✈️ Migrate Data from URL to Google Cloud Storage Seamlessly with Mage & Python

Yash Chauhan
3 min readFeb 9, 2024

--

Extracting, transforming, and loading (ETL) data is a fundamental task in data analysis. Fortunately, tools like Mage can help automate and simplify this process. In this blog, we’ll walk you through building an ETL pipeline using Mage, fetching data from a URL, transforming it, and storing it efficiently in Google Cloud Storage.

Prerequisites:

  • A Google Cloud Platform (GCP) account with a Cloud Storage bucket and a service account.
  • Basic understanding of Python and data manipulation libraries like pandas.
  • Mage set up locally using Docker: Refer to our previous blog post for detailed instructions on installing and configuring Mage using Docker. This will ensure you have a functional Mage environment ready to use.

Let’s dive in!

  1. Setting Up Google Cloud:

First, ensure you have a Cloud Storage bucket ready to store your transformed data. If not, head over to the GCP console and create one. Additionally, create a service account, download its JSON key, and store it securely. This key will grant Mage access to your Cloud Storage bucket.

2. Creating the Mage Pipeline:

Open the Mage UI and create a new standard (batch) pipeline. You can name it something descriptive like “URL-to-GCS-ETL”.

3. Configuring Credentials:

Mage uses a io_config.yaml file to store sensitive information like your service account key. Open this file and update the necessary fields with your GCP project ID, service account email, and the path to your JSON key file.

(Note: You can create another profile and set config separately)

4. Building the Data Loader Block:

This block retrieves data from the URL. Create a new Python data loader block and paste the following code:

(Note: Optimize resource usage: define data types to save space and accelerate calculations.)

5. Transforming the Data:

Now, let’s say you want to filter out rows with specific values. Create a Python transformer block and add the following code:

6. Saving to Google Cloud Storage:

Finally, it’s time to store the data. Create a Python data exporter block and paste this code:

Saved data to Storage:

7. Bonus: Data Partitioning (Optional):

You can further optimize storage by partitioning data based on specific columns. Add another Python data exporter block with the following code:

Saved data to Storage:

8. Run the Pipeline!

With everything configured, run the pipeline by clicking the “Play” button in the Mage UI. Depending on your data size, it might take some time to complete.

--

--

No responses yet