Building Your First Data Pipeline with Mage & Docker
Tired of wrestling with complex data pipeline setups? Want to unleash the power of your data, but frustrated by clunky tools and inefficient workflows? Then say hello to Mage and PostgreSQL, the dynamic duo ready to revolutionize your data engineering experience!
This action-packed guide walks you through setting up Mage and PostgreSQL locally using Docker, your one-stop shop for smooth, containerized development. We’ll dive into building your first pipeline, fetching real-world taxi data, transforming it with ease, and exporting it seamlessly to your Postgres database.
Prerequisites:
Before we dive in, make sure you have these ready:
- Docker: Download and install it from https://www.docker.com/.
- A code editor: Your trusty VSCode, PyCharm, or any other tool you like.
Step 1: Project Setup:
Let’s get our hands dirty! Create a directory named Mage
and within it, create these files:
Dockerfile
: This defines the container's blueprint.docker-compose.yml
: Specifies services and dependencies..env
: Stores environment variables (keep it secret, keep it safe!).requirements.txt
: Lists all the Python libraries we need.
Step 2: Crafting the Dockerfile:
Imagine the Dockerfile
as a recipe for building our container. Open it and paste this code:
FROM mageai/mageai:latest
ARG USER_CODE_PATH=/home/src/${PROJECT_NAME}
COPY requirements.txt ${USER_CODE_PATH}requirements.txt
RUN pip3 install -r ${USER_CODE_PATH}requirements.txt
Base Image: It uses the latest mageai/mageai
image, which contains pre-configured tools needed for Mage.
User Code Path: An argument USER_CODE_PATH
allows defining where your project files are located within the container.
Installing Dependencies:
- Copies the
requirements.txt
file, containing your Python library dependencies, into the chosen path. - Runs
pip3 install
to install those dependencies within the container.
By running this Dockerfile, you get a container with Mage and your project’s specific dependencies ready to use. Think of it as a pre-configured environment where you can easily run your Mage pipelines.
Step 3: Composing Services with docker-compose.yml:
This file is where we orchestrate the show. Here’s the magic:
version: '3'
services:
mage:
image: mageai/mageai:latest
command: mage start ${PROJECT_NAME}
env_file:
- .env
build:
context: .
dockerfile: Dockerfile
environment:
USER_CODE_PATH: /home/src/${PROJECT_NAME}
POSTGRES_DBNAME: ${POSTGRES_DBNAME}
POSTGRES_SCHEMA: ${POSTGRES_SCHEMA}
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_HOST: ${POSTGRES_HOST}
POSTGRES_PORT: ${POSTGRES_PORT}
ports:
- 6789:6789
volumes:
- .:/home/src/
- ~/Documents/secrets/personal-gcp.json:/home/src/personal-gcp.json
restart: on-failure:5
postgres:
image: postgres:14
restart: on-failure
container_name: ${PROJECT_NAME}-postgres
env_file:
- .env
environment:
POSTGRES_DB: ${POSTGRES_DBNAME}
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
ports:
- "${POSTGRES_PORT}:5432"
Services:
- mage:
- Builds a container using the provided Dockerfile, setting up your project environment.
- Runs
mage start ${PROJECT_NAME}
to launch your Mage application. - Loads environment variables from
.env
and sets additional environment variables for database connection. - Mounts your local project directory and a GCP service account key within the container.
- Exposes port 6789 for Mage UI access.
- Restarts automatically on failure.
2. postgres:
- Uses the
postgres:14
image for your database. - Sets environment variables for database configuration based on values in
.env
. - Maps container port to host port 5432 for access.
- Names the container based on your project name.
- Also restarts automatically on failure.
Step 4: Keeping Secrets with .env:
Remember that variables? Store it securely in the .env
file:
PROJECT_NAME=
POSTGRES_DBNAME=
POSTGRES_SCHEMA=
POSTGRES_USER=
POSTGRES_PASSWORD=
POSTGRES_HOST=
POSTGRES_PORT=
GCP_SERVICE_ACCOUNT_PATH=
Step 5: Declaring Dependencies:
List all the required Python libraries in requirements.txt
Step 6: Building and Running:
Now, the moment of truth! Open your terminal and navigate to the project directory. Run these commands:
docker compose build
docker compose up -d
The first command builds the container image, and the second starts it in detached mode.
Step 7: Accessing the Mage UI:
Open http://localhost:6789 in your browser. Boom! You’re greeted by the friendly Mage UI, ready to orchestrate your data flow.
Step 8: Adding a New Pipeline:
Open Mage’s web interface (http://localhost:6789) and click on the + New Pipeline
button. Choose "Standard (batch)" for a basic pipeline.
Step 9: Configure Profile in io_config.yaml File:
dev:
POSTGRES_CONNECT_TIMEOUT: 10
POSTGRES_DBNAME: "{{ env_var('POSTGRES_DBNAME') }}"
POSTGRES_SCHEMA: "{{ env_var('POSTGRES_SCHEMA') }}"
POSTGRES_USER: "{{ env_var('POSTGRES_USER') }}"
POSTGRES_PASSWORD: "{{ env_var('POSTGRES_PASSWORD') }}"
POSTGRES_HOST: "{{ env_var('POSTGRES_HOST') }}"
POSTGRES_PORT: "{{ env_var('POSTGRES_PORT') }}"
GOOGLE_SERVICE_ACC_KEY_FILEPATH: {{ env_var('GCP_SERVICE_ACCOUNT_PATH') }}
How to add blocks:
Mage empowers you to build complex pipelines effortlessly. Simply select the relevant data loader, transformer, or data exporter block from the provided options, and customize each block to meet your specific data manipulation or export requirements.
Step 10: Fetching New York Taxi Data:
Data Loader Block:
- Add a new block with “Python API Data Loader” type.
- Paste the provided URL: https://github.com/yash-chauhan-dev/test_data/raw/main/yellow_tripdata_2021-01.csv.gz
- Copy the provided
taxi_dtypes
dictionary to map data types efficiently. - Add
parse_dates = ['tpep_pickup_datetime', 'tpep_dropoff_datetime']
to handle date columns.
Step 11: Transforming Data:
Python Transformer Block:
- If you want to remove entries with zero passengers, add a “Python Generic Transformer” block.
- Paste the provided
transform
function to filter based on thepassenger_count
column.
Step 12: Exporting to PostgreSQL:
Python PostgreSQL Data Exporter Block:
- Add a “Python PostgreSQL Data Exporter” block.
- Update the
schema_name
to "ny_taxi" andtable_name
to "yellow_taxi" according to your preference. - Modify the
config_profile
to match yourio_config.yaml
settings.
Step 13: Verifying Data (Using SQL Data Loader):
Create a new “SQL Data Loader” block in Mage.
- Select “PostgresSQL” as the connection type.
- Paste the provided SQL query:
SELECT * FROM ny_taxi.yellow_taxi LIMIT 10
- Run the pipeline to execute the query and preview the first 10 rows.