Building a Modern Data Pipeline with dbt, Airflow, and Snowflake
I recently built a robust data transformation pipeline to process TPC-H sample data into analytics-ready fact tables. This project uses dbt Core for transformation and Apache Airflow (via Cosmos) for orchestration, all targeting a Snowflake data warehouse.
Project Overview
The pipeline transforms raw source data into a structured dimensional model following modern data engineering best practices.
- Source: TPC-H dataset (`orders`, `lineitem`)
- Transformation: dbt Core (SQL-based)
- Orchestration: Airflow DAGs dynamically generated using Cosmos
- Target: Snowflake
Architecture Features
1. Multi-Layer Transformation
The dbt project is structured into three distinct layers:
- Staging Layer: Raw data cleaning and column renaming (e.g., `stg_tpch_orders`).
- Intermediate Layer: Business logic and joins (e.g., `int_order_items` joining orders and items).
- Marts Layer: Final business-facing Fact tables (e.g., `fct_orders`).
2. Automated Data Quality
I implemented custom data quality checks in the `tests/` directory:
-- Example check: ensuring discounts are valid
select *
from {{ ref('fct_orders') }}
where item_discount_amount > 0 -- Should be negative3. Dynamic Orchestration
Instead of manually defining every task in Airflow, I used Cosmos to render the dbt project as an Airflow DAG automatically. This ensures the orchestrator stays in sync with the data model.
Key Learnings
Performance Optimization
I initially faced performance issues where the DAG graph would disappear. I learned to switch the Cosmos `RenderConfig` to use LoadMode.DBT_MANIFEST. This forces Airflow to read the pre-compiled `manifest.json` instead of reparsing the entire dbt project on every heartbeat, significantly improving stability.
Automated Documentation
I extended the DAG to include a `dbt_docs_generate` task using the `DbtDocsOperator`. Now, the documentation website is rebuilt after every successful pipeline run, ensuring the data dictionary is always fresh.
You can check out the full code and setup instructions on the GitHub Repository.