Home
Data pipeline dbt airflow
Back to Blog
dbtAirflowData EngineeringSnowflake

Building a Modern Data Pipeline with dbt, Airflow, and Snowflake

January 202610 min read

I recently built a robust data transformation pipeline to process TPC-H sample data into analytics-ready fact tables. This project uses dbt Core for transformation and Apache Airflow (via Cosmos) for orchestration, all targeting a Snowflake data warehouse.

Project Overview

The pipeline transforms raw source data into a structured dimensional model following modern data engineering best practices.

  • Source: TPC-H dataset (`orders`, `lineitem`)
  • Transformation: dbt Core (SQL-based)
  • Orchestration: Airflow DAGs dynamically generated using Cosmos
  • Target: Snowflake

Architecture Features

1. Multi-Layer Transformation

The dbt project is structured into three distinct layers:

  • Staging Layer: Raw data cleaning and column renaming (e.g., `stg_tpch_orders`).
  • Intermediate Layer: Business logic and joins (e.g., `int_order_items` joining orders and items).
  • Marts Layer: Final business-facing Fact tables (e.g., `fct_orders`).

2. Automated Data Quality

I implemented custom data quality checks in the `tests/` directory:

-- Example check: ensuring discounts are valid
select *
from {{ ref('fct_orders') }}
where item_discount_amount > 0  -- Should be negative

3. Dynamic Orchestration

Instead of manually defining every task in Airflow, I used Cosmos to render the dbt project as an Airflow DAG automatically. This ensures the orchestrator stays in sync with the data model.

Key Learnings

Performance Optimization

I initially faced performance issues where the DAG graph would disappear. I learned to switch the Cosmos `RenderConfig` to use LoadMode.DBT_MANIFEST. This forces Airflow to read the pre-compiled `manifest.json` instead of reparsing the entire dbt project on every heartbeat, significantly improving stability.

Automated Documentation

I extended the DAG to include a `dbt_docs_generate` task using the `DbtDocsOperator`. Now, the documentation website is rebuilt after every successful pipeline run, ensuring the data dictionary is always fresh.

You can check out the full code and setup instructions on the GitHub Repository.