dbt Data Transformation Tool Installation

dbt (data build tool) is an open-source SQL-based transformation framework that lets data analysts write modular, tested, and documented data models as SELECT statements. This guide covers installing dbt on Linux, setting up a project, creating models, writing tests, generating documentation, and integrating with CI/CD pipelines.

Prerequisites

  • Ubuntu 20.04+ or CentOS 8+ / Rocky Linux 8+
  • Python 3.8+ with pip
  • A supported data warehouse: PostgreSQL, BigQuery, Snowflake, Redshift, DuckDB, etc.
  • Database credentials with CREATE TABLE/VIEW permissions

Installing dbt

dbt is distributed as a Python package. Install the adapter for your database:

# Create a virtual environment (recommended)
python3 -m venv /opt/dbt-env
source /opt/dbt-env/bin/activate

# Install dbt with your database adapter
pip install dbt-postgres       # PostgreSQL / Redshift
pip install dbt-bigquery       # Google BigQuery
pip install dbt-snowflake      # Snowflake
pip install dbt-duckdb         # DuckDB (local analytics)
pip install dbt-mysql          # MySQL (community adapter)

# Install multiple adapters
pip install "dbt-core>=1.7" dbt-postgres dbt-duckdb

# Verify installation
dbt --version

# Add dbt to PATH for future sessions
echo 'source /opt/dbt-env/bin/activate' >> ~/.bashrc

Creating a dbt Project

# Initialize a new project
dbt init my_analytics

# Project structure created:
# my_analytics/
# ├── dbt_project.yml    # Project configuration
# ├── models/            # SQL model files
# │   └── example/
# ├── tests/             # Singular tests
# ├── macros/            # Jinja macros
# ├── seeds/             # CSV data files
# ├── snapshots/         # SCD Type 2 snapshots
# └── analyses/          # Ad-hoc SQL files

cd my_analytics

Configure your database connection in ~/.dbt/profiles.yml:

# ~/.dbt/profiles.yml
my_analytics:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      user: dbt_user
      password: "{{ env_var('DBT_PASSWORD') }}"
      dbname: analytics
      schema: dbt_dev          # dev schema for isolation
      threads: 4
      keepalives_idle: 0

    prod:
      type: postgres
      host: prod-db.example.com
      port: 5432
      user: dbt_prod
      password: "{{ env_var('DBT_PROD_PASSWORD') }}"
      dbname: analytics
      schema: public
      threads: 8

Configure dbt_project.yml:

name: 'my_analytics'
version: '1.0.0'
profile: 'my_analytics'

model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]

target-path: "target"
clean-targets: ["target", "dbt_packages"]

models:
  my_analytics:
    staging:
      +materialized: view          # staging models are views
    marts:
      +materialized: table         # marts are physical tables
      +schema: reporting

vars:
  start_date: '2024-01-01'

Test the connection:

dbt debug

Writing Models

Models are SQL SELECT statements in .sql files under models/. dbt handles CREATE TABLE/VIEW automatically.

Staging model - cleans raw data:

-- models/staging/stg_orders.sql
with source as (
    select * from {{ source('ecommerce', 'orders') }}
),
renamed as (
    select
        id                          as order_id,
        customer_id,
        status,
        amount / 100.0              as amount_dollars,
        created_at::timestamp       as created_at,
        updated_at::timestamp       as updated_at
    from source
    where created_at >= '{{ var("start_date") }}'
)
select * from renamed

Define your source tables in a schema YAML:

# models/staging/sources.yml
version: 2

sources:
  - name: ecommerce
    database: raw_db
    schema: public
    tables:
      - name: orders
        description: "Raw orders from the e-commerce platform"
        columns:
          - name: id
            description: "Primary key"
            tests:
              - not_null
              - unique

Mart model - business-level aggregation:

-- models/marts/fct_orders.sql
{{
  config(
    materialized = 'table',
    indexes = [
      {'columns': ['customer_id']},
      {'columns': ['created_at']}
    ]
  )
}}

with orders as (
    select * from {{ ref('stg_orders') }}
),
customers as (
    select * from {{ ref('stg_customers') }}
),
final as (
    select
        o.order_id,
        o.customer_id,
        c.email,
        c.country,
        o.status,
        o.amount_dollars,
        o.created_at
    from orders o
    left join customers c using (customer_id)
)
select * from final

Run models:

# Run all models
dbt run

# Run a specific model
dbt run --select stg_orders

# Run a model and all its dependencies
dbt run --select +fct_orders

# Run all models in a directory
dbt run --select staging.*

# Run with a different target
dbt run --target prod

Testing Data Quality

dbt has two types of tests: generic (schema YAML) and singular (custom SQL).

Generic tests in schema YAML:

# models/staging/schema.yml
version: 2

models:
  - name: stg_orders
    description: "Cleaned orders from raw source"
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: status
        tests:
          - not_null
          - accepted_values:
              values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']
      - name: amount_dollars
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"
      - name: customer_id
        tests:
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Singular test (custom SQL):

-- tests/assert_no_negative_revenue.sql
-- Test passes if this query returns 0 rows
select
    order_id,
    amount_dollars
from {{ ref('fct_orders') }}
where amount_dollars < 0

Run tests:

# Run all tests
dbt test

# Run tests for a specific model
dbt test --select stg_orders

# Run and stop on first failure
dbt test --fail-fast

Generating Documentation

# Generate documentation
dbt docs generate

# Serve docs locally (opens at http://localhost:8080)
dbt docs serve --port 8080

Enhance docs with descriptions in schema YAML:

models:
  - name: fct_orders
    description: |
      Fact table of all customer orders. One row per order.
      Includes customer demographics joined from stg_customers.
    columns:
      - name: order_id
        description: "Unique identifier for each order"
      - name: amount_dollars
        description: "Order total in USD"

The generated docs include a DAG lineage graph showing how models depend on each other.

Incremental Models

Incremental models only process new or changed rows, dramatically reducing compute time for large tables:

-- models/marts/fct_events.sql
{{
  config(
    materialized='incremental',
    unique_key='event_id',
    on_schema_change='sync_all_columns'
  )
}}

select
    event_id,
    user_id,
    event_type,
    properties,
    occurred_at
from {{ source('analytics', 'events') }}

{% if is_incremental() %}
  -- Only process records newer than the latest in the table
  where occurred_at > (select max(occurred_at) from {{ this }})
{% endif %}

Run full refresh when schema changes:

# Force full rebuild of incremental model
dbt run --select fct_events --full-refresh

CI/CD Integration

GitHub Actions workflow:

# .github/workflows/dbt.yml
name: dbt CI

on:
  pull_request:
    paths:
      - 'models/**'
      - 'tests/**'
      - 'dbt_project.yml'

jobs:
  dbt-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dbt
        run: pip install dbt-postgres

      - name: dbt debug
        env:
          DBT_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }}
        run: dbt debug --target ci

      - name: dbt run
        env:
          DBT_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }}
        run: dbt run --target ci

      - name: dbt test
        env:
          DBT_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }}
        run: dbt test --target ci

Add a ci target to profiles.yml pointing to a dedicated CI schema.

Troubleshooting

"Could not find profile" error:

# Ensure profiles.yml exists
cat ~/.dbt/profiles.yml
# Or specify profile path
dbt run --profiles-dir /path/to/profiles

Model compilation errors:

# Compile models without running to check SQL
dbt compile --select my_model
cat target/compiled/my_analytics/models/my_model.sql

Test failures - identify the bad rows:

# dbt stores test results in target/
dbt test --store-failures
# Failed rows are stored in a schema called dbt_test__audit

Slow model runs:

# Check query execution time in target/run_results.json
cat target/run_results.json | python3 -c "
import json, sys
results = json.load(sys.stdin)['results']
for r in sorted(results, key=lambda x: x.get('execution_time', 0), reverse=True)[:5]:
    print(f\"{r['execution_time']:.1f}s  {r['unique_id']}\")
"

Dependency conflicts:

# Check the model DAG
dbt ls --select +my_model+ --output json

Conclusion

dbt brings software engineering best practices to SQL data transformation: version control, modular code, automated testing, and self-generating documentation. By organizing transformations into staging and mart layers with the ref() function managing dependencies, dbt makes data pipelines maintainable and testable. Incremental models and CI/CD integration complete the picture for a production-grade analytics engineering workflow on any SQL data warehouse.