dbt Data Transformation Tool Installation
dbt (data build tool) is an open-source SQL-based transformation framework that lets data analysts write modular, tested, and documented data models as SELECT statements. This guide covers installing dbt on Linux, setting up a project, creating models, writing tests, generating documentation, and integrating with CI/CD pipelines.
Prerequisites
- Ubuntu 20.04+ or CentOS 8+ / Rocky Linux 8+
- Python 3.8+ with pip
- A supported data warehouse: PostgreSQL, BigQuery, Snowflake, Redshift, DuckDB, etc.
- Database credentials with CREATE TABLE/VIEW permissions
Installing dbt
dbt is distributed as a Python package. Install the adapter for your database:
# Create a virtual environment (recommended)
python3 -m venv /opt/dbt-env
source /opt/dbt-env/bin/activate
# Install dbt with your database adapter
pip install dbt-postgres # PostgreSQL / Redshift
pip install dbt-bigquery # Google BigQuery
pip install dbt-snowflake # Snowflake
pip install dbt-duckdb # DuckDB (local analytics)
pip install dbt-mysql # MySQL (community adapter)
# Install multiple adapters
pip install "dbt-core>=1.7" dbt-postgres dbt-duckdb
# Verify installation
dbt --version
# Add dbt to PATH for future sessions
echo 'source /opt/dbt-env/bin/activate' >> ~/.bashrc
Creating a dbt Project
# Initialize a new project
dbt init my_analytics
# Project structure created:
# my_analytics/
# ├── dbt_project.yml # Project configuration
# ├── models/ # SQL model files
# │ └── example/
# ├── tests/ # Singular tests
# ├── macros/ # Jinja macros
# ├── seeds/ # CSV data files
# ├── snapshots/ # SCD Type 2 snapshots
# └── analyses/ # Ad-hoc SQL files
cd my_analytics
Configure your database connection in ~/.dbt/profiles.yml:
# ~/.dbt/profiles.yml
my_analytics:
target: dev
outputs:
dev:
type: postgres
host: localhost
port: 5432
user: dbt_user
password: "{{ env_var('DBT_PASSWORD') }}"
dbname: analytics
schema: dbt_dev # dev schema for isolation
threads: 4
keepalives_idle: 0
prod:
type: postgres
host: prod-db.example.com
port: 5432
user: dbt_prod
password: "{{ env_var('DBT_PROD_PASSWORD') }}"
dbname: analytics
schema: public
threads: 8
Configure dbt_project.yml:
name: 'my_analytics'
version: '1.0.0'
profile: 'my_analytics'
model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
target-path: "target"
clean-targets: ["target", "dbt_packages"]
models:
my_analytics:
staging:
+materialized: view # staging models are views
marts:
+materialized: table # marts are physical tables
+schema: reporting
vars:
start_date: '2024-01-01'
Test the connection:
dbt debug
Writing Models
Models are SQL SELECT statements in .sql files under models/. dbt handles CREATE TABLE/VIEW automatically.
Staging model - cleans raw data:
-- models/staging/stg_orders.sql
with source as (
select * from {{ source('ecommerce', 'orders') }}
),
renamed as (
select
id as order_id,
customer_id,
status,
amount / 100.0 as amount_dollars,
created_at::timestamp as created_at,
updated_at::timestamp as updated_at
from source
where created_at >= '{{ var("start_date") }}'
)
select * from renamed
Define your source tables in a schema YAML:
# models/staging/sources.yml
version: 2
sources:
- name: ecommerce
database: raw_db
schema: public
tables:
- name: orders
description: "Raw orders from the e-commerce platform"
columns:
- name: id
description: "Primary key"
tests:
- not_null
- unique
Mart model - business-level aggregation:
-- models/marts/fct_orders.sql
{{
config(
materialized = 'table',
indexes = [
{'columns': ['customer_id']},
{'columns': ['created_at']}
]
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('stg_customers') }}
),
final as (
select
o.order_id,
o.customer_id,
c.email,
c.country,
o.status,
o.amount_dollars,
o.created_at
from orders o
left join customers c using (customer_id)
)
select * from final
Run models:
# Run all models
dbt run
# Run a specific model
dbt run --select stg_orders
# Run a model and all its dependencies
dbt run --select +fct_orders
# Run all models in a directory
dbt run --select staging.*
# Run with a different target
dbt run --target prod
Testing Data Quality
dbt has two types of tests: generic (schema YAML) and singular (custom SQL).
Generic tests in schema YAML:
# models/staging/schema.yml
version: 2
models:
- name: stg_orders
description: "Cleaned orders from raw source"
columns:
- name: order_id
tests:
- not_null
- unique
- name: status
tests:
- not_null
- accepted_values:
values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']
- name: amount_dollars
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: customer_id
tests:
- relationships:
to: ref('stg_customers')
field: customer_id
Singular test (custom SQL):
-- tests/assert_no_negative_revenue.sql
-- Test passes if this query returns 0 rows
select
order_id,
amount_dollars
from {{ ref('fct_orders') }}
where amount_dollars < 0
Run tests:
# Run all tests
dbt test
# Run tests for a specific model
dbt test --select stg_orders
# Run and stop on first failure
dbt test --fail-fast
Generating Documentation
# Generate documentation
dbt docs generate
# Serve docs locally (opens at http://localhost:8080)
dbt docs serve --port 8080
Enhance docs with descriptions in schema YAML:
models:
- name: fct_orders
description: |
Fact table of all customer orders. One row per order.
Includes customer demographics joined from stg_customers.
columns:
- name: order_id
description: "Unique identifier for each order"
- name: amount_dollars
description: "Order total in USD"
The generated docs include a DAG lineage graph showing how models depend on each other.
Incremental Models
Incremental models only process new or changed rows, dramatically reducing compute time for large tables:
-- models/marts/fct_events.sql
{{
config(
materialized='incremental',
unique_key='event_id',
on_schema_change='sync_all_columns'
)
}}
select
event_id,
user_id,
event_type,
properties,
occurred_at
from {{ source('analytics', 'events') }}
{% if is_incremental() %}
-- Only process records newer than the latest in the table
where occurred_at > (select max(occurred_at) from {{ this }})
{% endif %}
Run full refresh when schema changes:
# Force full rebuild of incremental model
dbt run --select fct_events --full-refresh
CI/CD Integration
GitHub Actions workflow:
# .github/workflows/dbt.yml
name: dbt CI
on:
pull_request:
paths:
- 'models/**'
- 'tests/**'
- 'dbt_project.yml'
jobs:
dbt-ci:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dbt
run: pip install dbt-postgres
- name: dbt debug
env:
DBT_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }}
run: dbt debug --target ci
- name: dbt run
env:
DBT_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }}
run: dbt run --target ci
- name: dbt test
env:
DBT_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }}
run: dbt test --target ci
Add a ci target to profiles.yml pointing to a dedicated CI schema.
Troubleshooting
"Could not find profile" error:
# Ensure profiles.yml exists
cat ~/.dbt/profiles.yml
# Or specify profile path
dbt run --profiles-dir /path/to/profiles
Model compilation errors:
# Compile models without running to check SQL
dbt compile --select my_model
cat target/compiled/my_analytics/models/my_model.sql
Test failures - identify the bad rows:
# dbt stores test results in target/
dbt test --store-failures
# Failed rows are stored in a schema called dbt_test__audit
Slow model runs:
# Check query execution time in target/run_results.json
cat target/run_results.json | python3 -c "
import json, sys
results = json.load(sys.stdin)['results']
for r in sorted(results, key=lambda x: x.get('execution_time', 0), reverse=True)[:5]:
print(f\"{r['execution_time']:.1f}s {r['unique_id']}\")
"
Dependency conflicts:
# Check the model DAG
dbt ls --select +my_model+ --output json
Conclusion
dbt brings software engineering best practices to SQL data transformation: version control, modular code, automated testing, and self-generating documentation. By organizing transformations into staging and mart layers with the ref() function managing dependencies, dbt makes data pipelines maintainable and testable. Incremental models and CI/CD integration complete the picture for a production-grade analytics engineering workflow on any SQL data warehouse.


