Grafana Loki and Promtail Complete Setup

Building a complete logging stack with Grafana, Loki, and Promtail provides comprehensive log aggregation and visualization. This guide covers deploying the entire stack, configuring complex pipeline stages, integrating with Grafana, and setting up log-based alerting for production environments.

Table of Contents

Overview

A complete logging stack captures, processes, stores, and visualizes logs from all infrastructure components. Loki's label-based architecture reduces storage costs while Promtail's flexible configuration handles diverse log sources. Grafana provides unified visualization across metrics and logs.

Architecture

Stack Components

┌─────────────────────────────────────────────┐
│      Application Logs / Syslog / Files      │
└───────────────────┬─────────────────────────┘
                    │
          ┌─────────▼─────────┐
          │    Promtail       │
          │  - Collection     │
          │  - Parsing        │
          │  - Labeling       │
          └─────────┬─────────┘
                    │
          ┌─────────▼─────────┐
          │   Loki Server     │
          │  - Ingestion      │
          │  - Indexing       │
          │  - Storage        │
          └─────────┬─────────┘
                    │
       ┌────────────┼────────────┐
       │            │            │
    BoltDB         S3/GCS      Cassandra
                    │
          ┌─────────▼─────────┐
          │     Grafana       │
          │  - Visualization  │
          │  - Alerting       │
          └───────────────────┘

System Preparation

Prerequisites

# System updates
sudo apt-get update && sudo apt-get upgrade -y

# Install dependencies
sudo apt-get install -y \
  curl wget unzip \
  git gcc make \
  openssl ca-certificates

# Create logging user
sudo useradd --no-create-home --shell /bin/false logging

Directory Structure

# Create directories
sudo mkdir -p /opt/logging/{loki,promtail,grafana}
sudo mkdir -p /var/lib/loki/{chunks,index,cache}
sudo mkdir -p /var/log/loki
sudo mkdir -p /etc/loki /etc/promtail

# Set permissions
sudo chown -R logging:logging /opt/logging
sudo chown -R logging:logging /var/lib/loki
sudo chown -R logging:logging /var/log/loki

Loki Server Setup

Installation

cd /tmp
wget https://github.com/grafana/loki/releases/download/v2.9.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo chmod +x /usr/local/bin/loki

Production Configuration

sudo tee /etc/loki/loki-config.yml > /dev/null << 'EOF'
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info
  log_format: json
  timeout_shutdown_request: 10s

ingester:
  chunk_idle_period: 3m
  chunk_retain_period: 1m
  max_chunk_age: 2h
  chunk_encoding: snappy
  chunk_size_target: 1048576
  chunk_size_bytes: 1572864
  max_streams_utilization_factor: 2.0
  max_stream_entries_limit: 10000
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    num_tokens: 128
    heartbeat_timeout: 5m
    term_timeout: 10m

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_line_size: 2097152
  ingestion_rate_mb: 100
  ingestion_burst_size_mb: 200
  max_entries_limit_per_second: 10000
  max_global_streams_matched_per_user: 10000
  retention_period: 720h
  cardinality_limit: 100000

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /var/lib/loki/index
    shared_store: filesystem
    cache_location: /var/lib/loki/cache
    shared_store_key_prefix: index/
  filesystem:
    directory: /var/lib/loki/chunks

chunk_store_config:
  max_look_back_period: 0s
  chunk_cache_config:
    cache:
      enable_fifocache: true
      default_validity: 1h
      memcache:
        batch_size: 1024
        parallelism: 100

table_manager:
  retention_deletes_enabled: true
  retention_period: 720h
  poll_interval: 10m
  creation_grace_period: 10m

query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      enable_fifocache: true
      default_validity: 1h

loki:
  auth_enabled: false

tracing:
  enabled: false

metrics:
  enabled: false
EOF

sudo chown logging:logging /etc/loki/loki-config.yml

Systemd Service

sudo tee /etc/systemd/system/loki.service > /dev/null << 'EOF'
[Unit]
Description=Grafana Loki
Documentation=https://grafana.com/loki
After=network.target

[Service]
User=logging
Group=logging
Type=simple
ExecStart=/usr/local/bin/loki -config.file=/etc/loki/loki-config.yml
Restart=on-failure
RestartSec=5

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=loki

# Resource limits
LimitNOFILE=65536
LimitNPROC=65536

# Security
ProtectSystem=full
ProtectHome=yes
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable loki
sudo systemctl start loki

Promtail Configuration

Base Configuration

sudo tee /etc/promtail/promtail-config.yml > /dev/null << 'EOF'
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/loki/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push
    batchwait: 1s
    batchsize: 1048576

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          host: __HOSTNAME__
          __path__: /var/log/{syslog,messages}

  - job_name: kernel
    static_configs:
      - targets:
          - localhost
        labels:
          job: kernel
          __path__: /var/log/kern.log

  - job_name: auth
    static_configs:
      - targets:
          - localhost
        labels:
          job: auth
          __path__: /var/log/auth.log

  - job_name: docker
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - json:
          expressions:
            output: log
            stream: stream
            attrs_status: attrs.status
      - output:
          source: output

  - job_name: nginx
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          __path__: /var/log/nginx/*.log
    pipeline_stages:
      - multiline:
          line_start_pattern: '^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
      - regex:
          expression: '^(?P<remote>[\w\.]+) (?P<host>[\w\.]+) (?P<user>[\w\-\.]+) \[(?P<timestamp>[\w:/]+\s[+\-]\d{4})\] "(?P<method>\w+) (?P<path>[^\s]+) (?P<protocol>[\w/\.]+)" (?P<status>\d+|-) (?P<bytes>\d+|-)\s?"?(?P<referer>[^\s]*)"?\s?"?(?P<agent>[^"]*)"?'
      - timestamp:
          source: timestamp
          format: '02/Jan/2006:15:04:05 -0700'
      - labels:
          status:
          method:
          path:
      - metrics:
          nginx_http_requests_total:
            type: Counter
            description: "Total HTTP requests"
            prefix: "nginx_"
            max_idle_duration: 30s
            match_all: true
            action: add
          nginx_http_request_duration_seconds:
            type: Histogram
            description: "Request duration"
            source: response_time
            prefix: "nginx_"
            buckets: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
            max_idle_duration: 30s

  - job_name: application
    static_configs:
      - targets:
          - localhost
        labels:
          job: app
          env: production
          __path__: /var/log/app/*.log
    pipeline_stages:
      - json:
          expressions:
            timestamp: timestamp
            level: level
            msg: message
            trace_id: trace_id
            user_id: user_id
            request_path: request.path
            status_code: request.status_code
            response_time: response_time_ms
      - timestamp:
          source: timestamp
          format: 2006-01-02T15:04:05Z07:00
      - labels:
          level:
          trace_id:
          user_id:
      - drop:
          expression: '.*health_check.*'
      - metrics:
          app_errors_total:
            type: Counter
            description: "Total errors"
            match_all: true
            action: add
            value: '1'
            condition:
              selector: '{level="error"}'
          app_request_duration_seconds:
            type: Histogram
            description: "Request duration"
            source: response_time
            buckets: [10, 50, 100, 250, 500, 1000, 2500, 5000]
EOF

sudo chown logging:logging /etc/promtail/promtail-config.yml

Promtail Service

# Download Promtail
cd /tmp
wget https://github.com/grafana/loki/releases/download/v2.9.0/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail
sudo chmod +x /usr/local/bin/promtail

# Create systemd service
sudo tee /etc/systemd/system/promtail.service > /dev/null << 'EOF'
[Unit]
Description=Grafana Promtail
Documentation=https://grafana.com/loki
After=network.target loki.service

[Service]
User=logging
Group=logging
Type=simple
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/promtail-config.yml
Restart=on-failure
RestartSec=5

StandardOutput=journal
StandardError=journal
SyslogIdentifier=promtail

LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable promtail
sudo systemctl start promtail

Advanced Pipeline Stages

Multiline Log Parsing

scrape_configs:
  - job_name: java-app
    static_configs:
      - targets:
          - localhost
        labels:
          job: java
          __path__: /var/log/java-app/*.log
    pipeline_stages:
      - multiline:
          line_start_pattern: '^\d{4}-\d{2}-\d{2}'
      - regex:
          expression: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<level>\w+) \[(?P<thread>[^\]]+)\] (?P<logger>[^\s]+)\s-\s(?P<message>.*)$'
      - timestamp:
          source: timestamp
          format: '2006-01-02 15:04:05'
      - labels:
          level:
          thread:
          logger:

Complex JSON Parsing

scrape_configs:
  - job_name: structured-logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: structured
          __path__: /var/log/app/*.json
    pipeline_stages:
      - json:
          expressions:
            timestamp: '@timestamp'
            level: log.level
            message: message
            service: service.name
            trace_id: trace.id
            user_email: user.email
            duration_ms: duration_ms
            status_code: http.status_code
      - timestamp:
          source: timestamp
          format: '2006-01-02T15:04:05.000Z07:00'
      - labels:
          level:
          service:
          trace_id:
          status_code:
      - metrics:
          app_request_total:
            type: Counter
            description: "Total requests"
            match_all: true
            action: add
          app_error_total:
            type: Counter
            description: "Total errors"
            match_all: true
            action: add
            condition:
              selector: '{level="error"}'
          app_duration_seconds:
            type: Histogram
            description: "Request duration"
            source: duration_ms
            buckets: [10, 50, 100, 500, 1000, 5000, 10000]

Conditional Processing

scrape_configs:
  - job_name: conditional-processing
    static_configs:
      - targets:
          - localhost
        labels:
          job: conditional
          __path__: /var/log/app/*.log
    pipeline_stages:
      - regex:
          expression: '(?P<method>\w+) (?P<path>\S+) HTTP'
      - drop:
          expression: '(health_check|status_check|metrics)'
      - json:
          expressions:
            duration: duration
          only_errors: true
      - match:
          selector: '{method="GET"}'
          stages:
            - regex:
                expression: 'duration=(?P<duration>\d+)'
            - metrics:
                get_requests_total:
                  type: Counter
                  match_all: true
                  action: add
      - match:
          selector: '{method="POST"}'
          stages:
            - metrics:
                post_requests_total:
                  type: Counter
                  match_all: true
                  action: add

Grafana Integration

Add Loki Data Source

curl -X POST http://admin:admin@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Loki",
    "type": "loki",
    "url": "http://localhost:3100",
    "access": "proxy",
    "isDefault": false,
    "jsonData": {
      "maxLines": 1000
    }
  }'

Creating Log Dashboards

Dashboard JSON with Log Panels

{
  "dashboard": {
    "title": "Application Logs Dashboard",
    "panels": [
      {
        "title": "Log Volume",
        "targets": [
          {
            "expr": "sum(rate({job=\"app\"}[5m])) by (level)",
            "legendFormat": "{{level}}"
          }
        ],
        "type": "timeseries"
      },
      {
        "title": "Error Logs",
        "targets": [
          {
            "expr": "{job=\"app\", level=\"error\"}",
            "format": "logs"
          }
        ],
        "type": "logs"
      },
      {
        "title": "Request Duration",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate({job=\"app\"} | json [5m])) by (le))"
          }
        ],
        "type": "timeseries"
      }
    ]
  }
}

Alerting on Logs

Create Log-Based Alerts

# Alert on error rate
curl -X POST http://admin:admin@localhost:3000/api/ruler/grafana/rules/logs \
  -H "Content-Type: application/json" \
  -d '{
    "uid": "error-rate-alert",
    "title": "High Error Rate",
    "condition": "A",
    "data": [
      {
        "refId": "A",
        "queryType": "logs",
        "model": {
          "expr": "sum(rate({job=\"app\"} |= \"error\" [5m]))"
        }
      }
    ],
    "noDataState": "NoData",
    "execErrState": "Alerting",
    "for": "5m",
    "annotations": {
      "summary": "High error rate detected"
    }
  }'

Performance Optimization

Tuning Loki

# Increase chunk size for high-volume logging
# Edit loki-config.yml
chunk_size_target: 2097152  # 2MB instead of 1MB
chunk_size_bytes: 3145728   # 3MB max

# Increase ingestion rate
ingestion_rate_mb: 200
ingestion_burst_size_mb: 400

# Adjust query cache
query_range:
  cache_results: true
  results_cache:
    cache:
      default_validity: 2h

Promtail Optimization

# Increase batching
clients:
  - url: http://localhost:3100/loki/api/v1/push
    batchwait: 2s
    batchsize: 2097152  # 2MB batches
    backoff_config:
      minbackoff: 100ms
      maxbackoff: 10s
      maxretries: 5

Troubleshooting

Health Checks

# Check Loki readiness
curl -f http://localhost:3100/ready

# Check Loki metrics
curl http://localhost:3100/metrics | grep loki_ingester

# Check Promtail metrics
curl http://localhost:9080/metrics | grep promtail

Verify Data Flow

# Query recent logs
curl 'http://localhost:3100/loki/api/v1/query_range?query={job="app"}&start=1000&end=2000&limit=100' | jq .

# Check Promtail position tracking
tail -20 /var/lib/loki/positions.yaml

Debug Issues

# Enable debug logging
# Edit loki-config.yml
server:
  log_level: debug

# Restart services
sudo systemctl restart loki promtail

# Monitor logs
sudo journalctl -u loki -f
sudo journalctl -u promtail -f

Conclusion

A complete Loki stack provides cost-effective log aggregation with powerful querying and visualization. By following this guide, you've built a production-ready logging infrastructure. Focus on designing efficient label hierarchies, leveraging pipeline stages for intelligent parsing, and setting appropriate retention policies. This foundation scales to handle large-scale logging requirements while maintaining fast query performance and keeping operational costs low.