Promtail and Loki for Log Aggregation

Grafana Loki is a horizontally scalable log aggregation system inspired by Prometheus, using label-based indexing instead of full-text indexing to keep costs low while providing fast log queries through LogQL. Paired with Promtail for log scraping and Grafana for visualization, the PLG stack (Promtail-Loki-Grafana) delivers Prometheus-style observability for your logs.

Prerequisites

  • Ubuntu/Debian or CentOS/Rocky Linux server
  • Grafana 9.x+ (for visualization)
  • Object storage like S3 or MinIO (for production Loki storage)
  • Ports: 3100 (Loki HTTP), 9080 (Promtail HTTP)

Installing Loki

# Download and install Loki binary
LOKI_VERSION=3.0.0
wget https://github.com/grafana/loki/releases/download/v${LOKI_VERSION}/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo chmod +x /usr/local/bin/loki

# Create Loki user and directories
sudo useradd -r -s /bin/false loki
sudo mkdir -p /etc/loki /var/lib/loki/{index,cache,wal,boltdb-cache}
sudo chown -R loki:loki /etc/loki /var/lib/loki

# Basic Loki configuration (single-node, local filesystem storage)
cat > /etc/loki/loki-config.yaml <<EOF
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9095

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h  # 7 days
  ingestion_rate_mb: 64
  ingestion_burst_size_mb: 128
  max_query_series: 5000
  max_query_lookback: 0

table_manager:
  retention_deletes_enabled: true
  retention_period: 744h  # 31 days
EOF

# Create systemd service
cat > /etc/systemd/system/loki.service <<EOF
[Unit]
Description=Grafana Loki
After=network-online.target

[Service]
User=loki
Group=loki
ExecStart=/usr/local/bin/loki -config.file=/etc/loki/loki-config.yaml
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now loki
sudo systemctl status loki

# Verify Loki is running
curl -s http://localhost:3100/ready
curl -s http://localhost:3100/metrics | grep loki_build

Configuring Promtail

Promtail scrapes log files and sends them to Loki:

# Download and install Promtail
LOKI_VERSION=3.0.0
wget https://github.com/grafana/loki/releases/download/v${LOKI_VERSION}/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail
sudo chmod +x /usr/local/bin/promtail

# Create Promtail configuration
cat > /etc/promtail/promtail-config.yaml <<EOF
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml  # Tracks read positions

clients:
  - url: http://loki-server:3100/loki/api/v1/push

scrape_configs:
  # System logs
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          host: ${HOSTNAME}
          __path__: /var/log/syslog

  # Auth logs
  - job_name: auth
    static_configs:
      - targets:
          - localhost
        labels:
          job: auth
          host: ${HOSTNAME}
          __path__: /var/log/auth.log

  # Nginx access logs with pipeline
  - job_name: nginx_access
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          log_type: access
          host: ${HOSTNAME}
          __path__: /var/log/nginx/access.log

  # Nginx error logs
  - job_name: nginx_error
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          log_type: error
          host: ${HOSTNAME}
          __path__: /var/log/nginx/error.log

  # Application logs (multiple files with glob)
  - job_name: application
    static_configs:
      - targets:
          - localhost
        labels:
          job: app
          environment: production
          host: ${HOSTNAME}
          __path__: /var/log/app/*.log
EOF

sudo mkdir -p /var/lib/promtail /etc/promtail
sudo useradd -r -s /bin/false promtail
# Add promtail user to adm group to read system logs
sudo usermod -a -G adm promtail
sudo chown -R promtail:promtail /var/lib/promtail /etc/promtail

cat > /etc/systemd/system/promtail.service <<EOF
[Unit]
Description=Promtail log shipper
After=network-online.target

[Service]
User=promtail
Group=promtail
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/promtail-config.yaml
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now promtail

Label Extraction and Pipeline Stages

Promtail pipeline stages parse and enrich log entries:

# Enhanced Promtail config with pipeline stages
scrape_configs:
  - job_name: nginx_parsed
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          host: server-01
          __path__: /var/log/nginx/access.log
    pipeline_stages:
      # Parse nginx combined log format
      - regex:
          expression: '^(?P<remote_addr>[\w\.]+) - (?P<remote_user>\S+) \[(?P<time_local>[^\]]+)\] "(?P<method>\S+) (?P<request_uri>\S+) (?P<protocol>\S+)" (?P<status>\d+) (?P<body_bytes_sent>\d+)'
      # Extract labels from parsed fields
      - labels:
          status:
          method:
      # Convert status to numeric metric
      - metrics:
          nginx_response_time_seconds:
            type: Histogram
            description: "Nginx response time"
            source: request_time
            config:
              buckets: [0.001, 0.01, 0.1, 0.5, 1.0, 5.0]
      # Add timestamp from log
      - timestamp:
          source: time_local
          format: "02/Jan/2006:15:04:05 -0700"
      # Drop noisy health check requests
      - drop:
          expression: '.*healthz.*'

  # JSON log pipeline
  - job_name: json_app
    static_configs:
      - targets:
          - localhost
        labels:
          job: json-app
          __path__: /var/log/app/app.json.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message
            trace_id: trace_id
            service: service
      - labels:
          level:
          service:
      # Filter only errors for a separate stream
      - match:
          selector: '{job="json-app"} |= "error"'
          stages:
            - labels:
                severity: "error"

Loki Storage Backends

# Production: S3 storage backend
common:
  storage:
    s3:
      endpoint: s3.amazonaws.com
      bucketnames: your-loki-bucket
      region: us-east-1
      access_key_id: YOUR_ACCESS_KEY
      secret_access_key: YOUR_SECRET_KEY
      s3forcepathstyle: false

# Production: MinIO storage (self-hosted S3-compatible)
common:
  storage:
    s3:
      endpoint: minio.internal:9000
      bucketnames: loki-data
      region: us-east-1  # Required but ignored by MinIO
      access_key_id: minio-access-key
      secret_access_key: minio-secret-key
      s3forcepathstyle: true
      insecure: false

# Configure chunk caching with Redis
chunk_store_config:
  chunk_cache_config:
    redis:
      endpoint: redis:6379
      db: 0
      expiration: 1h

# Retention policy configuration
limits_config:
  retention_period: 744h  # 31 days globally

  # Per-tenant retention (requires auth_enabled: true)
ruler:
  storage:
    type: s3
    s3:
      bucketnames: loki-rules
      # ... same s3 config

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_store: s3

LogQL Queries

LogQL is Loki's query language:

# Basic log stream selection
{job="nginx"}

# Filter by label and content
{job="nginx", status="500"} |= "upstream"

# Show last 100 error logs
{job="nginx"} |= "error" | line_format "{{.message}}" | limit 100

# Parse nginx access log and filter slow requests
{job="nginx"} 
| regex `(?P<method>\w+) (?P<path>\S+).*" (?P<status>\d+) .* (?P<response_time>[\d.]+)$`
| response_time > 2.0

# Count errors per minute (metric query)
sum(rate({job="nginx"} |= "error" [5m])) by (host)

# Count HTTP status codes
sum by (status) (rate({job="nginx"} [5m]))

# Error rate percentage
sum(rate({job="nginx", status=~"5.."}[5m])) 
/ sum(rate({job="nginx"}[5m])) * 100

# Find failed SSH logins
{job="auth"} |= "Failed password" 
| regex `Failed password for (?P<user>\S+) from (?P<ip>[\d.]+)`
| line_format "User: {{.user}} from IP: {{.ip}}"

# JSON log parsing
{job="json-app"} 
| json 
| level="error" 
| line_format "{{.timestamp}} [{{.level}}] {{.message}} trace={{.trace_id}}"

# Top 10 slowest endpoints
topk(10, 
  sum by (path) (
    rate({job="nginx"} 
    | regex `(?P<path>/[^\s?]+).*(?P<duration>[\d.]+)$`
    | unwrap duration [5m])
  )
)

Grafana Log Visualization

# Add Loki data source in Grafana
curl -s -X POST \
  -H "Content-Type: application/json" \
  -u admin:grafana-password \
  http://grafana:3000/api/datasources \
  -d '{
    "name": "Loki",
    "type": "loki",
    "url": "http://loki:3100",
    "access": "proxy",
    "basicAuth": false
  }'

# Create a log dashboard panel via API or Grafana UI
# In Grafana: + > Dashboard > Add visualization > Select Loki datasource
# Use "Logs" visualization type for raw log viewing
# Use "Time series" or "Bar chart" for aggregated metrics from LogQL metric queries

# Example: Nginx dashboard with Logs panel
# Query: {job="nginx"} |= "" | json | line_format "{{.remote_addr}} {{.method}} {{.request_uri}} {{.status}}"
# Visualization: Logs
# Labels: status, method

# Correlate logs with Prometheus metrics
# In Grafana, use the "Explore" view to correlate:
# - Prometheus panel showing latency spike
# - Switch to Logs explorer with same time range
# - Filter {job="nginx"} |= "error"

Troubleshooting

Promtail not sending logs:

# Check Promtail is running and reading files
curl -s http://localhost:9080/metrics | grep promtail_files_active

# Check positions file to see where Promtail is reading
cat /var/lib/promtail/positions.yaml

# View Promtail logs
sudo journalctl -u promtail -n 50

# Test Promtail configuration
promtail -config.file=/etc/promtail/promtail-config.yaml -dry-run

Loki returning "too many outstanding requests":

# Increase query concurrency limits in loki-config.yaml
query_scheduler:
  max_outstanding_requests_per_tenant: 2048

# Or reduce query time range in LogQL
# Instead of last 7 days, use last 1 hour

High memory usage:

# Reduce ingestion rate limits
limits_config:
  ingestion_rate_mb: 32
  ingestion_burst_size_mb: 64

# Enable chunk compression
chunk_encoding: snappy

Labels not being extracted:

# Test pipeline stages locally with promtail in dry-run mode
echo '127.0.0.1 - - [15/Jan/2024:10:00:00 +0000] "GET /api/health HTTP/1.1" 200 42' | \
  promtail -config.file=/etc/promtail/promtail-config.yaml -stdin --dry-run

# Check Promtail targets and labels
curl -s http://localhost:9080/targets | jq

Conclusion

The Promtail + Loki + Grafana stack provides a cost-effective, horizontally scalable log aggregation solution that integrates naturally with Prometheus-based monitoring. Loki's label-based indexing approach keeps storage costs dramatically lower than full-text search solutions, while LogQL provides powerful querying capabilities. Start with filesystem storage for development and migrate to S3-compatible object storage as your log volume grows, using Loki's retention policies to automatically manage storage costs.