Grafana Loki and Promtail Complete Configuración

Building a complete registro stack with Grafana, Loki, and Promtail provides comprehensive log aggregation and visualization. Esta guía covers deploying the entire stack, configuring complex pipeline stages, integrating with Grafana, and Configurando log-based alerting for producción environments.

Tabla de Contenidos

Descripción General

A complete registro stack captures, processes, stores, and visualizes logs from all infrastructure components. Loki's label-based architecture reduces storage costs Mientras Promtail's flexible configuration handles diverse log sources. Grafana provides unified visualization across metrics and logs.

Architecture

Stack Components

┌─────────────────────────────────────────────┐
│      Application Logs / Syslog / Files      │
└───────────────────┬─────────────────────────┘
                    │
          ┌─────────▼─────────┐
          │    Promtail       │
          │  - Collection     │
          │  - Parsing        │
          │  - Labeling       │
          └─────────┬─────────┘
                    │
          ┌─────────▼─────────┐
          │   Loki Server     │
          │  - Ingestion      │
          │  - Indexing       │
          │  - Storage        │
          └─────────┬─────────┘
                    │
       ┌────────────┼────────────┐
       │            │            │
    BoltDB         S3/GCS      Cassandra
                    │
          ┌─────────▼─────────┐
          │     Grafana       │
          │  - Visualization  │
          │  - Alerting       │
          └───────────────────┘

System Preparation

Requisitos Previos

# System updates
sudo apt-get update && sudo apt-get upgrade -y

# Install dependencies
sudo apt-get install -y \
  curl wget unzip \
  git gcc make \
  openssl ca-certificates

# Create logging user
sudo useradd --no-create-home --shell /bin/false logging

Directorio Structure

# Create directories
sudo mkdir -p /opt/logging/{loki,promtail,grafana}
sudo mkdir -p /var/lib/loki/{chunks,index,cache}
sudo mkdir -p /var/log/loki
sudo mkdir -p /etc/loki /etc/promtail

# Set permissions
sudo chown -R logging:logging /opt/logging
sudo chown -R logging:logging /var/lib/loki
sudo chown -R logging:logging /var/log/loki

Loki Servidor Configuración

Instalación

cd /tmp
wget https://github.com/grafana/loki/releases/download/v2.9.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo chmod +x /usr/local/bin/loki

Producción Configuración

sudo tee /etc/loki/loki-config.yml > /dev/null << 'EOF'
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info
  log_format: json
  timeout_shutdown_request: 10s

ingester:
  chunk_idle_period: 3m
  chunk_retain_period: 1m
  max_chunk_age: 2h
  chunk_encoding: snappy
  chunk_size_target: 1048576
  chunk_size_bytes: 1572864
  max_streams_utilization_factor: 2.0
  max_stream_entries_limit: 10000
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    num_tokens: 128
    heartbeat_timeout: 5m
    term_timeout: 10m

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_line_size: 2097152
  ingestion_rate_mb: 100
  ingestion_burst_size_mb: 200
  max_entries_limit_per_second: 10000
  max_global_streams_matched_per_user: 10000
  retention_period: 720h
  cardinality_limit: 100000

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /var/lib/loki/index
    shared_store: filesystem
    cache_location: /var/lib/loki/cache
    shared_store_key_prefix: index/
  filesystem:
    directory: /var/lib/loki/chunks

chunk_store_config:
  max_look_back_period: 0s
  chunk_cache_config:
    cache:
      enable_fifocache: true
      default_validity: 1h
      memcache:
        batch_size: 1024
        parallelism: 100

table_manager:
  retention_deletes_enabled: true
  retention_period: 720h
  poll_interval: 10m
  creation_grace_period: 10m

query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      enable_fifocache: true
      default_validity: 1h

loki:
  auth_enabled: false

tracing:
  enabled: false

metrics:
  enabled: false
EOF

sudo chown logging:logging /etc/loki/loki-config.yml

Systemd Servicio

sudo tee /etc/systemd/system/loki.service > /dev/null << 'EOF'
[Unit]
Description=Grafana Loki
Documentation=https://grafana.com/loki
After=network.target

[Service]
User=logging
Group=logging
Type=simple
ExecStart=/usr/local/bin/loki -config.file=/etc/loki/loki-config.yml
Restart=on-failure
RestartSec=5

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=loki

# Resource limits
LimitNOFILE=65536
LimitNPROC=65536

# Security
ProtectSystem=full
ProtectHome=yes
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable loki
sudo systemctl start loki

Promtail Configuración

Base Configuración

sudo tee /etc/promtail/promtail-config.yml > /dev/null << 'EOF'
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/loki/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push
    batchwait: 1s
    batchsize: 1048576

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          host: __HOSTNAME__
          __path__: /var/log/{syslog,messages}

  - job_name: kernel
    static_configs:
      - targets:
          - localhost
        labels:
          job: kernel
          __path__: /var/log/kern.log

  - job_name: auth
    static_configs:
      - targets:
          - localhost
        labels:
          job: auth
          __path__: /var/log/auth.log

  - job_name: docker
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - json:
          expressions:
            output: log
            stream: stream
            attrs_status: attrs.status
      - output:
          source: output

  - job_name: nginx
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          __path__: /var/log/nginx/*.log
    pipeline_stages:
      - multiline:
          line_start_pattern: '^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
      - regex:
          expression: '^(?P<remote>[\w\.]+) (?P<host>[\w\.]+) (?P<user>[\w\-\.]+) \[(?P<timestamp>[\w:/]+\s[+\-]\d{4})\] "(?P<method>\w+) (?P<path>[^\s]+) (?P<protocol>[\w/\.]+)" (?P<status>\d+|-) (?P<bytes>\d+|-)\s?"?(?P<referer>[^\s]*)"?\s?"?(?P<agent>[^"]*)"?'
      - timestamp:
          source: timestamp
          format: '02/Jan/2006:15:04:05 -0700'
      - labels:
          status:
          method:
          path:
      - metrics:
          nginx_http_requests_total:
            type: Counter
            description: "Total HTTP requests"
            prefix: "nginx_"
            max_idle_duration: 30s
            match_all: true
            action: add
          nginx_http_request_duration_seconds:
            type: Histogram
            description: "Request duration"
            source: response_time
            prefix: "nginx_"
            buckets: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
            max_idle_duration: 30s

  - job_name: application
    static_configs:
      - targets:
          - localhost
        labels:
          job: app
          env: production
          __path__: /var/log/app/*.log
    pipeline_stages:
      - json:
          expressions:
            timestamp: timestamp
            level: level
            msg: message
            trace_id: trace_id
            user_id: user_id
            request_path: request.path
            status_code: request.status_code
            response_time: response_time_ms
      - timestamp:
          source: timestamp
          format: 2006-01-02T15:04:05Z07:00
      - labels:
          level:
          trace_id:
          user_id:
      - drop:
          expression: '.*health_check.*'
      - metrics:
          app_errors_total:
            type: Counter
            description: "Total errors"
            match_all: true
            action: add
            value: '1'
            condition:
              selector: '{level="error"}'
          app_request_duration_seconds:
            type: Histogram
            description: "Request duration"
            source: response_time
            buckets: [10, 50, 100, 250, 500, 1000, 2500, 5000]
EOF

sudo chown logging:logging /etc/promtail/promtail-config.yml

Promtail Servicio

# Download Promtail
cd /tmp
wget https://github.com/grafana/loki/releases/download/v2.9.0/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail
sudo chmod +x /usr/local/bin/promtail

# Create systemd service
sudo tee /etc/systemd/system/promtail.service > /dev/null << 'EOF'
[Unit]
Description=Grafana Promtail
Documentation=https://grafana.com/loki
After=network.target loki.service

[Service]
User=logging
Group=logging
Type=simple
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/promtail-config.yml
Restart=on-failure
RestartSec=5

StandardOutput=journal
StandardError=journal
SyslogIdentifier=promtail

LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable promtail
sudo systemctl start promtail

Avanzado Canalización Etapas

Multiline Registro Parsing

scrape_configs:
  - job_name: java-app
    static_configs:
      - targets:
          - localhost
        labels:
          job: java
          __path__: /var/log/java-app/*.log
    pipeline_stages:
      - multiline:
          line_start_pattern: '^\d{4}-\d{2}-\d{2}'
      - regex:
          expression: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<level>\w+) \[(?P<thread>[^\]]+)\] (?P<logger>[^\s]+)\s-\s(?P<message>.*)$'
      - timestamp:
          source: timestamp
          format: '2006-01-02 15:04:05'
      - labels:
          level:
          thread:
          logger:

Complex JSON Parsing

scrape_configs:
  - job_name: structured-logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: structured
          __path__: /var/log/app/*.json
    pipeline_stages:
      - json:
          expressions:
            timestamp: '@timestamp'
            level: log.level
            message: message
            service: service.name
            trace_id: trace.id
            user_email: user.email
            duration_ms: duration_ms
            status_code: http.status_code
      - timestamp:
          source: timestamp
          format: '2006-01-02T15:04:05.000Z07:00'
      - labels:
          level:
          service:
          trace_id:
          status_code:
      - metrics:
          app_request_total:
            type: Counter
            description: "Total requests"
            match_all: true
            action: add
          app_error_total:
            type: Counter
            description: "Total errors"
            match_all: true
            action: add
            condition:
              selector: '{level="error"}'
          app_duration_seconds:
            type: Histogram
            description: "Request duration"
            source: duration_ms
            buckets: [10, 50, 100, 500, 1000, 5000, 10000]

Conditional Processing

scrape_configs:
  - job_name: conditional-processing
    static_configs:
      - targets:
          - localhost
        labels:
          job: conditional
          __path__: /var/log/app/*.log
    pipeline_stages:
      - regex:
          expression: '(?P<method>\w+) (?P<path>\S+) HTTP'
      - drop:
          expression: '(health_check|status_check|metrics)'
      - json:
          expressions:
            duration: duration
          only_errors: true
      - match:
          selector: '{method="GET"}'
          stages:
            - regex:
                expression: 'duration=(?P<duration>\d+)'
            - metrics:
                get_requests_total:
                  type: Counter
                  match_all: true
                  action: add
      - match:
          selector: '{method="POST"}'
          stages:
            - metrics:
                post_requests_total:
                  type: Counter
                  match_all: true
                  action: add

Grafana Integración

Agregar Loki Datos Source

curl -X POST http://admin:admin@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Loki",
    "type": "loki",
    "url": "http://localhost:3100",
    "access": "proxy",
    "isDefault": false,
    "jsonData": {
      "maxLines": 1000
    }
  }'

Creating Registro Paneles

Panel JSON with Registro Panels

{
  "dashboard": {
    "title": "Application Logs Dashboard",
    "panels": [
      {
        "title": "Log Volume",
        "targets": [
          {
            "expr": "sum(rate({job=\"app\"}[5m])) by (level)",
            "legendFormat": "{{level}}"
          }
        ],
        "type": "timeseries"
      },
      {
        "title": "Error Logs",
        "targets": [
          {
            "expr": "{job=\"app\", level=\"error\"}",
            "format": "logs"
          }
        ],
        "type": "logs"
      },
      {
        "title": "Request Duration",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate({job=\"app\"} | json [5m])) by (le))"
          }
        ],
        "type": "timeseries"
      }
    ]
  }
}

Alerting on Registros

Crear Registro-Based Alertas

# Alert on error rate
curl -X POST http://admin:admin@localhost:3000/api/ruler/grafana/rules/logs \
  -H "Content-Type: application/json" \
  -d '{
    "uid": "error-rate-alert",
    "title": "High Error Rate",
    "condition": "A",
    "data": [
      {
        "refId": "A",
        "queryType": "logs",
        "model": {
          "expr": "sum(rate({job=\"app\"} |= \"error\" [5m]))"
        }
      }
    ],
    "noDataState": "NoData",
    "execErrState": "Alerting",
    "for": "5m",
    "annotations": {
      "summary": "High error rate detected"
    }
  }'

Rendimiento Optimización

Tuning Loki

# Increase chunk size for high-volume logging
# Edit loki-config.yml
chunk_size_target: 2097152  # 2MB instead of 1MB
chunk_size_bytes: 3145728   # 3MB max

# Increase ingestion rate
ingestion_rate_mb: 200
ingestion_burst_size_mb: 400

# Adjust query cache
query_range:
  cache_results: true
  results_cache:
    cache:
      default_validity: 2h

Promtail Optimización

# Increase batching
clients:
  - url: http://localhost:3100/loki/api/v1/push
    batchwait: 2s
    batchsize: 2097152  # 2MB batches
    backoff_config:
      minbackoff: 100ms
      maxbackoff: 10s
      maxretries: 5

Solución de Problemas

Estado Checks

# Check Loki readiness
curl -f http://localhost:3100/ready

# Check Loki metrics
curl http://localhost:3100/metrics | grep loki_ingester

# Check Promtail metrics
curl http://localhost:9080/metrics | grep promtail

Verificar Datos Flow

# Query recent logs
curl 'http://localhost:3100/loki/api/v1/query_range?query={job="app"}&start=1000&end=2000&limit=100' | jq .

# Check Promtail position tracking
tail -20 /var/lib/loki/positions.yaml

Debug Issues

# Enable debug logging
# Edit loki-config.yml
server:
  log_level: debug

# Restart services
sudo systemctl restart loki promtail

# Monitor logs
sudo journalctl -u loki -f
sudo journalctl -u promtail -f

Conclusión

A complete Loki stack provides cost-effective log aggregation with powerful querying and visualization. By following Esta guía, you've built a producción-ready registro infrastructure. Focus on designing efficient label hierarchies, leveraging pipeline stages for intelligent parsing, and setting appropriate retention policies. This foundation scales to handle large-scale registro requirements Mientras maintaining fast query rendimiento and keeping operational costs low.