Alertmanager Configuración for Prometheus

Alertmanager is a sophisticated alert management system designed to handle alerts generated by Prometheus. It provides alert deduplication, routing, silencing, and integration with various notification channels including email, Slack, PagerDuty, and webhooks. Esta guía covers configuration, routing strategies, receiver setup, and avanzado features.

Tabla de Contenidos

Introducción

Alertmanager solves a critical problem in alert-heavy monitoreo: alert fatigue. By grouping related alerts, deduplicating notifications, and intelligently routing to the right channels, it transforms raw alerts into actionable notifications. It decouples alert generation from notification delivery, enabling flexible, sophisticated alert handling.

Architecture

Alerta Flow

Prometheus Alerting Rules
        ↓
    Fires Alerts
        ↓
   Alertmanager
        ↓
   Routing Engine
        ↓
    ├─ Grouping
    ├─ Silencing
    └─ Inhibition
        ↓
   Receiver Channels
        ↓
    ├─ Email
    ├─ Slack
    ├─ PagerDuty
    ├─ Webhook
    └─ Custom Integrations

Instalación

Download and Install

# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64

# Install binaries
sudo cp alertmanager alertmanagercli /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/{alertmanager,alertmanagercli}

# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
chmod 750 /etc/alertmanager /var/lib/alertmanager

Crear Systemd Servicio

sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable alertmanager

Configuración Fundamentals

Basic Structure

The Alertmanager configuration file has these main sections:

global:
  # Global settings for all receivers

route:
  # Top-level routing rule

receivers:
  # Notification channel definitions

inhibit_rules:
  # Rules for suppressing alerts

Minimal Configuración

sudo tee /etc/alertmanager/alertmanager.yml > /dev/null << 'EOF'
global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname']

receivers:
  - name: 'default'

inhibit_rules: []
EOF

sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
sudo systemctl start alertmanager

Verificar Configuración

amtool config routes
amtool check-config /etc/alertmanager/alertmanager.yml

Routing Configuración

Route Structure

Routes Crear un tree-Como alert routing system:

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    # Child routes with more specific matching
    - match:
        severity: critical
      receiver: 'pagerduty'
      repeat_interval: 5m

    - match:
        severity: warning
      receiver: 'slack'
      repeat_interval: 1h

    - match_re:
        service: 'api-.*'
      receiver: 'api-team'

Match and Match_RE

Match alerts using label matching:

routes:
  - match:
      job: 'prometheus'
    receiver: 'prometheus-team'

  - match:
      environment: 'production'
      severity: 'critical'
    receiver: 'critical-alerts'

  - match_re:
      alertname: '(High|Critical).*'
      instance: '.*prod.*'
    receiver: 'production-alerts'

Routing Priorities

Crear nested routes for complex routing logic:

route:
  receiver: 'default'

  routes:
    # Production alerts take priority
    - match:
        environment: 'production'
      receiver: 'production'
      group_wait: 5s
      group_interval: 5s
      repeat_interval: 1h

      routes:
        # Critical production alerts
        - match:
            severity: 'critical'
          receiver: 'oncall'
          group_wait: 0s
          repeat_interval: 5m

        # Warning production alerts
        - match:
            severity: 'warning'
          receiver: 'prod-slack'

    # Staging environment
    - match:
        environment: 'staging'
      receiver: 'staging'
      group_wait: 10s
      repeat_interval: 6h

Grouping and Timing

Grupo Configuración

Grupo related alerts to reduce notification noise:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s           # Wait before sending first notification
  group_interval: 10s       # Wait before sending additional alerts
  repeat_interval: 4h       # Repeat notification after 4 hours

Timing Ejemplos

routes:
  # Critical alerts: immediate notification, repeat every 5 minutes
  - match:
      severity: 'critical'
    receiver: 'critical'
    group_wait: 0s
    group_interval: 1m
    repeat_interval: 5m

  # Warnings: wait 30 seconds, repeat hourly
  - match:
      severity: 'warning'
    receiver: 'warnings'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 1h

  # Info: wait 5 minutes, repeat daily
  - match:
      severity: 'info'
    receiver: 'info'
    group_wait: 5m
    group_interval: 5m
    repeat_interval: 24h

Receivers Configuración

Email Receiver

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'app-specific-password'
  smtp_require_tls: true
  smtp_from: '[email protected]'

receivers:
  - name: 'email-ops'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: 'Alert: {{ .GroupLabels.alertname }}'
        html: |
          {{ range .Alerts }}
            <strong>Alert:</strong> {{ .Labels.alertname }}<br>
            <strong>Instance:</strong> {{ .Labels.instance }}<br>
            <strong>Description:</strong> {{ .Annotations.description }}<br>
          {{ end }}

Slack Receiver

global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - channel: '#monitoring-alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
            Service: {{ .Labels.service }}
            Instance: {{ .Labels.instance }}
            {{ .Annotations.description }}
          {{ end }}
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        actions:
          - type: button
            text: 'View in Grafana'
            url: 'https://grafana.example.com/d/dashboards'

PagerDuty Receiver

receivers:
  - name: 'pagerduty-oncall'
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
        description: '{{ .GroupLabels.alertname }}'
        client: 'Prometheus'
        details:
          firing: '{{ range .Alerts.Firing }}{{ .Labels.instance }} {{ end }}'
          description: '{{ (index .Alerts 0).Annotations.description }}'

Webhook Receiver

receivers:
  - name: 'custom-webhook'
    webhook_configs:
      - url: 'https://your-api.example.com/alerts'
        send_resolved: true
        http_sd_configs:
          - bearer_token: 'your-token'

  - name: 'webhook-slack'
    webhook_configs:
      - url: 'https://your-custom-slack-bot.example.com/notify'
        send_resolved: true

Multiple Receivers

Send the same alert to multiple channels:

receivers:
  - name: 'critical-multi'
    slack_configs:
      - channel: '#critical-alerts'
    pagerduty_configs:
      - routing_key: 'YOUR_KEY'
    email_configs:
      - to: '[email protected]'

Inhibition Rules

Suppress Low-Priority Alertas

Prevent lower-severity alerts Cuando higher-priority ones exist:

inhibit_rules:
  # Suppress warning alerts when critical exists
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

  # Suppress info when warning exists
  - source_match:
      severity: 'warning'
    target_match:
      severity: 'info'
    equal: ['alertname', 'instance']

Complex Inhibition

inhibit_rules:
  # Don't alert on disk warnings if service is already down
  - source_match:
      alertname: 'ServiceDown'
    target_match:
      alertname: 'HighDiskUsage'
    equal: ['instance']

  # Don't alert on memory if node is down
  - source_match:
      alertname: 'NodeDown'
    target_match:
      alertname: 'HighMemoryUsage'
    equal: ['instance']

  # Suppress replica alerts when master is down
  - source_match:
      alertname: 'DatabaseMasterDown'
    target_match:
      alertname: 'DatabaseReplicaLag'
    equal: ['cluster']

Silencing

Silence via Web UI

  1. Access http://localhost:9093
  2. Click "Silences"
  3. Crear new silence:
    • Matchers: alertname=HighCPU
    • Duration: 1 hour
    • Creator: Your name

Silence via API

# Silence alerts matching criteria for 1 hour
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "HighCPU",
        "isRegex": false
      },
      {
        "name": "instance",
        "value": ".*prod.*",
        "isRegex": true
      }
    ],
    "startsAt": "2024-01-01T10:00:00Z",
    "endsAt": "2024-01-01T11:00:00Z",
    "createdBy": "automation",
    "comment": "Maintenance window"
  }'

Consulta Silences

curl http://localhost:9093/api/v1/silences | jq '.data'

Eliminar Silence

curl -X DELETE http://localhost:9093/api/v1/silences/silence_id

Avanzado Routing

Ambiente-Based Routing

route:
  group_by: ['alertname', 'environment']
  routes:
    - match:
        environment: 'production'
      receiver: 'prod-pagerduty'
      group_wait: 0s
      repeat_interval: 5m

    - match:
        environment: 'staging'
      receiver: 'staging-slack'
      group_wait: 10s
      repeat_interval: 1h

    - match:
        environment: 'development'
      receiver: 'dev-slack'
      group_wait: 1m
      repeat_interval: 6h

Team-Based Routing

route:
  routes:
    - match:
        team: 'platform'
      receiver: 'platform-team'
      routes:
        - match:
            service: 'kubernetes'
          receiver: 'k8s-team'

    - match:
        team: 'database'
      receiver: 'db-team'
      routes:
        - match:
            service: 'mysql'
          receiver: 'mysql-team'

Solución de Problemas

Verificar Alerta Estado

# View current alerts
curl http://localhost:9093/api/v1/alerts | jq .

# View grouped alerts
curl http://localhost:9093/api/v1/alerts?group_by=alertname | jq .

Prueba Configuración

amtool config routes
amtool check-config /etc/alertmanager/alertmanager.yml

# Validate YAML syntax
python -m yaml /etc/alertmanager/alertmanager.yml

View Routes

amtool config routes --output=json

Debug Receiver Issues

# Check which receiver handles an alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "Test"
    }
  }]'

# Check logs
journalctl -u alertmanager -f

Common Configuración Issues

# Webhook not being called - verify URL is correct
curl -X POST https://your-webhook.example.com/notify \
  -H "Content-Type: application/json" \
  -d '{"test": "data"}'

# Email not sending - verify SMTP settings
telnet smtp.gmail.com 587

# Slack not working - verify webhook URL
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H 'Content-type: application/json' \
  -d '{"text": "Test message"}'

Conclusión

Alertmanager transforms raw Prometheus alerts into intelligent, routed notifications. By mastering routing configuration, receiver setup, and inhibition rules, you Crear unn alert management system that reduces fatigue Mientras ensuring critical issues reach the right people immediately. Continuously refine your routing rules based on operational experience, monitor alert quality metrics, and regularly review silence policies to maintain an effective alerting system.