Alertmanager Configuration for Prometheus

Alertmanager is a sophisticated alert management system designed to handle alerts generated by Prometheus. It provides alert deduplication, routing, silencing, and integration with various notification channels including email, Slack, PagerDuty, and webhooks. This guide covers configuration, routing strategies, receiver setup, and advanced features.

Table of Contents

Introduction

Alertmanager solves a critical problem in alert-heavy monitoring: alert fatigue. By grouping related alerts, deduplicating notifications, and intelligently routing to the right channels, it transforms raw alerts into actionable notifications. It decouples alert generation from notification delivery, enabling flexible, sophisticated alert handling.

Architecture

Alert Flow

Prometheus Alerting Rules
        ↓
    Fires Alerts
        ↓
   Alertmanager
        ↓
   Routing Engine
        ↓
    ├─ Grouping
    ├─ Silencing
    └─ Inhibition
        ↓
   Receiver Channels
        ↓
    ├─ Email
    ├─ Slack
    ├─ PagerDuty
    ├─ Webhook
    └─ Custom Integrations

Installation

Download and Install

# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64

# Install binaries
sudo cp alertmanager alertmanagercli /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/{alertmanager,alertmanagercli}

# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
chmod 750 /etc/alertmanager /var/lib/alertmanager

Create Systemd Service

sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable alertmanager

Configuration Fundamentals

Basic Structure

The Alertmanager configuration file has these main sections:

global:
  # Global settings for all receivers

route:
  # Top-level routing rule

receivers:
  # Notification channel definitions

inhibit_rules:
  # Rules for suppressing alerts

Minimal Configuration

sudo tee /etc/alertmanager/alertmanager.yml > /dev/null << 'EOF'
global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname']

receivers:
  - name: 'default'

inhibit_rules: []
EOF

sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
sudo systemctl start alertmanager

Verify Configuration

amtool config routes
amtool check-config /etc/alertmanager/alertmanager.yml

Routing Configuration

Route Structure

Routes create a tree-like alert routing system:

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    # Child routes with more specific matching
    - match:
        severity: critical
      receiver: 'pagerduty'
      repeat_interval: 5m

    - match:
        severity: warning
      receiver: 'slack'
      repeat_interval: 1h

    - match_re:
        service: 'api-.*'
      receiver: 'api-team'

Match and Match_RE

Match alerts using label matching:

routes:
  - match:
      job: 'prometheus'
    receiver: 'prometheus-team'

  - match:
      environment: 'production'
      severity: 'critical'
    receiver: 'critical-alerts'

  - match_re:
      alertname: '(High|Critical).*'
      instance: '.*prod.*'
    receiver: 'production-alerts'

Routing Priorities

Create nested routes for complex routing logic:

route:
  receiver: 'default'

  routes:
    # Production alerts take priority
    - match:
        environment: 'production'
      receiver: 'production'
      group_wait: 5s
      group_interval: 5s
      repeat_interval: 1h

      routes:
        # Critical production alerts
        - match:
            severity: 'critical'
          receiver: 'oncall'
          group_wait: 0s
          repeat_interval: 5m

        # Warning production alerts
        - match:
            severity: 'warning'
          receiver: 'prod-slack'

    # Staging environment
    - match:
        environment: 'staging'
      receiver: 'staging'
      group_wait: 10s
      repeat_interval: 6h

Grouping and Timing

Group Configuration

Group related alerts to reduce notification noise:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s           # Wait before sending first notification
  group_interval: 10s       # Wait before sending additional alerts
  repeat_interval: 4h       # Repeat notification after 4 hours

Timing Examples

routes:
  # Critical alerts: immediate notification, repeat every 5 minutes
  - match:
      severity: 'critical'
    receiver: 'critical'
    group_wait: 0s
    group_interval: 1m
    repeat_interval: 5m

  # Warnings: wait 30 seconds, repeat hourly
  - match:
      severity: 'warning'
    receiver: 'warnings'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 1h

  # Info: wait 5 minutes, repeat daily
  - match:
      severity: 'info'
    receiver: 'info'
    group_wait: 5m
    group_interval: 5m
    repeat_interval: 24h

Receivers Setup

Email Receiver

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'app-specific-password'
  smtp_require_tls: true
  smtp_from: '[email protected]'

receivers:
  - name: 'email-ops'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: 'Alert: {{ .GroupLabels.alertname }}'
        html: |
          {{ range .Alerts }}
            <strong>Alert:</strong> {{ .Labels.alertname }}<br>
            <strong>Instance:</strong> {{ .Labels.instance }}<br>
            <strong>Description:</strong> {{ .Annotations.description }}<br>
          {{ end }}

Slack Receiver

global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - channel: '#monitoring-alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
            Service: {{ .Labels.service }}
            Instance: {{ .Labels.instance }}
            {{ .Annotations.description }}
          {{ end }}
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        actions:
          - type: button
            text: 'View in Grafana'
            url: 'https://grafana.example.com/d/dashboards'

PagerDuty Receiver

receivers:
  - name: 'pagerduty-oncall'
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
        description: '{{ .GroupLabels.alertname }}'
        client: 'Prometheus'
        details:
          firing: '{{ range .Alerts.Firing }}{{ .Labels.instance }} {{ end }}'
          description: '{{ (index .Alerts 0).Annotations.description }}'

Webhook Receiver

receivers:
  - name: 'custom-webhook'
    webhook_configs:
      - url: 'https://your-api.example.com/alerts'
        send_resolved: true
        http_sd_configs:
          - bearer_token: 'your-token'

  - name: 'webhook-slack'
    webhook_configs:
      - url: 'https://your-custom-slack-bot.example.com/notify'
        send_resolved: true

Multiple Receivers

Send the same alert to multiple channels:

receivers:
  - name: 'critical-multi'
    slack_configs:
      - channel: '#critical-alerts'
    pagerduty_configs:
      - routing_key: 'YOUR_KEY'
    email_configs:
      - to: '[email protected]'

Inhibition Rules

Suppress Low-Priority Alerts

Prevent lower-severity alerts when higher-priority ones exist:

inhibit_rules:
  # Suppress warning alerts when critical exists
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

  # Suppress info when warning exists
  - source_match:
      severity: 'warning'
    target_match:
      severity: 'info'
    equal: ['alertname', 'instance']

Complex Inhibition

inhibit_rules:
  # Don't alert on disk warnings if service is already down
  - source_match:
      alertname: 'ServiceDown'
    target_match:
      alertname: 'HighDiskUsage'
    equal: ['instance']

  # Don't alert on memory if node is down
  - source_match:
      alertname: 'NodeDown'
    target_match:
      alertname: 'HighMemoryUsage'
    equal: ['instance']

  # Suppress replica alerts when master is down
  - source_match:
      alertname: 'DatabaseMasterDown'
    target_match:
      alertname: 'DatabaseReplicaLag'
    equal: ['cluster']

Silencing

Silence via Web UI

  1. Access http://localhost:9093
  2. Click "Silences"
  3. Create new silence:
    • Matchers: alertname=HighCPU
    • Duration: 1 hour
    • Creator: Your name

Silence via API

# Silence alerts matching criteria for 1 hour
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "HighCPU",
        "isRegex": false
      },
      {
        "name": "instance",
        "value": ".*prod.*",
        "isRegex": true
      }
    ],
    "startsAt": "2024-01-01T10:00:00Z",
    "endsAt": "2024-01-01T11:00:00Z",
    "createdBy": "automation",
    "comment": "Maintenance window"
  }'

Query Silences

curl http://localhost:9093/api/v1/silences | jq '.data'

Delete Silence

curl -X DELETE http://localhost:9093/api/v1/silences/silence_id

Advanced Routing

Environment-Based Routing

route:
  group_by: ['alertname', 'environment']
  routes:
    - match:
        environment: 'production'
      receiver: 'prod-pagerduty'
      group_wait: 0s
      repeat_interval: 5m

    - match:
        environment: 'staging'
      receiver: 'staging-slack'
      group_wait: 10s
      repeat_interval: 1h

    - match:
        environment: 'development'
      receiver: 'dev-slack'
      group_wait: 1m
      repeat_interval: 6h

Team-Based Routing

route:
  routes:
    - match:
        team: 'platform'
      receiver: 'platform-team'
      routes:
        - match:
            service: 'kubernetes'
          receiver: 'k8s-team'

    - match:
        team: 'database'
      receiver: 'db-team'
      routes:
        - match:
            service: 'mysql'
          receiver: 'mysql-team'

Troubleshooting

Check Alert Status

# View current alerts
curl http://localhost:9093/api/v1/alerts | jq .

# View grouped alerts
curl http://localhost:9093/api/v1/alerts?group_by=alertname | jq .

Test Configuration

amtool config routes
amtool check-config /etc/alertmanager/alertmanager.yml

# Validate YAML syntax
python -m yaml /etc/alertmanager/alertmanager.yml

View Routes

amtool config routes --output=json

Debug Receiver Issues

# Check which receiver handles an alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "Test"
    }
  }]'

# Check logs
journalctl -u alertmanager -f

Common Configuration Issues

# Webhook not being called - verify URL is correct
curl -X POST https://your-webhook.example.com/notify \
  -H "Content-Type: application/json" \
  -d '{"test": "data"}'

# Email not sending - verify SMTP settings
telnet smtp.gmail.com 587

# Slack not working - verify webhook URL
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H 'Content-type: application/json' \
  -d '{"text": "Test message"}'

Conclusion

Alertmanager transforms raw Prometheus alerts into intelligent, routed notifications. By mastering routing configuration, receiver setup, and inhibition rules, you create an alert management system that reduces fatigue while ensuring critical issues reach the right people immediately. Continuously refine your routing rules based on operational experience, monitor alert quality metrics, and regularly review silence policies to maintain an effective alerting system.