Alertmanager Configuración for Prometheus
Alertmanager is a sophisticated alert management system designed to handle alerts generated by Prometheus. It provides alert deduplication, routing, silencing, and integration with various notification channels including email, Slack, PagerDuty, and webhooks. Esta guía covers configuration, routing strategies, receiver setup, and avanzado features.
Tabla de Contenidos
- Introducción
- Architecture
- Instalación
- Configuración Fundamentals
- Routing Configuración
- Grouping and Timing
- Receivers Configuración
- Inhibition Rules
- Silencing
- Avanzado Routing
- [Solución de Problemas](#solución de problemas)
- Conclusión
Introducción
Alertmanager solves a critical problem in alert-heavy monitoreo: alert fatigue. By grouping related alerts, deduplicating notifications, and intelligently routing to the right channels, it transforms raw alerts into actionable notifications. It decouples alert generation from notification delivery, enabling flexible, sophisticated alert handling.
Architecture
Alerta Flow
Prometheus Alerting Rules
↓
Fires Alerts
↓
Alertmanager
↓
Routing Engine
↓
├─ Grouping
├─ Silencing
└─ Inhibition
↓
Receiver Channels
↓
├─ Email
├─ Slack
├─ PagerDuty
├─ Webhook
└─ Custom Integrations
Instalación
Download and Install
# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager
# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64
# Install binaries
sudo cp alertmanager alertmanagercli /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/{alertmanager,alertmanagercli}
# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
chmod 750 /etc/alertmanager /var/lib/alertmanager
Crear Systemd Servicio
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=0.0.0.0:9093
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
Configuración Fundamentals
Basic Structure
The Alertmanager configuration file has these main sections:
global:
# Global settings for all receivers
route:
# Top-level routing rule
receivers:
# Notification channel definitions
inhibit_rules:
# Rules for suppressing alerts
Minimal Configuración
sudo tee /etc/alertmanager/alertmanager.yml > /dev/null << 'EOF'
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname']
receivers:
- name: 'default'
inhibit_rules: []
EOF
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
sudo systemctl start alertmanager
Verificar Configuración
amtool config routes
amtool check-config /etc/alertmanager/alertmanager.yml
Routing Configuración
Route Structure
Routes Crear un tree-Como alert routing system:
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
# Child routes with more specific matching
- match:
severity: critical
receiver: 'pagerduty'
repeat_interval: 5m
- match:
severity: warning
receiver: 'slack'
repeat_interval: 1h
- match_re:
service: 'api-.*'
receiver: 'api-team'
Match and Match_RE
Match alerts using label matching:
routes:
- match:
job: 'prometheus'
receiver: 'prometheus-team'
- match:
environment: 'production'
severity: 'critical'
receiver: 'critical-alerts'
- match_re:
alertname: '(High|Critical).*'
instance: '.*prod.*'
receiver: 'production-alerts'
Routing Priorities
Crear nested routes for complex routing logic:
route:
receiver: 'default'
routes:
# Production alerts take priority
- match:
environment: 'production'
receiver: 'production'
group_wait: 5s
group_interval: 5s
repeat_interval: 1h
routes:
# Critical production alerts
- match:
severity: 'critical'
receiver: 'oncall'
group_wait: 0s
repeat_interval: 5m
# Warning production alerts
- match:
severity: 'warning'
receiver: 'prod-slack'
# Staging environment
- match:
environment: 'staging'
receiver: 'staging'
group_wait: 10s
repeat_interval: 6h
Grouping and Timing
Grupo Configuración
Grupo related alerts to reduce notification noise:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s # Wait before sending first notification
group_interval: 10s # Wait before sending additional alerts
repeat_interval: 4h # Repeat notification after 4 hours
Timing Ejemplos
routes:
# Critical alerts: immediate notification, repeat every 5 minutes
- match:
severity: 'critical'
receiver: 'critical'
group_wait: 0s
group_interval: 1m
repeat_interval: 5m
# Warnings: wait 30 seconds, repeat hourly
- match:
severity: 'warning'
receiver: 'warnings'
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
# Info: wait 5 minutes, repeat daily
- match:
severity: 'info'
receiver: 'info'
group_wait: 5m
group_interval: 5m
repeat_interval: 24h
Receivers Configuración
Email Receiver
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'app-specific-password'
smtp_require_tls: true
smtp_from: '[email protected]'
receivers:
- name: 'email-ops'
email_configs:
- to: '[email protected]'
headers:
Subject: 'Alert: {{ .GroupLabels.alertname }}'
html: |
{{ range .Alerts }}
<strong>Alert:</strong> {{ .Labels.alertname }}<br>
<strong>Instance:</strong> {{ .Labels.instance }}<br>
<strong>Description:</strong> {{ .Annotations.description }}<br>
{{ end }}
Slack Receiver
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
receivers:
- name: 'slack-alerts'
slack_configs:
- channel: '#monitoring-alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
Service: {{ .Labels.service }}
Instance: {{ .Labels.instance }}
{{ .Annotations.description }}
{{ end }}
send_resolved: true
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
actions:
- type: button
text: 'View in Grafana'
url: 'https://grafana.example.com/d/dashboards'
PagerDuty Receiver
receivers:
- name: 'pagerduty-oncall'
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
description: '{{ .GroupLabels.alertname }}'
client: 'Prometheus'
details:
firing: '{{ range .Alerts.Firing }}{{ .Labels.instance }} {{ end }}'
description: '{{ (index .Alerts 0).Annotations.description }}'
Webhook Receiver
receivers:
- name: 'custom-webhook'
webhook_configs:
- url: 'https://your-api.example.com/alerts'
send_resolved: true
http_sd_configs:
- bearer_token: 'your-token'
- name: 'webhook-slack'
webhook_configs:
- url: 'https://your-custom-slack-bot.example.com/notify'
send_resolved: true
Multiple Receivers
Send the same alert to multiple channels:
receivers:
- name: 'critical-multi'
slack_configs:
- channel: '#critical-alerts'
pagerduty_configs:
- routing_key: 'YOUR_KEY'
email_configs:
- to: '[email protected]'
Inhibition Rules
Suppress Low-Priority Alertas
Prevent lower-severity alerts Cuando higher-priority ones exist:
inhibit_rules:
# Suppress warning alerts when critical exists
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# Suppress info when warning exists
- source_match:
severity: 'warning'
target_match:
severity: 'info'
equal: ['alertname', 'instance']
Complex Inhibition
inhibit_rules:
# Don't alert on disk warnings if service is already down
- source_match:
alertname: 'ServiceDown'
target_match:
alertname: 'HighDiskUsage'
equal: ['instance']
# Don't alert on memory if node is down
- source_match:
alertname: 'NodeDown'
target_match:
alertname: 'HighMemoryUsage'
equal: ['instance']
# Suppress replica alerts when master is down
- source_match:
alertname: 'DatabaseMasterDown'
target_match:
alertname: 'DatabaseReplicaLag'
equal: ['cluster']
Silencing
Silence via Web UI
- Access http://localhost:9093
- Click "Silences"
- Crear new silence:
- Matchers: alertname=HighCPU
- Duration: 1 hour
- Creator: Your name
Silence via API
# Silence alerts matching criteria for 1 hour
curl -X POST http://localhost:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "alertname",
"value": "HighCPU",
"isRegex": false
},
{
"name": "instance",
"value": ".*prod.*",
"isRegex": true
}
],
"startsAt": "2024-01-01T10:00:00Z",
"endsAt": "2024-01-01T11:00:00Z",
"createdBy": "automation",
"comment": "Maintenance window"
}'
Consulta Silences
curl http://localhost:9093/api/v1/silences | jq '.data'
Eliminar Silence
curl -X DELETE http://localhost:9093/api/v1/silences/silence_id
Avanzado Routing
Ambiente-Based Routing
route:
group_by: ['alertname', 'environment']
routes:
- match:
environment: 'production'
receiver: 'prod-pagerduty'
group_wait: 0s
repeat_interval: 5m
- match:
environment: 'staging'
receiver: 'staging-slack'
group_wait: 10s
repeat_interval: 1h
- match:
environment: 'development'
receiver: 'dev-slack'
group_wait: 1m
repeat_interval: 6h
Team-Based Routing
route:
routes:
- match:
team: 'platform'
receiver: 'platform-team'
routes:
- match:
service: 'kubernetes'
receiver: 'k8s-team'
- match:
team: 'database'
receiver: 'db-team'
routes:
- match:
service: 'mysql'
receiver: 'mysql-team'
Solución de Problemas
Verificar Alerta Estado
# View current alerts
curl http://localhost:9093/api/v1/alerts | jq .
# View grouped alerts
curl http://localhost:9093/api/v1/alerts?group_by=alertname | jq .
Prueba Configuración
amtool config routes
amtool check-config /etc/alertmanager/alertmanager.yml
# Validate YAML syntax
python -m yaml /etc/alertmanager/alertmanager.yml
View Routes
amtool config routes --output=json
Debug Receiver Issues
# Check which receiver handles an alert
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning"
},
"annotations": {
"summary": "Test"
}
}]'
# Check logs
journalctl -u alertmanager -f
Common Configuración Issues
# Webhook not being called - verify URL is correct
curl -X POST https://your-webhook.example.com/notify \
-H "Content-Type: application/json" \
-d '{"test": "data"}'
# Email not sending - verify SMTP settings
telnet smtp.gmail.com 587
# Slack not working - verify webhook URL
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H 'Content-type: application/json' \
-d '{"text": "Test message"}'
Conclusión
Alertmanager transforms raw Prometheus alerts into intelligent, routed notifications. By mastering routing configuration, receiver setup, and inhibition rules, you Crear unn alert management system that reduces fatigue Mientras ensuring critical issues reach the right people immediately. Continuously refine your routing rules based on operational experience, monitor alert quality metrics, and regularly review silence policies to maintain an effective alerting system.


