Grafana Alerting Rules Configuración
Grafana's unified alerting system provides powerful, flexible alert management directly within Grafana. It supports multiple data sources, complex evaluation logic, avanzado routing, and rich notifications. Esta guía covers creating alert rules, configuring contact points, Configurando notification policies, managing silences, and using alert templates for comprehensive alerting.
Tabla de Contenidos
- Introducción
- Alerting Architecture
- Alerta Rules
- Contact Points
- Notificación Policies
- Alerta Silencing
- Templates and Labels
- Pruebas Alertas
- Avanzado Features
- [Solución de Problemas](#solución de problemas)
- Conclusión
Introducción
Grafana's unified alerting system consolidates alert management across multiple data sources. UnComo legacy alerts, unified alerting provides sophisticated routing, grouping, silencing, and integration with external systems, enabling mature alert management at scale.
Alerting Architecture
Alerta Canalización
Metrics/Logs Data Sources
↓
Alert Rules (Evaluation)
↓
Alert Instances Created
↓
├─ Silences (Suppression)
├─ Grouping
└─ Deduplication
↓
Routing Rules
↓
Contact Points
↓
Notifications
(Email, Slack, PagerDuty, etc.)
Alerta Rules
Crear Alerta Rule
Navigate to Alerting > Alerta rules > Crear new alert rule
Basic HTTP Métrica Alerta
Alert name: High CPU Usage
Condition:
- Query A: SELECT mean(usage_user) FROM cpu WHERE time > now() - 10m
- Condition: A is above 80
Evaluation behavior:
- For: 5m
- Every: 1m
Annotation:
summary: "CPU usage is {{ $value | printf '%.2f' }}%"
description: "Host {{ $labels.instance }} has high CPU usage"
Avanzado Alerta with Multiple Conditions
Alert name: API Error Rate Alert
Conditions:
- Query A: sum(rate(http_requests_total{status=~"5.."}[5m]))
- Query B: sum(rate(http_requests_total[5m]))
- Math expression: A / B > 0.05 # 5% error rate
Evaluation: Every 1m for 5m
Alerta Rule JSON
{
"uid": "alert-rule-1",
"title": "High Memory Usage",
"condition": "C",
"data": [
{
"refId": "A",
"queryType": "",
"model": {
"expr": "node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100",
"intervalFactor": 2,
"refId": "A"
},
"datasourceUid": "prometheus-uid",
"relativeTimeRange": {
"from": 10,
"to": 0
}
},
{
"refId": "B",
"queryType": "",
"expression": "A",
"reducer": "last",
"settings": {
"mode": "strict"
}
},
{
"refId": "C",
"queryType": "",
"expression": "B",
"reducer": "last",
"settings": {},
"mathExpression": "$B > 85",
"type": "threshold"
}
],
"noDataState": "NoData",
"execErrState": "Alerting",
"for": "5m",
"annotations": {
"summary": "High memory usage on {{ $labels.instance }}"
},
"labels": {
"severity": "warning"
}
}
Contact Points
Email Contact Point
Alerting > Contact Points > New Contact Point
Name: email-ops
Type: Email
Email address: [email protected]
Disable resolve message: false
Slack Contact Point
Name: slack-alerts
Type: Slack
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK
Channel: #monitoring
Username: Grafana Alerts
Icon emoji: :bell:
Mention Users: @devops-team
PagerDuty Contact Point
Name: pagerduty-oncall
Type: PagerDuty
Integration Key: YOUR_INTEGRATION_KEY
Client: Grafana
Severity: critical
Webhook Contact Point
Name: custom-webhook
Type: Webhook
URL: https://your-api.example.com/alerts
HTTP Method: POST
Authorization scheme: Bearer
Bearer token: YOUR_TOKEN
Configurar Contact Points via API
# Create email contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points \
-H "Content-Type: application/json" \
-d '{
"name": "email-ops",
"type": "email",
"settings": {
"addresses": "[email protected]"
}
}'
# Create Slack contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points \
-H "Content-Type: application/json" \
-d '{
"name": "slack-alerts",
"type": "slack",
"settings": {
"url": "https://hooks.slack.com/services/YOUR/WEBHOOK",
"channel": "#monitoring",
"username": "Grafana"
}
}'
Notificación Policies
Default Route Configuración
Navigate to Alerting > Notificación policies
Default policy:
- Group by: alertname, cluster, service
- Group wait: 10s
- Group interval: 10s
- Repeat interval: 12h
- Contact point: default-email
Complex Routing Rules
default-policy:
receiver: default
group_by:
- alertname
- cluster
- service
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- receiver: critical-team
match:
severity: critical
group_wait: 0s
repeat_interval: 5m
- receiver: slack-alerts
match:
severity: warning
group_wait: 30s
repeat_interval: 4h
- receiver: api-team
match_re:
service: "api-.*"
group_wait: 1m
- receiver: database-team
match:
team: database
routes:
- receiver: db-oncall
match:
severity: critical
Nested Routing Policy
routing:
group_by: ['alertname', 'cluster']
receiver: 'default-receiver'
routes:
# Production alerts
- match:
environment: 'production'
receiver: 'prod-team'
group_wait: 10s
routes:
# Critical production alerts
- match:
severity: 'critical'
receiver: 'oncall'
group_wait: 0s
repeat_interval: 5m
# Warning production alerts
- match:
severity: 'warning'
receiver: 'prod-slack'
# Staging alerts
- match:
environment: 'staging'
receiver: 'staging-team'
group_wait: 5m
Alerta Silencing
Crear Silence
Navigate to Alerting > Silences > New Silence
Match alerts:
- Label: severity
Operator: =
Value: warning
Schedule:
Start: 2024-01-15 14:00 UTC
End: 2024-01-15 16:00 UTC
Creator: ops-team
Comment: Maintenance window
Crear Silence via API
curl -X POST http://localhost:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "alertname",
"value": "HighMemoryUsage",
"isRegex": false
},
{
"name": "instance",
"value": ".*prod.*",
"isRegex": true
}
],
"startsAt": "2024-01-15T10:00:00Z",
"endsAt": "2024-01-15T11:00:00Z",
"createdBy": "automation",
"comment": "Scheduled maintenance"
}'
Consulta Active Silences
curl http://localhost:9093/api/v1/silences | jq '.data'
Templates and Labels
Custom Alerta Labels
- Alert name: Database Query Latency
Annotations:
summary: "Database latency alert"
description: "Query latency is {{ $value }}ms on {{ $labels.instance }}"
runbook: "https://wiki.example.com/db-latency"
dashboard: "https://grafana.example.com/d/db-dashboards"
Labels:
severity: "{{ if gt $value 1000 }}critical{{ else }}warning{{ end }}"
team: database
env: production
Template Functions
Alert: {{ .Alerts.Firing | len }} firing alerts
Resolved: {{ .Alerts.Resolved | len }} resolved alerts
Instance: {{ $labels.instance }}
Value: {{ $value }}
Timestamp: {{ .Now.Format "2006-01-02 15:04:05" }}
Notificación Template
Alert: {{ .GroupLabels.alertname }}
Severity: {{ .GroupLabels.severity }}
{{ range .Alerts.Firing -}}
Instance: {{ .Labels.instance }}
Value: {{ .Value }}
{{ .Annotations.description }}
{{ end }}
Pruebas Alertas
Prueba Alerta Rule
Navigate to Alerting > Alerta rules > [Rule] > Prueba
Data: Select time range
Result: Shows evaluation result and alert instances
Send Prueba Notificación
# Via API - Simulate alert
curl -X POST http://admin:admin@localhost:3000/api/v1/alerts/test \
-H "Content-Type: application/json" \
-d '{
"title": "Test Alert",
"state": "alerting",
"message": "This is a test notification"
}'
Manual Notificación Prueba
Click "Prueba" on contact point configuration
Avanzado Features
Conditional Alerting
- Query A: SERIES A query
- Query B: SERIES B query
- Condition: IF A > 100 AND B < 50 THEN alert
Grouped Alertas
Group by:
- alertname
- cluster
- instance
Grouped alert notifications show all related alerts together
Alerta State Management
Alerting: Alert condition is true
Pending: Alert threshold exceeded but not yet for "for" duration
Resolved: Alert returned to normal
NoData: No data available for evaluation
Error: Error evaluating alert rule
Dynamic Thresholds
Using percentile functions:
histogram_quantile(0.95, metric) > threshold
Using rolling averages:
avg_over_time(metric[24h]) + 2*stddev_over_time(metric[24h])
Solución de Problemas
Verificar Alerta Evaluation
# View alert rule list
curl http://admin:admin@localhost:3000/api/v1/rules
# Check specific rule
curl http://admin:admin@localhost:3000/api/v1/rules/{uid}
# View alert instances
curl http://admin:admin@localhost:3000/api/v1/rules/test
Debug Notificaciones
# Check contact point configuration
curl http://admin:admin@localhost:3000/api/v1/provisioning/contact-points
# Verify notification policy
curl http://admin:admin@localhost:3000/api/v1/provisioning/policies
# Test contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points/{uid}/test
Common Issues
# Alert not firing
# 1. Check data source connectivity
# 2. Verify query returns data
# 3. Confirm threshold conditions
# 4. Check evaluation interval
# Notifications not received
# 1. Verify contact point configuration
# 2. Check notification policy routing
# 3. Review Grafana logs
# 4. Confirm external service connectivity
# View Grafana logs
docker logs grafana
journalctl -u grafana -f
Conclusión
Grafana's unified alerting provides comprehensive alert management capabilities. By following Esta guía, you've set up sophisticated alerting with flexible routing, multiple notification channels, and avanzado silencing. Focus on designing clear alert rules with meaningful labels, setting thresholds based on service SLOs, and maintaining runbook links for on-call responders. Effective alerting es crítico for operational reliability.


