Grafana Alerting Rules Configuración

Grafana's unified alerting system provides powerful, flexible alert management directly within Grafana. It supports multiple data sources, complex evaluation logic, avanzado routing, and rich notifications. Esta guía covers creating alert rules, configuring contact points, Configurando notification policies, managing silences, and using alert templates for comprehensive alerting.

Tabla de Contenidos

Introducción

Grafana's unified alerting system consolidates alert management across multiple data sources. UnComo legacy alerts, unified alerting provides sophisticated routing, grouping, silencing, and integration with external systems, enabling mature alert management at scale.

Alerting Architecture

Alerta Canalización

Metrics/Logs Data Sources
    ↓
Alert Rules (Evaluation)
    ↓
Alert Instances Created
    ↓
├─ Silences (Suppression)
├─ Grouping
└─ Deduplication
    ↓
Routing Rules
    ↓
Contact Points
    ↓
Notifications
(Email, Slack, PagerDuty, etc.)

Alerta Rules

Crear Alerta Rule

Navigate to Alerting > Alerta rules > Crear new alert rule

Basic HTTP Métrica Alerta

Alert name: High CPU Usage
Condition: 
  - Query A: SELECT mean(usage_user) FROM cpu WHERE time > now() - 10m
  - Condition: A is above 80
Evaluation behavior:
  - For: 5m
  - Every: 1m
Annotation:
  summary: "CPU usage is {{ $value | printf '%.2f' }}%"
  description: "Host {{ $labels.instance }} has high CPU usage"

Avanzado Alerta with Multiple Conditions

Alert name: API Error Rate Alert
Conditions:
  - Query A: sum(rate(http_requests_total{status=~"5.."}[5m]))
  - Query B: sum(rate(http_requests_total[5m]))
  - Math expression: A / B > 0.05  # 5% error rate
Evaluation: Every 1m for 5m

Alerta Rule JSON

{
  "uid": "alert-rule-1",
  "title": "High Memory Usage",
  "condition": "C",
  "data": [
    {
      "refId": "A",
      "queryType": "",
      "model": {
        "expr": "node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100",
        "intervalFactor": 2,
        "refId": "A"
      },
      "datasourceUid": "prometheus-uid",
      "relativeTimeRange": {
        "from": 10,
        "to": 0
      }
    },
    {
      "refId": "B",
      "queryType": "",
      "expression": "A",
      "reducer": "last",
      "settings": {
        "mode": "strict"
      }
    },
    {
      "refId": "C",
      "queryType": "",
      "expression": "B",
      "reducer": "last",
      "settings": {},
      "mathExpression": "$B > 85",
      "type": "threshold"
    }
  ],
  "noDataState": "NoData",
  "execErrState": "Alerting",
  "for": "5m",
  "annotations": {
    "summary": "High memory usage on {{ $labels.instance }}"
  },
  "labels": {
    "severity": "warning"
  }
}

Contact Points

Email Contact Point

Alerting > Contact Points > New Contact Point

Name: email-ops
Type: Email
Email address: [email protected]
Disable resolve message: false

Slack Contact Point

Name: slack-alerts
Type: Slack
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK
Channel: #monitoring
Username: Grafana Alerts
Icon emoji: :bell:
Mention Users: @devops-team

PagerDuty Contact Point

Name: pagerduty-oncall
Type: PagerDuty
Integration Key: YOUR_INTEGRATION_KEY
Client: Grafana
Severity: critical

Webhook Contact Point

Name: custom-webhook
Type: Webhook
URL: https://your-api.example.com/alerts
HTTP Method: POST
Authorization scheme: Bearer
Bearer token: YOUR_TOKEN

Configurar Contact Points via API

# Create email contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points \
  -H "Content-Type: application/json" \
  -d '{
    "name": "email-ops",
    "type": "email",
    "settings": {
      "addresses": "[email protected]"
    }
  }'

# Create Slack contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points \
  -H "Content-Type: application/json" \
  -d '{
    "name": "slack-alerts",
    "type": "slack",
    "settings": {
      "url": "https://hooks.slack.com/services/YOUR/WEBHOOK",
      "channel": "#monitoring",
      "username": "Grafana"
    }
  }'

Notificación Policies

Default Route Configuración

Navigate to Alerting > Notificación policies

Default policy:
- Group by: alertname, cluster, service
- Group wait: 10s
- Group interval: 10s
- Repeat interval: 12h
- Contact point: default-email

Complex Routing Rules

default-policy:
  receiver: default
  group_by:
    - alertname
    - cluster
    - service
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    - receiver: critical-team
      match:
        severity: critical
      group_wait: 0s
      repeat_interval: 5m

    - receiver: slack-alerts
      match:
        severity: warning
      group_wait: 30s
      repeat_interval: 4h

    - receiver: api-team
      match_re:
        service: "api-.*"
      group_wait: 1m

    - receiver: database-team
      match:
        team: database
      routes:
        - receiver: db-oncall
          match:
            severity: critical

Nested Routing Policy

routing:
  group_by: ['alertname', 'cluster']
  receiver: 'default-receiver'
  
  routes:
    # Production alerts
    - match:
        environment: 'production'
      receiver: 'prod-team'
      group_wait: 10s
      routes:
        # Critical production alerts
        - match:
            severity: 'critical'
          receiver: 'oncall'
          group_wait: 0s
          repeat_interval: 5m
        
        # Warning production alerts
        - match:
            severity: 'warning'
          receiver: 'prod-slack'
    
    # Staging alerts
    - match:
        environment: 'staging'
      receiver: 'staging-team'
      group_wait: 5m

Alerta Silencing

Crear Silence

Navigate to Alerting > Silences > New Silence

Match alerts:
- Label: severity
  Operator: =
  Value: warning

Schedule:
Start: 2024-01-15 14:00 UTC
End: 2024-01-15 16:00 UTC

Creator: ops-team
Comment: Maintenance window

Crear Silence via API

curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "HighMemoryUsage",
        "isRegex": false
      },
      {
        "name": "instance",
        "value": ".*prod.*",
        "isRegex": true
      }
    ],
    "startsAt": "2024-01-15T10:00:00Z",
    "endsAt": "2024-01-15T11:00:00Z",
    "createdBy": "automation",
    "comment": "Scheduled maintenance"
  }'

Consulta Active Silences

curl http://localhost:9093/api/v1/silences | jq '.data'

Templates and Labels

Custom Alerta Labels

- Alert name: Database Query Latency
  Annotations:
    summary: "Database latency alert"
    description: "Query latency is {{ $value }}ms on {{ $labels.instance }}"
    runbook: "https://wiki.example.com/db-latency"
    dashboard: "https://grafana.example.com/d/db-dashboards"
  Labels:
    severity: "{{ if gt $value 1000 }}critical{{ else }}warning{{ end }}"
    team: database
    env: production

Template Functions

Alert: {{ .Alerts.Firing | len }} firing alerts
Resolved: {{ .Alerts.Resolved | len }} resolved alerts

Instance: {{ $labels.instance }}
Value: {{ $value }}

Timestamp: {{ .Now.Format "2006-01-02 15:04:05" }}

Notificación Template

Alert: {{ .GroupLabels.alertname }}
Severity: {{ .GroupLabels.severity }}

{{ range .Alerts.Firing -}}
Instance: {{ .Labels.instance }}
Value: {{ .Value }}
{{ .Annotations.description }}
{{ end }}

Pruebas Alertas

Prueba Alerta Rule

Navigate to Alerting > Alerta rules > [Rule] > Prueba

Data: Select time range
Result: Shows evaluation result and alert instances

Send Prueba Notificación

# Via API - Simulate alert
curl -X POST http://admin:admin@localhost:3000/api/v1/alerts/test \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Test Alert",
    "state": "alerting",
    "message": "This is a test notification"
  }'

Manual Notificación Prueba

Click "Prueba" on contact point configuration

Avanzado Features

Conditional Alerting

- Query A: SERIES A query
- Query B: SERIES B query
- Condition: IF A > 100 AND B < 50 THEN alert

Grouped Alertas

Group by:
  - alertname
  - cluster
  - instance

Grouped alert notifications show all related alerts together

Alerta State Management

Alerting: Alert condition is true
Pending: Alert threshold exceeded but not yet for "for" duration
Resolved: Alert returned to normal
NoData: No data available for evaluation
Error: Error evaluating alert rule

Dynamic Thresholds

Using percentile functions:
histogram_quantile(0.95, metric) > threshold

Using rolling averages:
avg_over_time(metric[24h]) + 2*stddev_over_time(metric[24h])

Solución de Problemas

Verificar Alerta Evaluation

# View alert rule list
curl http://admin:admin@localhost:3000/api/v1/rules

# Check specific rule
curl http://admin:admin@localhost:3000/api/v1/rules/{uid}

# View alert instances
curl http://admin:admin@localhost:3000/api/v1/rules/test

Debug Notificaciones

# Check contact point configuration
curl http://admin:admin@localhost:3000/api/v1/provisioning/contact-points

# Verify notification policy
curl http://admin:admin@localhost:3000/api/v1/provisioning/policies

# Test contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points/{uid}/test

Common Issues

# Alert not firing
# 1. Check data source connectivity
# 2. Verify query returns data
# 3. Confirm threshold conditions
# 4. Check evaluation interval

# Notifications not received
# 1. Verify contact point configuration
# 2. Check notification policy routing
# 3. Review Grafana logs
# 4. Confirm external service connectivity

# View Grafana logs
docker logs grafana
journalctl -u grafana -f

Conclusión

Grafana's unified alerting provides comprehensive alert management capabilities. By following Esta guía, you've set up sophisticated alerting with flexible routing, multiple notification channels, and avanzado silencing. Focus on designing clear alert rules with meaningful labels, setting thresholds based on service SLOs, and maintaining runbook links for on-call responders. Effective alerting es crítico for operational reliability.