Grafana Alerting Rules Configuration

Grafana's unified alerting system provides powerful, flexible alert management directly within Grafana. It supports multiple data sources, complex evaluation logic, advanced routing, and rich notifications. This guide covers creating alert rules, configuring contact points, setting up notification policies, managing silences, and using alert templates for comprehensive alerting.

Introduction

Grafana's unified alerting system consolidates alert management across multiple data sources. Unlike legacy alerts, unified alerting provides sophisticated routing, grouping, silencing, and integration with external systems, enabling mature alert management at scale.

Alerting Architecture

Alert Pipeline

Metrics/Logs Data Sources
    ↓
Alert Rules (Evaluation)
    ↓
Alert Instances Created
    ↓
├─ Silences (Suppression)
├─ Grouping
└─ Deduplication
    ↓
Routing Rules
    ↓
Contact Points
    ↓
Notifications
(Email, Slack, PagerDuty, etc.)

Alert Rules

Create Alert Rule

Navigate to Alerting > Alert rules > Create new alert rule

Basic HTTP Metric Alert

Alert name: High CPU Usage
Condition: 
  - Query A: SELECT mean(usage_user) FROM cpu WHERE time > now() - 10m
  - Condition: A is above 80
Evaluation behavior:
  - For: 5m
  - Every: 1m
Annotation:
  summary: "CPU usage is {{ $value | printf '%.2f' }}%"
  description: "Host {{ $labels.instance }} has high CPU usage"

Advanced Alert with Multiple Conditions

Alert name: API Error Rate Alert
Conditions:
  - Query A: sum(rate(http_requests_total{status=~"5.."}[5m]))
  - Query B: sum(rate(http_requests_total[5m]))
  - Math expression: A / B > 0.05  # 5% error rate
Evaluation: Every 1m for 5m

Alert Rule JSON

{
  "uid": "alert-rule-1",
  "title": "High Memory Usage",
  "condition": "C",
  "data": [
    {
      "refId": "A",
      "queryType": "",
      "model": {
        "expr": "node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100",
        "intervalFactor": 2,
        "refId": "A"
      },
      "datasourceUid": "prometheus-uid",
      "relativeTimeRange": {
        "from": 10,
        "to": 0
      }
    },
    {
      "refId": "B",
      "queryType": "",
      "expression": "A",
      "reducer": "last",
      "settings": {
        "mode": "strict"
      }
    },
    {
      "refId": "C",
      "queryType": "",
      "expression": "B",
      "reducer": "last",
      "settings": {},
      "mathExpression": "$B > 85",
      "type": "threshold"
    }
  ],
  "noDataState": "NoData",
  "execErrState": "Alerting",
  "for": "5m",
  "annotations": {
    "summary": "High memory usage on {{ $labels.instance }}"
  },
  "labels": {
    "severity": "warning"
  }
}

Contact Points

Email Contact Point

Alerting > Contact Points > New Contact Point

Name: email-ops
Type: Email
Email address: [email protected]
Disable resolve message: false

Slack Contact Point

Name: slack-alerts
Type: Slack
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK
Channel: #monitoring
Username: Grafana Alerts
Icon emoji: :bell:
Mention Users: @devops-team

PagerDuty Contact Point

Name: pagerduty-oncall
Type: PagerDuty
Integration Key: YOUR_INTEGRATION_KEY
Client: Grafana
Severity: critical

Webhook Contact Point

Name: custom-webhook
Type: Webhook
URL: https://your-api.example.com/alerts
HTTP Method: POST
Authorization scheme: Bearer
Bearer token: YOUR_TOKEN

Configure Contact Points via API

# Create email contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points \
  -H "Content-Type: application/json" \
  -d '{
    "name": "email-ops",
    "type": "email",
    "settings": {
      "addresses": "[email protected]"
    }
  }'

# Create Slack contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points \
  -H "Content-Type: application/json" \
  -d '{
    "name": "slack-alerts",
    "type": "slack",
    "settings": {
      "url": "https://hooks.slack.com/services/YOUR/WEBHOOK",
      "channel": "#monitoring",
      "username": "Grafana"
    }
  }'

Notification Policies

Default Route Configuration

Navigate to Alerting > Notification policies

Default policy:
- Group by: alertname, cluster, service
- Group wait: 10s
- Group interval: 10s
- Repeat interval: 12h
- Contact point: default-email

Complex Routing Rules

default-policy:
  receiver: default
  group_by:
    - alertname
    - cluster
    - service
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    - receiver: critical-team
      match:
        severity: critical
      group_wait: 0s
      repeat_interval: 5m

    - receiver: slack-alerts
      match:
        severity: warning
      group_wait: 30s
      repeat_interval: 4h

    - receiver: api-team
      match_re:
        service: "api-.*"
      group_wait: 1m

    - receiver: database-team
      match:
        team: database
      routes:
        - receiver: db-oncall
          match:
            severity: critical

Nested Routing Policy

routing:
  group_by: ['alertname', 'cluster']
  receiver: 'default-receiver'
  
  routes:
    # Production alerts
    - match:
        environment: 'production'
      receiver: 'prod-team'
      group_wait: 10s
      routes:
        # Critical production alerts
        - match:
            severity: 'critical'
          receiver: 'oncall'
          group_wait: 0s
          repeat_interval: 5m
        
        # Warning production alerts
        - match:
            severity: 'warning'
          receiver: 'prod-slack'
    
    # Staging alerts
    - match:
        environment: 'staging'
      receiver: 'staging-team'
      group_wait: 5m

Alert Silencing

Create Silence

Navigate to Alerting > Silences > New Silence

Match alerts:
- Label: severity
  Operator: =
  Value: warning

Schedule:
Start: 2024-01-15 14:00 UTC
End: 2024-01-15 16:00 UTC

Creator: ops-team
Comment: Maintenance window

Create Silence via API

curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "HighMemoryUsage",
        "isRegex": false
      },
      {
        "name": "instance",
        "value": ".*prod.*",
        "isRegex": true
      }
    ],
    "startsAt": "2024-01-15T10:00:00Z",
    "endsAt": "2024-01-15T11:00:00Z",
    "createdBy": "automation",
    "comment": "Scheduled maintenance"
  }'

Query Active Silences

curl http://localhost:9093/api/v1/silences | jq '.data'

Templates and Labels

Custom Alert Labels

- Alert name: Database Query Latency
  Annotations:
    summary: "Database latency alert"
    description: "Query latency is {{ $value }}ms on {{ $labels.instance }}"
    runbook: "https://wiki.example.com/db-latency"
    dashboard: "https://grafana.example.com/d/db-dashboards"
  Labels:
    severity: "{{ if gt $value 1000 }}critical{{ else }}warning{{ end }}"
    team: database
    env: production

Template Functions

Alert: {{ .Alerts.Firing | len }} firing alerts
Resolved: {{ .Alerts.Resolved | len }} resolved alerts

Instance: {{ $labels.instance }}
Value: {{ $value }}

Timestamp: {{ .Now.Format "2006-01-02 15:04:05" }}

Notification Template

Alert: {{ .GroupLabels.alertname }}
Severity: {{ .GroupLabels.severity }}

{{ range .Alerts.Firing -}}
Instance: {{ .Labels.instance }}
Value: {{ .Value }}
{{ .Annotations.description }}
{{ end }}

Testing Alerts

Test Alert Rule

Navigate to Alerting > Alert rules > [Rule] > Test

Data: Select time range
Result: Shows evaluation result and alert instances

Send Test Notification

# Via API - Simulate alert
curl -X POST http://admin:admin@localhost:3000/api/v1/alerts/test \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Test Alert",
    "state": "alerting",
    "message": "This is a test notification"
  }'

Manual Notification Test

Click "Test" on contact point configuration

Advanced Features

Conditional Alerting

- Query A: SERIES A query
- Query B: SERIES B query
- Condition: IF A > 100 AND B < 50 THEN alert

Grouped Alerts

Group by:
  - alertname
  - cluster
  - instance

Grouped alert notifications show all related alerts together

Alert State Management

Alerting: Alert condition is true
Pending: Alert threshold exceeded but not yet for "for" duration
Resolved: Alert returned to normal
NoData: No data available for evaluation
Error: Error evaluating alert rule

Dynamic Thresholds

Using percentile functions:
histogram_quantile(0.95, metric) > threshold

Using rolling averages:
avg_over_time(metric[24h]) + 2*stddev_over_time(metric[24h])

Troubleshooting

Check Alert Evaluation

# View alert rule list
curl http://admin:admin@localhost:3000/api/v1/rules

# Check specific rule
curl http://admin:admin@localhost:3000/api/v1/rules/{uid}

# View alert instances
curl http://admin:admin@localhost:3000/api/v1/rules/test

Debug Notifications

# Check contact point configuration
curl http://admin:admin@localhost:3000/api/v1/provisioning/contact-points

# Verify notification policy
curl http://admin:admin@localhost:3000/api/v1/provisioning/policies

# Test contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points/{uid}/test

Common Issues

# Alert not firing
# 1. Check data source connectivity
# 2. Verify query returns data
# 3. Confirm threshold conditions
# 4. Check evaluation interval

# Notifications not received
# 1. Verify contact point configuration
# 2. Check notification policy routing
# 3. Review Grafana logs
# 4. Confirm external service connectivity

# View Grafana logs
docker logs grafana
journalctl -u grafana -f

Conclusion

Grafana's unified alerting provides comprehensive alert management capabilities. By following this guide, you've set up sophisticated alerting with flexible routing, multiple notification channels, and advanced silencing. Focus on designing clear alert rules with meaningful labels, setting thresholds based on service SLOs, and maintaining runbook links for on-call responders. Effective alerting is critical for operational reliability.

Grafana alerting rules configuration

En esta página