Grafana Alerting Rules Configuration
Grafana's unified alerting system provides powerful, flexible alert management directly within Grafana. It supports multiple data sources, complex evaluation logic, advanced routing, and rich notifications. This guide covers creating alert rules, configuring contact points, setting up notification policies, managing silences, and using alert templates for comprehensive alerting.
Table of Contents
- Introduction
- Alerting Architecture
- Alert Rules
- Contact Points
- Notification Policies
- Alert Silencing
- Templates and Labels
- Testing Alerts
- Advanced Features
- Troubleshooting
- Conclusion
Introduction
Grafana's unified alerting system consolidates alert management across multiple data sources. Unlike legacy alerts, unified alerting provides sophisticated routing, grouping, silencing, and integration with external systems, enabling mature alert management at scale.
Alerting Architecture
Alert Pipeline
Metrics/Logs Data Sources
↓
Alert Rules (Evaluation)
↓
Alert Instances Created
↓
├─ Silences (Suppression)
├─ Grouping
└─ Deduplication
↓
Routing Rules
↓
Contact Points
↓
Notifications
(Email, Slack, PagerDuty, etc.)
Alert Rules
Create Alert Rule
Navigate to Alerting > Alert rules > Create new alert rule
Basic HTTP Metric Alert
Alert name: High CPU Usage
Condition:
- Query A: SELECT mean(usage_user) FROM cpu WHERE time > now() - 10m
- Condition: A is above 80
Evaluation behavior:
- For: 5m
- Every: 1m
Annotation:
summary: "CPU usage is {{ $value | printf '%.2f' }}%"
description: "Host {{ $labels.instance }} has high CPU usage"
Advanced Alert with Multiple Conditions
Alert name: API Error Rate Alert
Conditions:
- Query A: sum(rate(http_requests_total{status=~"5.."}[5m]))
- Query B: sum(rate(http_requests_total[5m]))
- Math expression: A / B > 0.05 # 5% error rate
Evaluation: Every 1m for 5m
Alert Rule JSON
{
"uid": "alert-rule-1",
"title": "High Memory Usage",
"condition": "C",
"data": [
{
"refId": "A",
"queryType": "",
"model": {
"expr": "node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100",
"intervalFactor": 2,
"refId": "A"
},
"datasourceUid": "prometheus-uid",
"relativeTimeRange": {
"from": 10,
"to": 0
}
},
{
"refId": "B",
"queryType": "",
"expression": "A",
"reducer": "last",
"settings": {
"mode": "strict"
}
},
{
"refId": "C",
"queryType": "",
"expression": "B",
"reducer": "last",
"settings": {},
"mathExpression": "$B > 85",
"type": "threshold"
}
],
"noDataState": "NoData",
"execErrState": "Alerting",
"for": "5m",
"annotations": {
"summary": "High memory usage on {{ $labels.instance }}"
},
"labels": {
"severity": "warning"
}
}
Contact Points
Email Contact Point
Alerting > Contact Points > New Contact Point
Name: email-ops
Type: Email
Email address: [email protected]
Disable resolve message: false
Slack Contact Point
Name: slack-alerts
Type: Slack
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK
Channel: #monitoring
Username: Grafana Alerts
Icon emoji: :bell:
Mention Users: @devops-team
PagerDuty Contact Point
Name: pagerduty-oncall
Type: PagerDuty
Integration Key: YOUR_INTEGRATION_KEY
Client: Grafana
Severity: critical
Webhook Contact Point
Name: custom-webhook
Type: Webhook
URL: https://your-api.example.com/alerts
HTTP Method: POST
Authorization scheme: Bearer
Bearer token: YOUR_TOKEN
Configure Contact Points via API
# Create email contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points \
-H "Content-Type: application/json" \
-d '{
"name": "email-ops",
"type": "email",
"settings": {
"addresses": "[email protected]"
}
}'
# Create Slack contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points \
-H "Content-Type: application/json" \
-d '{
"name": "slack-alerts",
"type": "slack",
"settings": {
"url": "https://hooks.slack.com/services/YOUR/WEBHOOK",
"channel": "#monitoring",
"username": "Grafana"
}
}'
Notification Policies
Default Route Configuration
Navigate to Alerting > Notification policies
Default policy:
- Group by: alertname, cluster, service
- Group wait: 10s
- Group interval: 10s
- Repeat interval: 12h
- Contact point: default-email
Complex Routing Rules
default-policy:
receiver: default
group_by:
- alertname
- cluster
- service
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- receiver: critical-team
match:
severity: critical
group_wait: 0s
repeat_interval: 5m
- receiver: slack-alerts
match:
severity: warning
group_wait: 30s
repeat_interval: 4h
- receiver: api-team
match_re:
service: "api-.*"
group_wait: 1m
- receiver: database-team
match:
team: database
routes:
- receiver: db-oncall
match:
severity: critical
Nested Routing Policy
routing:
group_by: ['alertname', 'cluster']
receiver: 'default-receiver'
routes:
# Production alerts
- match:
environment: 'production'
receiver: 'prod-team'
group_wait: 10s
routes:
# Critical production alerts
- match:
severity: 'critical'
receiver: 'oncall'
group_wait: 0s
repeat_interval: 5m
# Warning production alerts
- match:
severity: 'warning'
receiver: 'prod-slack'
# Staging alerts
- match:
environment: 'staging'
receiver: 'staging-team'
group_wait: 5m
Alert Silencing
Create Silence
Navigate to Alerting > Silences > New Silence
Match alerts:
- Label: severity
Operator: =
Value: warning
Schedule:
Start: 2024-01-15 14:00 UTC
End: 2024-01-15 16:00 UTC
Creator: ops-team
Comment: Maintenance window
Create Silence via API
curl -X POST http://localhost:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "alertname",
"value": "HighMemoryUsage",
"isRegex": false
},
{
"name": "instance",
"value": ".*prod.*",
"isRegex": true
}
],
"startsAt": "2024-01-15T10:00:00Z",
"endsAt": "2024-01-15T11:00:00Z",
"createdBy": "automation",
"comment": "Scheduled maintenance"
}'
Query Active Silences
curl http://localhost:9093/api/v1/silences | jq '.data'
Templates and Labels
Custom Alert Labels
- Alert name: Database Query Latency
Annotations:
summary: "Database latency alert"
description: "Query latency is {{ $value }}ms on {{ $labels.instance }}"
runbook: "https://wiki.example.com/db-latency"
dashboard: "https://grafana.example.com/d/db-dashboards"
Labels:
severity: "{{ if gt $value 1000 }}critical{{ else }}warning{{ end }}"
team: database
env: production
Template Functions
Alert: {{ .Alerts.Firing | len }} firing alerts
Resolved: {{ .Alerts.Resolved | len }} resolved alerts
Instance: {{ $labels.instance }}
Value: {{ $value }}
Timestamp: {{ .Now.Format "2006-01-02 15:04:05" }}
Notification Template
Alert: {{ .GroupLabels.alertname }}
Severity: {{ .GroupLabels.severity }}
{{ range .Alerts.Firing -}}
Instance: {{ .Labels.instance }}
Value: {{ .Value }}
{{ .Annotations.description }}
{{ end }}
Testing Alerts
Test Alert Rule
Navigate to Alerting > Alert rules > [Rule] > Test
Data: Select time range
Result: Shows evaluation result and alert instances
Send Test Notification
# Via API - Simulate alert
curl -X POST http://admin:admin@localhost:3000/api/v1/alerts/test \
-H "Content-Type: application/json" \
-d '{
"title": "Test Alert",
"state": "alerting",
"message": "This is a test notification"
}'
Manual Notification Test
Click "Test" on contact point configuration
Advanced Features
Conditional Alerting
- Query A: SERIES A query
- Query B: SERIES B query
- Condition: IF A > 100 AND B < 50 THEN alert
Grouped Alerts
Group by:
- alertname
- cluster
- instance
Grouped alert notifications show all related alerts together
Alert State Management
Alerting: Alert condition is true
Pending: Alert threshold exceeded but not yet for "for" duration
Resolved: Alert returned to normal
NoData: No data available for evaluation
Error: Error evaluating alert rule
Dynamic Thresholds
Using percentile functions:
histogram_quantile(0.95, metric) > threshold
Using rolling averages:
avg_over_time(metric[24h]) + 2*stddev_over_time(metric[24h])
Troubleshooting
Check Alert Evaluation
# View alert rule list
curl http://admin:admin@localhost:3000/api/v1/rules
# Check specific rule
curl http://admin:admin@localhost:3000/api/v1/rules/{uid}
# View alert instances
curl http://admin:admin@localhost:3000/api/v1/rules/test
Debug Notifications
# Check contact point configuration
curl http://admin:admin@localhost:3000/api/v1/provisioning/contact-points
# Verify notification policy
curl http://admin:admin@localhost:3000/api/v1/provisioning/policies
# Test contact point
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/contact-points/{uid}/test
Common Issues
# Alert not firing
# 1. Check data source connectivity
# 2. Verify query returns data
# 3. Confirm threshold conditions
# 4. Check evaluation interval
# Notifications not received
# 1. Verify contact point configuration
# 2. Check notification policy routing
# 3. Review Grafana logs
# 4. Confirm external service connectivity
# View Grafana logs
docker logs grafana
journalctl -u grafana -f
Conclusion
Grafana's unified alerting provides comprehensive alert management capabilities. By following this guide, you've set up sophisticated alerting with flexible routing, multiple notification channels, and advanced silencing. Focus on designing clear alert rules with meaningful labels, setting thresholds based on service SLOs, and maintaining runbook links for on-call responders. Effective alerting is critical for operational reliability.


