Prometheus and Grafana Complete Monitoring Stack
Building a complete monitoring stack with Prometheus and Grafana provides comprehensive infrastructure observability. This guide covers the entire ecosystem including Prometheus, Node Exporter for system metrics, Alertmanager for alert routing, and Grafana for visualization. By the end of this guide, you'll have a production-ready monitoring platform monitoring multiple servers.
Table of Contents
- Overview
- Architecture
- Prerequisites
- Installing Prometheus
- Installing Node Exporter
- Installing Alertmanager
- Installing Grafana
- Configuring the Stack
- Creating Monitoring Dashboards
- Setting Up Alerts
- Scaling the Stack
- Troubleshooting
- Conclusion
Overview
A complete monitoring stack consists of four key components working together:
- Prometheus: Time-series database collecting metrics
- Node Exporter: Agent exposing system metrics
- Alertmanager: Routes and deduplicates alerts
- Grafana: Visualizes metrics and manages dashboards
This architecture enables monitoring of servers, applications, and services across multiple environments.
Architecture
Component Relationships
┌─────────────────────────────────────────────────────────┐
│ Monitoring Stack │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Node Exporter│ │Node Exporter│ │
│ │ Server 1 │ │ Server 2 │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ └───────────┬───────┘ │
│ │ Scrapes │
│ ┌──────▼──────┐ │
│ │ Prometheus │ │
│ │ 9090 │ │
│ └──────┬──────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ │ │ │ │
│ ┌────▼──┐ ┌────▼──┐ ┌────▼──┐ │
│ │Grafana│ │Alert │ │Custom │ │
│ │3000 │ │Manager│ │Systems│ │
│ └───────┘ │9093 │ └───────┘ │
│ └───────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Prerequisites
System requirements for complete stack:
- Two servers minimum (one for monitoring, one for metrics collection)
- 2GB RAM on monitoring server
- 10GB storage for Prometheus TSDB
- Ubuntu 20.04+ or CentOS 8+
- Root or sudo access
- Network connectivity between servers
Installing Prometheus
Step 1: Prepare System
# Update system
sudo apt-get update && sudo apt-get upgrade -y
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
Step 2: Download and Install
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.50.0/prometheus-2.50.0.linux-amd64.tar.gz
tar -xvzf prometheus-2.50.0.linux-amd64.tar.gz
cd prometheus-2.50.0.linux-amd64
# Copy binaries
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}
# Copy configuration templates
sudo cp prometheus.yml /etc/prometheus/
sudo cp -r consoles console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries
Step 3: Create Prometheus Configuration
sudo tee /etc/prometheus/prometheus.yml > /dev/null << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'infrastructure-prod'
environment: 'production'
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093'
rule_files:
- '/etc/prometheus/alert_rules.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
labels:
group: 'local'
- targets: ['192.168.1.10:9100', '192.168.1.11:9100']
labels:
group: 'remote'
EOF
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
Step 4: Create Systemd Service
sudo tee /etc/systemd/system/prometheus.service > /dev/null << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle \
--storage.tsdb.retention.time=30d
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
Installing Node Exporter
On Each Server to Monitor
# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter
# Download and install
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xvzf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
sudo cp node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Create Systemd Service for Node Exporter
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
--collector.netdev.device-exclude=^(veth.*|br.*|docker.*|virbr.*)$$ \
--web.listen-address=:9100
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Verify Node Exporter Metrics
curl -s http://localhost:9100/metrics | head -30
Installing Alertmanager
Step 1: Download and Install
# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager
# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64
# Install
sudo cp alertmanager alertmanagercli /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/{alertmanager,alertmanagercli}
# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
Step 2: Configure Alertmanager
sudo tee /etc/alertmanager/alertmanager.yml > /dev/null << 'EOF'
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
group_wait: 0s
repeat_interval: 5m
- match:
severity: warning
receiver: 'slack'
group_wait: 30s
receivers:
- name: 'default-receiver'
- name: 'slack'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_KEY'
client: 'Prometheus'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
EOF
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
Step 3: Create Systemd Service
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
After=network.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=0.0.0.0:9093
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
Installing Grafana
Step 1: Add Repository and Install
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install -y grafana-server
Step 2: Configure Grafana
sudo tee /etc/grafana/grafana.ini > /dev/null << 'EOF'
[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = localhost
root_url = http://localhost:3000
[database]
type = sqlite3
path = /var/lib/grafana/grafana.db
[security]
admin_user = admin
admin_password = admin
secret_key = replace_with_secure_key_min_16_chars
cookie_secure = false
[log]
mode = file
level = info
[paths]
logs = /var/log/grafana
EOF
Step 3: Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server
Configuring the Stack
Step 1: Add Prometheus as Data Source in Grafana
curl -X POST -H "Content-Type: application/json" \
-d '{
"name":"Prometheus",
"type":"prometheus",
"url":"http://localhost:9090",
"access":"proxy",
"isDefault":true
}' \
http://admin:admin@localhost:3000/api/datasources
Step 2: Configure Alert Rules in Prometheus
sudo tee /etc/prometheus/alert_rules.yml > /dev/null << 'EOF'
groups:
- name: system_alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
- alert: DiskSpaceRunningOut
expr: (node_filesystem_avail_bytes{device!~"tmpfs"} / node_filesystem_size_bytes) < 0.1
for: 10m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Less than 10% disk space available on {{ $labels.device }}"
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node is down: {{ $labels.instance }}"
description: "{{ $labels.instance }} has been down for more than 1 minute"
EOF
sudo chown prometheus:prometheus /etc/prometheus/alert_rules.yml
sudo systemctl restart prometheus
Step 3: Verify Alert Manager Integration
# Check Alertmanager status
curl -s http://localhost:9093/api/v1/status | jq .
# Check active alerts
curl -s http://localhost:9093/api/v1/alerts | jq .
Creating Monitoring Dashboards
Create System Overview Dashboard
Access Grafana at http://localhost:3000 and create a new dashboard. Add these panels:
- CPU Usage (Time Series Panel):
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- Memory Usage (Gauge Panel):
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- Disk Usage (Stat Panel):
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
- Network Traffic (Time Series Panel):
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
Save Dashboard
# Export dashboard
curl -H "Authorization: Bearer $GRAFANA_API_TOKEN" \
http://localhost:3000/api/dashboards/uid/system-overview > dashboard.json
Setting Up Alerts
Configure Alert Notification Channels
# Slack notification channel
curl -X POST -H "Content-Type: application/json" \
-d '{
"name": "Slack Channel",
"type": "slack",
"settings": {
"url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
"uploadImage": true
},
"isDefault": true
}' \
http://admin:admin@localhost:3000/api/alert-notifications
Test Alerting Pipeline
# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing the alert pipeline"
}
}]'
Scaling the Stack
Add More Servers
# In Prometheus config, add targets:
sudo tee -a /etc/prometheus/prometheus.yml > /dev/null << 'EOF'
- job_name: 'additional-servers'
static_configs:
- targets: ['192.168.1.20:9100', '192.168.1.21:9100']
labels:
datacenter: 'us-east-1'
EOF
# Reload Prometheus
curl -X POST http://localhost:9090/-/reload
Use Service Discovery
scrape_configs:
- job_name: 'consul-discovered'
consul_sd_configs:
- server: 'localhost:8500'
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
Troubleshooting
Check Prometheus Health
# Verify Prometheus is running
systemctl status prometheus
# Check metrics are being scraped
curl -s http://localhost:9090/api/v1/query?query=up | jq .
# View scrape targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets'
Debug Node Exporter
# Check Node Exporter metrics
curl -s http://localhost:9100/metrics | wc -l
# Check specific metric
curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total | head -5
Verify Alert Rules
# Check rules syntax
promtool check rules /etc/prometheus/alert_rules.yml
# View active alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts'
Conclusion
You now have a complete, production-ready monitoring stack. Prometheus collects metrics, Node Exporter provides system data, Alertmanager routes alerts, and Grafana visualizes everything. This foundation scales to monitor hundreds of servers. Focus on continuously improving your dashboards, refining alert thresholds, and documenting runbooks for alert responses. Regular maintenance, including data retention review and storage monitoring, ensures long-term reliability.


