Prometheus and Grafana Complete Monitoring Stack

Building a complete monitoring stack with Prometheus and Grafana provides comprehensive infrastructure observability. This guide covers the entire ecosystem including Prometheus, Node Exporter for system metrics, Alertmanager for alert routing, and Grafana for visualization. By the end of this guide, you'll have a production-ready monitoring platform monitoring multiple servers.

Overview

A complete monitoring stack consists of four key components working together:

Prometheus: Time-series database collecting metrics
Node Exporter: Agent exposing system metrics
Alertmanager: Routes and deduplicates alerts
Grafana: Visualizes metrics and manages dashboards

This architecture enables monitoring of servers, applications, and services across multiple environments.

Architecture

Component Relationships

┌─────────────────────────────────────────────────────────┐
│                    Monitoring Stack                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─────────────┐    ┌─────────────┐                  │
│  │Node Exporter│    │Node Exporter│                  │
│  │  Server 1   │    │  Server 2   │                  │
│  └──────┬──────┘    └──────┬──────┘                  │
│         │                   │                         │
│         └───────────┬───────┘                         │
│                     │ Scrapes                         │
│              ┌──────▼──────┐                          │
│              │ Prometheus  │                          │
│              │ 9090        │                          │
│              └──────┬──────┘                          │
│                     │                                 │
│         ┌───────────┼───────────┐                     │
│         │           │           │                     │
│    ┌────▼──┐  ┌────▼──┐  ┌────▼──┐                  │
│    │Grafana│  │Alert  │  │Custom │                  │
│    │3000   │  │Manager│  │Systems│                  │
│    └───────┘  │9093   │  └───────┘                  │
│               └───────┘                              │
│                                                         │
└─────────────────────────────────────────────────────────┘

Prerequisites

System requirements for complete stack:

Two servers minimum (one for monitoring, one for metrics collection)
2GB RAM on monitoring server
10GB storage for Prometheus TSDB
Ubuntu 20.04+ or CentOS 8+
Root or sudo access
Network connectivity between servers

Installing Prometheus

Step 1: Prepare System

# Update system
sudo apt-get update && sudo apt-get upgrade -y

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus

# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Step 2: Download and Install

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.50.0/prometheus-2.50.0.linux-amd64.tar.gz
tar -xvzf prometheus-2.50.0.linux-amd64.tar.gz
cd prometheus-2.50.0.linux-amd64

# Copy binaries
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}

# Copy configuration templates
sudo cp prometheus.yml /etc/prometheus/
sudo cp -r consoles console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries

Step 3: Create Prometheus Configuration

sudo tee /etc/prometheus/prometheus.yml > /dev/null << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'infrastructure-prod'
    environment: 'production'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'localhost:9093'

rule_files:
  - '/etc/prometheus/alert_rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          group: 'local'
      - targets: ['192.168.1.10:9100', '192.168.1.11:9100']
        labels:
          group: 'remote'
EOF

sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

Step 4: Create Systemd Service

sudo tee /etc/systemd/system/prometheus.service > /dev/null << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle \
  --storage.tsdb.retention.time=30d

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Installing Node Exporter

On Each Server to Monitor

# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter

# Download and install
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xvzf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

sudo cp node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

Create Systemd Service for Node Exporter

sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
  --collector.netdev.device-exclude=^(veth.*|br.*|docker.*|virbr.*)$$ \
  --web.listen-address=:9100

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Verify Node Exporter Metrics

curl -s http://localhost:9100/metrics | head -30

Installing Alertmanager

Step 1: Download and Install

# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64

# Install
sudo cp alertmanager alertmanagercli /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/{alertmanager,alertmanagercli}

# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

Step 2: Configure Alertmanager

sudo tee /etc/alertmanager/alertmanager.yml > /dev/null << 'EOF'
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
      group_wait: 0s
      repeat_interval: 5m

    - match:
        severity: warning
      receiver: 'slack'
      group_wait: 30s

receivers:
  - name: 'default-receiver'

  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_KEY'
        client: 'Prometheus'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']
EOF

sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

Step 3: Create Systemd Service

sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
After=network.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

Installing Grafana

Step 1: Add Repository and Install

sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

sudo apt-get update
sudo apt-get install -y grafana-server

Step 2: Configure Grafana

sudo tee /etc/grafana/grafana.ini > /dev/null << 'EOF'
[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = localhost
root_url = http://localhost:3000

[database]
type = sqlite3
path = /var/lib/grafana/grafana.db

[security]
admin_user = admin
admin_password = admin
secret_key = replace_with_secure_key_min_16_chars
cookie_secure = false

[log]
mode = file
level = info

[paths]
logs = /var/log/grafana
EOF

Step 3: Start Grafana

sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server

Configuring the Stack

Step 1: Add Prometheus as Data Source in Grafana

curl -X POST -H "Content-Type: application/json" \
  -d '{
    "name":"Prometheus",
    "type":"prometheus",
    "url":"http://localhost:9090",
    "access":"proxy",
    "isDefault":true
  }' \
  http://admin:admin@localhost:3000/api/datasources

Step 2: Configure Alert Rules in Prometheus

sudo tee /etc/prometheus/alert_rules.yml > /dev/null << 'EOF'
groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: DiskSpaceRunningOut
        expr: (node_filesystem_avail_bytes{device!~"tmpfs"} / node_filesystem_size_bytes) < 0.1
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Less than 10% disk space available on {{ $labels.device }}"

      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node is down: {{ $labels.instance }}"
          description: "{{ $labels.instance }} has been down for more than 1 minute"
EOF

sudo chown prometheus:prometheus /etc/prometheus/alert_rules.yml
sudo systemctl restart prometheus

Step 3: Verify Alert Manager Integration

# Check Alertmanager status
curl -s http://localhost:9093/api/v1/status | jq .

# Check active alerts
curl -s http://localhost:9093/api/v1/alerts | jq .

Creating Monitoring Dashboards

Create System Overview Dashboard

Access Grafana at http://localhost:3000 and create a new dashboard. Add these panels:

CPU Usage (Time Series Panel):

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage (Gauge Panel):

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk Usage (Stat Panel):

(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

Network Traffic (Time Series Panel):

rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

Save Dashboard

# Export dashboard
curl -H "Authorization: Bearer $GRAFANA_API_TOKEN" \
  http://localhost:3000/api/dashboards/uid/system-overview > dashboard.json

Setting Up Alerts

Configure Alert Notification Channels

# Slack notification channel
curl -X POST -H "Content-Type: application/json" \
  -d '{
    "name": "Slack Channel",
    "type": "slack",
    "settings": {
      "url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
      "uploadImage": true
    },
    "isDefault": true
  }' \
  http://admin:admin@localhost:3000/api/alert-notifications

Test Alerting Pipeline

# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "This is a test alert",
      "description": "Testing the alert pipeline"
    }
  }]'

Scaling the Stack

Add More Servers

# In Prometheus config, add targets:
sudo tee -a /etc/prometheus/prometheus.yml > /dev/null << 'EOF'

  - job_name: 'additional-servers'
    static_configs:
      - targets: ['192.168.1.20:9100', '192.168.1.21:9100']
        labels:
          datacenter: 'us-east-1'
EOF

# Reload Prometheus
curl -X POST http://localhost:9090/-/reload

Use Service Discovery

scrape_configs:
  - job_name: 'consul-discovered'
    consul_sd_configs:
      - server: 'localhost:8500'
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service

Troubleshooting

Check Prometheus Health

# Verify Prometheus is running
systemctl status prometheus

# Check metrics are being scraped
curl -s http://localhost:9090/api/v1/query?query=up | jq .

# View scrape targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets'

Debug Node Exporter

# Check Node Exporter metrics
curl -s http://localhost:9100/metrics | wc -l

# Check specific metric
curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total | head -5

Verify Alert Rules

# Check rules syntax
promtool check rules /etc/prometheus/alert_rules.yml

# View active alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts'

Conclusion

You now have a complete, production-ready monitoring stack. Prometheus collects metrics, Node Exporter provides system data, Alertmanager routes alerts, and Grafana visualizes everything. This foundation scales to monitor hundreds of servers. Focus on continuously improving your dashboards, refining alert thresholds, and documenting runbooks for alert responses. Regular maintenance, including data retention review and storage monitoring, ensures long-term reliability.

Prometheus and Grafana complete monitoring stack

On this page