Prometheus and Grafana Complete Monitoreo Stack

Building a complete monitoreo stack with Prometheus and Grafana provides comprehensive infrastructure observability. Esta guía covers the entire ecosystem including Prometheus, Nodo Exporter for system metrics, Alertmanager for alert routing, and Grafana for visualization. By the end of Esta guía, you'll have a producción-ready monitoreo platform monitoreo multiple servers.

Tabla de Contenidos

Descripción General

A complete monitoreo stack consists of four key components working together:

  • Prometheus: Time-series database collecting metrics
  • Nodo Exporter: Agent exposing system metrics
  • Alertmanager: Routes and deduplicates alerts
  • Grafana: Visualizes metrics and manages dashboards

This architecture enables monitoreo of servers, applications, and services across multiple environments.

Architecture

Component Relationships

┌─────────────────────────────────────────────────────────┐
│                    Monitoring Stack                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─────────────┐    ┌─────────────┐                  │
│  │Node Exporter│    │Node Exporter│                  │
│  │  Server 1   │    │  Server 2   │                  │
│  └──────┬──────┘    └──────┬──────┘                  │
│         │                   │                         │
│         └───────────┬───────┘                         │
│                     │ Scrapes                         │
│              ┌──────▼──────┐                          │
│              │ Prometheus  │                          │
│              │ 9090        │                          │
│              └──────┬──────┘                          │
│                     │                                 │
│         ┌───────────┼───────────┐                     │
│         │           │           │                     │
│    ┌────▼──┐  ┌────▼──┐  ┌────▼──┐                  │
│    │Grafana│  │Alert  │  │Custom │                  │
│    │3000   │  │Manager│  │Systems│                  │
│    └───────┘  │9093   │  └───────┘                  │
│               └───────┘                              │
│                                                         │
└─────────────────────────────────────────────────────────┘

Requisitos Previos

System requirements for complete stack:

  • Two servers minimum (one for monitoreo, one for metrics collection)
  • 2GB RAM on monitoreo server
  • 10GB storage for Prometheus TSDB
  • Ubuntu 20.04+ or CentOS 8+
  • Root or sudo access
  • Red connectivity between servers

Instalando Prometheus

Paso 1: Prepare System

# Update system
sudo apt-get update && sudo apt-get upgrade -y

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus

# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Paso 2: Download and Install

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.50.0/prometheus-2.50.0.linux-amd64.tar.gz
tar -xvzf prometheus-2.50.0.linux-amd64.tar.gz
cd prometheus-2.50.0.linux-amd64

# Copy binaries
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}

# Copy configuration templates
sudo cp prometheus.yml /etc/prometheus/
sudo cp -r consoles console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries

Paso 3: Crear Prometheus Configuración

sudo tee /etc/prometheus/prometheus.yml > /dev/null << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'infrastructure-prod'
    environment: 'production'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'localhost:9093'

rule_files:
  - '/etc/prometheus/alert_rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          group: 'local'
      - targets: ['192.168.1.10:9100', '192.168.1.11:9100']
        labels:
          group: 'remote'
EOF

sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

Paso 4: Crear Systemd Servicio

sudo tee /etc/systemd/system/prometheus.service > /dev/null << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle \
  --storage.tsdb.retention.time=30d

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Instalando Nodo Exporter

On Each Servidor Para monitorear

# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter

# Download and install
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xvzf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

sudo cp node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

Crear Systemd Servicio for Nodo Exporter

sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
  --collector.netdev.device-exclude=^(veth.*|br.*|docker.*|virbr.*)$$ \
  --web.listen-address=:9100

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Verificar Nodo Exporter Métricas

curl -s http://localhost:9100/metrics | head -30

Instalando Alertmanager

Paso 1: Download and Install

# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64

# Install
sudo cp alertmanager alertmanagercli /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/{alertmanager,alertmanagercli}

# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

Paso 2: Configurar Alertmanager

sudo tee /etc/alertmanager/alertmanager.yml > /dev/null << 'EOF'
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
      group_wait: 0s
      repeat_interval: 5m

    - match:
        severity: warning
      receiver: 'slack'
      group_wait: 30s

receivers:
  - name: 'default-receiver'

  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_KEY'
        client: 'Prometheus'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']
EOF

sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

Paso 3: Crear Systemd Servicio

sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
After=network.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

Instalando Grafana

Paso 1: Agregar Repository and Install

sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

sudo apt-get update
sudo apt-get install -y grafana-server

Paso 2: Configurar Grafana

sudo tee /etc/grafana/grafana.ini > /dev/null << 'EOF'
[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = localhost
root_url = http://localhost:3000

[database]
type = sqlite3
path = /var/lib/grafana/grafana.db

[security]
admin_user = admin
admin_password = admin
secret_key = replace_with_secure_key_min_16_chars
cookie_secure = false

[log]
mode = file
level = info

[paths]
logs = /var/log/grafana
EOF

Paso 3: Iniciar Grafana

sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server

Configuring the Stack

Paso 1: Agregar Prometheus as Datos Source in Grafana

curl -X POST -H "Content-Type: application/json" \
  -d '{
    "name":"Prometheus",
    "type":"prometheus",
    "url":"http://localhost:9090",
    "access":"proxy",
    "isDefault":true
  }' \
  http://admin:admin@localhost:3000/api/datasources

Paso 2: Configurar Alerta Rules in Prometheus

sudo tee /etc/prometheus/alert_rules.yml > /dev/null << 'EOF'
groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: DiskSpaceRunningOut
        expr: (node_filesystem_avail_bytes{device!~"tmpfs"} / node_filesystem_size_bytes) < 0.1
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Less than 10% disk space available on {{ $labels.device }}"

      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node is down: {{ $labels.instance }}"
          description: "{{ $labels.instance }} has been down for more than 1 minute"
EOF

sudo chown prometheus:prometheus /etc/prometheus/alert_rules.yml
sudo systemctl restart prometheus

Paso 3: Verificar Alerta Manager Integración

# Check Alertmanager status
curl -s http://localhost:9093/api/v1/status | jq .

# Check active alerts
curl -s http://localhost:9093/api/v1/alerts | jq .

Creating Monitoreo Paneles

Crear System Descripción General Panel

Access Grafana at http://localhost:3000 and Crear un new dashboard. Agregar these panels:

  1. CPU Usage (Time Series Panel):
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  1. Memory Usage (Gauge Panel):
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
  1. Disk Usage (Stat Panel):
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
  1. Red Traffic (Time Series Panel):
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

Save Panel

# Export dashboard
curl -H "Authorization: Bearer $GRAFANA_API_TOKEN" \
  http://localhost:3000/api/dashboards/uid/system-overview > dashboard.json

Configuración Up Alertas

Configurar Alerta Notificación Channels

# Slack notification channel
curl -X POST -H "Content-Type: application/json" \
  -d '{
    "name": "Slack Channel",
    "type": "slack",
    "settings": {
      "url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
      "uploadImage": true
    },
    "isDefault": true
  }' \
  http://admin:admin@localhost:3000/api/alert-notifications

Prueba Alerting Canalización

# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "This is a test alert",
      "description": "Testing the alert pipeline"
    }
  }]'

Escalado the Stack

Agregar More Servidores

# In Prometheus config, add targets:
sudo tee -a /etc/prometheus/prometheus.yml > /dev/null << 'EOF'

  - job_name: 'additional-servers'
    static_configs:
      - targets: ['192.168.1.20:9100', '192.168.1.21:9100']
        labels:
          datacenter: 'us-east-1'
EOF

# Reload Prometheus
curl -X POST http://localhost:9090/-/reload

Use Servicio Discovery

scrape_configs:
  - job_name: 'consul-discovered'
    consul_sd_configs:
      - server: 'localhost:8500'
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service

Solución de Problemas

Verificar Prometheus Estado

# Verify Prometheus is running
systemctl status prometheus

# Check metrics are being scraped
curl -s http://localhost:9090/api/v1/query?query=up | jq .

# View scrape targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets'

Debug Nodo Exporter

# Check Node Exporter metrics
curl -s http://localhost:9100/metrics | wc -l

# Check specific metric
curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total | head -5

Verificar Alerta Rules

# Check rules syntax
promtool check rules /etc/prometheus/alert_rules.yml

# View active alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts'

Conclusión

You now have a complete, producción-ready monitoreo stack. Prometheus collects metrics, Nodo Exporter provides system data, Alertmanager routes alerts, and Grafana visualizes everything. This foundation scales Para monitorear hundreds of servers. Focus on continuously improving your dashboards, refining alert thresholds, and documenting runbooks for alert responses. Regular maintenance, including data retention review and storage monitoreo, ensures long-term reliability.