Prometheus and Grafana Complete Monitoreo Stack
Building a complete monitoreo stack with Prometheus and Grafana provides comprehensive infrastructure observability. Esta guía covers the entire ecosystem including Prometheus, Nodo Exporter for system metrics, Alertmanager for alert routing, and Grafana for visualization. By the end of Esta guía, you'll have a producción-ready monitoreo platform monitoreo multiple servers.
Tabla de Contenidos
- Descripción General
- Architecture
- Requisitos Previos
- Instalando Prometheus
- Instalando Nodo Exporter
- Instalando Alertmanager
- Instalando Grafana
- Configuring the Stack
- Creating Monitoreo Paneles
- Configuración Up Alertas
- Escalado the Stack
- [Solución de Problemas](#solución de problemas)
- Conclusión
Descripción General
A complete monitoreo stack consists of four key components working together:
- Prometheus: Time-series database collecting metrics
- Nodo Exporter: Agent exposing system metrics
- Alertmanager: Routes and deduplicates alerts
- Grafana: Visualizes metrics and manages dashboards
This architecture enables monitoreo of servers, applications, and services across multiple environments.
Architecture
Component Relationships
┌─────────────────────────────────────────────────────────┐
│ Monitoring Stack │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Node Exporter│ │Node Exporter│ │
│ │ Server 1 │ │ Server 2 │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ └───────────┬───────┘ │
│ │ Scrapes │
│ ┌──────▼──────┐ │
│ │ Prometheus │ │
│ │ 9090 │ │
│ └──────┬──────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ │ │ │ │
│ ┌────▼──┐ ┌────▼──┐ ┌────▼──┐ │
│ │Grafana│ │Alert │ │Custom │ │
│ │3000 │ │Manager│ │Systems│ │
│ └───────┘ │9093 │ └───────┘ │
│ └───────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Requisitos Previos
System requirements for complete stack:
- Two servers minimum (one for monitoreo, one for metrics collection)
- 2GB RAM on monitoreo server
- 10GB storage for Prometheus TSDB
- Ubuntu 20.04+ or CentOS 8+
- Root or sudo access
- Red connectivity between servers
Instalando Prometheus
Paso 1: Prepare System
# Update system
sudo apt-get update && sudo apt-get upgrade -y
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
Paso 2: Download and Install
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.50.0/prometheus-2.50.0.linux-amd64.tar.gz
tar -xvzf prometheus-2.50.0.linux-amd64.tar.gz
cd prometheus-2.50.0.linux-amd64
# Copy binaries
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}
# Copy configuration templates
sudo cp prometheus.yml /etc/prometheus/
sudo cp -r consoles console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries
Paso 3: Crear Prometheus Configuración
sudo tee /etc/prometheus/prometheus.yml > /dev/null << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'infrastructure-prod'
environment: 'production'
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093'
rule_files:
- '/etc/prometheus/alert_rules.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
labels:
group: 'local'
- targets: ['192.168.1.10:9100', '192.168.1.11:9100']
labels:
group: 'remote'
EOF
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
Paso 4: Crear Systemd Servicio
sudo tee /etc/systemd/system/prometheus.service > /dev/null << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle \
--storage.tsdb.retention.time=30d
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
Instalando Nodo Exporter
On Each Servidor Para monitorear
# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter
# Download and install
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xvzf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
sudo cp node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Crear Systemd Servicio for Nodo Exporter
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
--collector.netdev.device-exclude=^(veth.*|br.*|docker.*|virbr.*)$$ \
--web.listen-address=:9100
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Verificar Nodo Exporter Métricas
curl -s http://localhost:9100/metrics | head -30
Instalando Alertmanager
Paso 1: Download and Install
# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager
# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64
# Install
sudo cp alertmanager alertmanagercli /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/{alertmanager,alertmanagercli}
# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
Paso 2: Configurar Alertmanager
sudo tee /etc/alertmanager/alertmanager.yml > /dev/null << 'EOF'
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
group_wait: 0s
repeat_interval: 5m
- match:
severity: warning
receiver: 'slack'
group_wait: 30s
receivers:
- name: 'default-receiver'
- name: 'slack'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_KEY'
client: 'Prometheus'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
EOF
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
Paso 3: Crear Systemd Servicio
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
After=network.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=0.0.0.0:9093
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
Instalando Grafana
Paso 1: Agregar Repository and Install
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install -y grafana-server
Paso 2: Configurar Grafana
sudo tee /etc/grafana/grafana.ini > /dev/null << 'EOF'
[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = localhost
root_url = http://localhost:3000
[database]
type = sqlite3
path = /var/lib/grafana/grafana.db
[security]
admin_user = admin
admin_password = admin
secret_key = replace_with_secure_key_min_16_chars
cookie_secure = false
[log]
mode = file
level = info
[paths]
logs = /var/log/grafana
EOF
Paso 3: Iniciar Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server
Configuring the Stack
Paso 1: Agregar Prometheus as Datos Source in Grafana
curl -X POST -H "Content-Type: application/json" \
-d '{
"name":"Prometheus",
"type":"prometheus",
"url":"http://localhost:9090",
"access":"proxy",
"isDefault":true
}' \
http://admin:admin@localhost:3000/api/datasources
Paso 2: Configurar Alerta Rules in Prometheus
sudo tee /etc/prometheus/alert_rules.yml > /dev/null << 'EOF'
groups:
- name: system_alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
- alert: DiskSpaceRunningOut
expr: (node_filesystem_avail_bytes{device!~"tmpfs"} / node_filesystem_size_bytes) < 0.1
for: 10m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Less than 10% disk space available on {{ $labels.device }}"
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node is down: {{ $labels.instance }}"
description: "{{ $labels.instance }} has been down for more than 1 minute"
EOF
sudo chown prometheus:prometheus /etc/prometheus/alert_rules.yml
sudo systemctl restart prometheus
Paso 3: Verificar Alerta Manager Integración
# Check Alertmanager status
curl -s http://localhost:9093/api/v1/status | jq .
# Check active alerts
curl -s http://localhost:9093/api/v1/alerts | jq .
Creating Monitoreo Paneles
Crear System Descripción General Panel
Access Grafana at http://localhost:3000 and Crear un new dashboard. Agregar these panels:
- CPU Usage (Time Series Panel):
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- Memory Usage (Gauge Panel):
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- Disk Usage (Stat Panel):
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
- Red Traffic (Time Series Panel):
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
Save Panel
# Export dashboard
curl -H "Authorization: Bearer $GRAFANA_API_TOKEN" \
http://localhost:3000/api/dashboards/uid/system-overview > dashboard.json
Configuración Up Alertas
Configurar Alerta Notificación Channels
# Slack notification channel
curl -X POST -H "Content-Type: application/json" \
-d '{
"name": "Slack Channel",
"type": "slack",
"settings": {
"url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
"uploadImage": true
},
"isDefault": true
}' \
http://admin:admin@localhost:3000/api/alert-notifications
Prueba Alerting Canalización
# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing the alert pipeline"
}
}]'
Escalado the Stack
Agregar More Servidores
# In Prometheus config, add targets:
sudo tee -a /etc/prometheus/prometheus.yml > /dev/null << 'EOF'
- job_name: 'additional-servers'
static_configs:
- targets: ['192.168.1.20:9100', '192.168.1.21:9100']
labels:
datacenter: 'us-east-1'
EOF
# Reload Prometheus
curl -X POST http://localhost:9090/-/reload
Use Servicio Discovery
scrape_configs:
- job_name: 'consul-discovered'
consul_sd_configs:
- server: 'localhost:8500'
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
Solución de Problemas
Verificar Prometheus Estado
# Verify Prometheus is running
systemctl status prometheus
# Check metrics are being scraped
curl -s http://localhost:9090/api/v1/query?query=up | jq .
# View scrape targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets'
Debug Nodo Exporter
# Check Node Exporter metrics
curl -s http://localhost:9100/metrics | wc -l
# Check specific metric
curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total | head -5
Verificar Alerta Rules
# Check rules syntax
promtool check rules /etc/prometheus/alert_rules.yml
# View active alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts'
Conclusión
You now have a complete, producción-ready monitoreo stack. Prometheus collects metrics, Nodo Exporter provides system data, Alertmanager routes alerts, and Grafana visualizes everything. This foundation scales Para monitorear hundreds of servers. Focus on continuously improving your dashboards, refining alert thresholds, and documenting runbooks for alert responses. Regular maintenance, including data retention review and storage monitoreo, ensures long-term reliability.


