Monitoring Kubernetes Clusters with Prometheus
Kubernetes introduces operational complexity requiring sophisticated monitoring. The kube-prometheus-stack provides pre-configured Prometheus, Grafana, and Alertmanager alongside Kubernetes-specific exporters. This guide covers deploying the stack via Helm, configuring ServiceMonitors, creating dashboards, and setting up comprehensive alerting for Kubernetes clusters.
Table of Contents
- Introduction
- Architecture
- System Requirements
- Kubernetes Setup
- Helm Installation
- Kube-Prometheus-Stack Deployment
- ServiceMonitors
- Dashboards
- Alerting Rules
- Scaling and Performance
- Troubleshooting
- Conclusion
Introduction
Monitoring Kubernetes requires observability into cluster infrastructure, API servers, container runtime, and application workloads. The kube-prometheus-stack bundles pre-configured components eliminating manual setup while providing proven monitoring configurations.
Architecture
Kubernetes Monitoring Stack
┌────────────────────────────────────────┐
│ Kubernetes Cluster │
│ ┌──────────────────────────────────┐ │
│ │ kubelet (every node) │ │
│ │ ├─ cAdvisor metrics │ │
│ │ ├─ Node metrics │ │
│ │ └─ Pod metrics │ │
│ └──────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────┐ │
│ │ kube-prometheus-stack │ │
│ │ ├─ Prometheus Operator │ │
│ │ ├─ Prometheus Server │ │
│ │ ├─ Alertmanager │ │
│ │ ├─ Grafana │ │
│ │ ├─ Node Exporter │ │
│ │ └─ kube-state-metrics │ │
│ └──────────────────────────────────┘ │
└────────────────────────────────────────┘
↓
External Systems
(Slack, PagerDuty, etc.)
System Requirements
- Kubernetes 1.19+ cluster
- Helm 3.x installed
- kubectl configured and authenticated
- At least 4GB free memory in cluster
- 20GB persistent storage (for Prometheus)
- Internet access for image downloads
Kubernetes Setup
Install kubectl
# Ubuntu/Debian
curl -LO https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# Verify
kubectl version --client
Install Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version
Access Kubernetes Cluster
# Configure kubectl context
kubectl config use-context your-cluster
# Verify cluster access
kubectl get nodes
kubectl get namespaces
Helm Installation
Add Prometheus Helm Repository
# Add Prometheus community repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kube-state-metrics https://kubernetes.github.io/kube-state-metrics
helm repo update
# List available charts
helm search repo prometheus-community | grep kube-prometheus-stack
Create Monitoring Namespace
kubectl create namespace monitoring
kubectl label namespace monitoring release=monitoring
Create Values File
cat > prometheus-values.yaml << 'EOF'
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
grafana:
enabled: true
adminPassword: admin123
persistence:
enabled: true
size: 10Gi
alertmanager:
enabled: true
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2Gi
prometheus-operator:
enabled: true
prometheus-node-exporter:
enabled: true
hostNetwork: true
kube-state-metrics:
enabled: true
EOF
Kube-Prometheus-Stack Deployment
Deploy Stack
# Install kube-prometheus-stack
helm install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus-values.yaml
# Verify deployment
kubectl get all -n monitoring
kubectl get pods -n monitoring -w
Verify Components
# Check Prometheus
kubectl get svc -n monitoring prometheus-kube-prometheus-prometheus
# Check Grafana
kubectl get svc -n monitoring kube-prometheus-stack-grafana
# Check Alertmanager
kubectl get svc -n monitoring kube-prometheus-stack-alertmanager
Access Services
# Port forward Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Port forward Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Port forward Alertmanager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093
# Access:
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000
# Alertmanager: http://localhost:9093
ServiceMonitors
Monitor Application Services
Create ServiceMonitor for application exposing metrics:
cat > servicemonitor-example.yaml << 'EOF'
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: default
spec:
selector:
matchLabels:
app: my-application
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
EOF
kubectl apply -f servicemonitor-example.yaml
Monitor Prometheus Operator
cat > servicemonitor-prometheus.yaml << 'EOF'
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prometheus-operator
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: kube-prometheus-operator
endpoints:
- port: metrics
interval: 30s
EOF
kubectl apply -f servicemonitor-prometheus.yaml
PrometheusRule
Create alerting rules for Kubernetes:
cat > prometheusrule-kubernetes.yaml << 'EOF'
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-rules
namespace: monitoring
spec:
groups:
- name: kubernetes
interval: 30s
rules:
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes Node not ready (instance {{ $labels.node }})"
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[1h]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes Pod crash looping (pod {{ $labels.pod }})"
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes Memory Pressure (node {{ $labels.node }})"
- alert: KubernetesDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes Disk Pressure (node {{ $labels.node }})"
EOF
kubectl apply -f prometheusrule-kubernetes.yaml
Dashboards
Pre-installed Dashboards
The kube-prometheus-stack includes dashboards:
- Kubernetes Cluster
- Kubernetes Nodes
- Kubernetes Pods
- Prometheus Overview
Create Custom Dashboard
cat > custom-dashboard.json << 'EOF'
{
"dashboard": {
"title": "Custom Kubernetes Application",
"panels": [
{
"title": "Pod CPU Usage",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)"
}
]
},
{
"title": "Pod Memory Usage",
"targets": [
{
"expr": "sum(container_memory_usage_bytes) by (pod_name)"
}
]
},
{
"title": "Pod Network",
"targets": [
{
"expr": "rate(container_network_receive_bytes_total[5m])"
}
]
}
]
}
}
EOF
Alerting Rules
Configure Alertmanager
cat > alertmanager-config.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'critical-team'
group_wait: 0s
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
- name: 'critical-team'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: '[email protected]'
auth_password: 'password'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'
channel: '#alerts'
EOF
kubectl apply -f alertmanager-config.yaml
Scaling and Performance
High Availability Setup
cat > kube-prometheus-ha-values.yaml << 'EOF'
prometheus:
prometheusSpec:
replicas: 2
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
externalLabels:
cluster: "production"
region: "us-east-1"
alertmanager:
alertmanagerSpec:
replicas: 2
retention: 120h
grafana:
replicas: 2
persistence:
size: 20Gi
EOF
helm upgrade kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values kube-prometheus-ha-values.yaml
Resource Management
# Check resource usage
kubectl top nodes -n monitoring
kubectl top pods -n monitoring
# Update resource requests/limits
kubectl set resources deployment \
-n monitoring \
prometheus-kube-prometheus-operator \
--requests=cpu=500m,memory=512Mi \
--limits=cpu=2000m,memory=2Gi
Troubleshooting
Verify Component Health
# Check all pods running
kubectl get pods -n monitoring
# View pod logs
kubectl logs -n monitoring -l app=prometheus
# Check ServiceMonitor discovery
kubectl get servicemonitor -n monitoring
kubectl describe servicemonitor -n monitoring app-metrics
# Verify metrics scraping
kubectl exec -n monitoring prometheus-pod -- \
promtool query instant 'up'
Debug Metrics Collection
# Access Prometheus console
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Query metrics
curl http://localhost:9090/api/v1/query?query=kubernetes_build_info
# Check targets
curl http://localhost:9090/api/v1/targets
Common Issues
# ServiceMonitor not picked up
# Check label selectors match
kubectl get servicemonitor -n monitoring -o yaml
# Prometheus not scraping targets
# Verify ServiceMonitor exists in same namespace
# Check selector labels on Service
# Storage issues
# Check PVC status
kubectl get pvc -n monitoring
kubectl describe pvc prometheus-kube-prometheus-prometheus-db-prometheus-0 -n monitoring
Conclusion
The kube-prometheus-stack provides enterprise-grade Kubernetes monitoring out of the box. By following this guide, you've deployed a comprehensive monitoring platform for your Kubernetes infrastructure. Focus on creating meaningful ServiceMonitors for your applications, setting appropriate alert thresholds based on SLOs, and continuously refining dashboards. Kubernetes observability is critical for reliable, scalable deployments.


