Kubernetes Troubleshooting Common Issues
Troubleshooting Kubernetes issues requires systematic debugging skills and understanding of common failure patterns. This guide covers diagnosing CrashLoopBackOff, Pending pods, ImagePullBackOff errors, DNS resolution problems, networking issues, and using kubectl debug for advanced diagnostics on your VPS and baremetal infrastructure.
Table of Contents
- Troubleshooting Methodology
- Pod Status Codes
- CrashLoopBackOff
- Pending Pods
- ImagePullBackOff
- DNS and Networking Issues
- Debugging Tools
- Practical Examples
- Conclusion
Troubleshooting Methodology
Systematic Approach
- Gather Information: Pod status, events, logs
- Identify Root Cause: Use logs and events
- Check Dependencies: Config, secrets, network
- Test Hypothesis: Use debug commands
- Implement Fix: Apply solution
- Verify: Confirm resolution
Key kubectl Commands
# Pod status
kubectl get pods -n namespace
kubectl describe pod pod-name -n namespace
# Events
kubectl get events -n namespace --sort-by='.lastTimestamp'
# Logs
kubectl logs pod-name -n namespace
kubectl logs pod-name -c container-name -n namespace
# Previous logs (for crashed pods)
kubectl logs pod-name --previous -n namespace
# Detailed information
kubectl get pod pod-name -o yaml -n namespace
Pod Status Codes
Pod Phases
Pending: Pod created but not scheduled
- Container image being pulled
- Awaiting resource availability
- Waiting for dependencies
Running: Pod assigned to node and containers started
- Containers may still be initializing
Succeeded: All containers completed successfully
- Terminal state for one-time tasks
Failed: At least one container exited with error
- Check logs for failure reason
Unknown: Unable to determine pod state
- Usually network issue with node
Container States
Waiting: Not yet running
- Reasons: ContainerCreating, ImagePullBackOff, CrashLoopBackOff
Running: Container started and healthy
Terminated: Container exited
- Reasons: Completed, Error, Signal
CrashLoopBackOff
Identifying CrashLoopBackOff
# Show status
kubectl get pods -n namespace
# Describe for details
kubectl describe pod crashing-pod -n namespace
Example output shows: "State: Waiting (reason: CrashLoopBackOff)"
Common Causes and Solutions
Application Crash:
# Check recent logs
kubectl logs crashing-pod -n namespace
kubectl logs crashing-pod --previous -n namespace
# Check exit code
kubectl describe pod crashing-pod -n namespace | grep "Exit Code"
Exit codes:
- 1: Generic error
- 126: Permission denied
- 127: Command not found
- 137: Killed by signal (OOM)
- 139: Segmentation fault
Missing Dependency:
# Add startup probe to wait for dependency
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
initContainers:
- name: wait-for-db
image: busybox:1.35
command: ['sh', '-c', 'until nc -z db 5432; do echo waiting; sleep 2; done']
containers:
- name: app
image: myapp:1.0
startupProbe:
exec:
command:
- sh
- -c
- "curl -f http://localhost:8080/health"
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 30
Configuration Issues:
# Check environment variables
kubectl set env pod/crashing-pod --list -n namespace
# Check mounted volumes
kubectl describe pod crashing-pod -n namespace | grep -A 5 "Mounts:"
# Verify ConfigMap/Secret
kubectl get configmap -n namespace
kubectl get secret -n namespace
Insufficient Resources:
# Check node resources
kubectl top nodes
kubectl describe node node-name
# Check pod resource requests
kubectl get pod crashing-pod -o yaml -n namespace | grep -A 3 "resources:"
Pending Pods
Identifying Pending Pods
kubectl get pods -n namespace | grep Pending
kubectl describe pod pending-pod -n namespace
Causes and Solutions
Insufficient Resources:
# View node capacity
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, capacity: .status.capacity, allocatable: .status.allocatable}'
# Check resource requests
kubectl get pod pending-pod -o yaml | grep -A 3 "resources:"
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
Solution: Increase cluster size or reduce pod resource requirements
Node Affinity/Taints:
# Check pod affinity rules
kubectl get pod pending-pod -o yaml | grep -A 10 "affinity:"
# Remove node taint
kubectl taint nodes node-name key=value:NoSchedule-
# Add pod toleration
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
tolerations:
- key: key
operator: Equal
value: value
effect: NoSchedule
containers:
- name: app
image: myapp:1.0
Stuck PVC:
# Check PVC status
kubectl get pvc -n namespace
kubectl describe pvc pvc-name -n namespace
# Check StorageClass
kubectl get storageclass
# Check PV
kubectl get pv | grep pvc-name
Solution: Create PVC in available zone, check storage provisioner
ResourceQuota Exceeded:
# Check quota usage
kubectl describe resourcequota -n namespace
# View quota
kubectl get resourcequota -n namespace
Solution: Increase quota or free resources
ImagePullBackOff
Identifying ImagePullBackOff
kubectl describe pod imagepull-pod -n namespace
# Look for: "Failed to pull image" or "ErrImagePull"
Common Causes and Solutions
Image Not Found:
# Verify image exists
docker pull myregistry.azurecr.io/myapp:1.0
# Check image name spelling
kubectl get pod imagepull-pod -o yaml | grep "image:"
# Check available tags
az acr repository show-tags -n myregistry -t myapp
Registry Authentication:
# Create docker registry secret
kubectl create secret docker-registry regcred \
--docker-server=myregistry.azurecr.io \
--docker-username=username \
--docker-password=password \
-n namespace
# Use secret in pod
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
imagePullSecrets:
- name: regcred
containers:
- name: app
image: myregistry.azurecr.io/myapp:1.0
Network/DNS Issues:
# Test DNS resolution
kubectl run -it --rm debug --image=busybox -- nslookup myregistry.azurecr.io
# Test connectivity
kubectl run -it --rm debug --image=busybox -- wget -O- myregistry.azurecr.io
# Check kubelet logs on node
ssh node-name
sudo journalctl -u kubelet -n 50
DNS and Networking Issues
Testing DNS
# Run DNS test pod
kubectl run -it --rm debug --image=busybox -- sh
# Inside pod, test:
nslookup kubernetes.default
nslookup myservice.production.svc.cluster.local
dig myservice.production.svc.cluster.local
Common DNS Issues
CoreDNS Down:
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# View CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Restart CoreDNS
kubectl rollout restart -n kube-system deployment/coredns
Service Discovery Issues:
# Check service
kubectl get svc myservice -n production
kubectl endpoints myservice -n production
# Check selector matches pods
kubectl get pods -l app=myapp -n production
# Test service connectivity
kubectl run -it --rm test --image=busybox -- sh
# Inside: wget -O- http://myservice.production.svc.cluster.local:8080
Network Connectivity
Pod-to-Pod Communication:
# Test from pod
kubectl run -it --rm debug --image=busybox -- sh
# Inside: wget -O- http://target-pod-ip:8080
# Check network policies
kubectl get networkpolicies -n production
# Test with tcpdump
kubectl run -it --rm debug --image=nicolaka/netshoot -- sh
# Inside: tcpdump -i eth0 -n
Egress Issues:
# Test external connectivity
kubectl run -it --rm debug --image=busybox -- sh
# Inside: wget -O- http://external-service.com
# Check egress gateway
kubectl get pods -n istio-system -l app=istio-egressgateway
# Check network policies blocking egress
kubectl get networkpolicies -A
Debugging Tools
kubectl debug
Interactive debugging pod:
# Debug running pod
kubectl debug pod/myapp -n production -it
# Debug crashed pod (copy image)
kubectl debug pod/myapp --copy-to=myapp-debug -n production
# Create debug container
kubectl debug pod/myapp -n production -it --image=busybox --target=container-name
Port Forwarding
Access pod directly:
# Forward local port to pod
kubectl port-forward pod/myapp 8080:8080 -n production
# Test
curl http://localhost:8080
Exec into Container
# Get shell access
kubectl exec -it pod/myapp -n production -- /bin/sh
# Run command
kubectl exec pod/myapp -n production -- curl http://localhost:8080
Logs and Events
# Tail logs
kubectl logs -f pod/myapp -n production
# Previous logs
kubectl logs pod/myapp --previous -n production
# All events in namespace
kubectl get events -n production --sort-by='.lastTimestamp'
# Watch events
kubectl get events -n production -w
Practical Examples
Example: Troubleshooting CrashLoopBackOff
#!/bin/bash
POD_NAME="myapp"
NAMESPACE="production"
echo "=== Step 1: Get Pod Status ==="
kubectl describe pod $POD_NAME -n $NAMESPACE | head -20
echo "=== Step 2: Check Latest Logs ==="
kubectl logs $POD_NAME -n $NAMESPACE --tail=50
echo "=== Step 3: Check Previous Logs (if exists) ==="
kubectl logs $POD_NAME --previous -n $NAMESPACE 2>/dev/null || echo "No previous logs"
echo "=== Step 4: Check Resource Status ==="
kubectl top pod $POD_NAME -n $NAMESPACE
echo "=== Step 5: Check Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo "=== Step 6: Examine Pod YAML ==="
kubectl get pod $POD_NAME -o yaml -n $NAMESPACE | grep -A 3 "resources:"
echo "=== Step 7: Debug with Copy ==="
kubectl debug pod/$POD_NAME --copy-to=${POD_NAME}-debug -n $NAMESPACE
Example: Diagnosing Pending Pod
#!/bin/bash
POD_NAME="pending-pod"
NAMESPACE="production"
echo "=== Checking Why Pod is Pending ==="
# Check describe
echo "=== Pod Description ==="
kubectl describe pod $POD_NAME -n $NAMESPACE
# Check node capacity
echo "=== Node Resources ==="
kubectl top nodes
# Check resource requests
echo "=== Pod Resource Requests ==="
kubectl get pod $POD_NAME -o yaml -n $NAMESPACE | grep -A 3 "resources:"
# Check ResourceQuota
echo "=== ResourceQuota Status ==="
kubectl describe resourcequota -n $NAMESPACE
# Check PVC if exists
echo "=== PVC Status ==="
kubectl get pvc -n $NAMESPACE | grep $POD_NAME
# Check node affinity
echo "=== Node Affinity Rules ==="
kubectl get pod $POD_NAME -o yaml -n $NAMESPACE | grep -A 10 "affinity:"
Conclusion
Effective Kubernetes troubleshooting requires understanding common failure patterns, using appropriate diagnostic tools, and systematically narrowing down root causes. By mastering kubectl describe, logs, events, and debug commands, you can quickly diagnose and resolve most issues. Start with simple status checks, examine logs and events, then progress to advanced debugging with ephemeral containers and network diagnostics. Regular practice with troubleshooting helps you build intuition for identifying issues quickly on your VPS and baremetal Kubernetes infrastructure.


