Kubernetes Troubleshooting Common Issues

Troubleshooting Kubernetes issues requires systematic debugging skills and understanding of common failure patterns. This guide covers diagnosing CrashLoopBackOff, Pending pods, ImagePullBackOff errors, DNS resolution problems, networking issues, and using kubectl debug for advanced diagnostics on your VPS and baremetal infrastructure.

Table of Contents

Troubleshooting Methodology

Systematic Approach

  1. Gather Information: Pod status, events, logs
  2. Identify Root Cause: Use logs and events
  3. Check Dependencies: Config, secrets, network
  4. Test Hypothesis: Use debug commands
  5. Implement Fix: Apply solution
  6. Verify: Confirm resolution

Key kubectl Commands

# Pod status
kubectl get pods -n namespace
kubectl describe pod pod-name -n namespace

# Events
kubectl get events -n namespace --sort-by='.lastTimestamp'

# Logs
kubectl logs pod-name -n namespace
kubectl logs pod-name -c container-name -n namespace

# Previous logs (for crashed pods)
kubectl logs pod-name --previous -n namespace

# Detailed information
kubectl get pod pod-name -o yaml -n namespace

Pod Status Codes

Pod Phases

Pending: Pod created but not scheduled

  • Container image being pulled
  • Awaiting resource availability
  • Waiting for dependencies

Running: Pod assigned to node and containers started

  • Containers may still be initializing

Succeeded: All containers completed successfully

  • Terminal state for one-time tasks

Failed: At least one container exited with error

  • Check logs for failure reason

Unknown: Unable to determine pod state

  • Usually network issue with node

Container States

Waiting: Not yet running

  • Reasons: ContainerCreating, ImagePullBackOff, CrashLoopBackOff

Running: Container started and healthy

Terminated: Container exited

  • Reasons: Completed, Error, Signal

CrashLoopBackOff

Identifying CrashLoopBackOff

# Show status
kubectl get pods -n namespace

# Describe for details
kubectl describe pod crashing-pod -n namespace

Example output shows: "State: Waiting (reason: CrashLoopBackOff)"

Common Causes and Solutions

Application Crash:

# Check recent logs
kubectl logs crashing-pod -n namespace
kubectl logs crashing-pod --previous -n namespace

# Check exit code
kubectl describe pod crashing-pod -n namespace | grep "Exit Code"

Exit codes:

  • 1: Generic error
  • 126: Permission denied
  • 127: Command not found
  • 137: Killed by signal (OOM)
  • 139: Segmentation fault

Missing Dependency:

# Add startup probe to wait for dependency
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  initContainers:
  - name: wait-for-db
    image: busybox:1.35
    command: ['sh', '-c', 'until nc -z db 5432; do echo waiting; sleep 2; done']
  containers:
  - name: app
    image: myapp:1.0
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - "curl -f http://localhost:8080/health"
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 30

Configuration Issues:

# Check environment variables
kubectl set env pod/crashing-pod --list -n namespace

# Check mounted volumes
kubectl describe pod crashing-pod -n namespace | grep -A 5 "Mounts:"

# Verify ConfigMap/Secret
kubectl get configmap -n namespace
kubectl get secret -n namespace

Insufficient Resources:

# Check node resources
kubectl top nodes
kubectl describe node node-name

# Check pod resource requests
kubectl get pod crashing-pod -o yaml -n namespace | grep -A 3 "resources:"

Pending Pods

Identifying Pending Pods

kubectl get pods -n namespace | grep Pending
kubectl describe pod pending-pod -n namespace

Causes and Solutions

Insufficient Resources:

# View node capacity
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, capacity: .status.capacity, allocatable: .status.allocatable}'

# Check resource requests
kubectl get pod pending-pod -o yaml | grep -A 3 "resources:"

# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Solution: Increase cluster size or reduce pod resource requirements

Node Affinity/Taints:

# Check pod affinity rules
kubectl get pod pending-pod -o yaml | grep -A 10 "affinity:"

# Remove node taint
kubectl taint nodes node-name key=value:NoSchedule-

# Add pod toleration
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  tolerations:
  - key: key
    operator: Equal
    value: value
    effect: NoSchedule
  containers:
  - name: app
    image: myapp:1.0

Stuck PVC:

# Check PVC status
kubectl get pvc -n namespace
kubectl describe pvc pvc-name -n namespace

# Check StorageClass
kubectl get storageclass

# Check PV
kubectl get pv | grep pvc-name

Solution: Create PVC in available zone, check storage provisioner

ResourceQuota Exceeded:

# Check quota usage
kubectl describe resourcequota -n namespace

# View quota
kubectl get resourcequota -n namespace

Solution: Increase quota or free resources

ImagePullBackOff

Identifying ImagePullBackOff

kubectl describe pod imagepull-pod -n namespace

# Look for: "Failed to pull image" or "ErrImagePull"

Common Causes and Solutions

Image Not Found:

# Verify image exists
docker pull myregistry.azurecr.io/myapp:1.0

# Check image name spelling
kubectl get pod imagepull-pod -o yaml | grep "image:"

# Check available tags
az acr repository show-tags -n myregistry -t myapp

Registry Authentication:

# Create docker registry secret
kubectl create secret docker-registry regcred \
  --docker-server=myregistry.azurecr.io \
  --docker-username=username \
  --docker-password=password \
  -n namespace

# Use secret in pod
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  imagePullSecrets:
  - name: regcred
  containers:
  - name: app
    image: myregistry.azurecr.io/myapp:1.0

Network/DNS Issues:

# Test DNS resolution
kubectl run -it --rm debug --image=busybox -- nslookup myregistry.azurecr.io

# Test connectivity
kubectl run -it --rm debug --image=busybox -- wget -O- myregistry.azurecr.io

# Check kubelet logs on node
ssh node-name
sudo journalctl -u kubelet -n 50

DNS and Networking Issues

Testing DNS

# Run DNS test pod
kubectl run -it --rm debug --image=busybox -- sh

# Inside pod, test:
nslookup kubernetes.default
nslookup myservice.production.svc.cluster.local
dig myservice.production.svc.cluster.local

Common DNS Issues

CoreDNS Down:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# View CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Restart CoreDNS
kubectl rollout restart -n kube-system deployment/coredns

Service Discovery Issues:

# Check service
kubectl get svc myservice -n production
kubectl endpoints myservice -n production

# Check selector matches pods
kubectl get pods -l app=myapp -n production

# Test service connectivity
kubectl run -it --rm test --image=busybox -- sh
# Inside: wget -O- http://myservice.production.svc.cluster.local:8080

Network Connectivity

Pod-to-Pod Communication:

# Test from pod
kubectl run -it --rm debug --image=busybox -- sh
# Inside: wget -O- http://target-pod-ip:8080

# Check network policies
kubectl get networkpolicies -n production

# Test with tcpdump
kubectl run -it --rm debug --image=nicolaka/netshoot -- sh
# Inside: tcpdump -i eth0 -n

Egress Issues:

# Test external connectivity
kubectl run -it --rm debug --image=busybox -- sh
# Inside: wget -O- http://external-service.com

# Check egress gateway
kubectl get pods -n istio-system -l app=istio-egressgateway

# Check network policies blocking egress
kubectl get networkpolicies -A

Debugging Tools

kubectl debug

Interactive debugging pod:

# Debug running pod
kubectl debug pod/myapp -n production -it

# Debug crashed pod (copy image)
kubectl debug pod/myapp --copy-to=myapp-debug -n production

# Create debug container
kubectl debug pod/myapp -n production -it --image=busybox --target=container-name

Port Forwarding

Access pod directly:

# Forward local port to pod
kubectl port-forward pod/myapp 8080:8080 -n production

# Test
curl http://localhost:8080

Exec into Container

# Get shell access
kubectl exec -it pod/myapp -n production -- /bin/sh

# Run command
kubectl exec pod/myapp -n production -- curl http://localhost:8080

Logs and Events

# Tail logs
kubectl logs -f pod/myapp -n production

# Previous logs
kubectl logs pod/myapp --previous -n production

# All events in namespace
kubectl get events -n production --sort-by='.lastTimestamp'

# Watch events
kubectl get events -n production -w

Practical Examples

Example: Troubleshooting CrashLoopBackOff

#!/bin/bash

POD_NAME="myapp"
NAMESPACE="production"

echo "=== Step 1: Get Pod Status ==="
kubectl describe pod $POD_NAME -n $NAMESPACE | head -20

echo "=== Step 2: Check Latest Logs ==="
kubectl logs $POD_NAME -n $NAMESPACE --tail=50

echo "=== Step 3: Check Previous Logs (if exists) ==="
kubectl logs $POD_NAME --previous -n $NAMESPACE 2>/dev/null || echo "No previous logs"

echo "=== Step 4: Check Resource Status ==="
kubectl top pod $POD_NAME -n $NAMESPACE

echo "=== Step 5: Check Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10

echo "=== Step 6: Examine Pod YAML ==="
kubectl get pod $POD_NAME -o yaml -n $NAMESPACE | grep -A 3 "resources:"

echo "=== Step 7: Debug with Copy ==="
kubectl debug pod/$POD_NAME --copy-to=${POD_NAME}-debug -n $NAMESPACE

Example: Diagnosing Pending Pod

#!/bin/bash

POD_NAME="pending-pod"
NAMESPACE="production"

echo "=== Checking Why Pod is Pending ==="

# Check describe
echo "=== Pod Description ==="
kubectl describe pod $POD_NAME -n $NAMESPACE

# Check node capacity
echo "=== Node Resources ==="
kubectl top nodes

# Check resource requests
echo "=== Pod Resource Requests ==="
kubectl get pod $POD_NAME -o yaml -n $NAMESPACE | grep -A 3 "resources:"

# Check ResourceQuota
echo "=== ResourceQuota Status ==="
kubectl describe resourcequota -n $NAMESPACE

# Check PVC if exists
echo "=== PVC Status ==="
kubectl get pvc -n $NAMESPACE | grep $POD_NAME

# Check node affinity
echo "=== Node Affinity Rules ==="
kubectl get pod $POD_NAME -o yaml -n $NAMESPACE | grep -A 10 "affinity:"

Conclusion

Effective Kubernetes troubleshooting requires understanding common failure patterns, using appropriate diagnostic tools, and systematically narrowing down root causes. By mastering kubectl describe, logs, events, and debug commands, you can quickly diagnose and resolve most issues. Start with simple status checks, examine logs and events, then progress to advanced debugging with ephemeral containers and network diagnostics. Regular practice with troubleshooting helps you build intuition for identifying issues quickly on your VPS and baremetal Kubernetes infrastructure.