High Load Average: What It Means and How to Fix It
Introduction
Load average is one of the most commonly monitored yet frequently misunderstood metrics in Linux system administration. A high load average can indicate system stress, but interpreting what constitutes "high" and identifying the underlying cause requires understanding what load average actually measures and how it differs from CPU usage.
This comprehensive guide explains what load average means, how to interpret it correctly, and provides systematic approaches to diagnosing and resolving high load situations. You'll learn to distinguish between CPU-bound, I/O-bound, and uninterruptible sleep issues, enabling you to quickly identify and fix the root cause of system performance degradation.
Understanding load average is critical for maintaining system performance and preventing outages. While CPU usage shows instantaneous utilization, load average reveals system stress over time, making it an invaluable metric for capacity planning and performance troubleshooting.
Understanding Load Average
What is Load Average?
Load average represents the average number of processes in a runnable or uninterruptible state over a specific time period. It includes:
- Runnable processes (R state): Ready to run, waiting for CPU
- Uninterruptible sleep (D state): Waiting for I/O (disk, network)
Key insight: Load average measures demand for system resources, not just CPU usage.
The Three Numbers
Load average shows three values:
load average: 1.50, 2.30, 1.80
^ ^ ^
| | |
| | +-- 15-minute average
| +-------- 5-minute average
+-------------- 1-minute average
Interpretation:
- 1-minute: Current load trend
- 5-minute: Recent load trend
- 15-minute: Long-term load trend
What's Normal?
Load average is relative to your CPU count:
# Check CPU count
nproc
lscpu | grep "^CPU(s)"
# 4-core system examples:
# Load: 1.0 = 25% utilized
# Load: 2.0 = 50% utilized
# Load: 4.0 = 100% utilized
# Load: 8.0 = 200% utilized (overloaded)
General guidelines:
- Load < CPU count: System healthy
- Load = CPU count: System at capacity
- Load > CPU count: System overloaded
Initial Load Average Assessment
Quick Load Check
# View load average
uptime
# Detailed system load
w
# Load with system info
top -bn1 | head -5
# Historical load (if sar installed)
sar -q
# CPU count for comparison
CPUS=$(nproc)
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | cut -d',' -f1)
echo "CPUs: $CPUS, Load: $LOAD"
# Quick health check
if (( $(echo "$LOAD > $CPUS" | bc -l) )); then
echo "System overloaded!"
else
echo "System load normal"
fi
Understanding the Pattern
# Increasing load (system getting busier)
load average: 0.80, 1.50, 2.30
# Decreasing load (system recovering)
load average: 2.30, 1.50, 0.80
# Spike load (recent event)
load average: 5.20, 1.50, 0.80
# Sustained high load (ongoing problem)
load average: 8.50, 8.20, 7.90
Step 1: Determining Load Type
CPU-Bound vs I/O-Bound
# Check CPU usage
top -bn1 | grep "Cpu(s)"
# If CPU usage high (>70%) AND load high
# THEN CPU-bound
# If I/O wait (wa) high (>20%) AND load high
# THEN I/O-bound
# Check I/O wait specifically
mpstat 1 5 | awk '/Average/ {print "I/O Wait:", $6"%"}'
# Detailed CPU breakdown
mpstat -P ALL 1 3
Identifying Process States
# Count processes in each state
ps aux | awk '{print $8}' | sort | uniq -c
# R = Running (CPU-bound)
# D = Uninterruptible sleep (I/O-bound)
# S = Sleeping
# Z = Zombie
# T = Stopped
# Find processes in D state
ps aux | awk '$8 ~ /D/ {print}'
# Find processes in R state
ps aux | awk '$8 ~ /R/ {print}'
# Count runnable vs waiting
echo "Runnable: $(ps aux | awk '$8 ~ /R/' | wc -l)"
echo "I/O Wait: $(ps aux | awk '$8 ~ /D/' | wc -l)"
Step 2: CPU-Bound Load Analysis
Identifying CPU Consumers
# Top CPU processes
ps aux --sort=-%cpu | head -15
# Real-time CPU monitoring
top
# Press 'P' to sort by CPU
# Press '1' to see per-core usage
# Per-process CPU
pidstat -u 1 5
# CPU usage by command
ps aux | awk '{cmd[$11]++; cpu[$11]+=$3} END {for(c in cmd) print cmd[c], cpu[c], c}' | sort -rn | head -15
# Find processes using 100% CPU
ps aux | awk '$3 > 90 {print}'
# Multi-threaded CPU usage
top -H
# Shows threads instead of processes
CPU Analysis
# Check CPU frequency
lscpu | grep MHz
# CPU throttling check
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
# Context switches (high = CPU contention)
vmstat 1 5
# Look at 'cs' column
# Per-CPU utilization
mpstat -P ALL 1 3
# CPU steal time (virtualization)
top -bn1 | grep "Cpu(s)" | awk '{print "Steal:", $16}'
# High steal = hypervisor contention
Step 3: I/O-Bound Load Analysis
Disk I/O Analysis
# I/O statistics
iostat -x 1 5
# Key metrics:
# %util approaching 100% = bottleneck
# await > 20ms = slow responses
# svctm > 10ms = service time issues
# Per-process I/O
iotop -o
# Shows only processes doing I/O
# I/O by process
pidstat -d 1 5
# Find I/O intensive processes
ps aux --sort=-%mem | head -15
# Disk usage
df -h
du -sh /* | sort -rh | head -10
Identifying I/O Bottlenecks
# Processes in D state (uninterruptible)
ps aux | awk '$8 ~ /D/ {print}'
# What these processes are waiting for
for pid in $(ps aux | awk '$8 ~ /D/ {print $2}'); do
echo "PID: $pid"
cat /proc/$pid/wchan 2>/dev/null
cat /proc/$pid/stack 2>/dev/null
done
# Check for NFS hangs
mount | grep nfs
df -h | grep nfs
# Network I/O
iftop -i eth0
nethogs
# Check swap I/O
vmstat 1 5
# si/so columns show swap in/out
Step 4: System Resource Analysis
Memory Pressure
# Memory status
free -h
# Swap usage
swapon --show
vmstat 1 5
# Memory by process
ps aux --sort=-%mem | head -15
# Page faults
vmstat -s | grep "page faults"
# OOM events
dmesg | grep -i "out of memory"
grep "killed process" /var/log/kern.log
Network Load
# Network connections
ss -s
netstat -an | awk '{print $6}' | sort | uniq -c
# Connection count
ss -tan | grep ESTABLISHED | wc -l
# Network throughput
iftop -i eth0
# Bandwidth by process
nethogs eth0
# Network errors
ip -s link
netstat -i
Step 5: Process Analysis
Finding Problematic Processes
# Processes sorted by load contribution
ps aux --sort=-%cpu | head -20
# Long-running processes
ps -eo pid,user,etime,%cpu,%mem,cmd --sort=-etime | head -20
# Processes with most threads
ps -eo pid,nlwp,cmd --sort=-nlwp | head -15
# Find runaway processes
ps aux | awk '$3 > 80 || $4 > 80 {print}'
# Process tree
ps auxf | less
pstree -p | less
Detailed Process Investigation
# Investigate specific process
PID=1234
# What's it doing?
strace -p $PID -c
# System calls
strace -p $PID 2>&1 | head -100
# Open files
lsof -p $PID
# Thread count
ps -o nlwp -p $PID
# CPU affinity
taskset -p $PID
# Memory map
pmap -x $PID
# Stack trace
cat /proc/$PID/stack
Solutions and Remediation
CPU-Bound Solutions
Immediate actions:
# Lower process priority
renice +10 PID
# Set CPU affinity (limit to specific cores)
taskset -cp 0,1 PID
# Limit CPU usage with cpulimit
cpulimit -p PID -l 50 # Limit to 50%
# Kill resource-intensive process
kill PID
kill -9 PID # Force kill
Configuration fixes:
# Apache/Nginx worker limits
# Apache - /etc/apache2/mods-available/mpm_prefork.conf
MaxRequestWorkers 150
# Nginx - /etc/nginx/nginx.conf
worker_processes auto;
worker_connections 1024;
# PHP-FPM - /etc/php/fpm/pool.d/www.conf
pm.max_children = 50
# MySQL - /etc/mysql/my.cnf
max_connections = 200
I/O-Bound Solutions
Immediate actions:
# Sync and clear cache
sync
echo 3 > /proc/sys/vm/drop_caches
# Adjust I/O scheduler
echo deadline > /sys/block/sda/queue/scheduler
# Lower process I/O priority
ionice -c 3 -p PID # Idle class
# Reduce swappiness
sysctl vm.swappiness=10
echo "vm.swappiness=10" >> /etc/sysctl.conf
Long-term fixes:
# Optimize filesystem mounts
# Add 'noatime' to /etc/fstab
/dev/sda1 / ext4 defaults,noatime 0 1
# Increase read-ahead
blockdev --setra 8192 /dev/sda
# Optimize database
# MySQL buffer pool
innodb_buffer_pool_size = 4G
# Add SSD for database
# Move MySQL to SSD partition
Memory Optimization
# Clear memory cache
sync && echo 3 > /proc/sys/vm/drop_caches
# Add swap
dd if=/dev/zero of=/swapfile bs=1G count=4
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
# Adjust swappiness
sysctl vm.swappiness=10
# Set in /etc/sysctl.conf
vm.swappiness = 10
vm.vfs_cache_pressure = 50
# Kill memory hogs
kill $(ps aux --sort=-%mem | head -2 | tail -1 | awk '{print $2}')
Monitoring and Prevention
Load Monitoring Script
cat > /usr/local/bin/load-monitor.sh << 'EOF'
#!/bin/bash
LOG_FILE="/var/log/load-monitor.log"
ALERT_EMAIL="[email protected]"
CPU_COUNT=$(nproc)
THRESHOLD=$(echo "$CPU_COUNT * 1.5" | bc)
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | cut -d',' -f1)
echo "$(date): Load: $LOAD, CPUs: $CPU_COUNT" >> "$LOG_FILE"
if (( $(echo "$LOAD > $THRESHOLD" | bc -l) )); then
echo "$(date): High load detected: $LOAD" >> "$LOG_FILE"
# Capture system state
echo "=== Top Processes ===" >> "$LOG_FILE"
ps aux --sort=-%cpu | head -15 >> "$LOG_FILE"
echo "=== I/O Stats ===" >> "$LOG_FILE"
iostat -x >> "$LOG_FILE"
echo "=== Memory ===" >> "$LOG_FILE"
free -h >> "$LOG_FILE"
# Send alert
echo "High load: $LOAD on $(hostname)" | \
mail -s "Load Alert: $LOAD" "$ALERT_EMAIL"
fi
EOF
chmod +x /usr/local/bin/load-monitor.sh
echo "*/5 * * * * /usr/local/bin/load-monitor.sh" | crontab -
Performance Baseline
cat > /usr/local/bin/performance-baseline.sh << 'EOF'
#!/bin/bash
BASELINE_DIR="/var/log/performance-baseline"
mkdir -p "$BASELINE_DIR"
DATE=$(date +%Y%m%d-%H%M%S)
# Capture baseline
uptime > "$BASELINE_DIR/uptime-$DATE.txt"
ps aux --sort=-%cpu | head -50 > "$BASELINE_DIR/processes-$DATE.txt"
iostat -x > "$BASELINE_DIR/iostat-$DATE.txt"
free -h > "$BASELINE_DIR/memory-$DATE.txt"
mpstat -P ALL > "$BASELINE_DIR/cpu-$DATE.txt"
echo "Baseline captured: $DATE"
EOF
chmod +x /usr/local/bin/performance-baseline.sh
echo "0 */6 * * * /usr/local/bin/performance-baseline.sh" | crontab -
Capacity Planning
# Track load over time with sar
sar -q 1 86400 > daily-load.txt
# Average load for the day
sar -q | awk '/Average/ {print "Avg Load:", $4, $5, $6}'
# Peak load times
sar -q | sort -k5 -rn | head -10
# Trend analysis
cat > /tmp/load-trend.sh << 'EOF'
#!/bin/bash
for i in {30..1}; do
DATE=$(date -d "$i days ago" +%d)
AVG=$(sar -q -f /var/log/sysstat/sa$DATE 2>/dev/null | \
awk '/Average/ {print $4}')
echo "$(date -d "$i days ago" +%Y-%m-%d): $AVG"
done
EOF
chmod +x /tmp/load-trend.sh
/tmp/load-trend.sh
Advanced Diagnostics
Using perf
# System-wide performance analysis
perf top
# Record events
perf record -a -g -- sleep 30
# Analyze recording
perf report
# CPU flame graph
perf record -F 99 -a -g -- sleep 60
perf script | ./FlameGraph/stackcollapse-perf.pl | \
./FlameGraph/flamegraph.pl > load-analysis.svg
Kernel Analysis
# Check kernel messages
dmesg | tail -100
# Kernel parameters
sysctl -a | grep -E "threads-max|pid_max"
# Process limits
cat /proc/sys/kernel/threads-max
cat /proc/sys/kernel/pid_max
# Current process count
ps aux | wc -l
Conclusion
Load average is a critical metric for understanding system health, but must be interpreted in context with CPU count and other metrics. Key takeaways:
- Compare to CPU count: Load relative to cores, not absolute
- Check trend: Rising vs falling vs stable load
- Identify type: CPU-bound vs I/O-bound
- Find root cause: Top processes, disk I/O, or memory
- Monitor continuously: Track baselines and trends
- Capacity plan: Use historical data for resource planning
- Act appropriately: Different causes need different solutions
Understanding load average enables proactive capacity planning and rapid resolution of performance issues. Regular monitoring, baseline tracking, and these diagnostic techniques ensure optimal system performance and prevent load-related outages.


