High Load Average: What It Means and How to Fix It

Introduction

Load average is one of the most commonly monitored yet frequently misunderstood metrics in Linux system administration. A high load average can indicate system stress, but interpreting what constitutes "high" and identifying the underlying cause requires understanding what load average actually measures and how it differs from CPU usage.

This comprehensive guide explains what load average means, how to interpret it correctly, and provides systematic approaches to diagnosing and resolving high load situations. You'll learn to distinguish between CPU-bound, I/O-bound, and uninterruptible sleep issues, enabling you to quickly identify and fix the root cause of system performance degradation.

Understanding load average is critical for maintaining system performance and preventing outages. While CPU usage shows instantaneous utilization, load average reveals system stress over time, making it an invaluable metric for capacity planning and performance troubleshooting.

Understanding Load Average

What is Load Average?

Load average represents the average number of processes in a runnable or uninterruptible state over a specific time period. It includes:

  1. Runnable processes (R state): Ready to run, waiting for CPU
  2. Uninterruptible sleep (D state): Waiting for I/O (disk, network)

Key insight: Load average measures demand for system resources, not just CPU usage.

The Three Numbers

Load average shows three values:

load average: 1.50, 2.30, 1.80
              ^     ^     ^
              |     |     |
              |     |     +-- 15-minute average
              |     +-------- 5-minute average
              +-------------- 1-minute average

Interpretation:

  • 1-minute: Current load trend
  • 5-minute: Recent load trend
  • 15-minute: Long-term load trend

What's Normal?

Load average is relative to your CPU count:

# Check CPU count
nproc
lscpu | grep "^CPU(s)"

# 4-core system examples:
# Load: 1.0 = 25% utilized
# Load: 2.0 = 50% utilized
# Load: 4.0 = 100% utilized
# Load: 8.0 = 200% utilized (overloaded)

General guidelines:

  • Load < CPU count: System healthy
  • Load = CPU count: System at capacity
  • Load > CPU count: System overloaded

Initial Load Average Assessment

Quick Load Check

# View load average
uptime

# Detailed system load
w

# Load with system info
top -bn1 | head -5

# Historical load (if sar installed)
sar -q

# CPU count for comparison
CPUS=$(nproc)
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | cut -d',' -f1)
echo "CPUs: $CPUS, Load: $LOAD"

# Quick health check
if (( $(echo "$LOAD > $CPUS" | bc -l) )); then
    echo "System overloaded!"
else
    echo "System load normal"
fi

Understanding the Pattern

# Increasing load (system getting busier)
load average: 0.80, 1.50, 2.30

# Decreasing load (system recovering)
load average: 2.30, 1.50, 0.80

# Spike load (recent event)
load average: 5.20, 1.50, 0.80

# Sustained high load (ongoing problem)
load average: 8.50, 8.20, 7.90

Step 1: Determining Load Type

CPU-Bound vs I/O-Bound

# Check CPU usage
top -bn1 | grep "Cpu(s)"

# If CPU usage high (>70%) AND load high
# THEN CPU-bound

# If I/O wait (wa) high (>20%) AND load high
# THEN I/O-bound

# Check I/O wait specifically
mpstat 1 5 | awk '/Average/ {print "I/O Wait:", $6"%"}'

# Detailed CPU breakdown
mpstat -P ALL 1 3

Identifying Process States

# Count processes in each state
ps aux | awk '{print $8}' | sort | uniq -c

# R = Running (CPU-bound)
# D = Uninterruptible sleep (I/O-bound)
# S = Sleeping
# Z = Zombie
# T = Stopped

# Find processes in D state
ps aux | awk '$8 ~ /D/ {print}'

# Find processes in R state
ps aux | awk '$8 ~ /R/ {print}'

# Count runnable vs waiting
echo "Runnable: $(ps aux | awk '$8 ~ /R/' | wc -l)"
echo "I/O Wait: $(ps aux | awk '$8 ~ /D/' | wc -l)"

Step 2: CPU-Bound Load Analysis

Identifying CPU Consumers

# Top CPU processes
ps aux --sort=-%cpu | head -15

# Real-time CPU monitoring
top
# Press 'P' to sort by CPU
# Press '1' to see per-core usage

# Per-process CPU
pidstat -u 1 5

# CPU usage by command
ps aux | awk '{cmd[$11]++; cpu[$11]+=$3} END {for(c in cmd) print cmd[c], cpu[c], c}' | sort -rn | head -15

# Find processes using 100% CPU
ps aux | awk '$3 > 90 {print}'

# Multi-threaded CPU usage
top -H
# Shows threads instead of processes

CPU Analysis

# Check CPU frequency
lscpu | grep MHz

# CPU throttling check
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

# Context switches (high = CPU contention)
vmstat 1 5
# Look at 'cs' column

# Per-CPU utilization
mpstat -P ALL 1 3

# CPU steal time (virtualization)
top -bn1 | grep "Cpu(s)" | awk '{print "Steal:", $16}'
# High steal = hypervisor contention

Step 3: I/O-Bound Load Analysis

Disk I/O Analysis

# I/O statistics
iostat -x 1 5

# Key metrics:
# %util approaching 100% = bottleneck
# await > 20ms = slow responses
# svctm > 10ms = service time issues

# Per-process I/O
iotop -o
# Shows only processes doing I/O

# I/O by process
pidstat -d 1 5

# Find I/O intensive processes
ps aux --sort=-%mem | head -15

# Disk usage
df -h
du -sh /* | sort -rh | head -10

Identifying I/O Bottlenecks

# Processes in D state (uninterruptible)
ps aux | awk '$8 ~ /D/ {print}'

# What these processes are waiting for
for pid in $(ps aux | awk '$8 ~ /D/ {print $2}'); do
    echo "PID: $pid"
    cat /proc/$pid/wchan 2>/dev/null
    cat /proc/$pid/stack 2>/dev/null
done

# Check for NFS hangs
mount | grep nfs
df -h | grep nfs

# Network I/O
iftop -i eth0
nethogs

# Check swap I/O
vmstat 1 5
# si/so columns show swap in/out

Step 4: System Resource Analysis

Memory Pressure

# Memory status
free -h

# Swap usage
swapon --show
vmstat 1 5

# Memory by process
ps aux --sort=-%mem | head -15

# Page faults
vmstat -s | grep "page faults"

# OOM events
dmesg | grep -i "out of memory"
grep "killed process" /var/log/kern.log

Network Load

# Network connections
ss -s
netstat -an | awk '{print $6}' | sort | uniq -c

# Connection count
ss -tan | grep ESTABLISHED | wc -l

# Network throughput
iftop -i eth0

# Bandwidth by process
nethogs eth0

# Network errors
ip -s link
netstat -i

Step 5: Process Analysis

Finding Problematic Processes

# Processes sorted by load contribution
ps aux --sort=-%cpu | head -20

# Long-running processes
ps -eo pid,user,etime,%cpu,%mem,cmd --sort=-etime | head -20

# Processes with most threads
ps -eo pid,nlwp,cmd --sort=-nlwp | head -15

# Find runaway processes
ps aux | awk '$3 > 80 || $4 > 80 {print}'

# Process tree
ps auxf | less
pstree -p | less

Detailed Process Investigation

# Investigate specific process
PID=1234

# What's it doing?
strace -p $PID -c

# System calls
strace -p $PID 2>&1 | head -100

# Open files
lsof -p $PID

# Thread count
ps -o nlwp -p $PID

# CPU affinity
taskset -p $PID

# Memory map
pmap -x $PID

# Stack trace
cat /proc/$PID/stack

Solutions and Remediation

CPU-Bound Solutions

Immediate actions:

# Lower process priority
renice +10 PID

# Set CPU affinity (limit to specific cores)
taskset -cp 0,1 PID

# Limit CPU usage with cpulimit
cpulimit -p PID -l 50  # Limit to 50%

# Kill resource-intensive process
kill PID
kill -9 PID  # Force kill

Configuration fixes:

# Apache/Nginx worker limits
# Apache - /etc/apache2/mods-available/mpm_prefork.conf
MaxRequestWorkers 150

# Nginx - /etc/nginx/nginx.conf
worker_processes auto;
worker_connections 1024;

# PHP-FPM - /etc/php/fpm/pool.d/www.conf
pm.max_children = 50

# MySQL - /etc/mysql/my.cnf
max_connections = 200

I/O-Bound Solutions

Immediate actions:

# Sync and clear cache
sync
echo 3 > /proc/sys/vm/drop_caches

# Adjust I/O scheduler
echo deadline > /sys/block/sda/queue/scheduler

# Lower process I/O priority
ionice -c 3 -p PID  # Idle class

# Reduce swappiness
sysctl vm.swappiness=10
echo "vm.swappiness=10" >> /etc/sysctl.conf

Long-term fixes:

# Optimize filesystem mounts
# Add 'noatime' to /etc/fstab
/dev/sda1 / ext4 defaults,noatime 0 1

# Increase read-ahead
blockdev --setra 8192 /dev/sda

# Optimize database
# MySQL buffer pool
innodb_buffer_pool_size = 4G

# Add SSD for database
# Move MySQL to SSD partition

Memory Optimization

# Clear memory cache
sync && echo 3 > /proc/sys/vm/drop_caches

# Add swap
dd if=/dev/zero of=/swapfile bs=1G count=4
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

# Adjust swappiness
sysctl vm.swappiness=10

# Set in /etc/sysctl.conf
vm.swappiness = 10
vm.vfs_cache_pressure = 50

# Kill memory hogs
kill $(ps aux --sort=-%mem | head -2 | tail -1 | awk '{print $2}')

Monitoring and Prevention

Load Monitoring Script

cat > /usr/local/bin/load-monitor.sh << 'EOF'
#!/bin/bash

LOG_FILE="/var/log/load-monitor.log"
ALERT_EMAIL="[email protected]"
CPU_COUNT=$(nproc)
THRESHOLD=$(echo "$CPU_COUNT * 1.5" | bc)

LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | cut -d',' -f1)

echo "$(date): Load: $LOAD, CPUs: $CPU_COUNT" >> "$LOG_FILE"

if (( $(echo "$LOAD > $THRESHOLD" | bc -l) )); then
    echo "$(date): High load detected: $LOAD" >> "$LOG_FILE"

    # Capture system state
    echo "=== Top Processes ===" >> "$LOG_FILE"
    ps aux --sort=-%cpu | head -15 >> "$LOG_FILE"

    echo "=== I/O Stats ===" >> "$LOG_FILE"
    iostat -x >> "$LOG_FILE"

    echo "=== Memory ===" >> "$LOG_FILE"
    free -h >> "$LOG_FILE"

    # Send alert
    echo "High load: $LOAD on $(hostname)" | \
        mail -s "Load Alert: $LOAD" "$ALERT_EMAIL"
fi
EOF

chmod +x /usr/local/bin/load-monitor.sh
echo "*/5 * * * * /usr/local/bin/load-monitor.sh" | crontab -

Performance Baseline

cat > /usr/local/bin/performance-baseline.sh << 'EOF'
#!/bin/bash

BASELINE_DIR="/var/log/performance-baseline"
mkdir -p "$BASELINE_DIR"
DATE=$(date +%Y%m%d-%H%M%S)

# Capture baseline
uptime > "$BASELINE_DIR/uptime-$DATE.txt"
ps aux --sort=-%cpu | head -50 > "$BASELINE_DIR/processes-$DATE.txt"
iostat -x > "$BASELINE_DIR/iostat-$DATE.txt"
free -h > "$BASELINE_DIR/memory-$DATE.txt"
mpstat -P ALL > "$BASELINE_DIR/cpu-$DATE.txt"

echo "Baseline captured: $DATE"
EOF

chmod +x /usr/local/bin/performance-baseline.sh
echo "0 */6 * * * /usr/local/bin/performance-baseline.sh" | crontab -

Capacity Planning

# Track load over time with sar
sar -q 1 86400 > daily-load.txt

# Average load for the day
sar -q | awk '/Average/ {print "Avg Load:", $4, $5, $6}'

# Peak load times
sar -q | sort -k5 -rn | head -10

# Trend analysis
cat > /tmp/load-trend.sh << 'EOF'
#!/bin/bash
for i in {30..1}; do
    DATE=$(date -d "$i days ago" +%d)
    AVG=$(sar -q -f /var/log/sysstat/sa$DATE 2>/dev/null | \
        awk '/Average/ {print $4}')
    echo "$(date -d "$i days ago" +%Y-%m-%d): $AVG"
done
EOF

chmod +x /tmp/load-trend.sh
/tmp/load-trend.sh

Advanced Diagnostics

Using perf

# System-wide performance analysis
perf top

# Record events
perf record -a -g -- sleep 30

# Analyze recording
perf report

# CPU flame graph
perf record -F 99 -a -g -- sleep 60
perf script | ./FlameGraph/stackcollapse-perf.pl | \
    ./FlameGraph/flamegraph.pl > load-analysis.svg

Kernel Analysis

# Check kernel messages
dmesg | tail -100

# Kernel parameters
sysctl -a | grep -E "threads-max|pid_max"

# Process limits
cat /proc/sys/kernel/threads-max
cat /proc/sys/kernel/pid_max

# Current process count
ps aux | wc -l

Conclusion

Load average is a critical metric for understanding system health, but must be interpreted in context with CPU count and other metrics. Key takeaways:

  1. Compare to CPU count: Load relative to cores, not absolute
  2. Check trend: Rising vs falling vs stable load
  3. Identify type: CPU-bound vs I/O-bound
  4. Find root cause: Top processes, disk I/O, or memory
  5. Monitor continuously: Track baselines and trends
  6. Capacity plan: Use historical data for resource planning
  7. Act appropriately: Different causes need different solutions

Understanding load average enables proactive capacity planning and rapid resolution of performance issues. Regular monitoring, baseline tracking, and these diagnostic techniques ensure optimal system performance and prevent load-related outages.