High CPU Usage: Diagnostics with top, ps, pidstat
Introduction
High CPU usage is one of the most common performance issues system administrators encounter. When CPU resources are exhausted, servers become slow, unresponsive, or completely unavailable, directly impacting application performance and user experience. Identifying the root cause quickly is critical to maintaining service quality and preventing downtime.
This comprehensive guide provides a systematic approach to diagnosing high CPU usage using command-line tools available on every Linux system. You'll learn how to use top, ps, pidstat, and other diagnostic utilities to identify CPU-intensive processes, analyze their behavior, and implement effective solutions.
Whether you're managing web servers, database servers, or application servers, understanding CPU diagnostics is essential for maintaining optimal performance. This guide covers everything from basic CPU monitoring to advanced profiling techniques that help you pinpoint exact causes of CPU bottlenecks.
Understanding CPU Usage
CPU Metrics Explained
Before diagnosing issues, understand these key CPU metrics:
User Time (us): CPU time spent running user-space processes System Time (sy): CPU time spent in kernel-space operations Nice Time (ni): CPU time for processes with adjusted priority Idle Time (id): CPU time spent idle I/O Wait (wa): CPU time waiting for I/O operations Hardware Interrupts (hi): CPU time servicing hardware interrupts Software Interrupts (si): CPU time servicing software interrupts Steal Time (st): CPU time stolen by hypervisor (virtualization)
What Constitutes High CPU Usage?
CPU usage interpretation depends on context:
- 0-40%: Normal light load
- 40-70%: Moderate load, usually acceptable
- 70-90%: High load, investigate if sustained
- 90-100%: Critical, immediate investigation needed
Important: Brief spikes to 100% are normal. Sustained high usage indicates problems.
Load Average vs CPU Usage
Load average represents average system load over 1, 5, and 15 minutes:
# View load average
uptime
# Output: load average: 2.50, 1.80, 1.45
# Interpretation:
# - Load < CPU count: System healthy
# - Load = CPU count: System at capacity
# - Load > CPU count: System overloaded
For a 4-core system:
- Load average of 2.0 = 50% utilized
- Load average of 4.0 = 100% utilized
- Load average of 8.0 = 200% overloaded
Initial CPU Assessment
Quick CPU Status Check
Start with these rapid assessment commands:
# System load and uptime
uptime
# CPU count
nproc
lscpu | grep "^CPU(s)"
# Current CPU usage
top -bn1 | grep "Cpu(s)"
# Per-core CPU usage
mpstat -P ALL
# Quick process overview
ps aux --sort=-%cpu | head -10
# System resource summary
vmstat 1 5
Quick interpretation:
# If load average > CPU count
# AND CPU usage > 80%
# THEN investigate immediately
# If iowait > 30%
# THEN problem is I/O, not pure CPU
# If steal > 10%
# THEN virtualization overhead issue
Step 1: Using top for CPU Analysis
Basic top Usage
The top command is the most common CPU monitoring tool:
# Interactive top
top
# Batch mode (one iteration)
top -bn1
# Monitor specific user
top -u username
# Update every 2 seconds
top -d 2
# Show specific number of processes
top -bn1 -n 20
# Sort by CPU usage (default)
# In interactive mode, press:
# P = Sort by CPU
# M = Sort by Memory
# T = Sort by Time
# c = Show command line
# 1 = Show individual cores
Interpreting top Output
top - 10:30:45 up 5 days, 2:15, 3 users, load average: 4.23, 3.87, 2.91
Tasks: 247 total, 2 running, 245 sleeping, 0 stopped, 0 zombie
%Cpu(s): 87.3 us, 8.2 sy, 0.0 ni, 2.1 id, 2.1 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem: 16384.0 total, 2048.5 free, 12288.3 used, 2047.2 buff/cache
MiB Swap: 4096.0 total, 3072.5 free, 1023.5 used. 3584.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 www-data 20 0 2.1g 1.5g 12m R 95.3 9.3 123:45 php-fpm
5678 mysql 20 0 3.2g 2.1g 256m S 45.1 13.1 567:23 mysqld
Key observations:
- Load average 4.23 on 4-core system = overloaded
- User CPU 87.3% = application/process issue
- I/O wait 2.1% = not an I/O problem
- PID 1234 using 95.3% CPU = primary culprit
- php-fpm process is the problem
Advanced top Commands
# Save top output to file
top -bn1 > cpu-snapshot.txt
# Monitor specific process
top -p 1234
# Monitor multiple processes
top -p 1234,5678,9012
# Show threads instead of processes
top -H
# Show threads for specific process
top -H -p 1234
# Highlight running processes
# In interactive mode, press 'z' for color
# Show full command path
# Press 'c' in interactive mode
# Filter by user
# Press 'u' then enter username
Capturing CPU Snapshots
# Capture CPU usage over time
for i in {1..10}; do
echo "=== Snapshot $i at $(date) ===" >> cpu-monitor.log
top -bn1 | head -20 >> cpu-monitor.log
sleep 60
done
# Automated monitoring script
cat > /tmp/cpu-monitor.sh << 'EOF'
#!/bin/bash
while true; do
CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$CPU > 80" | bc -l) )); then
echo "$(date): High CPU detected: $CPU%" >> /var/log/cpu-alerts.log
top -bn1 | head -20 >> /var/log/cpu-alerts.log
fi
sleep 60
done
EOF
chmod +x /tmp/cpu-monitor.sh
Step 2: Using ps for Process Analysis
Basic ps Commands
The ps command provides detailed process information:
# All processes sorted by CPU
ps aux --sort=-%cpu
# Top 10 CPU consumers
ps aux --sort=-%cpu | head -11
# All processes sorted by memory
ps aux --sort=-%mem | head -11
# Processes by specific user
ps aux | grep username
# Show process hierarchy
ps auxf
ps -ejH
# Custom output format
ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | head -20
# Show threads
ps -eLf
# Process count by user
ps aux | awk '{print $1}' | sort | uniq -c | sort -rn
Advanced ps Analysis
# Long-running processes
ps -eo pid,user,lstart,etime,%cpu,cmd --sort=-etime | head -20
# Processes with most threads
ps -eo pid,nlwp,cmd --sort=-nlwp | head -15
# Processes grouped by command
ps aux --sort=-%cpu | awk '{print $11}' | sort | uniq -c | sort -rn
# Zombie processes
ps aux | awk '$8 ~ /Z/ {print}'
# Real-time process monitoring
watch -n 1 'ps aux --sort=-%cpu | head -15'
# Process tree for specific PID
ps --forest -p 1234
pstree -p 1234
# CPU usage by process name
ps aux | grep process_name | awk '{sum+=$3} END {print "Total CPU:", sum"%"}'
Detailed Process Information
# Full process details
ps -fp 1234
# All information for process
ps -F -p 1234
# Process environment variables
cat /proc/1234/environ | tr '\0' '\n'
# Process command line
cat /proc/1234/cmdline | tr '\0' ' '
# Process status
cat /proc/1234/status
# Process CPU affinity
taskset -p 1234
# Process limits
cat /proc/1234/limits
# Process file descriptors
ls -l /proc/1234/fd | wc -l
Step 3: Using pidstat for Detailed Analysis
Installing and Basic Usage
# Install sysstat (includes pidstat)
apt install sysstat # Debian/Ubuntu
yum install sysstat # CentOS/RHEL
# Enable sysstat
systemctl enable sysstat
systemctl start sysstat
# Basic pidstat usage
pidstat
# Monitor every 2 seconds
pidstat 2
# Monitor for 10 iterations
pidstat 2 10
# Monitor specific process
pidstat -p 1234
# Monitor multiple processes
pidstat -p 1234,5678,9012 2
Advanced pidstat Analysis
# Per-thread statistics
pidstat -t
# Per-thread for specific process
pidstat -t -p 1234 2
# Show command name
pidstat -l
# CPU statistics only
pidstat -u
# I/O statistics
pidstat -d
# Memory statistics
pidstat -r
# Context switches
pidstat -w
# All statistics combined
pidstat -u -d -r -w -p 1234 2
# Monitor by task name
pidstat -C php-fpm 2
# Human-readable output
pidstat -h 2
Interpreting pidstat Output
# pidstat -u 2
Linux 5.4.0-42-generic (server01) 01/11/2026 _x86_64_
10:45:32 AM UID PID %usr %system %guest %wait %CPU CPU Command
10:45:34 AM 1000 1234 85.00 5.00 0.00 2.00 90.00 2 php-fpm
10:45:34 AM 1001 5678 25.00 15.00 0.00 5.00 40.00 0 mysqld
Key metrics:
- %usr: User-space CPU usage
- %system: Kernel-space CPU usage
- %guest: Virtual CPU time (VMs)
- %wait: Time waiting for CPU
- %CPU: Total CPU usage
- CPU: CPU core number
Context Switch Analysis
High context switches indicate CPU contention:
# Monitor context switches
pidstat -w 2
# Output interpretation:
# cswch/s = voluntary context switches (I/O wait, sleep)
# nvcswch/s = involuntary context switches (preempted)
# High involuntary switches = CPU contention
# High voluntary switches = I/O bound process
Step 4: CPU Profiling and Analysis
Using mpstat
Monitor per-CPU core statistics:
# Install if needed (part of sysstat)
apt install sysstat
# Show all CPU cores
mpstat -P ALL
# Update every 2 seconds
mpstat -P ALL 2
# Show specific CPU core
mpstat -P 0 2
# Extended statistics
mpstat -A 2
# JSON output
mpstat -o JSON 2 5
Interpreting mpstat:
# Unbalanced load across cores
# CPU0: 100%, CPU1: 20%, CPU2: 15%, CPU3: 10%
# Indicates: Single-threaded bottleneck
# Balanced load
# CPU0: 80%, CPU1: 85%, CPU2: 82%, CPU3: 87%
# Indicates: Multi-threaded application
Using vmstat
System-wide performance overview:
# Basic vmstat
vmstat 1 10
# Extended CPU statistics
vmstat -a 2
# Detailed CPU breakdown
vmstat -w 2
# Output interpretation:
# r = processes waiting for CPU (runnable)
# b = processes in uninterruptible sleep
# us = user CPU time
# sy = system CPU time
# id = idle time
# wa = I/O wait time
Critical indicators:
# r column > CPU count = CPU bottleneck
# wa > 30% = I/O bottleneck, not CPU
# sy > 30% = excessive system calls
# us > 70% with r > cores = CPU overload
Using sar
Historical performance data:
# Install and enable
apt install sysstat
systemctl enable sysstat
# CPU usage (last 10 minutes)
sar -u -s $(date -d '10 minutes ago' +%H:%M:%S)
# Per-core statistics
sar -P ALL
# Historical CPU data
sar -u -f /var/log/sysstat/sa$(date +%d)
# Yesterday's CPU data
sar -u -f /var/log/sysstat/sa$(date -d yesterday +%d)
# CPU statistics for specific time
sar -u -s 10:00:00 -e 11:00:00
# Generate report
sar -u > cpu-report.txt
Step 5: Identifying CPU Bottleneck Causes
Application Issues
# Check for runaway processes
ps aux --sort=-%cpu | head -5
# Check process uptime
ps -eo pid,user,etime,%cpu,cmd --sort=-etime | head -15
# Multiple instances of same process
ps aux | grep process_name | wc -l
# Check process nice values
ps -eo pid,ni,cmd --sort=ni
# Processes in uninterruptible sleep (D state)
ps aux | awk '$8 ~ /D/ {print}'
Infinite Loops and Bugs
# Monitor process CPU over time
while true; do
ps -p 1234 -o %cpu,cmd
sleep 1
done
# Check if process is stuck
strace -p 1234 -c
# Look for repetitive system calls
# Sample process execution
strace -p 1234 -f -e trace=all 2>&1 | head -100
# Check for tight loops
perf record -p 1234 -g -- sleep 10
perf report
Database Query Issues
# MySQL slow queries
mysql -e "SHOW FULL PROCESSLIST;" | grep -v Sleep
# MySQL process list by time
mysql -e "SELECT * FROM information_schema.processlist WHERE command != 'Sleep' ORDER BY time DESC;"
# PostgreSQL active queries
sudo -u postgres psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;"
# Check database CPU usage
ps aux | grep -E "mysqld|postgres" | awk '{sum+=$3} END {print "DB CPU:", sum"%"}'
Web Server Load
# Apache processes
ps aux | grep apache2 | wc -l
ps aux | grep httpd | wc -l
# Apache CPU usage
ps aux | grep apache2 | awk '{sum+=$3} END {print "Apache CPU:", sum"%"}'
# Nginx worker CPU
ps aux | grep "nginx: worker" | awk '{sum+=$3} END {print "Nginx CPU:", sum"%"}'
# PHP-FPM pool status
curl http://localhost/status
Container/Virtualization Issues
# Docker container CPU usage
docker stats --no-stream
# Container CPU limits
docker inspect container_name | grep -i cpu
# Check steal time (hypervisor overhead)
top -bn1 | grep "Cpu(s)" | awk '{print $16}'
# If steal > 10%, virtualization overhead is high
Step 6: Advanced Diagnostic Techniques
Using perf
Performance profiling tool:
# Install perf
apt install linux-tools-common linux-tools-$(uname -r)
# Record system-wide CPU profile
perf record -a -g -- sleep 30
# Record specific process
perf record -p 1234 -g -- sleep 30
# View report
perf report
# Top functions consuming CPU
perf top
# CPU cycle analysis
perf stat -p 1234 sleep 10
# Cache misses
perf stat -e cache-misses,cache-references -p 1234 sleep 10
CPU Flame Graphs
Visualize CPU consumption:
# Clone FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
# Capture data
perf record -F 99 -a -g -- sleep 60
# Generate flame graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu-flamegraph.svg
# For specific process
perf record -F 99 -p 1234 -g -- sleep 60
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > process-flamegraph.svg
Using strace
Trace system calls:
# Trace process
strace -p 1234
# Count system calls
strace -c -p 1234
# Trace with timestamps
strace -tt -p 1234
# Trace specific calls
strace -e trace=open,read,write -p 1234
# Follow forks
strace -f -p 1234
# Save to file
strace -o trace.log -p 1234
Using htop
Enhanced process viewer:
# Install htop
apt install htop
# Run htop
htop
# Key features:
# F5 = Tree view
# F6 = Sort by (CPU, Memory, etc.)
# F9 = Kill process
# Space = Mark process
# u = Filter by user
# t = Tree view
# H = Hide/show threads
Solutions and Remediation
Immediate Actions
Kill runaway process:
# Graceful termination
kill 1234
# Force kill
kill -9 1234
# Kill all instances of process
pkill -9 process_name
killall -9 process_name
Reduce process priority:
# Lower priority (increase nice value)
renice +10 1234
# Set very low priority
renice +19 1234
# Set high priority (requires root)
renice -10 1234
CPU affinity management:
# Bind process to specific CPU cores
taskset -p -c 0,1 1234
# Start process on specific cores
taskset -c 0,1 command
# Check current affinity
taskset -p 1234
Application-Level Fixes
Restart problematic service:
# Restart service
systemctl restart service-name
# Reload configuration
systemctl reload service-name
# Check service status
systemctl status service-name
Limit process resources:
# Using ulimit
ulimit -t 3600 # CPU time limit (seconds)
# Using systemd service limits
cat > /etc/systemd/system/service-name.service.d/limits.conf << 'EOF'
[Service]
CPUQuota=50%
EOF
systemctl daemon-reload
systemctl restart service-name
Optimize application configuration:
# PHP-FPM optimization
# Edit /etc/php/7.4/fpm/pool.d/www.conf
pm = dynamic
pm.max_children = 50
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 10
# Apache optimization
# Edit /etc/apache2/mods-available/mpm_prefork.conf
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxRequestWorkers 150
MaxConnectionsPerChild 1000
</IfModule>
Database Optimization
# Kill long-running MySQL query
mysql -e "KILL 1234;" # Query ID from SHOW PROCESSLIST
# Optimize MySQL tables
mysqlcheck -o database_name
# PostgreSQL query termination
sudo -u postgres psql -c "SELECT pg_terminate_backend(1234);" # PID
# Enable slow query log (MySQL)
mysql -e "SET GLOBAL slow_query_log = 'ON';"
mysql -e "SET GLOBAL long_query_time = 2;"
System-Level Optimization
Kernel parameters:
# Edit /etc/sysctl.conf
# Scheduler optimization
kernel.sched_migration_cost_ns = 5000000
kernel.sched_autogroup_enabled = 0
# Apply changes
sysctl -p
CPU governor settings:
# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Set performance governor
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $cpu
done
# Install cpufrequtils
apt install cpufrequtils
# Set governor permanently
echo 'GOVERNOR="performance"' > /etc/default/cpufrequtils
systemctl restart cpufrequtils
Prevention and Monitoring
Continuous Monitoring Script
cat > /usr/local/bin/cpu-monitor.sh << 'EOF'
#!/bin/bash
THRESHOLD=80
LOG_FILE="/var/log/cpu-monitor.log"
ALERT_EMAIL="[email protected]"
while true; do
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | cut -d',' -f1)
if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
echo "$(date): High CPU detected: $CPU_USAGE%" >> "$LOG_FILE"
echo "Top processes:" >> "$LOG_FILE"
ps aux --sort=-%cpu | head -10 >> "$LOG_FILE"
# Send email alert
echo "High CPU alert on $(hostname): $CPU_USAGE%" | \
mail -s "CPU Alert: $CPU_USAGE%" "$ALERT_EMAIL"
fi
sleep 60
done
EOF
chmod +x /usr/local/bin/cpu-monitor.sh
# Run as systemd service
cat > /etc/systemd/system/cpu-monitor.service << 'EOF'
[Unit]
Description=CPU Monitoring Service
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/cpu-monitor.sh
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl enable cpu-monitor.service
systemctl start cpu-monitor.service
Automated Reporting
cat > /usr/local/bin/cpu-report.sh << 'EOF'
#!/bin/bash
REPORT="/tmp/cpu-report-$(date +%Y%m%d).txt"
echo "CPU Usage Report - $(date)" > "$REPORT"
echo "================================" >> "$REPORT"
echo "" >> "$REPORT"
echo "System Load:" >> "$REPORT"
uptime >> "$REPORT"
echo "" >> "$REPORT"
echo "CPU Info:" >> "$REPORT"
lscpu | grep -E "^CPU\(s\)|^Model name" >> "$REPORT"
echo "" >> "$REPORT"
echo "Current CPU Usage:" >> "$REPORT"
mpstat -P ALL >> "$REPORT"
echo "" >> "$REPORT"
echo "Top 10 CPU Processes:" >> "$REPORT"
ps aux --sort=-%cpu | head -11 >> "$REPORT"
echo "" >> "$REPORT"
echo "Load Average History (today):" >> "$REPORT"
sar -q | tail -20 >> "$REPORT"
mail -s "Daily CPU Report - $(hostname)" [email protected] < "$REPORT"
EOF
chmod +x /usr/local/bin/cpu-report.sh
# Schedule daily
echo "0 8 * * * /usr/local/bin/cpu-report.sh" | crontab -
Performance Baseline
# Create baseline script
cat > /usr/local/bin/cpu-baseline.sh << 'EOF'
#!/bin/bash
BASELINE_DIR="/var/log/performance-baseline"
mkdir -p "$BASELINE_DIR"
DATE=$(date +%Y%m%d-%H%M%S)
# Capture baseline
uptime > "$BASELINE_DIR/load-$DATE.txt"
mpstat -P ALL > "$BASELINE_DIR/mpstat-$DATE.txt"
ps aux --sort=-%cpu | head -50 > "$BASELINE_DIR/processes-$DATE.txt"
sar -u 1 60 > "$BASELINE_DIR/sar-$DATE.txt"
echo "Baseline captured: $DATE"
EOF
chmod +x /usr/local/bin/cpu-baseline.sh
Conclusion
Diagnosing high CPU usage requires systematic analysis using the right tools. Key takeaways:
- Start with basics: Use top and ps for quick identification
- Use pidstat for detail: Thread-level and per-process statistics
- Profile when needed: perf and flame graphs for deep analysis
- Monitor continuously: Implement automated monitoring and alerting
- Understand metrics: Know the difference between user, system, and wait time
- Check context: High CPU isn't always bad - verify if it's expected
- Document baselines: Know what normal looks like for your systems
Regular monitoring, proper application configuration, and quick diagnostic skills minimize the impact of CPU-related performance issues. Keep these commands and techniques readily available for rapid troubleshooting when CPU bottlenecks occur.


