High CPU Usage: Diagnostics with top, ps, pidstat

Introduction

High CPU usage is one of the most common performance issues system administrators encounter. When CPU resources are exhausted, servers become slow, unresponsive, or completely unavailable, directly impacting application performance and user experience. Identifying the root cause quickly is critical to maintaining service quality and preventing downtime.

This comprehensive guide provides a systematic approach to diagnosing high CPU usage using command-line tools available on every Linux system. You'll learn how to use top, ps, pidstat, and other diagnostic utilities to identify CPU-intensive processes, analyze their behavior, and implement effective solutions.

Whether you're managing web servers, database servers, or application servers, understanding CPU diagnostics is essential for maintaining optimal performance. This guide covers everything from basic CPU monitoring to advanced profiling techniques that help you pinpoint exact causes of CPU bottlenecks.

Understanding CPU Usage

CPU Metrics Explained

Before diagnosing issues, understand these key CPU metrics:

User Time (us): CPU time spent running user-space processes System Time (sy): CPU time spent in kernel-space operations Nice Time (ni): CPU time for processes with adjusted priority Idle Time (id): CPU time spent idle I/O Wait (wa): CPU time waiting for I/O operations Hardware Interrupts (hi): CPU time servicing hardware interrupts Software Interrupts (si): CPU time servicing software interrupts Steal Time (st): CPU time stolen by hypervisor (virtualization)

What Constitutes High CPU Usage?

CPU usage interpretation depends on context:

0-40%: Normal light load
40-70%: Moderate load, usually acceptable
70-90%: High load, investigate if sustained
90-100%: Critical, immediate investigation needed

Important: Brief spikes to 100% are normal. Sustained high usage indicates problems.

Load Average vs CPU Usage

Load average represents average system load over 1, 5, and 15 minutes:

# View load average
uptime
# Output: load average: 2.50, 1.80, 1.45

# Interpretation:
# - Load < CPU count: System healthy
# - Load = CPU count: System at capacity
# - Load > CPU count: System overloaded

For a 4-core system:

Load average of 2.0 = 50% utilized
Load average of 4.0 = 100% utilized
Load average of 8.0 = 200% overloaded

Initial CPU Assessment

Quick CPU Status Check

Start with these rapid assessment commands:

# System load and uptime
uptime

# CPU count
nproc
lscpu | grep "^CPU(s)"

# Current CPU usage
top -bn1 | grep "Cpu(s)"

# Per-core CPU usage
mpstat -P ALL

# Quick process overview
ps aux --sort=-%cpu | head -10

# System resource summary
vmstat 1 5

Quick interpretation:

# If load average > CPU count
# AND CPU usage > 80%
# THEN investigate immediately

# If iowait > 30%
# THEN problem is I/O, not pure CPU

# If steal > 10%
# THEN virtualization overhead issue

Step 1: Using top for CPU Analysis

Basic top Usage

The top command is the most common CPU monitoring tool:

# Interactive top
top

# Batch mode (one iteration)
top -bn1

# Monitor specific user
top -u username

# Update every 2 seconds
top -d 2

# Show specific number of processes
top -bn1 -n 20

# Sort by CPU usage (default)
# In interactive mode, press:
# P = Sort by CPU
# M = Sort by Memory
# T = Sort by Time
# c = Show command line
# 1 = Show individual cores

Interpreting top Output

top - 10:30:45 up 5 days, 2:15, 3 users, load average: 4.23, 3.87, 2.91
Tasks: 247 total, 2 running, 245 sleeping, 0 stopped, 0 zombie
%Cpu(s): 87.3 us, 8.2 sy, 0.0 ni, 2.1 id, 2.1 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem: 16384.0 total, 2048.5 free, 12288.3 used, 2047.2 buff/cache
MiB Swap: 4096.0 total, 3072.5 free, 1023.5 used. 3584.2 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1234 www-data  20   0  2.1g   1.5g   12m R  95.3  9.3  123:45 php-fpm
 5678 mysql     20   0  3.2g   2.1g  256m S  45.1 13.1  567:23 mysqld

Key observations:

Load average 4.23 on 4-core system = overloaded
User CPU 87.3% = application/process issue
I/O wait 2.1% = not an I/O problem
PID 1234 using 95.3% CPU = primary culprit
php-fpm process is the problem

Advanced top Commands

# Save top output to file
top -bn1 > cpu-snapshot.txt

# Monitor specific process
top -p 1234

# Monitor multiple processes
top -p 1234,5678,9012

# Show threads instead of processes
top -H

# Show threads for specific process
top -H -p 1234

# Highlight running processes
# In interactive mode, press 'z' for color

# Show full command path
# Press 'c' in interactive mode

# Filter by user
# Press 'u' then enter username

Capturing CPU Snapshots

# Capture CPU usage over time
for i in {1..10}; do
    echo "=== Snapshot $i at $(date) ===" >> cpu-monitor.log
    top -bn1 | head -20 >> cpu-monitor.log
    sleep 60
done

# Automated monitoring script
cat > /tmp/cpu-monitor.sh << 'EOF'
#!/bin/bash
while true; do
    CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    if (( $(echo "$CPU > 80" | bc -l) )); then
        echo "$(date): High CPU detected: $CPU%" >> /var/log/cpu-alerts.log
        top -bn1 | head -20 >> /var/log/cpu-alerts.log
    fi
    sleep 60
done
EOF
chmod +x /tmp/cpu-monitor.sh

Step 2: Using ps for Process Analysis

Basic ps Commands

The ps command provides detailed process information:

# All processes sorted by CPU
ps aux --sort=-%cpu

# Top 10 CPU consumers
ps aux --sort=-%cpu | head -11

# All processes sorted by memory
ps aux --sort=-%mem | head -11

# Processes by specific user
ps aux | grep username

# Show process hierarchy
ps auxf
ps -ejH

# Custom output format
ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | head -20

# Show threads
ps -eLf

# Process count by user
ps aux | awk '{print $1}' | sort | uniq -c | sort -rn

Advanced ps Analysis

# Long-running processes
ps -eo pid,user,lstart,etime,%cpu,cmd --sort=-etime | head -20

# Processes with most threads
ps -eo pid,nlwp,cmd --sort=-nlwp | head -15

# Processes grouped by command
ps aux --sort=-%cpu | awk '{print $11}' | sort | uniq -c | sort -rn

# Zombie processes
ps aux | awk '$8 ~ /Z/ {print}'

# Real-time process monitoring
watch -n 1 'ps aux --sort=-%cpu | head -15'

# Process tree for specific PID
ps --forest -p 1234
pstree -p 1234

# CPU usage by process name
ps aux | grep process_name | awk '{sum+=$3} END {print "Total CPU:", sum"%"}'

Detailed Process Information

# Full process details
ps -fp 1234

# All information for process
ps -F -p 1234

# Process environment variables
cat /proc/1234/environ | tr '\0' '\n'

# Process command line
cat /proc/1234/cmdline | tr '\0' ' '

# Process status
cat /proc/1234/status

# Process CPU affinity
taskset -p 1234

# Process limits
cat /proc/1234/limits

# Process file descriptors
ls -l /proc/1234/fd | wc -l

Step 3: Using pidstat for Detailed Analysis

Installing and Basic Usage

# Install sysstat (includes pidstat)
apt install sysstat          # Debian/Ubuntu
yum install sysstat          # CentOS/RHEL

# Enable sysstat
systemctl enable sysstat
systemctl start sysstat

# Basic pidstat usage
pidstat

# Monitor every 2 seconds
pidstat 2

# Monitor for 10 iterations
pidstat 2 10

# Monitor specific process
pidstat -p 1234

# Monitor multiple processes
pidstat -p 1234,5678,9012 2

Advanced pidstat Analysis

# Per-thread statistics
pidstat -t

# Per-thread for specific process
pidstat -t -p 1234 2

# Show command name
pidstat -l

# CPU statistics only
pidstat -u

# I/O statistics
pidstat -d

# Memory statistics
pidstat -r

# Context switches
pidstat -w

# All statistics combined
pidstat -u -d -r -w -p 1234 2

# Monitor by task name
pidstat -C php-fpm 2

# Human-readable output
pidstat -h 2

Interpreting pidstat Output

# pidstat -u 2
Linux 5.4.0-42-generic (server01)     01/11/2026      _x86_64_

10:45:32 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
10:45:34 AM  1000      1234   85.00    5.00    0.00    2.00   90.00     2  php-fpm
10:45:34 AM  1001      5678   25.00   15.00    0.00    5.00   40.00     0  mysqld

Key metrics:

%usr: User-space CPU usage
%system: Kernel-space CPU usage
%guest: Virtual CPU time (VMs)
%wait: Time waiting for CPU
%CPU: Total CPU usage
CPU: CPU core number

Context Switch Analysis

High context switches indicate CPU contention:

# Monitor context switches
pidstat -w 2

# Output interpretation:
# cswch/s = voluntary context switches (I/O wait, sleep)
# nvcswch/s = involuntary context switches (preempted)

# High involuntary switches = CPU contention
# High voluntary switches = I/O bound process

Step 4: CPU Profiling and Analysis

Using mpstat

Monitor per-CPU core statistics:

# Install if needed (part of sysstat)
apt install sysstat

# Show all CPU cores
mpstat -P ALL

# Update every 2 seconds
mpstat -P ALL 2

# Show specific CPU core
mpstat -P 0 2

# Extended statistics
mpstat -A 2

# JSON output
mpstat -o JSON 2 5

Interpreting mpstat:

# Unbalanced load across cores
# CPU0: 100%, CPU1: 20%, CPU2: 15%, CPU3: 10%
# Indicates: Single-threaded bottleneck

# Balanced load
# CPU0: 80%, CPU1: 85%, CPU2: 82%, CPU3: 87%
# Indicates: Multi-threaded application

Using vmstat

System-wide performance overview:

# Basic vmstat
vmstat 1 10

# Extended CPU statistics
vmstat -a 2

# Detailed CPU breakdown
vmstat -w 2

# Output interpretation:
# r = processes waiting for CPU (runnable)
# b = processes in uninterruptible sleep
# us = user CPU time
# sy = system CPU time
# id = idle time
# wa = I/O wait time

Critical indicators:

# r column > CPU count = CPU bottleneck
# wa > 30% = I/O bottleneck, not CPU
# sy > 30% = excessive system calls
# us > 70% with r > cores = CPU overload

Using sar

Historical performance data:

# Install and enable
apt install sysstat
systemctl enable sysstat

# CPU usage (last 10 minutes)
sar -u -s $(date -d '10 minutes ago' +%H:%M:%S)

# Per-core statistics
sar -P ALL

# Historical CPU data
sar -u -f /var/log/sysstat/sa$(date +%d)

# Yesterday's CPU data
sar -u -f /var/log/sysstat/sa$(date -d yesterday +%d)

# CPU statistics for specific time
sar -u -s 10:00:00 -e 11:00:00

# Generate report
sar -u > cpu-report.txt

Step 5: Identifying CPU Bottleneck Causes

Application Issues

# Check for runaway processes
ps aux --sort=-%cpu | head -5

# Check process uptime
ps -eo pid,user,etime,%cpu,cmd --sort=-etime | head -15

# Multiple instances of same process
ps aux | grep process_name | wc -l

# Check process nice values
ps -eo pid,ni,cmd --sort=ni

# Processes in uninterruptible sleep (D state)
ps aux | awk '$8 ~ /D/ {print}'

Infinite Loops and Bugs

# Monitor process CPU over time
while true; do
    ps -p 1234 -o %cpu,cmd
    sleep 1
done

# Check if process is stuck
strace -p 1234 -c
# Look for repetitive system calls

# Sample process execution
strace -p 1234 -f -e trace=all 2>&1 | head -100

# Check for tight loops
perf record -p 1234 -g -- sleep 10
perf report

Database Query Issues

# MySQL slow queries
mysql -e "SHOW FULL PROCESSLIST;" | grep -v Sleep

# MySQL process list by time
mysql -e "SELECT * FROM information_schema.processlist WHERE command != 'Sleep' ORDER BY time DESC;"

# PostgreSQL active queries
sudo -u postgres psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;"

# Check database CPU usage
ps aux | grep -E "mysqld|postgres" | awk '{sum+=$3} END {print "DB CPU:", sum"%"}'

Web Server Load

# Apache processes
ps aux | grep apache2 | wc -l
ps aux | grep httpd | wc -l

# Apache CPU usage
ps aux | grep apache2 | awk '{sum+=$3} END {print "Apache CPU:", sum"%"}'

# Nginx worker CPU
ps aux | grep "nginx: worker" | awk '{sum+=$3} END {print "Nginx CPU:", sum"%"}'

# PHP-FPM pool status
curl http://localhost/status

Container/Virtualization Issues

# Docker container CPU usage
docker stats --no-stream

# Container CPU limits
docker inspect container_name | grep -i cpu

# Check steal time (hypervisor overhead)
top -bn1 | grep "Cpu(s)" | awk '{print $16}'

# If steal > 10%, virtualization overhead is high

Step 6: Advanced Diagnostic Techniques

Using perf

Performance profiling tool:

# Install perf
apt install linux-tools-common linux-tools-$(uname -r)

# Record system-wide CPU profile
perf record -a -g -- sleep 30

# Record specific process
perf record -p 1234 -g -- sleep 30

# View report
perf report

# Top functions consuming CPU
perf top

# CPU cycle analysis
perf stat -p 1234 sleep 10

# Cache misses
perf stat -e cache-misses,cache-references -p 1234 sleep 10

CPU Flame Graphs

Visualize CPU consumption:

# Clone FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# Capture data
perf record -F 99 -a -g -- sleep 60

# Generate flame graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu-flamegraph.svg

# For specific process
perf record -F 99 -p 1234 -g -- sleep 60
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > process-flamegraph.svg

Using strace

Trace system calls:

# Trace process
strace -p 1234

# Count system calls
strace -c -p 1234

# Trace with timestamps
strace -tt -p 1234

# Trace specific calls
strace -e trace=open,read,write -p 1234

# Follow forks
strace -f -p 1234

# Save to file
strace -o trace.log -p 1234

Using htop

Enhanced process viewer:

# Install htop
apt install htop

# Run htop
htop

# Key features:
# F5 = Tree view
# F6 = Sort by (CPU, Memory, etc.)
# F9 = Kill process
# Space = Mark process
# u = Filter by user
# t = Tree view
# H = Hide/show threads

Solutions and Remediation

Immediate Actions

Kill runaway process:

# Graceful termination
kill 1234

# Force kill
kill -9 1234

# Kill all instances of process
pkill -9 process_name
killall -9 process_name

Reduce process priority:

# Lower priority (increase nice value)
renice +10 1234

# Set very low priority
renice +19 1234

# Set high priority (requires root)
renice -10 1234

CPU affinity management:

# Bind process to specific CPU cores
taskset -p -c 0,1 1234

# Start process on specific cores
taskset -c 0,1 command

# Check current affinity
taskset -p 1234

Application-Level Fixes

Restart problematic service:

# Restart service
systemctl restart service-name

# Reload configuration
systemctl reload service-name

# Check service status
systemctl status service-name

Limit process resources:

# Using ulimit
ulimit -t 3600  # CPU time limit (seconds)

# Using systemd service limits
cat > /etc/systemd/system/service-name.service.d/limits.conf << 'EOF'
[Service]
CPUQuota=50%
EOF

systemctl daemon-reload
systemctl restart service-name

Optimize application configuration:

# PHP-FPM optimization
# Edit /etc/php/7.4/fpm/pool.d/www.conf
pm = dynamic
pm.max_children = 50
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 10

# Apache optimization
# Edit /etc/apache2/mods-available/mpm_prefork.conf
<IfModule mpm_prefork_module>
    StartServers         5
    MinSpareServers      5
    MaxSpareServers     10
    MaxRequestWorkers  150
    MaxConnectionsPerChild  1000
</IfModule>

Database Optimization

# Kill long-running MySQL query
mysql -e "KILL 1234;"  # Query ID from SHOW PROCESSLIST

# Optimize MySQL tables
mysqlcheck -o database_name

# PostgreSQL query termination
sudo -u postgres psql -c "SELECT pg_terminate_backend(1234);"  # PID

# Enable slow query log (MySQL)
mysql -e "SET GLOBAL slow_query_log = 'ON';"
mysql -e "SET GLOBAL long_query_time = 2;"

System-Level Optimization

Kernel parameters:

# Edit /etc/sysctl.conf

# Scheduler optimization
kernel.sched_migration_cost_ns = 5000000
kernel.sched_autogroup_enabled = 0

# Apply changes
sysctl -p

CPU governor settings:

# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Set performance governor
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done

# Install cpufrequtils
apt install cpufrequtils

# Set governor permanently
echo 'GOVERNOR="performance"' > /etc/default/cpufrequtils
systemctl restart cpufrequtils

Prevention and Monitoring

Continuous Monitoring Script

cat > /usr/local/bin/cpu-monitor.sh << 'EOF'
#!/bin/bash

THRESHOLD=80
LOG_FILE="/var/log/cpu-monitor.log"
ALERT_EMAIL="[email protected]"

while true; do
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | cut -d',' -f1)

    if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
        echo "$(date): High CPU detected: $CPU_USAGE%" >> "$LOG_FILE"
        echo "Top processes:" >> "$LOG_FILE"
        ps aux --sort=-%cpu | head -10 >> "$LOG_FILE"

        # Send email alert
        echo "High CPU alert on $(hostname): $CPU_USAGE%" | \
            mail -s "CPU Alert: $CPU_USAGE%" "$ALERT_EMAIL"
    fi

    sleep 60
done
EOF

chmod +x /usr/local/bin/cpu-monitor.sh

# Run as systemd service
cat > /etc/systemd/system/cpu-monitor.service << 'EOF'
[Unit]
Description=CPU Monitoring Service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/cpu-monitor.sh
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl enable cpu-monitor.service
systemctl start cpu-monitor.service

Automated Reporting

cat > /usr/local/bin/cpu-report.sh << 'EOF'
#!/bin/bash

REPORT="/tmp/cpu-report-$(date +%Y%m%d).txt"

echo "CPU Usage Report - $(date)" > "$REPORT"
echo "================================" >> "$REPORT"
echo "" >> "$REPORT"

echo "System Load:" >> "$REPORT"
uptime >> "$REPORT"
echo "" >> "$REPORT"

echo "CPU Info:" >> "$REPORT"
lscpu | grep -E "^CPU\(s\)|^Model name" >> "$REPORT"
echo "" >> "$REPORT"

echo "Current CPU Usage:" >> "$REPORT"
mpstat -P ALL >> "$REPORT"
echo "" >> "$REPORT"

echo "Top 10 CPU Processes:" >> "$REPORT"
ps aux --sort=-%cpu | head -11 >> "$REPORT"
echo "" >> "$REPORT"

echo "Load Average History (today):" >> "$REPORT"
sar -q | tail -20 >> "$REPORT"

mail -s "Daily CPU Report - $(hostname)" [email protected] < "$REPORT"
EOF

chmod +x /usr/local/bin/cpu-report.sh

# Schedule daily
echo "0 8 * * * /usr/local/bin/cpu-report.sh" | crontab -

Performance Baseline

# Create baseline script
cat > /usr/local/bin/cpu-baseline.sh << 'EOF'
#!/bin/bash

BASELINE_DIR="/var/log/performance-baseline"
mkdir -p "$BASELINE_DIR"

DATE=$(date +%Y%m%d-%H%M%S)

# Capture baseline
uptime > "$BASELINE_DIR/load-$DATE.txt"
mpstat -P ALL > "$BASELINE_DIR/mpstat-$DATE.txt"
ps aux --sort=-%cpu | head -50 > "$BASELINE_DIR/processes-$DATE.txt"
sar -u 1 60 > "$BASELINE_DIR/sar-$DATE.txt"

echo "Baseline captured: $DATE"
EOF

chmod +x /usr/local/bin/cpu-baseline.sh

Conclusion

Diagnosing high CPU usage requires systematic analysis using the right tools. Key takeaways:

Start with basics: Use top and ps for quick identification
Use pidstat for detail: Thread-level and per-process statistics
Profile when needed: perf and flame graphs for deep analysis
Monitor continuously: Implement automated monitoring and alerting
Understand metrics: Know the difference between user, system, and wait time
Check context: High CPU isn't always bad - verify if it's expected
Document baselines: Know what normal looks like for your systems

Regular monitoring, proper application configuration, and quick diagnostic skills minimize the impact of CPU-related performance issues. Keep these commands and techniques readily available for rapid troubleshooting when CPU bottlenecks occur.

High CPU Usage: Diagnostics with top, ps, pidstat

High CPU Usage: Diagnostics with top, ps, pidstat

Introduction

Understanding CPU Usage

CPU Metrics Explained

What Constitutes High CPU Usage?

Load Average vs CPU Usage

Initial CPU Assessment

Quick CPU Status Check

Step 1: Using top for CPU Analysis

Basic top Usage

Interpreting top Output

Advanced top Commands

Capturing CPU Snapshots

Step 2: Using ps for Process Analysis

Basic ps Commands

Advanced ps Analysis

Detailed Process Information

Step 3: Using pidstat for Detailed Analysis

Installing and Basic Usage

Advanced pidstat Analysis

Interpreting pidstat Output

Context Switch Analysis

Step 4: CPU Profiling and Analysis

Using mpstat

Using vmstat

Using sar

Step 5: Identifying CPU Bottleneck Causes

Application Issues

Infinite Loops and Bugs

Database Query Issues

Web Server Load

Container/Virtualization Issues

Step 6: Advanced Diagnostic Techniques

Using perf

CPU Flame Graphs

Using strace

Using htop

Solutions and Remediation

Immediate Actions

Application-Level Fixes

Database Optimization

System-Level Optimization

Prevention and Monitoring

Continuous Monitoring Script

Automated Reporting

Performance Baseline

Conclusion

Latest Video

Get $20 Free Credit