Server Not Responding: Command-Based Diagnostics

Introduction

A server that stops responding is one of the most critical issues system administrators face. Whether you're managing web servers, database servers, or application servers, unresponsive systems can lead to significant downtime, lost revenue, and frustrated users. This comprehensive guide provides a systematic, command-line approach to diagnosing and resolving server unresponsiveness issues.

Understanding how to properly diagnose an unresponsive server is essential for any DevOps engineer, system administrator, or IT professional. This guide will walk you through a methodical troubleshooting process using proven diagnostic commands and techniques to identify root causes and implement effective solutions.

Understanding Server Unresponsiveness

What Does "Not Responding" Mean?

Server unresponsiveness can manifest in several ways:

  • Complete network timeout: Unable to ping or connect to the server
  • Partial responsiveness: Server responds to ping but services are unavailable
  • Slow response: Server responds but with significant delays
  • Service-specific issues: Specific applications or services fail while others work
  • Intermittent failures: Server responds inconsistently

Common Symptoms

Before diving into diagnostics, recognize these common symptoms:

  1. SSH connections timeout or refuse to establish
  2. Web services return HTTP 502/503/504 errors
  3. Database connections fail or hang
  4. Ping requests timeout completely
  5. Services appear running but don't respond to requests
  6. High latency in all network communications
  7. System console shows frozen output

Initial Assessment and Diagnostic Strategy

The Systematic Approach

When facing an unresponsive server, follow this structured diagnostic methodology:

  1. Verify the problem: Confirm the issue from multiple locations
  2. Check external factors: Network connectivity, DNS resolution
  3. Assess system resources: CPU, memory, disk I/O
  4. Review recent changes: Updates, deployments, configuration changes
  5. Analyze logs: System and application logs for errors
  6. Test services individually: Isolate the problematic component

Remote vs Console Access

Your diagnostic approach differs based on access method:

  • Remote Access (SSH): If SSH is available, you have full diagnostic capabilities
  • Console Access: If SSH fails, use KVM/IPMI/physical console access
  • No Access: Contact hosting provider or use out-of-band management tools

Step 1: Initial Connectivity Testing

Testing Basic Network Connectivity

Start with basic network reachability tests from your local machine:

# Basic ping test
ping -c 4 your-server-ip

# Traceroute to identify network path issues
traceroute your-server-ip

# MTR for continuous network monitoring
mtr -c 100 your-server-ip

# Check specific ports
telnet your-server-ip 22
nc -zv your-server-ip 22 80 443

Interpretation:

  • No ping response: Network issue or firewall blocking ICMP
  • Ping works but ports closed: Services down or firewall rules changed
  • Packet loss: Network congestion or hardware issues
  • High latency: Network path problems or server resource exhaustion

DNS Resolution Testing

Verify DNS is not causing connectivity issues:

# Test DNS resolution
nslookup your-domain.com
dig your-domain.com +short

# Check reverse DNS
dig -x your-server-ip

# Test with alternative DNS servers
nslookup your-domain.com 8.8.8.8

If DNS fails but IP address works, the issue is DNS-related, not server unresponsiveness.

Step 2: Establishing Server Access

Using Alternative Access Methods

If standard SSH fails, try these alternatives:

# SSH with verbose output
ssh -vvv user@server-ip

# SSH on alternative port
ssh -p 2222 user@server-ip

# SSH with specific identity file
ssh -i /path/to/key user@server-ip

# SSH through jump host
ssh -J jumphost user@server-ip

Console Access Options

When SSH is completely unavailable:

  1. Cloud Provider Console: AWS EC2 Serial Console, DigitalOcean Droplet Console
  2. IPMI/iLO/iDRAC: Out-of-band management for bare metal servers
  3. KVM over IP: Remote console access
  4. Physical Access: Direct keyboard/monitor connection

Step 3: System Resource Diagnostics

CPU Usage Analysis

Once you have access, immediately check CPU utilization:

# Quick CPU overview
top -bn1 | head -20

# Detailed CPU statistics
mpstat 1 5

# Per-process CPU usage
ps aux --sort=-%cpu | head -15

# Real-time CPU monitoring
htop

# CPU information and utilization
lscpu
uptime

High CPU indicators:

  • Load average significantly higher than CPU count
  • One or more processes consuming >90% CPU
  • System CPU (sy) higher than user CPU (us)
  • iowait (wa) percentage consistently high

Memory Usage Analysis

Memory exhaustion is a common cause of unresponsiveness:

# Memory usage overview
free -h

# Detailed memory statistics
vmstat 1 5

# Memory usage by process
ps aux --sort=-%mem | head -15

# Check for OOM killer activity
dmesg | grep -i "out of memory"
grep -i "killed process" /var/log/kern.log

# Slab memory usage
slabtop -o

Memory issues indicators:

  • Available memory near zero
  • High swap usage (Swap used > 50%)
  • OOM killer messages in logs
  • Processes killed unexpectedly

Disk I/O Analysis

Disk bottlenecks can make systems appear unresponsive:

# Disk I/O statistics
iostat -x 1 5

# Per-process I/O usage
iotop -o

# Disk usage by filesystem
df -h

# Inode usage
df -i

# Find large files
du -sh /* | sort -rh | head -10

# Check for disk errors
dmesg | grep -i "I/O error"
smartctl -a /dev/sda

Disk issues indicators:

  • %util approaching 100%
  • High await times (>50ms)
  • Disk space at 100%
  • Inode usage at 100%
  • I/O errors in dmesg

Step 4: Network and Service Diagnostics

Network Connection Analysis

Check active connections and network statistics:

# Active network connections
netstat -tunapl
ss -tunapl

# Connection counts by state
ss -s

# Connections by IP address
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head

# Network interface statistics
ip -s link
ifconfig

# Check for network errors
netstat -i

Network issues indicators:

  • Excessive connections in TIME_WAIT or CLOSE_WAIT states
  • Single IP with hundreds of connections (potential attack)
  • Network interface errors or drops
  • Firewall dropping packets

Service Status Verification

Check critical services status:

# List all running services
systemctl list-units --type=service --state=running

# Check specific service status
systemctl status nginx
systemctl status mysql
systemctl status apache2

# Check service startup times
systemd-analyze blame

# Check for failed services
systemctl --failed

Port and Process Association

Identify what's listening on expected ports:

# List listening ports with processes
ss -tulpn
netstat -tulpn

# Check specific port
lsof -i :80
fuser -v 80/tcp

# List all open files by process
lsof -p <PID>

Step 5: Log Analysis

System Logs Examination

System logs often contain critical diagnostic information:

# Recent system messages
journalctl -xe

# Last 100 kernel messages
dmesg | tail -100

# Authentication logs
tail -100 /var/log/auth.log
grep "Failed password" /var/log/auth.log | tail -20

# System log
tail -100 /var/log/syslog
tail -100 /var/log/messages

# Check for segfaults
journalctl -b | grep -i segfault

# Kernel ring buffer
dmesg -T | grep -i "error\|fail\|warning"

Application-Specific Logs

Review logs for specific services:

# Web server logs
tail -f /var/log/nginx/error.log
tail -f /var/log/apache2/error.log

# Database logs
tail -f /var/log/mysql/error.log
tail -f /var/log/postgresql/postgresql-*.log

# Application logs
journalctl -u your-service -f

# Search for errors in last hour
journalctl --since "1 hour ago" | grep -i error

Step 6: Process and Service Analysis

Identifying Problem Processes

Find processes causing issues:

# Top CPU consumers
ps aux --sort=-%cpu | head -10

# Top memory consumers
ps aux --sort=-%mem | head -10

# Processes with highest open file count
lsof | awk '{print $2}' | sort | uniq -c | sort -rn | head

# Long-running processes
ps -eo pid,user,comm,start,time | sort -k4

# Zombie processes
ps aux | grep -w Z

# Process tree
pstree -p

Analyzing Process Behavior

Get detailed information about problematic processes:

# Process details
ps -fp <PID>

# Process limits
cat /proc/<PID>/limits

# Process file descriptors
ls -l /proc/<PID>/fd | wc -l

# Process network connections
lsof -p <PID> -i

# Trace system calls
strace -p <PID>

# Process stack trace
pstack <PID>
gdb -p <PID> -batch -ex "thread apply all bt"

Step 7: Checking for Security Issues

Detecting Intrusions or Attacks

Look for signs of security compromise:

# Check for unusual processes
ps aux | grep -v "^root\|^www-data\|^mysql" | less

# Recent login activity
last -a | head -20
lastb | head -20  # Failed login attempts

# Current logged-in users
w
who

# Unusual network connections
ss -tunapl | grep ESTABLISHED | grep -v ":80\|:443\|:22"

# Check for rootkits
rkhunter --check
chkrootkit

# Check listening processes
ss -tulpn | grep LISTEN

Firewall and Security Log Review

# Check firewall rules
iptables -L -n -v
ufw status verbose

# Security log review
grep -i "refused\|denied\|error" /var/log/auth.log | tail -50

# Failed2ban status (if installed)
fail2ban-client status
fail2ban-client status sshd

Root Cause Analysis

Common Causes of Server Unresponsiveness

1. Resource Exhaustion

CPU Exhaustion:

  • Runaway processes consuming all CPU cycles
  • Infinite loops in applications
  • Cryptocurrency miners
  • DDoS attacks

Memory Exhaustion:

  • Memory leaks in applications
  • Insufficient memory for workload
  • Cache growing unbounded
  • OOM killer terminating critical processes

Disk Exhaustion:

  • Full filesystem preventing writes
  • Inode exhaustion
  • Disk I/O bottleneck
  • Hardware failure

2. Network Issues

  • Firewall rule changes blocking access
  • DDoS or brute force attacks
  • Network interface errors
  • Routing problems
  • Bandwidth saturation

3. Service Failures

  • Application crashes
  • Database connection exhaustion
  • Configuration errors
  • Deadlocks in applications
  • Service dependencies failing

4. Kernel Issues

  • Kernel panics
  • Driver failures
  • File system corruption
  • Out of memory conditions

Solutions and Remediation

Immediate Recovery Actions

If CPU bound:

# Kill problematic process
kill <PID>
kill -9 <PID>  # Force kill if needed

# Lower process priority
renice +10 <PID>

# Limit CPU usage with cpulimit
cpulimit -p <PID> -l 50

If memory bound:

# Clear page cache (safe operation)
sync && echo 1 > /proc/sys/vm/drop_caches

# Restart memory-hungry service
systemctl restart service-name

# Add swap space temporarily
dd if=/dev/zero of=/swapfile bs=1M count=2048
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

If disk bound:

# Find and remove large files
find /var/log -type f -size +100M -exec rm -f {} \;

# Clean package cache
apt clean  # Ubuntu/Debian
yum clean all  # CentOS/RHEL

# Remove old kernels
apt autoremove --purge  # Ubuntu/Debian

# Compress old logs
gzip /var/log/*.log.1

If service crashed:

# Restart service
systemctl restart service-name

# Enable service auto-restart
systemctl edit service-name
# Add:
# [Service]
# Restart=always
# RestartSec=10

# Start service with debugging
service-name --verbose --debug

Network-Related Fixes

# Restart networking
systemctl restart networking
systemctl restart NetworkManager

# Flush and reload firewall
iptables -F
systemctl restart firewall

# Reset network interface
ip link set eth0 down
ip link set eth0 up

# Clear ARP cache
ip neigh flush all

Service Recovery

# Graceful service restart
systemctl reload service-name

# Force restart with timeout
timeout 30 systemctl restart service-name || systemctl kill service-name

# Reset failed state
systemctl reset-failed service-name

Prevention and Best Practices

Proactive Monitoring

Implement monitoring to catch issues before they cause unresponsiveness:

# Install monitoring tools
apt install sysstat monitoring-plugins nagios-plugins-basic

# Enable system statistics collection
systemctl enable sysstat
systemctl start sysstat

# Create monitoring script
cat > /usr/local/bin/resource-monitor.sh << 'EOF'
#!/bin/bash
CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM=$(free | grep Mem | awk '{print ($3/$2) * 100}')
DISK=$(df / | tail -1 | awk '{print $5}' | cut -d'%' -f1)

if (( $(echo "$CPU > 90" | bc -l) )); then
    echo "High CPU: $CPU%" | mail -s "Alert: High CPU" [email protected]
fi
if (( $(echo "$MEM > 90" | bc -l) )); then
    echo "High Memory: $MEM%" | mail -s "Alert: High Memory" [email protected]
fi
if [ $DISK -gt 90 ]; then
    echo "High Disk: $DISK%" | mail -s "Alert: High Disk" [email protected]
fi
EOF

chmod +x /usr/local/bin/resource-monitor.sh

# Add to crontab
echo "*/5 * * * * /usr/local/bin/resource-monitor.sh" | crontab -

System Hardening

# Set resource limits
cat >> /etc/security/limits.conf << EOF
* soft nofile 65535
* hard nofile 65535
* soft nproc 32768
* hard nproc 32768
EOF

# Optimize kernel parameters
cat >> /etc/sysctl.conf << EOF
# Increase max connections
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 8096

# Optimize memory
vm.swappiness = 10
vm.vfs_cache_pressure = 50

# Network tuning
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
EOF

sysctl -p

Automated Recovery

Create watchdog scripts for critical services:

cat > /usr/local/bin/service-watchdog.sh << 'EOF'
#!/bin/bash
SERVICES=("nginx" "mysql" "php-fpm")

for SERVICE in "${SERVICES[@]}"; do
    if ! systemctl is-active --quiet "$SERVICE"; then
        echo "$(date): $SERVICE is down, restarting..." >> /var/log/watchdog.log
        systemctl restart "$SERVICE"
        echo "$SERVICE restarted" | mail -s "Service Recovery: $SERVICE" [email protected]
    fi
done
EOF

chmod +x /usr/local/bin/service-watchdog.sh
echo "*/5 * * * * /usr/local/bin/service-watchdog.sh" | crontab -

Regular Maintenance

# Weekly cleanup script
cat > /usr/local/bin/weekly-maintenance.sh << 'EOF'
#!/bin/bash
# Rotate logs
logrotate -f /etc/logrotate.conf

# Clean package cache
apt autoremove -y
apt autoclean

# Update database
updatedb

# Check disk health
smartctl -H /dev/sda

# Send report
df -h > /tmp/disk-report.txt
free -h >> /tmp/disk-report.txt
mail -s "Weekly Maintenance Report" [email protected] < /tmp/disk-report.txt
EOF

chmod +x /usr/local/bin/weekly-maintenance.sh

Advanced Diagnostic Techniques

Performance Profiling

# System-wide performance profile
perf record -a -g sleep 30
perf report

# CPU flame graphs
git clone https://github.com/brendangregg/FlameGraph
perf record -F 99 -a -g -- sleep 60
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > cpu-flamegraph.svg

# Memory profiling
valgrind --leak-check=full --show-leak-kinds=all command

# Detailed I/O tracking
blktrace -d /dev/sda -o - | blkparse -i -

Kernel Debugging

# Enable kernel debugging
echo 1 > /proc/sys/kernel/sysrq

# Dump backtrace of all tasks
echo t > /proc/sysrq-trigger
dmesg | tail -100

# Show memory usage
echo m > /proc/sysrq-trigger

# Show CPU registers and flags
echo p > /proc/sysrq-trigger

Network Deep Dive

# Capture network traffic
tcpdump -i any -w /tmp/capture.pcap

# Analyze specific connections
tcpdump -i any host server-ip and port 80

# Check for SYN floods
netstat -n | grep SYN_RECV | wc -l

# Monitor connection rates
watch -n 1 'ss -s'

Documentation and Reporting

Creating Incident Reports

Document your findings systematically:

# Automated diagnostic report
cat > /usr/local/bin/diagnostic-report.sh << 'EOF'
#!/bin/bash
REPORT="/tmp/diagnostic-report-$(date +%Y%m%d-%H%M%S).txt"

echo "=== DIAGNOSTIC REPORT ===" > $REPORT
echo "Date: $(date)" >> $REPORT
echo "" >> $REPORT

echo "=== SYSTEM INFO ===" >> $REPORT
uname -a >> $REPORT
uptime >> $REPORT
echo "" >> $REPORT

echo "=== CPU USAGE ===" >> $REPORT
top -bn1 | head -20 >> $REPORT
echo "" >> $REPORT

echo "=== MEMORY USAGE ===" >> $REPORT
free -h >> $REPORT
echo "" >> $REPORT

echo "=== DISK USAGE ===" >> $REPORT
df -h >> $REPORT
echo "" >> $REPORT

echo "=== NETWORK CONNECTIONS ===" >> $REPORT
ss -s >> $REPORT
echo "" >> $REPORT

echo "=== SERVICE STATUS ===" >> $REPORT
systemctl --failed >> $REPORT
echo "" >> $REPORT

echo "=== RECENT ERRORS ===" >> $REPORT
journalctl -p err -n 50 --no-pager >> $REPORT

echo "Report saved to: $REPORT"
EOF

chmod +x /usr/local/bin/diagnostic-report.sh

Conclusion

Server unresponsiveness is a critical issue that requires systematic diagnosis and swift resolution. By following the methodical approach outlined in this guide, you can quickly identify root causes and implement effective solutions. The key to successful troubleshooting is:

  1. Stay calm and systematic: Follow the diagnostic steps in order
  2. Document everything: Keep notes of what you observe and try
  3. Understand your baseline: Know what normal looks like for your systems
  4. Implement monitoring: Catch issues before they become critical
  5. Practice recovery: Test your procedures regularly
  6. Automate where possible: Use scripts for common tasks
  7. Learn from incidents: Review and improve after each issue

Regular maintenance, proactive monitoring, and understanding these diagnostic commands will minimize downtime and ensure rapid recovery when issues do occur. Remember that prevention is always better than cure, so invest time in proper monitoring, alerting, and automated recovery mechanisms.

Keep this guide handy, practice with these commands in non-critical situations, and you'll be well-prepared when facing actual server unresponsiveness incidents.