Server Not Responding: Command-Based Diagnostics
Introduction
A server that stops responding is one of the most critical issues system administrators face. Whether you're managing web servers, database servers, or application servers, unresponsive systems can lead to significant downtime, lost revenue, and frustrated users. This comprehensive guide provides a systematic, command-line approach to diagnosing and resolving server unresponsiveness issues.
Understanding how to properly diagnose an unresponsive server is essential for any DevOps engineer, system administrator, or IT professional. This guide will walk you through a methodical troubleshooting process using proven diagnostic commands and techniques to identify root causes and implement effective solutions.
Understanding Server Unresponsiveness
What Does "Not Responding" Mean?
Server unresponsiveness can manifest in several ways:
- Complete network timeout: Unable to ping or connect to the server
- Partial responsiveness: Server responds to ping but services are unavailable
- Slow response: Server responds but with significant delays
- Service-specific issues: Specific applications or services fail while others work
- Intermittent failures: Server responds inconsistently
Common Symptoms
Before diving into diagnostics, recognize these common symptoms:
- SSH connections timeout or refuse to establish
- Web services return HTTP 502/503/504 errors
- Database connections fail or hang
- Ping requests timeout completely
- Services appear running but don't respond to requests
- High latency in all network communications
- System console shows frozen output
Initial Assessment and Diagnostic Strategy
The Systematic Approach
When facing an unresponsive server, follow this structured diagnostic methodology:
- Verify the problem: Confirm the issue from multiple locations
- Check external factors: Network connectivity, DNS resolution
- Assess system resources: CPU, memory, disk I/O
- Review recent changes: Updates, deployments, configuration changes
- Analyze logs: System and application logs for errors
- Test services individually: Isolate the problematic component
Remote vs Console Access
Your diagnostic approach differs based on access method:
- Remote Access (SSH): If SSH is available, you have full diagnostic capabilities
- Console Access: If SSH fails, use KVM/IPMI/physical console access
- No Access: Contact hosting provider or use out-of-band management tools
Step 1: Initial Connectivity Testing
Testing Basic Network Connectivity
Start with basic network reachability tests from your local machine:
# Basic ping test
ping -c 4 your-server-ip
# Traceroute to identify network path issues
traceroute your-server-ip
# MTR for continuous network monitoring
mtr -c 100 your-server-ip
# Check specific ports
telnet your-server-ip 22
nc -zv your-server-ip 22 80 443
Interpretation:
- No ping response: Network issue or firewall blocking ICMP
- Ping works but ports closed: Services down or firewall rules changed
- Packet loss: Network congestion or hardware issues
- High latency: Network path problems or server resource exhaustion
DNS Resolution Testing
Verify DNS is not causing connectivity issues:
# Test DNS resolution
nslookup your-domain.com
dig your-domain.com +short
# Check reverse DNS
dig -x your-server-ip
# Test with alternative DNS servers
nslookup your-domain.com 8.8.8.8
If DNS fails but IP address works, the issue is DNS-related, not server unresponsiveness.
Step 2: Establishing Server Access
Using Alternative Access Methods
If standard SSH fails, try these alternatives:
# SSH with verbose output
ssh -vvv user@server-ip
# SSH on alternative port
ssh -p 2222 user@server-ip
# SSH with specific identity file
ssh -i /path/to/key user@server-ip
# SSH through jump host
ssh -J jumphost user@server-ip
Console Access Options
When SSH is completely unavailable:
- Cloud Provider Console: AWS EC2 Serial Console, DigitalOcean Droplet Console
- IPMI/iLO/iDRAC: Out-of-band management for bare metal servers
- KVM over IP: Remote console access
- Physical Access: Direct keyboard/monitor connection
Step 3: System Resource Diagnostics
CPU Usage Analysis
Once you have access, immediately check CPU utilization:
# Quick CPU overview
top -bn1 | head -20
# Detailed CPU statistics
mpstat 1 5
# Per-process CPU usage
ps aux --sort=-%cpu | head -15
# Real-time CPU monitoring
htop
# CPU information and utilization
lscpu
uptime
High CPU indicators:
- Load average significantly higher than CPU count
- One or more processes consuming >90% CPU
- System CPU (sy) higher than user CPU (us)
- iowait (wa) percentage consistently high
Memory Usage Analysis
Memory exhaustion is a common cause of unresponsiveness:
# Memory usage overview
free -h
# Detailed memory statistics
vmstat 1 5
# Memory usage by process
ps aux --sort=-%mem | head -15
# Check for OOM killer activity
dmesg | grep -i "out of memory"
grep -i "killed process" /var/log/kern.log
# Slab memory usage
slabtop -o
Memory issues indicators:
- Available memory near zero
- High swap usage (Swap used > 50%)
- OOM killer messages in logs
- Processes killed unexpectedly
Disk I/O Analysis
Disk bottlenecks can make systems appear unresponsive:
# Disk I/O statistics
iostat -x 1 5
# Per-process I/O usage
iotop -o
# Disk usage by filesystem
df -h
# Inode usage
df -i
# Find large files
du -sh /* | sort -rh | head -10
# Check for disk errors
dmesg | grep -i "I/O error"
smartctl -a /dev/sda
Disk issues indicators:
- %util approaching 100%
- High await times (>50ms)
- Disk space at 100%
- Inode usage at 100%
- I/O errors in dmesg
Step 4: Network and Service Diagnostics
Network Connection Analysis
Check active connections and network statistics:
# Active network connections
netstat -tunapl
ss -tunapl
# Connection counts by state
ss -s
# Connections by IP address
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head
# Network interface statistics
ip -s link
ifconfig
# Check for network errors
netstat -i
Network issues indicators:
- Excessive connections in TIME_WAIT or CLOSE_WAIT states
- Single IP with hundreds of connections (potential attack)
- Network interface errors or drops
- Firewall dropping packets
Service Status Verification
Check critical services status:
# List all running services
systemctl list-units --type=service --state=running
# Check specific service status
systemctl status nginx
systemctl status mysql
systemctl status apache2
# Check service startup times
systemd-analyze blame
# Check for failed services
systemctl --failed
Port and Process Association
Identify what's listening on expected ports:
# List listening ports with processes
ss -tulpn
netstat -tulpn
# Check specific port
lsof -i :80
fuser -v 80/tcp
# List all open files by process
lsof -p <PID>
Step 5: Log Analysis
System Logs Examination
System logs often contain critical diagnostic information:
# Recent system messages
journalctl -xe
# Last 100 kernel messages
dmesg | tail -100
# Authentication logs
tail -100 /var/log/auth.log
grep "Failed password" /var/log/auth.log | tail -20
# System log
tail -100 /var/log/syslog
tail -100 /var/log/messages
# Check for segfaults
journalctl -b | grep -i segfault
# Kernel ring buffer
dmesg -T | grep -i "error\|fail\|warning"
Application-Specific Logs
Review logs for specific services:
# Web server logs
tail -f /var/log/nginx/error.log
tail -f /var/log/apache2/error.log
# Database logs
tail -f /var/log/mysql/error.log
tail -f /var/log/postgresql/postgresql-*.log
# Application logs
journalctl -u your-service -f
# Search for errors in last hour
journalctl --since "1 hour ago" | grep -i error
Step 6: Process and Service Analysis
Identifying Problem Processes
Find processes causing issues:
# Top CPU consumers
ps aux --sort=-%cpu | head -10
# Top memory consumers
ps aux --sort=-%mem | head -10
# Processes with highest open file count
lsof | awk '{print $2}' | sort | uniq -c | sort -rn | head
# Long-running processes
ps -eo pid,user,comm,start,time | sort -k4
# Zombie processes
ps aux | grep -w Z
# Process tree
pstree -p
Analyzing Process Behavior
Get detailed information about problematic processes:
# Process details
ps -fp <PID>
# Process limits
cat /proc/<PID>/limits
# Process file descriptors
ls -l /proc/<PID>/fd | wc -l
# Process network connections
lsof -p <PID> -i
# Trace system calls
strace -p <PID>
# Process stack trace
pstack <PID>
gdb -p <PID> -batch -ex "thread apply all bt"
Step 7: Checking for Security Issues
Detecting Intrusions or Attacks
Look for signs of security compromise:
# Check for unusual processes
ps aux | grep -v "^root\|^www-data\|^mysql" | less
# Recent login activity
last -a | head -20
lastb | head -20 # Failed login attempts
# Current logged-in users
w
who
# Unusual network connections
ss -tunapl | grep ESTABLISHED | grep -v ":80\|:443\|:22"
# Check for rootkits
rkhunter --check
chkrootkit
# Check listening processes
ss -tulpn | grep LISTEN
Firewall and Security Log Review
# Check firewall rules
iptables -L -n -v
ufw status verbose
# Security log review
grep -i "refused\|denied\|error" /var/log/auth.log | tail -50
# Failed2ban status (if installed)
fail2ban-client status
fail2ban-client status sshd
Root Cause Analysis
Common Causes of Server Unresponsiveness
1. Resource Exhaustion
CPU Exhaustion:
- Runaway processes consuming all CPU cycles
- Infinite loops in applications
- Cryptocurrency miners
- DDoS attacks
Memory Exhaustion:
- Memory leaks in applications
- Insufficient memory for workload
- Cache growing unbounded
- OOM killer terminating critical processes
Disk Exhaustion:
- Full filesystem preventing writes
- Inode exhaustion
- Disk I/O bottleneck
- Hardware failure
2. Network Issues
- Firewall rule changes blocking access
- DDoS or brute force attacks
- Network interface errors
- Routing problems
- Bandwidth saturation
3. Service Failures
- Application crashes
- Database connection exhaustion
- Configuration errors
- Deadlocks in applications
- Service dependencies failing
4. Kernel Issues
- Kernel panics
- Driver failures
- File system corruption
- Out of memory conditions
Solutions and Remediation
Immediate Recovery Actions
If CPU bound:
# Kill problematic process
kill <PID>
kill -9 <PID> # Force kill if needed
# Lower process priority
renice +10 <PID>
# Limit CPU usage with cpulimit
cpulimit -p <PID> -l 50
If memory bound:
# Clear page cache (safe operation)
sync && echo 1 > /proc/sys/vm/drop_caches
# Restart memory-hungry service
systemctl restart service-name
# Add swap space temporarily
dd if=/dev/zero of=/swapfile bs=1M count=2048
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
If disk bound:
# Find and remove large files
find /var/log -type f -size +100M -exec rm -f {} \;
# Clean package cache
apt clean # Ubuntu/Debian
yum clean all # CentOS/RHEL
# Remove old kernels
apt autoremove --purge # Ubuntu/Debian
# Compress old logs
gzip /var/log/*.log.1
If service crashed:
# Restart service
systemctl restart service-name
# Enable service auto-restart
systemctl edit service-name
# Add:
# [Service]
# Restart=always
# RestartSec=10
# Start service with debugging
service-name --verbose --debug
Network-Related Fixes
# Restart networking
systemctl restart networking
systemctl restart NetworkManager
# Flush and reload firewall
iptables -F
systemctl restart firewall
# Reset network interface
ip link set eth0 down
ip link set eth0 up
# Clear ARP cache
ip neigh flush all
Service Recovery
# Graceful service restart
systemctl reload service-name
# Force restart with timeout
timeout 30 systemctl restart service-name || systemctl kill service-name
# Reset failed state
systemctl reset-failed service-name
Prevention and Best Practices
Proactive Monitoring
Implement monitoring to catch issues before they cause unresponsiveness:
# Install monitoring tools
apt install sysstat monitoring-plugins nagios-plugins-basic
# Enable system statistics collection
systemctl enable sysstat
systemctl start sysstat
# Create monitoring script
cat > /usr/local/bin/resource-monitor.sh << 'EOF'
#!/bin/bash
CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM=$(free | grep Mem | awk '{print ($3/$2) * 100}')
DISK=$(df / | tail -1 | awk '{print $5}' | cut -d'%' -f1)
if (( $(echo "$CPU > 90" | bc -l) )); then
echo "High CPU: $CPU%" | mail -s "Alert: High CPU" [email protected]
fi
if (( $(echo "$MEM > 90" | bc -l) )); then
echo "High Memory: $MEM%" | mail -s "Alert: High Memory" [email protected]
fi
if [ $DISK -gt 90 ]; then
echo "High Disk: $DISK%" | mail -s "Alert: High Disk" [email protected]
fi
EOF
chmod +x /usr/local/bin/resource-monitor.sh
# Add to crontab
echo "*/5 * * * * /usr/local/bin/resource-monitor.sh" | crontab -
System Hardening
# Set resource limits
cat >> /etc/security/limits.conf << EOF
* soft nofile 65535
* hard nofile 65535
* soft nproc 32768
* hard nproc 32768
EOF
# Optimize kernel parameters
cat >> /etc/sysctl.conf << EOF
# Increase max connections
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 8096
# Optimize memory
vm.swappiness = 10
vm.vfs_cache_pressure = 50
# Network tuning
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
EOF
sysctl -p
Automated Recovery
Create watchdog scripts for critical services:
cat > /usr/local/bin/service-watchdog.sh << 'EOF'
#!/bin/bash
SERVICES=("nginx" "mysql" "php-fpm")
for SERVICE in "${SERVICES[@]}"; do
if ! systemctl is-active --quiet "$SERVICE"; then
echo "$(date): $SERVICE is down, restarting..." >> /var/log/watchdog.log
systemctl restart "$SERVICE"
echo "$SERVICE restarted" | mail -s "Service Recovery: $SERVICE" [email protected]
fi
done
EOF
chmod +x /usr/local/bin/service-watchdog.sh
echo "*/5 * * * * /usr/local/bin/service-watchdog.sh" | crontab -
Regular Maintenance
# Weekly cleanup script
cat > /usr/local/bin/weekly-maintenance.sh << 'EOF'
#!/bin/bash
# Rotate logs
logrotate -f /etc/logrotate.conf
# Clean package cache
apt autoremove -y
apt autoclean
# Update database
updatedb
# Check disk health
smartctl -H /dev/sda
# Send report
df -h > /tmp/disk-report.txt
free -h >> /tmp/disk-report.txt
mail -s "Weekly Maintenance Report" [email protected] < /tmp/disk-report.txt
EOF
chmod +x /usr/local/bin/weekly-maintenance.sh
Advanced Diagnostic Techniques
Performance Profiling
# System-wide performance profile
perf record -a -g sleep 30
perf report
# CPU flame graphs
git clone https://github.com/brendangregg/FlameGraph
perf record -F 99 -a -g -- sleep 60
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > cpu-flamegraph.svg
# Memory profiling
valgrind --leak-check=full --show-leak-kinds=all command
# Detailed I/O tracking
blktrace -d /dev/sda -o - | blkparse -i -
Kernel Debugging
# Enable kernel debugging
echo 1 > /proc/sys/kernel/sysrq
# Dump backtrace of all tasks
echo t > /proc/sysrq-trigger
dmesg | tail -100
# Show memory usage
echo m > /proc/sysrq-trigger
# Show CPU registers and flags
echo p > /proc/sysrq-trigger
Network Deep Dive
# Capture network traffic
tcpdump -i any -w /tmp/capture.pcap
# Analyze specific connections
tcpdump -i any host server-ip and port 80
# Check for SYN floods
netstat -n | grep SYN_RECV | wc -l
# Monitor connection rates
watch -n 1 'ss -s'
Documentation and Reporting
Creating Incident Reports
Document your findings systematically:
# Automated diagnostic report
cat > /usr/local/bin/diagnostic-report.sh << 'EOF'
#!/bin/bash
REPORT="/tmp/diagnostic-report-$(date +%Y%m%d-%H%M%S).txt"
echo "=== DIAGNOSTIC REPORT ===" > $REPORT
echo "Date: $(date)" >> $REPORT
echo "" >> $REPORT
echo "=== SYSTEM INFO ===" >> $REPORT
uname -a >> $REPORT
uptime >> $REPORT
echo "" >> $REPORT
echo "=== CPU USAGE ===" >> $REPORT
top -bn1 | head -20 >> $REPORT
echo "" >> $REPORT
echo "=== MEMORY USAGE ===" >> $REPORT
free -h >> $REPORT
echo "" >> $REPORT
echo "=== DISK USAGE ===" >> $REPORT
df -h >> $REPORT
echo "" >> $REPORT
echo "=== NETWORK CONNECTIONS ===" >> $REPORT
ss -s >> $REPORT
echo "" >> $REPORT
echo "=== SERVICE STATUS ===" >> $REPORT
systemctl --failed >> $REPORT
echo "" >> $REPORT
echo "=== RECENT ERRORS ===" >> $REPORT
journalctl -p err -n 50 --no-pager >> $REPORT
echo "Report saved to: $REPORT"
EOF
chmod +x /usr/local/bin/diagnostic-report.sh
Conclusion
Server unresponsiveness is a critical issue that requires systematic diagnosis and swift resolution. By following the methodical approach outlined in this guide, you can quickly identify root causes and implement effective solutions. The key to successful troubleshooting is:
- Stay calm and systematic: Follow the diagnostic steps in order
- Document everything: Keep notes of what you observe and try
- Understand your baseline: Know what normal looks like for your systems
- Implement monitoring: Catch issues before they become critical
- Practice recovery: Test your procedures regularly
- Automate where possible: Use scripts for common tasks
- Learn from incidents: Review and improve after each issue
Regular maintenance, proactive monitoring, and understanding these diagnostic commands will minimize downtime and ensure rapid recovery when issues do occur. Remember that prevention is always better than cure, so invest time in proper monitoring, alerting, and automated recovery mechanisms.
Keep this guide handy, practice with these commands in non-critical situations, and you'll be well-prepared when facing actual server unresponsiveness incidents.


