Monitoring with Custom Bash Scripts
Introduction
While enterprise monitoring solutions like Prometheus, Nagios, and Zabbix offer comprehensive features, custom Bash scripts provide a lightweight, flexible, and cost-effective alternative for monitoring Linux servers. Bash scripts can be tailored to your specific infrastructure needs, deployed instantly without dependencies, and integrated seamlessly with existing tools and workflows.
Custom monitoring scripts excel in scenarios where you need quick deployment, specific metrics not covered by standard tools, or lightweight monitoring for resource-constrained environments. They're particularly valuable for monitoring custom applications, enforcing specific business logic, and creating monitoring solutions that integrate directly with your notification systems.
This comprehensive guide teaches you how to build robust monitoring scripts from simple resource checks to advanced multi-server monitoring systems. You'll learn how to monitor CPU, memory, disk, network, services, and custom application metrics, implement alerting mechanisms, create dashboards, and follow best practices for maintainable, production-ready monitoring scripts.
Prerequisites
Before building custom monitoring scripts, ensure you have:
- A Linux server (Ubuntu 20.04/22.04, Debian 10/11, CentOS 7/8, Rocky Linux 8/9, or similar)
- Root or sudo access for system-level monitoring
- Basic to intermediate Bash scripting knowledge
- Familiarity with Linux system commands
- Text editor (nano, vim, or vi)
Recommended Knowledge:
- Understanding of process management
- Basic knowledge of awk, grep, and sed
- Familiarity with cron for scheduling
- Understanding of exit codes and error handling
Optional Tools:
- Mail transfer agent (for email alerts)
- curl or wget (for webhook notifications)
- bc (for floating-point calculations)
- jq (for JSON processing)
Basic Monitoring Script Structure
Essential Components
A well-structured monitoring script should include:
#!/bin/bash
#
# Script Name: monitor-system.sh
# Description: Basic system monitoring script
# Author: Your Name
# Date: 2024-01-11
# Version: 1.0
# Exit on error
set -e
# Configuration
LOG_FILE="/var/log/monitoring/system-monitor.log"
ALERT_EMAIL="[email protected]"
HOSTNAME=$(hostname)
# Thresholds
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
# Functions
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
send_alert() {
local subject="$1"
local message="$2"
echo "$message" | mail -s "$subject" "$ALERT_EMAIL"
log_message "Alert sent: $subject"
}
# Main monitoring logic
main() {
log_message "Starting system monitoring"
# Monitoring checks go here
log_message "Monitoring completed"
}
# Execute main function
main "$@"
Error Handling and Logging
#!/bin/bash
# Enhanced error handling
set -euo pipefail # Exit on error, undefined variables, pipe failures
# Error trap
trap 'error_handler $? $LINENO' ERR
error_handler() {
local exit_code=$1
local line_number=$2
log_message "ERROR: Script failed at line $line_number with exit code $exit_code"
send_alert "Monitoring Script Failed" \
"Script failed on $HOSTNAME at line $line_number"
exit "$exit_code"
}
# Logging with levels
log_level() {
local level=$1
shift
local message="$*"
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $message" | tee -a "$LOG_FILE"
}
log_info() { log_level "INFO" "$@"; }
log_warn() { log_level "WARN" "$@"; }
log_error() { log_level "ERROR" "$@"; }
CPU Monitoring Scripts
Basic CPU Usage Check
#!/bin/bash
# cpu-monitor.sh - Monitor CPU usage
CPU_THRESHOLD=80
LOG_FILE="/var/log/monitoring/cpu.log"
# Get CPU usage (percentage of idle time subtracted from 100)
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
CPU_USAGE_INT=${CPU_USAGE%.*} # Convert to integer
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
log_message "Current CPU usage: ${CPU_USAGE}%"
if [ "$CPU_USAGE_INT" -gt "$CPU_THRESHOLD" ]; then
log_message "WARNING: CPU usage above threshold (${CPU_USAGE}% > ${CPU_THRESHOLD}%)"
# Get top CPU consumers
TOP_PROCESSES=$(ps aux --sort=-%cpu | head -n 6)
# Send alert
{
echo "High CPU usage detected on $(hostname)"
echo "Current usage: ${CPU_USAGE}%"
echo "Threshold: ${CPU_THRESHOLD}%"
echo ""
echo "Top CPU consumers:"
echo "$TOP_PROCESSES"
} | mail -s "CPU Alert: $(hostname)" [email protected]
fi
Advanced CPU Monitoring with Per-Core Analysis
#!/bin/bash
# cpu-detailed-monitor.sh - Detailed CPU monitoring
monitor_cpu_cores() {
local threshold=$1
# Get per-core CPU usage
mpstat -P ALL 1 1 | tail -n +4 | while read line; do
if [[ $line =~ ^[0-9] ]]; then
core=$(echo "$line" | awk '{print $2}')
idle=$(echo "$line" | awk '{print $NF}')
usage=$(echo "100 - $idle" | bc)
usage_int=${usage%.*}
echo "Core $core: ${usage}%"
if (( $(echo "$usage_int > $threshold" | bc -l) )); then
echo "WARNING: Core $core usage: ${usage}%"
fi
fi
done
}
# Get load averages
get_load_average() {
local load_1min load_5min load_15min
read load_1min load_5min load_15min < /proc/loadavg
# Get number of CPU cores
local cpu_cores=$(nproc)
# Calculate load per core
local load_per_core=$(echo "scale=2; $load_1min / $cpu_cores" | bc)
echo "Load Average: $load_1min (1m), $load_5min (5m), $load_15min (15m)"
echo "CPU Cores: $cpu_cores"
echo "Load per Core: $load_per_core"
# Alert if load per core exceeds 1.0
if (( $(echo "$load_per_core > 1.0" | bc -l) )); then
echo "WARNING: High load per core: $load_per_core"
return 1
fi
}
# Main execution
echo "=== CPU Monitoring Report ==="
echo "Timestamp: $(date)"
echo ""
echo "--- Load Average ---"
get_load_average
echo ""
echo "--- Per-Core Usage ---"
monitor_cpu_cores 80
CPU Wait Time Monitoring
#!/bin/bash
# cpu-iowait-monitor.sh - Monitor I/O wait time
IOWAIT_THRESHOLD=20
# Get I/O wait percentage
get_iowait() {
iostat -c 1 2 | tail -1 | awk '{print $4}'
}
IOWAIT=$(get_iowait)
IOWAIT_INT=${IOWAIT%.*}
echo "Current I/O wait: ${IOWAIT}%"
if [ "$IOWAIT_INT" -gt "$IOWAIT_THRESHOLD" ]; then
echo "WARNING: High I/O wait detected"
# Identify I/O intensive processes
echo ""
echo "Top I/O processes:"
sudo iotop -b -n 1 -o | head -n 20
# Check disk statistics
echo ""
echo "Disk I/O statistics:"
iostat -x 1 2 | tail -n +4
fi
Memory Monitoring Scripts
Comprehensive Memory Monitor
#!/bin/bash
# memory-monitor.sh - Comprehensive memory monitoring
MEMORY_THRESHOLD=85
SWAP_THRESHOLD=50
LOG_FILE="/var/log/monitoring/memory.log"
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Get memory statistics
get_memory_stats() {
local total used free shared buffers cached available
read -r _ total used free shared buffers cached <<< \
"$(free -m | grep Mem:)"
available=$(free -m | grep Mem: | awk '{print $7}')
local mem_percent=$(echo "scale=2; ($used / $total) * 100" | bc)
local mem_percent_int=${mem_percent%.*}
echo "Total: ${total}MB"
echo "Used: ${used}MB"
echo "Free: ${free}MB"
echo "Available: ${available}MB"
echo "Usage: ${mem_percent}%"
if [ "$mem_percent_int" -gt "$MEMORY_THRESHOLD" ]; then
log_message "WARNING: Memory usage ${mem_percent}% exceeds threshold ${MEMORY_THRESHOLD}%"
return 1
fi
return 0
}
# Get swap statistics
get_swap_stats() {
local total used free
read -r _ total used free <<< "$(free -m | grep Swap:)"
if [ "$total" -eq 0 ]; then
echo "Swap: Not configured"
return 0
fi
local swap_percent=$(echo "scale=2; ($used / $total) * 100" | bc)
local swap_percent_int=${swap_percent%.*}
echo "Swap Total: ${total}MB"
echo "Swap Used: ${used}MB"
echo "Swap Usage: ${swap_percent}%"
if [ "$swap_percent_int" -gt "$SWAP_THRESHOLD" ]; then
log_message "WARNING: Swap usage ${swap_percent}% exceeds threshold ${SWAP_THRESHOLD}%"
return 1
fi
return 0
}
# Get top memory consumers
get_top_memory_processes() {
echo "Top 10 Memory Consuming Processes:"
ps aux --sort=-%mem | head -n 11 | awk '{printf "%-10s %-8s %-8s %s\n", $1, $2, $4"%", $11}'
}
# Check for memory leaks
check_memory_growth() {
local current_usage previous_usage
# Get current memory usage
current_usage=$(free -m | grep Mem: | awk '{print $3}')
# Read previous usage if exists
if [ -f /tmp/memory_usage.tmp ]; then
previous_usage=$(cat /tmp/memory_usage.tmp)
local growth=$((current_usage - previous_usage))
echo "Memory growth since last check: ${growth}MB"
if [ "$growth" -gt 500 ]; then
log_message "WARNING: Rapid memory growth detected: ${growth}MB"
fi
fi
# Save current usage
echo "$current_usage" > /tmp/memory_usage.tmp
}
# Main execution
main() {
log_message "Starting memory monitoring"
echo "=== Memory Monitoring Report ==="
echo "Timestamp: $(date)"
echo ""
echo "--- Physical Memory ---"
if ! get_memory_stats; then
get_top_memory_processes
echo "" | mail -s "Memory Alert: $(hostname)" [email protected]
fi
echo ""
echo "--- Swap Memory ---"
get_swap_stats
echo ""
check_memory_growth
echo ""
get_top_memory_processes
}
main
OOM Killer Detection
#!/bin/bash
# oom-monitor.sh - Detect Out Of Memory killer events
LOG_FILE="/var/log/monitoring/oom.log"
# Check for OOM killer events
check_oom_events() {
local oom_events
# Search kernel logs for OOM events in last hour
oom_events=$(journalctl -k --since "1 hour ago" | grep -i "Out of memory\|oom-killer\|Killed process")
if [ -n "$oom_events" ]; then
echo "OOM Killer Events Detected:"
echo "$oom_events"
# Log the event
echo "[$(date)] OOM events detected" >> "$LOG_FILE"
echo "$oom_events" >> "$LOG_FILE"
# Send alert
{
echo "Out Of Memory events detected on $(hostname)"
echo "Time: $(date)"
echo ""
echo "Events:"
echo "$oom_events"
echo ""
echo "Current Memory Status:"
free -h
echo ""
echo "Top Memory Consumers:"
ps aux --sort=-%mem | head -n 11
} | mail -s "OOM Alert: $(hostname)" [email protected]
return 1
fi
return 0
}
check_oom_events
Disk Monitoring Scripts
Disk Space Monitor
#!/bin/bash
# disk-monitor.sh - Monitor disk space usage
DISK_THRESHOLD=85
LOG_FILE="/var/log/monitoring/disk.log"
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Check disk usage for all mounted filesystems
check_disk_usage() {
local alert_triggered=0
df -h | tail -n +2 | while read filesystem size used avail percent mountpoint; do
# Remove % sign from percentage
usage=${percent%\%}
echo "Checking $mountpoint: $percent used"
if [ "$usage" -gt "$DISK_THRESHOLD" ]; then
log_message "WARNING: $mountpoint is ${percent} full"
alert_triggered=1
# Find largest directories
echo "Largest directories in $mountpoint:"
du -h "$mountpoint" 2>/dev/null | sort -rh | head -n 10
# Send alert
{
echo "Disk space alert on $(hostname)"
echo "Mount point: $mountpoint"
echo "Usage: $percent"
echo "Threshold: ${DISK_THRESHOLD}%"
echo ""
echo "Largest directories:"
du -h "$mountpoint" 2>/dev/null | sort -rh | head -n 10
} | mail -s "Disk Space Alert: $mountpoint" [email protected]
fi
done
return $alert_triggered
}
# Check inode usage
check_inode_usage() {
local inode_threshold=85
df -i | tail -n +2 | while read filesystem inodes iused ifree ipercent mountpoint; do
# Remove % sign
usage=${ipercent%\%}
if [ "$usage" -gt "$inode_threshold" ]; then
log_message "WARNING: Inode usage on $mountpoint is ${ipercent}"
{
echo "Inode usage alert on $(hostname)"
echo "Mount point: $mountpoint"
echo "Inode usage: $ipercent"
echo ""
echo "Directories with most files:"
find "$mountpoint" -xdev -type d -exec sh -c \
'echo $(find "{}" -maxdepth 1 | wc -l) "{}"' \; 2>/dev/null | \
sort -rn | head -n 10
} | mail -s "Inode Alert: $mountpoint" [email protected]
fi
done
}
# Check for rapidly growing directories
check_directory_growth() {
local watch_dirs=("/var/log" "/tmp" "/var/tmp" "/home")
for dir in "${watch_dirs[@]}"; do
if [ ! -d "$dir" ]; then
continue
fi
current_size=$(du -sm "$dir" 2>/dev/null | awk '{print $1}')
state_file="/tmp/dirsize_$(echo $dir | tr '/' '_').tmp"
if [ -f "$state_file" ]; then
previous_size=$(cat "$state_file")
growth=$((current_size - previous_size))
if [ "$growth" -gt 1000 ]; then # Alert if grown by >1GB
log_message "WARNING: $dir grew by ${growth}MB"
{
echo "Rapid directory growth on $(hostname)"
echo "Directory: $dir"
echo "Growth: ${growth}MB"
echo ""
echo "Largest subdirectories:"
du -h "$dir" 2>/dev/null | sort -rh | head -n 10
} | mail -s "Directory Growth Alert: $dir" [email protected]
fi
fi
echo "$current_size" > "$state_file"
done
}
# Main execution
main() {
log_message "Starting disk monitoring"
echo "=== Disk Space Monitoring ==="
check_disk_usage
echo ""
echo "=== Inode Usage ==="
check_inode_usage
echo ""
echo "=== Directory Growth Check ==="
check_directory_growth
}
main
Disk I/O Performance Monitor
#!/bin/bash
# disk-io-monitor.sh - Monitor disk I/O performance
IOPS_THRESHOLD=1000
LATENCY_THRESHOLD=100 # milliseconds
# Get disk I/O statistics
get_disk_io_stats() {
iostat -x 1 2 | tail -n +4 | tail -n +7 | while read line; do
device=$(echo "$line" | awk '{print $1}')
tps=$(echo "$line" | awk '{print $2}')
read_kb=$(echo "$line" | awk '{print $3}')
write_kb=$(echo "$line" | awk '{print $4}')
await=$(echo "$line" | awk '{print $10}')
echo "Device: $device"
echo " Transactions/sec: $tps"
echo " Read: ${read_kb} kB/s"
echo " Write: ${write_kb} kB/s"
echo " Await: ${await} ms"
# Check thresholds
if (( $(echo "$await > $LATENCY_THRESHOLD" | bc -l) )); then
echo " WARNING: High latency detected"
# Get processes doing I/O
sudo iotop -b -n 1 -o | head -n 20
fi
done
}
get_disk_io_stats
Network Monitoring Scripts
Network Traffic Monitor
#!/bin/bash
# network-monitor.sh - Monitor network traffic and connections
BANDWIDTH_THRESHOLD=80 # Percentage of interface capacity
LOG_FILE="/var/log/monitoring/network.log"
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Monitor network interface traffic
monitor_interface_traffic() {
local interface=$1
# Get interface statistics
rx_bytes_before=$(cat /sys/class/net/"$interface"/statistics/rx_bytes)
tx_bytes_before=$(cat /sys/class/net/"$interface"/statistics/tx_bytes)
sleep 1
rx_bytes_after=$(cat /sys/class/net/"$interface"/statistics/rx_bytes)
tx_bytes_after=$(cat /sys/class/net/"$interface"/statistics/tx_bytes)
# Calculate bytes per second
rx_bps=$((rx_bytes_after - rx_bytes_before))
tx_bps=$((tx_bytes_after - tx_bytes_before))
# Convert to Mbps
rx_mbps=$(echo "scale=2; $rx_bps * 8 / 1000000" | bc)
tx_mbps=$(echo "scale=2; $tx_bps * 8 / 1000000" | bc)
echo "Interface: $interface"
echo " RX: ${rx_mbps} Mbps"
echo " TX: ${tx_mbps} Mbps"
}
# Check connection count
check_connection_count() {
local established=$(ss -tan | grep ESTAB | wc -l)
local time_wait=$(ss -tan | grep TIME-WAIT | wc -l)
local total=$(ss -tan | tail -n +2 | wc -l)
echo "Connection Statistics:"
echo " Total: $total"
echo " Established: $established"
echo " Time-Wait: $time_wait"
if [ "$established" -gt 1000 ]; then
log_message "WARNING: High number of established connections: $established"
# Show top connection sources
echo " Top connection sources:"
ss -tan | grep ESTAB | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -n 10
fi
}
# Check for network errors
check_network_errors() {
local interface=$1
local rx_errors=$(cat /sys/class/net/"$interface"/statistics/rx_errors)
local tx_errors=$(cat /sys/class/net/"$interface"/statistics/tx_errors)
local rx_dropped=$(cat /sys/class/net/"$interface"/statistics/rx_dropped)
local tx_dropped=$(cat /sys/class/net/"$interface"/statistics/tx_dropped)
echo "Interface $interface errors:"
echo " RX errors: $rx_errors"
echo " TX errors: $tx_errors"
echo " RX dropped: $rx_dropped"
echo " TX dropped: $tx_dropped"
if [ "$rx_errors" -gt 100 ] || [ "$tx_errors" -gt 100 ]; then
log_message "WARNING: Network errors detected on $interface"
fi
}
# Monitor port availability
check_port_listening() {
local ports=("80" "443" "22" "3306")
echo "Port Availability:"
for port in "${ports[@]}"; do
if ss -tuln | grep -q ":$port "; then
echo " Port $port: LISTENING"
else
echo " Port $port: NOT LISTENING"
log_message "WARNING: Port $port is not listening"
fi
done
}
# Main execution
main() {
local interface=${1:-eth0}
echo "=== Network Monitoring Report ==="
echo "Timestamp: $(date)"
echo ""
echo "--- Interface Traffic ---"
monitor_interface_traffic "$interface"
echo ""
echo "--- Connection Statistics ---"
check_connection_count
echo ""
echo "--- Network Errors ---"
check_network_errors "$interface"
echo ""
echo "--- Port Status ---"
check_port_listening
}
main "$@"
Connection Monitor with Geo-IP
#!/bin/bash
# connection-geoip-monitor.sh - Monitor connections with geographic information
# Check for suspicious connection patterns
monitor_connections() {
echo "=== Active Connection Analysis ==="
# Get unique IPs with connection count
ss -tan | grep ESTAB | awk '{print $5}' | cut -d: -f1 | \
grep -v "^127\|^::1\|^$" | sort | uniq -c | sort -rn | \
while read count ip; do
echo "IP: $ip (Connections: $count)"
# Alert on high connection count from single IP
if [ "$count" -gt 50 ]; then
echo " WARNING: High connection count from $ip"
# Get GeoIP info if geoiplookup is available
if command -v geoiplookup &> /dev/null; then
geo=$(geoiplookup "$ip")
echo " Location: $geo"
fi
fi
done
}
monitor_connections
Service Monitoring Scripts
Comprehensive Service Monitor
#!/bin/bash
# service-monitor.sh - Monitor critical services
SERVICES=("nginx" "mysql" "redis" "ssh")
LOG_FILE="/var/log/monitoring/services.log"
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Check if service is running
check_service() {
local service=$1
if systemctl is-active --quiet "$service"; then
echo "Service $service: RUNNING"
return 0
else
echo "Service $service: NOT RUNNING"
log_message "ERROR: Service $service is not running"
# Attempt to start service
log_message "Attempting to start $service"
if systemctl start "$service"; then
log_message "Successfully started $service"
# Send notification
echo "Service $service was down and has been restarted on $(hostname)" | \
mail -s "Service Recovered: $service" [email protected]
else
log_message "ERROR: Failed to start $service"
# Get service status
status=$(systemctl status "$service" --no-pager)
# Send alert
{
echo "CRITICAL: Service $service failed to start on $(hostname)"
echo "Time: $(date)"
echo ""
echo "Service Status:"
echo "$status"
echo ""
echo "Recent Logs:"
journalctl -u "$service" -n 50 --no-pager
} | mail -s "CRITICAL: Service $service Failed" [email protected]
fi
return 1
fi
}
# Check service response time
check_service_response() {
local service=$1
local url=$2
local max_time=5
if [ -z "$url" ]; then
return 0
fi
response_time=$(curl -o /dev/null -s -w '%{time_total}' --max-time "$max_time" "$url" 2>/dev/null || echo "timeout")
if [ "$response_time" = "timeout" ]; then
echo " Response: TIMEOUT"
log_message "WARNING: $service response timeout"
return 1
else
echo " Response time: ${response_time}s"
# Alert if response time is slow
if (( $(echo "$response_time > 2" | bc -l) )); then
log_message "WARNING: Slow response from $service: ${response_time}s"
fi
fi
return 0
}
# Check service ports
check_service_port() {
local service=$1
local port=$2
if [ -z "$port" ]; then
return 0
fi
if ss -tuln | grep -q ":$port "; then
echo " Port $port: LISTENING"
return 0
else
echo " Port $port: NOT LISTENING"
log_message "ERROR: $service port $port not listening"
return 1
fi
}
# Main monitoring loop
main() {
log_message "Starting service monitoring"
echo "=== Service Monitoring Report ==="
echo "Timestamp: $(date)"
echo ""
local failed_services=0
for service in "${SERVICES[@]}"; do
if ! check_service "$service"; then
((failed_services++))
fi
# Service-specific checks
case "$service" in
nginx|apache2)
check_service_port "$service" "80"
check_service_response "$service" "http://localhost"
;;
mysql|mariadb)
check_service_port "$service" "3306"
;;
redis)
check_service_port "$service" "6379"
# Test redis connection
if command -v redis-cli &> /dev/null; then
if redis-cli ping &>/dev/null; then
echo " Redis: RESPONSIVE"
else
echo " Redis: NOT RESPONSIVE"
log_message "ERROR: Redis not responsive"
fi
fi
;;
ssh)
check_service_port "$service" "22"
;;
esac
echo ""
done
if [ "$failed_services" -gt 0 ]; then
log_message "WARNING: $failed_services service(s) failed checks"
exit 1
fi
log_message "All services operational"
exit 0
}
main
Process Monitor
#!/bin/bash
# process-monitor.sh - Monitor specific processes
# Monitor process by name
monitor_process() {
local process_name=$1
local min_instances=${2:-1}
local max_instances=${3:-10}
# Count running instances
local count=$(pgrep -c "$process_name" || echo "0")
echo "Process $process_name: $count instance(s)"
if [ "$count" -lt "$min_instances" ]; then
echo " WARNING: Too few instances (minimum: $min_instances)"
log_message "ERROR: Process $process_name has only $count instances (minimum: $min_instances)"
return 1
elif [ "$count" -gt "$max_instances" ]; then
echo " WARNING: Too many instances (maximum: $max_instances)"
log_message "WARNING: Process $process_name has $count instances (maximum: $max_instances)"
return 1
fi
# Check process resource usage
total_cpu=0
total_mem=0
while IFS= read -r pid; do
cpu=$(ps -p "$pid" -o %cpu= | tr -d ' ')
mem=$(ps -p "$pid" -o %mem= | tr -d ' ')
total_cpu=$(echo "$total_cpu + $cpu" | bc)
total_mem=$(echo "$total_mem + $mem" | bc)
done < <(pgrep "$process_name")
echo " Total CPU: ${total_cpu}%"
echo " Total Memory: ${total_mem}%"
return 0
}
# Monitor process by PID file
monitor_pid_file() {
local pid_file=$1
local process_name=$2
if [ ! -f "$pid_file" ]; then
echo "Process $process_name: PID file not found"
log_message "ERROR: PID file $pid_file not found for $process_name"
return 1
fi
pid=$(cat "$pid_file")
if kill -0 "$pid" 2>/dev/null; then
echo "Process $process_name (PID: $pid): RUNNING"
return 0
else
echo "Process $process_name (PID: $pid): NOT RUNNING"
log_message "ERROR: Process $process_name with PID $pid not running"
return 1
fi
}
# Example usage
monitor_process "nginx" 1 10
monitor_process "php-fpm" 5 50
monitor_pid_file "/var/run/nginx.pid" "nginx"
Comprehensive Monitoring Dashboard
All-in-One Monitoring Script
#!/bin/bash
# comprehensive-monitor.sh - Complete system monitoring dashboard
# Configuration
CONFIG_FILE="/etc/monitoring/config.conf"
LOG_DIR="/var/log/monitoring"
ALERT_EMAIL="[email protected]"
HOSTNAME=$(hostname)
# Ensure log directory exists
mkdir -p "$LOG_DIR"
# Load configuration if exists
if [ -f "$CONFIG_FILE" ]; then
source "$CONFIG_FILE"
else
# Default thresholds
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
LOAD_THRESHOLD=2.0
fi
# Color codes for terminal output
RED='\033[0;31m'
YELLOW='\033[1;33m'
GREEN='\033[0;32m'
NC='\033[0m' # No Color
# Logging functions
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_DIR/monitoring.log"
}
# Status tracking
declare -a ALERTS=()
declare -a WARNINGS=()
declare -a INFO=()
add_alert() {
ALERTS+=("$1")
log_message "ALERT: $1"
}
add_warning() {
WARNINGS+=("$1")
log_message "WARNING: $1"
}
add_info() {
INFO+=("$1")
}
# System checks
check_cpu() {
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
local cpu_int=${cpu_usage%.*}
add_info "CPU Usage: ${cpu_usage}%"
if [ "$cpu_int" -gt "$CPU_THRESHOLD" ]; then
add_alert "High CPU usage: ${cpu_usage}%"
return 1
fi
return 0
}
check_memory() {
local total=$(free -m | grep Mem: | awk '{print $2}')
local used=$(free -m | grep Mem: | awk '{print $3}')
local percent=$(echo "scale=2; ($used / $total) * 100" | bc)
local percent_int=${percent%.*}
add_info "Memory Usage: ${percent}% (${used}MB / ${total}MB)"
if [ "$percent_int" -gt "$MEMORY_THRESHOLD" ]; then
add_alert "High memory usage: ${percent}%"
return 1
fi
return 0
}
check_disk() {
local failed=0
while IFS= read -r line; do
local mountpoint=$(echo "$line" | awk '{print $6}')
local percent=$(echo "$line" | awk '{print $5}' | tr -d '%')
add_info "Disk $mountpoint: ${percent}% used"
if [ "$percent" -gt "$DISK_THRESHOLD" ]; then
add_alert "High disk usage on $mountpoint: ${percent}%"
failed=1
fi
done < <(df -h | tail -n +2 | grep -v "tmpfs")
return $failed
}
check_load() {
local load_1min=$(cat /proc/loadavg | awk '{print $1}')
local cpu_cores=$(nproc)
local load_per_core=$(echo "scale=2; $load_1min / $cpu_cores" | bc)
add_info "Load Average: $load_1min (cores: $cpu_cores, per-core: $load_per_core)"
if (( $(echo "$load_per_core > $LOAD_THRESHOLD" | bc -l) )); then
add_warning "High load average: $load_per_core per core"
return 1
fi
return 0
}
check_services() {
local services=("nginx" "mysql" "redis" "ssh")
local failed=0
for service in "${services[@]}"; do
if systemctl is-active --quiet "$service" 2>/dev/null; then
add_info "Service $service: running"
else
if systemctl list-unit-files | grep -q "^${service}.service"; then
add_alert "Service $service: not running"
failed=1
fi
fi
done
return $failed
}
check_network() {
local interface=${1:-eth0}
if [ ! -d "/sys/class/net/$interface" ]; then
add_warning "Network interface $interface not found"
return 1
fi
local rx_errors=$(cat /sys/class/net/"$interface"/statistics/rx_errors)
local tx_errors=$(cat /sys/class/net/"$interface"/statistics/tx_errors)
add_info "Network $interface - RX errors: $rx_errors, TX errors: $tx_errors"
if [ "$rx_errors" -gt 1000 ] || [ "$tx_errors" -gt 1000 ]; then
add_warning "High network error count on $interface"
return 1
fi
return 0
}
# Generate report
generate_report() {
local report=""
report+="========================================\n"
report+="System Monitoring Report\n"
report+="========================================\n"
report+="Hostname: $HOSTNAME\n"
report+="Timestamp: $(date)\n"
report+="========================================\n\n"
# Alerts
if [ ${#ALERTS[@]} -gt 0 ]; then
report+="ALERTS (${#ALERTS[@]}):\n"
for alert in "${ALERTS[@]}"; do
report+=" [!] $alert\n"
done
report+="\n"
fi
# Warnings
if [ ${#WARNINGS[@]} -gt 0 ]; then
report+="WARNINGS (${#WARNINGS[@]}):\n"
for warning in "${WARNINGS[@]}"; do
report+=" [*] $warning\n"
done
report+="\n"
fi
# Info
if [ ${#INFO[@]} -gt 0 ]; then
report+="SYSTEM STATUS:\n"
for info in "${INFO[@]}"; do
report+=" $info\n"
done
report+="\n"
fi
report+="========================================\n"
echo -e "$report"
}
# Send alerts via email
send_alerts() {
if [ ${#ALERTS[@]} -gt 0 ]; then
local subject="ALERT: $HOSTNAME - ${#ALERTS[@]} alert(s)"
generate_report | mail -s "$subject" "$ALERT_EMAIL"
log_message "Alert email sent to $ALERT_EMAIL"
fi
}
# Display in terminal with colors
display_colored_report() {
echo "========================================"
echo "System Monitoring Report"
echo "========================================"
echo "Hostname: $HOSTNAME"
echo "Timestamp: $(date)"
echo "========================================"
echo ""
if [ ${#ALERTS[@]} -gt 0 ]; then
echo -e "${RED}ALERTS (${#ALERTS[@]}):${NC}"
for alert in "${ALERTS[@]}"; do
echo -e " ${RED}[!] $alert${NC}"
done
echo ""
fi
if [ ${#WARNINGS[@]} -gt 0 ]; then
echo -e "${YELLOW}WARNINGS (${#WARNINGS[@]}):${NC}"
for warning in "${WARNINGS[@]}"; do
echo -e " ${YELLOW}[*] $warning${NC}"
done
echo ""
fi
if [ ${#INFO[@]} -gt 0 ]; then
echo -e "${GREEN}SYSTEM STATUS:${NC}"
for info in "${INFO[@]}"; do
echo " $info"
done
echo ""
fi
echo "========================================"
}
# Main execution
main() {
log_message "Starting comprehensive system monitoring"
# Run all checks
check_cpu
check_memory
check_disk
check_load
check_services
check_network
# Display report
if [ -t 1 ]; then
# Terminal output - use colors
display_colored_report
else
# Non-terminal (cron, etc.) - plain text
generate_report
fi
# Send alerts if any
send_alerts
# Save report to file
generate_report > "$LOG_DIR/latest-report.txt"
# Return appropriate exit code
if [ ${#ALERTS[@]} -gt 0 ]; then
log_message "Monitoring completed with ${#ALERTS[@]} alerts"
exit 2
elif [ ${#WARNINGS[@]} -gt 0 ]; then
log_message "Monitoring completed with ${#WARNINGS[@]} warnings"
exit 1
else
log_message "Monitoring completed successfully"
exit 0
fi
}
# Execute
main "$@"
Conclusion
Custom Bash monitoring scripts provide a flexible, lightweight, and powerful approach to system monitoring. They offer immediate deployment, complete customization, and seamless integration with existing infrastructure without the overhead of complex monitoring frameworks.
Key advantages of custom Bash monitoring:
- Simplicity - Easy to understand, modify, and maintain
- Portability - Works on any Linux distribution without dependencies
- Customization - Tailored exactly to your specific monitoring needs
- Integration - Easily integrated with existing alerting and logging systems
- Resource efficiency - Minimal overhead compared to agent-based monitoring
- Rapid deployment - No installation or configuration of complex software
Best practices for production monitoring scripts:
- Implement comprehensive error handling and logging
- Use meaningful exit codes for integration with monitoring systems
- Add configuration files for easy threshold management
- Include detailed comments and documentation
- Test scripts thoroughly before production deployment
- Implement rate limiting for alerts to prevent notification fatigue
- Use locks to prevent concurrent executions
- Monitor the monitors - ensure monitoring scripts themselves are running
While custom scripts excel for specific use cases and lightweight monitoring, consider complementing them with enterprise monitoring solutions like Prometheus, Grafana, or Nagios for complex environments requiring historical data, advanced visualization, and distributed monitoring capabilities.
The monitoring scripts presented in this guide provide a solid foundation that you can extend and adapt to your specific requirements, creating a robust monitoring solution that grows with your infrastructure needs.


