Monitoring with Custom Bash Scripts

Introduction

While enterprise monitoring solutions like Prometheus, Nagios, and Zabbix offer comprehensive features, custom Bash scripts provide a lightweight, flexible, and cost-effective alternative for monitoring Linux servers. Bash scripts can be tailored to your specific infrastructure needs, deployed instantly without dependencies, and integrated seamlessly with existing tools and workflows.

Custom monitoring scripts excel in scenarios where you need quick deployment, specific metrics not covered by standard tools, or lightweight monitoring for resource-constrained environments. They're particularly valuable for monitoring custom applications, enforcing specific business logic, and creating monitoring solutions that integrate directly with your notification systems.

This comprehensive guide teaches you how to build robust monitoring scripts from simple resource checks to advanced multi-server monitoring systems. You'll learn how to monitor CPU, memory, disk, network, services, and custom application metrics, implement alerting mechanisms, create dashboards, and follow best practices for maintainable, production-ready monitoring scripts.

Prerequisites

Before building custom monitoring scripts, ensure you have:

  • A Linux server (Ubuntu 20.04/22.04, Debian 10/11, CentOS 7/8, Rocky Linux 8/9, or similar)
  • Root or sudo access for system-level monitoring
  • Basic to intermediate Bash scripting knowledge
  • Familiarity with Linux system commands
  • Text editor (nano, vim, or vi)

Recommended Knowledge:

  • Understanding of process management
  • Basic knowledge of awk, grep, and sed
  • Familiarity with cron for scheduling
  • Understanding of exit codes and error handling

Optional Tools:

  • Mail transfer agent (for email alerts)
  • curl or wget (for webhook notifications)
  • bc (for floating-point calculations)
  • jq (for JSON processing)

Basic Monitoring Script Structure

Essential Components

A well-structured monitoring script should include:

#!/bin/bash
#
# Script Name: monitor-system.sh
# Description: Basic system monitoring script
# Author: Your Name
# Date: 2024-01-11
# Version: 1.0

# Exit on error
set -e

# Configuration
LOG_FILE="/var/log/monitoring/system-monitor.log"
ALERT_EMAIL="[email protected]"
HOSTNAME=$(hostname)

# Thresholds
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90

# Functions
log_message() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

send_alert() {
    local subject="$1"
    local message="$2"
    echo "$message" | mail -s "$subject" "$ALERT_EMAIL"
    log_message "Alert sent: $subject"
}

# Main monitoring logic
main() {
    log_message "Starting system monitoring"

    # Monitoring checks go here

    log_message "Monitoring completed"
}

# Execute main function
main "$@"

Error Handling and Logging

#!/bin/bash

# Enhanced error handling
set -euo pipefail  # Exit on error, undefined variables, pipe failures

# Error trap
trap 'error_handler $? $LINENO' ERR

error_handler() {
    local exit_code=$1
    local line_number=$2
    log_message "ERROR: Script failed at line $line_number with exit code $exit_code"
    send_alert "Monitoring Script Failed" \
        "Script failed on $HOSTNAME at line $line_number"
    exit "$exit_code"
}

# Logging with levels
log_level() {
    local level=$1
    shift
    local message="$*"
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $message" | tee -a "$LOG_FILE"
}

log_info() { log_level "INFO" "$@"; }
log_warn() { log_level "WARN" "$@"; }
log_error() { log_level "ERROR" "$@"; }

CPU Monitoring Scripts

Basic CPU Usage Check

#!/bin/bash
# cpu-monitor.sh - Monitor CPU usage

CPU_THRESHOLD=80
LOG_FILE="/var/log/monitoring/cpu.log"

# Get CPU usage (percentage of idle time subtracted from 100)
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
CPU_USAGE_INT=${CPU_USAGE%.*}  # Convert to integer

log_message() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

log_message "Current CPU usage: ${CPU_USAGE}%"

if [ "$CPU_USAGE_INT" -gt "$CPU_THRESHOLD" ]; then
    log_message "WARNING: CPU usage above threshold (${CPU_USAGE}% > ${CPU_THRESHOLD}%)"

    # Get top CPU consumers
    TOP_PROCESSES=$(ps aux --sort=-%cpu | head -n 6)

    # Send alert
    {
        echo "High CPU usage detected on $(hostname)"
        echo "Current usage: ${CPU_USAGE}%"
        echo "Threshold: ${CPU_THRESHOLD}%"
        echo ""
        echo "Top CPU consumers:"
        echo "$TOP_PROCESSES"
    } | mail -s "CPU Alert: $(hostname)" [email protected]
fi

Advanced CPU Monitoring with Per-Core Analysis

#!/bin/bash
# cpu-detailed-monitor.sh - Detailed CPU monitoring

monitor_cpu_cores() {
    local threshold=$1

    # Get per-core CPU usage
    mpstat -P ALL 1 1 | tail -n +4 | while read line; do
        if [[ $line =~ ^[0-9] ]]; then
            core=$(echo "$line" | awk '{print $2}')
            idle=$(echo "$line" | awk '{print $NF}')
            usage=$(echo "100 - $idle" | bc)
            usage_int=${usage%.*}

            echo "Core $core: ${usage}%"

            if (( $(echo "$usage_int > $threshold" | bc -l) )); then
                echo "WARNING: Core $core usage: ${usage}%"
            fi
        fi
    done
}

# Get load averages
get_load_average() {
    local load_1min load_5min load_15min
    read load_1min load_5min load_15min < /proc/loadavg

    # Get number of CPU cores
    local cpu_cores=$(nproc)

    # Calculate load per core
    local load_per_core=$(echo "scale=2; $load_1min / $cpu_cores" | bc)

    echo "Load Average: $load_1min (1m), $load_5min (5m), $load_15min (15m)"
    echo "CPU Cores: $cpu_cores"
    echo "Load per Core: $load_per_core"

    # Alert if load per core exceeds 1.0
    if (( $(echo "$load_per_core > 1.0" | bc -l) )); then
        echo "WARNING: High load per core: $load_per_core"
        return 1
    fi
}

# Main execution
echo "=== CPU Monitoring Report ==="
echo "Timestamp: $(date)"
echo ""

echo "--- Load Average ---"
get_load_average
echo ""

echo "--- Per-Core Usage ---"
monitor_cpu_cores 80

CPU Wait Time Monitoring

#!/bin/bash
# cpu-iowait-monitor.sh - Monitor I/O wait time

IOWAIT_THRESHOLD=20

# Get I/O wait percentage
get_iowait() {
    iostat -c 1 2 | tail -1 | awk '{print $4}'
}

IOWAIT=$(get_iowait)
IOWAIT_INT=${IOWAIT%.*}

echo "Current I/O wait: ${IOWAIT}%"

if [ "$IOWAIT_INT" -gt "$IOWAIT_THRESHOLD" ]; then
    echo "WARNING: High I/O wait detected"

    # Identify I/O intensive processes
    echo ""
    echo "Top I/O processes:"
    sudo iotop -b -n 1 -o | head -n 20

    # Check disk statistics
    echo ""
    echo "Disk I/O statistics:"
    iostat -x 1 2 | tail -n +4
fi

Memory Monitoring Scripts

Comprehensive Memory Monitor

#!/bin/bash
# memory-monitor.sh - Comprehensive memory monitoring

MEMORY_THRESHOLD=85
SWAP_THRESHOLD=50
LOG_FILE="/var/log/monitoring/memory.log"

log_message() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Get memory statistics
get_memory_stats() {
    local total used free shared buffers cached available
    read -r _ total used free shared buffers cached <<< \
        "$(free -m | grep Mem:)"

    available=$(free -m | grep Mem: | awk '{print $7}')

    local mem_percent=$(echo "scale=2; ($used / $total) * 100" | bc)
    local mem_percent_int=${mem_percent%.*}

    echo "Total: ${total}MB"
    echo "Used: ${used}MB"
    echo "Free: ${free}MB"
    echo "Available: ${available}MB"
    echo "Usage: ${mem_percent}%"

    if [ "$mem_percent_int" -gt "$MEMORY_THRESHOLD" ]; then
        log_message "WARNING: Memory usage ${mem_percent}% exceeds threshold ${MEMORY_THRESHOLD}%"
        return 1
    fi

    return 0
}

# Get swap statistics
get_swap_stats() {
    local total used free
    read -r _ total used free <<< "$(free -m | grep Swap:)"

    if [ "$total" -eq 0 ]; then
        echo "Swap: Not configured"
        return 0
    fi

    local swap_percent=$(echo "scale=2; ($used / $total) * 100" | bc)
    local swap_percent_int=${swap_percent%.*}

    echo "Swap Total: ${total}MB"
    echo "Swap Used: ${used}MB"
    echo "Swap Usage: ${swap_percent}%"

    if [ "$swap_percent_int" -gt "$SWAP_THRESHOLD" ]; then
        log_message "WARNING: Swap usage ${swap_percent}% exceeds threshold ${SWAP_THRESHOLD}%"
        return 1
    fi

    return 0
}

# Get top memory consumers
get_top_memory_processes() {
    echo "Top 10 Memory Consuming Processes:"
    ps aux --sort=-%mem | head -n 11 | awk '{printf "%-10s %-8s %-8s %s\n", $1, $2, $4"%", $11}'
}

# Check for memory leaks
check_memory_growth() {
    local current_usage previous_usage

    # Get current memory usage
    current_usage=$(free -m | grep Mem: | awk '{print $3}')

    # Read previous usage if exists
    if [ -f /tmp/memory_usage.tmp ]; then
        previous_usage=$(cat /tmp/memory_usage.tmp)

        local growth=$((current_usage - previous_usage))

        echo "Memory growth since last check: ${growth}MB"

        if [ "$growth" -gt 500 ]; then
            log_message "WARNING: Rapid memory growth detected: ${growth}MB"
        fi
    fi

    # Save current usage
    echo "$current_usage" > /tmp/memory_usage.tmp
}

# Main execution
main() {
    log_message "Starting memory monitoring"

    echo "=== Memory Monitoring Report ==="
    echo "Timestamp: $(date)"
    echo ""

    echo "--- Physical Memory ---"
    if ! get_memory_stats; then
        get_top_memory_processes
        echo "" | mail -s "Memory Alert: $(hostname)" [email protected]
    fi

    echo ""
    echo "--- Swap Memory ---"
    get_swap_stats

    echo ""
    check_memory_growth

    echo ""
    get_top_memory_processes
}

main

OOM Killer Detection

#!/bin/bash
# oom-monitor.sh - Detect Out Of Memory killer events

LOG_FILE="/var/log/monitoring/oom.log"

# Check for OOM killer events
check_oom_events() {
    local oom_events

    # Search kernel logs for OOM events in last hour
    oom_events=$(journalctl -k --since "1 hour ago" | grep -i "Out of memory\|oom-killer\|Killed process")

    if [ -n "$oom_events" ]; then
        echo "OOM Killer Events Detected:"
        echo "$oom_events"

        # Log the event
        echo "[$(date)] OOM events detected" >> "$LOG_FILE"
        echo "$oom_events" >> "$LOG_FILE"

        # Send alert
        {
            echo "Out Of Memory events detected on $(hostname)"
            echo "Time: $(date)"
            echo ""
            echo "Events:"
            echo "$oom_events"
            echo ""
            echo "Current Memory Status:"
            free -h
            echo ""
            echo "Top Memory Consumers:"
            ps aux --sort=-%mem | head -n 11
        } | mail -s "OOM Alert: $(hostname)" [email protected]

        return 1
    fi

    return 0
}

check_oom_events

Disk Monitoring Scripts

Disk Space Monitor

#!/bin/bash
# disk-monitor.sh - Monitor disk space usage

DISK_THRESHOLD=85
LOG_FILE="/var/log/monitoring/disk.log"

log_message() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Check disk usage for all mounted filesystems
check_disk_usage() {
    local alert_triggered=0

    df -h | tail -n +2 | while read filesystem size used avail percent mountpoint; do
        # Remove % sign from percentage
        usage=${percent%\%}

        echo "Checking $mountpoint: $percent used"

        if [ "$usage" -gt "$DISK_THRESHOLD" ]; then
            log_message "WARNING: $mountpoint is ${percent} full"
            alert_triggered=1

            # Find largest directories
            echo "Largest directories in $mountpoint:"
            du -h "$mountpoint" 2>/dev/null | sort -rh | head -n 10

            # Send alert
            {
                echo "Disk space alert on $(hostname)"
                echo "Mount point: $mountpoint"
                echo "Usage: $percent"
                echo "Threshold: ${DISK_THRESHOLD}%"
                echo ""
                echo "Largest directories:"
                du -h "$mountpoint" 2>/dev/null | sort -rh | head -n 10
            } | mail -s "Disk Space Alert: $mountpoint" [email protected]
        fi
    done

    return $alert_triggered
}

# Check inode usage
check_inode_usage() {
    local inode_threshold=85

    df -i | tail -n +2 | while read filesystem inodes iused ifree ipercent mountpoint; do
        # Remove % sign
        usage=${ipercent%\%}

        if [ "$usage" -gt "$inode_threshold" ]; then
            log_message "WARNING: Inode usage on $mountpoint is ${ipercent}"

            {
                echo "Inode usage alert on $(hostname)"
                echo "Mount point: $mountpoint"
                echo "Inode usage: $ipercent"
                echo ""
                echo "Directories with most files:"
                find "$mountpoint" -xdev -type d -exec sh -c \
                    'echo $(find "{}" -maxdepth 1 | wc -l) "{}"' \; 2>/dev/null | \
                    sort -rn | head -n 10
            } | mail -s "Inode Alert: $mountpoint" [email protected]
        fi
    done
}

# Check for rapidly growing directories
check_directory_growth() {
    local watch_dirs=("/var/log" "/tmp" "/var/tmp" "/home")

    for dir in "${watch_dirs[@]}"; do
        if [ ! -d "$dir" ]; then
            continue
        fi

        current_size=$(du -sm "$dir" 2>/dev/null | awk '{print $1}')
        state_file="/tmp/dirsize_$(echo $dir | tr '/' '_').tmp"

        if [ -f "$state_file" ]; then
            previous_size=$(cat "$state_file")
            growth=$((current_size - previous_size))

            if [ "$growth" -gt 1000 ]; then  # Alert if grown by >1GB
                log_message "WARNING: $dir grew by ${growth}MB"

                {
                    echo "Rapid directory growth on $(hostname)"
                    echo "Directory: $dir"
                    echo "Growth: ${growth}MB"
                    echo ""
                    echo "Largest subdirectories:"
                    du -h "$dir" 2>/dev/null | sort -rh | head -n 10
                } | mail -s "Directory Growth Alert: $dir" [email protected]
            fi
        fi

        echo "$current_size" > "$state_file"
    done
}

# Main execution
main() {
    log_message "Starting disk monitoring"

    echo "=== Disk Space Monitoring ==="
    check_disk_usage

    echo ""
    echo "=== Inode Usage ==="
    check_inode_usage

    echo ""
    echo "=== Directory Growth Check ==="
    check_directory_growth
}

main

Disk I/O Performance Monitor

#!/bin/bash
# disk-io-monitor.sh - Monitor disk I/O performance

IOPS_THRESHOLD=1000
LATENCY_THRESHOLD=100  # milliseconds

# Get disk I/O statistics
get_disk_io_stats() {
    iostat -x 1 2 | tail -n +4 | tail -n +7 | while read line; do
        device=$(echo "$line" | awk '{print $1}')
        tps=$(echo "$line" | awk '{print $2}')
        read_kb=$(echo "$line" | awk '{print $3}')
        write_kb=$(echo "$line" | awk '{print $4}')
        await=$(echo "$line" | awk '{print $10}')

        echo "Device: $device"
        echo "  Transactions/sec: $tps"
        echo "  Read: ${read_kb} kB/s"
        echo "  Write: ${write_kb} kB/s"
        echo "  Await: ${await} ms"

        # Check thresholds
        if (( $(echo "$await > $LATENCY_THRESHOLD" | bc -l) )); then
            echo "  WARNING: High latency detected"

            # Get processes doing I/O
            sudo iotop -b -n 1 -o | head -n 20
        fi
    done
}

get_disk_io_stats

Network Monitoring Scripts

Network Traffic Monitor

#!/bin/bash
# network-monitor.sh - Monitor network traffic and connections

BANDWIDTH_THRESHOLD=80  # Percentage of interface capacity
LOG_FILE="/var/log/monitoring/network.log"

log_message() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Monitor network interface traffic
monitor_interface_traffic() {
    local interface=$1

    # Get interface statistics
    rx_bytes_before=$(cat /sys/class/net/"$interface"/statistics/rx_bytes)
    tx_bytes_before=$(cat /sys/class/net/"$interface"/statistics/tx_bytes)

    sleep 1

    rx_bytes_after=$(cat /sys/class/net/"$interface"/statistics/rx_bytes)
    tx_bytes_after=$(cat /sys/class/net/"$interface"/statistics/tx_bytes)

    # Calculate bytes per second
    rx_bps=$((rx_bytes_after - rx_bytes_before))
    tx_bps=$((tx_bytes_after - tx_bytes_before))

    # Convert to Mbps
    rx_mbps=$(echo "scale=2; $rx_bps * 8 / 1000000" | bc)
    tx_mbps=$(echo "scale=2; $tx_bps * 8 / 1000000" | bc)

    echo "Interface: $interface"
    echo "  RX: ${rx_mbps} Mbps"
    echo "  TX: ${tx_mbps} Mbps"
}

# Check connection count
check_connection_count() {
    local established=$(ss -tan | grep ESTAB | wc -l)
    local time_wait=$(ss -tan | grep TIME-WAIT | wc -l)
    local total=$(ss -tan | tail -n +2 | wc -l)

    echo "Connection Statistics:"
    echo "  Total: $total"
    echo "  Established: $established"
    echo "  Time-Wait: $time_wait"

    if [ "$established" -gt 1000 ]; then
        log_message "WARNING: High number of established connections: $established"

        # Show top connection sources
        echo "  Top connection sources:"
        ss -tan | grep ESTAB | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -n 10
    fi
}

# Check for network errors
check_network_errors() {
    local interface=$1

    local rx_errors=$(cat /sys/class/net/"$interface"/statistics/rx_errors)
    local tx_errors=$(cat /sys/class/net/"$interface"/statistics/tx_errors)
    local rx_dropped=$(cat /sys/class/net/"$interface"/statistics/rx_dropped)
    local tx_dropped=$(cat /sys/class/net/"$interface"/statistics/tx_dropped)

    echo "Interface $interface errors:"
    echo "  RX errors: $rx_errors"
    echo "  TX errors: $tx_errors"
    echo "  RX dropped: $rx_dropped"
    echo "  TX dropped: $tx_dropped"

    if [ "$rx_errors" -gt 100 ] || [ "$tx_errors" -gt 100 ]; then
        log_message "WARNING: Network errors detected on $interface"
    fi
}

# Monitor port availability
check_port_listening() {
    local ports=("80" "443" "22" "3306")

    echo "Port Availability:"
    for port in "${ports[@]}"; do
        if ss -tuln | grep -q ":$port "; then
            echo "  Port $port: LISTENING"
        else
            echo "  Port $port: NOT LISTENING"
            log_message "WARNING: Port $port is not listening"
        fi
    done
}

# Main execution
main() {
    local interface=${1:-eth0}

    echo "=== Network Monitoring Report ==="
    echo "Timestamp: $(date)"
    echo ""

    echo "--- Interface Traffic ---"
    monitor_interface_traffic "$interface"

    echo ""
    echo "--- Connection Statistics ---"
    check_connection_count

    echo ""
    echo "--- Network Errors ---"
    check_network_errors "$interface"

    echo ""
    echo "--- Port Status ---"
    check_port_listening
}

main "$@"

Connection Monitor with Geo-IP

#!/bin/bash
# connection-geoip-monitor.sh - Monitor connections with geographic information

# Check for suspicious connection patterns
monitor_connections() {
    echo "=== Active Connection Analysis ==="

    # Get unique IPs with connection count
    ss -tan | grep ESTAB | awk '{print $5}' | cut -d: -f1 | \
        grep -v "^127\|^::1\|^$" | sort | uniq -c | sort -rn | \
        while read count ip; do
            echo "IP: $ip (Connections: $count)"

            # Alert on high connection count from single IP
            if [ "$count" -gt 50 ]; then
                echo "  WARNING: High connection count from $ip"

                # Get GeoIP info if geoiplookup is available
                if command -v geoiplookup &> /dev/null; then
                    geo=$(geoiplookup "$ip")
                    echo "  Location: $geo"
                fi
            fi
        done
}

monitor_connections

Service Monitoring Scripts

Comprehensive Service Monitor

#!/bin/bash
# service-monitor.sh - Monitor critical services

SERVICES=("nginx" "mysql" "redis" "ssh")
LOG_FILE="/var/log/monitoring/services.log"

log_message() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Check if service is running
check_service() {
    local service=$1

    if systemctl is-active --quiet "$service"; then
        echo "Service $service: RUNNING"
        return 0
    else
        echo "Service $service: NOT RUNNING"
        log_message "ERROR: Service $service is not running"

        # Attempt to start service
        log_message "Attempting to start $service"
        if systemctl start "$service"; then
            log_message "Successfully started $service"

            # Send notification
            echo "Service $service was down and has been restarted on $(hostname)" | \
                mail -s "Service Recovered: $service" [email protected]
        else
            log_message "ERROR: Failed to start $service"

            # Get service status
            status=$(systemctl status "$service" --no-pager)

            # Send alert
            {
                echo "CRITICAL: Service $service failed to start on $(hostname)"
                echo "Time: $(date)"
                echo ""
                echo "Service Status:"
                echo "$status"
                echo ""
                echo "Recent Logs:"
                journalctl -u "$service" -n 50 --no-pager
            } | mail -s "CRITICAL: Service $service Failed" [email protected]
        fi

        return 1
    fi
}

# Check service response time
check_service_response() {
    local service=$1
    local url=$2
    local max_time=5

    if [ -z "$url" ]; then
        return 0
    fi

    response_time=$(curl -o /dev/null -s -w '%{time_total}' --max-time "$max_time" "$url" 2>/dev/null || echo "timeout")

    if [ "$response_time" = "timeout" ]; then
        echo "  Response: TIMEOUT"
        log_message "WARNING: $service response timeout"
        return 1
    else
        echo "  Response time: ${response_time}s"

        # Alert if response time is slow
        if (( $(echo "$response_time > 2" | bc -l) )); then
            log_message "WARNING: Slow response from $service: ${response_time}s"
        fi
    fi

    return 0
}

# Check service ports
check_service_port() {
    local service=$1
    local port=$2

    if [ -z "$port" ]; then
        return 0
    fi

    if ss -tuln | grep -q ":$port "; then
        echo "  Port $port: LISTENING"
        return 0
    else
        echo "  Port $port: NOT LISTENING"
        log_message "ERROR: $service port $port not listening"
        return 1
    fi
}

# Main monitoring loop
main() {
    log_message "Starting service monitoring"

    echo "=== Service Monitoring Report ==="
    echo "Timestamp: $(date)"
    echo ""

    local failed_services=0

    for service in "${SERVICES[@]}"; do
        if ! check_service "$service"; then
            ((failed_services++))
        fi

        # Service-specific checks
        case "$service" in
            nginx|apache2)
                check_service_port "$service" "80"
                check_service_response "$service" "http://localhost"
                ;;
            mysql|mariadb)
                check_service_port "$service" "3306"
                ;;
            redis)
                check_service_port "$service" "6379"
                # Test redis connection
                if command -v redis-cli &> /dev/null; then
                    if redis-cli ping &>/dev/null; then
                        echo "  Redis: RESPONSIVE"
                    else
                        echo "  Redis: NOT RESPONSIVE"
                        log_message "ERROR: Redis not responsive"
                    fi
                fi
                ;;
            ssh)
                check_service_port "$service" "22"
                ;;
        esac

        echo ""
    done

    if [ "$failed_services" -gt 0 ]; then
        log_message "WARNING: $failed_services service(s) failed checks"
        exit 1
    fi

    log_message "All services operational"
    exit 0
}

main

Process Monitor

#!/bin/bash
# process-monitor.sh - Monitor specific processes

# Monitor process by name
monitor_process() {
    local process_name=$1
    local min_instances=${2:-1}
    local max_instances=${3:-10}

    # Count running instances
    local count=$(pgrep -c "$process_name" || echo "0")

    echo "Process $process_name: $count instance(s)"

    if [ "$count" -lt "$min_instances" ]; then
        echo "  WARNING: Too few instances (minimum: $min_instances)"
        log_message "ERROR: Process $process_name has only $count instances (minimum: $min_instances)"
        return 1
    elif [ "$count" -gt "$max_instances" ]; then
        echo "  WARNING: Too many instances (maximum: $max_instances)"
        log_message "WARNING: Process $process_name has $count instances (maximum: $max_instances)"
        return 1
    fi

    # Check process resource usage
    total_cpu=0
    total_mem=0

    while IFS= read -r pid; do
        cpu=$(ps -p "$pid" -o %cpu= | tr -d ' ')
        mem=$(ps -p "$pid" -o %mem= | tr -d ' ')

        total_cpu=$(echo "$total_cpu + $cpu" | bc)
        total_mem=$(echo "$total_mem + $mem" | bc)
    done < <(pgrep "$process_name")

    echo "  Total CPU: ${total_cpu}%"
    echo "  Total Memory: ${total_mem}%"

    return 0
}

# Monitor process by PID file
monitor_pid_file() {
    local pid_file=$1
    local process_name=$2

    if [ ! -f "$pid_file" ]; then
        echo "Process $process_name: PID file not found"
        log_message "ERROR: PID file $pid_file not found for $process_name"
        return 1
    fi

    pid=$(cat "$pid_file")

    if kill -0 "$pid" 2>/dev/null; then
        echo "Process $process_name (PID: $pid): RUNNING"
        return 0
    else
        echo "Process $process_name (PID: $pid): NOT RUNNING"
        log_message "ERROR: Process $process_name with PID $pid not running"
        return 1
    fi
}

# Example usage
monitor_process "nginx" 1 10
monitor_process "php-fpm" 5 50
monitor_pid_file "/var/run/nginx.pid" "nginx"

Comprehensive Monitoring Dashboard

All-in-One Monitoring Script

#!/bin/bash
# comprehensive-monitor.sh - Complete system monitoring dashboard

# Configuration
CONFIG_FILE="/etc/monitoring/config.conf"
LOG_DIR="/var/log/monitoring"
ALERT_EMAIL="[email protected]"
HOSTNAME=$(hostname)

# Ensure log directory exists
mkdir -p "$LOG_DIR"

# Load configuration if exists
if [ -f "$CONFIG_FILE" ]; then
    source "$CONFIG_FILE"
else
    # Default thresholds
    CPU_THRESHOLD=80
    MEMORY_THRESHOLD=85
    DISK_THRESHOLD=90
    LOAD_THRESHOLD=2.0
fi

# Color codes for terminal output
RED='\033[0;31m'
YELLOW='\033[1;33m'
GREEN='\033[0;32m'
NC='\033[0m' # No Color

# Logging functions
log_message() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_DIR/monitoring.log"
}

# Status tracking
declare -a ALERTS=()
declare -a WARNINGS=()
declare -a INFO=()

add_alert() {
    ALERTS+=("$1")
    log_message "ALERT: $1"
}

add_warning() {
    WARNINGS+=("$1")
    log_message "WARNING: $1"
}

add_info() {
    INFO+=("$1")
}

# System checks
check_cpu() {
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    local cpu_int=${cpu_usage%.*}

    add_info "CPU Usage: ${cpu_usage}%"

    if [ "$cpu_int" -gt "$CPU_THRESHOLD" ]; then
        add_alert "High CPU usage: ${cpu_usage}%"
        return 1
    fi

    return 0
}

check_memory() {
    local total=$(free -m | grep Mem: | awk '{print $2}')
    local used=$(free -m | grep Mem: | awk '{print $3}')
    local percent=$(echo "scale=2; ($used / $total) * 100" | bc)
    local percent_int=${percent%.*}

    add_info "Memory Usage: ${percent}% (${used}MB / ${total}MB)"

    if [ "$percent_int" -gt "$MEMORY_THRESHOLD" ]; then
        add_alert "High memory usage: ${percent}%"
        return 1
    fi

    return 0
}

check_disk() {
    local failed=0

    while IFS= read -r line; do
        local mountpoint=$(echo "$line" | awk '{print $6}')
        local percent=$(echo "$line" | awk '{print $5}' | tr -d '%')

        add_info "Disk $mountpoint: ${percent}% used"

        if [ "$percent" -gt "$DISK_THRESHOLD" ]; then
            add_alert "High disk usage on $mountpoint: ${percent}%"
            failed=1
        fi
    done < <(df -h | tail -n +2 | grep -v "tmpfs")

    return $failed
}

check_load() {
    local load_1min=$(cat /proc/loadavg | awk '{print $1}')
    local cpu_cores=$(nproc)
    local load_per_core=$(echo "scale=2; $load_1min / $cpu_cores" | bc)

    add_info "Load Average: $load_1min (cores: $cpu_cores, per-core: $load_per_core)"

    if (( $(echo "$load_per_core > $LOAD_THRESHOLD" | bc -l) )); then
        add_warning "High load average: $load_per_core per core"
        return 1
    fi

    return 0
}

check_services() {
    local services=("nginx" "mysql" "redis" "ssh")
    local failed=0

    for service in "${services[@]}"; do
        if systemctl is-active --quiet "$service" 2>/dev/null; then
            add_info "Service $service: running"
        else
            if systemctl list-unit-files | grep -q "^${service}.service"; then
                add_alert "Service $service: not running"
                failed=1
            fi
        fi
    done

    return $failed
}

check_network() {
    local interface=${1:-eth0}

    if [ ! -d "/sys/class/net/$interface" ]; then
        add_warning "Network interface $interface not found"
        return 1
    fi

    local rx_errors=$(cat /sys/class/net/"$interface"/statistics/rx_errors)
    local tx_errors=$(cat /sys/class/net/"$interface"/statistics/tx_errors)

    add_info "Network $interface - RX errors: $rx_errors, TX errors: $tx_errors"

    if [ "$rx_errors" -gt 1000 ] || [ "$tx_errors" -gt 1000 ]; then
        add_warning "High network error count on $interface"
        return 1
    fi

    return 0
}

# Generate report
generate_report() {
    local report=""

    report+="========================================\n"
    report+="System Monitoring Report\n"
    report+="========================================\n"
    report+="Hostname: $HOSTNAME\n"
    report+="Timestamp: $(date)\n"
    report+="========================================\n\n"

    # Alerts
    if [ ${#ALERTS[@]} -gt 0 ]; then
        report+="ALERTS (${#ALERTS[@]}):\n"
        for alert in "${ALERTS[@]}"; do
            report+="  [!] $alert\n"
        done
        report+="\n"
    fi

    # Warnings
    if [ ${#WARNINGS[@]} -gt 0 ]; then
        report+="WARNINGS (${#WARNINGS[@]}):\n"
        for warning in "${WARNINGS[@]}"; do
            report+="  [*] $warning\n"
        done
        report+="\n"
    fi

    # Info
    if [ ${#INFO[@]} -gt 0 ]; then
        report+="SYSTEM STATUS:\n"
        for info in "${INFO[@]}"; do
            report+="  $info\n"
        done
        report+="\n"
    fi

    report+="========================================\n"

    echo -e "$report"
}

# Send alerts via email
send_alerts() {
    if [ ${#ALERTS[@]} -gt 0 ]; then
        local subject="ALERT: $HOSTNAME - ${#ALERTS[@]} alert(s)"
        generate_report | mail -s "$subject" "$ALERT_EMAIL"
        log_message "Alert email sent to $ALERT_EMAIL"
    fi
}

# Display in terminal with colors
display_colored_report() {
    echo "========================================"
    echo "System Monitoring Report"
    echo "========================================"
    echo "Hostname: $HOSTNAME"
    echo "Timestamp: $(date)"
    echo "========================================"
    echo ""

    if [ ${#ALERTS[@]} -gt 0 ]; then
        echo -e "${RED}ALERTS (${#ALERTS[@]}):${NC}"
        for alert in "${ALERTS[@]}"; do
            echo -e "  ${RED}[!] $alert${NC}"
        done
        echo ""
    fi

    if [ ${#WARNINGS[@]} -gt 0 ]; then
        echo -e "${YELLOW}WARNINGS (${#WARNINGS[@]}):${NC}"
        for warning in "${WARNINGS[@]}"; do
            echo -e "  ${YELLOW}[*] $warning${NC}"
        done
        echo ""
    fi

    if [ ${#INFO[@]} -gt 0 ]; then
        echo -e "${GREEN}SYSTEM STATUS:${NC}"
        for info in "${INFO[@]}"; do
            echo "  $info"
        done
        echo ""
    fi

    echo "========================================"
}

# Main execution
main() {
    log_message "Starting comprehensive system monitoring"

    # Run all checks
    check_cpu
    check_memory
    check_disk
    check_load
    check_services
    check_network

    # Display report
    if [ -t 1 ]; then
        # Terminal output - use colors
        display_colored_report
    else
        # Non-terminal (cron, etc.) - plain text
        generate_report
    fi

    # Send alerts if any
    send_alerts

    # Save report to file
    generate_report > "$LOG_DIR/latest-report.txt"

    # Return appropriate exit code
    if [ ${#ALERTS[@]} -gt 0 ]; then
        log_message "Monitoring completed with ${#ALERTS[@]} alerts"
        exit 2
    elif [ ${#WARNINGS[@]} -gt 0 ]; then
        log_message "Monitoring completed with ${#WARNINGS[@]} warnings"
        exit 1
    else
        log_message "Monitoring completed successfully"
        exit 0
    fi
}

# Execute
main "$@"

Conclusion

Custom Bash monitoring scripts provide a flexible, lightweight, and powerful approach to system monitoring. They offer immediate deployment, complete customization, and seamless integration with existing infrastructure without the overhead of complex monitoring frameworks.

Key advantages of custom Bash monitoring:

  1. Simplicity - Easy to understand, modify, and maintain
  2. Portability - Works on any Linux distribution without dependencies
  3. Customization - Tailored exactly to your specific monitoring needs
  4. Integration - Easily integrated with existing alerting and logging systems
  5. Resource efficiency - Minimal overhead compared to agent-based monitoring
  6. Rapid deployment - No installation or configuration of complex software

Best practices for production monitoring scripts:

  • Implement comprehensive error handling and logging
  • Use meaningful exit codes for integration with monitoring systems
  • Add configuration files for easy threshold management
  • Include detailed comments and documentation
  • Test scripts thoroughly before production deployment
  • Implement rate limiting for alerts to prevent notification fatigue
  • Use locks to prevent concurrent executions
  • Monitor the monitors - ensure monitoring scripts themselves are running

While custom scripts excel for specific use cases and lightweight monitoring, consider complementing them with enterprise monitoring solutions like Prometheus, Grafana, or Nagios for complex environments requiring historical data, advanced visualization, and distributed monitoring capabilities.

The monitoring scripts presented in this guide provide a solid foundation that you can extend and adapt to your specific requirements, creating a robust monitoring solution that grows with your infrastructure needs.