Service Monitoring with systemd

Introduction

systemd has become the standard init system and service manager for most modern Linux distributions, replacing traditional SysV init and Upstart. Beyond service management, systemd provides powerful built-in monitoring capabilities that enable administrators to track service health, resource usage, failures, and dependencies without installing additional monitoring software.

Understanding systemd's monitoring features is essential for effective service management in modern Linux environments. systemd tracks detailed service metrics, maintains comprehensive logs via journald, offers automatic restart capabilities, implements dependency management, and provides real-time status information. These capabilities enable proactive service monitoring and rapid troubleshooting directly from the command line.

This comprehensive guide explores systemd's monitoring capabilities, teaching you how to monitor service status, track resource consumption, configure automatic restart policies, analyze service logs, create custom monitoring scripts, and implement alerting for service failures. Whether you're managing a single server or orchestrating multiple services, mastering systemd monitoring is fundamental for maintaining reliable operations.

Prerequisites

Before exploring systemd monitoring, ensure you have:

  • A Linux distribution using systemd (Ubuntu 16.04+, Debian 8+, CentOS 7+, Rocky Linux 8+)
  • Root or sudo access for service management
  • Basic understanding of systemd service units
  • Familiarity with Linux command line
  • Services configured under systemd management

Verify systemd is running:

# Check systemd version
systemctl --version

# Verify systemd is PID 1
ps -p 1 -o comm=
# Should output: systemd

# Check systemd status
systemctl status

Understanding systemd Service States

Service States

systemd services can be in various states:

Active States:

  • active (running) - Service is running normally
  • active (exited) - One-time service completed successfully
  • active (waiting) - Service is waiting for an event

Inactive States:

  • inactive (dead) - Service is not running
  • failed - Service failed to start or crashed
  • activating - Service is starting
  • deactivating - Service is stopping

Enable States:

  • enabled - Service starts automatically at boot
  • disabled - Service doesn't start at boot
  • static - Service can't be enabled (typically dependencies)
  • masked - Service can't be started (completely disabled)

Basic Service Monitoring Commands

Check service status:

# Basic status
systemctl status nginx

# Show all properties
systemctl show nginx

# Check if service is active
systemctl is-active nginx

# Check if service is enabled
systemctl is-enabled nginx

# Check if service failed
systemctl is-failed nginx

List all services:

# List all loaded services
systemctl list-units --type=service

# List all services (including inactive)
systemctl list-units --type=service --all

# List failed services
systemctl list-units --state=failed

# List enabled services
systemctl list-unit-files --type=service --state=enabled

# List running services
systemctl list-units --type=service --state=running

Monitoring Service Status

Detailed Service Status

Get comprehensive service information:

# Full status with recent logs
systemctl status nginx -l --no-pager

# Status of multiple services
systemctl status nginx mysql redis

# Show service dependency tree
systemctl list-dependencies nginx

# Show reverse dependencies (what depends on this service)
systemctl list-dependencies nginx --reverse

Service properties:

# Show all properties
systemctl show nginx

# Show specific property
systemctl show nginx -p MainPID
systemctl show nginx -p ActiveState
systemctl show nginx -p SubState
systemctl show nginx -p LoadState
systemctl show nginx -p UnitFileState

# Multiple properties
systemctl show nginx -p MainPID -p MemoryCurrent -p CPUUsageNSec

Real-Time Service Monitoring

Watch service status:

# Continuously monitor service status (updates every 2 seconds)
watch -n 2 'systemctl status nginx'

# Monitor multiple services
watch -n 2 'systemctl status nginx mysql redis | grep -E "Active|Main PID|Memory"'

# Monitor failed services
watch -n 5 'systemctl list-units --state=failed'

Follow service logs in real-time:

# Follow service logs
journalctl -u nginx -f

# Follow with more context
journalctl -u nginx -f -n 100

# Follow multiple services
journalctl -u nginx -u mysql -f

# Follow all service logs
journalctl -f

Resource Monitoring

CPU and Memory Usage

Check resource consumption:

# Show resource usage for service
systemctl status nginx

# Detailed resource statistics
systemd-cgtop

# Resource usage for specific service
systemctl show nginx -p CPUUsageNSec -p MemoryCurrent

# Human-readable memory usage
systemctl show nginx -p MemoryCurrent | awk -F= '{printf "Memory: %.2f MB\n", $2/1024/1024}'

Monitor resource limits:

# Check configured limits
systemctl show nginx -p LimitNOFILE -p LimitNPROC -p LimitMEMLOCK

# Check current vs limit
systemctl show nginx | grep -E "Limit|Current"

systemd-cgtop for Real-Time Resource Monitoring

Interactive resource monitoring:

# Launch systemd-cgtop (like top for services)
systemd-cgtop

# Press 'p' to sort by path
# Press 't' to sort by tasks
# Press 'c' to sort by CPU
# Press 'm' to sort by memory
# Press 'q' to quit

# Batch mode (single output)
systemd-cgtop -n 1 --batch

# Monitor specific services
systemd-cgtop | grep -E "nginx|mysql|redis"

Resource usage script:

#!/bin/bash
# service-resources.sh - Monitor service resource usage

SERVICES=("nginx" "mysql" "redis")

echo "Service Resource Usage Report"
echo "=============================="
echo "Date: $(date)"
echo ""

for service in "${SERVICES[@]}"; do
    if systemctl is-active --quiet "$service"; then
        echo "Service: $service"

        # Get PID
        PID=$(systemctl show "$service" -p MainPID | cut -d= -f2)
        echo "  PID: $PID"

        # Get memory usage
        MEM=$(systemctl show "$service" -p MemoryCurrent | cut -d= -f2)
        MEM_MB=$(echo "scale=2; $MEM/1024/1024" | bc)
        echo "  Memory: ${MEM_MB} MB"

        # Get CPU usage (accumulated)
        CPU=$(systemctl show "$service" -p CPUUsageNSec | cut -d= -f2)
        CPU_SEC=$(echo "scale=2; $CPU/1000000000" | bc)
        echo "  CPU Time: ${CPU_SEC}s"

        # Get task count
        TASKS=$(systemctl show "$service" -p TasksCurrent | cut -d= -f2)
        echo "  Tasks: $TASKS"

        echo ""
    else
        echo "Service: $service - NOT RUNNING"
        echo ""
    fi
done

Service Failure Monitoring

Detecting Service Failures

Check for failed services:

# List all failed services
systemctl --failed

# Count failed services
systemctl --failed --no-legend | wc -l

# Get failure reason
systemctl status nginx | grep -A 5 "Process"

# Show failure details
systemctl show nginx -p Result -p ExecMainStatus

Failed service details:

# Get exit code
systemctl show nginx -p ExecMainStatus

# Get failure result
systemctl show nginx -p Result
# Possible values: success, timeout, exit-code, signal, core-dump

# View recent failures
journalctl -u nginx --since "1 hour ago" | grep -i "failed\|error"

Automated Failure Detection Script

#!/bin/bash
# monitor-failed-services.sh - Alert on service failures

EMAIL="[email protected]"
HOSTNAME=$(hostname)
STATE_FILE="/var/lib/monitoring/failed-services-state"

mkdir -p /var/lib/monitoring

# Get currently failed services
FAILED=$(systemctl --failed --no-legend | awk '{print $1}')

if [ -n "$FAILED" ]; then
    # Check if this is a new failure
    if [ ! -f "$STATE_FILE" ] || ! diff -q <(echo "$FAILED") "$STATE_FILE" > /dev/null 2>&1; then
        # Send alert
        {
            echo "Service Failure Alert on $HOSTNAME"
            echo "=================================="
            echo "Time: $(date)"
            echo ""
            echo "Failed Services:"
            echo "$FAILED"
            echo ""
            echo "Details:"
            echo "--------"
            for service in $FAILED; do
                echo ""
                echo "Service: $service"
                systemctl status "$service" --no-pager -l
                echo ""
                echo "Recent Logs:"
                journalctl -u "$service" -n 20 --no-pager
                echo "---"
            done
        } | mail -s "ALERT: Service Failures on $HOSTNAME" "$EMAIL"

        # Update state file
        echo "$FAILED" > "$STATE_FILE"
    fi
else
    # No failures, remove state file
    rm -f "$STATE_FILE"
fi

Service Restart Policies

Automatic Restart Configuration

Configure service restart:

# Edit service unit
sudo systemctl edit nginx

Add restart configuration:

[Service]
Restart=on-failure
RestartSec=5s
StartLimitInterval=200s
StartLimitBurst=3

Restart policy options:

  • Restart=no - Never restart (default)
  • Restart=on-success - Restart only on clean exit
  • Restart=on-failure - Restart on failures
  • Restart=on-abnormal - Restart on crashes, watchdog, timeouts
  • Restart=on-abort - Restart on unclean signal
  • Restart=on-watchdog - Restart on watchdog timeout
  • Restart=always - Always restart

Example configurations:

# Web server - restart on failure
[Service]
Restart=on-failure
RestartSec=10s

# Database - restart only on clean exit
[Service]
Restart=on-success
RestartSec=30s

# Critical service - always restart with rate limiting
[Service]
Restart=always
RestartSec=5s
StartLimitInterval=300s
StartLimitBurst=5

# Worker process - restart on abnormal exit
[Service]
Restart=on-abnormal
RestartSec=15s

Apply changes:

# Reload systemd configuration
sudo systemctl daemon-reload

# Restart service
sudo systemctl restart nginx

# Verify new configuration
systemctl show nginx -p Restart -p RestartSec

Monitor Restart Activity

Check restart count:

# View service restarts
systemctl show nginx -p NRestarts

# View with status
systemctl status nginx | grep -i restart

# Check restart rate limiting
systemctl show nginx -p StartLimitBurst -p StartLimitIntervalSec

Track restart history:

# View restart events in journal
journalctl -u nginx | grep -E "Started|Stopped|Failed"

# Count restarts in last hour
journalctl -u nginx --since "1 hour ago" | grep -c "Started"

# View restart timestamps
journalctl -u nginx -o short-precise | grep "Started"

Watchdog Monitoring

Configure Watchdog

systemd can monitor services using watchdog functionality.

Enable watchdog in service:

sudo systemctl edit myapp
[Service]
WatchdogSec=30s
Restart=on-watchdog

Application must send watchdog notifications:

# Python example using systemd python library
import systemd.daemon
import time

while True:
    # Do work
    process_data()

    # Notify watchdog (service is alive)
    systemd.daemon.notify('WATCHDOG=1')
    time.sleep(10)

Monitor watchdog status:

# Check watchdog configuration
systemctl show myapp -p WatchdogSec -p WatchdogTimestamp

# View watchdog events
journalctl -u myapp | grep watchdog

Dependency Monitoring

Service Dependencies

View dependencies:

# What this service requires
systemctl list-dependencies nginx

# What requires this service
systemctl list-dependencies nginx --reverse

# Full dependency tree
systemctl list-dependencies nginx --all

# Just direct dependencies
systemctl list-dependencies nginx --depth=1

Dependency types:

# View unit file to see dependency configuration
systemctl cat nginx

# Common dependency directives:
# Requires= - Hard dependency (fails if dependency fails)
# Wants= - Soft dependency (continues if dependency fails)
# After= - Order dependency (start after)
# Before= - Order dependency (start before)
# BindsTo= - Strong binding (stops if dependency stops)

Monitor Dependency Failures

#!/bin/bash
# check-service-dependencies.sh

SERVICE="$1"

if [ -z "$SERVICE" ]; then
    echo "Usage: $0 <service-name>"
    exit 1
fi

echo "Checking dependencies for $SERVICE"
echo "==================================="

# Get required dependencies
REQUIRES=$(systemctl show "$SERVICE" -p Requires | cut -d= -f2)

if [ -n "$REQUIRES" ]; then
    echo "Required dependencies:"
    for dep in $REQUIRES; do
        STATUS=$(systemctl is-active "$dep")
        if [ "$STATUS" != "active" ]; then
            echo "  [WARN] $dep: $STATUS"
        else
            echo "  [OK] $dep: $STATUS"
        fi
    done
else
    echo "No hard dependencies"
fi

echo ""

# Get wanted dependencies
WANTS=$(systemctl show "$SERVICE" -p Wants | cut -d= -f2)

if [ -n "$WANTS" ]; then
    echo "Optional dependencies:"
    for dep in $WANTS; do
        STATUS=$(systemctl is-active "$dep")
        echo "  $dep: $STATUS"
    done
fi

Logging and Journal Monitoring

Journal Integration

Service-specific logs:

# View logs for service
journalctl -u nginx

# Last 100 lines
journalctl -u nginx -n 100

# Follow logs
journalctl -u nginx -f

# Since specific time
journalctl -u nginx --since "2024-01-11 10:00:00"
journalctl -u nginx --since "1 hour ago"
journalctl -u nginx --since today

# Date range
journalctl -u nginx --since "2024-01-11" --until "2024-01-12"

# Priority filtering (errors only)
journalctl -u nginx -p err

# Multiple services
journalctl -u nginx -u mysql

Log analysis:

# Count errors
journalctl -u nginx --since today -p err --no-pager | wc -l

# Extract specific patterns
journalctl -u nginx --since today | grep "404\|500"

# Export to file
journalctl -u nginx --since "1 hour ago" > /tmp/nginx-logs.txt

# JSON output
journalctl -u nginx -n 10 -o json-pretty

# Show kernel messages related to service
journalctl -u nginx -k

Custom Monitoring Scripts

Comprehensive Service Monitor

#!/bin/bash
# comprehensive-service-monitor.sh - Complete service monitoring

SERVICES=("nginx" "mysql" "redis" "ssh")
REPORT_FILE="/tmp/service-monitor-$(date +%Y%m%d-%H%M).txt"
ALERT_EMAIL="[email protected]"

declare -a ALERTS=()

{
    echo "========================================="
    echo "Service Monitoring Report"
    echo "Date: $(date)"
    echo "Hostname: $(hostname)"
    echo "========================================="
    echo ""

    for service in "${SERVICES[@]}"; do
        echo "--- Service: $service ---"

        # Check if service exists
        if ! systemctl list-unit-files | grep -q "^${service}.service"; then
            echo "  Status: NOT INSTALLED"
            echo ""
            continue
        fi

        # Get status
        STATUS=$(systemctl is-active "$service")
        ENABLED=$(systemctl is-enabled "$service" 2>/dev/null || echo "unknown")

        echo "  Status: $STATUS"
        echo "  Enabled: $ENABLED"

        if [ "$STATUS" = "active" ]; then
            # Get resource usage
            MEM=$(systemctl show "$service" -p MemoryCurrent | cut -d= -f2)
            if [ "$MEM" != "[not set]" ] && [ "$MEM" -gt 0 ]; then
                MEM_MB=$(echo "scale=2; $MEM/1024/1024" | bc)
                echo "  Memory: ${MEM_MB} MB"
            fi

            # Get restart count
            RESTARTS=$(systemctl show "$service" -p NRestarts | cut -d= -f2)
            echo "  Restarts: $RESTARTS"

            # Check for recent errors
            ERROR_COUNT=$(journalctl -u "$service" --since "1 hour ago" -p err --no-pager | wc -l)
            echo "  Recent Errors (1h): $ERROR_COUNT"

            if [ "$ERROR_COUNT" -gt 10 ]; then
                ALERTS+=("High error count for $service: $ERROR_COUNT errors in last hour")
            fi
        else
            echo "  [ALERT] Service is not active!"
            ALERTS+=("Service $service is $STATUS")
        fi

        # Check last restart
        LAST_START=$(systemctl show "$service" -p ActiveEnterTimestamp | cut -d= -f2)
        echo "  Last Started: $LAST_START"

        echo ""
    done

    # Summary
    echo "========================================="
    echo "Summary"
    echo "========================================="

    ACTIVE_COUNT=0
    INACTIVE_COUNT=0

    for service in "${SERVICES[@]}"; do
        if systemctl is-active --quiet "$service"; then
            ((ACTIVE_COUNT++))
        else
            ((INACTIVE_COUNT++))
        fi
    done

    echo "Active Services: $ACTIVE_COUNT"
    echo "Inactive Services: $INACTIVE_COUNT"

    if [ ${#ALERTS[@]} -gt 0 ]; then
        echo ""
        echo "ALERTS:"
        for alert in "${ALERTS[@]}"; do
            echo "  - $alert"
        done
    else
        echo ""
        echo "No alerts - all services healthy"
    fi

} > "$REPORT_FILE"

# Display report
cat "$REPORT_FILE"

# Send email if there are alerts
if [ ${#ALERTS[@]} -gt 0 ]; then
    mail -s "Service Alert: $(hostname)" "$ALERT_EMAIL" < "$REPORT_FILE"
fi

Service Availability Monitor

#!/bin/bash
# service-availability.sh - Track service uptime and availability

SERVICE="$1"
STATS_FILE="/var/lib/monitoring/service-stats-${SERVICE}.json"

if [ -z "$SERVICE" ]; then
    echo "Usage: $0 <service-name>"
    exit 1
fi

mkdir -p /var/lib/monitoring

# Check if service is active
if systemctl is-active --quiet "$SERVICE"; then
    STATUS="up"
else
    STATUS="down"
fi

# Update statistics
if [ -f "$STATS_FILE" ]; then
    # Load existing stats
    TOTAL_CHECKS=$(jq -r '.total_checks' "$STATS_FILE")
    UP_CHECKS=$(jq -r '.up_checks' "$STATS_FILE")
    LAST_STATUS=$(jq -r '.last_status' "$STATS_FILE")

    # Increment counters
    ((TOTAL_CHECKS++))
    if [ "$STATUS" = "up" ]; then
        ((UP_CHECKS++))
    fi

    # Check for status change
    if [ "$STATUS" != "$LAST_STATUS" ]; then
        echo "Status change detected: $LAST_STATUS -> $STATUS"
        # Log to journal
        logger -t service-monitor "Service $SERVICE changed from $LAST_STATUS to $STATUS"
    fi
else
    # Initialize stats
    TOTAL_CHECKS=1
    if [ "$STATUS" = "up" ]; then
        UP_CHECKS=1
    else
        UP_CHECKS=0
    fi
fi

# Calculate availability
AVAILABILITY=$(echo "scale=2; ($UP_CHECKS / $TOTAL_CHECKS) * 100" | bc)

# Save stats
cat > "$STATS_FILE" <<EOF
{
  "service": "$SERVICE",
  "last_check": "$(date -Iseconds)",
  "current_status": "$STATUS",
  "last_status": "$STATUS",
  "total_checks": $TOTAL_CHECKS,
  "up_checks": $UP_CHECKS,
  "availability": $AVAILABILITY
}
EOF

echo "Service: $SERVICE"
echo "Status: $STATUS"
echo "Availability: ${AVAILABILITY}%"
echo "Checks: $UP_CHECKS/$TOTAL_CHECKS"

Alerting Integration

systemd OnFailure Integration

Configure alert on service failure:

# Create alert service
sudo nano /etc/systemd/system/[email protected]
[Unit]
Description=Alert on service failure for %i

[Service]
Type=oneshot
ExecStart=/usr/local/bin/send-service-alert.sh %i

Create alert script:

sudo nano /usr/local/bin/send-service-alert.sh
#!/bin/bash
SERVICE="$1"
EMAIL="[email protected]"
HOSTNAME=$(hostname)

{
    echo "Service Failure Alert"
    echo "===================="
    echo "Service: $SERVICE"
    echo "Hostname: $HOSTNAME"
    echo "Time: $(date)"
    echo ""
    echo "Status:"
    systemctl status "$SERVICE" --no-pager -l
    echo ""
    echo "Recent Logs:"
    journalctl -u "$SERVICE" -n 50 --no-pager
} | mail -s "CRITICAL: $SERVICE failed on $HOSTNAME" "$EMAIL"
sudo chmod +x /usr/local/bin/send-service-alert.sh

Add to service configuration:

sudo systemctl edit nginx
[Unit]
OnFailure=service-alert@%n.service
sudo systemctl daemon-reload

Performance Monitoring

Benchmark Service Startup

# Analyze service startup time
systemd-analyze blame | grep nginx

# Show critical chain
systemd-analyze critical-chain nginx.service

# Plot boot chart (requires graphviz)
systemd-analyze plot > boot.svg

Monitor Service Performance

#!/bin/bash
# service-performance.sh - Track service performance metrics

SERVICE="$1"

if [ -z "$SERVICE" ]; then
    echo "Usage: $0 <service-name>"
    exit 1
fi

echo "Performance Metrics for $SERVICE"
echo "================================="

# Startup time
STARTUP_TIME=$(systemd-analyze blame | grep "$SERVICE" | awk '{print $1}')
echo "Startup Time: $STARTUP_TIME"

# Memory usage
MEM=$(systemctl show "$SERVICE" -p MemoryCurrent | cut -d= -f2)
if [ "$MEM" != "[not set]" ]; then
    MEM_MB=$(echo "scale=2; $MEM/1024/1024" | bc)
    echo "Memory Usage: ${MEM_MB} MB"
fi

# CPU time
CPU=$(systemctl show "$SERVICE" -p CPUUsageNSec | cut -d= -f2)
if [ "$CPU" != "[not set]" ]; then
    CPU_SEC=$(echo "scale=2; $CPU/1000000000" | bc)
    echo "CPU Time: ${CPU_SEC}s"
fi

# Tasks
TASKS=$(systemctl show "$SERVICE" -p TasksCurrent | cut -d= -f2)
echo "Active Tasks: $TASKS"

# File descriptors
FD=$(systemctl show "$SERVICE" -p FileDescriptorCount | cut -d= -f2)
echo "Open FDs: $FD"

Conclusion

systemd provides comprehensive built-in monitoring capabilities that enable effective service management without requiring external monitoring tools. By mastering systemd's monitoring features, you can track service health, detect failures, analyze resource usage, and automate remediation directly from the Linux command line.

Key takeaways:

  1. Built-in monitoring - systemd tracks extensive service metrics natively
  2. Resource tracking - Monitor CPU, memory, and other resources per service
  3. Automatic restart - Configure intelligent restart policies for resilience
  4. Journal integration - Unified logging with powerful filtering and search
  5. Dependency awareness - Monitor and manage service dependencies

Best practices:

  • Configure appropriate restart policies for each service type
  • Monitor failed services regularly
  • Implement alerting for critical service failures
  • Track resource usage to identify performance issues
  • Use journalctl for centralized log analysis
  • Document service dependencies
  • Automate routine monitoring tasks
  • Integrate with external monitoring for comprehensive coverage

While systemd provides excellent built-in monitoring, consider complementing it with dedicated monitoring solutions like Prometheus, Nagios, or Zabbix for historical metrics, advanced alerting, and distributed monitoring across multiple servers. systemd's monitoring capabilities form the foundation for effective service management in modern Linux infrastructure.