Disaster Recovery Plan for Linux Servers

A comprehensive disaster recovery (DR) plan is essential for any organization relying on Linux infrastructure. This guide provides a complete framework for developing, implementing, and maintaining an effective disaster recovery strategy that minimizes downtime and data loss while ensuring business continuity.

Table of Contents

  1. Understanding RTO and RPO
  2. Conducting a Risk Assessment
  3. Backup Strategies
  4. Recovery Procedures
  5. Testing Schedule
  6. Documentation Template
  7. Implementation Best Practices
  8. Monitoring and Alerting
  9. Conclusion

Understanding RTO and RPO

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are foundational concepts that define your disaster recovery requirements.

RTO represents the maximum acceptable downtime before business impact becomes critical. For example, an RTO of 4 hours means your services must be restored within that timeframe.

RPO defines the maximum acceptable data loss. An RPO of 1 hour means you can tolerate losing up to 1 hour of data.

# Calculate RTO and RPO for your infrastructure
cat > /tmp/rto_rpo_calculator.sh << 'EOF'
#!/bin/bash

# Define criticality levels
declare -A criticality_rto
declare -A criticality_rpo

criticality_rto[critical]="1h"
criticality_rto[high]="4h"
criticality_rto[medium]="8h"
criticality_rto[low]="24h"

criticality_rpo[critical]="15min"
criticality_rpo[high]="1h"
criticality_rpo[medium]="4h"
criticality_rpo[low]="24h"

# Function to determine service criticality
determine_criticality() {
    local service=$1
    local business_impact=$2
    
    if [[ "$business_impact" == *"critical"* ]]; then
        echo "critical"
    elif [[ "$business_impact" == *"significant"* ]]; then
        echo "high"
    elif [[ "$business_impact" == *"moderate"* ]]; then
        echo "medium"
    else
        echo "low"
    fi
}

# Generate RTO/RPO matrix
generate_dr_matrix() {
    echo "Service Criticality Analysis"
    echo "============================"
    echo "Service | Criticality | RTO | RPO"
    echo "--------|-------------|-----|----"
}

generate_dr_matrix
EOF

chmod +x /tmp/rto_rpo_calculator.sh

Conducting a Risk Assessment

A thorough risk assessment identifies potential threats and their impact on your infrastructure.

# Risk Assessment Framework
cat > /tmp/risk_assessment.md << 'EOF'
# Risk Assessment Template

## Infrastructure Assets
- [ ] Web servers
- [ ] Database servers
- [ ] Mail servers
- [ ] File servers
- [ ] Network infrastructure
- [ ] Storage systems

## Potential Threats
- [ ] Hardware failure (disk, memory, power supply)
- [ ] Data corruption
- [ ] Cyber attacks and ransomware
- [ ] Network outages
- [ ] Natural disasters
- [ ] Human error
- [ ] Software bugs
- [ ] Misconfiguration

## Impact Analysis
For each asset:
1. Identify critical functions
2. Estimate revenue impact per hour of downtime
3. Calculate recovery cost
4. Determine dependencies on other systems
EOF

cat /tmp/risk_assessment.md

Create a risk matrix to prioritize mitigation efforts:

# Create risk matrix analysis
cat > /tmp/risk_matrix.sh << 'EOF'
#!/bin/bash

# Risk = Probability x Impact
# Probability: 1-5 (1=unlikely, 5=very likely)
# Impact: 1-5 (1=minimal, 5=catastrophic)

declare -A risks=(
    ["disk_failure"]="4x5=20"
    ["ransomware"]="3x5=15"
    ["network_outage"]="2x4=8"
    ["human_error"]="4x3=12"
    ["power_failure"]="2x4=8"
    ["data_corruption"]="3x4=12"
)

echo "Risk Assessment Matrix"
echo "====================="
printf "%-25s %-15s %-10s\n" "Risk Type" "Score" "Priority"
echo "---------------------------------------------"

for risk in $(printf '%s\n' "${!risks[@]}" | sort); do
    score=${risks[$risk]}
    printf "%-25s %-15s %-10s\n" "$risk" "$score" "$([ ${score%=*} -gt 15 ] && echo 'Critical' || echo 'High')"
done
EOF

chmod +x /tmp/risk_matrix.sh
bash /tmp/risk_matrix.sh

Backup Strategies

Implement a multi-layered backup strategy using the 3-2-1 rule: 3 copies of data, on 2 different media, with 1 offsite.

Full Backup Strategy

# Full backup of entire server
backup_full() {
    local server=$1
    local backup_dir="/backup/full"
    local timestamp=$(date +%Y%m%d_%H%M%S)
    local backup_file="$backup_dir/full_${server}_${timestamp}.tar.gz"
    
    mkdir -p "$backup_dir"
    
    # Exclude unnecessary files
    tar --exclude='/proc' \
        --exclude='/sys' \
        --exclude='/dev' \
        --exclude='/tmp' \
        --exclude='/var/tmp' \
        --exclude='/var/log' \
        --exclude='/var/cache' \
        --exclude='/backup' \
        --exclude='*.swp' \
        -czf "$backup_file" / 2>/dev/null
    
    if [ $? -eq 0 ]; then
        echo "Full backup completed: $backup_file"
        ls -lh "$backup_file"
    else
        echo "Error: Full backup failed"
        return 1
    fi
}

# Execute full backup
backup_full "production-server-01"

Incremental Backup Strategy

# Incremental backup using find and tar
backup_incremental() {
    local backup_dir="/backup/incremental"
    local timestamp=$(date +%Y%m%d_%H%M%S)
    local backup_file="$backup_dir/incremental_${timestamp}.tar.gz"
    local last_backup="/var/lib/backup/.last_backup"
    
    mkdir -p "$backup_dir"
    
    # Create file listing for incremental backup
    if [ -f "$last_backup" ]; then
        # Backup only files modified since last backup
        find / -type f -newer "$last_backup" \
            -not -path '/proc/*' \
            -not -path '/sys/*' \
            -not -path '/dev/*' \
            -not -path '/tmp/*' \
            -not -path '/backup/*' | \
            tar -czf "$backup_file" -T - 2>/dev/null
    fi
    
    touch "$last_backup"
    echo "Incremental backup: $backup_file"
}

backup_incremental

Database Backup Strategy

# MySQL backup with compression
backup_mysql() {
    local db_user="backup_user"
    local db_password="secure_password"
    local backup_dir="/backup/mysql"
    local timestamp=$(date +%Y%m%d_%H%M%S)
    local backup_file="$backup_dir/mysql_${timestamp}.sql.gz"
    
    mkdir -p "$backup_dir"
    
    mysqldump \
        -u "$db_user" \
        -p"$db_password" \
        --all-databases \
        --single-transaction \
        --quick \
        --lock-tables=false | \
        gzip > "$backup_file"
    
    if [ $? -eq 0 ]; then
        echo "MySQL backup: $backup_file ($(ls -lh "$backup_file" | awk '{print $5}'))"
    fi
}

backup_mysql

# PostgreSQL backup
backup_postgresql() {
    local backup_dir="/backup/postgresql"
    local timestamp=$(date +%Y%m%d_%H%M%S)
    local backup_file="$backup_dir/postgresql_${timestamp}.sql.gz"
    
    mkdir -p "$backup_dir"
    
    pg_dumpall | gzip > "$backup_file"
    
    echo "PostgreSQL backup: $backup_file ($(ls -lh "$backup_file" | awk '{print $5}'))"
}

# Enable WAL archiving for point-in-time recovery
postgresql_setup_wal_archiving() {
    local wal_archive_dir="/backup/postgres_wal"
    mkdir -p "$wal_archive_dir"
    
    # Update postgresql.conf
    sed -i "s/#wal_level = .*/wal_level = replica/" /etc/postgresql/*/main/postgresql.conf
    sed -i "s|#archive_mode = .*|archive_mode = on|" /etc/postgresql/*/main/postgresql.conf
    sed -i "s|#archive_command = .*|archive_command = 'test ! -f $wal_archive_dir/%f \&\& cp %p $wal_archive_dir/%f'|" /etc/postgresql/*/main/postgresql.conf
    
    systemctl restart postgresql
}

Backup Verification

# Verify backup integrity
verify_backup() {
    local backup_file=$1
    
    if [ ! -f "$backup_file" ]; then
        echo "Error: Backup file not found: $backup_file"
        return 1
    fi
    
    # Check file size (must not be zero)
    if [ ! -s "$backup_file" ]; then
        echo "Error: Backup file is empty"
        return 1
    fi
    
    # Check if gzip file is valid
    if [[ "$backup_file" == *.gz ]]; then
        gzip -t "$backup_file" 2>/dev/null
        if [ $? -ne 0 ]; then
            echo "Error: Backup corruption detected"
            return 1
        fi
    fi
    
    # Check tar archive integrity
    if [[ "$backup_file" == *.tar.gz ]]; then
        tar -tzf "$backup_file" > /dev/null 2>&1
        if [ $? -ne 0 ]; then
            echo "Error: TAR archive is corrupted"
            return 1
        fi
    fi
    
    echo "Backup verification passed: $backup_file"
    return 0
}

# Test verification
verify_backup "/backup/full/full_production-server-01_20240101_120000.tar.gz"

Recovery Procedures

Establish clear, tested procedures for recovering from various failure scenarios.

Full Server Recovery

# Full server recovery procedure
recover_full_server() {
    local backup_file=$1
    local recovery_mount="/mnt/recovery"
    
    echo "Starting full server recovery"
    echo "==============================="
    
    # 1. Boot into recovery mode (manual step)
    echo "Step 1: Boot into recovery/rescue environment"
    
    # 2. Prepare disk
    echo "Step 2: Prepare disk"
    # Format disk (example with /dev/sda)
    # parted /dev/sda mklabel gpt
    # parted /dev/sda mkpart primary ext4 1MiB 100%
    
    # 3. Create filesystem
    echo "Step 3: Create filesystem"
    # mkfs.ext4 /dev/sda1
    
    # 4. Mount filesystem
    mkdir -p "$recovery_mount"
    # mount /dev/sda1 "$recovery_mount"
    
    # 5. Extract backup
    echo "Step 4: Extracting backup (this may take a while)"
    tar -xzf "$backup_file" -C "$recovery_mount" --exclude='./dev' --exclude='./proc' --exclude='./sys'
    
    # 6. Create essential directories
    mkdir -p "$recovery_mount"/{dev,proc,sys,run}
    
    # 7. Reinstall bootloader
    echo "Step 5: Reinstalling bootloader"
    # mount -B /dev "$recovery_mount/dev"
    # mount -t proc proc "$recovery_mount/proc"
    # mount -t sysfs sys "$recovery_mount/sys"
    # chroot "$recovery_mount" grub-install /dev/sda
    # chroot "$recovery_mount" update-grub
    
    echo "Full server recovery completed"
}

# Partial file recovery
recover_files() {
    local backup_file=$1
    local file_path=$2
    local recovery_dir="/tmp/recovery"
    
    mkdir -p "$recovery_dir"
    
    # Extract specific files from backup
    tar -xzf "$backup_file" -C "$recovery_dir" "$file_path" 2>/dev/null
    
    if [ $? -eq 0 ]; then
        echo "File recovered to: $recovery_dir/$file_path"
    fi
}

Database Recovery

# MySQL recovery from backup
recover_mysql() {
    local backup_file=$1
    local db_user="root"
    local db_password="secure_password"
    
    # Restore from backup
    if [[ "$backup_file" == *.gz ]]; then
        gunzip < "$backup_file" | mysql -u "$db_user" -p"$db_password"
    else
        mysql -u "$db_user" -p"$db_password" < "$backup_file"
    fi
    
    echo "MySQL recovery completed"
}

# PostgreSQL recovery
recover_postgresql() {
    local backup_file=$1
    
    if [[ "$backup_file" == *.gz ]]; then
        gunzip < "$backup_file" | psql -U postgres
    else
        psql -U postgres -f "$backup_file"
    fi
    
    echo "PostgreSQL recovery completed"
}

Testing Schedule

Regular testing ensures your backups work when needed and your procedures are current.

# Create automated backup testing schedule
cat > /etc/cron.d/backup-testing << 'EOF'
# Backup Testing Schedule
# Weekly full backup test on Sunday at 2 AM
0 2 * * 0 root /usr/local/bin/test-backup-restoration.sh >> /var/log/backup-test.log 2>&1

# Daily backup integrity check at 3 AM
0 3 * * * root /usr/local/bin/verify-backup-integrity.sh >> /var/log/backup-verify.log 2>&1

# Monthly disaster recovery drill on first Sunday at 4 AM
0 4 1 * 0 root /usr/local/bin/dr-drill.sh >> /var/log/dr-drill.log 2>&1
EOF

# Create backup testing script
cat > /usr/local/bin/test-backup-restoration.sh << 'EOF'
#!/bin/bash

BACKUP_DIR="/backup/full"
TEST_DIR="/tmp/backup-test"
LOG_FILE="/var/log/backup-restoration-test.log"

echo "[$(date)]Starting backup restoration test" >> "$LOG_FILE"

# Find most recent backup
LATEST_BACKUP=$(ls -t "$BACKUP_DIR"/*.tar.gz 2>/dev/null | head -1)

if [ -z "$LATEST_BACKUP" ]; then
    echo "[$(date)] ERROR: No backup found" >> "$LOG_FILE"
    exit 1
fi

# Create test directory
rm -rf "$TEST_DIR"
mkdir -p "$TEST_DIR"

# Extract and verify
tar -tzf "$LATEST_BACKUP" > /dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "[$(date)] PASS: Backup integrity verified" >> "$LOG_FILE"
else
    echo "[$(date)] FAIL: Backup corrupted" >> "$LOG_FILE"
    exit 1
fi

# Test partial extraction
tar -xzf "$LATEST_BACKUP" -C "$TEST_DIR" ./etc --warning=no-file-changed 2>/dev/null
if [ $? -eq 0 ] && [ -d "$TEST_DIR/etc" ]; then
    echo "[$(date)] PASS: Partial restoration successful" >> "$LOG_FILE"
else
    echo "[$(date)] FAIL: Partial restoration failed" >> "$LOG_FILE"
fi

# Cleanup
rm -rf "$TEST_DIR"
echo "[$(date)] Backup test completed" >> "$LOG_FILE"
EOF

chmod +x /usr/local/bin/test-backup-restoration.sh

Documentation Template

Maintain comprehensive documentation for your disaster recovery plan.

# Disaster Recovery Plan - [Organization]

## Executive Summary
- RTO: [Time]
- RPO: [Time]
- Last Review: [Date]
- Next Review: [Date]

## Critical Systems
| System | RTO | RPO | Location | Backup Method |
|--------|-----|-----|----------|---------------|
| Web Server | 4h | 1h | Primary DC | Daily full + hourly incremental |
| Database | 2h | 15min | Primary DC | Continuous replication + WAL archiving |
| File Server | 8h | 4h | Primary DC | Daily incremental |

## Recovery Procedures
1. [System] Recovery Steps
   - Prerequisites
   - Detailed steps
   - Verification
   - Estimated recovery time

## Contact Information
- DR Coordinator: [Name] - [Contact]
- Technical Lead: [Name] - [Contact]
- Executive Sponsor: [Name] - [Contact]

## Test Results
- Last Test Date: [Date]
- Test Type: [Full/Partial]
- Result: [Pass/Fail]
- Issues Found: [List]

## Version History
| Date | Version | Changes | Reviewed By |
|------|---------|---------|-------------|
EOF

Implementation Best Practices

Automated Backup Management

# Automated backup rotation (keep 30 days)
backup_rotation() {
    local backup_dir=$1
    local retention_days=30
    
    find "$backup_dir" -type f -mtime +$retention_days -delete
    echo "Backup rotation completed. Retained last $retention_days days."
}

# Schedule backup rotation
(crontab -l 2>/dev/null; echo "0 5 * * * /usr/local/bin/backup-rotation.sh /backup/full") | crontab -

# Backup monitoring
monitor_backups() {
    local backup_dir="/backup/full"
    local alert_threshold=24  # hours
    
    for backup in "$backup_dir"/*.tar.gz; do
        local age=$(($(date +%s) - $(stat -f%m "$backup" 2>/dev/null || stat -c%Y "$backup" 2>/dev/null)))
        local age_hours=$((age / 3600))
        
        if [ $age_hours -gt $alert_threshold ]; then
            echo "ALERT: Backup is $age_hours hours old: $backup"
        fi
    done
}

monitor_backups

Offsite Backup Replication

# Replicate backups to remote location
replicate_backups_offsite() {
    local source_dir="/backup/full"
    local remote_host="backup.remote-dc.com"
    local remote_user="backup"
    local remote_dir="/backups/offsite"
    
    # Using rsync for efficient transfer
    rsync -avz \
        --delete \
        --bwlimit=10240 \
        "$source_dir/" \
        "$remote_user@$remote_host:$remote_dir"
    
    if [ $? -eq 0 ]; then
        echo "Offsite backup replication completed"
    else
        echo "ERROR: Offsite backup replication failed"
    fi
}

# Schedule daily offsite replication
(crontab -l 2>/dev/null; echo "0 22 * * * /usr/local/bin/replicate-backups-offsite.sh") | crontab -

Monitoring and Alerting

# Create backup monitoring alerts
cat > /usr/local/bin/backup-health-check.sh << 'EOF'
#!/bin/bash

BACKUP_DIR="/backup/full"
ALERT_EMAIL="[email protected]"
CRITICAL_THRESHOLD=48  # hours

check_backup_currency() {
    local latest_backup=$(ls -t "$BACKUP_DIR"/*.tar.gz 2>/dev/null | head -1)
    
    if [ -z "$latest_backup" ]; then
        send_alert "CRITICAL: No backups found in $BACKUP_DIR"
        return 1
    fi
    
    local backup_age=$(($(date +%s) - $(stat -c%Y "$latest_backup" 2>/dev/null)))
    local backup_age_hours=$((backup_age / 3600))
    
    if [ $backup_age_hours -gt $CRITICAL_THRESHOLD ]; then
        send_alert "CRITICAL: Backup is $backup_age_hours hours old (threshold: $CRITICAL_THRESHOLD hours)"
        return 1
    fi
}

check_backup_size() {
    local latest_backup=$(ls -t "$BACKUP_DIR"/*.tar.gz 2>/dev/null | head -1)
    local size=$(stat -c%s "$latest_backup" 2>/dev/null)
    local size_mb=$((size / 1048576))
    
    if [ $size_mb -lt 100 ]; then
        send_alert "WARNING: Backup is suspiciously small: ${size_mb}MB"
    fi
}

send_alert() {
    local message=$1
    echo "$message" | mail -s "Backup Alert" "$ALERT_EMAIL"
}

check_backup_currency
check_backup_size
EOF

chmod +x /usr/local/bin/backup-health-check.sh

# Schedule health checks
(crontab -l 2>/dev/null; echo "0 * * * * /usr/local/bin/backup-health-check.sh") | crontab -

Conclusion

A robust disaster recovery plan requires:

  1. Clear Objectives: Define RTO and RPO for each critical system
  2. Regular Backups: Implement 3-2-1 strategy with multiple backup types
  3. Testing: Monthly restoration tests ensure procedures work
  4. Documentation: Maintain current, detailed recovery procedures
  5. Monitoring: Automated health checks and alerting
  6. Review: Quarterly updates as infrastructure changes

Start small with critical systems, then expand coverage. Regular testing is the only way to ensure your DR plan works when disaster strikes. Remember: untested backups are just expensive storage.