Disaster Recovery Plan for Linux Servers
A comprehensive disaster recovery (DR) plan is essential for any organization relying on Linux infrastructure. This guide provides a complete framework for developing, implementing, and maintaining an effective disaster recovery strategy that minimizes downtime and data loss while ensuring business continuity.
Table of Contents
- Understanding RTO and RPO
- Conducting a Risk Assessment
- Backup Strategies
- Recovery Procedures
- Testing Schedule
- Documentation Template
- Implementation Best Practices
- Monitoring and Alerting
- Conclusion
Understanding RTO and RPO
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are foundational concepts that define your disaster recovery requirements.
RTO represents the maximum acceptable downtime before business impact becomes critical. For example, an RTO of 4 hours means your services must be restored within that timeframe.
RPO defines the maximum acceptable data loss. An RPO of 1 hour means you can tolerate losing up to 1 hour of data.
# Calculate RTO and RPO for your infrastructure
cat > /tmp/rto_rpo_calculator.sh << 'EOF'
#!/bin/bash
# Define criticality levels
declare -A criticality_rto
declare -A criticality_rpo
criticality_rto[critical]="1h"
criticality_rto[high]="4h"
criticality_rto[medium]="8h"
criticality_rto[low]="24h"
criticality_rpo[critical]="15min"
criticality_rpo[high]="1h"
criticality_rpo[medium]="4h"
criticality_rpo[low]="24h"
# Function to determine service criticality
determine_criticality() {
local service=$1
local business_impact=$2
if [[ "$business_impact" == *"critical"* ]]; then
echo "critical"
elif [[ "$business_impact" == *"significant"* ]]; then
echo "high"
elif [[ "$business_impact" == *"moderate"* ]]; then
echo "medium"
else
echo "low"
fi
}
# Generate RTO/RPO matrix
generate_dr_matrix() {
echo "Service Criticality Analysis"
echo "============================"
echo "Service | Criticality | RTO | RPO"
echo "--------|-------------|-----|----"
}
generate_dr_matrix
EOF
chmod +x /tmp/rto_rpo_calculator.sh
Conducting a Risk Assessment
A thorough risk assessment identifies potential threats and their impact on your infrastructure.
# Risk Assessment Framework
cat > /tmp/risk_assessment.md << 'EOF'
# Risk Assessment Template
## Infrastructure Assets
- [ ] Web servers
- [ ] Database servers
- [ ] Mail servers
- [ ] File servers
- [ ] Network infrastructure
- [ ] Storage systems
## Potential Threats
- [ ] Hardware failure (disk, memory, power supply)
- [ ] Data corruption
- [ ] Cyber attacks and ransomware
- [ ] Network outages
- [ ] Natural disasters
- [ ] Human error
- [ ] Software bugs
- [ ] Misconfiguration
## Impact Analysis
For each asset:
1. Identify critical functions
2. Estimate revenue impact per hour of downtime
3. Calculate recovery cost
4. Determine dependencies on other systems
EOF
cat /tmp/risk_assessment.md
Create a risk matrix to prioritize mitigation efforts:
# Create risk matrix analysis
cat > /tmp/risk_matrix.sh << 'EOF'
#!/bin/bash
# Risk = Probability x Impact
# Probability: 1-5 (1=unlikely, 5=very likely)
# Impact: 1-5 (1=minimal, 5=catastrophic)
declare -A risks=(
["disk_failure"]="4x5=20"
["ransomware"]="3x5=15"
["network_outage"]="2x4=8"
["human_error"]="4x3=12"
["power_failure"]="2x4=8"
["data_corruption"]="3x4=12"
)
echo "Risk Assessment Matrix"
echo "====================="
printf "%-25s %-15s %-10s\n" "Risk Type" "Score" "Priority"
echo "---------------------------------------------"
for risk in $(printf '%s\n' "${!risks[@]}" | sort); do
score=${risks[$risk]}
printf "%-25s %-15s %-10s\n" "$risk" "$score" "$([ ${score%=*} -gt 15 ] && echo 'Critical' || echo 'High')"
done
EOF
chmod +x /tmp/risk_matrix.sh
bash /tmp/risk_matrix.sh
Backup Strategies
Implement a multi-layered backup strategy using the 3-2-1 rule: 3 copies of data, on 2 different media, with 1 offsite.
Full Backup Strategy
# Full backup of entire server
backup_full() {
local server=$1
local backup_dir="/backup/full"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_file="$backup_dir/full_${server}_${timestamp}.tar.gz"
mkdir -p "$backup_dir"
# Exclude unnecessary files
tar --exclude='/proc' \
--exclude='/sys' \
--exclude='/dev' \
--exclude='/tmp' \
--exclude='/var/tmp' \
--exclude='/var/log' \
--exclude='/var/cache' \
--exclude='/backup' \
--exclude='*.swp' \
-czf "$backup_file" / 2>/dev/null
if [ $? -eq 0 ]; then
echo "Full backup completed: $backup_file"
ls -lh "$backup_file"
else
echo "Error: Full backup failed"
return 1
fi
}
# Execute full backup
backup_full "production-server-01"
Incremental Backup Strategy
# Incremental backup using find and tar
backup_incremental() {
local backup_dir="/backup/incremental"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_file="$backup_dir/incremental_${timestamp}.tar.gz"
local last_backup="/var/lib/backup/.last_backup"
mkdir -p "$backup_dir"
# Create file listing for incremental backup
if [ -f "$last_backup" ]; then
# Backup only files modified since last backup
find / -type f -newer "$last_backup" \
-not -path '/proc/*' \
-not -path '/sys/*' \
-not -path '/dev/*' \
-not -path '/tmp/*' \
-not -path '/backup/*' | \
tar -czf "$backup_file" -T - 2>/dev/null
fi
touch "$last_backup"
echo "Incremental backup: $backup_file"
}
backup_incremental
Database Backup Strategy
# MySQL backup with compression
backup_mysql() {
local db_user="backup_user"
local db_password="secure_password"
local backup_dir="/backup/mysql"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_file="$backup_dir/mysql_${timestamp}.sql.gz"
mkdir -p "$backup_dir"
mysqldump \
-u "$db_user" \
-p"$db_password" \
--all-databases \
--single-transaction \
--quick \
--lock-tables=false | \
gzip > "$backup_file"
if [ $? -eq 0 ]; then
echo "MySQL backup: $backup_file ($(ls -lh "$backup_file" | awk '{print $5}'))"
fi
}
backup_mysql
# PostgreSQL backup
backup_postgresql() {
local backup_dir="/backup/postgresql"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_file="$backup_dir/postgresql_${timestamp}.sql.gz"
mkdir -p "$backup_dir"
pg_dumpall | gzip > "$backup_file"
echo "PostgreSQL backup: $backup_file ($(ls -lh "$backup_file" | awk '{print $5}'))"
}
# Enable WAL archiving for point-in-time recovery
postgresql_setup_wal_archiving() {
local wal_archive_dir="/backup/postgres_wal"
mkdir -p "$wal_archive_dir"
# Update postgresql.conf
sed -i "s/#wal_level = .*/wal_level = replica/" /etc/postgresql/*/main/postgresql.conf
sed -i "s|#archive_mode = .*|archive_mode = on|" /etc/postgresql/*/main/postgresql.conf
sed -i "s|#archive_command = .*|archive_command = 'test ! -f $wal_archive_dir/%f \&\& cp %p $wal_archive_dir/%f'|" /etc/postgresql/*/main/postgresql.conf
systemctl restart postgresql
}
Backup Verification
# Verify backup integrity
verify_backup() {
local backup_file=$1
if [ ! -f "$backup_file" ]; then
echo "Error: Backup file not found: $backup_file"
return 1
fi
# Check file size (must not be zero)
if [ ! -s "$backup_file" ]; then
echo "Error: Backup file is empty"
return 1
fi
# Check if gzip file is valid
if [[ "$backup_file" == *.gz ]]; then
gzip -t "$backup_file" 2>/dev/null
if [ $? -ne 0 ]; then
echo "Error: Backup corruption detected"
return 1
fi
fi
# Check tar archive integrity
if [[ "$backup_file" == *.tar.gz ]]; then
tar -tzf "$backup_file" > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "Error: TAR archive is corrupted"
return 1
fi
fi
echo "Backup verification passed: $backup_file"
return 0
}
# Test verification
verify_backup "/backup/full/full_production-server-01_20240101_120000.tar.gz"
Recovery Procedures
Establish clear, tested procedures for recovering from various failure scenarios.
Full Server Recovery
# Full server recovery procedure
recover_full_server() {
local backup_file=$1
local recovery_mount="/mnt/recovery"
echo "Starting full server recovery"
echo "==============================="
# 1. Boot into recovery mode (manual step)
echo "Step 1: Boot into recovery/rescue environment"
# 2. Prepare disk
echo "Step 2: Prepare disk"
# Format disk (example with /dev/sda)
# parted /dev/sda mklabel gpt
# parted /dev/sda mkpart primary ext4 1MiB 100%
# 3. Create filesystem
echo "Step 3: Create filesystem"
# mkfs.ext4 /dev/sda1
# 4. Mount filesystem
mkdir -p "$recovery_mount"
# mount /dev/sda1 "$recovery_mount"
# 5. Extract backup
echo "Step 4: Extracting backup (this may take a while)"
tar -xzf "$backup_file" -C "$recovery_mount" --exclude='./dev' --exclude='./proc' --exclude='./sys'
# 6. Create essential directories
mkdir -p "$recovery_mount"/{dev,proc,sys,run}
# 7. Reinstall bootloader
echo "Step 5: Reinstalling bootloader"
# mount -B /dev "$recovery_mount/dev"
# mount -t proc proc "$recovery_mount/proc"
# mount -t sysfs sys "$recovery_mount/sys"
# chroot "$recovery_mount" grub-install /dev/sda
# chroot "$recovery_mount" update-grub
echo "Full server recovery completed"
}
# Partial file recovery
recover_files() {
local backup_file=$1
local file_path=$2
local recovery_dir="/tmp/recovery"
mkdir -p "$recovery_dir"
# Extract specific files from backup
tar -xzf "$backup_file" -C "$recovery_dir" "$file_path" 2>/dev/null
if [ $? -eq 0 ]; then
echo "File recovered to: $recovery_dir/$file_path"
fi
}
Database Recovery
# MySQL recovery from backup
recover_mysql() {
local backup_file=$1
local db_user="root"
local db_password="secure_password"
# Restore from backup
if [[ "$backup_file" == *.gz ]]; then
gunzip < "$backup_file" | mysql -u "$db_user" -p"$db_password"
else
mysql -u "$db_user" -p"$db_password" < "$backup_file"
fi
echo "MySQL recovery completed"
}
# PostgreSQL recovery
recover_postgresql() {
local backup_file=$1
if [[ "$backup_file" == *.gz ]]; then
gunzip < "$backup_file" | psql -U postgres
else
psql -U postgres -f "$backup_file"
fi
echo "PostgreSQL recovery completed"
}
Testing Schedule
Regular testing ensures your backups work when needed and your procedures are current.
# Create automated backup testing schedule
cat > /etc/cron.d/backup-testing << 'EOF'
# Backup Testing Schedule
# Weekly full backup test on Sunday at 2 AM
0 2 * * 0 root /usr/local/bin/test-backup-restoration.sh >> /var/log/backup-test.log 2>&1
# Daily backup integrity check at 3 AM
0 3 * * * root /usr/local/bin/verify-backup-integrity.sh >> /var/log/backup-verify.log 2>&1
# Monthly disaster recovery drill on first Sunday at 4 AM
0 4 1 * 0 root /usr/local/bin/dr-drill.sh >> /var/log/dr-drill.log 2>&1
EOF
# Create backup testing script
cat > /usr/local/bin/test-backup-restoration.sh << 'EOF'
#!/bin/bash
BACKUP_DIR="/backup/full"
TEST_DIR="/tmp/backup-test"
LOG_FILE="/var/log/backup-restoration-test.log"
echo "[$(date)]Starting backup restoration test" >> "$LOG_FILE"
# Find most recent backup
LATEST_BACKUP=$(ls -t "$BACKUP_DIR"/*.tar.gz 2>/dev/null | head -1)
if [ -z "$LATEST_BACKUP" ]; then
echo "[$(date)] ERROR: No backup found" >> "$LOG_FILE"
exit 1
fi
# Create test directory
rm -rf "$TEST_DIR"
mkdir -p "$TEST_DIR"
# Extract and verify
tar -tzf "$LATEST_BACKUP" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "[$(date)] PASS: Backup integrity verified" >> "$LOG_FILE"
else
echo "[$(date)] FAIL: Backup corrupted" >> "$LOG_FILE"
exit 1
fi
# Test partial extraction
tar -xzf "$LATEST_BACKUP" -C "$TEST_DIR" ./etc --warning=no-file-changed 2>/dev/null
if [ $? -eq 0 ] && [ -d "$TEST_DIR/etc" ]; then
echo "[$(date)] PASS: Partial restoration successful" >> "$LOG_FILE"
else
echo "[$(date)] FAIL: Partial restoration failed" >> "$LOG_FILE"
fi
# Cleanup
rm -rf "$TEST_DIR"
echo "[$(date)] Backup test completed" >> "$LOG_FILE"
EOF
chmod +x /usr/local/bin/test-backup-restoration.sh
Documentation Template
Maintain comprehensive documentation for your disaster recovery plan.
# Disaster Recovery Plan - [Organization]
## Executive Summary
- RTO: [Time]
- RPO: [Time]
- Last Review: [Date]
- Next Review: [Date]
## Critical Systems
| System | RTO | RPO | Location | Backup Method |
|--------|-----|-----|----------|---------------|
| Web Server | 4h | 1h | Primary DC | Daily full + hourly incremental |
| Database | 2h | 15min | Primary DC | Continuous replication + WAL archiving |
| File Server | 8h | 4h | Primary DC | Daily incremental |
## Recovery Procedures
1. [System] Recovery Steps
- Prerequisites
- Detailed steps
- Verification
- Estimated recovery time
## Contact Information
- DR Coordinator: [Name] - [Contact]
- Technical Lead: [Name] - [Contact]
- Executive Sponsor: [Name] - [Contact]
## Test Results
- Last Test Date: [Date]
- Test Type: [Full/Partial]
- Result: [Pass/Fail]
- Issues Found: [List]
## Version History
| Date | Version | Changes | Reviewed By |
|------|---------|---------|-------------|
EOF
Implementation Best Practices
Automated Backup Management
# Automated backup rotation (keep 30 days)
backup_rotation() {
local backup_dir=$1
local retention_days=30
find "$backup_dir" -type f -mtime +$retention_days -delete
echo "Backup rotation completed. Retained last $retention_days days."
}
# Schedule backup rotation
(crontab -l 2>/dev/null; echo "0 5 * * * /usr/local/bin/backup-rotation.sh /backup/full") | crontab -
# Backup monitoring
monitor_backups() {
local backup_dir="/backup/full"
local alert_threshold=24 # hours
for backup in "$backup_dir"/*.tar.gz; do
local age=$(($(date +%s) - $(stat -f%m "$backup" 2>/dev/null || stat -c%Y "$backup" 2>/dev/null)))
local age_hours=$((age / 3600))
if [ $age_hours -gt $alert_threshold ]; then
echo "ALERT: Backup is $age_hours hours old: $backup"
fi
done
}
monitor_backups
Offsite Backup Replication
# Replicate backups to remote location
replicate_backups_offsite() {
local source_dir="/backup/full"
local remote_host="backup.remote-dc.com"
local remote_user="backup"
local remote_dir="/backups/offsite"
# Using rsync for efficient transfer
rsync -avz \
--delete \
--bwlimit=10240 \
"$source_dir/" \
"$remote_user@$remote_host:$remote_dir"
if [ $? -eq 0 ]; then
echo "Offsite backup replication completed"
else
echo "ERROR: Offsite backup replication failed"
fi
}
# Schedule daily offsite replication
(crontab -l 2>/dev/null; echo "0 22 * * * /usr/local/bin/replicate-backups-offsite.sh") | crontab -
Monitoring and Alerting
# Create backup monitoring alerts
cat > /usr/local/bin/backup-health-check.sh << 'EOF'
#!/bin/bash
BACKUP_DIR="/backup/full"
ALERT_EMAIL="[email protected]"
CRITICAL_THRESHOLD=48 # hours
check_backup_currency() {
local latest_backup=$(ls -t "$BACKUP_DIR"/*.tar.gz 2>/dev/null | head -1)
if [ -z "$latest_backup" ]; then
send_alert "CRITICAL: No backups found in $BACKUP_DIR"
return 1
fi
local backup_age=$(($(date +%s) - $(stat -c%Y "$latest_backup" 2>/dev/null)))
local backup_age_hours=$((backup_age / 3600))
if [ $backup_age_hours -gt $CRITICAL_THRESHOLD ]; then
send_alert "CRITICAL: Backup is $backup_age_hours hours old (threshold: $CRITICAL_THRESHOLD hours)"
return 1
fi
}
check_backup_size() {
local latest_backup=$(ls -t "$BACKUP_DIR"/*.tar.gz 2>/dev/null | head -1)
local size=$(stat -c%s "$latest_backup" 2>/dev/null)
local size_mb=$((size / 1048576))
if [ $size_mb -lt 100 ]; then
send_alert "WARNING: Backup is suspiciously small: ${size_mb}MB"
fi
}
send_alert() {
local message=$1
echo "$message" | mail -s "Backup Alert" "$ALERT_EMAIL"
}
check_backup_currency
check_backup_size
EOF
chmod +x /usr/local/bin/backup-health-check.sh
# Schedule health checks
(crontab -l 2>/dev/null; echo "0 * * * * /usr/local/bin/backup-health-check.sh") | crontab -
Conclusion
A robust disaster recovery plan requires:
- Clear Objectives: Define RTO and RPO for each critical system
- Regular Backups: Implement 3-2-1 strategy with multiple backup types
- Testing: Monthly restoration tests ensure procedures work
- Documentation: Maintain current, detailed recovery procedures
- Monitoring: Automated health checks and alerting
- Review: Quarterly updates as infrastructure changes
Start small with critical systems, then expand coverage. Regular testing is the only way to ensure your DR plan works when disaster strikes. Remember: untested backups are just expensive storage.


