Pruebas Automatizadas de Recuperación ante Desastres

Automated disaster recovery testing ensures your backup and recovery procedures are effective and remain current as infrastructure evolves. This guide covers comprehensive test scenarios, restoration validation scripts, automated scheduling, CI/CD integration, and detailed reporting mechanisms.

Tabla de Contenidos

  1. DR Testing Strategy
  2. Test Scenarios
  3. Backup Restoration Scripts
  4. Validation Checks
  5. Scheduling Automated Tests
  6. CI/CD Integration
  7. Reporting and Metrics
  8. Lessons Learned Process
  9. Conclusion

Estrategia de Pruebas de DR

Definir Objetivos de Pruebas

# DR testing framework
cat > /tmp/dr-testing-framework.md << 'EOF'
# DR Testing Framework

## Testing Levels

### Level 1: Backup Verification (Weekly)
- Verify backup files exist and have valid size
- Test backup integrity (checksum validation)
- Confirm backups are stored in multiple locations
- Automated, low resource requirement

### Level 2: Partial Restoration (Monthly)
- Restore individual files/databases to test environment
- Verify file integrity matches source
- Test database consistency after restore
- Requires test environment, ~2 hours

### Level 3: Full System Recovery (Quarterly)
- Complete recovery of entire system to spare hardware
- Verify all services start correctly
- Run full application test suite
- Requires dedicated resources, ~4-8 hours

### Level 4: Production Failover Drill (Annual)
- Actual failover to disaster recovery site
- Redirect real traffic briefly
- Measure actual RTO/RPO
- High risk, requires careful planning

## Success Metrics
- RTO: Actual vs Target
- RPO: Data loss measured
- Service availability post-recovery
- Data integrity verification results
- Application functionality tests

EOF

cat /tmp/dr-testing-framework.md

Escenarios de Prueba

Escenario 1: Database Restoration Test

# Complete database restoration test
test_database_restoration() {
    local backup_file=$1
    local test_database="test_restore_$$"
    local test_user="test_user"
    local test_password="test_password_$$"
    
    echo "Starting database restoration test"
    echo "Test DB: $test_database"
    
    # 1. Create test database
    mysql -u root << EOF
CREATE DATABASE $test_database;
CREATE USER '$test_user'@'localhost' IDENTIFIED BY '$test_password';
GRANT ALL PRIVILEGES ON $test_database.* TO '$test_user'@'localhost';
EOF
    
    # 2. Restore backup
    if [[ "$backup_file" == *.gz ]]; then
        gunzip < "$backup_file" | mysql -u root "$test_database"
    else
        mysql -u root "$test_database" < "$backup_file"
    fi
    
    if [ $? -ne 0 ]; then
        echo "ERROR: Database restoration failed"
        mysql -u root -e "DROP DATABASE $test_database;"
        return 1
    fi
    
    # 3. Validate restoration
    validate_restored_database "$test_database" "$test_user" "$test_password"
    
    # 4. Cleanup
    mysql -u root << EOF
DROP DATABASE $test_database;
DROP USER '$test_user'@'localhost';
EOF
    
    echo "Database restoration test completed"
}

# Validate restored database
validate_restored_database() {
    local test_db=$1
    local test_user=$2
    local test_pass=$3
    
    echo "Validating restored database: $test_db"
    
    # Check table count
    local table_count=$(mysql -u "$test_user" -p"$test_pass" "$test_db" -sNe \
        "SELECT COUNT(*) FROM information_schema.TABLES WHERE TABLE_SCHEMA='$test_db';")
    echo "✓ Tables restored: $table_count"
    
    # Check row counts for critical tables
    mysql -u "$test_user" -p"$test_pass" "$test_db" << 'SQL'
SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES 
WHERE TABLE_SCHEMA=DATABASE() ORDER BY TABLE_ROWS DESC LIMIT 10;
SQL
    
    # Run integrity check
    mysql -u "$test_user" -p"$test_pass" "$test_db" -sNe \
        "CHECK TABLE $(mysql -u "$test_user" -p"$test_pass" "$test_db" -sNe \
        "SELECT GROUP_CONCAT(TABLE_NAME) FROM information_schema.TABLES;" | head -1);"
}

Escenario 2: File System Restoration Test

# File system restoration test
test_filesystem_restoration() {
    local backup_file=$1
    local test_mount="/mnt/restore-test"
    local test_size_gb=50
    
    echo "Starting filesystem restoration test"
    
    # 1. Create temporary test filesystem
    mkdir -p "$test_mount"
    
    # Create test disk image
    dd if=/dev/zero of="/tmp/test-disk.img" bs=1G count=$test_size_gb
    
    # Format and mount
    mkfs.ext4 "/tmp/test-disk.img"
    mount -o loop "/tmp/test-disk.img" "$test_mount"
    
    # 2. Restore files from backup
    echo "Restoring files from backup..."
    
    if [[ "$backup_file" == *.tar.gz ]]; then
        tar -xzf "$backup_file" -C "$test_mount" --warning=no-file-changed
    elif [[ "$backup_file" == *.tar ]]; then
        tar -xf "$backup_file" -C "$test_mount" --warning=no-file-changed
    else
        echo "ERROR: Unsupported backup format"
        return 1
    fi
    
    # 3. Validate restoration
    validate_filesystem_restoration "$test_mount"
    
    # 4. Cleanup
    echo "Cleaning up test environment..."
    umount "$test_mount"
    rm "/tmp/test-disk.img"
    rmdir "$test_mount"
    
    echo "Filesystem restoration test completed"
}

# Validate filesystem restoration
validate_filesystem_restoration() {
    local test_mount=$1
    
    echo "Validating filesystem restoration"
    
    # Check directory structure
    echo "✓ Directory structure:"
    find "$test_mount" -type d -maxdepth 3 | head -20
    
    # Check total file count
    local file_count=$(find "$test_mount" -type f | wc -l)
    echo "✓ Files restored: $file_count"
    
    # Check disk usage
    local disk_usage=$(du -sh "$test_mount" | awk '{print $1}')
    echo "✓ Disk usage: $disk_usage"
    
    # Verify critical files
    check_critical_files "$test_mount"
}

# Check for critical files
check_critical_files() {
    local test_mount=$1
    local critical_files=("etc/passwd" "etc/shadow" "etc/hosts" "var/www/index.html")
    
    echo "✓ Critical files check:"
    for file in "${critical_files[@]}"; do
        if [ -f "$test_mount/$file" ]; then
            echo "  ✓ $file"
        else
            echo "  ✗ MISSING: $file"
        fi
    done
}

Escenario 3: Application Functionality Test

# Application functionality test after recovery
test_application_functionality() {
    local app_url=$1
    
    echo "Testing application functionality"
    
    # 1. HTTP connectivity
    echo -n "HTTP connectivity: "
    if curl -s -m 5 "$app_url" > /dev/null; then
        echo "✓"
    else
        echo "✗"
        return 1
    fi
    
    # 2. Check response time
    echo -n "Response time: "
    response_time=$(curl -s -w "%{time_total}" -o /dev/null "$app_url")
    echo "${response_time}s"
    
    if (( $(echo "$response_time > 5" | bc -l) )); then
        echo "⚠ Warning: Slow response time"
    fi
    
    # 3. Test critical endpoints
    test_endpoints "$app_url"
    
    # 4. Test database connectivity
    test_database_connectivity
    
    # 5. Test API responses
    test_api_functionality "$app_url"
}

# Test critical API endpoints
test_api_functionality() {
    local app_url=$1
    
    echo "Testing API endpoints:"
    
    local endpoints=(
        "/api/health"
        "/api/status"
        "/api/v1/users"
        "/api/v1/products"
    )
    
    for endpoint in "${endpoints[@]}"; do
        local response=$(curl -s -o /dev/null -w "%{http_code}" "$app_url$endpoint")
        
        if [ "$response" = "200" ]; then
            echo "  ✓ $endpoint"
        else
            echo "  ✗ $endpoint (HTTP $response)"
        fi
    done
}

Scripts de Restauración de Copia de Seguridad

Script de Restauración Genérico

# Universal backup restoration script
cat > /usr/local/bin/restore-backup.sh << 'EOF'
#!/bin/bash

BACKUP_LOCATION=$1
RESTORE_DESTINATION=$2
BACKUP_TYPE=${3:-"auto"}  # auto, tar, mysql, postgresql, disk
DRY_RUN=${4:-false}

if [ -z "$BACKUP_LOCATION" ] || [ -z "$RESTORE_DESTINATION" ]; then
    echo "Usage: $0 <backup_location> <restore_destination> [type] [dry_run]"
    echo "Types: auto, tar, mysql, postgresql, disk"
    exit 1
fi

# Detect backup type if auto
if [ "$BACKUP_TYPE" = "auto" ]; then
    if [[ "$BACKUP_LOCATION" == *.tar.gz ]]; then
        BACKUP_TYPE="tar"
    elif [[ "$BACKUP_LOCATION" == *.sql.gz ]]; then
        BACKUP_TYPE="mysql"
    elif [[ "$BACKUP_LOCATION" == *.sql ]]; then
        BACKUP_TYPE="mysql"
    elif [ -d "$BACKUP_LOCATION" ]; then
        BACKUP_TYPE="disk"
    fi
fi

echo "Detected backup type: $BACKUP_TYPE"

# Verify backup exists
if [ ! -e "$BACKUP_LOCATION" ]; then
    echo "ERROR: Backup not found: $BACKUP_LOCATION"
    exit 1
fi

# Restore based on type
case "$BACKUP_TYPE" in
    tar)
        echo "Restoring TAR backup..."
        if [ "$DRY_RUN" = "true" ]; then
            tar -tzf "$BACKUP_LOCATION" | head -20
        else
            tar -xzf "$BACKUP_LOCATION" -C "$RESTORE_DESTINATION"
        fi
        ;;
    mysql)
        echo "Restoring MySQL backup..."
        if [[ "$BACKUP_LOCATION" == *.gz ]]; then
            if [ "$DRY_RUN" = "true" ]; then
                zcat "$BACKUP_LOCATION" | head -50
            else
                gunzip < "$BACKUP_LOCATION" | mysql -u root "$RESTORE_DESTINATION"
            fi
        else
            if [ "$DRY_RUN" = "true" ]; then
                head -50 "$BACKUP_LOCATION"
            else
                mysql -u root "$RESTORE_DESTINATION" < "$BACKUP_LOCATION"
            fi
        fi
        ;;
    postgresql)
        echo "Restoring PostgreSQL backup..."
        if [ "$DRY_RUN" = "true" ]; then
            zcat "$BACKUP_LOCATION" | head -50
        else
            gunzip < "$BACKUP_LOCATION" | psql -U postgres
        fi
        ;;
    disk)
        echo "Restoring disk image..."
        if [ "$DRY_RUN" = "true" ]; then
            ls -lah "$BACKUP_LOCATION"
        else
            rsync -avz --delete "$BACKUP_LOCATION/" "$RESTORE_DESTINATION/"
        fi
        ;;
    *)
        echo "ERROR: Unknown backup type: $BACKUP_TYPE"
        exit 1
        ;;
esac

if [ $? -eq 0 ]; then
    echo "✓ Restoration completed successfully"
else
    echo "✗ Restoration failed"
    exit 1
fi
EOF

chmod +x /usr/local/bin/restore-backup.sh

Verificaciones de Validación

Marco de Validación Integral

# Automated validation checks after restoration
cat > /usr/local/bin/validate-restoration.sh << 'EOF'
#!/bin/bash

VALIDATION_LOG="/var/log/restoration-validation.log"
VALIDATION_REPORT="/var/reports/restoration-validation-$(date +%Y%m%d_%H%M%S).html"
TESTS_PASSED=0
TESTS_FAILED=0

log_test() {
    local test_name=$1
    local result=$2
    local details=$3
    
    echo "[$(date)] [$result] $test_name - $details" >> "$VALIDATION_LOG"
    
    if [ "$result" = "PASS" ]; then
        ((TESTS_PASSED++))
    else
        ((TESTS_FAILED++))
    fi
}

# Test 1: Verify backup file integrity
test_backup_integrity() {
    local backup_file=$1
    
    echo "Testing backup integrity..."
    
    if [ ! -f "$backup_file" ]; then
        log_test "Backup exists" "FAIL" "File not found"
        return 1
    fi
    
    if [ ! -s "$backup_file" ]; then
        log_test "Backup size" "FAIL" "File is empty"
        return 1
    fi
    
    log_test "Backup exists" "PASS" "Size: $(ls -lh "$backup_file" | awk '{print $5}')"
    
    # Test compression integrity
    if [[ "$backup_file" == *.gz ]]; then
        if gzip -t "$backup_file" 2>/dev/null; then
            log_test "Backup compression" "PASS" "gzip integrity verified"
        else
            log_test "Backup compression" "FAIL" "gzip corruption detected"
            return 1
        fi
    fi
}

# Test 2: Verify restored data consistency
test_data_consistency() {
    local source_hash=$1
    local restored_hash=$2
    
    echo "Testing data consistency..."
    
    if [ "$source_hash" = "$restored_hash" ]; then
        log_test "Data consistency" "PASS" "Hashes match"
        return 0
    else
        log_test "Data consistency" "FAIL" "Hash mismatch"
        return 1
    fi
}

# Test 3: Verify file permissions
test_file_permissions() {
    local restore_path=$1
    
    echo "Testing file permissions..."
    
    # Check for world-writable files (security risk)
    local world_writable=$(find "$restore_path" -perm -002 | wc -l)
    
    if [ "$world_writable" -gt 0 ]; then
        log_test "File permissions" "WARN" "Found $world_writable world-writable files"
    else
        log_test "File permissions" "PASS" "No world-writable files"
    fi
    
    # Check for world-readable sensitive files
    if [ -f "$restore_path/etc/shadow" ]; then
        if [ "$(stat -c %a "$restore_path/etc/shadow")" = "000" ]; then
            log_test "Shadow file permissions" "PASS" "Correct permissions"
        else
            log_test "Shadow file permissions" "WARN" "Incorrect permissions"
        fi
    fi
}

# Generate HTML report
generate_report() {
    cat > "$VALIDATION_REPORT" << HTMLEOF
<!DOCTYPE html>
<html>
<head>
    <title>Restoration Validation Report</title>
    <style>
        body { font-family: Arial; margin: 20px; }
        .pass { color: green; }
        .fail { color: red; }
        table { border-collapse: collapse; width: 100%; }
        th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
        th { background-color: #f2f2f2; }
    </style>
</head>
<body>
    <h1>Disaster Recovery Validation Report</h1>
    <p>Generated: $(date)</p>
    
    <h2>Summary</h2>
    <p><strong>Tests Passed:</strong> <span class="pass">$TESTS_PASSED</span></p>
    <p><strong>Tests Failed:</strong> <span class="fail">$TESTS_FAILED</span></p>
    
    <h2>Detailed Results</h2>
    <table>
        <tr>
            <th>Test</th>
            <th>Result</th>
            <th>Details</th>
            <th>Timestamp</th>
        </tr>
HTMLEOF
    
    # Add test results to report
    tail -n 20 "$VALIDATION_LOG" | while read line; do
        echo "<tr>" >> "$VALIDATION_REPORT"
        echo "$line" | awk -F'-' '{
            status=$2; 
            gsub(/\[/, "", status); 
            gsub(/ /, "", status);
            if (status ~ /PASS/) 
                printf "<td>%s</td><td class=\"pass\">%s</td><td>%s</td><td>%s</td>", $1, status, $3, $4
            else 
                printf "<td>%s</td><td class=\"fail\">%s</td><td>%s</td><td>%s</td>", $1, status, $3, $4
        }' >> "$VALIDATION_REPORT"
        echo "</tr>" >> "$VALIDATION_REPORT"
    done
    
    cat >> "$VALIDATION_REPORT" << 'HTMLEOF'
    </table>
</body>
</html>
HTMLEOF
    
    echo "Report generated: $VALIDATION_REPORT"
}

# Run all tests
run_all_tests() {
    test_backup_integrity "/backup/latest.tar.gz"
    test_file_permissions "/mnt/restore"
    generate_report
}

run_all_tests
EOF

chmod +x /usr/local/bin/validate-restoration.sh

Programación de Pruebas Automatizadas

Programación de Pruebas Basada en Cron

# Setup automated DR testing schedule
cat > /etc/cron.d/dr-testing << 'EOF'
# Disaster Recovery Testing Schedule

# Weekly backup verification (Tuesday 2 AM)
0 2 * * 2 root /usr/local/bin/test-backup-integrity.sh 2>&1 | mail -s "Weekly Backup Test" [email protected]

# Monthly database restoration test (First Sunday 3 AM)
0 3 1 * 0 root /usr/local/bin/test-database-restoration.sh 2>&1 | mail -s "Monthly DB Restore Test" [email protected]

# Monthly filesystem restoration test (Second Sunday 3 AM)
0 3 8 * 0 root /usr/local/bin/test-filesystem-restoration.sh 2>&1 | mail -s "Monthly FS Restore Test" [email protected]

# Quarterly full system recovery test (First day of quarter, 4 AM)
0 4 1 1,4,7,10 0 root /usr/local/bin/test-full-recovery.sh 2>&1 | mail -s "Quarterly Full Recovery Test" [email protected]

# Daily backup validation (5 AM every day)
0 5 * * * root /usr/local/bin/validate-backup-health.sh >> /var/log/backup-health.log 2>&1
EOF

# Create test scheduling script
cat > /usr/local/bin/schedule-dr-tests.sh << 'EOF'
#!/bin/bash

TEST_NAME=$1
TEST_SCRIPT=$2
TEST_SCHEDULE=$3

if [ -z "$TEST_NAME" ] || [ -z "$TEST_SCRIPT" ] || [ -z "$TEST_SCHEDULE" ]; then
    echo "Usage: $0 <test_name> <test_script> <cron_schedule>"
    echo "Example: $0 'Database Restore' /usr/local/bin/test-db.sh '0 3 * * 0'"
    exit 1
fi

# Add to crontab
(crontab -l 2>/dev/null; echo "$TEST_SCHEDULE root $TEST_SCRIPT") | crontab -

echo "✓ DR test scheduled: $TEST_NAME"
echo "  Script: $TEST_SCRIPT"
echo "  Schedule: $TEST_SCHEDULE"
EOF

chmod +x /usr/local/bin/schedule-dr-tests.sh

Integración de CI/CD

Pipeline de Pruebas de DR Basada en Jenkins

// Jenkinsfile for automated DR testing
pipeline {
    agent any
    
    triggers {
        // Run every Sunday at 3 AM
        cron('0 3 * * 0')
    }
    
    stages {
        stage('Backup Verification') {
            steps {
                script {
                    echo "Verifying backups..."
                    sh '''
                        /usr/local/bin/verify-backup-integrity.sh
                        if [ $? -ne 0 ]; then
                            exit 1
                        fi
                    '''
                }
            }
        }
        
        stage('Database Restoration') {
            steps {
                script {
                    echo "Testing database restoration..."
                    sh '''
                        LATEST_BACKUP=$(ls -t /backup/mysql/*.sql.gz | head -1)
                        /usr/local/bin/test-database-restoration.sh "$LATEST_BACKUP"
                    '''
                }
            }
        }
        
        stage('Filesystem Restoration') {
            steps {
                script {
                    echo "Testing filesystem restoration..."
                    sh '''
                        LATEST_BACKUP=$(ls -t /backup/full/*.tar.gz | head -1)
                        /usr/local/bin/test-filesystem-restoration.sh "$LATEST_BACKUP"
                    '''
                }
            }
        }
        
        stage('Validation') {
            steps {
                script {
                    echo "Running validation checks..."
                    sh '/usr/local/bin/validate-restoration.sh'
                }
            }
        }
    }
    
    post {
        always {
            // Archive test results
            archiveArtifacts artifacts: '/var/log/dr-test-*.log'
            archiveArtifacts artifacts: '/var/reports/restoration-*.html'
            
            // Send report
            emailext(
                subject: "DR Test Report - ${BUILD_STATUS}",
                body: readFile('/var/reports/latest-dr-report.html'),
                mimeType: 'text/html',
                to: '[email protected]'
            )
        }
        
        success {
            echo "DR tests passed successfully"
        }
        
        failure {
            echo "DR tests FAILED - investigate immediately"
        }
    }
}

Reportes y Métricas

Generar Informe de Pruebas de DR

# Generate comprehensive DR testing report
cat > /usr/local/bin/generate-dr-report.sh << 'EOF'
#!/bin/bash

REPORT_DATE=$(date +%Y-%m-%d)
REPORT_FILE="/var/reports/dr-testing-report-$REPORT_DATE.txt"

cat > "$REPORT_FILE" << REPORT
==================================================
DISASTER RECOVERY TESTING REPORT
Date: $REPORT_DATE
==================================================

EXECUTIVE SUMMARY
-----------------
Overall DR Readiness: [Green/Yellow/Red]
Last Full Recovery Test: [Date]
Outstanding Issues: [Count]

DETAILED TEST RESULTS
---------------------

1. BACKUP TESTS
   Status: [PASS/FAIL]
   Latest Backup: $(ls -t /backup/full/*.tar.gz 2>/dev/null | head -1)
   Backup Size: $(du -sh /backup/full 2>/dev/null | awk '{print $1}')
   Age: $(find /backup/full -type f -mtime 0 2>/dev/null | wc -l) files from today

2. DATABASE RESTORATION
   Status: [PASS/FAIL]
   Tables Verified: [Count]
   Row Integrity: [Count rows]
   Test Duration: [Minutes]

3. FILESYSTEM RESTORATION
   Status: [PASS/FAIL]
   Files Restored: [Count]
   Total Size: [GB]
   Test Duration: [Minutes]

4. APPLICATION FUNCTIONALITY
   Status: [PASS/FAIL]
   API Endpoints: [Pass/Fail]
   Response Time: [ms]
   Database Connectivity: [PASS/FAIL]

5. INFRASTRUCTURE VALIDATION
   Network Connectivity: [PASS/FAIL]
   DNS Resolution: [PASS/FAIL]
   Service Startup: [PASS/FAIL]

METRICS & KPIs
--------------
RTO Actual: [Time]
RTO Target: [Time]
RPO Actual: [Time]
RPO Target: [Time]

Recovery Success Rate: [Percentage]
Mean Time to Recovery: [Hours]
Data Loss Percentage: [Percentage]

RECOMMENDATIONS
---------------
[List of improvements]

SIGN-OFF
--------
Report Generated: $(date)
Reviewed By: [Name]
Next Test Scheduled: [Date]
REPORT

echo "Report generated: $REPORT_FILE"
cat "$REPORT_FILE"
EOF

chmod +x /usr/local/bin/generate-dr-report.sh

Proceso de Lecciones Aprendidas

Documentar y Compartir Hallazgos

# Capture lessons learned from DR tests
cat > /usr/local/bin/dr-lessons-learned.sh << 'EOF'
#!/bin/bash

LESSONS_FILE="/var/reports/dr-lessons-learned.log"
TEST_DATE=$(date +%Y-%m-%d)

cat >> "$LESSONS_FILE" << 'LESSONS'

==================================================
Lessons Learned from DR Test - $TEST_DATE
==================================================

WHAT WENT WELL
--------------
1. [Success item]
2. [Success item]

WHAT COULD BE IMPROVED
----------------------
1. [Issue item]
2. [Issue item]

ACTIONS TAKEN
-------------
1. [Action with owner and deadline]
2. [Action with owner and deadline]

PROCESS IMPROVEMENTS
--------------------
1. [Improvement to procedure]
2. [Improvement to tooling]

LESSONS
EOF

echo "Lessons learned documented in: $LESSONS_FILE"
EOF

chmod +x /usr/local/bin/dr-lessons-learned.sh

Conclusión

Automated disaster recovery testing provides:

  1. Confidence: Regular testing proves recovery procedures work
  2. Early Detection: Problems identified before actual disasters
  3. Compliance: Demonstrates compliance with recovery requirements
  4. Optimization: Identifies areas to improve RTO/RPO
  5. Documentation: Creates current recovery runbooks

Implement a testing pyramid: frequent automated checks (weekly), monthly partial tests, quarterly full tests, and annual production failover drills. Use CI/CD to integrate testing into your deployment pipeline and always capture lessons learned to continuously improve your DR capability.