Pruebas Automatizadas de Recuperación ante Desastres
Automated disaster recovery testing ensures your backup and recovery procedures are effective and remain current as infrastructure evolves. This guide covers comprehensive test scenarios, restoration validation scripts, automated scheduling, CI/CD integration, and detailed reporting mechanisms.
Tabla de Contenidos
- DR Testing Strategy
- Test Scenarios
- Backup Restoration Scripts
- Validation Checks
- Scheduling Automated Tests
- CI/CD Integration
- Reporting and Metrics
- Lessons Learned Process
- Conclusion
Estrategia de Pruebas de DR
Definir Objetivos de Pruebas
# DR testing framework
cat > /tmp/dr-testing-framework.md << 'EOF'
# DR Testing Framework
## Testing Levels
### Level 1: Backup Verification (Weekly)
- Verify backup files exist and have valid size
- Test backup integrity (checksum validation)
- Confirm backups are stored in multiple locations
- Automated, low resource requirement
### Level 2: Partial Restoration (Monthly)
- Restore individual files/databases to test environment
- Verify file integrity matches source
- Test database consistency after restore
- Requires test environment, ~2 hours
### Level 3: Full System Recovery (Quarterly)
- Complete recovery of entire system to spare hardware
- Verify all services start correctly
- Run full application test suite
- Requires dedicated resources, ~4-8 hours
### Level 4: Production Failover Drill (Annual)
- Actual failover to disaster recovery site
- Redirect real traffic briefly
- Measure actual RTO/RPO
- High risk, requires careful planning
## Success Metrics
- RTO: Actual vs Target
- RPO: Data loss measured
- Service availability post-recovery
- Data integrity verification results
- Application functionality tests
EOF
cat /tmp/dr-testing-framework.md
Escenarios de Prueba
Escenario 1: Database Restoration Test
# Complete database restoration test
test_database_restoration() {
local backup_file=$1
local test_database="test_restore_$$"
local test_user="test_user"
local test_password="test_password_$$"
echo "Starting database restoration test"
echo "Test DB: $test_database"
# 1. Create test database
mysql -u root << EOF
CREATE DATABASE $test_database;
CREATE USER '$test_user'@'localhost' IDENTIFIED BY '$test_password';
GRANT ALL PRIVILEGES ON $test_database.* TO '$test_user'@'localhost';
EOF
# 2. Restore backup
if [[ "$backup_file" == *.gz ]]; then
gunzip < "$backup_file" | mysql -u root "$test_database"
else
mysql -u root "$test_database" < "$backup_file"
fi
if [ $? -ne 0 ]; then
echo "ERROR: Database restoration failed"
mysql -u root -e "DROP DATABASE $test_database;"
return 1
fi
# 3. Validate restoration
validate_restored_database "$test_database" "$test_user" "$test_password"
# 4. Cleanup
mysql -u root << EOF
DROP DATABASE $test_database;
DROP USER '$test_user'@'localhost';
EOF
echo "Database restoration test completed"
}
# Validate restored database
validate_restored_database() {
local test_db=$1
local test_user=$2
local test_pass=$3
echo "Validating restored database: $test_db"
# Check table count
local table_count=$(mysql -u "$test_user" -p"$test_pass" "$test_db" -sNe \
"SELECT COUNT(*) FROM information_schema.TABLES WHERE TABLE_SCHEMA='$test_db';")
echo "✓ Tables restored: $table_count"
# Check row counts for critical tables
mysql -u "$test_user" -p"$test_pass" "$test_db" << 'SQL'
SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES
WHERE TABLE_SCHEMA=DATABASE() ORDER BY TABLE_ROWS DESC LIMIT 10;
SQL
# Run integrity check
mysql -u "$test_user" -p"$test_pass" "$test_db" -sNe \
"CHECK TABLE $(mysql -u "$test_user" -p"$test_pass" "$test_db" -sNe \
"SELECT GROUP_CONCAT(TABLE_NAME) FROM information_schema.TABLES;" | head -1);"
}
Escenario 2: File System Restoration Test
# File system restoration test
test_filesystem_restoration() {
local backup_file=$1
local test_mount="/mnt/restore-test"
local test_size_gb=50
echo "Starting filesystem restoration test"
# 1. Create temporary test filesystem
mkdir -p "$test_mount"
# Create test disk image
dd if=/dev/zero of="/tmp/test-disk.img" bs=1G count=$test_size_gb
# Format and mount
mkfs.ext4 "/tmp/test-disk.img"
mount -o loop "/tmp/test-disk.img" "$test_mount"
# 2. Restore files from backup
echo "Restoring files from backup..."
if [[ "$backup_file" == *.tar.gz ]]; then
tar -xzf "$backup_file" -C "$test_mount" --warning=no-file-changed
elif [[ "$backup_file" == *.tar ]]; then
tar -xf "$backup_file" -C "$test_mount" --warning=no-file-changed
else
echo "ERROR: Unsupported backup format"
return 1
fi
# 3. Validate restoration
validate_filesystem_restoration "$test_mount"
# 4. Cleanup
echo "Cleaning up test environment..."
umount "$test_mount"
rm "/tmp/test-disk.img"
rmdir "$test_mount"
echo "Filesystem restoration test completed"
}
# Validate filesystem restoration
validate_filesystem_restoration() {
local test_mount=$1
echo "Validating filesystem restoration"
# Check directory structure
echo "✓ Directory structure:"
find "$test_mount" -type d -maxdepth 3 | head -20
# Check total file count
local file_count=$(find "$test_mount" -type f | wc -l)
echo "✓ Files restored: $file_count"
# Check disk usage
local disk_usage=$(du -sh "$test_mount" | awk '{print $1}')
echo "✓ Disk usage: $disk_usage"
# Verify critical files
check_critical_files "$test_mount"
}
# Check for critical files
check_critical_files() {
local test_mount=$1
local critical_files=("etc/passwd" "etc/shadow" "etc/hosts" "var/www/index.html")
echo "✓ Critical files check:"
for file in "${critical_files[@]}"; do
if [ -f "$test_mount/$file" ]; then
echo " ✓ $file"
else
echo " ✗ MISSING: $file"
fi
done
}
Escenario 3: Application Functionality Test
# Application functionality test after recovery
test_application_functionality() {
local app_url=$1
echo "Testing application functionality"
# 1. HTTP connectivity
echo -n "HTTP connectivity: "
if curl -s -m 5 "$app_url" > /dev/null; then
echo "✓"
else
echo "✗"
return 1
fi
# 2. Check response time
echo -n "Response time: "
response_time=$(curl -s -w "%{time_total}" -o /dev/null "$app_url")
echo "${response_time}s"
if (( $(echo "$response_time > 5" | bc -l) )); then
echo "⚠ Warning: Slow response time"
fi
# 3. Test critical endpoints
test_endpoints "$app_url"
# 4. Test database connectivity
test_database_connectivity
# 5. Test API responses
test_api_functionality "$app_url"
}
# Test critical API endpoints
test_api_functionality() {
local app_url=$1
echo "Testing API endpoints:"
local endpoints=(
"/api/health"
"/api/status"
"/api/v1/users"
"/api/v1/products"
)
for endpoint in "${endpoints[@]}"; do
local response=$(curl -s -o /dev/null -w "%{http_code}" "$app_url$endpoint")
if [ "$response" = "200" ]; then
echo " ✓ $endpoint"
else
echo " ✗ $endpoint (HTTP $response)"
fi
done
}
Scripts de Restauración de Copia de Seguridad
Script de Restauración Genérico
# Universal backup restoration script
cat > /usr/local/bin/restore-backup.sh << 'EOF'
#!/bin/bash
BACKUP_LOCATION=$1
RESTORE_DESTINATION=$2
BACKUP_TYPE=${3:-"auto"} # auto, tar, mysql, postgresql, disk
DRY_RUN=${4:-false}
if [ -z "$BACKUP_LOCATION" ] || [ -z "$RESTORE_DESTINATION" ]; then
echo "Usage: $0 <backup_location> <restore_destination> [type] [dry_run]"
echo "Types: auto, tar, mysql, postgresql, disk"
exit 1
fi
# Detect backup type if auto
if [ "$BACKUP_TYPE" = "auto" ]; then
if [[ "$BACKUP_LOCATION" == *.tar.gz ]]; then
BACKUP_TYPE="tar"
elif [[ "$BACKUP_LOCATION" == *.sql.gz ]]; then
BACKUP_TYPE="mysql"
elif [[ "$BACKUP_LOCATION" == *.sql ]]; then
BACKUP_TYPE="mysql"
elif [ -d "$BACKUP_LOCATION" ]; then
BACKUP_TYPE="disk"
fi
fi
echo "Detected backup type: $BACKUP_TYPE"
# Verify backup exists
if [ ! -e "$BACKUP_LOCATION" ]; then
echo "ERROR: Backup not found: $BACKUP_LOCATION"
exit 1
fi
# Restore based on type
case "$BACKUP_TYPE" in
tar)
echo "Restoring TAR backup..."
if [ "$DRY_RUN" = "true" ]; then
tar -tzf "$BACKUP_LOCATION" | head -20
else
tar -xzf "$BACKUP_LOCATION" -C "$RESTORE_DESTINATION"
fi
;;
mysql)
echo "Restoring MySQL backup..."
if [[ "$BACKUP_LOCATION" == *.gz ]]; then
if [ "$DRY_RUN" = "true" ]; then
zcat "$BACKUP_LOCATION" | head -50
else
gunzip < "$BACKUP_LOCATION" | mysql -u root "$RESTORE_DESTINATION"
fi
else
if [ "$DRY_RUN" = "true" ]; then
head -50 "$BACKUP_LOCATION"
else
mysql -u root "$RESTORE_DESTINATION" < "$BACKUP_LOCATION"
fi
fi
;;
postgresql)
echo "Restoring PostgreSQL backup..."
if [ "$DRY_RUN" = "true" ]; then
zcat "$BACKUP_LOCATION" | head -50
else
gunzip < "$BACKUP_LOCATION" | psql -U postgres
fi
;;
disk)
echo "Restoring disk image..."
if [ "$DRY_RUN" = "true" ]; then
ls -lah "$BACKUP_LOCATION"
else
rsync -avz --delete "$BACKUP_LOCATION/" "$RESTORE_DESTINATION/"
fi
;;
*)
echo "ERROR: Unknown backup type: $BACKUP_TYPE"
exit 1
;;
esac
if [ $? -eq 0 ]; then
echo "✓ Restoration completed successfully"
else
echo "✗ Restoration failed"
exit 1
fi
EOF
chmod +x /usr/local/bin/restore-backup.sh
Verificaciones de Validación
Marco de Validación Integral
# Automated validation checks after restoration
cat > /usr/local/bin/validate-restoration.sh << 'EOF'
#!/bin/bash
VALIDATION_LOG="/var/log/restoration-validation.log"
VALIDATION_REPORT="/var/reports/restoration-validation-$(date +%Y%m%d_%H%M%S).html"
TESTS_PASSED=0
TESTS_FAILED=0
log_test() {
local test_name=$1
local result=$2
local details=$3
echo "[$(date)] [$result] $test_name - $details" >> "$VALIDATION_LOG"
if [ "$result" = "PASS" ]; then
((TESTS_PASSED++))
else
((TESTS_FAILED++))
fi
}
# Test 1: Verify backup file integrity
test_backup_integrity() {
local backup_file=$1
echo "Testing backup integrity..."
if [ ! -f "$backup_file" ]; then
log_test "Backup exists" "FAIL" "File not found"
return 1
fi
if [ ! -s "$backup_file" ]; then
log_test "Backup size" "FAIL" "File is empty"
return 1
fi
log_test "Backup exists" "PASS" "Size: $(ls -lh "$backup_file" | awk '{print $5}')"
# Test compression integrity
if [[ "$backup_file" == *.gz ]]; then
if gzip -t "$backup_file" 2>/dev/null; then
log_test "Backup compression" "PASS" "gzip integrity verified"
else
log_test "Backup compression" "FAIL" "gzip corruption detected"
return 1
fi
fi
}
# Test 2: Verify restored data consistency
test_data_consistency() {
local source_hash=$1
local restored_hash=$2
echo "Testing data consistency..."
if [ "$source_hash" = "$restored_hash" ]; then
log_test "Data consistency" "PASS" "Hashes match"
return 0
else
log_test "Data consistency" "FAIL" "Hash mismatch"
return 1
fi
}
# Test 3: Verify file permissions
test_file_permissions() {
local restore_path=$1
echo "Testing file permissions..."
# Check for world-writable files (security risk)
local world_writable=$(find "$restore_path" -perm -002 | wc -l)
if [ "$world_writable" -gt 0 ]; then
log_test "File permissions" "WARN" "Found $world_writable world-writable files"
else
log_test "File permissions" "PASS" "No world-writable files"
fi
# Check for world-readable sensitive files
if [ -f "$restore_path/etc/shadow" ]; then
if [ "$(stat -c %a "$restore_path/etc/shadow")" = "000" ]; then
log_test "Shadow file permissions" "PASS" "Correct permissions"
else
log_test "Shadow file permissions" "WARN" "Incorrect permissions"
fi
fi
}
# Generate HTML report
generate_report() {
cat > "$VALIDATION_REPORT" << HTMLEOF
<!DOCTYPE html>
<html>
<head>
<title>Restoration Validation Report</title>
<style>
body { font-family: Arial; margin: 20px; }
.pass { color: green; }
.fail { color: red; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background-color: #f2f2f2; }
</style>
</head>
<body>
<h1>Disaster Recovery Validation Report</h1>
<p>Generated: $(date)</p>
<h2>Summary</h2>
<p><strong>Tests Passed:</strong> <span class="pass">$TESTS_PASSED</span></p>
<p><strong>Tests Failed:</strong> <span class="fail">$TESTS_FAILED</span></p>
<h2>Detailed Results</h2>
<table>
<tr>
<th>Test</th>
<th>Result</th>
<th>Details</th>
<th>Timestamp</th>
</tr>
HTMLEOF
# Add test results to report
tail -n 20 "$VALIDATION_LOG" | while read line; do
echo "<tr>" >> "$VALIDATION_REPORT"
echo "$line" | awk -F'-' '{
status=$2;
gsub(/\[/, "", status);
gsub(/ /, "", status);
if (status ~ /PASS/)
printf "<td>%s</td><td class=\"pass\">%s</td><td>%s</td><td>%s</td>", $1, status, $3, $4
else
printf "<td>%s</td><td class=\"fail\">%s</td><td>%s</td><td>%s</td>", $1, status, $3, $4
}' >> "$VALIDATION_REPORT"
echo "</tr>" >> "$VALIDATION_REPORT"
done
cat >> "$VALIDATION_REPORT" << 'HTMLEOF'
</table>
</body>
</html>
HTMLEOF
echo "Report generated: $VALIDATION_REPORT"
}
# Run all tests
run_all_tests() {
test_backup_integrity "/backup/latest.tar.gz"
test_file_permissions "/mnt/restore"
generate_report
}
run_all_tests
EOF
chmod +x /usr/local/bin/validate-restoration.sh
Programación de Pruebas Automatizadas
Programación de Pruebas Basada en Cron
# Setup automated DR testing schedule
cat > /etc/cron.d/dr-testing << 'EOF'
# Disaster Recovery Testing Schedule
# Weekly backup verification (Tuesday 2 AM)
0 2 * * 2 root /usr/local/bin/test-backup-integrity.sh 2>&1 | mail -s "Weekly Backup Test" [email protected]
# Monthly database restoration test (First Sunday 3 AM)
0 3 1 * 0 root /usr/local/bin/test-database-restoration.sh 2>&1 | mail -s "Monthly DB Restore Test" [email protected]
# Monthly filesystem restoration test (Second Sunday 3 AM)
0 3 8 * 0 root /usr/local/bin/test-filesystem-restoration.sh 2>&1 | mail -s "Monthly FS Restore Test" [email protected]
# Quarterly full system recovery test (First day of quarter, 4 AM)
0 4 1 1,4,7,10 0 root /usr/local/bin/test-full-recovery.sh 2>&1 | mail -s "Quarterly Full Recovery Test" [email protected]
# Daily backup validation (5 AM every day)
0 5 * * * root /usr/local/bin/validate-backup-health.sh >> /var/log/backup-health.log 2>&1
EOF
# Create test scheduling script
cat > /usr/local/bin/schedule-dr-tests.sh << 'EOF'
#!/bin/bash
TEST_NAME=$1
TEST_SCRIPT=$2
TEST_SCHEDULE=$3
if [ -z "$TEST_NAME" ] || [ -z "$TEST_SCRIPT" ] || [ -z "$TEST_SCHEDULE" ]; then
echo "Usage: $0 <test_name> <test_script> <cron_schedule>"
echo "Example: $0 'Database Restore' /usr/local/bin/test-db.sh '0 3 * * 0'"
exit 1
fi
# Add to crontab
(crontab -l 2>/dev/null; echo "$TEST_SCHEDULE root $TEST_SCRIPT") | crontab -
echo "✓ DR test scheduled: $TEST_NAME"
echo " Script: $TEST_SCRIPT"
echo " Schedule: $TEST_SCHEDULE"
EOF
chmod +x /usr/local/bin/schedule-dr-tests.sh
Integración de CI/CD
Pipeline de Pruebas de DR Basada en Jenkins
// Jenkinsfile for automated DR testing
pipeline {
agent any
triggers {
// Run every Sunday at 3 AM
cron('0 3 * * 0')
}
stages {
stage('Backup Verification') {
steps {
script {
echo "Verifying backups..."
sh '''
/usr/local/bin/verify-backup-integrity.sh
if [ $? -ne 0 ]; then
exit 1
fi
'''
}
}
}
stage('Database Restoration') {
steps {
script {
echo "Testing database restoration..."
sh '''
LATEST_BACKUP=$(ls -t /backup/mysql/*.sql.gz | head -1)
/usr/local/bin/test-database-restoration.sh "$LATEST_BACKUP"
'''
}
}
}
stage('Filesystem Restoration') {
steps {
script {
echo "Testing filesystem restoration..."
sh '''
LATEST_BACKUP=$(ls -t /backup/full/*.tar.gz | head -1)
/usr/local/bin/test-filesystem-restoration.sh "$LATEST_BACKUP"
'''
}
}
}
stage('Validation') {
steps {
script {
echo "Running validation checks..."
sh '/usr/local/bin/validate-restoration.sh'
}
}
}
}
post {
always {
// Archive test results
archiveArtifacts artifacts: '/var/log/dr-test-*.log'
archiveArtifacts artifacts: '/var/reports/restoration-*.html'
// Send report
emailext(
subject: "DR Test Report - ${BUILD_STATUS}",
body: readFile('/var/reports/latest-dr-report.html'),
mimeType: 'text/html',
to: '[email protected]'
)
}
success {
echo "DR tests passed successfully"
}
failure {
echo "DR tests FAILED - investigate immediately"
}
}
}
Reportes y Métricas
Generar Informe de Pruebas de DR
# Generate comprehensive DR testing report
cat > /usr/local/bin/generate-dr-report.sh << 'EOF'
#!/bin/bash
REPORT_DATE=$(date +%Y-%m-%d)
REPORT_FILE="/var/reports/dr-testing-report-$REPORT_DATE.txt"
cat > "$REPORT_FILE" << REPORT
==================================================
DISASTER RECOVERY TESTING REPORT
Date: $REPORT_DATE
==================================================
EXECUTIVE SUMMARY
-----------------
Overall DR Readiness: [Green/Yellow/Red]
Last Full Recovery Test: [Date]
Outstanding Issues: [Count]
DETAILED TEST RESULTS
---------------------
1. BACKUP TESTS
Status: [PASS/FAIL]
Latest Backup: $(ls -t /backup/full/*.tar.gz 2>/dev/null | head -1)
Backup Size: $(du -sh /backup/full 2>/dev/null | awk '{print $1}')
Age: $(find /backup/full -type f -mtime 0 2>/dev/null | wc -l) files from today
2. DATABASE RESTORATION
Status: [PASS/FAIL]
Tables Verified: [Count]
Row Integrity: [Count rows]
Test Duration: [Minutes]
3. FILESYSTEM RESTORATION
Status: [PASS/FAIL]
Files Restored: [Count]
Total Size: [GB]
Test Duration: [Minutes]
4. APPLICATION FUNCTIONALITY
Status: [PASS/FAIL]
API Endpoints: [Pass/Fail]
Response Time: [ms]
Database Connectivity: [PASS/FAIL]
5. INFRASTRUCTURE VALIDATION
Network Connectivity: [PASS/FAIL]
DNS Resolution: [PASS/FAIL]
Service Startup: [PASS/FAIL]
METRICS & KPIs
--------------
RTO Actual: [Time]
RTO Target: [Time]
RPO Actual: [Time]
RPO Target: [Time]
Recovery Success Rate: [Percentage]
Mean Time to Recovery: [Hours]
Data Loss Percentage: [Percentage]
RECOMMENDATIONS
---------------
[List of improvements]
SIGN-OFF
--------
Report Generated: $(date)
Reviewed By: [Name]
Next Test Scheduled: [Date]
REPORT
echo "Report generated: $REPORT_FILE"
cat "$REPORT_FILE"
EOF
chmod +x /usr/local/bin/generate-dr-report.sh
Proceso de Lecciones Aprendidas
Documentar y Compartir Hallazgos
# Capture lessons learned from DR tests
cat > /usr/local/bin/dr-lessons-learned.sh << 'EOF'
#!/bin/bash
LESSONS_FILE="/var/reports/dr-lessons-learned.log"
TEST_DATE=$(date +%Y-%m-%d)
cat >> "$LESSONS_FILE" << 'LESSONS'
==================================================
Lessons Learned from DR Test - $TEST_DATE
==================================================
WHAT WENT WELL
--------------
1. [Success item]
2. [Success item]
WHAT COULD BE IMPROVED
----------------------
1. [Issue item]
2. [Issue item]
ACTIONS TAKEN
-------------
1. [Action with owner and deadline]
2. [Action with owner and deadline]
PROCESS IMPROVEMENTS
--------------------
1. [Improvement to procedure]
2. [Improvement to tooling]
LESSONS
EOF
echo "Lessons learned documented in: $LESSONS_FILE"
EOF
chmod +x /usr/local/bin/dr-lessons-learned.sh
Conclusión
Automated disaster recovery testing provides:
- Confidence: Regular testing proves recovery procedures work
- Early Detection: Problems identified before actual disasters
- Compliance: Demonstrates compliance with recovery requirements
- Optimization: Identifies areas to improve RTO/RPO
- Documentation: Creates current recovery runbooks
Implement a testing pyramid: frequent automated checks (weekly), monthly partial tests, quarterly full tests, and annual production failover drills. Use CI/CD to integrate testing into your deployment pipeline and always capture lessons learned to continuously improve your DR capability.


