Migración en Vivo de Servicios entre Servidores

Live migration allows you to move running services from one server to another with minimal downtime. This guide covers planning, data synchronization, database replication cutover, DNS transitions, and comprehensive verification procedures for zero-downtime migrations.

Tabla de Contenidos

  1. Migration Planning
  2. Pre-Migration Checklist
  3. Data Synchronization with Rsync
  4. Database Replication Cutover
  5. Application Service Migration
  6. DNS Transition
  7. Verification and Validation
  8. Rollback Procedures
  9. Conclusion

Planificación de Migración

Crear Documento de Plan de Migración

# Migration plan template
cat > /tmp/migration-plan.md << 'EOF'
# Service Migration Plan

## Executive Summary
- Source Server: [hostname/IP]
- Destination Server: [hostname/IP]
- Services: [list]
- Estimated Duration: [hours]
- Maintenance Window: [date/time]
- Risk Level: [Low/Medium/High]

## Scope
### Services to migrate:
- Service A (port 8080)
- Service B (port 3306)
- Service C (port 443)

### Data to migrate:
- Application files: [size]
- Database: [size]
- User data: [size]
- Configuration files: [list]

## Dependencies
- Service A depends on: Database, Cache
- Service B depends on: Message Queue
- Service C depends on: Certificate Store

## Rollback Plan
1. [Step 1]
2. [Step 2]
3. [Step 3]

## Success Criteria
- [ ] All services running on destination
- [ ] No data loss detected
- [ ] No increase in error rates
- [ ] Users can access all features
- [ ] Performance metrics within acceptable range

EOF

cat /tmp/migration-plan.md

Evaluación de Riesgos

# Assess migration risks
assess_migration_risk() {
    echo "Migration Risk Assessment"
    echo "=========================="
    
    local high_risk_items=(
        "Database with active connections"
        "Real-time streaming services"
        "Services with local state"
        "Large databases (>100GB)"
        "Custom network configurations"
    )
    
    local medium_risk_items=(
        "Static web servers"
        "Cache layers"
        "Read-only databases"
        "Containerized applications"
    )
    
    local low_risk_items=(
        "Stateless services"
        "Load-balanced applications"
        "Services with built-in redundancy"
    )
    
    echo ""
    echo "High Risk (Requires careful planning):"
    printf '%s\n' "${high_risk_items[@]}"
    
    echo ""
    echo "Medium Risk (Standard procedures)"
    printf '%s\n' "${medium_risk_items[@]}"
    
    echo ""
    echo "Low Risk (Quick migration expected)"
    printf '%s\n' "${low_risk_items[@]}"
}

assess_migration_risk

Lista de Verificación Previa a la Migración

# Pre-migration validation checklist
cat > /usr/local/bin/pre-migration-check.sh << 'EOF'
#!/bin/bash

SOURCE_HOST=$1
DEST_HOST=$2

if [ -z "$SOURCE_HOST" ] || [ -z "$DEST_HOST" ]; then
    echo "Usage: $0 <source_host> <dest_host>"
    exit 1
fi

CHECKS_PASSED=0
CHECKS_FAILED=0

run_check() {
    local check_name=$1
    local check_command=$2
    
    echo -n "Checking: $check_name... "
    
    if eval "$check_command"; then
        echo "✓ PASS"
        ((CHECKS_PASSED++))
    else
        echo "✗ FAIL"
        ((CHECKS_FAILED++))
    fi
}

# Network connectivity
run_check "Source SSH connectivity" \
    "ssh -o ConnectTimeout=5 root@$SOURCE_HOST 'exit' 2>/dev/null"

run_check "Destination SSH connectivity" \
    "ssh -o ConnectTimeout=5 root@$DEST_HOST 'exit' 2>/dev/null"

# Storage capacity
run_check "Source disk usage" \
    "ssh root@$SOURCE_HOST 'df /data | awk \"NR==2 {if (\\\$4 > 10000000) exit 0; else exit 1}\"'"

run_check "Destination free space" \
    "ssh root@$DEST_HOST 'df /data | awk \"NR==2 {if (\\\$4 > 100000000) exit 0; else exit 1}\"'"

# Service availability
run_check "Source services running" \
    "ssh root@$SOURCE_HOST 'systemctl is-active my-service' | grep -q active"

run_check "Destination services configured" \
    "ssh root@$DEST_HOST 'systemctl list-units --type=service' | wc -l"

# Database connectivity
run_check "Source database reachable" \
    "ssh root@$SOURCE_HOST 'mysql -u root -e \"SELECT 1;\" 2>/dev/null' | grep -q 1"

# Report
echo ""
echo "Pre-Migration Check Results"
echo "============================"
echo "Passed: $CHECKS_PASSED"
echo "Failed: $CHECKS_FAILED"

if [ $CHECKS_FAILED -eq 0 ]; then
    echo "Status: ✓ Ready for migration"
    exit 0
else
    echo "Status: ✗ Fix failures before migration"
    exit 1
fi
EOF

chmod +x /usr/local/bin/pre-migration-check.sh

Sincronización de Datos con Rsync

Sincronización Completa Inicial

# Perform initial data synchronization
perform_initial_sync() {
    local source_host=$1
    local source_path=$2
    local dest_host=$3
    local dest_path=$4
    local sync_log="/var/log/migration-sync.log"
    
    echo "[$(date)] Starting initial data sync" | tee -a "$sync_log"
    
    # Full sync with verification
    rsync -avz \
        --progress \
        --no-perms \
        --delete \
        --checksum \
        "$source_host:$source_path/" \
        "$dest_host:$dest_path/" \
        2>&1 | tee -a "$sync_log"
    
    if [ ${PIPESTATUS[0]} -eq 0 ]; then
        echo "[$(date)] Initial sync completed" >> "$sync_log"
        return 0
    else
        echo "[$(date)] Initial sync failed" >> "$sync_log"
        return 1
    fi
}

# Example usage:
# perform_initial_sync "source.example.com" "/var/www" "dest.example.com" "/var/www"

Sincronización Delta Continua

# Continuous synchronization for minimal cutover time
continuous_delta_sync() {
    local source_host=$1
    local source_path=$2
    local dest_host=$3
    local dest_path=$4
    
    echo "Starting continuous delta synchronization"
    echo "Press Ctrl+C to stop"
    
    # Run rsync in daemon mode for continuous sync
    while true; do
        echo "[$(date)] Running delta sync..."
        
        rsync -avz \
            --progress \
            --no-perms \
            --checksum \
            --delete \
            --filter=':- .gitignore' \
            "$source_host:$source_path/" \
            "$dest_host:$dest_path/" \
        
        sync_status=$?
        
        if [ $sync_status -eq 0 ]; then
            echo "[$(date)] Delta sync completed successfully"
        else
            echo "[$(date)] Delta sync completed with status $sync_status"
        fi
        
        # Wait before next sync (e.g., every 5 minutes)
        sleep 300
    done
}

# Alternative: Use inotify for real-time sync
realtime_sync_with_inotify() {
    local source_host=$1
    local source_path=$2
    local dest_host=$3
    local dest_path=$4
    
    # On source server: install inotify-tools
    ssh "root@$source_host" 'apt-get install -y inotify-tools'
    
    # Monitor directory for changes and sync
    ssh "root@$source_host" << 'EOF'
while inotifywait -r -e modify,create,delete /var/www; do
    rsync -avz --delete /var/www/ remote-server:/var/www/
done
EOF
}

Sincronización con Límite de Ancho de Banda

# Sync with bandwidth limiting to avoid impacting production
bandwidth_limited_sync() {
    local source_host=$1
    local source_path=$2
    local dest_host=$3
    local dest_path=$4
    local max_bandwidth_mbps=50
    
    # Convert to KB/s (bandwidth in Mbps / 8 * 1000)
    local bandwidth_limit=$((max_bandwidth_mbps * 1000 / 8))
    
    rsync -avz \
        --bwlimit="$bandwidth_limit" \
        --progress \
        "$source_host:$source_path/" \
        "$dest_host:$dest_path/"
}

# Monitor sync progress
monitor_sync_progress() {
    local log_file="/var/log/migration-sync.log"
    
    watch -n 1 'tail -20 '"$log_file"' | grep -E "to-check|speedup"'
}

Corte de Replicación de Base de Datos

Configurar Replicación de MySQL

# Configure MySQL replication for migration
setup_mysql_replication_for_migration() {
    local source_host=$1
    local dest_host=$2
    
    echo "Setting up MySQL replication from $source_host to $dest_host"
    
    # On source server: Enable binary logging
    ssh "root@$source_host" << 'EOF'
mysql -u root << 'MYSQL'
SET GLOBAL binlog_format = 'ROW';
CREATE USER 'repl'@'%' IDENTIFIED BY 'replication_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
FLUSH PRIVILEGES;
SHOW MASTER STATUS;
MYSQL
EOF
    
    # Get master log position
    local master_log=$(ssh "root@$source_host" "mysql -u root -sNe \"SHOW MASTER STATUS\\G\" | grep 'File:' | awk '{print \$2}'")
    local master_pos=$(ssh "root@$source_host" "mysql -u root -sNe \"SHOW MASTER STATUS\\G\" | grep 'Position:' | awk '{print \$2}'")
    
    echo "Master log: $master_log at position $master_pos"
    
    # On destination server: Configure as slave
    ssh "root@$dest_host" << MYSQL
mysql -u root << 'SLAVE'
CHANGE MASTER TO
    MASTER_HOST='$source_host',
    MASTER_USER='repl',
    MASTER_PASSWORD='replication_password',
    MASTER_LOG_FILE='$master_log',
    MASTER_LOG_POS=$master_pos;

START SLAVE;
SHOW SLAVE STATUS\G;
SLAVE
MYSQL
}

# Monitor replication lag
monitor_replication_during_migration() {
    local dest_host=$1
    local lag_threshold=5  # seconds
    
    echo "Monitoring MySQL replication lag..."
    
    while true; do
        local lag=$(ssh "root@$dest_host" \
            "mysql -u root -sNe \"SHOW SLAVE STATUS\\G\" | grep 'Seconds_Behind_Master' | awk '{print \$NF}'")
        
        echo "[$(date)] Replication lag: ${lag}s"
        
        if [ "$lag" -eq "NULL" ] || [ "$lag" -gt "$lag_threshold" ]; then
            echo "Warning: Replication lag is high or not running"
        fi
        
        sleep 10
    done
}

Detener Replicación y Promover Destino

# Execute cutover: Stop replication and promote destination
promote_destination_database() {
    local dest_host=$1
    
    echo "Promoting destination database to primary"
    
    # Stop replication
    ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
STOP SLAVE;
SHOW SLAVE STATUS\G;

-- Verify replication is stopped
MYSQL
EOF
    
    # Make destination writable
    ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
SET GLOBAL read_only = 0;
SET GLOBAL super_read_only = 0;
SHOW VARIABLES LIKE '%read_only%';
MYSQL
EOF
    
    # Remove slave configuration
    ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
RESET SLAVE ALL;
MYSQL
EOF
    
    echo "Destination database is now primary"
}

# Verify promotion
verify_database_promotion() {
    local dest_host=$1
    
    ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
-- Check there are no replication threads
SHOW PROCESSLIST\G

-- Check database status
SHOW MASTER STATUS\G

-- Verify data integrity
SELECT COUNT(*) FROM information_schema.tables;
MYSQL
EOF
}

Migración de Servicios de Aplicación

Detener Servicios en Origen

# Gracefully stop services
stop_services_gracefully() {
    local source_host=$1
    local services=("nginx" "php-fpm" "nodejs" "custom-app")
    
    echo "Stopping services on $source_host"
    
    for service in "${services[@]}"; do
        echo "Stopping: $service"
        
        # Graceful stop
        ssh "root@$source_host" "systemctl stop $service"
        
        # Wait for graceful shutdown
        sleep 5
        
        # Force kill if still running
        ssh "root@$source_host" "pkill -9 $service" 2>/dev/null
        
        # Verify stopped
        ssh "root@$source_host" "systemctl is-active $service" && \
            echo "Warning: $service still running" || \
            echo "✓ $service stopped"
    done
}

Drenaje de Conexiones

# Drain connections from load balancer
drain_connections_from_lb() {
    local source_host=$1
    local lb_host=$2
    
    echo "Draining connections from load balancer"
    
    # Mark server as unhealthy in load balancer
    ssh "root@$lb_host" << EOF
# Example for HAProxy
echo "set server backend/server01 state maint" | socat - UNIX-CONNECT:/var/run/haproxy/admin.sock
EOF
    
    # Wait for existing connections to drain
    echo "Waiting for connections to drain..."
    sleep 30
    
    # Check connection count
    local connections=$(ssh "root@$source_host" \
        "netstat -an | grep ESTABLISHED | wc -l")
    
    echo "Active connections: $connections"
}

Verificar que la Fuente Está Detenida

# Verify all services are stopped
verify_services_stopped() {
    local source_host=$1
    
    ssh "root@$source_host" << 'EOF'
#!/bin/bash

echo "Verifying all services are stopped"

services_running=$(systemctl list-units --type=service --state=running | \
    grep -v "system-getty\|user-runtime-dir\|user@" | \
    wc -l)

if [ "$services_running" -gt 2 ]; then
    echo "Warning: Still $services_running services running"
else
    echo "✓ Services successfully stopped"
fi

# Check for remaining connections
echo "Checking for remaining network connections..."
netstat -an | grep ESTABLISHED | grep -v "^unix" | wc -l
EOF
}

Transición de DNS

Actualizar Registros DNS

# Plan DNS cutover
plan_dns_cutover() {
    local service_name=$1
    local new_ip=$2
    local current_ttl=300
    local low_ttl=60
    
    echo "DNS Cutover Plan for: $service.example.com"
    echo "Current IP: $(dig +short service.example.com)"
    echo "New IP: $new_ip"
    echo ""
    echo "Recommended steps:"
    echo "1. Lower TTL to $low_ttl (current: $current_ttl)"
    echo "   - This should be done 24 hours before cutover"
    echo "2. Monitor DNS propagation"
    echo "3. Update DNS A record to $new_ip"
    echo "4. Wait for TTL expiration ($low_ttl seconds)"
    echo "5. Monitor for issues"
}

# Lower TTL before migration
lower_dns_ttl() {
    local domain=$1
    local ttl=60
    
    echo "Lowering TTL for $domain to $ttl seconds"
    
    # Method 1: Using DNS provider API
    # Example with Route53:
    # aws route53 change-resource-record-sets \
    #   --hosted-zone-id Z123456 \
    #   --change-batch '{...}'
    
    # Method 2: Manual update via DNS control panel
    echo "Please update DNS TTL in control panel:"
    echo "Domain: $domain"
    echo "TTL: $ttl seconds"
    
    # Verify TTL change
    sleep 10
    dig "$domain" | grep -i "TTL"
}

# Update DNS record
update_dns_record() {
    local domain=$1
    local new_ip=$2
    
    echo "Updating DNS record for $domain to $new_ip"
    
    # Using dig to check current IP
    local current_ip=$(dig +short "$domain")
    echo "Current IP: $current_ip"
    echo "New IP: $new_ip"
    
    # Update via provider API (example)
    # curl -X POST "https://api.namecheap.com/xml.response" \
    #   --data "ApiUser=user&ApiKey=key&... HostName=$domain&Address=$new_ip"
    
    # For nsupdate (if TSIG is configured)
    nsupdate << EOF
server ns1.example.com
zone example.com
update delete $domain A
update add $domain 60 A $new_ip
send
quit
EOF
}

# Monitor DNS propagation
monitor_dns_propagation() {
    local domain=$1
    local expected_ip=$2
    
    echo "Monitoring DNS propagation for $domain"
    
    local nameservers=("8.8.8.8" "1.1.1.1" "208.67.222.222")
    
    for ns in "${nameservers[@]}"; do
        while true; do
            local resolved_ip=$(dig +short "@$ns" "$domain")
            
            if [ "$resolved_ip" = "$expected_ip" ]; then
                echo "✓ $ns: Correct ($resolved_ip)"
            else
                echo "⏳ $ns: Still old ($resolved_ip)"
            fi
            
            sleep 10
        done &
    done
    
    wait
}

Verificación y Validación

Verificaciones de Salud Integral

# Post-migration validation
validate_migration() {
    local dest_host=$1
    
    echo "Post-Migration Validation"
    echo "=========================="
    
    validation_log="/var/log/migration-validation.log"
    
    {
        echo "[$(date)] Starting post-migration validation"
        
        # Check services are running
        echo ""
        echo "Service Status:"
        ssh "root@$dest_host" "systemctl status nginx mysql postgresql redis"
        
        # Check database connectivity
        echo ""
        echo "Database Checks:"
        ssh "root@$dest_host" << 'EOF'
mysql -u root -e "SELECT COUNT(*) FROM information_schema.TABLES;" 2>&1
psql -U postgres -c "SELECT datname FROM pg_database WHERE datname NOT LIKE 'template%';" 2>&1
EOF
        
        # Check application response
        echo ""
        echo "Application Response:"
        curl -s -o /dev/null -w "%{http_code}" "http://$dest_host/health"
        
        # Check disk usage
        echo ""
        echo "Disk Usage:"
        ssh "root@$dest_host" "df -h"
        
        # Check system resources
        echo ""
        echo "System Resources:"
        ssh "root@$dest_host" "top -bn1 | head -20"
        
    } | tee "$validation_log"
}

# Automated health check
cat > /usr/local/bin/post-migration-health-check.sh << 'EOF'
#!/bin/bash

DEST_HOST=$1
CHECKS_PASSED=0
CHECKS_FAILED=0

health_check() {
    local check_name=$1
    local command=$2
    
    echo -n "[$check_name] "
    
    if eval "$command"; then
        echo "✓ PASS"
        ((CHECKS_PASSED++))
    else
        echo "✗ FAIL"
        ((CHECKS_FAILED++))
    fi
}

# Perform checks
health_check "Web Server" "curl -sf http://$DEST_HOST > /dev/null"
health_check "Database" "ssh root@$DEST_HOST 'mysql -u root -e \"SELECT 1;\"' | grep -q 1"
health_check "Disk Space" "ssh root@$DEST_HOST 'df / | awk \"NR==2 {if (\\\$4 > 1000000) exit 0; else exit 1}\"'"
health_check "Memory" "ssh root@$DEST_HOST 'free | awk \"NR==2 {if (\\\$7 > 100000) exit 0; else exit 1}\"'"

echo ""
echo "Results: $CHECKS_PASSED passed, $CHECKS_FAILED failed"
EOF

chmod +x /usr/local/bin/post-migration-health-check.sh

Verificación de Integridad de Datos

# Verify data integrity after migration
verify_data_integrity() {
    local source_host=$1
    local dest_host=$2
    
    echo "Verifying data integrity"
    
    # Compare file counts
    echo "Comparing file counts..."
    local source_files=$(ssh "root@$source_host" "find /var/www -type f | wc -l")
    local dest_files=$(ssh "root@$dest_host" "find /var/www -type f | wc -l")
    
    if [ "$source_files" -eq "$dest_files" ]; then
        echo "✓ File count matches: $source_files files"
    else
        echo "✗ File count mismatch: source=$source_files, dest=$dest_files"
    fi
    
    # Compare directory hashes
    echo "Comparing directory checksums..."
    local source_hash=$(ssh "root@$source_host" "find /var/www -type f -exec sha256sum {} \; | sha256sum")
    local dest_hash=$(ssh "root@$dest_host" "find /var/www -type f -exec sha256sum {} \; | sha256sum")
    
    if [ "$source_hash" = "$dest_hash" ]; then
        echo "✓ All files match"
    else
        echo "✗ File mismatch detected"
        # Find specific differences
        ssh "root@$source_host" "find /var/www -type f -exec sha256sum {} \;" > /tmp/source-hashes.txt
        ssh "root@$dest_host" "find /var/www -type f -exec sha256sum {} \;" > /tmp/dest-hashes.txt
        diff /tmp/source-hashes.txt /tmp/dest-hashes.txt | head -20
    fi
    
    # Database integrity
    echo "Verifying database integrity..."
    ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
-- Check for table errors
CHECK TABLE information_schema.TABLES;

-- Verify table counts
SELECT COUNT(*) FROM information_schema.TABLES WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema');
MYSQL
EOF
}

Procedimientos de Reversión

Reversión Rápida a Origen

# Rollback procedure if migration fails
rollback_to_source() {
    local source_host=$1
    local dest_host=$2
    
    echo "INITIATING ROLLBACK TO SOURCE"
    echo "==============================="
    
    # Step 1: Update DNS back to source
    echo "Step 1: Reverting DNS to source"
    update_dns_record "service.example.com" "$(dig +short $source_host)"
    
    # Step 2: Stop destination services
    echo "Step 2: Stopping destination services"
    ssh "root@$dest_host" "systemctl stop nginx mysql"
    
    # Step 3: Start source services
    echo "Step 3: Starting source services"
    ssh "root@$source_host" "systemctl start nginx mysql"
    
    # Step 4: Remove destination from load balancer
    echo "Step 4: Removing destination from load balancer"
    # Update load balancer configuration
    
    # Step 5: Verify rollback
    echo "Step 5: Verifying rollback"
    sleep 10
    curl -f "http://$source_host/health" && echo "✓ Source online" || echo "✗ Source offline"
    
    echo "Rollback completed"
}

# Scheduled rollback option
schedule_auto_rollback() {
    local source_host=$1
    local dest_host=$2
    local rollback_timeout_minutes=30
    
    # Schedule automatic rollback if destination stays unhealthy
    cat > /usr/local/bin/auto-rollback.sh << 'EOF'
#!/bin/bash

DEST_HOST=$1
TIMEOUT_MINUTES=${2:-30}
START_TIME=$(date +%s)

while true; do
    ELAPSED=$(($(date +%s) - START_TIME))
    ELAPSED_MINUTES=$((ELAPSED / 60))
    
    if ! curl -sf "http://$DEST_HOST/health" > /dev/null; then
        if [ $ELAPSED_MINUTES -gt $TIMEOUT_MINUTES ]; then
            echo "Triggering automatic rollback"
            /usr/local/bin/rollback-to-source.sh "$DEST_HOST"
            break
        fi
    else
        echo "Destination healthy - no rollback needed"
        break
    fi
    
    sleep 30
done
EOF
    
    chmod +x /usr/local/bin/auto-rollback.sh
}

Conclusión

Successful live migrations require:

  1. Planning: Detailed documentation of services, data, and dependencies
  2. Synchronization: Multiple sync passes to minimize cutover time
  3. Replication: Database replication configured and validated before cutover
  4. DNS: TTL lowered and DNS provider ready for quick updates
  5. Verification: Comprehensive health checks on destination
  6. Rollback: Quick rollback procedures rehearsed and ready

The key to zero-downtime migration is thorough preparation, continuous synchronization, and careful DNS cutover. Always maintain the ability to rollback quickly if issues arise.