Live Migration of Services Between Servers
Live migration allows you to move running services from one server to another with minimal downtime. This guide covers planning, data synchronization, database replication cutover, DNS transitions, and comprehensive verification procedures for zero-downtime migrations.
Table of Contents
- Migration Planning
- Pre-Migration Checklist
- Data Synchronization with Rsync
- Database Replication Cutover
- Application Service Migration
- DNS Transition
- Verification and Validation
- Rollback Procedures
- Conclusion
Migration Planning
Create Migration Plan Document
# Migration plan template
cat > /tmp/migration-plan.md << 'EOF'
# Service Migration Plan
## Executive Summary
- Source Server: [hostname/IP]
- Destination Server: [hostname/IP]
- Services: [list]
- Estimated Duration: [hours]
- Maintenance Window: [date/time]
- Risk Level: [Low/Medium/High]
## Scope
### Services to migrate:
- Service A (port 8080)
- Service B (port 3306)
- Service C (port 443)
### Data to migrate:
- Application files: [size]
- Database: [size]
- User data: [size]
- Configuration files: [list]
## Dependencies
- Service A depends on: Database, Cache
- Service B depends on: Message Queue
- Service C depends on: Certificate Store
## Rollback Plan
1. [Step 1]
2. [Step 2]
3. [Step 3]
## Success Criteria
- [ ] All services running on destination
- [ ] No data loss detected
- [ ] No increase in error rates
- [ ] Users can access all features
- [ ] Performance metrics within acceptable range
EOF
cat /tmp/migration-plan.md
Risk Assessment
# Assess migration risks
assess_migration_risk() {
echo "Migration Risk Assessment"
echo "=========================="
local high_risk_items=(
"Database with active connections"
"Real-time streaming services"
"Services with local state"
"Large databases (>100GB)"
"Custom network configurations"
)
local medium_risk_items=(
"Static web servers"
"Cache layers"
"Read-only databases"
"Containerized applications"
)
local low_risk_items=(
"Stateless services"
"Load-balanced applications"
"Services with built-in redundancy"
)
echo ""
echo "High Risk (Requires careful planning):"
printf '%s\n' "${high_risk_items[@]}"
echo ""
echo "Medium Risk (Standard procedures)"
printf '%s\n' "${medium_risk_items[@]}"
echo ""
echo "Low Risk (Quick migration expected)"
printf '%s\n' "${low_risk_items[@]}"
}
assess_migration_risk
Pre-Migration Checklist
# Pre-migration validation checklist
cat > /usr/local/bin/pre-migration-check.sh << 'EOF'
#!/bin/bash
SOURCE_HOST=$1
DEST_HOST=$2
if [ -z "$SOURCE_HOST" ] || [ -z "$DEST_HOST" ]; then
echo "Usage: $0 <source_host> <dest_host>"
exit 1
fi
CHECKS_PASSED=0
CHECKS_FAILED=0
run_check() {
local check_name=$1
local check_command=$2
echo -n "Checking: $check_name... "
if eval "$check_command"; then
echo "✓ PASS"
((CHECKS_PASSED++))
else
echo "✗ FAIL"
((CHECKS_FAILED++))
fi
}
# Network connectivity
run_check "Source SSH connectivity" \
"ssh -o ConnectTimeout=5 root@$SOURCE_HOST 'exit' 2>/dev/null"
run_check "Destination SSH connectivity" \
"ssh -o ConnectTimeout=5 root@$DEST_HOST 'exit' 2>/dev/null"
# Storage capacity
run_check "Source disk usage" \
"ssh root@$SOURCE_HOST 'df /data | awk \"NR==2 {if (\\\$4 > 10000000) exit 0; else exit 1}\"'"
run_check "Destination free space" \
"ssh root@$DEST_HOST 'df /data | awk \"NR==2 {if (\\\$4 > 100000000) exit 0; else exit 1}\"'"
# Service availability
run_check "Source services running" \
"ssh root@$SOURCE_HOST 'systemctl is-active my-service' | grep -q active"
run_check "Destination services configured" \
"ssh root@$DEST_HOST 'systemctl list-units --type=service' | wc -l"
# Database connectivity
run_check "Source database reachable" \
"ssh root@$SOURCE_HOST 'mysql -u root -e \"SELECT 1;\" 2>/dev/null' | grep -q 1"
# Report
echo ""
echo "Pre-Migration Check Results"
echo "============================"
echo "Passed: $CHECKS_PASSED"
echo "Failed: $CHECKS_FAILED"
if [ $CHECKS_FAILED -eq 0 ]; then
echo "Status: ✓ Ready for migration"
exit 0
else
echo "Status: ✗ Fix failures before migration"
exit 1
fi
EOF
chmod +x /usr/local/bin/pre-migration-check.sh
Data Synchronization with Rsync
Initial Full Sync
# Perform initial data synchronization
perform_initial_sync() {
local source_host=$1
local source_path=$2
local dest_host=$3
local dest_path=$4
local sync_log="/var/log/migration-sync.log"
echo "[$(date)] Starting initial data sync" | tee -a "$sync_log"
# Full sync with verification
rsync -avz \
--progress \
--no-perms \
--delete \
--checksum \
"$source_host:$source_path/" \
"$dest_host:$dest_path/" \
2>&1 | tee -a "$sync_log"
if [ ${PIPESTATUS[0]} -eq 0 ]; then
echo "[$(date)] Initial sync completed" >> "$sync_log"
return 0
else
echo "[$(date)] Initial sync failed" >> "$sync_log"
return 1
fi
}
# Example usage:
# perform_initial_sync "source.example.com" "/var/www" "dest.example.com" "/var/www"
Continuous Delta Sync
# Continuous synchronization for minimal cutover time
continuous_delta_sync() {
local source_host=$1
local source_path=$2
local dest_host=$3
local dest_path=$4
echo "Starting continuous delta synchronization"
echo "Press Ctrl+C to stop"
# Run rsync in daemon mode for continuous sync
while true; do
echo "[$(date)] Running delta sync..."
rsync -avz \
--progress \
--no-perms \
--checksum \
--delete \
--filter=':- .gitignore' \
"$source_host:$source_path/" \
"$dest_host:$dest_path/" \
sync_status=$?
if [ $sync_status -eq 0 ]; then
echo "[$(date)] Delta sync completed successfully"
else
echo "[$(date)] Delta sync completed with status $sync_status"
fi
# Wait before next sync (e.g., every 5 minutes)
sleep 300
done
}
# Alternative: Use inotify for real-time sync
realtime_sync_with_inotify() {
local source_host=$1
local source_path=$2
local dest_host=$3
local dest_path=$4
# On source server: install inotify-tools
ssh "root@$source_host" 'apt-get install -y inotify-tools'
# Monitor directory for changes and sync
ssh "root@$source_host" << 'EOF'
while inotifywait -r -e modify,create,delete /var/www; do
rsync -avz --delete /var/www/ remote-server:/var/www/
done
EOF
}
Bandwidth-Limited Sync
# Sync with bandwidth limiting to avoid impacting production
bandwidth_limited_sync() {
local source_host=$1
local source_path=$2
local dest_host=$3
local dest_path=$4
local max_bandwidth_mbps=50
# Convert to KB/s (bandwidth in Mbps / 8 * 1000)
local bandwidth_limit=$((max_bandwidth_mbps * 1000 / 8))
rsync -avz \
--bwlimit="$bandwidth_limit" \
--progress \
"$source_host:$source_path/" \
"$dest_host:$dest_path/"
}
# Monitor sync progress
monitor_sync_progress() {
local log_file="/var/log/migration-sync.log"
watch -n 1 'tail -20 '"$log_file"' | grep -E "to-check|speedup"'
}
Database Replication Cutover
Setup MySQL Replication
# Configure MySQL replication for migration
setup_mysql_replication_for_migration() {
local source_host=$1
local dest_host=$2
echo "Setting up MySQL replication from $source_host to $dest_host"
# On source server: Enable binary logging
ssh "root@$source_host" << 'EOF'
mysql -u root << 'MYSQL'
SET GLOBAL binlog_format = 'ROW';
CREATE USER 'repl'@'%' IDENTIFIED BY 'replication_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
FLUSH PRIVILEGES;
SHOW MASTER STATUS;
MYSQL
EOF
# Get master log position
local master_log=$(ssh "root@$source_host" "mysql -u root -sNe \"SHOW MASTER STATUS\\G\" | grep 'File:' | awk '{print \$2}'")
local master_pos=$(ssh "root@$source_host" "mysql -u root -sNe \"SHOW MASTER STATUS\\G\" | grep 'Position:' | awk '{print \$2}'")
echo "Master log: $master_log at position $master_pos"
# On destination server: Configure as slave
ssh "root@$dest_host" << MYSQL
mysql -u root << 'SLAVE'
CHANGE MASTER TO
MASTER_HOST='$source_host',
MASTER_USER='repl',
MASTER_PASSWORD='replication_password',
MASTER_LOG_FILE='$master_log',
MASTER_LOG_POS=$master_pos;
START SLAVE;
SHOW SLAVE STATUS\G;
SLAVE
MYSQL
}
# Monitor replication lag
monitor_replication_during_migration() {
local dest_host=$1
local lag_threshold=5 # seconds
echo "Monitoring MySQL replication lag..."
while true; do
local lag=$(ssh "root@$dest_host" \
"mysql -u root -sNe \"SHOW SLAVE STATUS\\G\" | grep 'Seconds_Behind_Master' | awk '{print \$NF}'")
echo "[$(date)] Replication lag: ${lag}s"
if [ "$lag" -eq "NULL" ] || [ "$lag" -gt "$lag_threshold" ]; then
echo "Warning: Replication lag is high or not running"
fi
sleep 10
done
}
Stop Replication and Promote Destination
# Execute cutover: Stop replication and promote destination
promote_destination_database() {
local dest_host=$1
echo "Promoting destination database to primary"
# Stop replication
ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
STOP SLAVE;
SHOW SLAVE STATUS\G;
-- Verify replication is stopped
MYSQL
EOF
# Make destination writable
ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
SET GLOBAL read_only = 0;
SET GLOBAL super_read_only = 0;
SHOW VARIABLES LIKE '%read_only%';
MYSQL
EOF
# Remove slave configuration
ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
RESET SLAVE ALL;
MYSQL
EOF
echo "Destination database is now primary"
}
# Verify promotion
verify_database_promotion() {
local dest_host=$1
ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
-- Check there are no replication threads
SHOW PROCESSLIST\G
-- Check database status
SHOW MASTER STATUS\G
-- Verify data integrity
SELECT COUNT(*) FROM information_schema.tables;
MYSQL
EOF
}
Application Service Migration
Stop Services on Source
# Gracefully stop services
stop_services_gracefully() {
local source_host=$1
local services=("nginx" "php-fpm" "nodejs" "custom-app")
echo "Stopping services on $source_host"
for service in "${services[@]}"; do
echo "Stopping: $service"
# Graceful stop
ssh "root@$source_host" "systemctl stop $service"
# Wait for graceful shutdown
sleep 5
# Force kill if still running
ssh "root@$source_host" "pkill -9 $service" 2>/dev/null
# Verify stopped
ssh "root@$source_host" "systemctl is-active $service" && \
echo "Warning: $service still running" || \
echo "✓ $service stopped"
done
}
Drain Connections
# Drain connections from load balancer
drain_connections_from_lb() {
local source_host=$1
local lb_host=$2
echo "Draining connections from load balancer"
# Mark server as unhealthy in load balancer
ssh "root@$lb_host" << EOF
# Example for HAProxy
echo "set server backend/server01 state maint" | socat - UNIX-CONNECT:/var/run/haproxy/admin.sock
EOF
# Wait for existing connections to drain
echo "Waiting for connections to drain..."
sleep 30
# Check connection count
local connections=$(ssh "root@$source_host" \
"netstat -an | grep ESTABLISHED | wc -l")
echo "Active connections: $connections"
}
Verify Source Stopped
# Verify all services are stopped
verify_services_stopped() {
local source_host=$1
ssh "root@$source_host" << 'EOF'
#!/bin/bash
echo "Verifying all services are stopped"
services_running=$(systemctl list-units --type=service --state=running | \
grep -v "system-getty\|user-runtime-dir\|user@" | \
wc -l)
if [ "$services_running" -gt 2 ]; then
echo "Warning: Still $services_running services running"
else
echo "✓ Services successfully stopped"
fi
# Check for remaining connections
echo "Checking for remaining network connections..."
netstat -an | grep ESTABLISHED | grep -v "^unix" | wc -l
EOF
}
DNS Transition
Update DNS Records
# Plan DNS cutover
plan_dns_cutover() {
local service_name=$1
local new_ip=$2
local current_ttl=300
local low_ttl=60
echo "DNS Cutover Plan for: $service.example.com"
echo "Current IP: $(dig +short service.example.com)"
echo "New IP: $new_ip"
echo ""
echo "Recommended steps:"
echo "1. Lower TTL to $low_ttl (current: $current_ttl)"
echo " - This should be done 24 hours before cutover"
echo "2. Monitor DNS propagation"
echo "3. Update DNS A record to $new_ip"
echo "4. Wait for TTL expiration ($low_ttl seconds)"
echo "5. Monitor for issues"
}
# Lower TTL before migration
lower_dns_ttl() {
local domain=$1
local ttl=60
echo "Lowering TTL for $domain to $ttl seconds"
# Method 1: Using DNS provider API
# Example with Route53:
# aws route53 change-resource-record-sets \
# --hosted-zone-id Z123456 \
# --change-batch '{...}'
# Method 2: Manual update via DNS control panel
echo "Please update DNS TTL in control panel:"
echo "Domain: $domain"
echo "TTL: $ttl seconds"
# Verify TTL change
sleep 10
dig "$domain" | grep -i "TTL"
}
# Update DNS record
update_dns_record() {
local domain=$1
local new_ip=$2
echo "Updating DNS record for $domain to $new_ip"
# Using dig to check current IP
local current_ip=$(dig +short "$domain")
echo "Current IP: $current_ip"
echo "New IP: $new_ip"
# Update via provider API (example)
# curl -X POST "https://api.namecheap.com/xml.response" \
# --data "ApiUser=user&ApiKey=key&... HostName=$domain&Address=$new_ip"
# For nsupdate (if TSIG is configured)
nsupdate << EOF
server ns1.example.com
zone example.com
update delete $domain A
update add $domain 60 A $new_ip
send
quit
EOF
}
# Monitor DNS propagation
monitor_dns_propagation() {
local domain=$1
local expected_ip=$2
echo "Monitoring DNS propagation for $domain"
local nameservers=("8.8.8.8" "1.1.1.1" "208.67.222.222")
for ns in "${nameservers[@]}"; do
while true; do
local resolved_ip=$(dig +short "@$ns" "$domain")
if [ "$resolved_ip" = "$expected_ip" ]; then
echo "✓ $ns: Correct ($resolved_ip)"
else
echo "⏳ $ns: Still old ($resolved_ip)"
fi
sleep 10
done &
done
wait
}
Verification and Validation
Comprehensive Health Checks
# Post-migration validation
validate_migration() {
local dest_host=$1
echo "Post-Migration Validation"
echo "=========================="
validation_log="/var/log/migration-validation.log"
{
echo "[$(date)] Starting post-migration validation"
# Check services are running
echo ""
echo "Service Status:"
ssh "root@$dest_host" "systemctl status nginx mysql postgresql redis"
# Check database connectivity
echo ""
echo "Database Checks:"
ssh "root@$dest_host" << 'EOF'
mysql -u root -e "SELECT COUNT(*) FROM information_schema.TABLES;" 2>&1
psql -U postgres -c "SELECT datname FROM pg_database WHERE datname NOT LIKE 'template%';" 2>&1
EOF
# Check application response
echo ""
echo "Application Response:"
curl -s -o /dev/null -w "%{http_code}" "http://$dest_host/health"
# Check disk usage
echo ""
echo "Disk Usage:"
ssh "root@$dest_host" "df -h"
# Check system resources
echo ""
echo "System Resources:"
ssh "root@$dest_host" "top -bn1 | head -20"
} | tee "$validation_log"
}
# Automated health check
cat > /usr/local/bin/post-migration-health-check.sh << 'EOF'
#!/bin/bash
DEST_HOST=$1
CHECKS_PASSED=0
CHECKS_FAILED=0
health_check() {
local check_name=$1
local command=$2
echo -n "[$check_name] "
if eval "$command"; then
echo "✓ PASS"
((CHECKS_PASSED++))
else
echo "✗ FAIL"
((CHECKS_FAILED++))
fi
}
# Perform checks
health_check "Web Server" "curl -sf http://$DEST_HOST > /dev/null"
health_check "Database" "ssh root@$DEST_HOST 'mysql -u root -e \"SELECT 1;\"' | grep -q 1"
health_check "Disk Space" "ssh root@$DEST_HOST 'df / | awk \"NR==2 {if (\\\$4 > 1000000) exit 0; else exit 1}\"'"
health_check "Memory" "ssh root@$DEST_HOST 'free | awk \"NR==2 {if (\\\$7 > 100000) exit 0; else exit 1}\"'"
echo ""
echo "Results: $CHECKS_PASSED passed, $CHECKS_FAILED failed"
EOF
chmod +x /usr/local/bin/post-migration-health-check.sh
Data Integrity Verification
# Verify data integrity after migration
verify_data_integrity() {
local source_host=$1
local dest_host=$2
echo "Verifying data integrity"
# Compare file counts
echo "Comparing file counts..."
local source_files=$(ssh "root@$source_host" "find /var/www -type f | wc -l")
local dest_files=$(ssh "root@$dest_host" "find /var/www -type f | wc -l")
if [ "$source_files" -eq "$dest_files" ]; then
echo "✓ File count matches: $source_files files"
else
echo "✗ File count mismatch: source=$source_files, dest=$dest_files"
fi
# Compare directory hashes
echo "Comparing directory checksums..."
local source_hash=$(ssh "root@$source_host" "find /var/www -type f -exec sha256sum {} \; | sha256sum")
local dest_hash=$(ssh "root@$dest_host" "find /var/www -type f -exec sha256sum {} \; | sha256sum")
if [ "$source_hash" = "$dest_hash" ]; then
echo "✓ All files match"
else
echo "✗ File mismatch detected"
# Find specific differences
ssh "root@$source_host" "find /var/www -type f -exec sha256sum {} \;" > /tmp/source-hashes.txt
ssh "root@$dest_host" "find /var/www -type f -exec sha256sum {} \;" > /tmp/dest-hashes.txt
diff /tmp/source-hashes.txt /tmp/dest-hashes.txt | head -20
fi
# Database integrity
echo "Verifying database integrity..."
ssh "root@$dest_host" << 'EOF'
mysql -u root << 'MYSQL'
-- Check for table errors
CHECK TABLE information_schema.TABLES;
-- Verify table counts
SELECT COUNT(*) FROM information_schema.TABLES WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema');
MYSQL
EOF
}
Rollback Procedures
Quick Rollback to Source
# Rollback procedure if migration fails
rollback_to_source() {
local source_host=$1
local dest_host=$2
echo "INITIATING ROLLBACK TO SOURCE"
echo "==============================="
# Step 1: Update DNS back to source
echo "Step 1: Reverting DNS to source"
update_dns_record "service.example.com" "$(dig +short $source_host)"
# Step 2: Stop destination services
echo "Step 2: Stopping destination services"
ssh "root@$dest_host" "systemctl stop nginx mysql"
# Step 3: Start source services
echo "Step 3: Starting source services"
ssh "root@$source_host" "systemctl start nginx mysql"
# Step 4: Remove destination from load balancer
echo "Step 4: Removing destination from load balancer"
# Update load balancer configuration
# Step 5: Verify rollback
echo "Step 5: Verifying rollback"
sleep 10
curl -f "http://$source_host/health" && echo "✓ Source online" || echo "✗ Source offline"
echo "Rollback completed"
}
# Scheduled rollback option
schedule_auto_rollback() {
local source_host=$1
local dest_host=$2
local rollback_timeout_minutes=30
# Schedule automatic rollback if destination stays unhealthy
cat > /usr/local/bin/auto-rollback.sh << 'EOF'
#!/bin/bash
DEST_HOST=$1
TIMEOUT_MINUTES=${2:-30}
START_TIME=$(date +%s)
while true; do
ELAPSED=$(($(date +%s) - START_TIME))
ELAPSED_MINUTES=$((ELAPSED / 60))
if ! curl -sf "http://$DEST_HOST/health" > /dev/null; then
if [ $ELAPSED_MINUTES -gt $TIMEOUT_MINUTES ]; then
echo "Triggering automatic rollback"
/usr/local/bin/rollback-to-source.sh "$DEST_HOST"
break
fi
else
echo "Destination healthy - no rollback needed"
break
fi
sleep 30
done
EOF
chmod +x /usr/local/bin/auto-rollback.sh
}
Conclusion
Successful live migrations require:
- Planning: Detailed documentation of services, data, and dependencies
- Synchronization: Multiple sync passes to minimize cutover time
- Replication: Database replication configured and validated before cutover
- DNS: TTL lowered and DNS provider ready for quick updates
- Verification: Comprehensive health checks on destination
- Rollback: Quick rollback procedures rehearsed and ready
The key to zero-downtime migration is thorough preparation, continuous synchronization, and careful DNS cutover. Always maintain the ability to rollback quickly if issues arise.


