Multi-Region High Availability Architecture

Multi-region high availability provides geographic redundancy and protection against entire datacenter failures. This guide covers distributed database replication, global load balancing, automated failover, split-brain prevention, and practical deployment patterns for true fault tolerance across regions.

Table of Contents

  1. Multi-Region Architecture Overview
  2. Geographic Redundancy Planning
  3. Database Replication Across Regions
  4. Global Load Balancing
  5. Automated Regional Failover
  6. Split-Brain Prevention
  7. Monitoring and Alerting
  8. Testing and Drills
  9. Conclusion

Multi-Region Architecture Overview

Multi-region architectures typically include:

  • Primary Region: Main production deployment
  • Secondary Regions: Standby or active-active deployments
  • Global Entry Point: DNS or load balancer directing traffic
  • Replication Layer: Continuous data synchronization
  • Orchestration: Automated failover and recovery
# Multi-region topology visualization
cat > /tmp/topology.txt << 'EOF'
Multi-Region HA Architecture

Internet
  |
  +-- Global Load Balancer/DNS
  |
  +---+---+---+
  |   |   |   |
Region A  Region B  Region C
(Primary) (Active)  (Active)
|  |      |  |      |  |
Web  DB  Web DB    Web DB
|__________|________|
Database Replication Mesh

Components:
- Web Layer: Stateless (can lose any instance)
- Database Layer: Replicated across regions
- Load Balancing: Global and regional
- DNS: GeoDNS or application-level routing
- Monitoring: Centralized across all regions

RTO: <5 minutes (automated failover)
RPO: <1 minute (continuous replication)
EOF

cat /tmp/topology.txt

Geographic Redundancy Planning

Select Datacenter Locations

# Datacenter selection criteria
cat > /tmp/region-selection.md << 'EOF'
# Multi-Region Selection Criteria

## Geographic Distribution
- Minimum 500km separation between regions
- Different geological zones (reduce disaster risk)
- Internet backbone diversity
- Regulatory compliance (data residency)

## Region Pairs (AWS Example)
Primary: us-east-1 (N. Virginia)
Secondary: us-west-2 (Oregon)
Tertiary: eu-west-1 (Ireland)

## Network Connectivity Requirements
- Minimum 50 Mbps dedicated inter-region link
- <100ms latency preferred for sync replication
- 99.9%+ uptime SLA
- BGP redundancy

## Cost Considerations
- Data transfer costs (egress between regions)
- Instance costs per region
- Storage replication costs
- Monitoring and logging infrastructure

## Compliance Requirements
- Data sovereignty (EU GDPR, etc.)
- Industry regulations (HIPAA, PCI-DSS)
- Backup location requirements
- Disaster recovery mandates
EOF

cat /tmp/region-selection.md

Create Multi-Region Infrastructure Template

# Infrastructure as Code example (Terraform-like)
cat > /tmp/multi-region-infrastructure.sh << 'EOF'
#!/bin/bash

declare -A regions=(
    ["primary"]="us-east-1"
    ["secondary"]="us-west-2"
    ["tertiary"]="eu-west-1"
)

declare -A region_names=(
    ["us-east-1"]="US East (N. Virginia)"
    ["us-west-2"]="US West (Oregon)"
    ["eu-west-1"]="EU (Ireland)"
)

# Deploy infrastructure in each region
deploy_regional_infrastructure() {
    for region_key in "${!regions[@]}"; do
        local region="${regions[$region_key]}"
        
        echo "Deploying to region: $region - ${region_names[$region]}"
        
        # Create VPC
        # Create subnets
        # Create security groups
        # Launch instances
        # Configure load balancing
        # Setup monitoring
    done
}

# Create inter-region connectivity
setup_inter_region_connectivity() {
    echo "Setting up inter-region connectivity"
    
    # VPN between regions
    # Direct Connect / Dedicated network links
    # Route53 health checks
    # CloudFront for static content
}
EOF

bash /tmp/multi-region-infrastructure.sh

Database Replication Across Regions

Multi-Master Database Replication

# Setup MySQL multi-master replication across regions
setup_mysql_multiregion_replication() {
    # Region 1: Primary
    local region1_host="db1.region1.example.com"
    local region1_id=1
    
    # Region 2: Secondary
    local region2_host="db2.region2.example.com"
    local region2_id=2
    
    # Region 3: Tertiary
    local region3_host="db3.region3.example.com"
    local region3_id=3
    
    echo "Setting up MySQL multi-master replication"
    
    # Configure Region 1
    ssh "root@$region1_host" << EOF
mysql -u root << MYSQL
SET GLOBAL server_id = $region1_id;
SET GLOBAL binlog_format = 'ROW';

CREATE USER 'repl'@'%' IDENTIFIED BY 'repl_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';

-- Point to Region 2
CHANGE MASTER TO
    MASTER_HOST='$region2_host',
    MASTER_USER='repl',
    MASTER_PASSWORD='repl_password';
START SLAVE;

SHOW MASTER STATUS;
SHOW SLAVE STATUS\G
MYSQL
EOF
    
    # Configure Region 2
    ssh "root@$region2_host" << EOF
mysql -u root << MYSQL
SET GLOBAL server_id = $region2_id;

CREATE USER 'repl'@'%' IDENTIFIED BY 'repl_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';

-- Point to Region 1
CHANGE MASTER TO
    MASTER_HOST='$region1_host',
    MASTER_USER='repl',
    MASTER_PASSWORD='repl_password';
START SLAVE;

SHOW MASTER STATUS;
SHOW SLAVE STATUS\G
MYSQL
EOF
    
    # Configure Region 3 (replicates from Region 1)
    ssh "root@$region3_host" << EOF
mysql -u root << MYSQL
SET GLOBAL server_id = $region3_id;
SET GLOBAL read_only = 1;

CHANGE MASTER TO
    MASTER_HOST='$region1_host',
    MASTER_USER='repl',
    MASTER_PASSWORD='repl_password';
START SLAVE;

SHOW SLAVE STATUS\G
MYSQL
EOF
    
    echo "✓ Multi-region replication configured"
}

# Monitor multi-region replication lag
monitor_multiregion_replication() {
    local regions=("us-east-1" "us-west-2" "eu-west-1")
    
    while true; do
        echo "=== Replication Status at $(date) ==="
        
        for region in "${regions[@]}"; do
            host="db.$region.example.com"
            
            lag=$(ssh "root@$host" \
                "mysql -u root -sNe \"SHOW SLAVE STATUS\\G\" | grep Seconds_Behind_Master | awk '{print \\\$NF}'")
            
            printf "%-15s: %3s seconds\n" "$region" "$lag"
        done
        
        sleep 30
    done
}

PostgreSQL Multi-Region Replication

# PostgreSQL streaming replication across multiple regions
setup_postgresql_multiregion_replication() {
    local primary="db-primary.region1.example.com"
    local secondary1="db-secondary1.region2.example.com"
    local secondary2="db-secondary2.region3.example.com"
    
    echo "Setting up PostgreSQL multi-region replication"
    
    # Configure primary
    ssh "root@$primary" << EOF
sudo -u postgres cat >> /etc/postgresql/14/main/postgresql.conf << 'CONFIG'

wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
hot_standby = on
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/wal-archive/%f && cp %p /var/lib/postgresql/wal-archive/%f'
CONFIG

sudo systemctl restart postgresql

sudo -u postgres psql << SQL
CREATE USER repl_user REPLICATION ENCRYPTED PASSWORD 'repl_password';

SELECT pg_create_physical_replication_slot('secondary1_slot');
SELECT pg_create_physical_replication_slot('secondary2_slot');
SQL
EOF
    
    # Configure secondaries
    for secondary in "$secondary1" "$secondary2"; do
        ssh "root@$secondary" << EOF
sudo -u postgres pg_basebackup \
    -h $primary \
    -U repl_user \
    -D /var/lib/postgresql/14/main \
    -R \
    -Xstream \
    -v

sudo systemctl start postgresql
EOF
    done
    
    echo "✓ PostgreSQL multi-region replication configured"
}

Global Load Balancing

DNS-Based Geographic Routing

# Setup GeoDNS with Route53 (or similar)
setup_geodns_routing() {
    local domain="app.example.com"
    
    echo "Configuring GeoDNS routing for: $domain"
    
    # Using AWS Route53 as example
    cat > /tmp/geodns-config.json << 'EOF'
{
  "Changes": [
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "US-East",
        "GeoLocation": {
          "CountryCode": "US",
          "SubdivisionCode": "VA"
        },
        "TTL": 60,
        "ResourceRecords": [
          {"Value": "10.0.1.10"}
        ],
        "HealthCheckId": "us-east-health-check"
      }
    },
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "US-West",
        "GeoLocation": {
          "CountryCode": "US",
          "SubdivisionCode": "OR"
        },
        "TTL": 60,
        "ResourceRecords": [
          {"Value": "10.0.2.10"}
        ],
        "HealthCheckId": "us-west-health-check"
      }
    },
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "Europe",
        "GeoLocation": {
          "CountryCode": "IE"
        },
        "TTL": 60,
        "ResourceRecords": [
          {"Value": "10.0.3.10"}
        ],
        "HealthCheckId": "eu-west-health-check"
      }
    }
  ]
}
EOF
    
    # Apply configuration
    # aws route53 change-resource-record-sets --hosted-zone-id Z123 --change-batch file:///tmp/geodns-config.json
}

# Application-level geographic routing
setup_application_routing() {
    echo "Implementing application-level geographic routing"
    
    cat > /tmp/app-router.py << 'EOF'
import geoip2.database
from flask import Flask, request, redirect

app = Flask(__name__)

REGION_ENDPOINTS = {
    'US': 'https://us-east.app.example.com',
    'EU': 'https://eu-west.app.example.com',
    'APAC': 'https://ap-southeast.app.example.com',
}

@app.route('/')
def geo_redirect():
    client_ip = request.remote_addr
    
    try:
        with geoip2.database.Reader('/usr/share/GeoIP/GeoLite2-Country.mmdb') as reader:
            response = reader.country(client_ip)
            continent = response.continent.code
            
            if continent == 'NA':
                region = 'US'
            elif continent == 'EU':
                region = 'EU'
            else:
                region = 'APAC'
            
            return redirect(REGION_ENDPOINTS.get(region, REGION_ENDPOINTS['US']))
    except:
        return redirect(REGION_ENDPOINTS['US'])

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)
EOF
}

Regional Load Balancing

# Setup regional load balancing within each region
setup_regional_load_balancer() {
    local region=$1
    local region_name=$2
    
    echo "Setting up load balancer for: $region_name"
    
    # HAProxy configuration
    cat > "/etc/haproxy/haproxy-$region.cfg" << EOF
global
    log stdout local0
    log stdout local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  http-server-close
    timeout connect 5000
    timeout client  50000
    timeout server  50000

frontend web_frontend
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/combined.pem
    redirect scheme https if !{ ssl_fc }
    
    default_backend web_servers

backend web_servers
    balance roundrobin
    option httpchk GET /health HTTP/1.1\r\nHost:\ example.com
    
    server web1 10.0.${region}.11:80 check inter 5s fall 3 rise 2
    server web2 10.0.${region}.12:80 check inter 5s fall 3 rise 2
    server web3 10.0.${region}.13:80 check inter 5s fall 3 rise 2
EOF
}

Automated Regional Failover

Implement Automatic Failover Logic

# Automated regional failover with health checks
cat > /usr/local/bin/regional-failover-manager.sh << 'EOF'
#!/bin/bash

REGIONS=("us-east-1" "us-west-2" "eu-west-1")
PRIMARY_REGION="us-east-1"
FAILOVER_LOG="/var/log/regional-failover.log"
HEALTH_CHECK_INTERVAL=30

declare -A region_status
declare -A last_status_change

# Initialize status tracking
for region in "${REGIONS[@]}"; do
    region_status["$region"]="up"
    last_status_change["$region"]=$(date +%s)
done

# Health check function
check_region_health() {
    local region=$1
    local endpoint="https://api.$region.example.com/health"
    
    if curl -s --max-time 5 "$endpoint" | grep -q "ok"; then
        echo "up"
    else
        echo "down"
    fi
}

# Failover decision logic
make_failover_decision() {
    local current_primary=$1
    local failed_region=$2
    
    # Get list of healthy regions
    local healthy_regions=()
    for region in "${REGIONS[@]}"; do
        if [ "${region_status[$region]}" = "up" ]; then
            healthy_regions+=("$region")
        fi
    done
    
    # If primary is down and there are healthy regions
    if [ "$failed_region" = "$current_primary" ] && [ ${#healthy_regions[@]} -gt 0 ]; then
        # Elect new primary (lowest alphabetically among healthy)
        local new_primary=$(printf '%s\n' "${healthy_regions[@]}" | sort | head -1)
        
        log_failover "Primary region $current_primary is down. Promoting $new_primary"
        
        promote_region_to_primary "$new_primary"
        
        return 0
    fi
    
    return 1
}

# Promote region to primary
promote_region_to_primary() {
    local new_primary=$1
    
    log_failover "Promoting $new_primary to primary"
    
    # Update DNS routing
    update_dns_to_region "$new_primary"
    
    # Update database replication (if needed)
    promote_database_replica "$new_primary"
    
    # Update configuration in all regions
    broadcast_primary_change "$new_primary"
    
    log_failover "✓ $new_primary is now primary region"
}

# Health check loop
health_check_loop() {
    while true; do
        for region in "${REGIONS[@]}"; do
            current_status=$(check_region_health "$region")
            previous_status="${region_status[$region]}"
            
            if [ "$current_status" != "$previous_status" ]; then
                region_status["$region"]="$current_status"
                last_status_change["$region"]=$(date +%s)
                
                log_failover "Status change: $region $previous_status -> $current_status"
                
                # If primary went down, initiate failover
                if [ "$region" = "$PRIMARY_REGION" ] && [ "$current_status" = "down" ]; then
                    make_failover_decision "$PRIMARY_REGION" "$region"
                fi
            fi
        done
        
        sleep $HEALTH_CHECK_INTERVAL
    done
}

log_failover() {
    echo "[$(date)] $1" | tee -a "$FAILOVER_LOG"
}

health_check_loop
EOF

chmod +x /usr/local/bin/regional-failover-manager.sh

Split-Brain Prevention

Implement Consensus-Based Failover

# Use etcd or Consul for distributed consensus
setup_distributed_consensus() {
    echo "Setting up distributed consensus for split-brain prevention"
    
    # Install Consul
    apt-get install -y consul
    
    # Configure Consul for multi-region
    cat > /etc/consul/consul.json << 'EOF'
{
  "datacenter": "us-east-1",
  "node_name": "consul-1",
  "server": true,
  "ui": true,
  "bootstrap_expect": 3,
  "client_addr": "0.0.0.0",
  "bind_addr": "10.0.1.10",
  "retry_join": [
    "consul-2.us-west-2.example.com",
    "consul-3.eu-west-1.example.com"
  ],
  "services": [
    {
      "name": "web",
      "port": 80,
      "check": {
        "http": "http://localhost/health",
        "interval": "10s"
      }
    }
  ]
}
EOF
    
    systemctl restart consul
}

# Use quorum-based failover decisions
cat > /usr/local/bin/quorum-failover.sh << 'EOF'
#!/bin/bash

CONSUL_SERVERS=("consul1" "consul2" "consul3")
FAILOVER_THRESHOLD=2  # Require 2 out of 3 consensus

check_quorum_for_failover() {
    local region=$1
    local votes=0
    
    for server in "${CONSUL_SERVERS[@]}"; do
        # Query Consul for health status
        status=$(curl -s "http://$server:8500/v1/health/service/$region" | grep -o '"Status":"[^"]*"' | cut -d'"' -f4)
        
        if [ "$status" = "critical" ]; then
            ((votes++))
        fi
    done
    
    echo "Failover votes for $region: $votes/$FAILOVER_THRESHOLD"
    
    if [ $votes -ge $FAILOVER_THRESHOLD ]; then
        return 0  # Quorum reached for failover
    else
        return 1  # No quorum
    fi
}

# Prevent split-brain with lease-based primary election
lease_based_primary_election() {
    local region=$1
    local lease_ttl=30
    
    # Attempt to acquire lease for primary role
    if consul lock "primary-role-lease" -session-ttl="$lease_ttl" bash -c "echo acquired"; then
        echo "✓ Acquired primary role for region: $region"
        return 0
    else
        echo "✗ Could not acquire primary role"
        return 1
    fi
}
EOF

chmod +x /usr/local/bin/quorum-failover.sh

Monitoring and Alerting

Comprehensive Multi-Region Monitoring

# Setup centralized monitoring for all regions
setup_multiregion_monitoring() {
    echo "Setting up multi-region monitoring"
    
    # Prometheus configuration for all regions
    cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'multi-region'

scrape_configs:
  - job_name: 'web-servers-us-east'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['web1.us-east.example.com:9100', 'web2.us-east.example.com:9100']
        labels:
          region: 'us-east-1'
  
  - job_name: 'web-servers-us-west'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['web1.us-west.example.com:9100', 'web2.us-west.example.com:9100']
        labels:
          region: 'us-west-2'
  
  - job_name: 'web-servers-eu-west'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['web1.eu-west.example.com:9100', 'web2.eu-west.example.com:9100']
        labels:
          region: 'eu-west-1'
  
  - job_name: 'database-replication'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['db1.us-east.example.com:9104', 'db1.us-west.example.com:9104', 'db1.eu-west.example.com:9104']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager.example.com:9093']
EOF
}

# Create regional failure alerts
create_regional_alerts() {
    cat > /etc/prometheus/alerts/regional-failover.yml << 'EOF'
groups:
  - name: regional_failover
    rules:
      - alert: RegionDown
        expr: count(up{region=~".+"}) by (region) == 0
        for: 2m
        annotations:
          summary: "Region is down: {{ $labels.region }}"
          description: "All servers in region {{ $labels.region }} are unreachable"
      
      - alert: HighReplicationLag
        expr: mysql_slave_lag_seconds > 30
        for: 5m
        annotations:
          summary: "High replication lag in {{ $labels.instance }}"
          description: "Replication lag: {{ $value }}s"
      
      - alert: PrimaryElectionConflict
        expr: count(group by (cluster) (primary_role{status="active"})) > 1
        for: 1m
        annotations:
          summary: "Split-brain detected: Multiple primary regions"
          description: "Multiple regions believe they are primary"
EOF
}

Testing and Drills

Regular Failover Testing

# Automated failover testing
test_regional_failover() {
    local test_region=$1
    
    echo "Testing failover for region: $test_region"
    
    # Step 1: Record current state
    current_primary=$(curl -s https://api.example.com/status | grep -o '"primary":"[^"]*"' | cut -d'"' -f4)
    echo "Current primary: $current_primary"
    
    # Step 2: Simulate region failure
    echo "Simulating failure in $test_region..."
    
    # Block network traffic from region
    ssh "ops@$test_region" << EOF
iptables -I INPUT 1 -j DROP
EOF
    
    # Step 3: Monitor failover
    sleep 30
    
    # Step 4: Verify failover occurred
    new_primary=$(curl -s https://api.example.com/status | grep -o '"primary":"[^"]*"' | cut -d'"' -f4)
    
    if [ "$new_primary" != "$current_primary" ]; then
        echo "✓ Failover successful: $current_primary -> $new_primary"
    else
        echo "✗ Failover failed: Still on $current_primary"
    fi
    
    # Step 5: Restore region
    echo "Restoring $test_region..."
    ssh "ops@$test_region" << EOF
iptables -D INPUT -j DROP
EOF
    
    sleep 30
    
    # Step 6: Verify recovery
    recovered_primary=$(curl -s https://api.example.com/status | grep -o '"primary":"[^"]*"' | cut -d'"' -f4)
    echo "Primary after recovery: $recovered_primary"
}

# Document failover test results
document_failover_test() {
    local test_date=$(date +%Y-%m-%d)
    local test_report="/var/reports/failover-test-$test_date.md"
    
    cat > "$test_report" << 'EOF'
# Regional Failover Test Report

**Date**: [Test Date]
**Tested Regions**: [List]

## Test Execution
- [ ] Baseline metrics recorded
- [ ] Region failure simulated
- [ ] Failover detection time
- [ ] Failover execution time
- [ ] Service availability during failover
- [ ] Data consistency verified
- [ ] Region restored
- [ ] Recovery time measured

## Results
- RTO Actual: [Time]
- RTO Target: [Time]
- Data Loss: [Amount]
- RPO Target: [Time]

## Issues Found
1. [Issue]
2. [Issue]

## Improvements Made
1. [Improvement]
2. [Improvement]
EOF
}

Conclusion

Multi-region high availability requires:

  1. Geographic Distribution: Separate regions minimize single-point failures
  2. Database Replication: Continuous synchronization keeps data current
  3. Global Routing: Smart DNS or load balancing directs users to nearest region
  4. Automated Failover: Quick detection and promotion of backup regions
  5. Split-Brain Prevention: Quorum-based consensus prevents conflicts
  6. Monitoring: Comprehensive health checks across all regions
  7. Testing: Regular drills validate procedures and recovery times

The key challenge is balancing consistency (stricter replication = lower RPO but higher latency) with availability. Use eventual consistency for most data, but maintain strict consistency for critical operations (payments, etc.). Always test failover procedures regularly and keep multiple independent backups in different regions as a final safety net.