Multi-Region High Availability Architecture
Multi-region high availability provides geographic redundancy and protection against entire datacenter failures. This guide covers distributed database replication, global load balancing, automated failover, split-brain prevention, and practical deployment patterns for true fault tolerance across regions.
Table of Contents
- Multi-Region Architecture Overview
- Geographic Redundancy Planning
- Database Replication Across Regions
- Global Load Balancing
- Automated Regional Failover
- Split-Brain Prevention
- Monitoring and Alerting
- Testing and Drills
- Conclusion
Multi-Region Architecture Overview
Multi-region architectures typically include:
- Primary Region: Main production deployment
- Secondary Regions: Standby or active-active deployments
- Global Entry Point: DNS or load balancer directing traffic
- Replication Layer: Continuous data synchronization
- Orchestration: Automated failover and recovery
# Multi-region topology visualization
cat > /tmp/topology.txt << 'EOF'
Multi-Region HA Architecture
Internet
|
+-- Global Load Balancer/DNS
|
+---+---+---+
| | | |
Region A Region B Region C
(Primary) (Active) (Active)
| | | | | |
Web DB Web DB Web DB
|__________|________|
Database Replication Mesh
Components:
- Web Layer: Stateless (can lose any instance)
- Database Layer: Replicated across regions
- Load Balancing: Global and regional
- DNS: GeoDNS or application-level routing
- Monitoring: Centralized across all regions
RTO: <5 minutes (automated failover)
RPO: <1 minute (continuous replication)
EOF
cat /tmp/topology.txt
Geographic Redundancy Planning
Select Datacenter Locations
# Datacenter selection criteria
cat > /tmp/region-selection.md << 'EOF'
# Multi-Region Selection Criteria
## Geographic Distribution
- Minimum 500km separation between regions
- Different geological zones (reduce disaster risk)
- Internet backbone diversity
- Regulatory compliance (data residency)
## Region Pairs (AWS Example)
Primary: us-east-1 (N. Virginia)
Secondary: us-west-2 (Oregon)
Tertiary: eu-west-1 (Ireland)
## Network Connectivity Requirements
- Minimum 50 Mbps dedicated inter-region link
- <100ms latency preferred for sync replication
- 99.9%+ uptime SLA
- BGP redundancy
## Cost Considerations
- Data transfer costs (egress between regions)
- Instance costs per region
- Storage replication costs
- Monitoring and logging infrastructure
## Compliance Requirements
- Data sovereignty (EU GDPR, etc.)
- Industry regulations (HIPAA, PCI-DSS)
- Backup location requirements
- Disaster recovery mandates
EOF
cat /tmp/region-selection.md
Create Multi-Region Infrastructure Template
# Infrastructure as Code example (Terraform-like)
cat > /tmp/multi-region-infrastructure.sh << 'EOF'
#!/bin/bash
declare -A regions=(
["primary"]="us-east-1"
["secondary"]="us-west-2"
["tertiary"]="eu-west-1"
)
declare -A region_names=(
["us-east-1"]="US East (N. Virginia)"
["us-west-2"]="US West (Oregon)"
["eu-west-1"]="EU (Ireland)"
)
# Deploy infrastructure in each region
deploy_regional_infrastructure() {
for region_key in "${!regions[@]}"; do
local region="${regions[$region_key]}"
echo "Deploying to region: $region - ${region_names[$region]}"
# Create VPC
# Create subnets
# Create security groups
# Launch instances
# Configure load balancing
# Setup monitoring
done
}
# Create inter-region connectivity
setup_inter_region_connectivity() {
echo "Setting up inter-region connectivity"
# VPN between regions
# Direct Connect / Dedicated network links
# Route53 health checks
# CloudFront for static content
}
EOF
bash /tmp/multi-region-infrastructure.sh
Database Replication Across Regions
Multi-Master Database Replication
# Setup MySQL multi-master replication across regions
setup_mysql_multiregion_replication() {
# Region 1: Primary
local region1_host="db1.region1.example.com"
local region1_id=1
# Region 2: Secondary
local region2_host="db2.region2.example.com"
local region2_id=2
# Region 3: Tertiary
local region3_host="db3.region3.example.com"
local region3_id=3
echo "Setting up MySQL multi-master replication"
# Configure Region 1
ssh "root@$region1_host" << EOF
mysql -u root << MYSQL
SET GLOBAL server_id = $region1_id;
SET GLOBAL binlog_format = 'ROW';
CREATE USER 'repl'@'%' IDENTIFIED BY 'repl_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
-- Point to Region 2
CHANGE MASTER TO
MASTER_HOST='$region2_host',
MASTER_USER='repl',
MASTER_PASSWORD='repl_password';
START SLAVE;
SHOW MASTER STATUS;
SHOW SLAVE STATUS\G
MYSQL
EOF
# Configure Region 2
ssh "root@$region2_host" << EOF
mysql -u root << MYSQL
SET GLOBAL server_id = $region2_id;
CREATE USER 'repl'@'%' IDENTIFIED BY 'repl_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
-- Point to Region 1
CHANGE MASTER TO
MASTER_HOST='$region1_host',
MASTER_USER='repl',
MASTER_PASSWORD='repl_password';
START SLAVE;
SHOW MASTER STATUS;
SHOW SLAVE STATUS\G
MYSQL
EOF
# Configure Region 3 (replicates from Region 1)
ssh "root@$region3_host" << EOF
mysql -u root << MYSQL
SET GLOBAL server_id = $region3_id;
SET GLOBAL read_only = 1;
CHANGE MASTER TO
MASTER_HOST='$region1_host',
MASTER_USER='repl',
MASTER_PASSWORD='repl_password';
START SLAVE;
SHOW SLAVE STATUS\G
MYSQL
EOF
echo "✓ Multi-region replication configured"
}
# Monitor multi-region replication lag
monitor_multiregion_replication() {
local regions=("us-east-1" "us-west-2" "eu-west-1")
while true; do
echo "=== Replication Status at $(date) ==="
for region in "${regions[@]}"; do
host="db.$region.example.com"
lag=$(ssh "root@$host" \
"mysql -u root -sNe \"SHOW SLAVE STATUS\\G\" | grep Seconds_Behind_Master | awk '{print \\\$NF}'")
printf "%-15s: %3s seconds\n" "$region" "$lag"
done
sleep 30
done
}
PostgreSQL Multi-Region Replication
# PostgreSQL streaming replication across multiple regions
setup_postgresql_multiregion_replication() {
local primary="db-primary.region1.example.com"
local secondary1="db-secondary1.region2.example.com"
local secondary2="db-secondary2.region3.example.com"
echo "Setting up PostgreSQL multi-region replication"
# Configure primary
ssh "root@$primary" << EOF
sudo -u postgres cat >> /etc/postgresql/14/main/postgresql.conf << 'CONFIG'
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
hot_standby = on
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/wal-archive/%f && cp %p /var/lib/postgresql/wal-archive/%f'
CONFIG
sudo systemctl restart postgresql
sudo -u postgres psql << SQL
CREATE USER repl_user REPLICATION ENCRYPTED PASSWORD 'repl_password';
SELECT pg_create_physical_replication_slot('secondary1_slot');
SELECT pg_create_physical_replication_slot('secondary2_slot');
SQL
EOF
# Configure secondaries
for secondary in "$secondary1" "$secondary2"; do
ssh "root@$secondary" << EOF
sudo -u postgres pg_basebackup \
-h $primary \
-U repl_user \
-D /var/lib/postgresql/14/main \
-R \
-Xstream \
-v
sudo systemctl start postgresql
EOF
done
echo "✓ PostgreSQL multi-region replication configured"
}
Global Load Balancing
DNS-Based Geographic Routing
# Setup GeoDNS with Route53 (or similar)
setup_geodns_routing() {
local domain="app.example.com"
echo "Configuring GeoDNS routing for: $domain"
# Using AWS Route53 as example
cat > /tmp/geodns-config.json << 'EOF'
{
"Changes": [
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "US-East",
"GeoLocation": {
"CountryCode": "US",
"SubdivisionCode": "VA"
},
"TTL": 60,
"ResourceRecords": [
{"Value": "10.0.1.10"}
],
"HealthCheckId": "us-east-health-check"
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "US-West",
"GeoLocation": {
"CountryCode": "US",
"SubdivisionCode": "OR"
},
"TTL": 60,
"ResourceRecords": [
{"Value": "10.0.2.10"}
],
"HealthCheckId": "us-west-health-check"
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "Europe",
"GeoLocation": {
"CountryCode": "IE"
},
"TTL": 60,
"ResourceRecords": [
{"Value": "10.0.3.10"}
],
"HealthCheckId": "eu-west-health-check"
}
}
]
}
EOF
# Apply configuration
# aws route53 change-resource-record-sets --hosted-zone-id Z123 --change-batch file:///tmp/geodns-config.json
}
# Application-level geographic routing
setup_application_routing() {
echo "Implementing application-level geographic routing"
cat > /tmp/app-router.py << 'EOF'
import geoip2.database
from flask import Flask, request, redirect
app = Flask(__name__)
REGION_ENDPOINTS = {
'US': 'https://us-east.app.example.com',
'EU': 'https://eu-west.app.example.com',
'APAC': 'https://ap-southeast.app.example.com',
}
@app.route('/')
def geo_redirect():
client_ip = request.remote_addr
try:
with geoip2.database.Reader('/usr/share/GeoIP/GeoLite2-Country.mmdb') as reader:
response = reader.country(client_ip)
continent = response.continent.code
if continent == 'NA':
region = 'US'
elif continent == 'EU':
region = 'EU'
else:
region = 'APAC'
return redirect(REGION_ENDPOINTS.get(region, REGION_ENDPOINTS['US']))
except:
return redirect(REGION_ENDPOINTS['US'])
if __name__ == '__main__':
app.run(host='0.0.0.0', port=80)
EOF
}
Regional Load Balancing
# Setup regional load balancing within each region
setup_regional_load_balancer() {
local region=$1
local region_name=$2
echo "Setting up load balancer for: $region_name"
# HAProxy configuration
cat > "/etc/haproxy/haproxy-$region.cfg" << EOF
global
log stdout local0
log stdout local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
option http-server-close
timeout connect 5000
timeout client 50000
timeout server 50000
frontend web_frontend
bind *:80
bind *:443 ssl crt /etc/ssl/certs/combined.pem
redirect scheme https if !{ ssl_fc }
default_backend web_servers
backend web_servers
balance roundrobin
option httpchk GET /health HTTP/1.1\r\nHost:\ example.com
server web1 10.0.${region}.11:80 check inter 5s fall 3 rise 2
server web2 10.0.${region}.12:80 check inter 5s fall 3 rise 2
server web3 10.0.${region}.13:80 check inter 5s fall 3 rise 2
EOF
}
Automated Regional Failover
Implement Automatic Failover Logic
# Automated regional failover with health checks
cat > /usr/local/bin/regional-failover-manager.sh << 'EOF'
#!/bin/bash
REGIONS=("us-east-1" "us-west-2" "eu-west-1")
PRIMARY_REGION="us-east-1"
FAILOVER_LOG="/var/log/regional-failover.log"
HEALTH_CHECK_INTERVAL=30
declare -A region_status
declare -A last_status_change
# Initialize status tracking
for region in "${REGIONS[@]}"; do
region_status["$region"]="up"
last_status_change["$region"]=$(date +%s)
done
# Health check function
check_region_health() {
local region=$1
local endpoint="https://api.$region.example.com/health"
if curl -s --max-time 5 "$endpoint" | grep -q "ok"; then
echo "up"
else
echo "down"
fi
}
# Failover decision logic
make_failover_decision() {
local current_primary=$1
local failed_region=$2
# Get list of healthy regions
local healthy_regions=()
for region in "${REGIONS[@]}"; do
if [ "${region_status[$region]}" = "up" ]; then
healthy_regions+=("$region")
fi
done
# If primary is down and there are healthy regions
if [ "$failed_region" = "$current_primary" ] && [ ${#healthy_regions[@]} -gt 0 ]; then
# Elect new primary (lowest alphabetically among healthy)
local new_primary=$(printf '%s\n' "${healthy_regions[@]}" | sort | head -1)
log_failover "Primary region $current_primary is down. Promoting $new_primary"
promote_region_to_primary "$new_primary"
return 0
fi
return 1
}
# Promote region to primary
promote_region_to_primary() {
local new_primary=$1
log_failover "Promoting $new_primary to primary"
# Update DNS routing
update_dns_to_region "$new_primary"
# Update database replication (if needed)
promote_database_replica "$new_primary"
# Update configuration in all regions
broadcast_primary_change "$new_primary"
log_failover "✓ $new_primary is now primary region"
}
# Health check loop
health_check_loop() {
while true; do
for region in "${REGIONS[@]}"; do
current_status=$(check_region_health "$region")
previous_status="${region_status[$region]}"
if [ "$current_status" != "$previous_status" ]; then
region_status["$region"]="$current_status"
last_status_change["$region"]=$(date +%s)
log_failover "Status change: $region $previous_status -> $current_status"
# If primary went down, initiate failover
if [ "$region" = "$PRIMARY_REGION" ] && [ "$current_status" = "down" ]; then
make_failover_decision "$PRIMARY_REGION" "$region"
fi
fi
done
sleep $HEALTH_CHECK_INTERVAL
done
}
log_failover() {
echo "[$(date)] $1" | tee -a "$FAILOVER_LOG"
}
health_check_loop
EOF
chmod +x /usr/local/bin/regional-failover-manager.sh
Split-Brain Prevention
Implement Consensus-Based Failover
# Use etcd or Consul for distributed consensus
setup_distributed_consensus() {
echo "Setting up distributed consensus for split-brain prevention"
# Install Consul
apt-get install -y consul
# Configure Consul for multi-region
cat > /etc/consul/consul.json << 'EOF'
{
"datacenter": "us-east-1",
"node_name": "consul-1",
"server": true,
"ui": true,
"bootstrap_expect": 3,
"client_addr": "0.0.0.0",
"bind_addr": "10.0.1.10",
"retry_join": [
"consul-2.us-west-2.example.com",
"consul-3.eu-west-1.example.com"
],
"services": [
{
"name": "web",
"port": 80,
"check": {
"http": "http://localhost/health",
"interval": "10s"
}
}
]
}
EOF
systemctl restart consul
}
# Use quorum-based failover decisions
cat > /usr/local/bin/quorum-failover.sh << 'EOF'
#!/bin/bash
CONSUL_SERVERS=("consul1" "consul2" "consul3")
FAILOVER_THRESHOLD=2 # Require 2 out of 3 consensus
check_quorum_for_failover() {
local region=$1
local votes=0
for server in "${CONSUL_SERVERS[@]}"; do
# Query Consul for health status
status=$(curl -s "http://$server:8500/v1/health/service/$region" | grep -o '"Status":"[^"]*"' | cut -d'"' -f4)
if [ "$status" = "critical" ]; then
((votes++))
fi
done
echo "Failover votes for $region: $votes/$FAILOVER_THRESHOLD"
if [ $votes -ge $FAILOVER_THRESHOLD ]; then
return 0 # Quorum reached for failover
else
return 1 # No quorum
fi
}
# Prevent split-brain with lease-based primary election
lease_based_primary_election() {
local region=$1
local lease_ttl=30
# Attempt to acquire lease for primary role
if consul lock "primary-role-lease" -session-ttl="$lease_ttl" bash -c "echo acquired"; then
echo "✓ Acquired primary role for region: $region"
return 0
else
echo "✗ Could not acquire primary role"
return 1
fi
}
EOF
chmod +x /usr/local/bin/quorum-failover.sh
Monitoring and Alerting
Comprehensive Multi-Region Monitoring
# Setup centralized monitoring for all regions
setup_multiregion_monitoring() {
echo "Setting up multi-region monitoring"
# Prometheus configuration for all regions
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'multi-region'
scrape_configs:
- job_name: 'web-servers-us-east'
metrics_path: '/metrics'
static_configs:
- targets: ['web1.us-east.example.com:9100', 'web2.us-east.example.com:9100']
labels:
region: 'us-east-1'
- job_name: 'web-servers-us-west'
metrics_path: '/metrics'
static_configs:
- targets: ['web1.us-west.example.com:9100', 'web2.us-west.example.com:9100']
labels:
region: 'us-west-2'
- job_name: 'web-servers-eu-west'
metrics_path: '/metrics'
static_configs:
- targets: ['web1.eu-west.example.com:9100', 'web2.eu-west.example.com:9100']
labels:
region: 'eu-west-1'
- job_name: 'database-replication'
metrics_path: '/metrics'
static_configs:
- targets: ['db1.us-east.example.com:9104', 'db1.us-west.example.com:9104', 'db1.eu-west.example.com:9104']
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager.example.com:9093']
EOF
}
# Create regional failure alerts
create_regional_alerts() {
cat > /etc/prometheus/alerts/regional-failover.yml << 'EOF'
groups:
- name: regional_failover
rules:
- alert: RegionDown
expr: count(up{region=~".+"}) by (region) == 0
for: 2m
annotations:
summary: "Region is down: {{ $labels.region }}"
description: "All servers in region {{ $labels.region }} are unreachable"
- alert: HighReplicationLag
expr: mysql_slave_lag_seconds > 30
for: 5m
annotations:
summary: "High replication lag in {{ $labels.instance }}"
description: "Replication lag: {{ $value }}s"
- alert: PrimaryElectionConflict
expr: count(group by (cluster) (primary_role{status="active"})) > 1
for: 1m
annotations:
summary: "Split-brain detected: Multiple primary regions"
description: "Multiple regions believe they are primary"
EOF
}
Testing and Drills
Regular Failover Testing
# Automated failover testing
test_regional_failover() {
local test_region=$1
echo "Testing failover for region: $test_region"
# Step 1: Record current state
current_primary=$(curl -s https://api.example.com/status | grep -o '"primary":"[^"]*"' | cut -d'"' -f4)
echo "Current primary: $current_primary"
# Step 2: Simulate region failure
echo "Simulating failure in $test_region..."
# Block network traffic from region
ssh "ops@$test_region" << EOF
iptables -I INPUT 1 -j DROP
EOF
# Step 3: Monitor failover
sleep 30
# Step 4: Verify failover occurred
new_primary=$(curl -s https://api.example.com/status | grep -o '"primary":"[^"]*"' | cut -d'"' -f4)
if [ "$new_primary" != "$current_primary" ]; then
echo "✓ Failover successful: $current_primary -> $new_primary"
else
echo "✗ Failover failed: Still on $current_primary"
fi
# Step 5: Restore region
echo "Restoring $test_region..."
ssh "ops@$test_region" << EOF
iptables -D INPUT -j DROP
EOF
sleep 30
# Step 6: Verify recovery
recovered_primary=$(curl -s https://api.example.com/status | grep -o '"primary":"[^"]*"' | cut -d'"' -f4)
echo "Primary after recovery: $recovered_primary"
}
# Document failover test results
document_failover_test() {
local test_date=$(date +%Y-%m-%d)
local test_report="/var/reports/failover-test-$test_date.md"
cat > "$test_report" << 'EOF'
# Regional Failover Test Report
**Date**: [Test Date]
**Tested Regions**: [List]
## Test Execution
- [ ] Baseline metrics recorded
- [ ] Region failure simulated
- [ ] Failover detection time
- [ ] Failover execution time
- [ ] Service availability during failover
- [ ] Data consistency verified
- [ ] Region restored
- [ ] Recovery time measured
## Results
- RTO Actual: [Time]
- RTO Target: [Time]
- Data Loss: [Amount]
- RPO Target: [Time]
## Issues Found
1. [Issue]
2. [Issue]
## Improvements Made
1. [Improvement]
2. [Improvement]
EOF
}
Conclusion
Multi-region high availability requires:
- Geographic Distribution: Separate regions minimize single-point failures
- Database Replication: Continuous synchronization keeps data current
- Global Routing: Smart DNS or load balancing directs users to nearest region
- Automated Failover: Quick detection and promotion of backup regions
- Split-Brain Prevention: Quorum-based consensus prevents conflicts
- Monitoring: Comprehensive health checks across all regions
- Testing: Regular drills validate procedures and recovery times
The key challenge is balancing consistency (stricter replication = lower RPO but higher latency) with availability. Use eventual consistency for most data, but maintain strict consistency for critical operations (payments, etc.). Always test failover procedures regularly and keep multiple independent backups in different regions as a final safety net.


