Configuración de Conmutación por Error de DNS

DNS failover provides automatic redirection of traffic when primary servers fail. This guide covers health checks, failover record configuration, TTL optimization, BIND/PowerDNS setup, and comprehensive monitoring strategies for high-availability DNS systems.

Conceptos de Conmutación por Error de DNS

DNS failover redirects clients to backup servers when primary servers become unavailable. There are several approaches:

  • Round-robin DNS: Alternates between multiple IPs (no health checking)
  • Weighted DNS: Distributes based on server weight/capacity
  • Latency-based: Routes to geographically closest server
  • Health-checked failover: Routes based on server health status
# Compare failover strategies
compare_failover_strategies() {
    cat << 'EOF'
Strategy         | Health Check | Geographic | Weighted | Complexity
-----------------+--------------+------------+----------+----------
Round-robin      | No           | No         | No       | Low
Weighted         | No           | No         | Yes      | Low
Latency-based    | Optional     | Yes        | Yes      | Medium
Health-checked   | Yes          | Optional   | Yes      | High

Best for:
- Simple load distribution: Round-robin
- Capacity-aware distribution: Weighted
- Geographic optimization: Latency-based
- Mission-critical: Health-checked
EOF
}

compare_failover_strategies

Implementación de Verificación de Salud

Verificaciones de Salud HTTP/HTTPS

# HTTP-based health check script
cat > /usr/local/bin/health-check-http.sh << 'EOF'
#!/bin/bash

SERVER_IP=$1
HEALTH_URL="http://$SERVER_IP/health"
TIMEOUT=5
MAX_RETRIES=3
RETRY_DELAY=2

check_server_health() {
    local retries=0
    
    while [ $retries -lt $MAX_RETRIES ]; do
        HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
            --max-time $TIMEOUT \
            "$HEALTH_URL")
        
        if [ "$HTTP_CODE" = "200" ]; then
            echo "Healthy"
            return 0
        fi
        
        retries=$((retries + 1))
        sleep $RETRY_DELAY
    done
    
    echo "Unhealthy (HTTP $HTTP_CODE)"
    return 1
}

# Check with detailed diagnostics
detailed_health_check() {
    local response_time=$(curl -s -o /dev/null -w "%{time_total}" "$HEALTH_URL")
    local response_code=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
    
    cat << "RESULT"
Server: $SERVER_IP
Response Code: $response_code
Response Time: $response_time seconds
Status: $([ "$response_code" = "200" ] && echo "HEALTHY" || echo "UNHEALTHY")
RESULT
}

check_server_health
EOF

chmod +x /usr/local/bin/health-check-http.sh

Verificaciones de Salud de Puerto TCP

# TCP-based health check (faster, less resource-intensive)
cat > /usr/local/bin/health-check-tcp.sh << 'EOF'
#!/bin/bash

SERVER_IP=$1
PORT=${2:-80}
TIMEOUT=3

check_tcp_port() {
    if timeout $TIMEOUT bash -c "echo > /dev/tcp/$SERVER_IP/$PORT" 2>/dev/null; then
        echo "Healthy"
        return 0
    else
        echo "Unhealthy"
        return 1
    fi
}

# Alternative using nc (netcat)
check_with_netcat() {
    if nc -z -w $TIMEOUT $SERVER_IP $PORT 2>/dev/null; then
        echo "Healthy"
        return 0
    else
        echo "Unhealthy"
        return 1
    fi
}

check_tcp_port
EOF

chmod +x /usr/local/bin/health-check-tcp.sh

Script de Verificación de Salud Personalizado

# Advanced health check with multiple criteria
cat > /usr/local/bin/health-check-advanced.sh << 'EOF'
#!/bin/bash

SERVER_IP=$1
HEALTH_SCORE=0
HEALTH_THRESHOLD=70

# Check HTTP response
http_ok=$(curl -s -o /dev/null -w "%{http_code}" "http://$SERVER_IP" | grep -q "200" && echo 1 || echo 0)
[ $http_ok -eq 1 ] && HEALTH_SCORE=$((HEALTH_SCORE + 30))

# Check SSH connectivity
if timeout 2 bash -c "echo > /dev/tcp/$SERVER_IP/22" 2>/dev/null; then
    HEALTH_SCORE=$((HEALTH_SCORE + 20))
fi

# Check disk usage via SSH
disk_usage=$(ssh -o ConnectTimeout=2 "root@$SERVER_IP" "df / | awk 'NR==2 {print \$5}' | sed 's/%//'")
if [ "$disk_usage" -lt 80 ]; then
    HEALTH_SCORE=$((HEALTH_SCORE + 20))
fi

# Check memory via SSH
mem_available=$(ssh -o ConnectTimeout=2 "root@$SERVER_IP" "free | awk 'NR==2 {print \$7}'")
if [ "$mem_available" -gt 500000 ]; then
    HEALTH_SCORE=$((HEALTH_SCORE + 30))
fi

# Determine status
if [ $HEALTH_SCORE -ge $HEALTH_THRESHOLD ]; then
    echo "Healthy ($HEALTH_SCORE%)"
    exit 0
else
    echo "Unhealthy ($HEALTH_SCORE%)"
    exit 1
fi
EOF

chmod +x /usr/local/bin/health-check-advanced.sh

Configuración de BIND

Conmutación por Error Básica de DNS BIND

# Install BIND
apt-get install -y bind9 bind9-utils

# Configure BIND for failover
cat > /etc/bind/zones/db.example.com << 'EOF'
$TTL 300
@   IN  SOA ns1.example.com. admin.example.com. (
            2024010101  ; serial
            3600        ; refresh
            1800        ; retry
            604800      ; expire
            300 )       ; minimum
    
    IN  NS  ns1.example.com.
    IN  NS  ns2.example.com.

ns1     IN  A   10.0.1.10
ns2     IN  A   10.0.2.10

; Primary web server
web     IN  A   10.0.1.20

; Backup web server
web     IN  A   10.0.1.21

; Application with failover (multiple A records)
app     IN  A   10.0.1.30
app     IN  A   10.0.1.31

; Database with round-robin
db      IN  A   10.0.1.40
db      IN  A   10.0.1.41

; Service with priority weighting (using SRV records)
_service._tcp   IN  SRV 10 60 5000 server1.example.com.
_service._tcp   IN  SRV 20 40 5000 server2.example.com.
EOF

# Include zone in main config
cat >> /etc/bind/named.conf.local << 'EOF'
zone "example.com" {
    type master;
    file "/etc/bind/zones/db.example.com";
    allow-transfer { 10.0.2.10; };
};
EOF

systemctl restart bind9

Configuración del Servidor Esclavo BIND

# Configure slave (secondary) DNS server
cat > /etc/bind/named.conf.local << 'EOF'
zone "example.com" {
    type slave;
    file "/var/cache/bind/db.example.com";
    masters { 10.0.1.10; };
};
EOF

systemctl restart bind9

# Verify zone transfer
dig @10.0.2.10 example.com axfr

Actualizaciones Dinámicas de DNS con BIND

# Configure BIND for dynamic updates
cat > /etc/bind/zones/db.example.com << 'EOF'
$TTL 300
@   IN  SOA ns1.example.com. admin.example.com. (
            2024010101
            3600
            1800
            604800
            300 )
    IN  NS  ns1.example.com.

; Allow dynamic updates for specific hosts
; Dynamic DNS keys will be used to authenticate updates
EOF

# Create TSIG key for secure updates
dnssec-keygen -a HMAC-SHA256 -b 256 -n HOST ddns-key

# Configure BIND to accept dynamic updates
cat >> /etc/bind/named.conf << 'EOF'
key "ddns-key" {
    algorithm HMAC-SHA256;
    secret "your_generated_secret_key_here";
};

zone "example.com" {
    type master;
    file "/etc/bind/zones/db.example.com";
    
    # Allow dynamic updates from specific key
    update-policy {
        grant ddns-key wildcard *.example.com A TXT;
    };
};
EOF

systemctl restart bind9

Configuración de PowerDNS

Instalar y Configurar PowerDNS

# Install PowerDNS
apt-get install -y pdns-server pdns-backend-sqlite3

# Basic PowerDNS configuration
cat > /etc/powerdns/pdns.conf << 'EOF'
# General settings
setuid=pdns
setgid=pdns
allow-dnsupdate=127.0.0.1
dnsupdate=yes

# Backend configuration
backend=gsqlite3
gsqlite3-database=/var/lib/powerdns/pdns.db

# API settings
api-key=your_secure_api_key
api=yes
api-readonly=no
webserver-address=127.0.0.1
webserver-port=8081

# Performance settings
max-cache-entries=1000000
recursive-cache-ttl=10
cache-ttl=120
EOF

# Initialize database
pdns_server --schema-version-check='create' \
    --backend=gsqlite3 \
    --gsqlite3-database=/var/lib/powerdns/pdns.db

systemctl restart pdns

Conmutación por Error Basada en API de PowerDNS

# Script to update DNS via PowerDNS API based on health checks
cat > /usr/local/bin/pdns-failover.sh << 'EOF'
#!/bin/bash

PDNS_API_URL="http://localhost:8081/api/v1"
PDNS_API_KEY="your_secure_api_key"
ZONE_NAME="example.com"
PRIMARY_IP="10.0.1.20"
BACKUP_IP="10.0.1.21"

# Check primary server health
check_primary_health() {
    if curl -s -m 3 "http://$PRIMARY_IP/health" | grep -q "ok"; then
        return 0
    else
        return 1
    fi
}

# Update DNS record via PowerDNS API
update_dns_record() {
    local record_name=$1
    local ip_address=$2
    
    curl -s -X PATCH \
        -H "X-API-Key: $PDNS_API_KEY" \
        -H "Content-Type: application/json" \
        -d "{
            \"rrsets\": [{
                \"name\": \"$record_name\",
                \"type\": \"A\",
                \"changetype\": \"REPLACE\",
                \"ttl\": 60,
                \"records\": [{
                    \"content\": \"$ip_address\",
                    \"disabled\": false
                }]
            }]
        }" \
        "$PDNS_API_URL/servers/localhost/zones/$ZONE_NAME" \
        2>/dev/null
}

# Main failover logic
failover_loop() {
    local last_status="unknown"
    
    while true; do
        if check_primary_health; then
            if [ "$last_status" != "primary" ]; then
                echo "[$(date)] Primary server is healthy, updating DNS"
                update_dns_record "web.example.com" "$PRIMARY_IP"
                last_status="primary"
            fi
        else
            if [ "$last_status" != "backup" ]; then
                echo "[$(date)] Primary server is down, failing over to backup"
                update_dns_record "web.example.com" "$BACKUP_IP"
                last_status="backup"
            fi
        fi
        
        sleep 10
    done
}

failover_loop
EOF

chmod +x /usr/local/bin/pdns-failover.sh

Optimización de TTL

Estrategia de TTL para Conmutación por Error

# TTL optimization for different scenarios
cat > /tmp/ttl-strategy.sh << 'EOF'
#!/bin/bash

# TTL recommendations for failover:
# - Critical services: 30-60 seconds (faster failover, higher DNS load)
# - Standard services: 300-600 seconds (balanced)
# - Static content: 3600+ seconds (lower DNS load)
# - Geo-redundancy: 30-300 seconds (quick regional failover)

calculate_ttl_for_rto() {
    local recovery_time_objective=$1  # in seconds
    
    # TTL should be less than RTO
    local recommended_ttl=$((recovery_time_objective / 3))
    
    echo "RTO: ${recovery_time_objective}s"
    echo "Recommended TTL: ${recommended_ttl}s"
}

# Example: Service with 5-minute RTO
calculate_ttl_for_rto 300

# Dynamic TTL adjustment
adjust_ttl_based_on_health() {
    local server=$1
    
    if check_server_health "$server"; then
        # Server healthy: use longer TTL
        echo "300"  # 5 minutes
    else
        # Server unhealthy: use shorter TTL for quick failover
        echo "30"   # 30 seconds
    fi
}
EOF

bash /tmp/ttl-strategy.sh

Implementar Cambios de TTL en BIND

# Update BIND zones with optimized TTLs
update_bind_ttl() {
    local zone_file=$1
    local new_ttl=$2
    
    sed -i "s/^\$TTL .*/\$TTL $new_ttl/" "$zone_file"
    sed -i "s/\([^ ]*\) *IN */\1 IN $new_ttl IN /" "$zone_file"
    
    systemctl reload bind9
}

# Example: Set TTL to 60 seconds for web.example.com
# update_bind_ttl "/etc/bind/zones/db.example.com" "60"

Monitoreo y Alertas

Script de Monitoreo de Verificación de Salud

# Comprehensive health monitoring
cat > /usr/local/bin/monitor-failover.sh << 'EOF'
#!/bin/bash

SERVERS=("web1.example.com" "web2.example.com" "web3.example.com")
CHECK_INTERVAL=30
ALERT_EMAIL="[email protected]"
LOG_FILE="/var/log/dns-failover.log"
STATUS_FILE="/var/lib/dns-failover/status.json"

# Initialize status tracking
mkdir -p "$(dirname "$STATUS_FILE")"

# Health check function
check_server() {
    local server=$1
    
    if curl -s -m 5 "http://$server/health" | grep -q "ok"; then
        echo "up"
    else
        echo "down"
    fi
}

# Track state changes and alert
monitor_loop() {
    while true; do
        for server in "${SERVERS[@]}"; do
            current_status=$(check_server "$server")
            previous_status=$(grep -o "\"$server\":\"[^\"]*" "$STATUS_FILE" | cut -d'"' -f4)
            
            if [ "$current_status" != "$previous_status" ]; then
                echo "[$(date)] Status change: $server is now $current_status" >> "$LOG_FILE"
                
                if [ "$current_status" = "down" ]; then
                    send_alert "CRITICAL: $server is down"
                else
                    send_alert "RESOLVED: $server is back online"
                fi
            fi
        done
        
        sleep $CHECK_INTERVAL
    done
}

send_alert() {
    local message=$1
    echo "$message" | mail -s "DNS Failover Alert" "$ALERT_EMAIL"
}

monitor_loop
EOF

chmod +x /usr/local/bin/monitor-failover.sh

Métricas de Prometheus para Conmutación por Error de DNS

# Expose DNS failover metrics for Prometheus
cat > /usr/local/bin/dns-failover-exporter.sh << 'EOF'
#!/bin/bash

PORT=9999
SERVERS=("web1.example.com" "web2.example.com")

# Simple HTTP server for Prometheus metrics
start_metrics_server() {
    while true; do
        {
            echo -ne "HTTP/1.1 200 OK\r\n"
            echo -ne "Content-Type: text/plain\r\n"
            echo -ne "Connection: close\r\n"
            echo -ne "\r\n"
            
            # Export metrics
            for server in "${SERVERS[@]}"; do
                if curl -s -m 3 "http://$server/health" > /dev/null; then
                    status=1
                else
                    status=0
                fi
                
                echo "dns_failover_server_up{server=\"$server\"} $status"
            done
        } | nc -l -p $PORT -q 1
    done
}

start_metrics_server
EOF

chmod +x /usr/local/bin/dns-failover-exporter.sh

# Prometheus scrape configuration
cat > /etc/prometheus/dns-failover.yml << 'EOF'
global:
  scrape_interval: 30s

scrape_configs:
  - job_name: 'dns-failover'
    static_configs:
      - targets: ['localhost:9999']
EOF

Prueba de Conmutación por Error

Prueba Manual de Conmutación por Error

# Test failover procedure
test_failover_procedure() {
    echo "DNS Failover Test Procedure"
    echo "============================"
    
    local primary_server="web1.example.com"
    local backup_server="web2.example.com"
    
    # Step 1: Verify primary is healthy
    echo "Step 1: Checking primary server health..."
    curl -v "http://$primary_server/health"
    
    # Step 2: Simulate primary failure
    echo "Step 2: Stopping primary server..."
    ssh "root@$primary_server" "systemctl stop nginx"
    
    # Wait for health check to detect failure
    echo "Step 3: Waiting for health check to detect failure..."
    sleep 35  # Just over health check interval
    
    # Step 4: Verify failover occurred
    echo "Step 4: Verifying traffic is on backup..."
    for i in {1..10}; do
        response=$(curl -s "http://example.com" | grep -o "Server: web[0-9]")
        echo "Request $i: $response"
    done
    
    # Step 5: Restore primary
    echo "Step 5: Restoring primary server..."
    ssh "root@$primary_server" "systemctl start nginx"
    
    # Step 6: Verify failback
    echo "Step 6: Verifying traffic returns to primary..."
    sleep 35
    for i in {1..10}; do
        response=$(curl -s "http://example.com" | grep -o "Server: web[0-9]")
        echo "Request $i: $response"
    done
}

# Automated failover test
automate_failover_test() {
    local test_log="/var/log/failover-test.log"
    
    {
        echo "[$(date)] Starting automated failover test"
        
        # Get initial DNS response
        initial_response=$(dig @ns1.example.com web.example.com +short)
        echo "Initial DNS response: $initial_response"
        
        # Simulate server down
        ssh [email protected] systemctl stop nginx
        
        # Wait and check DNS update
        sleep 40
        failover_response=$(dig @ns1.example.com web.example.com +short)
        echo "Failover DNS response: $failover_response"
        
        if [ "$initial_response" != "$failover_response" ]; then
            echo "✓ Failover successful"
        else
            echo "✗ Failover failed"
        fi
        
        # Restore service
        ssh [email protected] systemctl start nginx
        
    } | tee -a "$test_log"
}

Escenarios Avanzados

Conmutación por Error Round-Robin Ponderada

# BIND weighted failover using SRV records
cat > /etc/bind/zones/db.example.com << 'EOF'
; Load balancing with SRV records
_http._tcp.web  IN  SRV  10 60 80 web1.example.com.
_http._tcp.web  IN  SRV  10 40 80 web2.example.com.
_http._tcp.web  IN  SRV  20 100 80 web3.example.com.

; Priority: lower number = higher priority
; Weight: within same priority, distributed based on weight
EOF

Conmutación por Error Geográfica

# GeoIP-based failover using split-view DNS
cat > /etc/bind/named.conf << 'EOF'
# Define geographic zones
view "europe" {
    match-clients { 80.0.0.0/4; };
    
    zone "example.com" {
        type master;
        file "/etc/bind/zones/db.example.com.eu";
    };
};

view "americas" {
    match-clients { 192.0.0.0/8; };
    
    zone "example.com" {
        type master;
        file "/etc/bind/zones/db.example.com.us";
    };
};
EOF

Conclusión

Effective DNS failover requires:

  1. Health Checks: Regular monitoring to detect failures quickly
  2. Low TTLs: Critical services use 30-60 second TTLs for fast failover
  3. Multiple Servers: At least 2-3 nameservers for redundancy
  4. Automation: Scripts to automatically update DNS on failure
  5. Testing: Regular failover drills to verify procedures work
  6. Monitoring: Continuous health checks and alerting

Choose between BIND (traditional, flexible) and PowerDNS (API-driven, modern) based on your infrastructure needs. Always test failover procedures in staging before deploying to production.