Configuración de Conmutación por Error de DNS

DNS failover provides automatic redirection of traffic when primary servers fail. This guide covers health checks, failover record configuration, TTL optimization, BIND/PowerDNS setup, and comprehensive monitoring strategies for high-availability DNS systems.

Tabla de Contenidos

  1. DNS Failover Concepts
  2. Health Check Implementation
  3. BIND Configuration
  4. PowerDNS Configuration
  5. TTL Optimization
  6. Monitoring and Alerting
  7. Testing Failover
  8. Advanced Scenarios
  9. Conclusion

Conceptos de Conmutación por Error de DNS

DNS failover redirects clients to backup servers when primary servers become unavailable. There are several approaches:

  • Round-robin DNS: Alternates between multiple IPs (no health checking)
  • Weighted DNS: Distributes based on server weight/capacity
  • Latency-based: Routes to geographically closest server
  • Health-checked failover: Routes based on server health status
# Compare failover strategies
compare_failover_strategies() {
    cat << 'EOF'
Strategy         | Health Check | Geographic | Weighted | Complexity
-----------------+--------------+------------+----------+----------
Round-robin      | No           | No         | No       | Low
Weighted         | No           | No         | Yes      | Low
Latency-based    | Optional     | Yes        | Yes      | Medium
Health-checked   | Yes          | Optional   | Yes      | High

Best for:
- Simple load distribution: Round-robin
- Capacity-aware distribution: Weighted
- Geographic optimization: Latency-based
- Mission-critical: Health-checked
EOF
}

compare_failover_strategies

Implementación de Verificación de Salud

Verificaciones de Salud HTTP/HTTPS

# HTTP-based health check script
cat > /usr/local/bin/health-check-http.sh << 'EOF'
#!/bin/bash

SERVER_IP=$1
HEALTH_URL="http://$SERVER_IP/health"
TIMEOUT=5
MAX_RETRIES=3
RETRY_DELAY=2

check_server_health() {
    local retries=0
    
    while [ $retries -lt $MAX_RETRIES ]; do
        HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
            --max-time $TIMEOUT \
            "$HEALTH_URL")
        
        if [ "$HTTP_CODE" = "200" ]; then
            echo "Healthy"
            return 0
        fi
        
        retries=$((retries + 1))
        sleep $RETRY_DELAY
    done
    
    echo "Unhealthy (HTTP $HTTP_CODE)"
    return 1
}

# Check with detailed diagnostics
detailed_health_check() {
    local response_time=$(curl -s -o /dev/null -w "%{time_total}" "$HEALTH_URL")
    local response_code=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
    
    cat << "RESULT"
Server: $SERVER_IP
Response Code: $response_code
Response Time: $response_time seconds
Status: $([ "$response_code" = "200" ] && echo "HEALTHY" || echo "UNHEALTHY")
RESULT
}

check_server_health
EOF

chmod +x /usr/local/bin/health-check-http.sh

Verificaciones de Salud de Puerto TCP

# TCP-based health check (faster, less resource-intensive)
cat > /usr/local/bin/health-check-tcp.sh << 'EOF'
#!/bin/bash

SERVER_IP=$1
PORT=${2:-80}
TIMEOUT=3

check_tcp_port() {
    if timeout $TIMEOUT bash -c "echo > /dev/tcp/$SERVER_IP/$PORT" 2>/dev/null; then
        echo "Healthy"
        return 0
    else
        echo "Unhealthy"
        return 1
    fi
}

# Alternative using nc (netcat)
check_with_netcat() {
    if nc -z -w $TIMEOUT $SERVER_IP $PORT 2>/dev/null; then
        echo "Healthy"
        return 0
    else
        echo "Unhealthy"
        return 1
    fi
}

check_tcp_port
EOF

chmod +x /usr/local/bin/health-check-tcp.sh

Script de Verificación de Salud Personalizado

# Advanced health check with multiple criteria
cat > /usr/local/bin/health-check-advanced.sh << 'EOF'
#!/bin/bash

SERVER_IP=$1
HEALTH_SCORE=0
HEALTH_THRESHOLD=70

# Check HTTP response
http_ok=$(curl -s -o /dev/null -w "%{http_code}" "http://$SERVER_IP" | grep -q "200" && echo 1 || echo 0)
[ $http_ok -eq 1 ] && HEALTH_SCORE=$((HEALTH_SCORE + 30))

# Check SSH connectivity
if timeout 2 bash -c "echo > /dev/tcp/$SERVER_IP/22" 2>/dev/null; then
    HEALTH_SCORE=$((HEALTH_SCORE + 20))
fi

# Check disk usage via SSH
disk_usage=$(ssh -o ConnectTimeout=2 "root@$SERVER_IP" "df / | awk 'NR==2 {print \$5}' | sed 's/%//'")
if [ "$disk_usage" -lt 80 ]; then
    HEALTH_SCORE=$((HEALTH_SCORE + 20))
fi

# Check memory via SSH
mem_available=$(ssh -o ConnectTimeout=2 "root@$SERVER_IP" "free | awk 'NR==2 {print \$7}'")
if [ "$mem_available" -gt 500000 ]; then
    HEALTH_SCORE=$((HEALTH_SCORE + 30))
fi

# Determine status
if [ $HEALTH_SCORE -ge $HEALTH_THRESHOLD ]; then
    echo "Healthy ($HEALTH_SCORE%)"
    exit 0
else
    echo "Unhealthy ($HEALTH_SCORE%)"
    exit 1
fi
EOF

chmod +x /usr/local/bin/health-check-advanced.sh

Configuración de BIND

Conmutación por Error Básica de DNS BIND

# Install BIND
apt-get install -y bind9 bind9-utils

# Configure BIND for failover
cat > /etc/bind/zones/db.example.com << 'EOF'
$TTL 300
@   IN  SOA ns1.example.com. admin.example.com. (
            2024010101  ; serial
            3600        ; refresh
            1800        ; retry
            604800      ; expire
            300 )       ; minimum
    
    IN  NS  ns1.example.com.
    IN  NS  ns2.example.com.

ns1     IN  A   10.0.1.10
ns2     IN  A   10.0.2.10

; Primary web server
web     IN  A   10.0.1.20

; Backup web server
web     IN  A   10.0.1.21

; Application with failover (multiple A records)
app     IN  A   10.0.1.30
app     IN  A   10.0.1.31

; Database with round-robin
db      IN  A   10.0.1.40
db      IN  A   10.0.1.41

; Service with priority weighting (using SRV records)
_service._tcp   IN  SRV 10 60 5000 server1.example.com.
_service._tcp   IN  SRV 20 40 5000 server2.example.com.
EOF

# Include zone in main config
cat >> /etc/bind/named.conf.local << 'EOF'
zone "example.com" {
    type master;
    file "/etc/bind/zones/db.example.com";
    allow-transfer { 10.0.2.10; };
};
EOF

systemctl restart bind9

Configuración del Servidor Esclavo BIND

# Configure slave (secondary) DNS server
cat > /etc/bind/named.conf.local << 'EOF'
zone "example.com" {
    type slave;
    file "/var/cache/bind/db.example.com";
    masters { 10.0.1.10; };
};
EOF

systemctl restart bind9

# Verify zone transfer
dig @10.0.2.10 example.com axfr

Actualizaciones Dinámicas de DNS con BIND

# Configure BIND for dynamic updates
cat > /etc/bind/zones/db.example.com << 'EOF'
$TTL 300
@   IN  SOA ns1.example.com. admin.example.com. (
            2024010101
            3600
            1800
            604800
            300 )
    IN  NS  ns1.example.com.

; Allow dynamic updates for specific hosts
; Dynamic DNS keys will be used to authenticate updates
EOF

# Create TSIG key for secure updates
dnssec-keygen -a HMAC-SHA256 -b 256 -n HOST ddns-key

# Configure BIND to accept dynamic updates
cat >> /etc/bind/named.conf << 'EOF'
key "ddns-key" {
    algorithm HMAC-SHA256;
    secret "your_generated_secret_key_here";
};

zone "example.com" {
    type master;
    file "/etc/bind/zones/db.example.com";
    
    # Allow dynamic updates from specific key
    update-policy {
        grant ddns-key wildcard *.example.com A TXT;
    };
};
EOF

systemctl restart bind9

Configuración de PowerDNS

Instalar y Configurar PowerDNS

# Install PowerDNS
apt-get install -y pdns-server pdns-backend-sqlite3

# Basic PowerDNS configuration
cat > /etc/powerdns/pdns.conf << 'EOF'
# General settings
setuid=pdns
setgid=pdns
allow-dnsupdate=127.0.0.1
dnsupdate=yes

# Backend configuration
backend=gsqlite3
gsqlite3-database=/var/lib/powerdns/pdns.db

# API settings
api-key=your_secure_api_key
api=yes
api-readonly=no
webserver-address=127.0.0.1
webserver-port=8081

# Performance settings
max-cache-entries=1000000
recursive-cache-ttl=10
cache-ttl=120
EOF

# Initialize database
pdns_server --schema-version-check='create' \
    --backend=gsqlite3 \
    --gsqlite3-database=/var/lib/powerdns/pdns.db

systemctl restart pdns

Conmutación por Error Basada en API de PowerDNS

# Script to update DNS via PowerDNS API based on health checks
cat > /usr/local/bin/pdns-failover.sh << 'EOF'
#!/bin/bash

PDNS_API_URL="http://localhost:8081/api/v1"
PDNS_API_KEY="your_secure_api_key"
ZONE_NAME="example.com"
PRIMARY_IP="10.0.1.20"
BACKUP_IP="10.0.1.21"

# Check primary server health
check_primary_health() {
    if curl -s -m 3 "http://$PRIMARY_IP/health" | grep -q "ok"; then
        return 0
    else
        return 1
    fi
}

# Update DNS record via PowerDNS API
update_dns_record() {
    local record_name=$1
    local ip_address=$2
    
    curl -s -X PATCH \
        -H "X-API-Key: $PDNS_API_KEY" \
        -H "Content-Type: application/json" \
        -d "{
            \"rrsets\": [{
                \"name\": \"$record_name\",
                \"type\": \"A\",
                \"changetype\": \"REPLACE\",
                \"ttl\": 60,
                \"records\": [{
                    \"content\": \"$ip_address\",
                    \"disabled\": false
                }]
            }]
        }" \
        "$PDNS_API_URL/servers/localhost/zones/$ZONE_NAME" \
        2>/dev/null
}

# Main failover logic
failover_loop() {
    local last_status="unknown"
    
    while true; do
        if check_primary_health; then
            if [ "$last_status" != "primary" ]; then
                echo "[$(date)] Primary server is healthy, updating DNS"
                update_dns_record "web.example.com" "$PRIMARY_IP"
                last_status="primary"
            fi
        else
            if [ "$last_status" != "backup" ]; then
                echo "[$(date)] Primary server is down, failing over to backup"
                update_dns_record "web.example.com" "$BACKUP_IP"
                last_status="backup"
            fi
        fi
        
        sleep 10
    done
}

failover_loop
EOF

chmod +x /usr/local/bin/pdns-failover.sh

Optimización de TTL

Estrategia de TTL para Conmutación por Error

# TTL optimization for different scenarios
cat > /tmp/ttl-strategy.sh << 'EOF'
#!/bin/bash

# TTL recommendations for failover:
# - Critical services: 30-60 seconds (faster failover, higher DNS load)
# - Standard services: 300-600 seconds (balanced)
# - Static content: 3600+ seconds (lower DNS load)
# - Geo-redundancy: 30-300 seconds (quick regional failover)

calculate_ttl_for_rto() {
    local recovery_time_objective=$1  # in seconds
    
    # TTL should be less than RTO
    local recommended_ttl=$((recovery_time_objective / 3))
    
    echo "RTO: ${recovery_time_objective}s"
    echo "Recommended TTL: ${recommended_ttl}s"
}

# Example: Service with 5-minute RTO
calculate_ttl_for_rto 300

# Dynamic TTL adjustment
adjust_ttl_based_on_health() {
    local server=$1
    
    if check_server_health "$server"; then
        # Server healthy: use longer TTL
        echo "300"  # 5 minutes
    else
        # Server unhealthy: use shorter TTL for quick failover
        echo "30"   # 30 seconds
    fi
}
EOF

bash /tmp/ttl-strategy.sh

Implementar Cambios de TTL en BIND

# Update BIND zones with optimized TTLs
update_bind_ttl() {
    local zone_file=$1
    local new_ttl=$2
    
    sed -i "s/^\$TTL .*/\$TTL $new_ttl/" "$zone_file"
    sed -i "s/\([^ ]*\) *IN */\1 IN $new_ttl IN /" "$zone_file"
    
    systemctl reload bind9
}

# Example: Set TTL to 60 seconds for web.example.com
# update_bind_ttl "/etc/bind/zones/db.example.com" "60"

Monitoreo y Alertas

Script de Monitoreo de Verificación de Salud

# Comprehensive health monitoring
cat > /usr/local/bin/monitor-failover.sh << 'EOF'
#!/bin/bash

SERVERS=("web1.example.com" "web2.example.com" "web3.example.com")
CHECK_INTERVAL=30
ALERT_EMAIL="[email protected]"
LOG_FILE="/var/log/dns-failover.log"
STATUS_FILE="/var/lib/dns-failover/status.json"

# Initialize status tracking
mkdir -p "$(dirname "$STATUS_FILE")"

# Health check function
check_server() {
    local server=$1
    
    if curl -s -m 5 "http://$server/health" | grep -q "ok"; then
        echo "up"
    else
        echo "down"
    fi
}

# Track state changes and alert
monitor_loop() {
    while true; do
        for server in "${SERVERS[@]}"; do
            current_status=$(check_server "$server")
            previous_status=$(grep -o "\"$server\":\"[^\"]*" "$STATUS_FILE" | cut -d'"' -f4)
            
            if [ "$current_status" != "$previous_status" ]; then
                echo "[$(date)] Status change: $server is now $current_status" >> "$LOG_FILE"
                
                if [ "$current_status" = "down" ]; then
                    send_alert "CRITICAL: $server is down"
                else
                    send_alert "RESOLVED: $server is back online"
                fi
            fi
        done
        
        sleep $CHECK_INTERVAL
    done
}

send_alert() {
    local message=$1
    echo "$message" | mail -s "DNS Failover Alert" "$ALERT_EMAIL"
}

monitor_loop
EOF

chmod +x /usr/local/bin/monitor-failover.sh

Métricas de Prometheus para Conmutación por Error de DNS

# Expose DNS failover metrics for Prometheus
cat > /usr/local/bin/dns-failover-exporter.sh << 'EOF'
#!/bin/bash

PORT=9999
SERVERS=("web1.example.com" "web2.example.com")

# Simple HTTP server for Prometheus metrics
start_metrics_server() {
    while true; do
        {
            echo -ne "HTTP/1.1 200 OK\r\n"
            echo -ne "Content-Type: text/plain\r\n"
            echo -ne "Connection: close\r\n"
            echo -ne "\r\n"
            
            # Export metrics
            for server in "${SERVERS[@]}"; do
                if curl -s -m 3 "http://$server/health" > /dev/null; then
                    status=1
                else
                    status=0
                fi
                
                echo "dns_failover_server_up{server=\"$server\"} $status"
            done
        } | nc -l -p $PORT -q 1
    done
}

start_metrics_server
EOF

chmod +x /usr/local/bin/dns-failover-exporter.sh

# Prometheus scrape configuration
cat > /etc/prometheus/dns-failover.yml << 'EOF'
global:
  scrape_interval: 30s

scrape_configs:
  - job_name: 'dns-failover'
    static_configs:
      - targets: ['localhost:9999']
EOF

Prueba de Conmutación por Error

Prueba Manual de Conmutación por Error

# Test failover procedure
test_failover_procedure() {
    echo "DNS Failover Test Procedure"
    echo "============================"
    
    local primary_server="web1.example.com"
    local backup_server="web2.example.com"
    
    # Step 1: Verify primary is healthy
    echo "Step 1: Checking primary server health..."
    curl -v "http://$primary_server/health"
    
    # Step 2: Simulate primary failure
    echo "Step 2: Stopping primary server..."
    ssh "root@$primary_server" "systemctl stop nginx"
    
    # Wait for health check to detect failure
    echo "Step 3: Waiting for health check to detect failure..."
    sleep 35  # Just over health check interval
    
    # Step 4: Verify failover occurred
    echo "Step 4: Verifying traffic is on backup..."
    for i in {1..10}; do
        response=$(curl -s "http://example.com" | grep -o "Server: web[0-9]")
        echo "Request $i: $response"
    done
    
    # Step 5: Restore primary
    echo "Step 5: Restoring primary server..."
    ssh "root@$primary_server" "systemctl start nginx"
    
    # Step 6: Verify failback
    echo "Step 6: Verifying traffic returns to primary..."
    sleep 35
    for i in {1..10}; do
        response=$(curl -s "http://example.com" | grep -o "Server: web[0-9]")
        echo "Request $i: $response"
    done
}

# Automated failover test
automate_failover_test() {
    local test_log="/var/log/failover-test.log"
    
    {
        echo "[$(date)] Starting automated failover test"
        
        # Get initial DNS response
        initial_response=$(dig @ns1.example.com web.example.com +short)
        echo "Initial DNS response: $initial_response"
        
        # Simulate server down
        ssh [email protected] systemctl stop nginx
        
        # Wait and check DNS update
        sleep 40
        failover_response=$(dig @ns1.example.com web.example.com +short)
        echo "Failover DNS response: $failover_response"
        
        if [ "$initial_response" != "$failover_response" ]; then
            echo "✓ Failover successful"
        else
            echo "✗ Failover failed"
        fi
        
        # Restore service
        ssh [email protected] systemctl start nginx
        
    } | tee -a "$test_log"
}

Escenarios Avanzados

Conmutación por Error Round-Robin Ponderada

# BIND weighted failover using SRV records
cat > /etc/bind/zones/db.example.com << 'EOF'
; Load balancing with SRV records
_http._tcp.web  IN  SRV  10 60 80 web1.example.com.
_http._tcp.web  IN  SRV  10 40 80 web2.example.com.
_http._tcp.web  IN  SRV  20 100 80 web3.example.com.

; Priority: lower number = higher priority
; Weight: within same priority, distributed based on weight
EOF

Conmutación por Error Geográfica

# GeoIP-based failover using split-view DNS
cat > /etc/bind/named.conf << 'EOF'
# Define geographic zones
view "europe" {
    match-clients { 80.0.0.0/4; };
    
    zone "example.com" {
        type master;
        file "/etc/bind/zones/db.example.com.eu";
    };
};

view "americas" {
    match-clients { 192.0.0.0/8; };
    
    zone "example.com" {
        type master;
        file "/etc/bind/zones/db.example.com.us";
    };
};
EOF

Conclusión

Effective DNS failover requires:

  1. Health Checks: Regular monitoring to detect failures quickly
  2. Low TTLs: Critical services use 30-60 second TTLs for fast failover
  3. Multiple Servers: At least 2-3 nameservers for redundancy
  4. Automation: Scripts to automatically update DNS on failure
  5. Testing: Regular failover drills to verify procedures work
  6. Monitoring: Continuous health checks and alerting

Choose between BIND (traditional, flexible) and PowerDNS (API-driven, modern) based on your infrastructure needs. Always test failover procedures in staging before deploying to production.