Configuración de Conmutación por Error de DNS
DNS failover provides automatic redirection of traffic when primary servers fail. This guide covers health checks, failover record configuration, TTL optimization, BIND/PowerDNS setup, and comprehensive monitoring strategies for high-availability DNS systems.
Tabla de Contenidos
- DNS Failover Concepts
- Health Check Implementation
- BIND Configuration
- PowerDNS Configuration
- TTL Optimization
- Monitoring and Alerting
- Testing Failover
- Advanced Scenarios
- Conclusion
Conceptos de Conmutación por Error de DNS
DNS failover redirects clients to backup servers when primary servers become unavailable. There are several approaches:
- Round-robin DNS: Alternates between multiple IPs (no health checking)
- Weighted DNS: Distributes based on server weight/capacity
- Latency-based: Routes to geographically closest server
- Health-checked failover: Routes based on server health status
# Compare failover strategies
compare_failover_strategies() {
cat << 'EOF'
Strategy | Health Check | Geographic | Weighted | Complexity
-----------------+--------------+------------+----------+----------
Round-robin | No | No | No | Low
Weighted | No | No | Yes | Low
Latency-based | Optional | Yes | Yes | Medium
Health-checked | Yes | Optional | Yes | High
Best for:
- Simple load distribution: Round-robin
- Capacity-aware distribution: Weighted
- Geographic optimization: Latency-based
- Mission-critical: Health-checked
EOF
}
compare_failover_strategies
Implementación de Verificación de Salud
Verificaciones de Salud HTTP/HTTPS
# HTTP-based health check script
cat > /usr/local/bin/health-check-http.sh << 'EOF'
#!/bin/bash
SERVER_IP=$1
HEALTH_URL="http://$SERVER_IP/health"
TIMEOUT=5
MAX_RETRIES=3
RETRY_DELAY=2
check_server_health() {
local retries=0
while [ $retries -lt $MAX_RETRIES ]; do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
--max-time $TIMEOUT \
"$HEALTH_URL")
if [ "$HTTP_CODE" = "200" ]; then
echo "Healthy"
return 0
fi
retries=$((retries + 1))
sleep $RETRY_DELAY
done
echo "Unhealthy (HTTP $HTTP_CODE)"
return 1
}
# Check with detailed diagnostics
detailed_health_check() {
local response_time=$(curl -s -o /dev/null -w "%{time_total}" "$HEALTH_URL")
local response_code=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
cat << "RESULT"
Server: $SERVER_IP
Response Code: $response_code
Response Time: $response_time seconds
Status: $([ "$response_code" = "200" ] && echo "HEALTHY" || echo "UNHEALTHY")
RESULT
}
check_server_health
EOF
chmod +x /usr/local/bin/health-check-http.sh
Verificaciones de Salud de Puerto TCP
# TCP-based health check (faster, less resource-intensive)
cat > /usr/local/bin/health-check-tcp.sh << 'EOF'
#!/bin/bash
SERVER_IP=$1
PORT=${2:-80}
TIMEOUT=3
check_tcp_port() {
if timeout $TIMEOUT bash -c "echo > /dev/tcp/$SERVER_IP/$PORT" 2>/dev/null; then
echo "Healthy"
return 0
else
echo "Unhealthy"
return 1
fi
}
# Alternative using nc (netcat)
check_with_netcat() {
if nc -z -w $TIMEOUT $SERVER_IP $PORT 2>/dev/null; then
echo "Healthy"
return 0
else
echo "Unhealthy"
return 1
fi
}
check_tcp_port
EOF
chmod +x /usr/local/bin/health-check-tcp.sh
Script de Verificación de Salud Personalizado
# Advanced health check with multiple criteria
cat > /usr/local/bin/health-check-advanced.sh << 'EOF'
#!/bin/bash
SERVER_IP=$1
HEALTH_SCORE=0
HEALTH_THRESHOLD=70
# Check HTTP response
http_ok=$(curl -s -o /dev/null -w "%{http_code}" "http://$SERVER_IP" | grep -q "200" && echo 1 || echo 0)
[ $http_ok -eq 1 ] && HEALTH_SCORE=$((HEALTH_SCORE + 30))
# Check SSH connectivity
if timeout 2 bash -c "echo > /dev/tcp/$SERVER_IP/22" 2>/dev/null; then
HEALTH_SCORE=$((HEALTH_SCORE + 20))
fi
# Check disk usage via SSH
disk_usage=$(ssh -o ConnectTimeout=2 "root@$SERVER_IP" "df / | awk 'NR==2 {print \$5}' | sed 's/%//'")
if [ "$disk_usage" -lt 80 ]; then
HEALTH_SCORE=$((HEALTH_SCORE + 20))
fi
# Check memory via SSH
mem_available=$(ssh -o ConnectTimeout=2 "root@$SERVER_IP" "free | awk 'NR==2 {print \$7}'")
if [ "$mem_available" -gt 500000 ]; then
HEALTH_SCORE=$((HEALTH_SCORE + 30))
fi
# Determine status
if [ $HEALTH_SCORE -ge $HEALTH_THRESHOLD ]; then
echo "Healthy ($HEALTH_SCORE%)"
exit 0
else
echo "Unhealthy ($HEALTH_SCORE%)"
exit 1
fi
EOF
chmod +x /usr/local/bin/health-check-advanced.sh
Configuración de BIND
Conmutación por Error Básica de DNS BIND
# Install BIND
apt-get install -y bind9 bind9-utils
# Configure BIND for failover
cat > /etc/bind/zones/db.example.com << 'EOF'
$TTL 300
@ IN SOA ns1.example.com. admin.example.com. (
2024010101 ; serial
3600 ; refresh
1800 ; retry
604800 ; expire
300 ) ; minimum
IN NS ns1.example.com.
IN NS ns2.example.com.
ns1 IN A 10.0.1.10
ns2 IN A 10.0.2.10
; Primary web server
web IN A 10.0.1.20
; Backup web server
web IN A 10.0.1.21
; Application with failover (multiple A records)
app IN A 10.0.1.30
app IN A 10.0.1.31
; Database with round-robin
db IN A 10.0.1.40
db IN A 10.0.1.41
; Service with priority weighting (using SRV records)
_service._tcp IN SRV 10 60 5000 server1.example.com.
_service._tcp IN SRV 20 40 5000 server2.example.com.
EOF
# Include zone in main config
cat >> /etc/bind/named.conf.local << 'EOF'
zone "example.com" {
type master;
file "/etc/bind/zones/db.example.com";
allow-transfer { 10.0.2.10; };
};
EOF
systemctl restart bind9
Configuración del Servidor Esclavo BIND
# Configure slave (secondary) DNS server
cat > /etc/bind/named.conf.local << 'EOF'
zone "example.com" {
type slave;
file "/var/cache/bind/db.example.com";
masters { 10.0.1.10; };
};
EOF
systemctl restart bind9
# Verify zone transfer
dig @10.0.2.10 example.com axfr
Actualizaciones Dinámicas de DNS con BIND
# Configure BIND for dynamic updates
cat > /etc/bind/zones/db.example.com << 'EOF'
$TTL 300
@ IN SOA ns1.example.com. admin.example.com. (
2024010101
3600
1800
604800
300 )
IN NS ns1.example.com.
; Allow dynamic updates for specific hosts
; Dynamic DNS keys will be used to authenticate updates
EOF
# Create TSIG key for secure updates
dnssec-keygen -a HMAC-SHA256 -b 256 -n HOST ddns-key
# Configure BIND to accept dynamic updates
cat >> /etc/bind/named.conf << 'EOF'
key "ddns-key" {
algorithm HMAC-SHA256;
secret "your_generated_secret_key_here";
};
zone "example.com" {
type master;
file "/etc/bind/zones/db.example.com";
# Allow dynamic updates from specific key
update-policy {
grant ddns-key wildcard *.example.com A TXT;
};
};
EOF
systemctl restart bind9
Configuración de PowerDNS
Instalar y Configurar PowerDNS
# Install PowerDNS
apt-get install -y pdns-server pdns-backend-sqlite3
# Basic PowerDNS configuration
cat > /etc/powerdns/pdns.conf << 'EOF'
# General settings
setuid=pdns
setgid=pdns
allow-dnsupdate=127.0.0.1
dnsupdate=yes
# Backend configuration
backend=gsqlite3
gsqlite3-database=/var/lib/powerdns/pdns.db
# API settings
api-key=your_secure_api_key
api=yes
api-readonly=no
webserver-address=127.0.0.1
webserver-port=8081
# Performance settings
max-cache-entries=1000000
recursive-cache-ttl=10
cache-ttl=120
EOF
# Initialize database
pdns_server --schema-version-check='create' \
--backend=gsqlite3 \
--gsqlite3-database=/var/lib/powerdns/pdns.db
systemctl restart pdns
Conmutación por Error Basada en API de PowerDNS
# Script to update DNS via PowerDNS API based on health checks
cat > /usr/local/bin/pdns-failover.sh << 'EOF'
#!/bin/bash
PDNS_API_URL="http://localhost:8081/api/v1"
PDNS_API_KEY="your_secure_api_key"
ZONE_NAME="example.com"
PRIMARY_IP="10.0.1.20"
BACKUP_IP="10.0.1.21"
# Check primary server health
check_primary_health() {
if curl -s -m 3 "http://$PRIMARY_IP/health" | grep -q "ok"; then
return 0
else
return 1
fi
}
# Update DNS record via PowerDNS API
update_dns_record() {
local record_name=$1
local ip_address=$2
curl -s -X PATCH \
-H "X-API-Key: $PDNS_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"rrsets\": [{
\"name\": \"$record_name\",
\"type\": \"A\",
\"changetype\": \"REPLACE\",
\"ttl\": 60,
\"records\": [{
\"content\": \"$ip_address\",
\"disabled\": false
}]
}]
}" \
"$PDNS_API_URL/servers/localhost/zones/$ZONE_NAME" \
2>/dev/null
}
# Main failover logic
failover_loop() {
local last_status="unknown"
while true; do
if check_primary_health; then
if [ "$last_status" != "primary" ]; then
echo "[$(date)] Primary server is healthy, updating DNS"
update_dns_record "web.example.com" "$PRIMARY_IP"
last_status="primary"
fi
else
if [ "$last_status" != "backup" ]; then
echo "[$(date)] Primary server is down, failing over to backup"
update_dns_record "web.example.com" "$BACKUP_IP"
last_status="backup"
fi
fi
sleep 10
done
}
failover_loop
EOF
chmod +x /usr/local/bin/pdns-failover.sh
Optimización de TTL
Estrategia de TTL para Conmutación por Error
# TTL optimization for different scenarios
cat > /tmp/ttl-strategy.sh << 'EOF'
#!/bin/bash
# TTL recommendations for failover:
# - Critical services: 30-60 seconds (faster failover, higher DNS load)
# - Standard services: 300-600 seconds (balanced)
# - Static content: 3600+ seconds (lower DNS load)
# - Geo-redundancy: 30-300 seconds (quick regional failover)
calculate_ttl_for_rto() {
local recovery_time_objective=$1 # in seconds
# TTL should be less than RTO
local recommended_ttl=$((recovery_time_objective / 3))
echo "RTO: ${recovery_time_objective}s"
echo "Recommended TTL: ${recommended_ttl}s"
}
# Example: Service with 5-minute RTO
calculate_ttl_for_rto 300
# Dynamic TTL adjustment
adjust_ttl_based_on_health() {
local server=$1
if check_server_health "$server"; then
# Server healthy: use longer TTL
echo "300" # 5 minutes
else
# Server unhealthy: use shorter TTL for quick failover
echo "30" # 30 seconds
fi
}
EOF
bash /tmp/ttl-strategy.sh
Implementar Cambios de TTL en BIND
# Update BIND zones with optimized TTLs
update_bind_ttl() {
local zone_file=$1
local new_ttl=$2
sed -i "s/^\$TTL .*/\$TTL $new_ttl/" "$zone_file"
sed -i "s/\([^ ]*\) *IN */\1 IN $new_ttl IN /" "$zone_file"
systemctl reload bind9
}
# Example: Set TTL to 60 seconds for web.example.com
# update_bind_ttl "/etc/bind/zones/db.example.com" "60"
Monitoreo y Alertas
Script de Monitoreo de Verificación de Salud
# Comprehensive health monitoring
cat > /usr/local/bin/monitor-failover.sh << 'EOF'
#!/bin/bash
SERVERS=("web1.example.com" "web2.example.com" "web3.example.com")
CHECK_INTERVAL=30
ALERT_EMAIL="[email protected]"
LOG_FILE="/var/log/dns-failover.log"
STATUS_FILE="/var/lib/dns-failover/status.json"
# Initialize status tracking
mkdir -p "$(dirname "$STATUS_FILE")"
# Health check function
check_server() {
local server=$1
if curl -s -m 5 "http://$server/health" | grep -q "ok"; then
echo "up"
else
echo "down"
fi
}
# Track state changes and alert
monitor_loop() {
while true; do
for server in "${SERVERS[@]}"; do
current_status=$(check_server "$server")
previous_status=$(grep -o "\"$server\":\"[^\"]*" "$STATUS_FILE" | cut -d'"' -f4)
if [ "$current_status" != "$previous_status" ]; then
echo "[$(date)] Status change: $server is now $current_status" >> "$LOG_FILE"
if [ "$current_status" = "down" ]; then
send_alert "CRITICAL: $server is down"
else
send_alert "RESOLVED: $server is back online"
fi
fi
done
sleep $CHECK_INTERVAL
done
}
send_alert() {
local message=$1
echo "$message" | mail -s "DNS Failover Alert" "$ALERT_EMAIL"
}
monitor_loop
EOF
chmod +x /usr/local/bin/monitor-failover.sh
Métricas de Prometheus para Conmutación por Error de DNS
# Expose DNS failover metrics for Prometheus
cat > /usr/local/bin/dns-failover-exporter.sh << 'EOF'
#!/bin/bash
PORT=9999
SERVERS=("web1.example.com" "web2.example.com")
# Simple HTTP server for Prometheus metrics
start_metrics_server() {
while true; do
{
echo -ne "HTTP/1.1 200 OK\r\n"
echo -ne "Content-Type: text/plain\r\n"
echo -ne "Connection: close\r\n"
echo -ne "\r\n"
# Export metrics
for server in "${SERVERS[@]}"; do
if curl -s -m 3 "http://$server/health" > /dev/null; then
status=1
else
status=0
fi
echo "dns_failover_server_up{server=\"$server\"} $status"
done
} | nc -l -p $PORT -q 1
done
}
start_metrics_server
EOF
chmod +x /usr/local/bin/dns-failover-exporter.sh
# Prometheus scrape configuration
cat > /etc/prometheus/dns-failover.yml << 'EOF'
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'dns-failover'
static_configs:
- targets: ['localhost:9999']
EOF
Prueba de Conmutación por Error
Prueba Manual de Conmutación por Error
# Test failover procedure
test_failover_procedure() {
echo "DNS Failover Test Procedure"
echo "============================"
local primary_server="web1.example.com"
local backup_server="web2.example.com"
# Step 1: Verify primary is healthy
echo "Step 1: Checking primary server health..."
curl -v "http://$primary_server/health"
# Step 2: Simulate primary failure
echo "Step 2: Stopping primary server..."
ssh "root@$primary_server" "systemctl stop nginx"
# Wait for health check to detect failure
echo "Step 3: Waiting for health check to detect failure..."
sleep 35 # Just over health check interval
# Step 4: Verify failover occurred
echo "Step 4: Verifying traffic is on backup..."
for i in {1..10}; do
response=$(curl -s "http://example.com" | grep -o "Server: web[0-9]")
echo "Request $i: $response"
done
# Step 5: Restore primary
echo "Step 5: Restoring primary server..."
ssh "root@$primary_server" "systemctl start nginx"
# Step 6: Verify failback
echo "Step 6: Verifying traffic returns to primary..."
sleep 35
for i in {1..10}; do
response=$(curl -s "http://example.com" | grep -o "Server: web[0-9]")
echo "Request $i: $response"
done
}
# Automated failover test
automate_failover_test() {
local test_log="/var/log/failover-test.log"
{
echo "[$(date)] Starting automated failover test"
# Get initial DNS response
initial_response=$(dig @ns1.example.com web.example.com +short)
echo "Initial DNS response: $initial_response"
# Simulate server down
ssh [email protected] systemctl stop nginx
# Wait and check DNS update
sleep 40
failover_response=$(dig @ns1.example.com web.example.com +short)
echo "Failover DNS response: $failover_response"
if [ "$initial_response" != "$failover_response" ]; then
echo "✓ Failover successful"
else
echo "✗ Failover failed"
fi
# Restore service
ssh [email protected] systemctl start nginx
} | tee -a "$test_log"
}
Escenarios Avanzados
Conmutación por Error Round-Robin Ponderada
# BIND weighted failover using SRV records
cat > /etc/bind/zones/db.example.com << 'EOF'
; Load balancing with SRV records
_http._tcp.web IN SRV 10 60 80 web1.example.com.
_http._tcp.web IN SRV 10 40 80 web2.example.com.
_http._tcp.web IN SRV 20 100 80 web3.example.com.
; Priority: lower number = higher priority
; Weight: within same priority, distributed based on weight
EOF
Conmutación por Error Geográfica
# GeoIP-based failover using split-view DNS
cat > /etc/bind/named.conf << 'EOF'
# Define geographic zones
view "europe" {
match-clients { 80.0.0.0/4; };
zone "example.com" {
type master;
file "/etc/bind/zones/db.example.com.eu";
};
};
view "americas" {
match-clients { 192.0.0.0/8; };
zone "example.com" {
type master;
file "/etc/bind/zones/db.example.com.us";
};
};
EOF
Conclusión
Effective DNS failover requires:
- Health Checks: Regular monitoring to detect failures quickly
- Low TTLs: Critical services use 30-60 second TTLs for fast failover
- Multiple Servers: At least 2-3 nameservers for redundancy
- Automation: Scripts to automatically update DNS on failure
- Testing: Regular failover drills to verify procedures work
- Monitoring: Continuous health checks and alerting
Choose between BIND (traditional, flexible) and PowerDNS (API-driven, modern) based on your infrastructure needs. Always test failover procedures in staging before deploying to production.


