Clustering de RabbitMQ y Alta Disponibilidad
RabbitMQ clustering enables multiple broker nodes to work together, providing high availability, load distribution, and fault tolerance. This guide covers cluster formation, quorum queues for guaranteed message delivery, federation for multi-cluster communication, and partition handling strategies.
Tabla de Contenidos
- Prerequisites
- Cluster Architecture
- Setting Up a RabbitMQ Cluster
- Quorum Queues
- RabbitMQ Federation
- RabbitMQ Shovel
- Partition Handling
- Cluster Health Monitoring
- Backup and Recovery
- Troubleshooting
- Conclusion
Requisitos Previos
Before setting up RabbitMQ clustering, ensure you have:
- Multiple Linux servers (minimum 3 for high availability)
- RabbitMQ 3.8+ installed on all nodes
- Network connectivity between all nodes
- Same Erlang cookie on all nodes for inter-node communication
- Unique hostname for each node
- 2+ GB RAM per node recommended
Cluster Architecture
A RabbitMQ cluster consists of multiple nodes sharing data and state:
- Disk nodes: Store cluster metadata and queue data
- RAM nodes: Keep data in memory only (rarely used in production)
For high availability, deploy at least 3 disk nodes:
Load Balancer
|
_________|_______
| | |
Node1 Node2 Node3
Disk Disk Disk
Setting Up a RabbitMQ Cluster
Prepare three nodes: rabbit1.example.com, rabbit2.example.com, rabbit3.example.com
Configure hostnames and networking on all nodes:
sudo tee -a /etc/hosts <<EOF
192.168.1.10 rabbit1.example.com rabbit1
192.168.1.11 rabbit2.example.com rabbit2
192.168.1.12 rabbit3.example.com rabbit3
EOF
Create/update the Erlang cookie (shared secret for cluster communication). Copy the same cookie to all nodes:
# On node 1
cat /var/lib/rabbitmq/.erlang.cookie
# Copy the output
# On nodes 2 and 3
echo "SHARED_ERLANG_COOKIE_VALUE" | sudo tee /var/lib/rabbitmq/.erlang.cookie
sudo chmod 600 /var/lib/rabbitmq/.erlang.cookie
sudo chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie
Configure RabbitMQ on each node. Edit /etc/rabbitmq/rabbitmq.conf on each node:
sudo nano /etc/rabbitmq/rabbitmq.conf
Add cluster-specific settings:
# Network binding
listeners.tcp.default = 5672
# Cluster configuration
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit@rabbit1
cluster_formation.classic_config.nodes.2 = rabbit@rabbit2
cluster_formation.classic_config.nodes.3 = rabbit@rabbit3
# Node RAM type (disk for all production nodes)
# Leave commented for disk node (default)
# loopback_users.guest = false
# Heartbeat and timeouts
heartbeat = 60
channel_max = 2048
# Management plugin
management.tcp.port = 15672
# Memory settings
vm_memory_high_watermark.relative = 0.8
# Cluster partition handling
cluster_partition_handling = autoheal
Update hostnames in the cluster config. On Node 1, ensure hostname is correctly set:
hostname
# Should output: rabbit1
sudo hostnamectl set-hostname rabbit1
Repeat for other nodes (rabbit2, rabbit3).
Start RabbitMQ on all nodes:
sudo systemctl restart rabbitmq-server
Join nodes to the cluster. On Node 2, stop the RabbitMQ application (keep Erlang running), reset, and join:
sudo rabbitmqctl stop_app
sudo rabbitmqctl reset
sudo rabbitmqctl join_cluster rabbit@rabbit1
sudo rabbitmqctl start_app
Repeat on Node 3:
sudo rabbitmqctl stop_app
sudo rabbitmqctl reset
sudo rabbitmqctl join_cluster rabbit@rabbit1
sudo rabbitmqctl start_app
Verify cluster status on any node:
sudo rabbitmqctl cluster_status
Output should show all three nodes:
Cluster status of node rabbit@rabbit1
Nodes in cluster: rabbit@rabbit1, rabbit@rabbit2, rabbit@rabbit3
Running nodes: rabbit@rabbit1, rabbit@rabbit2, rabbit@rabbit3
Quorum Queues
Quorum queues provide high availability by replicating queue data across multiple nodes (default: 3). They guarantee data durability and availability as long as a majority of nodes are alive.
Declare a quorum queue:
sudo rabbitmqctl declare_queue \
--queue order_processing \
--type quorum \
--durable true \
--arguments x-max-length=1000000
Or via the management UI or client libraries:
import pika
credentials = pika.PlainCredentials('guest', 'guest')
connection = pika.BlockingConnection(pika.ConnectionParameters(
host='rabbit1.example.com',
credentials=credentials
))
channel = connection.channel()
# Declare quorum queue
channel.queue_declare(
queue='order_processing',
durable=True,
arguments={
'x-queue-type': 'quorum',
'x-max-length': 1000000,
'x-quorum-initial-group-size': 3 # Replicate on 3 nodes
}
)
connection.close()
Monitor quorum queue replication:
sudo rabbitmqctl list_queues name type node slave_nodes messages
Output example:
Listing queues for vhost / ...
name type node slave_nodes messages
order_processing quorum rabbit@1 [rabbit@2, rabbit@3] 1250
Configure quorum queue policies:
# Set policy for queues matching pattern
sudo rabbitmqctl set_policy -p / quorum-config "^.*" \
'{"queue-type":"quorum","quorum-initial-group-size":3}' \
--priority 1 \
--apply-to queues
RabbitMQ Federation
Federation connects multiple RabbitMQ clusters or brokers, allowing messages to be automatically replicated between them. Useful for geo-distributed deployments and disaster recovery.
Enable the federation plugin on all nodes:
sudo rabbitmq-plugins enable rabbitmq_federation
sudo rabbitmq-plugins enable rabbitmq_federation_management
Restart RabbitMQ:
sudo systemctl restart rabbitmq-server
Configure federation upstreams. On the downstream cluster (rabbit2), define an upstream pointing to rabbit1:
sudo rabbitmqctl set_parameter federation-upstream rabbit1-cluster \
'{"uri":"amqp://guest:[email protected]:5672"}'
Add multiple upstreams for redundancy:
sudo rabbitmqctl set_parameter federation-upstream upstream-list \
'{"uri":"amqp://guest:[email protected]:5672,amqp://guest:[email protected]:5672"}'
Create a federation policy:
sudo rabbitmqctl set_policy -p / federation-exchanges \
'^(orders|events)$' \
'{"federation-upstream-set":"all"}' \
--priority 10 \
--apply-to exchanges
This policy federates exchanges matching the pattern to all configured upstreams.
For queues:
sudo rabbitmqctl set_policy -p / federation-queues \
'^(order_.*|event_.*)$' \
'{"federation-upstream-set":"all","federation-local-nomatch":true}' \
--priority 10 \
--apply-to queues
Check federation status:
sudo rabbitmqctl list_federation_upstream_status
RabbitMQ Shovel
Shovel is a more targeted federation mechanism for moving messages from one broker to another. Useful for specific queue migration or traffic shaping.
Enable the shovel plugin:
sudo rabbitmq-plugins enable rabbitmq_shovel
sudo rabbitmq-plugins enable rabbitmq_shovel_management
Restart RabbitMQ:
sudo systemctl restart rabbitmq-server
Configure a shovel to move messages from source to destination. Create /etc/rabbitmq/conf.d/shovel.conf:
sudo tee /etc/rabbitmq/conf.d/shovel.conf <<EOF
shovel.move_messages.sources.1 = amqp://guest:[email protected]:5672
shovel.move_messages.destinations.1 = amqp://guest:[email protected]:5672
shovel.move_messages.source_queue = legacy_orders
shovel.move_messages.destination_queue = orders_new
shovel.move_messages.source_delete_after = never
shovel.move_messages.ack_mode = on_confirm
shovel.move_messages.prefetch = 10
EOF
Restart RabbitMQ:
sudo systemctl restart rabbitmq-server
Or define shovels dynamically:
sudo rabbitmqctl set_parameter shovel move-orders \
'{
"sources": [{"uri":"amqp://guest:guest@rabbit1:5672"}],
"destinations": [{"uri":"amqp://guest:guest@rabbit2:5672"}],
"source_queue": "orders_old",
"destination_queue": "orders_new",
"ack_mode": "on_confirm",
"add_forward_headers": false
}'
List active shovels:
sudo rabbitmqctl list_shovels
Remove a shovel:
sudo rabbitmqctl clear_parameter shovel move-orders
Partition Handling
Network partitions can split a cluster into disconnected groups. Configure appropriate handling:
sudo nano /etc/rabbitmq/rabbitmq.conf
Set partition handling strategy:
# autoheal: Automatically select a partition to continue (default)
cluster_partition_handling = autoheal
# pause_minority: Pause minority partition
# cluster_partition_handling = pause_minority
# ignore: Ignore partitions (dangerous)
# cluster_partition_handling = ignore
With autoheal, the node with the most clients continues:
# Monitor partition events in logs
sudo tail -f /var/log/rabbitmq/[email protected] | grep partition
For pause_minority, ensure nodes form proper quorums before resuming:
# Check node status after partition
sudo rabbitmqctl cluster_status
# Force cluster recombination
sudo rabbitmqctl reset
sudo rabbitmqctl join_cluster rabbit@node1
Cluster Health Monitoring
Monitor cluster state and node health continuously:
# Check overall cluster status
sudo rabbitmqctl cluster_status
# Monitor node memory usage
sudo rabbitmqctl eval 'erlang:memory().'
# Get statistics
sudo rabbitmqctl status | head -50
# Check disk free space
sudo rabbitmqctl status | grep -i disk
# Monitor connections
sudo rabbitmqctl list_connections
# Check queue replication
sudo rabbitmqctl list_queues name node slave_nodes messages
Create a monitoring script:
#!/bin/bash
while true; do
clear
echo "RabbitMQ Cluster Health - $(date)"
echo "========================================"
echo -e "\nCluster Status:"
rabbitmqctl cluster_status 2>/dev/null | grep -E "Nodes|Running"
echo -e "\nQueue Status:"
rabbitmqctl list_queues name node messages | head -10
echo -e "\nNode Memory:"
rabbitmqctl status 2>/dev/null | grep -i memory
echo -e "\nConnections:"
rabbitmqctl list_connections | wc -l
sleep 30
done
Backup and Recovery
Backup cluster definitions and data:
# Backup RabbitMQ definitions (exchanges, queues, users, etc.)
sudo rabbitmqctl export_definitions /var/backups/rabbitmq-definitions.json
# Backup data directory
sudo tar -czf /var/backups/rabbitmq-data-$(date +%Y%m%d).tar.gz \
/var/lib/rabbitmq/mnesia
Restore from backup:
# Restore definitions to a fresh node
sudo rabbitmqctl import_definitions /var/backups/rabbitmq-definitions.json
# Restore data directory
sudo systemctl stop rabbitmq-server
sudo tar -xzf /var/backups/rabbitmq-data-20240115.tar.gz -C /
sudo systemctl start rabbitmq-server
Solución de Problemas
Handle common cluster issues:
# Check node connectivity
telnet rabbit2.example.com 25672 # Erlang distribution port
# View cluster startup log
sudo journalctl -u rabbitmq-server -n 100
# Check Erlang cookie mismatch (most common issue)
sudo cat /var/lib/rabbitmq/.erlang.cookie
# Force remove a failed node from cluster
sudo rabbitmqctl forget_cluster_node rabbit@failed-node
# Reset stuck node
sudo systemctl stop rabbitmq-server
sudo rm -rf /var/lib/rabbitmq/mnesia/*
sudo systemctl start rabbitmq-server
# Verify queue mirroring
sudo rabbitmqctl list_queues name slave_nodes
Conclusión
RabbitMQ clustering provides robust high availability for mission-critical messaging infrastructure. This guide covered cluster formation, quorum queues for guaranteed delivery, federation for multi-cluster communication, shovels for targeted message movement, and partition handling. Deploy at least 3 disk nodes for resilience, use quorum queues for critical data, monitor cluster health continuously, and establish backup/recovery procedures. Regular testing of failure scenarios ensures your RabbitMQ cluster maintains reliability under adverse conditions. For geo-distributed deployments, combine clustering with federation to create resilient, scalable messaging systems.


