Clustering with Pacemaker and Corosync: Enterprise High Availability Guide
Introduction
High availability clustering is a critical component of enterprise infrastructure, ensuring continuous service delivery even when individual nodes fail. Pacemaker and Corosync form the backbone of Linux-based high availability solutions, providing robust cluster resource management and node communication capabilities that power mission-critical applications worldwide.
Pacemaker is a sophisticated cluster resource manager that determines which nodes should host which services and handles failover operations automatically. Corosync, on the other hand, provides the cluster communication layer using the Totem protocol, ensuring reliable message delivery between cluster nodes and detecting node failures through heartbeat mechanisms.
Together, these technologies enable active-passive, active-active, and N+M redundancy configurations that protect against hardware failures, software crashes, and planned maintenance windows. Organizations deploying financial systems, healthcare applications, telecommunications infrastructure, and e-commerce platforms rely on Pacemaker/Corosync clusters to maintain service level agreements (SLAs) demanding 99.99% or higher uptime.
This comprehensive guide explores enterprise-grade cluster implementations, covering architecture design, advanced configuration patterns, performance optimization, and troubleshooting strategies that experienced systems engineers need to build production-ready high availability solutions.
Theory and Core Concepts
Cluster Architecture Fundamentals
High availability clusters built with Pacemaker and Corosync operate on several foundational principles that distinguish them from simple load balancing or redundancy mechanisms.
Quorum and Split-Brain Prevention: Clusters use voting mechanisms to establish quorum—the minimum number of nodes required to operate the cluster. When network partitions occur, only the partition containing quorum can continue managing resources, preventing split-brain scenarios where multiple cluster partitions attempt to manage the same resources simultaneously, leading to data corruption.
Resource Management Hierarchy: Pacemaker organizes services into resources with configurable properties:
- Primitive Resources: Individual services like IP addresses, filesystems, or applications
- Groups: Collections of resources that must run together on the same node
- Clones: Resources that run on multiple nodes simultaneously (active-active)
- Master-Slave Resources: Resources where one instance is primary and others are standby
Fencing and STONITH: "Shoot The Other Node In The Head" mechanisms forcibly power off or isolate failed nodes to guarantee that malfunctioning systems cannot corrupt shared resources. Fencing is mandatory for production clusters managing stateful resources like databases or shared storage.
Corosync Communication Architecture
Corosync implements the Totem Single Ring Ordering and Membership protocol, providing:
Total Ordering: All messages are delivered to all nodes in the same order, essential for consistent cluster state Virtual Synchrony: Groups of nodes maintain synchronized views of cluster membership Redundant Ring Protocol: Support for multiple network paths to eliminate single points of failure
The communication layer operates using multicast or unicast UDP, with configurable heartbeat intervals (typically 1-5 seconds) and failure detection timeouts that balance between rapid failover and false positive detection.
Pacemaker Decision Engine
Pacemaker's Policy Engine (PE) continuously evaluates cluster state against configured constraints and policies:
Location Constraints: Define which nodes can or should host specific resources Colocation Constraints: Specify resources that must run together or separately Order Constraints: Define startup/shutdown sequences for dependent resources Resource Stickiness: Preference for keeping resources on their current node versus migrating
The Cluster Resource Manager (CRM) interprets these policies and orchestrates resource transitions, minimizing service disruption while respecting all configured constraints.
Prerequisites
Hardware Requirements
Enterprise cluster implementations require careful hardware planning:
Minimum Configuration:
- At least 3 nodes (2 for services + 1 quorum/tie-breaker)
- Dual network interfaces per node for redundant communication
- Shared storage infrastructure (SAN, NAS, or distributed filesystem)
- Dedicated IPMI/iLO/BMC interfaces for STONITH fencing
- 4 CPU cores minimum per node
- 8GB RAM minimum per node (16GB+ recommended)
Network Infrastructure:
- Dedicated cluster interconnect network (isolated VLAN)
- Redundant network paths with sub-5ms latency
- Jumbo frame support (MTU 9000) for storage networks
- Network switches with IGMP snooping for multicast
Storage Considerations:
- Shared block storage for clustered filesystems (GFS2, OCFS2)
- RAID configuration for local storage redundancy
- Battery-backed cache for write performance
- Multipath configuration for storage path redundancy
Software Prerequisites
Operating System Compatibility:
- Red Hat Enterprise Linux 8/9 or compatible (Rocky Linux, AlmaLinux)
- Ubuntu 20.04/22.04 LTS
- SUSE Linux Enterprise Server 15
- Debian 11/12
Required Packages (RHEL/Rocky):
pacemaker corosync pcs fence-agents-all resource-agents
Required Packages (Ubuntu/Debian):
pacemaker corosync crmsh fence-agents resource-agents
Network Configuration Requirements
All nodes require:
- Synchronized time via NTP/Chrony (critical for cluster operations)
- Hostname resolution via /etc/hosts or DNS
- Firewall rules permitting cluster communication
- SELinux/AppArmor policies allowing cluster operations
Advanced Configuration
Initial Cluster Setup
Step 1: Time Synchronization
Configure chrony on all nodes:
# Install chrony
dnf install -y chrony
# Configure reliable NTP sources
cat >> /etc/chrony.conf << EOF
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
EOF
systemctl enable --now chronyd
Step 2: Hostname and Network Configuration
Configure /etc/hosts on all nodes:
cat >> /etc/hosts << EOF
192.168.100.11 cluster-node1
192.168.100.12 cluster-node2
192.168.100.13 cluster-node3
EOF
Step 3: Firewall Configuration
Open required ports on all nodes:
# Corosync communication
firewall-cmd --permanent --add-service=high-availability
firewall-cmd --permanent --add-port=2224/tcp # pcsd
firewall-cmd --permanent --add-port=3121/tcp # pacemaker remote
firewall-cmd --permanent --add-port=5403/tcp # corosync qnetd
firewall-cmd --permanent --add-port=5404-5412/udp # corosync
firewall-cmd --reload
Step 4: Install and Configure Cluster Software
On all nodes:
# Install cluster packages
dnf install -y pacemaker corosync pcs fence-agents-all
# Enable and start pcsd daemon
systemctl enable --now pcsd
# Set hacluster password (same on all nodes)
echo "StrongClusterPassword123!" | passwd --stdin hacluster
Step 5: Authenticate Cluster Nodes
On node1:
# Authenticate all nodes
pcs host auth cluster-node1 cluster-node2 cluster-node3 \
-u hacluster -p StrongClusterPassword123!
Step 6: Create the Cluster
# Create cluster with all nodes
pcs cluster setup enterprise-cluster \
cluster-node1 addr=192.168.100.11 \
cluster-node2 addr=192.168.100.12 \
cluster-node3 addr=192.168.100.13 \
transport knet
# Enable cluster services on boot
pcs cluster enable --all
# Start the cluster
pcs cluster start --all
Advanced Corosync Configuration
Edit /etc/corosync/corosync.conf for production optimization:
totem {
version: 2
cluster_name: enterprise-cluster
transport: knet
# Aggressive failure detection
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 3600
# Enable encryption
crypto_cipher: aes256
crypto_hash: sha256
interface {
ringnumber: 0
bindnetaddr: 192.168.100.0
broadcast: yes
mcastport: 5405
}
# Redundant ring for fault tolerance
interface {
ringnumber: 1
bindnetaddr: 192.168.101.0
broadcast: yes
mcastport: 5407
}
}
quorum {
provider: corosync_votequorum
expected_votes: 3
two_node: 0
wait_for_all: 0
last_man_standing: 1
last_man_standing_window: 10000
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
nodelist {
node {
ring0_addr: cluster-node1
ring1_addr: cluster-node1-priv
name: cluster-node1
nodeid: 1
}
node {
ring0_addr: cluster-node2
ring1_addr: cluster-node2-priv
name: cluster-node2
nodeid: 2
}
node {
ring0_addr: cluster-node3
ring1_addr: cluster-node3-priv
name: cluster-node3
nodeid: 3
}
}
Reload configuration:
pcs cluster reload corosync --all
STONITH/Fencing Configuration
Configure IPMI-based fencing:
# Create fence device for node1
pcs stonith create fence-node1 fence_ipmilan \
pcmk_host_list="cluster-node1" \
ipaddr="192.168.200.11" \
username="admin" \
password="ipmi_password" \
lanplus=1 \
cipher=1 \
op monitor interval=60s
# Repeat for all nodes
pcs stonith create fence-node2 fence_ipmilan \
pcmk_host_list="cluster-node2" \
ipaddr="192.168.200.12" \
username="admin" \
password="ipmi_password" \
lanplus=1 \
cipher=1 \
op monitor interval=60s
pcs stonith create fence-node3 fence_ipmilan \
pcmk_host_list="cluster-node3" \
ipaddr="192.168.200.13" \
username="admin" \
password="ipmi_password" \
lanplus=1 \
cipher=1 \
op monitor interval=60s
# Enable STONITH globally
pcs property set stonith-enabled=true
# Test fencing
pcs stonith fence cluster-node3
Resource Configuration Examples
Virtual IP Address Resource:
pcs resource create vip-public ocf:heartbeat:IPaddr2 \
ip=192.168.100.100 \
cidr_netmask=24 \
nic=eth0 \
op monitor interval=10s
Apache Web Server Resource:
pcs resource create webserver systemd:httpd \
op monitor interval=30s \
op start timeout=60s \
op stop timeout=60s
# Ensure VIP starts before webserver
pcs constraint order vip-public then webserver
# Ensure VIP and webserver run on same node
pcs constraint colocation add webserver with vip-public INFINITY
Clustered Filesystem Resource:
# Create LVM volume group resource
pcs resource create vg-cluster ocf:heartbeat:LVM-activate \
vgname=cluster_vg \
vg_access_mode=system_id \
op monitor interval=30s \
op start timeout=90s
# Create filesystem resource
pcs resource create fs-cluster Filesystem \
device="/dev/cluster_vg/cluster_lv" \
directory="/mnt/cluster" \
fstype="xfs" \
op monitor interval=20s \
op start timeout=60s \
op stop timeout=60s
# Create resource group
pcs resource group add cluster-storage vg-cluster fs-cluster
PostgreSQL Database Resource:
pcs resource create postgres-db pgsql \
pgctl="/usr/pgsql-14/bin/pg_ctl" \
psql="/usr/pgsql-14/bin/psql" \
pgdata="/var/lib/pgsql/14/data" \
pgport="5432" \
op start timeout=120s \
op stop timeout=120s \
op monitor interval=30s timeout=60s
# Add to resource group with storage and VIP
pcs constraint order cluster-storage then postgres-db
pcs constraint colocation add postgres-db with cluster-storage INFINITY
Performance Optimization
Corosync Tuning for Low Latency
Optimize token timeout values based on network characteristics:
# Fast failure detection (low-latency networks)
pcs property set token=1000
pcs property set token_retransmits_before_loss_const=20
# Network round-trip time configuration
pcs property set token_warning=75%
Resource Stickiness and Migration Threshold
Prevent unnecessary resource migrations:
# Global resource stickiness
pcs resource defaults resource-stickiness=100
# Per-resource migration threshold
pcs resource meta webserver migration-threshold=3
pcs resource meta webserver failure-timeout=300s
Concurrent Fencing Operations
Enable parallel fencing for faster recovery:
pcs property set concurrent-fencing=true
pcs property set stonith-max-attempts=5
pcs property set stonith-action=reboot
Cluster Transition Optimization
Reduce cluster recalculation overhead:
pcs property set cluster-recheck-interval=2min
pcs property set dc-deadtime=20s
pcs property set election-timeout=2min
Resource Operation Tuning
Optimize resource check intervals:
# Less frequent monitoring for stable resources
pcs resource op remove webserver monitor
pcs resource op add webserver monitor interval=60s timeout=30s
# Aggressive monitoring for critical resources
pcs resource op remove postgres-db monitor
pcs resource op add postgres-db monitor interval=15s timeout=45s on-fail=restart
High Availability Patterns
Active-Passive Configuration
Standard failover cluster with services running on one node:
# Disable resource cloning
pcs resource create app-service systemd:myapp \
op monitor interval=30s
# Set preferred node
pcs constraint location app-service prefers cluster-node1=100
pcs constraint location app-service prefers cluster-node2=50
Active-Active with Cloned Resources
Services running simultaneously on all nodes:
# Create cloned resource
pcs resource create app-clone systemd:myapp \
clone notify=true globally-unique=false
# Set clone maximum
pcs resource clone app-service clone-max=3 clone-node-max=1
Master-Slave (Promotable Clone) Configuration
Database replication with automatic promotion:
# PostgreSQL streaming replication
pcs resource create postgres-ha pgsqlms \
bindir="/usr/pgsql-14/bin" \
pgdata="/var/lib/pgsql/14/data" \
op start timeout=60s \
op stop timeout=60s \
op promote timeout=30s \
op demote timeout=120s \
op monitor interval=15s role="Master" \
op monitor interval=30s role="Slave" \
promotable notify=true
# Virtual IP follows master
pcs constraint colocation add vip-db with master postgres-ha-clone INFINITY
pcs constraint order promote postgres-ha-clone then start vip-db
Monitoring and Observability
Cluster Status Monitoring
Real-time cluster status:
# Comprehensive cluster status
pcs status --full
# Resource-specific status
pcs resource show --full
# Node status
pcs node attribute
Monitoring script for automation:
#!/bin/bash
# /usr/local/bin/cluster-health-check.sh
OUTPUT_FILE="/var/log/cluster/health-$(date +%Y%m%d-%H%M%S).log"
{
echo "=== Cluster Health Check: $(date) ==="
# Cluster status
echo -e "\n--- Cluster Status ---"
pcs status
# Quorum status
echo -e "\n--- Quorum Status ---"
corosync-quorumtool
# Resource constraints
echo -e "\n--- Resource Constraints ---"
pcs constraint --full
# Failed actions
echo -e "\n--- Failed Actions ---"
pcs status --full | grep -A5 "Failed Actions" || echo "None"
# STONITH status
echo -e "\n--- STONITH Devices ---"
pcs stonith status
} > "$OUTPUT_FILE"
# Alert on issues
if pcs status 2>&1 | grep -q "Failed\|Error\|Unclean"; then
mail -s "Cluster Health Alert" [email protected] < "$OUTPUT_FILE"
fi
Prometheus Integration
Export cluster metrics for monitoring:
# Install ha_cluster_exporter
wget https://github.com/ClusterLabs/ha_cluster_exporter/releases/download/1.3.0/ha_cluster_exporter-amd64.tar.gz
tar -xvf ha_cluster_exporter-amd64.tar.gz -C /usr/local/bin/
# Create systemd service
cat > /etc/systemd/system/ha_cluster_exporter.service << EOF
[Unit]
Description=HA Cluster Exporter
After=network.target
[Service]
Type=simple
User=hacluster
ExecStart=/usr/local/bin/ha_cluster_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl enable --now ha_cluster_exporter
Log Analysis and Alerting
Centralize cluster logging:
# Configure rsyslog forwarding
cat >> /etc/rsyslog.d/cluster.conf << EOF
if \$programname == 'pacemaker' or \$programname == 'corosync' then @@log-server:514
EOF
systemctl restart rsyslog
Troubleshooting
Quorum Loss Scenarios
Symptom: Cluster stops managing resources due to lost quorum.
Diagnosis:
# Check quorum status
corosync-quorumtool
# View voting configuration
pcs quorum status
Resolution:
# Temporarily enable quorum override (dangerous!)
pcs quorum unblock
# Restore normal operations once nodes rejoin
pcs cluster start cluster-node2
Split-Brain Detection and Recovery
Symptom: Multiple nodes believe they are cluster coordinator.
Diagnosis:
# Check DC election
crm_mon -1 | grep "Current DC"
# Verify STONITH history
stonith_admin --history=*
Resolution:
# Manually fence ambiguous nodes
pcs stonith fence cluster-node2
# Verify cluster integrity
pcs status --full
Resource Failures and Restart
Symptom: Resources repeatedly failing and restarting.
Diagnosis:
# View failure history
pcs status --full | grep -A10 "Failed"
# Check resource configuration
pcs resource config webserver
Resolution:
# Clear resource failures
pcs resource cleanup webserver
# Increase failure threshold
pcs resource meta webserver migration-threshold=5 failure-timeout=600s
# Manual resource operations
pcs resource debug-start webserver
Corosync Communication Issues
Symptom: Nodes joining/leaving cluster erratically.
Diagnosis:
# Check corosync ring status
corosync-cfgtool -s
# Monitor membership changes
corosync-cmapctl | grep members
Resolution:
# Verify network connectivity
ping -c 5 cluster-node2
mtr cluster-node2
# Check for packet loss
corosync-cfgtool -s | grep "failed"
# Adjust token timeout for lossy networks
pcs property set token=5000
STONITH Failures
Symptom: Fencing operations timing out or failing.
Diagnosis:
# Test fence device
stonith_admin --fence cluster-node3 --test
# View fence device configuration
pcs stonith show fence-node3
Resolution:
# Verify IPMI connectivity
ipmitool -I lanplus -H 192.168.200.13 -U admin -P password power status
# Update fence device timeout
pcs stonith update fence-node3 pcmk_reboot_timeout=90s
# Configure fence delays to prevent simultaneous fencing
pcs stonith update fence-node1 pcmk_delay_max=30s
Performance Degradation
Symptom: Slow resource transitions and high CPU usage.
Diagnosis:
# Analyze Pacemaker performance
crm_verify -L -V
# Check for constraint loops
pcs constraint show --full
Resolution:
# Simplify constraint configuration
pcs constraint remove complex-constraint-id
# Optimize transition calculation
pcs property set stop-orphan-resources=false
pcs property set stop-orphan-actions=false
Conclusion
Pacemaker and Corosync provide enterprise-grade high availability clustering capabilities essential for modern infrastructure demanding continuous service delivery. This guide has explored advanced configuration patterns, performance optimization techniques, and troubleshooting methodologies that systems engineers need to build and maintain production-ready cluster environments.
Successful cluster implementations require careful planning of hardware topology, network architecture, and resource dependencies. STONITH fencing remains non-negotiable for stateful resources, ensuring data integrity during failure scenarios. Performance tuning must balance rapid failure detection against false positive risks introduced by network latency and congestion.
Organizations should implement comprehensive monitoring using tools like Prometheus and centralized logging to maintain visibility into cluster health. Regular testing of failover scenarios, fence operations, and disaster recovery procedures ensures readiness when actual failures occur. As applications grow increasingly complex, mastering Pacemaker/Corosync clustering becomes essential for delivering the reliability that modern enterprises demand.
Advanced topics like geo-redundant clusters, integration with software-defined networking, and container orchestration platform clustering represent natural extensions of these foundational concepts, positioning skilled engineers to architect highly available solutions at scale.


