Clustering with Pacemaker and Corosync: Enterprise High Availability Guide

Introduction

High availability clustering is a critical component of enterprise infrastructure, ensuring continuous service delivery even when individual nodes fail. Pacemaker and Corosync form the backbone of Linux-based high availability solutions, providing robust cluster resource management and node communication capabilities that power mission-critical applications worldwide.

Pacemaker is a sophisticated cluster resource manager that determines which nodes should host which services and handles failover operations automatically. Corosync, on the other hand, provides the cluster communication layer using the Totem protocol, ensuring reliable message delivery between cluster nodes and detecting node failures through heartbeat mechanisms.

Together, these technologies enable active-passive, active-active, and N+M redundancy configurations that protect against hardware failures, software crashes, and planned maintenance windows. Organizations deploying financial systems, healthcare applications, telecommunications infrastructure, and e-commerce platforms rely on Pacemaker/Corosync clusters to maintain service level agreements (SLAs) demanding 99.99% or higher uptime.

This comprehensive guide explores enterprise-grade cluster implementations, covering architecture design, advanced configuration patterns, performance optimization, and troubleshooting strategies that experienced systems engineers need to build production-ready high availability solutions.

Theory and Core Concepts

Cluster Architecture Fundamentals

High availability clusters built with Pacemaker and Corosync operate on several foundational principles that distinguish them from simple load balancing or redundancy mechanisms.

Quorum and Split-Brain Prevention: Clusters use voting mechanisms to establish quorum—the minimum number of nodes required to operate the cluster. When network partitions occur, only the partition containing quorum can continue managing resources, preventing split-brain scenarios where multiple cluster partitions attempt to manage the same resources simultaneously, leading to data corruption.

Resource Management Hierarchy: Pacemaker organizes services into resources with configurable properties:

Primitive Resources: Individual services like IP addresses, filesystems, or applications
Groups: Collections of resources that must run together on the same node
Clones: Resources that run on multiple nodes simultaneously (active-active)
Master-Slave Resources: Resources where one instance is primary and others are standby

Fencing and STONITH: "Shoot The Other Node In The Head" mechanisms forcibly power off or isolate failed nodes to guarantee that malfunctioning systems cannot corrupt shared resources. Fencing is mandatory for production clusters managing stateful resources like databases or shared storage.

Corosync Communication Architecture

Corosync implements the Totem Single Ring Ordering and Membership protocol, providing:

Total Ordering: All messages are delivered to all nodes in the same order, essential for consistent cluster state Virtual Synchrony: Groups of nodes maintain synchronized views of cluster membership Redundant Ring Protocol: Support for multiple network paths to eliminate single points of failure

The communication layer operates using multicast or unicast UDP, with configurable heartbeat intervals (typically 1-5 seconds) and failure detection timeouts that balance between rapid failover and false positive detection.

Pacemaker Decision Engine

Pacemaker's Policy Engine (PE) continuously evaluates cluster state against configured constraints and policies:

Location Constraints: Define which nodes can or should host specific resources Colocation Constraints: Specify resources that must run together or separately Order Constraints: Define startup/shutdown sequences for dependent resources Resource Stickiness: Preference for keeping resources on their current node versus migrating

The Cluster Resource Manager (CRM) interprets these policies and orchestrates resource transitions, minimizing service disruption while respecting all configured constraints.

Prerequisites

Hardware Requirements

Enterprise cluster implementations require careful hardware planning:

Minimum Configuration:

At least 3 nodes (2 for services + 1 quorum/tie-breaker)
Dual network interfaces per node for redundant communication
Shared storage infrastructure (SAN, NAS, or distributed filesystem)
Dedicated IPMI/iLO/BMC interfaces for STONITH fencing
4 CPU cores minimum per node
8GB RAM minimum per node (16GB+ recommended)

Network Infrastructure:

Dedicated cluster interconnect network (isolated VLAN)
Redundant network paths with sub-5ms latency
Jumbo frame support (MTU 9000) for storage networks
Network switches with IGMP snooping for multicast

Storage Considerations:

Shared block storage for clustered filesystems (GFS2, OCFS2)
RAID configuration for local storage redundancy
Battery-backed cache for write performance
Multipath configuration for storage path redundancy

Software Prerequisites

Operating System Compatibility:

Red Hat Enterprise Linux 8/9 or compatible (Rocky Linux, AlmaLinux)
Ubuntu 20.04/22.04 LTS
SUSE Linux Enterprise Server 15
Debian 11/12

Required Packages (RHEL/Rocky):

pacemaker corosync pcs fence-agents-all resource-agents

Required Packages (Ubuntu/Debian):

pacemaker corosync crmsh fence-agents resource-agents

Network Configuration Requirements

All nodes require:

Synchronized time via NTP/Chrony (critical for cluster operations)
Hostname resolution via /etc/hosts or DNS
Firewall rules permitting cluster communication
SELinux/AppArmor policies allowing cluster operations

Advanced Configuration

Initial Cluster Setup

Step 1: Time Synchronization

Configure chrony on all nodes:

# Install chrony
dnf install -y chrony

# Configure reliable NTP sources
cat >> /etc/chrony.conf << EOF
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
EOF

systemctl enable --now chronyd

Step 2: Hostname and Network Configuration

Configure /etc/hosts on all nodes:

cat >> /etc/hosts << EOF
192.168.100.11  cluster-node1
192.168.100.12  cluster-node2
192.168.100.13  cluster-node3
EOF

Step 3: Firewall Configuration

Open required ports on all nodes:

# Corosync communication
firewall-cmd --permanent --add-service=high-availability
firewall-cmd --permanent --add-port=2224/tcp  # pcsd
firewall-cmd --permanent --add-port=3121/tcp  # pacemaker remote
firewall-cmd --permanent --add-port=5403/tcp  # corosync qnetd
firewall-cmd --permanent --add-port=5404-5412/udp  # corosync
firewall-cmd --reload

Step 4: Install and Configure Cluster Software

On all nodes:

# Install cluster packages
dnf install -y pacemaker corosync pcs fence-agents-all

# Enable and start pcsd daemon
systemctl enable --now pcsd

# Set hacluster password (same on all nodes)
echo "StrongClusterPassword123!" | passwd --stdin hacluster

Step 5: Authenticate Cluster Nodes

On node1:

# Authenticate all nodes
pcs host auth cluster-node1 cluster-node2 cluster-node3 \
  -u hacluster -p StrongClusterPassword123!

Step 6: Create the Cluster

# Create cluster with all nodes
pcs cluster setup enterprise-cluster \
  cluster-node1 addr=192.168.100.11 \
  cluster-node2 addr=192.168.100.12 \
  cluster-node3 addr=192.168.100.13 \
  transport knet

# Enable cluster services on boot
pcs cluster enable --all

# Start the cluster
pcs cluster start --all

Advanced Corosync Configuration

Edit /etc/corosync/corosync.conf for production optimization:

totem {
    version: 2
    cluster_name: enterprise-cluster
    transport: knet

    # Aggressive failure detection
    token: 3000
    token_retransmits_before_loss_const: 10
    join: 60
    consensus: 3600

    # Enable encryption
    crypto_cipher: aes256
    crypto_hash: sha256

    interface {
        ringnumber: 0
        bindnetaddr: 192.168.100.0
        broadcast: yes
        mcastport: 5405
    }

    # Redundant ring for fault tolerance
    interface {
        ringnumber: 1
        bindnetaddr: 192.168.101.0
        broadcast: yes
        mcastport: 5407
    }
}

quorum {
    provider: corosync_votequorum
    expected_votes: 3
    two_node: 0
    wait_for_all: 0
    last_man_standing: 1
    last_man_standing_window: 10000
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
    logger_subsys {
        subsys: QUORUM
        debug: off
    }
}

nodelist {
    node {
        ring0_addr: cluster-node1
        ring1_addr: cluster-node1-priv
        name: cluster-node1
        nodeid: 1
    }
    node {
        ring0_addr: cluster-node2
        ring1_addr: cluster-node2-priv
        name: cluster-node2
        nodeid: 2
    }
    node {
        ring0_addr: cluster-node3
        ring1_addr: cluster-node3-priv
        name: cluster-node3
        nodeid: 3
    }
}

Reload configuration:

pcs cluster reload corosync --all

STONITH/Fencing Configuration

Configure IPMI-based fencing:

# Create fence device for node1
pcs stonith create fence-node1 fence_ipmilan \
  pcmk_host_list="cluster-node1" \
  ipaddr="192.168.200.11" \
  username="admin" \
  password="ipmi_password" \
  lanplus=1 \
  cipher=1 \
  op monitor interval=60s

# Repeat for all nodes
pcs stonith create fence-node2 fence_ipmilan \
  pcmk_host_list="cluster-node2" \
  ipaddr="192.168.200.12" \
  username="admin" \
  password="ipmi_password" \
  lanplus=1 \
  cipher=1 \
  op monitor interval=60s

pcs stonith create fence-node3 fence_ipmilan \
  pcmk_host_list="cluster-node3" \
  ipaddr="192.168.200.13" \
  username="admin" \
  password="ipmi_password" \
  lanplus=1 \
  cipher=1 \
  op monitor interval=60s

# Enable STONITH globally
pcs property set stonith-enabled=true

# Test fencing
pcs stonith fence cluster-node3

Resource Configuration Examples

Virtual IP Address Resource:

pcs resource create vip-public ocf:heartbeat:IPaddr2 \
  ip=192.168.100.100 \
  cidr_netmask=24 \
  nic=eth0 \
  op monitor interval=10s

Apache Web Server Resource:

pcs resource create webserver systemd:httpd \
  op monitor interval=30s \
  op start timeout=60s \
  op stop timeout=60s

# Ensure VIP starts before webserver
pcs constraint order vip-public then webserver

# Ensure VIP and webserver run on same node
pcs constraint colocation add webserver with vip-public INFINITY

Clustered Filesystem Resource:

# Create LVM volume group resource
pcs resource create vg-cluster ocf:heartbeat:LVM-activate \
  vgname=cluster_vg \
  vg_access_mode=system_id \
  op monitor interval=30s \
  op start timeout=90s

# Create filesystem resource
pcs resource create fs-cluster Filesystem \
  device="/dev/cluster_vg/cluster_lv" \
  directory="/mnt/cluster" \
  fstype="xfs" \
  op monitor interval=20s \
  op start timeout=60s \
  op stop timeout=60s

# Create resource group
pcs resource group add cluster-storage vg-cluster fs-cluster

PostgreSQL Database Resource:

pcs resource create postgres-db pgsql \
  pgctl="/usr/pgsql-14/bin/pg_ctl" \
  psql="/usr/pgsql-14/bin/psql" \
  pgdata="/var/lib/pgsql/14/data" \
  pgport="5432" \
  op start timeout=120s \
  op stop timeout=120s \
  op monitor interval=30s timeout=60s

# Add to resource group with storage and VIP
pcs constraint order cluster-storage then postgres-db
pcs constraint colocation add postgres-db with cluster-storage INFINITY

Performance Optimization

Corosync Tuning for Low Latency

Optimize token timeout values based on network characteristics:

# Fast failure detection (low-latency networks)
pcs property set token=1000
pcs property set token_retransmits_before_loss_const=20

# Network round-trip time configuration
pcs property set token_warning=75%

Resource Stickiness and Migration Threshold

Prevent unnecessary resource migrations:

# Global resource stickiness
pcs resource defaults resource-stickiness=100

# Per-resource migration threshold
pcs resource meta webserver migration-threshold=3
pcs resource meta webserver failure-timeout=300s

Concurrent Fencing Operations

Enable parallel fencing for faster recovery:

pcs property set concurrent-fencing=true
pcs property set stonith-max-attempts=5
pcs property set stonith-action=reboot

Cluster Transition Optimization

Reduce cluster recalculation overhead:

pcs property set cluster-recheck-interval=2min
pcs property set dc-deadtime=20s
pcs property set election-timeout=2min

Resource Operation Tuning

Optimize resource check intervals:

# Less frequent monitoring for stable resources
pcs resource op remove webserver monitor
pcs resource op add webserver monitor interval=60s timeout=30s

# Aggressive monitoring for critical resources
pcs resource op remove postgres-db monitor
pcs resource op add postgres-db monitor interval=15s timeout=45s on-fail=restart

High Availability Patterns

Active-Passive Configuration

Standard failover cluster with services running on one node:

# Disable resource cloning
pcs resource create app-service systemd:myapp \
  op monitor interval=30s

# Set preferred node
pcs constraint location app-service prefers cluster-node1=100
pcs constraint location app-service prefers cluster-node2=50

Active-Active with Cloned Resources

Services running simultaneously on all nodes:

# Create cloned resource
pcs resource create app-clone systemd:myapp \
  clone notify=true globally-unique=false

# Set clone maximum
pcs resource clone app-service clone-max=3 clone-node-max=1

Master-Slave (Promotable Clone) Configuration

Database replication with automatic promotion:

# PostgreSQL streaming replication
pcs resource create postgres-ha pgsqlms \
  bindir="/usr/pgsql-14/bin" \
  pgdata="/var/lib/pgsql/14/data" \
  op start timeout=60s \
  op stop timeout=60s \
  op promote timeout=30s \
  op demote timeout=120s \
  op monitor interval=15s role="Master" \
  op monitor interval=30s role="Slave" \
  promotable notify=true

# Virtual IP follows master
pcs constraint colocation add vip-db with master postgres-ha-clone INFINITY
pcs constraint order promote postgres-ha-clone then start vip-db

Monitoring and Observability

Cluster Status Monitoring

Real-time cluster status:

# Comprehensive cluster status
pcs status --full

# Resource-specific status
pcs resource show --full

# Node status
pcs node attribute

Monitoring script for automation:

#!/bin/bash
# /usr/local/bin/cluster-health-check.sh

OUTPUT_FILE="/var/log/cluster/health-$(date +%Y%m%d-%H%M%S).log"

{
    echo "=== Cluster Health Check: $(date) ==="

    # Cluster status
    echo -e "\n--- Cluster Status ---"
    pcs status

    # Quorum status
    echo -e "\n--- Quorum Status ---"
    corosync-quorumtool

    # Resource constraints
    echo -e "\n--- Resource Constraints ---"
    pcs constraint --full

    # Failed actions
    echo -e "\n--- Failed Actions ---"
    pcs status --full | grep -A5 "Failed Actions" || echo "None"

    # STONITH status
    echo -e "\n--- STONITH Devices ---"
    pcs stonith status

} > "$OUTPUT_FILE"

# Alert on issues
if pcs status 2>&1 | grep -q "Failed\|Error\|Unclean"; then
    mail -s "Cluster Health Alert" [email protected] < "$OUTPUT_FILE"
fi

Prometheus Integration

Export cluster metrics for monitoring:

# Install ha_cluster_exporter
wget https://github.com/ClusterLabs/ha_cluster_exporter/releases/download/1.3.0/ha_cluster_exporter-amd64.tar.gz
tar -xvf ha_cluster_exporter-amd64.tar.gz -C /usr/local/bin/

# Create systemd service
cat > /etc/systemd/system/ha_cluster_exporter.service << EOF
[Unit]
Description=HA Cluster Exporter
After=network.target

[Service]
Type=simple
User=hacluster
ExecStart=/usr/local/bin/ha_cluster_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl enable --now ha_cluster_exporter

Log Analysis and Alerting

Centralize cluster logging:

# Configure rsyslog forwarding
cat >> /etc/rsyslog.d/cluster.conf << EOF
if \$programname == 'pacemaker' or \$programname == 'corosync' then @@log-server:514
EOF

systemctl restart rsyslog

Troubleshooting

Quorum Loss Scenarios

Symptom: Cluster stops managing resources due to lost quorum.

Diagnosis:

# Check quorum status
corosync-quorumtool

# View voting configuration
pcs quorum status

Resolution:

# Temporarily enable quorum override (dangerous!)
pcs quorum unblock

# Restore normal operations once nodes rejoin
pcs cluster start cluster-node2

Split-Brain Detection and Recovery

Symptom: Multiple nodes believe they are cluster coordinator.

Diagnosis:

# Check DC election
crm_mon -1 | grep "Current DC"

# Verify STONITH history
stonith_admin --history=*

Resolution:

# Manually fence ambiguous nodes
pcs stonith fence cluster-node2

# Verify cluster integrity
pcs status --full

Resource Failures and Restart

Symptom: Resources repeatedly failing and restarting.

Diagnosis:

# View failure history
pcs status --full | grep -A10 "Failed"

# Check resource configuration
pcs resource config webserver

Resolution:

# Clear resource failures
pcs resource cleanup webserver

# Increase failure threshold
pcs resource meta webserver migration-threshold=5 failure-timeout=600s

# Manual resource operations
pcs resource debug-start webserver

Corosync Communication Issues

Symptom: Nodes joining/leaving cluster erratically.

Diagnosis:

# Check corosync ring status
corosync-cfgtool -s

# Monitor membership changes
corosync-cmapctl | grep members

Resolution:

# Verify network connectivity
ping -c 5 cluster-node2
mtr cluster-node2

# Check for packet loss
corosync-cfgtool -s | grep "failed"

# Adjust token timeout for lossy networks
pcs property set token=5000

STONITH Failures

Symptom: Fencing operations timing out or failing.

Diagnosis:

# Test fence device
stonith_admin --fence cluster-node3 --test

# View fence device configuration
pcs stonith show fence-node3

Resolution:

# Verify IPMI connectivity
ipmitool -I lanplus -H 192.168.200.13 -U admin -P password power status

# Update fence device timeout
pcs stonith update fence-node3 pcmk_reboot_timeout=90s

# Configure fence delays to prevent simultaneous fencing
pcs stonith update fence-node1 pcmk_delay_max=30s

Performance Degradation

Symptom: Slow resource transitions and high CPU usage.

Diagnosis:

# Analyze Pacemaker performance
crm_verify -L -V

# Check for constraint loops
pcs constraint show --full

Resolution:

# Simplify constraint configuration
pcs constraint remove complex-constraint-id

# Optimize transition calculation
pcs property set stop-orphan-resources=false
pcs property set stop-orphan-actions=false

Conclusion

Pacemaker and Corosync provide enterprise-grade high availability clustering capabilities essential for modern infrastructure demanding continuous service delivery. This guide has explored advanced configuration patterns, performance optimization techniques, and troubleshooting methodologies that systems engineers need to build and maintain production-ready cluster environments.

Successful cluster implementations require careful planning of hardware topology, network architecture, and resource dependencies. STONITH fencing remains non-negotiable for stateful resources, ensuring data integrity during failure scenarios. Performance tuning must balance rapid failure detection against false positive risks introduced by network latency and congestion.

Organizations should implement comprehensive monitoring using tools like Prometheus and centralized logging to maintain visibility into cluster health. Regular testing of failover scenarios, fence operations, and disaster recovery procedures ensures readiness when actual failures occur. As applications grow increasingly complex, mastering Pacemaker/Corosync clustering becomes essential for delivering the reliability that modern enterprises demand.

Advanced topics like geo-redundant clusters, integration with software-defined networking, and container orchestration platform clustering represent natural extensions of these foundational concepts, positioning skilled engineers to architect highly available solutions at scale.

Clustering with Pacemaker and Corosync: Enterprise High Availability Guide

On this page

On this page

Clustering with Pacemaker and Corosync: Enterprise High Availability Guide

Introduction

Theory and Core Concepts

Cluster Architecture Fundamentals

Corosync Communication Architecture

Pacemaker Decision Engine

Prerequisites

Hardware Requirements

Software Prerequisites

Network Configuration Requirements

Advanced Configuration

Initial Cluster Setup

Advanced Corosync Configuration

STONITH/Fencing Configuration

Resource Configuration Examples

Performance Optimization

Corosync Tuning for Low Latency

Resource Stickiness and Migration Threshold

Concurrent Fencing Operations

Cluster Transition Optimization

Resource Operation Tuning

High Availability Patterns

Active-Passive Configuration

Active-Active with Cloned Resources

Master-Slave (Promotable Clone) Configuration

Monitoring and Observability

Cluster Status Monitoring

Prometheus Integration

Log Analysis and Alerting

Troubleshooting

Quorum Loss Scenarios

Split-Brain Detection and Recovery

Resource Failures and Restart

Corosync Communication Issues

STONITH Failures

Performance Degradation

Conclusion

Latest Video

Get $20 Free Credit