PostgreSQL High Availability with Patroni

Patroni is a modern, robust solution for PostgreSQL high availability that manages PostgreSQL initialization, configuration, and replication without requiring additional consultants or specialized scripts. It uses a distributed configuration store (typically etcd, Consul, or ZooKeeper) to manage cluster state and automatically handles failover, switchover, and cluster topology management. This guide provides a comprehensive walkthrough of deploying PostgreSQL HA with Patroni on Linux systems, covering etcd setup, Patroni installation, configuration, failover mechanisms, and practical operational considerations.

Table of Contents

Architecture Overview

A Patroni HA cluster consists of three main components: PostgreSQL servers running on cluster nodes, a distributed key-value store (etcd) for configuration and state management, and a load balancer (HAProxy) for client connection routing. In a typical three-node setup, each node runs PostgreSQL and Patroni, with one node serving as the primary and others as replicas. The etcd cluster manages Patroni state, facilitating automatic failover when the primary becomes unavailable.

The distributed state store continuously monitors PostgreSQL health and coordinates failover decisions. When the primary fails, Patroni automatically promotes the most advanced replica to primary, updates the topology, and notifies clients through HAProxy. This process typically completes in 10-30 seconds, minimizing downtime and data loss.

etcd Cluster Setup

etcd is a distributed, strongly consistent key-value store essential for Patroni's distributed consensus. Deploy a three-node etcd cluster with one etcd instance per PostgreSQL cluster node for proper quorum and fault tolerance.

Install etcd on all three nodes. First, download and install the latest stable version:

# Download etcd
ETCD_VER=v3.5.10
curl -L https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz

# Extract and move to /usr/local/bin
tar xzf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
sudo cp etcd-${ETCD_VER}-linux-amd64/{etcd,etcdctl} /usr/local/bin/
sudo chmod +x /usr/local/bin/etcd /usr/local/bin/etcdctl

# Verify installation
etcd --version
etcdctl version

Create an etcd systemd service file. Adjust node names and IPs according to your infrastructure:

sudo nano /etc/systemd/system/etcd.service

For node1 (192.168.1.10), use this configuration:

[Unit]
Description=etcd - Distributed reliable key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target
Wants=etcd-wait.service

[Service]
Type=notify
User=etcd
PermissionsStartOnly=true
ProtectSystem=full
ProtectHome=yes
NoNewPrivileges=true
ExecStart=/usr/local/bin/etcd \
  --name node1 \
  --listen-client-urls http://192.168.1.10:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.1.10:2379 \
  --listen-peer-urls http://192.168.1.10:2380 \
  --initial-advertise-peer-urls http://192.168.1.10:2380 \
  --initial-cluster node1=http://192.168.1.10:2380,node2=http://192.168.1.11:2380,node3=http://192.168.1.12:2380 \
  --initial-cluster-state new \
  --initial-cluster-token etcd-cluster \
  --data-dir /var/lib/etcd

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Create the etcd data directory and set permissions:

sudo mkdir -p /var/lib/etcd
sudo useradd -r -s /bin/false etcd 2>/dev/null || true
sudo chown -R etcd:etcd /var/lib/etcd
sudo chmod 700 /var/lib/etcd

On node2 and node3, modify the service with appropriate node identifiers:

# Node 2 configuration
--name node2 \
--listen-client-urls http://192.168.1.11:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://192.168.1.11:2379 \
--listen-peer-urls http://192.168.1.11:2380 \
--initial-advertise-peer-urls http://192.168.1.11:2380 \

# Node 3 configuration
--name node3 \
--listen-client-urls http://192.168.1.12:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://192.168.1.12:2379 \
--listen-peer-urls http://192.168.1.12:2380 \
--initial-advertise-peer-urls http://192.168.1.12:2380 \

Start etcd on all nodes:

sudo systemctl daemon-reload
sudo systemctl start etcd
sudo systemctl enable etcd
sudo systemctl status etcd

Verify the etcd cluster is healthy:

# Check cluster health
etcdctl --endpoints=http://127.0.0.1:2379 endpoint health

# List cluster members
etcdctl --endpoints=http://127.0.0.1:2379 member list

# Test writing a key
etcdctl --endpoints=http://127.0.0.1:2379 put /patroni/test "value"
etcdctl --endpoints=http://127.0.0.1:2379 get /patroni/test

Patroni Installation

Install Patroni on all three PostgreSQL nodes. First, install PostgreSQL server and development tools:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y postgresql-15 postgresql-contrib-15 python3-pip

# RHEL/CentOS
sudo dnf install -y postgresql-server postgresql-contrib postgresql-devel python3-pip

# Start PostgreSQL service (initial start only, Patroni will manage it)
sudo systemctl stop postgresql

Install Patroni and dependencies using pip:

# Install Patroni with etcd support
sudo pip3 install patroni[etcd]

# Install additional optional packages
sudo pip3 install psycopg2-binary requests pyyaml

# Verify installation
patroni --version

Create a Patroni system user:

sudo useradd -r -s /bin/bash patroni 2>/dev/null || true
sudo usermod -a -G postgres patroni

Create Patroni's configuration directory:

sudo mkdir -p /etc/patroni
sudo chown patroni:patroni /etc/patroni
sudo chmod 750 /etc/patroni

Patroni Configuration

Create the main Patroni configuration file. This YAML file contains all cluster topology, PostgreSQL, and Patroni-specific settings:

sudo nano /etc/patroni/patroni.yml

For node1 (primary), use this comprehensive configuration:

scope: pg-cluster
namespace: /patroni
name: node1

restapi:
  listen: 192.168.1.10:8008
  connect_address: 192.168.1.10:8008
  username: patroni_api_user
  password: patroni_api_password

etcd:
  hosts:
    - 192.168.1.10:2379
    - 192.168.1.11:2379
    - 192.168.1.12:2379

postgresql:
  scope: pg-cluster
  data_dir: /var/lib/postgresql/15/main
  bin_dir: /usr/lib/postgresql/15/bin
  pgpass: /var/lib/postgresql/.pgpass
  
  listen: 192.168.1.10:5432
  connect_address: 192.168.1.10:5432
  
  authentication:
    username: postgres
    password: postgres_password
    replication:
      username: replication
      password: replication_password
  
  basebackup:
    command: /usr/lib/postgresql/15/bin/pg_basebackup -D %p -X stream -c fast
    max_rate: '100M'
    checkpoint: 'fast'
  
  parameters:
    wal_level: logical
    max_wal_senders: 5
    max_replication_slots: 5
    wal_keep_size: 1GB
    hot_standby: on
    
    shared_preload_libraries: 'pg_stat_statements'
    shared_buffers: '256MB'
    effective_cache_size: '1GB'
    maintenance_work_mem: '64MB'
    log_min_duration_statement: 1000
    logging_collector: on
    log_directory: 'log'
    log_filename: 'postgresql-%Y-%m-%d_%H%M%S.log'
    log_file_mode: '0600'
    log_retention_days: 7
    
  recovery_conf:
    recovery_target_timeline: latest

bootstrap:
  method: initdb
  initdb:
    - encoding: UTF8
    - data-checksums: null
    - locale: en_US.UTF-8
  
  pg_hba:
    - local   all             all                                   trust
    - host    all             all             127.0.0.1/32          md5
    - host    all             all             ::1/128               md5
    - host    replication     replication     192.168.0.0/16        md5
  
  users:
    admin:
      password: admin_password
      options:
        - createrole
        - createdb

watchdog:
  mode: automatic
  device: /dev/watchdog
  safety_margin: 5

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

For nodes 2 and 3, modify the name and connect_address to reflect their identity. The node name must match the etcd cluster node names, and addresses must be adjusted accordingly:

# Node 2 configuration changes
name: node2
restapi:
  listen: 192.168.1.11:8008
  connect_address: 192.168.1.11:8008

postgresql:
  listen: 192.168.1.11:5432
  connect_address: 192.168.1.11:5432

Set proper permissions on the configuration file:

sudo chown patroni:patroni /etc/patroni/patroni.yml
sudo chmod 600 /etc/patroni/patroni.yml

Create a systemd service file for Patroni:

sudo nano /etc/systemd/system/patroni.service

Use this service configuration:

[Unit]
Description=Patroni - PostgreSQL HA solution
After=network.target etcd.service
Wants=etcd.service

[Service]
Type=simple
User=patroni
Group=patroni
WorkingDirectory=/etc/patroni

ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5

StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Enable and start Patroni:

sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni

# Check status
sudo systemctl status patroni
sudo journalctl -u patroni -f

Streaming Replication

Patroni automatically configures streaming replication between the primary and replicas. Verify replication is working correctly:

# Connect to primary node
psql -h 192.168.1.10 -U postgres -d postgres

# Check replication slots
SELECT slot_name, slot_type, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;

# Check replication connections
SELECT usename, application_name, client_addr, state, sync_state FROM pg_stat_replication;

# Monitor replication lag
SELECT now() - pg_last_xact_replay_time() AS replication_lag;

Monitor replication lag across the cluster:

# Create a monitoring script
cat > /usr/local/bin/check-patroni-replication.sh << 'EOF'
#!/bin/bash

NODES=("192.168.1.10" "192.168.1.11" "192.168.1.12")

for node in "${NODES[@]}"; do
  echo "=== Node $node ==="
  psql -h $node -U postgres -d postgres -t -c \
    "SELECT now() - pg_last_xact_replay_time() AS replication_lag;"
done
EOF

chmod +x /usr/local/bin/check-patroni-replication.sh
/usr/local/bin/check-patroni-replication.sh

Verify replication status through Patroni REST API:

# Query Patroni API on primary
curl -u patroni_api_user:patroni_api_password http://192.168.1.10:8008/cluster

# Check replica synchronization
curl -u patroni_api_user:patroni_api_password http://192.168.1.10:8008/primary

# Get detailed cluster information
curl -u patroni_api_user:patroni_api_password http://192.168.1.10:8008/replica

Failover and Recovery

Patroni automatically detects primary failure and promotes the most advanced replica. Monitor the automatic failover process:

# Monitor etcd for leader changes
watch -n 1 "etcdctl --endpoints=http://127.0.0.1:2379 get /patroni/pg-cluster/leader"

# Watch Patroni status
watch -n 2 "curl -s -u patroni_api_user:patroni_api_password http://192.168.1.10:8008/cluster | jq '.members[] | {name, role}'"

Simulate primary failure to test failover:

# Stop PostgreSQL on primary (be careful in production!)
sudo systemctl stop postgresql

# Or kill the Patroni process
sudo kill -9 $(pgrep -f "patroni /etc/patroni")

# Monitor failover progress
tail -f /var/log/syslog | grep patroni

Manually recover a failed node:

# Check node status after failure
sudo systemctl status patroni

# Clean recovery mode
sudo rm -rf /var/lib/postgresql/15/main/*
sudo systemctl start patroni

# Patroni will automatically perform basebackup and rejoin cluster
tail -f /var/log/syslog | grep -i basebackup

Configure automatic recovery in patroni.yml:

postgresql:
  recovery_conf:
    recovery_target_timeline: latest
  
  pre_demote: /usr/local/bin/pre-demote-hook.sh
  pre_promote: /usr/local/bin/pre-promote-hook.sh
  post_promote: /usr/local/bin/post-promote-hook.sh
  post_init: /usr/local/bin/post-init-hook.sh

Switchover Operations

Switchover is a planned, non-disruptive role change from primary to replica. Perform switchover for maintenance or load balancing:

# Request switchover via Patroni REST API
curl -X POST \
  -u patroni_api_user:patroni_api_password \
  -H "Content-Type: application/json" \
  -d '{"leader":"node1","candidate":"node2"}' \
  http://192.168.1.10:8008/switchover

# Verify switchover completion
curl -u patroni_api_user:patroni_api_password http://192.168.1.10:8008/cluster | jq '.members[] | {name, role}'

Monitor switchover progress:

# Watch cluster status during switchover
watch -n 1 "curl -s -u patroni_api_user:patroni_api_password http://192.168.1.10:8008/cluster | jq '.members[] | {name, role, state}'"

# Check WAL archiving progress
psql -h 192.168.1.10 -U postgres -d postgres -c "SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0');"

Handle switchover failures:

# Check Patroni logs for error details
sudo journalctl -u patroni -n 50 -p err

# Retry switchover
curl -X POST \
  -u patroni_api_user:patroni_api_password \
  -H "Content-Type: application/json" \
  -d '{"leader":"node2","candidate":"node1"}' \
  http://192.168.1.10:8008/switchover

HAProxy Load Balancing

HAProxy distributes connections to the PostgreSQL cluster, routing write requests to the primary and read requests to replicas. Install HAProxy on a dedicated server or co-locate with an application server:

# Install HAProxy
sudo apt-get install -y haproxy

# Backup original configuration
sudo cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.bak

Configure HAProxy for PostgreSQL cluster:

sudo nano /etc/haproxy/haproxy.cfg

Add this configuration, replacing default content:

global
  log /dev/log local0
  log /dev/log local1 notice
  chroot /var/lib/haproxy
  stats socket /run/haproxy/admin.sock mode 660 level admin
  stats timeout 30s
  user haproxy
  group haproxy
  daemon

  maxconn 4096
  tune.ssl.default-dh-param 2048

defaults
  log     global
  mode    tcp
  option  tcplog
  timeout connect 5000ms
  timeout client  50000ms
  timeout server  50000ms

# Stats interface
listen stats
  bind 127.0.0.1:8404
  stats enable
  stats uri /stats
  stats admin if TRUE

# PostgreSQL Primary (write)
listen pg-primary
  bind 0.0.0.0:5432
  mode tcp
  option httpchk GET /primary HTTP/1.0
  default-server fall 3 rise 2 on-marked-down shutdown-sessions
  
  server node1 192.168.1.10:5432 check port 8008
  server node2 192.168.1.11:5432 check port 8008 backup
  server node3 192.168.1.12:5432 check port 8008 backup

# PostgreSQL Replicas (read-only)
listen pg-replica
  bind 0.0.0.0:5433
  mode tcp
  option httpchk GET /replica HTTP/1.0
  balance roundrobin
  default-server fall 3 rise 2 on-marked-down shutdown-sessions
  
  server node1 192.168.1.10:5432 check port 8008 fall 10 rise 1
  server node2 192.168.1.11:5432 check port 8008 fall 10 rise 1
  server node3 192.168.1.12:5432 check port 8008 fall 10 rise 1

# All nodes
listen pg-all
  bind 0.0.0.0:5434
  mode tcp
  balance roundrobin
  default-server fall 3 rise 2 on-marked-down shutdown-sessions
  
  server node1 192.168.1.10:5432 check port 8008
  server node2 192.168.1.11:5432 check port 8008
  server node3 192.168.1.12:5432 check port 8008

Enable and start HAProxy:

sudo systemctl enable haproxy
sudo systemctl restart haproxy
sudo systemctl status haproxy

# Verify HAProxy is listening
netstat -tlnp | grep haproxy

Test HAProxy connectivity:

# Connect via HAProxy to primary
psql -h haproxy.server -p 5432 -U postgres -d postgres -c "SELECT pg_is_in_recovery();"

# Should return 'f' (not in recovery) on primary
# Should return 't' (in recovery) on replicas

# Test replica read-only connection
psql -h haproxy.server -p 5433 -U postgres -d postgres -c "SELECT pg_is_in_recovery();"

Monitor HAProxy:

# View HAProxy statistics
curl http://127.0.0.1:8404/stats | head -50

# Watch real-time HAProxy activity
watch "curl -s http://127.0.0.1:8404/stats | grep -E 'pg-|#'"

Monitoring and Alerting

Implement comprehensive monitoring of the Patroni cluster using Prometheus and Grafana. Export Patroni metrics for monitoring:

# Create a Prometheus exporter for Patroni
cat > /usr/local/bin/patroni-exporter.py << 'EOF'
#!/usr/bin/env python3

import requests
import json
from prometheus_client import start_http_server, Gauge
import time

# Gauge metrics
cluster_size = Gauge('patroni_cluster_size', 'Number of nodes in cluster')
primary_node = Gauge('patroni_primary_node', 'Is this node primary', ['node'])
replication_lag = Gauge('patroni_replication_lag_bytes', 'Replication lag in bytes', ['node'])

def get_cluster_status():
    try:
        response = requests.get('http://localhost:8008/cluster')
        return response.json()
    except Exception as e:
        print(f"Error fetching cluster status: {e}")
        return None

def update_metrics():
    while True:
        cluster_info = get_cluster_status()
        if cluster_info:
            cluster_size.set(len(cluster_info.get('members', [])))
            
            for member in cluster_info.get('members', []):
                is_primary = 1 if member['role'] == 'leader' else 0
                primary_node.labels(node=member['name']).set(is_primary)
        
        time.sleep(10)

if __name__ == '__main__':
    start_http_server(8009)
    update_metrics()
EOF

chmod +x /usr/local/bin/patroni-exporter.py
python3 /usr/local/bin/patroni-exporter.py &

Configure Prometheus to scrape Patroni metrics:

sudo nano /etc/prometheus/prometheus.yml

Add this scrape configuration:

scrape_configs:
  - job_name: 'patroni'
    static_configs:
      - targets: ['192.168.1.10:8009', '192.168.1.11:8009', '192.168.1.12:8009']
    scrape_interval: 15s

Create Grafana dashboards for cluster visualization:

# Query Patroni REST API for monitoring data
curl -s http://192.168.1.10:8008/cluster | jq '.'

# Get detailed member information
curl -s http://192.168.1.10:8008/members | jq '.'

# Check synchronous replication status
psql -h 192.168.1.10 -U postgres -c \
  "SELECT client_addr, application_name, sync_state FROM pg_stat_replication;"

Set up alerting for critical events:

# Create alert script
cat > /usr/local/bin/patroni-alert.sh << 'EOF'
#!/bin/bash

check_cluster_health() {
  CLUSTER_INFO=$(curl -s http://localhost:8008/cluster)
  SIZE=$(echo $CLUSTER_INFO | jq '.members | length')
  
  if [ "$SIZE" -lt 2 ]; then
    echo "ALERT: Cluster size is $SIZE (expected 3)"
  fi
  
  # Check if current node is primary
  if echo $CLUSTER_INFO | jq -e '.is_leader' > /dev/null; then
    echo "INFO: This node is primary"
  else
    echo "INFO: This node is replica"
  fi
}

check_cluster_health
EOF

chmod +x /usr/local/bin/patroni-alert.sh

Maintenance and Best Practices

Perform regular cluster maintenance while maintaining high availability. Rolling restart procedure for updates:

# 1. Restart replicas first
for replica in node2 node3; do
  echo "Restarting $replica..."
  ssh $replica "sudo systemctl restart patroni"
  sleep 30
  
  # Verify node rejoined cluster
  curl -s http://192.168.1.10:8008/cluster | jq ".members[] | select(.name==\"$replica\") | .state"
done

# 2. Finally, restart primary (will trigger automatic failover)
ssh node1 "sudo systemctl restart patroni"
sleep 60

# Verify cluster recovered with new primary
curl -s http://192.168.1.10:8008/cluster | jq '.members[] | {name, role}'

Backup procedures for cluster:

# Backup all databases regularly
pg_dump -h 192.168.1.10 -U postgres -F tar -f backup-$(date +%Y%m%d).tar

# Verify backup
tar -tf backup-$(date +%Y%m%d).tar | head

# Backup individual databases
pg_dump -h 192.168.1.10 -U postgres -d mydb > mydb-backup.sql

# Full cluster backup using pg_basebackup
pg_basebackup -h 192.168.1.10 -U replication -F tar -z -D ./backup-$(date +%Y%m%d)

Handle split-brain scenarios properly:

# Check if nodes are in sync
patronictl list

# If split-brain occurs, rejoin isolated nodes
# On isolated node:
sudo systemctl restart patroni

# Force rejoin if necessary
psql -h localhost -U postgres -c "SELECT pg_controldata();" | grep "Database system identifier"

Document and test disaster recovery:

# Test recovery from backup
mkdir -p /tmp/recovery
cd /tmp/recovery

# Extract backup
tar -xzf backup-20240101.tar -C /tmp/recovery/

# Restore to new server (detailed recovery procedures specific to your backup type)

Conclusion

PostgreSQL with Patroni delivers enterprise-grade high availability with automatic failover, seamless switchover, and minimal operational overhead. By properly configuring etcd for distributed consensus, setting up streaming replication, and deploying HAProxy for intelligent connection routing, you create a robust, self-healing database infrastructure. Regular monitoring, testing of failover procedures, and documented maintenance processes ensure your cluster continues to provide reliable service. Remember that while Patroni automates many HA operations, understanding the underlying PostgreSQL replication mechanics and etcd consensus principles remains essential for effective cluster management and troubleshooting.