etcd Cluster Installation and Management

etcd is a distributed key-value store that uses the Raft consensus algorithm to provide strong consistency, making it the backbone of Kubernetes and other distributed systems. This guide covers deploying a production etcd cluster on Linux with TLS security, backup and restore procedures, and performance tuning.

Prerequisites

3 Linux servers (odd number required) running Ubuntu 22.04 or CentOS/Rocky 9
At minimum 2 CPU cores and 4 GB RAM per node; SSD storage strongly recommended
All nodes can reach each other on TCP ports 2379 (client) and 2380 (peer)
Static IP addresses and synchronized clocks (NTP)
cfssl or openssl for certificate generation

Install etcd

Install the same version on all 3 nodes:

# Download the latest stable release
ETCD_VER=v3.5.12
DOWNLOAD_URL=https://github.com/etcd-io/etcd/releases/download

curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz \
  -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz

tar xvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcd* /usr/local/bin/

# Verify
etcd --version
etcdctl version

Create the system user and directories:

sudo useradd -r -s /sbin/nologin etcd
sudo mkdir -p /etc/etcd /var/lib/etcd
sudo chown etcd:etcd /var/lib/etcd

Generate TLS Certificates

Generate a CA and certificates for peer and client communication:

# Install cfssl (run on one node, distribute certs)
curl -L https://github.com/cloudflare/cfssl/releases/download/v1.6.4/cfssl_1.6.4_linux_amd64 \
  -o /usr/local/bin/cfssl && chmod +x /usr/local/bin/cfssl
curl -L https://github.com/cloudflare/cfssl/releases/download/v1.6.4/cfssljson_1.6.4_linux_amd64 \
  -o /usr/local/bin/cfssljson && chmod +x /usr/local/bin/cfssljson

mkdir -p ~/etcd-certs && cd ~/etcd-certs

# CA config
cat > ca-config.json << 'EOF'
{
  "signing": {
    "default": { "expiry": "8760h" },
    "profiles": {
      "etcd": {
        "expiry": "8760h",
        "usages": ["signing", "key encipherment", "server auth", "client auth"]
      }
    }
  }
}
EOF

# CA CSR
cat > ca-csr.json << 'EOF'
{"CN":"etcd-ca","key":{"algo":"rsa","size":2048},"names":[{"C":"US","ST":"CA","L":"San Francisco","O":"etcd"}]}
EOF

cfssl gencert -initca ca-csr.json | cfssljson -bare ca

# Generate server cert covering all 3 node IPs and localhost
cat > etcd-csr.json << 'EOF'
{
  "CN": "etcd",
  "hosts": [
    "localhost", "127.0.0.1",
    "192.168.1.10", "192.168.1.11", "192.168.1.12",
    "etcd-01", "etcd-02", "etcd-03"
  ],
  "key": {"algo": "rsa", "size": 2048},
  "names": [{"C":"US","ST":"CA","O":"etcd"}]
}
EOF

cfssl gencert -ca=ca.pem -ca-key=ca-key.pem \
  -config=ca-config.json -profile=etcd etcd-csr.json | cfssljson -bare etcd

# Copy certs to all nodes
for node in 192.168.1.10 192.168.1.11 192.168.1.12; do
  scp ca.pem etcd.pem etcd-key.pem root@${node}:/etc/etcd/
done

sudo chmod 600 /etc/etcd/etcd-key.pem
sudo chown etcd:etcd /etc/etcd/*.pem

Bootstrap the etcd Cluster

Create the systemd service on each node. Replace ETCD_NAME and ETCD_INITIAL_ADVERTISE_PEER_URLS per node:

# On etcd-01 (192.168.1.10)
sudo tee /etc/systemd/system/etcd.service << 'EOF'
[Unit]
Description=etcd distributed key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target

[Service]
User=etcd
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name etcd-01 \
  --data-dir /var/lib/etcd \
  --listen-peer-urls https://192.168.1.10:2380 \
  --listen-client-urls https://192.168.1.10:2379,https://127.0.0.1:2379 \
  --advertise-client-urls https://192.168.1.10:2379 \
  --initial-advertise-peer-urls https://192.168.1.10:2380 \
  --initial-cluster etcd-01=https://192.168.1.10:2380,etcd-02=https://192.168.1.11:2380,etcd-03=https://192.168.1.12:2380 \
  --initial-cluster-state new \
  --initial-cluster-token etcd-cluster-01 \
  --peer-cert-file=/etc/etcd/etcd.pem \
  --peer-key-file=/etc/etcd/etcd-key.pem \
  --peer-trusted-ca-file=/etc/etcd/ca.pem \
  --peer-client-cert-auth=true \
  --cert-file=/etc/etcd/etcd.pem \
  --key-file=/etc/etcd/etcd-key.pem \
  --trusted-ca-file=/etc/etcd/ca.pem \
  --client-cert-auth=true \
  --auto-compaction-retention=1 \
  --quota-backend-bytes=8589934592
Restart=on-failure
RestartSec=5
LimitNOFILE=40000

[Install]
WantedBy=multi-user.target
EOF

# Start on all 3 nodes simultaneously
sudo systemctl daemon-reload
sudo systemctl enable --now etcd

Verify Cluster Health

# Set etcdctl environment variables
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://192.168.1.10:2379,https://192.168.1.11:2379,https://192.168.1.12:2379
export ETCDCTL_CACERT=/etc/etcd/ca.pem
export ETCDCTL_CERT=/etc/etcd/etcd.pem
export ETCDCTL_KEY=/etc/etcd/etcd-key.pem

# Check cluster membership
etcdctl member list -w table

# Check endpoint health
etcdctl endpoint health -w table

# Check endpoint status (shows leader)
etcdctl endpoint status -w table

Working with etcd Data

# Write a key
etcdctl put /config/database/host "db.example.com"
etcdctl put /config/database/port "5432"

# Read a key
etcdctl get /config/database/host

# List all keys under a prefix
etcdctl get /config/ --prefix

# Delete a key
etcdctl del /config/database/port

# Watch for changes in real-time
etcdctl watch /config/ --prefix &

# Write with TTL (lease)
etcdctl lease grant 120   # 120 second TTL, returns lease ID
etcdctl put /locks/job1 "worker-01" --lease=<lease-id>
etcdctl lease keep-alive <lease-id>  # renew the lease

Backup and Restore

Take regular snapshots for disaster recovery:

# Create a snapshot backup
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db

# Verify the snapshot
etcdctl snapshot status /backup/etcd-snapshot-20260101-120000.db -w table

# Automate backups with cron
cat > /usr/local/bin/etcd-backup.sh << 'EOF'
#!/bin/bash
BACKUP_DIR=/backup/etcd
SNAPSHOT_FILE=${BACKUP_DIR}/snapshot-$(date +%Y%m%d-%H%M%S).db

export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/etcd/ca.pem
export ETCDCTL_CERT=/etc/etcd/etcd.pem
export ETCDCTL_KEY=/etc/etcd/etcd-key.pem

mkdir -p ${BACKUP_DIR}
etcdctl snapshot save ${SNAPSHOT_FILE}

# Keep only last 7 days of backups
find ${BACKUP_DIR} -name "snapshot-*.db" -mtime +7 -delete
echo "Backup saved: ${SNAPSHOT_FILE}"
EOF
chmod +x /usr/local/bin/etcd-backup.sh

# Add to crontab
echo "0 2 * * * root /usr/local/bin/etcd-backup.sh" | sudo tee /etc/cron.d/etcd-backup

Restore from snapshot:

# Stop etcd on ALL nodes first
sudo systemctl stop etcd

# Restore on each node (use the same snapshot, different data dirs if clustered)
etcdctl snapshot restore /backup/etcd-snapshot-20260101-120000.db \
  --name etcd-01 \
  --initial-cluster etcd-01=https://192.168.1.10:2380,etcd-02=https://192.168.1.11:2380,etcd-03=https://192.168.1.12:2380 \
  --initial-cluster-token etcd-cluster-restored \
  --initial-advertise-peer-urls https://192.168.1.10:2380 \
  --data-dir /var/lib/etcd-restored

sudo mv /var/lib/etcd /var/lib/etcd-old
sudo mv /var/lib/etcd-restored /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd

# Start etcd after restoring all nodes
sudo systemctl start etcd

Monitoring and Performance Tuning

# Check metrics endpoint (Prometheus format)
curl -s https://127.0.0.1:2379/metrics \
  --cacert /etc/etcd/ca.pem \
  --cert /etc/etcd/etcd.pem \
  --key /etc/etcd/etcd-key.pem | grep etcd_server

# Compact old revisions to reclaim disk space
# First find the current revision
REV=$(etcdctl endpoint status --write-out="json" | python3 -c \
  "import sys,json; print(json.load(sys.stdin)[0]['Status']['header']['revision'])")

# Compact to current revision
etcdctl compact $REV

# Defragment to reclaim disk space (one member at a time)
etcdctl defrag --endpoints=https://192.168.1.10:2379
etcdctl defrag --endpoints=https://192.168.1.11:2379
etcdctl defrag --endpoints=https://192.168.1.12:2379

Key performance settings in the service file:

# Increase heartbeat for high-latency networks (default: 100ms)
--heartbeat-interval=250
--election-timeout=1250

# Set backend quota (default 2GiB, max 8GiB)
--quota-backend-bytes=8589934592

# Auto-compact every hour
--auto-compaction-retention=1
--auto-compaction-mode=periodic

Troubleshooting

Cluster won't bootstrap - peer connection refused:

# Check firewall rules
sudo firewall-cmd --list-all   # CentOS/Rocky
sudo ufw status                 # Ubuntu

# Allow etcd ports
sudo ufw allow 2379/tcp
sudo ufw allow 2380/tcp

# Check that all nodes can reach each other
nc -zv 192.168.1.11 2380

"no leader" or split-brain:

# Check quorum - with 3 nodes you can lose 1
etcdctl endpoint status -w table

# If majority of nodes are down, check logs
journalctl -u etcd -n 100 --no-pager

Database too large / quota exceeded:

# Check current database size
etcdctl endpoint status -w table

# Compact and defrag immediately
etcdctl compact $(etcdctl endpoint status --write-out="json" | \
  python3 -c "import sys,json; print(json.load(sys.stdin)[0]['Status']['header']['revision'])")
etcdctl defrag

Conclusion

A production etcd cluster requires TLS encryption, regular snapshots, and disk performance monitoring to remain reliable under load. Running 3 nodes with Raft consensus tolerates one node failure while maintaining availability. Always take a snapshot before major operations, automate backups with cron, and monitor the backend database size to prevent quota-exceeded errors that block all writes.

etcd Cluster Installation and Management

On this page

On this page

etcd Cluster Installation and Management

Prerequisites

Install etcd

Generate TLS Certificates

Bootstrap the etcd Cluster

Verify Cluster Health

Working with etcd Data

Backup and Restore

Monitoring and Performance Tuning

Troubleshooting

Conclusion

Latest Video

Get $20 Free Credit