Cassandra Installation and Configuration
Apache Cassandra is a highly scalable, distributed NoSQL database designed for handling massive amounts of structured data across multiple servers with high availability and no single point of failure. It uses a peer-to-peer architecture with automatic partitioning, replication, and consistent hashing for data distribution. This comprehensive guide covers installation, cluster configuration, data modeling, consistency management, and operational procedures for production Cassandra deployments.
Table of Contents
- Architecture and Concepts
- Java Prerequisites
- Installation
- Cassandra Configuration
- Cluster Bootstrap
- Keyspaces and Tables
- Data Consistency
- Replication Strategies
- Node Management
- Monitoring with nodetool
- Backup and Recovery
- Conclusion
Architecture and Concepts
Cassandra is a distributed system where each node is independent and stores a portion of data determined by consistent hashing. Data is automatically replicated across multiple nodes based on replication factor settings, ensuring availability when nodes fail. The ring topology means nodes arrange logically in a circle, with each node owning tokens for a range of data values.
Cassandra uses eventual consistency with configurable read and write consistency levels, allowing tuning of the consistency-availability-partition tolerance tradeoff. A quorum-based consistency level provides strong consistency guarantees by requiring a majority of replicas to acknowledge reads and writes. Tuneable consistency enables applications to balance consistency needs with latency requirements.
Java Prerequisites
Cassandra runs on the Java Virtual Machine. Install a compatible Java version:
# Ubuntu/Debian - Install OpenJDK 11 or 17
sudo apt-get update
sudo apt-get install -y openjdk-17-jdk-headless
# RHEL/CentOS - Install OpenJDK
sudo dnf install -y java-17-openjdk-headless
# Verify Java installation
java -version
javac -version
Configure Java environment variables:
# Set JAVA_HOME
echo "export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64" >> ~/.bashrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
# Verify JAVA_HOME
echo $JAVA_HOME
java -version
Configure system limits for Cassandra:
sudo nano /etc/security/limits.conf
Add these limits:
cassandra soft nofile 100000
cassandra hard nofile 100000
cassandra soft nproc 32768
cassandra hard nproc 32768
cassandra soft as unlimited
cassandra hard as unlimited
cassandra soft memlock unlimited
cassandra hard memlock unlimited
Verify limits:
ulimit -n
ulimit -a
Installation
Install Cassandra from the official repository:
# Add Cassandra repository (Ubuntu/Debian)
curl https://www.apache.org/dist/cassandra/KEYS.asc | sudo apt-key add -
echo "deb https://debian.cassandra.apache.org 40x main" | \
sudo tee /etc/apt/sources.list.d/cassandra.sources.list
# Update and install
sudo apt-get update
sudo apt-get install -y cassandra
# Or install from tarball (CentOS/RHEL)
cd /opt
sudo wget https://archive.apache.org/dist/cassandra/4.1.0/apache-cassandra-4.1.0-bin.tar.gz
sudo tar -xzf apache-cassandra-4.1.0-bin.tar.gz
sudo ln -s apache-cassandra-4.1.0 cassandra
Verify installation:
cassandra -version
nodetool --version
cqlsh --version
Create Cassandra user and directories:
# Create system user
sudo useradd -r -s /bin/false cassandra 2>/dev/null || true
# Create data directories
sudo mkdir -p /var/lib/cassandra/data
sudo mkdir -p /var/lib/cassandra/commitlog
sudo mkdir -p /var/lib/cassandra/hints
sudo mkdir -p /var/lib/cassandra/saved_caches
sudo mkdir -p /var/log/cassandra
# Set ownership
sudo chown -R cassandra:cassandra /var/lib/cassandra /var/log/cassandra
sudo chmod 755 /var/lib/cassandra /var/log/cassandra
Create systemd service file:
sudo nano /etc/systemd/system/cassandra.service
Add this configuration:
[Unit]
Description=Apache Cassandra
After=network.target
[Service]
Type=simple
User=cassandra
Group=cassandra
Environment=CASSANDRA_CONF=/etc/cassandra
Environment=CASSANDRA_HOME=/usr/share/cassandra
ExecStart=/usr/sbin/cassandra -f -p /var/run/cassandra.pid
StandardOutput=journal
StandardError=journal
# Restart settings
Restart=always
RestartSec=5
# Timeout settings
TimeoutStopSec=300
[Install]
WantedBy=multi-user.target
Enable and start Cassandra:
sudo systemctl daemon-reload
sudo systemctl enable cassandra
sudo systemctl start cassandra
# Monitor startup
sudo journalctl -u cassandra -f
# Check service status
sudo systemctl status cassandra
Cassandra Configuration
Configure cassandra.yaml for multi-node cluster. Edit the configuration file:
sudo nano /etc/cassandra/cassandra.yaml
For node1 (192.168.1.10), configure these essential settings:
# Cluster name (must be identical across all nodes)
cluster_name: 'MyCluster'
# Unique seed address list (include at least 2 nodes)
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "192.168.1.10,192.168.1.11"
# This node's IP
listen_address: 192.168.1.10
rpc_address: 192.168.1.10
# Broadcast addresses for cluster communication
broadcast_rpc_address: 192.168.1.10
# Snitch for topology awareness
endpoint_snitch: SimpleSnitch
# For multi-region, use GossipingPropertyFileSnitch
# endpoint_snitch: GossipingPropertyFileSnitch
# Storage and performance settings
commitlog_directory: /var/lib/cassandra/commitlog
data_file_directories:
- /var/lib/cassandra/data
saved_caches_directory: /var/lib/cassandra/saved_caches
hints_directory: /var/lib/cassandra/hints
# Disk access mode
disk_access_mode: auto
# Memory settings
max_heap_size: 4G
heap_newsize: 1G
# Authentication
authenticator: AllowAllAuthenticator
authorizer: AllowAllAuthorizer
# Enable native transport (CQL protocol)
start_native_transport: true
native_transport_port: 9042
# Replication settings
num_tokens: 256
initial_token: (generated per node)
# Commit log settings
commitlog_compression:
- class_name: org.apache.cassandra.io.compress.LZ4Compressor
parameters:
-
chunk_length_kb: 64
# Partitioner
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
For nodes 2 and 3, modify these settings with appropriate node IPs:
# Node 2 configuration
listen_address: 192.168.1.11
rpc_address: 192.168.1.11
broadcast_rpc_address: 192.168.1.11
# Node 3 configuration
listen_address: 192.168.1.12
rpc_address: 192.168.1.12
broadcast_rpc_address: 192.168.1.12
Configure cassandra-env.sh for memory:
sudo nano /etc/cassandra/cassandra-env.sh
Uncomment and set:
MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="1G"
Restart Cassandra to apply configuration:
sudo systemctl restart cassandra
# Monitor logs
sudo journalctl -u cassandra -f
Cluster Bootstrap
Bootstrap the first node and allow others to join:
# Verify first node is up
nodetool status
# Should show single node UN (up, normal)
# UN = Up and Normal
# UL = Up and Leaving
# DL = Down and Leaving
Start the remaining nodes once the first node is stable:
# On node2 and node3
sudo systemctl start cassandra
# Monitor bootstrap via nodetool
nodetool status
# Watch logs
sudo journalctl -u cassandra -f | grep -i bootstrap
Verify cluster formation:
# Check peer information
nodetool describecluster
# Should show all three nodes with similar token ranges
Wait for gossip stabilization:
# Monitor gossip
watch -n 5 'nodetool status'
# All nodes should show "UN" (Up Normal) status
Keyspaces and Tables
Create a keyspace with replication configuration:
# Connect with cqlsh
cqlsh 192.168.1.10
-- Create keyspace with replication
CREATE KEYSPACE IF NOT EXISTS myapp
WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
-- Or use NetworkTopologyStrategy for multi-region
CREATE KEYSPACE IF NOT EXISTS myapp_distributed
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3,
'us-west': 2
};
-- Switch to keyspace
USE myapp;
-- Create table for events
CREATE TABLE IF NOT EXISTS events (
event_id UUID,
event_time TIMESTAMP,
user_id BIGINT,
event_type TEXT,
event_data TEXT,
PRIMARY KEY ((user_id), event_time, event_id)
) WITH CLUSTERING ORDER BY (event_time DESC)
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND gc_grace_seconds = 864000;
-- Create table for user profiles
CREATE TABLE IF NOT EXISTS user_profiles (
user_id BIGINT PRIMARY KEY,
username TEXT,
email TEXT,
signup_date TIMESTAMP,
updated_at TIMESTAMP
) WITH compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'};
-- Create materialized view for lookups by email
CREATE MATERIALIZED VIEW user_by_email AS
SELECT user_id, username, email, signup_date
FROM user_profiles
WHERE email IS NOT NULL
PRIMARY KEY (email, user_id);
-- Create table for metrics with TTL
CREATE TABLE IF NOT EXISTS metrics (
metric_name TEXT,
metric_time TIMESTAMP,
host_id TEXT,
metric_value DOUBLE,
PRIMARY KEY ((metric_name, host_id), metric_time)
) WITH CLUSTERING ORDER BY (metric_time DESC)
AND default_time_to_live = 2592000;
-- Create secondary index
CREATE INDEX ON metrics(metric_name);
-- Show tables
DESCRIBE TABLES;
DESCRIBE KEYSPACE myapp;
Data Consistency
Configure consistency levels for reads and writes:
-- Insert data with default consistency (quorum)
INSERT INTO events (event_id, event_time, user_id, event_type, event_data)
VALUES (uuid(), now(), 123, 'login', 'User logged in')
USING TIMESTAMP 1640000000000;
-- Insert with TTL (time to live)
INSERT INTO metrics (metric_name, metric_time, host_id, metric_value)
VALUES ('cpu_usage', now(), 'host-01', 45.5)
USING TTL 86400;
-- Read with consistency level
SELECT * FROM events WHERE user_id = 123 ALLOW FILTERING;
-- Update data
UPDATE user_profiles
SET username = 'alice', updated_at = now()
WHERE user_id = 123;
-- Delete with TTL
DELETE FROM metrics WHERE metric_name = 'old_metric' AND metric_time < '2024-01-01';
-- Batch multiple operations
BEGIN BATCH
INSERT INTO events (event_id, event_time, user_id, event_type)
VALUES (uuid(), now(), 123, 'action1');
INSERT INTO events (event_id, event_time, user_id, event_type)
VALUES (uuid(), now(), 123, 'action2');
APPLY BATCH;
Set consistency level in cqlsh:
cqlsh --consistency ONE
# or
cqlsh --consistency QUORUM
# or
cqlsh --consistency ALL
Configure consistency in application code (pseudo-code):
# In Python driver
from cassandra.cluster import Cluster
from cassandra.consistency import ConsistencyLevel
cluster = Cluster(['192.168.1.10', '192.168.1.11', '192.168.1.12'])
session = cluster.connect()
# Set consistency level for queries
from cassandra import ConsistencyLevel
session.default_consistency_level = ConsistencyLevel.QUORUM
# Or per query
from cassandra.query import SimpleStatement
query = SimpleStatement(
"SELECT * FROM myapp.events WHERE user_id = %s",
consistency_level=ConsistencyLevel.ONE
)
rows = session.execute(query, (123,))
Replication Strategies
Understand replication strategies for fault tolerance:
-- SimpleStrategy (single datacenter)
-- Replicates to consecutive nodes on the ring
CREATE KEYSPACE myapp
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
-- NetworkTopologyStrategy (multiple datacenters)
-- Controls replicas per datacenter
CREATE KEYSPACE myapp_global
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east-1': 3,
'us-west-2': 3,
'eu-central-1': 2
};
-- Alter keyspace replication
ALTER KEYSPACE myapp
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
Configure datacenter-aware replication:
# Edit cassandra.yml with rack-aware snitch
sudo nano /etc/cassandra/cassandra.yaml
# endpoint_snitch: GossipingPropertyFileSnitch
# Configure datacenter and rack
sudo nano /etc/cassandra/cassandra-rackdc.properties
Add:
# US East Region
dc=us-east-1a
rack=us-east-1a-r1
# Or for multi-rack
dc=us-west-2
rack=us-west-2-r2
Node Management
Add and remove nodes from the cluster:
# Add new node to cluster
# On new node, ensure configuration points to existing cluster via seeds
# Start cassandra - it will bootstrap and download data from existing nodes
sudo systemctl start cassandra
# Monitor bootstrap progress
nodetool netstats
# Check when bootstrap is complete
nodetool status
# Gracefully remove a node
nodetool decommission
# Force remove a dead node
nodetool removenode <node-uuid>
Handle node failures:
# Replace a failed node (reuse same token)
# Configure new node with same IP as failed node
# Add -Dcassandra.replace_address=<dead-node-ip> to JVM args
# Or replace and reassign tokens
sudo nano /etc/cassandra/cassandra-env.sh
# Add to JVM_OPTS
export JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=192.168.1.10"
# Start cassandra - it will take data from existing nodes
Repair data consistency:
# Run repair on a node (recommended weekly)
nodetool repair
# Repair specific keyspace
nodetool repair myapp
# Repair with parallelization
nodetool repair -pr myapp
# Check repair status
nodetool repairdigest
Monitoring with nodetool
Monitor cluster health and performance using nodetool:
# Overall cluster status
nodetool status
# Detailed cluster information
nodetool describecluster
# Node information
nodetool info
# Ring information (token distribution)
nodetool ring
# Gossip state
nodetool gossipinfo
# Pending compactions and streams
nodetool netstats
# Tpstats (thread pool stats)
nodetool tpstats
# GC stats
nodetool gcstats
# Disk usage
nodetool du
# Snapshot creation for backup
nodetool snapshot myapp -t backup_$(date +%Y%m%d)
# List snapshots
nodetool listsnapshots
# Clear snapshots
nodetool clearsnapshot
# Flush memtables
nodetool flush myapp
# Compact sstables
nodetool compact
Monitor performance metrics:
# Check compaction throughput
nodetool compactionhistory
# View drain status
nodetool statusdown
# Check schema version
nodetool describecluster | grep "Schema"
# Get token ranges
nodetool getendpoints myapp events 123
Backup and Recovery
Implement backup procedures:
# Create snapshot
nodetool snapshot myapp -t myapp_$(date +%Y%m%d_%H%M%S)
# Backup snapshot files
tar -czf /backup/cassandra_$(date +%Y%m%d).tar.gz \
/var/lib/cassandra/data/myapp/*/snapshots/
# Backup schema
cqlsh -e "DESCRIBE KEYSPACE myapp;" > /backup/schema_$(date +%Y%m%d).cql
# Full cluster backup script
#!/bin/bash
BACKUP_DIR="/backup/cassandra-$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
for node in 192.168.1.10 192.168.1.11 192.168.1.12; do
ssh cassandra@$node "nodetool snapshot myapp -t daily"
scp -r cassandra@$node:/var/lib/cassandra/data $BACKUP_DIR/node_$node/
done
# Backup commit logs and hints for point-in-time recovery
tar -czf $BACKUP_DIR/commit_logs.tar.gz /var/lib/cassandra/commitlog/
tar -czf $BACKUP_DIR/hints.tar.gz /var/lib/cassandra/hints/
Restore from backup:
# Stop Cassandra
sudo systemctl stop cassandra
# Remove current data
sudo rm -rf /var/lib/cassandra/data/*
# Restore snapshot
tar -xzf /backup/cassandra_20240101.tar.gz -C /var/lib/cassandra/data/
# Restore schema
cqlsh -f /backup/schema_20240101.cql
# Start Cassandra
sudo systemctl start cassandra
# Verify data
cqlsh -e "SELECT COUNT(*) FROM myapp.events;"
Conclusion
Cassandra provides a horizontally scalable, fault-tolerant database for applications requiring extreme availability and performance. Its peer-to-peer architecture eliminates single points of failure while consistent hashing ensures data is distributed efficiently across nodes. By understanding replication strategies, consistency levels, and node management procedures, you can build production Cassandra clusters that handle massive data volumes reliably. Proper configuration of snitch strategies, replication factors, and repair procedures maintains data consistency across the distributed system while delivering the high availability your applications demand.


