Cassandra Installation and Configuration

Apache Cassandra is a highly scalable, distributed NoSQL database designed for handling massive amounts of structured data across multiple servers with high availability and no single point of failure. It uses a peer-to-peer architecture with automatic partitioning, replication, and consistent hashing for data distribution. This comprehensive guide covers installation, cluster configuration, data modeling, consistency management, and operational procedures for production Cassandra deployments.

Table of Contents

Architecture and Concepts

Cassandra is a distributed system where each node is independent and stores a portion of data determined by consistent hashing. Data is automatically replicated across multiple nodes based on replication factor settings, ensuring availability when nodes fail. The ring topology means nodes arrange logically in a circle, with each node owning tokens for a range of data values.

Cassandra uses eventual consistency with configurable read and write consistency levels, allowing tuning of the consistency-availability-partition tolerance tradeoff. A quorum-based consistency level provides strong consistency guarantees by requiring a majority of replicas to acknowledge reads and writes. Tuneable consistency enables applications to balance consistency needs with latency requirements.

Java Prerequisites

Cassandra runs on the Java Virtual Machine. Install a compatible Java version:

# Ubuntu/Debian - Install OpenJDK 11 or 17
sudo apt-get update
sudo apt-get install -y openjdk-17-jdk-headless

# RHEL/CentOS - Install OpenJDK
sudo dnf install -y java-17-openjdk-headless

# Verify Java installation
java -version
javac -version

Configure Java environment variables:

# Set JAVA_HOME
echo "export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64" >> ~/.bashrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.bashrc
source ~/.bashrc

# Verify JAVA_HOME
echo $JAVA_HOME
java -version

Configure system limits for Cassandra:

sudo nano /etc/security/limits.conf

Add these limits:

cassandra soft nofile 100000
cassandra hard nofile 100000
cassandra soft nproc 32768
cassandra hard nproc 32768
cassandra soft as unlimited
cassandra hard as unlimited
cassandra soft memlock unlimited
cassandra hard memlock unlimited

Verify limits:

ulimit -n
ulimit -a

Installation

Install Cassandra from the official repository:

# Add Cassandra repository (Ubuntu/Debian)
curl https://www.apache.org/dist/cassandra/KEYS.asc | sudo apt-key add -
echo "deb https://debian.cassandra.apache.org 40x main" | \
  sudo tee /etc/apt/sources.list.d/cassandra.sources.list

# Update and install
sudo apt-get update
sudo apt-get install -y cassandra

# Or install from tarball (CentOS/RHEL)
cd /opt
sudo wget https://archive.apache.org/dist/cassandra/4.1.0/apache-cassandra-4.1.0-bin.tar.gz
sudo tar -xzf apache-cassandra-4.1.0-bin.tar.gz
sudo ln -s apache-cassandra-4.1.0 cassandra

Verify installation:

cassandra -version
nodetool --version
cqlsh --version

Create Cassandra user and directories:

# Create system user
sudo useradd -r -s /bin/false cassandra 2>/dev/null || true

# Create data directories
sudo mkdir -p /var/lib/cassandra/data
sudo mkdir -p /var/lib/cassandra/commitlog
sudo mkdir -p /var/lib/cassandra/hints
sudo mkdir -p /var/lib/cassandra/saved_caches
sudo mkdir -p /var/log/cassandra

# Set ownership
sudo chown -R cassandra:cassandra /var/lib/cassandra /var/log/cassandra
sudo chmod 755 /var/lib/cassandra /var/log/cassandra

Create systemd service file:

sudo nano /etc/systemd/system/cassandra.service

Add this configuration:

[Unit]
Description=Apache Cassandra
After=network.target

[Service]
Type=simple
User=cassandra
Group=cassandra

Environment=CASSANDRA_CONF=/etc/cassandra
Environment=CASSANDRA_HOME=/usr/share/cassandra

ExecStart=/usr/sbin/cassandra -f -p /var/run/cassandra.pid

StandardOutput=journal
StandardError=journal

# Restart settings
Restart=always
RestartSec=5

# Timeout settings
TimeoutStopSec=300

[Install]
WantedBy=multi-user.target

Enable and start Cassandra:

sudo systemctl daemon-reload
sudo systemctl enable cassandra
sudo systemctl start cassandra

# Monitor startup
sudo journalctl -u cassandra -f

# Check service status
sudo systemctl status cassandra

Cassandra Configuration

Configure cassandra.yaml for multi-node cluster. Edit the configuration file:

sudo nano /etc/cassandra/cassandra.yaml

For node1 (192.168.1.10), configure these essential settings:

# Cluster name (must be identical across all nodes)
cluster_name: 'MyCluster'

# Unique seed address list (include at least 2 nodes)
seed_provider:
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
    parameters:
      - seeds: "192.168.1.10,192.168.1.11"

# This node's IP
listen_address: 192.168.1.10
rpc_address: 192.168.1.10

# Broadcast addresses for cluster communication
broadcast_rpc_address: 192.168.1.10

# Snitch for topology awareness
endpoint_snitch: SimpleSnitch

# For multi-region, use GossipingPropertyFileSnitch
# endpoint_snitch: GossipingPropertyFileSnitch

# Storage and performance settings
commitlog_directory: /var/lib/cassandra/commitlog
data_file_directories:
  - /var/lib/cassandra/data
saved_caches_directory: /var/lib/cassandra/saved_caches
hints_directory: /var/lib/cassandra/hints

# Disk access mode
disk_access_mode: auto

# Memory settings
max_heap_size: 4G
heap_newsize: 1G

# Authentication
authenticator: AllowAllAuthenticator
authorizer: AllowAllAuthorizer

# Enable native transport (CQL protocol)
start_native_transport: true
native_transport_port: 9042

# Replication settings
num_tokens: 256
initial_token: (generated per node)

# Commit log settings
commitlog_compression:
  - class_name: org.apache.cassandra.io.compress.LZ4Compressor
    parameters:
      -
chunk_length_kb: 64

# Partitioner
partitioner: org.apache.cassandra.dht.Murmur3Partitioner

For nodes 2 and 3, modify these settings with appropriate node IPs:

# Node 2 configuration
listen_address: 192.168.1.11
rpc_address: 192.168.1.11
broadcast_rpc_address: 192.168.1.11

# Node 3 configuration
listen_address: 192.168.1.12
rpc_address: 192.168.1.12
broadcast_rpc_address: 192.168.1.12

Configure cassandra-env.sh for memory:

sudo nano /etc/cassandra/cassandra-env.sh

Uncomment and set:

MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="1G"

Restart Cassandra to apply configuration:

sudo systemctl restart cassandra

# Monitor logs
sudo journalctl -u cassandra -f

Cluster Bootstrap

Bootstrap the first node and allow others to join:

# Verify first node is up
nodetool status

# Should show single node UN (up, normal)
# UN = Up and Normal
# UL = Up and Leaving
# DL = Down and Leaving

Start the remaining nodes once the first node is stable:

# On node2 and node3
sudo systemctl start cassandra

# Monitor bootstrap via nodetool
nodetool status

# Watch logs
sudo journalctl -u cassandra -f | grep -i bootstrap

Verify cluster formation:

# Check peer information
nodetool describecluster

# Should show all three nodes with similar token ranges

Wait for gossip stabilization:

# Monitor gossip
watch -n 5 'nodetool status'

# All nodes should show "UN" (Up Normal) status

Keyspaces and Tables

Create a keyspace with replication configuration:

# Connect with cqlsh
cqlsh 192.168.1.10

-- Create keyspace with replication
CREATE KEYSPACE IF NOT EXISTS myapp
WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 3
};

-- Or use NetworkTopologyStrategy for multi-region
CREATE KEYSPACE IF NOT EXISTS myapp_distributed
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east': 3,
  'us-west': 2
};

-- Switch to keyspace
USE myapp;

-- Create table for events
CREATE TABLE IF NOT EXISTS events (
  event_id UUID,
  event_time TIMESTAMP,
  user_id BIGINT,
  event_type TEXT,
  event_data TEXT,
  PRIMARY KEY ((user_id), event_time, event_id)
) WITH CLUSTERING ORDER BY (event_time DESC)
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND gc_grace_seconds = 864000;

-- Create table for user profiles
CREATE TABLE IF NOT EXISTS user_profiles (
  user_id BIGINT PRIMARY KEY,
  username TEXT,
  email TEXT,
  signup_date TIMESTAMP,
  updated_at TIMESTAMP
) WITH compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'};

-- Create materialized view for lookups by email
CREATE MATERIALIZED VIEW user_by_email AS
SELECT user_id, username, email, signup_date
FROM user_profiles
WHERE email IS NOT NULL
PRIMARY KEY (email, user_id);

-- Create table for metrics with TTL
CREATE TABLE IF NOT EXISTS metrics (
  metric_name TEXT,
  metric_time TIMESTAMP,
  host_id TEXT,
  metric_value DOUBLE,
  PRIMARY KEY ((metric_name, host_id), metric_time)
) WITH CLUSTERING ORDER BY (metric_time DESC)
AND default_time_to_live = 2592000;

-- Create secondary index
CREATE INDEX ON metrics(metric_name);

-- Show tables
DESCRIBE TABLES;
DESCRIBE KEYSPACE myapp;

Data Consistency

Configure consistency levels for reads and writes:

-- Insert data with default consistency (quorum)
INSERT INTO events (event_id, event_time, user_id, event_type, event_data)
VALUES (uuid(), now(), 123, 'login', 'User logged in')
USING TIMESTAMP 1640000000000;

-- Insert with TTL (time to live)
INSERT INTO metrics (metric_name, metric_time, host_id, metric_value)
VALUES ('cpu_usage', now(), 'host-01', 45.5)
USING TTL 86400;

-- Read with consistency level
SELECT * FROM events WHERE user_id = 123 ALLOW FILTERING;

-- Update data
UPDATE user_profiles 
SET username = 'alice', updated_at = now()
WHERE user_id = 123;

-- Delete with TTL
DELETE FROM metrics WHERE metric_name = 'old_metric' AND metric_time < '2024-01-01';

-- Batch multiple operations
BEGIN BATCH
  INSERT INTO events (event_id, event_time, user_id, event_type)
  VALUES (uuid(), now(), 123, 'action1');
  INSERT INTO events (event_id, event_time, user_id, event_type)
  VALUES (uuid(), now(), 123, 'action2');
APPLY BATCH;

Set consistency level in cqlsh:

cqlsh --consistency ONE
# or
cqlsh --consistency QUORUM
# or
cqlsh --consistency ALL

Configure consistency in application code (pseudo-code):

# In Python driver
from cassandra.cluster import Cluster
from cassandra.consistency import ConsistencyLevel

cluster = Cluster(['192.168.1.10', '192.168.1.11', '192.168.1.12'])
session = cluster.connect()

# Set consistency level for queries
from cassandra import ConsistencyLevel
session.default_consistency_level = ConsistencyLevel.QUORUM

# Or per query
from cassandra.query import SimpleStatement
query = SimpleStatement(
    "SELECT * FROM myapp.events WHERE user_id = %s",
    consistency_level=ConsistencyLevel.ONE
)
rows = session.execute(query, (123,))

Replication Strategies

Understand replication strategies for fault tolerance:

-- SimpleStrategy (single datacenter)
-- Replicates to consecutive nodes on the ring
CREATE KEYSPACE myapp
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

-- NetworkTopologyStrategy (multiple datacenters)
-- Controls replicas per datacenter
CREATE KEYSPACE myapp_global
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3,
  'us-west-2': 3,
  'eu-central-1': 2
};

-- Alter keyspace replication
ALTER KEYSPACE myapp 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};

Configure datacenter-aware replication:

# Edit cassandra.yml with rack-aware snitch
sudo nano /etc/cassandra/cassandra.yaml
# endpoint_snitch: GossipingPropertyFileSnitch

# Configure datacenter and rack
sudo nano /etc/cassandra/cassandra-rackdc.properties

Add:

# US East Region
dc=us-east-1a
rack=us-east-1a-r1

# Or for multi-rack
dc=us-west-2
rack=us-west-2-r2

Node Management

Add and remove nodes from the cluster:

# Add new node to cluster
# On new node, ensure configuration points to existing cluster via seeds
# Start cassandra - it will bootstrap and download data from existing nodes
sudo systemctl start cassandra

# Monitor bootstrap progress
nodetool netstats

# Check when bootstrap is complete
nodetool status

# Gracefully remove a node
nodetool decommission

# Force remove a dead node
nodetool removenode <node-uuid>

Handle node failures:

# Replace a failed node (reuse same token)
# Configure new node with same IP as failed node
# Add -Dcassandra.replace_address=<dead-node-ip> to JVM args

# Or replace and reassign tokens
sudo nano /etc/cassandra/cassandra-env.sh

# Add to JVM_OPTS
export JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=192.168.1.10"

# Start cassandra - it will take data from existing nodes

Repair data consistency:

# Run repair on a node (recommended weekly)
nodetool repair

# Repair specific keyspace
nodetool repair myapp

# Repair with parallelization
nodetool repair -pr myapp

# Check repair status
nodetool repairdigest

Monitoring with nodetool

Monitor cluster health and performance using nodetool:

# Overall cluster status
nodetool status

# Detailed cluster information
nodetool describecluster

# Node information
nodetool info

# Ring information (token distribution)
nodetool ring

# Gossip state
nodetool gossipinfo

# Pending compactions and streams
nodetool netstats

# Tpstats (thread pool stats)
nodetool tpstats

# GC stats
nodetool gcstats

# Disk usage
nodetool du

# Snapshot creation for backup
nodetool snapshot myapp -t backup_$(date +%Y%m%d)

# List snapshots
nodetool listsnapshots

# Clear snapshots
nodetool clearsnapshot

# Flush memtables
nodetool flush myapp

# Compact sstables
nodetool compact

Monitor performance metrics:

# Check compaction throughput
nodetool compactionhistory

# View drain status
nodetool statusdown

# Check schema version
nodetool describecluster | grep "Schema"

# Get token ranges
nodetool getendpoints myapp events 123

Backup and Recovery

Implement backup procedures:

# Create snapshot
nodetool snapshot myapp -t myapp_$(date +%Y%m%d_%H%M%S)

# Backup snapshot files
tar -czf /backup/cassandra_$(date +%Y%m%d).tar.gz \
  /var/lib/cassandra/data/myapp/*/snapshots/

# Backup schema
cqlsh -e "DESCRIBE KEYSPACE myapp;" > /backup/schema_$(date +%Y%m%d).cql

# Full cluster backup script
#!/bin/bash
BACKUP_DIR="/backup/cassandra-$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

for node in 192.168.1.10 192.168.1.11 192.168.1.12; do
  ssh cassandra@$node "nodetool snapshot myapp -t daily"
  scp -r cassandra@$node:/var/lib/cassandra/data $BACKUP_DIR/node_$node/
done

# Backup commit logs and hints for point-in-time recovery
tar -czf $BACKUP_DIR/commit_logs.tar.gz /var/lib/cassandra/commitlog/
tar -czf $BACKUP_DIR/hints.tar.gz /var/lib/cassandra/hints/

Restore from backup:

# Stop Cassandra
sudo systemctl stop cassandra

# Remove current data
sudo rm -rf /var/lib/cassandra/data/*

# Restore snapshot
tar -xzf /backup/cassandra_20240101.tar.gz -C /var/lib/cassandra/data/

# Restore schema
cqlsh -f /backup/schema_20240101.cql

# Start Cassandra
sudo systemctl start cassandra

# Verify data
cqlsh -e "SELECT COUNT(*) FROM myapp.events;"

Conclusion

Cassandra provides a horizontally scalable, fault-tolerant database for applications requiring extreme availability and performance. Its peer-to-peer architecture eliminates single points of failure while consistent hashing ensures data is distributed efficiently across nodes. By understanding replication strategies, consistency levels, and node management procedures, you can build production Cassandra clusters that handle massive data volumes reliably. Proper configuration of snitch strategies, replication factors, and repair procedures maintains data consistency across the distributed system while delivering the high availability your applications demand.