Cassandra Instalación and Configuración

Apache Cassandra is a highly scalable, distributed NoSQL database designed for handling massive amounts of structured data across multiple servers with high availability and no single point of failure. It uses a peer-to-peer architecture with automatic partitioning, replication, and consistent hashing for data distribution. Esta guía completa cubre installation, cluster configuration, data modeling, consistency management, and operational procedures for production Cassandra deployments.

Architecture and Concepts

Cassandra is a distributed system where each nodo is independent and stores a portion of data determined by consistent hashing. Data is automatically replicated across multiple nodos based on replication factor settings, ensuring availability when nodos fail. The ring topology means nodos arrange logically in a circle, with each nodo owning tokens for a range of data values.

Cassandra uses eventual consistency with configurable read and write consistency levels, allowing tuning of the consistency-availability-partition tolerance tradeoff. A quorum-based consistency level proporciona strong consistency guarantees by requiring a majority of réplicas to acknowledge reads and writes. Tuneable consistency enables applications to balance consistency needs with latency requirements.

Java Requisitos

Cassandra runs on the Java Virtual Machine. Instala a compatible Java version:

# Ubuntu/Debian - Instala OpenJDK 11 or 17
sudo apt-get update
sudo apt-get install -y openjdk-17-jdk-headless

# RHEL/CentOS - Instala OpenJDK
sudo dnf install -y java-17-openjdk-headless

# Verifica Java installation
java -version
javac -version

Configura Java environment variables:

# Set JAVA_HOME
echo "export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64" >> ~/.bashrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.bashrc
source ~/.bashrc

# Verifica JAVA_HOME
echo $JAVA_HOME
java -version

Configura system limits for Cassandra:

sudo nano /etc/security/limits.conf

Add these limits:

cassandra soft nofile 100000
cassandra hard nofile 100000
cassandra soft nproc 32768
cassandra hard nproc 32768
cassandra soft as unlimited
cassandra hard as unlimited
cassandra soft memlock unlimited
cassandra hard memlock unlimited

Verifica limits:

ulimit -n
ulimit -a

Instalación

Instala Cassandra from the official repositorio:

# Add Cassandra repositorio (Ubuntu/Debian)
curl https://www.apache.org/dist/cassandra/KEYS.asc | sudo apt-key add -
echo "deb https://debian.cassandra.apache.org 40x main" | \
  sudo tee /etc/apt/sources.list.d/cassandra.sources.list

# Actualiza and install
sudo apt-get update
sudo apt-get install -y cassandra

# Or install from tarball (CentOS/RHEL)
cd /opt
sudo wget https://archive.apache.org/dist/cassandra/4.1.0/apache-cassandra-4.1.0-bin.tar.gz
sudo tar -xzf apache-cassandra-4.1.0-bin.tar.gz
sudo ln -s apache-cassandra-4.1.0 cassandra

Verifica installation:

cassandra -version
nodetool --version
cqlsh --version

Crea Cassandra user and directories:

# Crea system user
sudo useradd -r -s /bin/false cassandra 2>/dev/null || true

# Crea data directories
sudo mkdir -p /var/lib/cassandra/data
sudo mkdir -p /var/lib/cassandra/commitlog
sudo mkdir -p /var/lib/cassandra/hints
sudo mkdir -p /var/lib/cassandra/saved_caches
sudo mkdir -p /var/log/cassandra

# Set ownership
sudo chown -R cassandra:cassandra /var/lib/cassandra /var/log/cassandra
sudo chmod 755 /var/lib/cassandra /var/log/cassandra

Crea systemd servicio file:

sudo nano /etc/systemd/system/cassandra.servicio

Add this configuration:

[Unit]
Description=Apache Cassandra
After=red.target

[Servicio]
Type=simple
User=cassandra
Group=cassandra

Environment=CASSANDRA_CONF=/etc/cassandra
Environment=CASSANDRA_HOME=/usr/share/cassandra

ExecStart=/usr/sbin/cassandra -f -p /var/run/cassandra.pid

StandardOutput=journal
StandardError=journal

# Reinicia settings
Reinicia=always
RestartSec=5

# Timeout settings
TimeoutStopSec=300

[Instala]
WantedBy=multi-user.target

Habilita and start Cassandra:

sudo systemctl daemon-reload
sudo systemctl enable cassandra
sudo systemctl start cassandra

# Monitorea startup
sudo journalctl -u cassandra -f

# Check servicio status
sudo systemctl status cassandra

Cassandra Configuración

Configura cassandra.yaml for multi-nodo cluster. Edit the configuration file:

sudo nano /etc/cassandra/cassandra.yaml

For node1 (192.168.1.10), configure these essential settings:

# Cluster name (must be identical across all nodos)
cluster_name: 'MyCluster'

# Unique seed address list (include at least 2 nodos)
seed_provider:
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
    parameters:
      - seeds: "192.168.1.10,192.168.1.11"

# This nodo's IP
listen_address: 192.168.1.10
rpc_address: 192.168.1.10

# Broadcast addresses for cluster communication
broadcast_rpc_address: 192.168.1.10

# Snitch for topology awareness
endpoint_snitch: SimpleSnitch

# For multi-region, use GossipingPropertyFileSnitch
# endpoint_snitch: GossipingPropertyFileSnitch

# Almacenamiento and performance settings
commitlog_directory: /var/lib/cassandra/commitlog
data_file_directories:
  - /var/lib/cassandra/data
saved_caches_directory: /var/lib/cassandra/saved_caches
hints_directory: /var/lib/cassandra/hints

# Disk access mode
disk_access_mode: auto

# Memory settings
max_heap_size: 4G
heap_newsize: 1G

# Authentication
authenticator: AllowAllAuthenticator
authorizer: AllowAllAuthorizer

# Habilita native transport (CQL protocol)
start_native_transport: true
native_transport_port: 9042

# Replication settings
num_tokens: 256
initial_token: (generated per nodo)

# Commit log settings
commitlog_compression:
  - class_name: org.apache.cassandra.io.compress.LZ4Compressor
    parameters:
      -
chunk_length_kb: 64

# Partitioner
partitioner: org.apache.cassandra.dht.Murmur3Partitioner

For nodos 2 and 3, modify these settings with appropriate nodo IPs:

# Nodo 2 configuration
listen_address: 192.168.1.11
rpc_address: 192.168.1.11
broadcast_rpc_address: 192.168.1.11

# Nodo 3 configuration
listen_address: 192.168.1.12
rpc_address: 192.168.1.12
broadcast_rpc_address: 192.168.1.12

Configura cassandra-env.sh for memory:

sudo nano /etc/cassandra/cassandra-env.sh

Uncomment and set:

MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="1G"

Reinicia Cassandra to apply configuration:

sudo systemctl restart cassandra

# Monitorea logs
sudo journalctl -u cassandra -f

Cluster Bootstrap

Bootstrap the first nodo and allow others to join:

# Verifica first nodo is up
nodetool status

# Should show single nodo UN (up, normal)
# UN = Up and Normal
# UL = Up and Leaving
# DL = Down and Leaving

Inicia the remaining nodos once the first nodo is stable:

# On node2 and node3
sudo systemctl start cassandra

# Monitorea bootstrap via nodetool
nodetool status

# Watch logs
sudo journalctl -u cassandra -f | grep -i bootstrap

Verifica cluster formation:

# Check peer information
nodetool describecluster

# Should show all three nodos with similar token ranges

Wait for gossip stabilization:

# Monitorea gossip
watch -n 5 'nodetool status'

# All nodos should show "UN" (Up Normal) status

Keyspaces and Tables

Crea a keyspace with replication configuration:

# Conecta with cqlsh
cqlsh 192.168.1.10

-- Crea keyspace with replication
CREATE KEYSPACE IF NOT EXISTS myapp
WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 3
};

-- Or use NetworkTopologyStrategy for multi-region
CREATE KEYSPACE IF NOT EXISTS myapp_distributed
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east': 3,
  'us-west': 2
};

-- Switch to keyspace
USE myapp;

-- Crea table for events
CREATE TABLE IF NOT EXISTS events (
  event_id UUID,
  event_time TIMESTAMP,
  user_id BIGINT,
  event_type TEXT,
  event_data TEXT,
  PRIMARY KEY ((user_id), event_time, event_id)
) WITH CLUSTERING ORDER BY (event_time DESC)
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND gc_grace_seconds = 864000;

-- Crea table for user profiles
CREATE TABLE IF NOT EXISTS user_profiles (
  user_id BIGINT PRIMARY KEY,
  username TEXT,
  email TEXT,
  signup_date TIMESTAMP,
  updated_at TIMESTAMP
) WITH compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'};

-- Crea materialized view for lookups by email
CREATE MATERIALIZED VIEW user_by_email AS
SELECT user_id, username, email, signup_date
FROM user_profiles
WHERE email IS NOT NULL
PRIMARY KEY (email, user_id);

-- Crea table for metrics with TTL
CREATE TABLE IF NOT EXISTS metrics (
  metric_name TEXT,
  metric_time TIMESTAMP,
  host_id TEXT,
  metric_value DOUBLE,
  PRIMARY KEY ((metric_name, host_id), metric_time)
) WITH CLUSTERING ORDER BY (metric_time DESC)
AND default_time_to_live = 2592000;

-- Crea secondary index
CREATE INDEX ON metrics(metric_name);

-- Show tables
DESCRIBE TABLES;
DESCRIBE KEYSPACE myapp;

Data Consistency

Configura consistency levels for reads and writes:

-- Insert data with default consistency (quorum)
INSERT INTO events (event_id, event_time, user_id, event_type, event_data)
VALUES (uuid(), now(), 123, 'login', 'User logged in')
USING TIMESTAMP 1640000000000;

-- Insert with TTL (time to live)
INSERT INTO metrics (metric_name, metric_time, host_id, metric_value)
VALUES ('cpu_usage', now(), 'host-01', 45.5)
USING TTL 86400;

-- Read with consistency level
SELECT * FROM events WHERE user_id = 123 ALLOW FILTERING;

-- Actualiza data
UPDATE user_profiles 
SET username = 'alice', updated_at = now()
WHERE user_id = 123;

-- Elimina with TTL
DELETE FROM metrics WHERE metric_name = 'old_metric' AND metric_time < '2024-01-01';

-- Batch multiple operations
BEGIN BATCH
  INSERT INTO events (event_id, event_time, user_id, event_type)
  VALUES (uuid(), now(), 123, 'action1');
  INSERT INTO events (event_id, event_time, user_id, event_type)
  VALUES (uuid(), now(), 123, 'action2');
APPLY BATCH;

Set consistency level in cqlsh:

cqlsh --consistency ONE
# or
cqlsh --consistency QUORUM
# or
cqlsh --consistency ALL

Configura consistency in application code (pseudo-code):

# In Python driver
from cassandra.cluster import Cluster
from cassandra.consistency import ConsistencyLevel

cluster = Cluster(['192.168.1.10', '192.168.1.11', '192.168.1.12'])
session = cluster.connect()

# Set consistency level for queries
from cassandra import ConsistencyLevel
session.default_consistency_level = ConsistencyLevel.QUORUM

# Or per query
from cassandra.query import SimpleStatement
query = SimpleStatement(
    "SELECT * FROM myapp.events WHERE user_id = %s",
    consistency_level=ConsistencyLevel.ONE
)
rows = session.execute(query, (123,))

Replication Strategies

Understand replication strategies for fault tolerance:

-- SimpleStrategy (single datacenter)
-- Replicates to consecutive nodos on the ring
CREATE KEYSPACE myapp
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

-- NetworkTopologyStrategy (multiple datacenters)
-- Controls réplicas per datacenter
CREATE KEYSPACE myapp_global
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3,
  'us-west-2': 3,
  'eu-central-1': 2
};

-- Alter keyspace replication
ALTER KEYSPACE myapp 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};

Configura datacenter-aware replication:

# Edit cassandra.yml with rack-aware snitch
sudo nano /etc/cassandra/cassandra.yaml
# endpoint_snitch: GossipingPropertyFileSnitch

# Configura datacenter and rack
sudo nano /etc/cassandra/cassandra-rackdc.properties

Add:

# US East Region
dc=us-east-1a
rack=us-east-1a-r1

# Or for multi-rack
dc=us-west-2
rack=us-west-2-r2

Nodo Gestión

Add and remove nodos from the cluster:

# Add new nodo to cluster
# On new nodo, asegúrate de que configuration points to existing cluster via seeds
# Inicia cassandra - it will bootstrap and download data from existing nodos
sudo systemctl start cassandra

# Monitorea bootstrap progress
nodetool netstats

# Check when bootstrap is complete
nodetool status

# Gracefully remove a nodo
nodetool decommission

# Force remove a dead nodo
nodetool removenode <nodo-uuid>

Handle nodo failures:

# Replace a failed nodo (reuse same token)
# Configura new nodo with same IP as failed nodo
# Add -Dcassandra.replace_address=<dead-nodo-ip> to JVM args

# Or replace and reassign tokens
sudo nano /etc/cassandra/cassandra-env.sh

# Add to JVM_OPTS
export JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=192.168.1.10"

# Inicia cassandra - it will take data from existing nodos

Repair data consistency:

# Ejecuta repair on a nodo (recommended weekly)
nodetool repair

# Repair specific keyspace
nodetool repair myapp

# Repair with parallelization
nodetool repair -pr myapp

# Check repair status
nodetool repairdigest

Monitoreo with nodetool

Monitorea cluster health and performance using nodetool:

# Overall cluster status
nodetool status

# Detailed cluster information
nodetool describecluster

# Nodo information
nodetool info

# Ring information (token distribution)
nodetool ring

# Gossip state
nodetool gossipinfo

# Pending compactions and streams
nodetool netstats

# Tpstats (thread pool stats)
nodetool tpstats

# GC stats
nodetool gcstats

# Disk usage
nodetool du

# Snapshot creation for backup
nodetool snapshot myapp -t backup_$(date +%Y%m%d)

# List snapshots
nodetool listsnapshots

# Clear snapshots
nodetool clearsnapshot

# Flush memtables
nodetool flush myapp

# Compact sstables
nodetool compact

Monitorea performance metrics:

# Check compaction throughput
nodetool compactionhistory

# View drain status
nodetool statusdown

# Check schema version
nodetool describecluster | grep "Schema"

# Get token ranges
nodetool getendpoints myapp events 123

Respalda and Recovery

Implement backup procedures:

# Crea snapshot
nodetool snapshot myapp -t myapp_$(date +%Y%m%d_%H%M%S)

# Respalda snapshot files
tar -czf /backup/cassandra_$(date +%Y%m%d).tar.gz \
  /var/lib/cassandra/data/myapp/*/snapshots/

# Respalda schema
cqlsh -e "DESCRIBE KEYSPACE myapp;" > /backup/schema_$(date +%Y%m%d).cql

# Full cluster backup script
#!/bin/bash
BACKUP_DIR="/backup/cassandra-$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

for nodo in 192.168.1.10 192.168.1.11 192.168.1.12; do
  ssh cassandra@$nodo "nodetool snapshot myapp -t daily"
  scp -r cassandra@$nodo:/var/lib/cassandra/data $BACKUP_DIR/node_$nodo/
done

# Respalda commit logs and hints for point-in-time recovery
tar -czf $BACKUP_DIR/commit_logs.tar.gz /var/lib/cassandra/commitlog/
tar -czf $BACKUP_DIR/hints.tar.gz /var/lib/cassandra/hints/

Restaura from backup:

# Detén Cassandra
sudo systemctl stop cassandra

# Remueve current data
sudo rm -rf /var/lib/cassandra/data/*

# Restaura snapshot
tar -xzf /backup/cassandra_20240101.tar.gz -C /var/lib/cassandra/data/

# Restaura schema
cqlsh -f /backup/schema_20240101.cql

# Inicia Cassandra
sudo systemctl start cassandra

# Verifica data
cqlsh -e "SELECT COUNT(*) FROM myapp.events;"

Conclusión

Cassandra proporciona a horizontally scalable, fault-tolerant database for applications requiring extreme availability and performance. Its peer-to-peer architecture eliminates single points of failure while consistent hashing ensures data is distributed efficiently across nodos. By understanding replication strategies, consistency levels, and nodo management procedures, you can build production Cassandra clusters that handle massive data volúmenes reliably. Proper configuration of snitch strategies, replication factors, and repair procedures maintains data consistency across the distributed system while delivering the high availability your applications demand.

Instalación y configuración de Cassandra

En esta página