ScyllaDB Installation for High-Performance NoSQL

ScyllaDB is a Cassandra-compatible NoSQL database rebuilt in C++ with a shard-per-core architecture that eliminates GC pauses and delivers consistent low-latency performance at high throughput. This guide covers installing ScyllaDB on Linux, setting up a cluster, using the CQL interface, monitoring, and migrating from Apache Cassandra.

Prerequisites

  • Ubuntu 20.04/22.04 or CentOS 8/Rocky Linux 8+
  • Minimum 8 GB RAM (16+ GB recommended)
  • Multi-core CPU (ScyllaDB scales linearly with cores)
  • SSD storage required (NVMe preferred)
  • For cluster: 3+ nodes with low-latency network
  • Root or sudo access

Install ScyllaDB

Ubuntu/Debian:

# Add ScyllaDB repository
sudo apt install -y curl gnupg
curl -sSL https://downloads.scylladb.com/downloads/scylla/ubuntu/scylla.key | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/scylla.gpg

# For the latest stable release (check https://www.scylladb.com/download/ for current version)
SCYLLA_VERSION="6.0"
echo "deb [signed-by=/etc/apt/trusted.gpg.d/scylla.gpg] https://downloads.scylladb.com/downloads/scylla/ubuntu scylladb-${SCYLLA_VERSION}/ubuntu focal main" \
  | sudo tee /etc/apt/sources.list.d/scylla.list

sudo apt update
sudo apt install -y scylla

# Run ScyllaDB setup script (IMPORTANT - optimizes the system)
sudo scylla_setup

# During setup, it will ask about:
# - Developer mode (say No for production)
# - NTP configuration
# - RAID setup (if using multiple disks)
# - Disk/CPU optimization

sudo systemctl enable scylla-server
sudo systemctl start scylla-server

CentOS/Rocky Linux:

sudo rpm --import https://downloads.scylladb.com/downloads/scylla/rpm/unstable/centos/scylladb.key

cat > /etc/yum.repos.d/scylla.repo << 'EOF'
[ScyllaDB]
name=ScyllaDB
baseurl=https://downloads.scylladb.com/downloads/scylla/rpm/centos/scylladb-6.0/x86_64/
enabled=1
gpgcheck=1
gpgkey=https://downloads.scylladb.com/downloads/scylla/rpm/unstable/centos/scylladb.key
EOF

sudo dnf install -y scylla
sudo scylla_setup
sudo systemctl enable --now scylla-server

Verify the installation:

# Wait for ScyllaDB to start (30-60 seconds)
sudo journalctl -u scylla-server -f

# Check node status
nodetool status
# Should show: UN (Up Normal) for the local node

# Connect with cqlsh
cqlsh localhost 9042

Initial Configuration and Tuning

ScyllaDB's main configuration is in /etc/scylla/scylla.yaml:

sudo nano /etc/scylla/scylla.yaml

Key settings:

# /etc/scylla/scylla.yaml

# Cluster name (must match across all nodes in the cluster)
cluster_name: 'MyScyllaCluster'

# Listen address (this node's IP)
listen_address: 192.168.1.10

# Broadcast address (for clients)
rpc_address: 0.0.0.0
broadcast_rpc_address: 192.168.1.10

# Seed nodes (IPs of seed nodes for cluster discovery)
seed_provider:
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
    parameters:
      - seeds: "192.168.1.10,192.168.1.11"

# Data storage
data_file_directories:
  - /var/lib/scylla/data

commitlog_directory: /var/lib/scylla/commitlog

# Authentication
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

# Snitch for datacenter/rack topology
endpoint_snitch: GossipingPropertyFileSnitch

Run ScyllaDB tuning tools:

# Setup system tuning (run once on each node)
sudo scylla_io_setup    # Optimizes I/O scheduler
sudo scylla_cpuset_conf # Configures CPU isolation
sudo scylla_ntp_setup   # Configures NTP

# Check current configuration
sudo scylla_dev_mode_setup  # Enable dev mode for testing (no tuning required)

Cluster Setup

# Node 1 (192.168.1.10) - first seed node
# scylla.yaml already configured above

# Node 2 (192.168.1.11)
sudo nano /etc/scylla/scylla.yaml
# Set:
# listen_address: 192.168.1.11
# broadcast_rpc_address: 192.168.1.11
# seeds: "192.168.1.10,192.168.1.11"  (same seed list)

# Node 3 (192.168.1.12)
sudo nano /etc/scylla/scylla.yaml
# Set:
# listen_address: 192.168.1.12
# broadcast_rpc_address: 192.168.1.12
# seeds: "192.168.1.10,192.168.1.11"

# Start nodes one at a time, starting with the seeds
sudo systemctl start scylla-server  # Start on node 1 first

# Wait for node 1 to be UP, then start node 2
nodetool status  # Check from node 1
sudo systemctl start scylla-server  # On node 2

# Then node 3
sudo systemctl start scylla-server  # On node 3

# Verify cluster from any node
nodetool status
# Expected output:
# Datacenter: datacenter1
# Status=Up/Down |/ State=Normal/Leaving/Joining/Moving
# UN  192.168.1.10  ...  
# UN  192.168.1.11  ...
# UN  192.168.1.12  ...

# Check token distribution
nodetool ring

Open required ports:

# ScyllaDB requires these ports
sudo ufw allow 7000/tcp   # Inter-node communication
sudo ufw allow 7001/tcp   # TLS inter-node
sudo ufw allow 9042/tcp   # CQL client port
sudo ufw allow 9160/tcp   # Thrift (legacy)
sudo ufw allow 10000/tcp  # REST API
sudo ufw allow 9180/tcp   # Prometheus metrics

CQL Interface and Data Modeling

# Connect to ScyllaDB
cqlsh 192.168.1.10 9042 -u cassandra -p cassandra

# Change default password immediately
ALTER USER cassandra WITH PASSWORD 'newstrongpassword';
CREATE USER appuser WITH PASSWORD 'apppassword' NOSUPERUSER;
-- Create a keyspace with replication
CREATE KEYSPACE IF NOT EXISTS myapp
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3   -- 3 replicas across datacenter1
};

USE myapp;

-- Create a time-series table (optimized for ScyllaDB)
CREATE TABLE IF NOT EXISTS metrics (
    host TEXT,
    ts TIMESTAMP,
    cpu_usage FLOAT,
    memory_mb INT,
    disk_io_mb FLOAT,
    PRIMARY KEY ((host), ts)    -- host is partition key, ts is clustering key
) WITH CLUSTERING ORDER BY (ts DESC)
  AND compaction = {
      'class': 'TimeWindowCompactionStrategy',
      'compaction_window_unit': 'HOURS',
      'compaction_window_size': '1'
  };

-- Insert data
INSERT INTO metrics (host, ts, cpu_usage, memory_mb)
VALUES ('web01', toTimestamp(now()), 45.2, 2048);

-- Query with time range
SELECT * FROM metrics
WHERE host = 'web01'
  AND ts > '2024-01-01 00:00:00'
  AND ts < '2024-01-02 00:00:00'
LIMIT 100;

-- Create a wide-row table for IoT data
CREATE TABLE IF NOT EXISTS sensor_data (
    device_id UUID,
    reading_date DATE,        -- partition by date for time bucketing
    reading_time TIMESTAMP,
    value DOUBLE,
    unit TEXT,
    PRIMARY KEY ((device_id, reading_date), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);

-- Create secondary index (use sparingly in ScyllaDB)
CREATE INDEX ON metrics (cpu_usage);

-- Materialized view for alternative access patterns
CREATE MATERIALIZED VIEW metrics_by_cpu AS
    SELECT host, ts, cpu_usage
    FROM metrics
    WHERE cpu_usage IS NOT NULL AND ts IS NOT NULL
    PRIMARY KEY ((cpu_usage), ts, host)
    WITH CLUSTERING ORDER BY (ts DESC);

Shard-per-Core Architecture

ScyllaDB's shard-per-core model is its key differentiator from Cassandra:

# Each CPU core handles a dedicated shard with its own memory
# This eliminates cross-core coordination and GC pauses

# Check shard count (equals number of CPU cores ScyllaDB uses)
nodetool info | grep "Shards"

# Each shard handles a portion of the token range
# Client drivers (like Datastax driver v4) are "shard-aware"
# and route requests directly to the correct shard

# View per-shard statistics
curl http://localhost:10000/metrics | grep "shard"

# Check CPU utilization per shard
curl -s http://localhost:9180/metrics | grep cpu_utilization | head -20

# Configure number of shards (usually = nproc)
# In scylla.yaml: smp: 8  # use 8 CPU cores (0 = auto-detect)

Use a shard-aware driver in applications:

# pip install cassandra-driver
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.policies import TokenAwarePolicy, DCAwareRoundRobinPolicy

# ScyllaDB-specific: use TokenAwarePolicy for shard-aware routing
auth_provider = PlainTextAuthProvider(username='appuser', password='apppassword')
cluster = Cluster(
    contact_points=['192.168.1.10', '192.168.1.11', '192.168.1.12'],
    port=9042,
    auth_provider=auth_provider,
    load_balancing_policy=TokenAwarePolicy(DCAwareRoundRobinPolicy(local_dc='datacenter1')),
    connect_timeout=10,
    protocol_version=4
)

session = cluster.connect('myapp')
rows = session.execute("SELECT * FROM metrics WHERE host = 'web01' LIMIT 10")
for row in rows:
    print(row)
cluster.shutdown()

Monitoring ScyllaDB

# ScyllaDB exposes Prometheus metrics on port 9180
curl http://localhost:9180/metrics | grep -E "^scylla" | head -30

# Key metrics to watch:
# scylla_scheduler_runtime_ms - per-shard CPU usage
# scylla_storage_proxy_write_unavailable - write errors
# scylla_storage_proxy_read_unavailable - read errors
# scylla_io_queue_delay - disk I/O latency

# Nodetool commands for cluster health
nodetool status              # Node up/down status
nodetool tpstats             # Thread pool statistics
nodetool compactionstats     # Active compactions
nodetool tablestats myapp    # Per-table statistics
nodetool cfstats myapp.metrics  # Specific table stats

# Check read/write latency
nodetool tablehistograms myapp metrics

Migrate from Cassandra

# Method 1: sstableloader (for offline migration)
# Export from Cassandra
nodetool snapshot myapp
# Snapshot stored at: /var/lib/cassandra/data/myapp/<table>/snapshots/<name>/

# Load into ScyllaDB
sstableloader -d 192.168.1.10 \
  /var/lib/cassandra/data/myapp/metrics-abcdef123456/snapshots/snap1/

# Method 2: COPY command for smaller datasets
# Export from Cassandra
cqlsh cassandra-host -e "COPY myapp.metrics TO '/tmp/metrics.csv' WITH HEADER=TRUE;"

# Import into ScyllaDB
cqlsh scylla-host -e "COPY myapp.metrics FROM '/tmp/metrics.csv' WITH HEADER=TRUE;"

# Method 3: Dual-write migration
# 1. Write to both Cassandra and ScyllaDB
# 2. Backfill historical data with sstableloader
# 3. Switch reads to ScyllaDB
# 4. Stop writing to Cassandra

Troubleshooting

Node won't join cluster (stuck in JoiningState):

# Check logs
sudo journalctl -u scylla-server -n 100 | grep ERROR

# Verify seeds are reachable
nc -zv 192.168.1.10 7000

# Check listen_address is correct (not 127.0.0.1)
grep listen_address /etc/scylla/scylla.yaml

# Clear token state and rejoin (use only if necessary)
nodetool removenode <node-id>

High read/write latency:

# Check if disk is the bottleneck
iostat -x 1 5

# Ensure ScyllaDB has I/O scheduler configured correctly
cat /sys/block/sda/queue/scheduler  # Should be "none" or "noop" for SSDs

# Check for compaction pressure
nodetool compactionstats

# Increase concurrent reads/writes if CPU bound
# In scylla.yaml:
# concurrent_reads: 32   (default: 32, increase to 64 if needed)
# concurrent_writes: 32

CQL authentication error:

# If locked out of default account
sudo systemctl stop scylla-server

# Disable auth temporarily
# In scylla.yaml: authenticator: AllowAllAuthenticator
sudo systemctl start scylla-server
cqlsh localhost -e "ALTER USER cassandra WITH PASSWORD 'newpassword';"

# Re-enable auth
# authenticator: PasswordAuthenticator
sudo systemctl restart scylla-server

Conclusion

ScyllaDB's shard-per-core architecture delivers consistent low latency and high throughput that scales linearly with CPU cores, making it an excellent choice for high-performance NoSQL workloads on modern multi-core servers. Its Cassandra compatibility means existing CQL schemas and drivers work without modification. For production deployments, run the scylla_setup scripts on each node, use NVMe storage, and enable shard-aware routing in your client driver for optimal performance.