Apache Spark Installation on Linux

Apache Spark is the leading distributed data processing engine for big data workloads, enabling in-memory computation across clusters to process datasets that don't fit on a single machine at speeds far exceeding traditional MapReduce. With PySpark for Python-based data science, Spark SQL for structured queries, and built-in support for machine learning and streaming, Spark can be deployed in standalone mode on a VPS or scale to multi-node clusters on bare-metal infrastructure.

Prerequisites

  • Ubuntu 20.04+, Debian 11+, or CentOS/Rocky 8+
  • Java 11 or Java 17 (Java 8 minimum for older Spark versions)
  • Python 3.8+ for PySpark
  • Minimum 4 GB RAM (8+ GB recommended)
  • Root or sudo access
  • For multi-node: password-less SSH between nodes

Installing Java and Spark

Install Java:

# Ubuntu/Debian
sudo apt update && sudo apt install -y openjdk-17-jdk

# CentOS/Rocky
sudo dnf install -y java-17-openjdk-devel

# Verify
java -version
javac -version

Download and install Spark:

# Download Spark (choose the Hadoop-free version if not using HDFS)
SPARK_VERSION="3.5.1"
HADOOP_VERSION="3"
curl -LO "https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"

# Extract to /opt
sudo tar -xzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz -C /opt/
sudo ln -s /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} /opt/spark

# Set ownership
sudo useradd -r -m -d /home/spark -s /bin/bash spark
sudo chown -R spark:spark /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}

Configure environment variables:

sudo tee /etc/profile.d/spark.sh <<'EOF'
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3
EOF

source /etc/profile.d/spark.sh

# Verify Spark installation
spark-submit --version

Standalone Cluster Setup

Spark's standalone cluster manager works without Hadoop or Kubernetes.

Configure Spark on the master node:

cd /opt/spark/conf

# Copy template files
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf

# Configure spark-env.sh
cat >> spark-env.sh <<'EOF'
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_MASTER_HOST=master-ip-or-hostname
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_CORES=4           # CPU cores per worker
export SPARK_WORKER_MEMORY=8g         # RAM per worker
export SPARK_LOG_DIR=/opt/spark/logs
export SPARK_PID_DIR=/opt/spark/pids
EOF

mkdir -p /opt/spark/logs /opt/spark/pids
chown -R spark:spark /opt/spark/logs /opt/spark/pids

Configure spark-defaults.conf:

cat > /opt/spark/conf/spark-defaults.conf <<'EOF'
spark.master                     spark://master-hostname:7077
spark.eventLog.enabled           true
spark.eventLog.dir               file:///opt/spark/logs/events
spark.history.fs.logDirectory    file:///opt/spark/logs/events
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              2g
spark.executor.memory            4g
spark.sql.shuffle.partitions     100
spark.default.parallelism        100
EOF

mkdir -p /opt/spark/logs/events

Add worker nodes (multi-node setup):

# Create workers file (list of worker hostnames/IPs)
cat > /opt/spark/conf/workers <<'EOF'
worker1.example.com
worker2.example.com
worker3.example.com
EOF

Start the standalone cluster:

# Start master
sudo -u spark /opt/spark/sbin/start-master.sh

# Start all workers (from master, requires password-less SSH)
sudo -u spark /opt/spark/sbin/start-workers.sh

# Or start a single worker on the current node
sudo -u spark /opt/spark/sbin/start-worker.sh spark://master-hostname:7077

# Check status
/opt/spark/sbin/spark-daemon.sh status org.apache.spark.deploy.master.Master 1

Create systemd services:

sudo tee /etc/systemd/system/spark-master.service <<'EOF'
[Unit]
Description=Apache Spark Master
After=network.target

[Service]
User=spark
Group=spark
Type=forking
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now spark-master

PySpark Configuration

Install PySpark:

pip3 install pyspark==${SPARK_VERSION}

# Or use the PySpark bundled with Spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-*.zip:$PYTHONPATH

Start a PySpark interactive shell:

pyspark \
  --master spark://master-hostname:7077 \
  --executor-memory 4g \
  --num-executors 2

Example PySpark script:

# wordcount.py
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("WordCount") \
    .master("spark://master-hostname:7077") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

# Set log level
spark.sparkContext.setLogLevel("WARN")

# Read text file
lines = spark.sparkContext.textFile("hdfs://path/to/file.txt")
# Or from local filesystem:
lines = spark.sparkContext.textFile("file:///data/input.txt")

# Word count
word_counts = (lines
    .flatMap(lambda line: line.split())
    .map(lambda word: (word.lower(), 1))
    .reduceByKey(lambda a, b: a + b)
    .sortBy(lambda x: x[1], ascending=False))

# Show top 20 words
for word, count in word_counts.take(20):
    print(f"{word}: {count}")

spark.stop()

Submitting Spark Jobs

# Submit a Python script
spark-submit \
  --master spark://master-hostname:7077 \
  --driver-memory 2g \
  --executor-memory 4g \
  --executor-cores 2 \
  --num-executors 4 \
  wordcount.py

# Submit with additional JARs (for database connectors etc.)
spark-submit \
  --master spark://master-hostname:7077 \
  --jars /opt/jars/postgresql-42.7.3.jar \
  --packages org.apache.spark:spark-avro_2.12:3.5.1 \
  my_etl_job.py

# Run in cluster mode (driver runs on worker)
spark-submit \
  --master spark://master-hostname:7077 \
  --deploy-mode cluster \
  my_script.py

# Run locally (no cluster needed)
spark-submit --master local[4] my_script.py

Spark SQL Usage

# spark_sql_example.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, desc

spark = SparkSession.builder \
    .appName("SparkSQL") \
    .config("spark.sql.shuffle.partitions", "50") \
    .getOrCreate()

# Read a CSV file
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("file:///data/sales.csv")

# Show schema
df.printSchema()
df.show(5)

# DataFrame API
result = df.groupBy("product") \
    .agg(
        count("*").alias("total_sales"),
        avg("amount").alias("avg_amount")
    ) \
    .orderBy(desc("total_sales"))

result.show()

# SQL API
df.createOrReplaceTempView("sales")
result = spark.sql("""
    SELECT product,
           COUNT(*) as total_sales,
           AVG(amount) as avg_amount
    FROM sales
    GROUP BY product
    ORDER BY total_sales DESC
    LIMIT 10
""")
result.show()

# Write output
result.write \
    .mode("overwrite") \
    .parquet("file:///data/output/top_products.parquet")

spark.stop()

Resource Management

# Dynamic allocation (auto-scales executors based on workload)
# In spark-defaults.conf:
spark.dynamicAllocation.enabled          true
spark.dynamicAllocation.minExecutors     1
spark.dynamicAllocation.maxExecutors     10
spark.dynamicAllocation.initialExecutors 2
spark.shuffle.service.enabled            true

# Memory configuration
# Total executor memory = executor.memory + executor.memoryOverhead
spark.executor.memory           4g
spark.executor.memoryOverhead   512m
spark.driver.memory             2g
spark.driver.maxResultSize      1g

# Limit resource usage per application
spark.cores.max                 8        # Max cores for the entire application

Monitoring with the Spark UI

# Master UI shows cluster status
http://master-hostname:8080

# Application UI (while a job runs)
http://driver-hostname:4040

# History Server (for completed jobs)
/opt/spark/sbin/start-history-server.sh

# History Server UI
http://master-hostname:18080

# Configure history server in spark-defaults.conf:
# spark.history.ui.port 18080
# spark.history.fs.logDirectory file:///opt/spark/logs/events

Monitor from CLI:

# List running applications
curl -s http://master-hostname:8080/json/ | \
  python3 -c "import sys, json; apps=json.load(sys.stdin)['activeapps']; [print(a['name'], a['cores'], a['memoryperslave']) for a in apps]"

Troubleshooting

Workers not connecting to master:

# Check master is accessible from workers
nc -zv master-hostname 7077

# Check firewall
sudo ufw allow 7077/tcp     # Master port
sudo ufw allow 7078/tcp     # Driver communication
sudo ufw allow 8080/tcp     # Master UI

# Check Spark logs
cat /opt/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-*.out

OutOfMemoryError (OOM) during jobs:

# Increase executor memory in spark-submit:
# --executor-memory 8g
# --driver-memory 4g

# Increase memory overhead for large off-heap usage
# --conf spark.executor.memoryOverhead=2048

# Reduce partition size (increase number of partitions)
# spark.sql.shuffle.partitions=500

# Check GC overhead
# Look for "GC overhead limit exceeded" in logs
cat /opt/spark/logs/*.out | grep -i "GC\|OutOfMemory"

Tasks running too slowly:

# Check for data skew (a few tasks take much longer)
# In Spark UI: Stages > Check task duration distribution

# Repartition skewed data
df.repartition(200, "partition_column")

# Enable Adaptive Query Execution (Spark 3.0+)
# spark.sql.adaptive.enabled = true
# spark.sql.adaptive.coalescePartitions.enabled = true

Conclusion

Apache Spark provides a powerful and flexible distributed data processing platform that handles everything from interactive data exploration with PySpark to large-scale ETL pipelines, with Spark SQL offering a familiar query interface for structured data analysis. Deploying Spark in standalone mode on a Linux VPS provides a cost-effective way to process large datasets without cloud infrastructure, while the built-in Spark UI and History Server give full visibility into job performance and resource utilization.