Apache Spark Installation on Linux

Apache Spark is the leading distributed data processing engine for big data workloads, enabling in-memory computation across clusters to process datasets that don't fit on a single machine at speeds far exceeding traditional MapReduce. With PySpark for Python-based data science, Spark SQL for structured queries, and built-in support for machine learning and streaming, Spark can be deployed in standalone mode on a VPS or scale to multi-node clusters on bare-metal infrastructure.

Prerequisites

Ubuntu 20.04+, Debian 11+, or CentOS/Rocky 8+
Java 11 or Java 17 (Java 8 minimum for older Spark versions)
Python 3.8+ for PySpark
Minimum 4 GB RAM (8+ GB recommended)
Root or sudo access
For multi-node: password-less SSH between nodes

Installing Java and Spark

Install Java:

# Ubuntu/Debian
sudo apt update && sudo apt install -y openjdk-17-jdk

# CentOS/Rocky
sudo dnf install -y java-17-openjdk-devel

# Verify
java -version
javac -version

Download and install Spark:

# Download Spark (choose the Hadoop-free version if not using HDFS)
SPARK_VERSION="3.5.1"
HADOOP_VERSION="3"
curl -LO "https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"

# Extract to /opt
sudo tar -xzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz -C /opt/
sudo ln -s /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} /opt/spark

# Set ownership
sudo useradd -r -m -d /home/spark -s /bin/bash spark
sudo chown -R spark:spark /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}

Configure environment variables:

sudo tee /etc/profile.d/spark.sh <<'EOF'
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3
EOF

source /etc/profile.d/spark.sh

# Verify Spark installation
spark-submit --version

Standalone Cluster Setup

Spark's standalone cluster manager works without Hadoop or Kubernetes.

Configure Spark on the master node:

cd /opt/spark/conf

# Copy template files
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf

# Configure spark-env.sh
cat >> spark-env.sh <<'EOF'
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_MASTER_HOST=master-ip-or-hostname
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_CORES=4           # CPU cores per worker
export SPARK_WORKER_MEMORY=8g         # RAM per worker
export SPARK_LOG_DIR=/opt/spark/logs
export SPARK_PID_DIR=/opt/spark/pids
EOF

mkdir -p /opt/spark/logs /opt/spark/pids
chown -R spark:spark /opt/spark/logs /opt/spark/pids

Configure spark-defaults.conf:

cat > /opt/spark/conf/spark-defaults.conf <<'EOF'
spark.master                     spark://master-hostname:7077
spark.eventLog.enabled           true
spark.eventLog.dir               file:///opt/spark/logs/events
spark.history.fs.logDirectory    file:///opt/spark/logs/events
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              2g
spark.executor.memory            4g
spark.sql.shuffle.partitions     100
spark.default.parallelism        100
EOF

mkdir -p /opt/spark/logs/events

Add worker nodes (multi-node setup):

# Create workers file (list of worker hostnames/IPs)
cat > /opt/spark/conf/workers <<'EOF'
worker1.example.com
worker2.example.com
worker3.example.com
EOF

Start the standalone cluster:

# Start master
sudo -u spark /opt/spark/sbin/start-master.sh

# Start all workers (from master, requires password-less SSH)
sudo -u spark /opt/spark/sbin/start-workers.sh

# Or start a single worker on the current node
sudo -u spark /opt/spark/sbin/start-worker.sh spark://master-hostname:7077

# Check status
/opt/spark/sbin/spark-daemon.sh status org.apache.spark.deploy.master.Master 1

Create systemd services:

sudo tee /etc/systemd/system/spark-master.service <<'EOF'
[Unit]
Description=Apache Spark Master
After=network.target

[Service]
User=spark
Group=spark
Type=forking
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now spark-master

PySpark Configuration

Install PySpark:

pip3 install pyspark==${SPARK_VERSION}

# Or use the PySpark bundled with Spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-*.zip:$PYTHONPATH

Start a PySpark interactive shell:

pyspark \
  --master spark://master-hostname:7077 \
  --executor-memory 4g \
  --num-executors 2

Example PySpark script:

# wordcount.py
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("WordCount") \
    .master("spark://master-hostname:7077") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

# Set log level
spark.sparkContext.setLogLevel("WARN")

# Read text file
lines = spark.sparkContext.textFile("hdfs://path/to/file.txt")
# Or from local filesystem:
lines = spark.sparkContext.textFile("file:///data/input.txt")

# Word count
word_counts = (lines
    .flatMap(lambda line: line.split())
    .map(lambda word: (word.lower(), 1))
    .reduceByKey(lambda a, b: a + b)
    .sortBy(lambda x: x[1], ascending=False))

# Show top 20 words
for word, count in word_counts.take(20):
    print(f"{word}: {count}")

spark.stop()

Submitting Spark Jobs

# Submit a Python script
spark-submit \
  --master spark://master-hostname:7077 \
  --driver-memory 2g \
  --executor-memory 4g \
  --executor-cores 2 \
  --num-executors 4 \
  wordcount.py

# Submit with additional JARs (for database connectors etc.)
spark-submit \
  --master spark://master-hostname:7077 \
  --jars /opt/jars/postgresql-42.7.3.jar \
  --packages org.apache.spark:spark-avro_2.12:3.5.1 \
  my_etl_job.py

# Run in cluster mode (driver runs on worker)
spark-submit \
  --master spark://master-hostname:7077 \
  --deploy-mode cluster \
  my_script.py

# Run locally (no cluster needed)
spark-submit --master local[4] my_script.py

Spark SQL Usage

# spark_sql_example.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, desc

spark = SparkSession.builder \
    .appName("SparkSQL") \
    .config("spark.sql.shuffle.partitions", "50") \
    .getOrCreate()

# Read a CSV file
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("file:///data/sales.csv")

# Show schema
df.printSchema()
df.show(5)

# DataFrame API
result = df.groupBy("product") \
    .agg(
        count("*").alias("total_sales"),
        avg("amount").alias("avg_amount")
    ) \
    .orderBy(desc("total_sales"))

result.show()

# SQL API
df.createOrReplaceTempView("sales")
result = spark.sql("""
    SELECT product,
           COUNT(*) as total_sales,
           AVG(amount) as avg_amount
    FROM sales
    GROUP BY product
    ORDER BY total_sales DESC
    LIMIT 10
""")
result.show()

# Write output
result.write \
    .mode("overwrite") \
    .parquet("file:///data/output/top_products.parquet")

spark.stop()

Resource Management

# Dynamic allocation (auto-scales executors based on workload)
# In spark-defaults.conf:
spark.dynamicAllocation.enabled          true
spark.dynamicAllocation.minExecutors     1
spark.dynamicAllocation.maxExecutors     10
spark.dynamicAllocation.initialExecutors 2
spark.shuffle.service.enabled            true

# Memory configuration
# Total executor memory = executor.memory + executor.memoryOverhead
spark.executor.memory           4g
spark.executor.memoryOverhead   512m
spark.driver.memory             2g
spark.driver.maxResultSize      1g

# Limit resource usage per application
spark.cores.max                 8        # Max cores for the entire application

Monitoring with the Spark UI

# Master UI shows cluster status
http://master-hostname:8080

# Application UI (while a job runs)
http://driver-hostname:4040

# History Server (for completed jobs)
/opt/spark/sbin/start-history-server.sh

# History Server UI
http://master-hostname:18080

# Configure history server in spark-defaults.conf:
# spark.history.ui.port 18080
# spark.history.fs.logDirectory file:///opt/spark/logs/events

Monitor from CLI:

# List running applications
curl -s http://master-hostname:8080/json/ | \
  python3 -c "import sys, json; apps=json.load(sys.stdin)['activeapps']; [print(a['name'], a['cores'], a['memoryperslave']) for a in apps]"

Troubleshooting

Workers not connecting to master:

# Check master is accessible from workers
nc -zv master-hostname 7077

# Check firewall
sudo ufw allow 7077/tcp     # Master port
sudo ufw allow 7078/tcp     # Driver communication
sudo ufw allow 8080/tcp     # Master UI

# Check Spark logs
cat /opt/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-*.out

OutOfMemoryError (OOM) during jobs:

# Increase executor memory in spark-submit:
# --executor-memory 8g
# --driver-memory 4g

# Increase memory overhead for large off-heap usage
# --conf spark.executor.memoryOverhead=2048

# Reduce partition size (increase number of partitions)
# spark.sql.shuffle.partitions=500

# Check GC overhead
# Look for "GC overhead limit exceeded" in logs
cat /opt/spark/logs/*.out | grep -i "GC\|OutOfMemory"

Tasks running too slowly:

# Check for data skew (a few tasks take much longer)
# In Spark UI: Stages > Check task duration distribution

# Repartition skewed data
df.repartition(200, "partition_column")

# Enable Adaptive Query Execution (Spark 3.0+)
# spark.sql.adaptive.enabled = true
# spark.sql.adaptive.coalescePartitions.enabled = true

Conclusion

Apache Spark provides a powerful and flexible distributed data processing platform that handles everything from interactive data exploration with PySpark to large-scale ETL pipelines, with Spark SQL offering a familiar query interface for structured data analysis. Deploying Spark in standalone mode on a Linux VPS provides a cost-effective way to process large datasets without cloud infrastructure, while the built-in Spark UI and History Server give full visibility into job performance and resource utilization.

Apache Spark Installation on Linux

On this page

Apache Spark Installation on Linux

Prerequisites

Installing Java and Spark

Standalone Cluster Setup

PySpark Configuration

Submitting Spark Jobs

Spark SQL Usage

Resource Management

Monitoring with the Spark UI

Troubleshooting

Conclusion

On this page

Latest Video

Get $20 Free Credit