Apache Spark Installation on Linux
Apache Spark is the leading distributed data processing engine for big data workloads, enabling in-memory computation across clusters to process datasets that don't fit on a single machine at speeds far exceeding traditional MapReduce. With PySpark for Python-based data science, Spark SQL for structured queries, and built-in support for machine learning and streaming, Spark can be deployed in standalone mode on a VPS or scale to multi-node clusters on bare-metal infrastructure.
Prerequisites
- Ubuntu 20.04+, Debian 11+, or CentOS/Rocky 8+
- Java 11 or Java 17 (Java 8 minimum for older Spark versions)
- Python 3.8+ for PySpark
- Minimum 4 GB RAM (8+ GB recommended)
- Root or sudo access
- For multi-node: password-less SSH between nodes
Installing Java and Spark
Install Java:
# Ubuntu/Debian
sudo apt update && sudo apt install -y openjdk-17-jdk
# CentOS/Rocky
sudo dnf install -y java-17-openjdk-devel
# Verify
java -version
javac -version
Download and install Spark:
# Download Spark (choose the Hadoop-free version if not using HDFS)
SPARK_VERSION="3.5.1"
HADOOP_VERSION="3"
curl -LO "https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"
# Extract to /opt
sudo tar -xzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz -C /opt/
sudo ln -s /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} /opt/spark
# Set ownership
sudo useradd -r -m -d /home/spark -s /bin/bash spark
sudo chown -R spark:spark /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}
Configure environment variables:
sudo tee /etc/profile.d/spark.sh <<'EOF'
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3
EOF
source /etc/profile.d/spark.sh
# Verify Spark installation
spark-submit --version
Standalone Cluster Setup
Spark's standalone cluster manager works without Hadoop or Kubernetes.
Configure Spark on the master node:
cd /opt/spark/conf
# Copy template files
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
# Configure spark-env.sh
cat >> spark-env.sh <<'EOF'
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_MASTER_HOST=master-ip-or-hostname
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_CORES=4 # CPU cores per worker
export SPARK_WORKER_MEMORY=8g # RAM per worker
export SPARK_LOG_DIR=/opt/spark/logs
export SPARK_PID_DIR=/opt/spark/pids
EOF
mkdir -p /opt/spark/logs /opt/spark/pids
chown -R spark:spark /opt/spark/logs /opt/spark/pids
Configure spark-defaults.conf:
cat > /opt/spark/conf/spark-defaults.conf <<'EOF'
spark.master spark://master-hostname:7077
spark.eventLog.enabled true
spark.eventLog.dir file:///opt/spark/logs/events
spark.history.fs.logDirectory file:///opt/spark/logs/events
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 2g
spark.executor.memory 4g
spark.sql.shuffle.partitions 100
spark.default.parallelism 100
EOF
mkdir -p /opt/spark/logs/events
Add worker nodes (multi-node setup):
# Create workers file (list of worker hostnames/IPs)
cat > /opt/spark/conf/workers <<'EOF'
worker1.example.com
worker2.example.com
worker3.example.com
EOF
Start the standalone cluster:
# Start master
sudo -u spark /opt/spark/sbin/start-master.sh
# Start all workers (from master, requires password-less SSH)
sudo -u spark /opt/spark/sbin/start-workers.sh
# Or start a single worker on the current node
sudo -u spark /opt/spark/sbin/start-worker.sh spark://master-hostname:7077
# Check status
/opt/spark/sbin/spark-daemon.sh status org.apache.spark.deploy.master.Master 1
Create systemd services:
sudo tee /etc/systemd/system/spark-master.service <<'EOF'
[Unit]
Description=Apache Spark Master
After=network.target
[Service]
User=spark
Group=spark
Type=forking
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now spark-master
PySpark Configuration
Install PySpark:
pip3 install pyspark==${SPARK_VERSION}
# Or use the PySpark bundled with Spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-*.zip:$PYTHONPATH
Start a PySpark interactive shell:
pyspark \
--master spark://master-hostname:7077 \
--executor-memory 4g \
--num-executors 2
Example PySpark script:
# wordcount.py
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
.appName("WordCount") \
.master("spark://master-hostname:7077") \
.config("spark.executor.memory", "4g") \
.getOrCreate()
# Set log level
spark.sparkContext.setLogLevel("WARN")
# Read text file
lines = spark.sparkContext.textFile("hdfs://path/to/file.txt")
# Or from local filesystem:
lines = spark.sparkContext.textFile("file:///data/input.txt")
# Word count
word_counts = (lines
.flatMap(lambda line: line.split())
.map(lambda word: (word.lower(), 1))
.reduceByKey(lambda a, b: a + b)
.sortBy(lambda x: x[1], ascending=False))
# Show top 20 words
for word, count in word_counts.take(20):
print(f"{word}: {count}")
spark.stop()
Submitting Spark Jobs
# Submit a Python script
spark-submit \
--master spark://master-hostname:7077 \
--driver-memory 2g \
--executor-memory 4g \
--executor-cores 2 \
--num-executors 4 \
wordcount.py
# Submit with additional JARs (for database connectors etc.)
spark-submit \
--master spark://master-hostname:7077 \
--jars /opt/jars/postgresql-42.7.3.jar \
--packages org.apache.spark:spark-avro_2.12:3.5.1 \
my_etl_job.py
# Run in cluster mode (driver runs on worker)
spark-submit \
--master spark://master-hostname:7077 \
--deploy-mode cluster \
my_script.py
# Run locally (no cluster needed)
spark-submit --master local[4] my_script.py
Spark SQL Usage
# spark_sql_example.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, desc
spark = SparkSession.builder \
.appName("SparkSQL") \
.config("spark.sql.shuffle.partitions", "50") \
.getOrCreate()
# Read a CSV file
df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("file:///data/sales.csv")
# Show schema
df.printSchema()
df.show(5)
# DataFrame API
result = df.groupBy("product") \
.agg(
count("*").alias("total_sales"),
avg("amount").alias("avg_amount")
) \
.orderBy(desc("total_sales"))
result.show()
# SQL API
df.createOrReplaceTempView("sales")
result = spark.sql("""
SELECT product,
COUNT(*) as total_sales,
AVG(amount) as avg_amount
FROM sales
GROUP BY product
ORDER BY total_sales DESC
LIMIT 10
""")
result.show()
# Write output
result.write \
.mode("overwrite") \
.parquet("file:///data/output/top_products.parquet")
spark.stop()
Resource Management
# Dynamic allocation (auto-scales executors based on workload)
# In spark-defaults.conf:
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 10
spark.dynamicAllocation.initialExecutors 2
spark.shuffle.service.enabled true
# Memory configuration
# Total executor memory = executor.memory + executor.memoryOverhead
spark.executor.memory 4g
spark.executor.memoryOverhead 512m
spark.driver.memory 2g
spark.driver.maxResultSize 1g
# Limit resource usage per application
spark.cores.max 8 # Max cores for the entire application
Monitoring with the Spark UI
# Master UI shows cluster status
http://master-hostname:8080
# Application UI (while a job runs)
http://driver-hostname:4040
# History Server (for completed jobs)
/opt/spark/sbin/start-history-server.sh
# History Server UI
http://master-hostname:18080
# Configure history server in spark-defaults.conf:
# spark.history.ui.port 18080
# spark.history.fs.logDirectory file:///opt/spark/logs/events
Monitor from CLI:
# List running applications
curl -s http://master-hostname:8080/json/ | \
python3 -c "import sys, json; apps=json.load(sys.stdin)['activeapps']; [print(a['name'], a['cores'], a['memoryperslave']) for a in apps]"
Troubleshooting
Workers not connecting to master:
# Check master is accessible from workers
nc -zv master-hostname 7077
# Check firewall
sudo ufw allow 7077/tcp # Master port
sudo ufw allow 7078/tcp # Driver communication
sudo ufw allow 8080/tcp # Master UI
# Check Spark logs
cat /opt/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-*.out
OutOfMemoryError (OOM) during jobs:
# Increase executor memory in spark-submit:
# --executor-memory 8g
# --driver-memory 4g
# Increase memory overhead for large off-heap usage
# --conf spark.executor.memoryOverhead=2048
# Reduce partition size (increase number of partitions)
# spark.sql.shuffle.partitions=500
# Check GC overhead
# Look for "GC overhead limit exceeded" in logs
cat /opt/spark/logs/*.out | grep -i "GC\|OutOfMemory"
Tasks running too slowly:
# Check for data skew (a few tasks take much longer)
# In Spark UI: Stages > Check task duration distribution
# Repartition skewed data
df.repartition(200, "partition_column")
# Enable Adaptive Query Execution (Spark 3.0+)
# spark.sql.adaptive.enabled = true
# spark.sql.adaptive.coalescePartitions.enabled = true
Conclusion
Apache Spark provides a powerful and flexible distributed data processing platform that handles everything from interactive data exploration with PySpark to large-scale ETL pipelines, with Spark SQL offering a familiar query interface for structured data analysis. Deploying Spark in standalone mode on a Linux VPS provides a cost-effective way to process large datasets without cloud infrastructure, while the built-in Spark UI and History Server give full visibility into job performance and resource utilization.


