🧮 LARGE-SCALE ONLY: Spark Setup¶
Complete this once before starting any Spark or large-scale data processing project.
Apache Spark is a data processing engine used for analytics on larger or more complex datasets. We can run Spark locally through Python using PySpark.
Spark does not need to be started and stopped like Kafka. Once Java and the project dependencies are installed, Spark starts when the Python script runs and stops when the script finishes.
This setup is done once and reused across Spark-related modules. If Spark is already working on your machine, skip to Verify PySpark.
Windows Users¶
If you are on Windows, read Windows and WSL first. Spark runs in WSL (Windows Subsystem for Linux) on Windows machines. The setup steps are the same, but the terminal environment is different.
What You Will Set Up¶
By the end of setting up Spark, you will have:
- Java JDK 17 installed
JAVA_HOMEset in your shell profile- Python project dependencies installed with
uv
Note on SPARK_HOME:
Do not set SPARK_HOME.
When PySpark is installed via uv or pip, Spark JARs are bundled inside
the Python package.
A SPARK_HOME pointing elsewhere causes a FileNotFoundError at runtime.
If SPARK_HOME is already set on your machine, unset it:
Step 1. Open a New Terminal in VS Code¶
Windows users:
- Read Windows and WSL first.
- Open a new WSL terminal in VS Code.
- For example, from the VS Code menu, select Terminal / New Terminal.
- Then run:
Your prompt will change to something like username@DESKTOP:~$.
Do all steps below in this WSL terminal.
All other users:
- Open a new terminal in VS Code.
- For example, from the VS Code menu, select Terminal / New Terminal.
Step 2. Install Java 17+¶
Choose the option for your operating system.
Option 2A: Windows WSL / Ubuntu / Debian
Option 2C: Red Hat / Fedora
Expected output: openjdk 17.x.x or higher.
Step 3. Detect and Set JAVA_HOME¶
PySpark needs JAVA_HOME to locate Java at runtime.
Detect the actual path first, then write it to your profile.
Option 3A: Windows WSL / Ubuntu / Debian (tested)
Tested on WSL. Use the output of that command as your path below. Common values are `/usr/lib/jvm/java-17-openjdk-amd64` (AMD64) or `/usr/lib/jvm/java-17-openjdk-arm64` (ARM64).Option 3B: macOS (please report any issues)
Option 3C: Red Hat / Fedora please report issues)
Step 4. Verify JAVA_HOME and Path¶
All three commands should succeed and show openjdk 17.x.x.
If the first two succeed but the third fails, your PATH is not updated.
Re-run source ~/.bashrc (Linux) or source ~/.zshrc (macOS).
Step 5. Install Project Dependencies¶
Verify that pyproject.toml includes:
Then install and upgrade dependencies with:
This installs PySpark and all other required Python packages.
Step 6. Verify PySpark¶
Create this file in your main project package folder and run it to verify. Adjust paths as needed for your project.
"""src/bizintel/sparktest.py - Verify PySpark installation.
Run with:
uv run python src/bizintel/sparktest.py
"""
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkTest").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
print(f"Spark version: {spark.version}")
data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
df = spark.createDataFrame(data, ["id", "name"])
df.show()
spark.stop()
print("PySpark is working.")
You should see the following:
Spark version: 3.x.x
+---+-------+
| id| name|
+---+-------+
| 1| Alice|
| 2| Bob|
| 3|Charlie|
+---+-------+
PySpark is working.