Installing Apache Spark¶

Apache Spark is a distributed computing engine used for large-scale data processing and analytics.

0. ADVANCED FOR WINDOWS USERS (OPTIONAL)¶

Apache Spark is optional for Windows users.

If you want to run Spark on Windows, you must set up WSL (Ubuntu) and make a second clone of your project inside WSL. There will be two operating system specfic versions of the project:

Windows repo with Power BI
WSL repo with Spark

Steps:

Open WSL (Ubuntu) and make a second clone of your repo from inside WSL, for example: git clone https://github.com/...
In Windows, install the "WSL – Windows Subsystem for Linux" extension (ms-vscode-remote.remote-wsl) in VS Code.
In WSL, navigate into the project folder and run: code . This opens the WSL project in VS Code.
All notebooks, Spark sessions, and terminal commands will now run inside WSL, while the VS Code interface runs normally in Windows.

All of the following steps should be run inside a WSL bash terminal.

1. Install Apache Spark¶

Download Spark from: https://spark.apache.org/downloads.html

Choose:

Spark version: 3.x
Package type: Pre-built for Apache Hadoop
Follow install instructions for your OS

Add Spark’s bin/ folder to your shell PATH.

Example:

export PATH="$PATH:/path/to/spark/bin"

By default, the project comments out pyspark in pyproject.toml.

Open pyproject.toml.
Uncomment the pyspark line in the dependencies section
Resync your environment:

uv sync --extra dev --extra docs --upgrade

This ensures:

pyspark is installed
Jupyter notebooks can import pyspark
Your Spark session (SparkSession.builder) works inside the notebook

2. Install the JDBC Driver¶

Spark needs a JDBC driver to read SQLite or DuckDB databases.

Install the JDBC (Java Database Connectivity) driver and create a Data Name Source:

If using SQLite, see Working with SQLite
If using DuckDB, see Working with DuckDB

3. Configure Spark to Use the JDBC Driver¶

Example PySpark config:

spark = (
    SparkSession.builder
    .appName("SmartSales")
    .config("spark.driver.extraClassPath", "lib/sqlite-jdbc.jar")
    .getOrCreate()
)

If Spark cannot find the driver, use an absolute path.

Restart the notebook kernel after updating Spark configuration.

4. Test Spark in a Notebook¶

df = spark.read.format("jdbc").options(
    url="jdbc:sqlite:data/warehouse/smart_sales.db",
    dbtable="fact_sale"
).load()

df.show()