Installing Apache Spark¶
Apache Spark is a distributed computing engine used for large-scale data processing and analytics.
0. ADVANCED FOR WINDOWS USERS (OPTIONAL)¶
Apache Spark is optional for Windows users.
If you want to run Spark on Windows, you must set up WSL (Ubuntu) and make a second clone of your project inside WSL. There will be two operating system specfic versions of the project:
- Windows repo with Power BI
- WSL repo with Spark
Steps:
- Open WSL (Ubuntu) and make a second clone of your repo from inside WSL, for example:
git clone https://github.com/... - In Windows, install the "WSL – Windows Subsystem for Linux" extension (
ms-vscode-remote.remote-wsl) in VS Code. - In WSL, navigate into the project folder and run:
code .This opens the WSL project in VS Code. - All notebooks, Spark sessions, and terminal commands will now run inside WSL, while the VS Code interface runs normally in Windows.
All of the following steps should be run inside a WSL bash terminal.
1. Install Apache Spark¶
Download Spark from: https://spark.apache.org/downloads.html
Choose:
- Spark version: 3.x
- Package type: Pre-built for Apache Hadoop
- Follow install instructions for your OS
Add Spark’s bin/ folder to your shell PATH.
Example:
export PATH="$PATH:/path/to/spark/bin"
By default, the project comments out pyspark in pyproject.toml.
- Open
pyproject.toml. - Uncomment the
pysparkline in the dependencies section - Resync your environment:
uv sync --extra dev --extra docs --upgrade
This ensures:
pysparkis installed- Jupyter notebooks can import
pyspark - Your Spark session (
SparkSession.builder) works inside the notebook
2. Install the JDBC Driver¶
Spark needs a JDBC driver to read SQLite or DuckDB databases.
Install the JDBC (Java Database Connectivity) driver and create a Data Name Source:
- If using SQLite, see Working with SQLite
- If using DuckDB, see Working with DuckDB
3. Configure Spark to Use the JDBC Driver¶
Example PySpark config:
spark = (
SparkSession.builder
.appName("SmartSales")
.config("spark.driver.extraClassPath", "lib/sqlite-jdbc.jar")
.getOrCreate()
)
If Spark cannot find the driver, use an absolute path.
Restart the notebook kernel after updating Spark configuration.
4. Test Spark in a Notebook¶
df = spark.read.format("jdbc").options(
url="jdbc:sqlite:data/warehouse/smart_sales.db",
dbtable="fact_sale"
).load()
df.show()