Project Instructions (Module 5: Relational Data and SQL Analytics)¶

WEDNESDAY: Complete Workflow Phases 1-3¶

Follow the instructions in ⭐ Workflow: Apply Example to complete:

Phase 1. Start & Run - copy the project and confirm it runs
Phase 2. Change Authorship - update the project to your name and GitHub account
Phase 3. Read & Understand - review the project structure and code

This project includes two working examples - run both and compare:

uv run python -m datafun.app_retail_duckdb_case
uv run python -m datafun.app_retail_sqlite_case

Read the corresponding .sql files in sql/ alongside each Python script.

Again, follow the instructions above to complete:

Phase 4. Make a Technical Modification - make a change and verify it still runs
Phase 5. Apply the Skills to a New Problem

Make a small technical change that does not break the pipeline. Choose any one of these (or a different modification as you like):

Add a new SQL query file that computes a different KPI for the retail domain (e.g., total units sold per store, average sale value, number of sales above a threshold)
Modify an existing query to add a WHERE clause that filters results
Add an ORDER BY clause to a query that currently returns unordered results
Add a new run_sql_query() call in the Python script so your new query runs and logs the result
Add a COUNT(*) query that verifies the number of rows loaded matches the source CSV

Confirm the pipeline still runs successfully after your change.

Implement the same pipeline for one of the provided non-retail domains.

Available domains in data/raw/:

Steps:

Choose one domain (not retail)
Create SQL files in sql/duckdb/ named yourname_<domain>_bootstrap.sql, yourname_<domain>_clean.sql, and at least three query files
Create a Python script src/datafun/app_<domain>_duckdb_yourname.py for a new domain modeled on the example
Run your new script with uv run python -m datafun.app_<domain>_duckdb_yourname
Confirm your generated database appears in artifacts/
If your pipeline exports processed CSV, Parquet, or summary files, place them in data/processed/
Commit all .sql files, your .py file, and the generated database so your completed pipeline artifacts are visible.

Then:

Implement a pipeline using a different relational or columnar storage format.

Good options:

DuckDB with a persistent file - analytics-oriented, file-based database
SQLite with a persistent file - portable, file-based relational database
DuckDB with an in-memory database - no file, useful for exploration and testing
Parquet - columnar storage format that DuckDB can query well

Steps:

Choose a format
Load the same retail (or alternate domain) CSV data into your chosen format
Write at least three queries that answer analytical questions
Compare the experience to DuckDB: what is easier, what is harder, what is different?
Document your findings in your README, docs page, or log output

Then:

As you work, focus on:

how SQL SELECT, WHERE, GROUP BY, ORDER BY, and JOIN answer analytical questions
how Python orchestrates SQL execution and handles results
how file-based databases (DuckDB, SQLite) differ from server-based ones (e.g., PostgreSQL, MySQL)
how separating SQL into .sql files makes queries reviewable and reusable
how a 1-to-many relationship is represented in two related tables

Your goal is to query structured data with SQL and interpret the results analytically.

Make sure the title and narrative reflect your work. Verify key files:

Ensure your project clearly demonstrates: