Project Instructions (Module 5: Relational Data and SQL Analytics)¶
WEDNESDAY: Complete Workflow Phases 1-3¶
Follow the instructions in ⭐ Workflow: Apply Example to complete:
- Phase 1. Start & Run - copy the project and confirm it runs
- Phase 2. Change Authorship - update the project to your name and GitHub account
- Phase 3. Read & Understand - review the project structure and code
This project includes two working examples - run both and compare:
Read the corresponding .sql files in sql/ alongside each Python script.
FRIDAY/SUNDAY: Complete Workflow Phases 4-5¶
Again, follow the instructions above to complete:
- Phase 4. Make a Technical Modification - make a change and verify it still runs
- Phase 5. Apply the Skills to a New Problem
Phase 4 Suggestions¶
Make a small technical change that does not break the pipeline. Choose any one of these (or a different modification as you like):
- Add a new SQL query file that computes a different KPI for the retail domain (e.g., total units sold per store, average sale value, number of sales above a threshold)
- Modify an existing query to add a
WHEREclause that filters results - Add an
ORDER BYclause to a query that currently returns unordered results - Add a new
run_sql_query()call in the Python script so your new query runs and logs the result - Add a
COUNT(*)query that verifies the number of rows loaded matches the source CSV
Confirm the pipeline still runs successfully after your change.
Phase 5 Suggestions¶
Phase 5 Suggestion 1. New Domain, Same Database Engine: DuckDB (Directed)¶
Implement the same pipeline for one of the provided non-retail domains.
Available domains in data/raw/:
library- a library branch manages many checkoutsshelter- a shelter manages many animal adoptionscivic_event- an event manages many attendees
Steps:
- Choose one domain (not retail)
- Create SQL files in
sql/duckdb/namedyourname_<domain>_bootstrap.sql,yourname_<domain>_clean.sql, and at least three query files - Create a Python script
src/datafun/app_<domain>_duckdb_yourname.pyfor a new domain modeled on the example - Run your new script with
uv run python -m datafun.app_<domain>_duckdb_yourname - Confirm your generated database appears in
artifacts/ - If your pipeline exports processed CSV, Parquet, or summary files, place them in
data/processed/ - Commit all
.sqlfiles, your.pyfile, and the generated database so your completed pipeline artifacts are visible.
Then:
- Describe the two entities in your domain and their 1-to-many relationship
- Write one SQL query of your own that answers a question about your data
- Explain when SQL is a better tool than pandas for this kind of question
Phase 5 Suggestion 2. Different Storage Format (Original)¶
Implement a pipeline using a different relational or columnar storage format.
Good options:
- DuckDB with a persistent file - analytics-oriented, file-based database
- SQLite with a persistent file - portable, file-based relational database
- DuckDB with an in-memory database - no file, useful for exploration and testing
- Parquet - columnar storage format that DuckDB can query well
Steps:
- Choose a format
- Load the same retail (or alternate domain) CSV data into your chosen format
- Write at least three queries that answer analytical questions
- Compare the experience to DuckDB: what is easier, what is harder, what is different?
- Document your findings in your README, docs page, or log output
Then:
- Explain why you chose this format and what use cases it is best suited for
- Describe one limitation of the format you encountered
- Identify which format you would recommend for a team analytics workflow and why
Key Skill Focus¶
As you work, focus on:
- how SQL
SELECT,WHERE,GROUP BY,ORDER BY, andJOINanswer analytical questions - how Python orchestrates SQL execution and handles results
- how file-based databases (DuckDB, SQLite) differ from server-based ones (e.g., PostgreSQL, MySQL)
- how separating SQL into
.sqlfiles makes queries reviewable and reusable - how a 1-to-many relationship is represented in two related tables
Your goal is to query structured data with SQL and interpret the results analytically.
Professional Communication¶
Make sure the title and narrative reflect your work. Verify key files:
- README.md
- docs/ (source and hosted on GitHub Pages)
- sql/ (your query files, correctly named)
- src/ (your pipeline script, correctly named)
- data/raw/ = source inputs
- data/processed/ (processed data exports, if the pipeline creates any)
- artifacts/ (generated database files or reports)
Ensure your project clearly demonstrates:
- a complete SQL pipeline for a non-retail domain
- at least three meaningful queries with logged results
- understanding of the relational structure of your chosen domain