Star Schema Overview for smart-sales-example¶
This project uses a clean, minimal star schema to illustrate the core concepts of business intelligence (BI) data modeling. It provides a simple example as part of a BI project incorporating ETL, SQL, OLAP, and analytics workflows in a modern Python and SQLite (or DuckDB) pipeline.
Initial Star Schema¶
We have one fact table and two dimension tables:
Dimension: customer¶
customer_id INTEGER PRIMARY KEY
name TEXT
region TEXT
join_date TEXT
Dimension: product¶
product_id INTEGER PRIMARY KEY
product_name TEXT
category TEXT
unit_price REAL
Fact: sale¶
sale_id INTEGER PRIMARY KEY
customer_id INTEGER -- FK to customer
product_id INTEGER -- FK to product
sale_amount REAL -- measure
sale_date TEXT -- transaction date
The grain of the fact table is: one row per sale event (customer × product × date).
Star Schema¶
A data model is a star schema when it contains:
- A central fact table containing numeric/measurable events:
sale.sale_amount - One or more dimension tables containing descriptive attributes:
customer,product - Foreign key relationships from fact to dimensions:
sale.customer_id,sale.product_id - A well-defined grain: each fact row is a single sale event
A star schema makes it easy to write analytic queries such as:
- Sales by customer
- Sales by product or category
- Sales by region
- Sales by month or year
- Top 10 customers by revenue
- Product category profitability
- etc.
Our project is small but demonstrates a full BI lifecycle:
- Data cleaning
- Conformed dimensions
- Fact table construction
- ETL
- OLAP queries (
GROUP BY, aggregates, drill-downs) - Visualizations
Modifications¶
- Add your new columns to the dimension CSVs
- Extend the schema with your additional attributes
- Transform any messy headers and use clean snake_case throughout
- Load the warehouse with the ETL pipeline
- Once in there, we will build:
- Aggregation queries
- Multi-dimensional reports
- Simple dashboards
- Perform OLAP-style analysis:
- Drill down by category
- Slice by region
- Trend analysis using
sale_date
This repo illustrates a complete, end-to-end BI project.
Additional Improvements¶
To align with more formal BI/warehouse naming conventions:
1. Table naming¶
Rename:
| Current | Improved |
|---|---|
customer |
dim_customer |
product |
dim_product |
sale |
fact_sale or fact_sales |
This is common in Snowflake, Redshift, dbt, and enterprise BI.
2. Add a proper date dimension¶
Instead of storing sale_date in the fact table:
- create a
dim_date - join it via
date_id - allow easy year/month/week rollups
3. Add more measures¶
Examples:
quantitydiscount_amountprofit(calculated)unit_cost
4. Add more descriptive attributes¶
Extend:
- Customer: age, loyalty segment, state, customer_type
- Product: brand, supplier, subcategory
New attributes expand analytic possibilities.