app_case.py - Project script (example).
Author: Denise Case
Date: 2026-04
Purpose
- exploratory data analysis (EDA)
- loading a dataset with seaborn
- inspecting a DataFrame with pandas
- checking for missing values
- computing descriptive statistics
- grouping and aggregating
- computing a correlation matrix
- creating charts with seaborn and matplotlib
Data Source:
- Palmer Archipelago (Antarctica) penguin data
- Available via Seaborn
Assumptions:
- The data contains columns like:
species, island, bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g, sex, year
Terminal command to run this file from the root project folder:
uv run python -m datafun.app_case
OBS
Don't edit this file - it should remain a working example.
Copy it, rename it, and modify your copy.
build_data_dictionary
build_data_dictionary(df: DataFrame) -> pd.DataFrame
Build a starter data dictionary.
Includes:
- column name
- data type
- missing value count
- percent missing
WHY: A data dictionary helps with understanding the structure and quality of the data.
Arguments:
- df: The DataFrame to analyze.
Returns:
- pd.DataFrame: A data dictionary summarizing the columns.
Source code in src/datafun/app_case.py
| def build_data_dictionary(df: pd.DataFrame) -> pd.DataFrame:
"""Build a starter data dictionary.
Includes:
- column name
- data type
- missing value count
- percent missing
WHY: A data dictionary helps with understanding the structure and quality of the data.
Arguments:
- df: The DataFrame to analyze.
Returns:
- pd.DataFrame: A data dictionary summarizing the columns.
"""
LOG.info("Building starter data dictionary")
data_dictionary = pd.DataFrame(
{
"column": df.columns,
"dtype": [str(t) for t in df.dtypes],
"missing_count": df.isna().sum().values,
"missing_pct": (df.isna().mean() * 100).round(2).values,
}
)
LOG.debug(f"\n{data_dictionary}")
return data_dictionary
|
check_quality
check_quality(df: DataFrame) -> None
Perform basic data quality checks.
Checks include:
- Missing values
- Duplicate rows
- Basic numeric sanity checks
WHY: Data quality issues can affect analysis results.
It's important to identify them early in the EDA process.
Arguments:
- df: The DataFrame to check.
Returns:
- None (logs results)
Source code in src/datafun/app_case.py
| def check_quality(df: pd.DataFrame) -> None:
"""Perform basic data quality checks.
Checks include:
- Missing values
- Duplicate rows
- Basic numeric sanity checks
WHY: Data quality issues can affect analysis results.
It's important to identify them early in the EDA process.
Arguments:
- df: The DataFrame to check.
Returns:
- None (logs results)
"""
LOG.info("Missing values per column:")
LOG.info(f"\n{df.isnull().sum()}")
LOG.info("Checking missing values per column")
LOG.debug(f"\n{df.isna().sum().sort_values(ascending=False)}")
dup_count = int(df.duplicated().sum())
LOG.info(f"Duplicate rows detected: {dup_count}")
LOG.info("Call describe() for numeric columns")
LOG.debug(f"\n{df[SELECTED_NUMERIC_COLS].describe()}\n")
|
correlation_matrix
correlation_matrix(df_clean: DataFrame) -> pd.DataFrame
Compute a simple numeric correlations to understand
relationships between numeric variables.
A correlation matrix is symmetric.
There are as many columns as numeric variables.
There are as many rows as numeric variables.
The diagonal values are always exactly 1.0.
since each variable perfectly correlates with itself.
WHY: Correlation tells us how numeric variables relate to each other.
- Values near 1 or -1 indicate strong relationships
- Values near 0 indicate weak or no linear relationship
Parameters:
| Name |
Type |
Description |
Default |
df_clean
|
DataFrame
|
Cleaned DataFrame for analysis.
|
required
|
Returns:
| Type |
Description |
DataFrame
|
pd.DataFrame: Correlation matrix of numeric columns.
|
Source code in src/datafun/app_case.py
| def correlation_matrix(df_clean: pd.DataFrame) -> pd.DataFrame:
"""Compute a simple numeric correlations to understand
relationships between numeric variables.
A correlation matrix is symmetric.
There are as many columns as numeric variables.
There are as many rows as numeric variables.
The diagonal values are always exactly 1.0.
since each variable perfectly correlates with itself.
WHY: Correlation tells us how numeric variables relate to each other.
- Values near 1 or -1 indicate strong relationships
- Values near 0 indicate weak or no linear relationship
Args:
df_clean (pd.DataFrame): Cleaned DataFrame for analysis.
Returns:
pd.DataFrame: Correlation matrix of numeric columns.
"""
LOG.info("Computing correlation matrix for numeric columns")
# Select only numeric columns
df_clean_numeric_cols: pd.DataFrame = df_clean.select_dtypes(include="number")
# calculate the correlation matrix using the df corr() method
correlation_matrix = df_clean_numeric_cols.corr()
LOG.info("\nCorrelation matrix:")
LOG.debug(f"\n{correlation_matrix}")
LOG.info("---------Visualize Correlation Matrix as a Heatmap---------------")
# Open a fresh blank canvas before a new chart
plt.figure()
# Use a heatmap() to visualize correlation matrix
heatmap: Axes = sns.heatmap(
correlation_matrix,
annot=True, # Set annotations to True to show correlation values
cmap="coolwarm", # try viridis, plasma, or other colormaps
center=0,
)
heatmap.set_title("Correlation Matrix Heatmap")
# IN NOTEBOOK: SHOW AS YOU GO
# plt.show() displays the current chart and closes it
# Call this before starting a new chart
# or next chart will be drawn on top of this one
# IN SCRIPT: WAIT TO SHOW TILL THE END
# Do not call plt.show() here - let figures accumulate
# so all charts display together with sequential Figure numbers.
# plt.show() is called once at the end of make_plots()
# plt.show()
LOG.info("""
CUSTOM: Update these notes and use Markdown cells to narrate and tell the story as you explore. For example:
Interpretation:
- Values close to 1 (dark red) = strong positive correlation (both increase together)
- Values close to -1 (dark blue) = strong negative correlation (one increases, other decreases)
- Values close to 0 (white) = little or no linear relationship
- The diagonal is always 1 (each variable correlates perfectly with itself)
From this heatmap, we can see that flipper_length_mm and body_mass_g show strong positive correlation (~0.87).
""")
return correlation_matrix
|
descriptive_stats
descriptive_stats(
df_clean: DataFrame,
) -> tuple[pd.DataFrame, pd.DataFrame]
Compute descriptive statistics overall and by group.
WHY: Summary statistics offer a quick overview of numeric data:
- Central tendency (mean)
- Spread (std, min, max)
- Distribution shape (quartiles)
Grouping by a categorical variable (i.e., non-numeric column)
enables comparing statistics across categories
Parameters:
| Name |
Type |
Description |
Default |
df_clean
|
DataFrame
|
Cleaned DataFrame for analysis.
|
required
|
Returns:
| Type |
Description |
tuple[DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame]: Overall stats, stats by group.
|
Notes:
- Descriptive statistics summarize key aspects of numeric data.
- Grouped stats help compare across categories.
Source code in src/datafun/app_case.py
| def descriptive_stats(df_clean: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Compute descriptive statistics overall and by group.
WHY: Summary statistics offer a quick overview of numeric data:
- Central tendency (mean)
- Spread (std, min, max)
- Distribution shape (quartiles)
Grouping by a categorical variable (i.e., non-numeric column)
enables comparing statistics across categories
Args:
df_clean (pd.DataFrame): Cleaned DataFrame for analysis.
Returns:
tuple[pd.DataFrame, pd.DataFrame]: Overall stats, stats by group.
Notes:
- Descriptive statistics summarize key aspects of numeric data.
- Grouped stats help compare across categories.
"""
LOG.info("--------------- Manual statistics ---------------")
# Example: Calculate statistics for a specific column with numpy
mean_body_mass = np.mean(df_clean["body_mass_g"])
std_body_mass = np.std(df_clean["body_mass_g"])
min_body_mass = np.min(df_clean["body_mass_g"])
max_body_mass = np.max(df_clean["body_mass_g"])
range_body_mass = np.ptp(df_clean["body_mass_g"]) # peak to peak (max - min)
# Log the example results with formatting
LOG.debug("Body Mass Statistics (using numpy):")
LOG.debug(f" Mean: {mean_body_mass:.2f} g")
LOG.debug(f" Std Dev: {std_body_mass:.2f} g")
LOG.debug(f" Min: {min_body_mass:.2f} g")
LOG.debug(f" Max: {max_body_mass:.2f} g")
LOG.debug(f" Range: {range_body_mass:.2f} g")
LOG.info("--------------- Using pandas describe() method ---------------")
LOG.info("Computing overall descriptive statistics")
# Use describe() to get count, mean, std, min, 25%, 50%, 75%, max for numeric columns
# OPTION: Use .T to transpose the result so that columns become rows for easier reading in logs
stats_overall = df_clean[SELECTED_NUMERIC_COLS].describe().T
LOG.debug(f"\n{stats_overall}")
LOG.info("--------------- Using pandas groupby() and agg() ---------------")
LOG.info("Computing descriptive statistics by group")
# Step 1: Select only the numeric columns we want to summarize
df_numeric_subset: pd.DataFrame = df_clean[SELECTED_NUMERIC_COLS]
# Step 2: Split the numeric subset into groups based on the grouping column
# groupby() returns a GroupBy object - not a DataFrame yet, just a plan to group
grouped = df_numeric_subset.groupby(df_clean[GROUP_COL])
# Step 3: For each group, compute multiple summary statistics at once
# agg() applies each function in the list to each numeric column
# The result has a multi-level column index: (numeric_column, statistic)
df_stats_by_group: pd.DataFrame = grouped.agg(
["count", "mean", "std", "min", "max"]
)
LOG.debug(f"\n{df_stats_by_group}")
LOG.info("\nStacked view - easier to read in logs:")
# Yuck: That's the multi-level column index in action.
# pandas lays out the result as (numeric_column, statistic) pairs
# side by side, wrapping when the terminal width runs out.
# With 4 numeric columns × 5 statistics = 20 columns total,
# it can only fit 2 numeric columns per line at 120 characters wide.
# Let's stack it so each numeric column's stats are grouped together
# vertically instead of horizontally.
stats_by_group_stacked: pd.DataFrame | pd.Series[Any] = df_stats_by_group.stack(
level=0
)
LOG.debug(f"\n{stats_by_group_stacked}")
return stats_overall, df_stats_by_group
|
inspect_basic
inspect_basic(df: DataFrame) -> None
Inspect the basic structure of the dataset.
WHY: Always start by understanding what columns exist,
what types they are, and how large the dataset is.
- How many rows and columns are there?
- What types of data are present?
- Are there obvious missing values?
This step determines challenges we might have downstream (later).
Parameters:
| Name |
Type |
Description |
Default |
df
|
DataFrame
|
The DataFrame to inspect.
|
required
|
Returns:
Source code in src/datafun/app_case.py
| def inspect_basic(df: pd.DataFrame) -> None:
"""Inspect the basic structure of the dataset.
WHY: Always start by understanding what columns exist,
what types they are, and how large the dataset is.
- How many rows and columns are there?
- What types of data are present?
- Are there obvious missing values?
This step determines challenges we might have downstream (later).
Arguments:
df: The DataFrame to inspect.
Returns:
None
"""
# Preview the first few rows
LOG.info("Previewing first few rows of the dataset")
LOG.debug(f"\n{df.head()}")
LOG.info("Column names")
LOG.debug(f"{list(df.columns)}")
LOG.info("DataFrame info (types and non-null counts)")
df.info()
# Get shape - number of rows and columns
# It has two parts so the return value is a tuple of (num_rows, num_columns)
shape: tuple[int, int] = df.shape
# To get each value, we can unpack the tuple into two variables
# This is a common Python idiom for working with tuples.
# Or we could just use shape[0] and shape[1] directly without unpacking.
num_rows, num_cols = shape
LOG.info(f"Dataset shape: {num_rows} rows, {num_cols} columns")
|
load_data
load_data() -> pd.DataFrame
Load a dataset into a DataFrame.
Seaborn provides clean built-in datasets for practice.
Other projects may load from CSV, JSON, or a database.
Arguments: None
Returns:
| Type |
Description |
DataFrame
|
pd.DataFrame: The loaded dataset.
|
Source code in src/datafun/app_case.py
| def load_data() -> pd.DataFrame:
"""Load a dataset into a DataFrame.
Seaborn provides clean built-in datasets for practice.
Other projects may load from CSV, JSON, or a database.
Arguments: None
Returns:
pd.DataFrame: The loaded dataset.
"""
LOG.info(f"Loading dataset: {DATASET_NAME}")
df: pd.DataFrame = sns.load_dataset(DATASET_NAME)
LOG.info(f"Loaded: {df.shape[0]} rows, {df.shape[1]} columns")
return df
|
main
Main function to run the EDA workflow.
Source code in src/datafun/app_case.py
| def main() -> None:
"""Main function to run the EDA workflow."""
log_header(LOG, "EDA")
LOG.info("========================")
LOG.info("START main()")
LOG.info("========================")
LOG.info(f"--- Section 2: Load dataset: {DATASET_NAME} ---")
df = load_data()
LOG.info("--- Section 3: Inspect shape and basic structure ---")
inspect_basic(df)
LOG.info("--- Section 4: Create Data Dictionary and Check Data Quality ---")
build_data_dictionary(df)
check_quality(df)
LOG.info("--- Section 5: Create a cleaned view for EDA ---")
df_clean = make_clean_view(df)
LOG.info("--- Section 6: Descriptive statistics for numeric columns ---")
descriptive_stats(df_clean)
LOG.info("--- Section 7: Correlation matrix for numeric columns ---")
correlation_matrix(df_clean)
LOG.info("--- Section 8: Charts ---")
make_plots(df_clean)
LOG.info("--- Section 9: Summary and next steps ---")
summarize(df, df_clean)
LOG.info(
"----- in a script, call plt.show() once at the end to display all charts -----"
)
LOG.info(
"----- in a script, close the chart windows (with the close button) to continue -----"
)
plt.show()
LOG.info("EDA workflow complete")
LOG.info("IMPORTANT: This script creates chart windows.")
LOG.info(
"Close any chart windows and terminate this process with CTRL+c as needed."
)
LOG.info("========================")
LOG.info("Executed successfully!")
LOG.info("========================")
|
make_clean_view
make_clean_view(df: DataFrame) -> pd.DataFrame
Create a cleaned view for EDA.
Strategy:
- Keep the original DataFrame unchanged
- Drop rows missing key numeric fields and grouping field
WHY: EDA often focuses on a "clean" subset of the data.
This allows exploring patterns without being distracted by missing values.
Arguments:
- df: The original DataFrame.
Returns:
- pd.DataFrame: A cleaned view of the data for EDA.
Source code in src/datafun/app_case.py
| def make_clean_view(df: pd.DataFrame) -> pd.DataFrame:
"""Create a cleaned view for EDA.
Strategy:
- Keep the original DataFrame unchanged
- Drop rows missing key numeric fields and grouping field
WHY: EDA often focuses on a "clean" subset of the data.
This allows exploring patterns without being distracted by missing values.
Arguments:
- df: The original DataFrame.
Returns:
- pd.DataFrame: A cleaned view of the data for EDA.
"""
LOG.info("Creating cleaned view for EDA (dropping rows with key missing values)")
# Build the list of columns we require to be non-missing
# This includes all the selected numeric columns plus the grouping column.
# SELECTED_NUMERIC_COLS is a list of strings,
# GROUP_COL is a single string
# Wrap GROUP_COL in a list - two lists can be combined with +
cols_required: list[str] = SELECTED_NUMERIC_COLS + [GROUP_COL]
LOG.debug(f"Columns required to be non-missing: {cols_required}")
# Drop a row if it is missing a value in ANY of the required columns
# Use the dropna() method with subset= to specify which columns to check for missing values.
# Dropna means "drop rows if not available (na) that is, they have missing values". By default, it checks all columns, but we only want to check the key columns.
# dropna(subset=...) only looks at the specified columns, not the whole row
# .copy() creates a new DataFrame so we don't accidentally modify the original
df_clean: pd.DataFrame = df.dropna(subset=cols_required).copy()
# Report what was kept and what was dropped
count_original: int = df.shape[0]
count_clean: int = df_clean.shape[0]
count_dropped: int = count_original - count_clean
LOG.info(f"Original rows: {count_original}")
LOG.info(f"Clean rows: {count_clean}")
LOG.info(f"Rows dropped: {count_dropped}")
return df_clean
|
make_plots
make_plots(df_clean: DataFrame) -> None
Create simple, notebook-friendly plots.
WHY: Visualizations reveal patterns not obvious in tables.
CUSTOM: Charts will vary depending on the dataset
and questions of interest.
Common charts include:
1. A scatter plot to see relationships between two variables
2. A box plot to compare distributions across groups
A scatter plot shows the relationship between two numeric variables.
In this example:
- Each dot is one data record shown as x vs y.
- Color (hue) provides a third dimension.
A box plot shows the distribution of one numeric variable across groups.
- The box shows the middle 50% of values.
- The line inside the box is the median.
- The whiskers show the range. Dots beyond the whiskers are outliers.
Source code in src/datafun/app_case.py
| def make_plots(df_clean: pd.DataFrame) -> None:
"""Create simple, notebook-friendly plots.
WHY: Visualizations reveal patterns not obvious in tables.
CUSTOM: Charts will vary depending on the dataset
and questions of interest.
Common charts include:
1. A scatter plot to see relationships between two variables
2. A box plot to compare distributions across groups
A scatter plot shows the relationship between two numeric variables.
In this example:
- Each dot is one data record shown as x vs y.
- Color (hue) provides a third dimension.
A box plot shows the distribution of one numeric variable across groups.
- The box shows the middle 50% of values.
- The line inside the box is the median.
- The whiskers show the range. Dots beyond the whiskers are outliers.
"""
LOG.info("---- Creating Scatter Plot to see Relationships ------")
LOG.info("---- Use clean dataframe ---------------------------")
LOG.info("---- Set x to flipper length -----------------------")
LOG.info("---- Set y to bill length --------------------------")
LOG.info("---- Set the hue (color mapping) to the group column --")
# Open a fresh blank canvas before a new chart
plt.figure()
# Use a scatterplot() to visualize relationships between two variables (x vs y)
scatter_plt: Axes = sns.scatterplot(
data=df_clean,
x="flipper_length_mm",
y="bill_length_mm",
hue=GROUP_COL,
)
scatter_plt.set_xlabel("Flipper length (mm)")
scatter_plt.set_ylabel("Bill length (mm)")
scatter_plt.set_title("Flipper length vs Bill length (by species)")
# IN NOTEBOOK: SHOW AS YOU GO
# plt.show() displays the current chart and closes it
# Call this before starting a new chart
# or next chart will be drawn on top of this one
# IN SCRIPT: WAIT TO SHOW TILL THE END
# Do not call plt.show() here - let figures accumulate
# so all charts display together with sequential Figure numbers.
# plt.show() is called once at the end of make_plots()
# plt.show()
LOG.info("------ Creating Box Plot to see Distribution: ---------")
LOG.info("------ Use clean dataframe --------------------------")
LOG.info("------ Set x to the group column --------------------")
LOG.info("------ Set y to flipper length ----------------------")
# Open a fresh blank canvas before a new chart
plt.figure()
# Use a boxplot() to visualize the distribution of a numeric variable across groups
box_plt: Axes = sns.boxplot(
data=df_clean,
x=GROUP_COL,
y="flipper_length_mm",
)
box_plt.set_title("Flipper length by species")
|
summarize
summarize(df: DataFrame, df_clean: DataFrame) -> None
Log a brief summary of findings and suggested next steps.
WHY: EDA is not the final report.
A summary captures what was found and what to investigate next.
Parameters:
| Name |
Type |
Description |
Default |
df
|
DataFrame
|
|
required
|
df_clean
|
DataFrame
|
|
required
|
Returns:
Source code in src/datafun/app_case.py
| def summarize(df: pd.DataFrame, df_clean: pd.DataFrame) -> None:
"""Log a brief summary of findings and suggested next steps.
WHY: EDA is not the final report.
A summary captures what was found and what to investigate next.
Arguments:
df: The original DataFrame.
df_clean: The cleaned DataFrame.
Returns:
None
"""
LOG.info("========================")
LOG.info("SUMMARY")
LOG.info("========================")
LOG.info(f"Dataset: {DATASET_NAME}")
LOG.info(f"Original rows: {df.shape[0]}")
LOG.info(f"Clean rows: {df_clean.shape[0]}")
# Get the unique values in the grouping column (e.g. species names)
unique_groups_array: np.ndarray = df_clean[GROUP_COL].unique()
# Sort them alphabetically so the output is consistent and readable
sorted_groups: list[str] = sorted(unique_groups_array)
LOG.info(f"Groups found in {GROUP_COL}: {sorted_groups}")
LOG.info("Strongest correlation: ")
LOG.info(" flipper_length_mm and body_mass_g (~0.87)")
LOG.info("Suggested next step: ")
LOG.info(" Model body_mass_g ~ flipper_length_mm with linear regression")
|