API Reference¶

This page is auto-generated from Python docstrings.

datafun.app_case ¶

app_case.py - Project script (example).

Author: Denise Case Date: 2026-04

Purpose

exploratory data analysis (EDA)
loading a dataset with seaborn
inspecting a DataFrame with pandas
checking for missing values
computing descriptive statistics
grouping and aggregating
computing a correlation matrix
creating charts with seaborn and matplotlib

Data Source: - Palmer Archipelago (Antarctica) penguin data - Available via Seaborn

Assumptions: - The data contains columns like: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, year

Terminal command to run this file from the root project folder:

uv run python -m datafun.app_case

OBS

Don't edit this file - it should remain a working example. Copy it, rename it, and modify your copy.

build_data_dictionary ¶

build_data_dictionary(df: DataFrame) -> pd.DataFrame

Build a starter data dictionary.

Includes: - column name - data type - missing value count - percent missing

WHY: A data dictionary helps with understanding the structure and quality of the data.

Arguments: - df: The DataFrame to analyze.

Returns: - pd.DataFrame: A data dictionary summarizing the columns.

Source code in src/datafun/app_case.py

def build_data_dictionary(df: pd.DataFrame) -> pd.DataFrame:
    """Build a starter data dictionary.

    Includes:
    - column name
    - data type
    - missing value count
    - percent missing

    WHY: A data dictionary helps with understanding the structure and quality of the data.

    Arguments:
    - df: The DataFrame to analyze.

    Returns:
    - pd.DataFrame: A data dictionary summarizing the columns.
    """
    LOG.info("Building starter data dictionary")

    data_dictionary = pd.DataFrame(
        {
            "column": df.columns,
            "dtype": [str(t) for t in df.dtypes],
            "missing_count": df.isna().sum().values,
            "missing_pct": (df.isna().mean() * 100).round(2).values,
        }
    )

    LOG.debug(f"\n{data_dictionary}")
    return data_dictionary

check_quality ¶

check_quality(df: DataFrame) -> None

Perform basic data quality checks.

Checks include: - Missing values - Duplicate rows - Basic numeric sanity checks

WHY: Data quality issues can affect analysis results. It's important to identify them early in the EDA process. Arguments: - df: The DataFrame to check. Returns: - None (logs results)

Source code in src/datafun/app_case.py

def check_quality(df: pd.DataFrame) -> None:
    """Perform basic data quality checks.

    Checks include:
    - Missing values
    - Duplicate rows
    - Basic numeric sanity checks

    WHY: Data quality issues can affect analysis results.
    It's important to identify them early in the EDA process.
    Arguments:
    - df: The DataFrame to check.
    Returns:
    - None (logs results)
    """
    LOG.info("Missing values per column:")
    LOG.info(f"\n{df.isnull().sum()}")

    LOG.info("Checking missing values per column")
    LOG.debug(f"\n{df.isna().sum().sort_values(ascending=False)}")

    dup_count = int(df.duplicated().sum())
    LOG.info(f"Duplicate rows detected: {dup_count}")

    LOG.info("Call describe() for numeric columns")
    LOG.debug(f"\n{df[SELECTED_NUMERIC_COLS].describe()}\n")

correlation_matrix ¶

correlation_matrix(df_clean: DataFrame) -> pd.DataFrame

Compute a simple numeric correlations to understand relationships between numeric variables.

A correlation matrix is symmetric. There are as many columns as numeric variables. There are as many rows as numeric variables. The diagonal values are always exactly 1.0. since each variable perfectly correlates with itself.

WHY: Correlation tells us how numeric variables relate to each other.

Values near 1 or -1 indicate strong relationships
Values near 0 indicate weak or no linear relationship

Parameters:

Name	Type	Description	Default
`df_clean`	`DataFrame`	Cleaned DataFrame for analysis.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Correlation matrix of numeric columns.

Source code in src/datafun/app_case.py

def correlation_matrix(df_clean: pd.DataFrame) -> pd.DataFrame:
    """Compute a simple numeric correlations to understand
    relationships between numeric variables.

    A correlation matrix is symmetric.
    There are as many columns as numeric variables.
    There are as many rows as numeric variables.
    The diagonal values are always exactly 1.0.
    since each variable perfectly correlates with itself.

    WHY: Correlation tells us how numeric variables relate to each other.

    - Values near 1 or -1 indicate strong relationships
    - Values near 0 indicate weak or no linear relationship

    Args:
        df_clean (pd.DataFrame): Cleaned DataFrame for analysis.

    Returns:
        pd.DataFrame: Correlation matrix of numeric columns.
    """
    LOG.info("Computing correlation matrix for numeric columns")

    # Select only numeric columns
    df_clean_numeric_cols: pd.DataFrame = df_clean.select_dtypes(include="number")

    # calculate the correlation matrix using the df corr() method
    correlation_matrix = df_clean_numeric_cols.corr()

    LOG.info("\nCorrelation matrix:")
    LOG.debug(f"\n{correlation_matrix}")

    LOG.info("---------Visualize Correlation Matrix as a Heatmap---------------")

    # Open a fresh blank canvas before a new chart
    plt.figure()

    # Use a heatmap() to visualize correlation matrix
    heatmap: Axes = sns.heatmap(
        correlation_matrix,
        annot=True,  # Set annotations to True to show correlation values
        cmap="coolwarm",  # try viridis, plasma, or other colormaps
        center=0,
    )
    heatmap.set_title("Correlation Matrix Heatmap")
    # IN NOTEBOOK: SHOW AS YOU GO
    #      plt.show() displays the current chart and closes it
    #      Call this before starting a new chart
    #      or next chart will be drawn on top of this one
    # IN SCRIPT: WAIT TO SHOW TILL THE END
    #      Do not call plt.show() here - let figures accumulate
    #      so all charts display together with sequential Figure numbers.
    #      plt.show() is called once at the end of make_plots()
    # plt.show()

    LOG.info("""
CUSTOM: Update these notes and use Markdown cells to narrate and tell the story as you explore. For example:

Interpretation:

 - Values close to 1 (dark red) = strong positive correlation (both increase together)
 - Values close to -1 (dark blue) = strong negative correlation (one increases, other decreases)
 - Values close to 0 (white) = little or no linear relationship
 - The diagonal is always 1 (each variable correlates perfectly with itself)

From this heatmap, we can see that flipper_length_mm and body_mass_g show strong positive correlation (~0.87).
""")

    return correlation_matrix

descriptive_stats ¶

descriptive_stats(
    df_clean: DataFrame,
) -> tuple[pd.DataFrame, pd.DataFrame]

Compute descriptive statistics overall and by group.

WHY: Summary statistics offer a quick overview of numeric data:

Central tendency (mean)
Spread (std, min, max)
Distribution shape (quartiles)

Grouping by a categorical variable (i.e., non-numeric column) enables comparing statistics across categories

Parameters:

Name	Type	Description	Default
`df_clean`	`DataFrame`	Cleaned DataFrame for analysis.	required

Returns:

Type	Description
`tuple[DataFrame, DataFrame]`	tuple[pd.DataFrame, pd.DataFrame]: Overall stats, stats by group.

Notes: - Descriptive statistics summarize key aspects of numeric data. - Grouped stats help compare across categories.

Source code in src/datafun/app_case.py

def descriptive_stats(df_clean: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Compute descriptive statistics overall and by group.

    WHY: Summary statistics offer a quick overview of numeric data:

    - Central tendency (mean)
    - Spread (std, min, max)
    - Distribution shape (quartiles)

    Grouping by a categorical variable (i.e., non-numeric column)
    enables comparing statistics across categories

    Args:
        df_clean (pd.DataFrame): Cleaned DataFrame for analysis.

    Returns:
        tuple[pd.DataFrame, pd.DataFrame]: Overall stats, stats by group.

    Notes:
    - Descriptive statistics summarize key aspects of numeric data.
    - Grouped stats help compare across categories.
    """

    LOG.info("--------------- Manual statistics ---------------")

    # Example: Calculate statistics for a specific column with numpy
    mean_body_mass = np.mean(df_clean["body_mass_g"])
    std_body_mass = np.std(df_clean["body_mass_g"])
    min_body_mass = np.min(df_clean["body_mass_g"])
    max_body_mass = np.max(df_clean["body_mass_g"])
    range_body_mass = np.ptp(df_clean["body_mass_g"])  # peak to peak (max - min)

    # Log the example results with formatting
    LOG.debug("Body Mass Statistics (using numpy):")
    LOG.debug(f"  Mean: {mean_body_mass:.2f} g")
    LOG.debug(f"  Std Dev: {std_body_mass:.2f} g")
    LOG.debug(f"  Min: {min_body_mass:.2f} g")
    LOG.debug(f"  Max: {max_body_mass:.2f} g")
    LOG.debug(f"  Range: {range_body_mass:.2f} g")

    LOG.info("--------------- Using pandas describe() method ---------------")

    LOG.info("Computing overall descriptive statistics")

    # Use describe() to get count, mean, std, min, 25%, 50%, 75%, max for numeric columns
    # OPTION: Use .T to transpose the result so that columns become rows for easier reading in logs
    stats_overall = df_clean[SELECTED_NUMERIC_COLS].describe().T
    LOG.debug(f"\n{stats_overall}")

    LOG.info("--------------- Using pandas groupby() and agg() ---------------")

    LOG.info("Computing descriptive statistics by group")

    # Step 1: Select only the numeric columns we want to summarize
    df_numeric_subset: pd.DataFrame = df_clean[SELECTED_NUMERIC_COLS]

    # Step 2: Split the numeric subset into groups based on the grouping column
    # groupby() returns a GroupBy object - not a DataFrame yet, just a plan to group
    grouped = df_numeric_subset.groupby(df_clean[GROUP_COL])

    # Step 3: For each group, compute multiple summary statistics at once
    # agg() applies each function in the list to each numeric column
    # The result has a multi-level column index: (numeric_column, statistic)
    df_stats_by_group: pd.DataFrame = grouped.agg(
        ["count", "mean", "std", "min", "max"]
    )

    LOG.debug(f"\n{df_stats_by_group}")

    LOG.info("\nStacked view - easier to read in logs:")

    # Yuck: That's the multi-level column index in action.
    # pandas lays out the result as (numeric_column, statistic) pairs
    # side by side, wrapping when the terminal width runs out.
    # With 4 numeric columns × 5 statistics = 20 columns total,
    # it can only fit 2 numeric columns per line at 120 characters wide.
    # Let's stack it so each numeric column's stats are grouped together
    # vertically instead of horizontally.

    stats_by_group_stacked: pd.DataFrame | pd.Series[Any] = df_stats_by_group.stack(
        level=0
    )
    LOG.debug(f"\n{stats_by_group_stacked}")

    return stats_overall, df_stats_by_group

inspect_basic ¶

inspect_basic(df: DataFrame) -> None

Inspect the basic structure of the dataset.

WHY: Always start by understanding what columns exist, what types they are, and how large the dataset is.

How many rows and columns are there?
What types of data are present?
Are there obvious missing values?

This step determines challenges we might have downstream (later).

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to inspect.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun/app_case.py

def inspect_basic(df: pd.DataFrame) -> None:
    """Inspect the basic structure of the dataset.

    WHY: Always start by understanding what columns exist,
    what types they are, and how large the dataset is.

    - How many rows and columns are there?
    - What types of data are present?
    - Are there obvious missing values?

    This step determines challenges we might have downstream (later).

    Arguments:
        df: The DataFrame to inspect.

    Returns:
        None
    """
    # Preview the first few rows
    LOG.info("Previewing first few rows of the dataset")
    LOG.debug(f"\n{df.head()}")

    LOG.info("Column names")
    LOG.debug(f"{list(df.columns)}")

    LOG.info("DataFrame info (types and non-null counts)")
    df.info()

    # Get shape - number of rows and columns
    # It has two parts so the return value is a tuple of (num_rows, num_columns)
    shape: tuple[int, int] = df.shape

    # To get each value, we can unpack the tuple into two variables
    # This is a common Python idiom for working with tuples.
    # Or we could just use shape[0] and shape[1] directly without unpacking.

    num_rows, num_cols = shape

    LOG.info(f"Dataset shape: {num_rows} rows, {num_cols} columns")

load_data ¶

load_data() -> pd.DataFrame

Load a dataset into a DataFrame.

Seaborn provides clean built-in datasets for practice. Other projects may load from CSV, JSON, or a database.

Arguments: None

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The loaded dataset.

Source code in src/datafun/app_case.py

def load_data() -> pd.DataFrame:
    """Load a dataset into a DataFrame.

    Seaborn provides clean built-in datasets for practice.
    Other projects may load from CSV, JSON, or a database.

    Arguments: None

    Returns:
        pd.DataFrame: The loaded dataset.
    """
    LOG.info(f"Loading dataset: {DATASET_NAME}")
    df: pd.DataFrame = sns.load_dataset(DATASET_NAME)
    LOG.info(f"Loaded: {df.shape[0]} rows, {df.shape[1]} columns")

    return df

main ¶

main() -> None

Main function to run the EDA workflow.

Source code in src/datafun/app_case.py

def main() -> None:
    """Main function to run the EDA workflow."""
    log_header(LOG, "EDA")

    LOG.info("========================")
    LOG.info("START main()")
    LOG.info("========================")

    LOG.info(f"--- Section 2: Load dataset: {DATASET_NAME} ---")
    df = load_data()

    LOG.info("--- Section 3: Inspect shape and basic structure ---")
    inspect_basic(df)

    LOG.info("--- Section 4: Create Data Dictionary and Check Data Quality ---")
    build_data_dictionary(df)
    check_quality(df)

    LOG.info("--- Section 5: Create a cleaned view for EDA ---")
    df_clean = make_clean_view(df)

    LOG.info("--- Section 6: Descriptive statistics for numeric columns ---")
    descriptive_stats(df_clean)

    LOG.info("--- Section 7: Correlation matrix for numeric columns ---")
    correlation_matrix(df_clean)

    LOG.info("--- Section 8: Charts ---")
    make_plots(df_clean)

    LOG.info("--- Section 9: Summary and next steps ---")
    summarize(df, df_clean)

    LOG.info(
        "----- in a script, call plt.show() once at the end to display all charts -----"
    )
    LOG.info(
        "----- in a script, close the chart windows (with the close button) to continue  -----"
    )
    plt.show()

    LOG.info("EDA workflow complete")
    LOG.info("IMPORTANT: This script creates chart windows.")
    LOG.info(
        "Close any chart windows and terminate this process with CTRL+c as needed."
    )
    LOG.info("========================")
    LOG.info("Executed successfully!")
    LOG.info("========================")

make_clean_view ¶

make_clean_view(df: DataFrame) -> pd.DataFrame

Create a cleaned view for EDA.

Strategy: - Keep the original DataFrame unchanged - Drop rows missing key numeric fields and grouping field

WHY: EDA often focuses on a "clean" subset of the data. This allows exploring patterns without being distracted by missing values.

Arguments: - df: The original DataFrame.

Returns: - pd.DataFrame: A cleaned view of the data for EDA.

Source code in src/datafun/app_case.py

def make_clean_view(df: pd.DataFrame) -> pd.DataFrame:
    """Create a cleaned view for EDA.

    Strategy:
    - Keep the original DataFrame unchanged
    - Drop rows missing key numeric fields and grouping field

    WHY: EDA often focuses on a "clean" subset of the data.
    This allows exploring patterns without being distracted by missing values.

    Arguments:
    - df: The original DataFrame.

    Returns:
    - pd.DataFrame: A cleaned view of the data for EDA.
    """
    LOG.info("Creating cleaned view for EDA (dropping rows with key missing values)")

    # Build the list of columns we require to be non-missing
    # This includes all the selected numeric columns plus the grouping column.
    # SELECTED_NUMERIC_COLS is a list of strings,
    # GROUP_COL is a single string
    # Wrap GROUP_COL in a list - two lists can be combined with +
    cols_required: list[str] = SELECTED_NUMERIC_COLS + [GROUP_COL]
    LOG.debug(f"Columns required to be non-missing: {cols_required}")

    # Drop a row if it is missing a value in ANY of the required columns
    # Use the dropna() method with subset= to specify which columns to check for missing values.
    # Dropna means "drop rows if not available (na) that is, they have missing values". By default, it checks all columns, but we only want to check the key columns.
    # dropna(subset=...) only looks at the specified columns, not the whole row
    # .copy() creates a new DataFrame so we don't accidentally modify the original
    df_clean: pd.DataFrame = df.dropna(subset=cols_required).copy()

    # Report what was kept and what was dropped
    count_original: int = df.shape[0]
    count_clean: int = df_clean.shape[0]
    count_dropped: int = count_original - count_clean

    LOG.info(f"Original rows: {count_original}")
    LOG.info(f"Clean rows:    {count_clean}")
    LOG.info(f"Rows dropped:  {count_dropped}")

    return df_clean

make_plots ¶

make_plots(df_clean: DataFrame) -> None

Create simple, notebook-friendly plots.

WHY: Visualizations reveal patterns not obvious in tables. CUSTOM: Charts will vary depending on the dataset and questions of interest.

Common charts include: 1. A scatter plot to see relationships between two variables 2. A box plot to compare distributions across groups

A scatter plot shows the relationship between two numeric variables. In this example: - Each dot is one data record shown as x vs y. - Color (hue) provides a third dimension.

A box plot shows the distribution of one numeric variable across groups. - The box shows the middle 50% of values. - The line inside the box is the median. - The whiskers show the range. Dots beyond the whiskers are outliers.

Source code in src/datafun/app_case.py

def make_plots(df_clean: pd.DataFrame) -> None:
    """Create simple, notebook-friendly plots.

    WHY: Visualizations reveal patterns not obvious in tables.
    CUSTOM: Charts will vary depending on the dataset
            and questions of interest.

    Common charts include:
    1. A scatter plot to see relationships between two variables
    2. A box plot to compare distributions across groups

    A scatter plot shows the relationship between two numeric variables.
    In this example:
    - Each dot is one data record shown as x vs y.
    - Color (hue) provides a third dimension.

    A box plot shows the distribution of one numeric variable across groups.
    - The box shows the middle 50% of values.
    - The line inside the box is the median.
    - The whiskers show the range. Dots beyond the whiskers are outliers.

    """
    LOG.info("---- Creating Scatter Plot to see Relationships ------")
    LOG.info("----   Use clean dataframe ---------------------------")
    LOG.info("----   Set x to flipper length -----------------------")
    LOG.info("----   Set y to bill length --------------------------")
    LOG.info("----   Set the hue (color mapping) to the group column --")

    # Open a fresh blank canvas before a new chart
    plt.figure()

    # Use a scatterplot() to visualize relationships between two variables (x vs y)
    scatter_plt: Axes = sns.scatterplot(
        data=df_clean,
        x="flipper_length_mm",
        y="bill_length_mm",
        hue=GROUP_COL,
    )
    scatter_plt.set_xlabel("Flipper length (mm)")
    scatter_plt.set_ylabel("Bill length (mm)")
    scatter_plt.set_title("Flipper length vs Bill length (by species)")

    # IN NOTEBOOK: SHOW AS YOU GO
    #      plt.show() displays the current chart and closes it
    #      Call this before starting a new chart
    #      or next chart will be drawn on top of this one
    # IN SCRIPT: WAIT TO SHOW TILL THE END
    #      Do not call plt.show() here - let figures accumulate
    #      so all charts display together with sequential Figure numbers.
    #      plt.show() is called once at the end of make_plots()
    # plt.show()

    LOG.info("------ Creating Box Plot to see Distribution: ---------")
    LOG.info("------   Use clean dataframe --------------------------")
    LOG.info("------   Set x to the group column --------------------")
    LOG.info("------   Set y to flipper length ----------------------")

    # Open a fresh blank canvas before a new chart
    plt.figure()

    # Use a boxplot() to visualize the distribution of a numeric variable across groups
    box_plt: Axes = sns.boxplot(
        data=df_clean,
        x=GROUP_COL,
        y="flipper_length_mm",
    )
    box_plt.set_title("Flipper length by species")

summarize ¶

summarize(df: DataFrame, df_clean: DataFrame) -> None

Log a brief summary of findings and suggested next steps.

WHY: EDA is not the final report. A summary captures what was found and what to investigate next.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The original DataFrame.	required
`df_clean`	`DataFrame`	The cleaned DataFrame.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun/app_case.py

def summarize(df: pd.DataFrame, df_clean: pd.DataFrame) -> None:
    """Log a brief summary of findings and suggested next steps.

    WHY: EDA is not the final report.
    A summary captures what was found and what to investigate next.

    Arguments:
        df: The original DataFrame.
        df_clean: The cleaned DataFrame.

    Returns:
        None
    """
    LOG.info("========================")
    LOG.info("SUMMARY")
    LOG.info("========================")
    LOG.info(f"Dataset: {DATASET_NAME}")

    LOG.info(f"Original rows: {df.shape[0]}")
    LOG.info(f"Clean rows:    {df_clean.shape[0]}")

    # Get the unique values in the grouping column (e.g. species names)
    unique_groups_array: np.ndarray = df_clean[GROUP_COL].unique()

    # Sort them alphabetically so the output is consistent and readable
    sorted_groups: list[str] = sorted(unique_groups_array)

    LOG.info(f"Groups found in {GROUP_COL}: {sorted_groups}")

    LOG.info("Strongest correlation: ")
    LOG.info("  flipper_length_mm and body_mass_g (~0.87)")

    LOG.info("Suggested next step: ")
    LOG.info("  Model body_mass_g ~ flipper_length_mm with linear regression")