Skip to content

API Reference

This page is auto-generated from Python docstrings.

datafun.app_case

app_case.py - Project script (example).

Author: Denise Case Date: 2026-01

Practice key Python skills: - pathlib for cross-platform paths - logging (preferred over print) - calling functions from modules - clear ETVL pipeline stages: E = Extract (read, get data from source into memory) T = Transform (process, change data in memory) V = Verify (check, validate data in memory) L = Load (write results, to data/processed or other destination)

Terminal command to run this file from the root project folder:

uv run python -m datafun.app_case
OBS

Don't edit this file - it should remain a working example. Copy it, rename it, and modify your copy.

main

main() -> None

Entry point for the script.

Entry point: run four simple ETVL pipelines.

log_header() logs a standard run header. log_path() logs repo-relative paths (privacy-safe).

Arguments: None. Returns: None.

Source code in src/datafun/app_case.py
def main() -> None:
    """Entry point for the script.

    Entry point: run four simple ETVL pipelines.

    log_header() logs a standard run header.
    log_path() logs repo-relative paths (privacy-safe).

    Arguments: None.
    Returns: None.
    """
    log_header(LOG, "P03")

    LOG.info("========================")
    LOG.info("START main()")
    LOG.info("========================")

    log_path(LOG, "ROOT_DIR", ROOT_DIR)
    log_path(LOG, "PROCESSED_DIR", PROCESSED_DIR)

    PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

    # Call each pipeline. Each reads from data/raw and writes to data/processed.
    run_csv_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_xlsx_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_json_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_text_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)

    LOG.info("========================")
    LOG.info("Executed successfully!")
    LOG.info("========================")

datafun.case_csv_pipeline

case_csv_pipeline.py - CSV ETVL pipeline.

Author: Denise Case Date: 2026-04

Practice key Python skills related to: - ETVL pipeline structure (Extract, Transform, Verify, Load) - reading CSV files using the csv module - keyword-only function arguments - error handling with raise - calculating statistics with the statistics module - writing results to a text file

Paths (relative to repo root):

INPUT FILE:  data/raw/2020_happiness.csv
OUTPUT FILE: data/processed/csv_ladder_score_stats.txt

Terminal command to run this file from the root project folder:

uv run python -m datafun.case_csv_pipeline
OBS

Don't edit this file - it should remain a working example. Copy it, rename it, and modify your copy.

extract_csv_scores

extract_csv_scores(
    *, file_path: Path, column_name: str
) -> list[float]

E: Read CSV and extract one numeric column as floats.

Parameters:

Name Type Description Default
file_path Path

Path to input CSV file.

required
column_name str

Name of the column to extract.

required

Returns:

Type Description
list[float]

List of float values from the specified column.

Source code in src/datafun/case_csv_pipeline.py
def extract_csv_scores(*, file_path: Path, column_name: str) -> list[float]:
    """E: Read CSV and extract one numeric column as floats.

    Arguments:
        file_path: Path to input CSV file.
        column_name: Name of the column to extract.

    Returns:
        List of float values from the specified column.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    scores: list[float] = []
    with file_path.open("r", newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)

        # Handle known possible error: missing expected column.
        if reader.fieldnames is None or column_name not in reader.fieldnames:
            raise KeyError(
                f"CSV missing expected column '{column_name}'. "
                f"Found: {reader.fieldnames}"
            )

        for row in reader:
            raw_value = (row.get(column_name) or "").strip()
            # Skip empty cells rather than failing the whole pipeline.
            if not raw_value:
                continue
            try:
                scores.append(float(raw_value))
            except ValueError:
                # Skip rows that do not convert cleanly to float.
                continue

    return scores

load_stats_report

load_stats_report(
    *, stats: dict[str, float], out_path: Path
) -> None

L: Write stats to a text file in data/processed.

Parameters:

Name Type Description Default
stats dict[str, float]

Dictionary with statistics to write.

required
out_path Path

Path to output text file.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_csv_pipeline.py
def load_stats_report(*, stats: dict[str, float], out_path: Path) -> None:
    """L: Write stats to a text file in data/processed.

    Arguments:
        stats: Dictionary with statistics to write.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("CSV Ladder Score Statistics\n")
        f.write(f"Count: {int(stats['count'])}\n")
        f.write(f"Minimum: {stats['min']:.2f}\n")
        f.write(f"Maximum: {stats['max']:.2f}\n")
        f.write(f"Mean: {stats['mean']:.2f}\n")
        f.write(f"Standard Deviation: {stats['stdev']:.2f}\n")

run_csv_pipeline

run_csv_pipeline(
    *, raw_dir: Path, processed_dir: Path, logger: Any
) -> None

Run the full ETVL pipeline.

Parameters:

Name Type Description Default
raw_dir Path

Path to data/raw directory.

required
processed_dir Path

Path to data/processed directory.

required
logger Any

Logger for logging messages.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_csv_pipeline.py
def run_csv_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Arguments:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None
    """
    logger.info("CSV: START")

    input_file = raw_dir / "2020_happiness.csv"
    output_file = processed_dir / "csv_ladder_score_stats.txt"

    # E: Read raw data.
    scores = extract_csv_scores(file_path=input_file, column_name="Ladder score")

    # T: Calculate statistics.
    stats = transform_scores_to_stats(scores=scores)

    # V: Verify results before writing.
    verify_stats(stats=stats)

    # L: Write results to disk.
    load_stats_report(stats=stats, out_path=output_file)

    logger.info("CSV: wrote %s", output_file)
    logger.info("CSV: END")

transform_scores_to_stats

transform_scores_to_stats(
    *, scores: list[float]
) -> dict[str, float]

T: Calculate basic statistics for a list of floats.

Parameters:

Name Type Description Default
scores list[float]

List of float values.

required

Returns:

Type Description
dict[str, float]

Dictionary with keys: count, min, max, mean, stdev.

Source code in src/datafun/case_csv_pipeline.py
def transform_scores_to_stats(*, scores: list[float]) -> dict[str, float]:
    """T: Calculate basic statistics for a list of floats.

    Arguments:
        scores: List of float values.

    Returns:
        Dictionary with keys: count, min, max, mean, stdev.
    """
    if not scores:
        raise ValueError("No numeric values found for analysis.")

    return {
        "count": float(len(scores)),
        "min": min(scores),
        "max": max(scores),
        "mean": statistics.mean(scores),
        # stdev() requires at least 2 values; return 0.0 for a single value.
        "stdev": statistics.stdev(scores) if len(scores) > 1 else 0.0,
    }

verify_stats

verify_stats(*, stats: dict[str, float]) -> None

V: Sanity-check the stats dictionary.

Parameters:

Name Type Description Default
stats dict[str, float]

Dictionary with statistics to verify.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_csv_pipeline.py
def verify_stats(*, stats: dict[str, float]) -> None:
    """V: Sanity-check the stats dictionary.

    Arguments:
        stats: Dictionary with statistics to verify.

    Returns:
        None
    """
    required = {"count", "min", "max", "mean", "stdev"}
    missing = required - set(stats.keys())
    # Handle known possible error: missing required keys.
    if missing:
        raise KeyError(f"Missing stats keys: {sorted(missing)}")

    # Handle known possible error: count must be positive.
    if stats["count"] <= 0:
        raise ValueError("Count must be positive.")

    # Handle known possible error: min cannot be greater than max.
    if stats["min"] > stats["max"]:
        raise ValueError("Min cannot be greater than max.")

datafun.case_xlsx_pipeline

case_xlsx_pipeline.py - XLSX ETVL pipeline.

Author: Denise Case Date: 2026-04

Practice key Python skills related to: - ETVL pipeline structure (Extract, Transform, Verify, Load) - reading Excel files using the openpyxl package - accessing cells by column letter - keyword-only function arguments - runtime type checking with isinstance() - counting word occurrences across strings - writing results to a text file

Paths (relative to repo root):

INPUT FILE:  data/raw/Feedback.xlsx
OUTPUT FILE: data/processed/xlsx_feedback_github_count.txt

Terminal command to run this file from the root project folder:

uv run python -m datafun.case_xlsx_pipeline
OBS

Don't edit this file - it should remain a working example. Copy it, rename it, and modify your copy.

extract_xlsx_column_strings

extract_xlsx_column_strings(
    *, file_path: Path, column_letter: str
) -> list[str]

E: Read an Excel file and extract string values from a column.

Parameters:

Name Type Description Default
file_path Path

Path to input XLSX file.

required
column_letter str

Letter of the column to extract (e.g., 'A').

required

Returns:

Type Description
list[str]

List of non-empty string values from the specified column.

Source code in src/datafun/case_xlsx_pipeline.py
def extract_xlsx_column_strings(*, file_path: Path, column_letter: str) -> list[str]:
    """E: Read an Excel file and extract string values from a column.

    Arguments:
        file_path: Path to input XLSX file.
        column_letter: Letter of the column to extract (e.g., 'A').

    Returns:
        List of non-empty string values from the specified column.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    workbook = openpyxl.load_workbook(file_path)
    # active returns the first worksheet - the one visible when the file opens.
    sheet = workbook.active

    values: list[str] = []

    for cell in sheet[column_letter]:
        # cast() narrows the type for the type checker - no runtime effect.
        cell = cast(Cell, cell)
        value = cell.value
        # Only keep non-empty string values.
        if isinstance(value, str) and value.strip():
            values.append(value)

    return values

load_count_report

load_count_report(
    *,
    count: int,
    out_path: Path,
    word: str,
    column_letter: str,
) -> None

L: Write the word count result to a text file in data/processed.

Parameters:

Name Type Description Default
count int

The word count to write.

required
out_path Path

Path to output text file.

required
word str

The word that was counted.

required
column_letter str

The column letter that was processed.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_xlsx_pipeline.py
def load_count_report(
    *, count: int, out_path: Path, word: str, column_letter: str
) -> None:
    """L: Write the word count result to a text file in data/processed.

    Arguments:
        count: The word count to write.
        out_path: Path to output text file.
        word: The word that was counted.
        column_letter: The column letter that was processed.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("XLSX Word Count Result\n")
        f.write(f"Word: {word}\n")
        f.write(f"Column: {column_letter}\n")
        f.write(f"Count: {count}\n")

run_xlsx_pipeline

run_xlsx_pipeline(
    *, raw_dir: Path, processed_dir: Path, logger: Any
) -> None

Run the full ETVL pipeline.

Parameters:

Name Type Description Default
raw_dir Path

Path to data/raw directory.

required
processed_dir Path

Path to data/processed directory.

required
logger Any

Logger for logging messages.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_xlsx_pipeline.py
def run_xlsx_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Arguments:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None
    """
    logger.info("XLSX: START")

    input_file = raw_dir / "Feedback.xlsx"
    output_file = processed_dir / "xlsx_feedback_github_count.txt"

    column_letter = "A"
    word = "GitHub"

    # E: Read string values from column A.
    values = extract_xlsx_column_strings(
        file_path=input_file,
        column_letter=column_letter,
    )

    # T: Count occurrences of the target word.
    count = transform_count_word(values=values, word=word)

    # V: Verify the count before writing.
    verify_count(count=count)

    # L: Write results to disk.
    load_count_report(
        count=count,
        out_path=output_file,
        word=word,
        column_letter=column_letter,
    )

    logger.info("XLSX: wrote %s", output_file)
    logger.info("XLSX: END")

transform_count_word

transform_count_word(
    *, values: list[str], word: str
) -> int

T: Count occurrences of a word across all strings (case-insensitive).

Parameters:

Name Type Description Default
values list[str]

List of strings to search.

required
word str

Word to count.

required

Returns:

Type Description
int

Total count of occurrences of the word across all strings.

Source code in src/datafun/case_xlsx_pipeline.py
def transform_count_word(*, values: list[str], word: str) -> int:
    """T: Count occurrences of a word across all strings (case-insensitive).

    Arguments:
        values: List of strings to search.
        word: Word to count.

    Returns:
        Total count of occurrences of the word across all strings.
    """
    # Handle known possible error: no word provided by caller.
    if not word:
        raise ValueError("Word to count cannot be empty.")

    target = word.lower()
    count = 0
    for text in values:
        # Convert both to lowercase for case-insensitive matching.
        count += text.lower().count(target)
    return count

verify_count

verify_count(*, count: int) -> None

V: Verify the count is valid.

Parameters:

Name Type Description Default
count int

The count to verify.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_xlsx_pipeline.py
def verify_count(*, count: int) -> None:
    """V: Verify the count is valid.

    Arguments:
        count: The count to verify.

    Returns:
        None
    """
    # Handle known possible error: count is negative.
    if count < 0:
        raise ValueError("Count cannot be negative.")

datafun.case_json_pipeline

case_json_pipeline.py - JSON ETVL pipeline.

Author: Denise Case Date: 2026-04

Practice key Python skills related to: - ETVL pipeline structure (Extract, Transform, Verify, Load) - reading JSON files using the json module - walking JSON: dictionaries, lists, and nested structures - keyword-only function arguments - defensive programming for untrusted input - runtime type checking with isinstance() - writing results to a text file

Paths (relative to repo root):

INPUT FILE:  data/raw/astros.json
OUTPUT FILE: data/processed/json_astronauts_by_craft.txt

Terminal command to run this file from the root project folder:

uv run python -m datafun.case_json_pipeline
OBS

Don't edit this file - it should remain a working example. Copy it, rename it, and modify your copy.

extract_people_list

extract_people_list(
    *, file_path: Path, list_key: str = 'people'
) -> list[dict[str, Any]]

E/V: Read JSON file and extract a list of dictionaries under list_key.

Parameters:

Name Type Description Default
file_path Path

Path to input JSON file.

required
list_key str

Top-level key expected to map to a list (default: "people").

'people'

Returns:

Type Description
list[dict[str, Any]]

A list of dictionaries from the JSON file.

Source code in src/datafun/case_json_pipeline.py
def extract_people_list(
    *, file_path: Path, list_key: str = "people"
) -> list[dict[str, Any]]:
    """E/V: Read JSON file and extract a list of dictionaries under list_key.

    Arguments:
        file_path: Path to input JSON file.
        list_key: Top-level key expected to map to a list (default: "people").

    Returns:
        A list of dictionaries from the JSON file.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    with file_path.open("r", encoding="utf-8") as f:
        # json.load() reads the entire file and returns a Python object.
        data: Any = json.load(f)

    # JSON top level should be a dict - verify before accessing keys.
    if not isinstance(data, dict):
        raise TypeError("Expected JSON top-level object to be a dictionary.")

    # Use dict.get() to safely retrieve the list - default to empty list if missing.
    value: Any = data.get(list_key, [])

    # Verify the value is actually a list before iterating.
    if not isinstance(value, list):
        raise TypeError(f"Expected {list_key!r} to be a list.")

    # Walk the list and keep only items that are dictionaries.
    # Each person record should be a dict with keys like "name" and "craft".
    people_list: list[dict[str, Any]] = []
    for item in value:
        if isinstance(item, dict):
            people_list.append(item)  # type: ignore[arg-type]

    return people_list

load_counts_report

load_counts_report(
    *, counts: dict[str, int], out_path: Path
) -> None

L: Write craft counts to a text file in data/processed.

Parameters:

Name Type Description Default
counts dict[str, int]

Dictionary mapping craft names to counts.

required
out_path Path

Path to output text file.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_json_pipeline.py
def load_counts_report(*, counts: dict[str, int], out_path: Path) -> None:
    """L: Write craft counts to a text file in data/processed.

    Arguments:
        counts: Dictionary mapping craft names to counts.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("Astronauts by spacecraft:\n")
        # Sort craft names alphabetically for consistent, readable output.
        for craft in sorted(counts):
            f.write(f"{craft}: {counts[craft]}\n")

run_json_pipeline

run_json_pipeline(
    *, raw_dir: Path, processed_dir: Path, logger: Any
) -> None

Run the full ETVL pipeline.

Parameters:

Name Type Description Default
raw_dir Path

Path to data/raw directory.

required
processed_dir Path

Path to data/processed directory.

required
logger Any

Logger for logging messages.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_json_pipeline.py
def run_json_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Arguments:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None
    """
    logger.info("JSON: START")

    input_file = raw_dir / "astros.json"
    output_file = processed_dir / "json_astronauts_by_craft.txt"

    # E: Read raw data.
    people_list = extract_people_list(file_path=input_file, list_key="people")

    # T: Count people by craft.
    craft_counts = transform_count_by_craft(people_list=people_list, craft_key="craft")

    # V: Verify results before writing.
    verify_counts(counts=craft_counts)

    # L: Write results to disk.
    load_counts_report(counts=craft_counts, out_path=output_file)

    logger.info("JSON: wrote %s", output_file)
    logger.info("JSON: END")

transform_count_by_craft

transform_count_by_craft(
    *,
    people_list: list[dict[str, Any]],
    craft_key: str = 'craft',
) -> dict[str, int]

T: Count people by craft.

Parameters:

Name Type Description Default
people_list list[dict[str, Any]]

List of person dictionaries.

required
craft_key str

Key to read craft name from (default: "craft").

'craft'

Returns:

Type Description
dict[str, int]

Dictionary mapping craft names to counts.

Source code in src/datafun/case_json_pipeline.py
def transform_count_by_craft(
    *, people_list: list[dict[str, Any]], craft_key: str = "craft"
) -> dict[str, int]:
    """T: Count people by craft.

    Arguments:
        people_list: List of person dictionaries.
        craft_key: Key to read craft name from (default: "craft").

    Returns:
        Dictionary mapping craft names to counts.
    """
    counts: dict[str, int] = {}

    for person in people_list:
        # Use dict.get() to safely access the craft key.
        craft: Any = person.get(craft_key, "Unknown")
        # Guard against non-string or empty values.
        if not isinstance(craft, str) or not craft.strip():
            craft = "Unknown"
        # Increment the count for this craft, starting at 0 if not yet seen.
        counts[craft] = counts.get(craft, 0) + 1

    return counts

verify_counts

verify_counts(*, counts: dict[str, int]) -> None

V: Verify counts are non-negative and craft names are not empty.

Parameters:

Name Type Description Default
counts dict[str, int]

Dictionary mapping craft names to counts.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_json_pipeline.py
def verify_counts(*, counts: dict[str, int]) -> None:
    """V: Verify counts are non-negative and craft names are not empty.

    Arguments:
        counts: Dictionary mapping craft names to counts.

    Returns:
        None
    """
    for craft, count in counts.items():
        # Handle known possible error: invalid craft name.
        if not craft.strip():
            raise ValueError(f"Invalid craft name: {craft!r}")
        # Handle known possible error: count is negative.
        if count < 0:
            raise ValueError(f"Invalid count for craft {craft!r}: {count}")

datafun.case_text_pipeline

case_text_pipeline.py - Text ETVL pipeline.

Author: Denise Case Date: 2026-04

Practice key Python skills related to: - ETVL pipeline structure (Extract, Transform, Verify, Load) - reading text files line by line - counting lines, words, and characters - keyword-only function arguments - error handling with raise - writing results to a text file

Paths (relative to repo root):

INPUT FILE:  data/raw/romeo_and_juliet.txt
OUTPUT FILE: data/processed/txt_summary.txt

Terminal command to run this file from the root project folder:

uv run python -m datafun.case_text_pipeline
OBS

Don't edit this file - it should remain a working example. Copy it, rename it, and modify your copy.

extract_lines

extract_lines(*, file_path: Path) -> list[str]

E: Read a text file into a list of lines.

Parameters:

Name Type Description Default
file_path Path

Path to input text file.

required

Returns:

Type Description
list[str]

List of lines from the text file.

Source code in src/datafun/case_text_pipeline.py
def extract_lines(*, file_path: Path) -> list[str]:
    """E: Read a text file into a list of lines.

    Arguments:
        file_path: Path to input text file.

    Returns:
        List of lines from the text file.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    with file_path.open("r", encoding="utf-8") as f:
        return f.readlines()

load_summary_report

load_summary_report(
    *, summary: dict[str, int], out_path: Path
) -> None

L: Write summary to a text file in data/processed.

Parameters:

Name Type Description Default
summary dict[str, int]

Dictionary with counts for 'lines', 'words', and 'chars'.

required
out_path Path

Path to output text file.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_text_pipeline.py
def load_summary_report(*, summary: dict[str, int], out_path: Path) -> None:
    """L: Write summary to a text file in data/processed.

    Arguments:
        summary: Dictionary with counts for 'lines', 'words', and 'chars'.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("Text File Summary\n")
        f.write(f"Lines: {summary['lines']}\n")
        f.write(f"Words: {summary['words']}\n")
        f.write(f"Characters: {summary['chars']}\n")

run_text_pipeline

run_text_pipeline(
    *, raw_dir: Path, processed_dir: Path, logger: Any
) -> None

Run the full ETVL pipeline.

Parameters:

Name Type Description Default
raw_dir Path

Path to data/raw directory.

required
processed_dir Path

Path to data/processed directory.

required
logger Any

Logger for logging messages.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_text_pipeline.py
def run_text_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Arguments:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None
    """
    logger.info("TXT: START")

    input_file = raw_dir / "romeo_and_juliet.txt"
    output_file = processed_dir / "txt_summary.txt"

    # E: Read raw data.
    lines = extract_lines(file_path=input_file)

    # T: Calculate counts.
    summary = transform_line_word_char_counts(lines=lines)

    # V: Verify results before writing.
    verify_summary(summary=summary)

    # L: Write results to disk.
    load_summary_report(summary=summary, out_path=output_file)

    logger.info("TXT: wrote %s", output_file)
    logger.info("TXT: END")

transform_line_word_char_counts

transform_line_word_char_counts(
    *, lines: list[str]
) -> dict[str, int]

T: Summarize a list of lines: line count, word count, character count.

Parameters:

Name Type Description Default
lines list[str]

List of lines from the text file.

required

Returns:

Type Description
dict[str, int]

Dictionary with counts for 'lines', 'words', and 'chars'.

Source code in src/datafun/case_text_pipeline.py
def transform_line_word_char_counts(*, lines: list[str]) -> dict[str, int]:
    """T: Summarize a list of lines: line count, word count, character count.

    Arguments:
        lines: List of lines from the text file.

    Returns:
        Dictionary with counts for 'lines', 'words', and 'chars'.
    """
    line_count = len(lines)
    word_count = 0
    char_count = 0

    for line in lines:
        char_count += len(line)
        word_count += len(line.split())

    return {
        "lines": line_count,
        "words": word_count,
        "chars": char_count,
    }

verify_summary

verify_summary(*, summary: dict[str, int]) -> None

V: Verify the summary has expected keys and non-negative values.

Parameters:

Name Type Description Default
summary dict[str, int]

Dictionary with counts for 'lines', 'words', and 'chars'.

required

Returns:

Type Description
None

None

Source code in src/datafun/case_text_pipeline.py
def verify_summary(*, summary: dict[str, int]) -> None:
    """V: Verify the summary has expected keys and non-negative values.

    Arguments:
        summary: Dictionary with counts for 'lines', 'words', and 'chars'.

    Returns:
        None
    """
    for key in ("lines", "words", "chars"):
        # Handle known possible error: the key is missing.
        if key not in summary:
            raise KeyError(f"Missing summary key: {key}")
        # Handle known possible error: count is negative.
        if summary[key] < 0:
            raise ValueError(f"Invalid {key} count: {summary[key]}")