API Reference

This page is auto-generated from Python docstrings.

datafun_03_analytics.app_case

app_case.py - Project script (example).

Author: Denise Case Date: 2026-01

Practice key Python skills: - pathlib for cross-platform paths - logging (preferred over print) - calling functions from modules - clear ETVL pipeline stages: E = Extract (read, get data from source into memory) T = Transform (process, change data in memory) V = Verify (check, validate data in memory) L = Load (write results, to data/processed or other destination)

OBS

Don't edit this file - it should remain a working example.

main

main() -> None

Entry point: run four simple ETVL pipelines.

Source code in src/datafun_03_analytics/app_case.py

def main() -> None:
    """Entry point: run four simple ETVL pipelines."""
    log_header(LOG, "Pipelines: Read, Process, Verify, Write (ETVL)")
    LOG.info("START main()")

    # Each pipeline reads from data/raw and writes to data/processed.
    run_csv_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_xlsx_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_json_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_text_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)

    LOG.info("END main()")

datafun_03_analytics.case_csv_pipeline

p3_csv_pipeline.py - CSV ETVL pipeline.

ETVL

E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)

CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.

CUSTOM: We use keyword-only function arguments. In our functions, you'll see a *,. The asterisk can appear anywhere in the list of parameters. EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.

WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.

extract_csv_scores

extract_csv_scores(*, file_path: Path, column_name: str) -> list[float]

E: Read CSV and extract one numeric column as floats.

Parameters:

Name	Type	Description	Default
`file_path`	`Path`	Path to input CSV file.	required
`column_name`	`str`	Name of the column to extract.	required

Returns:

Type	Description
`list[float]`	List of float values from the specified column.

Source code in src/datafun_03_analytics/case_csv_pipeline.py

def extract_csv_scores(*, file_path: Path, column_name: str) -> list[float]:
    """E: Read CSV and extract one numeric column as floats.

    Args:
        file_path: Path to input CSV file.
        column_name: Name of the column to extract.

    Returns:
        List of float values from the specified column.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    scores: list[float] = []
    with file_path.open("r", newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)

        # Handle known possible error: missing expected column.
        if reader.fieldnames is None or column_name not in reader.fieldnames:
            raise KeyError(
                f"CSV missing expected column '{column_name}'. Found: {reader.fieldnames}"
            )

        for row in reader:
            raw_value = (row.get(column_name) or "").strip()
            if not raw_value:
                continue
            try:
                scores.append(float(raw_value))
            except ValueError:
                # Keep it simple: skip rows that do not convert cleanly.
                continue

    return scores

load_stats_report

load_stats_report(*, stats: dict[str, float], out_path: Path) -> None

L: Write stats to a text file in data/processed.

Parameters:

Name	Type	Description	Default
`stats`	`dict[str, float]`	Dictionary with statistics to write.	required
`out_path`	`Path`	Path to output text file.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_csv_pipeline.py

def load_stats_report(*, stats: dict[str, float], out_path: Path) -> None:
    """L: Write stats to a text file in data/processed.

    Args:
        stats: Dictionary with statistics to write.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("CSV Ladder Score Statistics\n")
        f.write(f"Count: {int(stats['count'])}\n")
        f.write(f"Minimum: {stats['min']:.2f}\n")
        f.write(f"Maximum: {stats['max']:.2f}\n")
        f.write(f"Mean: {stats['mean']:.2f}\n")
        f.write(f"Standard Deviation: {stats['stdev']:.2f}\n")

run_csv_pipeline

run_csv_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None

Run the full ETVL pipeline.

Parameters:

Name	Type	Description	Default
`raw_dir`	`Path`	Path to data/raw directory.	required
`processed_dir`	`Path`	Path to data/processed directory.	required
`logger`	`Any`	Logger for logging messages.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_csv_pipeline.py

def run_csv_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Args:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None

    """
    logger.info("CSV: START")

    input_file = raw_dir / "2020_happiness.csv"
    output_file = processed_dir / "csv_ladder_score_stats.txt"

    # E
    scores = extract_csv_scores(file_path=input_file, column_name="Ladder score")

    # T
    stats = transform_scores_to_stats(scores=scores)

    # V
    verify_stats(stats=stats)

    # L
    load_stats_report(stats=stats, out_path=output_file)

    logger.info("CSV: wrote %s", output_file)
    logger.info("CSV: END")

transform_scores_to_stats

transform_scores_to_stats(*, scores: list[float]) -> dict[str, float]

T: Calculate basic statistics for a list of floats.

Parameters:

Name	Type	Description	Default
`scores`	`list[float]`	List of float values.	required

Returns:

Type	Description
`dict[str, float]`	Dictionary with keys: count, min, max, mean, stdev.

Source code in src/datafun_03_analytics/case_csv_pipeline.py

def transform_scores_to_stats(*, scores: list[float]) -> dict[str, float]:
    """T: Calculate basic statistics for a list of floats.

    Args:
        scores: List of float values.

    Returns:
        Dictionary with keys: count, min, max, mean, stdev.
    """
    if not scores:
        raise ValueError("No numeric values found for analysis.")

    return {
        "count": float(len(scores)),
        "min": min(scores),
        "max": max(scores),
        "mean": statistics.mean(scores),
        "stdev": statistics.stdev(scores) if len(scores) > 1 else 0.0,
    }

verify_stats

verify_stats(*, stats: dict[str, float]) -> None

V: Sanity-check the stats dictionary.

Parameters:

Name	Type	Description	Default
`stats`	`dict[str, float]`	Dictionary with statistics to verify.	required

Raises:

Type	Description
`KeyError`	If expected keys are missing.
`ValueError`	If any stats values are invalid.

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_csv_pipeline.py

def verify_stats(*, stats: dict[str, float]) -> None:
    """V: Sanity-check the stats dictionary.

    Args:
        stats: Dictionary with statistics to verify.

    Raises:
        KeyError: If expected keys are missing.
        ValueError: If any stats values are invalid.

    Returns:
        None
    """
    required = {"count", "min", "max", "mean", "stdev"}
    missing = required - set(stats.keys())
    # Handle known possible error: missing required keys.
    if missing:
        raise KeyError(f"Missing stats keys: {sorted(missing)}")

    # Handle known possible error: count must be positive.
    if stats["count"] <= 0:
        raise ValueError("Count must be positive.")
    # Handle known possible error: min cannot be greater than max.
    if stats["min"] > stats["max"]:
        raise ValueError("Min cannot be greater than max.")

datafun_03_analytics.case_xlsx_pipeline

p3_xlsx_pipeline.py - XLSX ETVL pipeline.

ETVL

E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)

CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.

CUSTOM: We use keyword-only function arguments. In our functions, you'll see a *,. The asterisk can appear anywhere in the list of parameters. EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.

WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.

extract_xlsx_column_strings

extract_xlsx_column_strings(*, file_path: Path, column_letter: str) -> list[str]

E: Read an Excel file and extract string values from a column.

Parameters:

Name	Type	Description	Default
`file_path`	`Path`	Path to input XLSX file.	required
`column_letter`	`str`	Letter of the column to extract (e.g., 'A').	required

Returns:

Type	Description
`list[str]`	List of string values from the specified column.

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py

def extract_xlsx_column_strings(*, file_path: Path, column_letter: str) -> list[str]:
    """E: Read an Excel file and extract string values from a column.

    Args:
        file_path: Path to input XLSX file.
        column_letter: Letter of the column to extract (e.g., 'A').

    Returns:
        List of string values from the specified column.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    workbook = openpyxl.load_workbook(file_path)
    sheet = workbook.active

    values: list[str] = []

    for cell in sheet[column_letter]:
        cell = cast(Cell, cell)
        value = cell.value
        if isinstance(value, str) and value.strip():
            values.append(value)
    return values

load_count_report

load_count_report(*, count: int, out_path: Path, word: str, column_letter: str) -> None

L: Write the result to a text file in data/processed.

Parameters:

Name	Type	Description	Default
`count`	`int`	The word count to write.	required
`out_path`	`Path`	Path to output text file.	required
`word`	`str`	The word that was counted.	required
`column_letter`	`str`	The column letter that was processed.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py

def load_count_report(
    *, count: int, out_path: Path, word: str, column_letter: str
) -> None:
    """L: Write the result to a text file in data/processed.

    Args:
        count: The word count to write.
        out_path: Path to output text file.
        word: The word that was counted.
        column_letter: The column letter that was processed.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("XLSX Word Count Result\n")
        f.write(f"Word: {word}\n")
        f.write(f"Column: {column_letter}\n")
        f.write(f"Count: {count}\n")

run_xlsx_pipeline

run_xlsx_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None

Run the full ETVL pipeline.

Parameters:

Name	Type	Description	Default
`raw_dir`	`Path`	Path to data/raw directory.	required
`processed_dir`	`Path`	Path to data/processed directory.	required
`logger`	`Any`	Logger for logging messages.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py

def run_xlsx_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Args:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None

    """
    logger.info("XLSX: START")

    input_file = raw_dir / "Feedback.xlsx"
    output_file = processed_dir / "xlsx_feedback_github_count.txt"

    column_letter = "A"
    word = "GitHub"

    # E
    values = extract_xlsx_column_strings(
        file_path=input_file,
        column_letter=column_letter,
    )

    # T
    count = transform_count_word(values=values, word=word)

    # V
    verify_count(count=count)

    # L
    load_count_report(
        count=count, out_path=output_file, word=word, column_letter=column_letter
    )

    logger.info("XLSX: wrote %s", output_file)
    logger.info("XLSX: END")

transform_count_word

transform_count_word(*, values: list[str], word: str) -> int

T: Count occurrences of a word across strings (case-insensitive).

Parameters:

Name	Type	Description	Default
`values`	`list[str]`	List of strings to search.	required
`word`	`str`	Word to count.	required

Returns:

Type	Description
`int`	Count of occurrences of the word.

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py

def transform_count_word(*, values: list[str], word: str) -> int:
    """T: Count occurrences of a word across strings (case-insensitive).

    Args:
        values: List of strings to search.
        word: Word to count.

    Returns:
        Count of occurrences of the word.
    """
    # Handle known possible error: no word provided by caller.
    if not word:
        raise ValueError("Word to count cannot be empty.")

    target = word.lower()
    count = 0
    for text in values:
        count += text.lower().count(target)
    return count

verify_count

verify_count(*, count: int) -> None

V: Verify the count is valid.

Parameters:

Name	Type	Description	Default
`count`	`int`	The count to verify.	required

Raises:

Type	Description
`ValueError`	If the count is negative.

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py

def verify_count(*, count: int) -> None:
    """V: Verify the count is valid.

    Args:
        count: The count to verify.

    Raises:
        ValueError: If the count is negative.

    Returns:
        None
    """
    # Handle known possible error: count is negative.
    if count < 0:
        raise ValueError("Count cannot be negative.")

datafun_03_analytics.case_json_pipeline

p3_json_pipeline.py - JSON ETVL pipeline.

ETVL

E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)

This example is intentionally explicit about walking JSON:

json.load(file) returns a Python dictionary (top-level object)
dict.get("people", []) safely retrieves a nested list
iteration is used to walk arrays (lists)
each list element is expected to be a dictionary with keys such as "craft"

Core JSON Data Concepts:

JSON is hierarchical (tree-structured)
JSON arrays map to Python lists
JSON objects map to Python dictionaries (key-value pairs)
JSON is nested (lists and dictionaries can appear within each other)
JSON is untrusted input (keys may be missing, values may be wrong types)
JSON values are optional (no required keys)
JSON types are runtime facts, not promises (no static typing or schema)

Runtime Validation and Defensive Access:

Use isinstance() to verify value types at runtime
Use dict.get(key, default) to handle missing keys safely
Use iteration to walk arrays (lists)
Apply defensive programming for unexpected or missing data
Verify file existence before attempting to read JSON

CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.

CUSTOM: We use keyword-only function arguments. In our functions, you'll see a *,. The asterisk can appear anywhere in the list of parameters. EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.

WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.

Example JSON Data:

{ "people": [ { "craft": "ISS", "name": "Oleg Kononenko" }, ...

extract_people_list

extract_people_list(*, file_path: Path, list_key: str = 'people') -> list[dict[str, Any]]

E/V: Read JSON file and extract a list of dictionaries under list_key.

Parameters:

Name	Type	Description	Default
`file_path`	`Path`	Path to input JSON file.	required
`list_key`	`str`	Top-level key expected to map to a list (default: "people").	`'people'`

Returns:

Type	Description
`list[dict[str, Any]]`	A list of dictionaries from the JSON file.

Source code in src/datafun_03_analytics/case_json_pipeline.py

def extract_people_list(
    *, file_path: Path, list_key: str = "people"
) -> list[dict[str, Any]]:
    """E/V: Read JSON file and extract a list of dictionaries under list_key.

    Args:
        file_path: Path to input JSON file.
        list_key: Top-level key expected to map to a list (default: "people").

    Returns:
        A list of dictionaries from the JSON file.
    """
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    with file_path.open("r", encoding="utf-8") as f:
        data: Any = json.load(f)

    if not isinstance(data, dict):
        raise TypeError("Expected JSON top-level object to be a dictionary.")

    value: Any = data.get(list_key, [])
    if not isinstance(value, list):
        raise TypeError(f"Expected {list_key!r} to be a list.")

    people_list: list[dict[str, Any]] = []
    for item in value:
        if isinstance(item, dict):
            # If it passes the right type check, add it to the list.
            # Just add a type ignore to silence the warnings - we have already checked the type.
            people_list.append(item)  # type: ignore[arg-type]

    return people_list

load_counts_report

load_counts_report(*, counts: dict[str, int], out_path: Path) -> None

L: Write craft counts to a text file in data/processed.

Parameters:

Name	Type	Description	Default
`counts`	`dict[str, int]`	Dictionary mapping craft names to counts.	required
`out_path`	`Path`	Path to output text file.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_json_pipeline.py

def load_counts_report(*, counts: dict[str, int], out_path: Path) -> None:
    """L: Write craft counts to a text file in data/processed.

    Args:
        counts: Dictionary mapping craft names to counts.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("Astronauts by spacecraft:\n")
        for craft in sorted(counts):
            f.write(f"{craft}: {counts[craft]}\n")

run_json_pipeline

run_json_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None

Run the full ETVL pipeline.

Parameters:

Name	Type	Description	Default
`raw_dir`	`Path`	Path to data/raw directory.	required
`processed_dir`	`Path`	Path to data/processed directory.	required
`logger`	`Any`	Logger for logging messages.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_json_pipeline.py

def run_json_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Args:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None

    """
    logger.info("JSON: START")

    input_file = raw_dir / "astros.json"
    output_file = processed_dir / "json_astronauts_by_craft.txt"

    # E
    people_list = extract_people_list(file_path=input_file, list_key="people")

    # T
    craft_counts = transform_count_by_craft(people_list=people_list, craft_key="craft")

    # V
    verify_counts(counts=craft_counts)

    # L
    load_counts_report(counts=craft_counts, out_path=output_file)

    logger.info("JSON: wrote %s", output_file)
    logger.info("JSON: END")

transform_count_by_craft

transform_count_by_craft(*, people_list: list[dict[str, Any]], craft_key: str = 'craft') -> dict[str, int]

T/V: Count people by craft.

Parameters:

Name	Type	Description	Default
`people_list`	`list[dict[str, Any]]`	List of person dictionaries.	required
`craft_key`	`str`	Key to read craft name from (default: "craft").	`'craft'`

Returns:

Type	Description
`dict[str, int]`	Dictionary mapping craft names to counts.

Source code in src/datafun_03_analytics/case_json_pipeline.py

def transform_count_by_craft(
    *, people_list: list[dict[str, Any]], craft_key: str = "craft"
) -> dict[str, int]:
    """T/V: Count people by craft.

    Args:
        people_list: List of person dictionaries.
        craft_key: Key to read craft name from (default: "craft").

    Returns:
        Dictionary mapping craft names to counts.
    """
    counts: dict[str, int] = {}

    for person in people_list:
        craft: Any = person.get(craft_key, "Unknown")
        if not isinstance(craft, str) or not craft.strip():
            craft = "Unknown"
        counts[craft] = counts.get(craft, 0) + 1

    return counts

verify_counts

verify_counts(*, counts: dict[str, int]) -> None

V: Verify counts are non-negative and craft names are not empty.

Parameters:

Name	Type	Description	Default
`counts`	`dict[str, int]`	Dictionary mapping craft names to counts.	required

Raises:

Type	Description
`ValueError`	If any count is negative or craft name is invalid.

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_json_pipeline.py

def verify_counts(*, counts: dict[str, int]) -> None:
    """V: Verify counts are non-negative and craft names are not empty.

    Args:
        counts: Dictionary mapping craft names to counts.

    Raises:
        ValueError: If any count is negative or craft name is invalid.

    Returns:
        None
    """
    for craft, count in counts.items():
        # Handle known possible error: invalid craft name after stripping off white space.
        if not craft.strip():
            raise ValueError(f"Invalid craft name: {craft!r}")
        # Handle known possible error: count is negative.
        if count < 0:
            raise ValueError(f"Invalid count for craft {craft!r}: {count}")

datafun_03_analytics.case_text_pipeline

p3_text_pipeline.py - Text ETVL pipeline.

ETVL

E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)

CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.

CUSTOM: We use keyword-only function arguments. In our functions, you'll see a *,. The asterisk can appear anywhere in the list of parameters. EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.

WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.

extract_lines

extract_lines(*, file_path: Path) -> list[str]

E: Read a text file into a list of lines.

Parameters:

Name	Type	Description	Default
`file_path`	`Path`	Path to input text file.	required

Returns:

Type	Description
`list[str]`	List of lines from the text file.

Source code in src/datafun_03_analytics/case_text_pipeline.py

def extract_lines(*, file_path: Path) -> list[str]:
    """E: Read a text file into a list of lines.

    Args:
        file_path: Path to input text file.

    Returns:
            List of lines from the text file.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    with file_path.open("r", encoding="utf-8") as f:
        return f.readlines()

load_summary_report

load_summary_report(*, summary: dict[str, int], out_path: Path) -> None

L: Write summary to a text file in data/processed.

Parameters:

Name	Type	Description	Default
`summary`	`dict[str, int]`	Dictionary with counts for 'lines', 'words', and 'chars'.	required
`out_path`	`Path`	Path to output text file.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_text_pipeline.py

def load_summary_report(*, summary: dict[str, int], out_path: Path) -> None:
    """L: Write summary to a text file in data/processed.

    Args:
        summary: Dictionary with counts for 'lines', 'words', and 'chars'.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("Text File Summary\n")
        f.write(f"Lines: {summary['lines']}\n")
        f.write(f"Words: {summary['words']}\n")
        f.write(f"Characters: {summary['chars']}\n")

run_text_pipeline

run_text_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None

Run the full ETVL pipeline.

Parameters:

Name	Type	Description	Default
`raw_dir`	`Path`	Path to data/raw directory.	required
`processed_dir`	`Path`	Path to data/processed directory.	required
`logger`	`Any`	Logger for logging messages.	required

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_text_pipeline.py

def run_text_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Args:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None

    """
    logger.info("TXT: START")

    input_file = raw_dir / "romeo_and_juliet.txt"
    output_file = processed_dir / "txt_summary.txt"

    # E
    lines = extract_lines(file_path=input_file)

    # T
    summary = transform_line_word_char_counts(lines=lines)

    # V
    verify_summary(summary=summary)

    # L
    load_summary_report(summary=summary, out_path=output_file)

    logger.info("TXT: wrote %s", output_file)
    logger.info("TXT: END")

transform_line_word_char_counts

transform_line_word_char_counts(*, lines: list[str]) -> dict[str, int]

T: Create a simple summary: line count, word count, character count.

Parameters:

Name	Type	Description	Default
`lines`	`list[str]`	List of lines from the text file.	required

Returns:

Type	Description
`dict[str, int]`	Dictionary with counts for 'lines', 'words', and 'chars'.

Source code in src/datafun_03_analytics/case_text_pipeline.py

def transform_line_word_char_counts(*, lines: list[str]) -> dict[str, int]:
    """T: Create a simple summary: line count, word count, character count.

    Args:
        lines: List of lines from the text file.

    Returns:
        Dictionary with counts for 'lines', 'words', and 'chars'.
    """
    line_count = len(lines)
    word_count = 0
    char_count = 0

    for line in lines:
        char_count += len(line)
        word_count += len(line.split())

    return {
        "lines": line_count,
        "words": word_count,
        "chars": char_count,
    }

verify_summary

verify_summary(*, summary: dict[str, int]) -> None

V: Verify the summary has expected keys and non-negative values.

Parameters:

Name	Type	Description	Default
`summary`	`dict[str, int]`	Dictionary with counts for 'lines', 'words', and 'chars'.	required

Raises:

Type	Description
`KeyError`	If expected keys are missing.
`ValueError`	If any count is negative.

Returns:

Type	Description
`None`	None

Source code in src/datafun_03_analytics/case_text_pipeline.py

def verify_summary(*, summary: dict[str, int]) -> None:
    """V: Verify the summary has expected keys and non-negative values.

    Args:
        summary: Dictionary with counts for 'lines', 'words', and 'chars'.

    Raises:
        KeyError: If expected keys are missing.
        ValueError: If any count is negative.

    Returns:
        None
    """
    for key in ("lines", "words", "chars"):
        # Handle known possible error: the key is missing.
        if key not in summary:
            raise KeyError(f"Missing summary key: {key}")
        # Handle known possible error: count is negative.
        if summary[key] < 0:
            raise ValueError(f"Invalid {key} count: {summary[key]}")