Skip to content

API Reference

This page is auto-generated from Python docstrings.

datafun_03_analytics.app_case

app_case.py - Project script (example).

Author: Denise Case Date: 2026-01

Practice key Python skills: - pathlib for cross-platform paths - logging (preferred over print) - calling functions from modules - clear ETVL pipeline stages: E = Extract (read, get data from source into memory) T = Transform (process, change data in memory) V = Verify (check, validate data in memory) L = Load (write results, to data/processed or other destination)

OBS

Don't edit this file - it should remain a working example.

main

main() -> None

Entry point: run four simple ETVL pipelines.

Source code in src/datafun_03_analytics/app_case.py
60
61
62
63
64
65
66
67
68
69
70
71
def main() -> None:
    """Entry point: run four simple ETVL pipelines."""
    log_header(LOG, "Pipelines: Read, Process, Verify, Write (ETVL)")
    LOG.info("START main()")

    # Each pipeline reads from data/raw and writes to data/processed.
    run_csv_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_xlsx_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_json_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)
    run_text_pipeline(raw_dir=RAW_DIR, processed_dir=PROCESSED_DIR, logger=LOG)

    LOG.info("END main()")

datafun_03_analytics.case_csv_pipeline

p3_csv_pipeline.py - CSV ETVL pipeline.

ETVL

E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)

CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.

CUSTOM: We use keyword-only function arguments. In our functions, you'll see a *,. The asterisk can appear anywhere in the list of parameters. EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.

WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.

extract_csv_scores

extract_csv_scores(*, file_path: Path, column_name: str) -> list[float]

E: Read CSV and extract one numeric column as floats.

Parameters:

Name Type Description Default
file_path Path

Path to input CSV file.

required
column_name str

Name of the column to extract.

required

Returns:

Type Description
list[float]

List of float values from the specified column.

Source code in src/datafun_03_analytics/case_csv_pipeline.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def extract_csv_scores(*, file_path: Path, column_name: str) -> list[float]:
    """E: Read CSV and extract one numeric column as floats.

    Args:
        file_path: Path to input CSV file.
        column_name: Name of the column to extract.

    Returns:
        List of float values from the specified column.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    scores: list[float] = []
    with file_path.open("r", newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)

        # Handle known possible error: missing expected column.
        if reader.fieldnames is None or column_name not in reader.fieldnames:
            raise KeyError(
                f"CSV missing expected column '{column_name}'. Found: {reader.fieldnames}"
            )

        for row in reader:
            raw_value = (row.get(column_name) or "").strip()
            if not raw_value:
                continue
            try:
                scores.append(float(raw_value))
            except ValueError:
                # Keep it simple: skip rows that do not convert cleanly.
                continue

    return scores

load_stats_report

load_stats_report(*, stats: dict[str, float], out_path: Path) -> None

L: Write stats to a text file in data/processed.

Parameters:

Name Type Description Default
stats dict[str, float]

Dictionary with statistics to write.

required
out_path Path

Path to output text file.

required

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_csv_pipeline.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
def load_stats_report(*, stats: dict[str, float], out_path: Path) -> None:
    """L: Write stats to a text file in data/processed.

    Args:
        stats: Dictionary with statistics to write.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("CSV Ladder Score Statistics\n")
        f.write(f"Count: {int(stats['count'])}\n")
        f.write(f"Minimum: {stats['min']:.2f}\n")
        f.write(f"Maximum: {stats['max']:.2f}\n")
        f.write(f"Mean: {stats['mean']:.2f}\n")
        f.write(f"Standard Deviation: {stats['stdev']:.2f}\n")

run_csv_pipeline

run_csv_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None

Run the full ETVL pipeline.

Parameters:

Name Type Description Default
raw_dir Path

Path to data/raw directory.

required
processed_dir Path

Path to data/processed directory.

required
logger Any

Logger for logging messages.

required

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_csv_pipeline.py
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def run_csv_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Args:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None

    """
    logger.info("CSV: START")

    input_file = raw_dir / "2020_happiness.csv"
    output_file = processed_dir / "csv_ladder_score_stats.txt"

    # E
    scores = extract_csv_scores(file_path=input_file, column_name="Ladder score")

    # T
    stats = transform_scores_to_stats(scores=scores)

    # V
    verify_stats(stats=stats)

    # L
    load_stats_report(stats=stats, out_path=output_file)

    logger.info("CSV: wrote %s", output_file)
    logger.info("CSV: END")

transform_scores_to_stats

transform_scores_to_stats(*, scores: list[float]) -> dict[str, float]

T: Calculate basic statistics for a list of floats.

Parameters:

Name Type Description Default
scores list[float]

List of float values.

required

Returns:

Type Description
dict[str, float]

Dictionary with keys: count, min, max, mean, stdev.

Source code in src/datafun_03_analytics/case_csv_pipeline.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def transform_scores_to_stats(*, scores: list[float]) -> dict[str, float]:
    """T: Calculate basic statistics for a list of floats.

    Args:
        scores: List of float values.

    Returns:
        Dictionary with keys: count, min, max, mean, stdev.
    """
    if not scores:
        raise ValueError("No numeric values found for analysis.")

    return {
        "count": float(len(scores)),
        "min": min(scores),
        "max": max(scores),
        "mean": statistics.mean(scores),
        "stdev": statistics.stdev(scores) if len(scores) > 1 else 0.0,
    }

verify_stats

verify_stats(*, stats: dict[str, float]) -> None

V: Sanity-check the stats dictionary.

Parameters:

Name Type Description Default
stats dict[str, float]

Dictionary with statistics to verify.

required

Raises:

Type Description
KeyError

If expected keys are missing.

ValueError

If any stats values are invalid.

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_csv_pipeline.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def verify_stats(*, stats: dict[str, float]) -> None:
    """V: Sanity-check the stats dictionary.

    Args:
        stats: Dictionary with statistics to verify.

    Raises:
        KeyError: If expected keys are missing.
        ValueError: If any stats values are invalid.

    Returns:
        None
    """
    required = {"count", "min", "max", "mean", "stdev"}
    missing = required - set(stats.keys())
    # Handle known possible error: missing required keys.
    if missing:
        raise KeyError(f"Missing stats keys: {sorted(missing)}")

    # Handle known possible error: count must be positive.
    if stats["count"] <= 0:
        raise ValueError("Count must be positive.")
    # Handle known possible error: min cannot be greater than max.
    if stats["min"] > stats["max"]:
        raise ValueError("Min cannot be greater than max.")

datafun_03_analytics.case_xlsx_pipeline

p3_xlsx_pipeline.py - XLSX ETVL pipeline.

ETVL

E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)

CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.

CUSTOM: We use keyword-only function arguments. In our functions, you'll see a *,. The asterisk can appear anywhere in the list of parameters. EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.

WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.

extract_xlsx_column_strings

extract_xlsx_column_strings(*, file_path: Path, column_letter: str) -> list[str]

E: Read an Excel file and extract string values from a column.

Parameters:

Name Type Description Default
file_path Path

Path to input XLSX file.

required
column_letter str

Letter of the column to extract (e.g., 'A').

required

Returns:

Type Description
list[str]

List of string values from the specified column.

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def extract_xlsx_column_strings(*, file_path: Path, column_letter: str) -> list[str]:
    """E: Read an Excel file and extract string values from a column.

    Args:
        file_path: Path to input XLSX file.
        column_letter: Letter of the column to extract (e.g., 'A').

    Returns:
        List of string values from the specified column.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    workbook = openpyxl.load_workbook(file_path)
    sheet = workbook.active

    values: list[str] = []

    for cell in sheet[column_letter]:
        cell = cast(Cell, cell)
        value = cell.value
        if isinstance(value, str) and value.strip():
            values.append(value)
    return values

load_count_report

load_count_report(*, count: int, out_path: Path, word: str, column_letter: str) -> None

L: Write the result to a text file in data/processed.

Parameters:

Name Type Description Default
count int

The word count to write.

required
out_path Path

Path to output text file.

required
word str

The word that was counted.

required
column_letter str

The column letter that was processed.

required

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def load_count_report(
    *, count: int, out_path: Path, word: str, column_letter: str
) -> None:
    """L: Write the result to a text file in data/processed.

    Args:
        count: The word count to write.
        out_path: Path to output text file.
        word: The word that was counted.
        column_letter: The column letter that was processed.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("XLSX Word Count Result\n")
        f.write(f"Word: {word}\n")
        f.write(f"Column: {column_letter}\n")
        f.write(f"Count: {count}\n")

run_xlsx_pipeline

run_xlsx_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None

Run the full ETVL pipeline.

Parameters:

Name Type Description Default
raw_dir Path

Path to data/raw directory.

required
processed_dir Path

Path to data/processed directory.

required
logger Any

Logger for logging messages.

required

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def run_xlsx_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Args:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None

    """
    logger.info("XLSX: START")

    input_file = raw_dir / "Feedback.xlsx"
    output_file = processed_dir / "xlsx_feedback_github_count.txt"

    column_letter = "A"
    word = "GitHub"

    # E
    values = extract_xlsx_column_strings(
        file_path=input_file,
        column_letter=column_letter,
    )

    # T
    count = transform_count_word(values=values, word=word)

    # V
    verify_count(count=count)

    # L
    load_count_report(
        count=count, out_path=output_file, word=word, column_letter=column_letter
    )

    logger.info("XLSX: wrote %s", output_file)
    logger.info("XLSX: END")

transform_count_word

transform_count_word(*, values: list[str], word: str) -> int

T: Count occurrences of a word across strings (case-insensitive).

Parameters:

Name Type Description Default
values list[str]

List of strings to search.

required
word str

Word to count.

required

Returns:

Type Description
int

Count of occurrences of the word.

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def transform_count_word(*, values: list[str], word: str) -> int:
    """T: Count occurrences of a word across strings (case-insensitive).

    Args:
        values: List of strings to search.
        word: Word to count.

    Returns:
        Count of occurrences of the word.
    """
    # Handle known possible error: no word provided by caller.
    if not word:
        raise ValueError("Word to count cannot be empty.")

    target = word.lower()
    count = 0
    for text in values:
        count += text.lower().count(target)
    return count

verify_count

verify_count(*, count: int) -> None

V: Verify the count is valid.

Parameters:

Name Type Description Default
count int

The count to verify.

required

Raises:

Type Description
ValueError

If the count is negative.

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def verify_count(*, count: int) -> None:
    """V: Verify the count is valid.

    Args:
        count: The count to verify.

    Raises:
        ValueError: If the count is negative.

    Returns:
        None
    """
    # Handle known possible error: count is negative.
    if count < 0:
        raise ValueError("Count cannot be negative.")

datafun_03_analytics.case_json_pipeline

p3_json_pipeline.py - JSON ETVL pipeline.

ETVL

E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)

This example is intentionally explicit about walking JSON:

  • json.load(file) returns a Python dictionary (top-level object)
  • dict.get("people", []) safely retrieves a nested list
  • iteration is used to walk arrays (lists)
  • each list element is expected to be a dictionary with keys such as "craft"

Core JSON Data Concepts:

  • JSON is hierarchical (tree-structured)
  • JSON arrays map to Python lists
  • JSON objects map to Python dictionaries (key-value pairs)
  • JSON is nested (lists and dictionaries can appear within each other)
  • JSON is untrusted input (keys may be missing, values may be wrong types)
  • JSON values are optional (no required keys)
  • JSON types are runtime facts, not promises (no static typing or schema)

Runtime Validation and Defensive Access:

  • Use isinstance() to verify value types at runtime
  • Use dict.get(key, default) to handle missing keys safely
  • Use iteration to walk arrays (lists)
  • Apply defensive programming for unexpected or missing data
  • Verify file existence before attempting to read JSON

CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.

CUSTOM: We use keyword-only function arguments. In our functions, you'll see a *,. The asterisk can appear anywhere in the list of parameters. EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.

WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.

Example JSON Data:

{ "people": [ { "craft": "ISS", "name": "Oleg Kononenko" }, ...

extract_people_list

extract_people_list(*, file_path: Path, list_key: str = 'people') -> list[dict[str, Any]]

E/V: Read JSON file and extract a list of dictionaries under list_key.

Parameters:

Name Type Description Default
file_path Path

Path to input JSON file.

required
list_key str

Top-level key expected to map to a list (default: "people").

'people'

Returns:

Type Description
list[dict[str, Any]]

A list of dictionaries from the JSON file.

Source code in src/datafun_03_analytics/case_json_pipeline.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def extract_people_list(
    *, file_path: Path, list_key: str = "people"
) -> list[dict[str, Any]]:
    """E/V: Read JSON file and extract a list of dictionaries under list_key.

    Args:
        file_path: Path to input JSON file.
        list_key: Top-level key expected to map to a list (default: "people").

    Returns:
        A list of dictionaries from the JSON file.
    """
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    with file_path.open("r", encoding="utf-8") as f:
        data: Any = json.load(f)

    if not isinstance(data, dict):
        raise TypeError("Expected JSON top-level object to be a dictionary.")

    value: Any = data.get(list_key, [])
    if not isinstance(value, list):
        raise TypeError(f"Expected {list_key!r} to be a list.")

    people_list: list[dict[str, Any]] = []
    for item in value:
        if isinstance(item, dict):
            # If it passes the right type check, add it to the list.
            # Just add a type ignore to silence the warnings - we have already checked the type.
            people_list.append(item)  # type: ignore[arg-type]

    return people_list

load_counts_report

load_counts_report(*, counts: dict[str, int], out_path: Path) -> None

L: Write craft counts to a text file in data/processed.

Parameters:

Name Type Description Default
counts dict[str, int]

Dictionary mapping craft names to counts.

required
out_path Path

Path to output text file.

required

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_json_pipeline.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def load_counts_report(*, counts: dict[str, int], out_path: Path) -> None:
    """L: Write craft counts to a text file in data/processed.

    Args:
        counts: Dictionary mapping craft names to counts.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("Astronauts by spacecraft:\n")
        for craft in sorted(counts):
            f.write(f"{craft}: {counts[craft]}\n")

run_json_pipeline

run_json_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None

Run the full ETVL pipeline.

Parameters:

Name Type Description Default
raw_dir Path

Path to data/raw directory.

required
processed_dir Path

Path to data/processed directory.

required
logger Any

Logger for logging messages.

required

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_json_pipeline.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
def run_json_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Args:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None

    """
    logger.info("JSON: START")

    input_file = raw_dir / "astros.json"
    output_file = processed_dir / "json_astronauts_by_craft.txt"

    # E
    people_list = extract_people_list(file_path=input_file, list_key="people")

    # T
    craft_counts = transform_count_by_craft(people_list=people_list, craft_key="craft")

    # V
    verify_counts(counts=craft_counts)

    # L
    load_counts_report(counts=craft_counts, out_path=output_file)

    logger.info("JSON: wrote %s", output_file)
    logger.info("JSON: END")

transform_count_by_craft

transform_count_by_craft(*, people_list: list[dict[str, Any]], craft_key: str = 'craft') -> dict[str, int]

T/V: Count people by craft.

Parameters:

Name Type Description Default
people_list list[dict[str, Any]]

List of person dictionaries.

required
craft_key str

Key to read craft name from (default: "craft").

'craft'

Returns:

Type Description
dict[str, int]

Dictionary mapping craft names to counts.

Source code in src/datafun_03_analytics/case_json_pipeline.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
def transform_count_by_craft(
    *, people_list: list[dict[str, Any]], craft_key: str = "craft"
) -> dict[str, int]:
    """T/V: Count people by craft.

    Args:
        people_list: List of person dictionaries.
        craft_key: Key to read craft name from (default: "craft").

    Returns:
        Dictionary mapping craft names to counts.
    """
    counts: dict[str, int] = {}

    for person in people_list:
        craft: Any = person.get(craft_key, "Unknown")
        if not isinstance(craft, str) or not craft.strip():
            craft = "Unknown"
        counts[craft] = counts.get(craft, 0) + 1

    return counts

verify_counts

verify_counts(*, counts: dict[str, int]) -> None

V: Verify counts are non-negative and craft names are not empty.

Parameters:

Name Type Description Default
counts dict[str, int]

Dictionary mapping craft names to counts.

required

Raises:

Type Description
ValueError

If any count is negative or craft name is invalid.

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_json_pipeline.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def verify_counts(*, counts: dict[str, int]) -> None:
    """V: Verify counts are non-negative and craft names are not empty.

    Args:
        counts: Dictionary mapping craft names to counts.

    Raises:
        ValueError: If any count is negative or craft name is invalid.

    Returns:
        None
    """
    for craft, count in counts.items():
        # Handle known possible error: invalid craft name after stripping off white space.
        if not craft.strip():
            raise ValueError(f"Invalid craft name: {craft!r}")
        # Handle known possible error: count is negative.
        if count < 0:
            raise ValueError(f"Invalid count for craft {craft!r}: {count}")

datafun_03_analytics.case_text_pipeline

p3_text_pipeline.py - Text ETVL pipeline.

ETVL

E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)

CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.

CUSTOM: We use keyword-only function arguments. In our functions, you'll see a *,. The asterisk can appear anywhere in the list of parameters. EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.

WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.

extract_lines

extract_lines(*, file_path: Path) -> list[str]

E: Read a text file into a list of lines.

Parameters:

Name Type Description Default
file_path Path

Path to input text file.

required

Returns:

Type Description
list[str]

List of lines from the text file.

Source code in src/datafun_03_analytics/case_text_pipeline.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def extract_lines(*, file_path: Path) -> list[str]:
    """E: Read a text file into a list of lines.

    Args:
        file_path: Path to input text file.

    Returns:
            List of lines from the text file.
    """
    # Handle known possible error: no file at the path provided.
    if not file_path.exists():
        raise FileNotFoundError(f"Missing input file: {file_path}")

    with file_path.open("r", encoding="utf-8") as f:
        return f.readlines()

load_summary_report

load_summary_report(*, summary: dict[str, int], out_path: Path) -> None

L: Write summary to a text file in data/processed.

Parameters:

Name Type Description Default
summary dict[str, int]

Dictionary with counts for 'lines', 'words', and 'chars'.

required
out_path Path

Path to output text file.

required

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_text_pipeline.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def load_summary_report(*, summary: dict[str, int], out_path: Path) -> None:
    """L: Write summary to a text file in data/processed.

    Args:
        summary: Dictionary with counts for 'lines', 'words', and 'chars'.
        out_path: Path to output text file.

    Returns:
        None
    """
    out_path.parent.mkdir(parents=True, exist_ok=True)

    with out_path.open("w", encoding="utf-8") as f:
        f.write("Text File Summary\n")
        f.write(f"Lines: {summary['lines']}\n")
        f.write(f"Words: {summary['words']}\n")
        f.write(f"Characters: {summary['chars']}\n")

run_text_pipeline

run_text_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None

Run the full ETVL pipeline.

Parameters:

Name Type Description Default
raw_dir Path

Path to data/raw directory.

required
processed_dir Path

Path to data/processed directory.

required
logger Any

Logger for logging messages.

required

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_text_pipeline.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
def run_text_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None:
    """Run the full ETVL pipeline.

    Args:
        raw_dir: Path to data/raw directory.
        processed_dir: Path to data/processed directory.
        logger: Logger for logging messages.

    Returns:
        None

    """
    logger.info("TXT: START")

    input_file = raw_dir / "romeo_and_juliet.txt"
    output_file = processed_dir / "txt_summary.txt"

    # E
    lines = extract_lines(file_path=input_file)

    # T
    summary = transform_line_word_char_counts(lines=lines)

    # V
    verify_summary(summary=summary)

    # L
    load_summary_report(summary=summary, out_path=output_file)

    logger.info("TXT: wrote %s", output_file)
    logger.info("TXT: END")

transform_line_word_char_counts

transform_line_word_char_counts(*, lines: list[str]) -> dict[str, int]

T: Create a simple summary: line count, word count, character count.

Parameters:

Name Type Description Default
lines list[str]

List of lines from the text file.

required

Returns:

Type Description
dict[str, int]

Dictionary with counts for 'lines', 'words', and 'chars'.

Source code in src/datafun_03_analytics/case_text_pipeline.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def transform_line_word_char_counts(*, lines: list[str]) -> dict[str, int]:
    """T: Create a simple summary: line count, word count, character count.

    Args:
        lines: List of lines from the text file.

    Returns:
        Dictionary with counts for 'lines', 'words', and 'chars'.
    """
    line_count = len(lines)
    word_count = 0
    char_count = 0

    for line in lines:
        char_count += len(line)
        word_count += len(line.split())

    return {
        "lines": line_count,
        "words": word_count,
        "chars": char_count,
    }

verify_summary

verify_summary(*, summary: dict[str, int]) -> None

V: Verify the summary has expected keys and non-negative values.

Parameters:

Name Type Description Default
summary dict[str, int]

Dictionary with counts for 'lines', 'words', and 'chars'.

required

Raises:

Type Description
KeyError

If expected keys are missing.

ValueError

If any count is negative.

Returns:

Type Description
None

None

Source code in src/datafun_03_analytics/case_text_pipeline.py
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def verify_summary(*, summary: dict[str, int]) -> None:
    """V: Verify the summary has expected keys and non-negative values.

    Args:
        summary: Dictionary with counts for 'lines', 'words', and 'chars'.

    Raises:
        KeyError: If expected keys are missing.
        ValueError: If any count is negative.

    Returns:
        None
    """
    for key in ("lines", "words", "chars"):
        # Handle known possible error: the key is missing.
        if key not in summary:
            raise KeyError(f"Missing summary key: {key}")
        # Handle known possible error: count is negative.
        if summary[key] < 0:
            raise ValueError(f"Invalid {key} count: {summary[key]}")