API Reference
This page is auto-generated from Python docstrings.
datafun_03_analytics.app_case
app_case.py - Project script (example).
Author: Denise Case Date: 2026-01
Practice key Python skills: - pathlib for cross-platform paths - logging (preferred over print) - calling functions from modules - clear ETVL pipeline stages: E = Extract (read, get data from source into memory) T = Transform (process, change data in memory) V = Verify (check, validate data in memory) L = Load (write results, to data/processed or other destination)
OBS
Don't edit this file - it should remain a working example.
main
main() -> None
Entry point: run four simple ETVL pipelines.
Source code in src/datafun_03_analytics/app_case.py
60 61 62 63 64 65 66 67 68 69 70 71 | |
datafun_03_analytics.case_csv_pipeline
p3_csv_pipeline.py - CSV ETVL pipeline.
ETVL
E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)
CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.
CUSTOM: We use keyword-only function arguments.
In our functions, you'll see a *,.
The asterisk can appear anywhere in the list of parameters.
EVERY argument AFTER the asterisk must be passed
using the named keyword argument (also called kwarg), rather than by position.
WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.
extract_csv_scores
extract_csv_scores(*, file_path: Path, column_name: str) -> list[float]
E: Read CSV and extract one numeric column as floats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to input CSV file. |
required |
column_name
|
str
|
Name of the column to extract. |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
List of float values from the specified column. |
Source code in src/datafun_03_analytics/case_csv_pipeline.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
load_stats_report
load_stats_report(*, stats: dict[str, float], out_path: Path) -> None
L: Write stats to a text file in data/processed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stats
|
dict[str, float]
|
Dictionary with statistics to write. |
required |
out_path
|
Path
|
Path to output text file. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_csv_pipeline.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
run_csv_pipeline
run_csv_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None
Run the full ETVL pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_dir
|
Path
|
Path to data/raw directory. |
required |
processed_dir
|
Path
|
Path to data/processed directory. |
required |
logger
|
Any
|
Logger for logging messages. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_csv_pipeline.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | |
transform_scores_to_stats
transform_scores_to_stats(*, scores: list[float]) -> dict[str, float]
T: Calculate basic statistics for a list of floats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
list[float]
|
List of float values. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary with keys: count, min, max, mean, stdev. |
Source code in src/datafun_03_analytics/case_csv_pipeline.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | |
verify_stats
verify_stats(*, stats: dict[str, float]) -> None
V: Sanity-check the stats dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stats
|
dict[str, float]
|
Dictionary with statistics to verify. |
required |
Raises:
| Type | Description |
|---|---|
KeyError
|
If expected keys are missing. |
ValueError
|
If any stats values are invalid. |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_csv_pipeline.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | |
datafun_03_analytics.case_xlsx_pipeline
p3_xlsx_pipeline.py - XLSX ETVL pipeline.
ETVL
E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)
CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.
CUSTOM: We use keyword-only function arguments.
In our functions, you'll see a *,.
The asterisk can appear anywhere in the list of parameters.
EVERY argument AFTER the asterisk must be passed
using the named keyword argument (also called kwarg), rather than by position.
WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.
extract_xlsx_column_strings
extract_xlsx_column_strings(*, file_path: Path, column_letter: str) -> list[str]
E: Read an Excel file and extract string values from a column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to input XLSX file. |
required |
column_letter
|
str
|
Letter of the column to extract (e.g., 'A'). |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of string values from the specified column. |
Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
load_count_report
load_count_report(*, count: int, out_path: Path, word: str, column_letter: str) -> None
L: Write the result to a text file in data/processed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
count
|
int
|
The word count to write. |
required |
out_path
|
Path
|
Path to output text file. |
required |
word
|
str
|
The word that was counted. |
required |
column_letter
|
str
|
The column letter that was processed. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | |
run_xlsx_pipeline
run_xlsx_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None
Run the full ETVL pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_dir
|
Path
|
Path to data/raw directory. |
required |
processed_dir
|
Path
|
Path to data/processed directory. |
required |
logger
|
Any
|
Logger for logging messages. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | |
transform_count_word
transform_count_word(*, values: list[str], word: str) -> int
T: Count occurrences of a word across strings (case-insensitive).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[str]
|
List of strings to search. |
required |
word
|
str
|
Word to count. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Count of occurrences of the word. |
Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
verify_count
verify_count(*, count: int) -> None
V: Verify the count is valid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
count
|
int
|
The count to verify. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the count is negative. |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_xlsx_pipeline.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |
datafun_03_analytics.case_json_pipeline
p3_json_pipeline.py - JSON ETVL pipeline.
ETVL
E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)
This example is intentionally explicit about walking JSON:
- json.load(file) returns a Python dictionary (top-level object)
- dict.get("people", []) safely retrieves a nested list
- iteration is used to walk arrays (lists)
- each list element is expected to be a dictionary with keys such as "craft"
Core JSON Data Concepts:
- JSON is hierarchical (tree-structured)
- JSON arrays map to Python lists
- JSON objects map to Python dictionaries (key-value pairs)
- JSON is nested (lists and dictionaries can appear within each other)
- JSON is untrusted input (keys may be missing, values may be wrong types)
- JSON values are optional (no required keys)
- JSON types are runtime facts, not promises (no static typing or schema)
Runtime Validation and Defensive Access:
- Use isinstance() to verify value types at runtime
- Use dict.get(key, default) to handle missing keys safely
- Use iteration to walk arrays (lists)
- Apply defensive programming for unexpected or missing data
- Verify file existence before attempting to read JSON
CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.
CUSTOM: We use keyword-only function arguments.
In our functions, you'll see a *,.
The asterisk can appear anywhere in the list of parameters.
EVERY argument AFTER the asterisk must be passed
using the named keyword argument (also called kwarg), rather than by position.
WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.
Example JSON Data:
{ "people": [ { "craft": "ISS", "name": "Oleg Kononenko" }, ...
extract_people_list
extract_people_list(*, file_path: Path, list_key: str = 'people') -> list[dict[str, Any]]
E/V: Read JSON file and extract a list of dictionaries under list_key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to input JSON file. |
required |
list_key
|
str
|
Top-level key expected to map to a list (default: "people"). |
'people'
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
A list of dictionaries from the JSON file. |
Source code in src/datafun_03_analytics/case_json_pipeline.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
load_counts_report
load_counts_report(*, counts: dict[str, int], out_path: Path) -> None
L: Write craft counts to a text file in data/processed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
counts
|
dict[str, int]
|
Dictionary mapping craft names to counts. |
required |
out_path
|
Path
|
Path to output text file. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_json_pipeline.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
run_json_pipeline
run_json_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None
Run the full ETVL pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_dir
|
Path
|
Path to data/raw directory. |
required |
processed_dir
|
Path
|
Path to data/processed directory. |
required |
logger
|
Any
|
Logger for logging messages. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_json_pipeline.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | |
transform_count_by_craft
transform_count_by_craft(*, people_list: list[dict[str, Any]], craft_key: str = 'craft') -> dict[str, int]
T/V: Count people by craft.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
people_list
|
list[dict[str, Any]]
|
List of person dictionaries. |
required |
craft_key
|
str
|
Key to read craft name from (default: "craft"). |
'craft'
|
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
Dictionary mapping craft names to counts. |
Source code in src/datafun_03_analytics/case_json_pipeline.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
verify_counts
verify_counts(*, counts: dict[str, int]) -> None
V: Verify counts are non-negative and craft names are not empty.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
counts
|
dict[str, int]
|
Dictionary mapping craft names to counts. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any count is negative or craft name is invalid. |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_json_pipeline.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
datafun_03_analytics.case_text_pipeline
p3_text_pipeline.py - Text ETVL pipeline.
ETVL
E = Extract (read) T = Transform (process) V = Verify (check) L = Load (write results to data/processed)
CUSTOM: We turn off some of our PyRight type checks when working with raw data pipelines. WHY: We don't know what types things are until after we read them. OBS: See pyproject.toml and the [tool.pyright] section for details.
CUSTOM: We use keyword-only function arguments.
In our functions, you'll see a *,.
The asterisk can appear anywhere in the list of parameters.
EVERY argument AFTER the asterisk must be passed
using the named keyword argument (also called kwarg), rather than by position.
WHY: Requiring named arguments prevents argument-order mistakes. It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.
extract_lines
extract_lines(*, file_path: Path) -> list[str]
E: Read a text file into a list of lines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to input text file. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of lines from the text file. |
Source code in src/datafun_03_analytics/case_text_pipeline.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
load_summary_report
load_summary_report(*, summary: dict[str, int], out_path: Path) -> None
L: Write summary to a text file in data/processed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
summary
|
dict[str, int]
|
Dictionary with counts for 'lines', 'words', and 'chars'. |
required |
out_path
|
Path
|
Path to output text file. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_text_pipeline.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |
run_text_pipeline
run_text_pipeline(*, raw_dir: Path, processed_dir: Path, logger: Any) -> None
Run the full ETVL pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_dir
|
Path
|
Path to data/raw directory. |
required |
processed_dir
|
Path
|
Path to data/processed directory. |
required |
logger
|
Any
|
Logger for logging messages. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_text_pipeline.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |
transform_line_word_char_counts
transform_line_word_char_counts(*, lines: list[str]) -> dict[str, int]
T: Create a simple summary: line count, word count, character count.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lines
|
list[str]
|
List of lines from the text file. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
Dictionary with counts for 'lines', 'words', and 'chars'. |
Source code in src/datafun_03_analytics/case_text_pipeline.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
verify_summary
verify_summary(*, summary: dict[str, int]) -> None
V: Verify the summary has expected keys and non-negative values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
summary
|
dict[str, int]
|
Dictionary with counts for 'lines', 'words', and 'chars'. |
required |
Raises:
| Type | Description |
|---|---|
KeyError
|
If expected keys are missing. |
ValueError
|
If any count is negative. |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/datafun_03_analytics/case_text_pipeline.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | |