Project Instructions (Module 3: Data Analytics Pipelines)¶

WEDNESDAY: Complete Workflow Phase 1¶

Follow the instructions in ⭐ Workflow: Apply Example to complete:

Again, follow the instructions above to complete:

Phase 2. Change Authorship - update the project to your name and GitHub account
Phase 3. Read & Understand - review the project structure and code
Phase 4. Make a Technical Modification - make a change and verify it still runs

Make a small technical change that does not break the pipeline. Choose any one of these (or a different modification as you like):

Change the target word in the XLSX pipeline to a different word (e.g., search for "Python" or "data" instead of "GitHub")
Add a new derived field in the CSV Transform stage (e.g., a normalized score, or a label based on whether the value is above the mean)
Add a line to any Load stage that writes the current date to the output file
Adjust logging messages in any pipeline to provide more detail about what was found

Confirm the pipeline still runs successfully after your change.

Apply one of the existing pipelines to a different data file of the same type.

Steps:

Choose one pipeline: CSV, JSON, text, or XLSX
Find or create a new data file of that type
CSV: any dataset with at least one numeric column (e.g., from https://www.kaggle.com or https://data.gov)
JSON: any JSON file with a list of objects (e.g., from a public API or https://github.com/jdorfman/awesome-json-datasets)
Text: any plain text file with readable content (e.g., from https://www.gutenberg.org)
XLSX: any Excel file with text content in a column
Copy and rename the relevant pipeline file
Update the input filename and column/key names to match your new data
Update the output filename to reflect your data
Run the pipeline and confirm success

Then:

Describe your data source and what it contains
Identify the column or key you extracted and why you chose it
Describe one challenge you encountered adapting the pipeline and how you resolved it

Write a new ETVL pipeline for a data format not yet covered in the example.

Good options include:

A second CSV file with different columns and a different Transform (e.g., filtering rows)
A nested JSON file with a different structure (e.g., multiple levels of nesting)
A text file from a different source with different counting logic (e.g., count sentences instead of words)

Steps:

Place your raw data file in data/raw/
Create a new pipeline file named yourname_<format>_pipeline.py
Implement all four ETVL steps as separate functions
Use keyword-only arguments (*,) in all function definitions
Call the pipeline from main() in app_yourname.py
Write at least one test for your Extract function and one for your Transform function
Run the full pipeline and confirm output appears in data/processed/

Then:

As you work, focus on:

Your goal is to reuse the same pipeline pattern on new data sources.

Make sure the title and narrative reflect your work. Verify key files:

Ensure your project clearly demonstrates: