Project Instructions (Module 6: NLP Pipeline)¶

WEDNESDAY: Complete Workflow Phase 1¶

Follow the instructions in ⭐ Workflow: Apply Example to complete:

Again, follow the instructions above to complete:

Phase 2. Change Authorship - update the project to your name and GitHub account
Phase 3. Read & Understand - review the project structure and code
Phase 4. Make a Technical Modification - make a change and verify it still runs.

Make a small technical change that does not break the pipeline. Choose any one of these (or a different modification as you like):

Change the target URL to a different arXiv paper (find a paper you find interesting at https://arxiv.org)
Add a new derived column in the Transform stage (e.g., sentence count in the abstract, or average word length)
Adjust the number of top tokens shown in the frequency bar chart
Add a new visualization in the Analyze stage (e.g., a histogram of word lengths)
Adjust logging messages to provide more detail about the pipeline stages

Confirm the script still runs successfully after your change.

Apply the same EVTAL pipeline to a different arXiv abstract page.

Steps:

Then:

Apply this pipeline to a different web page of your choice.

Good options include:

Another arXiv listing page (e.g., https://arxiv.org/list/cs.AI/recent) to extract and analyze multiple abstracts at once
A Wikipedia article to extract and analyze the introduction
A Project Gutenberg page to extract text from a literary work (e.g., https://www.gutenberg.org/files/1342/1342-h/1342-h.htm Pride and Prejudice)

Steps:

Then:

Describe the HTML structure of your chosen page
Identify the tags and attributes you used to extract each field
Explain one challenge you encountered in cleaning the text and how you resolved it
Describe what the frequency analysis reveals about the content

As you work, focus on:

how Transform is an iterative loop: inspect, clean, inspect, engineer, repeat
how cleaning decisions involve tradeoffs (what signal might be lost?)
how to compute and interpret token frequency, vocabulary richness, and type-token ratio
how visualizations (bar charts, word clouds) surface patterns that numbers alone do not
how data moves through the EVTAL pipeline

Your goal is to produce a clean, analysis-ready corpus and interpret what the analysis reveals.

If time allows, consider:

Remove instructor-provided content you no longer need in your project.

Make sure the title and narrative reflect your presentation. Verify key files:

Ensure your project clearly demonstrates: