Project Instructions (Module 6: NLP Pipeline)¶
WEDNESDAY: Complete Workflow Phase 1¶
Follow the instructions in ⭐ Workflow: Apply Example to complete:
- Phase 1. Start & Run - copy the project and confirm it runs
FRIDAY/SUNDAY: Complete Workflow Phases 2-4¶
Again, follow the instructions above to complete:
- Phase 2. Change Authorship - update the project to your name and GitHub account
- Phase 3. Read & Understand - review the project structure and code
- Phase 4. Make a Technical Modification - make a change and verify it still runs.
Phase 4 Suggestions¶
Make a small technical change that does not break the pipeline. Choose any one of these (or a different modification as you like):
- Change the target URL to a different arXiv paper (find a paper you find interesting at https://arxiv.org)
- Add a new derived column in the Transform stage (e.g., sentence count in the abstract, or average word length)
- Adjust the number of top tokens shown in the frequency bar chart
- Add a new visualization in the Analyze stage (e.g., a histogram of word lengths)
- Adjust logging messages to provide more detail about the pipeline stages
Confirm the script still runs successfully after your change.
Phase 5 Suggestions¶
Phase 5 Suggestion 1. New arXiv Paper (Directed)¶
Apply the same EVTAL pipeline to a different arXiv abstract page.
Steps:
- Find an arXiv paper that interests you at https://arxiv.org
- Copy the abstract page URL (e.g.,
https://arxiv.org/abs/XXXX.XXXXX) - Update
PAGE_URLin your copiedconfigfile with the new URL - Run the pipeline
- Inspect the extracted fields, cleaned text, and visualizations
- Confirm the pipeline runs successfully
Then:
- Identify the title, authors, and primary subject of your chosen paper
- Describe how the cleaned abstract differs from the raw abstract
- Compare the type-token ratio and token count to the case example
- Describe what the word cloud and bar chart reveal about the paper's topic
Phase 5 Suggestion 2. New Web Page (Original Selection)¶
Apply this pipeline to a different web page of your choice.
Good options include:
- Another arXiv listing page (e.g.,
https://arxiv.org/list/cs.AI/recent) to extract and analyze multiple abstracts at once - A Wikipedia article to extract and analyze the introduction
- A Project Gutenberg page to extract text from a literary work (e.g., https://www.gutenberg.org/files/1342/1342-h/1342-h.htm Pride and Prejudice)
Steps:
- Open the target page in your browser
- Right-click and select "View Page Source" to inspect the HTML structure
- Identify the tags and class names that wrap the content you want
- Update your copied
configfile with the new URL - Update your copied
stage02_validatefile to check for the new structure - Update your copied
stage03_transformfile to extract and clean the new fields - Run the pipeline and confirm success
Then:
- Describe the HTML structure of your chosen page
- Identify the tags and attributes you used to extract each field
- Explain one challenge you encountered in cleaning the text and how you resolved it
- Describe what the frequency analysis reveals about the content
Key Skill Focus¶
As you work, focus on:
- how Transform is an iterative loop: inspect, clean, inspect, engineer, repeat
- how cleaning decisions involve tradeoffs (what signal might be lost?)
- how to compute and interpret token frequency, vocabulary richness, and type-token ratio
- how visualizations (bar charts, word clouds) surface patterns that numbers alone do not
- how data moves through the EVTAL pipeline
Your goal is to produce a clean, analysis-ready corpus and interpret what the analysis reveals.
Optional Enhancements¶
If time allows, consider:
- extracting and analyzing multiple abstracts into a multi-row DataFrame
- comparing type-token ratios across two different papers
- experimenting with different stopword lists or cleaning strategies
- adding POS-filtered tokens (nouns only, verbs only) to the analysis
Professional Communication¶
Remove instructor-provided content you no longer need in your project.
Make sure the title and narrative reflect your presentation. Verify key files:
- README.md
- docs/ (source and hosted on GitHub Pages)
- src/ (pipeline and stage files)
Ensure your project clearly demonstrates:
- correct EVTAL pipeline execution
- understanding of text cleaning and its tradeoffs
- ability to compute and interpret NLP features
- meaningful visualizations with written interpretation