Project Instructions (Module 5: Web Documents and HTML Data)¶
WEDNESDAY: Complete Workflow Phase 1¶
Follow the instructions in ⭐ Workflow: Apply Example to complete:
- Phase 1. Start & Run - copy the project and confirm it runs
FRIDAY/SUNDAY: Complete Workflow Phases 2-4¶
Again, follow the instructions above to complete:
- Phase 2. Change Authorship - update the project to your name and GitHub account
- Phase 3. Read & Understand - review the project structure and code
- Phase 4. Make a Technical Modification - make a change and verify it still runs.
Phase 4 Suggestions¶
Make a small technical change that does not break the pipeline. Choose any one of these (or a different modification as you like):
- Change the target URL to a different arXiv paper (find a paper you find interesting at https://arxiv.org)
- Add a new derived column in the Transform stage (e.g., sentence count in the abstract, or first author only)
- Add extraction of a new field from the page (e.g., the PDF link, or the arXiv category code)
- Adjust logging messages to provide more detail about the pipeline stages
Confirm the script still runs successfully after your change.
Phase 5 Suggestions¶
Phase 5 Suggestion 1. New arXiv Paper (Directed)¶
Apply the same EVTL pipeline to a different arXiv abstract page.
Steps:
- Find an arXiv paper that interests you at https://arxiv.org
- Copy the abstract page URL (e.g.,
https://arxiv.org/abs/XXXX.XXXXX) - Update
PAGE_URLin your copiedconfigfile with the new URL - Run the pipeline
- Inspect the extracted fields in the log output
- Confirm the pipeline runs successfully
Then:
- Identify the title, authors, and primary subject of your chosen paper
- Describe one field that required cleaning or special handling
- Explain how the abstract word count compares to the case example
Phase 5 Suggestion 2. New Web Page (Original Selection)¶
Apply this pipeline to a different web page of your choice.
Good options include:
- Another arXiv listing page (e.g.,
https://arxiv.org/list/cs.AI/recent) to extract multiple papers at once - A Wikipedia article to extract the introduction and metadata
- A Project Gutenberg page to extract text from a literary work (e.g., https://www.gutenberg.org/files/1342/1342-h/1342-h.htm Pride and Prejudice)
Steps:
- Open the target page in your browser
- Right-click and select "View Page Source" to inspect the HTML structure
- Identify the tags and class names that wrap the content you want
- Update your copied
configfile with the new URL - Update your copied
stage02_validatefile to check for the new structure - Update your copied
stage03_transformfile to extract the new fields - Run the pipeline and confirm success
Then:
- Describe the HTML structure of your chosen page
- Identify the tags and attributes you used to extract each field
- Explain one challenge you encountered and how you resolved it
Key Skill Focus¶
As you work, focus on:
- how to fetch HTML from a web page
- how to inspect unknown HTML structures using View Page Source
- how to identify tags, classes, and attributes that wrap desired content
- how to extract clean text using BeautifulSoup
- how to handle missing elements gracefully with fallback values
- how data moves through the EVTL pipeline
Your goal is to reuse the same pipeline pattern on new web data sources.
Optional Enhancements¶
If time allows, consider:
- extracting additional fields (PDF link, DOI, version history)
- computing additional derived fields (sentence count, average word length)
- scraping a listing page to extract multiple records into a multi-row DataFrame
- comparing two papers side by side in a single DataFrame
Professional Communication¶
Remove instructor-provided content you no longer need in your project.
Make sure the title and narrative reflect your presentation. Verify key files:
- README.md
- docs/ (source and hosted on GitHub Pages)
- src/ (pipeline and stage files)
Ensure your project clearly demonstrates:
- correct EVTL pipeline execution
- understanding of HTML structure and BeautifulSoup
- ability to adapt the pipeline to new web data sources