Skip to content

Web Mining and Applied NLP

Project Instructions (Module 5: Web Documents and HTML Data)

denisecase/nlp-05-web-documents

Project Instructions (Module 5: Web Documents and HTML Data)¶

WEDNESDAY: Complete Workflow Phase 1¶

Follow the instructions in ⭐ Workflow: Apply Example to complete:

Phase 1. Start & Run - copy the project and confirm it runs

FRIDAY/SUNDAY: Complete Workflow Phases 2-4¶

Again, follow the instructions above to complete:

Phase 2. Change Authorship - update the project to your name and GitHub account
Phase 3. Read & Understand - review the project structure and code
Phase 4. Make a Technical Modification - make a change and verify it still runs.

Phase 4 Suggestions¶

Make a small technical change that does not break the pipeline. Choose any one of these (or a different modification as you like):

Change the target URL to a different arXiv paper (find a paper you find interesting at https://arxiv.org)
Add a new derived column in the Transform stage (e.g., sentence count in the abstract, or first author only)
Add extraction of a new field from the page (e.g., the PDF link, or the arXiv category code)
Adjust logging messages to provide more detail about the pipeline stages

Confirm the script still runs successfully after your change.

Phase 5 Suggestions¶

Phase 5 Suggestion 1. New arXiv Paper (Directed)¶

Apply the same EVTL pipeline to a different arXiv abstract page.

Steps:

Find an arXiv paper that interests you at https://arxiv.org
Copy the abstract page URL (e.g., https://arxiv.org/abs/XXXX.XXXXX)
Update PAGE_URL in your copied config file with the new URL
Run the pipeline
Inspect the extracted fields in the log output
Confirm the pipeline runs successfully

Then:

Identify the title, authors, and primary subject of your chosen paper
Describe one field that required cleaning or special handling
Explain how the abstract word count compares to the case example

Phase 5 Suggestion 2. New Web Page (Original Selection)¶

Apply this pipeline to a different web page of your choice.

Good options include:

Another arXiv listing page (e.g., https://arxiv.org/list/cs.AI/recent) to extract multiple papers at once
A Wikipedia article to extract the introduction and metadata
A Project Gutenberg page to extract text from a literary work (e.g., https://www.gutenberg.org/files/1342/1342-h/1342-h.htm Pride and Prejudice)

Steps:

Open the target page in your browser
Right-click and select "View Page Source" to inspect the HTML structure
Identify the tags and class names that wrap the content you want
Update your copied config file with the new URL
Update your copied stage02_validate file to check for the new structure
Update your copied stage03_transform file to extract the new fields
Run the pipeline and confirm success

Then:

Describe the HTML structure of your chosen page
Identify the tags and attributes you used to extract each field
Explain one challenge you encountered and how you resolved it

Key Skill Focus¶

As you work, focus on:

how to fetch HTML from a web page
how to inspect unknown HTML structures using View Page Source
how to identify tags, classes, and attributes that wrap desired content
how to extract clean text using BeautifulSoup
how to handle missing elements gracefully with fallback values
how data moves through the EVTL pipeline

Your goal is to reuse the same pipeline pattern on new web data sources.

Optional Enhancements¶

If time allows, consider:

extracting additional fields (PDF link, DOI, version history)
computing additional derived fields (sentence count, average word length)
scraping a listing page to extract multiple records into a multi-row DataFrame
comparing two papers side by side in a single DataFrame

Professional Communication¶

Remove instructor-provided content you no longer need in your project.

Make sure the title and narrative reflect your presentation. Verify key files:

README.md
docs/ (source and hosted on GitHub Pages)
src/ (pipeline and stage files)

Ensure your project clearly demonstrates:

correct EVTL pipeline execution
understanding of HTML structure and BeautifulSoup
ability to adapt the pipeline to new web data sources