Suggested Datasets: Web Mining and NLP¶

Good datasets for Web Mining and Natural Language Processing typically contain collections of text that can be analyzed for patterns, topics, sentiment, or structure.

These datasets help analysts explore how language is used across websites, documents, conversations, or social platforms.

These are only suggestions.

Choosing a Dataset¶

Choose datasets where each row contains meaningful text such as a sentence, paragraph, comment, or document and contain plain text that can be easily extracted and processed.

Good datasets usually have:

text that can be copied or downloaded as plain text or HTML
many rows or documents to analyze
meaningful language (articles, conversations, books, comments)

Try to avoid datasets where:

the text is embedded in PDF files
the data consists mainly of images or scanned documents
the text is extremely short (for example: single keywords)

Look for real language data that can be tokenized, counted, and explored using Python.

Index¶

These datasets work well for exploring web mining and NLP techniques. More information appears in the sections below.

Dataset	Text Source	Good For Modules
Project Gutenberg Books	Books and literature	2, 3
News Headlines	News media text	2, 3
Movie Reviews	Opinion and sentiment text	2, 3
Reddit Comments	Online discussion forums	2, 3, 4
Wikipedia Articles	Structured knowledge text	2, 3
GitHub Issue Discussions	Technical collaboration text	3, 4
MIT Shakespeare Page	Structured literary text	2, 3, 5
Web Page Text	HTML page content	3, 4, 5

1. Project Gutenberg Books¶

Source: https://www.gutenberg.org/

Project Gutenberg provides thousands of public domain books that can be downloaded as plain text.

Example uses:

word frequency analysis
tokenization
text cleaning
topic exploration

Questions to explore:

What are the most common words in a text?
How does vocabulary vary across authors?
What words characterize different genres?

Works well for modules:

Module 2: Text Preprocessing
Module 3: Text Exploration

2. News Headlines¶

Example dataset:

https://www.kaggle.com/datasets/rmisra/news-category-dataset

This dataset contains thousands of news headlines across many topics.

Example fields:

headline
category
date

Possible analyses:

word frequency by topic
keyword extraction
topic exploration

Questions to explore:

Which words appear most frequently in news headlines?
How does language differ across news categories?
What topics dominate the dataset?

Works well for modules:

Module 2: Text Preprocessing
Module 3: Text Exploration

3. Movie Reviews¶

Example dataset:

https://ai.stanford.edu/~amaas/data/sentiment/

Movie review datasets contain opinion text written by viewers.

Example fields:

review_text
sentiment label (positive/negative)

Possible analyses:

sentiment words
word frequency in positive vs negative reviews
common phrases

Questions to explore:

What words are associated with positive reviews?
What words appear frequently in negative reviews?
Are there common phrases across many reviews?

Works well for modules:

Module 2: Text Preprocessing
Module 3: Text Exploration

4. Reddit Comments¶

Example dataset:

https://www.kaggle.com/datasets/reddit/reddit-comments-may-2015

Reddit comment datasets contain conversations from online forums.

Example fields:

comment_text
subreddit
timestamp

Possible analyses:

frequent discussion topics
vocabulary differences between communities
conversation language patterns

Questions to explore:

What topics appear frequently in a subreddit?
How does language vary between communities?
What keywords characterize discussions?

Works well for modules:

Module 2: Text Preprocessing
Module 3: Text Exploration
Module 4: API or Structured Data Workflows

5. Wikipedia Articles¶

Source: https://dumps.wikimedia.org/

Wikipedia articles provide structured informational text.

Example fields:

article_title
article_text

Possible analyses:

keyword extraction
topic exploration
vocabulary analysis

Questions to explore:

What keywords define an article topic?
How does language differ between article types?
What vocabulary appears frequently across many articles?

Works well for modules:

Module 2: Text Preprocessing
Module 3: Text Exploration

6. GitHub Issue Discussions¶

Example dataset:

https://www.githubarchive.org/

GitHub issue comments contain technical discussions between developers.

Example fields:

issue_title
comment_text
repository
timestamp

Possible analyses:

technical vocabulary frequency
common problem descriptions
developer discussion topics

Questions to explore:

What problems appear most frequently in issues?
What words characterize technical discussions?
How does language vary across repositories?

Works well for modules:

Module 3: Text Exploration
Module 4: API Data Workflows

7. MIT Shakespeare Page (Complete Works)¶

Source: https://shakespeare.mit.edu/

The MIT Shakespeare site provides the complete works of Shakespeare organized by play, act, and scene. The text is available in simple HTML pages, which makes it easy to retrieve and analyze.

Example text sources:

play titles
character dialogue
scene descriptions

Possible analyses:

word frequency across plays
vocabulary differences between characters
keyword extraction for themes
comparison of language across tragedies and comedies

Questions to explore:

Which words appear most frequently across Shakespeare's works?
Do different characters use different vocabulary?
Are certain words associated with particular plays or themes?

Works well for modules:

Module 2: Text Preprocessing
Module 3: Text Exploration
Module 5: Web Document Structure

8. Web Page Text¶

Example source:

Any public website page retrieved with a web request.

Example fields extracted from HTML:

page_title
headings
paragraphs

Possible analyses:

keyword extraction
word frequency
content structure

Questions to explore:

What topics appear on a web page?
Which keywords characterize a website?
How does page structure affect text analysis?

Works well for modules:

Module 3: Text Exploration
Module 4: API Data Workflows
Module 5: Web Document Structure