Glossary (Module 3: Text Exploration)¶

Corpus¶

A collection of text documents used for analysis. In this project, each document is a short sentence with an associated category.

A single unit of text within a corpus. In this project, each document is represented as one line of text.

A label assigned to a document that groups it with similar documents (e.g., "dog", "cat", "car", "truck").

A word-like unit extracted from text during tokenization.

The process of splitting text into individual tokens (words).

The process of preparing text for analysis by:

A tabular structure where each row represents a single token and its associated category.

A count of how often each token appears.

Token frequency calculated across the entire corpus.

Token frequency calculated within each category.

A measure of which tokens appear near each other in text.

The number of tokens before and after a target token used to define its context.

A dictionary where:

A pair of consecutive tokens (two-word sequence).

A count of how often each bigram appears.

A graphical representation of data (e.g., bar charts of token frequencies).

A repeated structure in the data, such as:

The process of explaining what the results mean based on observed patterns in the data.