Member-only story
Here’s a comprehensive list of terms related to text processing and natural language processing:
General Terms:
- Text Data: Raw textual information used for analysis or processing.
- Corpus: A collection of text documents.
- Document: An individual piece of text (e.g., an article, a paragraph).
- Token: A unit of text resulting from tokenization (e.g., word, subword).
- Vocabulary: The set of all unique words in a corpus.
Preprocessing:
- Lowercasing: Converting all text to lowercase.
- Stop Words: Commonly used words (e.g., “the,” “is”) often removed during preprocessing.
- Lemmatization: Reducing words to their base or root form.
- Stemming: Reducing words to their root form by removing suffixes.
- Normalization: Standardizing text by removing accents, special characters, etc.
Tokenization:
- Word Tokenization: Splitting text into individual words.
- Sentence Tokenization: Splitting text into sentences.
- Subword Tokenization…