Comprehensive List of Terms for Text Processing and Natural Language Processing | by btd | Medium

Member-only story
Comprehensive List of Terms for Text Processing and Natural Language Processing
btd
·Follow
5 min read·
Dec 6, 2023
--
Photo by Denis Sebastian Tamas on Unsplash
Here’s a comprehensive list of terms related to text processing and natural language processing:
General Terms:Text Data: Raw textual information used for analysis or processing.
Corpus: A collection of text documents.
Document: An individual piece of text (e.g., an article, a paragraph).
Token: A unit of text resulting from tokenization (e.g., word, subword).
Vocabulary: The set of all unique words in a corpus.
Preprocessing:Lowercasing: Converting all text to lowercase.
Stop Words: Commonly used words (e.g., “the,” “is”) often removed during preprocessing.
Lemmatization: Reducing words to their base or root form.
Stemming: Reducing words to their root form by removing suffixes.
Normalization: Standardizing text by removing accents, special characters, etc.
Tokenization:Word Tokenization: Splitting text into individual words.
Sentence Tokenization: Splitting text into sentences.
Subword Tokenization…
--
--
Written by btd1.1K Followers
·11 Following
Learning & making lists
No responses yet
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams