Member-only story

Comprehensive List of Terms for Text Processing and Natural Language Processing

btd
5 min readDec 6, 2023

--

Here’s a comprehensive list of terms related to text processing and natural language processing:

General Terms:

  1. Text Data: Raw textual information used for analysis or processing.
  2. Corpus: A collection of text documents.
  3. Document: An individual piece of text (e.g., an article, a paragraph).
  4. Token: A unit of text resulting from tokenization (e.g., word, subword).
  5. Vocabulary: The set of all unique words in a corpus.

Preprocessing:

  1. Lowercasing: Converting all text to lowercase.
  2. Stop Words: Commonly used words (e.g., “the,” “is”) often removed during preprocessing.
  3. Lemmatization: Reducing words to their base or root form.
  4. Stemming: Reducing words to their root form by removing suffixes.
  5. Normalization: Standardizing text by removing accents, special characters, etc.

Tokenization:

  1. Word Tokenization: Splitting text into individual words.
  2. Sentence Tokenization: Splitting text into sentences.
  3. Subword Tokenization

--

--

btd
btd

No responses yet