Bag-of-Words and N-Grams in NLP

Introduction to Bag-of-Words and N-Grams

Welcome to this lesson on Bag-of-Words (BoW) and N-Grams, a foundational technique in Natural Language Processing (NLP) for converting text data into numerical representations. Before we move on to more advanced text vectorization methods, understanding BoW is crucial because it provides the basic framework for handling textual data in machine learning models.

In this lesson, you will learn:

What the Bag-of-Words model is and why it’s useful.
How n-grams enhance text representation.
How to implement BoW with scikit-learn’s CountVectorizer.
Challenges and best practices for working with BoW.

What is Bag-of-Words?

The Bag-of-Words (BoW) model is a simple yet effective way to represent text data numerically. It converts text into a fixed-length numerical feature vector by counting word occurrences, disregarding grammar and word order.

How BoW Works

Tokenization: Splitting text into individual words (tokens).
Building a Vocabulary: Creating a list of all unique words in the dataset.
Encoding: Counting how many times each word appears in a document.

Example

Consider these three sample documents:

"Machine learning is amazing."
"Bag-of-Words is a fundamental NLP technique."
"NLP models often rely on n-grams."

Step 1: Creating the Vocabulary

The unique words across all sentences are:

tokenized = ["machine", "learning", "is", "amazing", "bag-of-words", "a", "fundamental", "nlp", "technique", "models", "often", "rely", "on", "n-grams"]

Step 2: Creating the Frequency Matrix

Each sentence is represented as a vector:

	Machine	Learning	Is	Amazing	Bag-of-Words	A	Fundamental	NLP	Technique	Models	Often	Rely	On	N-Grams
Doc 1	1	1	1	1	0	0	0	0	0	0	0	0	0	0
Doc 2	0	0	1	0	1	1	1	1	1	0	0	0	0	0
Doc 3	0	0	0	0	0	0	0	1	0	1	1	1	1	1

Each row represents a document, and each column represents the occurrence of a word in that document.

Understanding N-Grams

N-Grams are sequences of words that appear together in a document. The basic types include:

Unigrams: Single words (e.g., "machine")
Bigrams: Two consecutive words (e.g., "machine learning")
Trigrams: Three consecutive words (e.g., "learning is amazing")

Using n-grams helps capture context better than single-word unigrams. For example, "not good" is negative, but in a unigram model, "not" and "good" would be treated separately.

Example of N-Grams

Given the sentence:

"Natural Language Processing is fascinating."

Unigrams: ["Natural", "Language", "Processing", "is", "fascinating"]
Bigrams: ["Natural Language", "Language Processing", "Processing is", "is fascinating"]
Trigrams: ["Natural Language Processing", "Language Processing is", "Processing is fascinating"]

By using n-grams, models can capture phrases and contextual meaning instead of isolated words.

Implementing Bag-of-Words with Unigrams in Python

Let's now implement BoW using scikit-learn’s CountVectorizer, focusing only on unigrams.

Step-by-Step Explanation

Import Libraries:
- Import CountVectorizer from sklearn.feature_extraction.text for creating the BoW model.
- Import pandas for handling data in a tabular format.
Define Example Documents:
- Create a list docs containing three example text documents. These documents will be used to demonstrate the BoW model.
Initialize CountVectorizer:
- Create an instance of CountVectorizer with ngram_range=(1,1) to specify that only unigrams (single words) should be considered.
- Set stop_words='english' to remove common English stop words from the documents.
Fit and Transform Documents:
- Use the fit_transform method of CountVectorizer to learn the vocabulary from the documents and transform them into a BoW matrix. This matrix contains the frequency of each word in each document.
Convert to DataFrame:
- Convert the BoW matrix to a Pandas DataFrame for better visualization. The columns of the DataFrame represent the unique words (features) in the vocabulary, and the rows represent the documents.
:

Output Example

This matrix shows the frequency of different unigrams in each document. This information can be used to train machine learning models for classification, clustering, and other NLP tasks.

Practical Example: Text Classification with Bag-of-Words and N-Grams

In this example, you will apply the Bag-of-Words and n-gram techniques to a simple text classification task using a dataset of categorized short texts. We will implement a complete pipeline that includes creating a BoW representation, splitting the data, training a classifier, and evaluating performance.

Step-by-Step Implementation

Dataset Preparation
- Use a dataset of categorized short texts, such as news headlines or product descriptions, with 2-3 different categories.
BoW Representation with N-Grams
- Implement BoW with an appropriate n-gram range and preprocessing options like stop word removal and stemming/lemmatization.
Data Splitting
- Split the data into training and testing sets.
Model Training
- Train a simple classifier, such as Naive Bayes or Logistic Regression, on the BoW features.
Performance Evaluation
- Evaluate the classification performance using metrics like accuracy, precision, recall, and F1-score.
Experimentation and Analysis
- Experiment with different n-gram ranges and preprocessing options.
- Analyze how these choices affect classification accuracy.

Example Code

By following this example, you can demonstrate your ability to apply BoW models to solve practical NLP tasks and understand the impact of different preprocessing and n-gram settings on classification performance.

Challenges

While BoW is simple and effective, it has some challenges:

High Dimensionality: With a large vocabulary, the feature space grows significantly.
Loss of Semantic Meaning: BoW ignores the order and meaning of words.
Sparse Representation: Most values in the feature matrix are zeros, making it inefficient for large datasets.

Best Practices

✔ Use N-Grams: Bigrams and trigrams can improve context understanding.
✔ Remove Stop Words: Reduces unnecessary features.
✔ Apply Dimensionality Reduction: Use techniques like PCA or feature selection to manage feature explosion.
✔ Consider Stemming/Lemmatization: Helps normalize words and reduce redundancy.

Summary

In this lesson, we covered the Bag-of-Words model and how n-grams improve text representation. We implemented BoW using scikit-learn and explored practical examples.

In the next lesson, we will continue our journey into more advanced text representation techniques. Apply what you've learned to real-world datasets and experiment with different n-gram settings to deepen your understanding!

Previous Lesson

Next Lesson: Introduction to TF-IDF Vectorization in NLP

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal