Hello, and welcome to today's lesson on n-grams! If you've ever wondered how language models or text classifiers can understand the context or sequence in text, it's usually courtesy of our today's hero — n-grams. In this lesson, we'll delve into the magic of n-grams and how essential they prove in processing textual data. Specifically, we'll learn how to create n-grams from text data using Python, covering unigrams and bigrams.
In Natural Language Processing, when we analyze text, it's often beneficial to consider not only individual words but sequences of words. This approach helps to grasp the context better. Here is where n-grams come in handy.
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The 'n' stands for the number of words in the sequence. For instance, in "I love dogs," a 1-gram (or unigram) is just one word, like "love." A 2-gram (or bigram) would be a sequence of 2 words, like "I love" or "love dogs".
N-grams help preserve the sequential information or context in text data, contributing significantly to many language models or text classifiers.
Before we can create n-grams, we need clean, structured text data. The text needs to be cleaned and preprocessed into a desirable format, after which it can be used for feature extraction or modeling.
Here's an already familiar code where we apply cleaning on our text, removing stop words and stemming the remaining words. These steps include lower-casing words, removing punctuations, useless words (stopwords), and reducing all words to their base or stemmed form.
Python's sklearn
library provides an accessible way to generate n-grams. The CountVectorizer
class in the sklearn.feature_extraction.text
module can convert a given text into its matrix representation and allows us to specify the type of n-grams we want.
Let's set up our vectorizer as a preliminary step towards creating n-grams:
The ngram_range=(1, 2)
parameter instructs our vectorizer to generate n-grams where n ranges from 1 to 2. So, the CountVectorizer will generate both unigrams and bigrams. If we wanted unigrams, bigrams, and trigrams, we could use ngram_range=(1, 3)
.
Now that we've set up our n-gram generating machine let's use it on some real-world data.
Applying the vectorizer to our cleaned text data will create the n-grams:
The output of the above code will be:
The shape of X
is (100, 16246)
, indicating we have a high-dimensional feature space. The first number, 100
, represents the number of documents or records in your dataset (here, it's 100 as we limited our fetching to the first 100 records of the dataset), whereas 16246
represents the unique n-grams or features created from all the 100 documents.
By printing features[100:111]
we get a glance into our features where each string represents an n-gram from our cleaned text data. The returned n-grams ['accid figur', 'accid worri', 'accomod', ...]
include both unigrams (single words like accomod
, account
) and bigrams (two-word phrases like accid figur
, accid worri
).
As you can see, generating n-grams adds a new level of complexity to our analysis, as we now have multiple types of features or tokens - unigrams and bigrams. You can experiment with the ngram_range
parameter in CountVectorizer
to include trigrams or higher-level n-grams, depending on your specific context and requirements. Remember, each choice will have implications for the complexity and interpretability of your models, and it's always a balance between the two.
Congratulations, you've finished today's lesson on n-grams! We've explored what n-grams are and their importance in text classification. We then moved on to preparing data for creating n-grams before we dived into generating them using Python's CountVectorizer
class in the sklearn
library.
Now, it's time to get hands-on. Try generating trigrams or 4-grams from the same cleaned newsgroups data and notice the differences. Practicing these skills will not only reinforce the concepts learned in this lesson but also enable you to understand when and how much context is needed for certain tasks.
As always, happy learning!
