Welcome back to the "Foundations of NLP Data Processing" course. In the previous lesson, we explored the Bag of Words (BoW) model, a foundational technique for text feature extraction that represents text data by counting the frequency of each word in a document. While BoW is simple and effective, it does not account for the importance of words across different documents in a corpus. Today, we will delve into text feature extraction using TF-IDF, which stands for Term Frequency-Inverse Document Frequency. TF-IDF not only considers the frequency of words within a document but also evaluates their significance across the entire corpus, making it a powerful tool for text analysis.
TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). In this context:
-
A document refers to a single piece of text data, such as a sentence, paragraph, or article. It is the unit of text for which we calculate the term frequency (TF).
-
A corpus is a collection of documents. It represents the entire dataset of text data that we are analyzing. The inverse document frequency (IDF) is calculated based on the entire corpus to determine the significance of a term across all documents.
TF-IDF combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF). By multiplying these two metrics, TF-IDF assigns higher scores to terms that are frequent in a document but rare in the corpus, highlighting their significance.
-
Term Frequency (TF): This measures how frequently a term appears in a document. It is calculated as:
Consider two documents:
- Document 1: "The cat sat on the mat."
- Document 2: "The dog chased the cat."
For simplicity, we'll demonstrate the calculations for a few terms, but in practice, you would calculate TF-IDF for all words.
Step 1: Term Frequency (TF)
- Document 1: TF(cat) = 1/6, TF(sat) = 1/6
- Document 2: TF(cat) = 1/5, TF(dog) = 1/5
Step 2: Inverse Document Frequency (IDF)
- IDF(cat) = log(2/2) = 0
- IDF(sat) = log(2/1) = 0.301
Step 3: TF-IDF
- Document 1: TF-IDF(cat) = 0, TF-IDF(sat) = 0.050
- Document 2: TF-IDF(cat) = 0, TF-IDF(dog) = 0.060
This example shows how TF-IDF scores highlight term importance.
An n-gram is a contiguous sequence of n items from a given text. Unigrams are single words, while bigrams are pairs of consecutive words, and trigrams are sequences of three consecutive words. Using n-grams can help capture more context in text data by considering combinations of words rather than individual words alone. The ngram_range
parameter in the TfidfVectorizer
specifies the range of n-values for different n-grams to be extracted. By setting ngram_range=(1,2)
, we are instructing the vectorizer to consider both unigrams (single words) and bigrams (pairs of consecutive words) when analyzing the text. This means that the TF-IDF vectorization will account for individual words as well as combinations of two consecutive words, allowing for a richer representation of the text data by capturing more context and relationships between words.
Let's start with a simple example using only unigrams to understand how TF-IDF vectorization works.
The output of the above code is a DataFrame that displays the TF-IDF scores for each unigram in the documents. Each column represents a unigram, and each row corresponds to a document. The values in the DataFrame are the TF-IDF scores, indicating the importance of each term in the respective document.
The output looks like this:
Now, let's extend the example to include both unigrams and bigrams.
The output now includes both unigrams and bigrams, providing a richer representation of the text data.
The output looks like this:
In this output, you can see that the bigrams "cat sat" and "dog chased" have significant TF-IDF scores, indicating their importance in the respective documents.
In this lesson, you learned about TF-IDF vectorization, a powerful technique for transforming text into numerical features. We covered the basics of TF-IDF, the role of n-grams, and saw practical examples of implementing TF-IDF using scikit-learn
and visualizing the results with pandas
. As you move on to the practice exercises, apply these concepts to different datasets and experiment with various n-gram settings to deepen your understanding. This knowledge will be invaluable as you continue to explore more advanced NLP techniques in future lessons.
