Introduction to Text Feature Extraction

Welcome back to the "Foundations of NLP Data Processing" course. In the previous lesson, we explored the Bag of Words (BoW) model, a foundational technique for text feature extraction that represents text data by counting the frequency of each word in a document. While BoW is simple and effective, it does not account for the importance of words across different documents in a corpus. Today, we will delve into text feature extraction using TF-IDF, which stands for Term Frequency-Inverse Document Frequency. TF-IDF not only considers the frequency of words within a document but also evaluates their significance across the entire corpus, making it a powerful tool for text analysis.

Understanding TF-IDF Vectorization

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). In this context:

  • A document refers to a single piece of text data, such as a sentence, paragraph, or article. It is the unit of text for which we calculate the term frequency (TF).

  • A corpus is a collection of documents. It represents the entire dataset of text data that we are analyzing. The inverse document frequency (IDF) is calculated based on the entire corpus to determine the significance of a term across all documents.

TF-IDF combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF). By multiplying these two metrics, TF-IDF assigns higher scores to terms that are frequent in a document but rare in the corpus, highlighting their significance.

TF-IDF Calculations
  1. Term Frequency (TF): This measures how frequently a term appears in a document. It is calculated as:

    TF(t,d)=Number of times term t appears in document dTotal number of terms in document d\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
Example: Calculating TF-IDF

Consider two documents:

  • Document 1: "The cat sat on the mat."
  • Document 2: "The dog chased the cat."

For simplicity, we'll demonstrate the calculations for a few terms, but in practice, you would calculate TF-IDF for all words.

Step 1: Term Frequency (TF)

  • Document 1: TF(cat) = 1/6, TF(sat) = 1/6
  • Document 2: TF(cat) = 1/5, TF(dog) = 1/5

Step 2: Inverse Document Frequency (IDF)

  • IDF(cat) = log(2/2) = 0
  • IDF(sat) = log(2/1) = 0.301

Step 3: TF-IDF

  • Document 1: TF-IDF(cat) = 0, TF-IDF(sat) = 0.050
  • Document 2: TF-IDF(cat) = 0, TF-IDF(dog) = 0.060

This example shows how TF-IDF scores highlight term importance.

Exploring N-grams

An n-gram is a contiguous sequence of n items from a given text. Unigrams are single words, while bigrams are pairs of consecutive words, and trigrams are sequences of three consecutive words. Using n-grams can help capture more context in text data by considering combinations of words rather than individual words alone. The ngram_range parameter in the TfidfVectorizer specifies the range of n-values for different n-grams to be extracted. By setting ngram_range=(1,2), we are instructing the vectorizer to consider both unigrams (single words) and bigrams (pairs of consecutive words) when analyzing the text. This means that the TF-IDF vectorization will account for individual words as well as combinations of two consecutive words, allowing for a richer representation of the text data by capturing more context and relationships between words.

Example 1: Implementing TF-IDF with Unigrams

Let's start with a simple example using only unigrams to understand how TF-IDF vectorization works.

Analyzing the Output

The output of the above code is a DataFrame that displays the TF-IDF scores for each unigram in the documents. Each column represents a unigram, and each row corresponds to a document. The values in the DataFrame are the TF-IDF scores, indicating the importance of each term in the respective document.

The output looks like this:

Example 2: Implementing TF-IDF with Unigrams and Bigrams

Now, let's extend the example to include both unigrams and bigrams.

Analyzing the Output

The output now includes both unigrams and bigrams, providing a richer representation of the text data.

The output looks like this:

In this output, you can see that the bigrams "cat sat" and "dog chased" have significant TF-IDF scores, indicating their importance in the respective documents.

Summary and Next Steps

In this lesson, you learned about TF-IDF vectorization, a powerful technique for transforming text into numerical features. We covered the basics of TF-IDF, the role of n-grams, and saw practical examples of implementing TF-IDF using scikit-learn and visualizing the results with pandas. As you move on to the practice exercises, apply these concepts to different datasets and experiment with various n-gram settings to deepen your understanding. This knowledge will be invaluable as you continue to explore more advanced NLP techniques in future lessons.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal