Introduction to Stemming

Hello and welcome! In the world of Natural Language Processing (NLP), dealing with text data often involves various preprocessing steps. One such essential step is "Stemming".

Stemming is a heuristic process of reducing inflected (or sometimes derived) words to their root or basic form — generally a written word form. The principle use of stemming is to reduce related words to the same stem even if this stem itself is not a valid root.

For example, if we load stemming on the words running, runs, run, we should get run as the result for all of them.

Why does this matter? It's quite simple when it comes to text processing. Words like running, runs, run all carry similar context, and when processing language, it's beneficial to treat them as the same. This simplification not only speeds up various NLP tasks but also significantly reduces the space of features while preserving most of the informational content.

It's essential to note that stemming is not always the perfect method for some applications as it is based on heuristics and doesn't take into consideration the context of a word. In many cases, this can lead to incorrect stemming of the words, but it's still an effective strategy for many NLP applications.

Implementing Stemming with Python and NLTK

For implementing stemming, we will be using a very powerful Python library for processing natural language — NLTK (Natural Language Toolkit). It provides several different algorithms to stem words, but for this lesson, we will focus on the most common algorithm - the Porter Stemming Algorithm.

The Porter Stemming Algorithm is a heuristic process for removing the commoner morphological and inflectional endings from words in English. Its primary use is in information retrieval systems. It leverages five different phases of word reductions, applied sequentially that are composed of multiple heuristics.

Let's see how we can implement this in Python:

The output of the above code will be:

This demonstrates how stemming effectively reduces different forms of the word run to its root form.

Applying Stemming to SMS Spam Collection Dataset

Let's now apply stemming to real data. We will use the SMS Spam Collection dataset, which you have learned to import previously.

Also, this dataset has already been lowercased, tokenized, and had stop words removed as you have learned in the previous lessons:

The output of the above code will be:

This sample output showcases the stemming results on the first 5 records in the SMS Spam Collection dataset, demonstrating how each word in the messages is reduced to its stemmed form.

The resulting stemmed_tokens column in our DataFrame now contains the stemmed forms of our tweet's tokenized words. However, remember that stemming is a heuristic process and imperfect, but it is still effective for many NLP tasks.

And there you go! You've learned another key preprocessing step in NLP — Stemming.

Lesson Summary and Practice

Congrats on getting to the end of this lesson! We delved into Stemming—an essential text preprocessing step in Natural Language Processing, which is instrumental in reducing dimensionality and standardizing word variations.

You have learned about the Porter Stemming algorithm and how to implement it using the powerful and convenient Natural Language Toolkit (NLTK). Given the importance of stemming in many NLP tasks, understanding and mastering this concept will undoubtedly come in handy down the line.

Next up, we have some practice exercises where you'll be tasked with performing stemming on different pieces of text. They will help strengthen your grasp of the concept and perfect your coding skills. Happy learning!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal