Introduction

Hello and welcome to this lesson on Removing Stop Words and Stemming! In this lesson, we will dive deep into two essential steps to prepare text data for machine learning models: removing stop words and stemming. These techniques will help us improve the efficiency and accuracy of our models. Let's get started!

Understanding Stop Words

Stop words in Natural Language Processing (NLP) refer to the most common words in a language. Examples include "and", "the", "is", and others that do not provide significant meaning and are often removed to speed up processing without losing crucial information. For this purpose, Python's Natural Language Tool Kit (NLTK) provides a pre-defined list of stop words. Let's have a look:

The output of the above code will be:

Here, the stopwords.words('english') function returns a list of English stop words. You might sometimes need to add domain-specific stop words to this list based on the nature of your text data.

Introduction to Stemming

Stemming is a technique that reduces a word to its root form. Although the stemmed word may not always be a real or grammatically correct word in English, it does help to consolidate different forms of the same word to a common base form, reducing the complexity of text data. This simplification leads to quicker computation and potentially better performance when implementing Natural Language Processing (NLP) algorithms, as there are fewer unique words to consider.

For example, the words "run", "runs", "running" might all be stemmed to the common root "run". This helps our algorithm understand that these words are related and they carry a similar semantic meaning.

Let's illustrate this with Porter Stemmer, a well-known stemming algorithm from the NLTK library:

The output of the above code will be:

The PorterStemmer class comes with the stem method that takes in a word and returns its root form. In this case, "running" is correctly stemmed to its root word "run". This form of preprocessing, although it may lead to words that are not recognizable, is a standard practice in text preprocessing for NLP tasks.

Stop Words Removal and Stemming in Action

Having understood stop words and stemming, let's develop a function that removes stop words and applies stemming to a given text. We will tokenize the text (split it into individual words) and apply these transformations word by word.

The output of the above code will be:

The remove_stopwords_and_stem function does the required processing and provides the cleaned-up text.

Stop Words Removal and Stemming on a Dataset

Let's implement the above concepts on a real-world text dataset – the 20 Newsgroups Dataset.

The output of the above code will be:

This process can take a while for large datasets, but the output will be much cleaner and easier for a machine learning model to work with.

Summary and Conclusion

And that's a wrap! In today's lesson, we've learned about stop words and stemming as crucial steps in text preprocessing for machine learning models. We've used Python's NLTK library to work with stop words and perform stemming. We have processed some example sentences and a real-world dataset to practice these concepts.

As we proceed to more advanced NLP tasks, pre-processing techniques like removing stop words and stemming would serve as a solid foundation. In the upcoming lessons, we will delve deeper into handling missing text data and learn about reshaping textual data for analysis. Let's keep going!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal