Undersampling Techniques for Handling Unbalanced Datasets

Lesson Introduction

Welcome to our session on undersampling techniques! Today, we'll explore how to handle unbalanced datasets using undersampling methods. Unbalanced datasets are common in real-world scenarios, where one class significantly outnumbers the other, leading to biased models. To simplify our understanding, we'll start with toy data, a small, manageable dataset that allows us to focus on the concepts. Later, during the practice session, you'll apply these techniques to our data. Our goal is to understand and implement various undersampling techniques using Python.

Understanding Undersampling

Undersampling balances datasets by reducing the number of instances in the majority class, improving model performance on minority classes. However, it can lead to information loss, as some data points from the majority class are removed. The key is to maintain dataset integrity while addressing imbalance.

Random Undersampling

Random undersampling is the simplest form of undersampling. It involves randomly selecting a subset of the majority class to match the size of the minority class. This method is straightforward but can result in the loss of potentially important data. You can control the number of samples removed by setting the sampling_strategy parameter, which defines the desired ratio of the minority class to the majority class. Here's how it works:

In this example, the sampling_strategy=0.5 means the minority class will be half the size of the majority class after undersampling. The default parameter for sampling_strategy is 1. We use RandomUnderSampler from imblearn to balance the dataset. The random_state ensures reproducibility. After applying random undersampling, the dataset shape changes, reflecting a balanced distribution.

In this case, the output will be:

Tomek Links

Tomek Links is an undersampling technique that removes overlapping instances between classes. It identifies pairs of instances from different classes that are each other's nearest neighbors and removes the majority class instance. This method helps clean the dataset by eliminating borderline cases. However, if there are not many similar samples, Tomek Links may remove only a few or no samples at all. Here's how to implement Tomek Links:

We use TomekLinks to apply the technique. The resulting dataset is cleaner, with fewer overlapping instances, which can improve model performance. The number of samples removed depends on the presence of overlapping instances.

In our case, TomekLinks didn't remove any samples:

However, if you lessen the class_sep parameter in the dataset generator, making classes closer, to each other, TomekLinks will start to remove classes. Here is the output with class_sep=0.3:

Cluster Centroids

Cluster Centroids is an advanced undersampling technique that uses clustering to create representative samples of the majority class. It reduces the majority class by replacing clusters of samples with their centroids, preserving the overall distribution.

This method works by applying a clustering algorithm (such as k-means) to the majority class and computing centroids for each cluster. These centroids then replace the original data points. This preserves the geometric structure of the data while reducing redundancy.

Let's see it in action:

ClusterCentroids performs undersampling, with random_state ensuring consistent results. This technique is useful when you want to maintain the distribution of the majority class while reducing its size. Here is the result:

Lesson Summary and Practice Introduction

We've explored three undersampling techniques: Random Undersampling, Tomek Links, and Cluster Centroids. Each method has its advantages and trade-offs, and the choice depends on the dataset and problem context. Random Undersampling is simple but may lead to information loss. Tomek Links help clean the dataset by removing overlapping instances, while Cluster Centroids preserve the distribution of the majority class.

Now that you've learned about these undersampling techniques, it's time to put your knowledge into practice. In the upcoming practice session, you'll apply these methods to a new dataset. Experiment with different techniques and observe their effects on the dataset. This hands-on experience will solidify your understanding and prepare you for real-world applications.

Previous Lesson

Next Lesson: Oversampling Techniques for Handling Unbalanced Datasets

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal