Oversampling Techniques for Handling Unbalanced Datasets

Lesson Introduction

Welcome to the lesson on oversampling techniques! In this lesson, we'll explore oversampling techniques to balance these datasets and improve model performance on the minority class.

Understanding Oversampling

Oversampling addresses class imbalance by increasing the number of minority class instances, either by duplicating existing ones or generating new synthetic instances. This provides the model with more examples of the minority class, leading to better learning and predictions. However, oversampling can also lead to overfitting, where the model learns noise rather than the underlying pattern.

Random Oversampling

Random Oversampling is the simplest form of oversampling, involving random duplication of minority class instances until the dataset is balanced. While straightforward, it can lead to overfitting by replicating existing data points. Additionally, because it doesn't introduce any new variability in the data, Random Oversampling may not help the model learn to generalize beyond the examples it has already seen. This method is best used when the dataset is small and quick balancing is needed for preliminary testing or baseline comparison. Here's how to implement Random Oversampling using RandomOverSampler from imblearn:

We create a toy dataset with an imbalance, and apply Random Oversampling to balance the classes. The resampled distribution is 33/67.

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE is an advanced oversampling technique that generates synthetic samples for the minority class. Instead of duplicating instances, SMOTE creates new ones by interpolating between existing ones, reducing overfitting compared to Random Oversampling. Here's how SMOTE works:

Selection of Minority Instances: SMOTE begins by selecting a random instance from the minority class.
Finding Neighbors: It then identifies the k-nearest neighbors of this instance within the minority class.
Interpolation: A new synthetic instance is created by selecting one of the k-nearest neighbors and interpolating between the selected instance and this neighbor. The interpolation is done by taking a weighted average of the two instances, where the weights are randomly chosen.

Here's how to implement SMOTE:

SMOTE creates a more diverse set of minority class instances, which helps in reducing overfitting and improving model generalization. Note that we can use the sampling_strategy=0.25 parameter here, in the same way as we did for undersampling. In this case, the resulting distribution will be 20/80.

ADASYN (Adaptive Synthetic Sampling)

ADASYN extends SMOTE by focusing on generating synthetic samples for minority class instances that are harder to learn. It adapts to the data distribution by generating more synthetic samples for instances that are harder to classify. Here's how ADASYN works:

Density Distribution Calculation: ADASYN calculates the density distribution of the minority class instances. It identifies which instances are in low-density regions, meaning they have fewer neighbors from the same class.
Weight Assignment: Each minority instance is assigned a weight based on its difficulty to classify, with higher weights given to instances in low-density regions.
Synthetic Sample Generation: Similar to SMOTE, ADASYN generates synthetic samples by interpolating between minority instances and their neighbors. However, the number of synthetic samples generated for each instance is proportional to its weight.

Here's how to implement ADASYN:

ADASYN focuses the model's learning on difficult-to-classify instances, potentially improving the model's performance on challenging data points.

Lesson Summary

In this lesson, we explored oversampling techniques to handle unbalanced datasets. We started with Random Oversampling, which duplicates existing instances, and moved to advanced techniques like SMOTE and ADASYN, which generate synthetic samples. Each technique has its advantages and trade-offs, and the choice depends on the dataset and problem context. By understanding and applying these techniques, you can improve your model's performance on the minority class.

Now that you've learned about oversampling techniques, it's time to put your knowledge into practice. In the upcoming practice session, you'll apply these techniques to our dataset, enhancing your understanding and skills. Let's get started!

Previous Lesson

Next Lesson: Training a Better Model with Resampling Techniques

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal