Lesson Introduction

Welcome to this lesson on training a better model, especially with unbalanced datasets. Previously, we explored the challenges of unbalanced datasets and techniques like undersampling and oversampling. Today, we'll focus on improving model performance using these resampling techniques and evaluating their impact. By the end, you'll know how to enhance your model's ability to predict minority classes effectively.

Recall: AUC-ROC

The AUC-ROC (Area Under the Receiver Operating Characteristic curve) is a performance measurement for classification problems at various threshold settings. The ROC curve is a plot of the true positive rate (recall) against the false positive rate. The AUC represents the degree or measure of separability, indicating how well the model can distinguish between classes. A higher AUC value indicates better model performance. An AUC of 0.5 indicates no discriminative power (equivalent to random guessing), while a perfect model achieves an AUC of 1.0. Values closer to 1.0 mean the model is better at ranking positive instances higher than negative ones, regardless of the decision threshold. AUC-ROC considers all possible thresholds, not just one. It measures the model's overall ranking ability, not just its performance at a specific threshold.

We will use AUC-ROC in this unit as an efficient metric which is sensitive to the class balance and easy to interpret.

Initial Model Training Without Resampling

Let's start by recalling training a logistic regression model without resampling as our baseline. We'll use the LogisticRegression class from sklearn.linear_model, a simple yet powerful algorithm for binary classification.

Here's how we load our dataset and train the model:

We load the datasets, split them into features and labels, and train a logistic regression model. The roc_auc_score function evaluates the model's performance, providing a single metric that summarizes the model's ability to distinguish between classes. In this case, we got a poor result of 0.52.

Resampling Strategy

Now, let's introduce a resampling strategy to address dataset imbalance. Resampling adjusts class distribution to improve model performance. We'll use SMOTE (Synthetic Minority Over-sampling Technique) and RandomUnderSampler.

  • SMOTE: Generates synthetic samples for the minority class by interpolating between existing samples.
  • RandomUnderSampler: Randomly removes samples from the majority class to balance the dataset.

Here's how we define our resampling strategy:

We create a pipeline that applies SMOTE to oversample the minority class and RandomUnderSampler to reduce the majority class. The sampling_strategy parameter controls the desired ratio after resampling.

Implementing Resampling with a Pipeline

To streamline resampling, we use a pipeline from imblearn. A pipeline chains multiple processing steps, making the code cleaner.

Here's how we apply the resampling strategy:

The fit_resample method applies the resampling strategy, resulting in a balanced dataset. This step is crucial for improving the model's ability to learn from the minority class.

Model Training with Resampled Data

With resampled data, we train a logistic regression model and evaluate its performance to understand resampling's impact.

We train a new logistic regression model using the resampled data and evaluate its performance. Comparing AUC-ROC scores before and after resampling helps assess the strategy's effectiveness. The new metric result is 0.66. While it is far from perfect, it is a significant improvement. By combining data resampling with more advanced models, you can often achieve good results.

Lesson Summary

In this lesson, we explored training a better model by applying resampling techniques to handle unbalanced datasets. We started with a logistic regression model without resampling, introduced a resampling strategy using SMOTE and RandomUnderSampler, implemented it with a pipeline, and trained a new model with resampled data. Evaluating the model's performance using AUC-ROC showed the positive impact of resampling on predicting minority classes.

Ready for practice? Now that you've learned the theory, it's time to practice. In the upcoming session, you'll apply these resampling techniques to your datasets and observe their impact on model performance. This hands-on experience will solidify your understanding and prepare you for real-world applications.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal