Welcome to this lesson on training a better model, especially with unbalanced datasets. Previously, we explored the challenges of unbalanced datasets and techniques like undersampling and oversampling. Today, we'll focus on improving model performance using these resampling techniques and evaluating their impact. By the end, you'll know how to enhance your model's ability to predict minority classes effectively.
The AUC-ROC (Area Under the Receiver Operating Characteristic curve) is a performance measurement for classification problems at various threshold settings. The ROC curve is a plot of the true positive rate (recall) against the false positive rate. The AUC represents the degree or measure of separability, indicating how well the model can distinguish between classes. A higher AUC value indicates better model performance. An AUC of 0.5 indicates no discriminative power (equivalent to random guessing), while a perfect model achieves an AUC of 1.0. Values closer to 1.0 mean the model is better at ranking positive instances higher than negative ones, regardless of the decision threshold. AUC-ROC considers all possible thresholds, not just one. It measures the model's overall ranking ability, not just its performance at a specific threshold.
We will use AUC-ROC in this unit as an efficient metric which is sensitive to the class balance and easy to interpret.
Let's start by recalling training a logistic regression model without resampling as our baseline. We'll use the LogisticRegression
class from sklearn.linear_model
, a simple yet powerful algorithm for binary classification.
Here's how we load our dataset and train the model:
We load the datasets, split them into features and labels, and train a logistic regression model. The roc_auc_score
function evaluates the model's performance, providing a single metric that summarizes the model's ability to distinguish between classes. In this case, we got a poor result of 0.52
.
Now, let's introduce a resampling strategy to address dataset imbalance. Resampling adjusts class distribution to improve model performance. We'll use SMOTE (Synthetic Minority Over-sampling Technique) and RandomUnderSampler
.
- SMOTE: Generates synthetic samples for the minority class by interpolating between existing samples.
- RandomUnderSampler: Randomly removes samples from the majority class to balance the dataset.
Here's how we define our resampling strategy:
We create a pipeline that applies SMOTE to oversample the minority class and RandomUnderSampler
to reduce the majority class. The sampling_strategy
parameter controls the desired ratio after resampling.
To streamline resampling, we use a pipeline from imblearn
. A pipeline chains multiple processing steps, making the code cleaner.
Here's how we apply the resampling strategy:
The fit_resample
method applies the resampling strategy, resulting in a balanced dataset. This step is crucial for improving the model's ability to learn from the minority class.
With resampled data, we train a logistic regression model and evaluate its performance to understand resampling's impact.
We train a new logistic regression model using the resampled data and evaluate its performance. Comparing AUC-ROC scores before and after resampling helps assess the strategy's effectiveness. The new metric result is 0.66
. While it is far from perfect, it is a significant improvement. By combining data resampling with more advanced models, you can often achieve good results.
In this lesson, we explored training a better model by applying resampling techniques to handle unbalanced datasets. We started with a logistic regression model without resampling, introduced a resampling strategy using SMOTE and RandomUnderSampler
, implemented it with a pipeline, and trained a new model with resampled data. Evaluating the model's performance using AUC-ROC showed the positive impact of resampling on predicting minority classes.
Ready for practice? Now that you've learned the theory, it's time to practice. In the upcoming session, you'll apply these resampling techniques to your datasets and observe their impact on model performance. This hands-on experience will solidify your understanding and prepare you for real-world applications.
