Ensemble Methods for Imbalanced Data

Introduction

Hello and welcome to the third lesson of our "Evaluation Metrics & Advanced Techniques" course! You're making excellent progress. So far, we've explored evaluation strategies for imbalanced datasets and learned how to use class weights to improve logistic regression models. Today, we'll take our skills to the next level by mastering ensemble methods specifically designed for imbalanced data.

In many real-world scenarios, traditional ensemble methods like Random Forests or AdaBoost may still struggle with severe class imbalance, despite their general robustness. Fortunately, specialized ensemble techniques have been developed to directly address class imbalance while maintaining the power of ensemble learning.

Let's enhance our imbalanced data toolkit with these powerful ensemble methods!

Understanding Ensemble Methods for Imbalanced Data

Class imbalance occurs when one class significantly outnumbers another in your dataset, making it challenging for standard models to detect the minority class. While ensemble methods like Random Forests and AdaBoost are generally robust, they can still be biased toward the majority class because most of their training samples come from that class. This can lead to poor detection of the minority class and suboptimal overall performance.

Specialized ensemble methods for imbalanced data address these issues by combining the strengths of ensemble learning with techniques that ensure fair representation of both classes. These methods often modify the sampling process or the algorithm itself to improve minority class detection.

Some of the key advantages of specialized ensemble methods for imbalanced data include:

Balanced sampling: These methods incorporate various sampling techniques to ensure balanced representation of all classes during training.
Algorithm-level adjustments: Rather than just pre-processing the data, these methods modify the ensemble algorithms internally to address imbalance.
Combining multiple strategies: Many specialized ensemble methods integrate resampling with traditional ensemble techniques for optimal performance.

The imblearn library provides several ensemble methods specifically designed to handle class imbalance, each with its unique approach to balancing model performance across classes.

The imblearn.ensemble Module

The imblearn.ensemble module extends scikit-learn's ensemble methods with specialized algorithms for imbalanced data. These algorithms combine the powerful predictive performance of ensemble methods with techniques to address class imbalance.

Let's start by importing the necessary libraries for our task:

The three main classifiers we'll explore from the imblearn.ensemble module are:

BalancedRandomForestClassifier: An extension of the traditional Random Forest that balances classes in each bootstrap sample by under-sampling the majority class while maintaining all minority class instances.
EasyEnsembleClassifier: Creates an ensemble of AdaBoost learners trained on balanced bootstrap samples, combining boosting's focus on difficult examples with balanced sampling.
BalancedBaggingClassifier: A bagging classifier that incorporates resampling to balance the training data for each base estimator, allowing for various resampling strategies and base classifiers.

Each of these classifiers has been specifically designed to handle imbalanced datasets, but they differ in their underlying algorithms and how they combine ensemble learning with balancing techniques.

A Quick Recap: Random Forest, AdaBoost, and Bagging

Before we dive into their specialized imbalanced counterparts, let's briefly refresh our understanding of the foundational ensemble methods these imblearn classifiers build upon:

Random Forest: Imagine building a large number of decision trees. Each tree is trained on a slightly different random subset of the original data (a "bootstrap sample") and considers only a random subset of features at each split. To make a prediction, all trees "vote," and the most popular class wins. This approach reduces overfitting and generally leads to robust models. The BalancedRandomForestClassifier adapts this by ensuring each tree is trained on balanced data.
AdaBoost (Adaptive Boosting): This method builds a model sequentially. It starts with a simple model (a "weak learner," often a very shallow decision tree) and then trains subsequent models that focus more on the instances that the previous models misclassified. Each model's contribution to the final prediction is weighted based on its accuracy. AdaBoost is powerful because it adaptively focuses on the "hard-to-learn" parts of the data. The EasyEnsembleClassifier combines AdaBoost with balanced sampling.
Bagging (Bootstrap Aggregating): This is a more general technique where multiple base models (e.g., decision trees, but could be others) are trained independently on different bootstrap samples of the training data. The final prediction is made by aggregating the predictions of all base models (e.g., by voting). Random Forest is a specific type of bagging that adds feature randomness. The BalancedBaggingClassifier applies this bagging principle but ensures each base model is trained on a balanced dataset.

Understanding these core concepts will help you appreciate how their imblearn extensions are specifically tailored to tackle class imbalance. Now, let's see them in action!

BalancedRandomForestClassifier Implementation

How it works:

For each tree in the forest, it draws a bootstrap sample from the minority class.
It then randomly under-samples the majority class to match the size of the minority class.
This ensures each decision tree is trained on a balanced dataset.
The predictions from all trees are combined through majority voting.

Let's implement the BalancedRandomForestClassifier:

In this implementation, we create a BalancedRandomForestClassifier with 10 trees (n_estimators=10) for simplicity, though in production you might use more trees (typically 100 or more) to improve performance. We set random_state=42 to ensure reproducibility of results, then fit the model to our training data and use it to make predictions on our test data.

Unlike a standard Random Forest, which might be overwhelmed by majority class instances, this balanced version ensures every tree "sees" an equal number of samples from each class. This is particularly effective when your data has a severe imbalance, as it gives the minority class an equal voice in the training process.

EasyEnsembleClassifier Implementation

How it works:

Creates an ensemble of AdaBoost learners.
Each AdaBoost learner is trained on a balanced bootstrap sample.
Uses random under-sampling to balance the training data for each base classifier.
Combines the predictions from multiple AdaBoost models.

The EasyEnsembleClassifier is particularly effective when you need to capture complex patterns in the minority class. Let's implement it:

In this implementation, we create an EasyEnsembleClassifier with 10 AdaBoost learners (n_estimators=10). Each of these AdaBoost learners itself contains multiple weak classifiers, creating a powerful hierarchical ensemble. The combination of AdaBoost's ability to focus on difficult examples with balanced sampling makes this method particularly effective for complex imbalanced datasets.

The key insight of this approach is that it leverages the strength of boosting (focusing on examples that are hard to classify) while ensuring balanced representation through under-sampling. This dual approach often results in excellent recall for the minority class without severely sacrificing precision.

BalancedBaggingClassifier Implementation

How it works:

Uses the concept of bootstrap aggregating (bagging).
Trains multiple base classifiers on different balanced subsets of the data.
Balances each subset by random under-sampling.
Combines predictions through voting.

Let's implement the BalancedBaggingClassifier:

This implementation creates a BalancedBaggingClassifier with 10 base estimators, each trained on a balanced subset of our data. The bagging approach reduces variance by training each estimator on a different subset, while the balancing ensures each estimator receives adequate representation of minority class examples.

What makes this classifier particularly versatile is its flexibility with base estimators. By default, it uses decision trees, but you can specify any classifier by setting the estimator parameter. This allows you to create custom ensemble solutions that combine the balancing power of this classifier with the specific strengths of different base algorithms, from simple linear models to complex neural networks.

Comparing Ensemble Performance

Now that we've implemented all three ensemble methods, let's compare their performance to understand their strengths and weaknesses:

This code generates classification reports for each of our ensemble methods. Here's what the output would look like:

These metrics help us understand how each method handles the minority and majority classes.

When analyzing the results, look for these patterns:

BalancedRandomForestClassifier typically offers a good balance between precision and recall. In our example, for the minority class (1), it achieved a precision of 0.46 and recall of 0.82. The forest structure provides stability, while the balanced sampling improves minority class detection. If you need a robust all-around performer, this is often a good choice.
EasyEnsembleClassifier usually achieves the highest recall for the minority class, though sometimes at the cost of precision. Here, it has the highest recall for class 1 (0.86) but the lowest precision (0.20). This makes it ideal for applications where detecting minority instances is critical, such as fraud detection or medical diagnosis, where missing a positive case is costly.

Conclusion

Congratulations! In this third lesson of our "Evaluation Metrics & Advanced Techniques" course, we've explored powerful ensemble methods specifically designed for imbalanced datasets. These specialized techniques combine the strengths of ensemble learning with sampling approaches to effectively handle class imbalance, offering significant improvements over both standard models and simpler approaches like class weighting.

In the practice exercises that follow, you'll have the opportunity to apply these methods to different datasets, tune their hyperparameters, and gain a deeper understanding of when to use each approach. As you work through these exercises, pay attention to how different ensemble methods perform on various imbalance ratios and dataset characteristics, which will help you develop intuition for selecting the right tool for your specific machine learning challenges.

Previous Lesson

Next Lesson: Anomaly Detection for Extremely Imbalanced Data

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal