Hello and welcome to the second lesson of our "Evaluation Metrics & Advanced Techniques" course! In our previous lesson, we explored various evaluation strategies for imbalanced datasets and learned how traditional metrics like accuracy can be misleading. Today, we'll build upon that foundation by learning how to address class imbalance directly during model training using class weights.
Class imbalance presents a significant challenge in many real-world machine learning applications. When one class significantly outnumbers the other, models tend to develop a bias toward the majority class, potentially misclassifying or overlooking the minority class entirely. This is particularly problematic when the minority class represents the events of interest (like fraud, disease, or rare events).
By the end of this lesson, you'll understand:
- Why standard classification algorithms struggle with imbalanced data
- How class weights work to counterbalance data imbalance
- How to implement balanced class weights in Logistic Regression
Let's continue our journey to master machine learning techniques for imbalanced datasets!
In our first lesson, we discovered how imbalanced datasets pose unique challenges when it comes to evaluation. Now, let's explore why standard algorithms struggle with class imbalance during the training process itself. When one class significantly outnumbers another, most algorithms naturally optimize for overall accuracy, resulting in models that excel at predicting the majority class but perform poorly on the minority class for several key reasons:
-
Loss function bias: Standard algorithms like Logistic Regression minimize the overall error rate, giving equal importance to each instance. With imbalanced data, the algorithm can minimize error by focusing predominantly on the majority class.
-
Decision boundary skew: The algorithm has fewer examples of the minority class to learn from, leading to a decision boundary that heavily favors the majority class.
-
Optimization shortcuts: The model can achieve high accuracy by simply predicting the majority class for all instances, creating a false sense of good performance.
While our previous lesson focused on evaluating models for imbalanced data with metrics like precision, recall, and F1-score, today we'll learn how to improve model training itself using class weighting. This technique helps the algorithm pay more attention to the minority class during training, potentially leading to more balanced predictions without modifying the underlying data.
Class weights provide a powerful approach to handling imbalanced datasets directly at the model training stage. Rather than modifying the dataset itself (through sampling techniques we'll explore in future lessons), class weights adjust how the learning algorithm weighs errors for different classes. In a nutshell:
- Each class is assigned a weight that determines how much the model should "care" about errors on that class
- Higher weights make misclassifications for that class more costly
- The algorithm adjusts to minimize this weighted error rather than just the raw error count
In practice, we often use balanced class weights, which are inversely proportional to class frequencies. This simply means that the minority classes receive a higher weight whereas the majority classes receive a lower weight; the more imbalanced the classes, the greater the difference in weights. For example, in a binary classification problem with classes 0
and 1
, if class 1
appears only 10% of the time, it would receive a weight approximately 9 times higher than class 0
. This effectively tells the algorithm, "pay 9 times more attention to errors on class 1."
The beauty of this approach is that it doesn't change your data — it changes how the model learns from that data, making it more sensitive to the minority class without artificial data manipulation.
Before implementing class weights, let's train a standard Logistic Regression model to establish a baseline for comparison:
In this code, we're creating a basic Logistic Regression model without any special handling for class imbalance. We set random_state=42
to ensure the reproducibility of our results. After training the model on our training data, we use it to make predictions on the test set and evaluate its performance.
The classification_report
will provide detailed metrics on precision, recall, and F1-score for each class. As we discovered in our previous lesson, despite potentially high overall accuracy, the standard model typically struggles with the minority class, showing decent precision but very low recall — indicating that it misses many instances of the minority class.
Now, let's implement a Logistic Regression model with balanced class weights to address the imbalance:
The critical difference here is the addition of the class_weight='balanced'
parameter. This simple addition tells scikit-learn
to automatically adjust weights inversely proportional to class frequencies in the input data. For our binary classification problem, the weight for each class is calculated as:
Where:
n_samples
is the total number of samplesn_classes
is the number of classes (2
in our case)np.bincount(y)
counts the number of occurrences of each class
So, if 90% of our data is class 0
and 10% is class 1
, class 0
would get a weight of approximately 0.55
and class 1
would get a weight of about 5
, making errors on the minority class about 9 times more "expensive" during training.
Let's analyze the results from both models to understand the impact of class weights:
When we run this code, we get the following output:
The differences between these models reveal key insights about the impact of class weights:
-
Standard Model:
- For the majority class (
0
), the model achieves very high precision (0.96), recall (0.99), and f1-score (0.98). - For the minority class (
1
), the precision is 0.67, but the recall is very low at 0.19, resulting in a low f1-score of 0.30. - The overall accuracy is 0.95, but this is misleading because the model is missing most of the minority class instances.
- The macro average recall (0.59) and f1-score (0.64) reflect the poor performance on the minority class.
- For the majority class (
-
Weighted Model:
- For the majority class (
0
), precision increases to 0.99, but recall drops to 0.83, and f1-score drops to 0.90. - For the minority class (), drops to 0.19, but jumps dramatically to 0.78, with a slight increase in to 0.31.
- For the majority class (
In this second lesson of our course, we've explored how class weights can be a powerful tool for addressing class imbalance directly during model training. We've learned that by setting class_weight='balanced'
in scikit-learn
's LogisticRegression
, we can significantly improve the model's ability to identify the minority class, often at a minimal cost to overall accuracy. The beauty of class weights lies in their simplicity — a single parameter change that can dramatically alter how the model learns from imbalanced data, without requiring any modification to the dataset itself.
Class weighting is just one of several techniques for handling imbalanced data. In future lessons, we'll explore other approaches such as ensemble techniques specifically designed for imbalanced datasets and anomaly detection for extremely imbalanced data. The practice exercises that follow will give you hands-on experience with implementing class weights in different scenarios and analyzing their impact on model performance, deepening your understanding of when and how to use this technique effectively. Happy coding!
