Introduction

Hello and welcome to the second lesson of our "Evaluation Metrics & Advanced Techniques" course! In our previous lesson, we explored various evaluation strategies for imbalanced datasets and learned how traditional metrics like accuracy can be misleading. Today, we'll build upon that foundation by learning how to address class imbalance directly during model training using class weights.

Class imbalance presents a significant challenge in many real-world machine learning applications. When one class significantly outnumbers the other, models tend to develop a bias toward the majority class, potentially misclassifying or overlooking the minority class entirely. This is particularly problematic when the minority class represents the events of interest (like fraud, disease, or rare events).

By the end of this lesson, you'll understand:

  • Why standard classification algorithms struggle with imbalanced data
  • How class weights work to counterbalance data imbalance
  • How to implement balanced class weights in Logistic Regression

Let's continue our journey to master machine learning techniques for imbalanced datasets!

The Challenge of Class Imbalance

In our first lesson, we discovered how imbalanced datasets pose unique challenges when it comes to evaluation. Now, let's explore why standard algorithms struggle with class imbalance during the training process itself. When one class significantly outnumbers another, most algorithms naturally optimize for overall accuracy, resulting in models that excel at predicting the majority class but perform poorly on the minority class for several key reasons:

  1. Loss function bias: Standard algorithms like Logistic Regression minimize the overall error rate, giving equal importance to each instance. With imbalanced data, the algorithm can minimize error by focusing predominantly on the majority class.

  2. Decision boundary skew: The algorithm has fewer examples of the minority class to learn from, leading to a decision boundary that heavily favors the majority class.

  3. Optimization shortcuts: The model can achieve high accuracy by simply predicting the majority class for all instances, creating a false sense of good performance.

While our previous lesson focused on evaluating models for imbalanced data with metrics like precision, recall, and F1-score, today we'll learn how to improve model training itself using class weighting. This technique helps the algorithm pay more attention to the minority class during training, potentially leading to more balanced predictions without modifying the underlying data.

Understanding Class Weights

Class weights provide a powerful approach to handling imbalanced datasets directly at the model training stage. Rather than modifying the dataset itself (through sampling techniques we'll explore in future lessons), class weights adjust how the learning algorithm weighs errors for different classes. In a nutshell:

  • Each class is assigned a weight that determines how much the model should "care" about errors on that class
  • Higher weights make misclassifications for that class more costly
  • The algorithm adjusts to minimize this weighted error rather than just the raw error count

In practice, we often use balanced class weights, which are inversely proportional to class frequencies. This simply means that the minority classes receive a higher weight whereas the majority classes receive a lower weight; the more imbalanced the classes, the greater the difference in weights. For example, in a binary classification problem with classes 0 and 1, if class 1 appears only 10% of the time, it would receive a weight approximately 9 times higher than class 0. This effectively tells the algorithm, "pay 9 times more attention to errors on class 1."

The beauty of this approach is that it doesn't change your data — it changes how the model learns from that data, making it more sensitive to the minority class without artificial data manipulation.

Building a Standard Logistic Regression Model

Before implementing class weights, let's train a standard Logistic Regression model to establish a baseline for comparison:

In this code, we're creating a basic Logistic Regression model without any special handling for class imbalance. We set random_state=42 to ensure the reproducibility of our results. After training the model on our training data, we use it to make predictions on the test set and evaluate its performance.

The classification_report will provide detailed metrics on precision, recall, and F1-score for each class. As we discovered in our previous lesson, despite potentially high overall accuracy, the standard model typically struggles with the minority class, showing decent precision but very low recall — indicating that it misses many instances of the minority class.

Implementing Class Weights

Now, let's implement a Logistic Regression model with balanced class weights to address the imbalance:

The critical difference here is the addition of the class_weight='balanced' parameter. This simple addition tells scikit-learn to automatically adjust weights inversely proportional to class frequencies in the input data. For our binary classification problem, the weight for each class is calculated as:

Where:

  • n_samples is the total number of samples
  • n_classes is the number of classes (2 in our case)
  • np.bincount(y) counts the number of occurrences of each class

So, if 90% of our data is class 0 and 10% is class 1, class 0 would get a weight of approximately 0.55 and class 1 would get a weight of about 5, making errors on the minority class about 9 times more "expensive" during training.

Comparing Model Performance

Let's analyze the results from both models to understand the impact of class weights:

When we run this code, we get the following output:

The differences between these models reveal key insights about the impact of class weights:

  1. Standard Model:

    • For the majority class (0), the model achieves very high precision (0.96), recall (0.99), and f1-score (0.98).
    • For the minority class (1), the precision is 0.67, but the recall is very low at 0.19, resulting in a low f1-score of 0.30.
    • The overall accuracy is 0.95, but this is misleading because the model is missing most of the minority class instances.
    • The macro average recall (0.59) and f1-score (0.64) reflect the poor performance on the minority class.
  2. Weighted Model:

    • For the majority class (0), precision increases to 0.99, but recall drops to 0.83, and f1-score drops to 0.90.
    • For the minority class (), drops to 0.19, but jumps dramatically to 0.78, with a slight increase in to 0.31.
Conclusion and Next Steps

In this second lesson of our course, we've explored how class weights can be a powerful tool for addressing class imbalance directly during model training. We've learned that by setting class_weight='balanced' in scikit-learn's LogisticRegression, we can significantly improve the model's ability to identify the minority class, often at a minimal cost to overall accuracy. The beauty of class weights lies in their simplicity — a single parameter change that can dramatically alter how the model learns from imbalanced data, without requiring any modification to the dataset itself.

Class weighting is just one of several techniques for handling imbalanced data. In future lessons, we'll explore other approaches such as ensemble techniques specifically designed for imbalanced datasets and anomaly detection for extremely imbalanced data. The practice exercises that follow will give you hands-on experience with implementing class weights in different scenarios and analyzing their impact on model performance, deepening your understanding of when and how to use this technique effectively. Happy coding!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal