Evaluating Classification Models: Confusion Matrix and Classification Report

Introduction: The Need for Model Evaluation

Welcome to the first lesson of the course, Fixing Classical Models – Diagnosis & Regularization. In this course, you will learn how to take a poorly performing machine learning model and improve it step by step. The journey starts here, with model evaluation. Before you can fix a model, you need to know what is wrong with it. This lesson will show you how to use two essential tools for diagnosing classification models: the confusion matrix and the classification report. By the end of this lesson, you will be able to evaluate a model’s predictions and spot where it is making mistakes. This is the foundation for all the improvements you will make in the rest of the course.

What Is a Confusion Matrix?

A confusion matrix is a simple but powerful way to see how well your classification model is performing. It is a table that compares the actual labels from your dataset to the predictions made by your model. Each row of the matrix represents the true class, while each column represents the predicted class. The main components are:

True Positives (TP): The model correctly predicted the positive class.
True Negatives (TN): The model correctly predicted the negative class.
False Positives (FP): The model incorrectly predicted the positive class.
False Negatives (FN): The model incorrectly predicted the negative class.

For a binary classification problem, the confusion matrix looks like this:

	Predicted Negative	Predicted Positive
Actual Negative	TN	FP
Actual Positive	FN	TP

This matrix helps you see not just how many predictions were correct, but also what kinds of mistakes your model is making. For example, if your model is predicting too many positives, you will see a high number in the FP cell.

Understanding the Classification Report

While the confusion matrix gives you a raw count of correct and incorrect predictions, the classification report provides more detailed metrics. The most common metrics are:

Precision: Out of all the positive predictions, how many were actually positive?
$\text{Precision} = \frac{TP}{TP + FP}$
Recall: Out of all the actual positives, how many did the model correctly identify?

Example: Evaluating a Poorly Tuned Logistic Regression Model

Let’s walk through a practical example using scikit-learn. In this example, you will see how to generate a dataset, train a logistic regression model with poor settings, and then evaluate it using both a confusion matrix and a classification report.

Here is the code:

Let’s break down what is happening here. First, we generate a synthetic dataset with 1,000 samples and 20 features, where 70% of the samples belong to one class and 30% to the other. We split the data into training and test sets.

Then, we train a logistic regression model with a very small C value (C=1e-6). In scikit-learn’s logistic regression, the C parameter controls the strength of regularization. Specifically, C is the inverse of regularization strength: smaller values of C mean stronger regularization, while larger values mean weaker regularization. Regularization is a technique used to prevent overfitting by discouraging the model from fitting the training data too closely. However, if regularization is too strong (i.e., C is too small), the model can become too simple and underfit the data, failing to capture important patterns. In this example, setting C=1e-6 makes the model heavily regularized, which is why it performs poorly and fails to identify the minority class.

Evaluating the Model

After training, we use the model to predict the test set labels. We then print out the classification report and the confusion matrix.

Here is an example of what the output might look like:

From the confusion matrix, you can see that the model predicted every sample as class 0. It got all the class 0 samples correct (210) but missed all the class 1 samples (90). The classification report shows that precision, recall, and f1-score for class 1 are all zero. This is a clear sign that the model is not working well, especially for the minority class.

Summary And What’s Next

In this lesson, you learned why model evaluation is the first step in improving a machine learning model. You saw how the confusion matrix and classification report can help you understand not just how many mistakes your model is making, but also what kinds of mistakes. By looking at these outputs, you can quickly spot if your model is missing an entire class or making too many false positives.

These tools are essential for diagnosing problems before you try to fix them. In the next set of practice exercises, you will get hands-on experience generating and interpreting confusion matrices and classification reports for different models. This will prepare you for the next steps in the course, where you will learn how to actually fix the problems you find.

Next Lesson: Tuning L2 Regularization in Logistic Regression

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal