Backpropagation Through the Multi-Layer Perceptron

Introduction

Welcome back to our course, "Training Neural Networks: The Backpropagation Algorithm"! You've made excellent progress through our first three lessons, where we covered loss functions, gradient descent, and implementing backpropagation for a single neural network layer. Today, in our fourth lesson, we're going to extend your knowledge by implementing backpropagation for an entire Multi-Layer Perceptron (MLP).

In our previous lesson, we focused on calculating gradients for a single layer. While this is a crucial building block, real neural networks typically have multiple layers. Today, we'll see how to propagate gradients through an entire network, from the output layer all the way back to the input layer. This is where backpropagation truly shines — efficiently calculating gradients through complex networks with many parameters.

By the end of this lesson, we'll understand how to:

Calculate derivatives of the MSE loss function
Implement the backward method for a complete MLP
Orchestrate the flow of gradients from the output layer to the input layer
Analyze the gradients calculated during backpropagation

Let's dive in and unlock the full power of backpropagation!

From Layer Backpropagation to MLP Backpropagation

As you may recall from our previous lesson, we implemented backpropagation for a single DenseLayer. We calculated how the loss changes with respect to the layer's weights and biases, and also how it changes with respect to the layer's inputs (which would be passed to the previous layer).

The key insight for extending backpropagation to an entire MLP is to recognize the sequential nature of the algorithm. The name "backpropagation" comes from the fact that we propagate error gradients backward through the network, starting from the output layer and moving toward the input layer.

Here's how the process works in a multi-layer network:

We perform a complete forward pass through all layers to get the prediction.
We calculate the loss between our prediction and the true target.
We compute the gradient of the loss with respect to the network's output.
We then propagate this gradient backward through each layer, in reverse order:
- For each layer, we receive the gradient of the loss with respect to its output.
- We use this to calculate gradients for the layer's parameters (weights and biases).
- We also calculate the gradient of the loss with respect to the layer's inputs.
- This gradient becomes the input for the backpropagation step of the previous layer.

This elegant recursive process allows us to efficiently compute gradients for all parameters in the network, regardless of how many layers it has.

Implementing the Loss Function Derivative

Before we can backpropagate through our network, we need to calculate how the loss changes with respect to our network's output. This is the starting point for the backward pass.

We'll use the Mean Squared Error (MSE) loss function, which we covered in our first lesson. Let's implement both the loss function and its derivative:

Let's understand the derivative calculation:

The MSE loss is defined as: $L = \frac{1}{n}\sum_{i=1}^{n}(y_{true,i} - y_{pred,i})^2$

Recap: The MLP Class

We've already seen and implemented the MLP class in the previous course. As a quick recap, the MLP class manages a list of layers, provides an add_layer method to build the network, and a forward method to pass data through each layer in sequence:

This structure allows us to easily stack multiple layers and perform a forward pass through the entire network.

Backpropagation Through the Network

With our understanding of the MLP structure and how backpropagation works layer by layer, we can now implement the backward method for the entire network:

The beauty of this implementation lies in its simplicity. Let's break it down:

We start with d_loss_wrt_prediction, which is the gradient of the loss with respect to our network's output (calculated using our loss function derivative).
We iterate through the layers in reverse order using Python's reversed() function — this is the essence of backpropagation, as we're going backward through the network!
For each layer, we call the layer's backward method, passing in the current gradient, and store the returned gradient (which is the gradient of the loss with respect to that layer's input). This returned gradient becomes the input for the next (previous) layer's backward pass.
The process continues until we've propagated through all layers.

This elegant recursive approach allows the error signal to flow backward through the network, with each layer computing its contribution to the overall gradient.

Analyzing Backpropagation Results

Let's see how all these components work together in a complete example. We'll create a simple MLP with two layers, perform a forward pass, calculate the loss, and then analyze the results of backpropagation:

Running this code produces output similar to the following:

Let's analyze these gradients:

Layer 1 (ReLU): We see small but nonzero gradients for most weights and biases. These values tell us how each parameter should be adjusted to reduce the overall loss. Note that some values are zero, which is characteristic of the ReLU function's derivative (zero when the input is negative).
Layer 2 (Linear): We see somewhat larger gradients, especially for the bias, which has a gradient close to -1. This large negative gradient suggests that increasing this bias would substantially reduce the loss, which makes sense as our initial predictions were too low compared to the targets.

These gradients provide the information needed for gradient descent to update the network's parameters. The sign of the gradient (positive or negative) indicates the direction in which parameters should change to reduce the loss, while the magnitude suggests how much influence each parameter has.

Conclusion and Next Steps

Congratulations! You've now mastered one of the most fundamental algorithms in deep learning: backpropagation through a multi-layer neural network. We've extended single-layer backpropagation to work across an entire MLP, implemented the derivative of the MSE loss function, created a powerful backward method that orchestrates gradient flow, and analyzed the calculated gradients to understand how they guide parameter updates. This algorithm is the workhorse behind neural network training, efficiently computing gradients that drive the learning process.

In our upcoming practice exercises, you'll gain hands-on experience implementing and working with backpropagation in MLPs. This practice will solidify your understanding and prepare you for the last lesson in this course, where we'll build a complete training loop and explore the full Stochastic Gradient Descent optimization algorithm. The foundations you've built today are crucial stepping stones toward mastering deep learning.

Previous Lesson

Next Lesson: Updating Weights with Stochastic Gradient Descent (SGD)

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal