Loading...

Introduction

Welcome back to our course, "Training Neural Networks: The Backpropagation Algorithm"! You've made excellent progress through our first three lessons, where we covered loss functions, gradient descent, and implemented backpropagation for a single neural network layer using a modular, math.js-based approach. Today, in our fourth lesson, we're going to extend your knowledge by implementing backpropagation for an entire Multi-Layer Perceptron (MLP).

In our previous lesson, we focused on calculating gradients for a single dense layer, using math.js for matrix operations and supporting multiple activation functions. While this is a crucial building block, real neural networks typically have multiple layers. Today, we'll see how to propagate gradients through an entire network, from the output layer all the way back to the input layer. This is where backpropagation truly shines — efficiently calculating gradients through complex networks with many parameters.

By the end of this lesson, you'll understand how to:

Calculate derivatives of the MSE loss function
Implement the backward method for a complete MLP
Orchestrate the flow of gradients from the output layer to the input layer
Analyze the gradients calculated during backpropagation

Let's dive in and unlock the full power of backpropagation!

From Layer Backpropagation to MLP Backpropagation

As you may recall from our previous lesson, we implemented backpropagation for a single dense layer using math.js and modular activation functions. We calculated how the loss changes with respect to the layer's weights and biases, and also how it changes with respect to the layer's inputs (which would be passed to the previous layer).

The key insight for extending backpropagation to an entire MLP is to recognize the sequential nature of the algorithm. The name "backpropagation" comes from the fact that we propagate error gradients backward through the network, starting from the output layer and moving toward the input layer.

Here's how the process works in a multi-layer network:

We perform a complete forward pass through all layers to get the prediction.
We calculate the loss between our prediction and the true target.
We compute the gradient of the loss with respect to the network's output.
We then propagate this gradient backward through each layer, in reverse order:
- For each layer, we receive the gradient of the loss with respect to its output.
- We use this to calculate gradients for the layer's parameters (weights and biases).
- We also calculate the gradient of the loss with respect to the layer's inputs.
- This gradient becomes the input for the backpropagation step of the previous layer.

This elegant recursive process allows us to efficiently compute gradients for all parameters in the network, regardless of how many layers it has.

Implementing the Loss Function Derivative

Before we can backpropagate through our network, we need to calculate how the loss changes with respect to our network's output. This is the starting point for the backward pass.

We'll use the Mean Squared Error (MSE) loss function, which we covered in our first lesson. Let's implement both the loss function and its derivative using math.js:

Let's understand the derivative calculation:

The MSE loss is defined as: $L = \frac{1}{n}\sum_{i=1}^{n}(y_{true,i} - y_{pred,i})^2$

Recap: The MLP Class

We've already seen and implemented the MLP class in the previous lessons. As a quick recap, the MLP class manages a list of layers, provides an addLayer method to build the network, and a forward method to pass data through each layer in sequence:

This structure allows us to easily stack multiple layers and perform both forward and backward passes through the entire network.

Backpropagation Through the Network

With our understanding of the MLP structure and how backpropagation works layer by layer, we can now implement the backward method for the entire network. The method simply loops through the layers in reverse order, passing the gradient backward:

Notice that backward(dLossWrtPrediction) here is the MLP’s backward method—the one that orchestrates the backward pass through all layers. Inside the loop, this.layers[i].backward(currentDLoss) is calling the backward method of each individual layer (such as a DenseLayer’s backward).

Let's break down how the backward method for the MLP works and why it's so effective:

We start with dLossWrtPrediction, which is the gradient of the loss with respect to our network's output (calculated using our loss function derivative).
We iterate through the layers in reverse order using a reverse for loop — this is the essence of backpropagation, as we're going backward through the network!
For each layer, we call the layer's backward method, passing in the current gradient, and store the returned gradient (which is the gradient of the loss with respect to that layer's input). This returned gradient becomes the input for the next (previous) layer's backward pass.
The process continues until we've propagated through all layers.

The beauty of this implementation lies in its simplicity and modularity. Each layer is responsible for computing its own gradients and passing the necessary information backward. This approach allows the error signal to flow backward through the network, with each layer computing its contribution to the overall gradient. As a result, the code remains clean, easy to understand, and scalable to networks of any depth.

Analyzing Backpropagation Results

Let's see how all these components work together in a complete example. We'll create a simple MLP with two layers, perform a forward pass, calculate the loss, and then analyze the results of backpropagation.

Below is a sample implementation. For this example, we assume you have already implemented the DenseLayer class from the previous lesson, and that each layer stores its gradients in dWeights and dBiases properties.

Running this code produces output similar to the following:

Let's analyze these gradients:

Layer 1 (ReLU): We see small but nonzero gradients for most weights and biases. These values tell us how each parameter should be adjusted to reduce the overall loss. Note that some values are zero, which is characteristic of the ReLU function's derivative (zero when the input is negative).

Conclusion and Next Steps

Congratulations! You've now mastered one of the most fundamental algorithms in deep learning: backpropagation through a multi-layer neural network. We've extended single-layer backpropagation to work across an entire MLP, implemented the derivative of the MSE loss function, created a powerful backward method that orchestrates gradient flow, and analyzed the calculated gradients to understand how they guide parameter updates. This algorithm is the workhorse behind neural network training, efficiently computing gradients that drive the learning process.

In our upcoming practice exercises, you'll gain hands-on experience implementing and working with backpropagation in MLPs. This practice will solidify your understanding and prepare you for the last lesson in this course, where we'll build a complete training loop and explore the full Stochastic Gradient Descent optimization algorithm. The foundations you've built today are crucial stepping stones toward mastering deep learning.

Previous Lesson

Next Lesson: Training Neural Networks Efficiently

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal