Welcome back to our course, "Training Neural Networks: The Backpropagation Algorithm"! You've made excellent progress through our first three lessons, where we covered loss functions, gradient descent, and implemented backpropagation for a single neural network layer using a modular, math.js-based approach. Today, in our fourth lesson, we're going to extend your knowledge by implementing backpropagation for an entire Multi-Layer Perceptron (MLP).
In our previous lesson, we focused on calculating gradients for a single dense layer, using math.js for matrix operations and supporting multiple activation functions. While this is a crucial building block, real neural networks typically have multiple layers. Today, we'll see how to propagate gradients through an entire network, from the output layer all the way back to the input layer. This is where backpropagation truly shines — efficiently calculating gradients through complex networks with many parameters.
By the end of this lesson, you'll understand how to:
- Calculate derivatives of the
MSE
loss function - Implement the
backward
method for a completeMLP
- Orchestrate the flow of gradients from the output layer to the input layer
- Analyze the gradients calculated during backpropagation
Let's dive in and unlock the full power of backpropagation!
As you may recall from our previous lesson, we implemented backpropagation for a single dense layer using math.js and modular activation functions. We calculated how the loss changes with respect to the layer's weights and biases, and also how it changes with respect to the layer's inputs (which would be passed to the previous layer).
The key insight for extending backpropagation to an entire MLP
is to recognize the sequential nature of the algorithm. The name "backpropagation" comes from the fact that we propagate error gradients backward through the network, starting from the output layer and moving toward the input layer.
Here's how the process works in a multi-layer network:
- We perform a complete forward pass through all layers to get the prediction.
- We calculate the loss between our prediction and the true target.
- We compute the gradient of the loss with respect to the network's output.
- We then propagate this gradient backward through each layer, in reverse order:
- For each layer, we receive the gradient of the loss with respect to its output.
- We use this to calculate gradients for the layer's parameters (weights and biases).
- We also calculate the gradient of the loss with respect to the layer's inputs.
- This gradient becomes the input for the backpropagation step of the previous layer.
This elegant recursive process allows us to efficiently compute gradients for all parameters in the network, regardless of how many layers it has.
Before we can backpropagate through our network, we need to calculate how the loss changes with respect to our network's output. This is the starting point for the backward pass.
We'll use the Mean Squared Error (MSE) loss function, which we covered in our first lesson. Let's implement both the loss function and its derivative using math.js:
Let's understand the derivative calculation:
- The MSE loss is defined as:
- To find , we differentiate with respect to :
- In JavaScript, we use
yTrue.size()[0]
to get the batch size.
The division by batch size normalizes the gradient, which is important for consistent learning regardless of batch size. This derivative gives us the direction in which our predictions would need to change to decrease the loss, serving as the starting point for our backward pass through the network.
We've already seen and implemented the MLP
class in the previous lessons. As a quick recap, the MLP
class manages a list of layers, provides an addLayer
method to build the network, and a forward
method to pass data through each layer in sequence:
This structure allows us to easily stack multiple layers and perform both forward and backward passes through the entire network.
With our understanding of the MLP structure and how backpropagation works layer by layer, we can now implement the backward
method for the entire network. The method simply loops through the layers in reverse order, passing the gradient backward:
Notice that backward(dLossWrtPrediction)
here is the MLP’s backward method—the one that orchestrates the backward pass through all layers. Inside the loop, this.layers[i].backward(currentDLoss)
is calling the backward method of each individual layer (such as a DenseLayer’s backward).
Let's break down how the backward
method for the MLP works and why it's so effective:
-
We start with
dLossWrtPrediction
, which is the gradient of the loss with respect to our network's output (calculated using our loss function derivative). -
We iterate through the layers in reverse order using a reverse for loop — this is the essence of backpropagation, as we're going backward through the network!
-
For each layer, we call the layer's
backward
method, passing in the current gradient, and store the returned gradient (which is the gradient of the loss with respect to that layer's input). This returned gradient becomes the input for the next (previous) layer's backward pass. -
The process continues until we've propagated through all layers.
The beauty of this implementation lies in its simplicity and modularity. Each layer is responsible for computing its own gradients and passing the necessary information backward. This approach allows the error signal to flow backward through the network, with each layer computing its contribution to the overall gradient. As a result, the code remains clean, easy to understand, and scalable to networks of any depth.
Let's see how all these components work together in a complete example. We'll create a simple MLP
with two layers, perform a forward pass, calculate the loss, and then analyze the results of backpropagation.
Below is a sample implementation. For this example, we assume you have already implemented the DenseLayer
class from the previous lesson, and that each layer stores its gradients in dWeights
and dBiases
properties.
Running this code produces output similar to the following:
Let's analyze these gradients:
-
Layer 1 (ReLU): We see small but nonzero gradients for most weights and biases. These values tell us how each parameter should be adjusted to reduce the overall loss. Note that some values are zero, which is characteristic of the ReLU function's derivative (zero when the input is negative).
-
Layer 2 (Linear): We see somewhat larger gradients, especially for the bias, which has a gradient close to -1. This large negative gradient suggests that increasing this bias would substantially reduce the loss, which makes sense as our initial predictions were too low compared to the targets.
These gradients provide the information needed for gradient descent to update the network's parameters. The sign of the gradient (positive or negative) indicates the direction in which parameters should change to reduce the loss, while the magnitude suggests how much influence each parameter has.
Congratulations! You've now mastered one of the most fundamental algorithms in deep learning: backpropagation through a multi-layer neural network. We've extended single-layer backpropagation to work across an entire MLP
, implemented the derivative of the MSE
loss function, created a powerful backward
method that orchestrates gradient flow, and analyzed the calculated gradients to understand how they guide parameter updates. This algorithm is the workhorse behind neural network training, efficiently computing gradients that drive the learning process.
In our upcoming practice exercises, you'll gain hands-on experience implementing and working with backpropagation in MLP
s. This practice will solidify your understanding and prepare you for the last lesson in this course, where we'll build a complete training loop and explore the full Stochastic Gradient Descent optimization algorithm. The foundations you've built today are crucial stepping stones toward mastering deep learning.
