Introduction

Welcome back to "Training Neural Networks: The Backpropagation Algorithm"! You've made excellent progress so far, having learned about loss functions in our first lesson and gradient descent in our second. Today, we're diving into the heart of neural network training: backpropagation.

In our previous lesson, we explored how gradient descent updates weights by moving in the direction opposite to the gradient of the loss function. But we left an important question unanswered: How do we actually calculate these gradients in a neural network with multiple layers and thousands or even millions of parameters?

That's where backpropagation comes in. Backpropagation (short for "backward propagation of errors") is an efficient algorithm for computing these gradients. Today, we'll focus specifically on implementing the backward pass for a single dense layer, which will form the building block for training complete neural networks.

By the end of this lesson, you'll understand how to:

  • Calculate derivatives for different activation functions
  • Store necessary values during the forward pass
  • Implement the backward pass to calculate gradients
  • Connect these gradients to the gradient descent algorithm we learned previously

Let's embark on this crucial step in our neural network journey!

Understanding the Chain Rule for Backpropagation

Before diving into code, let's build some intuition about how backpropagation works. The core mathematical principle behind backpropagation is the chain rule from calculus.

The chain rule allows us to calculate the derivative of composite functions. In simple terms, if we have a function f(g(x))f(g(x)), the derivative with respect to xx is:

dfdx=dfdgdgdx\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}
Activation Functions and Their Derivatives

Let's start implementing backpropagation by first defining our activation functions and their derivatives. These are crucial because the derivative of the activation function (dadz\frac{da}{dz}) is a key component in our chain rule calculations.

Now, let's define the activation functions and their derivatives:

Let's analyze each activation function and its derivative:

The DenseLayer Class Structure

Now, let's look at the structure of our DenseLayer class. This class encapsulates both the forward pass (which we've seen in previous lessons) and the backward pass (which we're focusing on today).

The key added components to notice:

  1. Activation Derivatives: We store not only the chosen activationFn but also its activationDerivativeFn.
The Forward Pass: Setting Up for Backpropagation

The forward pass not only computes the layer's output but also stores the necessary values for the backward pass. Let's examine the forward method implementation:

This method:

  1. Stores the input values in this.inputs
  2. Calculates and stores the pre-activation outputs this.z (the weighted sum plus bias)
  3. Applies the activation function and stores the results in this.output
  4. Returns the output for use in subsequent layers

The key insight here is that we're caching all the intermediate values we'll need for the backward pass. This is essential for efficient computation of gradients during backpropagation.

As you may recall from our previous lessons, this is how information flows forward through the network. Now, let's see how errors flow backward during the backpropagation process.

The Backward Pass: Calculating Gradients

Now we come to the heart of backpropagation: the backward method. This method calculates the gradients that will be used to update the weights during gradient descent.

Let's break down what's happening:

  1. Input: dLossWrtLayerOutput is the gradient of the loss with respect to this layer's output. For the output layer, this comes directly from the loss function. For hidden layers, it's passed backward from the next layer.

  2. Gradient w.r.t. pre-activation output: We first calculate dLdz\frac{dL}{dz} by applying the chain rule: , where is the activation output and is the pre-activation output.

Backpropagation in Action: A Practical Example

Let's now see how our backpropagation implementation works in practice with a simple example:

This example:

  1. Creates a single DenseLayer with 2 inputs and 3 neurons
  2. Performs a forward pass with a sample input
  3. Simulates receiving gradients from the next layer using a dummy gradient
  4. Performs a backward pass using this gradient
  5. Prints the computed gradients for weights, biases, and inputs
Output Discussion

When we run this code, we get output similar to the following:

Looking at this output:

  1. Our input is a single sample with two features: [0.5, -0.2]
  2. The forward pass produces outputs around 0.5 (since our weights are initialized close to zero, the sigmoid of values near zero is about 0.5)
  3. We provide a dummy gradient [0.1, -0.2, 0.05] representing how the loss would change if each output neuron's value changed slightly
  4. The backward pass calculates:
    • Gradients for each weight (dWeights)
    • Gradients for each bias (dBiases)
    • Gradients to pass to the previous layer (dLossWrtInput)

This example demonstrates the full cycle of forward and backward passes for a single layer. In a complete neural network, we would perform this process for each layer, starting from the output and working backward (hence the name "backpropagation").

Conclusion and Next Steps

Congratulations! You've now mastered one of the most fundamental algorithms in deep learning: backpropagation for a single dense layer. The chain rule has empowered us to efficiently calculate gradients through a network, while our careful implementation of activation functions and their derivatives has given us the building blocks for neural network learning. Our layer's forward pass not only computes outputs but also strategically caches values needed for the backward pass, which then efficiently computes the gradients that power the learning process.

In our upcoming practice exercises, you'll gain hands-on experience with backpropagation and see how these gradients drive the learning process in neural networks. After solidifying these concepts through practice, we'll expand this foundation to implement backpropagation for entire multi-layer networks and explore more advanced optimization techniques to enhance our models' performance.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal