Introduction

Welcome back to "Training Neural Networks: The Backpropagation Algorithm"! You've made excellent progress so far, having learned about loss functions in our first lesson and gradient descent in our second. Today, we're diving into the heart of neural network training: backpropagation.

In our previous lesson, we explored how gradient descent updates weights by moving in the direction opposite to the gradient of the loss function. But we left an important question unanswered: How do we actually calculate these gradients in a neural network with multiple layers and thousands or even millions of parameters?

That's where backpropagation comes in. Backpropagation (short for "backward propagation of errors") is an efficient algorithm for computing these gradients. Today, we'll focus specifically on implementing the backward pass for a single dense layer, which will form the building block for training complete neural networks.

By the end of this lesson, you'll understand how to:

  • Calculate derivatives for different activation functions
  • Store necessary values during the forward pass
  • Implement the backward pass to calculate gradients
  • Connect these gradients to the gradient descent algorithm we learned previously

Let's embark on this crucial step in our neural network journey!

Understanding the Chain Rule for Backpropagation

Before diving into code, let's build some intuition about how backpropagation works. The core mathematical principle behind backpropagation is the chain rule from calculus.

The chain rule allows us to calculate the derivative of composite functions. In simple terms, if we have a function f(g(x))f(g(x)), the derivative with respect to xx is:

dfdx=dfdgdgdx\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

In neural networks, we have many nested functions. Consider a simple example with one dense layer:

  1. We multiply inputs by weights: z=xw+bz = x \cdot w + b
  2. We apply an activation function: a=σ(z)a = \sigma(z)
  3. We calculate a loss: L=loss(a,ytrue)L = \text{loss}(a, y_{\text{true}})

If we want to find how the loss changes with respect to the weights (dLdw\frac{dL}{dw}), we apply the chain rule:

dLdw=dLdadadzdzdw\frac{dL}{dw} = \frac{dL}{da} \cdot \frac{da}{dz} \cdot \frac{dz}{dw}

Backpropagation gets its name because we start at the output (the loss) and work backward through the network, calculating these gradients layer by layer. This is much more efficient than trying to directly compute the gradient of the loss with respect to each parameter.

For our implementation, we'll use matrix operations to handle batches of data efficiently, but the underlying principle remains the same.

Activation Functions and Their Derivatives

Let's start implementing backpropagation by first defining our activation functions and their derivatives. These are crucial because the derivative of the activation function (dadz\frac{da}{dz}) is a key component in our chain rule calculations.

Let's analyze each activation function and its derivative:

  1. Sigmoid:

    • The function is σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}
    • Its derivative is σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))
    • Note that we pass the output of the sigmoid function to calculate its derivative, as this is computationally efficient
  2. ReLU (Rectified Linear Unit):

    • The function is ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)
    • Its derivative is 1 if x>0x > 0 and 0 otherwise
    • We use NumPy's where function to efficiently compute this
  3. Linear:

    • The function is simply f(x)=xf(x) = x
    • Its derivative is always 1
    • We use ones_like to create an array of ones with the same shape as the input

These activation functions and their derivatives are essential building blocks for both the forward and backward passes in our neural network layer.

The DenseLayer Class Structure

Now, let's look at the structure of our DenseLayer class. This class encapsulates both the forward pass (which we've seen in previous lessons) and the backward pass (which we're focusing on today).

The key added components to notice:

  1. Activation Derivatives: We store not only the chosen activation_fn but also its activation_derivative_fn.
  2. Caching Variables:
    • self.inputs will store the inputs to the layer;
    • self.z will store the pre-activation outputs;
    • self.output will store the post-activation outputs;
    • self.d_weights and self.d_biases will store the gradients of weights and biases.

This caching of intermediate values is crucial for backpropagation. We need to know these values during the backward pass to correctly compute the gradients.

The Forward Pass: Setting Up for Backpropagation

The forward pass not only computes the layer's output but also stores the necessary values for the backward pass. Let's examine the forward method implementation:

This method:

  1. Stores the input values in self.inputs
  2. Calculates and stores the pre-activation outputs self.z (the weighted sum plus bias)
  3. Applies the activation function and stores the results in self.output
  4. Returns the output for use in subsequent layers

The key insight here is that we're caching all the intermediate values we'll need for the backward pass. This is essential for efficient computation of gradients during backpropagation.

As you may recall from our previous lessons, this is how information flows forward through the network. Now, let's see how errors flow backward during the backpropagation process.

The Backward Pass: Calculating Gradients

Now we come to the heart of backpropagation: the backward method. This method calculates the gradients that will be used to update the weights during gradient descent.

Let's break down what's happening:

  1. Input: d_loss_wrt_layer_output is the gradient of the loss with respect to this layer's output. For the output layer, this comes directly from the loss function. For hidden layers, it's passed backward from the next layer.

  2. Gradient w.r.t. pre-activation output: We first calculate dLdz\frac{dL}{dz} by applying the chain rule: dLdz=dLdydydz\frac{dL}{dz} = \frac{dL}{dy} \cdot \frac{dy}{dz}, where yy is the activation output and zz is the pre-activation output.

  3. Gradient w.r.t. weights: We calculate dLdW\frac{dL}{dW} using the chain rule again: dLdW=dLdzdzdW\frac{dL}{dW} = \frac{dL}{dz} \cdot \frac{dz}{dW}. Since z=XW+bz = X \cdot W + b, we have dzdW=XT\frac{dz}{dW} = X^T, leading to dLdW=XTdLdz\frac{dL}{dW} = X^T \cdot \frac{dL}{dz}.

  4. Gradient w.r.t. biases: Similarly, dLdb=dLdzdzdb\frac{dL}{db} = \frac{dL}{dz} \cdot \frac{dz}{db}. Since dzdb=1\frac{dz}{db} = 1, we sum dLdz\frac{dL}{dz} over all samples.

  5. Gradient w.r.t. inputs: Finally, we calculate dLdX\frac{dL}{dX} to pass backward to the previous layer: dLdX=dLdzdzdX\frac{dL}{dX} = \frac{dL}{dz} \cdot \frac{dz}{dX}. Since dzdX=WT\frac{dz}{dX} = W^T, we get dLdX=dLdzWT\frac{dL}{dX} = \frac{dL}{dz} \cdot W^T.

Each of these calculations is a direct application of the chain rule, and together they form the core of the backpropagation algorithm.

Backpropagation in Action: A Practical Example

Let's now see how our backpropagation implementation works in practice with a simple example:

This example:

  1. Creates a single DenseLayer with 2 inputs and 3 neurons
  2. Performs a forward pass with a sample input
  3. Simulates receiving gradients from the next layer using a dummy gradient
  4. Performs a backward pass using this gradient
  5. Prints the computed gradients for weights, biases, and inputs
Output Discussion

When we run this code, we get the following output:

Looking at this output:

  1. Our input is a single sample with two features: [0.5, -0.2]
  2. The forward pass produces outputs around 0.5 (since our weights are initialized close to zero, the sigmoid of values near zero is about 0.5)
  3. We provide a dummy gradient [0.1, -0.2, 0.05] representing how the loss would change if each output neuron's value changed slightly
  4. The backward pass calculates:
    • Gradients for each weight (d_weights)
    • Gradients for each bias (d_biases)
    • Gradients to pass to the previous layer (d_loss_wrt_input)

This example demonstrates the full cycle of forward and backward passes for a single layer. In a complete neural network, we would perform this process for each layer, starting from the output and working backward (hence the name "backpropagation").

Conclusion and Next Steps

Congratulations! You've now mastered one of the most fundamental algorithms in deep learning: backpropagation for a single dense layer. The chain rule has empowered us to efficiently calculate gradients through a network, while our careful implementation of activation functions and their derivatives has given us the building blocks for neural network learning. Our layer's forward pass not only computes outputs but also strategically caches values needed for the backward pass, which then efficiently computes the gradients that power the learning process.

In our upcoming practice exercises, you'll gain hands-on experience with backpropagation and see how these gradients drive the learning process in neural networks. After solidifying these concepts through practice, we'll expand this foundation to implement backpropagation for entire multi-layer networks and explore more advanced optimization techniques to enhance our models' performance.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal