Welcome back to our course, "Training Neural Networks: the Backpropagation Algorithm"! This is our fifth and final lesson in this course, and we've come a long way. So far, we've explored loss functions, gradient descent, and implemented backpropagation for both individual layers and complete networks.
In our previous lesson, we implemented the backward pass for a Multi-Layer Perceptron, allowing us to calculate the gradients of all weights and biases with respect to the loss. However, calculating gradients is only half the story. These gradients tell us which direction to move, but we still need to actually move in that direction!
Today, we'll complete our neural network training journey by implementing Stochastic Gradient Descent (SGD), which will use the gradients to update our network's parameters. We'll also create a complete training loop and apply our knowledge to train a model on a real-world regression task.
This represents the culmination of everything we've learned so far. After this lesson, you'll be ready to move on to our fourth and final course, "Building and Applying Your Neural Network Library"!
Stochastic Gradient Descent (SGD) is the foundational algorithm that powers most neural network training. In an earlier lesson, we explored basic gradient descent, also called Batch Gradient Descent, where we minimized a simple quadratic function by computing the gradient over the entire dataset and updating the parameters accordingly. However, as datasets grow larger and models become more complex, this approach becomes computationally expensive and slow.
To address this, several variants of gradient descent have been developed:
- Batch Gradient Descent: Computes gradients using the entire dataset for each update. While this provides the most accurate gradient direction, it is often too slow and memory-intensive for large datasets.
- Stochastic Gradient Descent (SGD): Updates parameters using only a single randomly chosen data point at each step. This allows for very frequent updates and can help the model escape local minima, but introduces a lot of noise into the updates.
- Mini-batch Gradient Descent: The most common or "default" approach in deep learning. Instead of using the whole dataset or a single data point, it updates parameters using small, randomly selected batches of data (e.g., 32 or 64 samples at a time). This strikes a balance between computational efficiency and the stability of gradient estimates.
The key idea behind these variants is how much data is used to estimate the gradient at each update step. Using smaller subsets (mini-batches) introduces randomness, or "stochasticity," into the training process. This randomness provides several benefits:
- Computational efficiency: Processing small batches requires less memory and allows for faster updates.
- Faster convergence: Weights are updated more frequently, so learning can progress more quickly.
- Ability to escape local minima: The noise in gradient estimates can help the model avoid getting stuck in shallow local minima.
In modern neural network training, mini-batch SGD is the standard, as it leverages the strengths of both batch and stochastic approaches and is well-suited to parallel computation on hardware like GPUs.
Here's the pseudocode for mini-batch SGD, adapted to our style:
Let's discuss the key steps:
- Initialization: Start by randomly initializing the network's weights.
- Learning Rate: Set the learning rate
η
, which determines the size of each update step. - Epoch Loop: For each epoch (a full pass through the dataset), shuffle the data to ensure randomness.
- Mini-batch Loop: Split the data into mini-batches and process each batch in turn.
- Forward Pass: Compute the model's predictions for the current mini-batch.
- Loss Calculation: Measure how far off the predictions are from the true values.
- Backward Pass: Compute the gradients of the loss with respect to each parameter.
- Parameter Update: Adjust the weights and biases in the direction that reduces the loss, scaled by the learning rate.
By following this cycle of forward and backward passes, combined with parameter updates, SGD enables neural networks to learn from data in a scalable and efficient way.
Let's implement SGD as a class that works with our DenseLayer
and math.js
matrices:
This implementation is straightforward but powerful:
- We initialize the optimizer with a specified learning rate (defaulting to 0.01 if none is provided).
- The
update
method takes a layer as input and applies the SGD update rule to its weights and biases usingmath.js
for element-wise operations. - We check if the gradients exist before attempting to use them, providing a warning if they're missing.
- The core update rule is simply:
parameter -= learningRate * gradient
, applied to both weights and biases.
Each layer calculates and stores its own gradients during the backward pass, and the optimizer simply uses these stored gradients to update the parameters. This separation of concerns creates a clean architecture that can be easily extended to more sophisticated optimization algorithms in the future.
Before we implement our training loop, let's discuss the dataset we'll be using. The diabetes dataset is a commonly used regression dataset, and it's perfect for testing our neural network implementation.
The dataset contains measurements for 442 patients with diabetes, along with a quantitative measure of disease progression one year after baseline. It includes 10 features, such as age, sex, body mass index (BMI), average blood pressure (BP), and six blood serum measurements.
To load and process our CSV data, we'll introduce a new library: PapaParse. PapaParse is a fast and powerful CSV parser for JavaScript. We'll use it to easily read and process our CSV data. If you're following along, make sure you have PapaParse installed. In the CodeSignal IDE, it is already installed. If you need to install it manually, you can run:
We'll also use Node's built-in fs
module to read files from disk.
I will provide you with a sample CSV file (data/train.csv
) to use for this lesson. We'll load it using the papaparse
and fs
modules, and filter out any rows with missing targets to ensure clean data.
Here's how to load and prepare the dataset:
Explanation:
- We read the CSV file and parse it with PapaParse.
- We filter out any rows where the target is missing.
- We extract the feature names (all columns except 'target').
- We build
X_train
as a 2D array of features andy_train
as a column vector of targets. - We also determine the number of samples and features for later use.
With our SGD optimizer and dataset ready, let's implement a basic training loop to tie everything together. The training loop is the heart of neural network learning, orchestrating the forward pass, loss calculation, backward pass, and parameter updates.
Below is the structure of a typical training loop using mini-batch SGD in JavaScript, closely following the structure you provided earlier:
Analysis of this initial part of our training loop:
- MLP Construction: We first create our MLP with two layers: a hidden layer with 10 neurons and ReLU activation, and an output layer with 1 neuron (for regression) and linear activation.
- Optimizer: We initialize our SGD optimizer with a learning rate of 0.002, which is relatively small to ensure stable training.
- Hyperparameters: We set up hyperparameters: 100 epochs and a batch size of 32.
- Shuffling: For each epoch, we shuffle the data to prevent the model from learning patterns based on data order.
- Mini-batch Processing: We then process the data in mini-batches, selecting subsets of the training data for each update.
The shuffling step is crucial in stochastic gradient descent. It ensures that each epoch presents the data in a different order, which helps prevent the model from getting stuck in local minima and improves generalization.
Now let's complete our training loop by implementing the forward pass, backward pass, and parameter updates for each mini-batch. This is the core of the learning process:
Step-by-step review of each mini-batch:
-
Forward pass:
The MLP processes the current mini-batch of input data (X_batch
) and produces predictions (y_pred
). -
Loss calculation:
The mean squared error (MSE) loss is computed between the predictions and the true targets (y_batch
). This quantifies how far off the model's predictions are. -
Loss derivative:
We calculate the derivative of the loss with respect to the predictions. This tells us how to change the predictions to reduce the loss. -
Backward pass:
The gradient of the loss is propagated backward through the network using backpropagation. This computes the gradients of the loss with respect to each parameter (weights and biases) in the network. -
Parameter updates:
The SGD optimizer uses the computed gradients to update each layer's weights and biases, moving them in the direction that reduces the loss.
After processing all mini-batches, we calculate the average loss for the epoch and periodically print it to track training progress. This allows us to monitor whether the model is learning effectively.
Summary:
The training loop is the orchestrator that brings together all the components we've built throughout this course: the layer structure, forward propagation, loss functions, backpropagation, and now parameter updates through SGD. This elegant cycle of computation and learning is what gives neural networks their power.
After training, we evaluate the model and print some predictions:
Explanation:
- After training, we run a forward pass on the entire training set to get the final predictions.
- We compute the final MSE loss to see how well the model fits the training data.
- We print a few sample predictions alongside the true values for a quick qualitative check. You should see output similar to:
Congratulations! You've completed the fifth and final lesson in our course on "Training Neural Networks: the Backpropagation Algorithm". Throughout this lesson, we've implemented the crucial final component of neural network training: the Stochastic Gradient Descent
optimizer that transforms calculated gradients into parameter updates. We created a complete training loop that orchestrates forward propagation, loss calculation, backpropagation, and parameter updates, applying it to a real-world regression task with the diabetes dataset. You now understand the full lifecycle of neural network training, from initializing parameters to making predictions, calculating loss, computing gradients, and updating parameters in a continuous cycle of improvement.
