Updating Weights with Stochastic Gradient Descent (SGD)

Introduction

Welcome back to our course, "Training Neural Networks: the Backpropagation Algorithm"! This is our fifth and final lesson in this course, and we've come a long way. So far, we've explored loss functions, gradient descent, and implemented backpropagation for both individual layers and complete networks.

In our previous lesson, we implemented the backward pass for a Multi-Layer Perceptron, allowing us to calculate the gradients of all weights and biases with respect to the loss. However, calculating gradients is only half the story. These gradients tell us which direction to move, but we still need to actually move in that direction!

Today, we'll complete our neural network training journey by implementing Stochastic Gradient Descent (SGD), which will use the gradients to update our network's parameters. We'll also create a complete training loop and apply our knowledge to train a model on a real-world regression task.

This represents the culmination of everything we've learned so far. After this lesson, you'll be ready to move on to our fourth and final course, "Building and Applying Your Neural Network Library"!

Stochastic Gradient Descent: The Workhorse of Neural Network Training

Stochastic Gradient Descent (SGD) is the foundational algorithm that powers most neural network training. In our earlier lesson, we explored basic gradient descent, also called Batch Gradient Descent, where we minimized a simple quadratic function by computing the gradient over the entire dataset and updating the parameters accordingly. However, as datasets grow larger and models become more complex, this approach becomes computationally expensive and slow.

To address this, several variants of gradient descent have been developed:

Batch Gradient Descent: Computes gradients using the entire dataset for each update. While this provides the most accurate gradient direction, it is often too slow and memory-intensive for large datasets.
Stochastic Gradient Descent (SGD): Updates parameters using only a single randomly chosen data point at each step. This allows for very frequent updates and can help the model escape local minima, but introduces a lot of noise into the updates.
Mini-batch Gradient Descent: The most common or "default" approach in deep learning. Instead of using the whole dataset or a single data point, it updates parameters using small, randomly selected batches of data (e.g., 32 or 64 samples at a time). This strikes a balance between computational efficiency and the stability of gradient estimates.

The key idea behind these variants is how much data is used to estimate the gradient at each update step. Using smaller subsets (mini-batches) introduces randomness, or "stochasticity," into the training process. This randomness provides several benefits:

Computational efficiency: Processing small batches requires less memory and allows for faster updates.
Faster convergence: Weights are updated more frequently, so learning can progress more quickly.
Ability to escape local minima: The noise in gradient estimates can help the model avoid getting stuck in shallow local minima.

In modern neural network training, mini-batch SGD is the standard, as it leverages the strengths of both batch and stochastic approaches and is well-suited to parallel computation on hardware like GPUs.

The SGD Algorithm: Pseudocode and Key Steps

Let's break down the SGD algorithm as it is typically used in neural network training. The process involves iteratively updating the model's parameters in the direction that reduces the loss, using gradients computed from mini-batches of data.

Here's the pseudocode for mini-batch SGD:

Let's discuss the key steps:

Initialization: Start by randomly initializing the network's weights.
Learning Rate: Set the learning rate η, which determines the size of each update step.
Epoch Loop: For each epoch (a full pass through the dataset), shuffle the data to ensure randomness.
Mini-batch Loop: Split the data into mini-batches and process each batch in turn.
- Forward Pass: Compute the model's predictions for the current mini-batch.
- Loss Calculation: Measure how far off the predictions are from the true values.
- Backward Pass: Compute the gradients of the loss with respect to each parameter.
- Parameter Update: Adjust the weights and biases in the direction that reduces the loss, scaled by the learning rate.

By following this cycle of forward and backward passes, combined with parameter updates, SGD enables neural networks to learn from data in a scalable and efficient way.

Implementing the SGD Optimizer

Now that we understand the concept behind SGD, let's implement it as a Python class. Our SGD optimizer will be responsible for updating the weights and biases of each layer using their calculated gradients:

This implementation is straightforward but powerful. Let's break it down:

We initialize the optimizer with a specified learning rate (defaulting to 0.01 if none is provided).
The update method takes a layer as input and applies the SGD update rule to its weights and biases.
We check if the gradients exist before attempting to use them, providing a warning if they're missing.
The core update rule is simply: parameter -= learning_rate * gradient

The beauty of this design is its simplicity and flexibility. Each layer calculates and stores its own gradients during the backward pass, and the optimizer simply uses these stored gradients to update the parameters. This separation of concerns creates a clean architecture that can be easily extended to more sophisticated optimization algorithms in the future.

The Diabetes Dataset

Before we implement our training loop, let's briefly discuss the dataset we'll be using. The diabetes dataset is a commonly used regression dataset available through scikit-learn, and it's perfect for testing our neural network implementation.

The dataset contains measurements for 442 patients with diabetes, along with a quantitative measure of disease progression one year after baseline. It includes 10 features, such as age, sex, body mass index (BMI), average blood pressure (BP) and six blood serum measurements (s1-s6)

Our task will be to predict the disease progression measure, making this a regression problem. This dataset is relatively small and manageable, making it ideal for our educational purposes. We won't need to preprocess it extensively, allowing us to focus on the neural network training process.

Here's how we'll load the dataset in our code:

This gives us our feature matrix X_train with shape (442, 10) and our target vector y_train with shape (442, 1). We're now ready to build our training loop and apply SGD to learn from this data.

Creating a Training Loop

With our SGD optimizer and dataset ready, let's implement a basic training loop to tie everything together. The training loop is the heart of neural network learning, orchestrating the forward pass, loss calculation, backward pass, and parameter updates.

Here's the structure of a typical training loop using mini-batch SGD:

Let's analyze this initial part of our training loop:

We first create our MLP with two layers: a hidden layer with 10 neurons and ReLU activation, and an output layer with 1 neuron (for regression) and linear activation.
We initialize our SGD optimizer with a learning rate of 0.002, which is relatively small to ensure stable training.
We set up hyperparameters: 100 epochs and a batch size of 32.
For each epoch, we shuffle the data to prevent the model from learning patterns based on data order.
We then process the data in mini-batches, selecting subsets of the training data for each update.

The shuffling step is crucial in stochastic gradient descent. It ensures that each epoch presents the data in a different order, which helps prevent the model from getting stuck in local minima and improves generalization.

Completing the Training Loop

Now let's complete our training loop by implementing the forward pass, backward pass, and parameter updates for each mini-batch:

For each mini-batch, we:

Forward pass: Use our MLP to generate predictions
Loss calculation: Compute the MSE loss between predictions and targets
Loss derivative: Calculate how the loss changes with respect to our predictions
Backward pass: Propagate this gradient backward through the network
Parameter updates: Use our SGD optimizer to update each layer's weights and biases

After processing all mini-batches, we calculate the average loss for the epoch and periodically print it to track training progress. This allows us to monitor whether the model is learning effectively.

The training loop is the orchestrator that brings together all the components we've built throughout this course: the layer structure, forward propagation, loss functions, backpropagation, and now parameter updates through SGD. This elegant cycle of computation and learning is what gives neural networks their power.

Evaluating Training Results

After training is complete, we want to evaluate our model's performance and examine some sample predictions. Let's add this final piece to our implementation:

After running this code, we might see output like:

The training results show that:

Loss Reduction: The loss drops significantly in the early epochs (from 23125.65 to around 4000), demonstrating that our model is learning. The loss continues to decrease more gradually in later epochs.
Final Performance: The final MSE on the training data is approximately 2910, which gives us a sense of the model's fit. For the diabetes dataset, this is a reasonable result for a simple MLP model without any feature engineering or hyperparameter tuning.
Predictions vs. Targets: Looking at the sample predictions, we can see that our model makes reasonable predictions for some samples (like the second and fifth examples) but has larger errors for others. This suggests there's room for improvement, which could be achieved through model refinement or additional training.

This analysis provides valuable insights into our model's performance and confirms that our implementation of SGD and the training loop is working correctly.

Conclusion

Congratulations! You've completed the fifth and final lesson in our course on "Training Neural Networks: the Backpropagation Algorithm". Throughout this lesson, we've implemented the crucial final component of neural network training: the Stochastic Gradient Descent optimizer that transforms calculated gradients into parameter updates. We created a complete training loop that orchestrates forward propagation, loss calculation, backpropagation, and parameter updates, applying it to a real-world regression task with the diabetes dataset. You now understand the full lifecycle of neural network training, from initializing parameters to making predictions, calculating loss, computing gradients, and updating parameters in a continuous cycle of improvement.

You're now ready to move on to our fourth and final course in this path, "Building and Applying Your Neural Network Library". There, you'll build upon the foundations established in this course to create a more comprehensive neural network library and apply it to solve real-world problems. Remember that while our implementation prioritized clarity and educational value over computational efficiency, the fundamental principles remain the same in production libraries like TensorFlow and PyTorch, which leverage advanced hardware for optimization. Keep experimenting, keep learning, and most importantly, enjoy applying your new skills to solve interesting problems!

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal