Comparing SGD and Adam Optimizers in PyTorch

Introduction: Why Optimizer Choice Matters

Welcome back to the Advanced Neural Tuning course. In the last lesson, you learned how adjusting the learning rate during training can help your neural network learn more efficiently. Now, we will focus on another key part of the training process: the optimizer.

The optimizer is the algorithm that updates the weights of your neural network based on the gradients calculated during backpropagation. Choosing the right optimizer can make a big difference in how quickly your model learns and how well it performs. Just as with learning rate scheduling, the optimizer you select can help your model achieve better results, sometimes with less effort. In this lesson, you will learn how to set up and compare two of the most popular optimizers in PyTorch: SGD and Adam.

SGD vs. Adam: What’s the Difference?

Before we look at the code, let’s briefly discuss what makes SGD and Adam different.

SGD stands for Stochastic Gradient Descent. It is one of the simplest and most widely used optimizers. With SGD, the model’s weights are updated in the direction that reduces the loss, using a fixed learning rate. While it is simple and effective, it can sometimes be slow to converge, especially if the learning rate is not set well.

Adam, which stands for Adaptive Moment Estimation, is a more advanced optimizer. It keeps track of both the average of the gradients and the average of the squared gradients for each parameter. This allows Adam to adapt the learning rate for each parameter individually, often leading to faster and more stable training. In practice, Adam is a good default choice for many deep learning problems, but it is still important to understand and compare it with simpler methods like SGD.

Setting Up Multiple Optimizers in PyTorch

Let’s see how you can set up both SGD and Adam optimizers for the same model in PyTorch. For this example, we will use a simple multi-layer perceptron (MLP) model. If you are working in the CodeSignal environment, you do not need to install PyTorch, as it is already available.

Here is how you can define two optimizers for the same model:

In this code, we create a list called optimizers that contains two optimizer objects. The first is an SGD optimizer with a learning rate of 0.01, and the second is an Adam optimizer with the same learning rate. Both optimizers are set up to update the parameters of the same model. This setup will allow us to compare how each optimizer performs under the same conditions.

Example: Training with Different Optimizers

Now, let’s look at a complete example where we train the same model using both optimizers, one after the other. It is important to reset the model before training with each optimizer so that both start from the same initial state. This way, you can fairly compare their effects.

In the code below, MLP() refers to a simple multi-layer perceptron class. You should define this class yourself, or use one provided earlier in the course. For example, you might have:

Here is how you can set up and compare the optimizers:

Let’s break down what is happening here. We loop through each optimizer in the optimizers list. For each optimizer, we print out which one is being used. Then, we create a new instance of the MLP model so that each optimizer starts with a fresh model. After that, you would run your usual training loop using the current optimizer. This approach ensures that the results you see are due to the optimizer itself, not differences in the model’s starting point.

When you run this code, you might see output like:

This output is the string representation of each optimizer object. It shows the configuration parameters for each optimizer, such as the learning rate (lr), momentum (for SGD), and beta values (for Adam). This helps you verify which optimizer is being used and with what settings for each training run.

Summary and Practice Preview

In this lesson, you learned why optimizer choice is important and how it can impact the training of your neural network. You saw the main differences between SGD and Adam, and how to set up both optimizers in PyTorch. You also learned how to structure your code to compare optimizers fairly by resetting the model before each run.

In the upcoming practice exercises, you will get hands-on experience using different optimizers. You will train models with both SGD and Adam, observe their effects, and start to develop an intuition for when to use each one. This will help you build stronger, more efficient neural networks as you continue through the course.

Previous Lesson

Next Lesson: Weight Initialization with Xavier Uniform in PyTorch

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal