Last time, we saw how gradients point us "downhill" to reduce loss. The process of actually taking those steps is called gradient descent.
Imagine you're on a foggy mountain. Gradient descent is how you find your way to the lowest valley, one step at a time.
Engagement Message
What quantity are we trying to minimize on this downhill journey?
The size of each step you take is called the learning rate. It's a critical setting.
If your steps are too big, you might leap right over the valley's bottom. If they're too small, it could take forever to get there.
Engagement Message
What could go wrong if your learning rate is too large?
One method is Batch Gradient Descent. Here, we calculate the gradient using our entire dataset before taking a single step.
This gives a very accurate, stable direction. It's like surveying the whole mountain before moving.
Engagement Message
What's the biggest downside of using millions of data points for just one step?
The opposite is Stochastic Gradient Descent (SGD). We update weights after seeing just one training example!
This is much faster, but the path is very noisy and zig-zags a lot. It's like only looking at the ground right under your feet to decide your next step.
Engagement Message
Why does using just one training example at a time make the SGD path so noisy?
The most popular method is Mini-Batch Gradient Descent. It's the perfect compromise.
We use a small batch of data (e.g., 32 examples) to calculate our step. This is much faster than Batch and far less noisy than SGD.
Engagement Message
Why is this "best of both worlds" approach so widely used in practice?
In PyTorch, all these approaches use torch.optim.SGD
—the difference between batch, mini-batch, and stochastic gradient descent comes from the batch_size
you set in your DataLoader, not from the optimizer itself:
Engagement Message
Does this make sense?
So, to recap the trade-offs:
- Batch: Uses all data. It's smooth but very slow.
- SGD: Uses one example. It's fast but noisy.
- Mini-Batch: Uses a small group. It's a practical balance.
Engagement Message
Which method from our recap gives the best balance between speed and smoothness?
Type
Sort Into Boxes
Practice Question
Let's sort these characteristics based on the gradient descent method they describe.
Labels
- First Box Label: Batch GD
- Second Box Label: Stochastic GD
First Box Items
- Uses all data
- Slowest per update
- Smooth descent
Second Box Items
- Uses one example
- Fastest per update
- Noisy descent
