Loading...

Section 1 - Instruction

Last time, we saw how gradients point us "downhill" to reduce loss. The process of actually taking those steps is called gradient descent.

Imagine you're on a foggy mountain. Gradient descent is how you find your way to the lowest valley, one step at a time.

Engagement Message

What quantity are we trying to minimize on this downhill journey?

Section 2 - Instruction

The size of each step you take is called the learning rate. It's a critical setting.

If your steps are too big, you might leap right over the valley's bottom. If they're too small, it could take forever to get there.

Engagement Message

What could go wrong if your learning rate is too large?

Section 3 - Instruction

One method is Batch Gradient Descent. Here, we calculate the gradient using our entire dataset before taking a single step.

This gives a very accurate, stable direction. It's like surveying the whole mountain before moving.

Engagement Message

What's the biggest downside of using millions of data points for just one step?

Section 4 - Instruction

The opposite is Stochastic Gradient Descent (SGD). We update weights after seeing just one training example!

This is much faster, but the path is very noisy and zig-zags a lot. It's like only looking at the ground right under your feet to decide your next step.

Engagement Message

Why does using just one training example at a time make the SGD path so noisy?

Section 5 - Instruction

The most popular method is Mini-Batch Gradient Descent. It's the perfect compromise.

We use a small batch of data (e.g., 32 examples) to calculate our step. This is much faster than Batch and far less noisy than SGD.

Engagement Message

Why is this "best of both worlds" approach so widely used in practice?

Section 6 - Instruction

In PyTorch, all these approaches use torch.optim.SGD—the difference between batch, mini-batch, and stochastic gradient descent comes from the batch_size you set in your DataLoader, not from the optimizer itself:

Engagement Message

Does this make sense?

Section 7 - Instruction

So, to recap the trade-offs:

Batch: Uses all data. It's smooth but very slow.
SGD: Uses one example. It's fast but noisy.
Mini-Batch: Uses a small group. It's a practical balance.

Engagement Message

Which method from our recap gives the best balance between speed and smoothness?

Section 8 - Practice

Type

Sort Into Boxes

Practice Question

Let's sort these characteristics based on the gradient descent method they describe.

Labels

First Box Label: Batch GD
Second Box Label: Stochastic GD

First Box Items

Uses all data
Slowest per update
Smooth descent

Second Box Items

Uses one example
Fastest per update
Noisy descent

Previous Lesson

Next Lesson: Overfitting and Underfitting

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal