Gradient Descent with Momentum

Lesson Introduction

Welcome back! Today, we'll explore Gradient Descent with Momentum. You've already familiarized yourself with basic gradient descent. However, sometimes gradient descent is slow and gets stuck, especially on bumpy paths to the minimum.

So, how do we speed it up? We use momentum. Imagine pushing a heavy shopping cart. Instead of stopping and starting, you build momentum. This helps you move faster. By the end of this lesson, you'll understand how gradient descent with momentum works, implement it in Python, and see how it improves optimization.

Introducing Momentum

Momentum in optimization is like a push in physical movement. It reduces oscillations (back-and-forth movements) as you head toward the minimum point. Essentially, it helps you move down faster and more steadily.

Let's look at the formulas for momentum:

v_t = \beta v_{t-1} - \alpha \nabla f(\theta_{t-1})

How Velocity Works in Gradient Descent with Momentum

In the basic gradient descent, the update to the parameters is directly proportional to the gradient of the function. However, this approach can be slow due to the oscillations around the minimum.

With momentum, the velocity term $v_t$ is introduced to accelerate gradient vectors in the right directions. Here's a detailed breakdown of how the velocity term works:

Initial Update (at $t=0$ )
- Velocity starts at zero: $v_0 = 0$ .

Python Implementation: part 1

Now, let's implement Gradient Descent with Momentum in Python. Here's the code snippet:

Here’s a breakdown of the key lines in the code:

velocity = [0] * len(point): Initializes the velocity vector with zeros, having the same length as the starting point.
velocity[i] = momentum * velocity[i] - learning_rate * grad[i]: Updates the velocity by applying the momentum and subtracting the gradient scaled by the learning rate.
point[i] += velocity[i]: Updates the current point using the newly calculated velocity.

Python Implementation: part 2

Here's the continuation of our implementation with the example function and initial point:

Benefits of Using Momentum

Using momentum in gradient descent offers several benefits:

Faster Convergence: Reaches the minimum quicker.
Reduced Oscillations: Smoothens the path, reducing back-and-forth movements.
Better Navigation Through Local Minima: Avoids getting stuck in small bumps and oscillations.

Visualizing Momentum

To visually understand this, let's compare the paths of basic gradient descent and gradient descent with momentum. We will take $f(x, y) = x^2 + y^2$ and apply both types of gradient with just 3 iterations:

Lesson Summary

Congratulations! You've learned about Gradient Descent with Momentum. We covered its importance, how it works, and implemented it in Python. You've seen how it speeds up optimization and reduces oscillations.

Now, let’s practice. In the practice session, you'll implement Gradient Descent with Momentum and observe its effects on different functions. Get ready to solidify your understanding and see momentum in action!

Previous Lesson

Next Lesson: Adaptive Learning Rate Methods

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal