Getting Started with Momentum

Hello! Today, we will learn about a powerful technique that makes our Gradient Descent move faster, like a ball rolling down a hill. We call this "Momentum".

What's Momentum and How It Works

Momentum improves our Gradient Descent. How does it do that? Remember how a ball on top of a hill starts rolling down? If the slope is steep, the ball picks up speed, right? That's what momentum does to our Gradient Descent. It makes it move faster when the slope (our 'hill') points in the same direction over time.

How to Add Momentum to Gradient Descent

Let's get down to coding! Here's a little piece of code to demonstrate the effect of momentum in a gradient descent process. We will use a gradient function, grad_func(). The weight or parameter (theta) starts at a point and moves down the slope by adjusting itself in every iteration or 'epoch':

v:=vγ+αgradientv := v \cdot \gamma + \alpha \cdot gradient

θ:=θv\theta := \theta - v

Where:

  • θ\theta is the parameter vector,
  • gradientgradient is the gradient of the cost function with regards to the parameters at the current parameter value,
  • α\alpha is the learning rate,
  • vv is the velocity vector (initialized to 0), and
  • γ\gamma is the momentum parameter (a new hyperparameter).

A higher γ\gamma will result in a faster convergence.

Here is the python implementation:

We compute the gradient from the current parameters. Then, we calculate the new momentum, a combination of the old momentum, our learning rate, and the gradient. We update our parameter by subtracting this momentum from it.

Compare Gradient Descents: Setup

Now let's visualize how momentum aids in faster convergence (which means getting to the answer quicker) in the following code snippet:

Here, we implement plain and momentum gradients within one loop and track the history of weight changes to visualize them later.

Compare Gradient Descends: Visualization

Let's visualize the comparison:

Here is the result:

Here, we compare Gradient Descent (without momentum) and Momentum-based Gradient Descent on the same function (x^2). The graph shows how the cost (value of the function) changes over time (or epochs). The cost gets smaller faster for the Momentum-based method. That's because it gets a speed boost from the momentum, just like the ball rolling down the hill!

Wrapping Up

You've done it! You've understood how to use momentum to improve Gradient Descent and seen it in action. Doesn't the ball-on-a-hill analogy make it easier to understand? Now, it's time to put your knowledge into practice! If you remember how a rolling ball picks up speed, you'll never forget how momentum improves Gradient Descent. Happy practicing and coding!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal