Stochastic Gradient Descent: Theory and Implementation in Python

Introduction

Welcome! We're about to explore Stochastic Gradient Descent (SGD), a pivotal optimization algorithm. SGD, a variant of Gradient Descent, is renowned for its efficiency with large datasets due to its unique stochastic nature. Stochastic means “random” and is the opposite of deterministic. A deterministic algorithm runs the same every time, but a stochastic one introduces a randomness. Our journey includes understanding SGD, its theoretical concepts, and implementing it in Python.

Understanding Stochastic Gradient Descent

SGD starts by understanding its structure. Unlike Gradient Descent, SGD calculates an estimate of the gradient using a randomly selected single data point, not the entire dataset. Consequently, SGD is highly efficient for large datasets.

While the efficient handling of large datasets by SGD is a blessing, its stochasticity can often lead to a slightly noisier process for convergence, resulting in the model not settling at an absolute minimum.

Defining Data

We are going to use this simple example of data:

Math Behind

In terms of math, SGD can be formulated as follows. Imagine we are looking for a best-fit line, setting the parameters of the familiar $y = mx + b$ equation. Remember, is the slope and is the y-intercept. Then:

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal