Welcome! We're about to explore Stochastic Gradient Descent (SGD), a pivotal optimization algorithm. SGD, a variant of Gradient Descent, is renowned for its efficiency with large datasets due to its unique stochastic nature. Stochastic means “random” and is the opposite of deterministic. A deterministic algorithm runs the same every time, but a stochastic one introduces a randomness. Our journey includes understanding SGD, its theoretical concepts, and implementing it in Python.
SGD starts by understanding its structure. Unlike Gradient Descent, SGD calculates an estimate of the gradient using a randomly selected single data point, not the entire dataset. Consequently, SGD is highly efficient for large datasets.
While the efficient handling of large datasets by SGD is a blessing, its stochasticity can often lead to a slightly noisier process for convergence, resulting in the model not settling at an absolute minimum.
We are going to use this simple example of data:
In terms of math, SGD can be formulated as follows. Imagine we are looking for a best-fit line, setting the parameters of the familiar equation. Remember, is the slope and is the y-intercept. Then:
