Consider an objective function in the form of a sum:
During Gradient Descent, we select one term of the sum and take a step in that direction. This is an alternative to computing the gradient of the whole dataset, which is computationally expensive and infeasible with large datasets. Furthermore, taking a bunch of small steps in a stochastic manner tends to average out to about the same as naive gradient descent.
Note that now instead of a fixed , it is indexed by the iteration of the algorithm, . Choosing a good stopping criterion can be a little trickier for SGD than traditional gradient descent. In the above, we’ve just chosen to stop after a fixed number of iterations .
For SGD to converge to a local optimum as increases, the step size has to decrease as a function of time.
Theorem: Step size and convergence
If is convex and is a sequence satisfying
then SGD converges with probability to the optimal .
An example of decreasing step size would be to make but people often use rules that decrease slowly, and so don’t strictly satisfy the criteria for convergence.
Algorithmic Benefits
There are two some reasons why SGD might be better algorithmically than regular gradient descent (aka Batch Gradient Descent).
- If your is actually non-convex, but has many shallow local optima that might trap BGD, then taking samples from the gradient at some point might bounce you around the landscape and out of the local optima.
- Sometimes, optimizing really well is not what we want to do, because it might overfit the training set; so, in fact, SGD might not get lower training error than BGD, it might result in lower test error.
- BGD typically requires computing some quantity over every data point in a data set. SGD may perform well after visiting only some of the data. This behavior can be useful for very large data sets – in runtime and memory savings.