Nice Distill blog: Why Momentum Really Works

A common modification of Stochastic Gradient Descent is to add a momentum term. We update the parameters with a weighted combination of the gradient computed from the current batch and the direction moved in the previous step:

where:

  • is the momentum (which drives the update at iteration )
  • is a decay parameter that controls the degree to which the gradient is smoothed over time, is the learning rate.
  • is the learning rate
  • is the gradient of the loss function for data point , with respect to the model parameters , evaluated at the current parameters .

Intuition

Essentially, we “remember” the direction that the past gradients took us in, and use a weighted average of the average past direction and the current gradient, resulting in a smoother path.

If we roll out the momentum expression, we can see that older gradients contribute less and less (exponential decay).

The recursive formulation of the momentum calculation means that the gradient step is an infinite weighted sum of all the previous gradients, where the weights get smaller as we move back in time. The effective learning rate increases if all these gradients are aligned over multiple iterations but decreases if the gradient direction repeatedly changes as the terms in the sum cancel out. The overall effect is a smoother trajectory and reduced oscillatory behavior in valleys.

MIT 6.036 Notes

We can use methods like Running Averages to describe strategies for computing . Momentum is a simple method that does this by “averaging” recent gradient updates, so that if they have been bouncing back and forth in some direction, we take out that component of the motion. For momentum, we have:

This can also be written as:

These two forms are equivalent if we let . Essentially, (or ) is a moving average of the gradient. We are doing an update with step size on a moving average of the gradients with parameter .

will be bigger in dimensions that consistently have the same sign for and smaller for those that don’t. Of course we now have two parameters to set ( and ), but the hope is that the algorithm will perform better overall, so it will be worth trying to find good values for them. Often γ is set to be something like 0.9.