We can use methods like Running Averages to describe strategies for computing . Momentum is a simple method that does this by “averaging” recent gradient updates, so that if they have been bouncing back and forth in some direction, we take out that component of the motion. For momentum, we have:
This can also be written as:
These two forms are equivalent if we let . Essentially, (or ) is a moving average of the gradient. We are doing an update with step size on a moving average of the gradients with parameter .
will be bigger in dimensions that consistently have the same sign for and smaller for those that don’t. Of course we now have two parameters to set ( and ), but the hope is that the algorithm will perform better overall, so it will be worth trying to find good values for them. Often γ is set to be something like 0.9.