The momentum term can be considered a coarse prediction where the SGD algorithm will move next. Nesterov accelerated momentum computes the gradients at this predicted point rather than at the current point:
where now the gradients are evaluated at .
- One way to think about this that the gradient term now corrects the path provided by momentum alone.