The momentum term can be considered a coarse prediction where the SGD algorithm will move next. Nesterov accelerated momentum computes the gradients at this predicted point rather than at the current point:

where now the gradients are evaluated at .

  • One way to think about this that the gradient term now corrects the path provided by momentum alone.