Nesterov Accelerated Momentum

The momentum term can be considered a coarse prediction where the SGD algorithm will move next. Nesterov accelerated momentum computes the gradients at this predicted point rather than at the current point:

m_{t + 1} ϕ_{t + 1} ⟵ β \cdot m_{t} + (1 - β) i \in B_{t} \sum \frac{\partial ℓ _{i} [ ϕ _{t} - α β \cdot m _{t} ]}{\partial ϕ} ⟵ ϕ_{t} - α \cdot m_{t + 1}

where now the gradients are evaluated at $ϕ_{t} - α β \cdot m_{t}$ .

One way to think about this that the gradient term now corrects the path provided by momentum alone.

/notes/

Recent

Backpropagation Algorithm

Backpropagation Intuition

Backpropagation Toy Example

Nesterov Accelerated Momentum

Graph View

Backlinks