Adam has become the default method for managing step sizes in neural networks. It combines the ideas of Momentum (ML) and adadelta.

  • Take the average of the gradient
  • Moderate step size based on magnitude of gradient

We start by writing the moving averages of the gradient and squared gradient, which reflect estimates of the mean and variance of the gradient for weight :

A problem with these estimates is that, if we initialize , they will always be biased (slightly too small). So we will correct for that bias by defining:

Note that is raised to the power of and likewise for . To justify these corrections, note that if we were to expand in terms of and the coefficients would sum to However, the coefficient behind is and since , the sum of coefficients of non-zero terms is , hence the correction. The same justification holds for .

Now, our update for weight has a step size that takes the steepness into account (like Adadelta) but also tends to move in the same direction (like momentum). The authors of adam like setting:

Although we now have even more parameters, adam is not highly sensitive to their values. Even though we now have a step-size for each weight, and we have to update various quantities on each iteration of gradient descent, it’s relatively easy to implement by maintaining a matrix for each quantity in each layer of the network.