Another useful idea in picking learning rate is that we want to take large steps in parts of the space where is nearly flat (because there’s no risk of taking too big a step due to the gradient being large). We can apply this idea to each weight independently, and end up with a method called adadelta, which is a variant on adagrad (adaptive gradient).
Even though our weights are indexed by layer, input unit and output unit, for simplicity here, just let be any weight in the network (we will do the same thing for all of them). We have:
The sequence is a moving average of the square of the th component of the gradient. We square it in order to be insensitive to the sign; we want to know whether the magnitude is big or small. Then, we perform a gradient update to weight , but divide the step size by , which is larger when the surface is steeper in direction at point in weight space; this means that the step size will be smaller when it’s steep and larger when it’s flat. The is a small smoothing value to prevent division by zero.
Adagrad
Adagrad is a more basic version of the above; instead of using the running average, we just accumulate all past gradients. This is computationally more costly and makes the learning rate monotonically decreasing.