Weight Decay

Weight decay is a simple regularization strategy that penalizes the norm of all weights, as we did in Ridge Regression. We take the gradient of the objective

J (W) = i = 1 \sum n L (x^{(i)}, y^{(i)}) + λ ∣∣ W ∣ ∣^{2}

and end up with an update of the form

W_{t} = W_{t - 1} - η ((\nabla_{W} Loss (NN (x^{(i)}), y^{(i)}; W_{t - 1})) + λ W_{t - 1}) = W_{t - 1} (1 - λ η) - η (\nabla_{W} Loss (NN (x^{(i)}), y^{(i)}; W_{t - 1}))

So, we are “decaying” $W_{t - 1}$ by a factor of $(1 - λ η)$ and then taking a gradient step.

Weight decay seems to be very similar to L2 Regularization and is equivalent in some cases. I don’t understand the distinction yet.

/notes/

Recent

Backpropagation 3-Layer Example

Backpropagation Algorithm

Backpropagation Scalar Example

Weight Decay

Graph View

Backlinks