Weight decay is a simple regularization strategy that penalizes the norm of all weights, as we did in Ridge Regression. We take the gradient of the objective
and end up with an update of the form
So, we are “decaying” by a factor of and then taking a gradient step.
Weight decay seems to be very similar to L2 Regularization and is equivalent in some cases. I don’t understand the distinction yet.