Weight decay is a simple regularization strategy that penalizes the norm of all weights, as we did in Ridge Regression. We take the gradient of the objective

and end up with an update of the form

So, we are “decaying” by a factor of and then taking a gradient step.

Weight decay seems to be very similar to L2 Regularization and is equivalent in some cases. I don’t understand the distinction yet.