Problem 9.1

Consider a model where the prior distribution over the parameter is a normal distribution with mean zero and variance so that

where indexes the model parameters. We now maximize . Show that the associated loss function of this model is equivalent to L2 regularization.

Recall from Probabilistic interpretation that the regularization term can be consider a prior representing some knowledge we have about the parameters.

The posterior objective is then:

We use the log to convert from product to sum:

The prior term is:

where we collapsed the first term into because it does not depend on the parameters .

Therefore, maximizing the log posterior is equivalent to

or equivalently minimizing

with .

Problem 9.2

How do the gradients of the loss function change when L2 regularization is added?

The parameters are incentivized to stay small (near zero), as larger norm will cause the loss value to be higher.

where is the regular NLL objective.

Then the gradient becomes:

Thus, every parameter is pulled toward zero proportionally to its magnitude.

Problem 3

Problem 4

Problem 5

Problem 6

Problem 7

Problem 8

Problem 9

Problem 10