There are some troubles with the closed-form OLS solution for Regression:

  • What if isn’t invertible?
  • What if we overfit?

Ridge regression deals with these problems by adding a regularization term of to the OLS objective, with a trade-off parameter .

Original linear regression objective function:

Ridge regression objective function:

Note that (dot product with itself). Larger values of “pressure” values to be smaller (near zero).

Why don't we penalize \theta_{0}?

If we think about what linear regression does, we are basically trying to fit a function . The point of regularization is to avoid from overfitting to training data; , or in our ML notation, is just an offset term that governs the translation the function/line, so we generally don’t want to regularize it.

Analytical Solution for Ridge Regression

Like the analytical solution for OLS, we can also minimize analytically. It’s a bit more complicated because requires special treatment. For simplicity, we will decide to not treat specially (so we add a 1 feature to our input vectors), then we get:

Setting to zero and solving:

This becomes invertible when . This is called “ridge” regression because we are adding a “ridge” of values along the diagonal of the matrix before inverting, since is the identity matrix for the dimension . This is basically making the matrix more diagonally dominant, which makes it more invertible.

Gradient Descent Solution

Inverting a matrix of dimensions takes time, so the analytical solution above is impractical for large . We can fall back on gradient descent and use computation.

Ridge regression objective:

Gradient of the ridge regression objective with respect to :

Partial derivative with respect to :

With these derivatives, we can do Gradient Descent, using the regular or stochastic gradient methods. Even better, the objective functions for OLS and ridge regression are convex, which means they have only one minimum. This means, with a small enough step size, gradient descent is guaranteed to find the optimum.