Ridge Regression

There are some troubles with the closed-form OLS solution for Regression:

What if $(W^{T} W)^{- 1}$ isn’t invertible?
What if we overfit?

Ridge regression deals with these problems by adding a regularization term of $∣∣ θ ∣ ∣^{2}$ to the OLS objective, with a trade-off parameter $λ$ .

Original linear regression objective function:

J (θ, θ_{0}) = \frac{1}{n} i = 1 \sum n (θ^{T} x^{(i)} + θ_{0} - y^{(i)})^{2}

Ridge regression objective function:

J_{ridge} (θ, θ_{0}) = \frac{1}{n} i = 1 \sum n (θ^{T} x^{(i)} + θ_{0} - y^{(i)})^{2} + λ ∣∣ θ ∣ ∣^{2}

Note that $∣∣ θ ∣ ∣^{2} = θ^{T} θ$ (dot product with itself). Larger values of $λ$ “pressure” $θ$ values to be smaller (near zero).

Why don't we penalize $θ_{0}$ ?

If we think about what linear regression does, we are basically trying to fit a function $y = f (x) + b$ . The point of regularization is to avoid $f (x)$ from overfitting to training data; $b$ , or $θ_{0}$ in our ML notation, is just an offset term that governs the translation the function/line, so we generally don’t want to regularize it.

Analytical Solution for Ridge Regression

Like the analytical solution for OLS, we can also minimize $J_{ridge}$ analytically. It’s a bit more complicated because $θ_{0}$ requires special treatment. For simplicity, we will decide to not treat $θ_{0}$ specially (so we add a 1 feature to our input vectors), then we get:

\nabla J_{ridge} = \frac{2}{n} W^{T} (W θ - T) + 2 λ θ

Setting to zero and solving:

\frac{2}{n} W^{T} (W θ - T) + 2 λ θ \frac{1}{n} W^{T} W θ - \frac{1}{n} W^{T} T + λ θ \frac{1}{n} W^{T} W θ + λ θ W^{T} W θ + nλ θ (W^{T} W + nλ I) θ θ = 0 = 0 = \frac{1}{n} W^{T} T = W^{T} T = W^{T} T = (W^{T} W + nλ I)^{- 1} W^{T} T

This becomes invertible when $λ > 0$ . This is called “ridge” regression because we are adding a “ridge” of $λ$ values along the diagonal of the matrix before inverting, since $I$ is the identity matrix for the dimension $d$ . This is basically making the matrix more diagonally dominant, which makes it more invertible.

Gradient Descent Solution

Inverting a matrix of dimensions $d \times d$ takes $O (d^{3})$ time, so the analytical solution above is impractical for large $d$ . We can fall back on gradient descent and use computation.

Ridge regression objective:

J_{ridge} (θ, θ_{0}) = \frac{1}{n} i = 1 \sum n (θ^{T} x^{(i)} + θ_{0} - y^{(i)})^{2} + λ ∣∣ θ ∣ ∣^{2}

Gradient of the ridge regression objective with respect to $θ_{0}$ :

\nabla J_{ridge} = \frac{2}{n} i = 1 \sum n (θ^{T} x^{(i)} + θ_{0} - y^{(i)}) + 2 λ θ

Partial derivative with respect to $θ_{0}$ :

\frac{\partial J}{\partial θ _{0}} = \frac{2}{n} i = 1 \sum n (θ^{T} x^{(i)} + θ_{0} - y^{(i)})

With these derivatives, we can do Gradient Descent, using the regular or stochastic gradient methods. Even better, the objective functions for OLS and ridge regression are convex, which means they have only one minimum. This means, with a small enough step size, gradient descent is guaranteed to find the optimum.

/notes/

Recent

Linearization of Nonlinear State Space Models

Phase Portrait

Embed to Control

Ridge Regression

Analytical Solution for Ridge Regression

Gradient Descent Solution

Graph View

Table of Contents

Backlinks