Ordinary Least Squares

Recall that Ordinary Least Squares is a regression problem aiming to minimize the mean squared loss, such that we have objective function:

J (θ, θ_{0}) = \frac{1}{n} i = 1 \sum n (θ^{T} x^{(i)} + θ_{0} - y^{(i)})^{2}

This problem actually has an analytical solution based on calculus.

Analytical/Closed-Form Solution

We can approach this like a minimization problem by taking the derivative of $J$ with respect to $θ$ , set it to 0, and then solve for $θ$ . It’s possible to do this by:

Finding $\frac{\partial J}{\partial θ _{k}}$ for $k$ in $1, \dots, d$
Constructing a set of $k$ equations of the form $\frac{\partial J}{\partial θ _{k}} = 0$
Solving the system for values of $θ_{k}$ .

We can work through this in a cool matrix view. Let’s also assume that $x^{(i)}$ have been augmented with an extra input dimension with a value of $1$ , so that we can ignore $θ_{0}$ .

Let us think of our training data in terms of matrices $X$ and $Y$ , where each column of $X$ is an example and each column of $Y$ is the corresponding output target value:

X = x_{1}^{(1)} ⋮ x_{d}^{(1)} \dots ⋱ \dots x_{1}^{(n)} ⋮ x_{d}^{(n)}, Y = [y^{(1)} \dots y^{(n)}] .

To make this easier and more standard with most textbooks, we define a new matrix and vector, $W$ and $T$ , which are just transposes of our $X$ and $Y$ :

W = X^{T} = x_{1}^{(1)} ⋮ x_{1}^{(n)} \dots ⋱ \dots x_{d}^{(1)} ⋮ x_{d}^{(n)}, T = Y^{T} = y^{(1)} ⋮ y^{(n)}

Now, each row of $W$ corresponds to a sample.

We can then write our objective as:

J (θ) = \frac{1}{n} i = 1 \sum n ((j = 1 \sum d W_{ij} θ j) - T_{i})^{2} = \frac{1}{n} 1 \times n (W θ - T)^{T} n \times 1 (W θ - T)

Here $W θ$ is a vector of predictions. so $W θ - T$ is a vector of differences between predictions and labels. When we do $(W θ - T)^{T} (W θ - T)$ , we’re basically taking the dot product of the vector of differences with itself, which achieves a summing and squaring effect.

To solve this problem, we take the gradient and set it to zero:

\nabla_{θ} J = \frac{\partial J}{\partial θ _{1}} ⋮ \frac{\partial J}{\partial θ _{d}} = \frac{2}{n} W^{T} (W θ - T)

The gradient has the same shape as $θ$ . The simplified version is basically just the power rule and chain rule (from equation $(2)$ ). Verifying the shapes, we see:

\frac{2}{n} d \times n W^{T} n \times 1 (W θ - T)

so our result would have shape $d \times 1$ , which is the shape we want!

Setting to $0$ and solving, we get:

\frac{2}{n} W^{T} (W θ - T) W^{T} (W θ - T) W^{T} W θ θ = 0 = 0 = W^{T} T = (W^{T} W)^{- 1} W^{T} T

And the dimensions work out!

θ = d \times d (W^{T} W)^{- 1} d \times n W^{T} n \times 1 T

This is cool because it’s a rare closed-form solution!

To be really good and proper, we should also check that this solution wields a minimum, not just a critical point. Also, what if $(W^{T} W)$ is not invertible?

/notes/

Recent

Sources of Test Error

UDL Chapter 8 Problems

Parameter Initialization

Ordinary Least Squares

Analytical/Closed-Form Solution

Graph View

Backlinks