Linear Regression as MLE

Regression problems can be expressed in terms of error minimization, such as Ordinary Least Squares. We can also view it as a probabilistic maximum likelihood estimation problem.

The goal in the regression problem is to make predictions for the target variable $t$ given some new value of the input variable $x$ . This is done by using a set of training data comprising $N$ input values, $x = (x_{1}, \dots, x_{N})$ and their corresponding values $t = (t_{1}, \dots, t_{N})$ .

We can express our uncertainty over the value of the target variable using a probability distribution. Given a value of $x$ , the corresponding $t$ has a Gaussian distribution with a variance $σ^{2}$ , and a mean equal to $y (x, w)$ , such that:

y (x, w) = w_{0} + w_{1} x + w_{2} x^{2} + \dots + w_{M} x^{M} = j = 0 \sum M w_{j} x^{j}

Thus, we have:

p (t ∣ x, w, σ^{2}) = N (t ∣ y (x, w), σ^{2})

This is shown in the diagram below. Instead of predicting a point with the function $y (x, w)$ , we’re predicting a distribution where $y (x, w)$ is the mean.

We can then use our training data $x$ and labels $t$ to determine the values of the unknown parameters $w$ and $σ^{2}$ by maximum likelihood. If the data is drawn independently the distribution, the likelihood function is:

p (t ∣ x, w, σ^{2}) = n = 1 \prod N N (t_{n} ∣ y (x_{n}, w), σ^{2})

Like our Gaussian MLE example, we can maximize the logarithm of the likelihood function:

ln p (t ∣ x, w, σ^{2}) = - \frac{1}{2 σ ^{2}} n = 1 \sum N {y (x_{n}, w) - t_{n}} - \frac{N}{2} ln σ^{2} - \frac{N}{2} ln (2 π)

Now, we would want to maximize the above expression with respect to $w$ :

We can ignore the 2nd and 3rd term because they don’t depend on $w$ .
Scaling the log likelihood by a positive constant coefficient doesn’t change the location of the maximum with respect to $w$ , so we can replace $1/2 σ^{2}$ with just $1/2$ .
Minimizing the negative log likelihood is equivalent to maximizing the log likelihood.

Thus, our MLE is equivalent, as far as $w$ is concerned, to minimizing the sum-of-squares error defined by:

E (w) = \frac{1}{2} n = 1 \sum N {y (x_{n}, ω) - t_{n}}^{2}

(The square is added to guarantee positive values).

Thus, we’ve shown how the sum-of-squares error function arises as a consequence of maximizing the likelihood under the assumption of a Gaussian noise distribution.

We can also use maximum likelihood to determine $σ^{2}$ . Maximizing the logarithm of the likelihood function with respect to $σ^{2}$ gives:

σ_{ML}^{2} = \frac{1}{N} n = 1 \sum N {y (x_{n}, w_{ML}) - t_{n}}^{2}

We can first determine the parameter vector $w_{ML}$ governing the mean, and subsequently use this to find the variance $σ_{ML}^{2}$ as was the case for the Gaussian MLE.

Since we have determined the parameters $w$ and $σ^{2}$ , we can now make predictions for new values of $x$ . Our model is probabilistic, so these are expressed in terms of the predictive distribution that gives the probability distribution over $t$ , rather than simply a point estimate. These distributions can be obtained by substituting the maximum likelihood parameters into our original distribution expression to get:

p (t ∣ x, w_{ML}, σ_{ML}^{2}) = N (t ∣ y (x, w_{ML}), σ^{2})

/notes/

Recent

Sources of Test Error

UDL Chapter 8 Problems

Parameter Initialization

Linear Regression as MLE

Graph View

Backlinks