Univariate Regression

The goal of univariate regression is to predict a single scalar output $y \in R$ from $x$ using a model $f [x, ϕ]$ with parameters $ϕ$ . To use the loss function recipe with a maximum likelihood approach, we try to predict a univariate normal distribution (Gaussian), which is defined over $y \in R$ .

The Gaussian has two parameters, mean $μ$ and variance $σ^{2}$ , and has a probability density function of

P r (y ∣ μ, σ^{2}) = \frac{1}{2 π σ ^{2}} exp [- \frac{( y - μ ) ^{2}}{2 σ ^{2}}]

We then need to set the model to $f [x, ϕ]$ to compute one or more parameters of this distribution. Here, we just compute the mean so $μ = f [x, ϕ]$ :

P r (y ∣ μ, σ^{2}) = \frac{1}{2 π σ ^{2}} exp [- \frac{( y - f [ x _{i} , ϕ ] ) ^{2}}{2 σ ^{2}}]

We aim to find the parameters $ϕ$ that make the training data ${x_{i}, y_{i}}$ most probable under this distribution. To do this, we choose a loss function based on negative log-likelihood:

L [ϕ] = - i = 1 \sum I lo g [P r (y_{i} ∣ f [x_{i}, ϕ], σ^{2})] = - i = 1 \sum I lo g [\frac{1}{2 π σ ^{2}} exp [- \frac{( y _{i} - f [ x _{i} , ϕ ] ) ^{2}}{2 σ ^{2}}]]

When we train the model, we seek parameters $\hat{ϕ}$ that minimize this loss.

Least squares loss function

We can perform some algebraic manipulation on the loss function above.

\hat{ϕ} = ϕ argmin [- i = 1 \sum I lo g [\frac{1}{2 π σ ^{2}} exp [- \frac{( y _{i} - f [ x _{i} , ϕ ] ) ^{2}}{2 σ ^{2}}]]] = ϕ argmin [- i = 1 \sum I (lo g [\frac{1}{2 π σ ^{2}} - \frac{( y _{i} - f [ x _{i} , ϕ ] ^{2} )}{2 σ ^{2}}])] = ϕ argmin [- i = 1 \sum I - \frac{( y _{i} - f [ x _{i} , ϕ ] ^{2} )}{2 σ ^{2}}] = ϕ argmin [i = 1 \sum I (y_{i} - f [x_{i}, ϕ])^{2}]

We removed the first term between the 2nd and 3rd lines because it doesn’t depend on $ϕ$ – doesn’t affect the position of the minimum.
We removed the denominator between the 3rd and 4th lines because it’s just a constant positive scaling factor – doesn’t affect the position of the minimum.

The result that we arrive is the least squares function that we use for linear regression.

We can see that least squares and maximum likelihood loss are equivalent for the normal distribution.

(a) Consider the linear model we saw for linear regression. The least squares criterion minimizes the sum of the squares of the deviations (dashed line) between the model prediction (green line) and ground truths (orange points). Here the fit is good, so these deviations are small.
(b) For these parameters, the fit is bad, and the squared deviations are large.
(c) The least squares criterion follows from the assumption that the model predicts the mean of a normal distribution over the outputs and that we maximize the probability. For the first case, the model fits well, so the probability of the data $P r (y_{i} ∣ x_{i})$ (horizontal orange dashed lines) is large, which in turn means the negative log probability is small.
(d) For this case, the model fits badly, so the probability is small and the negative log probability is large.

Inference

The network no longer directly predicts $y$ but instead predicts the mean $μ = f [x, ϕ]$ of the normal distribution over $y$ . When we perform inference, we usually want a single “best” point estimate $\overset{y}{^}$ , so we take the maximum of the . predicted distribution:

\overset{y}{^} = y argmax [P r (y ∣ f [x, \hat{ϕ}], σ^{2})]

For the univariate normal distribution, the maximum position is determined by the mean parameter $μ$ . This is exactly what the model computed, so $\overset{y}{^} = f [x, \hat{ϕ}]$ .

Estimating variance

To formulate the least squares loss function, we assumed that the network predicts the mean of a normal distribution. The final expression above for the best parameters $\hat{ϕ}$ did not depend on the variance $σ^{2}$ . However, we can easily also treat $σ^{2}$ as a learned parameter; then, we can minimize the loss function with respect to both the model parameters $ϕ$ and the variance $σ^{2}$ :

\hat{ϕ}, σ^{2} = ϕ, σ^{2} argmin [- i = 1 \sum I lo g [\frac{1}{2 π σ ^{2}} exp [- \frac{( y _{i} - f [ x _{i} , ϕ ] ) ^{2}}{2 σ ^{2}}]]]

In inference, the model predicts the mean $μ = f [x, \hat{ϕ}]$ from the input, and we learned the variance $\overset{σ}{^}^{2}$ during the training process. Then, $μ$ is the best prediction and $\overset{σ}{^}^{2}$ tells us about the uncertainty of th

/notes/

Recent

Linear Regression

Loss Function Recipe