Mathematical Formulation of Test Error

Let’s make the notions of noise, bias, and variance mathematically precise.

Consider a 1D regression problem where the data generation has additive noise with variance $σ^{2}$ . We can observe different outputs $y$ for the same input $x$ ; so for each $x$ , there is a distribution $Pr (y ∣ x)$ with expected value (mean) $μ (x)$ :

μ [x] = E_{y} [y [x]] = \int y [x] Pr (y ∣ x) d y

and fixed noise $σ^{2} = E_{y} [(μ [x] - y [x])^{2}]$ .

Here we have used the notation $y [x]$ to specify that we are considering the output $y$ at a given input position $x$ .

Definition: Expectation

Consider a function $f (x)$ and a probability function $P r (x)$ defined over $x$ . The expected value of a function $f [∙]$ of a random variable $x$ with respect to the probability $P r (x)$ is defined as:
$E_{x} [f [x]] = \int f [x] P r (x) d x$

Now consider a least squares loss between the model prediction $f [x, ϕ]$ at position $x$ and the observed value $y [x]$ at that position:

L [x] = (f [x, ϕ] - y [x])^{2} = ((f [x, ϕ] - μ [x]) + (μ [x] - y [x]))^{2} = (f [x, ϕ] - μ [x])^{2} + 2 (f [x, ϕ] - μ [x]) (μ [x] - y [x]) + (μ [x] - y [x])^{2}

where we have both added and subtracted the mean $μ [x]$ of the underlying function in the second line and have expanded out the squared term in the third line.

The underlying function is stochastic, so this loss depends on the particular $y [x]$ we observe. The expected loss is:

E_{y} [L [x]] = E_{y} [(f [x, ϕ] - μ [x])^{2} + 2 (f [x, ϕ] - μ [x]) (μ [x] - y [x]) + (μ [x] - y [x])^{2}] = (f [x, ϕ] - μ [x])^{2} + 2 (f [x, ϕ] - μ [x]) (μ [x] - E_{y} [y [x]]) + E_{y} [(μ [x] - y [x])^{2}] = (f [x, ϕ] - μ [x])^{2} + 2 (f [x, ϕ] - μ [x]) \cdot 0 + E_{y} [(μ [x] - y [x])^{2}] = (f [x, ϕ] - μ [x])^{2} + σ^{2}

where we have made use of the rules for manipulating expectations.

In the second line, we have distributed the expectation operator and removed it from terms terms with no dependence on $y [x]$
In the third line, we note that the second term is zero since $E_{y} [y [x]] = μ [x]$ by definition.
In the fourth line, we have substituted in the definition of the noise $σ^{2}$ .

We can see that the expected loss has been broken down into two terms: the first term is the squared deviation between the model and the true function mean, and the second term is the noise.

The first term can be further partitioned into bias and variance.

The parameters $ϕ$ of the model $f [x, ϕ]$ depend on the training dataset $D = {x_{i}, y_{i}}$ ; so to be more proper, we should write $f [x, ϕ [D]]$ . The training dataset is a random sample from the data generation process; with a different sample of training data, we would learn different parameter values. The expected model output $f_{μ} [x]$ with respect to all datasets $D$ is hence:

f_{μ} [x] = E_{D} [f [x, ϕ [D]]]

Returning to the first term of equation our equation, we add and subtract $f_{μ} [x]$ and expand:

(f [x, ϕ] - μ [x])^{2} = (f [x, ϕ [D]] - μ [x])^{2} = ((f [x, ϕ [D]] - f_{μ} [x]) + (f_{μ} [x] - μ [x]))^{2} = ((f [x, ϕ [D]] - f_{μ} [x])^{2} + 2 (f [x, ϕ [D]] - f_{μ} [x]) (f_{μ} [x] - μ [x])) + (f_{μ} [x] - μ [x]^{2})

We then take the expectation with respect to the training dataset $D$ :

E_{D} [(f [x, ϕ [D]] - μ [x])^{2}] = E_{D} [(f [x, ϕ [D]] - f_{μ} [x])^{2}] + (f_{μ} [x] - μ [x])^{2}

We removed the middle term since it doesn’t depend on $D$ .

Finally, substituting this result gives:

E_{D} [E_{y} [L [x]]] = variance E_{D} [(f [x, ϕ [D]] - f_{μ} [x])^{2}] + bias (f_{μ} [x] - μ [x])^{2} + noise σ^{2}

This equation says that the expected loss after considering the uncertainty in the training data $D$ and the test data $y$ consists of three additive components.

The variance is the uncertainty in the fitted model due to the particular training dataset we sample.
The bias is the systematic deviation of the model from the mean of the function we are modeling.
The noise is the inherently uncertainty in the true mapping from input to output.

These sources of error will be present for any task. We’ve seen that they combine additively for regression with least squares loss; however, their interaction can be more complex for other problem types. For classification problems, there are some counter-intuitive predictions; for example, if the model is based toward selecting the wrong class in a region of the input space, then increasing the variance can improve the classification rate as this pushes some of the predictions over the threshold to be classified correctly.

/notes/

Recent

Boundary Crossing Lemma

Jury Test

Schur Checking Tutorial

Mathematical Formulation of Test Error

Graph View

Backlinks