Let’s make the notions of noise, bias, and variance mathematically precise.
Consider a 1D regression problem where the data generation has additive noise with variance . We can observe different outputs for the same input ; so for each , there is a distribution with expected value (mean) :
and fixed noise .
- Here we have used the notation to specify that we are considering the output at a given input position .
Definition: Expectation
Consider a function and a probability function defined over . The expected value of a function of a random variable with respect to the probability is defined as:
Now consider a least squares loss between the model prediction at position and the observed value at that position:
where we have both added and subtracted the mean of the underlying function in the second line and have expanded out the squared term in the third line.
The underlying function is stochastic, so this loss depends on the particular we observe. The expected loss is:
where we have made use of the rules for manipulating expectations.
- In the second line, we have distributed the expectation operator and removed it from terms terms with no dependence on
- In the third line, we note that the second term is zero since by definition.
- In the fourth line, we have substituted in the definition of the noise .
We can see that the expected loss has been broken down into two terms: the first term is the squared deviation between the model and the true function mean, and the second term is the noise.
The first term can be further partitioned into bias and variance.
The parameters of the model depend on the training dataset ; so to be more proper, we should write . The training dataset is a random sample from the data generation process; with a different sample of training data, we would learn different parameter values. The expected model output with respect to all datasets is hence:
Returning to the first term of equation our equation, we add and subtract and expand:
We then take the expectation with respect to the training dataset :
- We removed the middle term since it doesn’t depend on .
Finally, substituting this result gives:
This equation says that the expected loss after considering the uncertainty in the training data and the test data consists of three additive components.
- The variance is the uncertainty in the fitted model due to the particular training dataset we sample.
- The bias is the systematic deviation of the model from the mean of the function we are modeling.
- The noise is the inherently uncertainty in the true mapping from input to output.
These sources of error will be present for any task. We’ve seen that they combine additively for regression with least squares loss; however, their interaction can be more complex for other problem types.