With sufficient capacity, a neural network often performs perfectly on training data; however, it does not necessarily generalize well to new test data.

We will see that the test errors have three distinct causes and that their relative contributions depend on (i) the inherent uncertainty in the task, (ii) the amount of training data, and (iii) the choice of model. The latter dependency raises the issue of hyperparameter search. We discuss how to select both the model hyperparameters (e.g., the number of hidden layers and the number of hidden units in each) and the learning algorithm hyperparameters (e.g., the learning rate and batch size).

What are the sources of error that causes a model to fail to generalize?

Noise, bias and variance

There are three possible sources of error: noise, bias, and variance. The figure below illustrates each of these for the visual model example above where we try to fit a quasi-sinusoidal function using a 3-region shallow neural network.

Noise

The data generation process includes the addition of noise, so there are multiple possible valid outputs for each input . This source of error is insurmountable for test data. It does not necessarily limit the training performance; we will likely never see the same input twice during training, so it is still possible to fit the training data perfectly.

Noise may arise because there is a genuine stochastic element to the data generation process, because some of the data are mislabeled, or because there are further explanatory variables that are not not observed. In rare cases, noise may be absent; for example, a network might approximate a function that is deterministic but requires significant computation to evaluate. However, noise is usually fundamental limitation on the possible test performance.

Bias

A second potential source of error may occur because the model is not flexible enough (lacks capacity) to fit the true function perfectly. For example, the three-region neural network cannot exactly describe a quasi-sinusoidal function, even when the parameters are chosen optimally. This is known as bias.

Variance

We have limited training samples, and there is no way to distinguish systematic changes in the underlying function from noise in the underlying data. When we fit a model, we do not get the closest possible approximation to the true underlying function. For different training datasets, the result will be slightly different each time. This additional source of variability in the fitted function is called variance. In practice, there might also be additional variance due to the stochastic learning algorithm, which does not necessarily converge to the same solution each time.