The universal approximation theorem states that for any continuous function, there exists a shallow network that can approximate this function to any specified precision.

Let’s say we have a shallow neural network of the form:

where -th hidden unit is:

where is an activation function such as ReLU. We can then write the neural network as:

The number of hidden units in a shallow network is a measure of the network capacity.

With ReLU activation functions, the output of a network with hidden units has at most D “joints” and so is a piecewise linear function with at most linear regions:

As we add more hidden units, the model can approximate more complex functions. With enough capacity (more hidden units), a shallow network can describe any continuous 1D function defined on a compact subset of the real line to arbitrary precision.

To see this, consider that every time we add a hidden unit, we add another linear region to the function. More regions means that each represents smaller sections of the function, which in turn means a better approximation.

Width Version

The width version of this theorem states that there exists a network with one hidden layer containing a finite number of hidden units that can approximate any specified continuous function on a compact subset of to arbitrary accuracy.