The universal approximation theorem states that for any continuous function, there exists a shallow network that can approximate this function to any specified precision.
Let’s say we have a shallow neural network of the form:
where -th hidden unit is:
where is an activation function such as ReLU. We can then write the neural network as:
The number of hidden units in a shallow network is a measure of the network capacity.
With ReLU activation functions, the output of a network with hidden units has at most D “joints” and so is a piecewise linear function with at most linear regions:
As we add more hidden units, the model can approximate more complex functions. With enough capacity (more hidden units), a shallow network can describe any continuous 1D function defined on a compact subset of the real line to arbitrary precision.
To see this, consider that every time we add a hidden unit, we add another linear region to the function. More regions means that each represents smaller sections of the function, which in turn means a better approximation.
Width Version
The width version of this theorem states that there exists a network with one hidden layer containing a finite number of hidden units that can approximate any specified continuous function on a compact subset of to arbitrary accuracy.