As an entrypoint to deep neural networks, we first consider composing two shallow neural networks so that the output of the first becomes the input of the second.

Consider two shallow networks with 3 hidden units each.

The first network takes an input and returns output and is defined by:

and

The second network takes as input and returns and is defined by:

and

  • The first network maps inputs to outputs using a function comprising three linear regions that are chosen so that they alternate the sign of their slope (4th linear region is outside range of graph). Multiple inputs (gray circles) now map to the same output (cyan circle).
  • The second network defines a function comprising three linear regions that takes and returns (i.e. the cyan circle is mapped to the brown circle).
  • The combined effect of these 2 functions when composed is that three different inputs are mapped to any given value of by the first network and are processed the same way by the second network.
  • The result is that the function defined by the second network in panel (c) is duplicated three times, variously flipped and rescaled according to the slope of the regions of panel (b).

With ReLU activations, this model also describes a family of piecewise linear functions. However, the number of linear regions is potentially greater than for a shallow network with 6 hidden units. To see this, consider choosing the first network to produce three alternating regions of positive and negative slope (panel b above). This means that three different ranges of are mapped to the same output range , and the subsequent mapping from this range of to is applied three times. The overall effect is that the function defined by the second network is duplicated three times to create nine linear regions.

The same principle applies in higher dimensions as well.

A different way to think about composing networks is that the first network “folds” input space onto itself so that multiple inputs generate the same output. Then the second network applies a function, which is replicated at all points that were folded on top of one another.

From composing networks to deep networks

We can show that our composition of networks is a special case of a Deep Neural Network with two hidden layers.

The output of the first network () is a linear combination of the activations of the hidden units. The first operations of the second network () are linear in the output of the first network. Applying one linear function to another yields another linear function.

Substituting the expression for in the calculation of hidden units in the second network gives:

which can be re-written as:

where and so on. The result is a network with two hidden layers.

It follows that a network with two layers can represent the family of functions created by passing the output of one single-layer network into another. It represents a broader family, because the nine slope parameters, can take arbitrary values, whereas the original parameters are constrained to be the outer product .