Composing Shallow Networks

As an entrypoint to deep neural networks, we first consider composing two shallow neural networks so that the output of the first becomes the input of the second.

Consider two shallow networks with 3 hidden units each.

The first network takes an input $x$ and returns output $y$ and is defined by:

h_{1} = a [θ_{10} + θ_{11} x] h_{2} = a [θ_{20} + θ_{21} x] h_{3} = a [θ_{30} + θ_{31} x]

and

y = ϕ_{0} + ϕ_{1} h_{1} + ϕ_{2} h_{2} + ϕ_{3} h_{3}

The second network takes $y$ as input and returns $y^{'}$ and is defined by:

h_{1}^{'} = a [θ_{10}^{'} + θ_{11}^{'} y] h_{2}^{'} = a [θ_{20}^{'} + θ_{21}^{'} y] h_{3}^{'} = a [θ_{30}^{'} + θ_{31}^{'} y]

and

y^{'} = ϕ_{0}^{'} + ϕ_{1}^{'} h_{1}^{'} + ϕ_{2}^{'} h_{2}^{'} + ϕ_{3}^{'} h_{3}^{'}

The first network maps inputs $x \in [- 1, 1]$ to outputs $y \in [- 1, 1]$ using a function comprising three linear regions that are chosen so that they alternate the sign of their slope (4th linear region is outside range of graph). Multiple inputs $x$ (gray circles) now map to the same output $y$ (cyan circle).
The second network defines a function comprising three linear regions that takes $y$ and returns $y^{'}$ (i.e. the cyan circle is mapped to the brown circle).
The combined effect of these 2 functions when composed is that three different inputs $x$ are mapped to any given value of $y$ by the first network and are processed the same way by the second network.
The result is that the function defined by the second network in panel (c) is duplicated three times, variously flipped and rescaled according to the slope of the regions of panel (b).

With ReLU activations, this model also describes a family of piecewise linear functions. However, the number of linear regions is potentially greater than for a shallow network with 6 hidden units. To see this, consider choosing the first network to produce three alternating regions of positive and negative slope (panel b above). This means that three different ranges of $x$ are mapped to the same output range $y \in [- 1, 1]$ , and the subsequent mapping from this range of $y$ to $y^{'}$ is applied three times. The overall effect is that the function defined by the second network is duplicated three times to create nine linear regions.

The same principle applies in higher dimensions as well.

A different way to think about composing networks is that the first network “folds” input space $x$ onto itself so that multiple inputs generate the same output. Then the second network applies a function, which is replicated at all points that were folded on top of one another.

From composing networks to deep networks

We can show that our composition of networks is a special case of a Deep Neural Network with two hidden layers. The first layer is defined by

h_{1} h_{2} h_{3} = a [θ_{10} + θ_{11} x] = a [θ_{20} + θ_{21} x] = a [θ_{30} + θ_{31} x]

The output of the first network ( $y = ϕ_{0} + ϕ_{1} h_{1} + ϕ_{2} h_{2} + ϕ_{3} h_{3}$ ) is a linear combination of the activations of the hidden units. The first operations of the second network ( $θ_{10}^{'} + θ_{11}^{'} y, θ_{20}^{'} + θ_{21}^{'} y, θ_{30}^{'} + θ_{31}^{'} y$ ) are linear in the output of the first network. Applying one linear function to another yields another linear function.

Substituting the expression for $y$ in the calculation of hidden units in the second network gives:

h_{1}^{'} = a [θ_{10}^{'} + θ_{11}^{'} y] = a [θ_{10}^{'} + θ_{11}^{'} ϕ_{0} + θ_{11}^{'} ϕ_{1} h_{1} + θ_{11}^{'} ϕ_{2} h_{2} + θ_{11}^{'} ϕ_{3} h_{3}] h_{2}^{'} = a [θ_{20}^{'} + θ_{21}^{'} y] = a [θ_{20}^{'} + θ_{21}^{'} ϕ_{0} + θ_{21}^{'} ϕ_{1} h_{1} + θ_{21}^{'} ϕ_{2} h_{2} + θ_{21}^{'} ϕ_{3} h_{3}] h_{3}^{'} = a [θ_{30}^{'} + θ_{31}^{'} y] = a [θ_{30}^{'} + θ_{31}^{'} ϕ_{0} + θ_{31}^{'} ϕ_{1} h_{1} + θ_{31}^{'} ϕ_{2} h_{2} + θ_{31}^{'} ϕ_{3} h_{3}]

which can be re-written as:

h_{1}^{'} = a [ψ_{10} + ψ_{11} h_{1} + ψ_{12} h_{2} + ψ_{13} h_{3}] h_{2}^{'} = a [ψ_{20} + ψ_{21} h_{1} + ψ_{22} h_{2} + ψ_{23} h_{3}] h_{3}^{'} = a [ψ_{30} + ψ_{31} h_{1} + ψ_{32} h_{2} + ψ_{33} h_{3}]

where $ψ_{10} = θ_{10}^{'} + θ_{11}^{'} ϕ_{0}, ψ_{11} = θ_{11}^{'} ϕ_{1}, ψ_{12} = θ_{11}^{'} ϕ_{2}$ and so on. Finally we can define the output by:

y^{'} = ϕ_{0}^{'} + ϕ_{1}^{'} h_{1}^{'} + ϕ_{2}^{'} h_{2}^{'} + ϕ_{3}^{'} h_{3}^{'}

The result is a network with two hidden layers.

It follows that a network with two layers can represent the family of functions created by passing the output of one single-layer network into another. It represents a broader family, because the nine slope parameters, $ψ_{11}, ψ_{21}, \dots, ψ_{33}$ can take arbitrary values, whereas the original parameters are constrained to be the outer product $[θ_{11}^{'}, θ_{21}^{'}, θ_{31}^{'}]^{T} [ϕ_{1}, ϕ_{2}, ϕ_{3}]$ .

Considering the above equations leads to another way of thinking about how the network constructs an increasingly complex function:

The three hidden units $h_{1}, h_{2}$ and $h_{3}$ in the first layer are computed as usual by forming linear functions of the input and passing these through ReLU activation functions.
The pre-activations at second layer are computed by taking three new linear functions of these hidden units. At this point, we effectively have a shallow network with three outputs; we have computed three piecewise linear functions with the “joints” between linear regions in the same places.
At the second hidden layer, another ReLU function is applied to each function, which clips them and adds new “joints” to each.
The final output is a linear combination of these hidden units.

We can think of each layer as “folding” the input space or as creating the new functions, which are clipped (creating new regions) and then recombined. The former view emphasizes the dependencies in the output, but not how clipping creates new joints, and the latter has the opposite emphasis. Both only provide partial insight into how deep neural networks operate.

It’s important to not lose sight of the fact that this is still merely an equation relating input $x$ to output $y^{'}$ . We can combine all the equations to get one expression:

y^{'} = ϕ_{0}^{'} + ϕ_{1}^{'} a [ψ_{10} + ψ_{11} a [θ_{10} + θ_{11} x] + ψ_{12} a [θ_{20} + θ_{21} x] + ψ_{13} a [θ_{30} + θ_{31} x]] + ϕ_{2}^{'} a [ψ_{20} + ψ_{21} a [θ_{10} + θ_{11} x] + ψ_{22} a [θ_{20} + θ_{21} x] + ψ_{23} a [θ_{30} + θ_{31} x]] + ϕ_{3}^{'} a [ψ_{30} + ψ_{31} a [θ_{10} + θ_{11} x] + ψ_{32} a [θ_{20} + θ_{21} x] + ψ_{33} a [θ_{30} + θ_{31} x]]

Matrix Notation

We can describe our composition above in matrix notation as:

h_{1} h_{2} h_{3} = a θ_{10} θ_{20} θ_{30} + θ_{11} θ_{21} θ_{31} x

and

h_{1}^{'} h_{2}^{'} h_{3}^{'} = a ψ_{10} ψ_{20} ψ_{30} + ψ_{11} ψ_{21} ψ_{31} ψ_{12} ψ_{22} ψ_{32} ψ_{13} ψ_{23} ψ_{33} h_{1} h_{2} h_{3}

and

y^{'} = ϕ_{0}^{'} + [ϕ_{1}^{'} ϕ_{2}^{'} ϕ_{3}^{'}] h_{1}^{'} h_{2}^{'} h_{3}^{'}

or even more compactly as

h h^{'} y^{'} = a [θ_{0} + θ x] = a [ψ_{0} + Ψ h] = ϕ_{0}^{'} + ϕ^{'} h^{'}

where, in each case, the function $a [\cdot]$ applies the activation function separately to every element of its vector input.

/notes/

Recent

Sources of Test Error

UDL Chapter 8 Problems

Parameter Initialization

Composing Shallow Networks

From composing networks to deep networks

Matrix Notation

Graph View

Table of Contents

Backlinks