Deep Neural Network

We have seen that composing shallow networks can give us complex functions. We can extend this to construct deep networks with more than two hidden layers; modern networks have hundreds of layers with thousands of hidden units at each layer.

The number of hidden units in each layer is referred to as the width of the network
The number of hidden layers is the depth
The total number of hidden units is a measure of network capacity

Hyperparameters

We denote the number of layers as $K$ and the number of hidden units in each layer as $D_{1}, D_{2}, \dots, D_{K}$ . These are examples of hyperparameters. They are quantities chosen before we learn the model parameters (i.e., the slope and intercept terms). For fixed hyperparameters (e.g., $K = 2$ and $D_{k} = 3$ hidden units each), the model describes a family of functions, and the parameters determine the particular function. Hence, when we also consider the hyperparameters, we can think of neural networks as representing a family of families of functions relating input to output.

General Formulation

See Matrix Notation for an introduction to matrix notation for a simple composition of shallow networks

We describe the vector of hidden units at layer $k$ as $h_{k}$ , the vector of biases (intercepts) that contribute to hidden layer $k + 1$ as $β_{k}$ , and the weights (slopes) that are applied to the $k$ -th layer and contribute to the $(k + 1)$ -th layer as $Ω_{k}$ . A general deep network $y = f [x, ϕ]$ with $K$ layers can now be written as:

h_{1} h_{2} h_{3} h_{K} y = a [β_{0} + Ω_{0} x] = a [β_{1} + Ω_{1} h_{1}] = a [β_{2} + Ω_{2} h_{2}] ⋮ = a [β_{K - 1} + Ω_{K - 1} h_{K - 1}] = β_{K} + Ω_{K} h_{K}

The parameters $ϕ$ of this model comprise all of these weight matrices and bias vectors $ϕ = {β_{k}, Ω_{k}}_{k = 0}^{K}$ .

If the $k$ -th layer has $D_{k}$ has hidden units, then the bias vector $β_{k - 1}$ will be of size $D_{k}$ . The last bias vector $β_{K}$ has the size $D_{o}$ of the output.
The first weight matrix $Ω_{0}$ has size $D_{1} \times D_{i}$ where $D_{i}$ is the size of the input.
The last weight matrix $Ω_{K}$ is $D_{o} \times D_{K}$ , and the remaining matrices $Ω_{k}$ are $D_{k + 1} \times D_{k}$

We can equivalently write the network as a single function:

y = β_{K} + Ω_{K} [β_{K - 1} + Ω_{K - 1} a [\dots β_{2} + Ω_{2} a [β_{1} + Ω_{1} a [β_{0} + Ω_{0} x]] \dots]]

In diagram form:

Clarification on terminology

Layer $k$ is the operation $Ω_{k - 1}, β_{k - 1}, a (∙)$

The activations $h_{k}$ is the result we get as a result of applying the operation layer $k$ .

However, “layer” is used to refer to “layer $k$ ‘s activations” which can be confusing.

Example Computation

Let’s consider a numeric example of the above. We use a training input of

x = 1 - 1 0.5

Layer 1: We first have $Ω_{0} \in R^{4 \times 3}$ that maps the input ( $D_{i} = 3$ ) to the first hidden layer ( $D_{1} = 4$ ), and a corresponding bias vector $β_{0} \in R^{4}$ :

Ω_{0} = 0.2 0.7 - 0.6 0.0 - 0.1 0.3 0.8 - 0.2 0.4 - 0.5 0.1 0.2, β_{0} = 0.1 - 0.3 0.05 0.2

Then, the first pre-activation is:

f_{1} = Ω_{0} x + β_{0} = (0.2) (1) + (- 0.1) (- 1) + (0.4) (0.5) (0.7) (1) + (0.3) (- 1) + (- 0.5) (0.5) (- 0.6) (1) + (0.8) (- 1) + (0.1) (0.5) (0.0) (1) + (- 0.2) (- 1) + (0.2) (0.5) + 0.1 - 0.3 0.05 0.2 = 0.5 0.15 - 1.35 0.3 + 0.1 - 0.3 0.05 0.2 = 0.6 - 0.15 - 1.3 0.5

Passing through ReLU to get to the complete first hidden layer:

h_{1} = a [f_{1}] = 0.6 00 0.5

Layer 2: Now we have $Ω_{1} \in R^{2 \times 4}$ that maps the first hidden layer ( $D_{1} = 4$ ) to the second hidden layer ( $D_{2} = 2$ ), and a corresponding bias vector $β_{1} \in R^{2}$ :

Ω_{1} = [0.3 - 0.5 - 0.2 0.6 0.1 0.2 0.4 - 0.1], β_{1} = [0.05 - 0.1]

Then, the pre-activation is:

f_{2} = Ω_{1} h_{1} + β_{1} = [(0.3) (0.6) + (- 0.2) (0) + (0.1) (0) + (0.4) (0.5) (- 0.5) (0.6) + (0.6) (0) + (0.2) (0) + (- 0.1) (0.5)] + [0.05 - 0.1] = [0.38 - 0.35] + [0.05 - 0.1] = [0.43 - 0.45]

Passing through ReLU:

h_{2} = a [f_{2}] = [0.43 0]

Layer 3: Now, $Ω_{2} \in R^{3 \times 2}, β_{2} \in R^{3}$ :

Ω_{2} = 0.2 - 0.3 0.5 - 0.1 0.4 0.1, β_{2} = 0.1 0.2 - 0.05

and we compute the pre-activation with:

f_{3} = Ω_{2} h_{2} + β_{2} = (0.2) (0.43) + (- 0.1) (0) (- 0.3) (0.43) + (0.4) (0) (0.5) (0.43) + (0.1) (0) + 0.1 0.2 - 0.05 = 0.086 - 0.129 0.215 + 0.1 0.2 - 0.05 = 0.186 0.071 0.165

Then, applying ReLU to get the activations of the third hidden layer:

h_{3} = a [f_{3}] = 0.186 0.071 0.165

Output layer: Finally, $Ω_{3} \in R^{2 \times 3}$ and $β_{3} \in R^{2}$ :

Ω_{3} = [0.4 - 0.3 - 0.2 0.5 0.1 0.2], β_{3} = [0.05 - 0.02]

We use these to compute the output pre-activation:

f_{4} = Ω_{3} h_{3} + β_{3} = [(0.4) (0.186) + (- 0.2) (0.071) + (0.1) (0.165) (- 0.3) (0.186) + (0.5) (0.071) + (0.2) (0.165)] + [0.05 - 0.02] = [0.0767 0.0127] + [0.05 - 0.02] = [0.1267 - 0.0073]

/notes/

Recent

Linearization of Nonlinear State Space Models

Phase Portrait

Embed to Control

Deep Neural Network

Hyperparameters

General Formulation

Example Computation

Graph View

Table of Contents

Backlinks