Backpropagation 3-Layer Example

Let’s repeat the backpropagation toy example but for a three-layer network.

The intuition and much of the algebra are identical. The main differences are that intermediate variables $f_{k}, h_{k}$ are vectors, the biases $β_{k}$ are vectors, the weights $Ω_{k}$ are matrices, and we are using ReLU functions rather than simple algebraic functions like $cos [∙]$ .

Forward pass

We write the network as a series of sequential calculations:

f_{0} h_{1} f_{1} h_{2} f_{2} h_{3} f_{3} ℓ_{i} = β_{0} + Ω_{0} x_{i} = a [f_{0}] = β_{1} + Ω_{1} h_{1} = a [f_{1}] = β_{2} + Ω_{2} h_{2} = a [f_{2}] = β_{3} + Ω_{3} h_{3} = l [f_{3}, y_{i}]

where:

$f_{k - 1}$ represents the pre-activations at the $k$ -th hidden layer (values before the ReLU function $a [∙]$ )
$h_{k}$ represents the activations at the $k$ -th layer (after the ReLU function)
The term $l [f_{3}, y_{i}]$ represents the loss function

In the forward pass, we work through these calculations and store all the intermediate values.

Backward pass 1

Now, let’s consider how the loss changes when the pre-activations $f_{0}, f_{1}, f_{2}$ change. Applying the chain rule, the expression for the derivative of the loss with $ℓ_{i}$ with respect to $f_{2}$ is:

\frac{\partial ℓ _{i}}{\partial f _{2}} = \frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}}

The first term on the right has size $D_{f} \times 1$
- $D_{f}$ is the dimension of the model output $f_{3}$
The second term has size $D_{3} \times D_{f}$
- $D_{3}$ is the number of hidden units in the third layer
The third term has size $D_{3} \times D_{3}$

Similarly, we can compute how the loss changes when we change $f_{1}$ and $f_{0}$ :

\frac{\partial ℓ _{i}}{\partial f _{1}} \frac{\partial ℓ _{i}}{\partial f _{0}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}}) \frac{\partial f _{2}}{\partial h _{2}} \frac{\partial h _{2}}{\partial f _{1}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}} \frac{\partial f _{2}}{\partial h _{2}} \frac{\partial h _{2}}{\partial f _{1}}) \frac{\partial f _{1}}{\partial h _{1}} \frac{\partial h _{1}}{\partial f _{0}}

In each case, the term in brackets was computed in the previous step. By working backward through the network, we can reuse previous computations.

Each term tends to be fairly simple:

$\frac{\partial ℓ _{i}}{\partial f _{3}}$ (derivative of the loss $ℓ_{i}$ w.r.t network output $f_{3}$ ) depends on the loss function but generally has a simple form
$\frac{\partial f _{3}}{\partial h _{3}}$ of the network output with respect to hidden layer $h_{3}$ is: $\frac{\partial f _{3}}{\partial h _{3}} = \frac{\partial}{\partial h _{3}} (β_{3} + Ω_{3} h_{3}) = Ω_{3}^{T}$ This is shown in Problem 7.6.
The derivative $\frac{\partial h _{3}}{\partial f _{2}}$ of the output $h_{3}$ of the activation function with respect to its input $f_{2}$ will depend on the activation function.
- It will be a diagonal matrix since each activation only depends on the corresponding pre-activation.
- For ReLU functions, the diagonal terms are zero everywhere $f_{2}$ is less than zero and one otherwise. Rather than multiply by this matrix, we extract the diagonal terms as a vector $I (f_{2} > 0)$ and pointwise multiply, which is more efficient.

Backward pass 2

Now that we know how to compute $\frac{\partial ℓ _{i}}{\partial f _{k}}$ , we can focus on calculating the erivatives of the loss with respect to the weights and biases.

To calculate the derivatives of the loss with respect to biases $β_{k}$ , we use the chain rule:

\frac{\partial ℓ _{i}}{\partial β _{k}} = \frac{\partial ℓ _{i}}{\partial f _{k}} \frac{\partial f _{k}}{\partial β _{k}} = \frac{\partial ℓ _{i}}{\partial f _{k}} \frac{\partial}{\partial β _{k}} (β_{k} + Ω_{k} h_{k}) = \frac{\partial ℓ _{i}}{\partial f _{k}}

which we already calculated above.

Similarly, the derivative of the weights matrix $Ω_{k}$ is given by:

\frac{\partial ℓ _{i}}{\partial Ω _{k}} = \frac{\partial ℓ _{i}}{\partial f _{k}} \frac{\partial f _{k}}{\partial Ω _{k}} = \frac{\partial ℓ _{i}}{\partial f _{k}} \frac{\partial}{\partial Ω _{k}} (β_{k} + Ω_{k} h_{k}) = \frac{\partial ℓ _{i}}{\partial f _{k}} h_{k}^{T}

The progression from line 2 to 3 is shown in Problem 7.9 .

The result above makes intuitive sense; the final line is a matrix of the same size as $Ω_{k}$ . It depends linearly on $h_{k}$ , which was multiplied by $Ω_{k}$ in the original expression.

This is consistent with the intuition that the derivatives of the weights in $Ω_{k}$ will be proportional to the values of the hidden units $h_{k}$ that they multiply. Recall that we already computed these during the forward pass.

/notes/

Recent

Linearization of Nonlinear State Space Models

Phase Portrait

Embed to Control

Backpropagation 3-Layer Example

Forward pass

Backward pass 1

Backward pass 2

Graph View

Table of Contents

Backlinks