Backpropagation Algorithm

Let’s repeat the backpropagation toy example but for a three-layer network.

The intuition and much of the algebra are identical. The main differences are that intermediate variables $f_{k}, h_{k}$ are vectors, the biases $β_{k}$ are vectors, the weights $Ω_{k}$ are matrices, and we are using ReLU functions rather than simple algebraic functions like $cos [∙]$ .

Forward pass

We write the network as a series of sequential calculations:

f_{0} h_{1} f_{1} h_{2} f_{2} h_{3} f_{3} ℓ_{i} = β_{0} + Ω_{0} x_{i} = a [f_{0}] = β_{1} + Ω_{1} h_{1} = a [f_{1}] = β_{2} + Ω_{2} h_{2} = a [f_{2}] = β_{3} + Ω_{3} h_{3} = l [f_{3}, y_{i}]

where:

$f_{k - 1}$ represents the pre-activations at the $k$ -th hidden layer (values before the ReLU function $a [∙]$ )
$h_{k}$ represents the activations at the $k$ -th layer (after the ReLU function)
The term $l [f_{3}, y_{i}]$ represents the loss function

In the forward pass, we work through these calculations and store all the intermediate values.

Backward pass 1

Now, let’s consider how the loss changes when the pre-activations $f_{0}, f_{1}, f_{2}$ change. Applying the chain rule, the expression for the derivative of the loss with $ℓ_{i}$ with respect to $f_{2}$ is:

\frac{\partial ℓ _{i}}{\partial f _{2}} = \frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}}

The first term on the right has size $D_{f} \times 1$
- $D_{f}$ is the dimension of the model output $f_{3}$
The second term has size $D_{f} \times D_{3}$
- $D_{3}$ is the number of hidden units in the third layer
The third term has size $D_{3} \times D_{3}$

Similarly, we can compute how the loss changes when we change $f_{1}$ and $f_{0}$ :

\frac{\partial ℓ _{i}}{\partial f _{1}} \frac{\partial ℓ _{i}}{\partial f _{0}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}}) \frac{\partial f _{2}}{\partial h _{2}} \frac{\partial h _{2}}{\partial f _{1}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}} \frac{\partial f _{2}}{\partial h _{2}} \frac{\partial h _{2}}{\partial f _{1}}) \frac{\partial f _{1}}{\partial h _{1}} \frac{\partial h _{1}}{\partial f _{0}}

In each case, the term in brackets was computed in the previous step. By working backward through the network, we can reuse previous computations.

/notes/

Recent