Backpropagation Toy Example

Consider a model $f [x, ϕ]$ with eight scalar parameters $ϕ = {β_{0}, ω_{0}, β_{1}, ω_{1}, β_{2}, ω_{2}, β_{3}, ω_{3}}$ that consists of a composition of functions $sin [∙], exp [∙]$ and $cos [∙]$ :

f [x, ϕ] = β_{3} + ω_{3} \cdot cos [β_{2} + ω_{2} \cdot exp [β_{1} + ω_{1} \cdot sin [β_{0} + ω_{0} \cdot x]]]

and a least squares loss function $L [ϕ] = \sum_{i} ℓ_{i}$ with individual terms:

ℓ_{i} = (f [x_{i}, ϕ] - y_{i})^{2}

where $x_{i}$ is the $i$ -th training input, and $y_{i}$ is the $i$ -th training output. You can think of this as a simple neural network with one input, one output, one hidden unit at each layer, and different activation functions $sin [∙], exp [∙]$ and $cos [∙]$ between each layer.

We aim to compute the derivatives of $ℓ_{i}$ with respect to each of the eight parameters:

\frac{\partial ℓ _{i}}{\partial β _{0}}, \frac{\partial ℓ _{i}}{\partial ω _{0}}, \frac{\partial ℓ _{i}}{\partial β _{1}}, \frac{\partial ℓ _{i}}{\partial ω _{1}}, \frac{\partial ℓ _{i}}{\partial β _{2}}, \frac{\partial ℓ _{i}}{\partial ω _{2}}, \frac{\partial ℓ _{i}}{\partial β _{3}}, \frac{\partial ℓ _{i}}{\partial ω _{3}}

Of course, we could find expressions for these derivatives by hand and compute them directly. However, some of these expressions are quite complex. Such expressions are awkward to derive and code without mistakes and do not exploit the inherent redundancy.

Backpropagation is an efficient method for computing all of these derivatives at once. It consists of a forward pass, in which we compute and store a series of intermediate values and the network output, and a backward pass, in which we calculate the derivatives of each parameter, starting at the end of the network, and re-use previous calculations as we move toward the start.

Forward Pass

We treat the computation of the loss as a series of calculations:

f_{0} h_{1} f_{1} h_{2} f_{2} h_{3} f_{3} ℓ_{i} = β_{0} + ω_{0} \cdot x_{i} = sin [f_{0}] = β_{1} + ω_{1} \cdot h_{1} = exp [f_{1}] = β_{2} + ω_{2} \cdot h_{2} = cos [f_{2}] = β_{3} + ω_{3} \cdot h_{3} = (f_{3} - y_{i})^{2}

We compute and store the values of the intermediate variables $f_{k}$ and $h_{k}$ .

Backward pass 1

We now compute the derivatives of $ℓ_{1}$ with respect to these intermediate variables, but in reverse order.

The first one is very straightforward:

\frac{\partial ℓ _{i}}{\partial f _{3}} = 2 (f_{3} - y_{i})

The next derivative can be calculated using the chain rule:

\frac{\partial ℓ _{i}}{\partial h _{3}} = \frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}}

The left side asks how $ℓ_{i}$ changes when $h_{3}$ changes
The right side says we can decompose this into (i) how $ℓ_{i}$ changes when $f_{3}$ changes and (ii) how $f_{3}$ changes when $h_{3}$ changes. In the original equations, $h_{3}$ changes $f_{3}$ , which changes $ℓ_{i}$ , and the derivatives represent the effects of this chain. Notice that we already computed the first derivative and the other is just the derivative of $β_{3} + ω_{3} \cdot h_{3}$ with respect to $h_{3}$ .

We can continue this way:

\frac{\partial ℓ _{i}}{\partial f _{2}} \frac{\partial ℓ _{i}}{\partial h _{2}} \frac{\partial ℓ _{i}}{\partial f _{1}} \frac{\partial ℓ _{i}}{\partial h _{1}} \frac{\partial ℓ _{i}}{\partial f _{0}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}}) \frac{\partial h _{3}}{\partial f _{2}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}}) \frac{\partial f _{2}}{\partial h _{2}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}} \frac{\partial f _{2}}{\partial h _{2}}) \frac{\partial h _{2}}{\partial f _{1}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}} \frac{\partial f _{2}}{\partial h _{2}} \frac{\partial h _{2}}{\partial f _{1}}) \frac{\partial f _{1}}{\partial h _{1}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}} \frac{\partial f _{2}}{\partial h _{2}} \frac{\partial h _{2}}{\partial f _{1}} \frac{\partial f _{1}}{\partial h _{1}}) \frac{\partial h _{1}}{\partial f _{0}}

In each case, we already calculated the quantities in the brackets in the previous step, and the last term has a simple expression. These equations embody Observation 2 we made in Backpropagation Intuition; we can reuse the previously computed derivatives if we calculate them in reverse order.

Backward pass 2

Finally, we consider how the loss $ℓ_{i}$ changes when we change the parameters ${β_{k}}$ and ${ω_{k}}$ .

Once again, we apply the chain rule:

\frac{\partial ℓ _{i}}{\partial β _{k}} \frac{\partial ℓ _{i}}{\partial ω _{k}} = \frac{\partial ℓ _{i}}{\partial f _{k}} \frac{\partial f _{k}}{\partial β _{k}} = \frac{\partial ℓ _{i}}{\partial f _{k}} \frac{\partial f _{k}}{\partial ω _{k}}

In each case, the first term on the right side was already computed above. When $k > 0$ , we have $f_{k}$ = $β_{k} + ω_{k} \cdot h_{k}$ , so

\frac{\partial f _{k}}{\partial β _{k}} = 1 and \frac{\partial f _{k}}{\partial ω _{k}} = h_{k}

This is consistent with Observation 1 from Backpropagation Intuition; the effect of a change in the weight $ω_{k}$ is proportional to the value of the source $h_{k}$ which was stored in the forward pass. The final derivatives from the term $f_{0} = β_{0} + ω_{0} \cdot x_{i}$ are:

\frac{\partial f _{0}}{\partial β _{0}} = 1 and \frac{\partial f _{0}}{\partial ω _{0}} = x_{i}

Backpropagation is both simpler and more efficient than computing the derivatives individually.

/notes/

Recent

Backpropagation Algorithm

Backpropagation Intuition