Let’s repeat the backpropagation toy example but for a three-layer network.

The intuition and much of the algebra are identical. The main differences are that intermediate variables are vectors, the biases are vectors, the weights are matrices, and we are using ReLU functions rather than simple algebraic functions like .

Forward pass

We write the network as a series of sequential calculations:

where:

  • represents the pre-activations at the -th hidden layer (values before the ReLU function )
  • represents the activations at the -th layer (after the ReLU function)
  • The term represents the loss function

In the forward pass, we work through these calculations and store all the intermediate values.

Backward pass 1

Now, let’s consider how the loss changes when the pre-activations change. Applying the chain rule, the expression for the derivative of the loss with with respect to is:

  • The first term on the right has size
    • is the dimension of the model output
  • The second term has size
    • is the number of hidden units in the third layer
  • The third term has size

Similarly, we can compute how the loss changes when we change and :

  • In each case, the term in brackets was computed in the previous step. By working backward through the network, we can reuse previous computations.