Consider a deep neural network that takes inputs , has hidden layers with ReLU activations, and individual loss term .

The goal of backpropagation is to compute the derivatives and with respect to the biases and weights .

Forward pass

We compute and store the following quantities:

where .

Backward pass

We start with the derivative of the loss function with respect to the network output and work backward through the network. For , we do:

We then pass backward to the previous layer:

where:

  • is a pointwise multiplication
  • is a vector containing ones where is greater than zero and zeros elsewhere.
  • Thus, the operation is a mask applying the activation function

Finally, we compute the derivatives with respect to the first set of weights and biases:

We calculate these derivatives for every training example in the batch and sum them together to retrieve the gradient for the SGD update.

Backpropagation is extremely efficient; the most demanding computational step in the forward and backward pass is matrix multiplication (by and , respectively) which only requires additions and multiplications. However, it is not memory efficient; the intermediate values in the forward pass must all be stored, which can limit the size of the model we can train.