Consider a deep neural network that takes inputs , has hidden layers with ReLU activations, and individual loss term .
The goal of backpropagation is to compute the derivatives and with respect to the biases and weights .
Forward pass
We compute and store the following quantities:
where .
Backward pass
We start with the derivative of the loss function with respect to the network output and work backward through the network. For , we do:
We then pass backward to the previous layer:
where:
- is a pointwise multiplication
- is a vector containing ones where is greater than zero and zeros elsewhere.
- Thus, the operation is a mask applying the activation function
Finally, we compute the derivatives with respect to the first set of weights and biases:
We calculate these derivatives for every training example in the batch and sum them together to retrieve the gradient for the SGD update.
Backpropagation is extremely efficient; the most demanding computational step in the forward and backward pass is matrix multiplication (by and , respectively) which only requires additions and multiplications. However, it is not memory efficient; the intermediate values in the forward pass must all be stored, which can limit the size of the model we can train.