Problem 7.1

A two-layer network with two hidden units in each layer can be defined as:

where the functions are ReLU functions. Compute the derivatives of the output with respect to each of the 13 parameters and directly. The derivatives of the ReLU function with respect to its input is the indicator function , which returns one if the argument is greater than zero and zero otherwise.

Output layer:

Second hidden layer: Let us first define

Then:

First hidden layer: Let us first define

Then we have:

Problem 7.2

Find an expression for the final term in each of the five chains of derivatives in equation 7.13.

Problem 7.3

What size are each of the terms in equation 7.20?

  • is
  • is
  • is
  • is

Problem 7.4

Calculate the derivative for the least squares loss function:

Problem 7.5

Calculate the derivative for the binary classification loss function:

For the sake of clean math let’s write . So we want to find .

We have:

The first term is:

The second term is:

Combining them back, we have:

Recall that the sigmoid is:

Let’s substitute these back into our loss function and simplify:

\begin{align} \frac{ \partial \ell_{i} }{ \partial f } & = \left( (1-y_{i}) \left( \frac{1+\exp[-f]}{\exp[-f]} \right) - (y_{i}) (1+\exp[-f]) \right) \left( \frac{\exp[-f]}{(1+\exp[-f])^{2}} \right) \\[2ex] & = \left( \frac{(1-y_{i})(1+\exp[-f])}{\exp[-f]} - (y_{i}) (1+\exp[-f])\right)\left( \frac{\exp[-f]}{(1+\exp[-f])^{2}} \right) \\[2ex] & =\cancel{ (1+\exp[-f]) }\left( \frac{1-y_{i}}{\exp[-f]} - y_{i} \right)\left( \frac{\exp[-f]}{(1+\exp[-f])^\cancel{ {2} }} \right)\\[2ex] & =\left( \frac{1-y_{i}}{\exp[-f]}-y_{i} \right)\left( \frac{\exp[-f]}{1+\exp[-f]} \right) \\[2ex] & = \frac{1-y_{i}}{1+\exp[-f]}-\frac{y_{i}\exp[-f]}{1+\exp[-f]} \\[2ex] & = \frac{1-y_{i}-y_{i}\exp[-f]}{1+\exp[-f]} \\[2ex] & = \frac{1}{1+\exp[-f]}- \frac{y_{i}+y_{i}\exp[-f]}{1+\exp[-f]} \\[2ex] & = \frac{1}{1+\exp[-f]}-\frac{y_{i}(1+\exp[-f])}{1+\exp[-f]} \\[2ex] & = \frac{1}{1+\exp[-f]} - y_{i} \\[2ex] & =\boxed{\text{sig}[f]-y_{i}} \end{align}

Nice result (and seems important)!

Problem 7.6

Show that for :

where is a matrix containing the term in its -th column and -th row.

To do this, first find an expression for the constituent elements , and then consider the form that the matrix must take.

Let:

so that . We can see this is in action in Example Computation is a dot product between the -th row of and , and then we add the bias for that row.

Now let’s consider the element-wise derivative. For some fixed and :

This is because depends linearly on each , so the partial derivative just plucks out the corresponding weight.

Then, since ” is a matrix containing the term in its -th column and -th row”:

Now we can notice that the matrix whose entry is is precisely the transpose of , leading us to:

If you instead store instead in row , column —the more common “Jacobian” convention—the matrix would be itself. The textbook’s row/column choice therefore introduces the transpose.

Problem 7.7

Consider the case where we use the logistic sigmoid as an activation function, so . Compute the derivative for this activation function. What happens to the derivative when the input takes (i) a large positive value and (ii) a large negative value?

We have:

and it follows that

which can then be re-written as:

When the input becomes large positive value, , so we have . When the input becomes a large negative value, we in turn have , which gives . Thus, the gradient is (near) zero at both extremes.

Problem 7.8

Consider using (i) the Heaviside function and (ii) the rectangular function as activations:

Discuss why these functions are problematic for neural network training with gradient-based optimization methods.

Both of these functions are flat and discontinuous.

In regions where the function is flat, weights before the activation will not change because the chain rule will include a multiplication by zero. For some weight :

In this case because the activation is flat, so we also just have and the weight doesn’t update at all.

In regions where the function is discontinuous, there’s no gradient to follow (undefined) – optimization algorithms like gradient descent are stuck.

Problem 7.9

Consider a loss function , where . We want to find how the loss changes when we change , which we’ll express with a matrix that contains the derivative at the -th row and -th column. Find an expression for , and, using the chain rule, show that

We have

and so:

Using the chain rule:

Converting back to vector form, we have

as required.

Problem 7.10

Derive the equations for the backward pass of the backpropagation algorithm that uses leaky ReLU activations, which are defined as:

where is a small positive constant, typically .

The Leaky ReLU has a gradient of when the input is greater than zero, and a gradient of when the gradient is less than zero. The backprop equations are the same except for

Problem 7.11

Problem 7.12

Problem 7.13

Problem 7.14

Problem 7.15

Problem 7.16

Problem 7.17