UDL Chapter 7 Problems

Problem 7.1

A two-layer network with two hidden units in each layer can be defined as:
$y = ϕ_{0} + ϕ_{1} a [ψ_{01} + ψ_{11} a [θ_{01} + θ_{11} x] + ψ_{21} a [θ_{02} + θ_{12} x]] + ϕ_{2} a [ψ_{02} + ψ_{12} a [θ_{01} + θ_{11} x] + ψ_{22} a [θ_{02} + θ_{12} x]]$
where the functions $a [∙]$ are ReLU functions. Compute the derivatives of the output $y$ with respect to each of the 13 parameters $ϕ_{∙}, θ_{∙∙}$ and $ψ_{∙∙}$ directly. The derivatives of the ReLU function with respect to its input $\partial a [z] / \partial [z]$ is the indicator function $I [z > 0]$ , which returns one if the argument is greater than zero and zero otherwise.

Output layer:

\frac{\partial y}{\partial ϕ _{0}} \frac{\partial y}{\partial ϕ _{1}} \frac{\partial y}{\partial ϕ _{2}} = 1 = a [ψ_{01} + ψ_{11} a [θ_{01} + θ_{11} x] + ψ_{21} a [θ_{02} + θ_{12} x]] = a [ψ_{02} + ψ_{12} a [θ_{01} + θ_{11} x] + ψ_{22} a [θ_{02} + θ_{12} x]]

Second hidden layer: Let us first define

h_{21} = ψ_{01} + ψ_{11} \cdot a [z_{1}] + ψ_{21} \cdot a [z_{2}] h_{22} = ψ_{02} + ψ_{12} \cdot a [z_{1}] + ψ_{22} \cdot a [z_{2}]

Then:

\frac{\partial y}{\partial ψ _{01}} \frac{\partial y}{\partial ψ _{11}} \frac{\partial y}{\partial ψ _{21}} \frac{\partial y}{\partial ψ _{02}} \frac{\partial y}{\partial ψ _{12}} \frac{\partial y}{\partial ψ _{22}} = \frac{\partial y}{\partial a [ h _{21} ]} \frac{\partial a [ h _{21} ]}{\partial h _{21}} \frac{\partial ψ _{01}}{\partial h _{21}} = ϕ_{1} \cdot I (h_{21} > 0) \cdot 1 = \frac{\partial y}{\partial a [ h _{21} ]} \frac{\partial a [ h _{21} ]}{\partial h _{21}} \frac{\partial ψ _{11}}{\partial h _{21}} = ϕ_{1} \cdot I (h_{21} > 0) \cdot a [θ_{01} + θ_{11} x] = \frac{\partial y}{\partial a [ h _{21} ]} \frac{\partial a [ h _{21} ]}{\partial h _{21}} \frac{\partial ψ _{21}}{\partial h _{21}} = ϕ_{1} \cdot I (h_{21} > 0) \cdot a [θ_{02} + θ_{12} x] = \frac{\partial y}{\partial a [ h _{22} ]} \frac{\partial a [ h _{22} ]}{\partial h _{22}} \frac{\partial ψ _{02}}{\partial h _{22}} = ϕ_{2} \cdot I (h_{22} > 0) \cdot 1 = \frac{\partial y}{\partial a [ h _{22} ]} \frac{\partial a [ h _{22} ]}{\partial h _{22}} \frac{\partial ψ _{12}}{\partial h _{22}} = ϕ_{2} \cdot I (h_{22} > 0) \cdot a [θ_{01} + θ_{11} x] = \frac{\partial y}{\partial a [ h _{22} ]} \frac{\partial a [ h _{22} ]}{\partial h _{22}} \frac{\partial ψ _{22}}{\partial h _{22}} = ϕ_{2} \cdot I (h_{22} > 0) \cdot a [θ_{02} + θ_{12} x]

First hidden layer: Let us first define

h_{11} = θ_{01} + θ_{11} x h_{12} = θ_{02} + θ_{12} x

Then we have:

\frac{\partial y}{\partial θ _{01}} \frac{\partial y}{\partial θ _{11}} \frac{\partial y}{\partial θ _{02}} \frac{\partial y}{\partial θ _{12}} = \frac{\partial y}{\partial a [ h _{21} ]} \frac{\partial a [ h _{21} ]}{\partial h _{21}} \frac{\partial h _{21}}{\partial a [ h _{11} ]} \frac{\partial a [ h _{11} ]}{\partial h _{11}} \frac{\partial h _{11}}{\partial θ _{01}} = ϕ_{1} \cdot I (h_{21} > 0) \cdot ψ_{11} \cdot I (h_{11} > 0) \cdot 1 = \frac{\partial y}{\partial a [ h _{21} ]} \frac{\partial a [ h _{21} ]}{\partial h _{21}} \frac{\partial h _{21}}{\partial a [ h _{11} ]} \frac{\partial a [ h _{11} ]}{\partial h _{11}} \frac{\partial h _{11}}{\partial θ _{11}} = ϕ_{1} \cdot I (h_{21} > 0) \cdot ψ_{11} \cdot I (h_{11} > 0) \cdot x = \frac{\partial y}{\partial a [ h _{22} ]} \frac{\partial a [ h _{22} ]}{\partial h _{22}} \frac{\partial h _{22}}{\partial a [ h _{12} ]} \frac{\partial a [ h _{12} ]}{\partial h _{12}} \frac{\partial h _{12}}{\partial θ _{02}} = ϕ_{2} \cdot I (h_{22} > 0) \cdot ψ_{12} \cdot I (h_{12} > 0) \cdot 1 = \frac{\partial y}{\partial a [ h _{22} ]} \frac{\partial a [ h _{22} ]}{\partial h _{22}} \frac{\partial h _{22}}{\partial a [ h _{12} ]} \frac{\partial a [ h _{12} ]}{\partial h _{12}} \frac{\partial h _{12}}{\partial θ _{12}} = ϕ_{2} \cdot I (h_{22} > 0) \cdot ψ_{12} \cdot I (h_{12} > 0) \cdot x

Problem 7.2

Find an expression for the final term in each of the five chains of derivatives in equation 7.13.

\frac{\partial ℓ _{i}}{\partial f _{2}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}}) \frac{\partial h _{3}}{\partial f _{2}} = 2 (f_{3} - y_{i}) \cdot ω_{3} \cdot - sin f_{2}

\frac{\partial ℓ _{i}}{\partial h _{2}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}}) \frac{\partial f _{2}}{\partial h _{2}} = 2 (f_{3} - y_{i}) \cdot ω_{3} \cdot - sin f_{2} \cdot ω_{2}

\frac{\partial ℓ _{i}}{\partial f _{1}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}} \frac{\partial f _{2}}{\partial h _{2}}) \frac{\partial h _{2}}{\partial f _{1}} = 2 (f_{3} - y_{i}) \cdot ω_{3} \cdot - sin f_{2} \cdot ω_{2} \cdot exp [f_{1}]

\frac{\partial ℓ _{i}}{\partial h _{1}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}} \frac{\partial f _{2}}{\partial h _{2}} \frac{\partial h _{2}}{\partial f _{1}}) \frac{\partial f _{1}}{\partial h _{1}} = 2 (f_{3} - y_{i}) \cdot ω_{3} \cdot - sin f_{2} \cdot ω_{2} \cdot exp [f_{1}] \cdot ω_{1}

\frac{\partial ℓ _{i}}{\partial f _{0}} = (\frac{\partial ℓ _{i}}{\partial f _{3}} \frac{\partial f _{3}}{\partial h _{3}} \frac{\partial h _{3}}{\partial f _{2}} \frac{\partial f _{2}}{\partial h _{2}} \frac{\partial h _{2}}{\partial f _{1}} \frac{\partial f _{1}}{\partial h _{1}}) \frac{\partial h _{1}}{\partial f _{0}} = 2 (f_{3} - y_{i}) \cdot ω_{3} \cdot - sin f_{2} \cdot ω_{2} \cdot exp [f_{1}] \cdot ω_{1} \cdot cos [f_{0}]

Problem 7.3

What size are each of the terms in equation 7.20?

$\frac{\partial f _{2}}{\partial h _{2}}$ is $D_{3} \times D_{2}$
$\frac{\partial h _{2}}{\partial f _{1}}$ is $D_{2} \times D_{2}$
$\frac{\partial f _{1}}{\partial h _{1}}$ is $D_{2} \times D_{1}$
$\frac{\partial h _{1}}{\partial f _{0}}$ is $D_{1} \times D_{1}$

Problem 7.4

Calculate the derivative $\frac{\partial ℓ _{i}}{\partial f [ x _{i} , ϕ ]}$ for the least squares loss function:
$ℓ_{i} = (y_{i} - f [x_{i}, ϕ])^{2}$

\frac{\partial ℓ _{i}}{\partial f [ x _{i} , ϕ ]} = 2 (y_{i} - f [x_{i}, ϕ]) (- 1) = 2 (f [x_{i}, ϕ] - y_{i})

Problem 7.5

Calculate the derivative $\frac{\partial ℓ _{i}}{\partial f [ x _{i} , ϕ ]}$ for the binary classification loss function:
$ℓ_{i} = - (1 - y_{i}) lo g [1 - sig [f [x_{i}, ϕ]]] - y_{i} lo g [sig [f [x_{i}, ϕ]]]$

For the sake of clean math let’s write $f = f [x_{i}, ϕ]$ . So we want to find $\frac{\partial ℓ _{i}}{\partial f}$ .

We have:

\frac{\partial ℓ _{i}}{\partial f} = \frac{\partial ℓ _{i}}{\partial sig [ f ]} \frac{\partial sig [ f ]}{\partial f}

The first term is:

\frac{\partial ℓ _{i}}{\partial sig [ f ]} = - (1 - y_{i}) (\frac{1}{1 - sig [ f ]}) (- 1) - (y_{i}) (\frac{1}{sig [ f ]}) (1) = \frac{1 - y _{i}}{1 - sig [ f ]} - \frac{y _{i}}{sig [ f ]}

The second term is:

\frac{\partial sig [ f ]}{\partial f} = \frac{exp [ - f ]}{( 1 + exp [ - f ] ) ^{2}}

Combining them back, we have:

\frac{\partial ℓ _{i}}{\partial f} = (\frac{1 - y _{i}}{1 - sig [ f ]} - \frac{y _{i}}{sig [ f ]}) (\frac{exp [ - f ]}{( 1 + exp [ - f ] ) ^{2}})

Recall that the sigmoid is:

sig [f] 1 - sig [f] = \frac{1}{1 + exp [ - f ]} = \frac{exp [ - f ]}{1 + exp [ - f ]}

Let’s substitute these back into our loss function and simplify:

\begin{align} \frac{ \partial \ell_{i} }{ \partial f } & = \left( (1-y_{i}) \left( \frac{1+\exp[-f]}{\exp[-f]} \right) - (y_{i}) (1+\exp[-f]) \right) \left( \frac{\exp[-f]}{(1+\exp[-f])^{2}} \right) \\[2ex] & = \left( \frac{(1-y_{i})(1+\exp[-f])}{\exp[-f]} - (y_{i}) (1+\exp[-f])\right)\left( \frac{\exp[-f]}{(1+\exp[-f])^{2}} \right) \\[2ex] & =\cancel{ (1+\exp[-f]) }\left( \frac{1-y_{i}}{\exp[-f]} - y_{i} \right)\left( \frac{\exp[-f]}{(1+\exp[-f])^\cancel{ {2} }} \right)\\[2ex] & =\left( \frac{1-y_{i}}{\exp[-f]}-y_{i} \right)\left( \frac{\exp[-f]}{1+\exp[-f]} \right) \\[2ex] & = \frac{1-y_{i}}{1+\exp[-f]}-\frac{y_{i}\exp[-f]}{1+\exp[-f]} \\[2ex] & = \frac{1-y_{i}-y_{i}\exp[-f]}{1+\exp[-f]} \\[2ex] & = \frac{1}{1+\exp[-f]}- \frac{y_{i}+y_{i}\exp[-f]}{1+\exp[-f]} \\[2ex] & = \frac{1}{1+\exp[-f]}-\frac{y_{i}(1+\exp[-f])}{1+\exp[-f]} \\[2ex] & = \frac{1}{1+\exp[-f]} - y_{i} \\[2ex] & =\boxed{\text{sig}[f]-y_{i}} \end{align}

Nice result (and seems important)!

Problem 7.6

Show that for $z = β + Ωh$ :
$\frac{\partial z}{\partial h} = Ω^{T}$
where $\partial z / \partial h$ is a matrix containing the term $\partial z_{i} / \partial h_{j}$ in its $i$ -th column and $j$ -th row.

To do this, first find an expression for the constituent elements $\partial z_{i} / \partial h_{j}$ , and then consider the form that the matrix $\partial z / \partial h$ must take.

Let:

h = h_{1} ⋮ h_{n}, z = β + Ω h, Ω = [Ω_{ij}] \in R^{m \times n}

so that $z_{i} = β_{i} + \sum_{k = 1}^{n} Ω_{ik} h_{k}$ . We can see this is in action in Example Computation – $Ω_{ik} h_{k}$ is a dot product between the $i$ -th row of $Ω$ and $h$ , and then we add the bias for that row.

Now let’s consider the element-wise derivative. For some fixed $i$ and $j$ :

\frac{\partial z _{i}}{\partial h _{j}} = \frac{\partial}{\partial h _{j}} (β_{i} + k \sum Ω_{ik} h_{k}) = Ω_{ij}

This is because $z_{i}$ depends linearly on each $h_{j}$ , so the partial derivative just plucks out the corresponding weight.

Then, since ” $\partial z / \partial h$ is a matrix containing the term $\partial z_{i} / \partial h_{j}$ in its $i$ -th column and $j$ -th row”:

\frac{\partial z}{\partial h} = [\frac{\partial z _{i}}{\partial h _{j}}]_{row j, col i} = Ω_{ij}

Now we can notice that the matrix whose $(j, i)$ entry is $Ω_{ij}$ is precisely the transpose of $Ω$ , leading us to:

\frac{\partial z}{\partial h} = Ω^{T}

If you instead store instead in row $i$ , column $j$ —the more common “Jacobian” convention—the matrix would be $Ω$ itself. The textbook’s row/column choice therefore introduces the transpose.

Problem 7.7

Consider the case where we use the logistic sigmoid as an activation function, so $h = sig [f]$ . Compute the derivative $\partial h / \partial f$ for this activation function. What happens to the derivative when the input takes (i) a large positive value and (ii) a large negative value?

We have:

h = sig [f] = \frac{1}{1 + exp [ - f ]}

and it follows that

\frac{\partial h}{\partial f} = \frac{\partial sig [ f ]}{\partial f} = \frac{exp [ - f ]}{( 1 + exp [ - f ] ) ^{2}}

which can then be re-written as:

\frac{\partial h}{\partial f} = \frac{1}{1 + exp [ - f ]} \cdot \frac{exp [ - f ]}{1 + exp [ - f ]} = sig [f] \cdot (1 - sig [f]) = h (1 - h)

When the input $f$ becomes large positive value, $h = sig [f] \approx 1$ , so we have $1 (1 - 1) = 0$ . When the input $f$ becomes a large negative value, we in turn have $h \approx 0$ , which gives $0 (1 - 0) = 0$ . Thus, the gradient is (near) zero at both extremes.

Problem 7.8

Consider using (i) the Heaviside function and (ii) the rectangular function as activations:
$Heaviside [z] = {01 z < 0 z \geq 0$ $rect [z] = ⎩ ⎨ ⎧ 010 z < 0 0 \leq z \leq 1 z > 1$
Discuss why these functions are problematic for neural network training with gradient-based optimization methods.

Both of these functions are flat and discontinuous.

In regions where the function is flat, weights before the activation will not change because the chain rule will include a multiplication by zero. For some weight $w$ :

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

In this case $\frac{\partial a}{\partial z} = 0$ because the activation is flat, so we also just have $\frac{\partial L}{\partial w} = 0$ and the weight doesn’t update at all.

In regions where the function is discontinuous, there’s no gradient to follow (undefined) – optimization algorithms like gradient descent are stuck.

Problem 7.9

Consider a loss function $ℓ [f]$ , where $f = β + Ωh$ . We want to find how the loss $ℓ$ changes when we change $Ω$ , which we’ll express with a matrix that contains the derivative $\partial ℓ / \partial Ω_{ij}$ at the $i$ -th row and $j$ -th column. Find an expression for $\partial f_{i} / \partial Ω_{ij}$ , and, using the chain rule, show that
$\frac{\partial ℓ}{\partial Ω} = \frac{\partial ℓ}{\partial f} h^{T}$

We have

f_{i} = β_{i} + j \sum Ω_{ij} h_{j}

and so:

\frac{\partial f _{i}}{\partial Ω _{ij}} = h_{j}

Using the chain rule:

\frac{\partial ℓ}{\partial Ω _{ij}} = \frac{\partial ℓ}{\partial f _{i}} \frac{\partial f _{i}}{\partial Ω _{ij}} = \frac{\partial ℓ}{\partial f _{i}} h_{j}

Converting back to vector form, we have

\frac{\partial ℓ}{\partial Ω} = \frac{\partial ℓ}{\partial f} h^{T}

as required.

Problem 7.10

Derive the equations for the backward pass of the backpropagation algorithm that uses leaky ReLU activations, which are defined as:
$a [z] = ReLU [z] = {α \cdot z z z < 0 z \geq 0$
where $α$ is a small positive constant, typically $0.1$ .

The Leaky ReLU has a gradient of $+ 1$ when the input is greater than zero, and a gradient of $α$ when the gradient is less than zero. The backprop equations are the same except for

\frac{\partial ℓ _{i}}{\partial f _{k - 1}} = I [f_{k - 1} > 0] ⊙ (Ω_{k}^{T} \frac{\partial ℓ _{i}}{\partial f _{k}}) + I [f_{k - 1} < 0] ⊙ α (Ω_{k}^{T} \frac{\partial ℓ _{i}}{\partial f _{k}})

Problem 7.11

Problem 7.12

Problem 7.13

Problem 7.14

Problem 7.15

Problem 7.16

Problem 7.17

/notes/

Recent

Backpropagation 3-Layer Example

Backpropagation Algorithm

Backpropagation Scalar Example

UDL Chapter 7 Problems

Graph View

Backlinks