UDL Chapter 6 Problems

Problem 6.1

Show that the derivates of the least squares loss function in equation 6.5 are given by the equations in 6.7.

Each loss component:

ℓ_{i} = (ϕ_{0} + ϕ_{1} x_{i} - y_{i})^{2}

Derivatives:

\frac{\partial ℓ _{i}}{\partial ϕ _{0}} = 2 (ϕ_{0} + ϕ_{1} x_{i} - y_{i})

\frac{\partial ℓ _{i}}{\partial ϕ _{1}} = 2 (ϕ_{0} + ϕ_{1} x_{i} - y_{i}) (x_{i})

Problem 6.2

A surface is guaranteed to be convex if the eigenvalues of the Hessian $H [ϕ]$ are positive everywhere. In this case, the surface has a unique minimum, and optimization is easy. Find an algebraic expression for the Hessian matrix,
$H [ϕ] = [\frac{\partial ^{2} L}{\partial ϕ _{0}^{2}} \frac{\partial ^{2} L}{\partial ϕ _{1} \partial ϕ _{0}} \frac{\partial ^{2} L}{\partial ϕ _{0} \partial ϕ _{1}} \frac{\partial ^{2} L}{\partial ϕ _{1}^{2}}]$
for the linear regression model. Prove that this function is convex by showing that the eigenvalues are always positive. This can be done by showing that both the trace and determinant of the matrix are positive.

We have:

ℓ_{i} = (ϕ_{0} + ϕ_{1} x_{i} - y_{i})^{2}

Top left $(H_{1, 1})$ :

\frac{\partial ℓ _{i}}{\partial ϕ _{0}} \frac{\partial ^{2} ℓ _{i}}{\partial ϕ _{0}^{2}} = 2 (ϕ_{0} + ϕ_{1} x_{i} - y_{i}) = 2

Bottom left ( $H_{2, 1}$ ):

\frac{\partial ^{2} ℓ _{i}}{\partial ϕ _{1} ϕ _{0}} = 2 x_{i}

Top right ( $H_{1, 2}$ ):

\frac{\partial ℓ _{i}}{\partial ϕ _{1}} \frac{\partial ^{2} ℓ _{i}}{\partial ϕ _{0} ϕ _{1}} = 2 x_{i} (ϕ_{0} + ϕ_{1} x_{i} - y_{i}) = 2 x_{i}

Bottom right ( $H_{2, 2}$ ):

\frac{\partial ^{2} ℓ _{i}}{\partial ϕ _{1}^{2}} = 2 x_{i}^{2}

So the result for a single point is:

[2 2 x_{i} 2 x_{i} 2 x_{i}^{2}]

And the Hessian of the total loss is:

H [ϕ] = i = 1 \sum I [2 2 x_{i} 2 x_{i} 2 x_{i}^{2}] = [2 I 2 \sum x_{i} 2 \sum x_{i} 2 \sum x_{i}^{2}]

The trace is positive:

tr (H) = H_{1, 1} + H_{2, 2} = 2 I + 2 \sum x_{i}^{2} > 0

The determinant is:

det (H) = a d - b c = (2 I) (2 \sum x_{i}^{2}) - (2 \sum x_{i})^{2} = 4 I \sum x_{i}^{2} - 4 (\sum x_{i})^{2} > 0

Thus, the surface is convex.

Problem 6.3

Compute the derivatives of the least squares loss $L [ϕ]$ with respect to the parameters $ϕ_{0}$ and $ϕ_{1}$ for the Gabor model.

L [ϕ] = i = 1 \sum I (f [x_{i}, ϕ] - y_{i})^{2} = (sin [ϕ_{0} + 0.06 \cdot ϕ_{1} x] \cdot exp (- \frac{( ϕ _{0} + 0.06 \cdot ϕ _{1} x _{i} ) ^{2}}{32.0}) - y_{i})^{2}

Let:

g_{i} f_{i} = ϕ_{0} + 0.06 ϕ_{1} x_{i} = sin (g_{i}) \cdot exp (- \frac{g _{i}^{2}}{32})

Then, we have

L [ϕ] = \sum r_{i}^{2}

where $r_{i} = f_{i} - y_{i}$ .

First, we can find:

\frac{\partial f _{i}}{\partial g _{i}} = cos (g_{i}) exp (- \frac{g _{i}^{2}}{32}) + sin (g_{i}) exp (- \frac{g _{i}^{2}}{32}) (- \frac{2 g _{i}}{32}) = exp (- \frac{g _{i}^{2}}{32}) [cos (g_{i}) + sin (g_{i}) (- \frac{g _{i}}{16})]

using the product rule and chain rule.

So:

\frac{\partial f _{i}}{\partial ϕ _{0}} = \frac{\partial f _{i}}{\partial g _{i}} \frac{\partial g _{i}}{\partial ϕ _{0}} = exp (- \frac{g _{i}^{2}}{32}) [cos (g_{i}) + sin (g_{i}) (- \frac{g _{i}}{16})] (1)

and

\frac{\partial f _{i}}{\partial ϕ _{1}} = \frac{\partial f _{i}}{\partial g _{i}} \frac{\partial g _{i}}{\partial ϕ _{1}} = exp (- \frac{g _{i}^{2}}{32}) [cos (g_{i}) + sin (g_{i}) (- \frac{g _{i}}{16})] (0.06 x_{i})

Because $L [ϕ] = \sum r_{i}^{2}$ , for each of the parameters, we have

\frac{\partial L}{\partial ϕ _{k}} = 2 \sum r_{i} \frac{\partial f _{i}}{\partial ϕ _{k}}

So:

\frac{\partial L}{\partial ϕ _{0}} = 2 \sum (f_{i} - y_{i}) exp (- \frac{g _{i}^{2}}{32}) [cos (g_{i}) + sin (g_{i}) (- \frac{g _{i}}{16})]

and

\frac{\partial L}{\partial ϕ _{1}} = 2 \sum (f_{i} - y_{i}) exp (- \frac{g _{i}^{2}}{32}) [cos (g_{i}) + sin (g_{i}) (- \frac{g _{i}}{16})] (0.06 x_{i})

Problem 6.4

The logistic regression model uses a linear function to assign an input $x$ to one of two classes $y \in {0, 1}$ . For a 1D input and a 1D output, it has two parameters, $ϕ_{0}$ and $ϕ_{1}$ , and is defined
$P r (y = 1 ∣ x) = sig [ϕ_{0} + ϕ_{1} x]$
where $sig [∙]$ is the logistic sigmoid function.

(i) Plot $y$ against $x$ for this model for different values of $ϕ_{0}$ and $ϕ_{1}$ and explain the qualitative meaning of each parameters.

(ii) What is a suitable loss function for this model?

(iii) Compute the derivatives of this loss function with respect to the parameters.

(iv) Generate ten data points from a normal distribution with mean $- 1$ and standard deviation $1$ and assign them to label $y = 0$ . Generate another ten data points from a normal distribution with mean $1$ and standard deviation $1$ and assign these the label $y = 1$ . Plot the loss as a heatmap in terms of the two parameters $ϕ_{0}$ and $ϕ_{1}$ . (v) Is this loss function convex? How could you prove this?

(i) $ϕ_{0}$ controls the location of the centerpoint of the sigmoid function ( $y = 0.5$ ), $ϕ_{1}$ controls the slope of the transition.

(ii) Binary cross-entropy loss:

L [ϕ] = i = 1 \sum I - (1 - y_{i}) lo g [1 - sig [ϕ_{0} + ϕ_{1} x]] - y_{i} lo g [sig [ϕ_{0} + ϕ_{1} x]]

(iii) Derivatives of the sigmoid function are:

\frac{\partial sig [ z ]}{\partial z} = \frac{exp [ - z ]}{( 1 + exp [ - z ] ) ^{2}}

It follows that the derivatives of the loss function are

\frac{\partial L}{\partial ϕ _{0}} = \frac{\partial L}{\partial ( sig [ ϕ _{0} + ϕ _{1} x _{i} ])} \frac{\partial ( sig [ ϕ _{0} + ϕ _{1} x _{i} ])}{\partial ϕ _{0} + ϕ _{1} x _{i}} \frac{\partial ( ϕ _{0} + ϕ _{1} x _{i} )}{\partial ϕ _{0}} = i = 1 \sum I (\frac{1 - y _{i}}{1 - sig [ ϕ _{0} + ϕ _{1} x _{i} ]} - \frac{y _{i}}{sig [ ϕ _{0} + ϕ _{1} x ]}) \frac{exp [ - ϕ _{0} - ϕ _{1} x _{i} ]}{( 1 + exp [ - ϕ _{0} - ϕ _{1} x _{i} ] ) ^{2}}

and

\frac{\partial L}{\partial ϕ _{0}} = \frac{\partial L}{\partial ( sig [ ϕ _{0} + ϕ _{1} x _{i} ])} \frac{\partial ( sig [ ϕ _{0} + ϕ _{1} x _{i} ])}{\partial ϕ _{0} + ϕ _{1} x _{i}} \frac{\partial ( ϕ _{0} + ϕ _{1} x _{i} )}{\partial ϕ _{1}} = i = 1 \sum I (\frac{1 - y _{i}}{1 - sig [ ϕ _{0} + ϕ _{1} x _{i} ]} - \frac{y _{i}}{sig [ ϕ _{0} + ϕ _{1} x ]}) \frac{exp [ - ϕ _{0} - ϕ _{1} x _{i} ]}{( 1 + exp [ - ϕ _{0} - ϕ _{1} x _{i} ] ) ^{2}} \cdot x_{i}

(iv)

import numpy as np
import matplotlib.pyplot as plt
 
np.random.seed(42)
 
# Generate data
x0 = np.random.normal(-1.0, 1.0, 10)
y0 = np.zeros(10)
 
x1 = np.random.normal(1.0, 1.0, 10)
y1 = np.ones(10)
 
x = np.concatenate([x0, x1])
y = np.concatenate([y0, y1])
 
# Sigmoid and loss
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
def loss(phi0, phi1):
    z = phi0 + phi1 * x
    p = sigmoid(z)
    eps = 1e-12
    p = np.clip(p, eps, 1 - eps) # Keep p from exact 0 or 1 to prevent overflow
    return -np.sum((1 - y) * np.log(1 - p) + y * np.log(p))
 
# Grid
phi0_vals = np.linspace(-4, 4, 200)
phi1_vals = np.linspace(-4, 4, 200)
loss_grid = np.zeros((len(phi1_vals), len(phi0_vals)))
 
for i, phi1 in enumerate(phi1_vals):
    for j, phi0 in enumerate(phi0_vals):
        loss_grid[i, j] = loss(phi0, phi1)
 
# Plot
plt.figure(figsize=(7, 6))
plt.imshow(
    loss_grid,
    extent=[phi0_vals.min(), phi0_vals.max(), phi1_vals.min(), phi1_vals.max()],
    origin="lower",
    aspect="auto",
)
 
# Add level curves (contours)
min_loss = np.min(loss_grid)
max_loss = np.max(loss_grid)
contour_levels = np.linspace(min_loss, max_loss, 15)
plt.contour(
    phi0_vals, 
    phi1_vals, 
    loss_grid, 
    levels=contour_levels,
    colors='white',
    alpha=0.7,
    linewidths=0.5
)
 
plt.colorbar(label="Loss")
plt.xlabel(r"$\phi_0$")
plt.ylabel(r"$\phi_1$")
plt.title("Binary Cross‑Entropy Loss Surface")
plt.show()

(v) The loss function seems to be convex based on the plot above. We can prove it by examining the Hessian matrix like we did in question 6.2.

Problem 6.5

Compute the derivatives of the least squares loss with respect to the ten parameters of the simple neural network model:
$f [x, ϕ] = ϕ_{0} + ϕ_{1} a [θ_{10} + θ_{11} x] + ϕ_{2} [θ_{20} + θ_{21} x] + ϕ_{3} a [θ_{30} + θ_{31} x]$
Think carefully about what the derivative of the ReLU function $a [∙]$ will be.

The derivative of the least squares loss function $f [x, ϕ]$ is given by:

\frac{\partial L}{\partial ϕ _{j}} = - 2 i \sum (y - f [x_{i}, ϕ]) \frac{\partial f [ x _{i} , ϕ ]}{\partial ϕ _{j}}

The derivative of ReLU is:

{01 if z < 0 if z > 0

We can write this as $I [z > 0]$ .

Then, the derivatives are:

\frac{\partial f [ x _{i} , ϕ ]}{\partial ϕ _{0}} \frac{\partial f [ x _{i} , ϕ ]}{\partial ϕ _{1}} \frac{\partial f [ x _{i} , ϕ ]}{\partial ϕ _{2}} \frac{\partial f [ x _{i} , ϕ ]}{\partial ϕ _{3}} = 1 = a [θ_{10} + θ_{11} x_{i}] = a [θ_{20} + θ_{21} x_{i}] = a [θ_{30} + θ_{31} x_{i}]

and:

\frac{\partial f [ x _{i} , ϕ ]}{\partial θ _{10}} \frac{\partial f [ x _{i} , ϕ ]}{\partial θ _{11}} \frac{\partial f [ x _{i} , ϕ ]}{\partial θ _{20}} \frac{\partial f [ x _{i} , ϕ ]}{\partial θ _{21}} \frac{\partial f [ x _{i} , ϕ ]}{\partial θ _{30}} \frac{\partial f [ x _{i} , ϕ ]}{\partial θ _{31}} = ϕ_{1} \cdot I [θ_{10} + θ_{11} x_{i} > 0] = ϕ_{1} \cdot x_{i} \cdot I [θ_{10} + θ_{11} x_{i} > 0] = ϕ_{2} \cdot I [θ_{20} + θ_{21} x_{i} > 0] = ϕ_{2} \cdot x_{i} \cdot I [θ_{20} + θ_{21} x_{i} > 0] = ϕ_{3} \cdot I [θ_{30} + θ_{31} x_{i} > 0] = ϕ_{3} \cdot x_{i} \cdot I [θ_{30} + θ_{31} x_{i} > 0]

Problem 6.6

Which of the functions in figure 6.11 is convex? Justify your answer. Character each of the points 1-7 as (i) a local minimum, (ii) a global minimum, or (iii) neither.

B is the only convex function; all chords lie below above the function. A is non-convex (a chord from 1 → 2 or 2 → 3 would be below the curve), C is non-convex (a chord from 6 → 7 would intersect/be below the curve).

Points:

Local minimum
Global minimum
Local minimum
Neither
Global minimum
Global minimum
Neither (saddle point)

Problem 6.7

The gradient descent trajectory for path 1 in figure 6.5a oscillates back and forth inefficiently as it moves down the valley toward the minimum. It’s also notable that it turns at right angles to the previous direction at each step. Provide a qualitative explanation for these phenomena. Propose a solution that might help prevent this behavior.

The trajectory must turn at right angles. If the current direction still had any component pointing downhill (i.e. decreasing the function), we should logically keep going in that direction. However, we’re in a “curved valley” and just overshot the center, so continuing in the same direction would now move you uphill on the other side, so the gradient reverses direction sharply.

Solutions include Newton’s method (use second derivative to understand curvature of the loss landscape) or momentum.

Problem 6.8

Can (non-stochastic) gradient descent with a fixed learning rate escape local minima?

No - in a local minima the gradient will be zero or near zero so the non-stochastic gradient descent will just be stuck there as it has no reason to move.

Problem 6.9

We run the stochastic gradient descent algorithm for 1000 iterations on a dataset of size 100 with a batch size of 20. For how many epochs did we train the model?

Batch size of 20, total size 100 means that 1 epoch is 100/20 = 5 iterations. 1000/5 = 200 epochs.

Problem 6.10

Show that the momentum term $m_{t}$ (equation 6.11) is an infinite weighted sum of the gradients at the previous iterations and derive an expression for the coefficients (weights) of that sum.

Recall that the momentum update is given by

m_{t + 1} = β \cdot m_{t} + (1 - β) i \in B_{t} \sum \frac{\partial ℓ _{i} [ ϕ _{t} ]}{\partial ϕ} = β \cdot m_{t} + (1 - β) \cdot g_{t}

We want to unroll this sequence.

m_{t + 1} = β \cdot (β \cdot m_{t - 1} + (1 - β) g_{t - 1}) + (1 - β) \cdot g_{t} = β^{2} \cdot m_{t - 1} + β (1 - β) \cdot g_{t - 1} + (1 - β) \cdot g_{t} = β^{3} \cdot m_{t - 2} + β^{2} (1 - β) \cdot g_{t - 2} + β (1 - β) \cdot g_{t - 1} + (1 - β) g_{t} = β^{t + 1} m_{0} + k = 0 \sum t (1 - β) β^{k} \cdot g_{t - k}

Thus, as $t \to \infty$ , this sum approaches an infinite sum of the form:

m_{t + 1} = k = 0 \sum \infty w_{k} g_{t - k}

where $w_{k} = (1 - β) β^{k}$ .

Problem 6.11

What dimensions will the Hessian have if the model has one million parameters?

The Hessian will be $1 million \times 1 million$ :

H \in R^{1 0^{6} \times 1 0^{6}}

/notes/

Recent

Linearization of Nonlinear State Space Models

Phase Portrait

Embed to Control

UDL Chapter 6 Problems

Graph View

Backlinks