UDL Chapter 5 Problems

Problem 5.1

Show that the logistic sigmoid function $sig [z]$ becomes $0$ as $z \to - \infty$ , is $0.5$ when $z = 0$ , and becomes $1$ when $z \to \infty$ , where
$sig [z] = \frac{1}{1 + exp [ - z ]}$

For $z \to - \infty$ :

z \to - \infty lim \frac{1}{1 + exp [ - z ]} = \frac{1}{\infty} = 0

For $z = 0$ :

\frac{1}{1 + exp [ 0 ]} = \frac{1}{1 + 1} = \frac{1}{2} = 0.5

For $z \to \infty$ :

z \to - \infty lim \frac{1}{1 + exp [ - z ]} = \frac{1}{1 + 0} = 1

Problem 5.2

The loss $L$ for binary classification for a single training pair ${x, y}$ is
$L = - (1 - y) lo g [1 - sig [f [x, ϕ]]] - y lo g [sig [f [x, ϕ]]]$
Plot this loss as a function of the transformed output $sig [f [x, ϕ]] \in [0, 1]$ (i) when the training label $y = 0$ and when (ii) when $y = 1$ .

When $y = 0$ , we just have $L = - 1 lo g [1 - sig [f [x, ϕ]]]$ . With $y = 1$ , we just have $L = - lo g [sig [f [x, ϕ]]]$ :

Problem 5.3

Suppose we want to build a model that predicts the direction $y$ in radians of the prevailing wind based on local measurements of barometric pressure $x$ . A suitable distribution over circular domains is the von Mises distribution:
$P r (y ∣ μ, κ) = \frac{exp [ κ cos [ y - μ ]]}{2 π \cdot Bessel _{0} [ κ ]}$

$μ$ is a measure of the mean direction

$κ$ is a measure of concentration (i.e. inverse of variance)

The term $Bessel_{0} (κ)$ is a modified Bessel function of the first kind of order $0$ .

Use the loss function recipe to develop a loss function for learning the parameter $μ$ of a model $f [x, ϕ]$ to predict the most likely wind direction. Your solution should treat the concentration $κ$ as a constant. How would you perform inference?

We set $μ = f [x_{i}, ϕ]$ , so

P r (y_{i} ∣ f [x_{i}, ϕ]) = \frac{exp [ κ cos [ y - f [ x _{i} , ϕ ]]]}{2 π \cdot Bessel _{0} [ κ ]}

Then the negative log-likelihood loss function is

L = - i = 1 \sum I lo g (\frac{exp [ κ cos [ y - f [ x _{i} , ϕ ]]]}{2 π \cdot Bessel _{0} [ κ ]}) = - i = 1 \sum I [κ cos [y - f [x_{i}, ϕ]]] - (lo g (2 π) + lo g (Bessel_{0} [κ])) = - i = 1 \sum I cos [y - f [x_{i}, ϕ]]

To perform inference we just take the maximum of the distribution (which is just the predicted parameter $μ$ ). This might be out of the range $[- π, π]$ , in which case we would add/remove multiples of $2 π$ until it is in the right range.

Problem 5.4

Sometimes, the outputs for input $x$ are multimodal; there is more than one valid prediction for a given input. Here, we might use a sum of normal components as the distribution over the output. This is known as a mixture of Gaussians model. For example, a mixture of two Gaussians has parameters $θ = {λ, μ_{1}, σ_{1}^{2}, μ_{2}, σ_{2}^{2}}$ :
$P r (y ∣ λ, μ_{1}, μ_{2}, σ_{1}^{2}, σ_{2}^{2}) = \frac{λ}{2 π σ _{1}^{2}} exp [- \frac{( y - μ _{1} ) ^{2}}{2 σ _{1}^{2}}] + \frac{1 - λ}{2 π σ _{2}^{2}} exp [\frac{- ( y - μ _{2} ) ^{2}}{2 σ _{2}^{2}}]$
where $λ \in [0, 1]$ controls the relative weight of the two components, which have means $μ_{1}, μ_{2}$ and variances $σ_{1}^{2}, σ_{2}^{2}$ , respectively. This model can represent a distribution with two peaks or a distribution with one peak but a more complex shape.

Use the loss function recipe to construct a loss function for training a model $f [x, ϕ]$ that takes input $x$ , has parameters $ϕ$ , and predicts a mixture of two Gaussians. The loss should be based on $I$ training data pairs ${x_{i}, y_{i}}$ . What problems do you foresee when performing inference?

Let:

$λ = sig [f_{1} [x_{i}, ϕ]]$
- Use sigmoid to enforce $λ \in [0, 1]$ .
$μ_{1} = f_{2} [x_{i}, ϕ]$
$σ_{1} = f_{3} [x_{i}, ϕ]$
$μ_{2} = f_{4} [x_{i}, ϕ]$
$σ_{2} = f_{5} [x_{i}, ϕ]$

Then the loss is

L = - i = 1 \sum I lo g [\frac{sig [ f _{1} [ x _{i} , ϕ ]]}{2 π f _{3} [ x _{i} , ϕ ] ^{2}} exp [\frac{- ( y - f _{2} [ x _{i} , ϕ ] ) ^{2}}{2 f _{3} [ x _{i} , ϕ ] ^{2}}] + \frac{1 - sig [ f _{1} [ x _{i} , ϕ ]]}{2 π f _{5} [ x _{i} , ϕ ] ^{2}} exp [\frac{- ( y - f _{4} [ x _{i} , ϕ ] ) ^{2}}{2 f _{5} [ x _{i} , ϕ ] ^{2}}]]

Inference is a bit trickier in this case since there is no simple closed form for the mode of this distribution.

MACP: How many modes can a Gaussian mixture have?

Problem 5.5

Consider extending the model from problem 5.3 to predict the wind direction using a mixture of two von Mises distributions. Write an expression for the likelihood $P r (y ∣ θ)$ for this model. How many outputs will the network produce?

Each von Mises distribution is parametrized by $μ, κ$ . Thus, for a mixture of two von Mises distributions, the parameters will be

θ = {λ, μ_{1}, κ_{1}, μ_{2}, κ_{2}}

where $λ$ is the relative weight of the two distributions. The likelihood will then be:

P r (y ∣ λ, μ_{1}, κ_{1}, μ_{2}, κ_{2}) = λ \frac{exp [ κ _{1} cos [ y - μ _{1} ]}{2 π \cdot Bessel _{0} [ κ _{1} ]} + (1 - λ) \frac{[ exp [ κ _{2} cos [ y - μ _{2} ]]}{2 π \cdot Bessel _{0} [ κ _{2} ]}

Like the mixture of Gaussians above, we would need five outputs, unless we consider $κ_{1}$ and $κ_{2}$ to be constants, in which case we would need 3.

Problem 5.6

Consider building a model to predict the number of pedestrians $y \in {0, 1, 2, \dots}$ that will pass a given point in the city in the next minute, based on data $x$ that contains information about the time of the day, the longitude and latitude, and the type of neighborhood. A suitable distribution for modeling counts is the Poisson distribution. This has a single parameter $λ > 0$ called the rate that represents the mean of the distribution. The distribution has probability distribution function:
$P r (y = k) = \frac{λ ^{k} e ^{- λ}}{k !}$

Design a loss function for this model assuming we have access to $I$ training pairs ${x_{i}, y_{i}}$ .

We make the rate $λ$ the learned parameter such that $λ = f [x_{i}, ϕ]$ . Then, we have

P r (y_{i} = k ∣ f [x_{i}, ϕ]) = \frac{f [ x _{i} , ϕ ] ^{k} e ^{- f [x_{i}, ϕ]}}{k !}

The loss function based on negative log-likelihood is then

L = - i = 1 \sum I lo g [\frac{f [ x _{i} , ϕ ] ^{k} e ^{- f [x_{i}, ϕ]}}{y _{i} !}] = - i = 1 \sum I k lo g [f [x_{i}, ϕ]] - f [x_{i}, ϕ] - lo g [y_{i}!]

We can drop the last term above since it doesn’t depend on $ϕ$ . Also, we often would want to add a function like $λ = exp (f [x_{i}, ϕ])$ or $λ = ∣ f [x_{i}, ϕ] ∣$ to enforce the condition that $λ > 0$ , similarly to how we used sigmoid/softmax to enforce $0 < μ < 1$ before. $exp$ is likely a better choice because it is differentiable.

Problem 5.7

Consider a multivariate regression problem where we predict ten outputs, so $y \in R^{10}$ , and model each with an independent normal distribution where the means $μ_{d}$ are predicted by the network, and variances $σ^{2}$ are constant. Write an expression for the likelihood $P r (y ∣ f [x, ϕ])$ . Show that minimizing the negative log-likelihood of this model is still equivalent to minimizing a sum of square terms if we don’t estimate the variance $σ^{2}$ .

Each probability likelihood is given by

P r (y_{i d} ∣ f [x_{i}, ϕ]) = \frac{1}{2 π σ ^{2}} exp [- \frac{( y _{i d} - μ _{d} ) ^{2}}{2 σ ^{2}}]

We try to learn parameters to predict each mean, such that $μ_{d} = f_{d} [x_{i}, ϕ]$ .

Then, the overall joint likelihood is given by

P r (y_{i} ∣ f [x_{i}, ϕ]) = d = 1 \prod 10 \frac{1}{2 π σ ^{2}} exp [- \frac{( y _{i d} - f _{d} [ x _{i} , ϕ ] ) ^{2}}{2 σ ^{2}}] .

Then:

L [ϕ] = - i = 1 \sum I lo g [P r (y_{i} ∣ f [x_{i}, ϕ])] = - i = 1 \sum I d = 1 \sum 10 lo g [P r (y_{i d} ∣ f_{d} [x_{i}, ϕ])] = - i = 1 \sum I d = 1 \sum 10 lo g \frac{1}{2 π σ ^{2}} exp [- \frac{( y _{i d} - f _{d} [ x _{i} , ϕ ] ) ^{2}}{2 σ ^{2}}]

Then, following the derivation we did for least squares, we just have

\hat{ϕ} = ϕ argmin [i = 1 \sum I d = 1 \sum 10 (y_{i d} - f_{d} [x_{i}, ϕ])^{2}]

The terms with no relation to $ϕ$ were discarded since they have no effect on the location of $\hat{ϕ}$

Problem 5.8

Construct a loss function for making multivariate predictions $y \in R^{D_{o}}$ based on independent normal distributions with different variances $σ_{d}^{2}$ for each dimension. Assume a heteroscedastic model so that both the means $μ_{d}$ and the variances $σ_{d}^{2}$ vary as a function of the data.

Let:

$μ_{d} = f_{1 d} [x_{i}, ϕ]$
$σ_{d}^{2} = f_{2 d} [x_{i}, ϕ]$

Then the loss function is:

L = - i = 1 \sum I lo g [d = 1 \prod D_{o} \frac{1}{2 π f _{2 d} [ x _{i} , ϕ ] ^{2}} exp [- \frac{( y _{i d} - f _{1 d} [ x _{i} , ϕ ])}{2 f _{2 d} [ x _{i} , ϕ ] ^{2}}]]

We can do negative log-likelihood on the inside term too:

L = - i = 1 \sum I d = 1 \sum D_{o} lo g [\frac{1}{2 π f _{2 d} [ x _{i} , ϕ ] ^{2}} exp [- \frac{( y _{i d} - f _{1 d} [ x _{i} , ϕ ] ) ^{2}}{2 f _{2 d} [ x _{i} , ϕ ] ^{2}}]]

Taking the log of the product inside:

L = i = 1 \sum I d = 1 \sum D_{o} {\frac{1}{2} lo g (2 π f_{2 d} [x_{i}, ϕ]^{2}) + \frac{( y _{i d} - f _{1 d} [ x _{i} , ϕ ] ) ^{2}}{2 f _{2 d} [ x _{i} , ϕ ] ^{2}}}

Problem 5.9

Consider a multivariate regression problem in which we predict the height of a person in meters and their weight in kilos from data $x$ . Here, the units take quite different ranges. What problems do you see this causing? Propose two solutions to these problems.

The values for weight in kilos (50-100 range) are going to be much higher than for height in meters (1-2m) range. Using least squares, the loss will focus much more on weight than height.

Possible solutions:

Rescale/normalize outputs so they have the same standard deviation, build the model the predict the normalized outputs, and scale them back after inference
Learn a separate variance for the two dimensions so that the model can automatically take care of this. This can be done in either a homoscedastic or heteroscedastic way.

Problem 5.10

Extend the model from problem 5.3 to predict both the wind direction and the wind speed and define the associated loss function.

Direction uses von Mises distribution in 5.3:

P r (y ∣ μ, κ) = \frac{exp [ κ cos [ y - μ ]]}{2 π \cdot Bessel _{0} [ κ ]}

which results in a negative log-likelihood loss function of

L_{direction} = - i = 1 \sum I cos [y - f_{1} [x_{i}, ϕ]]

For windspeed, we can use a Weibull distribution:

P r (v ∣ k, λ) = \frac{k}{λ} (\frac{v}{λ})^{k - 1} exp [- (v / λ)^{k}]

where $k$ is a shape parameter and $λ$ is a scale parameter.

Using negative log likelihood on the Weibull distribution gives us

L_{speed} = - i = 1 \sum I lo g [\frac{k}{λ _{i}} (\frac{v _{i}}{λ _{i}})^{k - 1} exp (- (\frac{v _{i}}{λ _{i}})^{k})] = - i = 1 \sum I [lo g k - lo g λ_{i} + (k - 1) lo g (\frac{v _{i}}{λ _{i}}) - (\frac{v _{i}}{λ _{i}})^{k}] = - i = 1 \sum I [lo g k - lo g λ_{i} + (k - 1) (lo g v_{i} - lo g λ_{i}) - (\frac{v _{i}}{λ _{i}})^{k}] = - i = 1 \sum I [lo g k + (k - 1) lo g v_{i} - k lo g λ_{i} - (\frac{v _{i}}{λ _{i}})^{k}] .

We can learn both $λ$ and $k$ , or fix $k$ and just learn $λ$ . Let’s say we learn both and have $λ = f_{2} [x_{i}, ϕ]$ and $k = f_{3} [x_{i}, ϕ]$ . Then the complete loss function is:

L_{total} = i = 1 \sum I [- cos (y_{i} - f_{1} [x_{i}, ϕ]) - lo g f_{3} [x_{i}, ϕ] - (f_{3} [x_{i}, ϕ] - 1) lo g v_{i} - f_{3} [x_{i}, ϕ] lo g f_{2} [x_{i}, ϕ] - (\frac{v _{i}}{f _{2} [ x _{i} , ϕ ]})^{f_{3} [x_{i}, ϕ]}]

/notes/

Recent

Linearization of Nonlinear State Space Models

Phase Portrait

Embed to Control

UDL Chapter 5 Problems

Graph View

Backlinks