Conditional Probabilistic Perspective of Learning

Consider a model $f [x, ϕ]$ with parameters $ϕ$ that computes an output from $x$ . Instead of thinking about the model directly computing a prediction $y$ , we can shift perspective and consider the model as computing a conditional probability distribution $P r (y ∣ x)$ , over possible outputs $y$ given $x$ .

The loss encourages each training output $y_{i}$ to have a high probability under the distribution $P r (y_{i} ∣ x_{i})$ , computed from the corresponding $x_{i}$ .

Conditional probability model examples

a) Regression task, where the goal is to predict a real-valued output $y$ from the input $x$ based on training data ${x_{i}, y_{i}}$ (orange points). For each input value x, the model predicts a distribution $P r (y ∣ x)$ over the output $y \in R$ . The loss function aims to maximize the probability of the observed training outputs $y_{i}$ under the distribution predicted from the corresponding inputs $x_{i}$ .

b) To predict discrete classes $y \in {1, 2, 3, 4}$ in a classification task, we use a discrete probability distribution, so the model predicts a different histogram over the four possible values of $y_{i}$ for each value of $x_{i}$ .

c) To predict counts $y \in {0, 1, 2, \dots}$ we use distributions defined over positive integers

d) To predict directions $y \in (- π, π]$ , we use distributions defined over circular domains

Computing a distribution over outputs

How exactly can a model $f (x, ϕ)$ be adapted to compute a probability distribution?

First, we choose a parametric distribution $P r (y ∣ θ)$ defined on the output domain $y$ . Then, we use the network to compute one or more of the parameters $θ$ of this distribution.

For example, suppose the prediction domain is the set of real numbers, so $y \in R$ . Here, we might choose the univariate normal distribution, which is defined on $R$ . This distribution is defined by the mean $μ$ and variance $σ^{2}$ , so $θ = {μ, σ^{2}}$ . The model might predict the mean $μ$ , and the variance $σ^{2}$ could be treated as an unknown constant.

Inference

We use log-likelihood as a loss function to find the best parameters. At inference time, the network no longer predicts the outputs $y$ but instead determines a probability distribution over $y$ . However, we often want to use a point estimate rather than a distribution.

To do this, we return the maximum of the distribution

\overset{y}{^} = y argmax [P r (y ∣ f (x, \hat{ϕ}))]

It’s usually possible to find an expression for this in terms of the distribution parameters $θ$ predicted by the model. For example, in the univariate normal distribution, the maximum occurs at the mean $μ$ .

Multiple Outputs

Often, we wish to make more than one prediction with the same model, so the target output $y$ is a vector. For example, we might want to predict a molecule’s melting and boiling point (multivariate regression), or the obstacle class at every point in an image (multivariate classification). While it’s possible to define multivariate probability distributions and use a neural network to model their parameters as a function of the input, it’s more usual to treat each prediction as independent.

Independence implies that we treat the probability $P r (y ∣ f [x, ϕ])$ as a product of variate terms for each element $y_{d} \in y$ :

P r (y ∣ f [x, ϕ]) = d \prod P r (y_{d} ∣ f_{d} [x, ϕ])

where $f_{d} [x, ϕ]$ is the $d$ -th set of network outputs, which describe the parameters of the distribution over $y_{d}$ . For example:

To predict multiple continuous variables $y_{d} \in R$ , we use a normal distribution for each $y_{d}$ , and the network outputs $f_{d} [x, ϕ]$ predict the means of these distributions.
To predict multiple discrete variables $y_{d} \in {1, 2, \dots, K}$ , we use a categorical distribution for each $y_{d}$ . Here, each set of network outputs $f_{d} [x, ϕ]$ predicts the $K$ values that contribute to the categorical distribution for $y_{d}$ .

When we minimize the negative log-likelihood, this product becomes a sum of terms:

L [ϕ] = - i = 1 \sum I lo g [P r (y_{i} ∣ f [x_{i}, ϕ])] = - i = 1 \sum I d \sum lo g [P r (y_{i d} ∣ f_{d} [x_{i}, ϕ])]

where $y_{i d}$ is the $d$ -th output from the $i$ -th training example.

To make two or more prediction types simultaneously, we similarly assume the errors in each are independent.

Example: To predict wind direction and strength, we might choose the von Mises distribution (defined on circular domains) for the direction, and the exponential distribution (defined on positive real numbers) for the strength.

The independence assumption implies that the joint likelihood of the two predictions if the product of individual likelihoods. These terms will become additive when we compute the negative log-likelihood.

/notes/

Recent

Linearization of Nonlinear State Space Models

Phase Portrait

Embed to Control

Conditional Probabilistic Perspective of Learning

Computing a distribution over outputs

Inference

Multiple Outputs

Graph View

Table of Contents

Backlinks