Consider a model with parameters that computes an output from . Instead of thinking about the model directly computing a prediction , we can shift perspective and consider the model as computing a conditional probability distribution , over possible outputs given .

The loss encourages each training output to have a high probability under the distribution , computed from the corresponding .

  • a) Regression task, where the goal is to predict a real-valued output from the input based on training data (orange points). For each input value x, the model predicts a distribution over the output . The loss function aims to maximize the probability of the observed training outputs under the distribution predicted from the corresponding inputs .
  • b) To predict discrete classes in a classification task, we use a discrete probability distribution, so the model predicts a different histogram over the four possible values of for each value of .

  • c) To predict counts we use distributions defined over positive integers
  • d) To predict directions , we use distributions defined over circular domains

Computing a distribution over outputs

How exactly can a model be adapted to compute a probability distribution?

First, we choose a parametric distribution defined on the output domain . Then, we use the network to compute one or more of the parameters of this distribution.

For example, suppose the prediction domain is the set of real numbers, so . Here, we might choose the univariate normal distribution, which is defined on . This distribution is defined by the mean and variance , so . The model might predict the mean , and the variance could be treated as an unknown constant.

Inference

We use log-likelihood as a loss function to find the best parameters. At inference time, the network no longer predicts the outputs but instead determines a probability distribution over . However, we often want to use a point estimate rather than a distribution.

To do this, we return the maximum of the distribution

It’s usually possible to find an expression for this in terms of the distribution parameters predicted by the model. For example, in the univariate normal distribution, the maximum occurs at the mean .