Cross-Entropy Loss

Cross-entropy loss is based on the idea of finding parameters $θ$ that minimize the distance between the empirical distribution $q (y)$ of the observed data $y$ and a model distribution $P r (y ∣ θ)$ .

The distance between two probability distributions $q (z)$ and $p (z)$ can be evaluated using the Kullback-Leibler Divergence:

D_{K L} [q ∣∣ p] = \int_{- \infty}^{\infty} q (z) lo g [q (z)] d z - \int_{- \infty}^{\infty} q (z) lo g [p (z)] d z

Now consider that we observe an empirical distribution at points ${y_{i}}_{i = 1}^{I}$ . We can describe this as a weighted sum of point masses:

q (y) = \frac{1}{I} i = 1 \sum I δ [y - y_{i}]

where $δ [∙]$ is the Dirac delta function.

We want to minimize the KL divergence between the model distribution $P r (y ∣ θ)$ and this empirical distribution:

\hat{θ} = θ argmin [\int_{- \infty}^{\infty} q (y) lo g [q (y)] d y - \int_{- \infty}^{\infty} q (y) lo g [P r (y ∣ θ)] d y] = θ argmin [- \int_{- \infty}^{\infty} q (y) lo g [P r (y ∣ θ)] d y]

The first term disappears as it has no dependence on $θ$ . The second term is known as the cross-entropy. It can be interpreted as the amount of uncertainty that remains in one distribution after taking into account what we already know from the other.

Now, we substitute the definition of the empirical distribution $q (y)$ :

\hat{θ} = θ argmin [- \int_{- \infty}^{\infty} (\frac{1}{I} i = 1 \sum I δ [y - y_{i}]) lo g [P r (y ∣ θ)] d y] = θ argmin [- \frac{1}{I} i = 1 \sum I lo g [P r (y_{i} ∣ θ)]] = θ argmin [- i = 1 \sum I lo g [P r (y_{i} ∣ θ)]]

The product of the two terms in the first line corresponds to pointwise multiplying the point masses (5.12a) with the logarithm of the distribution (5.12b). We are left with a finite set of weighted probability masses centered on the data points. In the last line, we eliminate the constant scaling factor $1/ I$ as this doesn’t affect the position of the minimum.

In machine learning, the distribution parameters are computed by the model $f [x_{i}, ϕ]$ , so we have

\hat{ϕ} = ϕ argmin [- i = 1 \sum I lo g [P r (y_{i} ∣ f [x_{i}, ϕ])]]

This is exactly equivalent to the negative log-likelihood criterion. Cross-entropy and negative log-likelihood are equivalent!

/notes/

Recent

Backpropagation Algorithm

Backpropagation Intuition

Backpropagation Toy Example

Cross-Entropy Loss

Graph View

Backlinks