Density Estimation with KL Divergence

Kullback-Leibler Divergence shows us the relationship between data compression and density estimation.

Density estimation is the problem of modeling an unknown probability distribution.

The most efficient compression is achieved when we know the true distribution. If we use a distribution that is different from the true one, then we necessarily have a less efficient encoding, and on average the additional information that must be transmitted (at least) equal to the KL divergence between the two distributions.

Suppose data is being generated from an unknown distribution $p (x)$ that we want to model. We can try to approximate this distribution using some parametric model $q (x ∣ θ)$ , governed by a set of adjustable parameters $θ$ .

One way to determine $θ$ would be to minimize the KL divergence between $p (x)$ and $q (x ∣ θ)$ with respect to $θ$ . We cannot do this directly because we do not know $p (x)$ . However, suppose that we have observed a finite set of training points $x_{n}$ for $n = 1, \dots, N$ , drawn from $p (x)$ . Then, the expectation with respect to $p (x)$ can be approximated by a finite sum over these points, such that:

KL (p ∣∣ q) \approx \frac{1}{N} n = 1 \sum N {- ln q (x_{n} ∣ θ) + ln p (x_{n})}

The second term on the right-hand side is independent of $θ$
The first term is the Binary Cross-Entropy Loss for $θ$ under the distribution $q (x ∣ θ)$ evaluated using the training set.
Therefore, minimizing the KL divergence is equivalent to maximizing the log likelihood function.

/notes/

Recent

Backpropagation Algorithm

Backpropagation Intuition

Backpropagation Toy Example

Density Estimation with KL Divergence

Graph View

Backlinks