Kullback-Leibler Divergence shows us the relationship between data compression and density estimation.

  • Density estimation is the problem of modeling an unknown probability distribution.

The most efficient compression is achieved when we know the true distribution. If we use a distribution that is different from the true one, then we necessarily have a less efficient encoding, and on average the additional information that must be transmitted (at least) equal to the KL divergence between the two distributions.

Suppose data is being generated from an unknown distribution that we want to model. We can try to approximate this distribution using some parametric model , governed by a set of adjustable parameters .

One way to determine would be to minimize the KL divergence between and with respect to . We cannot do this directly because we do not know . However, suppose that we have observed a finite set of training points for , drawn from . Then, the expectation with respect to can be approximated by a finite sum over these points, such that:

  • The second term on the right-hand side is independent of
  • The first term is the Negative log-likelihood for under the distribution evaluated using the training set.
  • Therefore, minimizing the KL divergence is equivalent to maximizing the log likelihood function.