When two variables and are independent, their joint distribution will factorize into the product of their marginals, .
If the variables are not independent, we can gain some idea of whether they are ‘close’ to being independent by considering the Kullback-Leibler Divergence between the joint distribution and the product of the marginals, given by
which is called the mutual information between the variables and .
From the properties of KL divergence, we see that , with equality iff and are independent.
Using the Sum and Product Rules of Probability, we see that the mutual information is related to the conditional entropy through
Thus, the mutual information represents the reduction in uncertainty about by virtue of being told the value of , or vice versa.
From a Bayesian perspective, we can view as the prior distribution for and as the posterior distribution. The mutual information thus represents the reduction in uncertainty about as a consequence of the new observation .