Consider a discrete random variable . How much information is received when we observe a specific value of this variable?
The amount of information can be viewed as the “degree of surprise” on learning the value of . If we are told that a highly improbable event has just occurred, we receive more information than if we were told that some very likely event has just occurred, and if we knew that the event was certain to happen, we would receive no information. Our measure of information content then depend on the probability distribution , and so we look for a quantity that is a monotonic function of probability and that expresses the information content.
Then, our goal is to find the form of ; this can be done by noting that if we two events and that are unrelated, then the information gained from observing both of them should be the sum of the information gained from each of them separately, so . Two unrelated events are statistically independent and so .
The operation that turns products into sums is the logarithm; therefore, for to hold under the condition that , it follows that must be proportional to the logarithm of the probability of , such that . We can then add a negative sign since probabilities are between and , and the logarithm of a number between and is negative; we want the measure of information to be positive, so we negate the value to give us:
Now, events that are certain (probability ) have zero information content, and events that are less likely carry more information. The choice of base is arbitrary, but is a prevalent convention in information theory. The units of here are bits (‘binary digits’).
Another common choice is to use natural logarithms in defining entropy, such that
In this case, the entropy is measured in units of nats (from ‘natural logarithm’) instead of bits, which differ simply by a factor of .