Entropy, KL divergence and Perplexity

Wikipedia: https://en.wikipedia.org/wiki/Entropy_%28information_theory%29

Entropy

Information entropy measures the unpredictability. Depending on the probability distribution generating the events, the entropy measures how well we can predict the outcome of a sample.

Formally, entropy (H(X)) of random variable X is:

H(X) = E[-log(p(X))]

and for a discrete domain:

H(X) = -\sum{p(x_i) log (p(x_i))}

For example, for a fair coin, the probability of getting heads P(head) is equal to the probability of getting tails P(tail), both being equal to 0.5. Thus, the entropy of the fair coin would be 1.0. Another example, a fair dice would have a probability of 1/6 for each of its sides, thus giving an entropy of 2.585.

The entropy of a distribution can thus be seen as the number of bits required to encode a single sample drawn from the distribution. For a (fair) coin toss, this is 1 bit, for a dice toss, it’s just under 3 bits.

Cross-entropy

Given the true distribution p(X) and an approximating distribution q(X), the cross-entropy measures the distance between the q(X) and p(X):

H(p,q) = E_{p(X)} [-log(q(X))]

For a discrete domain:

H(p,q) = -\sum{p(x_i) log q(x_i)}

Thus, cross-entropy measures the efficiency of the optimal encoding for the approximating distribution q under the true distribution p. In other words, the cross-entropy from the p to the approximate q is the average number of bits we need to encode a sample drawn from p using an encoding optimal for q.

For example, if our true distribution models a fair dice and our approximating distribution  assigns probability 0.5 to one side and 0.1 to each of the five other sides, the cross-entropy is

H(p,q) = - (\frac{1}{6} * -1.0 + 5*\frac{1}{6} * -3.3) = 2.9

which is higher than a perfectly matching approximating distribution, in which case the cross-entropy would equal the entropy of p.

Kullback-Leibler divergence

(A.k.a. information gain.)

KL divergence D_{KL}(p||q) from (true) distribution p to approximating distribution q measures the increase in required encoding length when using encoding optimized for q for events distributed according to p.

In term of entropies defined earlier:

D_{KL}(p||q) = H(p,q) - H(p)

For discrete domains:

D_{KL}(p||q) = \sum{ p(x_i) log (\frac{p(x_i)}{q(x_i)}) }

The more the approximating distribution q deviates from true distribution p, the higher KL divergence. It can be visualized by the integral of the difference of probability density functions.

KL divergence can not be negative (at least 0, when q = p).

For our dice example, the KL divergence would be 0.3, the difference in bits required to encode a sample from true distribution p under an encoding optimal for q.

Perplexity

Perplexity of a probability distribution p is defined as

2^{H(p)} = 2^{-\sum p(x)log(p(x))}    ,

with H(p) the entropy of p.

Illustratory, perplexity measures the predictability of some distribution, giving the number of possible outcomes of an equally (un)predictable distribution if each of those outcomes was given the same probability.

For example, if some distribution p has a perplexity of 6.0, it’s as unpredictable as a fair dice.

Perplexity of a model

Consider a distribution q that tries to approximate the true distribution p. We don’t know the true distribution p but we can draw samples from it and use it to measure how well q fits q using q‘s perplexity over the samples:

2^{H(\tilde{p}, q)} = 2^{-\sum \tilde{p}(x) log(q(x))}

where \tilde{p} is the empirical distribution of samples drawn from the true distribution. The lower the perplexity of q, the better it approximates the true distribution and the less it is “surprised” by a new sample from p.

Leave a comment