Basic Concepts in Information Theory

While information theory was developed to model a communication system, it has several basic concepts that have widespread applications. In this lecture, we introduced the following basic concepts: (1) Entropy; (2) Cross Entropy; (3) Relative Entropy (i.e., Kullback-Leibler Divergence); and (4) Mutual Information.

Entropy is a function defined on a discrete random variable. Intuitively, it measures the "uncertainty" or "randomness" of the random variable. That is, if a random variable is very uncertain/random, it would have a high entropy value; maximum entropy is achieved when the probability distribution of the variable is uniform. Similarly, if a random variable is relatively certain, it would have a small entropy value; in the extreme case when the variable always takes one single value, its entropy would be zero.

Mathematically, given p(X), entropy of X is defined as the expected value of log(1/p(X)), where expectation is taken according to p(X). Since log(1/p(X=a)) is the length of the code for value a according to optimal coding (i.e., coding that achieves maximum compression), entropy can be interpreted as the expected code length for values of X. Such an interpretation is useful for understanding cross entropy and relative entropy.

Cross entropy H(p,q) is the expected code length for values of X when we use a different distribution q(X) (rather than the true distribution p(X)) to obtain optimal coding. It is thus the expected value of log(1/q(X)), where expectation is taken according to p(X). Intuitively, H(p,q) should be no less than the entropy H(p) because the "optimal" coding designed based on q(X) is unlikely optimal for values following distribution p(X). We can then compute the average number of bits wasted by using q(X) instead of p(X) to design our code, which is H(p,q)-H(p). This quantity is called relative entropy or Kullback-Leibler divergence D(p||q).

D(p||q) can often be used as a distance measure defined on p and q. Intuitively, when p=q, our coding would be optimal, so D(p||q)=0. In general, the closer q is to p, the smaller D(p||q) would be. Note that D(p||q) is not equal to D(q||p), and they have different interpretations.

When p is taken as empirical distribution of values in our observed data and q is taken as the distribution given by a probabilistic model to be estimated, minimizing the KL-divergence D(p||q) would be equivalent to maximizing the likelihood of our data. Thus the problem maximum likelihood estimation can also be understood as to find a model q that would minimize the KL-divergence D(p||q) where p is the empirical distribution of values. Note that D(p||q)=H(p,q)-H(p). Since the second part H(p) is fixed if p is the empirical distribution, finding a q to minimize D(p||q) is equivalent to finding a q to minimize the cross entropy H(p,q), which can be shown to be equivalent to maximizing the data log-likelihood.

Another important concept is mutual information I(X;Y), which is defined as the KL-divergence of p(X,Y) and p(X)p(Y). Note that we now have to think of a distribution over (X,Y) pairs. That is, when we compute the KL-divergence, the sum would be taken over all the (x,y) pairs where x is a value of X and y is a value of Y. It's easy to see that p(X)p(Y) also gives us a distribution over such pairs. This interpretation of I(X;Y) tells us that I(X;Y) measures the association/dependency of X and Y because the more independent X and Y are, the closer p(X,Y) and p(X)p(Y) would be. In case when X and Y are completely independent, p(X,Y)=p(X)p(Y), so I(X;Y)=0.

Mutual information can also have another interpretation I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X). This interpretation indicates that I(X;Y) measures the reduction of uncertainty of X (or Y) due to knowing Y (or X). Note that the conditional entropy H(Y|X) measures the uncertainty of Y given that we know X.

These basic concepts are very useful. For example, if X is the category of a document and Y is a word, I(X;Y) can be used to select words that are associated with the topics. Such words are presumably most useful for classifying documents into categories. KL-divergence D(p||q) can be used to do retrieval if p is a query unigram language model and q is a document unigram language model.

The reading we have chosen for this topic is one of the best tutorials for this topic. You should read it if you can't follow the lecture slides. There may be some errors (typos) in that note. Trying to find those errors should be a good exercise.