1 Introduction To Information Theory

1 Introduction to Information Theory
We begin with an overview of probability on discrete spaces. Let the discrete set of outcomes of some
experiment be . The set is usually called the sample space. Subsets of this set are called the events. A
probability distribution on is a function P : [0, 1] such that the following are true.
1. P() = 0. When we do the experiment, we stipulate that some outcome in should occur. Thus, the
probability that no outcome in occurs, should be zero.
2. We would like to state that the probability of a set S of outcomes is the sum of the probabilities of
the individual outcomes in S.
We generalize this to say the following. Suppose S is partitioned into disjoint sets A
1
, A
2
, . . . . Then
S can be viewed either as a whole, or as a union of the dierent A
i
s. Since the set is the same when
viewed in both these ways, we have the stipulation that the probability of S should be the sum of the
probabilities of the dierent A
i
s. We have the following condition.
For disjoint events A
1
, A
2
, . . . ,
P
_

_
i=1
A
i
_
=
i=1
P(A
i
).
3. Either an event A happens, or it does not happen. Since these events are mutually exclusive and
exhaustive, by the previous condition, their probabilities have to add up to the probability of the
entire space of outcomes. Set P() = 1. Thus we have the following condition. (Note that, the values
set to and satisfy the stipulation.)
For any event A, P(A
c
) = 1 P(A).
Then the pair (, P) is called a discrete probability space.
Example 1.0.1. Let be the set of outcomes of a fair coin toss. Then the function (H) = 0.5 and
(T) = 0.5 denes a distribution.
Example 1.0.2. For a number n, let (n) = 2
n
. Then (N {0}, ) forms a distribution.
Let (, F) be a discrete probability space for the subsequent discussion.
A random variable is a function X : where is a discrete set. The probability distribution
induced by X on , denoted p
X
, is dened for every x as
p
X
(x) = F{ | X() = x}.
This can be abbreviated as
p
X
(x) = F(X
1
(x)).
Thus X imposes a probability structure on the image using the probability distribution on the domain
.
1
Example 1.0.3. Consider a game of darts, where the board has 3 bands, red, blue and green from outside
in, and the bulls eye is marked in black. Let the score a player gets be 10, 20, 10 and 40, respectively for
dart throws on red, blue, green and black. Assume that the probability that a dart falls on red is 0.5, blue
is 0.2, green is 0.2 and bulls eye is 0.1.
This can be abstracted by the random variable: X(red) = 10, X(blue) = 20, X(green) = 10 and
X(black) = 40.
Then after a throw, p
X
(10) = F({ red, green }) = 0.7.
We now would like to dene multidimensional probability distributions induced by random variables.
Consider a distribution F
2
dened on .
Denition 1.0.4. Let X : and Y : be two random variables. The joint distribution
p
X,Y
: is dened as
p
X,Y
(, ) = F
2
_
(X
1
(), Y
1
())
_
.
It is possible to dene marginal distributions
p
Y |X
( | ) =
p
X,Y
(, )
p
X
()
.
That is, p
Y |X
( | ) is the probability distribution which is produced by scaling down the whole of the space
to a particular outcome . It follows that
p
X,Y
(, ) = p
X
() p
Y |X
( | ) = p
Y
() p
X|Y
( | ).
Example 1.0.5. Let = {H, T} represent the outcome of a coin toss. We consider the sample space
of two coin tosses. Let F
2
be dened as F
2
(T, T) = 0.1, F
2
(H, T) = 0.2, F
2
(T, H) = 0.3 and F
2
(H, H) = 0.4.
Let X be the map X(T) = 1, X(H) = 2, and Y be the map Y (H) = 10, Y (T) = 20. The following table
represents a joint distribution X, Y . (That is, the rst coin toss is scored according to X and the second
according to Y .)
(0.25,0.75) (0.33, 0.67)
20 0.3 0.4 (0.42, 0.58)
10 0.1 0.2 (0.33, 0.67)
1 2
An important notion which characterizes the theory of probability, is the idea of independence. Inde-
pendence tries to capture the idea that the occurrence of an event does not give any information about
the occurrence of another. The mathematical way to capture this notion is as follows. Two outcomes are
said to be independent if F
2
(
1
,
2
) = F(
1
)F(
2
). Similarly, two events A and B are independent if
F(A B) = F(A)F(B). We extend this to the concept of two random variables being independent.
2
Denition 1.0.6. Two random variables X : and Y : are independent if
p
X,Y
(, ) = p
X
()p
Y
().
Alternatively, we can say that X is independent of Y if p
X|Y
= p
X
. We can verify that if X is independent
of Y , then Y is independent of X. We can think of this as the most basic instance of symmetry of
information.
Example 1.0.7. It is easily veried that the joint distribution in the previous example does not dene lead
to independent X and Y . Let , X and Y be as before. Let F
2
be dened as F
2
(TT) = 0.01, F
2
(TH) = 0.09,
F
2
(HT) = 0.09, F
2
(HH) = 0.81. This distribution is the product distribution generated by
F =
_
T H
0.1 0.9
_
.
We can verify that X and Y are independent of each other in this distribution.
(0.1,0.9) (0.1, 0.9)
20 0.09 0.81 (0.1, 0.9)
10 0.01 0.09 (0.1, 0.9)
1 2
For more than two random variables, we can dene various degrees of independence.
Denition 1.0.8. Random variables X
1
, X
2
, . . . , X
n
are said to be mutually independent if
p
X
1
,X
2
,...,Xn
(x
1
, . . . , x
n
) = p
X
1
(x
1
), p
X
2
(x
2
) . . . p
Xn
(x
n
).
Mutual independence is the strongest form of independence among n random variables. Another fre-
quently useful notion is the weaker notion of pairwise independence.
Denition 1.0.9. Random variables X
1
, X
2
, . . . , X
n
are said to be pairwise independent if every pair of
distinct random variables among them are independent.
Mutual independence implies pairwise independence, but not conversely. An example to show that
pairwise independence does not imply mutual independence is shown below, using the properties of the
parity function.
Example 1.0.10. Consider two bits b
0
and b
1
produced by ips of a fair coin, and designating T with 1
and H with 0. Consider b
2
dened as the parity (XOR) of b
0
and b
1
. It can be veried that any pair (b
i
, b
j
)
of distinct bits are independent, but the three variables are not mutually independent.
With this basic set of denitions from probability, we can now dene the notion of entropy of a random
variable.
3
Denition 1.0.11. Let (, P) be a discrete probability space, and X : be a random variable. The
entropy of the random variable X is dened to be
H(X) =
x
p
X
(x) log p
X
(x),
where we adopt the convention that 0 log 0 = 0.
We can think of the entropy of a random variable as follows. If we assign an optimal coding scheme for
the image X(), where the higher the probability of a point, the fewer the bits we use in its encoding, we
would use log
1
P(x)
bits to represent a point x. The entropy then is the weighted average of the length of this
encoding scheme. Thus the entropy is the expected length of an optimal encoding of the random variable
X, where X is distributed according to p
X
.
Once we are familiar with the notion of probabilities induced by a random variable, we can drop the
subscript from the probability. This is done to ease the burden of notation, when there is no confusion
regarding which random variable we are talking about.
In the case where consists of two symbols, we have the binary entropy function, H(X) = p log p
(1 p) log(1 p). We denote this as h(p).
For two random variables X and Y , we can dene the notions of joint entropy, and conditional entropies.
Denition 1.0.12. The joint entropy of two random variables X and Y is dened as
H(X, Y ) =
x,y
p(x, y) log p(x, y) = E log p
X,Y
.
The conditional entropy of Y given X is dened by
H(Y | X) =
x,y
p
X,Y
(x, y) log p
Y |X
(y | x) = E
p
X,Y
log p
Y |X
.
Note the asymmetry in the last denition. We can understand this condition better by writing p(x, y)
as p(x)p(y | x). Then
H(Y | X) =
x,y
p(x)p(y | x) log p(y | x).
This summation can now be separated into two sums, since x and y do not vary together.
H(Y | X) =
x
p(x)
_
y
p(y | x) log p(y | x)
_
.
The inner term is the entropy H(Y | X = x). Thus
H(Y | X) =
x
p(x)H(Y | X = x).
Denition 1.0.13. The information about X contained in Y is dened as I(X; Y ) = H(X) H(X | Y ).
4
Then, we have the property of symmetry of information.
Lemma 1.0.14. I(X; Y ) = I(Y ; X).
Proof.
I(X; Y ) = H(X) H(X | Y )
=
x
p(x) log p(x) +
x,y
p(x, y) log p(x | y)
=
x
p(x) log
1
p(x)
+
x,y
p(x, y) log
p(x, y)
p(y)
We know that the probability of a point x, namely p(x) is the sum of all p(x, y) where y takes all values in
. Thus, we can write the above sum as
y
p(x, y) log
1
p(x)
+
x,y
p(x, y) log
p(x, y)
p(y)
=
x,y
p(x, y) log
1
p(x)
+
x,y
p(x, y) log
p(x, y)
p(y)
[Notation]
=
x,y
p(x, y) log
p(x, y)
p(x)p(y)
.
By the symmetry of the resultant expression, I(X; Y ) = I(Y ; X).
An important property of the logarithm function is its concavity. The secant between two points on the
graph of a concave function always lies below the graph. This characterizes concavity. A consequence of
this fact is Jensens Inequality. This is a very convenient tool in the analysis of convex or concave functions.
Theorem 1.0.15 (Jensens Inequality). Let X : R be a random variable with E[X] < , and f : R R
be a concave function. Then,
f(E[X]) E[f(X)].
Jensens inequality can hence be used to perform the following kind of inequality:
log
_
p
i
x
i
_

p
i
log x
i
.
Usually, the left side is easier to estimate. From the geometric observation that the tangent to the graph
of a concave function lies above the graph, we have the following useful upper bound on the logarithm
function.
Theorem 1.0.16 (Fundamental Inequality). For any a > 0, we have
lna a 1.
To illustrate an application of Jensens inequality, we will prove that the entropy of an n-dimensional
probability distribution is at most log n.
5
Lemma 1.0.17. Let P = (p
0
, p
2
, . . . , p
n1
) be an n-dimensional probability distribution. Then H(P)
log n.
Proof.
H(P) =
n1
i=0
_
p
i
log
1
p
i
_
log
n1
i=0
p
i
p
i
[log is concave, and Jensens inequality. ]
= log
n1
i=0
1
= log n.
Similarly, the fundamental inequality gives us a lower bound on the entropy, proving that it is non-
negative.
0
, p
1
, . . . , p
n1
) be an n-dimensional probability distribution. Then H(P) 0.
Proof.
H(P) =
n1
i=0
p
i
log
1
p
i
n1
i=0
p
i
[1 p
i
] [Fundamental Inequality].
Both p
i
and 1 p
i
are non-negative terms, since they are coordinates of a probability vector. H(P) is lower
bounded by a sum of nonnegative terms, thus, H(P) 0.
We can push the above analysis a bit further and get the result that H(P) = 0 if and only if P is a
deterministic distribution. It is easily veried that if P is a deterministic distribution, then H(P) = 0. We
now prove the converse.
0
, p
1
, . . . , p
n1
) be an n-dimensional probability distribution. Then H(P) = 0
only if P is a deterministic distribution.
Proof. By the fundamental inequality, we have
H(P)
n1
i=0
p
i
[1 p
i
] 0.
Assume H(P) = 0. Then, the above sum of non-negative terms is 0, thus each of the constituent
summands is equal to 0. This can happen only if p
i
= 1 or p
i
= 0 for each i. Observing that P is a
probability distribution, thus exactly one p
i
is 1, is enough to prove the result.
6
Similarly, from symmetry of information and the fundamental inequality, we can prove that I(X : Y ) 0.
For, we have the following.
I(X : Y ) =
x,y
p(x, y) log
p(x)p(y)
p(x, y)
x,y
p(x, y)
_
1
p(x)p(y)
p(x, y)
_
(by Fundamental Inequality for log.)
=
x,y
p(x, y)
x,y
p(x)p(y)
= 1
x
p(x)
y
p(y)
= 1 1 = 0.
1.1 Majorization*
We will just introduce a theory that is very useful in the study of entropy, how entropy changes when the
probability distribution changes. Consider probability distributions on a set of n elements. We have proved
that the entropy of the uniform distribution is maximal, and is equal to log n. Similarly, we have proved
that the entropy of a deterministic distribution, where one of the events has probability 1, and the rest
have probability 0, is minimal, and equal to 0 (take 0 log 0 = 0 by convention.)
This says the following. The space of probability distributions on n elements can be seen as a convex
set, with vertices being the deterministic distributions (1, 0, 0, ..., 0), (0, 1, 0, . . . , 0), . . . , (0, 0, 0, . . . , 1). The
entropy is maximal at the centroid of this set - this point corresponds to the uniform distribution. It then
decreases outwards towards the vertices, and reaches 0 at each of the vertices.
How do we compare the entropy at two arbitrary points within this convex set? Can we determine some
easily veriable property of the probability distributions and use that to say qualitatively which distribution
has greater entropy?
There is a powerful theory of majorization which can be used for this purpose. Majorization is a
comparison criterion for two n-dimensional vectors. A vector x is said to be majorized by another vector y
if, informally, y is more equitably distributed than x.
Denition 1.1.1. Let x and y be two non-negative n-dimensional vectors. Let a be x sorted in descending
order of coordinate values, and
b be y sorted in descending order of coordinate values. Then, x is majorized

by y and we write x y if the following hold.
a
0
b
0
a
0
+a
1
b
0
+b
1
. . .
a
0
+a
1
+ +a
n
1 = b
0
+b
1
+ +b
n1
.
For example, (1, 0, 0) (1/2, 0, 1/2) (1/3, 1/3, 1/3).
7
Thus, majorization provides an way to compare probability vectors. If a probability vector majorizes
another probability vector, then their entropies can be compared. This is because an n-dimensional entropy
function has a property called Schur concavity.
Denition 1.1.2. A function f : R
n
R is called Schur concave if f(x) f(y) whenever x y.
Shannon-entropy is Schur concave, as can be seen by an easy application of the following theorem.
Theorem 1.1.3 (Schur). A symmetric, continuous function f : R
n
R is Schur-concave if and only if the
projection of f onto each coordinate is a continuous concave function.
The theorem connecting these concepts enables us to compare the entropies of probability vectors in an
easy manner. A probability vector x has greater entropy than y if x y.
Note that, majorization is just a sucient condition for comparing entropies. There are vectors which
do not majorize one another, with dierent entropies: For example, the 3-dimensional probability vector
(0.6, 0.25, 0.15) has greater entropy than (0.5, 0.5, 0) even though they do not majorize one another.
1.2 Kullback-Leibler Divergence
Majorization was a qualitative notion of comparing two probability distributions. A very useful quantitative
notion to compare two probability measures is known as the Kullback-Leibler divergence.
Denition 1.2.1. Let P and Q be two n-dimensional probability distributions. The Kullback-Leibler
Divergence of Q from P is dened as
D(P||Q) =
n1
i=0
P
i
log
Q
i
P
i
.
We can interpret this as follows. Note that
D(P||Q) =
n1
i=0
P
i
log P
i
+
n1
i=0
P
i
log Q
i
= H(P)
_
n1
i=0
P
i
log Q
i
_
.
This form is amenable to some interpretation. Suppose a sample space is distributed according to P.
We mistakenly encode the space as though the space were distributed according to Q. Then, the divergence
D(P||Q) may be interpreted as the coding ineciency rate of encoding the space with respect to Q when
the optimal rate would have been achieved with respect to P.
The KL-divergence does not have many properties of a Euclidean distance, and hence is not a satisfactory
notion of distance between probability vectors. However, it is remarkably useful.
Note: For example, the universal integrable test for a computable probability measure P in the previous
chapter was the total KL divergence of M from P on any string x. Also, the deciency test for the weak
law of large numbers was approximately the KL divergence between the uniform distribution and p
x
.
8
We conclude our discussion by proving some basic facts about the KL-divergence. We prove that
D(P||Q) = 0 if and only if P = Q. One direction is easy: if P = Q, then D(P||Q) = 0. The converse is
slightly tricky.
Lemma 1.2.2. For nite positive probability distributions P and Q, P = Q only if D(P||Q) = 0.
Proof. First, we prove that D(P||Q) 0. We will analyze the calculations needed in this result to prove the
lemma.
First, we see that
D(P||Q) =
p
i
log
p
i
q
i
=
p
i
_
log
q
i
p
i
_
p
i
_
1
q
i
p
i
_
[Fundamental Inequality]
=
p
i

q
i
= 0. [P and Q are probabilities.]
Then D(P||Q) = 0 only if the inequality between lines 2 and 3 is an equality. We know that
log
q
i
p
i
_
1
q
i
p
i
_
0,
hence every summand
p
i
_
log
q
i
p
i
_
in line 2 is at least the corresponding summand
p
i
_
1
q
i
p
i
_
.
Thus line 2 and line 3 are equal only if the corresponding summands are equal. Thus, for every i,
q
i
p
i
= 1,
proving that P = Q.
9

1 Introduction To Information Theory

Uploaded by

1 Introduction To Information Theory

Uploaded by

1 Introduction to Information Theory

b be y sorted in descending order of coordinate values. Then, x is majorized

You might also like