The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

The Source Coding Theorem

Mário S. Alvim
([email protected])

Information Theory

DCC-UFMG
(2020/01)

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 1 / 62


The Source Coding Theorem - Introduction

In this lecture we will define sensible measures of

1. the information content of a random event, and


2. the expected information content of a random experiment.

We will also see how the measures above are connected with the compression
of sources of information.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 2 / 62


Definition of entropy and
related functions

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 3 / 62


Ensembles

Recall that an ensemble X is a triple (x, AX , PX ), where:

x is the outcome of a random variable,


AX = {a1 , a2 , . . . , ai , . . . , aI } is the set of possible values for the random
variable, and
PX = {p1 , p2 , . . . , pI } are the probabilities of each value, with pi standing for
p(x = ai ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 4 / 62


Introduction to Shannon entropy

The Shannon information content of an outcome x is defined to be


1
h(x) = log2 ,
p(x)

and it is measured in bits.

Convention. From now on, log x stands for log2 x, unless otherwise stated.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 5 / 62


Introduction to Shannon entropy

Example 1 Frequency of letters in “The Frequently Asked Questions


Manual for Linux”, and their entropy.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 6 / 62


Introduction to Shannon entropy

The entropy of an ensemble X is defined to be the average Shannon


information content of an outcome:
X X 1
H(X ) = p(x)h(x) = p(x) log2 ,
p(x)
x∈AX x∈AX

with the convention that for p(x) = 0 we have

0 · log2 1/0 = 0,

since limθ→0 θ log2 1/θ = 0.


When it is convenient we may write H(X ) as H(p), where p is the vector
(p1 , p2 , . . . , pI ).
Another name for the entropy of X is the uncertainty of X .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 7 / 62


Introduction to Shannon entropy

Example 1 (Continued)

Frequency of letters in “The Frequently Asked Questions Manual for Linux”.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 8 / 62


Introduction to Shannon entropy

Some properties of the entropy function:

1. Entropy is always non-negative:

H(X ) ≥ 0,

with equality iff pi = 1 for one i.

2. Entropy is maximized when p is uniform:

H(X ) ≤ log2 |AX |,

with equality iff pi = 1/|AX | for all i.

Verifying these properties is part of your homework assignment: for that


you’ll need Jensen’s inequality.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 9 / 62


Jensen’s inequality and convex functions

A function f (x) is convex ^ over an interval


(a, b) if every chord of the function lies above
the function.
That is, if for all 0 ≤ λ ≤ 1:

f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ).

Jensen’s inequality: if f is a convex function and x is a random variable,


then

E [f (x)] ≥ f (E [x]),

where E [·] denotes expected value.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 10 / 62


Introduction to Shannon entropy

The joint entropy of X , Y is


X X 1
H(X , Y ) = p(x, y )h(x, y ) = p(x, y ) log2 .
p(x, y )
xy ∈AX AY xy ∈AX AY

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 11 / 62


Introduction to Shannon entropy
Theorem Entropy is additive for independent random variables:
H(X , Y ) = H(X ) + H(Y ),
iff p(x, y ) = p(x)p(y ).

Proof.
We start by proving the following auxiliary result for when
p(x, y ) = p(x)p(y ):
1
h(x, y ) = log (by def. of h(·))
p(x, y )
1
= log (p(x, y ) = p(x)p(y ))
p(x)p(y )
 
1 1
= log ·
p(x) p(y )
1 1
= log + log
p(x) p(y )
= h(x) + h(y ) (by def. of h(·))

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 12 / 62


Introduction to Shannon entropy

Proof. (Continued)

Then, we can show that:


X X
H(X , Y ) = p(x, y )h(x, y ) (by definition)
x∈AX y ∈AY
X X
= p(x)p(y ) [h(x) + h(y )] (x, y independent)
x∈AX y ∈AY
X X
= [p(x)p(y )h(x) + p(x)p(y )h(y )] (by distributivity)
x∈AX y ∈AY
X X X X
= p(x)p(y )h(x) + p(x)p(y )h(y ) (splitting the sums (?))
x∈AX y ∈AY x∈AX y ∈AY

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 13 / 62


Joint Entropy - Properties
Proof. (Continued)

Note that the first term in the sum in Equation (?) can be written as
X X X X
p(x)p(y )h(x) = p(x)h(x) p(y ) (moving out constants)
x∈AX y ∈AY x∈AX y ∈AY
X P
= p(x)h(x) · 1 ( y ∈AY p(y ) = 1)
x∈AX

= H(X ) (by definition (??)),

and the second term in the sum in Equation (?) can be written as
X X X X
p(x)p(y )h(x) = p(y )h(y ) p(x) (moving out constants)
x∈AX y ∈AY y ∈AY x∈AX
X P
= p(y )h(y ) · 1 ( x∈AX p(x) = 1)
y ∈AY

= H(Y ) (by definition (???)).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 14 / 62


Joint Entropy - Properties

Proof. (Continued)

Now we can substitute Equations (??) and (???) in Equation (?) to obtain

H(X , Y ) = H(X ) + H(Y ).

The proof of the converse, i.e., that if H(X , Y ) = H(X ) + H(Y ) then X and
Y are independent, is similar and is left as an exercise.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 15 / 62


The Source Coding Theorem

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 16 / 62


The Source Coding Theorem - Three claims

Our goal in this class is to convince you of the following three claims:

1. The Shannon information content (a.k.a. surprisal, or self-information)


1
h(x = ai ) = log2
p(x = ai )

is a sensible measure of the information content of the outcome x = ai .

2. The entropy
X 1
H(X ) = p(x) log2
x∈AX
p(x)

is a sensible measure of the expected information content of an ensemble X .

3. Source coding theorem: N outcomes from a source X can be compressed


into roughly N · H(X ) bits.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 17 / 62


The Shannon information content of an outcome

Our first claim is that the Shannon information content


1
h(x = ai ) = log2
p(x = ai )

is a sensible measure of the information content of the outcome x = ai .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 18 / 62


The Shannon information content of an outcome
Some intuitive support for our first claim:

a) The less probable an outcome is, the more informative is its happening.
Or, the more “surprising” an outcome is, the more informative it is.
The function h(x) captures this intuition, since it grows as the probability of x
diminishes:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 19 / 62


The Shannon information content of an outcome

Some intuitive support for our first claim:

b) If an outcome x happens with certainty (i.e., p(x) = 1), the occurrence of this
outcome conveys no information:
1 1
h(x) = log2 = log2 = 0.
p(x) 1

If an outcome x happens is impossible (i.e., p(x) = 0), the occurrence of this


outcome conveys an infinite amount of information:
1 1
h(x) = log2 = log2 = ∞.
p(x) 0

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 20 / 62


The Shannon information content of an outcome

Some intuitive support for our first claim:

c) Independent events add up their surprises.


If p(x, y ) = p(x)p(y ) then
1
h(x, y ) = log2
p(x, y )
 
1 1
= log2 ·
p(x) p(y )
1 1
= log2 + log2
p(x) p(y )
= h(x) + h(y ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 21 / 62


The entropy of an ensemble

Our second claim is that the entropy


X 1
H(X ) = p(x) log2
p(x)
x∈AX

is a sensible measure of the expected information content of an ensemble


X = (x, AX , PX ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 22 / 62


The entropy of an ensemble

Some intuitive support for our second claim:

a) (The weighing problem.) You are given 12 balls, all equal in weight except
for one that is either heavier or lighter.
You are also given a two-pan balance to use.
In each use of the balance you may put any number of the 12 balls on the left
pan, and the same number on the right pan, and push a button to initiate the
weighing; there are three possible outcomes:

1. either the weights are equal; or


2. the balls on the left are heavier; or
3. the balls on the left are lighter.

Your task is to design a strategy to determine which is the odd ball and
whether it is heavier or lighter than the others in as few uses of the balance as
possible.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 23 / 62


The entropy of an ensemble
A possible optimal solution for the weighing problem.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 24 / 62


The entropy of an ensemble

Insights on “information” gained from the weighing problem:

(i) The world may be in many different states, and you are uncertain about which
is the real one.
(ii) You have measurements (questions) that you can make (ask) to probe in what
state the world is.
(iii) Each measure (question) produces an observation (answer) that allows you to
rule out some states of the world as not possible.
(iv) At each time a subset of possible states is ruled out, you gain some
information about the real state of the world.
The information you have increases because your uncertainty about the real
state of the world decreases.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 25 / 62


The entropy of an ensemble

Insights on “information” gained from the weighing problem:

(v) The most efficient way of finding the actual state is to have every
measurement (question) outcomes as close as possible to equally probable.
If your measurement (question) allows for n different outcomes (types of
answers), it is best to use them so to always split the set of still possible states
of the world into n sets of probability 1/n each.
(vi) The Shannon information content (in base 3) of the set of balls is
24
X 1 1
H(X ) = log3 = log3 24 = 2.89,
i=1
24 1/24

which is just about the minimal number of measurements (3) needed in a best
strategy.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 26 / 62


The entropy of an ensemble

Some intuitive support for our second claim:

b) (The guessing game.) Do you know the “20 questions” game? (If you don’t,
Google it, it’s a fun game to while away the time with friends during a long
trip.)
In a dumber version of the game, a friend thinks of a number between 0 and
15 and you have to guess which number was selected using yes/no questions.
What is the smallest number of questions needed to be guaranteed to identify
an integer between 0 and 15?

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 27 / 62


The entropy of an ensemble

An optimal strategy of yes/no questions to identify an integer in 0-15:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 28 / 62


The entropy of an ensemble

Insights on “information” gained from the guessing game:

(i) A series of answers to yes/no questions can be used to uniquely identify an


object from a set.
(That’s why the questions are useful in the first place!)
(ii) The optimal strategy to win the game corresponds to the shortest sequences
of yes/no questions needed to identify objects in the set.
(iii) If you map each yes/no answer to a 0/1 bit, you get a unique binary string
that identifies each object in the set.
(iv) Hence, the optimal strategy to win the game leads to the shortest binary
description of objects in the set.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 29 / 62


The entropy of an ensemble

Insights on “information” gained from the guessing game:

(v) Encoding information efficiently is related to asking the right questions.


(vi) We saw that the number of yes/no questions needed to identify an integer
between 0 and 15 is 4.
Let us calculate the Shannon information of the set of integers between 0 and
15, assuming your friend can pick any number in the set with equal probability:

15
X 1 1
H(X ) = log2 = log2 16 = 4.
i=0
16 1/16

Shannon entropy gave us the minimal number of questions necessary to win


the game. Is this a coincidence?

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 30 / 62


The entropy of an ensemble

Some intuitive support for our second claim:

c) (The game of submarine.) In a boring version of the “game of battleships”


called “game of submarine”, each player hides just one submarine in one
square of an eight-by-eight grid.
At each round, the other player picks a square in the grid to shoot at.
There two possible outcomes of a shot are y, n, corresponding to a hit and a
miss, and their probabilities depend on the state of the board.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 31 / 62


The entropy of an ensemble
In the game of submarine, each shot made by a player defines an ensemble:
at the beginning:
1 63
p(y) = and p(n) = ;
64 64

at the second shot, if the first shot missed:


1 62
p(y | n) = and p(n | n) = ;
63 63

at the third shot, if the first two shots missed:


1 61
p(y | nn) = and p(n | nn) = ;
62 62
...
at the k th shot (with 1 ≤ k ≤ 64), if the first k − 1 shots missed:
1 (64 − k)
p(y | nk−1 ) = and p(n | nk−1 ) = .
(64 − k + 1) (64 − k + 1)

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 32 / 62


The entropy of an ensemble

A game of submarine in which the submarine is hit on the 49th attempt.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 33 / 62


The entropy of an ensemble

Insights on “information” gained from the game of submarine:

(i) If we get a hit y on the k th shot, the sequence of answers we get is a string

x = nk−1 y,

in which the first k − 1 symbols are n and the last one is y.


This string x is the outcome of the hit/miss experiment that uniquely
identifies the square where the submarine is.
For a fixed strategy, our game has 64 possible outcomes:

y,
ny,
nny,
. . .,
nnnn . . . nny = n62 y,
nnnn . . . nnny = n63 y.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 34 / 62


The entropy of an ensemble

Insights on “information” gained from the game of submarine:


In particular, we can call y a bit 1 and n a bit 0, and encode each of the
game’s 64 possible outcomes as a binary string:

y = 1 (which is a binary string of 1 bit),


ny = 01 (which is a binary string of 2 bits)
nny = 001 (which is a binary string of 3 bits),
. . .,
n62 y = 062 1 (which is a binary string of 63 bits),
n63 y = 063 1 (which is a binary string of 64 bits).

Note that the use of symbols {y, n} or {0, 1} makes little difference: each
binary string uniquely identifies a result of the game.
Note also that we have binary strings of many different sizes.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 35 / 62


The entropy of an ensemble

Insights on “information” gained from the game of submarine:

(ii) Let us calculate the Shannon information content of an arbitrary string


x = nk−1 y for some 1 ≤ k ≤ 64:
1
h(x = nk−1 y) = log2
p(x = nk−1 y)
1
= log2 63 62 61 64−k+2 64−k+1 1
64
· 63
· 62
· ... · 64−k+3
· 64−k+2
· 64−k+1
1
= log2
1/64
= log2 64
= 6 bits.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 36 / 62


The entropy of an ensemble

Insights on “information” gained from the game of submarine:

(iii) Every outcome nk−1 y, no matter what value k assumes from 1 to 64, conveys
the same amount of information h(nk−1 y) = 6 bits.
6 bits is exactly the number of bits necessary to uniquely identify a square in a
set of 64!
(iv) The information contained in a binary string representing an object is not
necessarily the number of bits in the string: it is related to the object it
identifies within a set!
In our submarine example, all strings from y (which has size of 1 bit) to n63 y
(which has size of 64 bits) carry the same amount of bits of information: 6
bits each.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 37 / 62


Entropy as the complexity of binary search

Complementing the intuitions we got from the previous examples, in a future


lecture we will be able to prove an operational interpretation of Shannon
entropy in terms of search trees.

The Shannon entropy H(X ) of a random variable distributed according to


probability distribution pX is the:

the expected number of comparisons needed


for an optimal binary-search algorithm
on a space of values X following distribution pX .

In this sense, a random variable with:


low entropy would be located “fast” using binary search, whereas
one with high entropy would need “more effort/time” to be located.

Note that there could be other ways of measuring the information of a


distribution: we’ll disuss them at the end of this course.
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 38 / 62
Data compression

Our third claim is that N outcomes from a source X can be compressed into
roughly N · H(X ) bits.
In other words, our claim is that the number of bits necessary to compress a
source is linear with respect to the entropy of the source.

This claim implies an intimate connection between data compression and the
measure of information content of the source.
Before giving support for our third claim, let us understand better what we
mean by “data compression”.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 39 / 62


Data compression

A source of information is a stochastic process

X1 , X2 , X3 , . . .

in which the outcome of each ensemble Xi is a symbol produced by the


source.

Examples of sources of information include:


1 the speech produced by a human (each word is a symbol),
2 the sequence of pixels in a black and white image (each symbol is either
“black” or “white”),
3 the sequence of states of the weather in a region in a sequence of days (each
symbol is “good”, “cloudy”, “rainy”).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 40 / 62


Data compression

Average information content per symbol of a source:


If we can show that we can compress data from a particular source into a file
of L bits per source symbol and recover the data reliably, then we will say that
the average information content of that source is at most L bits per symbol.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 41 / 62


Data compression

The raw bit content of an ensemble X is

H0 (X ) = log2 |AX |.

H0 represents the minimum number of bits necessary to give a unique


codeword (i.e., a “name”) to every element in the ensemble X .

H0 is a measure of information of X :
it is a bound on the number of bits necessary to encode elements of X as
codewords.

This measure of information only considers the encoding of ensemble X , but


it does not consider how the encoding for the ensemble can be compressed.
To do compression, we need to take into consideration the probability of each
outcome of the ensemble.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 42 / 62


Data compression

Example 2 (MacKay 4.5) Could there be a compressor that maps an


outcome x to a binary code c(x), and a decompressor that maps c back to x,
such that every possible outcome is compressed into a binary code of length
shorter than H0 (X ) bits?

Solution. No! Just use the pigeonhole principle to verify that. (This exercise
is part of your homework assignment for this lecture.)


Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 43 / 62


Data compression

There are only two ways a compressor can compress files:

1. A lossy compressor maps all files to shorter codewords, but that means that
sometimes two or more files will necessarily be mapped to the same codeword.
The decompressor will be, in this case, unsure of how to decompress
ambiguous codewords, leading to a failure.
Calling δ the probability that the source file is one of the confusable files, a
lossy compressor has probability δ of failure.
If δ is small, the compressor is acceptable, but with some loss of information
(i.e., not all codewords are guaranteed to be decompressed correctly).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 44 / 62


Data compression

There are only two ways a compressor can compress files:

2. A lossless compressor maps most files to shorter codewords, but it will


necessarily map some files to longer codewords.
By picking wisely which files to map to shorter codewords (i.e., the most
probable files) and which files to map to longer codewords (i.e., the least
probable files), the compressor can usually achieve satisfactory compression
rates, and without any loss of information.

In the remaining of this lecture we will cover a simple lossy compressor, and
in future lectures we will cover lossless compressors.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 45 / 62


Lossy data compression
All compressors must take into consideration the probabilities of the different
outcomes a source may produce.

Example 3 Let

AX = {a, b, c, d, e f, g, h}
 
1 1 1 3 1 1 1 1
PX = , , , , , , ,
4 4 4 16 64 64 64 64
The raw bit content of this ensemble is
H0 = log2 |AX | = log2 8 = 3 bits,
so to represent any symbol of the ensemble we need, in principle, codewords
of 3 bits each.
But notice that a small set contains almost all the probability:
15
p(x ∈ {a, b, c, d}) = .
16
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 46 / 62
Lossy data compression
Example 3 (Continued)

If we accept a risk δ = 1/16 of not having a codeword for a symbol x, we can


use an encoding using only 2 bits per symbol instead of 3:


Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 47 / 62
Lossy data compression

The above example can be generalized to principle below.

Principle of lossy data compression.


Let Sδ denote the smallest δ-sufficient set (that is a subset of AX ) satisfying

p(x 6∈ Sδ ) ≤ δ or, equivalentely, p(x ∈ Sδ ) ≥ 1 − δ.

The maximum compression tolerating a probability of error at most δ uses


codewords of size log2 |Sδ | bits.

The quantity

Hδ (X ) = log2 |Sδ |

is called the essential bit content of X .

If we are willing to accept a probability δ of error, we can compress the


source from H0 (X ) bits per symbol to Hδ (X ) bits per symbol.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 48 / 62


Lossy data compression

Example 4 For the lossy compressor where

AX = {a, b, c, d, e f, g, h} , and
 
1 1 1 3 1 1 1 1
PX = , , , , , , , ,
4 4 4 16 64 64 64 64

we have:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 49 / 62


Lossy data compression
Example 4 (Continued)

For this lossy compressor, we have:


Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 50 / 62
Data compression of groups of symbols

We can ask ourselves if we can do better compression if, instead of encoding


each symbol of the source individually, we encode groups of symbols as
blocks.

Let’s start by reasoning about the entropy of a group of symbols as a block.

Let x = (x1 , x2 , . . . , xN ) be a string of N independent identically distributed


(i.i.d.) random variables from a single ensemble X .
Let X N denote the ensemble (X1 , X2 , . . . , XN ).
Because entropy is additive for independent variables, we have

H(X N ) = N · H(X ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 51 / 62


Data compression of groups of symbols

Example 5 Consider a string of N flips of a bent coins, x = (x1 , x2 , . . . , xN ),


where xn ∈ {0, 1}, with probabilities p0 = 0.9 and p1 = 0.1.
If r (x) is the number of 1s in x, then
N−r (x) r (x)
p(x) = p0 p1 .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 52 / 62


Data compression of groups of symbols

Example 5 (Continued)

If we want to encode blocks of size N, we can make a graph of how the


necessary number Hδ (X N ) of bits to encode the blocks varies as a function of
the error δ we are willing to tolerate.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 53 / 62


Data compression of groups of symbols

Example 5 (Continued)

To encode blocks of size N we need Hδ (X N ) bits per block of size N symbols.


That means that the number of bits per symbol is
1
Hδ (X N ).
N

What happens as N grows?

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 54 / 62


Data compression of groups of symbols

Example 5 (Continued)

As N grows we have the following graph:

It seems that as N
grows, N1 Hδ (X N ) flat-
tens out (i.e., becomes
constant), independen-
tly of the value of the
error δ tolerated.

1 N
What is the fixed value that N Hδ (X ) tends to as N grows?
Shannon’s source coding theorem tells us what it is...

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 55 / 62
The Source Coding Theorem

Shannon’s source coding theorem. Let X be an ensemble. Given  > 0,


1
lim Hδ (X N ) = H(X ).
N→∞ N

In English, Shannon’s source coding theorem states that if

1. you are encoding N symbols of a source X , and


2. you are willing to accept a probability δ of error in the decompression,
then

3. the maximum achievable compression will use approximately H(X ) bits per
symbol if the number of N of symbols being encoded is large enough.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 56 / 62


The Source Coding Theorem - Why it works

To see why Shannon’s coding theorem is true we will use the concept of a
typical string generated by the source.

A string of size N is typical if the frequency of each symbol in the string is


the same as the probability of the symbol being produced by the source:

P(x)typ ≈ p1p1 N p2p2 N p3p3 N . . . pIpI N .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 57 / 62


The Source Coding Theorem - Why it works
The information content of a typical string is
1
h(x)typ = log2
P(x)typ
1
≈ log2
p1p1 N p2p2 N p3p3 N . . . pIpI N
X 1
=N pi log2
pi
i
= NH(X ).

Note that we just showed that


1
h(x)typ = log2 ≈ NH(X ),
P(x)typ
which implies that any typical string has probability

P(x)typ ≈ 2−NH(X ) .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 58 / 62


The Source Coding Theorem - Why it works

Using the observations above, we define the typical set to be that of typical
strings, up to a tolerance β ≥ 0:

TNβ = {x | 2−NH(X )−β ≤ p(x) ≤ 2−NH(X )+β }.

The typical set satisfies two important properties:

1. The typical set TNβ contains almost all probability: p(x ∈ TNβ ) ≈ 1.

That is a direct consequence of the law of large numbers: as N grows, it


becomes less and less likely that the frequency of any given symbol in a string
will differ from its probability of being generated by the source.
Hence, if N grows, more and more strings will happen to be typical.

2. The typical set TNβ contains roughly 2NH(X ) elements: |TNβ | ≈ 2NH(X ) .

That is a consequence from the fact that the typical set has probability almost
1, and the probability of a typical element is 2−NH(X ) , so the set must have
about 1/2−NH(X ) = 2NH(X ) elements.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 59 / 62


The Source Coding Theorem - Why it works

The properties of the typical set lead us to Shannon’s coding theorem as


follows.

We know that

1. Sδ contains all probability of the sequences in X N , up to an error δ, and


2. the typical set TNβ contains almost all probability of the sequences in X N .

Hence can conclude that the two sets must have a great intersection:

|Sδ | ≈ |TNβ | ≈ 2NH(X ) .

That means that the essential information content of X N is

Hδ (X N ) = log2 |Sδ | ≈ log2 |TNβ | ≈ log2 2NH(X ) = NH(X ) bits,

which is exactly what Shannon’s coding theorem states.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 60 / 62


The Source Coding Theorem - Asymptotic Equipartition
Property

As an addendum, if you like to be formal, our justification of Shannon’s


coding theorem can be formalized by the following principle, which is a direct
consequence of the law of large numbers.

Asymptotic Equipartition Principle (AEP). For an ensemble of N


independent identically distributed (i.i.d.) random variables
XN = (X1 , X2 , . . . , XN ), with N sufficiently large, the outcome
x = (x1 , x2 , . . . , xN ) is almost certain to belong to a subset of AN
X having only
2NH(X ) members, each having probability “close” to 2−NH(X ) .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 61 / 62


Take-home message

At the beginning of this lecture we set the goal to convince you of three
claims:
1
1. The Shannon information content h(x = ai ) = log2 p(x=a i)
is a sensible
measure of the information content of the outcome x = ai .
1
P
2. The entropy H(X ) = x∈AX p(x) log2 p(x) is a sensible measure of the
expected information content of an ensemble X .
3. The Source Coding Theorem: N outcomes from a source X can be
compressed into roughly N · H(X ) bits.

Are you convinced?


After this lecture, can you provide good arguments in favor of them?

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 62 / 62

You might also like