The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)

The Source Coding Theorem
Mário S. Alvim
([email protected])
Information Theory
DCC-UFMG
(2020/01)
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 1 / 62

The Source Coding Theorem - Introduction
In this lecture we will define sensible measures of
1. the information content of a random event, and

2. the expected information content of a random experiment.
We will also see how the measures above are connected with the compression
of sources of information.

Definition of entropy and
related functions

Ensembles
Recall that an ensemble X is a triple (x, AX , PX ), where:
x is the outcome of a random variable,

AX = {a1 , a2 , . . . , ai , . . . , aI } is the set of possible values for the random
variable, and
PX = {p1 , p2 , . . . , pI } are the probabilities of each value, with pi standing for
p(x = ai ).

Introduction to Shannon entropy
The Shannon information content of an outcome x is defined to be

1
h(x) = log2 ,
p(x)
and it is measured in bits.
Convention. From now on, log x stands for log2 x, unless otherwise stated.

Example 1 Frequency of letters in “The Frequently Asked Questions

Manual for Linux”, and their entropy.

The entropy of an ensemble X is defined to be the average Shannon

information content of an outcome:
X X 1
H(X ) = p(x)h(x) = p(x) log2 ,
p(x)
x∈AX x∈AX
with the convention that for p(x) = 0 we have
0 · log2 1/0 = 0,
since limθ→0 θ log2 1/θ = 0.

When it is convenient we may write H(X ) as H(p), where p is the vector
(p1 , p2 , . . . , pI ).
Another name for the entropy of X is the uncertainty of X .

Example 1 (Continued)
Frequency of letters in “The Frequently Asked Questions Manual for Linux”.

Some properties of the entropy function:
1. Entropy is always non-negative:
H(X ) ≥ 0,
with equality iff pi = 1 for one i.
2. Entropy is maximized when p is uniform:
H(X ) ≤ log2 |AX |,
with equality iff pi = 1/|AX | for all i.
Verifying these properties is part of your homework assignment: for that

you’ll need Jensen’s inequality.

Jensen’s inequality and convex functions
A function f (x) is convex ^ over an interval

(a, b) if every chord of the function lies above
the function.
That is, if for all 0 ≤ λ ≤ 1:
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ).
Jensen’s inequality: if f is a convex function and x is a random variable,

then
E [f (x)] ≥ f (E [x]),
where E [·] denotes expected value.

The joint entropy of X , Y is

X X 1
H(X , Y ) = p(x, y )h(x, y ) = p(x, y ) log2 .
p(x, y )
xy ∈AX AY xy ∈AX AY

Theorem Entropy is additive for independent random variables:
H(X , Y ) = H(X ) + H(Y ),
iff p(x, y ) = p(x)p(y ).
Proof.
We start by proving the following auxiliary result for when
p(x, y ) = p(x)p(y ):
1
h(x, y ) = log (by def. of h(·))
p(x, y )
1
= log (p(x, y ) = p(x)p(y ))
p(x)p(y )

1 1
= log ·
p(x) p(y )
1 1
= log + log
p(x) p(y )
= h(x) + h(y ) (by def. of h(·))

Proof. (Continued)
Then, we can show that:

X X
H(X , Y ) = p(x, y )h(x, y ) (by definition)
x∈AX y ∈AY
X X
= p(x)p(y ) [h(x) + h(y )] (x, y independent)
x∈AX y ∈AY
X X
= [p(x)p(y )h(x) + p(x)p(y )h(y )] (by distributivity)
x∈AX y ∈AY
X X X X
= p(x)p(y )h(x) + p(x)p(y )h(y ) (splitting the sums (?))
x∈AX y ∈AY x∈AX y ∈AY

Joint Entropy - Properties
Proof. (Continued)
Note that the first term in the sum in Equation (?) can be written as
X X X X
p(x)p(y )h(x) = p(x)h(x) p(y ) (moving out constants)
x∈AX y ∈AY x∈AX y ∈AY
X P
= p(x)h(x) · 1 ( y ∈AY p(y ) = 1)
x∈AX
= H(X ) (by definition (??)),
and the second term in the sum in Equation (?) can be written as
X X X X
p(x)p(y )h(x) = p(y )h(y ) p(x) (moving out constants)
x∈AX y ∈AY y ∈AY x∈AX
X P
= p(y )h(y ) · 1 ( x∈AX p(x) = 1)
y ∈AY
= H(Y ) (by definition (???)).

Joint Entropy - Properties
Proof. (Continued)
Now we can substitute Equations (??) and (???) in Equation (?) to obtain
H(X , Y ) = H(X ) + H(Y ).
The proof of the converse, i.e., that if H(X , Y ) = H(X ) + H(Y ) then X and
Y are independent, is similar and is left as an exercise.


The Source Coding Theorem - Three claims
Our goal in this class is to convince you of the following three claims:
1. The Shannon information content (a.k.a. surprisal, or self-information)

1
h(x = ai ) = log2
p(x = ai )
is a sensible measure of the information content of the outcome x = ai .
2. The entropy
X 1
H(X ) = p(x) log2
x∈AX
p(x)
is a sensible measure of the expected information content of an ensemble X .
3. Source coding theorem: N outcomes from a source X can be compressed

into roughly N · H(X ) bits.

The Shannon information content of an outcome
Our first claim is that the Shannon information content

1
h(x = ai ) = log2
p(x = ai )
is a sensible measure of the information content of the outcome x = ai .

Some intuitive support for our first claim:
a) The less probable an outcome is, the more informative is its happening.
Or, the more “surprising” an outcome is, the more informative it is.
The function h(x) captures this intuition, since it grows as the probability of x
diminishes:

b) If an outcome x happens with certainty (i.e., p(x) = 1), the occurrence of this
outcome conveys no information:
1 1
h(x) = log2 = log2 = 0.
p(x) 1
If an outcome x happens is impossible (i.e., p(x) = 0), the occurrence of this

outcome conveys an infinite amount of information:
1 1
h(x) = log2 = log2 = ∞.
p(x) 0

c) Independent events add up their surprises.

If p(x, y ) = p(x)p(y ) then
1
h(x, y ) = log2
p(x, y )

1 1
= log2 ·
p(x) p(y )
1 1
= log2 + log2
p(x) p(y )
= h(x) + h(y ).

The entropy of an ensemble
Our second claim is that the entropy

X 1
H(X ) = p(x) log2
p(x)
x∈AX
is a sensible measure of the expected information content of an ensemble

X = (x, AX , PX ).

Some intuitive support for our second claim:
a) (The weighing problem.) You are given 12 balls, all equal in weight except
for one that is either heavier or lighter.
You are also given a two-pan balance to use.
In each use of the balance you may put any number of the 12 balls on the left
pan, and the same number on the right pan, and push a button to initiate the
weighing; there are three possible outcomes:
1. either the weights are equal; or

2. the balls on the left are heavier; or
3. the balls on the left are lighter.
Your task is to design a strategy to determine which is the odd ball and
whether it is heavier or lighter than the others in as few uses of the balance as
possible.

A possible optimal solution for the weighing problem.

Insights on “information” gained from the weighing problem:
(i) The world may be in many different states, and you are uncertain about which
is the real one.
(ii) You have measurements (questions) that you can make (ask) to probe in what
state the world is.
(iii) Each measure (question) produces an observation (answer) that allows you to
rule out some states of the world as not possible.
(iv) At each time a subset of possible states is ruled out, you gain some
information about the real state of the world.
The information you have increases because your uncertainty about the real
state of the world decreases.

Insights on “information” gained from the weighing problem:
(v) The most efficient way of finding the actual state is to have every
measurement (question) outcomes as close as possible to equally probable.
If your measurement (question) allows for n different outcomes (types of
answers), it is best to use them so to always split the set of still possible states
of the world into n sets of probability 1/n each.
(vi) The Shannon information content (in base 3) of the set of balls is
24
X 1 1
H(X ) = log3 = log3 24 = 2.89,
i=1
24 1/24
which is just about the minimal number of measurements (3) needed in a best
strategy.

b) (The guessing game.) Do you know the “20 questions” game? (If you don’t,
Google it, it’s a fun game to while away the time with friends during a long
trip.)
In a dumber version of the game, a friend thinks of a number between 0 and
15 and you have to guess which number was selected using yes/no questions.
What is the smallest number of questions needed to be guaranteed to identify
an integer between 0 and 15?

An optimal strategy of yes/no questions to identify an integer in 0-15:

Insights on “information” gained from the guessing game:
(i) A series of answers to yes/no questions can be used to uniquely identify an

object from a set.
(That’s why the questions are useful in the first place!)
(ii) The optimal strategy to win the game corresponds to the shortest sequences
of yes/no questions needed to identify objects in the set.
(iii) If you map each yes/no answer to a 0/1 bit, you get a unique binary string
that identifies each object in the set.
(iv) Hence, the optimal strategy to win the game leads to the shortest binary
description of objects in the set.

Insights on “information” gained from the guessing game:
(v) Encoding information efficiently is related to asking the right questions.

(vi) We saw that the number of yes/no questions needed to identify an integer
between 0 and 15 is 4.
Let us calculate the Shannon information of the set of integers between 0 and
15, assuming your friend can pick any number in the set with equal probability:
15
X 1 1
H(X ) = log2 = log2 16 = 4.
i=0
16 1/16
Shannon entropy gave us the minimal number of questions necessary to win

the game. Is this a coincidence?

c) (The game of submarine.) In a boring version of the “game of battleships”

called “game of submarine”, each player hides just one submarine in one
square of an eight-by-eight grid.
At each round, the other player picks a square in the grid to shoot at.
There two possible outcomes of a shot are y, n, corresponding to a hit and a
miss, and their probabilities depend on the state of the board.

In the game of submarine, each shot made by a player defines an ensemble:
at the beginning:
1 63
p(y) = and p(n) = ;
64 64
at the second shot, if the first shot missed:

1 62
p(y | n) = and p(n | n) = ;
63 63
at the third shot, if the first two shots missed:

1 61
p(y | nn) = and p(n | nn) = ;
62 62
...
at the k th shot (with 1 ≤ k ≤ 64), if the first k − 1 shots missed:
1 (64 − k)
p(y | nk−1 ) = and p(n | nk−1 ) = .
(64 − k + 1) (64 − k + 1)

A game of submarine in which the submarine is hit on the 49th attempt.

Insights on “information” gained from the game of submarine:
(i) If we get a hit y on the k th shot, the sequence of answers we get is a string
x = nk−1 y,
in which the first k − 1 symbols are n and the last one is y.

This string x is the outcome of the hit/miss experiment that uniquely
identifies the square where the submarine is.
For a fixed strategy, our game has 64 possible outcomes:
y,
ny,
nny,
. . .,
nnnn . . . nny = n62 y,
nnnn . . . nnny = n63 y.


In particular, we can call y a bit 1 and n a bit 0, and encode each of the
game’s 64 possible outcomes as a binary string:
y = 1 (which is a binary string of 1 bit),

ny = 01 (which is a binary string of 2 bits)
nny = 001 (which is a binary string of 3 bits),
. . .,
n62 y = 062 1 (which is a binary string of 63 bits),
n63 y = 063 1 (which is a binary string of 64 bits).
Note that the use of symbols {y, n} or {0, 1} makes little difference: each
binary string uniquely identifies a result of the game.
Note also that we have binary strings of many different sizes.

(ii) Let us calculate the Shannon information content of an arbitrary string

x = nk−1 y for some 1 ≤ k ≤ 64:
1
h(x = nk−1 y) = log2
p(x = nk−1 y)
1
= log2 63 62 61 64−k+2 64−k+1 1
64
· 63
· 62
· ... · 64−k+3
· 64−k+2
· 64−k+1
1
= log2
1/64
= log2 64
= 6 bits.

(iii) Every outcome nk−1 y, no matter what value k assumes from 1 to 64, conveys
the same amount of information h(nk−1 y) = 6 bits.
6 bits is exactly the number of bits necessary to uniquely identify a square in a
set of 64!
(iv) The information contained in a binary string representing an object is not
necessarily the number of bits in the string: it is related to the object it
identifies within a set!
In our submarine example, all strings from y (which has size of 1 bit) to n63 y
(which has size of 64 bits) carry the same amount of bits of information: 6
bits each.

Entropy as the complexity of binary search
Complementing the intuitions we got from the previous examples, in a future

lecture we will be able to prove an operational interpretation of Shannon
entropy in terms of search trees.
The Shannon entropy H(X ) of a random variable distributed according to

probability distribution pX is the:
the expected number of comparisons needed

for an optimal binary-search algorithm
on a space of values X following distribution pX .
In this sense, a random variable with:

low entropy would be located “fast” using binary search, whereas
one with high entropy would need “more effort/time” to be located.
Note that there could be other ways of measuring the information of a

distribution: we’ll disuss them at the end of this course.
Data compression
Our third claim is that N outcomes from a source X can be compressed into
roughly N · H(X ) bits.
In other words, our claim is that the number of bits necessary to compress a
source is linear with respect to the entropy of the source.
This claim implies an intimate connection between data compression and the
measure of information content of the source.
Before giving support for our third claim, let us understand better what we
mean by “data compression”.

Data compression
A source of information is a stochastic process
X1 , X2 , X3 , . . .
in which the outcome of each ensemble Xi is a symbol produced by the

source.
Examples of sources of information include:

1 the speech produced by a human (each word is a symbol),
2 the sequence of pixels in a black and white image (each symbol is either
“black” or “white”),
3 the sequence of states of the weather in a region in a sequence of days (each
symbol is “good”, “cloudy”, “rainy”).

Data compression
Average information content per symbol of a source:

If we can show that we can compress data from a particular source into a file
of L bits per source symbol and recover the data reliably, then we will say that
the average information content of that source is at most L bits per symbol.

Data compression
The raw bit content of an ensemble X is
H0 (X ) = log2 |AX |.
H0 represents the minimum number of bits necessary to give a unique

codeword (i.e., a “name”) to every element in the ensemble X .
H0 is a measure of information of X :
it is a bound on the number of bits necessary to encode elements of X as
codewords.
This measure of information only considers the encoding of ensemble X , but

it does not consider how the encoding for the ensemble can be compressed.
To do compression, we need to take into consideration the probability of each
outcome of the ensemble.

Data compression
Example 2 (MacKay 4.5) Could there be a compressor that maps an

outcome x to a binary code c(x), and a decompressor that maps c back to x,
such that every possible outcome is compressed into a binary code of length
shorter than H0 (X ) bits?
Solution. No! Just use the pigeonhole principle to verify that. (This exercise
is part of your homework assignment for this lecture.)


Data compression
There are only two ways a compressor can compress files:
1. A lossy compressor maps all files to shorter codewords, but that means that
sometimes two or more files will necessarily be mapped to the same codeword.
The decompressor will be, in this case, unsure of how to decompress
ambiguous codewords, leading to a failure.
Calling δ the probability that the source file is one of the confusable files, a
lossy compressor has probability δ of failure.
If δ is small, the compressor is acceptable, but with some loss of information
(i.e., not all codewords are guaranteed to be decompressed correctly).

Data compression
There are only two ways a compressor can compress files:
2. A lossless compressor maps most files to shorter codewords, but it will

necessarily map some files to longer codewords.
By picking wisely which files to map to shorter codewords (i.e., the most
probable files) and which files to map to longer codewords (i.e., the least
probable files), the compressor can usually achieve satisfactory compression
rates, and without any loss of information.
In the remaining of this lecture we will cover a simple lossy compressor, and
in future lectures we will cover lossless compressors.

Lossy data compression
All compressors must take into consideration the probabilities of the different
outcomes a source may produce.
Example 3 Let
AX = {a, b, c, d, e f, g, h}

1 1 1 3 1 1 1 1
PX = , , , , , , ,
4 4 4 16 64 64 64 64
The raw bit content of this ensemble is
H0 = log2 |AX | = log2 8 = 3 bits,
so to represent any symbol of the ensemble we need, in principle, codewords
of 3 bits each.
But notice that a small set contains almost all the probability:
15
p(x ∈ {a, b, c, d}) = .
16
If we accept a risk δ = 1/16 of not having a codeword for a symbol x, we can

use an encoding using only 2 bits per symbol instead of 3:

The above example can be generalized to principle below.
Principle of lossy data compression.

Let Sδ denote the smallest δ-sufficient set (that is a subset of AX ) satisfying
p(x 6∈ Sδ ) ≤ δ or, equivalentely, p(x ∈ Sδ ) ≥ 1 − δ.
The maximum compression tolerating a probability of error at most δ uses

codewords of size log2 |Sδ | bits.
The quantity
Hδ (X ) = log2 |Sδ |
is called the essential bit content of X .
If we are willing to accept a probability δ of error, we can compress the

source from H0 (X ) bits per symbol to Hδ (X ) bits per symbol.

Example 4 For the lossy compressor where
AX = {a, b, c, d, e f, g, h} , and

1 1 1 3 1 1 1 1
PX = , , , , , , , ,
4 4 4 16 64 64 64 64
we have:

For this lossy compressor, we have:

Data compression of groups of symbols
We can ask ourselves if we can do better compression if, instead of encoding

each symbol of the source individually, we encode groups of symbols as
blocks.
Let’s start by reasoning about the entropy of a group of symbols as a block.
Let x = (x1 , x2 , . . . , xN ) be a string of N independent identically distributed

(i.i.d.) random variables from a single ensemble X .
Let X N denote the ensemble (X1 , X2 , . . . , XN ).
Because entropy is additive for independent variables, we have
H(X N ) = N · H(X ).

Example 5 Consider a string of N flips of a bent coins, x = (x1 , x2 , . . . , xN ),

where xn ∈ {0, 1}, with probabilities p0 = 0.9 and p1 = 0.1.
If r (x) is the number of 1s in x, then
N−r (x) r (x)
p(x) = p0 p1 .

If we want to encode blocks of size N, we can make a graph of how the

necessary number Hδ (X N ) of bits to encode the blocks varies as a function of
the error δ we are willing to tolerate.

To encode blocks of size N we need Hδ (X N ) bits per block of size N symbols.

That means that the number of bits per symbol is
1
Hδ (X N ).
N
What happens as N grows?

As N grows we have the following graph:
It seems that as N
grows, N1 Hδ (X N ) flat-
tens out (i.e., becomes
constant), independen-
tly of the value of the
error δ tolerated.
1 N
What is the fixed value that N Hδ (X ) tends to as N grows?
Shannon’s source coding theorem tells us what it is...

Shannon’s source coding theorem. Let X be an ensemble. Given > 0,

1
lim Hδ (X N ) = H(X ).
N→∞ N
In English, Shannon’s source coding theorem states that if
1. you are encoding N symbols of a source X , and

2. you are willing to accept a probability δ of error in the decompression,
then
3. the maximum achievable compression will use approximately H(X ) bits per
symbol if the number of N of symbols being encoded is large enough.

The Source Coding Theorem - Why it works
To see why Shannon’s coding theorem is true we will use the concept of a
typical string generated by the source.
A string of size N is typical if the frequency of each symbol in the string is

the same as the probability of the symbol being produced by the source:
P(x)typ ≈ p1p1 N p2p2 N p3p3 N . . . pIpI N .

The information content of a typical string is
1
h(x)typ = log2
P(x)typ
1
≈ log2
p1p1 N p2p2 N p3p3 N . . . pIpI N
X 1
=N pi log2
pi
i
= NH(X ).
Note that we just showed that

1
h(x)typ = log2 ≈ NH(X ),
P(x)typ
which implies that any typical string has probability
P(x)typ ≈ 2−NH(X ) .

Using the observations above, we define the typical set to be that of typical
strings, up to a tolerance β ≥ 0:
TNβ = {x | 2−NH(X )−β ≤ p(x) ≤ 2−NH(X )+β }.
The typical set satisfies two important properties:
1. The typical set TNβ contains almost all probability: p(x ∈ TNβ ) ≈ 1.
That is a direct consequence of the law of large numbers: as N grows, it

becomes less and less likely that the frequency of any given symbol in a string
will differ from its probability of being generated by the source.
Hence, if N grows, more and more strings will happen to be typical.
2. The typical set TNβ contains roughly 2NH(X ) elements: |TNβ | ≈ 2NH(X ) .
That is a consequence from the fact that the typical set has probability almost
1, and the probability of a typical element is 2−NH(X ) , so the set must have
about 1/2−NH(X ) = 2NH(X ) elements.

The properties of the typical set lead us to Shannon’s coding theorem as

follows.
We know that
1. Sδ contains all probability of the sequences in X N , up to an error δ, and

2. the typical set TNβ contains almost all probability of the sequences in X N .
Hence can conclude that the two sets must have a great intersection:
|Sδ | ≈ |TNβ | ≈ 2NH(X ) .
That means that the essential information content of X N is
Hδ (X N ) = log2 |Sδ | ≈ log2 |TNβ | ≈ log2 2NH(X ) = NH(X ) bits,
which is exactly what Shannon’s coding theorem states.

The Source Coding Theorem - Asymptotic Equipartition
Property
As an addendum, if you like to be formal, our justification of Shannon’s

coding theorem can be formalized by the following principle, which is a direct
consequence of the law of large numbers.
Asymptotic Equipartition Principle (AEP). For an ensemble of N

independent identically distributed (i.i.d.) random variables
XN = (X1 , X2 , . . . , XN ), with N sufficiently large, the outcome
x = (x1 , x2 , . . . , xN ) is almost certain to belong to a subset of AN
X having only
2NH(X ) members, each having probability “close” to 2−NH(X ) .

Take-home message
At the beginning of this lecture we set the goal to convince you of three
claims:
1
1. The Shannon information content h(x = ai ) = log2 p(x=a i)
is a sensible
measure of the information content of the outcome x = ai .
1
P
2. The entropy H(X ) = x∈AX p(x) log2 p(x) is a sensible measure of the
expected information content of an ensemble X .
3. The Source Coding Theorem: N outcomes from a source X can be
compressed into roughly N · H(X ) bits.
Are you convinced?

After this lecture, can you provide good arguments in favor of them?

The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)

Uploaded by

Copyright:

Available Formats

The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)

Uploaded by

Copyright:

Available Formats

The Source Coding Theorem

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 1 / 62

In this lecture we will define sensible measures of

1. the information content of a random event, and

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 2 / 62

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 3 / 62

Recall that an ensemble X is a triple (x, AX , PX ), where:

x is the outcome of a random variable,

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 4 / 62

The Shannon information content of an outcome x is defined to be

and it is measured in bits.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 5 / 62

Example 1 Frequency of letters in “The Frequently Asked Questions

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 6 / 62

The entropy of an ensemble X is defined to be the average Shannon

with the convention that for p(x) = 0 we have

since limθ→0 θ log2 1/θ = 0.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 7 / 62

Frequency of letters in “The Frequently Asked Questions Manual for Linux”.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 8 / 62

Some properties of the entropy function:

1. Entropy is always non-negative:

with equality iff pi = 1 for one i.

2. Entropy is maximized when p is uniform:

H(X ) ≤ log2 |AX |,

with equality iff pi = 1/|AX | for all i.

Verifying these properties is part of your homework assignment: for that

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 9 / 62

A function f (x) is convex ^ over an interval

f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ).

Jensen’s inequality: if f is a convex function and x is a random variable,

where E [·] denotes expected value.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 10 / 62

The joint entropy of X , Y is

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 11 / 62

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 12 / 62

Then, we can show that:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 13 / 62

= H(X ) (by definition (??)),

= H(Y ) (by definition (???)).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 14 / 62

H(X , Y ) = H(X ) + H(Y ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 15 / 62

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 16 / 62

1. The Shannon information content (a.k.a. surprisal, or self-information)

is a sensible measure of the information content of the outcome x = ai .

is a sensible measure of the expected information content of an ensemble X .

3. Source coding theorem: N outcomes from a source X can be compressed

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 17 / 62

Our first claim is that the Shannon information content

is a sensible measure of the information content of the outcome x = ai .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 18 / 62

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 19 / 62

Some intuitive support for our first claim:

If an outcome x happens is impossible (i.e., p(x) = 0), the occurrence of this

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 20 / 62

Some intuitive support for our first claim:

c) Independent events add up their surprises.