CH 8
CH 8
Rate-Distortion Theory
d : X ⇥ X̂ ! <+ .
The value d(x, x̂) denotes the distortion incurred when a source symbol x is
reproduced as x̂.
for all x 2 X .
= ˜
E d(X, X̂) + ,
where X
= p(x)cx
x
is a constant which depends only on p(x) and d but not on the conditional
distribution p(x̂|x).
Definition 8.7 Let x̂⇤ minimizes Ed(X, x̂) over all x̂ 2 X̂ , and define
• If we know nothing about a source variable X, then x̂⇤ is the best estimate
of X, and Dmax is the minimum expected distortion between X and a
constant estimate of X.
• Specifically, Dmax can be asymptotically achieved by taking (x̂⇤ , x̂⇤ , · · · , x̂⇤ )
to be the reproduction sequence.
• Therefore it is not meanful to impose a constraint D Dmax on the
reproduction sequence.
8.2 The Rate-Distortion Function
All the discussions are with respect to an i.i.d. information source {Xk , k 1}
with generic random variable X and a distortion measure d.
g : {1, 2, · · · , M } ! X̂ n .
The set {1, 2, · · · , M }, denoted by I, is called the index set. The reproduc-
tion sequences g(1), g(2), · · · , g(M ) in X̂ n are called codewords, and the set of
codewords is called the codebook.
A Rate-Distortion Code
X f (X) X
Encoder Decoder
source reproduction
sequence sequence
Definition 8.9 The rate of an (n, M ) rate-distortion code is n 1
log M in bits
per symbol.
Remark If (R, D) is achievable, then (R0 , D) and (R, D0 ) are achievable for
all R0 R and D0 D. This in turn implies that (R0 , D0 ) are achievable for all
R0 R and D0 D.
Definition 8.11 The rate-distortion region is the subset of <2 containing all
achievable pairs (R, D).
Proof
• The closeness follows from the definition of the achievability of an (R, D)
pair.
• The convexity is proved by time-sharing. Specifically, if (R(1) , D(1) ) and
(R(2) , D(2) ) are achievable, then so is (R( ) , D( ) ), where
R( )
= R(1) + ¯ R(2)
D( )
= D(1) + ¯ D(2)
R-D region
D
Dmax
Definition 8.13 The rate-distortion function R(D) is the minimum of all rates
R for a given distortion D such that (R, D) is achievable.
H(X)
R(0)
The rate
distortion
region
D
Dmax
8.3 The Rate Distortion Theorem
Definition 8.16 For D 0, the information rate-distortion function is defined
by
RI (D) = min I(X; X̂).
X̂:Ed(X,X̂)D
• The minimization is taken over the set of all p(x̂|x) such that Ed(X, X̂)
D is satisfied, namely the set
8 9
< X =
p(x̂|x) : p(x)p(x̂|x)d(x, x̂) D .
: ;
x,x̂
• Since this set is compact in <|X ||X̂ | and I(X; X̂) is a continuous functional
of p(x̂|x), the minimum value of I(X; X̂) can be attained.
• Since
˜
E d(X, X̂) = Ed(X, X̂) ,
where does not depend on p(x̂|x), we can always replace d by d˜ and D
by D in the definition of RI (D) without changing the minimization
problem.
• Without loss of generality, we can assume d is normal.
Theorem 8.17 (The Rate-Distortion Theorem) R(D) = RI (D).
Theorem 8.18 The following properties hold for the information rate-distortion
function RI (D):
1. RI (D) is non-increasing in D.
2. RI (D) is convex.
3. RI (D) = 0 for D Dmax .
4. RI (0) H(X).
Proof of Theorem 8.18
1. For a larger D, the minimization is taken over a larger set.
3. Let X̂ = x̂⇤ w.p. 1 to show that (0, Dmax ) is achievable. Then for D
Dmax , RI (D) I(X; X̂) = 0, which implies RI (D) = 0.
Then
Ed(X, X̂ ( ) )
= Ed(X, X̂ (1) ) + ¯ Ed(X, X̂ (2) )
D(1) + ¯ D(2)
= D( ) .
Finally consider
Proof
• Then for D < = Dmax and any X̂ such that Ed(X, X̂) D,
Now need to construct X̂ which is tight for (1) and (2), so that the above bound
is achieved.
1 D 1 D
0 0 1
1 2D
D
X X
D
D
1 2D 1 1
1 D
• Therefore, we conclude that
⇢
hb ( ) hb (D) if 0 D <
RI (D) =
0 if D .
For 1/2 1, by exchanging the roles of the symbols 0 and 1 and applying
the same argument, we obtain RI (D) as above except that is replaced by
1 . Combining the two cases, we have
⇢
hb ( ) hb (D) if 0 D < min( , 1 )
RI (D) =
0 if D min( , 1 ).
for 0 1.
RI (D)
D
0 0.5
A Remark
The rate-distortion theorem does not include the source coding theorem as a
special case:
• In Example 8.20, RI (0) = hb ( ) = H(X).
• By the rate-distortion theorem, if R > H(X), the average Hamming dis-
tortion, i.e., the error probability per symbol, can be made arbitrarily
small.
• However, by the source coding theorem, if R > H(X), the message error
probability can be made arbitrarily small, which is much stronger.
8.4 The Converse
• Prove that for any achievable rate-distortion pair (R, D), R RI (D).
• Fix D and minimize R over all achievable pairs (R, D) to conclude that
R(D) RI (D).
Proof
1. Let (R, D) be any achievable rate-distortion pair, i.e., for any ✏ > 0, there
exists for sufficiently large n an (n, M ) code such that
1
log M R + ✏
n
and
Pr{d(X, X̂) > D + ✏} ✏,
where X̂ = g(f (X)).
2. Then
a)
n(R + ✏) log M
···
Xn
I(Xk ; X̂k )
k=1
c) Xn
RI (Ed(Xk , X̂k ))
k=1
" n
#
1 X
= n RI (Ed(Xk , X̂k ))
n
k=1
n
!
d) 1 X
nRI Ed(Xk , X̂k )
n
k=1
Ed(X, X̂)
= E[d(X, X̂)|d(X, X̂) > D + ✏]Pr{d(X, X̂) > D + ✏}
+E[d(X, X̂)|d(X, X̂) D + ✏]Pr{d(X, X̂) D + ✏}
dmax · ✏ + (D + ✏) · 1
= D + (dmax + 1)✏.
That is, if the probability that the average distortion between X and X̂
exceeds D + ✏ is small, then the expected average distortion between X
and X̂ can exceed D only by a small amount.
4. Therefore,
• Minimize I(X; X̂) over all such X̂ to conclude that (RI (D), D) is achiev-
able.
• This implies that RI (D) R(D).
Random Coding Scheme
• Fix ✏ > 0 and X̂ with Ed(X, X̂) D, where 0 D Dmax . Let be
specified later.
• Let M be an integer satisfying
✏ 1
I(X; X̂) + log M I(X; X̂) + ✏,
2 n
where n is sufficiently large.
• The random coding scheme:
1. Construct a codebook C of an (n, M ) code by randomly generating M
codewords in X̂ n independently and identically according to p(x̂)n .
Denote these codewords by X̂(1), X̂(2), · · · , X̂(M ).
2. Reveal the codebook C to both the encoder and the decoder.
3. The source sequence X is generated according to p(x)n .
4. The encoder encodes the source sequence X into an index K in the set
I = {1, 2, · · · , M }. The index K takes the value i if
n
(a) (X, X̂(i)) 2 T[X X̂]
,
i.e., if there exists more than one i satisfying (a), let K be the largest one.
Otherwise, K takes the constant value 1.
• Then
{K = 1} ⇢ E2c \ E3c \ · · · \ EM
c
.
= (1 Pr{E1 |X = x})M 1
.
n
• We will focus on x 2 S[X] where
n n
S[X] = {x 2 T[X] : |T[nX̂|X] (x)| 1},
n
because Pr{X 2 S[X] } ⇡ 1 for large n (Proposition 6.13).
n
• For x 2 S[X] , obtain a lower bound on Pr{E1 |X = x} as follows:
n o
n
Pr{E1 |X = x} = Pr (x, X̂(1)) 2 T[X X̂]
X
= p(x̂)
x̂2T n (x)
[X̂|X]
a) X
n(H(X̂)+⌘)
2
x̂2T n (x)
[X̂|X]
b)
2n(H(X̂|X) ⇠)
2 n(H(X̂)+⌘)
n(H(X̂) H(X̂|X)+⇠+⌘)
= 2
n(I(X;X̂)+⇣)
= 2 ,
• Then
h i
ln Pr{K = 1|X = x} (M 1) ln 1 2 n(I(X;X̂)+⇣)
a) ⇣ ✏
⌘ h i
2n(I(X;X̂)+ 2 ) 1 ln 1 2 n(I(X;X̂)+⇣)
b) ⇣ ✏
⌘
2n(I(X;X̂)+ 2 ) 1 2 n(I(X;X̂)+⇣)
h ✏ i
= 2n( 2 ⇣) 2 n(I(X;X̂)+⇣)
• If M grows with n at a rate higher than I(X; X̂), then the probability
that there exists at least one X̂(i) which is jointly typical with the source
sequence X with respect to p(x, x̂) is high.
• Such an X̂(i), if exists, would have d(X, X̂) ⇡ Ed(X, X̂) D, because
the joint relative frequency of (x, X̂(i)) ⇡ p(x, x̂).
n
• Conditioning on {K 6= 1}, we have (X, X̂) 2 T[X X̂]
.
• Therefore, Pr{d(X, X̂) > D + ✏|K 6= 1} = 0, which implies Pr{d(X, X̂) >
D + ✏} ✏.
⇡ 2nH(X|X̂)
⇡ 2nH(X) ⇡ 2nI(X;X̂)
sequences codewords
n
in T[X] in T[nX̂]