Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022
Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022
Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022
Book Heading
This textbook introduces the subject of information theory at a level suitable for advanced
undergraduate and graduate students. It develops both the classical Shannon theory and recent
applications in statistical learning. There are five parts covering foundations of information mea-
sures; (lossless) data compression; binary hypothesis testing and large deviations theory; channel
coding and channel capacity; lossy data compression; and, finally, statistical applications. There
are over 150 exercises included to help the reader learn about and bring attention to recent
discoveries in the literature.
Yihong Wu is a Professor in the Department of Statistics and Data Science at Yale University.
He obtained his B.E. degree from Tsinghua University in 2006 and Ph.D. degree from Princeton
University in 2011. He is a recipient of the NSF CAREER award in 2017 and the Sloan Research
Fellowship in Mathematics in 2018. He is broadly interested in the theoretical and algorithmic
aspects of high-dimensional statistics, information theory, and optimization.
i i
i i
i i
i i
i i
i i
Information Theory
From Coding to Learning
FIRS T E DI TI ON
Yury Polyanskiy
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Yihong Wu
Department of Statistics and Data Science
Yale University
i i
i i
i i
www.cambridge.org
Information on this title: www.cambridge.org/XXX-X-XXX-XXXXX-X
DOI: 10.1017/XXX-X-XXX-XXXXX-X
© Author name XXXX
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published XXXX
Printed in <country> by <printer>
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
ISBN XXX-X-XXX-XXXXX-X Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
i i
i i
i i
Dedicated to
Names
i i
i i
i i
Contents
Preface page xv
Introduction xvi
2 Divergence 17
2.1 Divergence and Radon-Nikodym derivatives 17
2.2 Divergence: main inequality and equivalent expressions 21
2.3 Differential entropy 23
2.4 Markov kernels 25
2.5 Conditional divergence, chain rule, data-processing inequality 27
2.6* Local behavior of divergence and Fisher information 32
2.6.1* Local behavior of divergence for mixtures 32
2.6.2* Parametrized family 34
3 Mutual information 37
3.1 Mutual information 37
3.2 Mutual information as difference of entropies 40
3.3 Examples of computing mutual information 42
3.4 Conditional mutual information and conditional independence 45
3.5 Sufficient statistics and data processing 48
i i
i i
i i
Contents vii
7 f-divergences 88
7.1 Definition and basic properties of f-divergences 88
7.2 Data-processing inequality; approximation by finite partitions 91
7.3 Total variation and Hellinger distance in hypothesis testing 95
7.4 Inequalities between f-divergences and joint range 98
7.5 Examples of computing joint range 102
7.5.1 Hellinger distance versus total variation 102
7.5.2 KL divergence versus total variation 103
7.5.3 Chi-squared versus total variation 103
7.6 A selection of inequalities between various divergences 104
7.7 Divergences between Gaussians 105
7.8 Mutual information based on f-divergence 106
7.9 Empirical distribution and χ2 -information 107
7.10 Most f-divergences are locally χ2 -like 109
7.11 f-divergences in parametric families: Fisher information 111
7.12 Rényi divergences and tensorization 115
7.13 Variational representation of f-divergences 118
7.14*Technical proofs: convexity, local expansions and variational representations 121
i i
i i
i i
viii Contents
i i
i i
i i
Contents ix
i i
i i
i i
x Contents
i i
i i
i i
Contents xi
i i
i i
i i
xii Contents
i i
i i
i i
Contents xiii
i i
i i
i i
xiv Contents
i i
i i
i i
Preface
This book is a modern introduction to the field of information theory. In the last two decades,
information theory has evolved from a discipline primarily dealing with problems of information
storage and transmission (“coding”) to one focusing increasingly on information extraction and
denoising (“learning”). This transformation is reflected in the title and content of this book.
The content grew out of the lecture notes accumulated over a decade of the authors’ teaching
regular courses at MIT, University of Illinois, and Yale, as well as short courses at EPFL (Switzer-
land) and ENSAE (France). Our intention is to use this manuscript as a textbook for a first course
on information theory for graduate (and advanced undergraduate) students, or for a second (topics)
course delving deeper into specific areas. A significant part of the book is devoted to the exposition
of information-theoretic methods which have found influential applications in other fields such as
statistical learning and computer science. (Specifically, we cover Kolmogorov’s metric entropy,
strong data processing inequalities, and entropic upper bounds for statistical estimation). We also
include some lesser known classical material (for example, connections to ergodicity) along with
the latest developments, which are often covered by the exercises (following the style of Csiszár
and Körner [81]).
It is hard to mention everyone, who helped us start and finish this work, but some stand out
especially. First and foremost, we owe our debt to Sergio Verdú, whose course at Princeton is
responsible for our life-long admiration of the subject. Furthermore, some techical choices, such
as the “one-shot” approach to coding theorems and simultaneous treatment of discrete and contin-
uous alphabets, reflect the style we learned from his courses. Next, we were fortunate to have many
bright students contribute to typing the lecture notes (precursor of this book), as well as to cor-
recting and extending the content. Among them, we especially thank Ganesh Ajjanagadde, Austin
Collins, Yuzhou Gu, Richard Guo, Qingqing Huang, Yunus Inan, Reka Inovan, Jason Klusowski,
Anuran Makur, Pierre Quinton, Aolin Xu, Sheng Xu, Pengkun Yang, Junhui Zhang.
Y. Polyanskiy <[email protected]>
MIT
Y. Wu <[email protected]>
Yale
i i
i i
i i
Introduction
What is information?
The Oxford English Dictionary lists 18 definitions of the word information, while the Merriam-
Webster Dictionary lists 17. This emphasizes the diversity of meaning and domains in which the
word information may appear. This book, however, is only concerned with a precise mathematical
understanding of information, independent of the application context.
How can we measure something that we cannot even define well? Among the earliest attempts
of quantifying information we can list R.A. Fisher’s works on the uncertainty of statistical esti-
mates (“confidence intervals”) and R. Hartley’s definition of information as the logarithm of the
number of possibilities. Around the same time, Fisher [127] and others identified connection
between information and thermodynamic entropy. This line of thinking culminated in Claude
Shannon’s magnum opus [277], where he formalized the concept of (what we call today the)
Shannon information and forever changed the human language by accepting John Tukey’s word
bit as the unit of its measurement. In addition to possessing a number of elegant properties, Shan-
non information turned out to also answer certain rigorous mathematical questions (such as the
optimal rate of data compression and data transmission). This singled out Shannon’s definition as
the right way of quantifying information. Classical information theory, as taught in [76, 81, 133],
focuses exclusively on this point of view.
In this book, however, we take a slightly more general point of view. To introduce it, let us
quote an emminent physicist L. Brillouin [53]:
We must start with a precise definition of the word “information”. We consider a problem involving a certain
number of possible answers, if we have no special information on the actual situation. When we happen to be
in possession of some information on the problem, the number of possible answers is reduced, and complete
information may even leave us with only one possible answer. Information is a function of the ratio of the
number of possible answers before and after, and we choose a logarithmic law in order to insure additivity of
the information contained in independent situations.
Note that only the last sentence specializes the more general term information to the Shannon’s
special version. In this book, we think of information without that last sentence. Namely, for us
information is a measure of difference between two beliefs about the system state. For example, it
could be the amount of change in our worldview following an observation or an event. Specifically,
suppose that initially the probability distribution P describes our understanding of the world (e.g.,
P allows us to answer questions such as how likely it is to rain today). Following an observation our
distribution changes to Q (e.g., upon observing clouds or a clear sky). The amount of information in
the observation is the dissimilarity between P and Q. How to quantify dissimilarity depends on the
particular context. As argued by Shannon, in many cases the right choice is the Kullback-Leibler
i i
i i
i i
Introduction xvii
(KL) divergence D(QkP), see Definition 2.1. Indeed, if the prior belief is described by a probability
mass function P = (p1 , . . . , pk ) on the set of k possible outcomes, then the observation of the first
outcome results in the new (posterior) belief vector Q = (1, 0, . . . , 0) giving D(QkP) = log p11 ,
and similarly for other outcomes. Since the outcome i happens with probability pi we see that the
average dissimilarity between the prior and posterior beliefs is
X
k
1
pi log ,
pi
i=1
1
For Institute of Electrical and Electronics Engineers; pronounced “Eye-triple-E”.
i i
i i
i i
xviii Introduction
have been heavily influenced by the information theory. Many more topics ranging from biology,
neuroscience and thermodynamics to pattern recognition, artificial intelligence and control theory
all regularly appear in information-theoretic conferences and journals.
It seems that objectively circumscribing the territory claimed by information theory is futile.
Instead, we highlight what we believe to be the most interesting developments of late.
First, information processing systems of today are much more varied compared to those of last
century. A modern controller (robot) is not just reacting to a few-dimensional vector of observa-
tions, modeled as a linear time-invariant system. Instead, it has million-dimensional inputs (e.g., a
rasterized image), delayed and quantized, which also need to be communicated across noisy links.
The target of statistical inference is no longer a low-dimensional parameter, but rather a high-
dimensional (possibly discrete) object with structure (e.g. a sparse matrix, or a graph between
communities). Furthermore, observations arrive to a statistician from spatially or temporally sep-
arated sources, which need to be transmitted cognizant of rate limitations. Recognizing these new
challenges, multiple communities simultaneously started re-investigating classical results (Chap-
ter 29) on the optimality of maximum-likelihood and the (optimal) variance bounds given by the
Fisher information. These developments in high-dimensional statistics, computer science and sta-
tistical learning depend on the mastery of the f-divergences (Chapter 7), the mutual-information
method (Chapter 30), and the strong version of the data-processing inequality (Chapter 33).
Second, since the 1990s technological advances have brought about a slew of new noisy channel
models. While classical theory addresses the so-called memoryless channels, the modern channels,
such as in flash storage, or urban wireless (multi-path, multi-antenna) communication, are far from
memoryless. In order to analyze these, the classical “asymptotic i.i.d.” theory is insufficient. The
resolution is the so-called “one-shot” approach to information theory, in which all main results are
developed while treating the channel inputs and outputs as abstract [307]. Only at the last step those
inputs are given the structure of long sequences and the asymptotic values are calculated. This
new “one-shot” approach has additional relevance to anyone willing to learn quantum information
theory, where it is in fact necessary.
Third, and perhaps the most important, is the explosion in the interest of understanding the meth-
ods and limits of machine learning from data. Information-theoretic methods were instrumental for
several discoveries in this area. As examples, we recall the concept of metric entropy (Chapter 27)
that is a cornerstone of Vapnik’s approach to supervised learning (known as empirical risk mini-
mization). In addition, metric entropy turns out to govern the fundamental limits of, and suggest
algorithms for, the problem of density estimation, the canonical building block of unsupervised
learning (Chapter 32). Another fascinating connection is that the optimal prediction performance
of online-learning algorithms is given by the maximum of the mutual information. This is shown
through a deep connection between prediction and universal compression (Chapter 13), which lead
to the multiplicative weight update algorithms [327, 74]. Finally, there is a common information-
theoretic method for solving a series of problems in distributed estimation, community detection
(in graphs), and computation with noisy logic gates. This method is a strong version of the classical
data-processing inequality (see Chapter 33), and is being actively developed and applied.
i i
i i
i i
Introduction xix
i i
i i
i i
xx Introduction
A note to statisticians
The interplay between information theory and statistics is a constant theme in the development of
both fields. Since its inception, information theory has been indispensable for understanding the
fundamental limits of statistical estimation. The prominent role of information-theoretic quanti-
ties, such as mutual information, f-divergence, metric entropy, and capacity, in establishing the
minimax rates of estimation has long been recognized since the seminal work of Le Cam [192],
Ibragimov and Khas’minski [162], Pinsker [234], Birgé [34], Haussler and Opper [157], Yang and
Barron [341], among many others. In Part VI of this book we give an exposition to some of the
most influential information-theoretic ideas and their applications in statistics. Of course, this is
not meant to be a thorough treatment of decision theory or mathematical statistics; for that purpose,
we refer to the classics [162, 196, 44, 313] and the more recent monographs [55, 265, 140, 328]
focusing on high dimensions. Instead, we apply the theory developed in previous Parts I–V of
this book to several concrete and carefully chosen examples of determining the minimax risk
in both classical (fixed-dimensional, large-sample asymptotic) and modern (high-dimensional,
non-asymptotic) settings.
At a high level, the connection between information theory (in particular, data transmission)
and statistical inference is that both problems are defined by a conditional distribution PY|X , which
is referred to as the channel for the former and the statistical model or experiment for the latter. In
data transmission we optimize the encoder, which maps messages to codewords, chosen in a way
that permits the decoder to reconstruct the message based on the noisy observation Y. In statistical
settings, Y is still the observation while X plays the role of the parameter which determines the
distribution of Y via PY|X ; the major distinction is that here we no longer have the freedom to
preselect X and the only task is to smartly estimate X (in either the average or the worst case) on
the basis of the data Y. Despite this key difference, many information-theoretic ideas still have
influential and fruitful applications for statistical problems, as we shall see next.
In Chapter 29 we show how the data processing inequality can be used to deduce classical lower
bounds (Hammersley-Chapman-Robbins, Cramér-Rao, van Trees). In Chapter 30 we introduce the
mutual information method, based on the reasoning in joint source-channel coding. Namely, by
comparing the amount of information contained in the data and the amount of information required
for achieving a given estimation accuracy, both measured in bits, this method allows us to apply
the theory of capacity and rate-distortion function developed in Parts IV and V to lower bound the
statistical risk. Besides being principled, this approach also unifies the three popular methods for
proving minimax lower bounds due to Le Cam, Assouad, and Fano respectively (Chapter 31).
It is a common misconception that information theory only supplies techniques for proving
negative results in statistics. In Chapter 32 we present three upper bounds on statistical estimation
risk based on metric entropy: Yang-Barron’s construction inspired by universal compression, Le
Cam-Birgé’s tournament based on pairwise hypothesis testing, and Yatracos’ minimum-distance
approach. These powerful methods are responsible for some of the strongest and most general
results in statistics and applicable for both high-dimensional and nonparametric problems. Finally,
in Chapter 33 we introduce the method based on strong data processing inequalities and apply it to
resolve an array of contemporary problems including community detection on graphs, distributed
i i
i i
i i
Introduction xxi
estimation with communication constraints and generating random tree colorings. These problems
are increasingly captivating the minds of computer scientists as well.
• Part I: Chapters 1–3, Sections 4.1, 5.1–5.3, 6.1, and 6.3, focusing only on discrete prob-
ability space and ignoring Radon-Nikodym derivatives. Some mention of applications in
combinatorics and cryptography (Chapters 8, 9) is recommended.
• Part II: Chapter 10, Sections 11.1–11.4.
• Part III: Chapter 14, Sections 15.1–15.3, and 16.1.
• Part IV: Chapters 17–18, Sections 19.1–19.3, 19.7, 20.1–20.2, 23.1.
• Part V: Sections 24.1–24.3, 25.1, 26.1, and 26.3.
• Conclude with a few applications of information theory outside the classical domain (Chap-
ters 30 and 33).
i i
i i
i i
General conventions
Analysis
• Let int(E) and cl(E) denote the interior and closure of a set E, namely, the largest open set
contained in and smallest closed set containing E, respectively.
• Let co(E) denote the convex hull of E (without topology), namely, the smallest convex set
Pn Pn
containing E, given by co(E) = { i=1 αi xi : αi ≥ 0, i=1 αi = 1, xi ∈ E, n ∈ N}.
• For subsets A, B of a real vector space and λ ∈ R, denote the dilation λA = {λa : a ∈ A} and
the Minkowski sum A + B = {a + b : a ∈ A, B ∈ B}.
• For a metric space (X , d), a function f : X → R is called C-Lipschiptz if |f(x) − f(y)| ≤ Cd(x, y)
for all x, y ∈ X . We set kfkLip(X ) = inf{C : f is C-Lipschitz}.
• The Lebesgue measure on Euclidean spaces is denoted by Leb and also by vol (volume).
• Throughout the book, all measurable spaces (X , E) are standard Borel spaces. Unless explicitly
needed, we suppress the underlying σ -algebra E .
i i
i i
i i
• The collection of all probability measures on X is denoted by ∆(X ). For finite spaces we
abbreviate ∆k ≡ ∆([k]), a (k − 1)-dimensional simplex.
• For measures P and Q, their product measure is denoted by P × Q or P ⊗ Q. The n-fold product
of P is denoted by Pn or P⊗n .
• Let P be absolutely continuous with respect to Q, denoted by P Q. The Radon-Nikodym
dP dP
derivative of P with respect to Q is denoted by dQ . For a probability measure P, if Q = Leb, dQ
is referred to the probability density function (pdf); if Q is the counting measure on a countable
X , dQ
dP
is the probability mass function (pmf).
• Let P ⊥ Q denote their mutual singularity, namely, P(A) = 0 and Q(A) = 1 for some A.
• The support of a probability measure P, denoted by supp(P), is the smallest closed set C such
that P(C) = 1. An atom x of P is such that P({x}) > 0. A distribution P is discrete if supp(P)
is a countable set (consisting of its atoms).
• Let X be a random variable taking values on X , which is referred to as the alphabet of X. Typi-
cally upper case, lower case, and script case are reserved for random variables, realizations, and
alphabets. Oftentimes X and Y are automatically assumed to be the alphabet of X and Y, etc.
• Let PX denote the distribution (law) of the random variable X, PX,Y the joint distribution of X
and Y, and PY|X the conditional distribution of Y given X.
• The independence of random variables X and Y is denoted by X ⊥ ⊥ Y, in which case PX,Y =
PX × PY . Similarly, X ⊥ ⊥ Y|Z denotes their conditional independence given Z, in which case
PX,Y|Z = PX|Z × PY|Z .
• Throughout the book, Xn ≡ Xn1 ≜ (X1 , . . . , Xn ) denotes an n-dimensional random vector. We
i.i.d.
write X1 , . . . , Xn ∼ P if they are independently and identically distributed (iid) as P, in which
case PXn = Pn .
• The empirical distribution of a sequence x1 , . . . , xn denoted by P̂xn ; empirical distribution of a
random sample X1 , . . . , Xn denoted by P̂n ≡ P̂Xn .
a.s. P d
• Let −−→, − →, − → denote convergence almost surely, in probability, and in distribution (law),
d
respectively. Let = denote equality in distribution.
• Some commonly used distributions are as follows:
– Ber(p): Bernoulli distribution with mean p.
– Bin(n, p): Binomial distribution with n trials and success probability p.
– Poisson(λ): Poisson distribution with mean λ.
– Let N ( μ, σ 2 ) denote the Gaussian (normal) distribution on R with mean μ and σ 2 and
N ( μ, Σ) the Gaussian distribution on Rd with mean μ and covariance matrix Σ. Denote
the standard normal density by φ(x) = √12π e−x /2 , the CDF and complementary CDF by
2
Rt
Φ(t) = −∞ φ(x)dx and Q(t) = Φc (t) = 1 − Φ(t). The inverse of Q is denoted by Q−1 (ϵ).
– Z ∼ Nc ( μ, σ 2 ) denotes the complex-valued circular symmetric normal distribution with
expectation E[Z] = μ ∈ C and E[|Z − μ|2 ] = σ 2 .
– For a compact subset X of Rd with non-empty interior, Unif(X ) denotes the uniform distri-
bution on X , with Unif(a, b) ≡ Unif([a, b]) for interval [a, b]. We also use Unif(X ) to denote
the uniform (equiprobable) distribution on a finite set X .
i i
i i
i i
Part I
Information measures
i i
i i
i i
i i
i i
i i
Information measures form the backbone of information theory. The first part of this book
is devoted to an in-depth study of various information measures, notably, entropy, divergence,
mutual information, as well as their conditional versions (Chapters 1–3). In addition to basic
definitions illustrated through concrete examples, we will also study various aspects including
regularity, tensorization, variational representation, local expansion, convexity and optimization
properties, as well as the data processing principle (Chapters 4–6). These information measures
will be imbued with operational meaning when we proceed to classical topics in information theory
such as data compression and transmission, in subsequent parts of the book.
In addition to the classical (Shannon) information measures, Chapter 7 provides a systematic
treatment of f-divergences, a generalization of (Shannon) measures introduced by Csíszar that
plays an important role in many statistical problems (see Parts III and VI). Finally, towards the
end of this part we will discuss two operational topics: random number generators in Chapter 9
and the application of entropy method to combinatorics and geometry Chapter 8.
i i
i i
i i
1 Entropy
This chapter introduces the first information measure – Shannon entropy. After studying its stan-
dard properties (chain rule, conditioning), we will briefly describe how one could arrive at its
definition. We discuss axiomatic characterization, the historical development in statistical mechan-
ics, as well as the underlying combinatorial foundation (“method of types”). We close the chapter
with Han’s and Shearer’s inequalities, that both exploit submodularity of entropy. After this Chap-
ter, the reader is welcome to consult the applications in combinatorics (Chapter 8) and random
number generation (Chapter 9), which are independent of the rest of this Part.
log2 ↔ bits
loge ↔ nats
log256 ↔ bytes
log ↔ arbitrary units, base always matches exp
Different units will be convenient in different cases and so most of the general results in this book
are stated with “baseless” log/exp.
i i
i i
i i
Definition 1.2 (Joint entropy). The joint entropy of n discrete random variables Xn ≜
(X1 , X2 , . . . , Xn ) is
h 1 i
H(Xn ) = H(X1 , . . . , Xn ) = E log .
PX1 ,...,Xn (X1 , . . . , Xn )
Note that joint entropy is a special case of Definition 1.1 applied to the random vector Xn =
(X1 , X2 , . . . , Xn ) taking values in the product space.
Remark 1.1. The name “entropy” originates from thermodynamics – see Section 1.3, which
also provides combinatorial justification for this definition. Another common justification is to
derive H(X) as a consequence of natural axioms for any measure of “information content” – see
Section 1.2. There are also natural experiments suggesting that H(X) is indeed the amount of
“information content” in X. For example, one can measure time it takes for ant scouts to describe
the location of the food to ants-workers. It was found that when nest is placed at the root of a full
binary tree of depth d and food at one of the leaves, the time was proportional to the entropy of a
random variable describing the food location [262]. (It was also estimated that ants communicate
with about 0.7–1 bit/min and that communication time reduces if there are some regularities in
path-description: paths like “left,right,left,right,left,right” are described by scouts faster).
Entropy measures the intrinsic randomness or uncertainty of a random variable. In the simple
setting where X takes values uniformly over a finite set X , the entropy is simply given by log-
cardinality: H(X) = log |X |. In general, the more spread out (resp. concentrated) a probability
mass function is, the higher (resp. lower) is its entropy, as demonstrated by the following example.
h(p)
Example 1.1 (Bernoulli). Let X ∼ Ber(p), with PX (1) = p
and PX (0) = p ≜ 1 − p. Then
log 2
1 1
H(X) = h(p) ≜ p log + p log .
p p
Here h(·) is called the binary entropy function, which is
continuous, concave on [0, 1], symmetric around 12 , and sat-
isfies h′ (p) = log pp , with infinite slope at 0 and 1. The
highest entropy is achieved at p = 21 (uniform), while the
lowest entropy is achieved at p = 0 or 1 (deterministic).
It is instructive to compare the plot of the binary entropy
p
function with the variance p(1 − p). 0 1
2
1
Example 1.2 (Geometric). Let X be geometrically distributed, with PX (i) = ppi , i = 0, 1, . . .. Then
E[X] = p̄p and
1 1 1 h( p)
H(X) = E[log ] = log + E[X] log = .
pp̄X p p̄ p
Example 1.3 (Infinite entropy). Is it possible that H(X) = +∞? Yes, for example, P[X = k] ∝
1
k ln2 k
, k = 2, 3, · · · .
i i
i i
i i
Many commonly used information measures have their conditional counterparts, defined
by applying the original definition to a conditional probability measure followed by a further
averaging. For entropy this is defined as follows.
Definition 1.3 (Conditional entropy). Let X be a discrete random variable and Y arbitrary. Denote
by PX|Y=y (·) or PX|Y (·|y) the conditional distribution of X given Y = y. The conditional entropy of
X given Y is
h 1 i
H(X|Y) = Ey∼PY [H(PX|Y=y )] = E log ,
PX|Y (X|Y)
Similar to entropy, conditional entropy measures the remaining randomness of a random vari-
able when another is revealed. As such, H(X|Y) = H(X) whenever Y is independent of X. But
when Y depends on X, observing Y does lower the entropy of X. Before formalizing this in the
next theorem, here is a concrete example.
Example 1.4 (Conditional entropy and noisy channel). Let Y be a noisy observation of X ∼ Ber( 21 )
as follows.
1.0
0.8
0.6
0.4
0.2
0.0
Before discussing various properties of entropy and conditional entropy, let us first review some
relevant facts from convex analysis, which will be used extensively throughout the book.
i i
i i
i i
Review: Convexity
(f) (Entropy under deterministic transformation) H(X) = H(X, f(X)) ≥ H(f(X)) with equality iff
f is one-to-one on the support of PX .
(g) (Full chain rule)
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |Xi−1 ) ≤ H(Xi ), (1.3)
i=1 i=1
i i
i i
i i
10
Proof. (a) Since log PX1(X) is a positive random variable, its expectation H(X) is also positive,
with H(X) = 0 if and only if log PX1(X) = 0 almost surely, namely, PX is a point mass.
(b) Apply Jensen’s inequality to the strictly concave function x 7→ log x:
1 1
H(X) = E log ≤ log E = log |X |.
PX (X) PX (X)
(c) H(X) as a summation only depends on the values of PX , not locations.
(d) Abbreviate P(x) ≡ PX (x) and P(x|y) ≡ PX|Y (x|y). Using P(x) = EY [P(x|Y)] and applying
Jensen’s inequality to the strictly concave function x 7→ x log 1x ,
X 1
X
1
H(X|Y) = EY P(x|Y) log ≤ P(x) log = H(X).
P(x|Y) P ( x)
x∈X x∈X
Additionally, this also follows from (and is equivalent to) Corollary 3.5 in Chapter 3 or
Theorem 5.2 in Chapter 5.
(e) Telescoping PX,Y (X, Y) = PY|X (Y|X)PX (X) and noting that both sides are positive PX,Y -almost
surely, we have
1 h 1 i h 1 i h 1 i
E[log ] = E log = E log + E log
PX,Y (X, Y) PX (X) · PY|X (Y|X) PX (X) PY|X (Y|X)
| {z } | {z }
H(X) H(Y|X)
(f) The intuition is that (X, f(X)) contains the same amount of information as X. Indeed, x 7→
(x, f(x)) is one-to-one. Thus by (c) and (e):
i i
i i
i i
Second law of thermodynamics: There does not exist a machine that operates in a cycle (i.e. returns to its original
state periodically), produces useful work and whose only other effect on the outside world is drawing heat from
a warm body. (That is, every such machine, should expend some amount of heat to some cold body too!)1
1
Note that the reverse effect (that is converting work into heat) is rather easy: friction is an example.
i i
i i
i i
12
Equivalent formulation is as follows: “There does not exist a cyclic process that transfers heat
from a cold body to a warm body”. That is, every such process needs to be helped by expending
some amount of external work; for example, the air conditioners, sadly, will always need to use
some electricity.
Notice that there is something annoying about the second law as compared to the first law. In
the first law there is a quantity that is conserved, and this is somehow logically easy to accept. The
second law seems a bit harder to believe in (and some engineers did not, and only their recurrent
failures to circumvent it finally convinced them). So Clausius, building on an ingenious work of
S. Carnot, figured out that there is an “explanation” to why any cyclic machine should expend
heat. He proposed that there must be some hidden quantity associated to the machine, entropy
of it (initially described as “transformative content” or Verwandlungsinhalt in German), whose
value must return to its original state. Furthermore, under any reversible (i.e. quasi-stationary, or
“very slow”) process operated on this machine the change of entropy is proportional to the ratio
of absorbed heat and the temperature of the machine:
∆Q
∆S = . (1.4)
T
If heat Q is absorbed at temperature Thot then to return to the original state, one must return some
amount of heat Q′ , where Q′ can be significantly smaller than Q but never zero if Q′ is returned
at temperature 0 < Tcold < Thot . Further logical arguments can convince one that for irreversible
cyclic process the change of entropy at the end of the cycle can only be positive, and hence entropy
cannot reduce.
There were great many experimentally verified consequences that second law produced. How-
ever, what is surprising is that the mysterious entropy did not have any formula for it (unlike, say,
energy), and thus had to be computed indirectly on the basis of relation (1.4). This was changed
with the revolutionary work of Boltzmann and Gibbs, who provided a microscopic explanation
of the second law based on statistical physics principles and showed that, e.g., for a system of n
independent particles (as in ideal gas) the entropy of a given macro-state can be computed as
X
ℓ
1
S = kn pj log , (1.5)
pj
j=1
where k is the Boltzmann constant, and we assumed that each particle can only be in one of ℓ
molecular states (e.g. spin up/down, or if we quantize the phase volume into ℓ subcubes) and pj is
the fraction of particles in j-th molecular state.
More explicitly, their innovation was two-fold. First, they separated the concept of a micro-
state (which in our example above corresponds to a tuple of n states, one for each particle) and the
macro-state (a list {pj } of proportions of particles in each state). Second, they postulated that for
experimental observations only the macro-state matters, but the multiplicity of the macro-state
(number of micro-states that correspond to a given macro-state) is precisely the (exponential
of the) entropy. The formula (1.5) then follows from the following explicit result connecting
combinatorics and entropy.
i i
i i
i i
1.4* Submodularity 13
Pk
Proposition 1.5 (Method of types). Let n1 , . . . , nk be non-negative integers with i=1 ni = n,
nk ≜
n
and denote the distribution P = (p1 , . . . , pk ), pi = nni . Then the multinomial coefficient n1 ,...
n!
n1 !···nk ! satisfies
1 n
exp{nH(P)} ≤ ≤ exp{nH(P)} .
( 1 + n) k − 1 n1 , . . . nk
i.i.d. Pn
Proof. For the upper bound, let X1 , . . . , Xn ∼ P and let Ni = i=1 1{Xj =i} denote the number of
occurences of i. Then (N1 , . . . , Nk ) has a multinomial distribution:
Y
k
n n′
P[N1 = n′1 , . . . , Nk = n′k ] = ′ ′ pi i ,
n1 , . . . , nk
i=1
for any nonnegative integers n′i such that n′1 + · · · + n′k = n. Recalling that pi = ni /n, the upper
bound follows from P[N1 = n1 , . . . , Nk = nk ] ≤ 1. In addition, since (N1 , . . . , Nk ) takes at most
(n + 1)k−1 values, the lower bound follows if we can show that (n1 , . . . , nk ) is its mode. Indeed,
for any n′i with n′1 + · · · + n′k = n, defining ∆i = n′i − ni we have
Proposition 1.5 shows that the multinomial coefficient can be approximated up to a polynomial
(in n) term by exp(nH(P)). More refined estimates can be obtained; see Ex. I.2. In particular, the
binomial coefficient can be approximated using the binary entropy function as follows: Provided
that p = nk ∈ (0, 1),
n
e−1/6 ≤ k
≤ 1. (1.6)
√ 1
enh(p)
2πnp(1−p)
For more on combinatorics and entropy, see Ex. I.1, I.3 and Chapter 8.
1.4* Submodularity
Recall that [n] denotes a set {1, . . . , n}, Sk denotes subsets of S of size k and 2S denotes all subsets
of S. A set function f : 2S → R is called submodular if for any T1 , T2 ⊂ S
Submodularity is similar to concavity, in the sense that “adding elements gives diminishing
returns”. Indeed consider T′ ⊂ T and b 6∈ T. Then
f( T ∪ b) − f( T ) ≤ f( T ′ ∪ b) − f( T ′ ) .
i i
i i
i i
14
Proof. Let A = XT1 \T2 , B = XT1 ∩T2 , C = XT2 \T1 . Then we need to show
H(A, B, C) + H(B) ≤ H(A, B) + H(B, C) .
This follows from a simple chain
H(A, B, C) + H(B) = H(A, C|B) + 2H(B) (1.8)
≤ H(A|B) + H(C|B) + 2H(B) (1.9)
= H(A, B) + H(B, C) (1.10)
on Xn . Let us also denote Γ̄∗n the closure of Γ∗n . It is not hard to show, cf. [347], that Γ̄∗n is also a
closed convex cone and that
Γ∗n ⊂ Γ̄∗n ⊂ Γn .
The astonishing result of [348] is that
Γ∗2 = Γ̄∗2 = Γ2 (1.11)
Γ∗3 ⊊ Γ̄∗3 = Γ3 (1.12)
Γ∗n ⊊ Γ̄∗n ⊊Γn n ≥ 4. (1.13)
This follows from the fundamental new information inequality not implied by the submodularity
of entropy (and thus called non-Shannon inequality). Namely, [348] showed that for any 4-tuple
of discrete random variables:
1 1 1
I(X3 ; X4 ) − I(X3 ; X4 |X1 ) − I(X3 ; X4 |X2 ) ≤ I(X1 ; X2 ) + I(X1 ; X3 , X4 ) + I(X2 ; X3 , X4 ) .
2 4 4
(This can be restated in the form of an entropy inequality using Theorem 3.4 but the resulting
expression is too cumbersome).
1 1
H̄n ≤ · · · ≤ H̄k · · · ≤ H̄1 . (1.14)
n k
i i
i i
i i
Furthermore, the sequence H̄k is increasing and concave in the sense of decreasing slope:
H̄m
Proof. Denote for convenience H̄0 = 0. Note that m is an average of differences:
1X
m
1
H̄m = (H̄k − H̄k−1 )
m m
k=1
Thus, it is clear that (1.15) implies (1.14) since increasing m by one adds a smaller element to the
average. To prove (1.15) observe that from submodularity
Now average this inequality over all n! permutations of indices {1, . . . , n} to get
as claimed by (1.15).
Alternative proof: Notice that by “conditioning decreases entropy” we have
Theorem 1.8 (Shearer’s Lemma). Let Xn be discrete n-dimensional RV and let S ⊂ [n] be a
random variable independent of Xn and taking values in subsets of [n]. Then
Remark 1.2. In the special case where S is uniform over all subsets of cardinality k, (1.16) reduces
to Han’s inequality 1n H(Xn ) ≤ 1k H̄k . The case of n = 3 and k = 2 can be used to give an entropy
proof of the following well-known geometry result that relates the size of 3-D object to those
of its 2-D projections: Place N points in R3 arbitrarily. Let N1 , N2 , N3 denote the number of dis-
tinct points projected onto the xy, xz and yz-plane, respectively. Then N1 N2 N3 ≥ N2 . For another
application, see Section 8.2.
Proof. We will prove an equivalent (by taking a suitable limit) version: If C = (S1 , . . . , SM ) is a
list (possibly with repetitions) of subsets of [n] then
X
H(XSj ) ≥ H(Xn ) · min deg(i) , (1.17)
i
j
where deg(i) ≜ #{j : i ∈ Sj }. Let us call C a chain if all subsets can be rearranged so that
S1 ⊆ S2 · · · ⊆ SM . For a chain, (1.17) is trivial, since the minimum on the right-hand side is either
i i
i i
i i
16
For the case of C not a chain, consider a pair of sets S1 , S2 that are not related by inclusion and
replace them in the collection with S1 ∩ S2 , S1 ∪ S2 . Submodularity (1.7) implies that the sum on the
left-hand side of (1.17) does not increase under this replacement, values deg(i) are not changed.
Notice that the total number of pairs that are not related by inclusion strictly decreases by this
replacement: if T was related by inclusion to S1 then it will also be related to at least one of S1 ∪ S2
or S1 ∩ S2 ; if T was related to both S1 , S2 then it will be related to both of the new sets as well.
Therefore, by applying this operation we must eventually arrive to a chain, for which (1.17) has
already been shown.
Remark 1.3. Han’s inequality (1.15) holds for any submodular set-function. For Han’s inequal-
ity (1.14) we also need f(∅) = 0 (this can be achieved by adding a constant to all values of f).
Shearer’s lemma holds for any submodular set-function that is also non-negative.
Example 1.5 (Non-entropy submodular function). Another submodular set-function is
S 7→ I(XS ; XSc ) .
Han’s inequality for this one reads
1 1
0= In ≤ · · · ≤ Ik · · · ≤ I1 ,
n k
1
P
where Ik = S:|S|=k I(XS ; XSc ) measures the amount of k-subset coupling in the random vector
(nk)
Xn .
2
Note that, consequently, for Xn without constant coordinates, and if C is a chain, (1.17) is only tight if C consists of only ∅
and [n] (with multiplicities). Thus if degrees deg(i) are known and non-constant, then (1.17) can be improved, cf. [206].
i i
i i
i i
2 Divergence
In this chapter we study divergence D(PkQ) (also known as information divergence, Kullback-
Leibler (KL) divergence, relative entropy), which is the first example of dissimilarity (information)
measure between a pair of distributions P and Q. As we will see later in Chapter 7, KL diver-
gence is a special case of f-divergences. Defining KL divergence and its conditional version in full
generality requires some measure-theoretic acrobatics (Radon-Nikodym derivatives and Markov
kernels), that we spend some time on. (We stress again that all this abstraction can be ignored if
one is willing to only work with finite or countably-infinite alphabets.)
Besides definitions we prove the “main inequality” showing that KL-divergence is non-negative.
Coupled with the chain rule for divergence, this inequality implies the data-processing inequality,
which is arguably the central pillar of information theory and this book. We conclude the chapter
by studying local behavior of divergence when P and Q are close. In the special case when P and
Q belong to a parametric family, we will see that divergence is locally quadratic with Hessian
being the Fisher information, explaining the fundamental role of the latter in classical statistics.
Review: Measurability
• All complete separable metric spaces, endowed with Borel σ -algebras are standard
Borel. In particular, countable alphabets and Rn and R∞ (space of sequences) are
standard Borel.
Q∞
• If Xi , i = 1, . . . are standard Borel, then so is i=1 Xi .
• Singletons {x} are measurable sets.
• The diagonal {(x, x) : x ∈ X } is measurable in X × X .
17
i i
i i
i i
18
We now need to define the second central concept of this book: the relative entropy, or Kullback-
Leibler divergence. Before giving the formal definition, we start with special cases. For that we
fix some alphabet A. The relative entropy from between distributions P and Q on X is denoted by
D(PkQ), defined as follows.
• Suppose A = Rk , P and Q have densities (pdfs) p and q with respect to the Lebesgue measure.
Then
(R
{p>0,q>0}
p(x) log qp((xx)) dx Leb{p > 0, q = 0} = 0
D(PkQ) = (2.2)
+∞ otherwise
These two special cases cover a vast majority of all cases that we encounter in this book. How-
ever, mathematically it is not very satisfying to restrict to these two special cases. For example, it
is not clear how to compute D(PkQ) when P and Q are two measures on a manifold (such as a
unit sphere) embedded in Rk . Another problematic case is computing D(PkQ) between measures
on the space of sequences (stochastic processes). To address these cases we need to recall the
concepts of Radon-Nikodym derivative and absolute continuity.
Recall that for two measures P and Q, we say P is absolutely continuous w.r.t. Q (denoted by
P Q) if Q(E) = 0 implies P(E) = 0 for all measurable E. If P Q, then Radon-Nikodym
theorem show that there exists a function f : X → R+ such that for any measurable set E,
Z
P(E) = fdQ. [change of measure] (2.3)
E
dP
Such f is called a relative density or a Radon-Nikodym derivative of P w.r.t. Q, denoted by dQ . Not
dP dP
that dQ may not be unique. In the simple cases, dQ is just the familiar likelihood ratio:
We can see that the two special cases of D(PkQ) were both computing EP [log dQdP
]. This turns
out to be the most general definition that we are looking for. However, we will state it slightly
differently, following the tradition.
i i
i i
i i
Below we will show (Lemma 2.4) that the expectation in (2.4) is well-defined (but possibly
infinite) and coincides with EP [log dQdP
] whenever P Q.
To demonstrate the general definition in the case not covered by discrete/continuous special-
izations, consider the situation in which both P and Q are given as densities with respect to a
common dominating measure μ, written as dP = fP dμ and dQ = fQ dμ for some non-negative
fP , fQ . (In other words, P μ and fP = dP dμ .) For example, taking μ = P + Q always allows one to
specify P and Q in this form. In this case, we have the following expression for divergence:
(R
dμ fP log ffQP μ({fQ = 0, fP > 0}) = 0,
D(PkQ) = fQ >0,fP >0
(2.5)
+∞ otherwise
Indeed, first note that, under the assumption of P μ and Q μ, we have P Q iff
μ({fQ = 0, fP > 0}) = 0. Furthermore, if P Q, then dQdP
= ffQP Q-a.e, in which case apply-
ing (2.3) and (1.1) reduces (2.5) to (2.4). Namely, D(PkQ) = EQ [ dQ dP dP
log dQ ] = EQ [ ffQP log ffQP ] =
R R
dμfP log ffQP 1{fQ >0} = dμfP log ffQP 1{fQ >0,fP >0} .
Note that D(PkQ) was defined to be +∞ if P 6 Q. However, it can also be +∞ even when
P Q. For example, D(CauchykGaussian) = ∞. However, it does not mean that there are
somehow two different ways in which D can be infinite. Indeed, what can be shown is that in
both cases there exists a sequence of (finer and finer) finite partitions Π of the space A such that
evaluating KL divergence between the induced discrete distributions P|Π and Q|Π grows without
a bound. This will be subject of Theorem 4.5 below.
Our next observation is that, generally, D(PkQ) 6= D(QkP) and, therefore, divergence is not a
distance. We will see later, that this is natural in many cases; for example it reflects the inherent
asymmetry of hypothesis testing (see Part III and, in particular, Section 14.5). Consider the exam-
ple of coin tossing where under P the coin is fair and under Q the coin always lands on the head.
Upon observing HHHHHHH, one tends to believe it is Q but can never be absolutely sure; upon
observing HHT, one knows for sure it is P. Indeed, D(PkQ) = ∞, D(QkP) = 1 bit.
Having made these remarks we proceed to some examples. First, we show that D is unsurpris-
ingly a generalization of entropy.
i i
i i
i i
20
1
log q
d(p∥q) d(p∥q)
1
log q̄
q p
0 p 1 0 q 1
i i
i i
i i
D(PkQ) ≥ 0,
Proof. In view of the definition (2.4), it suffices to consider P Q. Let φ(x) ≜ x log x, which
is strictly convex on R+ . Applying Jensen’s Inequality:
h dP i h dP i
D(PkQ) = EQ φ ≥ φ EQ = φ(1) = 0,
dQ dQ
dP
with equality iff dQ = 1 Q-a.e., namely, P = Q.
Lemma 2.4. Let P, Q, R μ and fP , fQ , fR denote their densities relative to μ. Define a bivariate
function Log ab : R+ × R+ → R ∪ {±∞} by
−∞ a = 0, b > 0
a +∞ a > 0, b = 0
Log = (2.10)
b 0 a = 0, b = 0
log ab a > 0, b > 0.
Then the following results hold:
• First,
fR
EP Log = D(PkQ) − D(PkR) , (2.11)
fQ
provided at least one of the hdivergences
i is finite.
• Second, the expectation EP Log fQ is well-defined (but possibly infinite) and, furthermore,
fP
fP
D(PkQ) = EP Log . (2.12)
fQ
In particular, when P Q we have
dP
D(PkQ) = EP log . (2.13)
dQ
i i
i i
i i
22
Remark 2.1. Note that ignoring the issue of dividing by or taking a log of 0, the proof of (2.12)
dR
is just the simple identity log dQ dRdP
= log dQdP = log dQdP
− log dR
dP
. What permits us to handle zeros
is the Log function, which satisfies several natural properties of the log: for every a, b ∈ R+
a b
Log = −Log
b a
and for every c > 0 we have
a a c ac
Log = Log + Log = Log − log(c)
b c b b
except for the case a = b = 0.
Proof. First, suppose D(PkQ) = ∞ and D(PkR) < ∞. Then P[fR (Y) = 0] = 0, and hence in
computation of the expectation in (2.11) only the second part of convention (2.10) can possibly
apply. Since also fP > 0 P-almost surely, we have
fR fR fP
Log = Log + Log , (2.14)
fQ fP fQ
with both logarithms evaluated according to (2.10). Taking expectation over P we see that the
first term, equal to −D(PkR), is finite, whereas the second term is infinite. Thus, the expectation
in (2.11) is well-defined and equal to +∞, as is the LHS of (2.11).
Now consider D(PkQ) < ∞. This implies that P[fQ (Y) = 0] = 0 and this time in (2.11) only
the first part of convention (2.10) can apply. Thus, again we have identity (2.14). Since the P-
expectation of the second term is finite, and of the first term non-negative, we again conclude that
expectation in (2.11) is well-defined, equals the LHS of (2.11) (and both sides are possibly equal
to −∞).
For the second part, we first show that
fP log e
EP min(Log , 0) ≥ − . (2.15)
fQ e
Let g(x) = min(x log x, 0). It is clear − loge e ≤ g(x) ≤ 0 for all x. Since fP (Y) > 0 for P-almost
all Y, in convention (2.10) only the 10 case is possible, which is excluded by the min(·, 0) from the
expectation in (2.15). Thus, the LHS in (2.15) equals
Z Z
f P ( y) f P ( y) f P ( y)
fP (y) log dμ = f Q ( y) log dμ
{fP >fQ >0} f Q ( y ) {fP >fQ >0} f Q ( y ) f Q ( y)
Z
f P ( y)
= f Q ( y) g dμ
{fQ >0} f Q ( y)
log e
≥− .
e
h i h i
Since the negative part of EP Log ffQP is bounded, the expectation EP Log ffQP is well-defined. If
P[fQ = 0] > 0 then it is clearly +∞, as is D(PkQ) (since P 6 Q). Otherwise, let E = {fP >
i i
i i
i i
From here, we notice that Q[fQ > 0] = 1 and on {fP = 0, fQ > 0} we have φ( ffQP ) = 0. Thus, the
term 1E can be dropped and we obtain the desired (2.12).
The final statement of the Lemma follows from taking μ = Q and noticing that P-almost surely
we have
dP
dQ dP
Log = log .
1 dQ
In particular, if X has probability density function (pdf) p, then h(X) = E log p(1X) ; otherwise
h(X) = −∞. The conditional differential entropy is h(X|Y) ≜ E log pX|Y (1X|Y) where pX|Y is a
conditional pdf.
1 n c −(−1)n n
For an example, consider a piecewise-constant pdf taking value e(−1) n on the n-th interval of width ∆n = n2
e .
i i
i i
i i
24
for sums of independent random variables, for integer-valued X and Y, H(X + Y) is finite whenever
H(X) and H(Y) are, because H(X + Y) ≤ H(X, Y) = H(X) + H(Y). This again fails for differential
entropy. In fact, there exists real-valued X with finite h(X) such that h(X + Y) = ∞ for any
independent Y such that h(Y) > −∞; there also exist X and Y with finite differential entropy such
that h(X + Y) does not exist (cf. [41, Section V]).
Nevertheless, differential entropy shares many functional properties with the usual Shannon
entropy. For a short application to Euclidean geometry see Section 8.4.
Theorem 2.6 (Properties of differential entropy). Assume that all differential entropies appearing
below exist and are finite (in particular all RVs have pdfs and conditional pdfs).
X
n
h( X n ) = h(Xk |Xk−1 ) .
k=1
Proof. Parts (a), (c), and (d) follow from the similar argument in the proof (b), (d), and (g) of
Theorem 1.4. Part (b) is by a change of variable in the density. Finall, (e) and (f) are analogous to
Theorems 1.6 and 1.7.
Interestingly, the first property is robust to small additive perturbations, cf. Ex. I.6. Regard-
ing maximizing entropy under quadratic constraints, we have the following characterization of
Gaussians.
Theorem 2.7. Let Cov(X) = E[XX⊤ ] − E[X]E[X]⊤ denote the covariance matrix of a random
vector X. For any d × d positive definite matrix Σ,
1
max h(X) = h(N(0, Σ)) = log((2πe)d det Σ) (2.19)
PX :Cov(X)⪯Σ 2
i i
i i
i i
Proof. To show (2.19), without loss of generality, assume that E[X] = 0. By comparing to
Gaussian, we have
where in the last step we apply E[X⊤ Σ−1 X] = Tr(E[XX⊤ ]Σ−1 ) ≤ Tr(I) due to the constraint
Cov(X) Σ and the formula (2.18). The inequality (2.20) follows analogously by choosing the
reference measure to be N(0, ad Id ).
Finally, let us mention a connection between the differential entropy and the Shannon entropy.
Let X be a continuous random vector in Rd . Denote its discretized version by Xm = m1 bmXc
for m ∈ N, where b·c is taken componentwise. Rényi showed that [261, Theorem 1] provided
H(bXc) < ∞ and h(X) is defined, we have
To interpret this result, consider, for simplicity, d = 1, m = 2k and assume that X takes values in
the unit interval, in which case X2k is the k-bit uniform quantization of X. Then (2.21) suggests
that for large k, the quantized bits behave as independent fair coin flips. The underlying reason is
that for “nice” density functions, the restriction to small intervals is approximately uniform. For
more on quantization see Section 24.1 (notably Section 24.1.5) in Chapter 24.
The kernel K can be viewed as a random transformation acting from X to Y , which draws
Y from a distribution depending on the realization of X, including deterministic transformations
PY|X
as special cases. For this reason, we write PY|X : X → Y and also X −−→ Y. In information-
theoretic context, we also refer to PY|X as a channel, where X and Y are the channel input and
output respectively. There are two ways of obtaining Markov kernels. The first way is defining
them explicitly. Here are some examples of that:
i i
i i
i i
26
Note that above we have implicitly used the facts that the slices Ex of E are measurable subsets
of Y for each x and that the function x 7→ K(Ex |x) is measurable (cf. [59, Chapter I, Prop. 6.8 and
6.9], respectively). We also notice that one joint distribution PX,Y can have many different versions
of PY|X differing on a measure-zero set of x’s.
The operation of combining an input distribution on X and a kernel K : X → Y as we did
in (2.22) is going to appear extensively in this book. We will usually denote it as multiplication:
Given PX and kernel PY|X we can multiply them to obtain PX,Y ≜ PX PY|X , which in the discrete
case simply means that the joint PMF factorizes as product of marginal and conditional PMFs:
To denote this (linear) relation between the input PX and the output PY we sometimes also write
PY|X
PX −−→ PY .
We must remark that technical assumptions such as restricting to standard Borel spaces are
really necessary for constructing any sensible theory of distintegration/conditioning and multi-
plication. To emphasize this point we consider a (cautionary!) example involving a pathological
measurable space Y .
i i
i i
i i
Example 2.5 (X ⊥⊥ Y but PY|X=x 6 PY for all x). Consider X a unit interval with Borel σ -algebra
and Y a unit interval with the σ -algebra σY consisting of all sets which are either countable or
have a countable complement. Clearly σY is a sub-σ -algebra of Borel one. We define the following
kernel K : X → Y :
K(A|x) ≜ 1{x ∈ A} .
This is simply saying that Y is produced from X by setting Y = X. It should be clear that for
every A ∈ σY the map x 7→ K(A|x) is measurable, and thus K is a valid Markov kernel. Letting
X ∼ Unif(0, 1) and using formula (2.22) we can define a joint distribution PX,Y . But what is the
conditional distribution PY|X ? On one hand, clearly we can set PY|X (A|x) = K(A|x), since this
was how PX,Y was constructed. On the other hand, we will show that PX,Y = PX PY , i.e. X ⊥ ⊥ Y
and X = Y at the same time! Indeed, consider any set E = B × C ⊂ X × Y . We always have
PX,Y [B × C] = PX [B ∩ C]. Thus if C is countable then PX,Y [E] = 0 and so is PX PY [E] = 0. On the
other hand, if Cc is countable then PX [C] = PY [C] = 1 and PX,Y [E] = PX PY [E] again. Thus, both
PY|X = K and PY|X = PY are valid conditional distributions. But notice that since PY [{x}] = 0, we
have K(·|x) 6 PY for every x ∈ X . In particular, the value of D(PY|X=x kPY ) can either be 0 or
+∞ for every x depending on the choice of the version of PY|X . It is, thus, advisable to stay within
the realm of standard Borel spaces.
We will also need to use the following result extensively. We remind that a σ -algebra is called
separable if it is generated by a countable collection of sets. Any standard Borel space’s σ -algebra
is separable. The following is another useful result about Markov kernels, cf. [59, Chapter 5,
Theorem 4.44]:
dPY|X=x
The meaning of this theorem is that the Radon-Nikodym derivative dRY|X=x can be made jointly
measurable with respect to (x, y).
i i
i i
i i
28
In order to extend the above definition to more general X , we need to first understand whether
the map x 7→ D(PY|X=x kQY|X=x ) is even measurable.
Lemma 2.11. Suppose that Y is standard Borel. The set A0 ≜ {x : PY|X=x QY|X=x } and the
function
x 7→ D(PY|X=x kQY|X=x )
dPY|X=x dQY|X=x
Proof. Take RY|X = 1
2 PY|X + 12 QY|X and define fP (y|x) ≜ dRY|X=x (y) and fQ (y|x) ≜ dRY|X=x (y).
By Theorem 2.10 these can be chosen to be jointly measurable on X × Y . Let us define B0 ≜
{(x, y) : fP (y|x) > 0, fQ (y|x) = 0} and its slice Bx0 = {y : (x, y) ∈ B0 }. Then note that PY|X=x
QY|X=x iff RY|X=x [Bx0 ] = 0. In other words, x ∈ A0 iff RY|X=x [Bx0 ] = 0. The measurability of B0
implies that of x 7→ RY|X=x [Bx0 ] and thus that of A0 . Finally, from (2.12) we get that
f P ( Y | x)
D(PY|X=x kQY|X=x ) = EY∼PY|X=x Log , (2.23)
f Q ( Y | x)
Theorem 2.13 (Chain rule). For any pair of measures PX,Y and QX,Y we have
regardless of the versions of conditional distributions PY|X and QY|X one chooses.
Proof. First, let us consider the simplest case: X , Y are discrete and QX,Y (x, y) > 0 for all x, y.
Letting (X, Y) ∼ PX,Y we get
PX,Y (X, Y) PX (X)PY|X (Y|X)
D(PX,Y kQX,Y ) = E log = E log
QX,Y (X, Y) QX (X)QY|X (Y|X)
PY|X (Y|X) PX (X)
= E log + E log
QY|X (Y|X) QX (X)
i i
i i
i i
Next, let us address the general case. If PX 6 QX then PX,Y 6 QX,Y and both sides of (2.24) are
infinity. Thus, we assume PX QX and set λP (x) ≜ dQ dPX
X
(x). Define fP (y|x), fQ (y|x) and RY|X as in
the proof of Lemma 2.11. Then we have PX,Y , QX,Y RX,Y ≜ QX RY|X , and for any measurable E
Z Z
PX,Y [E] = λP (x)fP (y|x)RX,Y (dx dy) , QX,Y [E] = fQ (y|x)RX,Y (dx dy) .
E E
The chain rule has a number of useful corollaries, which we summarize below.
Theorem 2.14 (Properties of Divergence). Assume that X and Y are standard Borel. Then
i i
i i
i i
30
X
n
≥ D(PXi kQXi ), (2.26)
i=1
Qn
where the inequality holds with equality if and only if PXn = j=1 PXj .
(d) (Tensorization)
Yn
Y n X n
D PXj
QXj = D(PXj kQXj ).
j=1 j=1 j=1
(e) (Conditioning increases divergence) Given PY|X , QY|X and PX , let PY = PY|X ◦ PX and QY =
QY|X ◦ PX , as represented by the diagram:
PY |X PY
PX
QY |X QY
Then D(PY kQY ) ≤ D(PY|X kQY|X |PX ), with equality iff D(PX|Y kQX|Y |PY ) = 0.
We remark that as before without the standard Borel assumption even the first property can
fail. For example, Example 2.5 shows an example where PX PY|X = PX QY|X but PY|X 6= QY|X and
D(PY|X kQY|X |PX ) = ∞.
Proof. (a) This follows from the chain rule (2.24) since PX = QX .
(b) Apply (2.24), with X and Y interchanged and use the fact that conditional divergence is non-
negative.
Qn Qn
(c) By telescoping PXn = i=1 PXi |Xi−1 and QXn = i=1 QXi |Xi−1 .
(d) Apply (c).
(e) The inequality follows from (a) and (b). To get conditions for equality, notice that by the chain
rule for D:
• There is a nice interpretation of the full chain rule as a decomposition of the “distance” from
PXn to QXn as a sum of “distances” between intermediate distributions, cf. Ex. I.33.
• In general, D(PX,Y kQX,Y ) and D(PX kQX ) + D(PY kQY ) are incomparable. For example, if X = Y
under P and Q, then D(PX,Y kQX,Y ) = D(PX kQX ) < 2D(PX kQX ). Conversely, if PX = QX and
PY = QY but PX,Y 6= QX,Y we have D(PX,Y kQX,Y ) > 0 = D(PX kQX ) + D(PY kQY ).
i i
i i
i i
The following result, known as the Data-Processing Inequality (DPI), is an important principle
in all of information theory. In many ways, it underpins the whole concept of information. The
intuitive interpretation is that it is easier to distinguish two distributions using clean (resp. full) data
as opposed to noisy (resp. partial) data. DPI is a recurring theme in this book, and later we will
study DPI for other information measures such as those for mutual information and f-divergences.
Theorem 2.15 (DPI for KL divergence). Let PY = PY|X ◦ PX and QY = PY|X ◦ QX , as represented
by the diagram:
PX PY
PY|X
QX QY
Then
Note that D(Pf(X) kQf(X) ) = D(PX kQX ) does not imply that f is one-to-one; as an example,
consider PX = Gaussian, QX = Laplace, Y = |X|. In fact, the equality happens precisely when
f(X) is a sufficient statistic for testing P against Q; in other words, there is no loss of information
in summarizing X into f(X) as far as testing these two hypotheses is concerned. See Example 3.8
for details.
A particular useful application of Corollary 2.16 is when we take f to be an indicator function:
i i
i i
i i
32
Proof.
1 1
D(λP + λ̄QkQ) = EQ (λf + λ̄) log(λf + λ̄)
λ λ
dP
where f = . As λ → 0 the function under expectation decreases to (f − 1) log e monotonically.
dQ
Indeed, the function
λ 7→ g(λ) ≜ (λf + λ̄) log(λf + λ̄)
g(λ)
is convex and equals zero at λ = 0. Thus λ is increasing in λ. Moreover, by the convexity of
x 7→ x log x:
1 1
(λf + λ)(log(λf + λ)) ≤ (λf log f + λ1 log 1) = f log f
λ λ
i i
i i
i i
and by assumption f log f is Q-integrable. Thus the Monotone Convergence Theorem applies.
To prove (2.28) first notice that if P 6 Q then there is a set E with p = P[E] > 0 = Q[E].
Applying data-processing for divergence to X 7→ 1E (X), we get
1
D(QkλP + λ̄Q) ≥ d(0kλp) = log
1 − λp
λ 7→ D(λP + λ̄QkQ) ,
Our second result about the local behavior of KL-divergence is the following (see Section 7.10
for generalizations):
i i
i i
i i
34
i i
i i
i i
where μ is some common dominating measure (e.g. Lebesgue or counting measure). If for each
fixed x, the density pθ (x) depends smoothly on θ, one can define the Fisher information matrix
with respect to the parameter θ as
JF (θ) ≜ Eθ VV⊤ , V ≜ ∇θ ln pθ (X) , (2.31)
E θ [ V] = 0 (2.32)
JF (θ) = cov(V)
θ
Z p p
= 4 μ(dx)(∇θ pθ (x))(∇θ pθ (x))⊤
where the last identity is obtained by differentiating (2.32) with respect to each θj .
The significance of Fisher information matrix arises from the fact that it gauges the local
behaviour of divergence for smooth parametric families. Namely, we have (again under suitable
technical conditions):2
log e ⊤
D(Pθ0 kPθ0 +ξ ) = ξ JF (θ0 )ξ + o(kξk2 ) , (2.33)
2
which is obtained by integrating the Taylor expansion:
1
ln pθ0 +ξ (x) = ln pθ0 (x) + ξ ⊤ ∇θ ln pθ0 (x) + ξ ⊤ Hessθ (ln pθ0 (x))ξ + o(kξk2 ) .
2
We will establish this fact rigorously later in Section 7.11. Property (2.33) is of paramount impor-
tance in statistics. We should remember it as: Divergence is locally quadratic on the parameter
space, with Hessian given by the Fisher information matrix. Note that for the Gaussian location
model Pθ = N (θ, Σ), (2.33) is in fact exact with JF (θ) ≡ Σ−1 – cf. Example 2.2.
As another example, note that Proposition 2.19 is a special case of (2.33) by considering Pλ =
λ̄Q + λP parametrized by λ ∈ [0, 1]. In this case, the Fisher information at λ = 0 is simply
χ2 (PkQ). Nevertheless, Proposition 2.19 is completely general while the asymptotic expansion
(2.33) is not without regularity conditions (see Section 7.11).
Remark 2.3. Some useful properties of Fisher information are as follows:
2
To illustrate the subtlety here, consider a scalar location family, i.e. pθ (x) = f0 (x − θ) for some density f0 . In this case
∫ (f′0 )2
Fisher information JF (θ0 ) = f0
does not depend on θ0 and is well-defined even for compactly supported f0 ,
provided f′0 vanishes at the endpoints sufficiently fast. But at the same time the left-hand side of (2.33) is infinite for any
ξ > 0. In such cases, a better interpretation for Fisher information is as the coefficient of the expansion
ξ2
D(Pθ0 k 12 Pθ0 + 12 Pθ0 +ξ ) = J (θ )
8 F 0
+ o(ξ 2 ). We will discuss this in more detail in Section 7.11.
i i
i i
i i
36
Example 2.6. Let Pθ = (θ0 , . . . , θd ) be a probability distribution on the finite alphabet {0, . . . , d}.
Pd
We will take θ = (θ1 , . . . , θd ) as the free parameter and set θ0 = 1 − i=1 θi . So all derivatives
are with respect to θ1 , . . . , θd only. Then we have
(
θi , i = 1, . . . , d
pθ (i) = Pd
1 − i=1 θi , i = 0
and for Fisher information matrix we get
1 1 1
JF (θ) = diag ,..., + Pd 11⊤ , (2.36)
θ1 θd 1 − i=1 θi
where 1 is the d × 1 vector of all ones. For future references (see Sections 29.4 and 13.4*), we also
compute the inverse and determinant of JF (θ). By the matrix inversion lemma (A + UCV)−1 =
A−1 − A−1 U(C−1 + VA−1 U)−1 VA−1 , we have
J− ⊤
F (θ) = diag(θ) − θθ .
1
(2.37)
For the determinant, notice that det(A + xy⊤ ) = det A · det(I + A−1 xy⊤ ) = det A · (1 + y⊤ A−1 x),
where we used the identity det(I + AB) = det(I + BA). Thus, we have
Y
d
1
det JF (θ) = . (2.38)
θi
i=0
i i
i i
i i
3 Mutual information
After technical preparations in previous chapters we define perhaps the most famous concept in
the entire field of information theory, the mutual information. It was originally defined by Shan-
non, although the name was coined later by Robert Fano1 It has two equivalent expressions (as a
KL divergence and as difference of entropies), both having its merits. In this chapter, we prove
first properties of mutual information (non-negativity, chain rule and the data-processing inequal-
ity). While defining conditional information, we also introduce the language of directed graphical
models, and connect the equality case in the data-processing inequality with Fisher’s concept of
sufficient statistics.
Definition 3.1 (Mutual information). For a pair of random variables X and Y we define
The intuitive interpretation of mutual information is that I(X; Y) measures the dependency
between X and Y by comparing their joint distribution to the product of the marginals in the KL
divergence, which, as we show next, is also equivalent to comparing the conditional distribution
to the unconditional.
The way we defined I(X; Y) it is a functional of the joint distribution PX,Y . However, it is also
rather fruitful to look at it as a functional of the pair (PX , PY|X ) – more on this in Section 5.1.
In general, the divergence D(PX,Y kPX PY ) should be evaluated using the general definition (2.4).
Note that PX,Y PX PY need not always hold. Let us consider the following examples, though.
1
Professor of electrical engineering at MIT, who developed the first course on information theory and as part of it
formalized and rigorized much of Shannon’s ideas. Most famously, he showed the “converse part” of the noisy channel
coding theorem, see Section 17.4.
37
i i
i i
i i
38
Example 3.1. If X = Y ∼ N(0, 1) then PX,Y 6 PX PY and I(X; Y) = ∞. This reflects our intuition
that X contains an “infinite” amount of information requiring infinitely many bits to describe. On
the other hand, if even one of X or Y is discrete, then we always have PX,Y PX PY . Indeed,
consider any E ⊂ X × Y measurable in the product sigma algebra with PX,Y (E) > 0. Since
P
x∈S P[(X, Y) ∈ S, X = x], there exists some x0 ∈ S such that PY (E ) ≥ P[X =
x0
PX,Y (E) =
x0 , Y ∈ E ] > 0, where E ≜ {y : (x0 , y) ∈ E} is a section of E (measurable for every x0 ). But
x0 x0
then PX PY (E) ≥ PX PY ({x0 } × Ex0 ) = PX ({x0 })PY (Ex0 ) > 0, implying that PX,Y PX PY .
It is clear that the two sides correspond to the two mutual informations. For bijective f, simply
apply the inequality to f and f−1 .
(e) Apply (d) with f(X1 , X2 ) = X1 .
P P
Proof. (a) I(X; Y) = E log PPXXP,YY = E log PYY|X = E log PXX|Y .
(b) Apply data-processing inequality twice to the map (x, y) → (y, x) to get D(PX,Y kPX PY ) =
D(PYX kPY PX ).
(c) By definition and Theorem 2.3.
i i
i i
i i
(d) We will use the data-processing inequality of mutual information (to be proved shortly in
Theorem 3.7(c)). For bijective f, consider the chain of data processing: (x, y) 7→ (f(x), y) 7→
(f−1 (f(x)), y). Then I(X; Y) ≥ I(f(X); Y) ≥ I(f−1 (f(X)); Y) = I(X; Y).
(e) Apply (d) with f(X1 , X2 ) = X1 .
Of the results above, the one we will use the most is (3.1). Note that it implies that
D(PX,Y kPX PY ) < ∞ if and only if
x 7→ D(PY|X=x kPY )
Proof. Suppose PX,Y PX PY . We need to prove that any version of the conditional probability
satisfies PY|X=x PY for almost every x. Note, however, that if we prove this for some version P̃Y|X
then the statement for any version follows, since PY|X=x = P̃Y|X=x for PX -a.e. x. (This measure-
theoretic fact can be derived from the chain rule (2.24): since PX P̃Y|X = PX,Y = PX PY|X we must
have 0 = D(PX,Y kPX,Y ) = D(P̃Y|X kPY|X |PX ) = Ex∼PX [D(P̃Y|X=x kPY|X=x )], implying the stated
dPX,Y R
fact.) So let g(x, y) = dP X PY
(x, y) and ρ(x) ≜ Y g(x, y)PY (dy). Fix any set E ⊂ X and notice
Z Z
PX [E] = 1E (x)g(x, y)PX (dx) PY (dy) = 1E (x)ρ(x)PX (dx) .
X ×Y X
R
On the other hand, we also have PX [E] = 1E dPX , which implies ρ(x) = 1 for PX -a.e. x. Now
define
(
g(x, y)PY (dy), ρ(x) = 1
P̃Y|X (dy|x) =
PY (dy), ρ(x) 6= 1 .
Directly plugging P̃Y|X into (2.22) shows that P̃Y|X does define a valid version of the conditional
probability of Y given X. Since by construction P̃Y|X=x PY for every x, the result follows.
Conversely, let PY|X be a kernel such that PX [E] = 1, where E = {x : PY|X=x PY } (recall that
E is measurable by Lemma 2.11). Define P̃Y|X=x = PY|X=x if x ∈ E and P̃Y|X=x = PY , otherwise.
By construction PX P̃Y|X = PX PY|X = PX,Y and P̃Y|X=x PY for every x. Thus, by Theorem 2.10
there exists a jointly measurable f(y|x) such that
i i
i i
i i
40
Theorem 3.4.
(
H(X) X discrete
(a) I(X; X) =
+∞ otherwise.
(b) If X is discrete, then
I(X; Y) + H(X|Y) = H(X) . (3.2)
Consequently, either H(X|Y) = H(X) = ∞,2 or H(X|Y) < ∞ and
I(X; Y) = H(X) − H(X|Y). (3.3)
(d) Similarly, if X, Y are real-valued random vectors with a joint PDF, then
I(X; Y) = h(X) + h(Y) − h(X, Y)
provided that h(X, Y) < ∞. If X has a marginal PDF pX and a conditional PDF pX|Y (x|y),
then
I(X; Y) = h(X) − h(X|Y) ,
provided h(X|Y) < ∞.
(e) If X or Y are discrete then I(X; Y) ≤ min (H(X), H(Y)), with equality iff H(X|Y) = 0 or
H(Y|X) = 0, or, equivalently, iff one is a deterministic function of the other.
Proof. (a) By Theorem 3.2.(a), I(X; X) = D(PX|X kPX |PX ) = Ex∼X D(δx kPX ). If PX is discrete,
then D(δx kPX ) = log PX1(x) and I(X; X) = H(X). If PX is not discrete, let A = {x : PX (x) > 0}
denote the set of atoms of PX . Let ∆ = {(x, x) : x 6∈ A} ⊂ X × X . (∆ is measurable since it’s
2
This is indeed possible if one takes Y = 0 (constant) and X from Example 1.3, demonstrating that (3.3) does not always
hold.
i i
i i
i i
the intersection of Ac × Ac with the diagonal {(x, x) : x ∈ X }.) Then PX,X (∆) = PX (Ac ) > 0
but since
Z Z
(PX × PX )(E) ≜ PX (dx1 ) PX (dx2 )1{(x1 , x2 ) ∈ E}
X X
we have by taking E = ∆ that (PX × PX )(∆) = 0. Thus PX,X 6 PX × PX and thus by definition
I(X; X) = D(PX,X kPX PX ) = +∞ .
(b) Since X is discrete there exists a countable set S such that P[X ∈ S] = 1, and for any x0 ∈ S we
have P[X = x0 ] > 0. Let λ be a counting measure on S and let μ = λ×PY , so that PX PY μ. As
shown in Example 3.1 we also have PX,Y μ. Furthermore, fP (x, y) ≜ dPdμX,Y (x, y) = pX|Y (x|y),
where the latter denotes conditional pmf of X given Y (which is a proper pmf for almost every
y, since P[X ∈ S|Y = y] = 1 for a.e. y). We also have fQ (x, y) = dPdμ
X PY
(x, y) = dP
dλ (x) = pX (x),
X
Note that PX,Y -almost surely both pX|Y (X|Y) > 0 and PX (x) > 0, so we can replace Log with
log in the above. On the other hand,
X 1
H(X|Y) = Ey∼PY pX|Y (x|y) log .
pX|Y (x|y)
x∈ S
From (3.2) we deduce the following result, which was previously shown in Theorem 1.4(d).
Corollary 3.5 (Conditioning reduces entropy). For discrete X, H(X|Y) ≤ H(X), with equality iff
X⊥
⊥ Y.
i i
i i
i i
42
H(X, Y )
H(Y ) H(X)
As an example, we have
H(X1 |X2 , X3 ) = μ(E1 \ (E2 ∪ E3 )) , (3.6)
I(X1 , X2 ; X3 |X4 ) = μ(((E1 ∪ E2 ) ∩ E3 ) \ E4 ) . (3.7)
By inclusion-exclusion, the quantity in (3.5) corresponds to μ(E1 ∩ E2 ∩ E3 ), which explains why
μ is not necessarily a positive measure. For an extensive discussion, see [80, Chapter 1.3].
i i
i i
i i
I(X; Y )
ρ
-1 0 1
show (3.8), by shifting and scaling if necessary, we can assume without loss of generality that
EX = EY = 0 and EX2 = EY2 = 1. Then ρ = EXY. By joint Gaussianity, Y = ρX + Z for some
Z ∼ N ( 0, 1 − ρ 2 ) ⊥
⊥ X. Then using the divergence formula for Gaussians (2.7), we get
I(X; Y) = D(PY|X kPY |PX )
= ED(N (ρX, 1 − ρ2 )kN (0, 1))
1 1 log e
=E log + (ρX) + 1 − ρ − 1
2 2
2 1 − ρ2 2
1 1
= log .
2 1 − ρ2
Alternatively, we can use the differential entropy representation in Theorem 3.4(d) and the entropy
formula (2.17) for Gaussians:
I(X; Y) = h(Y) − h(Y|X)
= h( Y ) − h( Z )
1 1 1 1
= log(2πe) − log(2πe(1 − ρ2 )) = log .
2 2 2 1 − ρ2
where the second equality follows h(Y|X) = h(Y − X|X) = h(Z|X) = h(Z) applying the shift-
invariance of h and the independence between X and Z.
Similar to the role of mutual information, the correlation coefficient also measures the depen-
dency between random variables which are real-valued (more generally, on an inner-product
space) in a certain sense. In contrast, mutual information is invariant to bijections and thus more
general: it can be defined not just for numerical but for arbitrary random variables.
i i
i i
i i
44
X + Y
Then
1 σ2
I(X; Y) = log 1 + X2 ,
2 σN
σX2
where σN2
is frequently referred to as the signal-to-noise ratio (SNR).
1 det ΣX det ΣY
I(X; Y) = log
2 det Σ[X,Y]
where ΣX ≜ E (X − EX)(X − EX)⊤ denotes the covariance matrix of X ∈ Rm , and Σ[X,Y]
denotes the covariance matrix of the random vector [X, Y] ∈ Rm+n .
In the special case of additive noise: Y = X + N for N ⊥
⊥ X, we have
1 det(ΣX + ΣN )
I(X; X + N) = log
2 det ΣN
ΣX ΣX
why?
since det Σ[X,X+N] = det ΣX ΣX +ΣN = det ΣX det ΣN .
Example 3.5 (Binary symmetric channel). Recall the setting in Example 1.4(1). Let X ∼ Ber( 21 )
and N ∼ Ber(δ) be independent. Let Y = X ⊕ N; or equivalently, Y is obtained by flipping X with
probability δ .
N
1− δ
0 0
X δ Y X + Y
1 1
1− δ
The channel PY|X , called the binary symmetric channel with parameter δ and denoted by BSCδ ,
will be encountered frequently in this book.
i i
i i
i i
Example 3.6 (Addition over finite groups). Generalizing Example 3.5, let X and Z take values on
a finite group G. If X is uniform on G and independent of Z, then
where the product PX|Z PY|Z is a conditional distribution such that (PX|Z PY|Z )(A × B|z) =
PX|Z (A|z)PY|Z (B|z), under which X and Y are independent conditioned on Z.
Denoting I(X; Y) as a functional I(PX,Y ) of the joint distribution PX,Y , we have I(X; Y|Z) =
Ez∼PZ [I(PX,Y|Z=z )]. As such, I(X; Y|Z) is a linear functional in PZ . Measurability of the map z 7→
I(PX,Y|Z=z ) is not obvious, but follows from Lemma 2.11.
To further discuss properties of the conditional mutual information, let us first introduce the
notation for conditional independence. A family of joint distributions can be represented by a
directed acyclic graph encoding the dependency structure of the underlying random variables. A
simple example is a Markov chain (path graph) X → Y → Z, which represents distributions that
factor as {PX,Y,Z : PX,Y,Z = PX PY|X PZ|Y }. We have the following equivalent descriptions:
Theorem 3.7 (Further properties of mutual information). Suppose that all random variables are
valued in standard Borel spaces. Then:
i i
i i
i i
46
(f) (Permutation invariance) If f and g are one-to-one (with measurable inverses), then
I(f(X); g(Y)) = I(X; Y).
On the other hand, from the chain rule for D, (2.24), we have
(b)
D(PX,Y,Z kPX,Y PZ ) = D(PX,Z kPX PZ ) + D(PY|X,Z kPY|X |PX,Z ) ,
where in the second term we noticed that conditioning on X, Z under the measure PX,Y PZ
results in PY|X (independent of Z). Putting (a) and (b) together completes the proof.
(c) Apply Kolmogorov identity to I(Y, Z; X):
Remark 3.2. In general, I(X; Y|Z) and I(X; Y) are incomparable. Indeed, consider the following
examples:
3
Also known as “Kolmogorov identities”.
i i
i i
i i
• I(X; Y|Z) > I(X; Y): We need to find an example of X, Y, Z, which do not form a Markov chain.
To that end notice that there is only one directed acyclic graph non-isomorphic to X → Y → Z,
i.i.d.
namely X → Y ← Z. With this idea in mind, we construct X, Z ∼ Bern( 12 ) and Y = X ⊕ Z. Then
I(X; Y) = 0 since X ⊥⊥ Y; however, I(X; Y|Z) = I(X; X ⊕ Z|Z) = H(X) = 1 bit.
• I(X; Y|Z) < I(X; Y): Simply take X, Y, Z to be any random variables on finite alphabets and
Z = Y. Then I(X; Y|Z) = I(X; Y|Y) = H(Y|Y) − H(Y|X, Y) = 0 by a conditional version of (3.3).
Remark 3.3 (Chain rule for I ⇒ Chain rule for H). Set Y = Xn . Then H(Xn ) = I(Xn ; Xn ) =
Pn n k− 1
Pn
k=1 I(Xk ; X |X ) = k=1 H(Xk |Xk−1 ), since H(Xk |Xn , Xk−1 ) = 0.
Remark 3.4 (DPI for divergence =⇒ DPI for mutual information). We proved DPI for mutual
information in Theorem 3.7 using Kolmogorov’s identity. In fact, DPI for mutual information is
implied by that for divergence in Theorem 2.15:
P Z| Y PZ|Y
where note that for each x, we have PY|X=x −−→ PZ|X=x and PY −−→ PZ . Therefore if we have a
bi-variate functional of distributions D(PkQ) which satisfies DPI, then we can define a “mutual
information-like” quantity via ID (X; Y) ≜ D(PY|X kPY |PX ) ≜ Ex∼PX D(PY|X=x kPY ) which will
satisfy DPI on Markov chains. A rich class of examples arises by taking D = Df (an f-divergence
– see Chapter 7).
Remark 3.5 (Strong data-processing inequalities). For many channels PY|X , it is possible to
strengthen the data-processing inequality (2.27) as follows: For any PX , QX we have
where ηKL < 1 and depends on the channel PY|X only. Similarly, this gives an improvement in the
data-processing inequality for mutual information in Theorem 3.7(c): For any PU,X we have
For example, for PY|X = BSCδ we have ηKL = (1 − 2δ)2 . Strong data-processing inequalities
(SDPIs) quantify the intuitive observation that noise intrinsict in the channel PY|X must reduce the
information that Y carries about the data U, regardless of how we optimize the encoding U 7→ X.
We explore SDPI further in Chapter 33 as well as their ramifications in statistics.
In addition to the case of strict inequality in DPI, the case of equality is also worth taking a closer
look. If U → X → Y and I(U; X) = I(U; Y), intuitively it means that, as far as U is concerned,
there is no loss of information in summarizing X into Y. In statistical parlance, we say that Y is a
sufficient statistic of X for U. This is the topic for the next section.
i i
i i
i i
48
Definition 3.8 (Sufficient statistic). We say that T is a sufficient statistic of X for θ if there exists a
transition probability kernel PX|T so that PθX PT|X = PθT PX|T , i.e., PX|T can be chosen to not depend
on θ.
The intuitive interpretation of T being sufficient is that, with T at hand, one can ignore X; in
other words, T contains all the relevant information to infer about θ. This is because X can be
simulated on the sole basis of T without knowing θ. As such, X provides no extra information
for identification of θ. Any one-to-one transformation of X is sufficient, however, this is not the
interesting case. In the interesting cases dimensionality of T will be much smaller (typically equal
to that of θ) than that of X. See examples below.
Observe also that the parameter θ need not be a random variable, as Definition 3.8 does not
involve any distribution (prior) on θ. This is a so-called frequentist point of view on the problem
of parameter estimation.
Theorem 3.9. Let θ, X, T be as in the setting above. Then the following are equivalent
Proof. We omit the details, which amount to either restating the conditions in terms of condi-
tional independence, or invoking equality cases in the properties stated in Theorem 3.7.
Theorem 3.10 (Fisher’s factorization theorem). For all θ ∈ Θ, let PθX have a density pθ with
respect to a common dominating measure μ. Let T = T(X) be a deterministic function of X. Then
T is a sufficient statistic of X for θ iff
pθ (x) = gθ (T(x))h(x)
i i
i i
i i
Proof. We only give the proof in the discrete case where pθ represents the PMF. (The argument
P R
for the general case is similar replacing by dμ). Let t = T(x).
“⇒”: Suppose T is a sufficient statistic of X for θ. Then pθ (x) = Pθ (X = x) = Pθ (X = x, T =
t) = Pθ (X = x|T = t)Pθ (T = t) = P(X = x|T = T(x)) Pθ (T = T(x))
| {z }| {z }
h ( x) gθ (T(x))
“⇐”: Suppose the factorization holds. Then
p θ ( x) gθ (t)h(x) h ( x)
Pθ (X = x|T = t) = P =P =P ,
x 1{T(x)=t} pθ (x) x 1{T(x)=t} gθ (t)h(x) x 1{T(x)=t} h(x)
free of θ.
Example 3.7 (Independent observations). In the following examples, a parametrized distribution
generates an independent sample of size n, which can be summarized into a scalar-valued sufficient
statistic. These can be verified by checking the factorization of the n-fold product distribution and
applying Theorem 3.10.
i.i.d.
• Normal mean model. Let θ ∈ R and observations X1 , . . . , Xn ∼ N (θ, 1). Then the sample mean
Pn
X̄ = 1n j=1 Xj is a sufficient statistic of Xn for θ.
i.i.d. Pn
• Coin flips. Let Bi ∼ Ber(θ). Then i=1 Bi is a sufficient statistic of Bn for θ.
i.i.d.
• Uniform distribution. Let Ui ∼ Unif(0, θ). Then maxi∈[n] Ui is a sufficient statistic of Un for θ.
Example 3.8 (Sufficient statistic for hypothesis testing). Let Θ = {0, 1}. Given θ = 0 or 1,
X ∼ PX or QX , respectively. Then Y – the output of PY|X – is a sufficient statistic of X for θ iff
D(PX|Y kQX|Y |PY ) = 0, i.e., PX|Y = QX|Y holds PY -a.s. Indeed, the latter means that for kernel QX|Y
we have
PX PY|X = PY QX|Y and QX PY|X = QY QX|Y ,
which is precisely the definition of sufficient statistic when θ ∈ {0, 1}. This example explains
the condition for equality in the data-processing for divergence in Theorem 2.15. Then assuming
D(PY kQY ) < ∞ we have:
D(PX kQX ) = D(PY kQY ) ⇐⇒ Y is a sufficient statistic for testing PX vs. QX
Proof. Let QX,Y = QX PY|X , PX,Y = PX PY|X , then
D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) +D(PX kQX )
| {z }
=0
= D(PX|Y kQX|Y |PY ) + D(PY kQY )
≥ D(PY kQY )
with equality iff D(PX|Y kQX|Y |PY ) = 0, which is equivalent to Y being a sufficient statistic for
testing PX vs QX as desired.
i i
i i
i i
In this chapter we collect some results on variational characterizations. It is a well known method
in analysis to study a functional by proving a variational characterization of the form F(x) =
supλ∈Λ fλ (x) or F(x) = infμ∈M gμ (x). Such representations can be useful for multiple purposes:
We will see in this chapter that divergence has two different sup characterizations (over partitions
and over functions). The mutual information is even more special. In addition to inheriting the
ones from KL divergence, it possesses two very special ones: an inf over (centroid) measures QY
and a sup over Markov kernels.
As the main applications of these variational characetizations, we will first pursue the topic of
continuity. In fact, we will discuss several types of continuity.
First, is the continuity in discretization. This is related to the issue of computation. For compli-
cated P and Q direct computation of D(PkQ) might be hard. Instead, one may want to discretize the
infinite alphabet and compute numerically the finite sum. Is this procedure stable, i.e., as the quan-
tization becomes finer, does this procedure guarantee to converge to the true value? The answer
is positive and this continuity with respect to discretization is guaranteed by Theorem 4.5.
Second, is the continuity under change of the distribution. For example, this is arises in the
problem of estimating information measures. In many statistical setups, oftentimes we do not
know P or Q, and we estimate the distribution by P̂n using n iid observations sampled from P (in
discrete cases we may set P̂n to be simply the empirical distribution). Does D(P̂n kQ) provide a
good estimator for D(PkQ)? Does D(P̂n kQ) → D(PkQ) if P̂n → P? The answer is delicate – see
Section 4.4.
Third, there is yet another kind of continuity: continuity “in the σ -algebra”. Despite the scary
name, this one is useful even in the most “discrete” situations. For example, imagine that θ ∼
i.i.d.
Unif(0, 1) and Xi ∼ Ber(θ). Suppose that you observe a sequence of Xi ’s until the random moment
τ equal to the first occurence of the pattern 0101. How much information about θ did you learn
by time τ ? We can encode these observations as
(
Xj , j≤τ,
Zj = ,
?, j>τ
50
i i
i i
i i
where ? designates the fact that we don’t know the value of Xj on those times. Then the question
we asked above is to compute I(θ; Z∞ ). We will show in this chapter that
X
∞
I(θ; Z∞ ) = lim I(θ; Zn ) = I(θ; Zn |Zn−1 ) (4.1)
n→∞
n=1
thus reducing computation to evaluating an infinite sum of simpler terms (not involving infinite-
dimensional vectors). Thus, even in this simple question about biased coin flips we have to
understand how to safely work with infinite-dimensional vectors.
Furthermore, it turns out that PY , similar to the center of gravity, minimizes this weighted distance
and thus can be thought as the best approximation for the “center” of the collection of distributions
{PY|X=x : x ∈ X } with weights given by PX . We formalize these results in this section and start
with the proof of a “golden formula”. Its importance is in bridging the two points of view on
mutual information: (4.3) is the difference of (relative) entropies in the style of Shannon, while
retaining applicability to continuous spaces in the style of Fano.
Proof. In the discrete case and ignoring the possibility of dividing by zero, the argument is really
simple. We just need to write
(3.1) PY|X PY|X QY
I(X; Y) = EPX,Y log = EPX,Y log
PY PY QY
P Q P
and then expand log PYY|XQYY = log QY|YX − log QPYY . The argument below is a rigorous implementation
of this idea.
First, notice that by Theorem 2.14(e) we have D(PY|X kQY |PX ) ≥ D(PY kQY ) and thus if
D(PY kQY ) = ∞ then both sides of (4.2) are infinite. Thus, we assume D(PY kQY ) < ∞ and
in particular PY QY . Rewriting LHS of (4.2) via the chain rule (2.24) we see that Theorem
amounts to proving
D(PX,Y kPX QY ) = D(PX,Y kPX PY ) + D(PY kQY ) .
i i
i i
i i
52
The case of D(PX,Y kPX QY ) = D(PX,Y kPX PY ) = ∞ is clear. Thus, we can assume at least one of
these divergences is finite, and, hence, also PX,Y PX QY .
dPY
Let λ(y) = dQ Y
(y). Since λ(Y) > 0 PY -a.s., applying the definition of Log in (2.10), we can
write
λ(Y)
EPY [log λ(Y)] = EPX,Y Log . (4.4)
1
dPX PY
Notice that the same λ(y) is also the density dPX QY
(x, y) of the product measure PX PY with respect
to PX QY . Therefore, the RHS of (4.4) by (2.11) applied with μ = PX QY coincides with
while the LHS of (4.4) by (2.13) equals D(PY kQY ). Thus, we have shown the required
and, consequently,
Remark 4.1. The variational representation (4.5) is useful for upper bounding mutual information
by choosing an appropriate QY . Indeed, often each distribution in the collection PY|X=x is simple,
but their mixture, PY , is very hard to work with. In these cases, choosing a suitable QY in (4.5)
provides a convenient upper bound. As an example, consider the AWGN channel Y = X + Z in
Example 3.3, where Var(X) = σ 2 , Z ∼ N (0, 1). Then, choosing the best possible Gaussian Q and
applying the above bound, we have:
1
I(X; Y) ≤ inf E[D(N (X, 1)kN ( μ, s))] = log(1 + σ 2 ),
μ∈R,s≥0 2
which is tight when X is Gaussian. For more examples and statistical applications, see Chapter 30.
i i
i i
i i
Proof. We only need to use the previous corollary and the chain rule (2.24):
(2.24)
D(PX,Y kQX QY ) = D(PY|X kQY |PX ) + D(PX kQX ) ≥ I(X; Y) .
Interestingly, the point of view in the previous result extends to conditional mutual information
as follows: We have
where the minimization is over all QX,Y,Z = QX QY|X QZ|Y , cf. Section 3.4. Showing this character-
ization is very similar to the previous theorem. By repeating the same argument as in (4.2) we
get
≥ I ( X ; Z| Y) .
Characterization (4.6) can be understood as follows. The most general graphical model for the
triplet (X, Y, Z) is a 3-clique (triangle).
Y X
What is the information flow on the dashed edge X → Z? To answer this, notice that removing
this edge restricts the joint distribution to a Markov chain X → Y → Z. Thus, it is natural to
ask what is the minimum (KL-divergence) distance between a given PX,Y,Z and the set of all
distributions QX,Y,Z satisfying the Markov chain constraint. By the above calculation, optimal
QX,Y,Z = PY PX|Y PZ|Y and hence the distance is I(X; Z|Y). For this reason, we may interpret I(X; Z|Y)
as the amount of information flowing through the X → Z edge.
In addition to inf-characterization, mutual information also has a sup-characterization.
Theorem 4.4. For any Markov kernel QX|Y such that QX|Y=y PX for PY -a.e. y we have
dQX|Y
I(X; Y) ≥ EPX,Y log .
dPX
i i
i i
i i
54
Remark 4.2. Similar to how Theorem 4.1 is used to upper-bound I(X; Y) by choosing a good
approximation to PY , this result is used to lower-bound I(X; Y) by selecting a good (but com-
putable) approximation QX|Y to usually a very complicated posterior PX|Y . See Section 5.6 for
applications.
Proof. Since modifying QX|Y=y on a negligible set of y’s does not change the expectations, we
will assume that QX|Y=y PY for every y. If I(X; Y) then there is nothing to prove. So we assume
I(X; Y) < ∞, which implies PX,Y PX PY . Then by Lemma 3.3 we have that PX|Y=y PX for
dQX|Y=y /dPX
almost every y. Choose any such y and apply (2.11) with μ = PX and noticing Log 1 =
dQX|Y=y
log dP X
we get
dQX|Y=y
EPX|Y=y log = D(PX|Y=y kPX ) − D(PX|Y=y kQX|Y=y ) ,
dPX
which is applicable since the first term is finite for a.e. y by (3.1). Taking expectation of the previous
identity over y we obtain
dQX|Y
EPX,Y log = I(X; Y) − D(PX|Y kQX|Y |PY ) ≤ I(X; Y) , (4.8)
dPX
implying the first part. The equality case in (4.7) follows by taking QX|Y = PX|Y , which satisfies
the conditions on Q when I(X; Y) < ∞.
i i
i i
i i
Remark 4.3. This theorem, in particular, allows us to prove all general identities and inequalities
for the cases of discrete random variables and then pass to the limit. In case of mutual information
I(X; Y) = D(PX,Y kPX PY ), the partitions over X and Y can be chosen separately, see (4.29).
“≤”: To show D(PkQ) is indeed achievable, first note that if P 6 Q, then by definition, there
exists B such that Q(B) = 0 < P(B). Choosing the partition E1 = B and E2 = Bc , we have
P2 P[Ei ]
D(PkQ) = ∞ = i=1 P[Ei ] log Q[Ei ] . In the sequel we assume that P Q and let X = dQ .
dP
Then D(PkQ) = EQ [X log X] = EQ [φ(X)] by (2.4). Note that φ(x) ≥ 0 if and only if x ≥ 1. By
monotone convergence theorem, we have EQ [φ(X)1{X<c} ] → D(PkQ) as c → ∞, regardless of
the finiteness of D(PkQ).
Next, we construct a finite partition. Let n = c/ϵ be an integer and for j = 0, . . . , n − 1, let
Ej = {jϵ ≤ X(j + 1)ϵ} and En = {X ≥ c}. Define Y = ϵbX/ϵc as the quantized version. Since φ is
uniformly continuous on [0, c], for any x, y ∈ [0, c] such |x − y| ≤ ϵ, we have |φ(x) − φ(y)| ≤ ϵ′
for some ϵ′ = ϵ′ (ϵ, c) such as ϵ′ → 0 as ϵ → 0. Then EQ [φ(Y)1{X<c} ] ≥ EQ [φ(X)1{X<c} ] − ϵ′ .
Morever,
X
n−1 n−1
X
P(Ej )
EQ [φ(Y)1{X<c} ] = φ(jϵ)Q(Ej ) ≤ ϵ′ + φ Q(Ej )
Q( E j )
j=0 j=0
X
n
P(Ej )
≤ ϵ′ + Q(X ≥ c) log e + P(Ej ) log ,
Q( E j )
j=0
P(E )
where the first inequality applies the uniform continuity of φ since jϵ ≤ Q(Ejj ) < (j + 1)ϵ, and the
second applies φ ≥ − log e. As Q(X ≥ c) → 0 as c → ∞, the proof is completed by first sending
ϵ → 0 then c → ∞.
i i
i i
i i
56
Theorem 4.6 (Donsker-Varadhan [100]). Let P, Q be probability measures on X and let CQ denote
the set of functions f : X → R such that EQ [exp{f(X)}] < ∞. We have
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] . (4.11)
f∈CQ
In particular, if D(PkQ) < ∞ then EP [f(X)] is finite for every f ∈ CQ . The identity (4.11) holds
with CQ replaced by the class of all simple functions. If X is a normal topological space (e.g., a
metric space) with Borel σ -algebra, then also
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] , (4.12)
f∈Cb
Proof. “≥”: We can assume for this part that D(PkQ) < ∞, since otherwise there is nothing to
prove. Then fix f ∈ CQ and define a probability measure Qf (tilted version of Q) via
Qf (dx) = exp{f(x) − Zf }Q(dx) , Zf ≜ log EQ [exp{f(X)}] .
Then, obviously Qf Q and we have
dQf dPdQf
EP [f(X)] − Zf = EP log = EP log = D(PkQ) − D(PkQf ) ≤ D(PkQ) .
dQ dQdP
“≤”: The idea is to just take f = log dQ
dP
; however to handle the edge cases we proceed carefully.
First, notice that if P 6 Q then for some E with Q[E] = 0 < P[E] and c → ∞ taking f = c1E shows
that both sides of (4.11) are infinite. Thus, we assume P Q. For any partition of X = ∪nj=1 Ej
Pn P[ E ]
we set f = j=1 1Ej log Q[Ejj ] . Then the right-hand sides of (4.11) and (4.9) evaluate to the same
value and hence by Theorem 4.5 we obtain that supremum over simple functions (and thus over
CQ ) is at least as large as D(PkQ).
Finally, to show (4.12), we show that for every simple function f there exists a continuous
bounded f′ such that EP [f′ ] − log EQ [exp{f′ }] is arbitrarily close to the same functional evaluated
at f. Clearly, for that it is enough to show that for any a ∈ R and measurable A ⊂ X there exists a
sequence of continuous bounded fn such that
EP [fn ] → aP[A], and EQ [exp{fn }] → exp{a}Q[A] (4.13)
hold simultaneously. We only consider the case of a > 0 below. Let compact F and open U be
such that F ⊂ A ⊂ U and max(P[U] − P[F], Q[U] − Q[F]) ≤ ϵ. Such F and U exist whenever P and
Q are so-called regular measures. Without going into details, we just notice that finite measures
on Polish spaces are automatically regular. Then by Urysohn’s lemma there exists a continuous
function fϵ : X → [0, a] equal to a on F and 0 on Uc . Then we have
aP[F] ≤ EP [fϵ ] ≤ aP[U]
exp{a}Q[F] ≤ EQ [exp{fϵ }] ≤ exp{a}Q[U] .
Subtracting aP[A] and exp{a}Q[A] for each of these inequalities, respectively, we see that taking
ϵ → 0 indeed results in a sequence of functions satisfying (4.13).
i i
i i
i i
Remark 4.4. 1 What is the Donsker-Varadhan representation useful for? By setting f(x) = ϵ · g(x)
with ϵ 1 and linearizing exp and log we can see that when D(PkQ) is small, expecta-
tions under P can be approximated by expectations over Q (change of measure): EP [g(X)] ≈
EQ [g(X)]. This holds for all functions g with finite exponential moment under Q. Total variation
distance provides a similar bound, but for a narrower class of bounded functions:
| EP [g(X)] − EQ [g(X)]| ≤ kgk∞ TV(P, Q) .
2 More formally, the inequality EP [f(X)] ≤ log EQ [exp f(X)] + D(PkQ) is useful in estimating
EP [f(X)] for complicated distribution P (e.g. over high-dimensional X with weakly dependent
coordinates) by making a smart choice of Q (e.g. with iid components).
3 In Chapter 5 we will show that D(PkQ) is convex in P (in fact, in the pair). A general method
of obtaining variational formulas like (4.11) is via the Young-Fenchel duality. Indeed, (4.11) is
exactly this inequality since the Fenchel-Legendre conjugate of D(·kQ) is given by a convex
map f 7→ Zf . For more details, see Section 7.13.
4 Donsker-Varadhan should also be seen as an “improved version” of the DPI. For example, one
of the main applications of the DPI in this book is in obtaining estimates like
1
P[A] log ≤ D(PkQ) + log 2 , (4.14)
Q[ A ]
which is the basis of the large deviations theory (Corollary 2.17) and Fano’s inequality
(Theorem 6.3). The same estimate can be obtained by applying (4.11) via f(x) = 1{x∈A} log Q[1A] .
Proposition 4.7. Let X be finite. Fix a distribution Q on X with Q(x) > 0 for all x ∈ X . Then the
map
P 7→ D(PkQ)
is continuous. In particular,
P 7→ H(P) (4.15)
is continuous.
Warning: Divergence is never continuous in the pair, even for finite alphabets. For example,
as n → ∞, d( 1n k2−n ) 6→ 0.
Proof. Notice that
X P ( x)
D(PkQ) = P(x) log
x
Q ( x)
i i
i i
i i
58
Our next goal is to study continuity properties of divergence for general alphabets. We start
with a negative observation.
Remark 4.5. In general, D(PkQ) is not continuous in either P or Q. For example, let X1 , . . . , Xn
Pn d
be iid and equally likely to be {±1}. Then by central limit theorem, Sn = √1n i=1 Xi −
→N (0, 1)
as n → ∞. But
for all n. Note that this is an example for strict inequality in (4.16).
Nevertheless, there is a very useful semicontinuity property.
Theorem 4.8 (Lower semicontinuity of divergence). Let X be a metric space with Borel σ -algebra
H. If Pn and Qn converge weakly to P and Q, respectively,1 then
On a general space if Pn → P and Qn → Q pointwise2 (i.e. Pn [E] → P[E] and Qn [E] → Q[E] for
every measurable E) then (4.16) also holds.
Proof. This simply follows from (4.12) since EPn [f] → EP [f] and EQn [exp{f}] → EQ [exp{f}] for
every f ∈ Cb .
D(PF kQF ) .
1
Recall that sequence of random variables Xn converges in distribution to X if and only if their laws PXn converge weakly
to PX .
2
Pointwise convergence is weaker than convergence in total variation and stronger than weak convergence.
i i
i i
i i
For establishing the first result, it will be convenient to extend the definition of the divergence
D(PF kQF ) to (a) any algebra of sets F and (b) two positive additive (not necessarily σ -additive)
set-functions P, Q on F .
Definition 4.9 (KL divergence over an algebra). Let P and Q be two positive, additive (not nec-
essarily σ -additive) set-functions defined over an algebra F of subsets of X (not necessarily a
σ -algebra). We define
X
n
P[Ei ]
D(PF kQF ) ≜ sup P[Ei ] log ,
{E1 ,...,En } i=1 Q[Ei ]
Sn
where the supremum is over all finite F -measurable partitions: j=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 01 = 0 and log 10 = ∞ per our usual convention.
Note that when F is not a σ -algebra or P, Q are not σ -additive, we do not have Radon-Nikodym
theorem and thus our original definition of KL-divergence is not applicable.
• If F is (P + Q)-dense in G then3
and, in particular,
Proof. The first two items are straightforward applications of the definition. The third follows
from the following fact: if F is dense in G then any G -measurable partition {E1 , . . . , En } can
be approximated by a F -measurable partition {E′1 , . . . , E′n } with (P + Q)[Ei 4E′i ] ≤ ϵ. Indeed,
first we set E′1 to be an element of F with (P + Q)(E1 4E′1 ) ≤ 2n ϵ
. Then, we set E′2 to be
3
Recall that F is μ-dense in G if ∀E ∈ G, ϵ > 0∃E′ ∈ F s.t. μ[E∆E′ ] ≤ ϵ.
i i
i i
i i
60
an ϵ
2n -approximation of E2 \ E′1 , etc. Finally, E′n = (∪j≤1 E′j )c . By taking ϵ → 0 we obtain
P ′ P[E′i ] P
P[Ei ] log QP[[EEii]] .
i P[Ei ] log Q[E′i ] → i
The last statement follows from the previous one and the fact that any algebra F is μ-dense in
the σ -algebra σ{F} it generates for any bounded μ on (X , H) (cf. [107, Lemma III.7.1].)
Finally, we address the continuity under the decreasing σ -algebra, i.e. (4.18).
i i
i i
i i
Further properties of mutual information follow from I(X; Y) = D(PX,Y kPX PY ) and correspond-
ing properties of divergence, e.g.
4
Here we only assume that topology on the space of measures is compatible with the linear structure, so that all linear
operations on measures are continuous.
i i
i i
i i
62
from which (4.24) follows by moving the outer expectation inside the log. Both of these can
be used to show that E[f(X, Y)] ≈ E[f(X, Ȳ)] as long as the dependence between X and Y (as
measured by I(X; Y)) is weak. For example, suppose that for every x the random variable h(x, Ȳ)
is ϵ-subgaussian, i.e.
1
log E[exp{λh(x, Ȳ)}] ≤ λ E[h(x, Ȳ)] + ϵ2 λ2 .
2
Then plugging f = λh into (4.25) and optimizing λ shows
p
E[h(X, Y)] − E[h(X, Ȳ)] ≤ 2ϵ2 I(X; Y) . (4.26)
This allows one to control expectations of functions of dependent random variables by replac-
ing them with independent pairs at the expense of (square-root of the) mutual information
slack [338]. Variant of this idea for bounding deviations with high-probability is a foundation of
the PAC-Bayes bounds on generalization of learning algorithims (in there, Y becomes training
data, X is the selected hypothesis/predictor, PX|Y the learning algorithm, E[h(X, Ȳ)] the test loss,
etc); see Ex. I.44 and [58] for more.
2 (Uniform convergence and Donsker-Varadhan) There is an interesting other consequence
of (4.25). By Theorem 4.1 we have I(X; Y) ≤ D(PX|Y kQX |PY ) for any fixed QX . This lets us con-
vert (4.25) into the following inequality: (we denote by EY and EX|Y the respective uncoditional
and conditional expectations): For every f, PY , QX and PX|Y we have
EY EX|Y [f(X, Y) − log EȲ [exp f(X, Ȳ)] − D(PX|Y=Y kQX ) ≤ 0 .
Now because of the arbitrariness of PX|Y , setting measurability issues aside, we get: For every
f, PY and QX
EY sup EX∼PX [f(X, Y) − log EȲ [exp f(X, Ȳ)] − D(PX kQX ) ≤ 0 .
PX
5
Just apply Donsker-Varadhan to D(PY|X=x0 kPY ) and average over x0 ∼ PX .
i i
i i
i i
For example, taking QX to be uniform on N elements recovers the standard bound on the
maximum of subgaussian random variables: if H1 , . . . , HN are ϵ-subgaussian, then
p
E max (Hi − E[Hi ]) ≤ 2ϵ2 log N . (4.27)
1≤i≤N
d
• Good example of strict inequality: Xn = Yn = 1n Z. In this case (Xn , Yn ) → (0, 0) but
I(Xn ; Yn ) = H(Z) > 0 = I(0; 0).
• Even more impressive example: Let (Xp , Yp ) be uniformly distributed on the unit ℓp -ball on
d
the plane: {x, y : |x|p + |y|p ≤ 1}. Then as p → 0, (Xp , Yp ) → (0, 0), but I(Xp ; Yp ) → ∞. (See
Ex. I.36)
4 Mutual information as supremum over partitions:
X PX,Y [Ei × Fj ]
I(X; Y) = sup PX,Y [Ei × Fj ] log , (4.29)
{Ei }×{Fj } PX [Ei ]PY [Fj ]
i,j
This implies that the full amount of mutual information between two processes X∞ and Y∞
is contained in their finite-dimensional projections, leaving nothing in the tail σ -algebra. Note
also that applying the (finite-n) chain rule to (4.30) recovers (4.1).
T
6 (Monotone convergence II): Let Xtail be a random variable such that σ(Xtail ) = n≥1 σ(X∞ n ).
Then
I(Xtail ; Y) = lim I(X∞
n ; Y) , (4.32)
n→∞
whenever the right-hand side is finite. This is a consequence of Proposition 4.11. Without the
i.i.d.
finiteness assumption the statement is incorrect. Indeed, consider Xj ∼ Ber(1/2) and Y = X∞ 0 .
Then each I(X∞ n ; Y) = ∞ , but Xtail = const a.e. by Kolmogorov’s 0-1 law, and thus the left-hand
side of (4.32) is zero.
6
To prove this from (4.9) one needs to notice that algebra of measurable rectangles is dense in the product σ-algebra. See
[95, Sec. 2.2].
i i
i i
i i
• I-projection: Given Q minimize D(PkQ) over convex class of P. (See Chapter 15.)
• Maximum likelihood: Given P minimize D(PkQ) over some class of Q. (See Section 29.3.)
• Rate-Distortion: Given PX minimize I(X; Y) over a convex class of PY|X . (See Chapter 26.)
• Capacity: Given PY|X maximize I(X; Y) over a convex class of PX . (This chapter.)
In this chapter we show that all these problems have convex/concave objective functions,
discuss iterative algorithms for solving them, and study the capacity problem in more detail.
Remark 5.1. The proof shows that for an arbitrary measure of similarity D(PkQ) convexity of
(P, Q) 7→ D(PkQ) is equivalent to “conditioning increases divergence” property of D. Convexity
can also be understood as “mixing decreases divergence”.
Remark 5.2. There are a number of alternative arguments possible. For example, (p, q) 7→ p log pq
is convex on R2+ , which isa manifestation
of a general phenomenon: for a convex f(·) the perspec-
tive function (p, q) 7→ qf pq is convex too. Yet another way is to invoke the Donsker-Varadhan
variational representation Theorem 4.6 and notice that supremum of convex functions is convex.
64
i i
i i
i i
Theorem 5.2. The map PX 7→ H(X) is concave. Furthermore, if PY|X is any channel, then PX 7→
H(X|Y) is concave. If X is finite, then PX 7→ H(X|Y) is continuous.
Proof. For the special case of the first claim, when PX is on a finite alphabet, the proof is complete
by H(X) = log |X | − D(PX kUX ). More generally, we prove the second claim as follows. Let
f(PX ) = H(X|Y). Introduce a random variable U ∼ Ber(λ) and define the transformation
P0 U = 0
PX|U =
P1 U = 1
Consider the probability space U → X → Y. Then we have f(λP1 + (1 − λ)P0 ) = H(X|Y) and
λf(P1 ) + (1 − λ)f(P0 ) = H(X|Y, U). Since H(X|Y, U) ≤ H(X|Y), the proof is complete. Continuity
follows from Proposition 4.12.
Recall that I(X; Y) is a function of PX,Y , or equivalently, (PX , PY|X ). Denote I(PX , PY|X ) =
I(X; Y).
Proof. There are several ways to prove the first statement, all having their merits.
• First proof : Introduce θ ∈ Ber(λ). Define PX|θ=0 = P0X and PX|θ=1 = P1X . Then θ → X → Y.
Then PX = λ̄P0X + λP1X . I(X; Y) = I(X, θ; Y) = I(θ; Y) + I(X; Y|θ) ≥ I(X; Y|θ), which is our
desired I(λ̄P0X + λP1X , PY|X ) ≥ λ̄I(P0X , PY|X ) + λI(P0X , PY|X ).
• Second proof : I(X; Y) = minQ D(PY|X kQ|PX ), which is a pointwise minimum of affine functions
in PX and hence concave.
• Third proof : Pick a Q and use the golden formula: I(X; Y) = D(PY|X kQ|PX ) − D(PY kQ), where
PX 7→ D(PY kQ) is convex, as the composition of the PX 7→ PY (affine) and PY 7→ D(PY kQ)
(convex).
The argument PY is a linear function of PY|X and thus the statement follows from convexity of D
in the pair.
i i
i i
i i
66
Theorem 5.4 (Saddle point). Let P be a convex set of distributions on X . Suppose there exists
P∗X ∈ P , called a capacity-achieving input distribution, such that
Let P∗Y = P∗X PY|X , called a capacity-achieving output distribution. Then for all PX ∈ P and for all
QY , we have
D(PY|X kP∗Y |PX ) ≤ D(PY|X kP∗Y |P∗X ) ≤ D(PY|X kQY |P∗X ). (5.1)
Proof. Right inequality: obvious from C = I(P∗X , PY|X ) = minQY D(PY|X kQY |P∗X ).
Left inequality: If C = ∞, then trivial. In the sequel assume that C < ∞, hence I(PX , PY|X ) < ∞
for all PX ∈ P . Let PXλ = λPX + λP∗X ∈ P and PYλ = PY|X ◦ PXλ . Clearly, PYλ = λPY + λP∗Y ,
where PY = PY|X ◦ PX .
We have the following chain then:
where inequality is by the right part of (5.1) (already shown). Thus, subtracting λ̄C and dividing
by λ we get
and the proof is completed by taking lim infλ→0 and applying the lower semincontinuity of
divergence (Theorem 4.8).
Corollary 5.5. In addition to the assumptions of Theorem 5.4, suppose C < ∞. Then the capacity-
achieving output distribution P∗Y is unique. It satisfies the property that for any PY induced by some
PX ∈ P (i.e. PY = PY|X ◦ PX ) we have
i i
i i
i i
Remark 5.3. • The finiteness of C is necessary for Corollary 5.5 to hold. For a counterexample,
consider the identity channel Y = X, where X takes values on integers. Then any distribution
with infinite entropy is a capacity-achieving input (and output) distribution.
• Unlike the output distribution, capacity-achieving input distribution need not be unique. For
example, consider Y1 = X1 ⊕ Z1 and Y2 = X2 where Z1 ∼ Ber( 12 ) is independent of X1 . Then
maxPX1 X2 I(X1 , X2 ; Y1 , Y2 ) = log 2, achieved by PX1 X2 = Ber(p) × Ber( 21 ) for any p. Note that
the capacity-achieving output distribution is unique: P∗Y1 Y2 = Ber( 12 ) × Ber( 21 ).
Suppose we have a bivariate function f. Then we always have the minimax inequality:
inf sup f(x, y) ≥ sup inf f(x, y).
y x x y
1 It turns out minimax equality is implied by the existence of a saddle point (x∗ , y∗ ),
i.e.,
f ( x, y∗ ) ≤ f ( x∗ , y∗ ) ≤ f ( x∗ , y) ∀ x, y
Furthermore, minimax equality also implies existence of saddle point if inf and sup
are achieved c.f. [31, Section 2.6]) for all x, y [Straightforward to check. See proof
of corollary below].
2 There are a number of known criteria establishing
inf sup f(x, y) = sup inf f(x, y)
y x x y
i i
i i
i i
68
Proof. This follows from the saddle-point: Maximizing/minimizing the leftmost/rightmost sides
of (5.1) gives
min sup D(PY|X kQY |PX ) ≤ max D(PY|X kP∗Y |PX ) = D(PY|X kP∗Y |P∗X )
QY PX ∈P PX ∈P
but by definition min max ≥ max min. Note that we were careful to only use max and min for the
cases where we know the optimum is achievable.
i i
i i
i i
1 Radius (aka Chebyshev radius) of A: the radius of the smallest ball that covers A,
i.e.,
rad (A) = inf sup d(x, y). (5.3)
y∈ X x∈ A
2 Diameter of A:
diam (A) = sup d(x, y). (5.4)
x, y∈ A
Note that the radius and the diameter both measure the massiveness/richness of a
set.
3 From definition and triangle inequality we have
1
diam (A) ≤ rad (A) ≤ diam (A). (5.5)
2
The lower and upper bounds are achieved when A is, for example, a Euclidean ball
and the Hamming space, respectively.
4 In many special cases, the upper bound in (5.5) can be improved:
• A result of Bohnenblust [43] shows that in Rn equipped with any norm we always
have rad (A) ≤ n+n 1 diam (A).
q
• For Rn with Euclidean distance Jung proved rad (A) ≤ n
2(n+1) diam (A),
attained by simplex. The best constant is sometimes called the Jung constant
of the space.
• For Rn with ℓ∞ -norm the situation is even simpler: rad (A) = 12 diam (A); such
spaces are called centrable.
The next simple corollary shows that capacity is just the radius of a finite collection of dis-
tributions {PY|X=x : x ∈ X } when distances are measured by divergence (although, we remind,
divergence is not a metric).
Corollary 5.7. For any finite X and any kernel PY|X , the maximal mutual information over all
distributions PX on X satisfies
i i
i i
i i
70
The last corollary gives a geometric interpretation to capacity: It equals the radius of the smallest
divergence-“ball” that encompasses all distributions {PY|X=x : x ∈ X }. Moreover, the optimal
center P∗Y is a convex combination of some PY|X=x and is equidistant to those.
The following is the information-theoretic version of “radius ≤ diameter” (in KL divergence)
for arbitrary input space (see Theorem 32.4 for a related representation):
I(X; Y) = inf D(PY|X kQ|PX ) ≤ inf sup D(PY|X=x kQ) ≤ ′inf sup D(PY|X=x kPY|X=x′ ).
Q Q x∈X x ∈X x∈X
C = sup I(X; Y)
PX ∈P
can be (a) interpreted as a saddle point; (b) written in the minimax form; and (c) that the capacity-
achieving output distribution P∗Y is unique. This was all done under the extra assumption that the
supremum over PX is attainable. It turns out, properties b) and c) can be shown without that extra
assumption.
Theorem 5.9 (Kemperman). For any PY|X and a convex set of distributions P such that
Furthermore,
i i
i i
i i
Note that Condition (5.6) is automatically satisfied if there exists a QY such that
sup D(PY|X kQY |PX ) < ∞ . (5.11)
PX ∈P
Without the constraint E[X4 ] = s, the capacity is uniquely achieved at the input distribution PX =
N (0, P); see Theorem 5.11. When s 6= 3P2 , such PX is no longer feasible. However, for s > 3P2
the maximum
1
C = log(1 + P)
2
is still attainable. Indeed, we can add a small “bump” to the gaussian distribution as follows:
PX = (1 − p)N (0, P) + pδx ,
where p → 0 and x → ∞ such that px2 → 0 but px4 → s − 3P2 > 0. This shows that for the
problem (5.12) with s > 3P2 , the capacity-achieving input distribution does not exist, but the
capacity-achieving output distribution P∗Y = N (0, 1 + P) exists and is unique as Theorem 5.9
shows.
Proof of Theorem 5.9. Let P′Xn be a sequence of input distributions achieving C, i.e.,
I(P′Xn , PY|X ) → C. Let Pn be the convex hull of {P′X1 , . . . , P′Xn }. Since Pn is a finite-dimensional
simplex, the (concave) function PX 7→ I(PX , PY|X ) is continuous (Proposition 4.12) and attains its
maximum at some point PXn ∈ Pn , i.e.,
In ≜ I(PXn , PY|X ) = max I(PX , PY|X ) .
PX ∈Pn
Since the space of all probability distributions on a fixed alphabet is complete in total variation,
the sequence must have a limit point PYn → P∗Y . Convergence in TV implies weak convergence,
i i
i i
i i
72
and thus by taking a limit as k → ∞ in (5.15) and applying the lower semicontinuity of divergence
(Theorem 4.8) we get
and therefore, PYn → P∗Y in the (stronger) sense of D(PYn kP∗Y ) → 0. By Theorem 4.1,
To prove that (5.18) holds for arbitrary PX ∈ P , we may repeat the argument above with Pn
replaced by P̃n = conv({PX } ∪ Pn ), denoting the resulting sequences by P̃Xn , P̃Yn and the limit
point by P̃∗Y , and obtain:
where (5.20) follows from (5.18) since PXn ∈ P̃n . Hence taking limit as n → ∞ we have P̃∗Y = P∗Y
and therefore (5.18) holds.
To see the uniqueness of P∗Y , assuming there exists Q∗Y that fulfills C = supPX ∈P D(PY|X kQ∗Y |PX ),
we show Q∗Y = P∗Y . Indeed,
C ≥ D(PY|X kQ∗Y |PXn ) = D(PY|X kPYn |PXn ) + D(PYn kQ∗Y ) = In + D(PYn kQ∗Y ).
Since In → C, we have D(PYn kQ∗Y ) → 0. Since we have already shown that D(PYn kP∗Y ) → 0,
we conclude P∗Y = Q∗Y (this can be seen, for example, from Pinsker’s inequality and the triangle
inequality TV(P∗Y , Q∗Y ) ≤ TV(PYn , Q∗Y ) + TV(PYn , P∗Y ) → 0).
Finally, to see (5.9), note that by definition capacity as a max-min is at most the min-max, i.e.,
C = sup min D(PY|X kQY |PX ) ≤ min sup D(PY|X kQY |PX ) ≤ sup D(PY|X kP∗Y |PX ) = C
PX ∈P QY QY PX ∈P PX ∈P
Corollary 5.10. Let X be countable and P a convex set of distributions on X . If supPX ∈P H(X) <
∞ then
X 1
sup H(X) = min sup PX (x) log <∞
PX ∈P QX PX ∈P
x
Q X ( x)
and the optimizer Q∗X exists and is unique. If Q∗X ∈ P , then it is also the unique maximizer of H(X).
i i
i i
i i
P
Example 5.2 (Max entropy). Assume that f : Z → R is such that Z(λ) ≜ n∈Z exp{−λf(n)} < ∞
for all λ > 0. Then
This follows from taking QX (n) = Z(λ)−1 exp{−λf(n)} in Corollary 5.10. Distributions of this
form are known as Gibbs distributions for the energy function f. This bound is often tight and
achieved by PX (n) = Z(λ∗ )−1 exp{−λ∗ f(n)} with λ∗ being the minimizer.
1. “Gaussian capacity”:
1 σ2
C = I(Xg ; Xg + Ng ) = log 1 + X2
2 σN
I(X; X + Ng ) ≤ I(Xg ; Xg + Ng ),
d
with equality iff X=Xg .
3. “Gaussian noise is the worst for Gaussian input”: For for all N s.t. E[Xg N] = 0 and EN2 ≤ σN2 ,
I(Xg ; Xg + N) ≥ I(Xg ; Xg + Ng ),
d
with equality iff N=Ng and independent of Xg .
Interpretations:
1 For AWGN channel, Gaussian input is the most favorable. Indeed, immediately from the second
statement we have
1 σ2
max I(X; X + Ng ) = log 1 + X2
X:Var X≤σX2 2 σN
i i
i i
i i
74
Proof. WLOG, assume all random variables have zero mean. Let Yg = Xg + Ng . Define
1 σ 2 log e x2 − σX2
f(x) ≜ D(PYg |Xg =x kPYg ) = D(N (x, σN2 )kN (0, σX2 + σN2 )) = log 1 + X2 +
2 σN 2 σX2 + σN2
| {z }
=C
3. Let Y = Xg + N and let PY|Xg be the respective kernel. Note that here we only assume that N is
uncorrelated with Xg , i.e., E [NXg ] = 0, not necessarily independent. Then
dPXg |Yg (Xg |Y)
I(Xg ; Xg + N) ≥ E log (5.21)
dPXg (Xg )
dPYg |Xg (Y|Xg )
= E log (5.22)
dPYg (Y)
log e h Y2 N2 i
=C+ E 2 2
− 2 (5.23)
2 σX + σN σN
log e σX 2 EN2
=C+ 1 − (5.24)
2 σX2 + σN2 σN2
≥ C, (5.25)
where
• (5.21): follows from (4.7),
dPX |Y dPY |X
• (5.22): dPgX g = dPgY g
g g
i i
i i
i i
Note that there is a steady improvement at each step (the value F(sk , tk ) is decreasing), so it
can be often proven that the algorithm converges to a local minimum, or even a global minimum
under appropriate conditions (e.g. the convexity of f). Below we discuss several applications of
this idea, and refer to [82] for proofs of convergence. We need a result, which will be derived
in Chapter 15: for any function c : Y → R and any QY on Y , under the integrability condition
R
Z = QY (dy) exp{−c(y)} < ∞,
min D(PY kQY ) + EY∼PY [c(Y)] (5.26)
PY
Maximizing mutual information (capacity). We have a fixed PY|X and the optimization
problem
QX | Y
C = max I(X; Y) = max max EPX,Y log .
PX PX QX|Y PX
This results in the iterations:
1
QX|Y (x|y) ← PX (x)PY|X (y|x)
Z(y)
( )
1 X
PX (x) ← Q′ (x) ≜ exp PY|X (y|x) log QX|Y (x|y) ,
Z y
i i
i i
i i
76
where Z(y) and Z are normalization constants. To derive this, notice that for a fixed PX the optimal
QX|Y = PX|Y . For a fixed QX|Y , we can see that
QX|Y
EPX,Y log = log Z − D(PX kQ′ ) ,
PX
and thus the optimal PX = Q′ .
Denoting Pn to be the value of PX at the nth iteration, we observe that
I(Pn , PY|X ) ≤ C ≤ sup D(PY|X=x kPY|X ◦ Pn ) . (5.27)
x
This is useful since at every iteration not only we get an estimate of the optimizer Pn , but also the
gap to optimality C − I(Pn , PY|X ) ≤ C − RHS. It can be shown, furthermore, that both RHS and
LHS in (5.27) monotonically converge to C as n → ∞, see [82] for details.
where QX|Y is a given channel. This is a problem arising in the maximum likelihood estimation
Pn
for mixture models where QY is the unknown mixing distribution and PX = 1n i=1 δxi is the
empirical distribution of the sample (x1 , . . . , xn ).1
1 (θ)
Note that EM algorithm is also applicable more generally, when QX|Y itself depends on the unknown parameter θ and the
∑ (θ)
goal (see Section 29.3) is to maximize the total log likelihood ni=1 log QX (xi ) joint over (QY , θ). A canonical example
(which was one of the original motivations for the EM algorithm) a k-component Gaussian mixture
(θ) ∑ (θ)
QX = kj=1 wj N (μj , 1); in other words, QY = (w1 , . . . , wk ), QX|Y=j = N (μj , 1) and θ = (μ1 , . . . , μk ). If the centers
μj ’s are known and only the weights wj ’s are to be estimated, then we get the simple convex case in (5.29). In general the
log likelihood function is non-convex in the centers and EM iterations may not converge to the global optimum even with
infinite sample size (see [169] for an example with k = 3).
i i
i i
i i
dQX|Y
(Note that taking d(x, y) = − log dPX shows that this problem is equivalent to (5.28).) By the
chain rule, thus, we find the iterations
1
PY|X ← QY (y)QX|Y (x|y)
Z(x)
QY ← PY|X ◦ PX .
Denote by Qn the value of QX = QX|Y ◦ QY at the nth iteration. Notice that for any n and all QX
we have from Jensen’s inequality
dQX|Y
D(PX kQX ) − D(PX kQn ) = EX∼PX [log EY∼QY ] ≥ gap(Qn ) ,
dQn
dQX|Y=y
where we defined gap(Qn ) = − log esssupy EX∼PX [ dQn ]. In all, we get the following sandwich
bound:
D(PY kQn ) − gap(Qn ) ≤ L ≤ D(PY kQn ) , (5.30)
and it can be shown that as n → ∞ both sides converge to L.
Sinkhorn’s algorithm. This algorithm [284] is very similar, but not exactly the same as the ones
above. We fix QX,Y , two marginals VX , VY and solve the problem
S = min{D(PX,Y kQX,Y ) : PX = VX , PY = VY )} .
From the results of Chapter 15 it is clear that the optimal distribution PX,Y is given by
P∗X,Y = A(x)QX,Y (x, y)B(y) ,
for some A, B ≥ 0. In order to find functions A, B we notice that under a fixed B the value of A
that makes PX = VX is given by
VX (x)QX,Y (x, y)B(y)
A ( x) ← P .
y QX,Y (x, y)B(y)
i i
i i
i i
In this chapter we start with explaining the important property of mutual information known as
tensorization (or single-letterization), which allows one to maximize and minimize mutual infor-
mation between two high-dimensional vectors. So far in this book we have tacitly failed to give
any operational meaning to the value of I(X; Y). In this chapter, we give one fundamental such
justification in the form of Fano’s inequality. It states that whenever I(X; Y) is small, one should
not be able to predict X on the basis of Y with a small probability of error. As such, this inequality
will be applied countless times in the rest of the book. We also define concepts of entropy rate (for
a stochastic process) and of mutual information rate (for a pair of stochastic processes). For the
former, it is shown that two processes that coincide often must have close entropy rates – a fact
to be used later in the discussion of ergodicity. For the latter we give a closed form expression for
the pair of Gaussian processes in terms of their spectral density.
(2) If X1 ⊥
⊥ ... ⊥
⊥ Xn then
X
n
I(Xn ; Y) ≥ I(Xi ; Y) (6.2)
i=1
78
i i
i i
i i
Q
with equality iff PXn |Y = PXi |Y PY -almost surely.1 Consequently,
X
n
min I(Xn ; Yn ) = min I(Xi ; Yi ).
PYn |Xn P Yi | X i
i=1
P Q Q
Proof. (1) Use I(Xn ; Yn ) − I(Xi ; Yi ) = D(PYn |Xn k PYi |Xi |PXn ) − D(PYn k PYi )
P Q Q
(2) Reverse the role of X and Y: I(Xn ; Y) − I(Xi ; Y) = D(PXn |Y k PXi |Y |PY ) − D(PXn k PXi )
1 For product channel, the input maximizing the mutual information is a product distribution.
2 For product source, the channel minimizing the mutual information is a product channel.
⊥ X2 ∼ Bern(1/2) on {0, 1} = F2 :
Example 6.1. 1. (6.1) fails for non-product channels. X1 ⊥
Y1 = X1 + X2
Y2 = X1
I(X1 ; Y1 ) = I(X2 ; Y2 ) = 0 but I(X2 ; Y2 ) = 2 bits
Y1 = X2 , Y2 = X3 , . . . , Yn = X1 ⇒ I(Xk ; Yk ) = 0
X X
I(Xn ; Yn ) = H(Xi ) > 0 = I(Xk ; Yk )
1 ∏n
That is, if PXn ,Y = PY i=1 PXi |Y as joint distributions.
i i
i i
i i
80
Pmax I(X n
;X + Z ) ≤
n n
Pmax I(Xk ; Xk + Zk )
E[ X2k ]≤nP E[ X2k ]≤nP
k=1
Given a distribution PX1 · · · PXn satisfying the constraint, form the “average of marginals” distribu-
Pn Pn
tion P̄X = n1 k=1 PXk , which also satisfies the single letter constraint E[X2 ] = n1 k=1 E[X2k ] ≤ P.
Then from the concavity in PX of I(PX , PY|X )
1X
n
I(P̄X ; PY|X ) ≥ I(PXk , PY|X )
n
k=1
So P̄X gives the same or better mutual information, which shows that the extremization above
ought to have the form nC(P) where C(P) is the single letter capacity. Now suppose Yn = Xn + ZnG
where ZnG ∼ N (0, In ). Since an isotropic Gaussian is rotationally symmetric, for any orthogo-
nal transformation U ∈ O(n), the additive noise has the same distribution ZnG ∼ UZnG , so that
PUYn |UXn = PYn |Xn , and
From the “average of marginal” argument above, averaging over many rotations of Xn can only
make the mutual information larger. Therefore, the optimal input distribution PXn can be chosen
to be invariant under orthogonal transformations. Consequently, the (unique!) capacity achiev-
ing output distribution P∗Yn must be rotationally invariant. Furthermore, from the conditions for
equality in (6.1) we conclude that P∗Yn must have independent components. Since the only product
distribution satisfying the power constraints and having rotational symmetry is an isotropic Gaus-
sian, we conclude that PYn = (P∗Y )n and P∗Y = N (0, PIn ).
min I(XG ; XG + N)
PN :E[N2 ]=1
This uses the same trick, except here the input distribution is automatically invariant under
orthogonal transformations.
i i
i i
i i
Our goal is to draw converse statements: for example, if the uncertainty of W is too high or if the
information provided by the data is too scarce, then it is difficult to guess the value of W.
The function FM (·) is shown in Fig. 6.1. Notice that due to its non-monotonicity the state-
ment (6.5) does not imply (6.3), even though P[X = X̂] ≤ Pmax .
FM (p)
log M
log(M − 1)
p
0 1/M 1
Figure 6.1 The function FM in (6.4) is concave with maximum log M at maximizer 1/M, but not monotone.
Proof. To show (6.3) consider an auxiliary distribution QX,X̂ = UX PX̂ , where UX is uniform on
X . Then Q[X = X̂] = 1/M. Denoting P[X = X̂] ≜ PS , applying the DPI for divergence to the data
processor (X, X̂) 7→ 1{X=X̂} yields d(PS k1/M) ≤ D(PXX̂ kQXX̂ ) = log M − H(X).
To show the second part, suppose one is trying to guess the value of X without any side informa-
tion. Then the best bet is obviously the most likely outcome (mode) and the maximal probability
of success is
max P[X = X̂] = Pmax (6.6)
X̂⊥
⊥X
Thus, applying (6.3) with X̂ being the mode yield (6.5). Finally, suppose that P =
(Pmax , P2 , . . . , PM ) and introduce Q = (Pmax , 1− Pmax 1−Pmax
M−1 , . . . , M−1 ). Then the difference of the right
and left side of (6.5) equals D(PkQ) ≥ 0, with equality iff P = Q.
i i
i i
i i
82
Remark 6.1. Let us discuss the unusual proof technique. Instead of studying directly the prob-
ability space PX,X̂ given to us, we introduced an auxiliary one: QX,X̂ . We then drew conclusions
about the target metric (probability of error) for the auxiliary problem (the probability of error
= 1 − M1 ). Finally, we used DPI to transport statement about Q to a statement about P: if D(PkQ)
is small, then the probabilities of the events (e.g., {X 6= X̂}) should be small as well. This is a
general method, known as meta-converse, that we develop in more detail later in this book. For
this result, however, there are much more explicit ways to derive it – see Ex. I.42.
Similar to Shannon entropy H, Pmax is also a reasonable measure for randomness of P. In fact,
1
H∞ (P) ≜ log (6.7)
Pmax
is known as the Rényi entropy of order ∞ (or the min-entropy in the cryptography literature). Note
that H∞ (P) = log M iff P is uniform; H∞ (P) = 0 iff P is a point mass. In this regard, the Fano’s
inequality can be thought of as our first example of a comparison of information measures: it
compares H and H∞ .
Theorem 6.3 (Fano’s inequality). Let |X | = M < ∞ and X → Y → X̂. Let Pe = P[X 6= X̂], then
Proof. The benefit of the previous proof is that it trivially generalizes to this new case of (possibly
randomized) estimators X̂, which may depend on some observation Y correlated with X. Note that
it is clear that the best predictor for X given Y is the maximum posterior (MAP) rule, i.e., posterior
mode: X̂(y) = argmaxx PX|Y (x|y).
To show (6.8) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs. QX,Y,X̂ =
UX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1{X̸=X̂} (note that PX̂|Y is identical for both).
To show (6.9) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs. QX,Y,X̂ =
PX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1{X̸=X̂} to obtain:
where the last step follows from Q[X = X̂] ≤ Pmax since X ⊥
⊥ X̂ under Q. (Again, we refer to
Ex. I.42 for a direct proof.)
The following corollary of the previous result emphasizes its role in providing converses (or
impossibility results) for statistics and data transmission.
i i
i i
i i
Proof. Apply Theorem 6.3 and the data processing for mutual information: I(W; Ŵ) ≤ I(X; Y).
A sufficient condition for the entropy rate to exist is stationarity, which essentially means invari-
d
ance with respect to time shift. Formally, X is stationary if (Xt1 , . . . , Xtn )=(Xt1 +k , . . . , Xtn +k ) for
any t1 , . . . , tn , k ∈ N. This definition naturally extends to two-sided processes.
Proof.
(a) Further conditioning + stationarity: H(Xn |Xn−1 ) ≤ H(Xn |Xn2−1 ) = H(Xn−1 |Xn−2 )
P
(b) Using chain rule: 1n H(Xn ) = 1n H(Xi |Xi−1 ) ≥ H(Xn |Xn−1 )
(c) H(Xn ) = H(Xn−1 ) + H(Xn |Xn−1 ) ≤ H(Xn−1 ) + 1n H(Xn )
(d) n 7→ 1n H(Xn ) is a decreasing sequence and lower bounded by zero, hence has a limit
Pn
H(X). Moreover by chain rule, 1n H(Xn ) = n1 i=1 H(Xi |Xi−1 ). From here we claim that
H(Xn |Xn−1 ) converges to the same limit H(X). Indeed, from the monotonicity shown in part
(a), limn H(Xn |Xn−1 ) = H′ exists. Next, recall the following fact from calculus: if an → a,
Pn
then the Cesàro’s mean 1n i=1 ai → a as well. Thus, H′ = H(X).
i i
i i
i i
84
lim H(X1 ) − H(X1 |X0−n ) = lim I(X1 ; X0−n ) = I(X1 ; X0−∞ ) = H(X1 ) − H(X1 |X0−∞ )
n→∞ n→∞
1X
n
P[Xj 6= Yj ] ≤ ϵ . (6.13)
n
j=1
For binary alphabet this quantity is known as the bit error rate, which is one of the performance
metrics we consider for reliable data transmission in Part IV (see Section 17.1 and Section 19.6).
Notice that if we define the Hamming distance as
X
n
d H ( xn , yn ) ≜ 1{xj 6= yj } (6.14)
j=1
i i
i i
i i
1X
n
1
δ= E[dH (Xn , Yn )] = P[Xj 6= Yj ] .
n n
j=1
Proof. For each j ∈ [n], applying (6.8) to the Markov chain Xj → Yn → Yj yields
where we denoted M = |X |. Then, upper-bounding joint entropy by the sum of marginals, cf. (1.3),
and combining with (6.16), we get
X
n
H(Xn |Yn ) ≤ H( X j | Y n ) (6.17)
j=1
Xn
≤ FM (P[Xj = Yj ]) (6.18)
j=1
1 X
n
≤ nFM P[Xj = Yj ] (6.19)
n
j=1
where in the last step we used the concavity of FM and Jensen’s inequality.
Corollary 6.8. Consider two processes X and Y with entropy rates H(X) and H(Y). If
P[Xj 6= Yj ] ≤ ϵ
H(X) − H(Y) ≤ FM (1 − ϵ) .
and apply (6.15). For the last statement just recall the expression for FM .
i i
i i
i i
86
Definition 6.9 (Contiguity). Let {Pn } and {Qn } be sequences of probability measures on some
Ωn . We say Pn is contiguous with respect to Qn (denoted by Pn ◁ Qn ) if for any sequence {An } of
measurable sets, Qn (An ) → 0 implies that Pn (An ) → 0. We say Pn and Qn are mutually contiguous
(denoted by Pn ◁▷ Qn ) if Pn ◁ Qn and Qn ◁ Pn .
Theorem 6.10. Let X be a finite set and Qn the uniform distribution on X n . If Pn ◁ Qn , then
H(Pn ) = H(Qn ) + o(n) = n log |X | + o(n). Equivalently, D(Pn kQn ) = o(n).
Proof. Suppose for the sake of contradiction that H(Pn ) ≤ (1 − ϵ)n log |X | for some constant
′ ′
ϵ. Let ϵ′ < ϵ and define An ≜ {xn ∈ X n : Pn (xn ) ≥ |X |−(1−ϵ )n }. Then |An | ≤ |X |(1−ϵ )n and
′
hence Qn (An ) ≤ |X |−ϵ n . Since Pn ◁ Qn , we have Pn (An ) → 0. On the other hand, H(Pn ) ≥
−ϵ
EPn [log P1n 1Acn ] ≥ (1 − ϵ′ )n log |X |Pn (Acn ). Thus Pn (Acn ) ≤ 11−ϵ ′ which is a contradiction.
Remark 6.2. It is natural to ask whether Theorem 6.10 holds for non-uniform Qn , that is, whether
Pn ◁▷ Qn implies H(Pn ) = H(Qn ) + o(n). This turns out to be false. To see this, choose any
μn , νn and set Pn ≜ 12 μn + 12 νn and Qn ≜ 13 μn + 23 νn . Then we always have Pn ◁▷ Qn since
4 ≤ Qn ≤ 2 . Using conditional entropy, it is clear that H(Pn ) = 2 H( μn ) + 2 H(νn ) + O(1) and
3 Pn 3 1 1
Choosing, say, μn = Ber( 21 )⊗n and νn = Ber( 13 )⊗n leads to |H(Pn ) − H(Qn )| = Ω(n).
Example 6.3 (Gaussian processes). Consider X, N two stationary Gaussian processes, independent
of each other. Assume that their auto-covariance functions are absolutely summable and thus there
exist continuous power spectral density functions fX and fN . Without loss of generality, assume all
means are zero. Let cX (k) = E [X1 Xk+1 ]. Then fX is the Fourier transform of the auto-covariance
P∞
function cX , i.e., fX (ω) = k=−∞ cX (k)e
iω k
. Finally, assume fN ≥ δ > 0. Then recall from
Example 3.4:
1 det(ΣXn + ΣNn )
I(Xn ; Xn + Nn ) = log
2 det ΣNn
i i
i i
i i
1X 1X
n n
= log σi − log λi ,
2 2
i=1 i=1
where σj , λj are the eigenvalues of the covariance matrices ΣYn = ΣXn + ΣNn and ΣNn , which are
all Toeplitz matrices, e.g., (ΣXn )ij = E [Xi Xj ] = cX (i − j). By Szegö’s theorem [146, Sec. 5.2]:
Z 2π
1X
n
1
log σi → log fY (ω)dω (6.20)
n 2π 0
i=1
Note that cY (k) = E [(X1 + N1 )(Xk+1 + Nk+1 )] = cX (k) + cN (k) and hence fY = fX + fN . Thus, we
have
Z 2π
1 n n 1 fX (w) + fN (ω)
I(X ; X + Nn ) → I(X; X + N) = log dω.
n 4π 0 fN (ω)
Maximizing this over fX (ω) leads to the famous water-filling solution f∗X (ω) = |T − fN (ω)|+ .
i i
i i
i i
7 f-divergences
In Chapter 2 we introduced the KL divergence that measures the dissimilarity between two dis-
tributions. This turns out to be a special case of the family of f-divergence between probability
distributions, introduced by Csiszár [79]. Like KL-divergence, f-divergences satisfy a number of
useful properties:
The purpose of this chapter is to establish these properties and prepare the ground for appli-
cations in subsequent chapters. The important highlight is a joint range Theorem of Harremoës
and Vajda [155], which gives the sharpest possible comparison inequality between arbitrary f-
divergences (and puts an end to a long sequence of results starting from Pinsker’s inequality –
Theorem 7.9). This material can be skimmed on the first reading and referenced later upon need.
88
i i
i i
i i
with the agreement that if P[q = 0] = 0 the last term is taken to be zero regardless of the value of
f′ (∞) (which could be infinite).
Remark 7.1. For the discrete case, with Q(x) and P(x) being the respective pmfs, we can also
write
X
P(x)
Df (PkQ) = Q(x)f
x
Q ( x)
• f(0) = f(0+),
• 0f( 00 ) = 0, and
• 0f( a0 ) = limx↓0 xf( ax ) = af′ (∞) for a > 0.
Remark 7.2. A nice property of Df (PkQ) is that the definition is invariant to the choice of the
dominating measure μ in (7.2). This is not the case for other dissimilarity measures, e.g., the
squared L2 -distance between the densities kp − qk2L2 (dμ) which is a popular loss function for density
estimation in statistics literature.
The following are common f-divergences:
Note that we can also choose f(x) = x2 − 1. Indeed, f’s differing by a linear term lead to the
same f-divergence, cf. Proposition 7.2.
√ 2
• Squared Hellinger distance: f(x) = 1 − x ,
r !2 Z √ Z p
dP p 2
H2 (P, Q) ≜ EQ 1 − = dP − dQ = 2 − 2 dPdQ. (7.5)
dQ
1 ∫ ∫ dQ
In (7.3), d(P ∧ Q) is the usual short-hand for ( dP dμ
∧ dμ
)dμ where μ is any dominating measure. The expressions in
(7.4) and (7.5) are understood in the similar sense.
i i
i i
i i
90
RR√
Here the quantity B(P, Q) ≜ dPdQ p is known as the Bhattacharyya coefficient (or
Hellinger affinity) [33]. Note that H(P, Q) = H2 (P, Q) defines a metric on the space of prob-
ability distributions: indeed, the triangle inequality follows from that of L2 ( μ) for a common
dominating measure. Note, however, that (P, Q) 7→ H(P, Q) is not convex. (This is because
metric H is not induced by a Banach norm on the space of measures.)
1−x
• Le Cam distance [193, p. 47]: f(x) = 2x +2 ,
Z
1 (dP − dQ)2
LC(P, Q) = . (7.6)
2 dP + dQ
p
Moreover, LC(PkQ) is a metric on the space of probability distributions [114].
• Jensen-Shannon divergence: f(x) = x log x2x 2
+1 + log x+1 ,
P + Q
P + Q
JS(P, Q) = D P
+ D Q
. (7.7)
2 2
p
Moreover, JS(PkQ) is a metric on the space of probability distributions [114].
Remark 7.3. If Df (PkQ) is an f-divergence, then it is easy to verify that Df (λP + λ̄QkQ) and
Df (PkλP + λ̄Q) are f-divergences for all λ ∈ [0, 1]. In particular, Df (QkP) = Df̃ (PkQ) with
f̃(x) ≜ xf( 1x ).
We start summarizing some formal observations about the f-divergences
the latter referred to as the conditional f-divergence (similar to Definition 2.12 for conditional
KL divergence).
5 If PX,Y = PX PY|X and QX,Y = QX PY|X then
In particular,
Df ( P X P Y k QX P Y ) = Df ( P X k QX ) . (7.10)
i i
i i
i i
In particular, we can always assume that f ≥ 0 and (if f is differentiable at 1) that f′ (1) = 0.
Proof. The first and second are clear. For the third property, verify explicitly that Df (PkQ) = 0
for f = c(x − 1). Next consider general f and observe that for P ⊥ Q, by definition we have
which is well-defined (i.e., ∞ − ∞ is not possible) since by convexity f(0) > −∞ and f′ (∞) >
−∞. So all we need to verify is that f(0) + f′ (∞) = 0 if and only if f = c(x − 1) for some c ∈ R.
Indeed, since f(1) = 0, the convexity of f implies that x 7→ g(x) ≜ xf(−x)1 is non-decreasing. By
assumption, we have g(0+) = g(∞) and hence g(x) is a constant on x > 0, as desired.
For property 4, let RY|X = 12 PY|X + 21 QY|X . By Theorem 2.10 there exist jointly measurable
p(y|x) and q(y|x) such that dPY|X=x = p(y|x)dRY|X=x and QY|X = q(y|x)dRY|X=x . We can then take
μ in (7.2) to be μ = PX RY|X , which gives dPX,Y = p(y|x)dμ and dQX,Y = q(y|x)dμ and thus
Df (PX,Y kQX,Y )
Z Z
p(y|x)
= dμ1{y : q(y|x) > 0} q(y|x)f + f′ (∞) dμ1{y : q(y|x) = 0} p(y|x)
X ×Y q(y|x) X ×Y
Z Z Z
(7.2) p ( y| x) ′
= dPX dRY|X=x q(y|x)f + f (∞) dRY|X=x p(y|x)
X {y:q(y|x)>0} q ( y| x) {y:q(y|x)=0}
| {z }
Df (PY|X=x ∥QY|X=x )
Proof. Note that in the case PX,Y QX,Y (and thus PX QX ), the proof is a simple application
of Jensen’s inequality to definition (7.1):
dPY|X PX
Df (PX,Y kQX,Y ) = EX∼QX EY∼QY|X f
dQY|X QX
dPY|X PX
≥ EX∼QX f EY∼QY|X
dQY|X QX
dPX
= EX∼QX f .
dQX
i i
i i
i i
92
To prove the general case we need to be more careful. Let RX = 21 (PX + QX ) and RY|X = 12 PY|X +
1
2 QY|X .It should be clear that PX,Y , QX,Y RX,Y ≜ RX RY|X and that for every x: PY|X=x , QY|X=x
RY|X=x . By Theorem 2.10 there exist measurable functions p1 , p2 , q1 , q2 so that
and dPY|X=x = p2 (y|x)dRY|X=x , dQY|X=x = q2 (y|x)dRY|X=x . We also denote p(x, y) = p1 (x)p2 (y|x),
q(x, y) = q1 (x)q2 (y|x).
Fix t > 0 and consider a supporting line to f at t with slope μ, so that
f( u) ≥ f( t) + μ ( u − t) , ∀u ≥ 0 .
Note that we added t = 0 case as well, since for t = 0 the statement is obvious (recall, though,
that f(0) ≜ f(0+) can be equal to +∞).
Next, fix some x with q1 (x) > 0 and consider the chain
Z
p 1 ( x) p 2 ( y| x) p 1 ( x)
dRY|X=x q2 (y|x)f + P [q2 (Y|x) = 0]f′ (∞)
{y:q2 (y|x)>0} q1 ( x ) q 2 ( y | x ) q1 (x) Y|X=x
( a) p 1 ( x) p1 (x)
≥f PY|X=x [q2 (Y|x) > 0] + P [q2 (Y|x) = 0]f′ (∞)
q 1 ( x) q1 (x) Y|X=x
(b) p 1 ( x)
≥f
q 1 ( x)
where (a) is by Jensen’s inequality and the convexity of f, and (b) by taking t = pq11 ((xx)) and λ =
PY|X=x [q2 (Y|x) > 0] in (7.13). Now multiplying the obtained inequality by q1 (x) and integrating
over {x : q1 (x) > 0} we get
Z
p ( x, y)
dRX,Y q(x, y)f + f′ (∞)PX,Y [q1 (X) > 0, q2 (Y|X) = 0]
{q>0} q ( x , y )
Z
p 1 ( x)
≥ dRX q1 (x)f .
{q1 >0} q 1 ( x)
Adding f′ (∞)PX [q1 (X) = 0] to both sides we obtain (7.12) since both sides evaluate to
definition (7.2).
i i
i i
i i
Theorem 7.4 (Data processing). Consider a channel that produces Y given X based on the
conditional law PY|X (shown below).
PX PY
PY|X
QX QY
Let PY (resp. QY ) denote the distribution of Y when X is distributed as PX (resp. QX ). For any
f-divergence Df (·k·),
Next we discuss some of the more useful properties of f-divergence that parallel those of KL
divergence in Theorem 2.14:
PY |X PY
PX
QY |X QY
Then
Df (PY kQY ) ≤ Df PY|X kQY|X |PX .
Proof. (a) Non-negativity follows from monotonicity by taking X to be unary. To show strict
positivity, suppose for the sake of contradiction that Df (PkQ) = 0 for some P 6= Q. Then
there exists some measurable A such that p = P(A) 6= q = Q(A) > 0. Applying the data
2
By strict convexity at 1, we mean for all s, t ∈ [0, ∞) and α ∈ (0, 1) such that αs + ᾱt = 1, we have
αf(s) + (1 − α)f(t) > f(1).
i i
i i
i i
94
Remark 7.4 (Strict convexity). Note that even when f is strictly convex at 1, the map (P, Q) 7→
Df (PkQ) may not be strictly convex (e.g. TV(Ber(p), Ber(q)) = |p − q| is piecewise linear).
However, if f is strictly convex everywhere on R+ then so is Df . Indeed, if P 6= Q, then there
exists E such that P(E) 6= Q(E). By the DPI and the strict convexity of f, we have Df (PkQ) ≥
Df (Ber(P(E))kBer(Q(E))) > 0. Strict convexity of f is also related to other desirable properties
of If (X; Y), see Ex. I.31.
Remark 7.5 (g-divergences). We note that, more generally, we may call functional D(PkQ) a
“g-divergence”, or a generalized dissimilarity measure, if it satisfies the following properties: pos-
itivity, monotonicity, data processing inequality (DPI), conditioning increases divergence (CID)
and convexity in the pair. As we have seen in the proof of Theorem 5.1 the latter two are exactly
equivalent. Furthermore, our proof demonstrated that DPI and CID are both implied by monotonic-
ity. If D(PkP) = 0 then monotonicity, as in (7.12), also implies positivity by taking X to be unary.
Finally, notice that DPI also implies monotonicity by applying it to the (deterministic) channel
(X, Y) 7→ X. Thus, requiring DPI (or monotonicity) for D automatically implies all the other main
properties. We remark also that there exist g-divergences which are not monotone transformations
of any f-divergence, cf. [243, Section V]. On the other hand, for finite alphabets, [232] shows that
P
any D(PkQ) = i ϕ(Pi , Qi ) is a g-divergence iff it is an f-divergence.
The following convenient property, a counterpart of Theorem 4.5, allows us to reduce any gen-
eral problem about f-divergences to the problem on finite alphabets. The proof is in Section 7.14*.
Theorem 7.6. Let P, Q be two probability measures on X with σ -algebra F . Given a finite F -
measurable partitions E = {E1 , . . . , En } define the distribution PE on [n] by PE (i) = P[Ei ] and
QE (i) = Q[Ei ]. Then
i i
i i
i i
is minimized.
In this section we first show that optimization over ϕ naturally leads to the concept of TV.
Subsequently, we will see that asymptotic considerations (when P and Q are replaced with P⊗n
and Q⊗n ) leads to H2 . We start with the former case.
where the set of joint distribution PX,Y with the property PX = P and PY = Q are called
couplings of P and Q.
3
The extension of (7.17) from from simple to composite hypothesis testing is in (32.24).
i i
i i
i i
96
which establishes that the second supremum in (7.16) lower bounds TV, and hence (by taking
f(x) = 2 · 1E (x) − 1) so does the first. For the other direction, let E = {x : p(x) > q(x)} and notice
Z Z Z
0 = (p(x) − q(x))dμ = + (p(x) − q(x))dμ ,
E Ec
R R
implying that Ec (q(x)− p(x))dμ = E (p(x)− q(x))dμ. But the sum of these two integrals precisely
equals 2 · TV, which implies that this choice of E attains equality in (7.16).
For the inf-representation, we notice that given a coupling PX,Y , for any kfk∞ ≤ 1, we have
which, in view of (7.16), shows that the inf-reprensentation is always an upper bound. To show
R
that this bound is tight one constructs X, Y as follows: with probability π ≜ min(p(x), q(x))dμ
we take X = Y = c with c sampled from a distribution with density r(x) = π1 min(p(x), q(x)),
whereas with probability 1 − π we take X, Y sampled independently from distributions p1 (x) =
1−π (p(x) − min(p(x), q(x))) and q1 (x) = 1−π (q(x) − min(p(x), q(x))) respectively. The result
1 1
follows upon verifying that this PX,Y indeed defines a coupling of P and Q and applying the last
identity of (7.3).
Remark 7.6 (Variational representation). The sup-representation (7.16) of the total variation will
be extended to general f-divergences in Section 7.13. In turn, the inf-representation (7.18) has
no analogs for other f-divergences, with the notable exception of Marton’s d2 , see Remark 7.15.
Distances defined via inf-representations over couplings are often called Wasserstein distances,
and hence we may think of TV as the Wasserstein distance with respect to Hamming distance
d(x, x′ ) = 1{x 6= x′ } on X . The benefit of variational representations is that choosing a particular
coupling in (7.18) gives an upper bound on TV(P, Q), and choosing a particular f in (7.16) yields
a lower bound.
Of particular relevance is the special case of testing with multiple observations, where the data
X = (X1 , . . . , Xn ) are i.i.d. drawn from either P or Q. In other words, the goal is to test
H0 : X ∼ P⊗n vs H1 : X ∼ Q⊗n .
By Theorem 7.7, the optimal total probability of error is given by 1 − TV(P⊗n , Q⊗n ). By the data
processing inequality, TV(P⊗n , Q⊗n ) is a non-decreasing sequence in n (and bounded by 1 by
definition) and hence converges. One would expect that as n → ∞, TV(P⊗n , Q⊗n ) converges to 1
and consequently, the probability of error in the hypothesis test vanishes. It turns out that for fixed
distributions P 6= Q, large deviation theory (see Chapter 16) shows that TV(P⊗n , Q⊗n ) indeed
converges to one as n → ∞ and, in fact, exponentially fast:
where the exponent C(P, Q) > 0 is known as the Chernoff Information of P and Q given in (16.2).
However, as frequently encountered in high-dimensional statistical problems, if the distributions
P = Pn and Q = Qn depend on n, then the large-deviation asymptotics in (7.19) can no longer be
directly applied. Since computing the total variation between two n-fold product distributions is
i i
i i
i i
typically difficult, understanding how a more tractable f-divergence is related to the total variation
may give insight on its behavior. It turns out Hellinger distance is precisely suited for this task.
Shortly, we will show the following relation between TV and the Hellinger divergence:
r
1 2 H2 ( P , Q)
H (P, Q) ≤ TV(P, Q) ≤ H(P, Q) 1 − ≤ 1. (7.20)
2 4
Direct consequences of the bound (7.20) are:
• H2 (P, Q) = 2, if and only if TV(P, Q) = 1. In this case, the probability of error is zero since
essentially P and Q have disjoint supports.
• H2 (P, Q) = 0 if and only if TV(P, Q) = 0. In this case, the smallest total probability of error is
one, meaning the best test is random guessing.
• Hellinger consistency is equivalent to TV consistency: we have
i.i.d.
Proof. For convenience, let X1 , X2 , ...Xn ∼ Qn . Then
v
u n
u Y Pn
H2 (P⊗ ⊗n
n , Qn ) = 2 − 2E
n t (Xi )
Qn
i=1
r r n
Yn
Pn Pn
=2−2 E ( Xi ) = 2 − 2 E
Qn Qn
i=1
n
1
= 2 − 2 1 − H2 ( P n , Qn ) . (7.23)
2
We now use (7.23) to conclude the proof. Recall from (7.21) that TV(P⊗ ⊗n
n , Qn ) → 0 if and
n
2 ⊗n ⊗n
only if H (Pn , Qn ) → 0, which happens precisely when H (Pn , Qn ) = o( n ). Similarly, by
2 1
(7.22), TV(P⊗ ⊗n 2 ⊗n ⊗n
n , Qn ) → 1 if and only if H (Pn , Qn ) → 2, which is further equivalent to
n
2 1
H (Pn , Qn ) = ω( n ).
i i
i i
i i
98
While some other f-divergences also satisfy tensorization, see Section 7.12, the H2 has the advan-
tage of a sandwich bound (7.20) making it the most convenient tool for checking asymptotic
testability of hypotheses.
Q Q
Remark 7.8 (Kakutani’s dichotomy). Let P = i≥1 Pi and Q = i≥1 Qi , where Pi Qi . Kaku-
tani’s theorem shows the following dichotomy between these two distributions on the infinite
sequence space:
P
• If i≥1 H2 (Pi , Qi ) = ∞, then P and Q are mutually singular.
P
• If i≥1 H2 (Pi , Qi ) < ∞, then P and Q are equivalent (i.e. absolutely continuous with respect
to each other).
In the Gaussian case, say, Pi = N( μi , 1) and Qi = N(0, 1), the equivalence condition simplifies to
P 2
μ i < ∞.
To understand Kakutani’s criterion, note that by the tensorization property (7.24), we have
Y H2 (Pi , Qi )
H (P, Q) = 2 − 2
2
1− .
2
i≥1
Q 2 P
Thus, if i≥1 (1 − H (P2i ,Qi ) ) = 0, or equivalently, i≥1 H2 (Pi , Qi ) = ∞, then H2 (P, Q) = 2,
P
which, by (7.20), is equivalent to TV(P, Q) = 0 and hence P ⊥ Q. If i≥1 H2 (Pi , Qi ) < ∞,
then H2 (P, Q) < 2. To conclude the equivalence between P and Q, note that the likelihood ratio
dP
Q dPi dP
dQ = i≥1 dQi satisfies that either Q( dQ = 0) = 0 or 1 by Kolmogorov’s 0-1 law. See [108,
Theorem 5.3.5] for details.
Proof. It suffices to consider the natural logarithm for the KL divergence. First we show that, by
the data processing inequality, it suffices to prove the result for Bernoulli distributions. For any
event E, let Y = 1{X∈E} which is Bernoulli with parameter P(E) or Q(E). By the DPI, D(PkQ) ≥
d(P(E)kQ(E)). If Pinsker’s inequality holds for all Bernoulli distributions, we have
r
1
D(PkQ) ≥ TV(Ber(P(E)), Ber(Q(E)) = |P(E) − Q(E)|
2
i i
i i
i i
q
Taking the supremum over E gives 12 D(PkQ) ≥ supE |P(E) − Q(E)| = TV(P, Q), in view of
Theorem 7.7.
The binary case follows easily from a second-order Taylor expansion (with integral remainder
form) of p 7→ d(pkq):
Z p Z p
p−t
d(pkq) = dt ≥ 4 (p − t)dt = 2(p − q)2
q t( 1 − t ) q
Pinsker’s inequality is sharp in the sense that the constant (2 log e) in (7.25) is not improvable,
i.e., there exist {Pn , Qn }, e.g., Pn = Ber( 12 + 1n ) and Qn = Ber( 21 ), such that RHS
LHS
→ 2 as n → ∞.
(This is best seen by inspecting the local quadratic behavior in Proposition 2.19.) Nevertheless,
this does not mean that the inequality (7.25) is not improvable, as the RHS can be replaced by some
other function of TV(P, Q) with additional higher-order terms. Indeed, several such improvements
of Pinsker’s inequality are known. But what is the best inequality? In addition, another natural
question is the reverse inequality: can we upper-bound D(PkQ) in terms of TV(P, Q)? Settling
these questions rests on characterizing the joint range (the set of possible values) of a given pair
f-divergences. This systematic approach to comparing f-divergences (as opposed to the ad hoc
proof of Theorem 7.9 we presented above) is the subject of this section.
Definition 7.10 (Joint range). Consider two f-divergences Df (PkQ) and Dg (PkQ). Their joint
range is a subset of [0, ∞]2 defined by
As an example, Fig. 7.1 gives the joint range R between the KL divergence and the total varia-
tion. By definition, the lower boundary of the region R gives the optimal refinement of Pinsker’s
inequality:
Also from Fig. 7.1 we see that it is impossible to bound D(PkQ) from above in terms of TV(P, Q)
due to the lack of upper boundary.
The joint range R may appear difficult to characterize since we need to consider P, Q over
all measurable spaces; on the other hand, the region Rk for small k is easy to obtain (at least
numerically). Revisiting the proof of Pinkser’s inequality in Theorem 7.9, we see that the key
step is the reduction to Bernoulli distributions. It is natural to ask: to obtain full joint range is it
possible to reduce to the binary case? It turns out that it is always sufficient to consider quaternary
distributions, or the convex hull of that of binary distributions.
i i
i i
i i
100
1.5
1.0
0.5
Figure 7.1 Joint range of TV and KL divergence. The dashed line is the quadratic lower bound given by
Pinsker’s inequality (7.25).
R = co(R2 ) = R4 .
where co denotes the convex hull with a natural extension of convex operations to [0, ∞]2 .
We will rely on the following famous result from convex analysis (cf. e.g. [110, Chapter 2,
Theorem 18]).
• Claim 1: co(R2 ) ⊂ R4 ;
• Claim 2: Rk ⊂ co(R2 );
• Claim 3: R = R4 .
S∞
Note that Claims 1-2 prove the most interesting part: k=1 Rk = co(R2 ). Claim 3 is more
technical and its proof can be found in [155]. However, the approximation result in Theorem 7.6
S∞
shows that R is the closure of k=1 Rk . Thus for the purpose of obtaining inequalities between
Df and Dg , Claims 1-2 are sufficient.
i i
i i
i i
We start with Claim 1. Given any two pairs of distributions (P0 , Q0 ) and (P1 , Q1 ) on some space
X and given any α ∈ [0, 1], define two joint distributions of the random variables (X, B) where
PB = QB = Ber(α), PX|B=i = Pi and QX|B=i = Qi for i = 0, 1. Then by (7.8) we get
R2 = R̃2 ∪ {(pf′ (∞), pg′ (∞)) : p ∈ (0, 1]} ∪ {(qf(0), qg(0)) : q ∈ (0, 1]} ,
Since (0, 0) ∈ R̃2 , we see that regardless of which f(0), f′ (∞), g(0), g′ (∞) are infinite, the set
R2 ∩ R2 is connected. Thus, by Lemma 7.12 any point in co(R2 ∩ R2 ) is a combination of two
points in R2 ∩ R2 , which, by the argument above, is a subset of R4 . Finally, it is not hard to see
that co(R2 )\R2 ⊂ R4 , which concludes the proof of co(R2 ) ⊂ R4 .
Next, we prove Claim 2. Fix P, Q on [k] and denote their PMFs (pj ) and (qj ), respectively. Note
that without changing either Df (PkQ) or Dg (PkQ) (but perhaps, by increasing k by 1), we can
p
make qj > 0 for j > 1 and q1 = 0, which we thus assume. Denote ϕj = qjj for j > 1 and consider
the set
X X
k
S = Q̃ = (q̃j )j∈[k] : q̃j ≥ 0, q̃j = 1, q̃1 = 0, q̃j ϕj ≤ 1 .
j=2
i i
i i
i i
102
affinely maps S to [0, ∞] (note that f(0) or f′ (∞) can equal ∞). In particular, if we denote P̃i =
P̃(Q̃i ) corresponding to Q̃i in decomposition (7.26), we get
X
m
Df (PkQ) = αi Df (P̃i kQ̃i ) ,
i=1
and similarly for Dg (PkQ). We are left to show that (P̃i , Q̃i ) are supported on at most two points,
which verifies that any element of Rk is a convex combination of k elements of R2 . Indeed, for
Q̃ ∈ Se the set {j ∈ [k] : q̃j > 0 or p̃j > 0} has cardinality at most two (for the second type
extremal points we notice p̃j1 + p̃j2 = 1 implying p̃1 = 0). This concludes the proof of Claim
2.
shown as non-convex grey region in Fig. 7.2. By Theorem 7.11, their full joint range R is the
convex hull of R2 , which turns out to be exactly described by the sandwich bound (7.20) shown
earlier in Section 7.3. This means that (7.20) is not improvable. Indeed, with t ranging from 0 to
1,
1−t
• the upper boundary is achieved by P = Ber( 1+ t
2 ), Q = Ber( 2 ),
• the lower boundary is achieved by P = (1 − t, t, 0), Q = (1 − t, 0, t).
i i
i i
i i
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0 1.5 2.0
Figure 7.2 The joint range R of TV and H2 is characterized by (7.20), which is the convex hull of the grey
region R2 .
where we take the natural logarithm. Here is a corollary (weaker bound) due to [316]:
1 + TV(P, Q) 2TV(P, Q)
D(PkQ) ≥ log − . (7.28)
1 − TV(P, Q) 1 + TV(P, Q)
Both bounds are stronger than Pinsker’s inequality (7.25). Note the following consequences:
where the function f is a convex increasing bijection of [0, 1) onto [0, ∞). Furthermore, for every
s ≥ f(t) there exists a pair of distributions such that χ2 (PkQ) = s and TV(P, Q) = t.
i i
i i
i i
104
(p − q)2 t2
TV(Ber(p), Ber(q)) = |p − q| ≜ t, χ2 (Ber(p)kBer(q)) = = .
q( 1 − q) q( 1 − q)
Given |p − q| = t, let us determine the possible range of q(1 − q). The smallest value of q(1 − q)
is always 0 by choosing p = t, q = 0. The largest value is 1/4 if t ≤ 1/2 (by choosing p = 1/2 − t,
q = 1/2). If t > 1/2 then we can at most get t(1 − t) (by setting p = 0 and q = t). Thus we
get χ2 (Ber(p)kBer(q)) ≥ f(|p − q|) as claimed. The convexity of f follows since its derivative is
monotoinically increasing. Clearly, f(t) ≥ 4t2 because t(1 − t) ≤ 41 .
• KL vs TV: see (7.27). For discrete distributions there is partial comparison in the other direction
(“reverse Pinsker”, cf. [272, Section VI]):
2 2 log e
D(PkQ) ≤ log 1 + TV(P, Q)2 ≤ TV(P, Q)2 , Qmin = min Q(x)
Qmin Qmin x
• KL vs Hellinger:
2
D(P||Q) ≥ 2 log . (7.30)
2 − H2 (P, Q)
This is tight at P = Ber(0), Q = Ber(q). For a fixed H2 , in general D(P||Q) has no finite upper
bound, as seen from P = Ber(p), Q = Ber(0). Therefore (7.30) gives the joint range.
There is a partial result in the opposite direction (log-Sobolev inequality for Bonami-Beckner
semigroup, cf. [89, Theorem A.1]):
log( Q1min − 1)
D(PkQ) ≤ 1 − (1 − H2 (P, Q))2 , Qmin = min Q(x)
1 − 2Qmin x
i i
i i
i i
• χ2 and TV: The full joint range is given by (7.29). Two simple consequences are:
1p 2
TV(P, Q) ≤ χ (PkQ) (7.34)
2
1 χ2 (PkQ)
TV(P, Q) ≤ max , (7.35)
2 1 + χ2 (PkQ)
where the second is useful for bounding TV away from one.
• JS and TV: The full joint region is given by
1 − TV(P, Q)
1
2d
≤ JS(P, Q) ≤ TV(P, Q) · 2 log 2 . (7.36)
2 2
The lower bound is a consequence of Fano’s inequality. For the upper bound notice that for
p, q ∈ [0, 1] and |p − q| = τ the maximum of d(pk p+2 q ) is attained at p = 0, q = τ (from the
convexity of d(·k·)) and, thus, the binary joint-range is given by τ 7→ d(τ kτ /2) + d(1 − τ k1 −
τ /2). Since the latter is convex, its concave envelope is a straightline connecting endpoints at
τ = 0 and τ = 1.
1 Total variation:
Z | μ|
| μ| 2σ | μ|
TV(N (0, σ ), N ( μ, σ )) = 2Φ
2 2
−1= φ(x)dx = √ + O( μ2 ), μ → 0.
2σ − 2σ
| μ| 2π σ
(7.37)
2 Hellinger distance:
μ2 μ2
H2 (N (0, σ 2 )||N ( μ, σ 2 )) = 2 − 2e− 8σ2 = + O( μ3 ), μ → 0. (7.38)
4σ 2
More generally,
1 1
|Σ1 | 4 |Σ2 | 4 1
H2 (N ( μ1 , Σ1 )||N ( μ2 , Σ2 )) = 2 − 2 1 exp − ( μ1 − μ2 )′ Σ̄−1 ( μ1 − μ2 ) ,
|Σ̄| 2 8
Σ1 +Σ2
where Σ̄ = 2 .
i i
i i
i i
106
3 KL divergence:
1 σ2 1 ( μ 1 − μ 2 )2 σ12
D(N ( μ1 , σ12 )||N ( μ2 , σ22 )) = log 22 + + 2 − 1 log e. (7.39)
2 σ1 2 σ22 σ2
Proof. Note that If (U; X) = Df (PU,X kPU PX ) ≥ Df (PU,Y kPU PY ) = If (U; Y), where we
applied the data-processing Theorem 7.4 to the (possibly stochastic) map (U, X) 7→ (U, Y). See
also Remark 3.4.
is not subadditive.
i i
i i
i i
In other words, an additional observation does not improve TV-information at all. This is the
main reason for the famous herding effect in economics [20].
3 The symmetric KL-divergence4 ISKL (X; Y) ≜ D(PX,Y kPX PY ) + D(PX PY kPX,Y ) satisfies, quite
amazingly [189], the additivity property:
Let us prove this in the discrete case. First notice the following equivalent expression for ISKL :
X
ISKL (X; Y) = PX (x)PX (x′ )D(PY|X=x kPY|X=x′ ) . (7.46)
x, x′
From (7.46) we get (7.45) by the additivity D(PA,B|X=x kPA,B|X=x′ ) = D(PA|X=x kPA|X=x′ ) +
D(PB|X=x kPB|X=x′ ). To prove (7.46) first consider the obvious identity:
X
PX (x)PX (x′ )[D(PY kPY|X=x′ ) − D(PY kPY|X=x )] = 0
x, x′
which is rewritten as
X X PY|X (y|x)
PX (x)PX (x′ ) PY (y) log = 0. (7.47)
PY|X (y|x′ )
x, x′ y
Next, by definition,
X PX,Y (x, y)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PX (x)PY (y)
PY|X (y|x)
Since the marginals of PX,Y and PX PY coincide, we can replace log PPXX(,xY)(PxY,y()y) by any log f(y)
for any f. We choose f(y) = PY|X (y|x′ ) to get
X PY|X (y|x)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PY|X (y|x′ )
Now averaging this over PX (x′ ) and applying (7.47) to get rid of the second term in [· · · ], we
obtain (7.46). For another interesting property of ISKL , see Ex. I.43.
4
This is the f-information corresponding to the Jeffreys divergence D(PkQ) + D(QkP).
i i
i i
i i
108
1X
n
P̂n = δ Xi
n
i=1
denote the empirical distribution corresponding to this sample. Let PY = PY|X ◦ PX be the output
distribution corresponding to PX and PY|X ◦ P̂n be the output distribution corresponding to P̂n (a
random distribution). Note that when PY|X=x (·) = ϕ(· − x), where ϕ is a fixed density, we can
think of PY|X ◦ P̂n as a kernel density estimator (KDE), whose density is p̂n (x) = (ϕ ∗ P̂n )(x) =
Pn
i=1 ϕ(Xi − x). Furthermore, using the fact that E[PY|X ◦ P̂n ] = PY , we have
1
n
where the first term represents the bias of the KDE due to convolution and increases with band-
width of ϕ, while the second term represents the variability of the KDE and decreases with the
bandwidth of ϕ. Surprisingly, the second term is is sharply (within a factor of two) given by the
Iχ2 information. More exactly, we prove the following result.
In Section 25.4* we will discuss an extension of this simple bound, in particular showing that
in many cases about n = exp{I(X; Y) + K} samples are sufficient to get e−O(K) bound on D(PY|X ◦
P̂n kPY ).
i i
i i
i i
X
n
I(Xn ; Ȳ) ≥ I(Xi ; Ȳ) = nI(X1 ; Ȳ) .
i=1
and apply the local expansion of KL divergence (Proposition 2.19) to get (7.49).
In the discrete case, by taking PY|X to be the identity (Y = X) we obtain the following guarantee
on the closeness between the empirical and the population distribution. This fact can be used to
test whether the sample was truly generated by the distribution PX .
Otherwise, we have
log e
E[D(P̂n kPX )] ≤ (|X | − 1) . (7.52)
n
log e
lim n E[D(P̂n kPX )] = (|supp(PX )| − 1) . (7.53)
n→∞ 2
See Lemma 13.2 below.
Corollary 7.16 is also useful for the statistical application of entropy estimation. Given n iid
samples, a natural estimator of the entropy of PX is the empirical entropy Ĥemp = H(P̂n ) (plug-in
estimator). It is clear that empirical entropy is an underestimate, in the sense that the bias
is always non-negative. For fixed PX , Ĥemp is known to be consistent even on countably infinite
alphabets [14], although the convergence rate can be arbitrarily slow, which aligns with the con-
clusion of (7.51). However, for large alphabet of size Θ(n), the upper bound (7.52) does not
vanish (this is tight for, e.g., uniform distribution). In this case, one need to de-bias the empirical
entropy (e.g. on the basis of (7.53)) or employ different techniques in order to achieve consistent
estimation.
i i
i i
i i
110
Theorem 7.17. Suppose that Df (PkQ) < ∞ and derivative of f(x) at x = 1 exist. Then,
1
lim Df (λP + λ̄QkQ) = (1 − P[supp(Q)])f′ (∞) ,
λ→0 λ
where as usual we take 0 · ∞ = 0 in the left-hand side.
Remark 7.10. Note that we do not need a separate theorem for Df (QkλP + λ̄Q) since the exchange
of arguments leads to another f-divergence with f(x) replaced by xf(1/x).
Proof. Without loss of generality we may assume f(1) = f′ (1) = 0 and f ≥ 0. Then, decomposing
P = μP1 + μ̄P0 with P0 ⊥ Q and P1 Q we have
Z
1 1 dP1
Df (λP + λ̄QkQ) = μ̄f′ (∞) + dQ f 1 + λ( μ − 1) .
λ λ dQ
Note that g(λ) = f (1 + λt) is positve and convex for every t ∈ R and hence λ1 g(λ) is mono-
tonically decreasing to g′ (0) = 0 as λ & 0. Since for λ = 1 the integrand is assumed to be
Q-integrable, the dominated convergence theorem applies and we get the result.
If χ2 (PkQ) < ∞, then Df (λ̄Q + λPkQ) < ∞ for all 0 ≤ λ < 1 and
1 f′′ (1) 2
lim D f ( λ̄ Q + λ PkQ ) = χ (PkQ) . (7.54)
λ→0 λ2 2
If χ2 (PkQ) = ∞ and f′′ (1) > 0 then (7.54) also holds, i.e. Df (λ̄Q + λPkQ) = ω(λ2 ).
Remark 7.11. Conditions of the theorem include D, DSKL , H2 , JS, LC and all Rényi-type diver-
gences, with f(x) = p−1 1 (xp − 1), of orders p < 2. A similar result holds also for the case when
f′′ (x) → ∞ with x → +∞ (e.g. Rényi-type divergences with p > 2), but then we need to make
extra assumptions in order to guarantee applicability of the dominated convergence theorem (often
just the finiteness of Df (PkQ) is sufficient).
Proof. Assuming that χ2 (PkQ) < ∞ we must have P Q and hence we can use (7.1) as the
definition of Df . Note that under (7.1) without loss of generality we may assume f′ (1) = f(1) = 0
(indeed, for that we can just add a multiple of (x − 1) to f(x), which does not change the value of
Df (PkQ)). From the Taylor expansion we have then
Z 1
f(1 + u) = u2 (1 − t)f′′ (1 + tu)dt .
0
i i
i i
i i
i i
i i
i i
112
We can see that with the exception of TV, other f-divergences behave quadratically under small
displacement t → 0. This turns out to be a general fact, and furthermore the coefficient in front
of t2 is given by the Fisher information (at t = 0). To proceed carefully, we need some technical
assumptions on the family Pt .
Definition 7.19 (Regular single-parameter families). Fix τ > 0, space X and a family Pt of
distributions on X , t ∈ [0, τ ). We define the following types of conditions that we call regularity
at t = 0:
(a) Pt (dx) = pt (x) μ(dx), for some measurable (t, x) 7→ pt (x) ∈ R+ and a fixed measure μ on X ;
(b0 ) There exists a measurable function (s, x) 7→ ṗs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost
Rτ
every x0 we have 0 |ṗs (x0 )|ds < ∞ and
Z t
p t ( x0 ) = p 0 ( x0 ) + ṗs (x0 )ds . (7.57)
0
Furthermore, for μ-almost every x0 we have limt↘0 ṗt (x0 ) = ṗ0 (x0 ).
(b1 ) We have ṗt (x) = 0 whenever p0 (x) = 0 and, furthermore,
Z
(ṗt (x))2
μ(dx) sup < ∞. (7.58)
X 0≤t<τ p0 (x)
c0 ) There exists a measurable function (s, x) 7→ ḣs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost
Rτ
every x0 we have 0 |ḣs (x0 )|ds < ∞ and
p p Z t
h t ( x0 ) ≜ p t ( x0 ) = p 0 ( x0 ) + ḣs (x0 )ds . (7.59)
0
Furthermore, for μ-almost every x0 we have limt↘0 ḣt (x0 ) = ḣ0 (x0 ).
(c1 ) The family of functions {(ḣt (x))2 : t ∈ [0, τ )} is uniformly μ-integrable.
Remark 7.12. Recall that the uniform integrability condition (c1 ) is implied by the following
stronger (but easier to verify) condition:
Z
μ(dx) sup (ḣt (x))2 < ∞ . (7.60)
X 0≤t<τ
Impressively, if one also assumes the continuous differentiability of ht then the uniform integra-
bility condition becomes equivalent to the continuity of the Fisher information
Z
t 7→ JF (t) ≜ 4 μ(dx)(ḣt (x))2 . (7.61)
Theorem 7.20. Let the family of distributions {Pt : t ∈ [0, τ )} satisfy the conditions (a), (b0 ) and
(b1 ) in Definition 7.19. Then we have
i i
i i
i i
log e
D(Pt kP0 ) = JF ( 0 ) t 2 + o ( t 2 ) , (7.63)
2
R 2
where JF (0) ≜ X
μ(dx) (ṗp00((xx))) < ∞ is the Fisher information at t = 0.
Proof. From assumption (b1 ) we see that for any x0 with p0 (x0 ) = 0 we must have ṗt (x0 ) = 0
and thus pt (x0 ) = 0 for all t ∈ [0, τ ). Hence, we may restrict all intergrals below to subset {x :
p0 (x0 ))2
p0 (x) > 0}, on which the ratio (pt (x0p)−
0 ( x0 )
is well-defined. Consequently, we have by (7.57)
Z
1 2 1 (pt (x) − p0 (x))2
χ (Pt kP0 ) = 2 μ(dx)
t2 t p 0 ( x)
Z Z 1 !2
1 1
= 2 μ(dx) t duṗtu (x)
t p 0 ( x) 0
Z Z 1 Z 1
(a) ṗtu (x)ṗtu2 (x)
= μ(dx) du1 du2 1
0 0 p 0 ( x)
2
The first fraction inside the bracket is between 0 and 1 and the second by sup0<t<τ pṗ0t ((XX)) , which
is P0 -integrable by (b1 ). Thus, dominated convergence theorem applies to the double integral
i i
i i
i i
114
Remark 7.13. Theorem 7.20 extends to the case of multi-dimensional parameters as follows.
Define the Fisher information matrix at θ ∈ Rd :
Z p p ⊤
JF (θ) ≜ μ(dx)∇θ pθ (x)∇θ pθ (x) (7.67)
Then (7.62) becomes χ2 (Pt kP0 ) = t⊤ JF (0)t + o(ktk2 ) as t → 0 and similarly for (7.63), which
has previously appeared in (2.33).
Theorem 7.20 applies to many cases (e.g. to smooth subfamilies of exponential families, for
which one can take μ = P0 and p0 (x) ≡ 1), but it is not sufficiently general. To demonstrate the
issue, consider the following example.
Example 7.1 (Location families with compact support). We say that family Pt is a (scalar) location
family if X = R, μ = Leb and pt (x) = p0 (x − t). Consider the following example, for α > −1:
α
x ∈ [ 0, 1] ,
x ,
p 0 ( x) = C α × ( 2 − x) α , x ∈ [ 1, 2] , ,
0, otherwise
with Cα chosen from normalization. Clearly, here condition (7.58) is not satisfied and both
χ2 (Pt kP0 ) and D(Pt kP0 ) are infinite for t > 0, since Pt 6 P0 . But JF (0) < ∞ whenever α > 1
and thus one expects that a certain remedy should be possible. Indeed, one can compute those
f-divergences that are finite for Pt 6 P0 and find that for α > 1 they are quadratic in t. As an
illustration, we have
1+α
0≤α<1
Θ(t ),
H2 (Pt , P0 ) = Θ(t2 log 1t ), α = 1 (7.68)
Θ(t2 ), α>1
as t → 0. This can be computed directly, or from a more general results of [162, Theorem VI.1.1].5
5
Statistical significance of this calculation is that if we were to estimate the location parameter t from n iid samples, then
precision δn∗ of the optimal estimator up to constant factors is given by solving H2 (Pδn∗ , P0 ) n1 , cf. [162, Chapter VI].
1
− 1+α
For α < 1 we have δn∗ n which is notably better than the empirical mean estimator (attaining precision of only
− 21
n ). For α = 1/2 this fact was noted by D. Bernoulli in 1777 as a consequence of his (newly proposed) maximum
likelihood estimation.
i i
i i
i i
The previous example suggests that quadratic behavior as t → 0 can hold even when Pt 6 P0 ,
which is the case handled by the next (more technical) result, whose proof we placed in Sec-
tion 7.14*). One can verify that condition (c1 ) is indeed satisfied for all α > 1 in Example 7.1,
thus establishing the quadratic behavior. Also note that the stronger (7.60) only applies to α ≥ 2.
Theorem 7.21. Given a family of distributions {Pt : t ∈ [0, τ )} satisfying the conditions (a), c0 )
and (c1 ) of Definition 7.19, we have
1 − 4ϵ #
χ (Pt kϵ̄P0 + ϵPt ) = t ϵ̄
2 2 2
JF ( 0 ) + J ( 0) + o( t2 ) , ∀ϵ ∈ (0, 1) (7.69)
ϵ
t2
H2 (Pt , P0 ) = JF ( 0 ) + o ( t 2 ) , (7.70)
4
R R
where JF (0) = 4 ḣ20 dμ < ∞ is the Fisher information and J# (0) = ḣ20 1{h0 =0} dμ can be called
the Fisher defect at t = 0.
Example 7.2 (On Fisher defect). Note that in most cases of interest we will have the situation that
t 7→ ht (x) is actually differentiable for all t in some two-sided neighborhood (−τ, τ ) of 0. In such
cases, h0 (x) = 0 implies that t = 0 is a local minima and thus ḣ0 (x) = 0, implying that the defect
J#F = 0. However, for other families this will not be so, sometimes even when pt (x) is smooth on
t ∈ (−τ, τ ) (but not ht ). Here is such an example.
Consider Pt = Ber(t2 ). A straighforward calculation shows:
ϵ̄2 p
χ2 (Pt kϵ̄P0 + ϵPt ) = t2 + O(t4 ), H2 (Pt , P0 ) = 2(1 − 1 − t2 ) = t2 + O(t4 ) .
ϵ
Note that if we view Pt as a family on t ∈ [0, τ ) for small τ , then all conditions (a), c0 ) and (c1 ) are
clearly satisfied (ḣt is bounded on t ∈ (−τ, τ )). We have JF (0) = 4 and J# (0) = 1 and thus (7.69)
recovers the correct expansion for χ2 and (7.70) for H2 .
Notice that the non-smoothness of ht only becomes visible if we extend the domain to t ∈
(−τ, τ ). In fact, this issue is not seen in terms of densities pt . Indeed, let us compute the density pt
and its derivative ṗt explicitly too:
( (
1 − t2 , x=0 −2t, x=0
p t ( x) = , ṗt (x) = .
2
t, x=1 2t, x=1
i i
i i
i i
116
is discontinuous at t = 0. To make things worse, at t = 0 this expectation does not match our
definition of the Fisher information JF (0) in Theorem 7.21, and thus does not yield the correct
small-t behavior for either χ2 or H2 . In general, to avoid difficulties one should restrict to those
families with t 7→ ht (x) continuously differentiable in t ∈ (−τ, τ ).
where EQ [·] is understood as an f-divergence Df (PkQ) with f(x) = xλ , see Definition 7.1.
Conditional Rényi divergence is defined as
Numerous properties of Rényi divergences are known, see [319]. Here we only notice a few:
i i
i i
i i
which is a simple consequence of (7.71). Dλ ’s are the only divergences satisfying DPI and
tensorization [222]. The most well-known special cases of (7.73) are for Hellinger distance,
see (7.24) and for χ2 :
!
Yn
Yn Yn
1+χ 2
Pi
Qi = 1 + χ2 (Pi kQi ) .
i=1 i=1 i=1
We can also obtain additive bounds for non-product distributions, see Ex. I.32 and I.33.
The following consequence of the chain rule will be crucial in statistical applications later (see
Section 32.2, in particular, Theorem 32.7).
Q Q
Proposition 7.23. Consider product channels PYn |Xn = PYi |Xi and QYn |Xn = QYi |Xi . We have
(with all optimizations over all possible distributions)
X
n
inf Dλ (PYn kQYn ) = inf Dλ (PYi kQYi ) (7.74)
PXn ,QXn PXi ,QXi
i=1
Xn X
n
sup Dλ (P kQ ) =
Yn Yn sup Dλ (PYi kQYi ) = sup Dλ (PYi |Xi =x kQYi |Xi =x′ ) (7.75)
PXn ,QXn ′
i=1 PXi ,QXi i=1 x,x
Remark 7.14. The mnemonic for (7.76)-(7.77) is that “mixtures of products are less distinguish-
able than products of mixtures”. The former arise in statistical settings where iid observations are
drawn a single distribution whose parameter is drawn from a prior.
Proof. The second equality in (7.75) follows from the fact that Dλ is an increasing function
of an f-divergence, and thus maximization should be attained at an extreme point of the space
of probabilities, which are just the single-point masses. The main equalities (7.74)-(7.75) follow
from a) restricting optimizations to product distributions and invoking (7.73); and b) the chain rule
i i
i i
i i
118
for Dλ . For example for n = 2, we fix PX2 and QX2 , which (via channels) induce joint distributions
PX2 ,Y2 and QX2 ,Y2 . Then we have
since PY1 |Y2 =y is a distribution induced by taking P̃X1 = PX1 |Y2 =y , and similarly for QY1 |Y2 =y′ . In
all, we get
(λ)
X
2
Dλ (PY2 kQY2 ) = Dλ (PY2 kQY2 ) + Dλ (PY1 |Y2 kQY1 |Y2 |PY2 ) ≥ inf Dλ (PYi kQYi ) ,
PXi ,QXi
i=1
Denote the domain of f∗ by dom(f∗ ) ≜ {y : f∗ (y) < ∞}. Two important properties of the convex
conjugates are
f(x) + f∗ (g) ≥ xy .
Similarly, we can define a convex conjugate for any convex functional Ψ(P) defined on the
space of measures, by setting
Z
Ψ∗ (g) = sup gdP − Ψ(P) . (7.79)
P
i i
i i
i i
Under appropriate conditions (e.g. finite X ), biconjugation then yields the sought-after variational
representation
Z
Ψ(P) = sup gdP − Ψ∗ (g) . (7.80)
g
Next we will now compute these conjugates for Ψ(P) = Df (PkQ). It turns out to be convenient
to first extend the definition of Df (PkQ) to all finite signed measures P then compute the conjugate.
To this end, let fext : R → R ∪ {+∞} be an extension of f, such that fext (x) = f(x) for x ≥ 0 and
fext is convex on R. In general, we can always choose fext (x) = ∞ for all x < 0. In special cases
e.g. f(x) = |x − 1|/2 or f(x) = (x − 1)2 we can directly take fext (x) = f(x) for all x. Now we can
define Df (PkQ) for all signed measure measures P in the same way as in Definition 7.1 using fext
in place of f.
For each choice of fext we have a variational representation of f-divergence:
Theorem 7.24. Let P and Q be probability measures on X . Fix an extension fext of f and let f∗ext
is the conjugate of fext , i.e., f∗ext (y) = supx∈R xy − fext (x). Denote dom(f∗ext ) ≜ {y : f∗ext (y) < ∞}.
Then
Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))]. (7.81)
g:X →dom(f∗
ext )
where the supremum can be taken over either (a) all simple g or (b) over all g satisfying
EQ [f∗ext (g(X))] < ∞.
We remark that when P Q then both results (a) and (b) also hold for supremum over g :
X → R, i.e. without restricting g(x) ∈ dom(f∗ext ).
As a consequence of the variational characterization, we get the following properties for f-
divergences:
1 Convexity: First of all, note that Df (PkQ) is expressed as a supremum of affine functions (since
the expectation is a linear operation). As a result, we get that (P, Q) 7→ Df (PkQ) is convex,
which was proved previously in Theorem 7.5 using different method.
2 Weak lower semicontinuity: Recall the example in Remark 4.5, where {Xi } are i.i.d. Rademach-
ers (±1), and
Pn
i=1 Xi d
√ →N (0, 1)
−
n
by the central limit theorem; however, by Proposition 7.2, for all n,
PX1 +X2 +...+Xn
N (0, 1) = f(0) + f′ (∞) > 0,
Df √
n
since the former distribution is discrete and the latter is continuous. Therefore similar to the
KL divergence, the best we can hope for f-divergence is semicontinuity. Indeed, if X is a nice
space (e.g., Euclidean space), in (7.81) we can restrict the function g to continuous bounded
functions, in which case Df (PkQ) is expressed as a supremum of weakly continuous functionals
i i
i i
i i
120
(note that f∗ ◦ g is also continuous and bounded since f∗ is continuous) and is hence weakly
w
lower semicontinuous, i.e., for any sequence of distributions Pn and Qn such that Pn −
→ P and
w
Qn −→ Q, we have
which previously appeared in (7.16). A similar calculation for Hellinger-squared yields (after
changing from g to h = 1 − h in (7.81)):
1
H2 (P, Q) = 2 − inf EP [h] + EQ [ ] .
h>0 h
Example 7.4 (χ2 -divergence). For χ2 -divergence we have f(x) = (x − 1)2 . Take fext (x) = (x − 1)2 ,
2
whose conjugate is f∗ext (y) = y + y4 . Applying (7.81) yields
" #
g2 (X)
χ (PkQ) = sup EP [g(X)] − EQ g(X) +
2
(7.83)
g:X →R 4
= sup 2EP [g(X)] − EQ [g2 (X)] − 1 (7.84)
g:X →R
i i
i i
i i
Example 7.5 (KL-divergence). In this case we have f(x) = x log x. Consider the extension fext (x) =
∞ for x < 0, whose convex conjugate is f∗ (y) = loge e exp(y). Hence (7.81) yields
Note that in the last example, the variational representation (7.86) we obtained for the KL
divergence is not the same as the Donsker-Varadhan identity in Theorem 4.6, that is,
D(PkQ) = sup EP [g(X)] − log EQ [exp{g(X)}] . (7.87)
g:X →R
In fact, (7.86) is weaker than (7.87) in the sense that for each choice of g, the obtained lower bound
on D(PkQ) in the RHS is smaller. Furthermore, regardless of the choice of fext , the Donsker-
Varadhan representation can never be obtained from Theorem 7.24 because, unlike (7.87), the
second term in (7.81) is always linear in Q. It turns out if we define Df (PkQ) = ∞ for all non-
probability measure P, and compute its convex conjugate, we obtain in the next theorem a different
type of variational representation, which, specialized to KL divergence in Example 7.5, recovers
exactly the Donsker-Varadhan identity.
Theorem 7.25. Consider the extension fext of f such that fext (x) = ∞ for x < 0. Let S = {x :
q(x) > 0} where q is as in (7.2). Then
Df (PkQ) = f′ (∞)P[Sc ] + sup EP [g1S ] − Ψ∗Q,P (g) , (7.88)
g
where
Ψ∗Q,P (g) ≜ inf EQ [f∗ext (g(X) − a)] + aP[S].
a∈R
′
In the special case f (∞) = ∞, we have
Df (PkQ) = sup EP [g] − Ψ∗Q (g), Ψ∗Q (g) ≜ inf EQ [f∗ext (g(X) − a)] + a. (7.89)
g a∈R
Remark 7.15 (Marton’s divergence). Recall that in Theorem 7.7 we have shown both the sup and
inf characterizations for the TV. Do other f-divergences also possess inf characterizations? The
only other known example (to us) is due to Marton. Let
Z 2
dP
Dm (PkQ) = dQ 1 − ,
dQ +
which is clearly an f-divergence with f(x) = (1 − x)2+ . We have the following [45, Lemma 8.3]:
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : X ∼ P, Y ∼ Q} ,
where the infimum is over all couplings of P and Q. See Ex. I.34.
Marton’s Dm divergence plays a crucial role in the theory of concentration of measure [45,
Chapter 8]. Note also that while Theorem 7.18 does not apply to Dm , due to the absence of twice
continuous differentiability, it does apply to the symmetrized Marton divergence Dsm (PkQ) ≜
Dm (PkQ) + Dm (QkP).
i i
i i
i i
122
Since |ġ(s, x)| ≤ C|hs (x)| for some C = C(ϵ), we have from Cauchy-Schwarz
Z Z
μ(dx)˙|g(s1 , x)ġ(s2 , x)| ≤ C2 sup μ(dx)ḣt (x)2 < ∞ . (7.94)
t X
where the last inequality follows from the uniform integrability assumption (c1 ). This implies that
Fubini’s theorem applies in (7.93) and we obtain
Z 1 Z 1 Z
L ( t) = du1 du2 G(tu1 , tu2 ) , G(s1 , s2 ) ≜ μ(dx)ġ(s1 , x)ġ(s2 , x) .
0 0
Notice that if a family of functions {fα (x) : α ∈ I} is uniformly square-integrable, then the family
{fα (x)fβ (x) : α ∈ I, β ∈ I} is uniformly integrable simply because apply |fα fβ | ≤ 21 (f2α + f2β ).
i i
i i
i i
7.14* Technical proofs: convexity, local expansions and variational representations 123
Consequently, from the assumption (c1 ) we see that the integral defining G(s1 , s2 ) allows passing
the limit over s1 , s2 inside the integral. From (7.92) we get as t → 0
Z
1 1 − 4ϵ #
G(tu1 , tu2 ) → G(0, 0) = μ(dx)ḣ0 (x) 4 · 1{h0 > 0} + 1{h0 = 0} = JF (0)+
2
J ( 0) .
ϵ ϵ
From (7.94) we see that G(s1 , s2 ) is bounded and thus, the bounded convergence theorem applies
and
Z 1 Z 1
lim du1 du2 G(tu1 , tu2 ) = G(0, 0) ,
t→0 0 0
which thus concludes the proof of L(t) → JF (0) and of (7.69) assuming facts about ϕ. Let us
verify those.
For simplicity, in the next paragraph we omit the argument x in h0 (x) and ϕ(·; x). A straightfor-
ward differentiation yields
h20 (1 − 2ϵ ) + 2ϵ h2
ϕ′ (h) = 2h .
(ϵ̄h20 + ϵh2 )3/2
h20 (1− ϵ2 )+ ϵ2 h2 1−ϵ/2
Since √ h
≤ √1
ϵ
and ϵ̄h20 +ϵh2
≤ 1−ϵ we obtain the finiteness of ϕ′ . For the continuity
ϵ̄h20 +ϵh2
of ϕ′ notice that if h0 > 0 then clearly the function is continuous, whereas for h0 = 0 we have
ϕ′ (h) = √1ϵ for all h.
We next proceed to the Hellinger distance. Just like in the argument above, we define
Z Z 1 Z 1
1
M(t) ≜ 2 H2 (Pt , P0 ) = μ(dx) du1 du2 ḣtu1 (x)ḣtu2 (x) .
t 0 0
R
Exactly as above from Cauchy-Schwarz and supt μ(dx)ḣt (x)2 < ∞ we conclude that Fubini
applies and hence
Z 1 Z 1 Z
M(t) = du1 du2 H(tu1 , tu2 ) , H(s1 , s2 ) ≜ μ(dx)ḣs1 (x)ḣs2 (x) .
0 0
Again, the family {ḣs1 ḣs2 : s1 ∈ [0, τ ), s2 ∈ [0, τ } is uniformly integrable and thus from c0 ) we
conclude H(tu1 , tu2 ) → 14 JF (0). Furthermore, similar to (7.94) we see that H(s1 , s2 ) is bounded
and thus
Z 1 Z 1
1
lim M(t) = du1 du2 lim H(tu1 , tu2 ) = JF (0) ,
t→ 0 0 0 t→ 0 4
concluding the proof of (7.70).
Proof of Theorem 7.6. The lower bound Df (PkQ) ≥ Df (PE kQE ) follows from the DPI. To prove
an upper bound, first we reduce to the case of f ≥ 0 by property 6 in Proposition 7.2. Then define
i i
i i
i i
124
sets S = suppQ, F∞ = { dQ
dP
= 0} and for a fixed ϵ > 0 let
dP
Fm = ϵm ≤ f < ϵ(m + 1) , m = 0, 1, . . . .
dQ
We have
X Z X
dP
ϵ mQ[Fm ] ≤ dQf ≤ϵ (m + 1)Q[Fm ] + f(0)Q[F∞ ]
m S dQ m
X
≤ϵ mQ[Fm ] + f(0)Q[F∞ ] + ϵ . (7.95)
m
Notice that on the interval I+m = {x > 1 : ϵm ≤ f(x) < ϵ(m + 1)} the function f is increasing and
on I−
m = { x ≤ 1 : ϵ m ≤ f ( x ) < ϵ(m + 1)} it is decreasing. Thus partition further every Fm into
−
Fm = { dQ ∈ Im } and Fm = { dQ
+ dP + dP
∈ I−
m }. Then, we see that
P[F±m]
f ≥ ϵm .
Q[ F ±
m]
− −
0 , F0 , . . . , Fn , Fn , F∞ , S , ∪m>n Fm }.
Consequently, for a fixed n define the partition consisting of sets E = {F+ + c
We next show that with sufficiently large n and sufficiently small ϵ the RHS of (7.96) approaches
Df (PkQ). If f(0)Q[F∞ ] = ∞ (and hence Df (PkQ) = ∞) then clearly (7.96) is also infinite. Thus,
assume thatf(0)Q[F∞ ] < ∞.
R
If S dQf dQdP
= ∞ then the sum over m on the RHS of (7.95) is also infinite, and hence for any
P
N > 0 there exists some n such that m≤n mQ[Fm ] ≥ N, thus showing that RHS for (7.96) can be
R
made arbitrarily large. Thus assume S dQf dQ dP
< ∞. Considering LHS of (7.95) we conclude
P
that for some large n we have m>n mQ[Fm ] ≤ 12 . Then, we must have again from (7.95)
X Z
dP 3
ϵ mQ[Fm ] + f(0)Q[F∞ ] ≥ dQf − ϵ.
S dQ 2
m≤n
Thus, we have shown that for arbitrary ϵ > 0 the RHS of (7.96) can be made greater than
Df (PkQ) − 32 ϵ.
Proof of Theorem 7.24. First, we show that for any g : X → dom(f∗ext ) we must have
Let p(·) and q(·) be the densities of P and Q. Then, from the definition of f∗ext we have for every x
s.t. q(x) > 0:
p ( x) p ( x)
f∗ext (g(x)) + fext ( ) ≥ g ( x) .
q ( x) q ( x)
i i
i i
i i
7.14* Technical proofs: convexity, local expansions and variational representations 125
where the last step follows is because the two sumprema combined is equivalent to the supremum
over all simple (finitely-valued) functions g.
Next, consider finite X . Let S = {x ∈ X : Q(x) > 0} denote the support of Q. We show the
following statement
Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))] + f′ (∞)P(Sc ), (7.100)
g:S→dom(f∗
ext )
Consider the functional Ψ(P) defined above where P takes values over all signed measures on S,
which can be identified with RS . The convex conjugate of Ψ(P) is as follows: for any g : S → R,
( )
X P( x )
Ψ∗ (g) = sup P(x)g(x) − Q(x) sup h − f∗ext (h)
P h∈ dom( f∗ ) Q ( x)
x ext
X
= sup inf ∗ P(x)(g(x) − h(x)) + Q(x)f∗ext (h(x))
P h:S→dom(fext ) x
( a) X
= inf sup P(x)(g(x) − h(x)) + EQ [f∗ext (h)]
h:S→dom(f∗
ext ) P
x
i i
i i
i i
126
(
EQ [f∗ext (g(X))] g : S → dom(f∗ext )
= .
+∞ otherwise
where (a) follows from the minimax theorem (which applies due to finiteness of X ). Applying
the convex duality in (7.80) yields the proof of the desired (7.100).
Proof of Theorem 7.25. First we argue that the supremum in the right-hand side of (7.88) can
be taken over all simple functions g. Then thanks to Theorem 7.6, it will suffice to consider finite
alphabet X . To that end, fix any g. For any δ , there exists a such that EQ [f∗ext (g − a)] − aP[S] ≤
Ψ∗Q,P (g) + δ . Since EQ [f∗ext (g − an )] can be approximated arbitrarily well by simple functions we
conclude that there exists a simple function g̃ such that simultaneously EP [g̃1S ] ≥ EP [g1S ] − δ and
This implies that restricting to simple functions in the supremization in (7.88) does not change the
right-hand side.
Next consider finite X . We proceed to compute the conjugate of Ψ, where Ψ(P) ≜ Df (PkQ) if
P is a probability measure on X and +∞ otherwise. Then for any g : X → R, maximizing over
all probability measures P we have:
X
Ψ∗ (g) = sup P(x)g(x) − Df (PkQ)
P x∈X
X X X
P(x)
= sup P(x)g(x) − P(x)g(x) − Q ( x) f
P x∈X Q ( x)
x∈Sc x∈ S
X X X
= sup inf P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + Q(x)f∗ext (h(x))
P h:S→R x∈S x∈Sc x∈S
( ! )
( a) X X
= inf sup P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + EQ [f∗ext (h(X))]
h:S→R P x∈ S x∈Sc
(b) ′ ∗
= inf max max g(x) − h(x), maxc g(x) − f (∞) + EQ [fext (h(X))]
h:S→R x∈ S x∈ S
( c) ′ ∗
= inf max a, maxc g(x) − f (∞) + EQ [fext (g(X) − a)]
a∈ R x∈ S
where (a) follows from the minimax theorem; (b) is due to P being a probability measure; (c)
follows since we can restrict to h(x) = g(x) − a for x ∈ S, thanks to the fact that f∗ext is non-
decreasing (since dom(fext ) = R+ ).
From convex duality we have shown that Df (PkQ) = supg EP [g] − Ψ∗ (g). Notice that without
loss of generality we may take g(x) = f′ (∞) + b for x ∈ Sc . Interchanging the optimization over
b with that over a we find that
i i
i i
i i
7.14* Technical proofs: convexity, local expansions and variational representations 127
which then recovers (7.88). To get (7.89) simply notice that if P[Sc ] > 0, then both sides of (7.89)
are infinite (since Ψ∗Q (g) does not depend on the values of g outside of S). Otherwise, (7.89)
coincides with (7.88).
i i
i i
i i
A commonly used method in combinatorics for bounding the number of certain objects from above
involves a smart application of Shannon entropy. This method typically proceeds as follows: in
order to count the cardinality of a given set C , we draw an element uniformly at random from C ,
whose entropy is given by log |C|. To bound |C| from above, we describe this random object by a
random vector X = (X1 , . . . , Xn ), e.g., an indicator vector, then proceed to compute or upper-bound
the joint entropy H(X1 , . . . , Xn ).
Notably, three methods of increasing precision are as follows:
• Marginal bound:
X
n
H(X1 , . . . , Xn ) ≤ H( X i )
i=1
1 X
H(X1 , . . . , Xn ) ≤ H(Xi , Xj )
n−1
i< j
X
n
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 )
i=1
We give three applications using the above three methods, respectively, in the order of increasing
difficulty:
Finally, to demonstrate how entropy method can also be used for questions in Euclidean spaces,
we prove the Loomis-Whitney and Bollobás-Thomason theorems based on analogous properties
of differential entropy (Section 2.3).
128
i i
i i
i i
where wH (x) is the Hamming weight (number of 1’s) of x ∈ {0, 1}n . Then |C| ≤ 2nh(p) .
where pi = P [Xi = 1] is the fraction of vertices whose i-th bit is 1. Note that
1X
n
p= pi ,
n
i=1
since we can either first average over vectors in C or first average across different bits. By Jensen’s
inequality and the fact that x 7→ h(x) is concave,
!
Xn
1X
n
h(pi ) ≤ nh pi = nh(p).
n
i=1 i=1
Theorem 8.2.
k
X n
≤ 2nh(k/n) , k ≤ n/2.
j
j=0
Proof. We take C = {x ∈ {0, 1}n : wH (x) ≤ k} and invoke the previous lemma, which says that
k
X n
= |C| ≤ 2nh(p) ≤ 2nh(k/n) ,
j
j=0
where the last inequality follows from the fact that x 7→ h(x) is increasing for x ≤ 1/2.
Remark 8.2. Alternatively, we can prove Theorem 8.2 using the large-deviation bound in Part III.
By the Chernoff bound on the binomial tail (see (15.19) in Example 15.1),
LHS RHS
= P(Bin(n, 1/2) ≤ k) ≤ 2−nd( n ∥ 2 ) = 2−n(1−h(k/n)) = n .
k 1
2n 2
i i
i i
i i
130
N( , ) = 4, N( , ) = 8.
If we know G has m edges, what is the maximal number of H that are contained in G? To study
this quantity, let’s define
1
To be precise, here N(H, G) is the number of subgraphs of G (subsets of edges) isomorphic to H. If we denote by inj(H, G)
the number of injective maps V(H) → V(G) mapping edges of H to edges of G, then N(H, G) = |Aut(H)| 1
inj(H, G).
i i
i i
i i
number of a graph. For a graph H = (V, E), define the fractional covering number as the value of
the following linear program:2
( )
X X
ρ∗ (H) = min wH (e) : wH (e) ≥ 1, ∀v ∈ V, wH (e) ∈ [0, 1] (8.3)
w
e∈E e∈E, v∈e
Theorem 8.3.
∗ ∗
c0 ( H ) m ρ (H)
≤ N(H, m) ≤ c1 (H)mρ (H)
. (8.4)
For example, for triangles we have ρ∗ (K3 ) = 3/2 and Theorem 8.3 is consistent with (8.1).
Proof. Upper bound: Let V(H) = [n] and let w∗ (e) be the solution for ρ∗ (H). For any G with m
edges, draw a subgraph of G, uniformly at random from all those that are isomorphic to H. Given
such a random subgraph set Xi ∈ V(G) to be the vertex corresponding to an i-th vertex of H, i ∈ [n].
∗
Now define a random 2-subset S of [n] by sampling an edge e from E(H) with probability ρw∗ ((He)) .
By the definition of ρ∗ (H) we have for any i ∈ [n] that P[i ∈ S] ≥ ρ∗1(H) . We are now ready to
apply Shearer’s Theorem 1.8:
where the last bound is as before: if S = {v, w} then XS = (Xv , Xw ) takes one of 2m values. Overall,
∗
we get3 N(H, G) ≤ (2m)ρ (H) .
Lower bound: It amounts to construct a graph G with m edges for which N(H, G) ≥
∗
c(H)|e(G)|ρ (H) . Consider the dual LP of (8.3)
X
α∗ (H) = max ψ(v) : ψ(v) + ψ(w) ≤ 1, ∀(vw) ∈ E, ψ(v) ∈ [0, 1] (8.5)
ψ
v∈V(H)
i.e., the fractional packing number. By the duality theorem of LP, we have α∗ (H) = ρ∗ (H). The
graph G is constructed as follows: for each vertex v of H, replicate it for m(v) times. For each edge
e = (vw) of H, replace it by a complete bipartite graph Km(v),m(w) . Then the total number of edges
of G is
X
|E(G)| = m(v)m(w).
(vw)∈E(H)
2
If the “∈ [0, 1]” constraints in (8.3) and (8.5) are replaced by “∈ {0, 1}”, we obtain the covering number ρ(H) and the
independence number α(H) of H, respectively.
3
Note that for H = K3 this gives a bound weaker than (8.2). To recover (8.2) we need to take X = (X1 , . . . , Xn ) be
uniform on all injective homomorphisms H → G.
i i
i i
i i
132
Q
Furthermore, N(G, H) ≥ v∈V(H) m(v). To minimize the exponent log N(G,H)
log |E(G)| , fix a large number
ψ(v)
M and let m(v) = M , where ψ is the maximizer in (8.5). Then
X
|E(G)| ≤ 4Mψ(v)+ψ(w) ≤ 4M|E(H)|
(vw)∈E(H)
Y ∗
N(G, H) ≥ Mψ(v) = Mα (H)
v∈V(H)
XY
n
perm(A) ≜ aiπ (i) ,
π ∈Sn i=1
where Sn denotes the group of all permutations of [n]. For a bipartite graph G with n vertices on
the left and right respectively, the number of perfect matchings in G is given by perm(A), where
A is the adjacency matrix. For example,
perm = 1, perm =2
Theorem 8.4 (Brégman’s Theorem). For any n × n bipartite graph with adjacency matrix A,
Y
n
1
perm(A) ≤ (di !) di ,
i=1
where di is the degree of left vertex i (i.e. sum of the ith row of A).
As an example, consider G = Kn,n . Then perm(G) = n!, which coincides with the RHS
[(n!)1/n ]n = n!. More generally, if G consists of n/d copies of Kd,d , then Bregman’s bound is
tight and perm = (d!)n/d .
As a first attempt of proving Theorem 8.4 using the entropy method, we select a perfect
matching uniformly at random which matches the ith left vertex to the Xi th right one. Let X =
i i
i i
i i
(X1 , . . . , Xn ). Then
X
n X
n
log perm(A) = H(X) = H(X1 , . . . , Xn ) ≤ H(Xi ) ≤ log(di ).
i=1 i=1
Q
Hence perm(A) ≤ i di . This is worse than Brégman’s bound by an exponential factor, since by
Stirling’s formula
!
Y
n
1 Y
n
(di !) di
∼ di e− n .
i=1 i=1
Here is our second attempt. The hope is to use the chain rule to expand the joint entropy and
bound the conditional entropy more carefully. Let’s write
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 ) ≤ E[log Ni ].
i=1 i=1
where Ni , as a random variable, denotes the number of possible values Xi can take conditioned
on X1 , . . . , Xi−1 , i.e., how many possible matchings for left vertex i given the outcome of where
1, . . . , i − 1 are matched to. However, it is hard to proceed from this point as we only know the
degree information, not the graph itself. In fact, since we do not know the relative positions of the
vertices, there is no reason why we should order from 1 to n. The key idea is to label the vertices
randomly, apply chain rule in this random order and average.
To this end, pick π uniformly at random from Sn and independent of X. Then
where Nk denotes the number of possible matchings for vertex k given the outcomes of {Xj :
π −1 (j) < π −1 (k)} and the expectation is with respect to (X, π ). The key observation is:
i i
i i
i i
134
1 X
dk
1
E(X,π ) log Nk = log i = log(di !) di
dk
i=1
and hence
X
n
1 Y
n
1
log perm(A) ≤ log(di !) di = log (di !) di .
k=1 i=1
Proof. Note that Xi = σ(i) for some random permutation σ . Let T = ∂(k) be the neighbors of k.
Then
which is a function of (σ, π ). In fact, conditioned on any realization of σ , Nk is uniform over [dk ].
To see this, note that σ −1 (T) is a fixed subset of [n] of cardinality dk , and k ∈ σ −1 (T). On the other
hand, S ≜ {j : π −1 (j) < π −1 (k)} is a uniformly random subset of [n]\{k}. Then
Theorem 8.6 (Bollobás-Thomason Box Theorem). Let K ⊂ Rn be a compact set. For S ⊂ [n],
denote by KS ⊂ RS the projection of K onto those coordinates indexed by S. Then there exists a
rectangle A s.t. Leb(A) = Leb(K) and for all S ⊂ [n]:
Leb(AS ) ≤ Leb(KS )
4
Note that since K is compact, its projection and slices are all compact and hence measurable.
i i
i i
i i
Proof. Let Xn be uniformly distributed on K. Then h(Xn ) = log Leb(K). Let A be a rectangle of
size a1 × · · · × an where
log ai = h(Xi |Xi−1 ) .
Then, we have by Theorem 2.6(a)
h(XS ) ≤ log Leb(KS ).
On the other hand, by the chain rule and the fact that conditioning reduces differential entropy
(recall Theorem 2.6(a) and (c)),
X
n
h( X S ) = 1{i ∈ S}h(Xi |X[i−1]∩S )
i=1
X
≥ h(Xi |Xi−1 )
i∈S
Y
= log ai
i∈S
= log Leb(AS )
The following result is a continuous counterpart of Shearer’s lemma (see Theorem 1.8 and
Remark 1.2):
Corollary 8.7 (Loomis-Whitney). Let K be a compact subset of Rn and let Kjc denote the
projection of K onto coordinates in [n] \ j. Then
Y
n
1
Leb(K) ≤ Leb(Kjc ) n−1 . (8.6)
j=1
Y
n
Leb(K) ≥ wj ,
j=1
i.e. that volume of K is greater than that of the rectangle of average widths.
i i
i i
i i
Consider the following problem: Given a stream of independent Ber(p) bits, with unknown p, we
want to turn them into pure random bits, i.e., independent Ber(1/2) bits; Our goal is to find a
universal way to extract the most number of bits. In other words, we want to extract as many fair
coin flips as possible from possibly biased coin flips, without knowing the actual bias.
In 1951 von Neumann [326] proposed the following scheme: Divide the stream into pairs of
bits, output 0 if 10, output 1 if 01, otherwise do nothing and move to the next pair. Since both
01 and 10 occur with probability pq (where q ≜ 1 − p throughout this chapter), regardless of the
value of p, we obtain fair coin flips at the output. To measure the efficiency of von Neumann’s
scheme, note that, on average, we have 2n bits in and 2pqn bits out. So the efficiency (rate) is pq.
The question is: Can we do better?
There are several choices to be made in the problem formulation. Universal v.s. non-universal:
the source distribution can be unknown or partially known, respectively. Exact v.s. approximately
fair coin flips: whether the generated coin flips are exactly fair or approximately, as measured by
one of the f-divergences studied in Chapter 7 (e.g., the total variation or KL divergence). In this
chapter, we only focus on the universal generation of exactly fair coins.
9.1 Setup
Let {0, 1}∗ = ∪k≥0 {0, 1}k = {∅, 0, 1, 00, 01, . . . } denote the set of all finite-length binary strings,
where ∅ denotes the empty string. For any x ∈ {0, 1}∗ , l(x) denotes the length of x.
Let us first introduce the definition of random number generator formally. If the input vector is
X, denote the output (variable-length) vector by Y ∈ {0, 1}∗ . Then the desired property of Y is the
following: Conditioned on the length of Y being k, Y is uniformly distributed on {0, 1}k .
136
i i
i i
i i
In other words, Ψ consumes a stream of n coins with bias p and outputs on average nrΨ (p) fair
coins.
Note that the von Neumann scheme above defines a valid extractor ΨvN (with ΨvN (x2n+1 ) =
ΨvN (x2n )), whose rate is rvN (p) = pq. Clearly this is wasteful, because even if the input bits are
already fair, we only get 25% in return.
9.2 Converse
We show that no extractor has a rate higher than the binary entropy function h(p), even if the
extractor is allowed to be non-universal (depending on p). The intuition is that the “information
content” contained in each Ber(p) variable is h(p) bits; as such, it is impossible to extract more
than that. This is easily made precise by the data processing inequality for entropy (since extractors
are deterministic functions).
1 1
rΨ (p) ≥ h(p) = p log2 + q log2 .
p q
nh(p) = H(Xn ) ≥ H(Ψ(Xn )) = H(Ψ(Xn )|L) + H(L) ≥ H(Ψ(Xn )|L) = E [L] bits,
where the last step follows from the assumption on Ψ that Ψ(Xn ) is uniform over {0, 1}k
conditioned on L = k.
The rate of von Neumann extractor and the entropy bound are plotted below. Next we present
two extractors, due to Elias [112] and Peres [233] respectively, that attain the binary entropy func-
tion. (More precisely, both construct a sequence of extractors whose rate approaches the entropy
bound).
i i
i i
i i
138
rate
1 bit
rvN
p
0 1 1
2
1 For iid Xn , the probability of each string only depends on its type, i.e., the number of 1’s. (This is
the main idea of the method of types for data compression.) Therefore conditioned on the num-
ber of 1’s, Xn is uniformly distributed (over the type class). This observation holds universally
for any p.
2 Given a uniformly distributed random variable on some finite set, we can easily turn it into
variable-length string of fair coin flips. For example:
• If U is uniform over {1, 2, 3}, we can map 1 7→ ∅, 2 7→ 0 and 3 7→ 1.
• If U is uniform over {1, 2, . . . , 11}, we can map 1 7→ ∅, 2 7→ 0, 3 7→ 1, and the remaining
eight numbers 4, . . . , 11 are assigned to 3-bit strings.
Lemma 9.3. Given U uniformly distributed on [M], there exists f : [M] → {0, 1}∗ such that
conditioned on l(f(U)) = k, f(U) is uniformly over {0, 1}k . Moreover,
log2 M − 4 ≤ E[l(f(U))] ≤ log2 M bits.
Proof. We defined f by partitioning [M] into subsets whose cardinalities are powers of two, and
assign elements in each subset to binary strings of that length. Formally, denote the binary expan-
Pn
sion of M by M = i=0 mi 2i , where the most significant bit mn = 1 and n = blog2 Mc + 1. Those
non-zero mi ’s defines a partition [M] = ∪tj=0 Mj , where |Mi | = 2ij . Map the elements of Mj to
{0, 1}ij . Finally, notice that uniform distribution conditioned on any subset is still uniform.
To prove the bound on the expected length, the upper bound follows from the same entropy
argument log2 M = H(U) ≥ H(f(U)) ≥ H(f(U)|l(f(U))) = E[l(f(U))], and the lower bound
follows from
1 X 1 X 2n X i−n
n n n
2n+1
E[l(f(U))] = mi 2 · i = n −
i
mi 2 ( n − i) ≥ n −
i
2 ( n − i) ≥ n − ≥ n − 4,
M M M M
i=0 i=0 i=0
i i
i i
i i
Elias’ extractor Fix n ≥ 1. Let wH (xn ) define the Hamming weight (number of ones) of a
binary string xn . Let Tk = {xn ∈ {0, 1}n : wH (xn ) = k} define the Hamming sphere of radius k.
For each 0 ≤ k ≤ n, we apply the function f from Lemma 9.3 to each Tk . This defines a mapping
ΨE : {0, 1}n → {0, 1}∗ and then we extend it to ΨE : {0, 1}∗ → {0, 1}∗ by applying the mapping
per n-bit block and discard the last incomplete block. Then it is clear that the rate is given by
n E[l(ΨE (X ))]. By Lemma 9.3, we have
1 n
n n
E log − 4 ≤ E[l(ΨE (X ))] ≤ E log
n
wH (Xn ) wH (Xn )
Using Stirling’s approximation, we can show (see, e.g., [19, Lemma 4.7.1])
2nh(p) n 2nh(p)
p ≤ ≤p (9.1)
8k(n − k)/n k 2πk(n − k)/n
• Let 1 ≤ m1 < . . . < mk ≤ n denote the locations such that x2mj 6= x2mj −1 .
• Let 1 ≤ i1 < . . . < in−k ≤ n denote the locations such that x2ij = x2ij −1 .
• yj = x2mj , vj = x2ij , uj = x2j ⊕ x2j+1 .
Here yk are the bits that von Neumann’s scheme outputs and both vn−k and un are discarded. Note
that un is important because it encodes the location of the yk and contains a lot of information.
Therefore von Neumann’s scheme can be improved if we can extract the randomness out of both
vn−k and un .
i i
i i
i i
140
Next we (a) verify Ψt is a valid extractor; (b) evaluate its efficiency (rate). Note that the bits that
enter into the iteration are no longer i.i.d. To compute the rate of Ψt , it is convenient to introduce
the notion of exchangeability. We say Xn are exchangeable if the joint distribution is invariant
under permutation, that is, PX1 ,...,Xn = PXπ (1) ,...,Xπ (n) for any permutation π on [n]. In particular, if
Xi ’s are binary, then Xn are exchangeable if and only if the joint distribution only depends on the
Hamming weight, i.e., PXn (xn ) = f(wH (xn )) for some function f. Examples: Xn is iid Ber(p); Xn is
uniform over the Hamming sphere Tk .
As an example, if X2n are i.i.d. Ber(p), then conditioned on L = k, Vn−k is iid Ber(p2 /(p2 + q2 )),
since L ∼ Binom(n, 2pq) and
pk+2m qn−k−2m
P[Yk = y, Un = u, Vn−k = v|L = k] =
n
k(p2 + q2 )n−k (2pq)k
− 1
n p2 m q2 n − k − m
= 2− k · · 2
k p + q2 p2 + q2
= P[Yk = y|L = k]P[Un = u|L = k]P[Vn−k = v|L = k],
where m = wH (v). In general, when X2n are only exchangeable, we have the following:
Lemma 9.4 (Ψt preserves exchangeability). Let X2n be exchangeable and L = Ψ1 (X2n ). Then
conditioned on L = k, Yk , Un and Vn−k are independent, each having an exchangeable distribution.
i.i.d.
Furthermore, Yk ∼ Ber( 12 ) and Un is uniform over Tk .
Proof. If suffices to show that ∀y, y′ ∈ {0, 1}k , u, u′ ∈ Tk and v, v′ ∈ {0, 1}n−k such that wH (v) =
wH (v′ ), we have
which implies that P[Yk = y, Un = u, Vn−k = v|L = k] = f(wH (v)) for some function f. Note that
the string X2n and the triple (Yk , Un , Vn−k ) are in one-to-one correspondence of each other. Indeed,
to reconstruct X2n , simply read the k distinct pairs from Y and fill them according to the locations of
ones in U and fill the remaining equal pairs from V. [Examples: (y, u, v) = (01, 1100, 01) ⇒ x =
(10010011), (y, u, v) = (11, 1010, 10) ⇒ x′ = (01110100).] Finally, note that u, y, v and u′ , y′ , v′
correspond to two input strings x and x′ of identical Hamming weight (wH (x) = k + 2wH (v)) and
hence of identical probability due to the exchangeability of X2n .
i.i.d.
Lemma 9.5 (Ψt is an extractor). Let X2n be exchangeable. Then Ψt (X2n ) ∼ Ber(1/2) conditioned
on l(Ψt (X2n )) = m.
i i
i i
i i
Proof. Note that Ψt (X2n ) ∈ {0, 1}∗ . It is equivalent to show that for all sm ∈ {0, 1}m ,
Proceed by induction on t. The base case of t = 1 follows from Lemma 9.4 (the distribution of
the Y part). Assume Ψt−1 is an extractor. Recall that Ψt (X2n ) = (Ψ1 (X2n ), Ψt−1 (Un ), Ψt−1 (Vn−k ))
and write the length as L = L1 + L2 + L3 , where L2 ⊥ ⊥ L3 |L1 by Lemma 9.4. Then
P[Ψt (X2n ) = sm ]
Xm
= P[Ψt (X2n ) = sm |L1 = k]P[L1 = k]
k=0
m X
X m−k
Lemma 9.4 n−k
= P[L1 = k]P[Yk = sk |L1 = k]P[Ψt−1 (Un ) = skk+1 |L1 = k]P[Ψt−1 (V
+r
k+r+1 |L1 = k]
) = sm
k=0 r=0
X
m X
m−k
P[L1 = k]2−k 2−r P[L2 = r|L1 = k]2−(m−k−r) P[L3 = m − k − r|L1 = k]
induction
=
k=0 r=0
Then
1 Ln 1 1
l(Ψt (X2n )) = + l(Ψt−1 (Un )) + l(Ψt−1 (Vn−Ln )).
2n 2n 2n 2n
i.i.d. i.i.d. a. s .
Note that Un ∼ Ber(2pq), Vn−Ln |Ln ∼ Ber(p2 /(p2 + q2 )) and Ln −−→∞. Then the induction hypoth-
a. s . a. s .
esis implies that 1n l(Ψt−1 (Un ))−−→rt−1 (2pq) and 2(n−1 Ln ) l(Ψt−1 (Vn−Ln ))−−→rt−1 (p2 /(p2 +q2 )). We
obtain the recursion:
1 p2 + q2 p2
rt (p) = pq + rt−1 (2pq) + rt−1 ≜ (Trt−1 )(p), (9.2)
2 2 p2 + q2
where the operator T maps a continuous function on [0, 1] to another. Furthermore, T is mono-
tone in the senes that f ≤ g pointwise then Tf ≤ Tg. Then it can be shown that rt converges
monotonically from below to the fixed point of T, which turns out to be exactly the binary
entropy function h. Instead of directly verifying Th = h, here is a simple proof: Consider
i.i.d.
X1 , X2 ∼ Ber(p). Then 2h(p) = H(X1 , X2 ) = H(X1 ⊕ X2 , X1 ) = H(X1 ⊕ X2 ) + H(X1 |X1 ⊕ X2 ) =
2
h(2pq) + 2pqh( 12 ) + (p2 + q2 )h( p2p+q2 ).
The convergence of rt to h are shown in Fig. 9.1.
i i
i i
i i
142
1.0
0.8
0.6
0.4
0.2
Figure 9.1 Rate function rt for t = 1, 4, 10 versus the binary entropy function.
example is whether we can simulate Ber(2p) from Ber(p), i.e., f(p) = 2p ∧ 1. Keane and O’Brien
[175] showed that all f that can be simulated are either constants or “polynomially bounded away
from 0 or 1”: for all 0 < p < 1, min{f(p), 1 − f(p)} ≥ min{p, 1 − p}n for some n ∈ N. In particular,
doubling the bias is impossible.
The above result deals with what f(p) can be simulated in principle. What type of computational
devices are needed for such as task? Note that since r1 (p) is quadratic in p, all rate functions rt
that arise from the iteration (9.2) are rational functions (ratios of polynomials), converging to the
binary entropy function as Fig. 9.1 shows. It turns out that for any rational function f that satisfies
0 < f < 1 on (0, 1), we can generate independent Ber(f(p)) from Ber(p) using either of the
following schemes with finite memory [221]:
1 Finite-state machine (FSM): initial state (red), intermediate states (white) and final states (blue,
output 0 or 1 then reset to initial state).
2 Block simulation: let A0 , A1 be disjoint subsets of {0, 1}k . For each k-bit segment, output 0 if
falling in A0 or 1 if falling in A1 . If neither, discard and move to the next segment. The block
size is at most the degree of the denominator polynomial of f.
The next table gives some examples of f that can be realized with these two architectures. (Exercise:
How to generate f(p) = 1/3?)
It turns out that the only type of f that can be simulated using either FSM or block simulation
√
is rational function. For f(p) = p, which satisfies Keane-O’Brien’s characterization, it cannot
be simulated by FSM or block simulation, but it can be simulated by the so-called pushdown
automata, which is a FSM operating with a stack (infinite memory) [221].
It is unknown how to find the optimal Bernoulli factory with the best rate. Clearly, a converse
is the entropy bound h(hf((pp))) , which can be trivial (bigger than one).
i i
i i
i i
1
1
0
0
f(p) = 1/2 A0 = 10; A1 = 01
1
1 0
0
0 0
1
f(p) = 2pq A0 = 00, 11; A1 = 01, 10 0 1
0
1 1
0 0
0
0
1
1
p3
f(p) = p3 +q3
A0 = 000; A1 = 111
0
0
1
1
1 1
i i
i i
i i
where C(n, k) is bounded by two universal constants C0 ≤ C(n, k) ≤ C1 , and h(·) is the
binary entropy. Conclude that for all 0 ≤ k ≤ n we have
3* More generally, let X be a finite alphabet, P̂, Q distributions on X , and TP̂ a set of all strings
in X n with composition P̂. If TP̂ is non-empty (i.e. if nP̂(·) is integral) then
and furthermore, both O(log n) terms can be bounded as |O(log n)| ≤ |X | log(n + 1). (Hint:
show that number of non-empty TP̂ is ≤ (n + 1)|X | .)
I.2 (Refined method of types) The following refines Proposition 1.5. Let n1 , . . . , be non-negative
P
integers with i ni = n and let k+ be the number of non-zero ni ’s. Then
n k+ − 1 1 X
log = nH(P̂) − log(2πn) − log P̂i − Ck+ ,
n1 , n2 , . . . 2 2
i:ni >0
i i
i i
i i
where we define xn+1 = x1 (cyclic continuation). Show that 1n Nxn (·, ·) defines a probability
distribution PA,B on X ×X with equal marginals PA = PB . Conclude that H(A|B) = H(B|A).
Is PA|B = PB|A ?
(2)
(b) Let Txn (Markov type-class of xn ) be defined as
(2)
Txn = {x̃n ∈ X n : Nx̃n = Nxn } .
(2)
Show that elements of Txn can be identified with cycles in the complete directed graph G
on X , such that for each (a, b) ∈ X × X the cycle passes Nxn (a, b) times through edge
( a, b) .
(c) Show that each such cycle can be uniquely specified by indentifying the first node and by
choosing at each vertex of the graph the order in which the outgoing edges are taken. From
this and Stirling’s approximation conclude that
(2)
log |Txn | = nH(xT+1 |xT ) + O(log n) , T ∼ Unif([n]) .
I.4 Find the entropy rate of a stationary ergodic Markov chain with transition probability matrix
1 1 1
2 4 4
P= 0 1
2
1
2
1 0 0
I.5 Let X = X∞
0 be a stationary Markov chain. Let PY|X be a Markov kernel. Define a new process
Y = Y∞0 where Yi ∼ PY|X=Xi conditionally independent of all other Xj , j 6= i. Prove that
and
I.6 (Robust version of the maximal entropy) Maximal differential entropy among all variables X
supported on [−b, b] is attained by a uniform distribution. Prove that as ϵ → 0+ we have
where supremization is over all (not necessarily independent) random variables M, Z such that
M + Z possesses a density. (Hint: [120, Appendix C] proves o(1) = O(ϵ1/3 log 1ϵ ) bound.)
I.7 (Maximum entropy.) Prove that for any X taking values on N = {1, 2, . . .} such that E[X] < ∞,
1
H(X) ≤ E[X]h ,
E [ X]
i i
i i
i i
maximized uniquely by the geometric distribution. Here as usual h(·) denotes the binary entropy
function. Hint: Find an appropriate Q such that RHS - LHS = D(PX kQ).
I.8 (Finiteness of entropy) We have shown that any N-valued random variable X, with E[X] < ∞
has H(X) ≤ E[X]h(1/ E[X]) < ∞. Next let us improve this result.
(a) Show that E[log X] < ∞ ⇒ H(X) < ∞.
Moreover, show that the condition of X being integer-valued is not superfluous by giving a
counterexample.
(b) Show that if k 7→ PX (k) is a decreasing sequence, then H(X) < ∞ ⇒ E[log X] < ∞.
Moreover, show that the monotonicity assumption is not superfluous by giving a counterex-
ample.
I.9 (Maximum entropy under Hamming weight constraint.) For any α ≤ 1/2 and d ∈ N,
achieved by the product distribution Y ∼ Ber(α)⊗d . Hint: Find an appropriate Q such that RHS
- LHS = D(PY kQ).
I.10 Let N (m, �) be the Gaussian distribution on Rn with mean m ∈ Rn and covariance matrix �.
(a) Under what conditions on m0 , �0 , m1 , �1 is
(b) Compute D(N (m, �)kN (0, In )), where In is the n × n identity matrix.
(c) Compute D( N (m1 , �1 ) k N (m0 , �0 ) ) for non-singular �0 . (Hint: think how Gaussian dis-
tribution changes under shifts x 7→ x + a and non-singular linear transformations x 7→ Ax.
Apply data-processing to reduce to previous case.)
I.11 (Information lost in erasures) Let X, Y be a pair of random variables with I(X; Y) < ∞. Let Z
be obtained from Y by passing the latter through an erasure channel, i.e., X → Y → Z where
(
1 − δ, z = y ,
PZ|Y (z|y) =
δ, z =?
I.13 The Hewitt-Savage 0-1 law states that certain symmetric events have no randomness. Let
{Xi }i≥1 be a sequence be iid random variables. Let E be an event determined by this sequence.
We say E is exchangeable if it is invariant under permutation of finitely many indices in
the sequence of {Xi }’s, e.g., the occurance of E is unchanged if we permute the values of
(X1 , X4 , X7 ), etc.
Let’s prove the Hewitt-Savage 0-1 law information-theoretically in the following steps:
P Pn
(a) (Warm-up) Verify that E = { i≥1 Xi converges} and E = {limn→∞ n1 i=1 Xi = E[X1 ]}
are exchangeable events.
i i
i i
i i
(b) Let E be an exchangeable event and W = 1E is its indicator random variable. Show that
for any k, I(W; X1 , . . . , Xk ) = 0. (Hint: Use tensorization (6.2) to show that for arbitrary n,
nI(W; X1 , . . . , Xk ) ≤ 1 bit.)
(c) Since E is determined by the sequence {Xi }i≥1 , we have by continuity of mutual informa-
tion:
xk
PY|X [k|x] = e−x , k = 0, 1, 2, . . .
k!
Let X be an exponential random variable with unit mean. Find I(X; Y).
I.15 Consider the following Z-channel given by PY|X [1|1] = 1 and PY|X [1|0] = PY|X [0|0] = 1/2.
1 1
0 0
C = max I(X; Y) .
X
(b) Find D(PY|X=0 kP∗Y ) and D(PY|X=1 kP∗Y ) where P∗Y is the capacity-achieving output distribu-
tion, or caod, i.e., the distribution of Y induced by the maximizer of I(X; Y).
I.16 (a) For any X such that E [|X|] < ∞, show that
(E[X])2
D(PX kN (0, 1)) ≥ nats.
2
(b) For a > 0, find the minimum and minimizer of
i i
i i
i i
I.17 (Entropy numbers and capacity.) Let {PY|X=x : x ∈ X } be a set of distributions and let C =
supPX I(X; Y) be its capacity. For every ϵ ≥ 0, define1
C = inf ϵ2 + log N(ϵ) . (I.5)
ϵ≥0
Comments: The reason these estimates are useful is because N(ϵ) for small ϵ roughly speaking
depends on local (differential) properties of the map x 7→ PY|X=x , unlike C which is global.
I.18 Consider the channel PYm |X : [0, 1] 7→ {0, 1}m , where given x ∈ [0, 1], Ym is i.i.d. Ber(x). Using
the upper bound from Ex. I.17 prove
1
C(m) ≜ max I(X; Ym ) ≤ log m + O(1) , m → ∞.
PX 2
Hint: Find a covering of the input space.
Show a lower bound to establish
1
C(m) ≥ log m + o(log m) , m → ∞.
2
You may use without proof that ∀ϵ > 0 there exists K(ϵ) such that for all m ≥ 1 and all
p ∈ [ϵ, 1 − ϵ] we have |H(Binom(m, p)) − 12 log m| ≤ K(ϵ).
I.19 Show that
Y
n
PY1 ···Yn |X1 ···Xn = PYi |Xi (I.7)
i=1
1
N(ϵ) is the minimum number of points that cover the set {PY|X=x : x ∈ X } to within ϵ in divergence; log N(ϵ) would be
called (Kolmogorov) metric ϵ-entropy of the set {PY|X=x : x ∈ X } – see Chapter 27.
i i
i i
i i
Pn
I.20 Suppose Z1 , . . . Zn are independent Poisson random variables with mean λ. Show that i=1 Zi
is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.21 Suppose Z1 , . . . Zn are independent uniformly distributed on the interval [0, λ]. Show that
max1≤i≤n Zi is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.22 Consider a binary symmetric random walk Xn on Z that starts at zero. In other words, Xn =
Pn
j=1 Bj , where (B1 , B2 , . . .) are independent and equally likely to be ±1.
(a) When n 1 does knowing X2n provide any information about Xn ? More exactly, prove
I.23 (Continuity of entropy on finite alphabet.) We have shown that entropy is continuous on on
finite alphabet. Now let us study how continuous it is with respect to the total variation. Prove
You may include only the minimal DAGs (recall: the DAG is minimal for a given
distribution if removal of any edge leads to a graphical model incompatible with the
distribution).2
(b) Draw the DAG describing the set of distributions PXn Yn satisfying:
Y
n
PYn |Xn = PYi |Xi
i=1
(c) Recall that two DAGs G1 and G2 are called equivalent if they have the same vertex sets and
each distribution factorizes w.r.t. G1 if and only if it does so w.r.t. G2 . For example, it is
2
Note: {X → Y}, {X ← Y} and {X Y} are the three possible directed graphical modelss for two random variables. For
example, the third graph describes the set of distributions for which X and Y are independent: PXY = PX PY . In fact, PX PY
factorizes according to any of the three DAGs, but {X Y} is the unique minimal DAG.
i i
i i
i i
well known
X→Y→Z ⇐⇒ X←Y←Z ⇐⇒ X ← Y → Z.
X1 → X2 → · · · → Xn → · · ·
X1 ← X2 ← · · · ← Xn ← · · ·
A→B→C
A→B→C
=⇒ A ⊥
⊥ (B, C)
A→C→B
Discuss implications for sufficient statistics.
Bonus: for binary (A, B, C) characterize all counter-examples.
Comment: Thus, a popular positivity condition PABC > 0 allows to infer conditional indepen-
dence relations, which are not true in general. Wisdom: This example demonstrates that a set
of distributions satisfying certain (conditional) independence relations does not equal to the
closure of its intersection with {PABC > 0}.
I.27 Show that for jointly gaussian (A, B, C)
I( A; C ) = I( B; C ) = 0 =⇒ I(A, B; C) = 0 . (I.11)
i i
i i
i i
eϵ
I.29 (Rényi divergences and Blackwell order) Let pϵ = 1+eϵ . Show that for all ϵ > 0 and all α > 0
we have
Note: This shows that domination under all Rényi divergences does not imply a similar
comparison in other f-divergences [? ]. On the other hand, we have the equivalence [222]:
Whenever the LHS is finite, derive the explicit form of a unique minimizer R.
I.31 For an f-divergence, consider the following statements:
(i) If If (X; Y) = 0, then X ⊥
⊥ Y.
(ii) If X − Y − Z and If (X; Y) = If (X; Z) < ∞, then X − Z − Y.
Recall that f : (0, ∞) → R is a convex function with f(1) = 0.
(a) Choose an f-divergence which is not a multiple of the KL divergence (i.e., f cannot be of
form c1 x log x + c2 (x − 1) for any c1 , c2 ∈ R). Prove both statements for If .
(b) Choose an f-divergence which is non-linear (i.e., f cannot be of form c(x − 1) for any c ∈ R)
and provide examples that violate (i) and (ii).
(c) Choose an f-divergence. Prove that (i) holds, and provide an example that violates (ii).
I.32 (Chain rules I)
(a) Show using (I.12) and the chain rule for KL that
X
n
(1 − α)Dα (PXn kQXn ) ≥ inf(1 − α)Dα (PXi |Xi−1 =a kQXi |Xi−1 =a )
a
i=1
1 Y n
1
1 − H2 (PXn , QXn ) ≤ sup(1 − H2 (PXi |Xi−1 =a , QXi |Xi−1 =a ))
2 a 2
i=1
Y
n
1 + χ2 (PXn kQXn ) ≤ sup(1 + χ2 (PXi |Xi−1 =a kQXi |Xi−1 =a ))
a
i=1
i i
i i
i i
(a) Show that the chain rule for divergence can be restated as
X
n
D(PXn kQXn ) = D(Pi kPi−1 ),
i=1
where Pi = PXi QXni+1 |Xi , with Pn = PXn and P0 = QXn . The identity above shows how
KL-distance from PXn to QXn can be traversed by summing distances between intermediate
Pi ’s.
(b) Using the same path and triangle inequality show that
X
n
TV(PXn , QXn ) ≤ EPXi−1 TV(PXi |Xi−1 , QXi |Xi−1 )
i=1
Prove that
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : PX = P, PY = Q}
PXY
where the infimum is over all couplings. (Hint: For one direction use the same coupling
achieving TV. For the other direction notice that P[X 6= Y|Y] ≥ 1 − QP((YY)) .)
(b) Define symmetrized Marton’s divergence
Dsm (PkQ) = Dm (PkQ) + Dm (QkP).
Prove that
Dsm (PkQ) = inf{E[P2 [X 6= Y|Y]] + E[P2 [X 6= Y|X]] : PX = P, PY = Q}.
PXY
I.35 (Center of gravity under f-divergences.) Recall from Corollary 4.2 the fact that
min D(PY|X kQY |PX ) = I(X; Y)
QY
3
Note that the results do not depend on the choice of μ, so we can take for example μ = PY , in view of Lemma 3.3.
i i
i i
i i
If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ exp(E[log p(y|X)]) μ(dy).
Note: This exercise shows that the center of gravity with respect to other f-divergences need not
be PY but its reweighted version. For statistical applications, see Exercise VI.6 and Exercise VI.9,
where (I.13) is used to determine the form of the Bayes estimator.
I.36 Let (X, Y) be uniformly distributed in the unit ℓp -ball Bp ≜ {(x, y) : |x|p + |y|p ≤ 1}, where
p ∈ (0, ∞). Also define the ℓ∞ -ball B∞ ≜ {(x, y) : |x| ≤ 1, |y| ≤ 1}.
(a) Compute I(X; Y) for p = 1/2, p = 1 and p = ∞.
(b) (Bonus) What do you think I(X; Y) converges to as p → 0. Can you prove it?
I.37 (Divergence of order statistics) Given xn = (x1 , . . . , xn ) ∈ Rn , let x(1) ≤ . . . ≤ x(n) denote the
ordered entries. Let P, Q be distributions on R and PXn = Pn , QXn = Qn .
(a) Prove that
I.38 (Sampling without replacement I, [293]) Consider two ways of generating a random vector
Xn = (X1 , . . . , Xn ): Under P, Xn are sampled from the set [n] = {1, . . . , n} without replacement;
under Q, Xn are sampled from [n] with replacement. Let’s compare the joint distribution of the
first k draws X1 , . . . , Xk for some 1 ≤ k ≤ n.
(a) Show that
k! n
TV(PXk , QXk ) = 1 − k
n k
k! n
D(PXk kQXk ) = − log k .
n k
√
Conclude that D and TV are o(1) iff k = o( n). You may use the fact that TV between two
discrete distributions is equal to half the ℓ1 -distance between their PMFs.
√
(b) Explain the specialness of n by find an explicit test that distinguishes P and Q with high
√
probability when k n. Hint: Birthday problem.
I.39 (Sampling without replacement II, [293]) Let X1 , . . . , Xk be a random sample of balls without
Pq
replacement from an urn containing ai balls of color i ∈ [q], i=1 ai = n. Let QX (i) = ani . Show
that
k2 ( q − 1 ) log e
D(PXk kQkX ) ≤ c , c= .
(n − 1)(n − k + 1) 2
Let Rm,b0 ,b1 be the distribution of the number of 1’s in the first m ≤ b0 + b1 coordinates of a
randomly permuted binary strings with b0 zeros and b1 ones.
i i
i i
i i
(b) Show that there must exist some t ∈ {k, k + 1, . . . , n} such that
H( X k − 1 )
I(Xk−1 ; Xk |Xnt+1 ) ≤ .
n−k+1
(Hint: Expand I(Xk−1 ; Xnk ) via chain rule.)
(c) Show from 1 and 2 that
Y kH(Xk−1 )
D PXk |T
PXj |T |PT ≤
n−k+1
where T = Xnt+1 .
(d) By Pinsker’s inequality
h i r
Y kH(Xk−1 )|X | 1
ET TV PXk |T , PXj |T ≤ c , c= p .
n−k+1 2 log e
Conclude the proof of (I.16) by convexity of total variation.
Note: Another estimate [293, 90] is easy to deduce from Exercise I.39 and Exercise I.38: there
exists a mixture of iid QXk such that
k
TV(QXk , PXk ) ≤ min(2|X |, k − 1) .
n
The bound (I.16) improves the above only when H(X1 ) ≲ 1.
i i
i i
i i
I.41 (Wringing Lemma [105, 309]) Prove that for any δ > 0 and any (Un , Vn ) there exists an index
n n
set I ⊂ [n] of size |I| ≤ I(U δ;V ) such that
I(Ut ; Vt |UI , VI ) ≤ δ ∀ t ∈ [ n] .
When I(Un ; Vn ) n, this shows that conditioning on a (relatively few) entries, one can make
individual coordinates almost independent. (Hint: Show I(A, B; C, D) ≥ I(A; C) + I(B; D|A, C)
first. Then start with I = ∅ and if there is any index t s.t. I(Ut ; Vt |UI , VI ) > δ then add it to I and
repeat.)
I.42 This exercise shows other ways of proving the Fano’s inequality in its various forms.
(a) Prove (6.5) as follows. Given any P = (Pmax , P2 , . . . , PM ), apply a random permutation π
to the last M − 1 atoms to obtain the distribution Pπ . By comparing H(P) and H(Q), where
Q is the average of Pπ over all permutations, complete the proof.
(b) Prove (6.5) by directly solving the convex optimization max{H(P) : 0 ≤ pi ≤ Pmax , i =
P
1, . . . , M, i pi = 1}.
(c) Prove (6.9) as follows. Let Pe = P[X 6= X̂]. First show that
I(X; Y) ≥ I(X; X̂) ≥ min{I(PX , PZ|X ) : P[X = Z] ≥ 1 − Pe }.
PZ|X
Notice that the minimum is non-zero unless Pe = Pmax . Second, solve the stated convex
optimization problem. (Hint: look for invariants that the matrix PZ|X must satisfy under
permutations (X, Z) 7→ (π (X), π (Z)) then apply the convexity of I(PX , ·)).
I.43 (Generalization gap = ISKL , [12]) A learning algorithm selects a parameter W based on observing
(not necessarily independent) samples S1 , . . . , Sn , where all Si have a common marginal law PS ,
with the goal of minimizing the loss on a fresh sample = E[ℓ(W, S)], where Sn ⊥ ⊥ S ∼ PS
and ℓ is an arbitrary loss function4 . Consider a Gibbs algorithm (generalizing ERM and various
regularizations) which chooses
αX
n
1
W ∼ PW|Sn (w|sn ) = n
π (w) exp{− ℓ(w, si )} ,
Z( s ) n
i=1
where π (·) is a fixed prior on weights and Z(·) – normalization constant. Show that generaliza-
tion gap of this algorithm is given by
1X
n
1
E[ℓ(W, S)] − E[ ℓ(W, Si )] = ISKL (W; Sn ) .
n α
i=1
4
For example, if S = (X, Y) we may have ℓ(w, (x, y)) = 1{fw (x) 6= y} where fw denotes a neural network with weights w.
i i
i i
i i
q
(in other words, the typical deviation is of order ϵ2 log δ1 + D(· · · )). Prove this inequality in
two steps (assume that (X, Y, Ȳ) are all discrete):
• For convenience, let EX|Y and EȲ denote the respective (conditional) expectation operators.
Show that the result follows from the following inequality (valid for all f and QX ):
h i
EY eEX|Y [f(X,Y)−ln EȲ e )]−D(PX|Y=Y ∥QX ) ≤ 1
f(X,Ȳ
(I.18)
i i
i i
i i
λ 1
D(PkϵU + ϵ̄R) ≤ 8(H (P, R) + 2ϵ)
2
log + Dλ (PkU) .
λ−1 ϵ
Thus, a Hellinger ϵ-net for a set of P’s can be converted into a KL (ϵ2 log 1ϵ )-net; see
Section 32.2.4.)
−1
(a) Start by proving the tail estimate for the divergence: For any λ > 1 and b > e(λ−1)
dP dP log b
EP log · 1{ > b} ≤ λ−1 exp{(λ − 1)Dλ (PkQ)}
dQ dQ b
(b) Show that for any b > 1 we have
b log b dP dP
D(PkQ) ≤ H2 (P, Q) √ + EP log · 1{ > b}
( b − 1)2 dQ dQ
h(x)
(Hint: Write D(PkQ) = EP [h( dQ
dP )] for h(x) = − log x + x − 1 and notice that
√
( x−1)2
is
monotonically decreasing on R+ .)
(c) Set Q = ϵU + ϵ̄R and show that for every δ < e− λ−1 ∧ 14
1
1
D(PkQ) ≤ 4H2 (P, R) + 8ϵ + cλ ϵ1−λ δ λ−1 log ,
δ
where cλ = exp{(λ − 1)Dλ (PkU). (Notice H2 (P, Q) ≤ H2 (P, R) + 2ϵ, Dλ (PkQ) ≤
Dλ (PkU) + log 1ϵ and set b = 1/δ .)
2
(d) Complete the proof by setting δ λ−1 = 4H c(λPϵ,λ−
R)+2ϵ
1 .
I.49 Let G = (V, E) be a finite directed graph. Let
4 = (x, y, z) ∈ V3 : (x, y), (y, z), (z, x) ∈ E ,
∧ = (x, y, z) ∈ V3 : (x, y), (x, z) ∈ E .
Prove that 4 ≤ ∧.
Hint: Prove H(X, Y, Z) ≤ H(X) + 2H(Y|X) for random variables (X, Y, Z) distributed uniformly
over the set of directed 3-cycles, i.e. subsets X → Y → Z → X.
i i
i i
i i
i i
i i
i i
Part II
i i
i i
i i
i i
i i
i i
161
• Variable-length lossless compression. Here we require P[X 6= X̂] = 0, where X̂ is the decoded
version. To make the question interesting, we compress X into a variable-length binary string. It
will turn out that optimal compression length is H(X) − O(log(1 + H(X))). If we further restrict
attention to so-called prefix-free or uniquely decodable codes, then the optimal compression
length is H(X) + O(1). Applying these results to n-letter variables X = Sn we see that optimal
5
Of course, one should not take these “laws” too far. In regards to language modeling, (finite-state) Markov assumption is
too simplistic to truly generate all proper sentences, cf. Chomsky [67].
i i
i i
i i
162
compression length normalized by n converges to the entropy rate (Section 6.4) of the process
{Sj }.
• Fixed-length, almost lossless compression. Here, we allow some very small (or vanishing with
n → ∞ when X = Sn ) probability of error, i.e. P[X 6= X̂] ≤ ϵ. It turns out that under mild
assumptions on the process {Sj }, here again we can compress to entropy rate but no more.
This mode of compression permits various beautiful results in the presence of side-information
(Slepian-Wolf, etc).
• Lossy compression. Here we require only E[d(X, X̂)] ≤ ϵ where d(·, ·) is some loss function.
This type of compression problems is the central topic of Part V.
Note that more correctly we should have called all the examples above as “fixed-to-variable”,
“fixed-to-fixed” and “fixed-to-lossy” codes, because they take fixed number of input letters. We
omit descussion of the beautiful class of variable-to-fixed compressors, such as the famous
Tunstall code [314], which consume an incoming stream of letters in variable-length chunks.
i i
i i
i i
X Compressor
{0, 1}∗ Decompressor X
f: X →{0,1}∗ g: {0,1}∗ →X
sor is a function f that maps each symbol x ∈ X into a variable-length string f(x) in {0, 1}∗ ≜
∪k≥0 {0, 1}k = {∅, 0, 1, 00, 01, . . . }. Each f(x) is referred to as a codeword and the collection of
codewords the codebook. We say f is a lossless compressor for a random variable X if there exists
a decompressor g : {0, 1}∗ → X such that P [X = g(f(X))] = 1, i.e., g(f(x)) = x for all x ∈ X
such that PX (x) > 0. (As such, f is injective on the support of PX ). We are interested in the most
economical way to compress the data. So let us introduce the length function l : {0, 1}∗ → Z+ ,
e.g., l(∅) = 0, l(01001) = 5.
Notice that since {0, 1}∗ is countable, lossless compression is only possible for discrete X.
Also, without loss of generality, we can relabel X such that X = N = {1, 2, . . . } and sort the
PMF decreasingly: PX (1) ≥ PX (2) ≥ · · · . At this point we do not impose any other constraints on
the map f; later in Section 10.3 we will introduce conditions such as prefix-freeness and unique-
decodability. The unconstrained setting is sometimes called a single-shot compression setting,
cf. [181].
We could consider different objectives for selecting the best compressor f, for example, min-
imizing any of E[l(f(X))], esssup l(f(X)), median[l(f(X))] would be reasonable. It turns out that
there is a compressor f∗ that minimizes all objectives simultaneously. As mentioned in the preface
of this chapter, the main idea is to assign longer codewords to less likely symbols, and reserve
the shorter codewords for more probable symbols. To make precise of the optimality of f∗ , let us
recall the concept of stochastic dominance.
Definition 10.1 (Stochastic dominance). For real-valued random variables X and Y, we say Y
st.
stochastically dominates (or, is stochastically larger than) X, denoted by X ≤ Y, if P [Y ≤ t] ≤
P [X ≤ t] for all t ∈ R.
163
i i
i i
i i
164
PX (i)
i
1 2 3 4 5 6 7 ···
∗
f
∅ 0 1 00 01 10 11 ···
st.
By definition, X ≤ Y if and only if the CDF of X is larger than that of Y pointwise; in other words,
the distribution of X assigns more probability to lower values than that of Y does. In particular, if
X is dominated by Y stochastically, so are their means, medians, supremum, etc.
Theorem 10.2 (Optimal f∗ ). Consider the compressor f∗ defined (for a down-sorted PMF PX )
by f∗ (1) = ∅, f∗ (2) = 0, f∗ (3) = 1, f∗ (4) = 00, etc, assigning strings with increasing lengths to
symbols i ∈ X . (See Fig. 10.1 for an illustration.) Then
1 Length of codeword:
2 l(f∗ (X)) is stochastically the smallest: For any lossless compressor f : X → {0, 1}∗ ,
st.
l(f∗ (X)) ≤ l(f(X))
i.e., for any k, P[l(f(X)) ≤ k] ≤ P[l(f∗ (X)) ≤ k]. As a result, E[l(f∗ (X))] ≤ E[l(f(X))].
Here the inequality is because f is lossless so that |Ak | can at most be the total number of binary
strings of length up to k. Then
X X
P[l(f(X)) ≤ k] = P X ( x) ≤ PX (x) = P[l(f∗ (X)) ≤ k], (10.1)
x∈Ak x∈A∗
k
since |Ak | ≤ |A∗k | and A∗k contains all 2k+1 − 1 most likely symbols.
The following lemma (see Ex. I.7) is useful in bounding the expected code length of f∗ . It says
if the random variable is integer-valued, then its entropy can be controlled using its mean.
i i
i i
i i
Lemma 10.3. For any Z ∈ N s.t. E[Z] < ∞, H(Z) ≤ E[Z]h( E[1Z] ), where h(·) is the binary entropy
function.
Theorem 10.4 (Optimal average code length: exact expression). Suppose X ∈ N and PX (1) ≥
PX (2) ≥ . . .. Then
X
∞
E[l(f∗ (X))] = P[X ≥ 2k ].
k=1
P
Proof. Recall that expectation of U ∈ Z+ can be written as E [U] = k≥1 P [U ≥ k]. Then by
P P
Theorem 10.2, E[l(f∗ (X))] = E [blog2 Xc] = k≥1 P [blog2 Xc ≥ k] = k≥1 P [log2 X ≥ k].
Remark 10.1. Theorem 10.5 is the first example of a coding theorem in this book, which relates
the fundamental limit E[l(f∗ (X))] (an operational quantity) to the entropy H(X) (an information
measure).
Proof. Define L(X) = l(f∗ (X))). For the upper bound, observe that since the PMF are ordered
decreasingly by assumption, PX (m) ≤ 1/m, so L(m) ≤ log2 m ≤ log2 (1/PX (m)). Taking
expectation yields E[L(X)] ≤ H(X).
For the lower bound,
( a)
H(X) = H(X, L) = H(X|L) + H(L) ≤ E[L] + H(L)
(b) 1
≤ E [ L] + h (1 + E[L])
1 + E[L]
1
= E[L] + log2 (1 + E[L]) + E[L] log2 1 + (10.2)
E [ L]
( c)
≤ E[L] + log2 (1 + E[L]) + log2 e
(d)
≤ E[L] + log2 (e(1 + H(X)))
where in (a) we have used the fact that H(X|L = k) ≤ k bits, because f∗ is lossless, so that given
f∗ (X) ∈ {0, 1}k , X can take at most 2k values; (b) follows by Lemma 10.3; (c) is via x log(1+1/x) ≤
log e, ∀x > 0; and (d) is by the previously shown upper bound H(X) ≤ E[L].
i i
i i
i i
166
1
For the case of sources for which log2 PS has non-lattice distribution, it is further shown in [300,
Theorem 3]:
1
E[l(f∗ (Sn ))] = nH(S) − log2 (8πeV(S)n) + o(1) , (10.3)
2
where V(S) is the varentropy of the source S:
1
V(S) ≜ Var log2 . (10.4)
PS (S)
Theorem 10.5 relates the mean of l(f∗ (X)) to that of log2 PX1(X) (entropy). It turns out that
distributions of these random variables are also closely related.
Proof. Lower bound (achievability): Use PX (m) ≤ 1/m. Then similarly as in Theorem 10.5,
L(m) = blog2 mc ≤ log2 m ≤ log2 PX 1(m) . Hence L(X) ≤ log2 PX1(X) a.s.
Upper bound (converse): By truncation,
1 1
P [L ≤ k] = P L ≤ k, log2 ≤ k + τ + P L ≤ k, log2 >k+τ
PX (X) PX (X)
X
1
≤ P log2 ≤k+τ + PX (x)1{l(f∗ (x))≤k} 1{PX (x)≤2−k−τ }
PX (X)
x∈X
1
≤ P log2 ≤ k + τ + (2k+1 − 1) · 2−k−τ
PX (X)
So far our discussion applies to an arbitrary random variable X. Next we consider the source as
a random process (S1 , S2 , . . .) and introduce blocklength n. We apply our results to X = Sn , that is,
by treating the first n symbols as a supersymbol. The following corollary states that the limiting
behavior of l(f∗ (Sn )) and log PSn 1(Sn ) always coincide.
Corollary 10.7. Let (S1 , S2 , . . .) be a random process and U, V real-valued random variable. Then
1 1 d 1 ∗ n d
log2 →U
− ⇔ l(f (S ))−
→U (10.5)
n PSn (Sn ) n
i i
i i
i i
and
1 1 1
√ (l(f∗ (Sn )) − H(Sn ))−
d d
√ log2 − H( S ) →
n
−V ⇔ →V (10.6)
n PSn (Sn ) n
Proof. First recall that convergence in distribution is equivalent to convergence of CDF at all
d
→U ⇔ P [Un ≤ u] → P [U ≤ u] for all u at which point the CDF of U is
continuity point, i.e., Un −
continuous (i.e., not an atom of U).
√
To get (10.5), apply Theorem 10.6 with k = un and τ = n:
1 1 1 ∗ 1 1 1 √
P log2 ≤ u ≤ P l(f (X)) ≤ u ≤ P log2 ≤ u + √ + 2− n+1 .
n PX (X) n n PX (X) n
√
To get (10.6), apply Theorem 10.6 with k = H(Sn ) + nu and τ = n1/4 :
∗ n
1 1 l(f (S )) − H(Sn )
P √ log − H( S ) ≤ u ≤ P
n
√ ≤u
n PSn (Sn ) n
1 1 −1/4
+ 2−n +1
1/ 4
≤P √ log n
− H( S ) ≤ u + n
n
n PSn (S )
Now let us particularize the preceding theorem to memoryless sources of i.i.d. Sj ’s. The
important observation is that the log likihood becomes an i.i.d. sum:
1 X n
1
log n
= log .
PSn (S ) PS (Si )
i=1 | {z }
i.i.d.
P
1 By the Law of Large Numbers (LLN), we know that n1 log PSn 1(Sn ) − →E log PS1(S) = H(S).
Therefore in (10.5) the limiting distribution U is degenerate, i.e., U = H(S), and we
P
have 1n l(f∗ (Sn ))−
→E log PS1(S) = H(S). [Note: convergence in distribution to a constant ⇔
convergence in probability to a constant]
2 By the Central Limit Theorem (CLT), if varentropy V(S) < ∞, then we know that V in (10.6)
is Gaussian, i.e.,
1 1 d
p log − nH(S) −→N (0, 1).
nV(S) PSn (Sn )
Consequently, we have the following Gaussian approximation for the probability law of the
optimal code length
1
(l(f∗ (Sn )) − nH(S))−
d
p →N (0, 1),
nV(S)
or, in shorthand,
p
l(f∗ (Sn )) ∼ nH(S) + nV(S)N (0, 1) in distribution.
1 ∗ n
Gaussian approximation tells us the speed of n l(f (S )) to entropy and give us a good
approximation at finite n.
i i
i i
i i
168
Optimal compression: CDF, n = 200, PS = [0.445 0.445 0.110] Optimal compression: PMF, n = 200, P S = [0.445 0.445 0.110]
1 0.06
True PMF
Gaussian approximation
Gaussian approximation (mean adjusted)
0.9
0.05
0.8
0.7
0.04
0.6
0.5
P
0.03
P
0.4
0.02
0.3
0.2
0.01
True CDF
0.1 Lower bound
Upper bound
Gaussian approximation
Gaussian approximation (mean adjusted)
0 0
1.25 1.3 1.35 1.4 1.45 1.5 1.25 1.3 1.35 1.4 1.45 1.5
Rate Rate
Figure 10.2 Left plot: Comparison of the true CDF of l(f∗ (Sn )), bounds of Theorem 10.6 (optimized over τ ),
and the Gaussian approximations in (10.7) and (10.8). Right plot: PMF of the optimal compression length
l(f∗ (Sn )) and the two Gaussian approximations.
Example 10.1 (Ternary source). Next we apply our bounds to approximate the distribution of
l(f∗ (Sn )) in a concrete example. Consider a memoryless ternary source outputing i.i.d. n symbols
from the distribution PS = [0.445, 0.445, 0.11]. We first compare different results on the minimal
expected length E[l(f∗ (Sn ))] in the following table:
Blocklength Lower bound (10.5) E[l(f∗ (Sn ))] H(Sn ) (upper bound) asymptotics (10.3)
n = 20 21.5 24.3 27.8 23.3 + o(1)
n = 100 130.4 134.4 139.0 133.3 + o(1)
n = 500 684.1 689.2 695.0 688.1 + o(1)
In all cases above E[l(f∗ (S))] is close to a midpoint between the bounds.
Next we consider the distribution of l(f∗ (Sn ). Its Gaussian approximation is defined as
p
nH(S) + nV(S)Z , Z ∼ N ( 0, 1) . (10.7)
However, in view of (10.3) we also define the mean-adjusted Gaussian approximation as
1 p
nH(S) − log2 (8πeV(S)n) + nV(S)Z , Z ∼ N (0, 1) . (10.8)
2
Fig. 10.2 compares the true distribution of l(f∗ (Sn )) with bounds and two Gaussian approxima-
tions.
i i
i i
i i
Figure 10.3 The log-log frequency-rank plots of the most used words in various languages exhibit a power
law tail with exponent close to 1, as popularized by Zipf [350]. Data from [292].
pr r−α for some value of α. Remarkably, this holds across various corpi of text in multiple
different languages (and with α ≈ 1) – see Fig. 10.3 for an illustration. Even more surprisingly, a
lot of other similar tables possess the power-law distribution: “city populations, the sizes of earth-
quakes, moon craters, solar flares, computer files, wars, personal names in most cultures, number
of papers scientists write, number of citations a paper receives, number of hits on web pages, sales
of books and music recordings, number of species in biological taxa, people’s incomes” (quoting
from [225], which gives references for each study). This spectacular universality of the power law
continues to provoke scientiests from many disciplines to suggest explanations for its occurrence;
see [220] for a survey of such. One of the earliest (in the context of natural language of Zipf) is
due to Mandelbrot [209] and is in fact intimately related to the topic of this Chapter.
Let us go back to the question of minimal expected length of the representation of source X. We
have shown bounds on this quantity in terms of the entropy of X in Theorem 10.5. Let us introduce
the following function
i i
i i
i i
170
where optimization is over lossless encoders and probability distributions PX = {pj : j = 1, . . .}.
Theorem 10.5 (or more precisely, the intermediate result (10.2)) shows that
It turns out that the upper bound is in fact tight. Furthermore, among all distributions the optimal
tradeoff between entropy and minimal compression length is attained at power law distributions.
To show that, notice that in computing H(Λ), we can restrict attention to sorted PMFs p1 ≥ p2 ≥
· · · (call this class P ↓ ), for which the optimal encoder is such that l(f(j)) = blog2 jc (Theorem 10.2).
Thus, we have shown
X
H(Λ) = sup {H(P) : pj blog2 jc ≤ Λ} .
P∈P ↓ j
Next, let us fix the base of the logarithm of H to be 2, for convenience. (We will convert to arbitrary
base at the end). Applying Example 5.2 we obtain:
Comparing with (10.9) we find that the upper bound in (10.9) is tight and attained by Pλ∗ . From
the first equation above, we also find λ∗ = log2 2+Λ2Λ . Altogether this yields
The argument of Mandelbrot [209] The above derivation shows a special (extremality) prop-
erty of the power law, but falls short of explaining its empirical ubiquity. Here is a way to connect
the optimization problem H(Λ) to the evolution of the natural language. Suppose that there is a
countable set S of elementary concepts that are used by the brain as building blocks of perception
and communication with the outside world. As an approximation we can think that concepts are
in one-to-one correspondence with language words. Now every concept x is represented internally
i i
i i
i i
10.3 Uniquely decodable codes, prefix codes and Huffman codes 171
by the brain as a certain pattern, in the simplest case – a sequence of zeros and ones of length l(f(x))
([209] considers more general representations). Now we have seen that the number of sequences
of concepts with a composition P grows exponentially (in length) with the exponent given by
H(P), see Proposition 1.5. Thus in the long run the probability distribution P over the concepts
results in the rate of information transfer equal to EP [Hl((fP(X) ))] . Mandelbrot concludes that in order
to transfer maximal information per unit, language and brain representation co-evolve in such a
way as to maximize this ratio. Note that
H(P) H(Λ)
sup = sup .
P,f EP [l(f(X))] Λ Λ
It is not hard to show that H(Λ) is concave and thus the supremum is achieved at Λ = 0+ and
equals infinity. This appears to have not been observed by Mandelbrot. To fix this issue, we can
postulate that for some unknown reason there is a requirement of also having a certain minimal
entropy H(P) ≥ h0 . In this case
H(P) h0
sup = −1
P,f:H(P)≥h0 EP [l(f(X))] H ( h0 )
and the supremum is achieved at a power law distribution P. Thus, the implication is that the fre-
quency of word usage in human languages evolves until a power law is attained, at which point it
maximizes information transfer within the brain. That’s the gist of the argument of [209]. It is clear
that this does not explain appearance of the power law in other domains, for which other explana-
tions such as preferential attachment models are more plausible, see [220]. Finally, we mention
that the Pλ distributions take discrete values 2−λm−log2 Z(λ) , m = 0, 1, 2, . . . with multiplicities 2m .
Thus Pλ appears as a rather unsightly staircase on frequency-rank plots such as Fig. 10.3. This
artifact can be alleviated by considering non-binary brain representations with unequal lengths of
signals.
i i
i i
i i
172
Definition 10.9 (Uniquely decodable codes). f : A → {0, 1}∗ is uniquely decodable if its
extension f : A+ → {0, 1}∗ is injective.
Definition 10.10 (Prefix codes). f : A → {0, 1}∗ is a prefix code1 if no codeword is a prefix of
another (e.g., 010 is a prefix of 0101).
• f(a) = 0, f(b) = 1, f(c) = 10. Not uniquely decodable, since f(ba) = f(c) = 10.
• f(a) = 0, f(b) = 10, f(c) = 11. Uniquely decodable and a prefix code.
• f(a) = 0, f(b) = 01, f(c) = 011, f(d) = 0111 Uniquely decodable but not a prefix code, since
as long as 0 appears, we know that the previous codeword has terminated.2
Remark 10.2.
1 Prefix codes are uniquely decodable and hence lossless, as illustrated in the following picture:
prefix codes
Huffman
code
2 Similar to prefix-free codes, one can define suffix-free codes. Those are also uniquely decodable
(one should start decoding in reverse direction).
3 By definition, any uniquely decodable code does not have the empty string as a codeword. Hence
f : X → {0, 1}+ in both Definition 10.9 and Definition 10.10.
4 Unique decodability means that one can decode from a stream of bits without ambiguity, but
one might need to look ahead in order to decide the termination of a codeword. (Think of the
1
Also known as prefix-free/comma-free/self-punctuatingf/instantaneous code.
2
In this example, if 0 is placed at the very end of each codeword, the code is uniquely decodable, known as the unary code.
i i
i i
i i
10.3 Uniquely decodable codes, prefix codes and Huffman codes 173
last example). In contrast, prefix codes allow the decoder to decode instantaneously without
looking ahead.
5 Prefix codes are in one-to-one correspondence with binary trees (with codewords at leaves). It
is also equivalent to strategies to ask “yes/no” questions previously mentioned at the end of
Section 1.1.
1 Let f : A → {0, 1}∗ be uniquely decodable. Set la = l(f(a)). Then f satisfies the Kraft inequality
X
2−la ≤ 1. (10.10)
a∈A
2 Conversely, for any set of code length {la : a ∈ A} satisfying (10.10), there exists a prefix code
f, such that la = l(f(a)). Moreover, such an f can be computed efficiently.
Remark 10.3. The consequence of Theorem 10.11 is that as far as compression efficiency is
concerned, we can ignore those uniquely decodable codes that are not prefix codes.
Proof. We prove the Kraft inequality for prefix codes and uniquely decodable codes separately.
The proof for the former is probabilistic, following ideas in [10, Exercise 1.8, p. 12]. Let f be a
prefix code. Let us construct a probability space such that the LHS of (10.10) is the probability
of some event, which cannot exceed one. To this end, consider the following scenario: Generate
independent Ber( 12 ) bits. Stop if a codeword has been written, otherwise continue. This process
P
terminates with probability a∈A 2−la . The summation makes sense because the events that a
given codeword is written are mutually exclusive, thanks to the prefix condition.
Now let f be a uniquely decodable code. The proof uses generating function as a device for
counting. (The analogy in coding theory is the weight enumerator function.) First assume A is
P PL
finite. Then L = maxa∈A la is finite. Let Gf (z) = a∈A zla = l=0 Al (f)zl , where Al (f) denotes
the number of codewords of length l in f. For k ≥ 1, define fk : Ak → {0, 1}+ as the symbol-
P k k P P
by-symbol extension of f. Then Gfk (z) = ak ∈Ak zl(f (a )) = a1 · · · ak zla1 +···+lak = [Gf (z)]k =
PkL k l
l=0 Al (f )z . By the unique decodability of f, fk is lossless. Hence Al (fk ) ≤ 2l . Therefore we have
P
Gf (1/2) = Gfk (1/2) ≤ kL for all k. Then a∈A 2−la = Gf (1/2) ≤ limk→∞ (kL)1/k = 1. If A is
k
P
countably infinite, for any finite subset A′ ⊂ A, repeating the same argument gives a∈A′ 2−la ≤
1. The proof is complete by the arbitrariness of A′ .
P
Conversely, given a set of code lengths {la : a ∈ A} s.t. a∈A 2−la ≤ 1, construct a prefix
code f as follows: First relabel A to N and assume that 1 ≤ l1 ≤ l2 ≤ . . .. For each i, define
X
i− 1
ai ≜ 2− l k
k=1
with a1 = 0. Then ai < 1 by Kraft inequality. Thus we define the codeword f(i) ∈ {0, 1}+ as the
first li bits in the binary expansion of ai . Finally, we prove that f is a prefix code by contradiction:
i i
i i
i i
174
Suppose for some j > i, f(i) is the prefix of f(j), since lj ≥ li . Then aj − ai ≤ 2−li , since they agree
on the most significant li bits. But aj − ai = 2−li + 2−li+1 +. . . > 2−li , which is a contradiction.
Remark 10.4. A conjecture of Ahslwede et al [4] states that for any set of lengths for which
P −la
2 ≤ 34 there exists a fix-free code (i.e. one which is simultaneously prefix-free and suffix-
free). So far, existence has only been shown when the Kraft sum is ≤ 58 , cf. [343].
In view of Theorem 10.11, the optimal average code length among all prefix (or uniquely
decodable) codes is given by the following optimization problem
X
L∗ (X) ≜ min P X ( a) la (10.11)
a∈A
X
s.t. 2−la ≤ 1
a∈A
la ∈ N
This is an integer programming (IP) problem, which, in general, is computationally hard to solve.
It is remarkable that this particular IP can be solved in near-linear time, thanks to the Huffman
algorithm. Before describing the construction of Huffman codes, let us give bounds to L∗ (X) in
terms of entropy:
Theorem 10.12.
Light inequality: We give two proofs for this converse. One of the commonly used ideas to deal
with combinatorial optimization is relaxation. Our first idea is to drop the integer constraints in
(10.11) and relax it into the following optimization problem, which obviously provides a lower
bound
X
L∗ (X) ≜ min PX (a)la (10.13)
a∈A
X
s.t. 2−la ≤ 1 (10.14)
a∈A
This is a nice convex optimization problem, with affine objective function and a convex feasible
set. Solving (10.13) by Lagrange multipliers (Exercise!) yields that the minimum is equal to H(X)
(achieved at la = log2 PX1(a) ).
3
Such a code is called a Shannon code.
i i
i i
i i
10.3 Uniquely decodable codes, prefix codes and Huffman codes 175
Another proof is the following: For any f whose codelengths {la } satisfying the Kraft inequality,
− la
define a probability measure Q(a) = P 2 2−la . Then
a∈A
X
El(f(X)) − H(X) = D(PkQ) − log 2−la ≥ 0.
a∈A
Next we describe the Huffman code, which achieves the optimum in (10.11). In view of the fact
that prefix codes and binary trees are one-to-one, the main idea of the Huffman code is to build
the binary tree from the bottom up: Given a PMF {PX (a) : a ∈ A},
The algorithm terminates in |A| − 1 steps. Given the binary tree, the code assignment can be
obtained by assigning 0/1 to the branches. Therefore the time complexity is O(|A|) (sorted PMF)
or O(|A| log |A|) (unsorted PMF).
Example 10.3. A = {a, b, c, d, e}, PX = {0.25, 0.25, 0.2, 0.15, 0.15}.
Huffman tree: Codebook:
0 1 f(a) = 00
0.55 0.45 f(b) = 10
0 1 0 1 f(c) = 11
f(d) = 010
a 0.3 b c
0 1 f(e) = 011
d e
Theorem 10.13 (Optimality of Huffman codes). The Huffman code achieves the minimal average
code length (10.11) among all prefix (or uniquely decodable) codes.
1 As Shannon pointed out in his 1948 paper, in compressing English texts, in addition to exploit-
ing the nonequiprobability of English letters, working with pairs (or more generally, n-grams)
of letters achieves even more compression. To compress a block of symbols (S1 , . . . , Sn ), while
a natural idea is to apply the Huffman codes on a symbol-by-symbol basis (i.e., applying the cor-
responding Huffman code for each PSi ). By Theorem 10.12, this is only guaranteed to achieve an
Pn
average length at most i=1 H(Si ) + n bits, which also fails to exploit the memory in the source
i i
i i
i i
176
Pn
when i=1 H(Si ) is significantly larger than H(S1 , . . . , Sn ). The solution is to apply block Huff-
man coding. Indeed, compressing the block (S1 , . . . , Sn ) using its Huffman code (designed for
PS1 ,...,Sn ) achieves H(S1 , . . . , Sn ) within one bit, but the complexity is |A|n !
2 Constructing the Huffman code requires knowing the source distribution. This brings us the
question: Is it possible to design universal compressor which achieves entropy for a class of
source distributions? And what is the price to pay? These questions are addressed in Chapter 13.
There are much more elegant solutions, e.g.,
i i
i i
i i
In the previous chapter we introduced the concept of variable-length compression and studied
its fundamental limits (with and without prefix-free condition). In some situations, however, one
may desire that the output of the compressor always has a fixed length, say, k bits. Unless k is
unreasonably large, then, this will require relaxing the losslessness condition. This is the focus of
this chapter: compression in the presence of (typically vanishingly small) probability of error. It
turns out allowing even very small error enables several beautiful effects:
• The possibility to compress data via matrix multiplication over finite fields (Linear Compres-
sion).
• The possibility to reduce compression length from H(X) to H(X|Y) if side information Y is
available at the decompressor (Slepian-Wolf).
• The possibility to reduce compression length below H(X) if access to a compressed representa-
tion of side-information Y is available at the decompressor (Ahslwede-Körner-Wyner).
that g(f(X)) = X with probability one, then k ≥ log2 |supp(PX )| and no meaningful compression
can be achieved. It turns out that by tolerating a small error probability, we can gain a lot in
terms of code length! So, instead of requiring g(f(x)) = x for all x ∈ X , consider only lossless
decompression for a subset S ⊂ X :
(
x x∈S
g(f(x)) =
e x 6∈ S
P [g(f(X)) 6= X] = P [g(f(X)) = e] = P [X ∈
/ S] .
177
i i
i i
i i
178
f : X → {0, 1}k
g : {0, 1}k → X ∪ {e}
The following result connects the respective fundamental limits of fixed-length almost lossless
compression and variable-length lossless compression (Chapter 10):
Theorem 11.2 (Fundamental limit of fixed-length compression). Recall the optimal variable-
length compressor f∗ defined in Theorem 10.2. Then
Comparing Theorems 10.2 and 11.2, we see that the optimal codes in these two settings work
as follows:
Remark 11.1. In Definition 11.1 we require that the errors are always detectable, i.e., g(f(x)) = x
or e. Alternatively, we can drop this requirement and allow undetectable errors, in which case we
can of course do better since we have more freedom in designing codes. It turns out that we do
not gain much by this relaxation. Indeed, if we define
then ϵ̃∗ (X, k) = 1 − sum of 2k largest masses of X. This follows immediately from
P
P [g(f(X)) = X] = x∈S PX (x) where S ≜ {x : g(f(x)) = x} satisfies |S| ≤ 2k , because f takes no
more than 2k values. Compared to Theorem 11.2, we see that ϵ̃∗ (X, k) and ϵ∗ (X, k) do not differ
much. In particular, ϵ∗ (X, k + 1) ≤ ϵ̃∗ (X, k) ≤ ϵ∗ (X, k).
i i
i i
i i
11.1 Fixed-length almost lossless code. Asymptotic Equipartition Property (AEP). 179
p
lim ϵ∗ (Sn , nH(S) + nV(S)γ) = 1 − Φ(γ).
n→∞
where Φ(·) is the CDF of N (0, 1), H(S) = E[log PS1(S) ] is the entropy, V(S) = Var[log PS1(S) ] is the
varentropy which is assumed to be finite.
Next we give separate achievability and converse bounds complementing the exact result in
Theorem 11.2.
Theorem 11.5.
∗ 1
ϵ (X, k) ≤ P log2 ≥k . (11.1)
PX (X)
Theorem 11.6.
∗ 1
ϵ (X, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0. (11.2)
PX (X)
Note that Theorem 11.5 is in fact always stronger than Theorem 11.6. Still, we present the proof
of Theorem 11.6 and the technology behind it – random coding – a powerful technique introduced
by Shannon for proving existence of good codes (achievability). This technique is used throughout
in this book and Theorem 11.6 is its first appearance. To see that Theorem 11.5 gives a better bound,
note that even the first term in (11.2) exceeds (11.1). Nevertheless, the random coding argument for
proving this weaker bound is much more important and generalizable. We will apply it again for
linear compression in Section 11.2 and the Slepian-Wolf problem in Section 11.4 in this chapter;
i i
i i
i i
180
later for data transmission and lossy data compression in Parts IV and V it will take the central
stage as the method of choice for most achievability proofs.
Proof of Theorem 11.5. Construction: use those 2k − 1 symbols with the highest probabilities.
The analysis is essentially the same as the lower bound in Theorem 10.6 from Chapter 10. Note
that the mth largest mass PX (m) ≤ m1 . Therefore
X X X
ϵ∗ (X, k) = P X ( m) = 1{m≥2k } PX (m) ≤ 1n 1 ≥2k o PX (m) = E1nlog 1 ≥ko .
PX (m) 2 PX (X)
m≥2k
Proof of Theorem 11.6. (Random coding.) For a given compressor f, the optimal decompressor
which minimizes the error probability is the maximum a posteriori (MAP) decoder, i.e.,
which can be hard to analyze. Instead, let us consider the following (suboptimal) decompressor g:
x, ∃! x ∈ X s.t. f(x) = w and log2 PX1(x) ≤ k − τ,
g(w) = (exists unique high-probability x that is mapped to w)
e, o.w.
i i
i i
i i
11.1 Fixed-length almost lossless code. Asymptotic Equipartition Property (AEP). 181
X
= 2− k E X 1{PX (x′ )≥2−k+τ }
x′ ̸=X
X
≤ 2− k 1{PX (x′ )≥2−k+τ }
x′ ∈X
Remark 11.2 (Why random coding works). The compressor f(x) = cx can be thought as hashing
x ∈ X to a random k-bit string cx ∈ {0, 1}k , as illustrated below:
Here, x has high probability ⇔ log2 PX1(x) ≤ k − τ ⇔ PX (x) ≥ 2−k+τ . Therefore the number of
those high-probability x’s is at most 2k−τ , which is far smaller than 2k , the total number of k-bit
codewords. Hence the chance of collision is small.
Remark 11.3. The random coding argument is a canonical example of probabilistic method: To
prove the existence of an object with certain property, we construct a probability distribution
(randomize) and show that on average the property is satisfied. Hence there exists at least one
realization with the desired property. The downside of this argument is that it is not constructive,
i.e., does not give us an algorithm to find the object.
Remark 11.4. This is a subtle point: Notice that in the proof we choose the random codebook to
be uniform over all possible codebooks. In other words, C = {cx : x ∈ X } consists of iid k-bit
strings. In fact, in the proof we only need pairwise independence, i.e., cx ⊥ ⊥ cx′ for any x 6= x′
(Why?). Now, why should we care about this? In fact, having access to external randomness is
also a lot of resources. It is more desirable to use less randomness in the random coding argument.
Indeed, if we use zero randomness, then it is a deterministic construction, which is the best situa-
tion! Using pairwise independent codebook requires significantly less randomness than complete
random coding which needs |X |k bits. To see this intuitively, note that one can use 2 independent
random bits to generate 3 random bits that is pairwise independent but not mutually independent,
e.g., {b1 , b2 , b1 ⊕ b2 }. This observation is related to linear compression studied in the next section,
where the codeword we generated are not iid, but elements of a random linear subspace.
i i
i i
i i
182
sequences whose Hamming is close to the expectation: Tδn = {sn ∈ {0, 1}n : w(sn ) ∈ [p ± δ ′ ]n},
where δ ′ is a constant depending on δ .
As a consequence of (11.3),
1 P Sn ∈ Tδn → 1 as n → ∞.
2 |Tδn | ≤ 2(H(S)+δ)n |S|n .
In other words, Sn is concentrated on the set Tδn which is exponentially smaller than the whole
space. In almost lossless compression we can simply encode this set losslessly. Although this is
different than the optimal encoding, Corollary 11.3 indicates that in the large-n limit the optimal
compressor is no better.
The property (11.3) is often referred as the Asymptotic Equipartition Property (AEP), in the
sense that the random vector is concentrated on a set wherein each reliazation is roughly equally
likely up to the exponent. Indeed, Note that for any sn ∈ Tδn , its likelihood is concentrated around
PSn (sn ) ∈ 2−(H(S)±δ)n , called δ -typical sequences.
Definition 11.7 (Galois field). F is a finite set with operations (+, ·) where
i i
i i
i i
A linear compressor is a linear function H : Fnq → Fkq (represented by a matrix H ∈ Fqk×n ) that
maps each x ∈ Fnq to its codeword w = Hx, namely
w1 h11 . . . h1n x1
.. .. .. ..
. = . . .
wk hk1 ... hkn xn
Compression is achieved if k < n, i.e., H is a fat matrix, which, again, is only possible in the
almost lossless sense.
Theorem 11.8 (Achievability). Let X ∈ Fnq be a random vector. ∀τ > 0, ∃ linear compressor
H : Fnq → Fkq and decompressor g : Fkq → Fnq ∪ {e}, s.t.
1
P [g(HX) 6= X] ≤ P logq > k − τ + q−τ
PX (X)
Remark 11.6. Consider the Hamming space q = 2. In comparison with Shannon’s random coding
achievability, which uses k2n bits to construct a completely random codebook, here for linear codes
we need kn bits to randomly generate the matrix H, and the codebook is a k-dimensional linear
subspace of the Hamming space.
Proof. Fix τ . As pointed in the proof of Shannon’s random coding theorem (Theorem 11.6),
given the compressor H, the optimal decompressor is the MAP decoder, i.e., g(w) =
argmaxx:Hx=w PX (x), which outputs the most likely symbol that is compatible with the codeword
received. Instead, let us consider the following (suboptimal) decoder for its ease of analysis:
(
x ∃!x ∈ Fnq : w = Hx, x − h.p.
g( w ) =
e otherwise
where we used the short-hand:
1
x − h.p. (high probability) ⇔ logq < k − τ ⇔ PX (x) ≥ q−k+τ .
P X ( x)
Note that this decoder is the same as in the proof of Theorem 11.6. The proof is also mostly the
same, except now hash collisions occur under the linear map H. By union bound,
1
P [g(f(X)) = e] ≤ P logq > k − τ + P [∃x′ − h.p. : x′ 6= X, Hx′ = HX]
P X ( x)
i i
i i
i i
184
X X
1
(union bound) ≤ P logq >k−τ + PX (x) 1{Hx′ = Hx}
PX (x) x x′ −h.p.,x′ ̸=x
Now we use random coding to average the second term over all possible choices of H. Specif-
ically, choose H as a matrix independent of X where each entry is iid and uniform on Fq . For
distinct x0 and x1 , the collision probability is
PH [Hx1 = Hx0 ] = PH [Hx2 = 0] (x2 ≜ x1 − x0 6= 0)
= P H [ H 1 · x2 = 0 ] k
(iid rows)
where H1 is the first row of the matrix H, and each row of H is independent. This is the probability
that Hi is in the orthogonal complement of x2 . On Fnq , the orthogonal complement of a given
non-zero vector has cardinality qn−1 . So the probability for the first row to lie in this subspace is
qn−1 /qn = 1/q, hence the collision probability 1/qk . Averaging over H gives
X X
EH 1{Hx′ = Hx} = PH [Hx′ = Hx] = |{x′ : x′ − h.p., x′ 6= x}|q−k ≤ qk−τ q−k = q−τ
x′ −h.p.,x′ ̸=x x′ −h.p.,x′ ̸=x
• f : X × Y → {0, 1}k
• g : {0, 1}k × Y → X ∪ {e}
• P[g(f(X, Y), Y) 6= X] < ϵ
• Fundamental Limit: ϵ∗ (X|Y, k) = inf{ϵ : ∃(k, ϵ) − S.I. code}
i i
i i
i i
Note that here unlike the source X, the side information Y need not be discrete. Conditioned
on Y = y, the problem reduces to compression without side information studied in Section 11.1,
where the source X is distributed according to PX|Y=y . Since Y is known to both the compressor
and decompressor, they can use the best code tailored for this distribution. Recall ϵ∗ (X, k) defined
in Definition 11.1, the optimal probability of error for compressing X using k bits, which can also
be denoted by ϵ∗ (PX , k). Then we have the following relationship
ϵ∗ (X|Y, k) = Ey∼PY [ϵ∗ (PX|Y=y , k)],
Theorem 11.10.
1 1
P log > k + τ − 2−τ ≤ ϵ∗ (X|Y, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0
PX|Y (X|Y) PX|Y (X|Y)
i.i.d.
Corollary 11.11. Let (X, Y) = (Sn , Tn ) where the pairs (Si , Ti ) ∼ PST . Then
(
∗ n n 0 R > H(S|T)
lim ϵ (S |T , nR) =
n→∞ 1 R < H(S|T)
Proof. Using the converse Theorem 11.4 and achievability Theorem 11.6 (or Theorem 11.5) for
compression without side information, we have
1 1
P log > k + τ Y = y − 2−τ ≤ ϵ∗ (PX|Y=y , k) ≤ P log > k Y = y
PX|Y (X|y) PX|Y (X|y)
By taking the average over all y ∼ PY , we get the theorem. For the corollary
1X
n
1 1 1 P
log = log −
→H(S|T)
n PSn |Tn (S |T )
n n n PS|T (Si |Ti )
i=1
i i
i i
i i
186
• f : X → {0, 1}k
• g : {0, 1}k × Y → X ∪ {e}
• P[g(f(X), Y) 6= X] ≤ ϵ
• Fundamental Limit: ϵ∗SW (X|Y, k) = inf{ϵ : ∃(k, ϵ)-S.W. code}
Now the very surprising result: Even without side information at the compressor, we can still
compress down to the conditional entropy!
Corollary 11.14.
(
0 R > H(S|T)
lim ϵ∗SW (Sn |Tn , nR) =
n→∞ 1 R < H(S|T)
Remark 11.8. Definition 11.12 does not include the zero-undected-error condition (that is
g(f(x), y) = x or e). In other words, we allow for the possibility of undetected errors. Indeed,
if we require this condition, the side-information savings will be mostly gone. Indeed, assuming
PX,Y (x, y) > 0 for all (x, y) it is clear that under zero-undetected-error condition, if f(x1 ) = f(x2 ) =
c then g(c) = e. Thus except for c all other elements in {0, 1}k must have unique preimages. Sim-
ilarly, one can show that Slepian-Wolf theorem does not hold in the setting of variable-length
lossless compression (i.e. average length is H(X) not H(X|Y).)
Proof. LHS is obvious, since side information at the compressor and decompressor is better than
only at the decompressor.
For the RHS, first generate a random codebook with iid uniform codewords: C = {cx ∈ {0, 1}k :
x ∈ X } independently of (X, Y), then define the compressor and decoder as
f ( x) = cx
i i
i i
i i
(
x ∃!x : cx = w, x − h.p.|y
g(w, y) =
0 o.w.
where we used the shorthand x − h.p.|y ⇔ log2 PX|Y1(x|y) < k − τ . The error probability of this
scheme, as a function of the code book C, is
1
E(C) = P log ≥ k − τ or J(X, C|Y) 6= ∅
PX|Y (X|Y)
1
≤ P log ≥ k − τ + P [J(X, C|Y) 6= ∅]
PX|Y (X|Y)
X
1
= P log ≥k−τ + PX,Y (x, y)1{J(x,C|y)̸=∅} .
PX|Y (X|Y) x, y
= 2k−τ P[cx′ = cx ]
= 2−τ
Hence the theorem follows as usual from two terms in the union bound.
X {0, 1}k1
Compressor f1
Decompressor g
(X̂, Ŷ)
Y {0, 1}k2
Compressor f2
i i
i i
i i
188
• (f1 , f2 , g) is (k1 , k2 , ϵ)-code if f1 : X → {0, 1}k1 , f2 : Y → {0, 1}k2 , g : {0, 1}k1 × {0, 1}k2 →
X × Y , s.t. P[(X̂, Ŷ) 6= (X, Y)] ≤ ϵ, where (X̂, Ŷ) = g(f1 (X), f2 (Y)).
• Fundamental limit: ϵ∗SW (X, Y, k1 , k2 ) = inf{ϵ : ∃(k1 , k2 , ϵ)-code}.
R2
Achievable
H(T )
Region
H(T |S)
R1
H(S|T ) H(S)
Since H(T) − H(T|S) = H(S) − H(S|T) = I(S; T), the slope is −1.
Proof. Converse: Take (R1 , R2 ) 6∈ RSW . Then one of three cases must occur:
1 R1 < H(S|T). Then even if encoder and decoder had full Tn , still can’t achieve this (from
compression with side info result – Corollary 11.11).
2 R2 < H(T|S) (same).
3 R1 + R2 < H(S, T). Can’t compress below the joint entropy of the pair (S, T).
Achievability: First note that we can achieve the two corner points. The point (H(S), H(T|S))
can be approached by almost lossless compressing S at entropy and compressing T with side infor-
mation S at the decoder. To make this rigorous, let k1 = n(H(S) + δ) and k2 = n(H(T|S) + δ). By
Corollary 11.3, there exist f1 : S n → {0, 1}k1 and g1 : {0, 1}k1 → S n s.t. P [g1 (f1 (Sn )) 6= Sn ] ≤
ϵn → 0. By Theorem 11.13, there exist f2 : T n → {0, 1}k2 and g2 : {0, 1}k1 × S n → T n
s.t. P [g2 (f2 (Tn ), Sn ) 6= Tn ] ≤ ϵn → 0. Now that Sn is not available, feed the S.W. decompres-
sor with g(f(Sn )) and define the joint decompressor by g(w1 , w2 ) = (g1 (w1 ), g2 (w2 , g1 (w1 ))) (see
below):
i i
i i
i i
Sn Ŝn
f1 g1
Tn T̂n
f2 g2
(Exercise: Write down the details rigorously.) Therefore, all convex combinations of points in the
achievable regions are also achievable, so the achievable region must be convex.
Theorem 11.17 (Ahlswede-Körner-Wyner). Consider i.i.d. source (Xn , Yn ) ∼ PX,Y with X dis-
crete. If rate pair (R1 , R2 ) is achievable with vanishing probability of error P[X̂n 6= Xn ] → 0, then
there exists an auxiliary random variable U taking values on alphabet of cardinality |Y| + 1 such
that PX,Y,U = PX,Y PU|X,Y and
i i
i i
i i
190
X {0, 1}k1
Compressor f1
Decompressor g
X̂
Y {0, 1}k2
Compressor f2
Furthermore, for every such random variable U the rate pair (H(X|U), I(Y; U)) is achievable with
vanishing error.
where (11.9) follows from I(W2 , Xk−1 ; Yk |Yk−1 ) = I(W2 ; Yk |Yk−1 ) + I(Xk−1 ; Yk |W2 , Yk−1 ) and the
⊥ Xk−1 |Yk−1 ; and (11.10) from Yk−1 ⊥
fact that (W2 , Yk ) ⊥ ⊥ Yk . Comparing (11.7) and (11.10) we
i i
i i
i i
and thus (from convexity) the rate pair must belong to the region spanned by all pairs
(H(X|U), I(U; Y)).
To show that without loss of generality the auxiliary random variable U can be chosen to take
at most |Y| + 1 values, one can invoke Carathéodory’s theorem (see Lemma 7.12). We omit the
details.
Finally, showing that for each U the mentioned rate-pair is achievable, we first notice that if there
were side information at the decompressor in the form of the i.i.d. sequence Un correlated to Xn ,
then Slepian-Wolf theorem implies that only rate R1 = H(X|U) would be sufficient to reconstruct
Xn . Thus, the question boils down to creating a correlated sequence Un at the decompressor by
using the minimal rate R2 . This is the content of the so called covering lemma – see Theorem 25.5:
It is sufficient to use rate I(U; Y) to do so. We omit further details.
i i
i i
i i
In this chapter, we shall examine similar results for ergodic processes and we first state the main
theory as follows:
Corollary 12.2. For any stationary and ergodic discrete process {S1 , S2 , . . . }, (12.1) – (12.2) hold
with H(S) replaced by H.
Proof. Shannon-McMillan (we only need convergence in probability) + Theorem 10.6 + Theo-
rem 11.2 which tie together the respective CDF of the random variable l(f∗ (Sn )) and log PSn1(sn ) .
In Chapter 11 we learned the asymptotic equipartition property (AEP) for iid sources. Here we
generalize it to stationary ergodic sources thanks to Shannon-McMillan.
Corollary 12.3 (AEP for stationary ergodic sources). Let {S1 , S2 , . . . } be a stationary and ergodic
discrete process. For any δ > 0, define the set
1 1
Tδn = sn : log − H ≤δ .
n PSn (sn )
Then
1 P Sn ∈ Tδn → 1 as n → ∞.
2 2n(H−δ) (1 + o(1)) ≤ |Tδn | ≤ 2(H+δ)n (1 + o(1)).
192
i i
i i
i i
Some historical notes are in order. Convergence in probability for stationary ergodic Markov
chains was already shown in [277]. The extension to convergence in L1 for all stationary ergodic
processes is due to McMillan in [216], and to almost sure convergence to Breiman [52]. A modern
proof is in [6]. Note also that for a Markov chain, existence of typical sequences and the AEP can
be anticipated by thinking of a Markov process as a sequence of independent decisions regarding
which transitions to take at each state. It is then clear that Markov process’s trajectory is simply a
transformation of trajectories of an iid process, hence must concentrate similarly.
The set E is called τ -invariant if E = τ −1 E. The set of all τ -invariant sets forms a σ -algrebra
(exercise) denoted Finv .
Sj = Sj−1 ◦ τ = S0 ◦ τ j
Remark 12.1.
(s1 , s2 , . . . ) ∈ E ⇒ (s0 , s1 , s2 , . . . ) ∈ E, s0
i i
i i
i i
194
It is easy to check that all shift-invariant events belong to Ftail . The inclusion is strict, as for
example the event
{∃n : xi = 0, ∀ odd i ≥ n}
Proposition 12.6 (Poincare recurrence). Let τ be measure-preserving for (Ω, F, P). Then for any
measurable A with P[A] > 0 we have
[
P[ τ −k A|A] = P[τ k (ω) ∈ A occurs infinitely often|A] = 1 .
k≥ 1
S
Proof. Let B = k≥ 1 τ −k A. It is sufficient to show that P[A ∩ B] = P[A] or equivalently
P[ A ∪ B] = P[ B] . (12.5)
but the left-hand side equals P[A ∪ B] by the measure-preservation of τ , proving (12.5).
Consider τ mapping initial state of the conservative (Hamiltonian) mechanical system to its
state after passage of a given unit of time. It is known that τ preserves Lebesgue measure in
phase space (Liouville’s theorem). Thus Poincare recurrence leads to a rather counter-intuitive
conclusions. For example, opening the barrier separating two gases in a cylinder allows them to
mix. Poincare recurrence says that eventually they will return back to the original separated state
(with each gas occupying roughly its half of the cylinder). Of course, the “paradox” is resolved
by observing that it will take unphysically long for this to happen.
i i
i i
i i
P[A ∩ τ −n B] → P[A]P[B] .
verify this process is ergodic (in the sense defined above!). Note however, that in Markov-chain
literature a chain is called ergodic if it is irreducible, aperiodic and recurrent. This example does
not satisfy this definition (this clash of terminology is a frequent source of confusion).
• (optional) {Si }: stationary zero-mean Gaussian process with autocovariance function R(n) =
E[S0 S∗n ].
1 X
n
lim R[t] = 0 ⇔ {Si } ergodic ⇔ {Si } weakly mixing
n→∞ n + 1
t=0
Intuitively speaking, an ergodic process can have infinite memory in general, but the memory
is weak. Indeed, we see that for a stationary Gaussian process ergodicity means the correlation
dies (in the Cesaro-mean sense).
The spectral measure is defined as the (discrete time) Fourier transform of the autocovariance
sequence {R(n)}, in the sense that there exists a unique probability measure μ on [− 12 , 21 ] such
that R(n) = E exp(i2nπX) where X ∼ μ. The spectral criteria can be formulated as follows:
i i
i i
i i
196
Detailed exposition on stationary Gaussian processes can be found in [101, Theorem 9.3.2, pp.
474, Theorem 9.7.1, pp. 493–494].
Theorem 12.8 (Birkhoff-Khintchine’s Ergodic Theorem). Let {Si } be a stationary and ergodic
process. For any integral function f, i.e., E |f(S1 , . . . )| < ∞,
1X
n
lim f(Sk , . . . ) = E f(S1 , . . . ). a.s. and in L1 .
n→∞ n
k=1
In the special case where f depends on finitely many coordinates, say, f = f(S1 , . . . , Sm ),
1X
n
lim f(Sk , . . . , Sk+m−1 ) = E f(S1 , . . . , Sm ). a.s. and in L1 .
n→∞ n
k=1
Definition 12.9. {Si : i ∈ N} is an mth order Markov chain if PSt+1 |St1 = PSt+1 |Stt−m+1 for all t ≥ m.
It is called time homogeneous if PSt+1 |Stt−m+1 = PSm+1 |Sm1 .
Remark 12.2. Showing (12.3) for an mth order time homogeneous Markov chain {Si } is a direct
application of Birkhoff-Khintchine.
1X
n
1 1 1
log = log
n n
PSn (S ) n PSt |St−1 (St |St−1 )
t=1
1 X
n
1 1 1
= log + log
n PSm (Sm ) n
t=m+1
PSt |St−1 (Sl |Sll− 1
−m )
t−m
1 X
n
1 1 1
= log + log t−1
, (12.6)
n PS1 (Sm ) n P | m ( S | S − )
| {z 1
} | t=m+1 S
{z
m+ 1 S1
t t m
}
→0
→H(Sm+1 |Sm
1 ) by Birkhoff-Khintchine
1
where we applied Theorem 12.8 with f(s1 , s2 , . . .) = log PS m (sm+1 |sm )
.
m+1 |S1 1
i i
i i
i i
Now let’s prove (12.3) for a general stationary ergodic process {Si } which might have infinite
memory. The idea is to approximate the distribution of that ergodic process by an m-th order MC
(finite memory) and make use of (12.6); then let m → ∞ to make the approximation accurate
(Markov approximation).
Proof of Theorem 12.1 in L1 . To show that (12.3) converges in L1 , we want to show that
1
1
E log − H → 0, n → ∞.
n PSn (Sn )
To this end, fix an m ∈ N. Define the following auxiliary distribution for the process:
Y
∞
Q(m) (S∞ m
1 ) = PS1m (S1 ) PSt |St−1 (St |Stt− 1
−m )
t− m
t=m+1
Y∞
stat.
= PSm1 (Sm
1) PSm+1 |Sm1 (St |Stt− 1
−m )
t=m+1
By triangle inequality,
1 1
1 1 1 1
E log n
− H ≤E log n
− log (m)
n PSn (S ) n PSn (S ) n n
QSn (S )
| {z }
≜A
1
1
+ E log (m) − Hm + |Hm − H|
n n
QSn (S ) | {z }
| {z } ≜C
≜B
stat. 1 (−H(Sn )
+ H( S ) + ( n − m) Hm )
m
=
n
→ Hm − H as n → ∞
i i
i i
i i
198
Lemma 12.10.
dP 2 log e
EP log ≤ D(PkQ) + .
dQ e
Intuitively:
1X k 1
n
An = T = (I − Tn )(I − T)−1
n n
k=1
Then, if f ⊥ ker(I − T) we should have An f → 0, since only components in the kernel can blow
up. This intuition is formalized in the proof below.
Let’s further decompose f into two parts f = f1 + f2 , where f1 ∈ ker(I − T) and f2 ∈ ker(I − T)⊥ .
Observations:
• if g ∈ ker(I − T), g must be a constant function. This is due to the ergodicity. Consider indicator
function 1A , if 1A = 1A ◦ τ = 1τ −1 A , then P[A] = 0 or 1. For a general case, suppose g = Tg and
g is not constant, then at least some set {g ∈ (a, b)} will be shift-invariant and have non-trivial
measure, violating ergodicity.
i i
i i
i i
where in the last step we used the fact that Cauchy-Schwarz (f, g) ≤ kfk · kgk only holds with
equality for g = cf for some constant c.
• ker(I − T)⊥ = ker(I − T∗ )⊥ = [Im(I − T)], where [Im(I − T)] is an L2 closure.
• g ∈ ker(I − T)⊥ ⇐⇒ E[g] = 0. Indeed, only zero-mean functions are orthogonal to constants.
With these observations, we know that f1 = m is a const. Also, f2 ∈ [Im(I − T)] so we further
approximate it by f2 = f0 + h1 , where f0 ∈ Im(I − T), namely f0 = g − g ◦ τ for some function
g ∈ L2 , and kh1 k1 ≤ kh1 k2 < ϵ. Therefore we have
An f1 = f1 = E[f]
1
An f0 = (g − g ◦ τ n ) → 0 a.s. and L1
n
P g◦τ n 2 P 1 a. s .
since E[ n≥1 ( n ) ] = E[g ] n2 < ∞ =⇒ 1n g ◦ τ n −−→0.
2
as required.
Proof of (12.7) makes use of the Maximal Ergodic Lemma stated as follows:
Theorem 12.11 (Maximal Ergodic Lemma). Let (P, τ ) be a probability measure and a measure-
preserving transformation. Then for any f ∈ L1 (P) we have
E[f1
{supn≥1 An f>a} ] kfk1
P sup An f > a ≤ ≤
n≥1 a a
Pn−1
where An f = 1
n k=0 f ◦ τ k.
This is a so-called “weak L1 ” estimate for a sublinear operator supn An (·). In fact, this theorem
is exactly equivalent to the following result:
Lemma 12.12 (Estimate for the maximum of averages). Let {Zn , n = 1, . . .} be a stationary
process with E[|Z|] < ∞ then
|Z1 + . . . + Zn | E[|Z|]
P sup >a ≤ ∀a > 0
n≥1 n a
i i
i i
i i
200
Proof. The argument for this Lemma has originally been quite involved, until a dramatically
simple proof (below) was found by A. Garcia.
Define
Xn
Sn = Zk (12.8)
k=1
Ln = max{0, Z1 , . . . , Z1 + · · · + Zn } (12.9)
Mn = max{0, Z2 , Z2 + Z3 , . . . , Z2 + · · · + Zn } (12.10)
Sn
Z∗ = sup (12.11)
n≥1 n
i i
i i
i i
Note that every random variable X0 generates a stationary process adapted to τ , that is
Xk ≜ X0 ◦ τ k .
In this way, Kolmogorov-Sinai entropy of τ equals the maximal entropy rate among all stationary
processes adapted to τ . This quantity may be extremely hard to evaluate, however. One help comes
in the form of the famous criterion of Y. Sinai. We need to elaborate on some more concepts before:
Theorem 12.14 (Sinai’s generator theorem). Let Y be the generator of a p.p.t. (Ω, F, P, τ ). Let
H(Y) be the entropy rate of the process Y = {Yk = Y ◦ τ k , k = 0, . . .}. If H(Y) is finite, then
H(τ ) = H(Y).
Proof. Notice that since H(Y) is finite, we must have H(Y0n ) < ∞ and thus H(Y) < ∞. First, we
argue that H(τ ) ≥ H(Y). If Y has finite alphabet, then it is simply from the definition. Otherwise
let Y be Z+ -valued. Define a truncated version Ỹm = min(Y, m), then since Ỹm → Y as m → ∞
we have from lower semicontinuity of mutual information, cf. (4.28), that
lim I(Y; Ỹm ) ≥ H(Y) ,
m→∞
i i
i i
i i
202
X
n
= H(Ỹn0 ) + H(Yi |Ỹn0 , Yi0−1 )
i=0
X
n
≤ H(Ỹn0 ) + H(Yi |Ỹi )
i=0
Thus, entropy rate of Ỹ (which has finite-alphabet) can be made arbitrarily close to the entropy
rate of Y, concluding that H(τ ) ≥ H(Y).
The main part is showing that for any stationary process X adapted to τ the entropy rate is
upper bounded by H(Y). To that end, consider X : Ω → X with finite X and define as usual the
process X = {X ◦ τ k , k = 0, 1, . . .}. By generating property of Y we have that X (perhaps after
modification on a set of measure zero) is a function of Y∞0 . So are all Xk . Thus
where we used the continuity-in-σ -algebra property of mutual information, cf. (4.30). Rewriting
the latter limit differently, we have
lim H(X0 |Yn0 ) = 0 .
n→∞
where we used stationarity of (Xk , Yk ) and the fact that H(X0 |Yn0−i ) < ϵ for i ≤ n − m. After
dividing by n and passing to the limit our argument implies
H( X ) ≤ H( Y ) + ϵ .
Taking here ϵ → 0 completes the proof.
Alternative proof: Suppose X0 is taking values on a finite alphabet X and X0 = f(Y∞
0 ). Then (this
is a measure-theoretic fact) for every ϵ > 0 there exists m = m(ϵ) and a function fϵ : Y m+1 → X
s.t.
P [ f( Y ∞
0 ) 6= fϵ (Y0 )] ≤ ϵ .
m
S
(This is just another way to say that n σ(Yn0 ) is P-dense in σ(Y∞ 0 ).) Define a stationary process
X̃ as
X̃j ≜ fϵ (Ym
j
+j
).
i i
i i
i i
It is easy to show that Y(ω) = 1{ω < 1/2} is a generator and that Y is an i.i.d. Bernoulli(1/2)
process. Thus, we get that Kolmogorov-Sinai entropy is H(τ ) = log 2.
• Let Ω be the unit circle S1 , F the Borel σ -algebra, and P the normalized length and
τ (ω) = ω + γ
γ
i.e. τ is a rotation by the angle γ . (When 2π is irrational, this is known to be an ergodic p.p.t.).
Here Y = 1{|ω| < 2π ϵ} is a generator for arbitrarily small ϵ and hence
H(τ ) ≤ H(X) ≤ H(Y0 ) = h(ϵ) → 0 as ϵ → 0 .
This is an example of a zero-entropy p.p.t.
Remark 12.3. Two p.p.t.’s (Ω1 , τ1 , P1 ) and (Ω0 , τ0 , P0 ) are called isomorphic if there exists fi :
Ωi → Ω1−i defined Pi -almost everywhere and such that 1) τ1−i ◦ fi = f1−i ◦ τi ; 2) fi ◦ f1−i is identity
on Ωi (a.e.); 3) Pi [f−
1−i E] = P1−i [E]. It is easy to see that Kolmogorov-Sinai entropies of isomorphic
1
p.p.t.s are equal. This observation was made by Kolmogorov in 1958. It was revoluationary, since
it allowed to show that p.p.t.s corresponding shifts of iid Ber(1/2) and iid Ber(1/3) procceses are
not isomorphic. Before, the only invariants known were those obtained from studying the spectrum
of a unitary operator
Uτ : L 2 ( Ω , P ) → L 2 ( Ω , P ) (12.16)
i i
i i
i i
204
1
To see the statement about the spectrum, let Xi be iid with zero mean and unit variance. Then consider ϕ(x∞1 ) defined as
∑m iωk x . This ϕ has unit energy and as m → ∞ we have kU ϕ − eiω ϕk
√1 L2 → 0. Hence every e
iω belongs to
m k=1 e k τ
the spectrum of Uτ .
i i
i i
i i
13 Universal compression
In this chapter we will discuss how to produce compression schemes that do not require apriori
knowledge of the distribution. Here, compressor is a map X n → {0, 1}∗ . Now, however, there is
no one fixed probability distribution PXn on X n . The plan for this chapter is as follows:
1 We will start by discussing the earliest example of a universal compression algorithm (of
Fitingof). It does not talk about probability distributions at all. However, it turns out to be asymp-
totically optimal simulatenously for all i.i.d. distributions and with small modifications for all
finite-order Markov chains.
2 Next class of universal compressors is based on assuming that the true distribution PXn belongs
to a given class. These methods proceed by choosing a good model distribution QXn serving as
the minimax approximation to each distribution in the class. The compression algorithm for a
single distribution QXn is then designed as in previous chapters.
3 Finally, an entirely different idea are algorithms of Lempel-Ziv type. These automatically adapt
to the distribution of the source, without any prior assumptions required.
Throughout this chapter, all logarithms are binary. Instead of describing each compres-
sion algorithm, we will merely specify some distribution QXn and apply one of the following
constructions:
• Sort all xn in the order of decreasing QXn (xn ) and assign values from {0, 1}∗ as in Theorem 10.2,
this compressor has lengths satisfying
1
ℓ(f(xn )) ≤ log .
QXn (xn )
• Set lengths to be
1
ℓ(f(x )) ≜ log
n
Q X ( xn )
n
205
i i
i i
i i
206
and in this way we may and will always replace lengths with log QXn1(xn ) . In this architecture, the
only task of a universal compression algorithm is to specify the probability assignment QXn .
Remark 13.1. Furthermore, if we only restrict attention to prefix codes, then any code f : X n →
{0, 1}∗ defines a distribution QXn (xn ) = 2−ℓ(f(x )) . (We assume the code’s binary tree is full such
n
that the Kraft sum equals one). In this way, for prefix-free codes results on redundancy, stated
in terms of optimizing the choice of QXn , imply tight converses too. For one-shot codes without
prefix constraints the optimal answers are slightly different, however. (For example, the optimal
universal code for all i.i.d. sources satisfies E[ℓ(f(Xn ))] ≈ H(Xn ) + |X 2|−3 log n in contrast with
|X |−1
2 log n for prefix-free codes, cf. [26, 184].)
Qn
If one factorizes QXn = t=1 QXt |Xt−1 then we arrive at a crucial conclusion: (universal) com-
1
pression is equivalent to sequential (online) prediction under the log-loss. As of 2022 the best
performing text compression algorithms (cf. the leaderboard at [207]) use a deep neural network
(transformer model) that starts from a fixed initialization. As the input text is processed, parame-
ters of the network are continuously updated via stochastic gradient descent causing progressively
better prediction (and hence compression) performance.
Associate to each xn an interval Ixn = [Fn (xn ), Fn (xn ) + QXn (xn )). These intervals are disjoint
subintervals of [0, 1). As such, each xn can be represented uniquely by any point in the interval Ixn .
A specific choice is as follows. Encode
and we agree to select the left-most dyadic interval when there are two possibilities. Recall that
dyadic intervals are intervals of the type [m2−k , (m + 1)2−k ) where m is an integer. We encode
such interval by the k-bit (zero-padded) binary expansion of the fractional number m2−k =
Pk
0.b1 b2 . . . bk = i=1 bi 2−i . For example, [3/4, 7/8) 7→ 110, [3/4, 13/16) 7→ 1100. We set the
i i
i i
i i
codeword f(xn ) to be that string. The resulting code is a prefix code satisfying
1 1
log2 ≤ ℓ(f(x )) ≤ log2
n
+ 1. (13.2)
QXn (xn ) QXn (xn )
(This is an exercise, see Ex. II.11.)
Observe that
X
Fn (xn ) = Fn−1 (xn−1 ) + QXn−1 (xn−1 ) QXn |Xn−1 (y|xn−1 )
y<xn
and thus Fn (xn ) can be computed sequentially if QXn−1 and QXn |Xn−1 are easy to compute. This
method is the method of choice in many modern compression algorithms because it allows to
dynamically incorporate the learned information about the stream, in the form of updating QXn |Xn−1
(e.g. if the algorithm detects that an executable file contains a long chunk of English text, it may
temporarily switch to QXn |Xn−1 modeling the English language).
We note that efficient implementation of arithmetic encoder and decoder is a continuing
research area. Indeed, performance depends on number-theoretic properties of denominators of
distributions QXt |Xt−1 , because as encoder/decoder progress along the string, they need to periodi-
cally renormalize the current interval Ixt to be [0, 1) but this requires carefully realigning the dyadic
boundaries. A recent idea, known as asymmetric numeral system (ANS) [103], lead to such impres-
sive computational gains that in less than a decade it was adopted by most compression libraries
handling diverse data streams (e.g., the Linux kernel images, Dropbox and Facebook traffic, etc).
1X
n
P̂xn (a) ≜ 1 { xi = a } . (13.5)
n
i=1
Then Fitingof argues that it should be possible to produce a prefix code with
This can be done in many ways. In the spirit of what comes next, let us define
i i
i i
i i
208
cn = O(n−(|X |−1) ) ,
and thus, by Kraft inequality, there must exist a prefix code with lengths satisfying (13.6).1 Now
i.i.d.
taking expectation over Xn ∼ PX we get
Universal compressor for all finite-order Markov chains. Fitingof’s idea can be extended as
follows. Define now the first order information content Φ1 (xn ) to be the log of the number of all
sequences, obtainable by permuting xn with extra restriction that the new sequence should have
the same statistics on digrams. Asymptotically, Φ1 is just the conditional entropy
where T − 1 is understood in the sense of modulo n. Again, it can be shown that there exists a code
such that lengths
This implies that for every first order stationary Markov chain X1 → X2 → · · · → Xn we have
This can be further continued to define Φ2 (xn ) and build a universal code, asymptotically
optimal for all second order Markov chains etc.
simultaneously for all i.i.d. sources (or even all r-th order Markov chains). What should we do
next? Krichevsky suggested that the next barrier should be to minimize the regret, or redundancy:
1
Explicitly, we can do a two-part encoding: first describe the type class of xn (takes (|X | − 1) log n bits) and then describe
the element of the class (takes Φ0 (xn ) bits).
i i
i i
i i
Replacing code lengths with log Q1Xn , we define redundancy of the distribution QXn as
Thus, the question of designing the best universal compressor (in the sense of optimizing worst-
case deviation of the average length from the entropy) becomes the question of finding solution
of:
Definition 13.1 (Redundancy in universal compression). Given a class of sources {PXn |θ=θ0 : θ0 ∈
Θ, n = 1, . . .} we define its minimax redundancy as
Assuming the finiteness of R∗n , Theorem 5.9 gives the maximin and capacity representation
where optimization is over priors π ∈ P(Θ) on θ. Thus redundancy is simply the capacity of
the channel θ → Xn . This result, obvious in hindsight, was rather surprising in the early days of
universal compression. It is known as capacity-redundancy theorem.
Finding exact QXn -minimizer in (13.8) is a daunting task even for the simple class of all i.i.d.
Bernoulli sources (i.e. Θ = [0, 1], PXn |θ = Bern (θ)). In fact, for smooth parametric families the
capacity-achieving input distribution is rather ugly: it is a discrete distribution with a kn atoms, kn
slowly growing as n → ∞. A provocative conjecture was put forward by physicists [213, 1] that
there is a certain universality relation:
3
R∗n = log kn + o(log kn )
4
satisfied for all parametric families simultaneously. For the Bernoulli example this implies kn
n2/3 , but even this is open. However, as we will see below it turns out that these unwieldy capacity-
achieving input distributions converge as n → ∞ to a beautiful limiting law, known as the Jeffreys
prior.
Remark 13.2. (Shtarkov, Fitingof and individual sequence approach) There is a connection
between the combinatorial method of Fitingof and the method of optimality for a class. Indeed,
i i
i i
i i
210
(S)
following Shtarkov we may want to choose distribution QXn so as to minimize the worst-case
redundancy for each realization xn (not average!):
PXn |θ (xn |θ0 )
min max sup log (13.11)
QXn n x θ0 Q X n ( xn )
This leads to Shtarkov’s distribution (also known as the normalized maximal likehood (NML)
code):
(S)
QXn (xn ) = c sup PXn |θ (xn |θ0 ) , (13.12)
θ0
where c is the normalization constant. If class {PXn |θ , θ ∈ Θ} is chosen to be all i.i.d. distributions
on X then
(S)
i.i.d. QXn (xn ) = c exp{−nH(P̂xn )} , (13.13)
(S)
and thus compressing w.r.t. QXn recovers Fitingof’s construction Φ0 up to O(log n) differences
between nH(P̂xn ) and Φ0 (xn ). If we take PXn |θ to be all first order Markov chains, then we get
construction Φ1 etc. Note also, that the problem (13.11) can also be written as minimization of the
regret for each individual sequence (under log-loss, with respect to a parameter class PXn |θ ):
1 1
min max log − inf log . (13.14)
QXn xn QXn (xn ) θ0 PXn |θ (xn |θ0 )
The gospel is that if there is a reason to believe that real-world data xn are likely to be generated
by one of the models PXn |θ , then using minimizer of (13.14) will result in the compressor that both
learns the right model (in the sense of QXn |Xn−1 ≈ true PXn |Xn−1 ) and compresses with respect to it.
See more in Section 13.6.
Γ(α0 +...+αd )
where c(α0 , . . . , αd ) = Qd is the normalizing constant.
j=0 Γ(αj )
First, we give the formal setting as follows:
i i
i i
i i
Pd
• As in Example 2.6, let Θ = {(θ1 , . . . , θd ) : j=1 θj ≤ 1, θj ≥ 0} parametrizes the collection of
all probability distributions on X . Note that Θ is a d-dimensional simplex. We will also define
X
d
θ0 ≜ 1 − θj .
j=1
In order to find the (near) optimal QXn , we need to guess an (almost) optimal prior π ∈ P(Θ)
in (13.10) and take QXn to be the mixture of PXn |θ ’s. We will search for π in the class of smooth
densities on Θ and set
Z
QXn (xn ) ≜ PXn |θ (xn |θ′ )π (θ′ )dθ′ . (13.16)
Θ
Before proceeding further, we recall the Laplace method of approximating exponential inte-
grals. Suppose that f(θ) has a unique minimum at the interior point θ̂ of Θ and that Hessian Hessf
is uniformly lower-bounded by a multiple of identity (in particular, f(θ) is strongly convex). Then
taking Taylor expansion of π and f we get
Z Z
−nf(θ)
dθ = (π (θ̂) + O(ktk))e−n(f(θ̂)− 2 t Hessf(θ̂)t+o(∥t∥ )) dt
1 T 2
π (θ)e (13.17)
Θ
Z
dx
= π (θ̂)e−nf(θ̂) e−x Hessf(θ̂)x √ (1 + O(n−1/2 ))
T
(13.18)
Rd nd
d2
−nf(θ̂) 2π 1
= π (θ̂)e q (1 + O(n−1/2 )) (13.19)
n
det Hessf(θ̂)
θ̂(xn ) ≜ P̂xn
d 2π Pθ (θ̂)
+ O( n− 2 ) ,
1
log QXn (xn ) = −nH(θ̂) + log + log q
2 n log e
det JF (θ̂)
i i
i i
i i
212
where we used the fact that Hessθ′ D(P̂kPX|θ=θ′ )|θ′ =θ̂ = log1 e JF (θ̂) with JF being the Fisher infor-
mation matrix introduced previously in (2.33). From here, using the fact that under Xn ∼ PXn |θ=θ′
the random variable θ̂ = θ′ + O(n−1/2 ) we get by approximating JF (θ̂) and Pθ (θ̂)
d Pθ (θ′ )
D(PXn |θ=θ′ kQXn ) = n(E[H(θ̂)]−H(X|θ = θ′ ))+ log n−log p +C+O(n− 2 ) , (13.20)
1
2 ′
det JF (θ )
where C is some constant (independent of the prior Pθ or θ′ ). The first term is handled by the next
result, refining Corollary 7.16.
i.i.d.
Lemma 13.2. Let Xn ∼ P on a finite alphabet X such that P(x) > 0 for all x ∈ X . Let P̂ = P̂Xn
be the empirical distribution of Xn , then
|X | − 1 1
E[D(P̂kP)] = log e + o .
2n n
log e 2
In fact, nD(P̂kP) → 2 χ (|X | − 1) in distribution.
√
Proof. By Central Limit Theorem, n(P̂ − P) converges in distribution to N (0, Σ), where Σ =
diag(P) − PPT , where P is an |X |-by-1 column vector. Thus, computing second-order Taylor
expansion of D(·kP), cf. (2.33) and (2.36), we get the result.
provided that the right side is integrable. Prior proportional to square-root of the determinant of
Fisher information matrix is known as the Jeffreys prior. In our case, using the explicit expression
for Fisher information (2.38), we conclude that π ∗ is the Dirchlet(1/2, 1/2, · · · , 1/2) prior, with
density:
1
π ∗ (θ) = cd qQ , (13.22)
d
j=0 θj
Γ( d+ 1
2 )
where cd = Γ(1/2)d+1
is the normalization constant. The corresponding redundancy is then
d n Γ( d+2 1 )
R∗n = log − log + o( 1) . (13.23)
2 2πe Γ(1/2)d+1
Making the above derivation rigorous is far from trivial, and was completed in [337]. Surprisingly,
while the Jeffreys prior π ∗ that we derived does attain the claimed value (13.23) of the mutual
i i
i i
i i
information I(θ; Xn ), the corresponding mixture QXn does not yield (13.23). In other words, this
QXn when plugged into (13.8) results in the value of supθ0 that is much larger than the optimal
value (13.23). The way (13.23) was proved is by patching the Jeffreys prior near the boundary of
the simplex.
Extension to general smooth parametric families. The fact that Jeffreys prior θ ∼ π maxi-
mizes the value of mutual information I(θ; Xn ) for general parametric families was conjectured
in [29] in the context of selecting priors in Bayesian inference. This result was proved rigorously
in [68, 69]. We briefly summarize the results of the latter.
Let {Pθ : θ ∈ Θ0 } be a smooth parametric family admitting a continuous and bounded Fisher
information matrix JF (θ) everywhere on the interior of Θ0 ⊂ Rd . Then for every compact Θ
contained in the interior of Θ0 we have
Z p
∗ d n
Rn (Θ) = log + log det JF (θ)dθ + o(1) . (13.24)
2 2πe Θ
Although Jeffreys prior on Θ achieves (up to o(1)) the optimal value of supπ I(θ; Xn ), to produce
an approximate capacity-achieving output distribution QXn , however, one needs to take a mixture
with respect to a Jeffreys prior on a slightly larger set Θϵ = {θ : d(θ, Θ) ≤ ϵ} and take ϵ → 0
slowly with n → ∞. This sequence of QXn ’s does achieve the optimal redundancy up to o(1).
Remark 13.3. In statistics Jeffreys prior is justified as being invariant to smooth reparametrization,
as evidenced by (2.34). For example, in answering “will the sun rise tomorrow”, Laplace proposed
to estimate the probability by modeling sunrise as i.i.d. Bernoulli process with a uniform prior on
θ ∈ [0, 1]. However, this is clearly not very logical, as one may equally well postulate uniformity of
√
α = θ10 or β = θ. Jeffreys prior θ ∼ √ 1 is invariant to reparametrization in the sense that
θ(1−θ)
p
if one computed det JF (α) under α-parametrization the result would be exactly the pushforward
of the √ 1 along the map θ 7→ θ10 .
θ(1−θ)
2 ∫1 a b
This is obtained from the identity √
θ (1−θ)
dθ = π
1·3···(2a−1)·1·3···(2b−1)
for integer a, b ≥ 0. This identity can be
0 θ(1−θ) 2a+b (a+b)!
θ
derived by change of variable z = 1−θ
and using the standard keyhole contour on the complex plane.
i i
i i
i i
214
This assignment can now be used to create a universal compressor via one of the methods outlined
in the beginning of this chapter. However, what is remarkable is that it has a very nice sequential
R
interpretation (as does any assignment obtained via QXn = Pθ PXn |θ with Pθ not depending on
n).
t1 + 12
QXn |Xn−1 (1|xn−1 ) =
(KT)
, t1 = #{j ≤ n − 1 : xj = 1} (13.27)
n
t0 + 12
QXn |Xn−1 (0|xn−1 ) =
(KT)
, t0 = #{j ≤ n − 1 : xj = 0} (13.28)
n
This is the famous “add 1/2” rule of Krichevsky and Trofimov. As mentioned in Section 13.1,
this sequential assignment is very convenient for use in prediction as well as in implementing an
arithmetic coder. The version for a general (non-binary) alphabet is equally simple:
1
ta +
QXn |Xn−1 (a|xn−1 ) =
(KT) 2
|X |−2
, ta = #{j ≤ n − 1 : xj = a}
n+ 2
Remark 13.4 (Laplace “add 1” rule). A slightly less optimal choice of QXn results from Laplace
prior: just take Pθ to be uniform on [0, 1]. Then, in the Bernoulli (d = 1) case we get
1
(Lap)
QXn = n
, w = #{j : xj = 1} . (13.29)
w ( n + 1)
The corresponding successive probability is given by
t1 + 1
QXn |Xn−1 (1|xn−1 ) =
(Lap)
, t1 = #{j ≤ n − 1 : xj = 1} .
n+1
We notice two things. First, the distribution (13.29) is exactly the same as Fitingof’s (13.7). Second,
this distribution “almost” attains the optimal first-order term in (13.23). Indeed, when Xn is iid
Ber(θ) we have for the redundancy:
" #
1 n
E log (Lap) − H(X ) = log(n + 1) + E log
n
− nh(θ) , W ∼ Bin(n, θ) . (13.30)
n
QX n ( X ) W
From Stirling’s expansion we know that as n → ∞ this redundancy evaluates to 12 log n + O(1),
uniformly in θ over compact subsets of (0, 1). However, for θ = 0 or θ = 1 the Laplace redun-
dancy (13.30) clearly equals log(n + 1). Thus, supremum over θ ∈ [0, 1] is achieved close to
endpoints and results in suboptimal redundancy log n + O(1). The Jeffreys prior (13.25) fixes the
problem at the endpoints.
i i
i i
i i
Consider the following problem: a sequence xn is observed sequentially and our goal is to
predict (by making a soft decision) the next symbol given the past observations. The experiment
proceeds as follows:
Note that to make this goal formal, we need to explain how xn is generated. Consider first a
naive requirement that the worst-case loss is minimized:
min max ℓ({Qt }, xn ) .
{Qt }nt=1 n x
This is clearly hopeless. Indeed, at any step t the distribution Qt must have at least one atom with
weight ≤ |X1 | , and hence for any predictor
max
n
ℓ({Qt }, xn ) ≥ n log |X | ,
x
which is clearly achieved iff Qt (·) ≡ |X1 | , i.e. if predictor simply makes uniform random guesses.
This triviality is not surprising: In the absence of whatsoever prior information on xn it is
impossible to predict anything.
The exciting idea, originated by Feder, Merhav and Gutman, cf. [119, 218], is to replace loss
with regret, i.e. the gap to the best possible static oracle. More precisely, suppose a non-causal
oracle can examine the entire string xn and output a constant Qt ≡ Q. From non-negativity of
divergence this non-causal oracle achieves:
X
n
1
ℓoracle (xn ) = min log = nH(P̂xn ) .
Q Q ( xt )
t=1
Can causal (but time-varying) predictor come close to this performance? In other words, we define
regret of a sequential predictor as the excess risk over the static oracle
reg({Qt }, xn ) ≜ ℓ({Qt }, xn ) − nH(P̂xn )
and ask to minimize the worst-case regret:
Reg∗n ≜ min max reg({Qt }, xn ) . (13.31)
{Qt } n x
Excitingly, non-trivial predictors emerge as solutions to the above problem, which furthermore
do not rely on any assumptions on the prior distribution of xn .
i i
i i
i i
216
We next consider the case of X = {0, 1} for simplicity. To solve (13.31), first notice that
designing a sequence {Qt (·|xt−1 } is equivalent to defining one joint distribution QXn and then
Q
factorizing the latter as QXn (xn ) = t Qt (xt |xt−1 ). Then the problem (13.31) becomes simply
1
Reg∗n = min max log − nH(P̂xn ) .
QXnn x QXn (xn )
First, we notice that generally we have that optimal QXn is Shtarkov distribution (13.12), which
implies that that regret is just the log of normalization constant in Shtarkov distribution. In the iid
case we are considering, we get
X Y
n X
Reg∗n = log max Q(xi ) = log exp{−nH(P̂xn )} .
Q
xn i=1 xn
This is, however,frequently a not very convenient expression to analyze, so instead we consider
upper and lower bounds. We may lower-bound the max over xn with the average over the Xn ∼
Ber(θ)n and obtain (also applying Lemma 13.2):
|X | − 1
Reg∗n ≥ R∗n + log e + o(1) ,
2
where R∗n is the universal compression redundancy defined in (13.8), whose asymptotics we
derived in (13.23).
(KT)
On the other hand, taking QXn from Krichevsky-Trofimov (13.26) we find after some algebra
and Stirling’s expansion:
1 1
max log (KT)
− nH(P̂xn ) = log n + O(1) .
n x QXn (xn ) 2
1 1
Reg∗n (Θ) = min sup log − inf log ,
QXn xn QXn (xn ) θ0 ∈Θ PXn |θ=θ0 (xn )
i i
i i
i i
This regret can be interpreted as worst-case loss of a given estimator compared to the best possible
one from a class PXn |θ , when the latter is selected optimally for each sequence. In this sense, regret
gives a uniform (in xn ) bound on the performance of an algorithm against a class.
It turns out that similarly to (13.24) the individual sequence redundancy for general d-
parametric families (under smoothness conditions) can be shown to satisfy [267]:
Z p
d d n
Reg∗n (Θ) = R∗n (Θ) + log e + o(1) = log + log det JF (θ)dθ + o(1) .
2 2 2π Θ
In machine learning terms, we say that R∗n (Θ) in (13.8) is a cumulative sequential prediction
regret under the well-specified setting (i.e. data Xn is generated by a distribution inside the model
class Θ), while here Reg∗n (Θ) corresponds to a fully mis-specified setting (i.e. data is completely
arbitrary). There are also interesting settings in between these extremes, e.g. when data is iid but
not from a model class Θ, cf. [120].
the basis for arithmetic coding. As long as P̂ converges to the actual conditional probability, we
will attain the entropy rate of H(Xn |Xnn− 1
−r ). Note that Krichevsky-Trofimov assignment (13.28) is
clearly learning the distribution too: as n grows, the estimator QXn |Xn−1 converges to the true PX
(provided that the sequence is i.i.d.). So in some sense the converse is also true: any good universal
compression scheme is inherently learning the true distribution.
The main drawback of the learn-then-compress approach is the following. Once we extend the
class of sources to include those with memory, we invariably are lead to the problem of learning
the joint distribution PXr−1 of r-blocks. However, the number of samples required to obtain a good
0
estimate of PXr−1 is exponential in r. Thus learning may proceed rather slowly. Lempel-Ziv family
0
of algorithms works around this in an ingeniously elegant way:
• First, estimating probabilities of rare substrings takes longest, but it is also the least useful, as
these substrings almost never appear at the input.
i i
i i
i i
218
• Second, and the most crucial, point is that an unbiased estimate of PXr (xr ) is given by the
reciprocal of the time since the last observation of xr in the data stream.
• Third, there is a prefix code3 mapping any integer n to binary string of length roughly log2 n:
There are a number of variations of these basic ideas, so we will only attempt to give a rough
explanation of why it works, without analyzing any particular algorithm.
We proceed to formal details. First, we need to establish Kac’s lemma.
3 ∑
For this just notice that k≥1 2− log2 k−2 log2 log(k+1) < ∞ and use Kraft’s inequality. See also Ex. II.14.
i i
i i
i i
1
= P[∃t ≥ 0 : Xt = u] (13.38)
P[X0 = u]
1
= , (13.39)
P[X0 = u]
where (13.34) is the standard expression for the expectation of a Z+ -valued random vari-
able, (13.37) is from stationarity, (13.38) is because the events corresponding to different t are
disjoint, and (13.39) is from (13.33).
The following proposition serves to explain the basic principle behind operation of Lempel-Ziv:
1
E[ℓ(fn (Xn0−1 , X∞
−1
))] → H ,
n
Proof. Let Ln be the last occurence of the block x0n−1 in the string x− 1
−∞ (recall that the latter is
known to decoder), namely
Ln = inf{t > 0 : x−
−t
t+n−1
= xn0−1 } .
= Xtt+n−1 we have
(n)
Then, by Kac’s lemma applied to the process Yt
1
E[Ln |Xn0−1 = xn0−1 ] = .
P[Xn0−1 = xn0−1 ]
We know encode Ln using the code (13.32). Note that there is crucial subtlety: even if Ln < n and
thus [−t, −t + n − 1] and [0, n − 1] overlap, the substring xn0−1 can be decoded from the knowledge
of Ln .
We have, by applying Jensen’s inequality twice and noticing that 1n H(Xn0−1 ) & H and
n−1
n log H(X0 ) → 0 that
1
1 1 1
E[ℓ(fint (Ln ))] ≤ E[log ] + o( 1) → H .
n n PXn−1 (Xn0−1 )
0
From Kraft’s inequality we know that for any prefix code we must have
1 1
E[ℓ(fint (Ln ))] ≥ H(Xn0−1 |X− 1
−∞ ) = H .
n n
The result shown above demonstrates that LZ algorithm has asymptotically optimal com-
pression rate for every stationary ergodic process. Recall, however, that previously discussed
compressors also enjoyed non-stochastic (individual sequence) guarantess. For example, we have
seen in Section 13.6 that Krichevsky-Trofimov’s compressor achieves on every input sequence a
compression ratio that is at most O( logn n ) worse than the arithmetic encoder built with the best
possible (for this sequence!) static probability assignment. It turns out that LZ algorithm is also
i i
i i
i i
220
special from this point of view. In [237] (see also [118, Theorem 4]) it was shown that the LZ
compression rate on every input sequence is better than that achieved by any finite state machine
(FSM) up to correction terms O( logloglogn n ). Consequently, investing via LZ achieves capital growth
that is competitive against any possible FSM investor [118].
Altogether we can see that LZ compression enjoys certain optimality guarantees in both the
stochastic and individual sequence senses.
i i
i i
i i
Yj = Sj Ej .
Find entropy rate of Yj (you can give answer in the form of a convergent series). Evaluate at
τ = 0.11, δ = 1/2 and compare with H(Y1 ).
II.2 Recall that an entropy rate of a process {Xj : j = 1, . . .} is defined as follows provided the limit
exists:
1
H = lim H( X n ) .
n→∞ n
Consider a 4-state Markov chain with transition probability matrix
0.89 0.11 0 0
0.11 0.89 0 0
0 0 0.11 0.89
0 0 0.89 0.11
The distribution of the initial state is [p, 0, 0, 1 − p].
(a) Does the entropy rate of such a Markov chain exist? If it does, find it.
(b) Describe the asymptotic behavior of the optimum variable-length rate n1 ℓ(f∗ (X1 , . . . , Xn )).
Consider convergence in probability and in distribution.
(c) Repeat with transition matrix:
0.89 0.11 0 0
0.11 0.89 0 0
0 0 0. 5 0. 5
0 0 0. 5 0. 5
II.3 Consider a three-state Markov chain S1 , S2 , . . . with the following transition probability matrix
1 1 1
2 4 4
P= 0 1
2
1
2
.
1 0 0
i i
i i
i i
Compute the limit of 1n E[l(f∗ (Sn ))] when n → ∞. Does your answer depend on the distribution
of the initial state S1 ?
II.4 (a) Let X take values on a finite alphabet X . Prove that
H(X) − k − 1
ϵ ∗ ( X , k) ≥ .
log(|X | − 1)
(b) Deduce the following converse result: For a stationary process {Sk : k ≥ 1} on a finite
alphabet S ,
H−R
lim inf ϵ∗ (Sn , nR) ≥ .
n→∞ log |S|
n
where H = limn→∞ H(nS ) is the entropy rate of the process.
II.5 Run-length encoding is a popular variable-length lossless compressor used in fax machines,
image compression, etc. Consider compression of Sn – an i.i.d. Ber(δ) source with very small
1
δ = 128 using run-length encoding: A chunk of consecutive r ≤ 255 zeros (resp. ones) is
encoded into a zero (resp. one) followed by an 8-bit binary encoding of r (If there are > 255
consecutive zeros then two or more 9-bit blocks will be output). Compute the average achieved
compression rate
1
lim E[ℓ(f(Sn )]
n→∞n
How does it compare with the optimal lossless compressor?
Hint: Compute the expected number of 9-bit blocks output per chunk of consecutive zeros/ones;
normalize by the expected length of the chunk.
II.6 Draw n random points independently and uniformly from the vertices of the following square.
Denote the coordinates by (X1 , Y1 ), . . . , (Xn , Yn ). Suppose Alice only observes Xn and Bob only
observes Yn . They want to encode their observation using RX and RY bits per symbol respectively
and send the codewords to Charlie who will be able to reconstruct the sequence of pairs.
(a) Find the optimal rate region for (RX , RY ).
(b) What if the square is rotated by 45◦ ?
II.7 Recall a bound on the probability of error for the Slepian-Wolf compression to k bits:
∗ 1 −τ
ϵSW (k) ≤ min P log|A| > k − τ + |A| (II.1)
τ >0 PXn |Y (Xn |Y)
i i
i i
i i
1X
n−1
P[A ∩ τ −k B] → P[A]P[B] .
n
k=0
For example, (0, 0, 0, 0), (0, 0, 0, 2) and (1, 1, 0, 0) satisfy the constraint but (0, 0, 1, 2) does not.
Let ϵ∗ (Sn , k) denote the minimum probability of error among all possible compressors of Sn =
{Sj , j = 1, . . . , n} with i.i.d. entries of finite entropy H(S) < ∞. Compute
as a function of R ≥ 0.
Hint: Relate to P[ℓ(f∗ (Sn )) ≥ γ n] and use Stirling’s formula (or Theorems 11.1.1, 11.1.3 in [76])
to find γ .
II.10 Mismatched compression. Let P, Q be distributions on some discrete alphabet A.
(a) Let f∗P : A → {0, 1} denote the optimal variable-length lossless compressor for X ∼ P.
Show that under Q,
EQ [l(f∗P (X))] ≤ H(Q) + D(QkP).
(b) The Shannon code for X ∼ P is a prefix code fP with the code length l(fP (a)) =
dlog2 P(1a) e, a ∈ A. Show that if X is distributed according to Q instead, then
i i
i i
i i
(g) Assume that X = (X1 , . . . , Xn ) is not iid but PX1 , PX2 |X1 , . . . , PXn |Xn−1 are known. How would
you modify the scheme so that we have
II.12 Enumerative Codes. Consider the following simple universal compressor for binary sequences:
Pn
Given xn ∈ {0, 1}n , denote by n1 = i=1 xi and n0 = n − n1 the number of ones and zeros in xn .
First encode n1 ∈ {0, 1, . . . , n} using dlog2 (ln + 1)e bits, then encode the index of xn in the set
n
m
of all strings with n1 number of ones using log2 n1 bits. Concatenating two binary strings,
we obtain the codeword of xn . This defines a lossless compressor f : {0, 1}n → {0, 1}∗ .
(a) Verify that f is a prefix code.
i.i.d.
(b) Let Snθ ∼ Ber(θ). Show that for any θ ∈ [0, 1],
[Optional: Explain why enumerative coding fails to achieve the optimal redundancy.]
Hint: The following non-asymptotic version of Stirling approximation might be useful
n! e
1≤ √ n ≤ √ , ∀ n ∈ N.
2πn ne 2π
i i
i i
i i
II.13 Krichevsky-Trofimov codes. From Kraft’s inequality we knowlthat any probabilitym distribution
QXn on {0, 1} gives rise to a prefix code f such that l(f(x )) = log2 QXn (xn ) for all xn . Consider
n n 1
the following QXn defined by the factorization QXn = QX1 QX2 |X1 · · · QXn |Xn−1 ,
1 n1 (xt ) + 12
QX 1 ( 1) = , QXt+1 |Xt (1|xt ) = , (II.3)
2 t+1
where n1 (xt ) denotes the number of ones in xt . Denote the prefix code corresponding to this QXn
by fKT : {0, 1}n → {0, 1}∗ .
(a) Prove that for any n and any xn ∈ {0, 1}n ,
n0 n1
1 1 n0 n1
QXn (x ) ≥ √
n
.
2 n0 + n1 n0 + n1 n0 + n1
where n0 = n0 (xn ) and n1 = n1 (xn ) denote the number of zeros and ones in xn .
Hint: Use induction on n.
(b) Conclude that the K-T code length satisfies:
n 1
1
l(fKT (xn )) ≤ nh + log n + 2, ∀xn ∈ {0, 1}n .
n 2
(c) Conclude that for K-T codes :
1
sup {E [l(fKT (Snθ ))] − nh(θ)} ≤ log n + O(1).
0≤θ≤1 2
1
This value is known as the redundancy of a universal code. It turns out that 2 log n + O(1)
is optimal for the class of all Bernoulli sources (see (13.23)).
Comments:
(a) The probability assignment (II.3) is known as the “add- 21 ” estimator: Upon observing xt
which contains n1 number of ones, a natural probability assignment to xt+1 = 1 is the
n +1
empirical average nt1 . Instead, K-T codes assign probability t1+12 , or equivalently, adding
1 4
2 to both n0 and n1 . This is a crucial modification to Laplace’s “add-one estimator”.
(b) By construction, the probability assignment QXn can be sequentially computed, which
allows us implement sequential encoding and encode a stream of bits on the fly. This is
a highly desirable feature of the K-T codes. Of course, we need to resort to construction
other than the one in Kraft’s inequality construction, e.g., arithmetic coding.
II.14 (Elias coding) In this problem all logarithms and entropy units are binary.
(a) Consider the following universal compressor for natural numbers: For x ∈ N = {1, 2, . . .},
let k(x) denote the length of its binary representation. Define its codeword c(x) to be k(x)
zeros followed by the binary representation of x. Compute c(10). Show that c is a prefix
code and describe how to decode a stream of codewords.
4
Interested readers should check Laplace’s rule of succession and the sunrise problem
https://en.wikipedia.org/wiki/Rule_of_succession.
i i
i i
i i
(b) Next we construct another code using the one above: Define the codeword c′ (x) to be c(k(x))
followed by the binary representation of x. Compute c′ (10). Show that c′ is a prefix code
and describe how to decode a stream of codewords.
(c) Let X be a random variable on N whose probability mass function is decreasing. Show that
E[log(X)] ≤ H(X).
(d) Show that the average code length of c satisfies E[ℓ(c(X))] ≤ 2H(X) + 2 bit.
(e) Show that the average code length of c′ satisfies E[ℓ(c′ (X))] ≤ H(X) + 2 log(H(X) + 1) + 3
bit.
Comments: The two coding schems are known as Elias γ -codes and δ -codes.
i i
i i
i i
Part III
i i
i i
i i
i i
i i
i i
229
In this part we study the topic of binary hypothesis testing (BHT). This is an important area
of statistics, with a definitive treatment given in [197]. Historically, there has been two schools
of thought on how to approach this question. One is the so-called significance testing of Karl
Pearson and Ronald Fisher. This is perhaps the most widely used approach in modern biomedical
and social sciences. The concepts of null hypothesis, p-value, χ2 -test, goodness-of-fit belong to
this world. We will not be discussing these.
The other school was pioneered by Jerzy Neyman and Egon Pearson, and is our topic in this part.
The concepts of type-I and type-II errors, likelihood-ratio tests, Chernoff exponent are from this
domain. This is, arguably, a more popular way of looking at the problem among the engineering
disciplines (perhaps explained by its foundational role in radar and electronic signal detection.)
The conceptual difference between the two is that in the first approach the full probabilistic
i.i.d.
model is specified only under the null hypothesis. (It still could be very specific like Xi ∼ N (0, 1),
i.i.d.
contain unknown parameters, like Xi ∼ N (θ, 1) with θ ∈ R arbitrary, or be nonparametric, like
i.i.d.
(Xi , Yi ) ∼ PX,Y = PX PY denoting that observables X and Y are statistically independent). The main
goal of the statistician in this setting is inventing a testing process that is able to find statistically
significant deviations from the postulated null behavior. If such deviation is found then the null is
rejected and (in scientific fields) a discovery is announced. The role of the alternative hypothesis
(if one is specified at all) is to roughly suggest what feature of the null are most likely to be violated
i.i.d.
and motivates the choice of test procedures. For example, if under the null Xi ∼ N (0, 1), then both
of the following are reasonable tests:
1X ? 1X 2 ?
n n
Xi ≈ 0 Xi ≈ 1 .
n n
i=1 i=1
However, the first one would be preferred if, under the alternative, “data has non-zero mean”, and
the second if “data has zero mean but variance not equal to one”. Whichever of the alternatives is
selected does not imply in any way the validity of the alternative. In addition, theoretical properties
of the test are mostly studied under the null rather than the alternative. For this approach the null
hypothesis (out of the two) plays a very special role.
The second approach treats hypotheses in complete symmetry. Exact specifications of proba-
bility distributions are required for both hypotheses and the precision of a proposed test is to be
analyzed under both. This is the setting that is most useful for our treatment of forthcoming topics
of channel coding (Part IV) and statistical estimation (Part VI).
The outline of this part is the following. First, we define the performance metric R(P, Q) giving
a full description of the BHT problem. A key result in this theory, the Neyman-Pearson lemma
determines the form of the optimal test and, at the same time, characterizes R(P, Q). We then
specialize to the setting of iid observations and consider two types of asymptotics (as the sam-
ple size n goes to infinity): Stein’s regime (where type-I error is held constant) and Chernoff’s
regime (where errors of both types are required to decay exponentially). The fundamental limit
in the former regime is simply a scalar (given by D(PkQ)), while in the latter it is a region. To
describe this region (Chapter 16) we will need to understand the problem of large deviations and
the information projection (Chapter 15).
i i
i i
i i
14 Neyman-Pearson lemma
H0 : X ∼ P
H1 : X ∼ Q .
What this means is that under hypothesis H0 (the null hypothesis) X is distributed according to P,
and under H1 (the alternative hypothesis) X is distributed according to Q. A test (or decision rule)
between two distributions chooses either H0 or H1 based on an observation of X. We will consider
Let Z = 0 denote that the test chooses P (accepting the null) and Z = 1 that the test chooses Q
(rejecting the null).
This setting is called “testing simple hypothesis against simple hypothesis”. Here “simple”
refers to the fact that under each hypothesis there is only one distribution that could generate
the data. In comparison, composite hypothesis postulates that X ∼ P for some P is a given class
of distributions; see Section 32.2.1.
In order to quantify the “effectiveness” of a test, we focus on two metrics. Let π i|j denote the
probability of the test choosing i when the correct hypothesis is j, with i, j ∈ {0, 1}. For every test
PZ|X we associate a pair of numbers:
230
i i
i i
i i
• Bayesian: Assuming the prior distribution P[H0 ] = π 0 and P[H1 ] = π 1 , we minimize the
average probability of error:
P∗b = min π 0 π 1|0 + π 1 π 0|1 . (14.1)
PZ|X :X →{0,1}
• Minimax: Assuming there is an unknown prior distribution, we choose the test that preforms
the best for the worst-case prior
P∗m = min max{π 1|0 , π 0|1 }.
PZ|X :X →{0,1}
• Neyman-Pearson: Minimize the type-II error β subject to that the success probability under the
null is at least α.
In this book the Neyman-Pearson formulation and the following quantities play important roles:
Definition 14.1. Given (P, Q), the Neyman-Pearson region consists of achievable points for all
randomized tests
R(P, Q) = (P[Z = 0], Q[Z = 0]) : PZ|X : X → {0, 1} ⊂ [0, 1]2 . (14.2)
In particular, its lower boundary is defined as (see Fig. 14.1 for an illustration)
βα (P, Q) ≜ inf Q[ Z = 0] (14.3)
P[Z=0]≥α
R(P, Q)
βα (P, Q)
Remark 14.1. The Neyman-Pearson region encodes much useful information about the relation-
ship between P and Q. For example, we have the following extreme cases1
1
Recall that P is mutually singular w.r.t. Q, denoted by P ⊥ Q, if P[E] = 0 and Q[E] = 1 for some E.
i i
i i
i i
232
P = Q ⇔ R(P, Q) = P ⊥ Q ⇔ R(P, Q) =
Moreover, TV(P, Q) coincides with half the length of the longest vertical segment contained in
R(P, Q) (Exercise III.2).
Proof. (a) For convexity, suppose that (α0 , β0 ), (α1 , β1 ) ∈ R(P, Q), corresponding to tests
PZ0 |X , PZ1 |X , respectively. Randomizing between these two tests, we obtain the test λPZ0 |X +
λ̄PZ1 |X for λ ∈ [0, 1], which achieves the point (λα0 + λ̄α1 , λβ0 + λ̄β1 ) ∈ R(P, Q).
The closedness of R(P, Q) will follow from the explicit determination of all boundary
points via the Neyman-Pearson lemma – see Remark 14.3. In more complicated situations
(e.g. in testing against composite hypothesis) simple explicit solutions similar to Neyman-
Pearson Lemma are not available but closedness of the region can frequently be argued
still. The basic reason is that the collection of bounded functions {g : X → [0, 1]} (with
g(x) = PZ|X (0|x)) forms a weakly compact set and hence its image under the linear functional
R R
g 7→ ( gdP, gdQ) is closed.
(b) Testing by random guessing, i.e., Z ∼ Ber(1 − α) ⊥ ⊥ X, achieves the point (α, α).
(c) If (α, β) ∈ R(P, Q) is achieved by PZ|X , P1−Z|X achieves (1 − α, 1 − β).
The region R(P, Q) consists of the operating points of all randomized tests, which include as
special cases those of deterministic tests, namely
As the next result shows, the former is in fact the closed convex hull of the latter. Recall that cl(E)
(resp. co(E)) denote the closure and convex hull of a set E, namely, the smallest closd (resp. convex)
set containing E. A useful example: For a subset E of an Euclidean space, and measurable functions
f, g : R → E, we have (E [f(X)] , E [g(X)]) ∈ cl(co(E)) for any real-valued random variable X.
Consequently, if P and Q are on a finite alphabet X , then R(P, Q) is a polygon of at most 2|X |
vertices.
i i
i i
i i
Proof. “⊃”: Comparing (14.2) and (14.4), by definition, R(P, Q) ⊃ Rdet (P, Q)), the former of
which is closed and convex , by Theorem 14.2.
“⊂”: Given any randomized test PZ|X , define a measurable function g : X → [0, 1] by g(x) =
PZ|X (0|x). Then
X Z 1
P [ Z = 0] = g(x)P(x) = EP [g(X)] = P[g(X) ≥ t]dt
x 0
X Z 1
Q[ Z = 0] = g(x)Q(x) = EQ [g(X)] = Q[g(X) ≥ t]dt
x 0
R
where we applied the “area rule” that E[U] = R+ P [U ≥ t] dt for any non-negative random
variable U. Therefore the point (P[Z = 0], Q[Z = 0]) ∈ R is a mixture of points (P[g(X) ≥
t], Q[g(X) ≥ t]) ∈ Rdet , averaged according to t uniformly distributed on the unit interval. Hence
R ⊂ cl(co(Rdet )).
The last claim follows because there are at most 2|X | subsets in (14.4).
Example 14.1 (Testing Ber(p) versus Ber(q)). Assume that p < 12 < q. Using Theorem 14.3, note
that there are 22 = 4 events E = ∅, {0}, {1}, {0, 1}. Then R(Ber(p), Ber(q)) is given by
1
q))
(p, q)
r(
Be
),
(p
er
(B
R
(p̄, q̄)
α
0 1
Definition 14.4 (Extended log likelihood ratio). Assume that dP = p(x)dμ and dQ = q(x)dμ for
some dominating measure μ (e.g. μ = P + Q.) Recalling the definition of Log from (2.10) we
i i
i i
i i
234
We see that taking expectation over P and over Q are equivalent upon multiplying the expectant
by exp(±T). The next result gives precise details in the general case.
i i
i i
i i
Z
( c)
= dμ p(x) exp(−T(x))h(x) = EP [exp(−T)g(T)] ,
{−∞<T(x)≤∞}
where in (a) we used (14.8) to justify restriction to finite values of T; in (b) we used exp(−T(x)) =
q(x)
p(x) for p, q > 0; and (c) follows from the fact that exp(−T(x)) = 0 whenever T = ∞. Exchanging
the roles of P and Q proves (14.6).
The last part follows upon taking h(x) = f(x)1{T(x) ≥ τ } and h(x) = f(x)1{T(x) ≤ τ } in (14.5)
and (14.6), respectively.
The importance of the LLR is that it is a sufficient statistic for testing the two hypotheses (recall
Section 3.5 and in particular Example 3.8), as the following result shows.
Proof. For part 2, suffiency of T would be implied by PX|T = QX|T . For the case of X being
discrete we have:
From Theorem 14.3 we know that to obtain the achievable region R(P, Q), one can iterate over
all decision regions and compute the region Rdet (P, Q) first, then take its closed convex hull. But
this is a formidable task if the alphabet is large or infinite. On the other hand, we know that the
LLR is a sufficient statistic. Next we give bounds to the region R(P, Q) in terms of the statistics
of the LLR. As usual, there are two types of statements:
• Converse (outer bounds): any point in R(P, Q) must satisfy certain constraints;
• Achievability (inner bounds): points satisfying certain constraints belong to R(P, Q).
d(αkβ) ≤ D(PkQ)
d(βkα) ≤ D(QkP)
Proof. Use the data processing inequality for KL divergence with PZ|X ; cf. Corollary 2.17.
We will strengthen this bound with the aid of the following result.
i i
i i
i i
236
Note that we do not need to assume P Q precisely because ±∞ are admissible values for
the (extended) LLR.
Proof. Defining τ = log γ and g(x) = PZ|X (0|x) we get from (14.7):
P[Z = 0, T ≤ τ ] − γ Q[Z = 0, T ≤ τ ] ≤ 0 .
Decomposing P[Z = 0] = P[Z = 0, T ≤ τ ] + P[Z = 0, T > τ ] and similarly for Q we obtain then
P[Z = 0] − γ Q[Z = 0] ≤= P [T > log γ, Z = 0] − γ Q [T > log γ, Z = 0] ≤ P [T > log γ]
• Theorem 14.9 provides an outer bound for the region R(P, Q) in terms of half-spaces. To see
this, fix γ > 0 and consider the line α − γβ = c by gradually increaseing c from zero. There
exists a maximal c, say c∗ , at which point the line touches the lower boundary of the region.
Then (14.9) says that c∗ cannot exceed P[log dQ dP
> log γ]. Hence R must lie to the left of the
line. Similarly, (14.10) provides bounds for the upper boundary. Altogether Theorem 14.9 states
that R(P, Q) is contained in the intersection of an infinite collection of half-spaces indexed by
γ.
• To apply the strong converse Theorem 14.9, we need to know the CDF of the LLR, whereas to
apply the weak converse Theorem 14.7 we need only to know the expectation of the LLR, i.e.,
the divergence.
i i
i i
i i
which is equivalent to minimizing the average probability of error in (14.1), with t = ππ 10 . This can
be solved without much effort. For simplicity, consider the discrete case. Then
X X
α∗ − tβ ∗ = max (α − tβ) = max (P(x) − tQ(x))PZ|X (0|x) = |P(x) − tQ(x)|+
(α,β)∈R PZ|X
x∈X x∈X
where the last equality follows from the fact that we are free to choose PZ|X (0|x), and the best
choice is obvious:
P ( x)
PZ|X (0|x) = 1 log ≥ log t .
Q(x)
Thus, we have shown that all supporting hyperplanes are parameterized by LRT. This completely
recovers the region R(P, Q) except for the points corresponding to the faces (flat pieces) of the
region. The precise result is stated as follows:
Theorem 14.10 (Neyman-Pearson Lemma: “LRT is optimal”). For each α, βα in (14.3) is attained
by the following test:
dP
1 log dQ > τ
PZ|X (0|x) = λ log dQdP
=τ (14.11)
0 log dP < τ
dQ
Proof of Theorem 14.10. Let t = exp(τ ). Given any test PZ|X , let g(x) = PZ|X (0|x) ∈ [0, 1]. We
want to show that
h dP i h dP i
α = P[Z = 0] = EP [g(X)] = P > t + λP =t (14.12)
dQ dQ
goal h dP i h dP i
⇒ β = Q[Z = 0] = EQ [g(X)] ≥ Q > t + λQ =t (14.13)
dQ dQ
Using the simple fact that EQ [f(X)1{ dP ≤t} ] ≥ t−1 EP [f(X)1{ dP ≤t} ] for any f ≥ 0 twice, we have
dQ dQ
1
≥ EP [g(X)1{ dP ≤t} ] +EQ [g(X)1{ dP >t} ]
t| {z
dQ
}
dQ
(14.12) 1
h dP i
= EP [(1 − g(X))1{ dP >t} ] + λP = t + EQ [g(X)1{ dP >t} ]
t dQ dQ dQ
| {z }
h dP i
≥ EQ [(1 − g(X))1{ dP >t} ] + λQ = t + EQ [g(X)1{ dP >t} ]
dQ dQ dQ
h dP i h dP i
=Q > t + λQ =t .
dQ dQ
i i
i i
i i
238
Remark 14.3. As a consequence of the Neyman-Pearson lemma, all the points on the boundary
of the region R(P, Q) are attainable. Therefore
Since α 7→ βα is convex on [0, 1], hence continuous, the region R(P, Q) is a closed convex set, as
previously stated in Theorem 14.2. Consequently, the infimum in the definition of βα is in fact a
minimum.
Furthermore, the lower half of the region R(P, Q) is the convex hull of the union of the
following two sets:
(
dP
α = P log dQ >τ
dP
τ ∈ R ∪ {±∞}.
β = Q log dQ >τ
and
(
dP
α = P log dQ ≥τ
τ ∈ R ∪ {±∞}.
dP
β = Q log dQ ≥τ
Therefore it does not lose optimality to restrict our attention on tests of the form 1{log dQ
dP
≥ τ}
or 1{log dQ > τ }. The convex combination (randomization) of the above two styles of tests lead
dP
dP dP
P [log dQ > t] P [log dQ > t]
1 1
α α
t t
τ τ
2
Note that it so happens that in Definition 14.4 the LRT is defined with an ≤ instead of <.
i i
i i
i i
h dP i
β ≤ exp(−τ )P log > τ ≤ exp(−τ )
dQ
• Stein regime: When π 1|0 is constrained to be at most ϵ, what is the best exponential rate of
convergence for π 0|1 ?
• Chernoff regime: When both π 1|0 and π 0|1 are required to vanish exponentially, what is the
optimal tradeoff between their exponents?
i i
i i
i i
240
Theorem 14.13 (Stein’s lemma). Consider the iid setting (14.14) where PXn = Pn and QXn = Qn .
Then
Consequently, V = D(PkQ).
The way to use this result in practice is the following. Suppose it is required that α ≥ 0.999,
and β ≤ 10−40 , what is the required sample size? Stein’s lemma provides a rule of thumb: n ≥
10−40
− log
D(P∥Q) .
dPXn X n
dP
Fn = log = log (Xi ), (14.15)
dQXn dQ
i=1
Note that both convergence results hold even if the divergence is infinite.
(Achievability) We show that Vϵ ≥ D(PkQ) ≡ D for any ϵ > 0. First assume that D < ∞. Pick
τ = n(D − δ) for some small δ > 0. Then Corollary 14.11 yields
then pick n large enough (depends on ϵ, δ ) such that α ≥ 1 − ϵ, we have the exponent E = D − δ
achievable, Vϵ ≥ E. Sending δ → 0 yields Vϵ ≥ D. Finally, if D = ∞, the above argument holds
for arbitrary τ > 0, proving that Vϵ = ∞.
(Converse) We show that Vϵ ≤ D for any ϵ < 1, to which end it suffices to consider D < ∞. As
a warm-up, we first show a weak converse by applying Theorem 14.7 based on data processing
inequality. For any (α, β) ∈ R(PXn , QXn ), we have
1
−h(α) + α log ≤ d(αkβ) ≤ D(PXn kQXn ) (14.18)
β
i i
i i
i i
For any achievable exponent E < Vϵ , by definition, there exists a sequence of tests such that
αn ≥ 1 − ϵ and βn ≤ exp(−nE). Plugging this into (14.18) and using h ≤ log 2, we have E ≤
D(P∥Q) log 2
1−ϵ + n(1−ϵ) . Sending n → ∞ yields
D(PkQ)
Vϵ ≤ ,
1−ϵ
which is weaker than what we set out to prove; nevertheless, this weak converse is tight for ϵ → 0,
so that for Stein’s exponent we have succeeded in proving the desired result of V = limϵ→0 Vϵ ≥
D(PkQ). So the question remains: if we allow the type-I error to be ϵ = 0.999, is it possible for
the type-II error to decay faster? This is shown impossible by the strong converse next.
To this end, note that, in proving the weak converse, we only made use of the expectation
of Fn in (14.18), we need to make use of the entire distribution (CDF) in order to obtain better
results. Applying the strong converse Theorem 14.9 to testing PXn versus QXn and α = 1 − ϵ and
β = exp(−nE), we have
Pick γ = exp(n(D + δ)) for δ > 0, by WLLN (14.16) the probability on the right side goes to 0,
which implies that for any fixed ϵ < 1, we have E ≤ D + δ and hence Vϵ ≤ D + δ . Sending δ → 0
complete the proof.
Finally, let us address the case of P 6 Q, in which case D(PkQ) = ∞. By definition, there
exists a subset A such that Q(A) = 0 but P(A) > 0. Consider the test that selects P if Xi ∈ A for
some i ∈ [n]. It is clear that this test achieves β = 0 and 1 − α = (1 − P(A))n , which can be made
less than any ϵ for large n. This shows Vϵ = ∞, as desired.
Remark 14.5 (Non-iid data). Just like in Chapter 12 on data compression, Theorem 14.13 can be
extended to stationary ergodic processes:
1
Vϵ = lim D(PXn kQXn )
n→∞ n
where {Xi } is stationary and ergodic under both P and Q. Indeed, the counterpart of (14.16) based
on WLLN, which is the key for choosing the appropriate threshold τ , for ergodic processes is the
Birkhoff-Khintchine convergence theorem (cf. Theorem 12.8).
Thus knowledge of Stein’s exponent Vϵ allows one to prove exponential bounds on probabilities
of arbitrary sets; this technique is known as “change of measure”, which will be applied in large
deviations analysis in Chapter 15.
i i
i i
i i
242
H0 : Xn ∼ Pn versus H1 : X n ∼ Qn ,
but the objective in the Chernoff regime is to achieve exponentially small error probability of both
types simultaneously. We say a pair of exponents (E0 , E1 ) is achievable if there exists a sequence
of tests such that
1 − α = π 1|0 ≤ exp(−nE0 )
β = π 0|1 ≤ exp(−nE1 ).
Intuitively, one exponent can made large at the expense of making the other small. So the interest-
ing question is to find their optimal tradeoff by characterizing the achievable region of (E0 , E1 ).
This problem was solved by [159, 38] and is the topic of Chapter 16. (See Fig. 16.2 for an
illustration of the optimal (E0 , E1 )-tradeoff.)
Let us explain what we already know about the region of achievable pairs of exponents (E0 , E1 ).
First, Stein’s regime corresponds to corner points of this achievable region. Indeed, Theo-
rem 14.13 tells us that when fixing αn = 1 − ϵ, namely E0 = 0, picking τ = D(PkQ) − δ
(δ → 0) gives the exponential convergence rate of βn as E1 = D(PkQ). Similarly, exchanging the
role of P and Q, we can achieves the point (E0 , E1 ) = (D(QkP), 0).
Second, we have shown in Section 7.3 that the minimum total error probabilities over all tests
satisfies
min 1 − α + β = 1 − TV(Pn , Qn ) .
(α,β)∈R(Pn ,Qn )
where we denoted
1
EH ≜ log 1 − H2 (P, Q) .
2
EH ≤ E ≤ 2EH .
This characterization is valid even if P and Q depends on the sample size n which will prove useful
later when we study composite hypothesis testing in Section 32.2.1. However, for fixed P and Q
this is not precise enough. In order to determine the full set of achievable pairs, we need to make
i i
i i
i i
a detour into the topic of large deviations next. To see how this connection arises, notice that the
(optimal) likelihood ratio tests give us explicit expressions for both error probabilities:
1 1
1 − αn = P Fn ≤ τ , βn = Q Fn > τ
n n
where Fn is the LLR in (14.15). When τ falls in the range of (−D(QkP), D(PkQ)), both proba-
bilities are vanishing thanks to WLLN – see (14.16) and (14.17), and we are interested in their
exponential convergence rate. This falls under the purview of large deviations theory.
i i
i i
i i
1 Basics of large deviation: log moment generating function (MGF) ψX and its conjugate (rate
function) ψX∗ , tilting.
2 Information projection problem:
3 Use information projection to prove tight Chernoff bound: for iid copies X1 , . . . , Xn of X,
" n #
1X
P Xk ≥ γ = exp (−nψ ∗ (γ) + o(n)) .
n
k=1
In the next chapter, we apply these results to characterize the achievable (E0 , E1 )-region (as defined
in Section 14.6) to get
The full account of such theory requires delicate consideration of topological properties of E , and
is the subject of classical treatments e.g. [87]. We focus here on a simple special case which,
however, suffices for the purpose of establishing the Chernoff exponents in hypothesis testing,
and also showcases all the relevant information-theoretic ideas. Our ultimate goal is to show the
following result:
244
i i
i i
i i
Theorem 15.1. Consider a random variable X whose log MGF ψX (λ) = log E[exp(λX)] is finite
for all λ ∈ R. Let B = esssup X and let E[X] < γ < B. Then
" n #
X
P Xi ≥ nγ = exp{−nE(γ) + o(n)} ,
i=1
where E(γ) = supλ≥0 λγ − ψX (λ) = ψX∗ (γ), known as the rate function.
The concepts of log MGF and the rate function will be elaborated in subsequent sections. We
provide the proof below that should be revisited after reading the rest of the chapter.
Proof. Let us recall the usual Chernoff bound: For iid Xn , for any λ ≥ 0,
" n # " ! #
X X n
P Xi ≥ nγ = P exp λ Xi ≥ exp(nλγ)
i=1 i=1
" !#
(Markov ineq.) X
n
≤ exp(−nλγ)E exp λ Xi
i=1
Optimizing over λ ≥ 0 gives the non-asymptotic upper bound (concentration inequality) which
holds for any n:
" n #
X n o
P Xi ≥ nγ ≤ exp − n sup(λγ − ψX (λ)) . (15.1)
i=1 λ≥0
This proves the upper bound part of Theorem 15.1. Our goal, thus, is to show the lower bound. This
will be accomplished by first expressing E(γ) as a certain KL-minimization problem (see Theo-
rem 15.9), known as information projection, and then solving this problem (see (15.26)) to obtain
the desired value of E(γ). In the process of this proof we will also understand why the apparently
naive Chernoff bound is in fact sharp. The explanation is that, essentially, inequality (15.1) per-
forms a change of measure to a tilted distribution Pλ , which is the closest to P (in KL divergence)
among all distributions Q with EQ [X] ≥ γ .
λ2
As an example, for a standard Gaussian Z ∼ N (0, 1), we have ψZ (λ) = 2 . Taking X = Z3
yields a random variable such that ψX (λ) is infinite for all non-zero λ.
i i
i i
i i
246
In the remaining of the chapter, we shall assume that the MGF of random variable X is finite,
namely ψX (λ) < ∞ for all λ ∈ R. This, in particular, implies that all moments of X is finite.
then A ≤ X ≤ B a.s.;
(f) If X is not a constant, then ψX is strictly convex, and consequently, ψX′ is strictly increasing.
(g) Chernoff bound:
Remark 15.1. The slope of log MGF encodes the range of X. Indeed, Theorem 15.3(d) and (e)
together show that the smallest closed interval containing the support of PX equals (closure of) the
range of ψX′ . In other words, A and B coincide with the essential infimum and supremum (min and
max of RV in the probabilistic sense) of X respectively,
Proof. Note that (g) is already proved in (15.1). The proof of (e)–(f) relies on Theorem 15.8 and
can be revisited later.
E[exp((λ1 /p + λ2 /q)X)] ≤ k exp(λ1 X/p)kp k exp(λ2 X/q)kq = E[exp(λ1 X)]θ E[exp(λ2 X)]θ̄ ,
i i
i i
i i
ψX (λ)
slope A
slope B
0
λ
slope E[X]
Figure 15.1 Example of a log MGF ψX (γ) with PX supported on [A, B]. The limiting maximal and minimal
slope is A and B respectively. The slope at γ = 0 is ψX′ (0) = E[X]. Here we plot for X = ±1 with
P [X = 1] = 1/3.
(c) The subtlety here is that we need to be careful when exchanging the order of differentiation
and expectation.
Assume without loss of generality that λ ≥ 0. First, we show that E[|XeλX |] exists. Since
e| X | ≤ eX + e− X
|XeλX | ≤ e|(λ+1)X| ≤ e(λ+1)X + e−(λ+1)X
i i
i i
i i
248
≤ E[|X|eλ(B−ϵ)−c−λ(B− 2 ) ]
ϵ
= E[|X|]e−λ 2 −c → 0
ϵ
λ→∞
where the first inequality is from (15.4) and the second from X < B − ϵ. Thus, the first term
in (15.3) goes to 0 implying the desired contradiction.
(f) Suppose ψX is not strictly convex. Since ψX is convex from part (f), ψX must be “flat” (affine)
near some point. That is, there exists a small neighborhood of some λ0 such that ψX (λ0 + u) =
ψX (λ0 ) + ur for some r ∈ R. Then ψPλ (u) = ur for all u in small neighborhood of zero, or
equivalently EPλ [eu(X−r) ] = 1 for u small. The following Lemma 15.4 implies Pλ [X = r] = 1,
but then P[X = r] = 1, contradicting the assumption X 6= const.
Definition 15.5 (Rate function). The rate function ψX∗ : R → R ∪ {+∞} is given by the Legendre-
Fenchel transform of the log MGF:
Note that the maximization (15.5) is a convex optimization problem since ψX is strictly convex,
so we can find the maximum by taking the derivative and finding the stationary point. In fact, ψX∗
is the precisely the convex conjugate of ψX ; cf. (7.78).
1
More precisely, if we only know that E[eλS ] is finite for |λ| ≤ 1 then the function z 7→ E[ezS ] is holomorphic in the
vertical strip {z : |Rez| < 1}.
i i
i i
i i
The next result describes useful properties of the rate function. See Fig. 15.2 for an illustration.
ψX (λ)
slope γ
0
λ
ψX∗ (γ)
ψX∗ (γ)
+∞ +∞
γ
A E[X] 0 B
Figure 15.2 Log-MGF ψX and its conjugate (rate function) ψX∗ for X taking values in [A, B], continuing the
example in Fig. 15.1.
(b) ψX∗ is strictly convex and strictly positive except ψX∗ (E[X]) = 0.
(c) ψX∗ is decreasing when γ ∈ (A, E[X]), and increasing when γ ∈ [E[X], B)
Proof. By Theorem 15.3(d), since A ≤ X ≤ B a.s., we have A ≤ ψX′ ≤ B. When γ ∈ (A, B), the
strictly concave function λ 7→ λγ − ψX (λ) has a single stationary point which achieves the unique
maximum. When γ > B (resp. < A), λ 7→ λγ − ψX (λ) increases (resp. decreases) without bounds.
When γ = B, since X ≤ B a.s., we have
ψX∗ (B) = sup λB − log(E[exp(λX)]) = − log inf E[exp(λ(X − B))]
λ∈R λ∈R
i i
i i
i i
250
eλ x
Pλ (dx) = P(dx) = eλx−ψX (λ) P(dx) (15.6)
E [ eλ X ]
In particular, if P has a PDF p, then the PDF of Pλ is given by pλ (x) = eλx−ψX (λ) p(x).
In the above examples we see that Pλ shifts the mean of P to the right (resp. left) when λ > 0
(resp. < 0). Indeed, this is a general property of tilting.
i i
i i
i i
(c)
∗ ′
ψX (ψX (λ)), where the last equality follows from Theorem 15.6(a).
(c)
i.i.d.
Theorem 15.9. Let X1 , X2 , . . . ∼ P. Then for any γ ∈ R,
1 1
lim log 1 Pn = inf D(QkP) (15.9)
n→∞ n P n k=1 Xk > γ Q : EQ [X]>γ
1 1
lim log 1 Pn = inf D(QkP) (15.10)
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]≥γ
i i
i i
i i
252
Remark 15.2 (Subadditivity). It is possible to argue from first principles that the limits (15.9) and
(15.10) exist. Indeed, note that the sequence pn ≜ log P 1 Pn 1 X ≥γ satisfies pn+m ≥ pn pm and
[ n k=1 k ]
hence log p1n is subadditive. As such, limn→∞ 1n log p1n = infn log p1n by Fekete’s lemma.
Proof. First note that if the events have zero probability, then both sides coincide with infinity.
Pn
Indeed, if P 1n k=1 Xk > γ = 0, then P [X > γ] = 0. Then EQ [X] > γ ⇒ Q[X > γ] > 0 ⇒
Q 6 P ⇒ D(QkP) = ∞ and hence (15.9) holds trivially. The case for (15.10) is similar.
In the sequel we assume both probabilities are nonzero. We start by proving (15.9). Set P[En ] =
Pn
P 1n k=1 Xk > γ .
Lower Bound on P[En ]: Fix a Q such that EQ [X] > γ . Let Xn be iid. Then by WLLN,
" n #
X LLN
Q[En ] = Q Xk > nγ = 1 − o(1).
k=1
Upper Bound on P[En ]: The key observation is that given any X and any event E, PX (E) > 0
can be expressed via the divergence between the conditional and unconditional distribution as:
P
log PX1(E) = D(PX|X∈E kPX ). Define P̃Xn = PXn | P Xi >nγ , under which Xi > nγ holds a.s. Then
1
log = D(P̃Xn kPXn ) ≥ inf
P D(QXn kPXn ) (15.13)
P[En ] QXn :EQ [ Xi ]>nγ
i i
i i
i i
We now show that the last problem “single-letterizes”, i.e., reduces n = 1. Note that this is a
special case of a more general phenomena – see Ex. III.7. Consider the following two steps:
X
n
D(QXn kPXn ) ≥ D(QXj kP)
j=1
1X
n
≥ nD(Q̄kP) , Q̄ ≜ QX j , (15.14)
n
j=1
where the first step follows from (2.26) in Theorem 2.14, after noticing that PXn = Pn , and the
second step is by convexity of divergence (Theorem 5.1). From this argument we conclude that
inf
P D(QXn kPXn ) = n · inf D(QkP) (15.15)
QXn :EQ [ Xi ]>nγ Q:EQ [X]>γ
inf
P D(QXn kPXn ) = n · inf D(QkP) (15.16)
QXn :EQ [ Xi ]≥nγ Q:EQ [X]≥γ
In particular, (15.13) and (15.15) imply the required lower bound in (15.9).
Next we prove (15.10). First, notice that the lower bound argument (15.13) applies equally well,
so that for each n we have
1 1
log 1 Pn ≥ inf D(QkP) .
n P n k=1 Xk ≥ γ Q : EQ [X]≥γ
• Case I: P[X > γ] = 0. If P[X ≥ γ] = 0, then both sides of (15.10) are +∞. If P[X = γ] > 0,
P
then P[ Xk ≥ nγ] = P[X1 = . . . = Xn = γ] = P[X = γ]n . For the right-hand side, since
D(QkP) < ∞ =⇒ Q P =⇒ Q(X ≤ γ) = 1, the only possibility for EQ [X] ≥ γ is that
Q(X = γ) = 1, i.e., Q = δγ . Then infEQ [X]≥γ D(QkP) = log P(X1=γ) .
P P
• Case II: P[X > γ] > 0. Since P[ Xk ≥ γ] ≥ P[ Xk > γ] from (15.9) we know that
1 1
lim sup log 1 Pn ≤ inf D(QkP) .
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]>γ
Indeed, let P̃ = PX|X>γ which is well defined since P[X > γ] > 0. For any Q such that EQ [X] ≥
γ , set Q̃ = ϵ̄Q + ϵP̃ satisfies EQ̃ [X] > γ . Then by convexity, D(Q̃kP) ≤ ϵ̄D(QkP) + ϵD(P̃kP) =
ϵ̄D(QkP) + ϵ log P[X1>γ] . Sending ϵ → 0, we conclude the proof of (15.17).
Remark 15.3. Note that the upper bound (15.11) also holds for independent non-identically
distributed Xi . Indeed, we only need to replace the step (15.14) with D(QXn kPXn ) ≥
Pn Pn
i=1 D(QXi kPXi ) ≥ nD(Q̄kP̄) where P̄ = n
1
i=1 PXi . This yields a bound (15.11) with P
replaced by P̄ in the right-hand side.
i i
i i
i i
254
where f(u) ≜ u log u − (u − 1) log e ≥ 0. These follow from (15.18)-(15.19) via the following
useful estimate:
d(upkp) ≥ pf(u) ∀p ∈ [0, 1], u ∈ [0, 1/p] (15.20)
Indeed, consider the elementary inequality
x
x log ≥ (x − y) log e
y
for all x, y ∈ [0, 1] (since the difference between the left and right side is minimized over y at
y = x). Using x = 1 − up and y = 1 − p establishes (15.20).
• Bernstein’s inequality:
t2
P[X > np + t] ≤ e− 2(t+np) ∀t > 0 .
f(u) Ru u−x
Ru
This follows from the previous bound for u > 1 by bounding log e = 1 x dx ≥ 1
u 1
(u −
x)dx = (u−
2
1)
2u .
• Okamoto’s inequality: For all 0 < p < 1 and t > 0,
√ √
P[ X − np ≥ t] ≤ e−t ,
2
(15.21)
√ √
P[ X − np ≤ −t] ≤ e−t .
2
(15.22)
These simply follow from the inequality between KL divergence and Hellinger distance
√ √ √
( np+t)2
in (7.30). Indeed, we get d(xkp) ≥ H2 (Ber(x), Ber(p)) ≥ ( x − p)2 . Plugging x = n
into (15.18)-(15.19) we obtain the result. We note that [226, Theorem 3] shows a stronger bound
of e−2t in (15.21).
2
Remarkably, the bounds in (15.21) and (15.22) do not depend on n or p. This is due to the
√
variance-stabilizing effect of the square-root transformation for binomials: Var( X) is at most
√ √
a constant for all n, p. In addition, X − np = √XX− np
√
+ np
is of a self-normalizing form: the
i i
i i
i i
denominator is on par with the standard deviation of the numerator. For more on self-normalized
sums, see [45, Problem 12.2].
inf D(QkP)
Q∈E
Denote the minimizing distribution Q by Q∗ . The next result shows that intuitively the “line”
between P and optimal Q∗ is “orthogonal” to E .
Q∗
Distributions on X
Theorem 15.10. Suppose ∃Q∗ ∈ E such that D(Q∗ kP) = minQ∈E D(QkP), then ∀Q ∈ E
Proof. If D(QkP) = ∞, then there is nothing to prove. So we assume that D(QkP) < ∞, which
also implies that D(Q∗ kP) < ∞. For λ ∈ [0, 1], form the convex combination Q(λ) = λ̄Q∗ +λQ ∈
E . Since Q∗ is the minimizer of D(QkP), then
d
0≤ D(Q(λ) kP) = D(QkP) − D(QkQ∗ ) − D(Q∗ kP)
dλ λ=0
The rigorous analysis requires an argument for interchanging derivatives and integrals (via domi-
nated convergence theorem) and is similar to the proof of Proposition 2.18. The details are in [83,
Theorem 2.2].
Remark 15.4. If we view the picture above in the Euclidean setting, the “triangle” formed by P,
Q∗ and Q (for Q∗ , Q in a convex set, P outside the set) is always obtuse, and is a right triangle
only when the convex set has a “flat face”. In this sense, the divergence is similar to the squared
Euclidean distance, and the above theorem is sometimes known as a “Pythagorean” theorem.
i i
i i
i i
256
The relevant set E of Q’s that we will focus next is the “half-space” of distributions E = {Q :
EQ [X] ≥ γ}, where X : Ω → R is some fixed function (random variable). This is justified by rela-
tion with the large-deviations exponent in Theorem 15.9. First, we solve this I-projection problem
explicitly.
2 Whenever the minimum is finite, the minimizing distribution is unique and equal to tilting of P
along X, namely2
Remark 15.5. Both Theorem 15.9 and Theorem 15.11 are stated for the right tail where the sample
mean exceeds the population mean. For the left tail, simply these results to −Xi to obtain for
γ < E[X],
1 1
lim log 1 Pn = inf D(QkP) = ψX∗ (γ).
n→∞ n P n k=1 Xk < γ Q : EQ [X]<γ
In other words, the large deviation exponent is still given by the rate function (15.5) except that
the optimal tilting parameter λ is negative.
Proof. We first prove (15.25).
2
Note that unlike the setting of Theorems 15.1 and 15.9 here P and Pλ are measures on an abstract space Ω, not necessarily
on the real line.
i i
i i
i i
• Third case: If P(X = B) = 0, then X < B a.s. under P, and Q 6 P for any Q s.t. EQ [X] ≥ B.
Then the minimum is ∞. Now assume P(X = B) > 0. Since D(QkP) < ∞ =⇒ Q P =⇒
Q(X ≤ B) = 1. Therefore the only possibility for EQ [X] ≥ B is that Q(X = B) = 1, i.e., Q = δB .
Then D(QkP) = log P(X1=B) .
• Second case: Fix EP [X] ≤ γ < B, and find the unique λ such that ψX′ (λ) = γ = EPλ [X] where
dPλ = exp(λX − ψX (λ))dP. This corresponds to tilting P far enough to the right to increase its
mean from EP X to γ , in particular λ ≥ 0. Moreover, ψX∗ (γ) = λγ − ψX (λ). Take any Q such
that EQ [X] ≥ γ , then
dQdPλ
D(QkP) = EQ log (15.28)
dPdPλ
dPλ
= D(QkPλ ) + EQ [log ]
dP
= D(QkPλ ) + EQ [λX − ψX (λ)]
≥ D(QkPλ ) + λγ − ψX (λ)
= D(QkPλ ) + ψX∗ (γ)
≥ ψX∗ (γ), (15.29)
where the last inequality holds with equality if and only if Q = Pλ . In addition, this shows
the minimizer is unique, proving the second claim. Note that even in the corner case of γ = B
(assuming P(X = B) > 0) the minimizer is a point mass Q = δB , which is also a tilted measure
(P∞ ), since Pλ → δB as λ → ∞, cf. Theorem 15.8(c).
An alternative version of the solution, given by expression (15.26), follows from Theorem 15.6.
For the third claim, notice that there is nothing to prove for γ < EP [X], while for γ ≥ EP [X] we
have just shown
The final step is to notice that ψX∗ is increasing and continuous by Theorem 15.6, and hence the
right-hand side infimum equalis ψX∗ (γ). The case of minQ:EQ [X]=γ is handled similarly.
Corollary 15.12. For any Q with EQ [X] ∈ (A, B), there exists a unique λ ∈ R such that the tilted
distribution Pλ satisfies
and furthermore the gap in the last inequality equals D(QkPλ ) = D(QkP) − D(Pλ kP).
i i
i i
i i
258
Proof. Proceed as in the proof of Theorem 15.11, and find the unique λ s.t. EPλ [X] = ψX′ (λ) =
EQ [X]. Then D(Pλ kP) = ψX∗ (EQ [X]) = λEQ [X] − ψX (λ). Repeat the steps (15.28)-(15.29)
obtaining D(QkP) = D(QkPλ ) + D(Pλ kP).
Remark: For any Q, this allows us to find a tilted measure Pλ that has the same mean yet
smaller (or equal) divergence.
Q 6≪ P
One Parameter Family
γ=A
P b
D(Pλ ||P ) EQ [X] = γ
λ=0 = ψ ∗ (γ)
b
λ>0 Q
b
γ=B
Q∗
=Pλ
Q 6≪ P
Space of distributions on R
• Each set {Q : EQ [X] = γ} corresponds to a slice. As γ varies from A to B, the curves fill the
entire space except for the corner regions.
• When γ < A or γ > B, Q 6 P.
• As γ varies, the Pλ ’s trace out a curve via ψ ∗ (γ) = D(Pλ kP). This set of distributions is called
a one parameter family, or exponential family.
Key Point: The one parameter family curve intersects each γ -slice E = {Q : EQ [X] =
γ} “orthogonally” at the minimizing Q∗ ∈ E , and the distance from P to Q∗ is given by
ψ ∗ (λ). To see this, note that applying Theorem 15.10 to the convex set E gives us D(QkP) ≥
D(QkQ∗ ) + D(Q∗ kP). Now thanks to Corollary 15.12, we in fact have equality D(QkP) =
D(QkQ∗ ) + D(Q∗ kP) and Q∗ = Pλ for some tilted measure.
i i
i i
i i
Examples of regularity conditions in the above theorem include: (a) X is finite and E is closed
with non-empty interior – see Exercise III.12 for a full proof in this case; (b) X is a Polish space
and the set E is weakly closed and has non-empty interior.
Proof sketch. The lower bound is proved as in Theorem 15.9: Just take an arbitrary Q ∈ E and
apply a suitable version of WLLN to conclude Qn [P̂ ∈ E] = 1 + o(1).
For the upper bound we can again adapt the proof from Theorem 15.9. Alternatively, we can
write the convex set E as an intersection of half spaces. Then we have already solved the problem
for half-spaces {Q : EQ [X] ≥ γ}. The general case follows by the following consequence of
Theorem 15.10: if Q∗ is projection of P onto E1 and Q∗∗ is projection of Q∗ on E2 , then Q∗∗ is
also projection of P onto E1 ∩ E2 :
(
∗∗ D(Q∗ kP) = minQ∈E1 D(QkP)
D(Q kP) = min D(QkP) ⇐
Q∈E1 ∩E2 D(Q∗∗ kQ∗ ) = minQ∈E2 D(QkQ∗ )
(Repeated projection property)
Indeed, by first tilting from P to Q∗ we find
P[P̂ ∈ E1 ∩ E2 ] ≤ exp (−nD(Q∗ kP)) Q∗ [P̂ ∈ E1 ∩ E2 ]
≤ exp (−nD(Q∗ kP)) Q∗ [P̂ ∈ E2 ]
and from here proceed by tilting from Q∗ to Q∗∗ and note that D(Q∗ kP) + D(Q∗∗ kQ∗ ) =
D(Q∗∗ kP).
i i
i i
i i
In this chapter our goal is to determine the achievable region of the exponent pairs (E0 , E1 ) for
the Type-I and Type-II error probabilities. Our strategy is to apply the achievability and (strong)
converse bounds from Chapter 14 in conjunction with the large deviation theory developed in
Chapter 15.
260
i i
i i
i i
ψP (λ)
0 1
λ
E0 = ψP∗ (θ)
E1 = ψP∗ (θ) − θ
slope θ
Figure 16.1 Geometric interpretation of Theorem 16.1 relies on the properties of ψP (λ) and ψP∗ (θ). Note that
ψP (0) = ψP (1) = 0. Moreover, by Theorem 15.6, θ 7→ E0 (θ) is increasing, θ 7→ E1 (θ) is decreasing.
apply under the (milder) conditions of P Q and Q P, we will only present proofs under
the (stronger) condition that log-MGF exists for all λ, following the convention of the previous
chapter. The following result determines the optimal (E0 , E1 )-tradeoff in a parametric form. For a
concrete example, see Exercise III.11 for testing two Gaussians.
Remark 16.1 (Rényi divergence). In Definition 7.22 we defined Rényi divergences Dλ . Note that
ψP (λ) = (λ − 1)Dλ (QkP) = −λD1−λ (PkQ). This provides another explanation that ψP (λ) is
negative for λ between 0 and 1, and the slope at endpoints is: ψP′ (0) = −D(PkQ) and ψP′ (1) =
D(QkP). See also Ex. I.30.
Corollary 16.2 (Bayesian criterion). Fix a prior (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 < π 0 < 1.
Denote the optimal Bayesian (average) error probability by
P∗e (n) ≜ inf π 0 π 1|0 + π 1 π 0|1
PZ|Xn
with exponent
1 1
E ≜ lim log ∗ .
n→∞ n P e ( n)
Then
E = max min(E0 (θ), E1 (θ)) = ψP∗ (0)
θ
i i
i i
i i
262
If |X | = 2 and if the compositions (types) of xn and x̃n are equal (!), the expression is invariant
under λ ↔ 1 − λ and thus from the convexity in λ we conclude that λ = 12 is optimal,2 yielding
E = 1n dB (xn , x̃n ), where
X
n Xq
dB (x , x̃ ) = −
n n
log PY|X (y|xt )PY|X (y|x̃t ) (16.3)
t=1 y∈Y
is known as the Bhattacharyya distance between codewords xn and x̃n . (Compare with the Bhat-
tacharyya coefficient defined after (7.5).) Without the two assumptions stated, dB (·, ·) does not
necessarily give the optimal error exponent. We do, however, always have the bounds, see (14.19):
1
exp (−2dB (xn , x̃n )) ≤ P∗e (xn , x̃n ) ≤ exp (−dB (xn , x̃n )) ,
4
where the upper bound becomes tighter when the joint composition of (xn , x̃n ) and that of (x̃n , xn )
are closer.
Pn
Proof of Theorem 16.1. The idea is to apply the large deviation theory to the iid sum k=1 Tk .
Specifically, let’s rewrite the achievability and converse bounds from Chapter 14 in terms of T:
1
In short, this is because the optimal tilting parameter λ does not need to be chosen differently for different values of
(xt , x̃t ).
2 1
For another example where λ = 2
achieves the optimal in the Chernoff information, see Exercise III.19.
i i
i i
i i
• Achievability (Neyman-Pearson): Applying Theorem 14.10 with τ = −nθ, the LRT achieves
the following
" n # " n #
X X
π 1|0 = P Tk ≥ nθ π 0|1 = Q Tk < nθ (16.4)
k=1 k=1
• Converse (strong): Applying Theorem 14.9 with γ = exp (−nθ), any achievable π 1|0 and π 0|1
satisfy
" n #
X
π 1|0 + exp (−nθ) π 0|1 ≥ P T k ≥ nθ . (16.5)
k=1
For achievability, applying the nonasymptotic large deviations upper bound in Theorem 15.9
(and Theorem 15.11) to (16.4), we obtain that for any n,
" n #
X
π 1|0 = P Tk ≥ nθ ≤ exp (−nψP∗ (θ)) , for θ ≥ EP T = −D(PkQ)
k=1
" #
Xn
π 0|1 = Q Tk < nθ ≤ exp −nψQ∗ (θ) , for θ ≤ EQ T = D(QkP)
k=1
Theorem 16.3. (a) The optimal exponents are given (parametrically) in terms of λ ∈ [0, 1] as
E0 = D(Pλ kP), E1 = D(Pλ kQ) (16.6)
i i
i i
i i
264
where the distribution Pλ 3 is tilting of P along T given in (15.27), which moves from P0 = P
to P1 = Q as λ ranges from 0 to 1:
dPλ = (dP)1−λ (dQ)λ exp{−ψP (λ)}.
(b) Yet another characterization of the boundary is
E∗1 (E0 ) = min D(Q′ kQ) , 0 ≤ E0 ≤ D(QkP) (16.7)
Q′ :D(Q′ ∥P)≤E0
Remark 16.3. The interesting consequence of this point of view is that it also suggests how
typical error event looks like. Namely, consider an optimal hypothesis test achieving the pair
of exponents (E0 , E1 ). Then conditioned on the error event (under either P or Q) we have that
the empirical distribution of the sample will be close to Pλ . For example, if P = Bin(m, p) and
Q = Bin(m, q), then the typical error event will correspond to a sample whose empirical distribu-
tion P̂n is approximately Bin(m, r) for some r = r(p, q, λ) ∈ (p, q), and not any other distribution
on {0, . . . , m}.
Proof. The first part is verified trivially. Indeed, if we fix λ and let θ(λ) ≜ EPλ [T], then
from (15.8) we have
D(Pλ kP) = ψP∗ (θ) ,
whereas
dPλ dPλ dP
D(Pλ kQ) = EPλ log = EPλ log = D(Pλ kP) − EPλ [T] = ψP∗ (θ) − θ .
dQ dP dQ
Also from (15.7) we know that as λ ranges in [0, 1] the mean θ = EPλ [T] ranges from −D(PkQ)
to D(QkP).
To prove the second claim (16.7), the key observation is the following: Since Q is itself a tilting
of P along T (with λ = 1), the following two families of distributions
dPλ = exp{λT − ψP (λ)} · dP
dQλ′ = exp{λ′ T − ψQ (λ′ )} · dQ
are in fact the same family with Qλ′ = Pλ′ +1 .
Now, suppose that Q∗ achieves the minimum in (16.7) and that Q∗ 6= Q, Q∗ 6= P (these cases
should be verified separately). Note that we have not shown that this minimum is achieved, but it
will be clear that our argument can be extended to the case of when Q′n is a sequence achieving
the infimum. Then, on one hand, obviously
D(Q∗ kQ) = min D(Q′ kQ) ≤ D(PkQ)
Q′ :D(Q′ ∥P)≤E0
3
This is called a geometric mixture of P and Q.
i i
i i
i i
Therefore,
dQ∗ dQ
EQ∗ [T] = EQ∗ log = D(Q∗ kP) − D(Q∗ kQ) ∈ [−D(PkQ), D(QkP)] . (16.8)
dP dQ∗
Next, we have from Corollary 15.12 that there exists a unique Pλ with the following three
properties:4
Remark 16.4. A geometric interpretation of (16.7) is given in Fig. 16.2: As λ increases from 0 to
1, or equivalently, θ increases from −D(PkQ) to D(QkP), the optimal distribution Pλ traverses
down the dotted path from P and Q. Note that there are many ways to interpolate between P and
Q, e.g., by taking their (arithmetic) mixture (1 − λ)P + λQ. In contrast, Pλ is a geometric mixture
of P and Q, and this special path is in essence a geodesic connecting P to Q and the exponents
E0 and E1 measures its respective distances to P and Q. Unlike Riemannian geometry, though,
here the sum of distances to the two endpoints from an intermediate Pλ actually varies along the
geodesic.
4
A subtlety: In Corollary 15.12 we ask EQ∗ [T] ∈ (A, B). But A, B – the essential range of T – depend on the distribution
under which the essential range is computed, cf. (15.23). Fortunately, we have Q P and P Q, so the essential range
is the same under both P and Q. And furthermore (16.8) implies that EQ∗ [T] ∈ (A, B).
i i
i i
i i
266
E1
P
D(PkQ) Pλ
space of distributions
D(Pλ kQ)
E0
0 D(Pλ kP) D(QkP)
Figure 16.2 Geometric interpretation of (16.7). Here the shaded circle represents {Q′ : D(Q′ kP) ≤ E0 }, the
KL divergence “ball” of radius E0 centered at P. The optimal E∗1 (E0 ) in (16.7) is given by the divergence from
Q to the closest element of this ball, attained by some tilted distribution Pλ . The tilted family Pλ is the
geodesic traversing from P to Q as λ increases from 0 to 1.
i i
i i
i i
So far we have always been working with a fixed number of observations n. However, different
realizations of Xn are informative to different levels, i.e. under some realizations we are very certain
about declaring the true hypothesis, whereas some other realizations leave us more doubtful. In
the fixed n setting, the tester is forced to take a guess in the latter case. In the sequential setting,
pioneered by Wald [329], the tester is allowed to request more samples. We show in this section that
the optimal test in this setting is something known as sequential probability ratio test (SPRT) [331].
It will also be shown that the resulting tradeoff between the exponents E0 and E1 is much improved
in the sequential setting.
We start with the concept of a sequential test. Informally, at each time t, upon receiving the
observation Xt , a sequential test either declares H0 , declares H1 , or requests one more observation.
The rigorous definition is as follows: a sequential hypothesis test consists of (a) a stopping time
τ with respect to the filtration {Fk , k ∈ Z+ }, where Fk ≜ σ{X1 , . . . , Xn } is generated by the first
n observations; and (b) a random variable (decision) Z ∈ {0, 1} measurable with respect to Fτ .
Each sequential test is associated with the following performance metrics:
α = P[Z = 0], β = Q [ Z = 0] (16.9)
l0 = EP [τ ], l1 = EQ [τ ] (16.10)
The easiest way to see why sequential tests may be dramatically superior to fixed-sample-size
tests is the following example: Consider P = 12 δ0 + 12 δ1 and Q = 12 δ0 + 12 δ−1 . Since P 6⊥ Q,
we also have Pn 6⊥ Qn . Consequently, no finite-sample-size test can achieve zero error under both
hypotheses. However, an obvious sequential test (wait for the first appearance of ±1) achieves zero
error probability with finite average number of samples (2) under both hypotheses. This advantage
is also very clear in the achievable error exponents as Fig. 16.3 shows.
The following result is due to [331] (for the special case of E0 = D(QkP) and E1 = D(PkQ))
and [245] (for the generalization).
5
This assumption is satisfied for example for a pair of full support discrete distributions on finite alphabets.
i i
i i
i i
268
E1
Sequential test
D(PkQ)
E0
0 D(QkP)
Figure 16.3 Tradeoff between Type-I and Type-II error exponents. The bottom curve corresponds to optimal
tests with fixed sample size (Theorem 16.1) and the upper curve to optimal sequential tests (Theorem 16.4).
0, if Sτ ≥ B
Z=
1, if Sτ < −A
where
X
n
P(Xk )
Sn = log
Q(Xk )
k=1
Remark 16.5 (Interpretation of SPRT). Under the usual setup of hypothesis testing, we collect
a sample of n iid observations, evaluate the LLR Sn , and compare it to the threshold to give the
optimal test. Under the sequential setup, {Sn : n ≥ 1} is a random walk, which has positive
(resp. negative) drift D(PkQ) (resp. −D(QkP)) under the null (resp. alternative)! SPRT simply
declares P if the random walk crosses the upper boundary B, or Q if the random walk crosses the
upper boundary −A. See Fig. 16.4 for an illustration.
i i
i i
i i
Sn
0 n
τ
−A
Figure 16.4 Illustration of the SPRT(A, B) test. Here, at the stopping time τ , the LLR process Sn reaches B
before reaching −A and the decision is Z = 1.
EQ [Sτ ] = − EQ [τ ]D(QkP) .
Mn = Sn − nD(PkQ)
M̃n ≜ Mmin(τ,n)
E[M̃n ] = E[M̃0 ] = 0 ,
or, equivalently,
This holds for every n ≥ 0. From boundedness assumption we have |Sn | ≤ nc and thus
|Smin(n,τ ) | ≤ nτ , implying that collection {Smin(n,τ ) , n ≥ 0} is uniformly integrable. Thus, we
can take n → ∞ in (16.12) and interchange expectation and limit safely to conclude (16.11).
i i
i i
i i
270
By monotone convergence theorem applied to the both sides of (16.13) it is then sufficient to
verify that for every n
Next, we denote τ0 = inf{n : Sn ≥ B} and observe that τ ≤ τ0 , whereas the expectation of τ0 can
be bounded using (16.11) as:
EP [τ ] ≤ EP [τ0 ] = EP [Sτ0 ] ≤ B + c0 ,
S τ 0 ≤ B + c0 .
Thus
B + c0 B
l0 = EP [τ ] ≤ EP [τ0 ] ≤ ≈ . for large B
D(PkQ) D(PkQ)
Similarly we can show π 0|1 ≤ e−B and l1 ≤ D(QA∥P) for large A. Take B = l0 D(PkQ), A =
l1 D(QkP), this shows the achievability.
Converse: Assume (E0 , E1 ) achievable for large l0 , l1 . Recall from Section 4.5* that
D(PFτ kQFτ ) denotes the divergence between P and Q when viewed as measures on σ -algebra
Fτ . We apply the data processing inequality for divergence to obtain:
(16.11)
d(P(Z = 1)kQ(Z = 1)) ≤ D(PFτ kQFτ ) = EP [Sτ ] = EP [τ ]D(PkQ) = l0 D(PkQ),
i i
i i
i i
Notice that for l0 E0 and l1 E1 large, we have d(P(Z = 1)kQ(Z = 1)) = l1 E1 (1 + o(1)), therefore
l1 E1 ≤ (1 + o(1))l0 D(PkQ). Similarly we can show that l0 E0 ≤ (1 + o(1))l1 D(QkP). Thus taking
ℓ0 , ℓ1 → ∞ we conclude
E0 E1 ≤ D(PkQ)D(QkP) .
Unlike testing simple hypotheses for which Neyman-Pearson’s test is optimal (Theorem 14.10), in
general there is no explicit description for the optimal test of composite hypotheses (cf. (32.24)).
The popular choice is a generalized likelihood-ratio test (GLRT) that proposes to threshold the
GLR
supP∈P P⊗n (Xn )
T( X n ) = .
supQ∈Q Q⊗n (Xn )
For examples and counterexamples of the optimality of GLRT in terms of error exponents, see,
e.g. [346].
Sometimes the families P and Q are small balls (in some metric) surrounding the center dis-
tributions P and Q, respectively. In this case, testing P against Q is known as robust hypothesis
testing (since the test is robust to small deviations of the data distribution). There is a notable
finite-sample optimality result in this case due to Huber [161], Exercise III.20. Asymptotically, it
turns out that if P and Q are separated in the Hellinger distance, then the probability of error can
be made exponentially small: see Theorem 32.7.
Sometimes in the setting of composite testing the distance between P and Q is zero. This is
the case, for example, for the most famous setting of a Student t-test: P = {N (0, σ 2 ) : σ 2 > 0},
Q = {N ( μ, σ 2 ) : μ 6= 0, σ 2 > 0}. It is clear that in this case there is no way to construct a test with
α + β < 1, since the data distribution under H1 can be arbitrarily close to P0 . Here, thus, instead
of minimizing worst-case β , one tries to find a test statistic T(X1 , . . . , Xn ) which is a) pivotal in
i i
i i
i i
272
the sense that its distribution under the H0 is (asymptotically) independent of the choice P0 ∈ P ;
and b) consistent, in the sense that T → ∞ as n → ∞ under any Q ∈ Q. Optimality questions are
studied by minimizing β as a function of Q ∈ Q (known as the power curve). The uniform most
powerful tests are the gold standard in this area [197, Chapter 3], although besides a few classical
settings (such as the one above) their existence is unknown.
In other settings, known as the goodness-of-fit testing [197, Chapter 14], instead of relatively
low-complexity parametric families P and Q one is interested in a giant set of alternatives Q. For
i.i.d. i.i.d.
example, the simplest setting is to distinguish H0 : Xi ∼ P0 vs H1 : Xi ∼ Q, TV(P0 , Q) > δ . If
δ = 0, then in this case again the worst case α + β = 1 for any test and one may only ask for a
statistic T(Xn ) with a known distribution under H0 and T → ∞ for any Q in the alternative. For
δ > 0 the problem is known as nonparametric detection [165, 166] and related to that of property
testing [141].
i i
i i
i i
III.1 Let P0 and P1 be distributions on X . Recall that the region of achievable pairs (P0 [Z =
0], P1 [Z = 0]) via randomized tests PZ|X : X → {0, 1} is denoted
[
R(P0 , P1 ) ≜ (P0 [Z = 0], P1 [Z = 0]) ⊆ [0, 1]2 .
PZ|X
PY|X
Let PY|X : X → Y be a Markov kernel, which maps Pj to Qj according to Pj −−→ Qj , j =
0, 1. Compare the regions R(P0 , P1 ) and R(Q0 , Q1 ). What does this say about βα (P0 , P1 ) vs.
β α ( Q0 , Q1 ) ?
Comment: This is the most general form of data-processing, all the other ones (divergence,
mutual information, f-divergence, total-variation, Rényi-divergence, etc) are corollaries.
Bonus: Prove that R(P0 , P1 ) ⊃ R(Q0 , Q1 ) implies existence of some PY|X carrying Pj to Qj
(“inclusion of R is equivalent to degradation”).
III.2 Recall the total variation distance
Explain how to read the value TV(P, Q) from the region R(P, Q). Does it equal half the
maximal vertical segment in R(P, Q)?
(b) (Bayesian criteria) Fix a prior π = (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 < π 0 < 1. Denote
the optimal average error probability by
1
Pe = (1 − TV(P, Q)).
2
Find the optimal test.
(c) Find the optimal test for general prior π (not necessarily equiprobable).
(d) Why is it always sufficient to focus on deteministic test in order to minimize the Bayesian
error probability?
i i
i i
i i
βα (P, Q) ≜ min Q[ Z = 0] = α 2 .
PZ|X :P[Z=0]≥α
III.5 We have shown that for testing iid products and any fixed ϵ ∈ (0, 1):
which is equivalent to Stein’s lemma. Show furthermore that assuming V(PkQ) < ∞ we have
p √
log β1−ϵ (Pn , Qn ) = −nD(PkQ) + nV(PkQ)Q−1 (ϵ) + o( n) , (III.2)
R ∞
where Q−1 (·) is the functional inverse of Q(x) = x √12π e−t /2 dt and
2
dP
V(PkQ) ≜ VarP log .
dQ
III.6 (Inverse Donsker-Varadhan) Verify for positive discrete random variables X that,
for some F ′ .
Conclude that in the case when PYj = P and
X
n
F = QYn : EQ f(Yj ) ≥ nγ
j=1
i i
i i
i i
we have (single-letterization)
6
In this exercise and the next, you may assume all log’s and exp’s are to the natural basis and that MGF exists for all λ.
i i
i i
i i
i i
i i
i i
(b) Consider the conditional distribution P̃Xn = PXn |π n ∈E . Show that P̃Xn ∈ En .
(c) Prove the following nonasymptotic upper bound:
P(π n ∈ E) ≤ exp − n inf D(QkP) , ∀n.
Q∈E
(Hint: Use data processing as in the proof of the large deviations theorem.)
(e) Conclude (III.7).
III.14 Error exponents of data compression. Let Xn be iid according to P on a finite alphabet X . Let
ϵ∗n (R) denote the minimal probability of error achieved by fixed-length compressors and decom-
pressors for Xn of compression rate R. We have learned that the if R < H(P), then ϵ∗n (R) tends
to zero. The goal of this exercise is to show it converges exponentially fast and find the best
exponent.
(a) For any sequence xn , denote by π (xn ) its empirical distribution and by Ĥ(xn ) its empirical
entropy, i.e., the entropy of the empirical distribution.7 For each R > 0, define the set
T = {xn : Ĥ(xn ) < R}. Show that
Specify the achievable scheme. (Hint: Use Sanov’s theorem in Exercise III.13.)
(c) Prove that the above exponent is asymptotically optimal:
1 1
lim sup log ∗ ≤ inf D(QkP).
n→∞ n ϵn (R) Q:H(Q)>R
(Hint: Recall that any compression scheme for memoryless source with rate below the
entropy fails with probability tending to one. Use data processing inequality. )
III.15 Denote by N( μ, σ 2 ) the one-dimensional Gaussian distribution with mean μ and variance σ 2 .
Let a > 0. All logarithms below are natural.
(a) Show that
a2
min D(QkN(0, 1)) = .
Q:EQ [X]≥a 2
(b) Let X1 , . . . , Xn be drawn iid from N(0, 1). Using part (a) show that
1 1 a2
lim log = . (III.8)
n→∞ n P [X1 + · · · + Xn ≥ na] 2
7
For example, for the binary sequence xn = (010110), the empirical distribution is Ber(1/2) and the empirical entropy is
1 bit.
i i
i i
i i
R∞
(c) Let Φ̄(x) = x √12π e−t /2 dt denote the complementary CDF of the standard Gaussian
2
distribution. Express P [X1 + · · · + Xn ≥ na] in terms of the Φ̄ function. Using the fact that
Φ̄(x) = e−x /2+o(x ) as x → ∞, reprove (III.8).
2 2
(d) Let Y be a continuous random variable with zero mean and unit variance. Show that
III.16 (Gibbs distribution) Let X be finite alphabet, f : X → R some function and Emin = min f(x).
(a) Using I-projection show that for any E ≥ Emin the solution of
β 0 ( E0 ) = β 1 ( E1 ) (III.10)
Hint: Let h = f1{ρ ≤ exp(L + t/2)}. Use triangle inequality and bound E |In (h) − Eν h|,
E |In (h) − In (f)|, | Eν f − Eν h| separately.
i i
i i
i i
Pμ (log ρ ≤ L − t/2)
P(In (1) ≥ 1 − δ)| ≤ exp(−t/2) + ,
1−δ
for all δ ∈ (0, 1), where 1 is the constant-1 function.
Hint: Divide into two cases depending on whether max1≤i≤n ρ(Xi ) ≤ exp(L − t/2).
This shows that a sample of size exp(D(νk μ) + Θ(1)) is both necessary and sufficient for
accurate estimation by importance sampling.
III.18 M-ary hypothesis testing.8 The following result [194] generalizes Corollary 16.2 on the best
average probability of error for testing two hypotheses to multiple hypotheses.
Fix a collection of distributions {P1 , . . . , PM }. Conditioned on θ, which takes value i with prob-
i.i.d.
ability π i > 0 for i = 1, . . . , M, let X1 , . . . , Xn ∼ Pθ . Denote the optimal average probability of
error by p∗n = inf P[θ̂ 6= θ], where the infimum is taken over all decision rules θ̂ = θ̂(X1 , . . . , Xn ).
(a) Show that
1 1
lim log ∗ = min C(Pi , Pj ), (III.12)
n→∞ n pn 1≤i<j≤M
8
Not to be confused with multiple testing in the statistics literature, which refers to testing multiple pairs of binary
hypotheses simultaneously.
i i
i i
i i
III.20 (Stochastic dominance and robust LRT) Let P0 , P1 be two families of probability distributions
on X . Suppose that there is a least favorable pair (LFP) (Q0 , Q1 ) ∈ P0 × P1 such that
Q0 [π > t] ≥ Q′0 [π > t]
Q1 [π > t] ≤ Q′1 [π > t],
for all t ≥ 0 and Q′i ∈ Pi , where π = dQ1 /dQ0 . Prove that (Q0 , Q1 ) simultaneously minimizes
all f-divergences between P0 and P1 , i.e.
Df (Q1 kQ0 ) ≤ Df (Q′1 kQ′0 ) ∀Q′0 ∈ P0 , Q′1 ∈ P1 . (III.13)
Hint: Interpolate between (Q0 , Q1 ) and (Q′0 , Q′1 ) and differentiate.
Remark: For the case of two TV-balls, i.e. Pi = {Q : TV(Q, Pi ) ≤ ϵ}, the existence of LFP is
shown in [161], in which case π = min(c′ , max(c′′ , dP ′ ′′
dP1 )) for some 0 ≤ c < c ≤ ∞ giving
0
i i
i i
i i
Part IV
Channel coding
i i
i i
i i
i i
i i
i i
283
In this Part we study a new type of problem known as “channel coding”. Historically, this
was the first application area of information theory that lead to widely recognized and surprising
results [277]. To explain the relation of this Part to others, let us revisit what problems we have
studied so far.
In Part II our objective was data compression. The main object there was a single distribution
PX and the fundamental limit E[ℓ(f∗ (X))] – the minimal compression length. The main result was
connection between the fundamental limit and an information quantity, that we can summarize as
E[ℓ(f∗ (X))] ≈ H(X)
In Part III we studied binary hypothesis testing. There the main object was a pair of distributions
(P, Q), the fundamental limit was the Neyman-Pearson curve β1−ϵ (Pn , Qn ) and the main result
i i
i i
i i
Definition 17.1. An M-code for PY|X is an encoder/decoder pair (f, g) of (randomized) functions1
• encoder f : [M] → X
• decoder g : Y → [M] ∪ {e}
In most cases f and g are deterministic functions, in which case we think of them, equivalently,
in terms of codewords, codebooks, and decoding regions (see Fig. 17.1 for an illustration)
c1 b
b
b
D1 b
b b
b cM
b b
b
DM
Figure 17.1 When X = Y, the decoding regions can be pictured as a partition of the space, each containing
one codeword.
Given an M-code we can define a probability space, underlying all the subsequent developments
in this Part. For that we chain the three objects – message W, the encoder and the decoder – together
1
For randomized encoder/decoders, we identify f and g as probability transition kernels PX|W and PŴ|Y .
284
i i
i i
i i
where we set W ∼ Unif([M]). In the case of discrete spaces, we can explicitly write out the joint
distribution of these variables as follows:
1
(general) PW,X,Y,Ŵ (m, a, b, m̂) = P (a|m)PY|X (b|a)PŴ|Y (m̂|b)
M X|W
1
(deterministic f, g) PW,X,Y,Ŵ (m, cm , b, m̂) = PY|X (b|cm )1{b ∈ Dm̂ }
M
Throughout these sections, these random variables will be referred to by their traditional names:
W – original (true) message, X - (induced) channel input, Y - channel output and Ŵ - decoded
message.
Although any pair (f, g) is called an M-code, in reality we are only interested in those that satisfy
certain “error-correcting” properties. To assess their quality we define the following performance
metrics:
Note that, clearlym, Pe ≤ Pe,max . Therefore, requirement of the small maximum error probability
is a more stringent criterion, and offers uniform protection for all codewords. Some codes (such
as linear codes, see Section 18.6) have the property of Pe = Pe,max by construction, but generally
these two metrics could be very different.
Having defined the concept of an M-code and the performance metrics, we can finally define
the fundamental limits for a given channel PY|X .
Definition 17.2. A code (f, g) is an (M, ϵ)-code for PY|X if Pe (f, g) ≤ ϵ. Similarly, an (M, ϵ)max -
code must satisfy Pe,max ≤ ϵ. The fundamental limits of channel coding are defined as
The argument PY|X will be omitted when PY|X is clear from the context.
In other words, the quantity log2 M∗ (ϵ) gives the maximum number of bits that we can
push through a noisy transformation PY|X , while still guaranteeing the error probability in the
appropriate sense to be at most ϵ.
Example 17.1. The channel BSC⊗ n
δ (recall from Example 3.5 that BSC stands for binary symmet-
ric channel) acts between X = {0, 1}n and Y = {0, 1}n , where the input Xn is contaminated by
i.i.d.
additive noise Zn ∼ Ber(δ) independent of Xn , resulting in the channel output
Yn = Xn ⊕ Zn .
i i
i i
i i
286
0 1 0 0 1 1 0 0 1 1
PY n |X n
1 1 0 1 0 1 0 0 0 1
In the next section we discuss coding for the BSC channel in more detail.
0 0 1 0
Decoding can be done by taking a majority vote inside each ℓ-block. Thus, each data bit is decoded
with probability of bit error Pb = P[Binom(l, δ) > l/2]. However, the probability of block error of
this scheme is Pe ≤ kP[Binom(l, δ) > l/2]. (This bound is essentially tight in the current regime).
Consequently, to satisfy Pe ≤ 10−3 we must solve for k and ℓ satisfying kl ≤ n = 1000 and also
i i
i i
i i
This gives l = 21, k = 47 bits. So we can see that using repetition coding we can send 47 data
bits by using 1000 channel uses.
Repetition coding is a natural idea. It also has a very natural tradeoff: if you want better reliabil-
ity, then the number ℓ needs to increase and hence the ratio nk = 1ℓ should drop. Before Shannon’s
groundbreaking work, it was almost universally accepted that this is fundamentally unavoidable:
vanishing error probability should imply vanishing communication rate nk .
Before delving into optimal codes let us offer a glimpse of more sophisticated ways of injecting
redundancy into the channel input n-sequence than simple repetition. For that, consider the so-
called first-order Reed-Muller codes (1, r). We interpret a sequence of r data bits a0 , . . . , ar−1 ∈ Fr2
as a degree-1 polynomial in (r − 1) variables:
X
r− 1
a = (a0 , . . . , ar−1 ) 7→ fa (x) ≜ a i xi + a 0 .
i=1
In order to transmit these r bits of data we simply evaluate fa (·) at all possible values of the variables
xr−1 ∈ Fr2−1 . This code, which maps r bits to 2r−1 bits, has minimum distance dmin = 2r−2 . That
is, for two distinct a 6= a′ the number of positions in which fa and fa′ disagree is at least 2r−2 . In
coding theory notation [n, k, dmin ] we say that the first-order Reed-Muller code (1, 7) is a [64, 7, 32]
code. It can be shown that the optimal decoder for this code achieves over the BSC0.11 ⊗ 64 channel
a probability of error at most 6 · 10−6 . Thus, we can use 16 such blocks (each carrying 7 data bits
and occupying 64 bits on the channel) over the BSC⊗ δ
1024
, and still have (by the union bound)
−4 −3
overall probability of block error Pe ≲ 10 < 10 . Thus, with the help of Reed-Muller codes
we can send 7 · 16 = 112 bits in 1024 channel uses, more than doubling that of the repetition code.
Shannon’s noisy channel coding theorem (Theorem 19.9) – a crown jewel of information theory
– tells us that over memoryless channel PYn |Xn = (PY|X )n of blocklength n the fundamental limit
satisfies
as n → ∞ and for arbitrary ϵ ∈ (0, 1). Here C = maxPX1 I(X1 ; Y1 ) is the capacity of the single-letter
channel. In our case of BSC we have
1
C = log 2 − h(δ) ≈ bit ,
2
since the optimal input distribution is uniform (from symmetry) – see Section 19.3. Shannon’s
expansion (17.2) can be used to predict (not completely rigorously, of course, because of the
o(n) residual) that it should be possible to send around 500 bits reliably. As it turns out, for the
blocklength n = 1000 this is not quite possible.
Note that computing M∗ exactly requries iterating over all possible encoders and decoder –
an impossible task even for small values of n. However, there exist rigorous and computation-
ally tractable finite blocklength bounds [239] that demonstrate for our choice of n = 1000, δ =
0.11, ϵ = 10−3 :
i i
i i
i i
288
Thus we can see that Shannon’s prediction is about 20% too optimistic. We will see below some
of such finite-length bounds. Notice, however, that while the guarantee existence of an encoder-
decoder pair achieving a prescribed performance, building an actual f and g implementable with
a modern software/hardware is a different story.
It took about 60 years after Shannon’s discovery of (17.2) to construct practically imple-
mentable codes achieving that performance. The first codes that approach the bounds on log M∗
are called Turbo codes [30] (after the turbocharger engine, where the exhaust is fed back in to
power the engine). This class of codes is known as sparse graph codes, of which the low-density
parity check (LDPC) codes invented by Gallager are particularly well studied [264]. As a rule of
thumb, these codes typically approach 80 . . . 90% of log M∗ when n ≈ 103 . . . 104 . For shorter
blocklengths in the range of n = 100 . . . 1000 there is an exciting alternative to LDPC codes: the
polar codes of Arıkan [15], which are most typically used together with the list-decoding idea
of Tal and Vardy [301]. And of course, the story is still evolving today as new channel models
become relevant and new hardware possibilities open up.
We wanted to point out a subtle but very important conceptual paradigm shift introduced by
Shannon’s insistence on coding over many (information) bits together. Indeed, consider the sit-
uation discussed above, where we constructed a powerful code with M ≈ 2400 codewords and
n = 1000. Now, one might imagine this code as a constellation of 2400 points carefully arranged
inside a hypercube {0, 1}1000 to guarantee some degree of separation between them, cf. (17.6).
Next, suppose one was using this code every second for the lifetime of the universe (≈ 1018 sec).
Yet, even after this laborious process she will have explored at most 260 different codewords from
among an overwhelmingly large codebook 2400 . So a natural question arises: why did we need
to carefully place all these many codewords if majority of them will never be used by anyone?
The answer is at the heart of the concept of information: to transmit information is to convey a
selection of one element (W) from a collection of possibilities ([M]). The fact that we do not know
which W will be selected forces us to apriori prepare for every one of the possibilities. This simple
idea, proposed in the first paragraph of [277], is now tacitly assumed by everyone, but was one of
the subtle ways in which Shannon revolutionized scientific approach to the study of information
exchange.
Notice that the optimal decoder is deterministic. For the special case of deterministic encoder,
where we can identify the encoder with its image C the minimal (MAP) probability of error for
i i
i i
i i
Consequently, the optimal decoding regions – see Fig. 17.1 – become the Voronoi cells tesselating
the Hamming space {0, 1}n . Similarly, the MAP decoder for the AWGN channel induces a Voronoi
tesselation of Rn – see Section 20.3.
So we have seen that the optimal decoder is without loss of generality can be assumed to be
deterministic. Similarly, we can represent any randomized encoder f as a function of two argu-
ments: the true message W and an external randomness U ⊥ ⊥ W, so that X = f(W, U) where this
time f is a deterministic function. Then we have
which implies that if P[W 6= Ŵ] ≤ ϵ then there must exist some choice u0 such that P[W 6= Ŵ|U =
u0 ] ≤ ϵ. In other words, the fundamental limit M∗ (ϵ) is unchanged if we restrict our attention to
deterministic encoders and decoders only.
Note, however, that neither of the above considerations apply to the maximal probability of
error Pe,max . Indeed, the fundamental limit M∗max (ϵ) does indeed require considering randomized
encoders and decoders. For example, when M = 2 from the decoding point of view we are back to
the setting of binary hypotheses testing in Part III. The optimal decoder (test) that minimizes the
maximal Type-I and II error probability, i.e., max{1 − α, β}, will not be deterministic if max{1 −
α, β} is not achieved at a vertex of the Neyman-Pearson region R(PY|W=1 , PY|W=2 ).
i i
i i
i i
290
Theorem 17.3 (Weak converse). Any (M, ϵ)-code for PY|X satisfies
supPX I(X; Y) + h(ϵ)
log M ≤ ,
1−ϵ
where h(x) = H(Ber(x)) is the binary entropy function.
Proof. This can be derived as a one-line application of Fano’s inequalty (Theorem 6.3), but we
proceed slightly differently. Consider an M-code with probability of error Pe and its corresponding
probability space: W → X → Y → Ŵ. We want to show that this code can be used as a hypothesis
test between distributions PX,Y and PX PY . Indeed, given a pair (X, Y) we can sample (W, Ŵ) from
PW,Ŵ|X,Y = PW|X PŴ|Y and compute the binary value Z = 1{W 6= Ŵ}. (Note that in the most
interesting cases when encoder and decoder are deterministic and the encoder is injective, the value
Z is a deterministic function of (X, Y).) Let us compute performance of this binary hypothesis test
under two hypotheses. First, when (X, Y) ∼ PX PY we have that Ŵ ⊥ ⊥ W ∼ Unif([M]) and therefore:
1
PX PY [Z = 1] = .
M
Second, when (X, Y) ∼ PX,Y then by definition we have
PX,Y [Z = 1] = 1 − Pe .
Thus, we can now apply the data-processing inequality for divergence to conclude: Since W →
X → Y → Ŵ, we have the following chain of inequalities (cf. Fano’s inequality Theorem 6.3):
DPI 1
D(PX,Y kPX PY ) ≥ d(1 − Pe k )
M
≥ −h(P[W 6= Ŵ]) + (1 − Pe ) log M
i i
i i
i i
So far our discussion of channel coding was mostly following the same lines as the M-ary hypothe-
sis testing (HT) in statistics. In this chapter we introduce the key departure: the principal and most
interesting goal in information theory is the design of the encoder f : [M] → X or the codebook
{ci ≜ f(i), i ∈ [M]}. Once the codebook is chosen, the problem indeed becomes that of M-ary HT
and can be tackled by the standard statistical methods. However, the task of choosing the encoder
f has no exact analogs in statistical theory (the closest being design of experiments). Each f gives
rise to a different HT problem and the goal is to choose these M hypotheses PX|c1 , . . . , PX|cM to
ensure maximal testability. It turns out that the problem of choosing a good f will be much sim-
plified if we adopt a suboptimal way of testing M-ary HT. Namely, roughly speaking we will run
M binary HTs testing PY|X=cm against PY , which tries to distinguish the channel output induced by
the message m from an “average background noise” PY . An optimal such test, as we know from
Neyman-Pearson (Theorem 14.10), thresholds the following quantity
PY|X=x
log
PY
This explains the central role played by the information density (see below) in these achievability
bounds.
In this chapter it will be convenient to introduce the following independent pairs (X, Y) ⊥
⊥ (X, Y)
with their joint distribution given by:
We will often call X the sent codeword and X̄ the unsent codeword.
291
i i
i i
i i
292
Definition 18.1 (Information density). Let PX,Y μ and PX PY μ for some dominating measure
μ, and denote by f(x, y) = dPdμX,Y and f̄(x, y) = dPdμ
X PY
the Radon-Nikodym derivatives of PX,Y and
PX PY with respect to μ, respectively. Then recalling the Log definition (2.10) we set
log ff̄((xx,,yy)) , f(x, y) > 0, f̄(x, y) > 0
f(x, y) +∞, f(x, y) > 0, f̄(x, y) = 0
iPX,Y (x; y) ≜ Log = (18.2)
f̄(x, y) −∞, f(x, y) = 0, f̄(x, y) > 0
0, f(x, y) = f̄(x, y) = 0 ,
Note an important observation: (18.3) holds regardless of the input distribution PX used for the def-
PM
inition of i(x; y), in particular we do not have to use the code-induced distribution PX = M1 i=1 δci .
However, if we are to threshold information density, different choices of PX will result in different
decoders, so we need to justify the choice of PX .
To that end, recall that to distinguish between two codewords ci and cj , one can apply (as we
P
learned in Part III for binary HT) the likelihood ratio test, namely thresholding the LLR log PYY||XX=
=c
ci
.
j
As we explained at the beginning of this Part, a (possibly suboptimal) approach in M-ary HT
is to run binary tests by thresholding each information density i(ci ; y). This, loosely speaking,
i i
i i
i i
evaluates the likelihood of ci against the average distribution of the other M − 1 codewords, which
1
P
we approximate by PY (as opposed to the more precise form M− 1 j̸=i PY|X=cj ). Putting these ideas
together we can propose the decoder as
where λ is a threshold and PX is judiciously chosen (to maximize I(X; Y) as we will see soon).
We proceed to show some elementary properties of the information density. The next result
explains the name “information density”1
Proposition 18.2. The expectation E[i(X; Y)] is well-defined and non-negative (but possibly
infinite). In any case, we have I(X; Y) = E[i(X; Y)].
Proof. This is follows from (2.12) and the definition of i(x; y) as log-ratio.
Being defined as log-likelihood, information density possesses the standard properties of the
latter, cf. Theorem 14.5. However, because its defined in terms of two variables (X, Y), there are
also very useful conditional expectation versions. To illustrate the meaning of the next proposition,
let us consider the case of discrete X, Y and PX,Y PX PY . Then we have for every x:
X X
f(x, y)PX (x)PY (y) = f(x, y) exp{−i(x; y)}PX,Y (x, y) .
y y
E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = E[f+ (X, Y) exp{−i(X; Y)}|X = x] (18.5)
Proof. The first part (18.4) is simply a restatement of (14.5). For the second part, let us define
a(x) ≜ E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x], b(x) ≜ E[f+ (X, Y) exp{−i(X; Y)}|X = x]
We first additionally assume that f is bounded. Fix ϵ > 0 and denote Sϵ = {x : a(x) ≥ b(x) + ϵ}.
As ϵ → 0 we have Sϵ % {x : a(x) > b(x)} and thus if we show PX [Sϵ ] = 0 this will imply that
a(x) ≤ b(x) for PX -a.e. x. The symmetric argument shows b(x) ≤ a(x) and completes the proof
of the equality.
1
Still an unfortunate name for a quantity that can be negative, though.
i i
i i
i i
294
To show PX [Sϵ ] = 0 let us apply (18.4) to the function f(x, y) = f+ (x, y)1{x ∈ Sϵ }. Then we get
E[f+ (X, Y)1{X ∈ Sϵ } exp{−i(X; Y)}] = E[f+ (X̄, Y)1{i(X̄; Y) > −∞}1{X ∈ Sϵ }] .
Let us re-express both sides of this equality by taking the conditional expectations over Y to get:
E[b(X)1{X ∈ Sϵ }] = E[a(X̄)1{X̄ ∈ Sϵ }] .
But from the definition of Sϵ we have
E[b(X)1{X ∈ Sϵ }] ≥ E[(b(X̄) + ϵ)1{X̄ ∈ Sϵ }] .
(d)
Recall that X = X̄ and hence
E[b(X)1{X ∈ Sϵ }] ≥ E[b(X)1{X ∈ Sϵ }] + ϵPX [Sϵ ] .
Since f+ (and therefore b) was assumed to be bounded we can cancel the common term from both
sides and conclude PX [Sϵ ] = 0 as required.
Finally, to show (18.5) in full generality, given an unbounded f+ we define fn (x, y) =
min(f+ (x, y), n). Since (18.5) holds for fn we can take limit as n → ∞ on both sides of it:
lim E[fn (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = lim E[fn (X, Y) exp{−i(X; Y)}|X = x]
n→∞ n→∞
By the monotone convergence theorem (for conditional expectations!) we can take the limits inside
the expectations to conclude the proof.
i i
i i
i i
and no tools for constructing them available to Shannon. So facing the problem of understanding
if error-correction is even possible, Shannon decided to check if placing codewords randomly
in space will somehow result in favorable geometric arrangement. To everyone’s astonishment,
which is still producing aftershocks today, this method not only produced reasonable codes, but
in fact turned out to be optimal asymptotically (and almost-optimal non-asymptotically [239]).
We also remark that the method of proving existence of certain combinatorial objects by random
selection is known as Erdös’s probabilistic method [10], which Shannon apparently discovered
independently and, perhaps, earlier.
Theorem 18.5 (Shannon’s achievability bound). Fix a channel PY|X and an arbitrary input
distribution PX . Then for every τ > 0 there exists an (M, ϵ)-code with
ϵ ≤ P[i(X; Y) ≤ log M + τ ] + exp(−τ ). (18.9)
Proof. Recall that for a given codebook {c1 , . . . , cM }, the optimal decoder is MAP and is equiv-
alent to maximizing information density, cf. (18.3). The step of maximizing the i(cm ; Y) makes
analyzing the error probability difficult. Similar to what we did in almost loss compression, cf. The-
orem 11.6, the first important step for showing the achievability bound is to consider a suboptimal
decoder. In Shannon’s bound, we consider a threshold-based suboptimal decoder g(y) as follows:
m, ∃! cm s.t. i(cm ; y) ≥ log M + τ
g ( y) = (18.10)
e, o.w.
In words, decoder g reports m as decoded message if and only if codeword cm is a unique one
with information density exceeding the threshold log M + τ . If there are multiple or none such
codewords, then decoder outputs a special value of e, which always results in error since W 6= e
ever. (We could have decreased probability of error slightly by allowing the decoder to instead
output a random message, or to choose any one of the messages exceeding the threshold, or any
other clever ideas. The point, however, is that even the simplistic resolution of outputting e already
achieves all qualitative goals, while simplifying the analysis considerably.)
For a given codebook (c1 , . . . , cM ), the error probability is:
Pe (c1 , . . . , cM ) = P[{i(cW ; Y) ≤ log M + τ } ∪ {∃m 6= W, i(cm ; Y) > log M + τ }]
where W is uniform on [M] and the probability space is as in (17.1).
The second (and most ingenious) step proposed by Shannon was to forego the complicated
discrete optimization of the codebook. His proposal is to generate the codebook (c1 , . . . , cM ) ran-
domly with cm ∼ PX i.i.d. for m ∈ [M] and then try to reason about the average E[Pe (c1 , . . . , cM )].
By symmetry, this averaged error probability over all possible codebooks is unchanged if we con-
dition on W = 1. Considering also the random variables (X, Y, X̄) as in (18.1), we get the following
chain:
E[Pe (c1 , . . . , cM )]
= E[Pe (c1 , . . . , cM )|W = 1]
= P[{i(c1 ; Y) ≤ log M + τ } ∪ {∃m 6= 1, i(cm , Y) > log M + τ }|W = 1]
i i
i i
i i
296
X
M
≤ P[i(c1 ; Y) ≤ log M + τ |W = 1] + P[i(cm̄ ; Y) > log M + τ |W = 1] (union bound)
m̄=2
( a)
= P [i(X; Y) ≤ log M + τ ] + (M − 1)P i(X; Y) > log M + τ
≤ P [i(X; Y) ≤ log M + τ ] + (M − 1) exp(−(log M + τ )) (by Corollary 18.4)
≤ P [i(X; Y) ≤ log M + τ ] + exp(−τ ) ,
where the crucial step (a) follows from the fact that given W = 1 and m̄ 6= 1 we have
d
(c1 , cm̄ , Y) = (X, X̄, Y)
Remark 18.2 (Joint typicality). Shortly in Chapter 19, we will apply this theorem for the case
of PX = P⊗ n ⊗n
X1 (the iid input) and PY|X = PY1 |X1 (the memoryless channel). Traditionally, cf. [81],
decoders in such settings were defined with the help of so called “joint typicality”. Those decoders
given y = yn search for the codeword xn (both of which are an n-letter vectors) such that the
empirical joint distribution is close to the true joint distribution, i.e., P̂xn ,yn ≈ PX1 ,Y1 , where
1
P̂xn ,yn (a, b) = · |{j ∈ [n] : xj = a, yj = b}|
n
is the joint empirical distribution of (xn , yn ). This definition is used for the case when random
coding is done with cj ∼ uniform on the type class {xn : P̂xn ≈ PX }. Another alternative, “entropic
Pn
typicality”, cf. [76], is to search for a codeword with j=1 log PX ,Y 1(xj ,yj ) ≈ H(X, Y). We think of
1 1
our requirement, {i(xn ; yn ) ≥ nγ1 }, as a version of “joint typicality” that is applicable to much
wider generality of channels (not necessarily over product alphabets, or memoryless).
Theorem 18.6 (DT bound). Fix a channel PY|X and an arbitrary input distribution PX . Then for
every τ > 0 there exists an (M, ϵ)-code with
M − 1 +
ϵ ≤ E exp − i(X; Y) − log (18.11)
2
i i
i i
i i
Setting Ŵ = g(Y) we note that given a codebook {c1 , . . . , cM }, we have by union bound
P[Ŵ 6= j|W = j] = P[i(cj ; Y) ≤ γ|W = j] + P[i(cj ; Y) > γ, ∃k ∈ [j − 1], s.t. i(ck ; Y) > γ]
j−1
X
≤ P[i(cj ; Y) ≤ γ|W = j] + P[i(ck ; Y) > γ|W = j].
k=1
Averaging over the randomly generated codebook, the expected error probability is upper bounded
by:
1 X
M
E[Pe (c1 , . . . , cM )] = P[Ŵ 6= j|W = j]
M
j=1
1 X
j−1
M X
≤ P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
M
j=1 k=1
M−1
= P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
2
M−1
= P[i(X; Y) ≤ γ] + E[exp(−i(X; Y))1 {i(X; Y) > γ}] (by (18.4))
2
h M−1 i
= E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
2
To optimize over γ , note the simple observation that U1E + V1Ec ≥ min{U, V}, with equal-
ity iff U ≥ V on E. Therefore for any x, y, 1[i(x; y) ≤ γ] + M− 1 −i(x;y)
2 e 1[i(x; y) > γ] ≥
M−1 −i(x;y) M−1
min(1, 2 e ), achieved by γ = log 2 regardless of x, y. Thus, we continue the bounding
as follows
h M−1 i
inf E[Pe (c1 , . . . , cM )] ≤ inf E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
γ γ 2
h M−1 i
= E min 1, exp(−i(X; Y))
2
M − 1 +
= E exp − i(X; Y) − log .
2
H0 : X, Y ∼ PX,Y versus H1 : X, Y ∼ PX PY
2 M−1
prior prob.: π 0 = , π1 = .
M+1 M+1
i i
i i
i i
298
Note that X, Y ∼ PX,Y and X, Y ∼ PX PY , where X is the sent codeword and X is the unsent code-
word. As we know from binary hypothesis testing, the best threshold for the likelihood ratio test
(minimizing the weighted probability of error) is log ππ 10 , as we indeed found out.
One of the immediate benefits of Theorem 18.6 compared to Theorem 18.5 is precisely the fact
that we do not need to perform a cumbersome minimization over τ in (18.9) to get the minimum
upper bound in Theorem 18.5. Nevertheless, it can be shown that the DT bound is stronger than
Shannon’s bound with optimized τ . See also Exc. IV.30.
Finally, we remark (and will develop this below in our treatment of linear codes) that DT bound
and Shannon’s bound both hold without change if we generate {ci } by any other (non-iid) pro-
cedure with a prescribed marginal and pairwise independent codewords – see Theorem 18.13
below.
Theorem 18.7 (Feinstein’s lemma). Fix a channel PY|X and an arbitrary input distribution PX .
Then for every γ > 0 and for every ϵ ∈ (0, 1) there exists an (M, ϵ)max -code with
M ≥ γ(ϵ − P[i(X; Y) < log γ]) (18.12)
Remark 18.4 (Comparison with Shannon’s bound). We can also interpret (18.12) differently: for
any fixed M, there exists an (M, ϵ)max -code that achieves the maximal error probability bounded
as follows:
M
ϵ ≤ P[i(X; Y) < log γ] +
γ
If we take log γ = log M + τ , this gives the bound of exactly the same form as Shannon’s (18.9). It
is rather surprising that two such different methods of proof produced essentially the same bound
(modulo the difference between maximal and average probability of error). We will discuss the
reason for this phenomenon in Section 18.7.
Proof. From the definition of (M, ϵ)max -code, we recall that our goal is to find codewords
c1 , . . . , cM ∈ X and disjoint subsets (decoding regions) D1 , . . . , DM ⊂ Y , s.t.
PY|X (Di |ci ) ≥ 1 − ϵ, ∀i ∈ [M].
Feinstein’s idea is to construct a codebook of size M in a sequential greedy manner.
2
Nevertheless, we should point out that this is not a serious advantage: from any (M, ϵ) code we can extract an
(M′ , ϵ′ )max -subcode with a smaller M′ and larger ϵ′ – see Theorem 19.4.
i i
i i
i i
Ex ≜ {y ∈ Y : i(x; y) ≥ log γ}
Notice that the preliminary decoding regions {Ex } may be overlapping, and we will trim them
into final decoding regions {Dx }, which will be disjoint. Next, we apply Corollary 18.4 and find
out that there is a set F ⊂ X with two properties: a) PX [F] = 1 and b) for every x ∈ F we have
1
PY ( Ex ) ≤ . (18.13)
γ
We can assume that P[i(X; Y) < log γ] ≤ ϵ, for otherwise the RHS of (18.12) is negative and
there is nothing to prove. We first claim that there exists some c ∈ F such that P[Y ∈ Ec |X =
c] = PY|X (Ec |c) ≥ 1 − ϵ. Indeed, assume (for the sake of contradiction) that ∀c ∈ F, P[i(c; Y) ≥
log γ|X = c] < 1 − ϵ. Note that since PX (F) = 1 we can average this inequality over c ∼ PX . Then
we arrive at P[i(X; Y) ≥ log γ] < 1 − ϵ, which is a contradiction.
With these preparations we construct the codebook in the following way:
1 Pick c1 to be any codeword in F such that PY|X (Ec1 |c1 ) ≥ 1 − ϵ, and set D1 = Ec1 ;
2 Pick c2 to be any codeword in F such that PY|X (Ec2 \D1 |c2 ) ≥ 1 − ϵ, and set D2 = Ec2 \D1 ;
...
−1
3 Pick cM to be any codeword in F such that PY|X (EcM \ ∪M j=1 Dj |cM ] ≥ 1 − ϵ, and set DM =
M−1
EcM \ ∪j=1 Dj .
We stop if cM+1 codeword satisfying the requirement cannot be found. Thus, M is determined by
the stopping condition:
∀c ∈ F, PY|X (Ec \ ∪M
j=1 Dj |c) < 1 − ϵ
Averaging the stopping condition over c ∼ PX (which is permissible due to PX (F) = 1), we
obtain
[
M
P i(X; Y) ≥ log γ and Y 6∈ Dj < 1 − ϵ,
j=1
or, equivalently,
[
M
ϵ < P i(X; Y) < log γ or Y ∈ Dj .
j=1
X
M
≤ P[i(X; Y) < log γ] + PY (Ecj )
j=1
i i
i i
i i
300
M
≤ P[i(X; Y) < log γ] +
γ
where the last step makes use of (18.13).Evidently, this completes the proof.
Theorem 18.8 (RCU bound). Fix a channel PY|X and an arbitrary input distribution PX . Then for
every integer M ≥ 1 there exists an (M, ϵ)-code with
ϵ ≤ E min 1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y , (18.14)
Proof. For a given codebook (c1 , . . . cM ) the average probability of error for the maximum
likelihood decoder, cf. (18.3), is upper bounded by
1 X M [
M
ϵ≤ P {i(cj ; Y) ≥ i(cm ; Y)} |X = cm .
M
m=1 j=1;j̸=m
Note that we do not necessarily have equality here, since the maximum likelihood decoder will
resolves ties (i.e. the cases when multiple codewords maximize information density) in favor of
the correct codeword, whereas in the expression above we pessimistically assume that all ties are
resolved incorrectly. Now, similar to Shannon’s bound in Theorem 18.5 we prove existence of a
i.i.d.
good code by averaging the last expression over cj ∼ PX .
To that end, notice that expectations of each term in the sum coincide (by symmetry). To evalu-
ate this expectation, let us take m = 1 condition on W = 1 and observe that under this conditioning
we have
Y
M
(c1 , Y, c2 , . . . , cM ) ∼ PX,Y PX .
j=2
i i
i i
i i
[
M
= E(x,y)∼PX,Y P {i(cj ; Y) ≥ i(c1 ; Y)} c1 = x, Y = y, W = 1
(a)
j=2
( b)
≤ E min{1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y }
where (a) is just expressing the probability by first conditioning on the values of (c1 , Y); and (b)
corresponds to applying the union bound but capping the result by 1. This completes the proof
of the bound. We note that the step (b) is the essence of the RCU bound and corresponds to the
self-evident fact that for any collection of events Ej we have
X
P[∪Ej ] ≤ min{1, P[Ej ]} .
j
What is makes its application clever is that we first conditioned on (c1 , Y). If we applied the union
bound right from the start without conditioning, the resulting estimate on ϵ would have been much
weaker (in particular, would not have lead to achieving capacity).
It turns out that Shannon’s bound Theorem 18.5 is just a weaking of (18.14) obtained by split-
ting the expectation according to whether or not i(X; Y) ≤ log β and upper bounding min{x, 1}
by 1 when i(X; Y) ≤ log β and by x otherwise. Another such weaking is a famous Gallager’s
bound [132]:
Theorem 18.9 (Gallager’s bound). Fix a channel PY|X , an arbitrary input distribution PX and
ρ ∈ [0, 1]. Then there exists an (M, ϵ) code such that
" 1+ρ #
i ( X̄ ; Y)
ϵ ≤ Mρ E E exp Y (18.15)
1+ρ
Proof. We first notice that by Proposition 18.3 applied with f+ (x, y) = exp{ i1(+ρ
x; y)
} and
interchanged X and Y we have for PY -almost every y
ρ 1 1
E[exp{−i(X; Y) }|Y = y] = E[exp{i(X; Ȳ) }|Ȳ = y] = E[exp{i(X̄; Y) }|Y = y] ,
1+ρ 1+ρ 1+ρ
(18.16)
d
where we also used the fact that (X, Ȳ) = (X̄, Y) under (18.1).
Now, consider the bound (18.14) and replace the min via the bound
min{t, 1} ≤ tρ ∀t ≥ 0 . (18.17)
this results in
ϵ ≤ Mρ E P[i(X̄; Y) > i(X; Y)|X, Y]ρ . (18.18)
We apply the Chernoff bound
1 1
P[i(X̄; Y) > i(X; Y)|X, Y] ≤ exp{− i(X; Y)} E[exp{ i(X̄; Y)}|Y] .
1+ρ 1+ρ
i i
i i
i i
302
The key innovation of Gallager – a step (18.17), which became know as the ρ-trick – cor-
responds to the following version of the union bound: For any events Ej and 0 ≤ ρ ≤ 1 we
have
ρ
X X
P[∪Ej ] ≤ min{1, P[Ej ]} ≤ P [ Ej ] .
j j
Now to understand properly the significance of Gallager’s bound we need to first define the concept
of the memoryless channels (see (19.1) below). For such channels and using the iid inputs, the
expression (18.15) turns, after optimization over ρ, into
ϵ ≤ exp{−nEr (R)} ,
where R = logn M is the rate and Er (R) is the Gallager’s random coding exponent. This shows that
not only the error probability at a fixed rate can be made to vanish, but in fact it can be made to
vanish exponentially fast in the blocklength. We will discuss such exponential estimates in more
detail in Section 22.4*.
Definition 18.10 (Linear codes). Let Fq denote the finite field of cardinality q (cf. Definition 11.7).
Let the input and output space of the channel be X = Y = Fnq . We say a codebook C = {cu : u ∈
Fkq } of size M = qk is a linear code if C is a k-dimensional linear subspace of Fnq .
i i
i i
i i
• Generator matrix G ∈ Fkq×n , so that the codeword for each u ∈ Fkq is given by cu = uG
(row-vector convention) and the codebook C is the row-span of G, denoted by Im(G);
(n−k)×n
• Parity-check matrix H ∈ Fq , so that each codeword c ∈ C satisfies Hc⊤ = 0. Thus C is
the nullspace of H, denoted by Ker(H). We have HG⊤ = 0.
Example 18.1 (Hamming code). The [7, 4, 3]2 Hamming code over F2 is a linear code with the
following generator and parity check matrices:
1 0 0 0 1 1 0
0 1 1 0 1 1 0 0
1 0 0 1 0 1
G=
0
, H= 1 0 1 1 0 1 0
0 1 0 0 1 1
0 1 1 1 0 0 1
0 0 0 1 1 1 1
In particular, G and H are of the form G = [I; P] and H = [−P⊤ ; I] (systematic codes) so that
HG⊤ = 0. The following picture helps to visualize the parity check operation:
x5
x2 x1
x4
x7 x6
x3
Note that all four bits in each circle (corresponding to a row of H) sum up to zero. One can verify
that the minimum distance of this code is 3 bits. As such, it can correct 1 bit of error and detect 2
bits of error.
Linear codes are almost always examined with channels of additive noise, a precise definition
of which is given below:
Definition 18.11 (Additive noise). A channel PY|X with input and output space Fnq is called
additive-noise if
PY|X (y|x) = PZ (y − x)
for some random vector Z taking values in Fnq . In other words, Y = X + Z, where Z ⊥
⊥ X.
Given a linear code and an additive-noise channel PY|X , it turns out that there is a special
“syndrome decoder” that is optimal.
i i
i i
i i
304
Theorem 18.12. Any [k, n]Fq linear code over an additive-noise PY|X has a maximum likelihood
(ML) decoder g : Fnq → Fkq such that:
1 g(y) = y − gsynd (Hy⊤ ), i.e., the decoder is a function of the “syndrome” Hy⊤ only. Here gsynd :
Fnq−k → Fnq , defined by gsynd (s) ≜ argmaxz:Hz⊤ =s PZ (z), is called the “syndrome decoder”,
which decodes the most likely realization of the noise.
2 (Geometric uniformity) Decoding regions are translates of D0 = Im(gsynd ): Du = cu + D0 for
any u ∈ Fkq .
3 Pe,max = Pe .
In other words, syndrome is a sufficient statistic (Definition 3.8) for decoding a linear code.
Remark 18.5. As a concrete example, consider the binary symmetric channel BSC⊗ n
δ previously
considered in Example 17.1 and Section 17.2. This is an additive-noise channel over Fn2 , where
i.i.d.
Y = X + Z and Z = (Z1 , . . . , Zn ) ∼ Ber(δ). Assuming δ < 1/2, the syndrome decoder aims
to find the noise realization with the fewest number of flips that is compatible with the received
codeword, namely gsynd (s) = argminz:Hz⊤ =s wH (z), where wH denotes the Hamming weight. In
this case elements of the image of gsynd , which we deonted by D0 , are known as “minimal weight
coset leaders”. Counting how many of them occur at each Hamming weight is a difficult open
problem even for the most well-studied codes such as Reed-Muller ones. In Hamming space D0
looks like a Voronoi region of a lattice and Du ’s constitute a Voronoi tesselation of Fnq .
Remark 18.6. Overwhelming majority of practically used codes are in fact linear codes. Early in
the history of coding, linearity was viewed as a way towards fast and low-complexity encoding (just
binary matrix multiplication) and slightly lower complexity of the maximum-likelihood decoding
(via the syndrome decoder). As codes became longer and longer, though, the syndrome decoding
became impractical and today only those codes are used in practice for which there are fast and
low-complexity (suboptimal) decoders.
i i
i i
i i
Theorem 18.13 (DT bound for linear codes). Let PY|X be an additive noise channel over Fnq . For
all integers k ≥ 1 there exists a linear code f : Fkq → Fnq with error probability:
+
− n−k−logq 1
Pe,max = Pe ≤ E q .
P Z ( Z)
(18.19)
Remark 18.7. The bound above is the same as Theorem 18.6 evaluated with PX = Unif(Fnq ). The
analogy between Theorems 18.6 and 18.13 is the same as that between Theorems 11.6 and 11.8
(full random coding vs random linear codes).
Proof. Recall that in proving DT bound (Theorem 18.6), we selected the codewords
i.i.d.
c1 , . . . , cM ∼ PX and showed that
M−1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y) ≤ γ] + P[i(X; Y) ≥ γ]
2
Here we will adopt the same approach and take PX = Unif(Fnq ) and M = qk .
By Theorem 18.12 the optimal decoding regions are translational invariant, i.e. Du = cu +
D0 , ∀u, and therefore:
cu = uG + h, ∀u ∈ Fkq
where random G and h are drawn as follows: the k × n entries of G and the 1 × n entries
of h are i.i.d. uniform over Fq . We add the dithering to eliminate the special role that the
all-zero codeword plays (since it is contained in any linear codebook).
Step 2: We claim that the codewords are pairwise independent and uniform, i.e. ∀u 6= u′ ,
(cu , cu′ ) ∼ (X, X), where PX,X (x, x) = 1/q2n . To see this note that
cu ∼ uniform on Fnq
cu′ = u′ G + h = uG + h + (u′ − u)G = cu + (u′ − u)G
i i
i i
i i
306
Step 5: Remove dithering h. We claim that there exists a linear code without dithering such
that (18.20) is satisfied. The intuition is that shifting a codebook has no effect on its
performance. Indeed,
• Before, with dithering, the encoder maps u to uG + h, the channel adds noise to produce
Y = uG + h + Z, and the decoder g outputs g(Y).
• Now, without dithering, we encode u to uG, the channel adds noise to produce Y =
uG + Z, and we apply decode g′ defined by g′ (Y) = g(Y + h).
By doing so, we “simulate” dithering at the decoder end and the probability of error
remains the same as before. Note that this is possible thanks to the additivity of the noisy
channel.
We see that random coding can be done with different ensembles of codebooks. For example,
we have
i.i.d.
• Shannon ensemble: C = {c1 , . . . , cM } ∼ PX – fully random ensemble.
• Elias ensemble [113]: C = {uG : u ∈ Fkq }, with the k × n generator matrix G drawn uniformly
at random from the set of all matrices. (This ensemble is used in the proof of Theorem 18.13.)
• Gallager ensemble: C = {c : Hc⊤ = 0}, with the (n − k) × n parity-check matrix H drawn
uniformly at random. Note this is not the same as the Elias ensemble.
• One issue with Elias ensemble is that with some non-zero probability G may fail to be full rank.
(It is a good exercise to find P [rank(G) < k] as a function of n, k, q.) If G is not full rank, then
there are two identical codewords and hence Pe,max ≥ 1/2. To fix this issue, one may let the
generator matrix G be uniform on the set of all k × n matrices of full (row) rank.
• Similarly, we may modify Gallager’s ensemble by taking the parity-check matrix H to be
uniform on all n × (n − k) full rank matrices.
For the modified Elias and Gallager’s ensembles, we could still do the analysis of random coding.
A small modification would be to note that this time (X, X̄) would have distribution
1
PX,X̄ = 1{X̸=X′ }
q2n − qn
uniform on all pairs of distinct codewords and are not pairwise independent.
Finally, we note that the Elias ensemble with dithering, cu = uG + h, has pairwise independence
property and its joint entropy H(c1 , . . . , cM ) = H(G) + H(h) = (nk + n) log q. This is significantly
smaller than for Shannon’s fully random ensemble that we used in Theorem 18.5. Indeed, when
i i
i i
i i
i.i.d.
cj ∼ Unif(Fnq ) we have H(c1 , . . . , cM ) = qk n log q. An interesting question, thus, is to find
min H(c1 , . . . , cM )
where the minimum is over all distributions with P[ci = a, cj = b] = q−2n when i 6= j (pairwise
independent, uniform codewords). Note that H(c1 , . . . , cM ) ≥ H(c1 , c2 ) = 2n log q. Similarly, we
may ask for (ci , cj ) to be uniform over all pairs of distinct elements. In this case, the Wozencraft
ensemble (see Exercise IV.13) for the case of n = 2k achieves H(c1 , . . . , cqk ) ≈ 2n log q, which is
essentially our lower bound.
In short, we will see that the answer is that both of these methods are well-known to be (almost)
optimal for submodular function maximization, and this is exactly what channel coding is about.
Before proceeding, we notice that in the second question it is important to qualify that PX
is simple, since taking PX to be supported on the optimal M∗ (ϵ)-achieving codebook would of
course result in very good performance. However, instead we will see that choosing rather simple
PX already achieves a rather good lower bound on M∗ (ϵ). More explicitly, by simple we mean a
product distribution for the memoryless channel. Or, as an even better example to have in mind,
consider an additive-noise vector channel:
Yn = Xn + Zn
with addition over a product abelian group and arbitrary (even non-memoryless) noise Zn . In this
case the choice of uniform PX in random coding bound works, and is definitely “simple”.
The key observation of [23] is submodularity of the function mapping a codebook C ⊂ X to
the |C|(1 − Pe,MAP (C)), where Pe,MAP (C) is the probability of error under the MAP decoder (17.5).
(Recall (1.7) for the definition of submodularity.) More expicitly, consider a discrete Y and define
X
S(C) ≜ max PY|X (y|x) , S(∅) = 0
x∈C
y∈Y
i i
i i
i i
308
and set
Ct+1 = Ct ∪ {xt+1 } .
In other words, the probability of successful (MAP) decoding for the greedily constructed code-
book is at most a factor (1 − 1/e) away from the largest possible probability of success among all
codebooks of the same cardinality. Since we are mostly interested in success probabilities very
close to 1, this result may not appear very exciting. However, a small modification of the argument
yields the following (see [185, Theorem 1.5] for the proof):
Theorem 18.14 ([223]). For any non-negative submodular set-function f and a greedy sequence
Ct we have for all ℓ, t:
Applying this to the special case of f(·) = S(·) we obtain the result of [23]: The greedily
constructed codebook C ′ with M′ codewords satisfies
M ′
1 − Pe,MAP (C ′ ) ≥ ′
(1 − e−M /M )(1 − ϵ∗ (M)) .
M
In particular, the greedily constructed code with M′ = M2−10 achieves probability of success that
is ≥ 0.9995(1 −ϵ∗ (M)). In other words, compared to the best possible code a greedy code carrying
10 bits fewer of data suffers at most 5 · 10−4 worse probability of error. This is a very compelling
evidence for why greedy construction is so good. We do note, however, that Feinstein’s bound
does greedy construction not with the MAP decoder, but with a suboptimal one.
Next we address the question of random coding. Recall that our goal is to explain how can
selecting codewords uniformly at random from a “simple” distribution PX be any good. The key
i i
i i
i i
idea is again contained in [223]. The set-function S(C) can also be understood as a function with
domain {0, 1}|X | . Here is a natural extension of this function to the entire solid hypercube [0, 1]|X | :
X X
SLP (π ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} . (18.21)
x, y x
Indeed, it is easy to see that SLP (1C ) = S(C) and that SLP is a concave function.3
Since SLP is an extension of S it is clear that
X
S∗ (M) ≤ S∗LP (M) ≜ max{SLP (π ) : 0 ≤ π x ≤ 1, π x ≤ M} . (18.22)
x
In fact, we will see later in Section 22.3 that this bound coincides with the bound known as
meta-converse. Surprisingly, [223] showed that the greedy construction not only achieves a large
multiple of S∗ (M) but also of S∗LP (M):
To connect to the concept of random coding, though, we need the following result of [23]:4
P i.i.d.
Theorem 18.15. Fix π ∈ [0, 1]|X | and let M = x∈X π x . Let C = {c1 , . . . , cM′ } with cj ∼ PX (x) =
πx
M . Then we have
′
E[S(C)] ≥ (1 − e−M /M )SLP (π ) .
The proof of this result trivially follows from applying the following lemma with g(x) =
PY|X (y|x), summing over y and recalling the definition of SLP in (18.21).
Lemma 18.16. Let π and C be as in Theorem. Let g : X → R be any function and denote
P P
T(π , g) = max{ x rx g(x) : 0 ≤ rx ≤ π x , x rx = 1}. Then
′
E[max g(x)] ≥ (1 − e−M /M )T(π , g) .
x∈C
3
There are a number of standard extensions of a submodular function f to a hypercube. The largest convex interpolant f+ ,
also known as Lovász extension, the least concave interpolant f− , and multi-linear extension [56]. However, SLP does not
coincide with any of these and in particular strictly larger (in general) than f− .
4
There are other ways of doing “random coding” to produce an integer solution from a fractional one. For example, see the
multi-linear extension based one in [56].
i i
i i
i i
310
Proof. Without loss of generality we take X = [m] and g(1) ≥ g(2) ≥ · · · ≥ g(m) ≥ g(m + 1) ≜
′ ′
0. Denote for convenience a = 1 − (1 − M1 )M ≥ 1 − e−M /M , b(j) ≜ P[{1, . . . , j} ∩ C 6= ∅]. Then
P[max g(x) = g(j)] = b(j) − b(j − 1) ,
x∈C
b(j) ≥ ac(j) .
Plugging this into (18.24) we conclude the proof by noticing that rj = c(j) − c(j − 1) attains the
maximum in the definition of T(π , g).
Theorem 18.15 completes this section’s goal and shows that the random coding (as well as the
greedy/maximal coding) attains an almost optimal value of S∗ (M). Notice also that the random
coding distribution that we should be using is the one that attains the definition of S∗LP (M). For input
symmetric channels (such as additive noise ones) it is easy to show that the optimal π ∈ [0, 1]X is
a constant vector, and hence the codewords are to be generated iid uniformly on X .
i i
i i
i i
19 Channel capacity
In this chapter we apply methods developed in the previous chapters (namely the weak converse
and the random/maximal coding achievability) to compute the channel capacity. This latter notion
quantifies the maximal amount of (data) bits that can be reliably communicated per single channel
use in the limit of using the channel many times. Formalizing the latter statement will require
introducing the concept of a communication channel. Then for special kinds of channels (the
memoryless and the information stable ones) we will show that computing the channel capacity
reduces to maximizing the (sequence of the) mutual informations. This result, known as Shannon’s
noisy channel coding theorem, is very special as it relates the value of a (discrete, combinatorial)
optimization problem over codebooks to that of a (convex) optimization problem over information
measures. It builds a bridge between the abstraction of Information Measures (Part I) and the
practical engineering problems.
Information theory as a subject is sometimes accused of “asymptopia”, or the obsession with
asymptotic results and computing various limits. Although in this book we mostly refrain from
asymptopia, the topic of this chapter requires committing this sin ipso facto.
Definition 19.1. Fix an input alphabet A and an output alphabet B . A sequence of Markov kernels
PYn |Xn : An → B n indexed by the integer n = 1, 2 . . . is called a channel. The length of the input
n is known as blocklength.
To give this abstract notion more concrete form one should recall Section 17.2, in which we
described the BSC channel. Note, however, that despite this definition, it is customary to use the
term channel to refer to a single Markov kernel (as we did before in this book). An even worse,
311
i i
i i
i i
312
yet popular, abuse of terminology is to refer to n-th element of the sequence, the kernel PYn |Xn , as
the n-letter channel.
Although we have not imposed any requirements on the sequence of kernels PYn |Xn , one is never
interested in channels at this level of generality. Most of the time the elements of the channel input
Xn = (X1 , . . . , Xn ) are thought as indexed by time. That is the Xt corresponds to the letter that is
transmitted at time t, while Yt is the letter received at time t. The channel’s action is that of “adding
noise” to Xt and outputting Yt . However, the generality of the previous definition allows to model
situations where the channel has internal state, so that the amount and type of noise added to Xt
depends on the previous inputs and in principle even on the future inputs. The interpretation of t
as time, however, is not exclusive. In storage (magnetic, non-volatile or flash) t indexes space. In
those applications, the noise may have a rather complicated structure with transformation Xt → Yt
depending on both the “past” X<t and the “future” X>t .
Almost all channels of interest satisfy one or more of the restrictions that we list next:
• A channel is called non-anticipatory if it has the following extension property. Under the n-letter
kernel PYn |Xn , the conditional distribution of the first k output symbols Yk only depends on Xk
(and not on Xnk+1 ) and coincides with the kernel PYk |Xk (the k-th element of the channel sequence)
the k-th channel transition kernel in the sequence. This requirement models the scenario wherein
channel outputs depend causally on the inputs.
• A channel is discrete if A and B are finite.
• A channel is additive-noise if A = B are abelian group and Yn = Xn + Zn for some Zn
independent of Xn (see Definition 18.11). Thus
Y
n
PYn |Xn = PYk |Xk . (19.1)
k=1
where each PYk |Xk : A → B ; in particular, PYn |Xn are compatible at different blocklengths n.
• A channel is stationary memoryless if (19.1) is satisfied with PYk |Xk not depending on k, denoted
commonly by PY|X . In other words,
The interpretation is that each coordinate of the transmitted codeword Xn is corrupted by noise
independently with the same noise statistic.
• Discrete memoryless stationary channel (DMC): A DMC is a channel that is both discrete and
stationary memoryless. It can be specified in two ways:
i i
i i
i i
– an |A| × |B|-dimensional (row-stochastic) matrix PY|X where elements specify the transition
probabilities;
– a bipartite graph with edge weight specifying the transition probabilities.
Fig. 19.1 lists some common binary-input binary-output DMCs.
Let us recall the example of the AWGN channel Example 3.3: the alphabets A = B = R and
Yn = Xn + Zn , with Xn ⊥ ⊥ Zn ∼ N (0, σ 2 In ). This channel is a non-discrete, stationary memoryless,
additive-noise channel.
Having defined the notion of the channel, we can define next the operational problem that the
communication engineer faces when tasked with establishing a data link across the channel. Since
the channel is noisy, the data is not going to pass unperturbed and the error-correcting codes are
naturally to be employed. To send one of M = 2k messages (or k data bits) with low probabil-
ity of error, it is often desirable to use the shortest possible length of the input sequence. This
desire explains the following definitions, which extend the fundamental limits in Definition 17.2
to involve the blocklength n.
• An (n, M, ϵ)-code is an (M, ϵ)-code for PYn |Xn , consisting of an encoder f : [M] → An and a
decoder g : B n → [M] ∪ {e}.
• An (n, M, ϵ)max -code is analogously defined for the maximum probability of error.
How to understand the behaviour of M∗ (n, ϵ)? Recall that blocklength n measures the amount
of time or space resource used by the code. Thus, it is natural to maximize the ratio of the data
i i
i i
i i
314
transmitted to the resource used, and that leads us to the notion of the transmission rate defined as
log M
R = n2 and equal to the number of bits transmitted per channel use. Consequently, instead of
studying M∗ (n, ϵ) one is lead to the study of 1n log M∗ (n, ϵ). A natural first question is to determine
the first-order asymptotics of this quantity and this motivates the final definition of the Section.
Definition 19.3 (Channel capacity). The ϵ-capacity Cϵ and Shannon capacity C are defined as
follows
1
Cϵ ≜ lim inf log M∗ (n, ϵ);
n→∞ n
C = lim Cϵ .
ϵ→0+
The operational meaning of Cϵ (resp. C) is the maximum achievable rate at which one can
communicate through a noisy channel with probability of error at most ϵ (resp. o(1)). In other
words, for any R < C, there exists an (n, exp(nR), ϵn )-code, such that ϵn → 0. In this vein, Cϵ and
C can be equivalently defined as follows:
The reason that capacity is defined as a large-n limit (as opposed to a supremum over n) is because
we are concerned with rate limit of transmitting large amounts of data without errors (such as in
communication and storage).
The case of zero-error (ϵ = 0) is so different from ϵ > 0 that the topic of ϵ = 0 constitutes a
separate subfield of its own (cf. the survey [182]). Introduced by Shannon in 1956 [279], the value
1
C0 ≜ lim inf log M∗ (n, 0) (19.6)
n→∞ n
is known as the zero-error capacity and represents the maximal achievable rate with no error
whatsoever. Characterizing the value of C0 is often a hard combinatorial problem. However, for
many practically relevant channels it is quite trivial to show C0 = 0. This is the case, for example,
for the DMCs we considered before: the BSC or BEC. Indeed, for them we have log M∗ (n, 0) = 0
for all n, meaning transmitting any amount of information across these channels requires accepting
some (perhaps vanishingly small) probability of error. Nevertheless, there are certain interesting
and important channels for which C0 is positive, cf. Section 23.3.1 for more.
As a function of ϵ the Cϵ could (most generally) behave like the plot below on the left-hand
side below. It may have a discontinuity at ϵ = 0 and may be monotonically increasing (possibly
even with jump discontinuities) in ϵ. Typically, however, Cϵ is zero at ϵ = 0 and stays constant for
all 0 < ϵ < 1 and, hence, coincides with C (see the plot on the right-hand side). In such cases we
say that the strong converse holds (more on this later in Section 22.1).
i i
i i
i i
Cǫ Cǫ
strong converse
holds
Zero error b
C0
Capacity
ǫ ǫ
0 1 0 1
In Definition 19.3, the capacities Cϵ and C are defined with respect to the average probabil-
ity of error. By replacing M∗ with M∗max , we can define, analogously, the capacities Cϵ
(max)
and
C(max) with respect to the maximal probability of error. It turns out that these two definitions are
equivalent, as the next theorem shows.
Proof. The second inequality is obvious, since any code that achieves a maximum error
probability ϵ also achieves an average error probability of ϵ.
For the first inequality, take an (n, M, ϵ(1 − τ ))-code, and define the error probability for the jth
codeword as
λj ≜ P[Ŵ 6= j|W = j]
Then
X X X
M(1 − τ )ϵ ≥ λj = λj 1{λj ≤ϵ} + λj 1{λj >ϵ} ≥ ϵ|{j : λj > ϵ}|.
Hence |{j : λj > ϵ}| ≤ (1 − τ )M. (Note that this is exactly Markov inequality.) Now by removing
those codewords1 whose λj exceeds ϵ, we can extract an (n, τ M, ϵ)max -code. Finally, take M =
M∗ (n, ϵ(1 − τ )) to finish the proof.
(max)
Corollary 19.5 (Capacity under maximal probability of error). Cϵ = Cϵ for all ϵ > 0 such
thatIn particular, C(max) = C.
Proof. Using the definition of M∗ and the previous theorem, for any fixed τ > 0
1
Cϵ ≥ C(ϵmax) ≥ lim inf log τ M∗ (n, ϵ(1 − τ )) ≥ Cϵ(1−τ )
n→∞ n
(max)
Sending τ → 0 yields Cϵ ≥ Cϵ ≥ Cϵ− .
1
This operation is usually referred to as expurgation which yields a smaller code by killing part of the codebook to reach a
desired property.
i i
i i
i i
316
Note that information capacity C(I) so defined is not the same as the Shannon capacity C in Def-
inition 19.3; as such, from first principles it has no direct interpretation as an operational quantity
related to coding. Nevertheless, they are related by the following coding theorems. We start with
a converse result:
C( I )
Theorem 19.7 (Upper Bound for Cϵ ). For any channel, ∀ϵ ∈ [0, 1), Cϵ ≤ 1−ϵ and C ≤ C(I) .
Proof. Applying the general weak converse bound in Theorem 17.3 to PYn |Xn yields
supPXn I(Xn ; Yn ) + h(ϵ)
log M∗ (n, ϵ) ≤
1−ϵ
Normalizing this by n and taking the lim inf as n → ∞, we have
1 1 supPXn I(Xn ; Yn ) + h(ϵ) C(I)
Cϵ = lim inf log M∗ (n, ϵ) ≤ lim inf = .
n→∞ n n→∞ n 1−ϵ 1−ϵ
Theorem 19.8 (Lower Bound for Cϵ ). For a stationary memoryless channel, Cϵ ≥ supPX I(X; Y),
for any ϵ ∈ (0, 1].
Here the information density is defined with respect to the distribution PXn ,Yn = P⊗ n
X,Y and, therefore,
X
n
dPX,Y Xn
i(Xn ; Yn ) = log (Xk , Yk ) = i(Xk ; Yk ),
dPX PY
k=1 k=1
i i
i i
i i
where i(x; y) = iPX,Y (x; y) and i(xn ; yn ) = iPXn ,Yn (xn ; yn ). What is important is that under PXn ,Yn the
random variable i(Xn ; Yn ) is a sum of iid random variables with mean I(X; Y). Thus, by the weak
law of large numbers we have
P[i(Xn ; Yn ) < n(I(X; Y) − δ)] → 0
for any δ > 0.
With this in minde, let us set log M = n(I(X; Y) − 2δ) for some δ > 0, and take τ = δ n in
Shannon’s bound. Then for the error bound we have
" n #
X n→∞
ϵn ≤ P i(Xk ; Yk ) ≤ nI(X; Y) − δ n + exp(−δ n) −−−→ 0, (19.7)
k=1
Since the bound converges to 0, we have shown that there exists a sequence of (n, Mn , ϵn )-codes
with ϵn → 0 and log Mn = n(I(X; Y) − 2δ). Hence, for all n such that ϵn ≤ ϵ
log M∗ (n, ϵ) ≥ n(I(X; Y) − 2δ)
And so
1
Cϵ = lim inf log M∗ (n, ϵ) ≥ I(X; Y) − 2δ
n→∞ n
Since this holds for all PX and all δ > 0, we conclude Cϵ ≥ supPX I(X; Y).
The following result follows from pairing the upper and lower bounds on Cϵ .
Theorem 19.9 (Shannon’s Noisy Channel Coding Theorem [277]). For a stationary memoryless
channel,
C = C(I) = sup I(X; Y). (19.8)
PX
As we mentioned several times already this result is among the most significant results in
information theory. From the engineering point of view, the major surprise was that C > 0,
i.e. communication over a channel is possible with strictly positive rate for any arbitrarily small
probability of error. The way to achieve this is to encode the input data jointly (i.e. over many
input bits together). This is drastically different from the pre-1948 methods, which operated on
a letter-by-letter bases (such as Morse code). This theoretical result gave impetus (and still gives
guidance) to the evolution of practical communication systems – quite a rare achievement for an
asymptotic mathematical fact.
Proof. Statement (19.8) contains two equalities. The first one follows automatically from the
second and Theorems 19.7 and 19.8. To show the second equality C(I) = supPX I(X; Y), we note
that for stationary memoryless channels C(I) is in fact easy to compute. Indeed, rather than solving
a sequence of optimization problems (one for each n) and taking the limit of n → ∞, memoryless-
ness of the channel implies that only the n = 1 problem needs to be solved. This type of result is
known as “single-letterization” in information theory and we show it formally in the following
proposition, which concludes the proof.
i i
i i
i i
318
Q
Proof. Recall that from Theorem 6.1 we know that for product kernels PYn |Xn = PYi |Xi , mutual
P n
information satisfies I(Xn ; Yn ) ≤ k=1 I(Xk ; Yk ) with equality whenever Xi ’s are independent.
Then
1
C(I) = lim inf sup I(Xn ; Yn ) = lim inf sup I(X; Y) = sup I(X; Y).
n→∞ n P Xn n→∞ PX PX
Shannon’s noisy channel theorem shows that by employing codes of large blocklength, we can
approach the channel capacity arbitrarily close. Given the asymptotic nature of this result (or any
other asymptotic result), a natural question is understanding the price to pay for reaching capacity.
This can be understood in two ways:
The main tool in the proof of Theorem 19.8 was the law of large numbers. The lower bound
Cϵ ≥ C(I) in Theorem 19.8 shows that log M∗ (n, ϵ) ≥ nC + o(n) (this just restates the fact
that normalizing by n and taking the lim inf must result in something ≥ C). If instead we apply
i i
i i
i i
a more careful analysis using the central-limit theorem (CLT), we obtain the following refined
achievability result.
Theorem 19.11. Consider a stationary memoryless channel with a capacity-achieving input dis-
tribution. Namely, C = maxPX I(X; Y) = I(X∗ ; Y∗ ) is attained at P∗X , which induces PX∗ Y∗ =
PX∗ PY|X . Assume that V = Var[i(X∗ ; Y∗ )] < ∞. Then
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n),
where Q(·) is the complementary Gaussian CDF and Q−1 (·) is its functional inverse.
Proof. Writing the little-o notation in terms of lim inf, our goal is
log M∗ (n, ϵ) − nC
lim inf √ ≥ −Q−1 (ϵ) = Φ−1 (ϵ),
n→∞ nV
where Φ(t) is the standard normal CDF.
Recall Feinstein’s bound
∃(n, M, ϵ)max : M ≥ β (ϵ − P[i(Xn ; Yn ) ≤ log β])
√
Take log β = nC + nVt, then applying the CLT gives
√ hX √ i
log M ≥ nC + nVt + log ϵ − P i(Xk ; Yk ) ≤ nC + nVt
√
=⇒ log M ≥ nC + nVt + log (ϵ − Φ(t)) ∀Φ(t) < ϵ
log M − nC log(ϵ − Φ(t))
=⇒ √ ≥t+ √ ,
nV nV
where Φ(t) is the standard normal CDF. Taking the liminf of both sides
log M∗ (n, ϵ) − nC
lim inf √ ≥ t,
n→∞ nV
for all t such that Φ(t) < ϵ. Finally, taking t % Φ−1 (ϵ), and writing the liminf in little-oh notation
completes the proof
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n).
Remark 19.1. Theorem 19.9 implies that for any R < C, there exists a sequence of (n, exp(nR), ϵn )-
codes such that the probability of error ϵn vanishes as n → ∞. Examining the upper bound (19.7),
we see that the probability of error actually vanishes exponentially fast, since the event in the first
term is of large-deviations type (recall Chapter 15) so that both terms are exponentially small.
Finding the value of the optimal exponent (or even the existence thereof) has a long history (but
remains generally open) in information theory, see Section 22.4*. Recently, however, it was under-
stood that a practically more relevant, and also much easier to analyze, is the regime of fixed
(non-vanishing) error ϵ, in which case the main question is to bound the speed of convergence of
R → Cϵ = C. Previous theorem shows one bound on this speed of convergence. See Sections 22.5
i i
i i
i i
320
√
and 22.6 for more. In particular, we will show that the bound on the n term in Theorem 19.11
is often tight.
C
C C
1 bit
1 bit 1 bit
δ
0 1 1 δ δ
2 0 1 0 1
BSCδ BECδ Z-channel
First for the BSCδ we have the following description of the input-output law:
Y = X + Z mod 2, Z ∼ Ber(δ) ⊥
⊥X
More generally, for all additive-noise channel over a finite abelian group G, C = supPX I(X; X +
Z) = log |G| − H(Z), achieved by X ∼ Unif(G).
Next we consider the binary erasure channel (BEC). BECδ is a multiplicative channel. Indeed,
if we define the input X ∈ {±1} and output Y ∈ {±1, 0}, then BEC relation can be written as
Y = XZ, Z ∼ Ber(δ) ⊥
⊥ X.
To compute the capacity, we first notice that even without evaluating Shannon’s formula, it is
clear that C ≤ 1 − δ (bit), because for a large blocklength n about δ -fraction of the message is
completely lost (even if the encoder knows a priori where the erasures are going to occur, the rate
still cannot exceed 1 − δ ). More formally, we notice that P[X = +1|Y = e] = P[X= δ
1]δ
= P[X = 1]
and therefore
i i
i i
i i
Finally, the Z-channel can be thought of as a multiplicative channel with transition law
Y = XZ, X ∈ { 0, 1} ⊥
⊥ Z ∼ Ber(1 − δ) ,
PY|X (f−
o (E)|fi (x)) = PY|X (E|x) ,
1
for all measurable E ⊂ Y and x ∈ X . Two symmetries f and g can be composed to produce another
symmetry as
( gi , go ) ◦ ( fi , fo ) ≜ ( gi ◦ fi , fo ◦ go ) . (19.9)
Note that both components of an automorphism f = (fi , fo ) are bimeasurable bijections, that is
fi , f− 1 −1
i , fo , fo are all measurable and well-defined functions.
Naturally, every symmetry group G possesses a canonical left action on X × Y defined as
Since the action on X × Y splits into actions on X and Y , we will abuse notation slightly and write
g · ( x, y) ≜ ( g x , g y ) .
For the cases of infinite X , Y we need to impose certain additional regularity conditions:
i i
i i
i i
322
G×X ×Y →X ×Y
is measurable.
Note that under the regularity assumption the action (19.10) also defines a left action of G on
P(X ) and P(Y) according to
or, in words, if X ∼ PX then gX ∼ gPX , and similarly for Y and gY. For every distribution PX we
define an averaged distribution P̄X as
Z
P̄X [E] ≜ PX [g−1 E]ν(dg) , (19.13)
G
which is the distribution of random variable gX when g ∼ ν and X ∼ PX . The measure P̄X is G-
invariant, in the sense that gP̄X = P̄X . Indeed, by left-invariance of ν we have for every bounded
function f
Z Z
f(g)ν(dg) = f(hg)ν(dg) ∀h ∈ G ,
G G
and therefore
Z
P̄X [h−1 E] = PX [(hg)−1 E]ν(dg) = P̄X [E] .
G
In other words, if the pair (X, Y) is generated by taking X ∼ PX and applying PY|X , then the pair
(gX, gY) has marginal distribution gPX but conditional kernel is still PY|X . For finite X , Y this is
equivalent to
i i
i i
i i
which may also be taken as the definition of the automorphism. In terms of the G-action on P(Y)
we may also say:
It is not hard to show that for any channel and a regular group of symmetries G the capacity-
achieving output distribution must be G-invariant, and capacity-achieving input distribution can
be chosen to be G-invariant. That is, the saddle point equation
inf sup D(PY|X kQY |PX ) = sup inf D(PY|X kQY |PX ) ,
PX QY QY PX
can be solved in the class of G-invariant distribution. Often, the action of G is transitive on X (Y ),
in which case the capacity-achieving input (output) distribution can be taken to be uniform.
Below we systematize many popular notions of channel symmetry and explain relationship
between them.
Y = X ◦ Z,
• Note that it is an easy consequence of the definitions that any input-symmetric (resp. output-
symmetric) channel’s PY|X has all rows (resp. columns) – permutations of the first row (resp.
column). Hence,
i i
i i
i i
324
• Since Gallager symmetry implies all rows are permutations of the first one, while output
symmetry implies the same statement for columns we have
• Clearly, not every Dobrushin-symmetric channel is square. One may wonder, however, whether
every square Dobrushin channel is a group-noise channel. This is not so. Indeed, according
to [286] the latin squares that are Cayley tables are precisely the ones in which composition of
two rows (as permutations) gives another row. An example of the latin square which is not a
Cayley table is the following:
1 2 3 4 5
2 5 4 1 3
3 1 2 5 4 . (19.21)
4 3 5 2 1
5 4 1 3 2
1
Thus, by multiplying this matrix by 15 we obtain a counter-example:
In fact, this channel is not even input-symmetric. Indeed, suppose there is g ∈ G such that
g4 = 1 (on X ). Then, applying (19.16) with x = 4 we figure out that on Y the action of g must
be:
1 7→ 4, 2 7→ 3, 3 7→ 5, 4 7→ 2, 5 7→ 1 .
1 7→ 2, 2 7→ 5, 3 7→ 1, 4 7→ 3, 5 7→ 4 ,
which implies via (19.16) that PY|X (g1|x) is not a column of (19.21). Thus:
i i
i i
i i
• Clearly, not every input-symmetric channel is Dobrushin (e.g., BEC). One may even find a
counter-example in the class of square channels:
1 2 3 4
1 3 2 4 1
4 2 3 1 · 10 (19.22)
4 3 2 1
This shows:
input-symmetric, square 6=⇒ Dobrushin
• Channel (19.22) also demonstrates:
Gallager-symmetric, square 6=⇒ Dobrushin .
• Example (19.22) naturally raises the question of whether every input-symmetric channel is
Gallager symmetric. The answer is positive: by splitting Y into the orbits of G we see that a
subchannel X → {orbit} is input and output symmetric. Thus by (19.18) we have:
input-symmetric =⇒ Gallager-symmetric =⇒ weakly input-symmetric (19.23)
(The second implication is evident).
• However, not all weakly input-symmetric channels are Gallager-symmetric. Indeed, consider
the following channel
1/7 4/7 1/7 1/7
4/7 1/7 0 4/7
W= . (19.24)
0 0 4 /7 2 / 7
2/7 2/7 2/7 0
Since det W 6= 0, the capacity achieving input distribution is unique. Since H(Y|X = x) is
independent of x and PX = [1/4, 1/4, 3/8, 1/8] achieves uniform P∗Y it must be the unique
optimum. Clearly any permutation Tx fixes a uniform P∗Y and thus the channel is weakly input-
symmetric. At the same time it is not Gallager-symmetric since no row is a permutation of
another.
• For more on the properties of weakly input-symmetric channels see [238, Section 3.4.5].
i i
i i
i i
326
Gallager
1010
1111111
0000000
0000000
1111111
0000000
1111111 0
1 Dobrushin
0000000
1111111 101111
0000000000
1111111111
0000
0000000
1111111 101111
0000000000
1111111111
0000
0000
1111
0000000
1111111 101111
0000000000
1111111111
0000
000
111
0000
1111 000
111
0000
1111
0000000
1111111 0
1
0000000000
1111111111
0000
1111
000
111
000
111
0000
1111 000
111
0000
1111
0000000
1111111 0
1
0000000000
1111111111
0000
1111
000
111
000
111
0000
1111
000
111 000
111
0000
1111
0000000
1111111
0000
1111 0000
1111
0000
1111 000
111
000input−symmetric
111
output−symmetric group−noise
Definition 19.14. A channel is called information stable if there exists a sequence of input
distributions {PXn , n = 1, 2, . . .} such that
1 n n P (I)
i( X ; Y ) −
→C .
n
For example, we can pick PXn = (P∗X )n for stationary memoryless channels. Therefore
stationary memoryless channels are information stable.
The purpose for defining information stability is the following theorem.
Proof. Like the stationary, memoryless case, the upper bound comes from the general con-
verse Theorem 17.3, and the lower bound uses a similar strategy as Theorem 19.8, except utilizing
the definition of information stability in place of WLLN.
The next theorem gives conditions to check for information stability in memoryless channels
which are not necessarily stationary.
Theorem 19.16. A memoryless channel is information stable if there exists {X∗k : k ≥ 1} such
that both of the following hold:
1X ∗ ∗
n
I(Xk ; Yk ) → C(I) (19.25)
n
k=1
X
∞
1
Var[i(X∗n ; Y∗n )] < ∞ . (19.26)
n2
n=1
i i
i i
i i
where convergence to 0 follows from Kronecker lemma (Lemma 19.17 to follow) applied with
bn = n2 , xn = Var[i(X∗n ; Y∗n )]/n2 .
The second part follows from the first. Indeed, notice that
1X
n
C(I) = lim inf sup I(Xk ; Yk ) .
n→∞ n PXk
k=1
(Note that each supPX I(Xk ; Yk ) ≤ log min{|A|, |B|} < ∞.) Then, we have
k
X
n X
n
I(X∗k ; Y∗k ) ≥ sup I(Xk ; Yk ) − 1 ,
PXk
k=1 k=1
and hence normalizing by n we get (19.25). We next show that for any joint distribution PX,Y we
have
Var[i(X; Y)] ≤ 2 log2 (min(|A|, |B|)) . (19.28)
The argument is symmetric in X and Y, so assume for concreteness that |B| < ∞. Then
E[i2 (X; Y)]
Z X
≜ dPX (x) PY|X (y|x) log PY|X (y|x) + log PY (y) − 2 log PY|X (y|x) · log PY (y)
2 2
A y∈B
Z X h i
≤ dPX (x) PY|X (y|x) log2 PY|X (y|x) + log2 PY (y) (19.29)
A y∈B
Z X X
= dPX (x) PY|X (y|x) log2 PY|X (y|x) + PY (y) log2 PY (y)
A y∈B y∈B
Z
≤ dPX (x)g(|B|) + g(|B|) (19.30)
A
i i
i i
i i
328
=2g(|B|) ,
where (19.29) is because 2 log PY|X (y|x) · log PY (y) is always non-negative, and (19.30) follows
because each term in square-brackets can be upper-bounded using the following optimization
problem:
X
n
g(n) ≜ sup
Pn
aj log2 aj . (19.31)
aj ≥0: j=1 aj =1 j=1
Since the x log2 x has unbounded derivative at the origin, the solution of (19.31) is always in the
interior of [0, 1]n . Then it is straightforward to show that for n > e the solution is actually aj = 1n .
For n = 2 it can be found directly that g(2) = 0.5629 log2 2 < log2 2. In any case,
Finally, because of the symmetry, a similar argument can be made with |B| replaced by |A|.
Lemma 19.17 (Kronecker Lemma). Let a sequence 0 < bn % ∞ and a non-negative sequence
P∞
{xn } such that n=1 xn < ∞, then
1 X
n
bj xj → 0
bn
j=1
Proof. Since bn ’s are strictly increasing, we can split up the summation and bound them from
above
X
n X
m X
n
bk xk ≤ bm xk + b k xk
k=1 k=1 k=m+1
1 X bm X X bm X X
n ∞ n ∞ ∞
bk
=⇒ b k xk ≤ xk + xk ≤ xk + xk
bn bn bn bn
k=1 k=1 k=m+1 k=1 k=m+1
1 X X
n ∞
=⇒ lim bk xk ≤ xk → 0
n→∞ bn
k=1 k=m+1
Since this holds for any m, we can make the last term arbitrarily small.
How to show information stability? One important class of channels with memory for which
information stability can be shown easily are Gaussian channels. The complete details will be
shown below (see Sections 20.5* and 20.6*), but here we demonstrate a crucial fact.
For jointly Gaussian (X, Y) we always have bounded variance:
cov[X, Y]
Var[i(X; Y)] = ρ2 (X, Y) log2 e ≤ log2 e , ρ(X, Y) = p . (19.32)
Var[X] Var[Y]
i i
i i
i i
Indeed, first notice that we can always represent Y = X̃ + Z with X̃ = aX ⊥⊥ Z. On the other hand,
we have
log e x̃2 + 2x̃z σ2 2
i(x̃; y) = − z , z ≜ y − x̃ .
2 σY2 σY2 σZ2
From here by using Var[·] = Var[E[·|X̃]] + Var[·|X̃] we need to compute two terms separately:
σX̃2
log e X̃ − σZ2
2
E[i(X̃; Y)|X̃] = ,
2 σY2
and hence
2 log2 e 4
Var[E[i(X̃; Y)|X̃]] = σ .
4σY4 X̃
On the other hand,
2 log2 e 2 2
Var[i(X̃; Y)|X̃] = [4σX̃ σZ + 2σX̃4 ] .
4σY4
Putting it all together we get (19.32). Inequality (19.32) justifies information stability of all sorts
of Gaussian channels (memoryless and with memory), as we will see shortly.
1X
k
1
Pb ≜ P[Sj 6= Ŝj ] = E[dH (Sk , Ŝk )] , (19.33)
k k
j=1
i i
i i
i i
330
1X X
k k
1{Si 6= Ŝi } ≤ 1{Sk 6= Ŝk } ≤ 1{Si 6= Ŝi },
k
i=1 i=1
where the first inequality is obvious and the second follow from the union bound. Taking
expectation of the above expression gives the theorem.
Next, the following pair of results is often useful for lower bounding Pb for some specific codes.
Proof. Let ei be a length k vector that is 1 in the i-th position, and zero everywhere else. Then
X
k X
k
1{Si =
6 Ŝi } ≥ 1{Sk = Ŝk + ei }
i=1 i=1
1X
k
Pb ≥ P[Sk = Ŝk + ei ]
k
i=1
Theorem 19.20. If A, B ∈ {0, 1}k (with arbitrary marginals!) then for every r ≥ 1 we have
1 k−1
Pb = E[dH (A, B)] ≥ Pr,min (19.34)
k r−1
Pr,min ≜ min{P[B = c′ |A = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = r} (19.35)
Next, notice
dH (x, y) ≥ r1{dH (x, y) = r}
and take the expectation with x ∼ A, y ∼ B.
In statistics, Assouad’s Lemma is a useful tool for obtaining lower bounds on the minimax risk
of an estimator. See Section 31.2 for more.
The following is a converse bound for channel coding under BER constraint.
i i
i i
i i
Theorem 19.21 (Converse under BER). Any M-code with M = 2k and bit-error rate Pb satisfies
supPX I(X; Y)
log M ≤ .
log 2 − h(Pb )
i.i.d.
Proof. Note that Sk → X → Y → Ŝk , where Sk ∼ Ber( 12 ). Recall from Theorem 6.1 that for iid
P
Sn , I(Si ; Ŝi ) ≤ I(Sk ; Ŝk ). This gives us
X
k
sup I(X; Y) ≥ I(X; Y) ≥ I(Si ; Ŝi )
PX
i=1
1
1X
≥k d P[Si = Ŝi ]
k 2
!
1X
k
1
≥ kd P[Si = Ŝi ]
k 2
i=1
1
= kd 1 − Pb
= k(log 2 − h(Pb ))
2
where the second line used Fano’s inequality (Theorem 6.3) for binary random variables (or data
processing inequality for divergence), and the third line used the convexity of divergence.2
Pairing this bound with Proposition 19.10 shows that any sequence of codes with Pb → 0 (for
a memoryless channel) must have rate R < C. In other words, relaxing the constraint from Pe to
Pb does not yield any higher rates.
Later in Section 26.3 we will see that channel coding under BER constraint is a special case
of a more general paradigm known as lossy joint source channel coding so that Theorem 19.21
follows from Theorem 26.5. Furthermore, this converse bound is in fact achievable asymptotically
for stationary memoryless channels.
Sk Encoder Xn Yn Decoder Ŝ k
Source (JSCC) Channel (JSCC)
2
Note that this last chain of inequalities is similar to the proof of Proposition 6.7.
i i
i i
i i
332
probability of error P[Sk 6= Ŝk ] is small. The fundamental limit (optimal probability of error) is
defined as
In channel coding we are interested in transmitting M messages and all messages are born equal.
Here we want to convey the source realizations which might not be equiprobable (has redundancy).
Indeed, if Sk is uniformly distributed on, say, {0, 1}k , then we are back to the channel coding setup
with M = 2k under average probability of error, and ϵ∗JSCC (k, n) coincides with ϵ∗ (n, 2k ) defined
in Section 22.1.
Here, we look for a clever scheme to directly encode k symbols from A into a length n channel
input such that we achieve a small probability of error over the channel. This feels like a mix
of two problems we’ve seen: compressing a source and coding over a channel. The following
theorem shows that compressing and channel coding separately is optimal. This is a relief, since
it implies that we do not need to develop any new theory or architectures to solve the Joint Source
Channel Coding problem. As far as the leading term in the asymptotics is concerned, the following
two-stage scheme is optimal: First use the optimal compressor to eliminate all the redundancy in
the source, then use the optimal channel code to add redundancy to combat the noise in the data
transmission.
Theorem 19.22. Let the source {Sk } be stationary memoryless on a finite alphabet with entropy
H. Let the channel be stationary memoryless with finite capacity C. Then
(
∗ → 0 R < C/H
ϵJSCC (nR, n) n → ∞.
6→ 0 R > C/H
The interpretation of this result is as follows: Each source symbol has information content
(entropy) H bits. Each channel use can convey C bits. Therefore to reliably transmit k symbols
over n channel uses, we need kH ≤ nC.
Proof. (Achievability.) The idea is to separately compress our source and code it for transmission.
Since this is a feasible way to solve the JSCC problem, it gives an achievability bound. This
separated architecture is
f1 f2 P Yn | X n g2 g1
Sk −→ W −→ Xn −→ Yn −→ Ŵ −→ Ŝk
Where we use the optimal compressor (f1 , g1 ) and optimal channel code (maximum probability of
error) (f2 , g2 ). Let W denote the output of the compressor which takes at most Mk values. Then
from Corollary 11.3 and Theorem 19.9 we get:
1
(From optimal compressor) log Mk > H + δ =⇒ P[Ŝk 6= Sk (W)] ≤ ϵ ∀k ≥ k0
k
1
(From optimal channel code) log Mk < C − δ =⇒ P[Ŵ 6= m|W = m] ≤ ϵ ∀m, ∀k ≥ k0
n
i i
i i
i i
And therefore if R(H + δ) < C − δ , then ϵ∗ → 0. By the arbitrariness of δ > 0, we conclude the
weak converse for any R > C/H.
(Converse.) To prove the converse notice that any JSCC encoder/decoder induces a Markov
chain
Sk → Xn → Yn → Ŝk .
On the other hand, since P[Sk 6= Ŝk ] ≤ ϵn , Fano’s inequality (Theorem 6.3) yields
We remark that instead of using Fano’s inequality we could have lower bounded I(Sk ; Ŝk ) as in
the proof of Theorem 17.3 by defining QSk Ŝk = USk PŜk (with USk = Unif({0, 1}k ) and applying the
data processing inequality to the map (Sk , Ŝk ) 7→ 1{Sk = Ŝk }:
D(PSk Ŝk kQSk Ŝk ) = D(PSk kUSk ) + D(PŜ|Sk kPŜ |PSk ) ≥ d(1 − ϵn k|A|−k )
Rearranging terms yields (19.36). As we discussed in Remark 17.2, replacing D with other f-
divergences can be very fruitful.
In a very similar manner, by invoking Corollary 12.2 and Theorem 19.15 we obtain:
Theorem 19.23. Let source {Sk } be ergodic on a finite alphabet, and have entropy rate H. Let
the channel have capacity C and be information stable. Then
(
∗ = 0 R > H/C
lim ϵJSCC (nR, n)
n→∞ > 0 R < H/C
i i
i i
i i
334
i i
i i
i i
In this chapter we study data transmission with constraints on the channel input. Namely, in pre-
vious chapter the encoder for blocklength n code was permitted to produce arbitrary sequences of
inputs, i.e. elements of An . However, in many practical problem only a subset of An is allowed
to be used. As a motivation, consider the setting of the AWGN channel Example 3.3. Without
restricting the input, i.e. allowing arbitrary elements of Rn as input, the channel capacity is infi-
nite: supPX I(X; X + Z) = ∞ (for example, take X ∼ N (0, P) and P → ∞). Indeed, one can
transmit arbitrarily many messages with arbitrarily small error probability by choosing elements
of Rn with giant pairwise distance. In reality, however, one is limited by the available power. In
other words, only the elements xn ∈ Rn are allowed satisfying
1X 2
n
xt ≤ P ,
n
t=1
where P > 0 is known as the power constraint. How many bits per channel use can we transmit
under this constraint on the codewords? To answer this question in general, we need to extend
the setup and coding theorems to channels with input constraints. After doing that we will apply
these results to compute capacities of various Gaussian channels (memoryless, with inter-symbol
interference and subject to fading).
An
b b
b
b Fn b
b b
b b b
b b
b
We will say that an (n, M, ϵ)-code satisfies the input constraint Fn ⊂ An if the encoder maps
[M] into Fn , i.e. f : [M] → Fn . What subsets Fn are of interest?
In the context of Gaussian channels, we have A = R. Then one often talks about the following
constraints:
335
i i
i i
i i
336
1X 2 √
n
| xi | ≤ P ⇔ kxn k2 ≤ nP.
n
i=1
√
In other words, codewords must lie in a ball of radius nP.
• Peak power constraint :
Notice that the second type of constraint does not introduce any new problems: we can simply
restrict the input space from A = R to A = [−A, A] and be back into the setting of input-
unconstrained coding. The first type of the constraint is known as a separable cost-constraint.
We will restrict our attention from now on to it exclusively.
1 A, B : input/output spaces
2 PYn |Xn : An → B n , n = 1, 2, . . .
3 Cost function c : A → R ∪ {±∞}.
1X
n
c(xn ) ≜ c(xk )
n
k=1
• Information capacity
1
C(I) (P) = lim inf sup I(Xn ; Yn )
n→∞ n PXn :E[Pnk=1 c(Xk )]≤nP
i i
i i
i i
• Information stability: Channel is information stable if for all (admissible) P, there exists a
sequence of channel input distributions PXn such that the following two properties hold:
1 P
iP n n (Xn ; Yn )−
→C(I) (P) (20.1)
n X ,Y
P[c(Xn ) > P + δ] → 0 ∀δ > 0 . (20.2)
These definitions clearly parallel those of Definitions 19.3 and 19.6 for channels without input
constraints. A notable and crucial exception is the definition of the information capacity C(I) (P).
Indeed, under input constraints instead of maximizing I(Xn ; Yn ) over distributions supported on
Fn we extend maximization to a richer set of distributions, namely, those satisfying
X
n
E[ c(Xk )] ≤ nP .
k=1
Clearly, if P ∈
/ Dc , then there is no code (even a useless one, with 1 codeword) satisfying the
input constraint. So in the remaining we always assume P ∈ Dc .
Proof. In the first part all statements are obvious, except for concavity, which follows from the
concavity of PX 7→ I(X; Y). For any PXi such that E [c(Xi )] ≤ Pi , i = 0, 1, let X ∼ λ̄PX0 + λPX1 .
Then E [c(X)] ≤ λ̄P0 + λP1 and I(X; Y) ≥ λ̄I(X0 ; Y0 ) + λI(X1 ; Y1 ). Hence ϕ(λ̄P0 + λP1 ) ≥
λ̄ϕ(P0 ) + λϕ(P1 ). The second claim follows from concavity of ϕ(·).
To extend these results to C(I) (P) observe that for every n
1
P 7→ sup I(Xn ; Yn )
n PXn :E[c(Xn )]≤P
is concave. Then taking lim infn→∞ the same holds for C(I) (P).
An immediate consequence is that memoryless input is optimal for memoryless channel with
separable cost, which gives us the single-letter formula of the information capacity:
i i
i i
i i
338
Proof. C(I) (P) ≥ ϕ(P) is obvious by using PXn = (PX )⊗n . For “≤”, fix any PXn satisfying the
cost constraint. Consider the chain
( a) X (b) X X
n n ( c)
n
1
I(Xn ; Yn ) ≤ I(Xj ; Yj ) ≤ ϕ(E[c(Xj )]) ≤ nϕ E[c(Xj )] ≤ nϕ(P) ,
n
j=1 j=1 j=1
where (a) follows from Theorem 6.1; (b) from the definition of ϕ; and (c) from Jensen’s inequality
and concavity of ϕ.
Proof. The argument is the same as we used in Theorem 17.3. Take any (n, M, ϵ, P)-code, W →
Xn → Yn → Ŵ. Applying Fano’s inequality and the data-processing, we get
Normalizing both sides by n and taking lim infn→∞ we obtain the result.
Next we need to extend one of the coding theorems to the case of input constraints. We do so for
the Feinstein’s lemma (Theorem 18.7). Note that when F = X , it reduces to the original version.
Theorem 20.7 (Extended Feinstein’s lemma). Fix a Markov kernel PY|X and an arbitrary PX .
Then for any measurable subset F ⊂ X , everyγ > 0 and any integer M ≥ 1, there exists an
(M, ϵ)max -code such that
i i
i i
i i
Proof. Similar to the proof of the original Feinstein’s lemma, define the preliminary decoding
regions Ec = {y : i(c; y) ≥ log γ} for all c ∈ X . Next, we apply Corollary 18.4 and find out
that there is a set F0 ⊂ X with two properties: a) PX [F0 ] = 1 and b) for every x ∈ F0 we have
PY (Ex ) ≤ γ1 . We now let F′ = F ∩ F0 and notice that PX [F′ ] = PX [F].
We sequentially pick codewords {c1 , . . . , cM } from the set F′ (!) and define the decoding regions
{D1 , . . . , DM } as Dj ≜ Ecj \ ∪jk− 1
=1 Dk . The stopping criterion is that M is maximal, i.e.,
∀x0 ∈ F′ , PY [Ex0 \ ∪M
j=1 Dj X = x0 ] < 1 − ϵ
⇔ ∀x0 ∈ X , PY [Ex0 \ ∪M ′
j=1 Dj X = x0 ] < (1 − ϵ)1[x0 ∈ F ] + 1[x0 ∈ F ]
′c
From here, we complete the proof by following the same steps as in the proof of original Feinstein’s
lemma (Theorem 18.7).
Given the coding theorem we can establish a lower bound on capacity
Theorem 20.8 (Capacity lower bound). For any information stable channel with input constraints
and P > P0 we have
C(P) ≥ C(I) (P). (20.3)
Proof. Let us consider a special case of the stationary memoryless channel (the proof for general
information stable channel follows similarly). Thus, we assume PYn |Xn = (PY|X )⊗n .
Fix n ≥ 1. Choose a PX such that E[c(X)] < P, Pick log M = n(I(X; Y) − 2δ) and log γ =
n(I(X; Y) − δ).
P
With the input constraint set Fn = {xn : 1n c(xk ) ≤ P}, and iid input distribution PXn = P⊗ n
X ,
we apply the extended Feinstein’s lemma. This shows existence of an (n, M, ϵn , P)max -code with
the encoder satisfying input constraint Fn and vanishing (maximal) error probability
ϵn PXn [Fn ] ≤ P[i(Xn ; Yn ) ≤ n(I(X; Y) − δ)] + exp(−nδ)
| {z } | {z } | {z }
→1 →0 as n→∞ by WLLN and stationary memoryless assumption →0
Indeed, the first term is vanishing by the weak law of large numbers: since E[c(X)] < P, we have
P
PXn (Fn ) = P[ 1n c(xk ) ≤ P] → 1. Since ϵn → 0 this implies that for every ϵ > 0 we have
1
Cϵ (P) ≥ log M = I(X; Y) − 2δ, ∀δ > 0, ∀PX s.t. E[c(X)] < P
n
⇒ Cϵ (P) ≥ sup lim (I(X; Y) − 2δ)
PX :E[c(X)]<P δ→0
where the last equality is from the continuity of C(I) on (P0 , ∞) by Proposition 20.4.
For a general information stable channel, we just need to use the definition to show that
P[i(Xn ; Yn ) ≤ n(C(I) − δ)] → 0, and the rest of the proof follows similarly.
i i
i i
i i
340
Theorem 20.9 (Channel capacity under cost constraint). For an information stable channel with
cost constraint and for any admissible constraint P we have
Proof. The boundary case of P = P0 is treated in Ex. IV.10, which shows that C(P0 ) = C(I) (P0 )
even though C(I) (P) may be discontinuous at P0 . So assume P > P0 next. Theorem 20.6 shows
(I)
Cϵ (P) ≤ C1−ϵ (P)
, thus C(P) ≤ C(I) (P). On the other hand, from Theorem 20.8 we have C(P) ≥
C(I) (P).
Z ∼ N (0, σ 2 )
X Y
+
Definition 20.10 (The stationary AWGN channel). The Additive White Gaussian Noise (AWGN)
channel is a stationary memoryless additive-noise channel with separable cost constraint: A =
B = R, c(x) = x2 , and a single-letter kernel PY|X given by Y = X + Z, where Z ∼ N (0, σ 2 ) ⊥⊥ X.
The n-letter kernel is given by a product extension, i.e. Yn = Xn + Zn with Zn ∼ N (0, In ). When
the power constraint is E[c(X)] ≤ P we say that the signal-to-noise ratio (SNR) equals σP2 .
The terminology white noise refers to the fact that the noise variables are uncorrelated across
time. This makes the power spectral density of the process {Zj } constant in frequency (or “white”).
We often drop the word stationary when referring to this channel. The definition we gave above is
more correctly should be called the real AWGN, or R-AWGN, channel. The complex AWGN, or
C-channel is defined similarly: A = B = C, c(x) = |x|2 , and Yn = Xn + Zn , with Zn ∼ Nc (0, In )
being the circularly symmetric complex gaussian.
Theorem 20.11. For the stationary AWGN channel, the channel capacity is equal to information
capacity, and is given by:
1 P
( I)
C(P) = C (P) = log 1 + 2 for R-AWGN (20.4)
2 σ
P
C(P) = C(I) (P) = log 1 + 2 for C-AWGN
σ
i i
i i
i i
Then using Theorem 5.11 (the Gaussian saddle point) to conclude X ∼ N (0, P) (or Nc (0, P)) is
the unique capacity-achieving input distribution.
At this point it is also instructive to revisit Section 6.2* which shows that Gaussian capacity
can in fact be derived essentially without solving the maximization of mutual information: the
Euclidean rotational symmetry implies the optimal input should be Gaussian.
There is a great deal of deep knowledge embedded in the simple looking formula of Shan-
non (20.4). First, from the engineering point of view we immediately see that to transmit
information faster (per unit time) one needs to pay with radiating at higher power. Second, the
amount of energy spent per transmitted information bit is minimized by solving
P log 2
inf = 2σ 2 loge 2 (20.5)
P>0 C(P)
and is achieved by taking P → 0. (We will discuss the notion of energy-per-bit more in
Section 21.1.) Thus, we see that in order to maximize communication rate we need to send
powerful, high-power waveforms. But in order to minimize energy-per-bit we need to send in
very quiet “whisper” and at very low communication rate.1 In either case the waveforms of good
error-correcting codes should look like samples of the white gaussian process.
Third, from the mathematical point of view, formula (20.4) reveals certain properties of high-
dimensional Euclidean geometry
√ as follows. Since Zn ∼ N (0, σ 2 ), then with high probability,
kZ k2 concentrates around nσ 2 . Similarly, due the power constraint and the fact that Zn ⊥
n
⊥ Xn , we
n 2 n 2 n 2
have E kY k = E kY p k + E kZ k ≤ n(P + σ 2 ) and the received vector Yn lies in an ℓ√ 2 -ball
of radius approximately n(P + σ 2 ). Since the noise √ can at most perturb the codeword p by nσ 2
in Euclidean distance, if we can pack M balls of radius nσ 2 into the ℓ2 -ball of radius n(P + σ 2 )
centered at the origin, this yields a good codebook and decoding regions – see Fig. 20.1 for an
illustration. So how large can M be? Note that the volume of an ℓ2 -ball of radius r in Rn is given by
2 n/ 2 n/2
cn rn for some constant cn . Then cn (cnn((Pn+σ ))
= 1 + σP2 . Taking the log and dividing by n, we
σ 2 ) n/ 2
∗
get n log M ≈ 2 log 1 + σ2 . This tantalazingly convincing reasoning, however, is flawed in at
1 1 P
least two ways. (a) Computing the volume ratio only gives an upper bound on the maximal number
of disjoint balls (See Section 27.2 for an extensive discussion on this topic.) (b) Codewords need
not correspond to centers of disjoint ℓ2 -balls. √ Indeed, the fact that we allow some vanishing (but
non-zero) probability of error means that the nσ 2 -balls are slightly overlapping and Shannon’s
formula establishes the maximal number of such partially overlapping balls that we can pack so
that they are (mostly) inside a larger ball.
Theorem 20.11 applies to Gaussian noise. What if the noise is non-Gaussian and how sensi-
tive is the capacity formula 12 log(1 + SNR) to the Gaussian assumption? Recall the Gaussian
1
This explains why, for example, the deep space probes communicate with earth via very low-rate codes and very long
blocklengths.
i i
i i
i i
342
c3
c4
p n
√ c2
nσ 2
(P
c1
+
σ
2
)
c5
c8
···
c6
c7
cM
saddlepoint result we have studied in Chapter 5 where we showed that for the same variance,
Gaussian noise is the worst which shows that the capacity of any non-Gaussian noise is at least
1
2 log(1 + SNR). Conversely, it turns out the increase of the capacity can be controlled by how
non-Gaussian the noise is (in terms of KL divergence). The following result is due to Ihara [163].
Theorem 20.12 (Additive Non-Gaussian noise). Let Z be a real-valued random variable indepen-
dent of X and EZ2 < ∞. Let σ 2 = Var Z. Then
1 P 1 P
log 1 + 2 ≤ sup I(X; X + Z) ≤ log 1 + 2 + D(PZ kN (EZ, σ 2 )).
2 σ PX :EX2 ≤P 2 σ
Remark 20.1. The quantity D(PZ kN (EZ, σ 2 )) is sometimes called the non-Gaussianness of Z,
where N (EZ, σ 2 ) is a Gaussian with the same mean and variance as Z. So if Z has a non-Gaussian
density, say, Z is uniform on [0, 1], then the capacity can only differ by a constant compared to
AWGN, which still scales as 21 log SNR in the high-SNR regime. On the other hand, if Z is discrete,
then D(PZ kN (EZ, σ 2 )) = ∞ and indeed in this case one can show that the capacity is infinite
because the noise is “too weak”.
i i
i i
i i
Figure 20.2 Power allocation via water-filling. Here, the second branch is too noisy (σ2 too big) for the
amount of available power P and the optimal coding should discard (input zeros to) this branch altogether.
1X + T
L
C = log
2 σj2
j=1
X
L
P = |T − σj2 |+
j=1
Proof.
X
L
≤ P
sup sup I(Xk ; Yk )
Pk ≤P,Pk ≥0 k=1 E[X2k ]≤Pk
X
L
1 Pk
= P
sup log(1 + )
Pk ≤P,Pk ≥0 k=1 2 σk2
with equality if Xk ∼ N (0, Pk ) are independent. So the question boils down to the last maximiza-
P
tion problem – power allocation: Denote the Lagragian multipliers for the constraint Pk ≤ P by
P1 P
λ and for the constraint Pk ≥ 0 by μk . We want to solve max 2 log(1 + σk2 )− μk Pk +λ(P −
Pk
Pk ).
First-order condition on Pk gives that
1 1
2
= λ − μk , μk Pk = 0
2 σk + Pk
X
L
Pk = |T − σk2 |+ , T is chosen such that P = |T − σk2 |+
k=1
i i
i i
i i
344
On Fig. 20.2 we give a visual interpretation of the waterfilling solution. It has a number of
practically important conclusions. First, it gives a precise recipee for how much power to allocate
to different frequency bands. This solution, simple and elegant, was actually pivotal for bringing
high-speed internet to many homes (via cable modems): initially, before information theorists
had a say, power allocations were chosen on the basis of costly and imprecise simulations of real
codes. Simplicity of the waterfilling power allocation allowed to make power allocation dynamic
and enable instantaneous reaction to changing noise environments.
Second, there is a very important consequence for multiple-antenna (MIMO) communication.
Given nr receive antennas and nt transmit antennas, very often one gets as a result a parallel AWGN
with L = min(nr , nt ) branches. For a single-antenna system the capacity then scales as 12 log P with
increasing power (Theorem 20.11), while the capacity for a MIMO AWGN channel is approxi-
mately L2 log( PL ) ≈ L2 log P for large P. This results in a L-fold increase in capacity at high SNR.
This is the basis of a powerful technique of spatial multiplexing in MIMO, largely behind much
of advance in 4G, 5G cellular (3GPP) and post-802.11n WiFi systems.
Notice that spatial diversity (requiring both receive and transmit antennas) is different from a
so-called multipath diversity (which works even if antennas are added on just one side). Indeed,
if a single stream of data is sent through every parallel channel simultaneously, then sufficient
statistic would be to average the received vectors, resulting in a the effective noise level reduced
by L1 factor. The result is capacity increase from 12 log P to 12 log(LP) – a far cry from the L-fold
increase of spatial multiplexing. These exciting topics are explored in excellent books [312, 190].
Theorem 20.16. Assume that for every T the following limits exist:
1X1
n
T
C̃(I) (T) = lim log+ 2
n→∞ n 2 σj
j=1
1X
n
P̃(T) = lim |T − σj2 |+ .
n→∞ n
j=1
Then the capacity of the non-stationary AWGN channel is given by the parameterized form:
C(T) = C̃(I) (T) with input power constraint P̃(T).
Proof. Fix T > 0. Then it is clear from the waterfilling solution that
X
n
1 T
sup I(Xn ; Yn ) = log+ , (20.6)
2 σj2
j=1
i i
i i
i i
1X
n
E[c(Xn )] ≤ |T − σj2 |+ . (20.7)
n
j=1
Now, by assumption, the LHS of (20.7) converges to P̃(T). Thus, we have that for every δ > 0
Taking δ → 0 and invoking continuity of P 7→ C(I) (P), we get that the information capacity
satisfies
and thus
X
n
1
Var(i(Xj ; Yj )) < ∞ .
n2
j=1
Non-stationary AWGN is primarily interesting due to its relationship to the additive colored
gaussian noise channel in the following section.
Theorem 20.18. The capacity of the ACGN channel with fZ (ω) > 0 for almost every ω ∈ [−π , π ]
is given by the following parametric form:
Z 2π
1 1 T
C ( T) = log+ dω,
2π 0 2 fZ (ω)
Z 2π
1
P ( T) = |T − fZ (ω)|+ dω.
2π 0
i i
i i
i i
346
Figure 20.3 The ACGN channel: the “whitening” process used in the capacity proof and the waterfilling
solution.
en = UXn and Y
Since Cov(Zn ) is positive semi-definite, U is a unitary matrix. Define X en = UYn ,
the channel between Xen and Yen is thus
en = X
Y en + UZn ,
e
Cov(UZn ) = U · Cov(Zn ) · U∗ = Σ
1 X
n
lim |T − σj2 |+ = P(T).
n→∞ n
j=1
e
Finally since U is unitary, C = C.
The idea used in the proof as well as the waterfilling power allocation are illustrated on Fig. 20.3.
Note that most of the time the noise that impacts real-world systems is actually “born” white
(because it’s a thermal noise). However, between the place of its injection and the processing there
are usually multiple circuit elements. If we model them linearly then their action can equivalently
be described as the ACGN channel, since the effective noise added becomes colored. In fact, this
i i
i i
i i
20.7* Additive White Gaussian Noise channel with Intersymbol Interference 347
filtering can be inserted deliberately in order to convert the actual channel into an additive noise
one. This is the content of the next section.
Definition 20.19 (AWGN with ISI). An AWGN channel with ISI is a channel with memory that
is defined as follows: the alphabets are A = B = R, and the separable cost is c(x) = x2 . The
channel law PYn |Xn is given by
X
n
Yk = hk−j Xj + Zk , k = 1, . . . , n
j=1
i.i.d.
where Zk ∼ N (0, σ 2 ) is white Gaussian noise, {hk , k = −∞, . . . , ∞} are coefficients of a discrete-
time channel filter.
The coefficients {hk } describe the action of the environment. They are often learned by the
receiver during the “handshake” process of establishing a communication link.
Theorem 20.20. Suppose that the sequence {hk } is the inverse Fourier transform of a frequency
response H(ω):
Z 2π
1
hk = eiωk H(ω)dω .
2π 0
Assume also that H(ω) is a continuous function on [0, 2π ]. Then the capacity of the AWGN channel
with ISI is given by
Z 2π
1 1
C ( T) = log+ (T|H(ω)|2 )dω
2π 0 2
Z 2π +
1 1
P ( T) = T − dω
2π 0 |H(ω)| 2
Proof sketch. At the decoder apply the inverse filter with frequency response ω 7→ 1
H(ω) . The
equivalent channel then becomes a stationary colored-noise Gaussian channel:
Ỹj = Xj + Z̃j ,
i i
i i
i i
348
The capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. The
number of atoms is Ω(A) and O(A2 ) as A → ∞. Moreover,
1 2A2 1
log 1 + ≤ C(A) ≤ log 1 + A2
2 eπ 2
Capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. Moreover,
the convergence speed of limA→∞ C(A, P) = 21 log(1 + P) is of the order e−O(A ) .
2
For details, see [290], [247, Section III] and [109, 253] for the O(A2 ) bound on the number of
atoms.
i i
i i
i i
There are two drastically different cases of fading channels, depending on the presence or
absence of the dashed link on Fig. 20.4. In the first case, known as the coherent case or the CSIR
case (for channel state information at the receiver), the receiver is assumed to have perfect esti-
mate of the channel state information Hi at every time i. In other words, the channel output is
effectively (Yi , Hi ). This situation occurs, for example, when there are pilot signals sent period-
ically and are used at the receiver to estimate the channel. in some cases, the i then references
different frequencies or sub-channels of an OFDM frame).
Whenever Hj is a stationary ergodic process, we have the channel capacity given by:
1 P|H|2
C(P) = E log 1 +
2 σ2
and the capacity achieving input distribution is the usual PX = N (0, P). Note that the capacity
C(P) is in the order of log P and we call the channel “energy efficient”.
In the second case, known as non-coherent or no-CSIR, the receiver does not have any knowl-
edge of the Hi . In this case, there is no simple expression for the channel capacity. Most of the
i.i.d.
known results were shown for the case of Hi ∼ according to the Rayleigh distribution. In this
case, the capacity achieving input distribution is discrete [2], and the capacity scales as [304, 191]
C(P) = O(log log P), P→∞ (20.10)
This channel is said to be “energy inefficient” since increase in communication rate requires
dramatic expenditures in power.
Further generalization of the Gaussian channel models requires introducing multiple input and
output antennas. In this case, the single-letter input Xi ∈ Cnt and the output Yi ∈ Cnr are related
by
Yi = Hi Xi + Zi , (20.11)
i.i.d.
where Zi ∼ CN (0, σ 2 Inr ), nt and nr are the number of transmit and receive antennas, and Hi ∈
Cnt ×nr is a matrix-valued channel gain process. An incredible effort in the 1990s and 2000s was
i i
i i
i i
350
i i
i i
i i
In this chapter we will consider an interesting variation of the channel coding problem. Instead
of constraining the blocklength (i.e. the number of channel uses), we will constrain the total cost
incurred by the codewords. The motivation is the following. Consider a deep space probe which
has a k bit message that needs to be delivered to Earth (or a satellite orbiting it). The duration of
transmission is of little worry for the probe, but what is really limited is the amount of energy it has
stored in its battery. In this chapter we will learn how to study this question abstractly, how coding
over large number of bits k → ∞ reduces the energy spent (per bit), and how this fundamental
limit is related to communication over continuous-time channels.
Note that in this chapter we have denoted the noise level for Zi to be N20 . There is a long tradition for
such a notation. Indeed, most of the noise in communication systems is a white thermal noise at the
receiver. The power spectral density of that noise is flat and denoted by N0 (in Joules per second
per Hz). However, recall that received signal is complex-valued and, thus, each real component
has power N20 . Note also that thermodynamics suggests that N0 = kT, where k = 1.38 × 10−23 is
the Boltzmann constant, and T is the absolute temperature in Kelvins.
In previous chapter, we analyzed the maximum number of information messages (M∗ (n, ϵ, P))
that can be sent through this channel for a given n number of channel uses and under the power
constraint P. We have also hinted that in (20.5) that there is a fundamental minimal required cost
to send each (data) bit. Here we develop this question more rigorously. Everywhere in this chapter
for v ∈ R∞ or u ∈ C∞ we define
X
∞ X
∞
kvk22 = v2j , kuk22 = | uj | 2 .
j=1 j=1
Definition 21.1 ((E, 2k , ϵ)-code). Given a Markov kernel with input space R∞ or C∞ we define
an (E, 2k , ϵ)-code to be an encoder-decoder pair, f : [2k ] → R∞ and g : R∞ → [2k ] (or similar
351
i i
i i
i i
352
randomized versions), such that for all messages m ∈ [2k ] we have kf(m)k22 ≤ E and
P[g(Y∞ ) 6= W] ≤ ϵ ,
The operational meaning of E∗ (k, ϵ) should be apparent: it is the minimal amount of energy the
space probe needs to draw from the battery in order to send k bits of data.
Theorem 21.2 ((Eb /N0 )min = −1.6dB). For the AWGN channel we have
E∗ (k, ϵ) N0
lim lim sup = . (21.2)
ϵ→0 k→∞ k log2 e
Remark 21.1. This result, first obtained by Shannon [277], is colloquially referred to as: mini-
mal Eb /N0 (pronounced “eebee over enzero” or “ebno”) is −1.6 dB. The latter value is simply
10 log10 ( log1 e ) ≈ −1.592. Achieving this value of the ebno was an ultimate quest for coding
2
theory, first resolved by the turbo codes [30]. See [73] for a review of this long conquest.
Proof. We start with a lower bound (or the “converse” part). As usual, we have the working
probability space
W → X∞ → Y∞ → Ŵ .
log e X EX2i
∞
≤ linearization of log
2 N0 /2
i=1
E
≤ log e
N0
Thus, we have shown
E∗ (k, ϵ) N0 h(ϵ)
≥ (ϵ − )
k log e k
i i
i i
i i
E∗ (kn , ϵ) nP
lim sup ≤ lim sup ∗ (n, ϵ, P)
n→∞ kn n→∞ log M
P
=
lim infn→∞ 1n log M∗max (n, ϵ, P)
P
= 1 ,
2 log( 1 + N0P/2 )
where in the last step we applied Theorem 20.11. Now the above statement holds for every P > 0,
so let us optimize it to get the best bound:
E∗ (kn , ϵ) P
lim sup ≤ inf 1 P
n→∞ kn P≥0
2 log(1 + N0 / 2 )
P
= lim
P→0 1 log(1 + P
2 N0 / 2 )
N0
= (21.3)
log2 e
Note that the fact that minimal energy per bit is attained at P → 0 implies that in order to send
information reliably at the Shannon limit of −1.6dB, infinitely many time slots are needed. In
other words, the information rate (also known as spectral efficiency) should be vanishingly small.
Conversely, in order to have non-zero spectral efficiency, one necessarily has to step away from
the −1.6 dB. This tradeoff is known as spectral efficiency vs. energy-per-bit.
We next can give a simpler and more explicit construction of the code, not relying on the random
coding implicit in Theorem 20.11. Let M = 2k and consider the following code, known as the
pulse-position modulation (PPM):
√
PPM encoder: ∀m, f(m) = cm ≜ (0, 0, . . . , |{z} E ,...) (21.4)
m-th location
It is not hard to derive an upper bound on the probability of error that this code achieves [242,
Theorem 2]:
" ( r ! )#
2E
ϵ ≤ E min MQ + Z ,1 , Z ∼ N (0, 1) . (21.5)
N0
i i
i i
i i
354
Indeed, our orthogonal codebook under a maximum likelihood decoder has probability of error
equal to
Z ∞" r !#M−1 √
(z− E)2
1 2 − N
Pe = 1 − √ 1−Q z e 0 dz , (21.6)
πN0 −∞ N0
which is obtained by observing that conditioned on (W = j,q Zj ) the events {||cj + z||2 ≤ ||cj +
z − ci ||2 }, i 6= j are independent. A change of variables x = N20 z and application of the bound
1 − (1 − y)M−1 ≤ min{My, 1} weakens (21.6) to (21.5).
To see that (21.5) implies (21.3), fix c > 0 and condition on |Z| ≤ c in (21.5) to relax it to
r
2E
ϵ ≤ MQ( − c) + 2Q(c) .
N0
Recall the expansion for the Q-function [322, (3.53)]:
x2 log e 1
log Q(x) = − − log x − log 2π + o(1) , x→∞ (21.7)
2 2
Thus, choosing τ > 0 and setting E = (1 + τ )k logN0 e we obtain
2
r
2E
2k Q( − c) → 0
N0
as k → ∞. Thus choosing c > 0 sufficiently large we obtain that lim supk→∞ E∗ (k, ϵ) ≤ (1 +
τ ) logN0 e for every τ > 0. Taking τ → 0 implies (21.3).
2
Remark 21.2 (Simplex conjecture). The code (21.4) in fact achieves the first three terms in the
large-k expansion of E∗ (k, ϵ), cf. [242, Theorem 3]. In fact, the code can be further slightly opti-
√ √
mized by subtracting the common center of gravity (2−k E, . . . , 2−k E . . .) and rescaling each
codeword to satisfy the power constraint. The resulting constellation is known as the simplex code.
It is conjectured to be the actual optimal code achieving E∗ (k, ϵ) for a fixed k and ϵ, see [75, Section
3.16] and [294] for more.
i i
i i
i i
P[g(Y∞ ) 6= W] ≤ ϵ ,
Let C(P) be the capacity-cost function of the channel (in the usual sense of capacity, as defined
in (20.1)). Assuming P0 = 0 and C(0) = 0 it is not hard to show (basically following the steps of
Theorem 21.2) that:
C(P) C(P) d
Cpuc = sup = lim = C(P) .
P P P→0 P dP P=0
The surprising discovery of Verdú [321] is that one can avoid computing C(P) and derive the Cpuc
directly. This is a significant help, as for many practical channels C(P) is unknown. Additionally,
this gives a yet another fundamental meaning to divergence.
Q
Theorem 21.4. For a stationary memoryless channel PY∞ |X∞ = PY|X with P0 = c(x0 ) = 0
(i.e. there is a symbol of zero cost), we have
D(PY|X=x kPY|X=x0 )
Cpuc = sup .
x̸=x0 c(x)
Proof. Let
D(PY|X=x kPY|X=x0 )
CV = sup .
x̸=x0 c(x)
i i
i i
i i
356
where we denoted for convenience d(x) ≜ D(PY|X=x kPY|X=x0 ). By the definition of CV we have
d(x) ≤ c(x)CV .
Thus, continuing (21.9) we obtain
" #
X
∞
(1 − ϵ) log M + h(ϵ) ≤ CV E c(Xt ) ≤ CV · E ,
t=1
where the last step is by the cost constraint (21.8). Thus, dividing by E and taking limits we get
Cpuc ≤ CV .
Achievability: We generalize the PPM code (21.4). For each x1 ∈ X and n ∈ Z+ we define the
encoder f as follows:
f ( 1 ) = ( x1 , x1 , . . . , x1 , x0 , . . . , x0 ) (21.10)
| {z } | {z }
n-times n(M−1)-times
f ( 2 ) = ( x0 , x0 , . . . , x0 , x1 , . . . , x1 , x0 , . . . , x0 ) (21.11)
| {z } | {z } | {z }
n-times n-times n(M−2)-times
··· (21.12)
f(M) = ( x0 , . . . , x0 , x1 , x1 , . . . , x1 ) (21.13)
| {z } | {z }
n(M−1)-times n-times
Now, by Stein’s lemma (Theorem 14.13) there exists a subset S ⊂ Y n with the property that
P[Yn ∈ S|Xn = (x1 , . . . , x1 )] ≥ 1 − ϵ1 (21.14)
P[Yn ∈ S|Xn = (x0 , . . . , x0 )] ≤ exp{−nD(PY|X=x1 kPY|X=x0 ) + o(n)} . (21.15)
Therefore, we propose the following (suboptimal!) decoder:
Yn ∈ S =⇒ Ŵ = 1 (21.16)
n+1 ∈ S
Y2n =⇒ Ŵ = 2 (21.17)
··· (21.18)
From the union bound we find that the overall probability of error is bounded by
ϵ ≤ ϵ1 + M exp{−nD(PY|X=x1 kPY|X=x0 ) + o(n)} .
At the same time the total cost of each codeword is given by nc(x1 ). Thus, taking n → ∞ and
after straightforward manipulations, we conclude that
D(PY|X=x1 kPY|X=x0 )
Cpuc ≥ .
c(x1 )
i i
i i
i i
This holds for any symbol x1 ∈ X , and so we are free to take supremum over x1 to obtain Cpuc ≥
CV , as required.
Yj = Hj Xj + Zj , Hj ∼ N c ( 0, 1) ⊥
⊥ Zj ∼ Nc (0, N0 )
(we use here a more convenient C-valued fading channel, the Hj ∼ Nc is known as the Rayleigh
fading). The cost function is the usual quadratic one c(x) = |x|2 . As we discussed previously,
cf. (20.10), the capacity-cost function C(P) is unknown in closed form, but is known to behave
drastically different from the case of non-fading AWGN (i.e. when Hj = 1). So here Theorem 21.4
comes handy. Let us perform a simple computation required, cf. (2.9):
D(Nc (0, |x|2 + N0 )kNc (0, N0 ))
Cpuc = sup (21.19)
x̸=0 | x| 2
| x| 2
1 log ( 1 + )
= sup log e − | x| 2
N0
(21.20)
N0 x̸=0
N0
log e
= (21.21)
N0
Comparing with Theorem 21.2 we discover that surprisingly, the capacity-per-unit-cost is unaf-
fected by the presence of fading. In other words, the random multiplicative noise which is so
detrimental at high SNR, appears to be much more benign at low SNR (recall that Cpuc = C′ (0)
and thus computing Cpuc corresponds to computing C(P) at P → 0). There is one important differ-
ence: the supremization over x in (21.20) is solved at x = ∞. Following the proof of the converse
bound, we conclude that any code hoping to achieve optimal Cpuc must satisfy a strange constraint:
X X
|xt |2 1{|xt | ≥ A} ≈ | xt | 2 ∀A > 0
t t
i.e. the total energy expended by each codeword must be almost entirely concentrated in very
large spikes. Such a coding method is called “flash signalling”. Thus, we can see that unlike the
non-fading AWGN (for which due to rotational symmetry all codewords can be made relatively
non-spiky), the only hope of achieving full Cpuc in the presence of fading is by signalling in short
bursts of energy.
This effect manifests itself in the speed of convergence to Cpuc with increasing constellation
∗
sizes. Namely, the energy-per-bit E (kk,ϵ) behaves asymptotically as
r
E∗ (k, ϵ) const −1
= (−1.59 dB) + Q (ϵ) (AWGN) (21.22)
k k
i i
i i
i i
358
14
12
10
Achievability
8
Converse
dB
2
fading+CSIR, non-fading AWGN
0
−1.59 dB
−2
100 101 102 103 104 105 106 107 108
Information bits, k
Figure 21.1 Comparing the energy-per-bit required to send a packet of k-bits for different channel models
∗
(curves represent upper and lower bounds on the unknown optimal value E (k,ϵ) k
). As a comparison: to get to
−1.5 dB one has to code over 6 · 104 data bits when the channel is non-fading AWGN or fading AWGN with
Hj known perfectly at the receiver. For fading AWGN without knowledge of Hj (noCSI), one has to code over
at least 7 · 107 data bits to get to the same −1.5 dB. Plot generated via [291].
r
E∗ (k, ϵ) 3 log k −1 2
= (−1.59 dB) + (Q (ϵ)) (non-coherent fading) (21.23)
k k
That is we see that the speed of convergence to Shannon limit is much slower under fading.
Fig. 21.1 shows this effect numerically by plotting evaluation of (the upper and lower bounds
for) E∗ (k, ϵ) for the fading and non-fading channels. See [340] for details.
i i
i i
i i
E[Ws Wt ] = min(s, t) .
Let M∗ (T, ϵ, P) the maximum number of messages that can be sent through this channel such that
given an encoder f : [M] → L2 [0, T] for each m ∈ [M] the waveform x(t) ≜ f(m)
and the decoding error probability P[Ŵ 6= W] ≤ ϵ. This is a natural extension of the previously
defined log M∗ functions to continuous-time setting.
We prove the capacity result for this channel next.
Theorem 21.5. The maximal reliable rate of communication across the continuous-time AWGN
channel is NP0 log e (per unit of time). More formally, we have
1 P
lim lim inf log M∗ (T, ϵ, P) = log e (21.24)
ϵ→0 T→∞ T N0
Proof. Note that the space of all square-integrable functions on [0, T], denoted L2 [0, T] has count-
able basis (e.g. sinusoids). Thus, by expanding our input and output waveforms in that basis we
obtain an equivalent channel model:
N0
Ỹj = X̃j + Z̃j , Z̃j ∼ N (0, ),
2
and energy constraint (dependent upon duration T):
X
∞
X̃2j ≤ PT .
j=1
But then the problem is equivalent to the energy-per-bit for the (discrete-time) AWGN channel
(see Theorem 21.2) and hence
Thus,
1 P P
lim lim inf log2 M∗ (T, ϵ, P) = E∗ (k,ϵ)
= log2 e ,
ϵ→0 n→∞ T limϵ→0 lim supk→∞ N0
k
i i
i i
i i
360
In other words, the capacity of this channel is B log(1 + NP0 B ). To understand the idea of the proof,
we need to recall the concept of modulation first. Every signal X(t) that is required to live in
[fc − B/2, fc + B/2] frequency band can be obtained by starting with a complex-valued signal XB (t)
with frequency content in [−B/2, B/2] and mapping it to X(t) via the modulation:
√
X(t) = Re(XB (t) 2ejωc t ) ,
where ωc = 2πfc . Upon receiving the sum Y(t) = X(t) + N(t) of the signal and the white noise
N(t) we may demodulate Y by computing
√
YB (t) = 2LPF(Y(t)ejωc t ), ,
where the LPF is a low-pass filter removing all frequencies beyond [−B/2, B/2]. The important
fact is that converting from Y(t) to YB (t) does not lose information.
Overall we have the following input-output relation:
e ( t) ,
YB (t) = XB (t) + N
e ( t) N
E[ N e (s)∗ ] = N0 δ(t − s).
1
Here we already encounter a major issue: the waveform x(t) supported on a finite interval (0, T] cannot have spectrum
supported on a compact. The requiremes of finite duration and finite spectrum are only satisfied by the zero waveform.
Rigorously, one should relax the bandwidth constraint to requiring that the signal have a vanishing out-of-band energy as
T → ∞. As we said, rigorous treatment of this issue lead to the theory of prolate spheroidal functions [287].
i i
i i
i i
where sincB (x) = sin(xBx) and Xi = XB (i/B). After the Nyquist sampling on XB and YB we get the
following equivalent input-output relation:
Yi = Xi + Zi , Zi ∼ Nc (0, N0 ) (21.26)
R∞
where the noise Zi = t=−∞ N e (t)sincB (t − i )dt. Finally, given that XB (t) is only non-zero for
B
t ∈ (0, T] we see that the C-AWGN channel (21.26) is only allowed to be used for i = 1, . . . , TB.
This fact is known in communication theory as “bandwidth B and duration T signal has BT complex
degrees of freedom”.
Let us summarize what we obtained so far:
i i
i i
i i
362
i i
i i
i i
In previous chapters our main object of study was the fundamental limit of blocklenght-n coding:
Finally, the finite blocklength information theory strives to prove the sharpest possible computa-
tional bounds on log M∗ (n, ϵ) at finite n, which allows evaluating real-world codes’ performance
taking their latency n into account. These results are surveyed in this chapter.
Theorem 22.1. For any stationary memoryless channel with either |A| < ∞ or |B| < ∞ we have
Cϵ = C for 0 < ϵ < 1. Equivalently, for every 0 < ϵ < 1 we have
363
i i
i i
i i
364
Pe
1
10−1
10−2
10−3
10−4
10−5
SNR
In other words, below a certain critical SNR, the probability of error quickly approaches 1, so that
the receiver cannot decode anything meaningful. Above the critical SNR the probability of error
quickly approaches 0 (unless there is an effect known as the error floor, in which case probability
of error decreases reaches that floor value and stays there regardless of the further SNR increase).
Proof. We will improve the method used in the proof of Theorem 17.3. Take an (n, M, ϵ)-code
and consider the usual probability space
W → Xn → Yn → Ŵ ,
where W ∼ Unif([M]). Note that PXn is the empirical distribution induced by the encoder at the
channel input. Our goal is to replace this probability space with a different one where the true
channel PYn |Xn = P⊗ n
Y|X is replaced with a “dummy” channel:
i i
i i
i i
Therefore, the random variable 1{Ŵ=W} is likely to be 1 under P and likely to be 0 under Q. It
thus looks like a rather good choice for a binary hypothesis test statistic distinguishing the two
distributions, PW,Xn ,Yn ,Ŵ and QW,Xn ,Yn ,Ŵ . Since no hypothesis test can beat the optimal (Neyman-
Pearson) test, we get the upper bound
1
β1−ϵ (PW,Xn ,Yn ,Ŵ , QW,Xn ,Yn ,Ŵ ) ≤ (22.2)
M
(Recall the definition of β from (14.3).) The likelihood ratio is a sufficient statistic for this
hypothesis test, so let us compute it:
PW,Xn ,Yn ,Ŵ PW PXn |W PYn |Xn PŴ|Yn PW|Xn PXn ,Yn PŴ|Yn PXn ,Yn
= ⊗
= =
QW,Xn ,Yn ,Ŵ n
PW PXn |W (QY ) PŴ|Yn PW|Xn PXn (QY )⊗n PŴ|Yn PXn (QY )⊗n
From the Neyman Pearson test, the optimal HT takes the form
⊗n PXn Yn PXn Yn
βα (PXn Yn , PXn (QY ) ) = Q log ≥ γ where α = P log ≥γ
| {z } | {z } PXn (QY )⊗n PXn (QY )⊗n
P Q
i i
i i
i i
366
Putting this together with our main bound (22.3), we see that any (n, M, ϵ) code for the BSC
satisfies
1
log M ≤ nD(Ber(δ)kBer( )) + o(n) = nC + o(n) .
2
Clearly, this implies the strong converse for the BSC.
For the general channel, let us denote by P∗Y the capacity achieving output distribution. Recall
that by Corollary 5.5 it is unique and by (5.1) we have for every x ∈ A:
This property will be very useful. We next consider two cases separately:
1 If |B| < ∞ we take QY = P∗Y and note that from (19.31) we have
X
PY|X (y|x0 ) log2 PY|X (y|x0 ) ≤ log2 |B| ∀ x0 ∈ A
y
and since miny P∗Y (y) > 0 (without loss of generality), we conclude that for some constant
K > 0 and for all x0 ∈ A we have
PY|X (Y|X = x0 )
Var log | X = x0 ≤ K < ∞ .
QY ( Y )
Thus, if we let
X
n
PY|X (Yi |Xi )
Sn = log ,
P∗Y (Yi )
i=1
then we have
1X
n
P̂c (x) ≜ 1{cj = x} .
n
j=1
By simple counting it is clear that from any (n, M, ϵ) code, it is possible to select an (n, M′ , ϵ)
subcode, such that a) all codeword have the same composition P0 ; and b) M′ > (n+1M )|A|−1
. Note
′ ′
that, log M = log M + O(log n) and thus we may replace M with M and focus on the analysis of
the chosen subcode. Then we set QY = PY|X ◦ P0 . From now on we also assume that P0 (x) > 0
for all x ∈ A (otherwise just reduce A). Let i(x; y) denote the information density with respect
i i
i i
i i
to P0 PY|X . If X ∼ P0 then I(X; Y) = D(PY|X kQY |P0 ) ≤ log |A| < ∞ and we conclude that
PY|X=x QY for each x and thus
dPY|X=x
i(x; y) = log ( y) .
dQY
From (19.28) we have
So if we define
X
n
dPY|X=Xi (Yi |Xi ) X n
Sn = log ( Yi ) = i(Xi ; Yi ) ,
dQY
i=1 i=1
1−ϵ
γβ1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≥ ,
2
which then implies
√
log β1−ϵ (PXn Yn , PXn (QY )n ) ≥ −nC + O( n) .
We note several lessons from this proof. First, we basically followed the same method as in the
proof of the weak converse, except instead of invoking data-processing inequality for divergence,
we analyzed the hypothesis testing problem explicitly. Second, the bound on variance of the infor-
mation density is important. Thus, while the AWGN channel is excluded by the assumptions of
the Theorem, the strong converse for it does hold as well (see Ex. IV.9). Third, this method of
proof is also known as “sphere-packing”, for the reason that becomes clear if we do the example
of the BSC slightly differently (see Ex. IV.8).
i i
i i
i i
368
Thus by Theorem 19.9 the capacity of the corresponding stationary memoryless channel is C. We
next show that nevertheless the ϵ-capacity can be strictly greater than C.
Indeed, fix blocklength n and consider a single letter distribution PX assigning equal weights
to all atoms (j, m) with m = exp{2nC}. It can be shown that in this case, the distribution of a
single-letter information density is given by
( (
log am , w.p. amm 2nC + O(log n), w.p. amm
i(X; Y) = = .
log bm , w.p. 1 − amm O( 1n ), w.p. 1 − amm
Thus, for blocklength-n density we have
1X
n
1 n n d 1 1 am d
i( X ; Y ) = i(Xi ; Yi ) = O( ) + (2C + O( log n)) · Bin(n, )−→2C · Poisson(1/2) ,
n n n n m
i=1
i i
i i
i i
In particular,
Cϵ ≥ 2C ∀ϵ > e−1/2
22.3 Meta-converse
We have seen various ways in which one can derive upper (impossibility or converse) bounds on
the fundamental limits such as log M∗ (n, ϵ). In Theorem 17.3 we used data-processing and Fano’s
inequalities. In the proof of Theorem 22.1 we reduced the problem to that of hypothesis testing.
There are many other converse bounds that were developed over the years. It turns out that there
is a very general approach that encompasses all of them. For its versatility it is sometimes referred
to as the “meta-converse”.
To describe it, let us fix a Markov kernel PY|X (usually, it will be the n-letter channel PYn |Xn ,
but in the spirit of “one-shot” approach, we avoid introducing blocklength). We are also given a
certain (M, ϵ) code and the goal is to show that there is an upper bound on M in terms of PY|X and
ϵ. The essence of the meta-converse is described by the following diagram:
PY |X
W Xn Yn Ŵ
QY |X
Here the W → X and Y → Ŵ represent encoder and decoder of our fixed (M, ϵ) code. The upper
arrow X → Y corresponds to the actual channel, whose fundamental limits we are analyzing. The
lower arrow is an auxiliary channel that we are free to select.
The PY|X or QY|X together with PX (distribution induced by the code) define two distribu-
tions: PX,Y and QX,Y . Consider a map (X, Y) 7→ Z ≜ 1{W 6= Ŵ} defined by the encoder and
decoder pair (if decoders are randomized or W → X is not injective, we consider a Markov kernel
PZ|X,Y (1|x, y) = P[Z = 1|X = x, Y = y] instead). We have
PX,Y [Z = 0] = 1 − ϵ, QX , Y [ Z = 0] = 1 − ϵ ′ ,
where ϵ and ϵ′ are the average probilities of error of the given code under the PY|X and QY|X
respectively. This implies the following relation for the binary HT problem of testing PX,Y vs
QX,Y :
i i
i i
i i
370
The high-level idea of the meta-converse is to select a convenient QY|X , bound 1 − ϵ′ from above
(i.e. prove a converse result for the QY|X ), and then use the Neyman-Pearson β -function to lift the
Q-channel converse to P-channel.
How one chooses QY|X is a matter of art. For example, in the proof of Case 2 of Theorem 22.1
we used the trick of reducing to the constant-composition subcode. This can instead be done by
taking QYn |Xn =c = (PY|X ◦ P̂c )⊗n . Since there are at most (n + 1)|A|−1 different output distributions,
we can see that
(n + 1)∥A|−1
1 − ϵ′ ≤ ,
M
and bounding of β can be done similar to Case 2 proof of Theorem 22.1. For channels with
|A| = ∞ the technique of reducing to constant-composition codes is not available, but the meta-
converse can still be applied. Examples include proof of parallel AWGN channel’s dispersion [238,
Theorem 78] and the study of the properties of good codes [246, Theorem 21].
However, the most common way of using meta-converse is to apply it with the trivial channel
QY|X = QY . We have already seen this idea in Section 22.1. Indeed, with this choice the proof
of the converse for the Q-channel is trivial, because we always have: 1 − ϵ′ = M1 . Therefore, we
conclude that any (M, ϵ) code must satisfy
1
≥ β1−ϵ (PX,Y , PX QY ) . (22.12)
M
Or, after optimization we obtain
1
≥ inf sup β1−ϵ (PX,Y , PX QY ) .
M∗ (ϵ) PX QY
This is a special case of the meta-coverse known as the minimax meta-converse. It has a number of
interesting properties. First, the minimax problem in question possesses a saddle-point and is of
convex-concave type [248]. It, thus, can be seen as a stronger version of the capacity saddle-point
result for divergence in Theorem 5.4.
Second, the bound given by the minimax meta-converse coincides with the bound we obtained
before via linear programming relaxation (18.22), as discovered by [212]. To see this connection,
instead of writing the meta-converse as an upper bound M (for a given ϵ) let us think of it as an
upper bound on 1 − ϵ (for a given M).
We have seen that existence of an (M, ϵ)-code for PY|X implies existence of the (stochastic) map
(X, Y) 7→ Z ∈ {0, 1}, denoted by PZ|X,Y , with the following property:
1
PX,Y [Z = 0] ≥ 1 − ϵ, and PX QY [Z = 0] ≤ ∀ QY .
M
That is PZ|X,Y is a test of a simple null hypothesis (X, Y) ∼ PX,Y against a composite alternative
(X, Y) ∼ PX QY for an arbitrary QY . In other words every (M, ϵ) code must satisfy
1 − ϵ ≤ α̃(M; PX ) ,
i i
i i
i i
Let us now replace PX with a π x ≜ MPX (x), x ∈ X . It is clear that π ∈ [0, 1]X . Let us also
replace the optimization variable with rx,y ≜ MPZ|X,Y (0|x, y)PX (x). With these notational changes
we obtain
1 X X
α̃(M; PX ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} .
M x, y x
It is now obvious that α̃(M; PX ) = SLP (π ) defined in (18.21). Optimizing over the choice of PX
P
(or equivalently π with x π x ≤ M) we obtain
1 1 X S∗ ( M )
1−ϵ≤ SLP (π ) ≤ sup{SLP (π ) : π x ≤ M} = LP .
M M x
M
Now recall that in (18.23) we showed that a greedy procedure (essentially, the same as the one we
used in the Feinstein’s bound Theorem 18.7) produces a code with probability of success
1 S∗ (M)
1 − ϵ ≥ (1 − ) LP .
e M
This indicates that in the regime of a fixed ϵ the bound based on minimax metaconverse should
be very sharp. This, of course, provided we can select the best QY in applying it. Fortunately, for
symmetric channels optimal QY can be guessed fairly easily, cf. [248] for more.
i i
i i
i i
372
What is the best value of Ẽ(R) for each R? This is perhaps the most famous open question in all of
channel coding. More formally, let us define what is known as reliability function of a channel as
(
limn→∞ − 1n log ϵ∗ (n, exp{nR}) R<C
E( R) =
∗
limn→∞ − n log(1 − ϵ (n, exp{nR})) R > C .
1
We leave E(R) as undefined if the limit does not exist. Unfortunately, there is no general argument
showing that this limit exist. The only way to show its existence is to prove an achievability bound
1
lim inf − log ϵ∗ (n, exp{nR}) ≥ Elower (R) ,
n→∞ n
a converse bound
1
lim sup − log ϵ∗ (n, exp{nR}) ≤ Eupper (R) ,
n→∞ n
and conclude that the limit exist whenever Elower = Eupper . It is common to abuse notation and
write such pair of bounds as
even though, as we said, the E(R) is not known to exist unless the two bounds match unless the
two bounds match.
From now on we restrict our discussion to the case of a DMC. An important object to
define is the Gallager’s E0 function, which is nothing else than the right-hand side of Gallager’s
bound (18.15). For the DMC it has the following expression:
!1+ρ
X X 1
E0 (ρ, PX ) = − log PX (x)PY|X (y|x)
1+ρ
y∈B x∈A
This expression is defined in terms of the single-letter channel PY|X . It is not hard to see that E0
function for the n-letter extension evaluated with P⊗ n
X just equals nE0 (ρ, PX ), i.e. it tensorizes
1
similar to mutual information. From this observation we can apply Gallager’s random coding
bound (Theorem 18.9) with P⊗ n
X to obtain
Optimizing the choice of PX we obtain our first estimate on the reliability function
1
There is one more very pleasant analogy with mutual information: the optimization problems in the definition of E0 (ρ)
also tensorize. That is, the optimal distribution for the n-letter channel is just P⊗n
X , where PX is optimal for a single-letter
one.
i i
i i
i i
An analysis, e.g. [133, Section 5.6], shows that the function Er (R) is a convex, decreasing and
strictly positive on 0 ≤ R < C. Therefore, Gallager’s bound provides a non-trivial estimate of
the reliability function for the entire range of rates below capacity. At rates R → C the optimal
choice of ρ → 0. As R departs further away from the capacity the optimal ρ reaches 1 at a certain
rate R = Rcr known as the critical rate, so that for R < Rcr we have Er (R) = E0 (1) − R behaving
linearly.
Going to the upper bounds, taking QY to be the iid product distribution in (22.12) and optimizing
yields the bound [278] known as the sphere-packing bound:
E(R) ≤ Esp (R) ≜ sup E0 (ρ) − ρR .
ρ≥0
Comparing the definitions of Esp and Er we can see that for Rcr < R < C we have
Esp (R) = E(R) = Er (R)
thus establishing reliability function value for high rates. However, for R < Rcr we have Esp (R) >
Er (R), so that E(R) remains unknown.
Both upper and lower bounds have classical improvements. The random coding bound can be
improved via technique known as expurgation showing
E(R) ≥ Eex (R) ,
and Eex (R) > Er (R) for rates R < Rx where Rx ≤ Rcr is the second critical rate; see Exc. IV.18.
The sphere packing bound can also be improved at low rates by analyzing a combinatorial packing
problem by showing that any code must have a pair of codewords which are close (in terms of
Hellinger distance between the induced output distributions) and concluding that confusing these
two leads to a lower bound on probability of error (via (16.3)). This class of bounds is known
as “minimum distance” based bounds. The straight-line bound [133, Theorem 5.8.2] allows to
interpolate between any minimum distance bound and the Esp (R). Unfortunately, these (classical)
improvements tightly bound E(R) at only one additional rate point R = 0. This state of affairs
remains unchanged (for a general DMC) since 1967. As far as we know, the common belief is that
Eex (R) is in fact the true value of E(R) for all rates.
We demonstrate these bounds (and some others, but not the straight-line bound) on the reli-
ability function on Fig. 22.1 for the case of the BSCδ . For this channel, there is an interesting
interpretation of the expurgated bound. To explain it, let us recall the different ensembles of ran-
dom codes that we discussed in Section 18.6. In particular, we had the Shannon ensemble (as used
in Theorem 18.5) and the random linear code (either Elias or Gallager ensembles, we do not need
to make a distinction here).
For either ensemble, it it is known [134] that Er (R) is not just an estimate, but in fact the exact
value of the exponent of the average probability of error (averaged over a code in the ensemble).
For either ensemble, however, for low rates the average is dominated by few bad codes, whereas
a typical (high probability) realization of the code has a much better error exponent. For Shannon
ensemble this happens at R < 12 Rx and for the linear ensemble it happens at R < Rx . Furthermore,
the typical linear code in fact has error exponent exactly equal to the expurgated exponent Eex (R),
see [21].
i i
i i
i i
374
1.4
1.2
Err.Exp. (log2)
0.8 Rx = 0.24
0.4
C = 0.86
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Rate
Figure 22.1 Comparison of bounds on the error exponent of the BSC. The MMRW stands for the upper
bound on the minimum distance of a code [214] and Gilbert-Varshamov is a lower bound of Theorem 27.5.
There is a famous conjecture in combinatorics stating that the best possible minimum pairwise
Hamming distance of a code with rate R is given by the Gilbert-Varshamov bound (Theorem 27.5).
If true, this would imply that E(R) = Eex (R) for R < Rx , see e.g. [201].
The most outstanding development in the error exponents since 1967 was a sequence of papers
starting from [201], which proposed a new technique for bounding E(R) from above. Litsyn’s
idea was to first prove a geometric result (that any code of a given rate has a large number of
pairs of codewords at a given distance) and then use de Caen’s inequality to convert it into a lower
bound on the probability of error. The resulting bound was very cumbersome. Thus, it was rather
surprising when Barg and MacGregor [22] were able to show that the new upper bound on E(R)
matched Er (R) for Rcr − ϵ < R < Rcr for some small ϵ > 0. This, for the first time since [278]
extended the range of knowledge of the reliability function. Their amazing result (together with
Gilbert-Varshamov conjecture) reinforced the belief that the typical linear codes achieve optimal
error exponent in the whole range 0 ≤ R ≤ C.
Regarding E(R) for R > C the situation is much simpler. We have
The lower (achievability) bound here is due to [106] (see also [228]), while the harder (converse)
part is by Arimoto [16]. It was later discovered that Arimoto’s converse bound can be derived by a
simple modification of the weak converse (Theorem 17.3): instead of applying data-processing to
i i
i i
i i
1
the KL divergence, one uses Rényi divergence of order α = 1+ρ ; see [243] for details. This sug-
gests a general conjecture that replacing Shannon information measures with Rényi ones upgrades
the (weak) converse proofs to a strong converse.
1 DMC
2 DMC with cost constraint
3 AWGN
4 Parallel AWGN
Let (X∗ , Y∗ ) be the input-output of the channel under the capacity achieving input distribution, and
i(x; y) be the corresponding (single-letter) information density. The following expansion holds for
a fixed 0 < ϵ < 1/2 and n → ∞
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) (22.14)
i i
i i
i i
376
where Q−1 is the inverse of the complementary standard normal CDF, the channel capacity is
C = I(X∗ ; Y∗ ) = E[i(X∗ ; Y∗ )], and the channel dispersion2 is V = Var[i(X∗ ; Y∗ )|X∗ ].
Proof. The full proofs of these results are somewhat technical, even for the DMC.3 Here we only
sketch the details.
First, in the absence of cost constraints the achievability (lower bound on log M∗ ) part has
already been done by us in Theorem 19.11, where we have shown that log M∗ (n, ϵ) ≥ nC −
√ √
nVQ−1 (ϵ) + o( n) by refining the proof of the noisy channel coding theorem and using the
CLT. Replacing the CLT with its non-asymptotic version (Berry-Esseen inequality [123, Theorem
√
2, Chapter XVI.5]) improves o( n) to O(log n). In the presence of cost constraints, one is inclined
to attempt to use an appropriate version of the achievability bound such as Theorem 20.7. However,
for the AWGN this would require using input distribution that is uniform on the sphere. Since this
distribution is non-product, the information density ceases to be a sum of iid, and CLT is harder
to justify. Instead, there is a different achievability bound known as the κ-β bound [239, Theorem
25] that has become the workhorse of achievability proofs for cost-constrained channels with
continuous input spaces.
The upper (converse) bound requires various special methods depending on the channel. How-
ever, the high-level idea is to always apply the meta-converse bound from (22.12) with an
approriate choice of QY . Most often, QY is taken as the n-th power of the capacity achieving output
distribution for the channel. We illustrate the details for the special case of the BSC. In (22.4) we
have shown that
1
log M∗ (n, ϵ) ≤ − log βα (Ber(δ)⊗n , Ber( )⊗n ) . (22.15)
2
On the other hand, Exc. III.5 shows that
1 1 √ √
− log β1−ϵ (Ber(δ)⊗n , Ber( )⊗n ) = nd(δk ) + nvQ−1 (ϵ) + o( n) ,
2 2
where v is just the variance of the (single-letter) log-likelihood ratio:
" #
δ 1−δ δ δ
v = VarZ∼Ber(δ) Z log 1 + (1 − Z) log 1 = Var[Z log ] = δ(1 − δ) log2 .
2 2
1 − δ 1 − δ
Upon inspection we notice that v = V – the channel dispersion of the BSC, which completes the
proof of the upper bound:
√ √
log M∗ (n, ϵ) ≤ nC − nVQ−1 (ϵ) + o( n)
√
Improving the o( n) to O(log n) is done by applying the Berry-Esseen inequality in place of the
CLT, similar to the upper bound. Many more details on these proofs are contained in [238].
2
There could be multiple capacity-achieving input distributions, in which case PX∗ should be chosen as the one that
minimizes Var[i(X∗ ; Y∗ )|X∗ ]. See [239] for more details.
3
Recently, subtle gaps in [295] and [239] in the treatment of DMCs with non-unique capacity-achieving input distributions
were found and corrected in [57].
i i
i i
i i
Remark 22.1 (Zero dispersion). We notice that V = 0 is entirely possible. For example, consider
an additive-noise channel Y = X + Z over some abelian group G with Z being uniform on some
subset of G, e.g. channel in Exc. IV.13. Among the zero-dispersion channels there is a class of
exotic channels [239], which for ϵ > 1/2 have asymptotic expansions of the form [238, Theorem
51]:
log M∗ (n, ϵ) = nC + Θϵ (n 3 ) .
1
Existence of this special case is why we restricted the theorem above to ϵ < 12 .
Remark 22.2. The expansion (22.14) only applies to certain channels (as described in the theorem).
If, for example, Var[i(X∗ ; Y∗ )] = ∞, then the theorem need not hold and there might be other stable
(non-Gaussian) distributions that the n-letter information density will converge to. Also notice that
in the absence of cost constraints we have
since, by capacity saddle-point (Corollary 5.7), E[i(X∗ ; Y∗ )|X∗ = x] = C for PX∗ -almost all x.
As an example, we have the following dispersion formulas for the common channels that we
discussed so far:
Y i = Hi ( X i + Z i ) ,
Pn
where we have N (0, 1) ∼ Zi ⊥ ⊥ Hi ∼ Ber(1/2) and the usual quadratic cost constraint i=1 x2i ≤
nP.
Multi-antenna (MIMO) channels (20.11) present interesting new challenges as well. For exam-
ple, for coherent channels the capacity achieving input distribution is non-unique [71]. The
quasi-static channels are similar to fading channels but the H1 = H2 = · · · , i.e. the channel
gain matrix in (20.11) is not changing with time. This channel model is often used to model cellu-
lar networks. By leveraging an unexpected amount of differential geometry, it was shown in [339]
i i
i i
i i
378
0.5
0.4
Rate, bit/ch.use
0.3
0.2
0.1 Capacity
Converse
RCU
DT
Gallager
Feinstein
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n
where the ϵ-capacity Cϵ is known as outage capacity in this case (and depends on ϵ). The main
implication is that Cϵ is a good predictor of the ultimate performance limits for these practically-
relevant channels (better than C is for the AWGN channel, for example). But some caution must
be taken in approximating log M∗ (n, ϵ) ≈ nCϵ , nevertheless. For example, in the case where H
matrix is known at the transmitter, the same paper demonstrated that the standard water-filling
power allocation (Theorem 20.14) that maximizes Cϵ is rather sub-optimal at finite n.
(The log n term in (22.14) is known to be equal to O(1) for the BEC, and 12 log n for the BSC,
AWGN and binary-input AWGN. For these latter channels, normal approximation is typically
defined with + 12 log n added to the previous display.)
For example, considering the BEC1/2 channel we can easily compute the capacity and disper-
sion to be C = (1 − δ) and V = δ(1 − δ) (in bits and bits2 , resp.). Detailed calculation in Ex. IV.31
i i
i i
i i
0.5
0.4
Rate, bit/ch.use
0.3
0.2
Capacity
Converse
0.1 Normal approximation + 1/2 log n
Normal approximation
Achievability
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n
Figure 22.3 Comparing the normal approximation against the best upper and lower bounds on 1
n
log M∗ (n, ϵ)
for the BSCδ channel (δ = 0.11, ϵ = 10−3 ).
p
log M∗ (500, 10−3 ) ≈ nδ̄ − nδ δ̄ Q−1 (10−3 ) ≈ 215.5 bits
i i
i i
i i
380
Pe k1 → n1 Pe k2 → n2
10−4 10−4
P∗ SNR P∗ SNR
After inspecting these plots, one may believe that the k1 → n1 code is better, since it requires a
smaller SNR to achieve the same error probability. However, this ignores the fact that the rate of
this code nk11 might be much smaller as well. The concept of normalized rate allows us to compare
the codes of different blocklengths and coding rates.
Specifically, suppose that a k → n code is given. Fix ϵ > 0 and find the value of the SNR P for
which this code attains probability of error ϵ (for example, by taking a horizontal intercept at level
ϵ on the waterfall plot). The normalized rate is defined as
k k
Rnorm (ϵ) = ≈ p ,
log2 M∗ (n, ϵ, P) nC(P) − nV(P)Q−1 (ϵ)
where log M∗ , capacity and dispersion correspond to the channel over which evaluation is being
made (most often the AWGN, BI-AWGN or the fading channel). We also notice that, of course,
the value of log M∗ is not possible to compute exactly and thus, in practice, we use the normal
approximation to evaluate it.
This idea allows us to clearly see how much different ideas in coding theory over the decades
were driving the value of normalized rate upward to 1. This comparison is show on Fig. 22.4.
A short summary is that at blocklengths corresponding to “data stream” channels in cellular net-
works (n ∼ 104 ) the LDPC codes and non-binary LDPC codes are already achiving 95% of the
information-theoretic limit. At blocklengths corresponding to “control plane” (n ≲ 103 ) the polar
codes and LDPC codes are at similar performance and at 90% of the fundamental limits.
i i
i i
i i
0.95
0.9
Galileo HGA
Turbo R=1/2
0.75 Cassini/Pathfinder
Galileo LGA
Hermitian curve [64,32] (SDD)
0.7 Reed−Solomon (SDD)
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (List dec.)
0.65 ME LDPC R=1/2 (BP)
0.6
0.55
0.5 2 3 4 5
10 10 10 10
Blocklength, n
Normalized rates of code families over BIAWGN, Pe=0.0001
1
0.95
0.9
Turbo R=1/3
Turbo R=1/6
Turbo R=1/4
0.85
Voyager
Normalized rate
Galileo HGA
Turbo R=1/2
Cassini/Pathfinder
0.8
Galileo LGA
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (L=32)
Polar+CRC R=1/2 (L=256)
0.75
Huawei NB−LDPC
Huawei hybrid−Cyclic
ME LDPC R=1/2 (BP)
0.7
0.65
0.6 2 3 4 5
10 10 10 10
Blocklength, n
Figure 22.4 Normalized rates for various codes. Plots generated via [291] (color version recommended)
i i
i i
i i
So far we have been focusing on the paradigm for one-way communication: data are mapped to
codewords and transmitted, and later decoded based on the received noisy observations. In most
practical settings (except for storage), frequently the communication goes in both ways so that the
receiver can provide certain feedback to the transmitter. As a motivating example, consider the
communication channel of the downlink transmission from a satellite to earth. Downlink transmis-
sion is very expensive (power constraint at the satellite), but the uplink from earth to the satellite
is cheap which makes virtually noiseless feedback readily available at the transmitter (satellite).
In general, channel with noiseless feedback is interesting when such asymmetry exists between
uplink and downlink. Even in less ideal settings, noisy or partial feedbacks are commonly available
that can potentially improve the realiability or complexity of communication.
In the first half of our discussion, we shall follow Shannon to show that even with noiseless
feedback “nothing” can be gained in the conventional setup, while in the second half, we examine
situations where feedback is extremely helpful.
f1 : [ M ] → A
f2 : [ M ] × B → A
..
.
fn : [M] × B n−1 → A
• Decoder:
g : B n → [M]
382
i i
i i
i i
23.1 Feedback does not increase capacity for stationary memoryless channels 383
Figure 23.1 Schematic representation of coding without feedback (left) and with full noiseless feedback
(right).
Here the symbol transmitted at time t depends on both the message and the history of received
symbols:
Xt = ft (W, Yt1−1 ).
W ∼ uniform on [M]
PY|X
X1 = f1 (W) −→ Y1
.. −→ Ŵ = g(Yn )
.
PY|X
Xn = fn (W, Yn1−1 ) −→ Yn
Fig. 23.1 compares the settings of channel coding without feedback and with full feedback:
Proof. Achievability: Although it is obvious that Cfb ≥ C, we wanted to demonstrate that in fact
constructing codes achieving capacity with full feedback can be done directly, without appealing
to a (much harder) problem of non-feedback codes. Let π t (·) ≜ PW|Yt (·|Yt ) with the (random) pos-
terior distribution after t steps. It is clear that due to the knowledge of Yt on both ends, transmitter
and receiver have perfectly synchronized knowledge of π t . Now consider how the transmission
progresses:
i i
i i
i i
384
1 Initialize π 0 (·) = M1
2 At (t + 1)-th step, having knowledge of π t all messages are partitioned into classes Pa , according
to the values ft+1 (·, Yt ):
Then transmitter, possessing the knowledge of the true message W, selects a letter Xt+1 =
ft+1 (W, Yt ).
3 Channel perturbs Xt+1 into Yt+1 and both parties compute the updated posterior:
PY|X (Yt+1 |ft+1 (j, Yt ))
π t+1 (j) ≜ π t (j)Bt+1 (j) , Bt+1 (j) ≜ P .
a∈A π t (Pa )
Notice that (this is the crucial part!) the random multiplier satisfies:
XX PY|X (y|a)
E[log Bt+1 (W)|Yt ] = π t (Pa ) log P = I(π̃ t , PY|X ) (23.1)
a∈A y∈B a∈A π t (Pa )a
Intuitively, we expect that the process log π t (W) resembles a random walk starting from − log M
and having a positive drift. Thus to estimate the time it takes for this process to reach value 0
we need to estimate the upward drift. Appealing to intuition and the law of large numbers we
approximate
X
t
log π t (W) − log π 0 (W) ≈ E[log Bs ] .
s=1
Finally, from (23.1) we conclude that the best idea is to select partitioning at each step in such a
way that π̃ t ≈ P∗X (capacity-achieving input distribution) and this obtains
implying that the transmission terminates in time ≈ logCM . The important lesson here is the follow-
ing: The optimal transmission scheme should map messages to channel inputs in such a way that
the induced input distribution PXt+1 |Yt is approximately equal to the one maximizing I(X; Y). This
idea is called posterior matching and explored in detail in [282].1
1
Note that the magic of Shannon’s theorem is that this optimal partitioning can also be done blindly. That is, it is possible
to preselect partitions Pa in a way that is independent of π t (but dependent on t) and so that π t (Pa ) ≈ P∗X (a) with
overwhelming probability and for all t ∈ [n].
i i
i i
i i
23.1 Feedback does not increase capacity for stationary memoryless channels 385
Converse: We are left to show that Cfb ≤ C(I) . Recall the key in proving weak converse for
channel coding without feedback: Fano’s inequality plus the graphical model
W → Xn → Yn → Ŵ. (23.2)
Then
With feedback the probabilistic picture becomes more complicated as the following figure
shows for n = 3 (dependence introduced by the extra squiggly arrows):
X1 Y1 X1 Y1
W X2 Y2 Ŵ W X2 Y2 Ŵ
X3 Y3 X3 Y3
without feedback with feedback
So, while the Markov chain relation in (23.2) is still true, the input-output relation is no longer
memoryless2
Y
n
PYn |Xn (yn |xn ) 6= PY|X (yj |xj ) (!)
j=1
There is still a large degree of independence in the channel, though. Namely, we have
Then
2
To see this, consider the example where X2 = Y1 and thus PY1 |X1 X2 = δX1 is a point mass.
i i
i i
i i
386
≤ nCt
In comparison with Theorem 22.2, the following result shows that, with fixed-length block cod-
ing, feedback does not even improve the speed of approaching capacity and can at most improve
the third-order log n terms.
Theorem 23.4 (Dispersion with feedback). For weakly input-symmetric DMC (e.g. additive noise,
BSC, BEC) we have:
√
log M∗fb (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n)
Proof. It is obvious that Cfb ≥ C, we are left to show that Cfb ≤ C(I) .
1 Recap of the steps of showing the strong converse of C ≤ C(I) previously in Section 22.1: take
any (n, M, ϵ) code, compare the two distributions:
P : W → Xn → Yn → Ŵ (23.5)
Q:W→X n
Y → Ŵ
n
(23.6)
PW,Xn ,Yn ,Ŵ = PW,Xn PYn |Xn PŴ|Yn , QW,Xn ,Yn ,Ŵ = PW,Xn PYn PŴ|Yn
thus D(PkQ) = I(Xn ; Yn ) measures the information flow through the links Xn → Yn .
mem−less,stat X
n
1 DPI
−h(ϵ) + ϵ̄ log M = d(1 − ϵk ) ≤ D(PkQ) = I(Xn ; Yn ) = I(X; Y) ≤ nC(I)
M
i=1
(23.7)
2 Notice that when feedback is present, Xn → Yn is not memoryless due to the transmission
protocol. So let us unfold the probability space over time to see the dependence explicitly. As
an example, the graphical model for n = 3 is given below:
i i
i i
i i
23.2* Alternative proof of Theorem 23.3 and Massey’s directed information 387
If we define Q similarly as in the case without feedback, we will encounter a problem at the
second last inequality in (23.7), as with feedback I(Xn ; Yn ) can be significantly larger than
Pn n n
i=1 I(X; Y). Consider the example where X2 = Y1 , we have I(X ; Y ) = +∞ independent
of I(X; Y).
We also make the observation that if Q is defined in (23.6), D(PkQ) = I(Xn ; Yn ) measures the
information flow through all the 6→ and ⇝ links. This motivates us to find a proper Q such that
D(PkQ) only captures the information flow through all the 6→ links {Xi → Yi : i = 1, . . . , n}, so
⊥ W, so that Q[W 6= Ŵ] = M1 .
that D(PkQ) closely relates to nC(I) , while still guarantees that W ⊥
3 Formally, we shall restrict QW,Xn ,Yn ,Ŵ ∈ Q, where Q is the set of distributions that can be
factorized as follows:
QW,Xn ,Yn ,Ŵ = QW QX1 |W QY1 QX2 |W,Y1 QY2 |Y1 · · · QXn |W,Yn−1 QYn |Yn−1 QŴ|Yn (23.8)
PW,Xn ,Yn ,Ŵ = PW PX1 |W PY1 |X1 PX2 |W,Y1 PY2 |X2 · · · PXn |W,Yn−1 PYn |Xn PŴ|Yn (23.9)
Verify that W ⊥
⊥ W under Q: W and Ŵ are d-separated by Xn .
Notice that in the graphical models, when removing ↛ we also added the directional links
between the Yi ’s, these links serve to maximally preserve the dependence relationships between
variables when ↛ are removed, so that Q is the “closest” to P while W ⊥ ⊥ W is satisfied.
Now we have that for Q ∈ Q, d(1 − ϵk M1 ) ≤ D(PkQ), in order to obtain the least upper bound,
in Lemma 23.5 we shall show that:
X
n
inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ ) = I(Xt ; Yt |Yt−1 )
Q∈Q
t=1
X
n
= EYt−1 [I(PXt |Yt−1 , PY|X )]
t=1
X
n
≤ I(EYt−1 [PXt |Yt−1 ], PY|X ) (concavity of I(·, PY|X ))
t=1
i i
i i
i i
388
X
n
= I(PXt , PY|X )
t=1
≤nC . ( I)
nC + h(ϵ) C
−h(ϵ) + ϵ̄ log M ≤ nC(I) ⇒ log M ≤ ⇒ Cfb,ϵ ≤ ⇒ Cfb ≤ C.
1−ϵ 1−ϵ
4 Notice that the above proof is also valid even when cost constraint is present.
Lemma 23.5.
X
n
inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ ) = I(Xt ; Yt |Yt−1 ) (23.10)
Q∈Q
t=1
Pn
Remark 23.1 (Directed information). The quantity ⃗I(Xn ; Yn ) ≜ t=1 I(Xt ; Yt |Yt−1 ) was defined
by Massey and is known as directed information. In some sense, see [211] it quantifies the amount
of causal information transfer from X-process to Y-process.
Proof. By chain rule, we can show that the minimizer Q ∈ Q must satisfy the following
equalities:
QX,W = PX,W ,
QXt |W,Yt−1 = PXt |W,Yt−1 , (exercise)
QŴ|Yn = PW|Yn .
and therefore
= D(PY1 |X1 kQY1 |X1 ) + D(PY2 |X2 ,Y1 kQY2 |Y1 |X2 , Y1 ) + · · · + D(PYn |Xn ,Yn−1 kQYn |Yn−1 |Xn , Yn−1 )
= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Yn−1 )
i i
i i
i i
denotes the set of input symbols that can lead to the output symbol y.
where (a) and (b) are by definitions, (c) follows from Theorem 23.3, and (d) is due to Theorem 19.9.
All capacity quantities above are defined with (fixed-length) block codes.
Remark 23.3. 1 In DMC for both zero-error capacities (C0 and Cfb,0 ) only the support of the
transition matrix PY|X , i.e., whether PY|X (b|a) > 0 or not, matters. The value of PY|X (b|a) > 0
is irrelevant. That is, C0 and Cfb,0 are functions of a bipartite graph between input and output
alphabets. Furthermore, the C0 (but not Cfb,0 !) is a function of the confusability graph – a simple
undirected graph on A with a 6= a′ connected by an edge iff ∃b ∈ B s.t. PY|X (b|a)PY|X (b|a′ ) > 0.
2 That Cfb,0 is not a function of the confusability graph alone is easily seen from comparing the
polygon channel (next remark) with L = 3 (for which Cfb,0 = log 32 ) and the useless channel
with A = {1, 2, 3} and B = {1} (for which Cfb,0 = 0). Clearly in both cases confusability
graph is the same – a triangle.
3 Oftentimes C0 is very hard to compute, but Cfb,0 can be obtained in closed form as in (23.11).
As an example, consider the following polygon channel:
1
5
4
3
i i
i i
i i
390
– L = 5: C0 = 12 log 5 . For achievability, with blocklength one, one can use {1, 3} to achieve
rate 1 bit; with blocklength two, the codebook {(1, 1), (2, 3), (3, 5), (4, 2), (5, 4)} achieves
rate 12 log 5 bits, as pointed out by Shannon in his original 1956 paper [279]. More than
two decades later this was shown optimal by Lovász using a technique now known as
semidefinite programming relaxation [204].
– Even L: C0 = log L2 (Exercise IV.16).
– Odd L: The exact value of C0 is unknown in general. For large L, C0 = log L2 + o(1) [42].
• Zero-error capacity with feedback (Exercise IV.16)
L
Cfb,0 = log , ∀L,
2
which can strictly exceeds C0 .
4 Notice that Cfb,0 is not necessarily equal to Cfb = limϵ→0 Cfb,ϵ = C. Here is an example with
Then
C0 = log 2
2
Cfb,0 = max − log max( δ, 1 − δ) (P∗X = (δ/3, δ/3, δ/3, δ̄))
δ 3
5 3
= log > C0 (δ ∗ = )
2 5
On the other hand, the Shannon capacity C = Cfb can be made arbitrarily close to log 4 by
picking the cross-over probabilities arbitrarily close to zero, while the confusability graph stays
the same.
Proof of Theorem 23.6. 1 Fix any (n, M, 0)-code. Denote the confusability set of all possible
messages that could have produced the received signal yt = (y1 , . . . , yt ) for all t = 0, 1, . . . , n
by:
i i
i i
i i
1
Cfb,0 = log .
θfb
By definition, we have
Notice the minimizer distribution P∗X is usually not the capacity-achieving input distribution in
the usual sense. This definition also sheds light on how the encoding and decoding should be
proceeded and serves to lower bound the uncertainty reduction at each stage of the decoding
scheme.
3 “≤” (converse): Let PXn be the joint distribution of the codewords. Denote E0 = [M] – original
message set.
t = 1: For PX1 , by (23.13), ∃y∗1 such that:
4 “≥” (achievability)
Let’s construct a code that achieves (M, n, 0).
i i
i i
i i
392
The above example with |A| = 3 illustrates that the encoder f1 partitions the space of all mes-
sages to 3 groups. The encoder f1 at the first stage encodes the groups of messages into a1 , a2 , a3
correspondingly. When channel outputs y1 and assume that Sy1 = {a1 , a2 }, then the decoder
can eliminate a total number of MP∗X (a3 ) candidate messages in this round. The “confusabil-
ity set” only contains the remaining MP∗X (Sy1 ) messages. By definition of P∗X we know that
MP∗X (Sy1 ) ≤ Mθfb . In the second round, f2 partitions the remaining messages into three groups,
send the group index and repeat.
By similar arguments, each interaction reduces the uncertainty by a factor of at least θfb . After n
iterations, the size of “confusability set” is upper bounded by Mθfbn n
, if Mθfb ≤ 1,3 then zero error
probability is achieved. This is guaranteed by choosing log M = −n log θfb . Therefore we have
shown that −n log θfb bits can be reliably delivered with n + O(1) channel uses with feedback,
thus
i i
i i
i i
Example 23.1. For the channel BSC0.11 without feedback the minimal is n = 3000 needed to
achieve 90% of the capacity C, while there exists a VLF code with l = E[n] = 200 achieving that.
This showcases how much feedback can improve the latency and decoding complexity.
Yk = X k + Zk , Zk ∼ N (0, σ 2 ) i.i.d.
E[X2k ] ≤ P, power constraint in expectation
A = Ân + Nn , Nn ⊥
⊥ Yn .
Morever, since all operations are lienar and everything is jointly Gaussian, Nn ⊥
⊥ Yn . Since Xn ∝
n−1
Nn−1 ⊥
⊥ Y , the codeword we are sending at each time slot is independent of the history of the
channel output (“innovation”), in order to maximize information transfer.
4 ∑n
Note that if we insist each codeword satisfies power constraint almost surely instead on average, i.e., k=1 X2k ≤ nP a.s.,
then this scheme does not work!
i i
i i
i i
394
Note that Yn → Ân → A, and the optimal estimator Ân (a linear combination of Yn ) is a sufficient
statistic of Yn for A under Gaussianity. Then
I(A; Yn ) =I(A; Ân , Yn )
= I(A; Ân ) + I(A; Yn |Ân )
= I(A; Ân )
1 Var(A)
= log .
2 Var(Nn )
where the last equality uses the fact that N follows a normal distribution. Var(Nn ) can be computed
directly using standard linear MMSE results. Instead, we determine it information theoretically:
Notice that we also have
I(A; Yn ) = I(A; Y1 ) + I(A; Y2 |Y1 ) + · · · + I(A; Yn |Yn−1 )
= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Yn−1 )
key
= I(X1 ; Y1 ) + I(X2 ; Y2 ) + · · · + I(Xn ; Yn )
1
= n log(1 + P) = nC
2
Therefore, with Elias’ scheme of sending A ∼ N (0, Var A), after the n-th use of the AWGN(P)
channel with feedback,
P n
Var Nn = Var(Ân − A) = 2−2nC Var A = Var A,
P + σ2
which says that the reduction of uncertainty in the estimation is exponential fast in n.
Schalkwijk-Kailath Elias’ scheme can also be used to send digital data. Let W ∼ uniform on
M-PAM constellation in ∈ [−1, 1], i.e., {−1, −1 + M2 , · · · , −1 + 2k
M , · · · , 1}. In the very first step
W is sent (after scaling to satisfy the power constraint):
√
X0 = PW, Y0 = X0 + Z0
Since Y0 and X0 are both known at the encoder, it can compute Z0 . Hence, to describe W it is
sufficient for the encoder to describe the noise realization Z0 . This is done by employing the Elias’
scheme (n − 1 times). After n − 1 channel uses, and the MSE estimation, the equivalent channel
output:
e 0 = X0 + Z
Y e0 , e0 ) = 2−2(n−1)C
Var(Z
e0 to the nearest PAM point. Notice that
Finally, the decoder quantizes Y
√ (n−1)C √
e 1 −(n−1)C P 2 P
ϵ ≤ P | Z0 | > =P 2 |Z| > = 2Q
2M 2M 2M
so that
√
P ϵ
log M ≥ (n − 1)C + log − log Q−1 ( ) = nC + O(1).
2 2
i i
i i
i i
Hence if the rate is strictly less than capacity, the error probability decays doubly exponentially as
√
n increases. More importantly, we gained an n term in terms of log M, since for the case without
feedback we have (by Theorem 22.2)
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) .
As an example, consider P = 1 and (n−thenchannel capacity is C = 0.5 bit per channel use. To
e(n−1)C
−3
achieve error probability 10 , 2Q 2 1) C
2M ≈ 10 , so 2M ≈ 3, and logn M ≈ n−n 1 C − logn 8 .
−3
Notice that the capacity is achieved to within 99% in as few as n = 50 channel uses, whereas the
best possible block codes without feedback require n ≈ 2800 to achieve 90% of capacity.
The take-away message of this chapter is as follows: Feedback is best harnessed with adaptive
strategies. Although it does not increase capacity under block coding, feedback greatly boosts
reliability as well as reduces coding complexity.
i i
i i
i i
1 1
IV.1 A code with M = 2k , average probability of error ϵ < 2 and bit-error probability pb < 2 must
satisfy both
C + h(ϵ)
log M ≤ (IV.1)
1−ϵ
and
C
log M ≤ , (IV.2)
log 2 − h(pb )
where C = supPX I(X; Y). Since pb ≤ ϵ, in the bound (IV.2) we may replace h(pb ) with h(ϵ) to
obtain a new bound. Suppose that a value of k is fixed and the bounds are used to prove a lower
bound on ϵ. When will the new bound be better than (IV.1)?
IV.2 A magician is performing card tricks on stage. In each round he takes a shuffled deck of 52
cards and asks someone to pick a random card N from the deck, which is then revealed to the
audience. Assume the magician can prepare an arbitrary ordering of cards in the deck (before
each round) and that N is distributed binomially on {0, . . . , 51} with mean 51
2 .
(a) What is the maximal number of bits per round that he can send over to his companion in
the room? (in the limit of infinitely many rounds)
(b) Is communication possible if N were uniform on {0, . . . , 51}? (In practice, however, nobody
ever picks the top or the bottom ones)
IV.3 Find the capacity of the erasure-error channel (Fig. 23.2) with channel matrix
1 − 2δ δ δ
W=
δ δ 1 − 2δ
where 0 ≤ δ ≤ 1/2.
FIXME
Figure 23.2 Binary erasure-error channel.
IV.4 Consider a binary symmetric channel with crossover probability δ ∈ (0, 1):
Y = X + Z mod 2 , Z ∼ Bern(δ) .
Suppose that in addition to Y the receiver also gets to observe noise Z through a binary erasure
channel with erasure probability δe ∈ (0, 1). Compute:
(a) Capacity C of the channel.
(b) Zero-error capacity C0 of the channel.
i i
i i
i i
1X 2
n
xi (w) ≤ P, w ∈ {1, 2, . . . , 2nR }.
n
i=1
Using Fano’s inequality, show that the capacity C is equal to zero for this channel.
IV.6 Randomized encoders and decoders may help for maximal probability of error:
(a) Consider a binary asymmetric channel PY|X : {0, 1} → {0, 1} specified by PY|X=0 =
Ber(1/2) and PY|X=1 = Ber(1/3). The encoder f : [M] → {0, 1} tries to transmit 1 bit
of information, i.e., M = 2, with f(1) = 0, f(2) = 1. Show that the optimal decoder which
minimizes the maximal probability of error is necessarily randomized. Find the optimal
decoder and the optimal Pe,max . (Hint: Recall binary hypothesis testing.)
(b) Give an example of PY|X : X → Y , M > 1 and ϵ > 0 such that there is an (M, ϵ)max -code
with a randomized encoder-decoder, but no such code with a deterministic encoder-decoder.
IV.7 Routers A and B are setting up a covert communication channel in which the data is encoded
in the ordering of packets. Formally: router A receives n packets, each of type A or D (for
Ack/Data), where type is i.i.d. Bernoulli(p) with p ≈ 0.9. It encodes k bits of secret data by
reordering these packets. The network between A and B delivers packets in-order with loss rate
δ ≈ 5% (Note: packets have sequence numbers, so each loss is detected by B).
What is the maximum rate nk of reliable communication achievable for large n? Justify your
answer!
IV.8 (Strong converse for BSC) In this exercise we give a combinatorial proof of the strong converse
for the binary symmetric channel. For BSCδ with 0 < δ < 21 ,
(a) Given any (n, M, ϵ)max -code with deterministic encoder f and decoder g, recall that the
decoding regions {Di = g−1 (i)}M i=1 form a partition of the output space. Prove that for
all i ∈ [M],
L
X n
| Di | ≥
j
j=0
i i
i i
i i
i i
i i
i i
IV.10 (Capacity-cost at P = P0 .) Recall that we have shown that for stationary memoryless channels
and P > P0 capacity equals f(P):
where
Show:
(a) If P0 is not admissible, i.e., c(x) > P0 for all x ∈ A, then C(P0 ) is undefined (even M = 1
is not possible)
(b) If there exists a unique x0 such that c(x0 ) = P0 then
C(P0 ) = f(P0 ) = 0 .
(c) If there are more than one x with c(x) = P0 then we still have
C(P0 ) = f(P0 ) .
(d) Give example of a channel with discontinuity of C(P) at P = P0 . (Hint: select a suitable
cost function for the channel Y = (−1)Z · sign(X), where Z is Bernoulli and sign : R →
{−1, 0, 1})
IV.11 Consider a stationary memoryless additive non-Gaussian noise channel:
Yi = Xi + Zi , E[Zi ] = 0, Var[Zi ] = 1
i i
i i
i i
we show that without degrading the performance, we can reduce this number to qn by restricting
to Toeplitz generator matrix G, i.e., Gij = Gi−1,j−1 for all i, j > 1.
Prove the following strengthening of Theorem 18.13: Let PY|X be additive noise over Fnq . For
any 1 ≤ k ≤ n, there exists a linear code f : Fkq → Fnq with Toeplitz generator matrix, such that
+
h − n−k−log 1 n i
Pe,max = Pe ≤ E q
q PZn (Z )
for the channel Y = X + Z where X is uniform on F2q , noise Z ∈ F2q has distribution PZ and
P Z ( b − a)
i(a; b) ≜ log .
q− 2
(a) Show that probability of error of the code a 7→ (av, au) + h is the same as that of a 7→
(a, auv−1 ).
(b) Let {Xa , a ∈ Fq } be a random codebook defined as
Xa = (aV, aU) + H ,
with V, U – uniform over non-zero elements of Fq and H – uniform over F2q , the three being
jointly independent. Show that for a 6= a′ we have
1
PXa ,X′a (x21 , x̃21 ) = 1{x1 6= x̃1 , x2 6= x̃2 }
q2 (q − 1)2
(c) Show that for a 6= a′
q2 1
P[i(X′a ; Xa + Z) > log β] ≤ P[i(X̄; Y) > log β] − P[i(X; Y) > log β]
( q − 1) 2 (q − 1)2
q2
≤ P[i(X̄; Y) > log β] ,
( q − 1) 2
where PX̄XY (ā, a, b) = q14 PZ (b − a).
(d) Conclude by following the proof of the DT bound with M = q that the probability of error
averaged over the random codebook {Xa } satisfies (IV.10).
i i
i i
i i
IV.14 (Information density and types.) Let PY|X : A → B be a DMC and let PX be some input
distribution. Take PXn Yn = PnXY and define i(an ; bn ) with respect to this PXn Yn .
(a) Show that i(xn ; yn ) is a function of only the “joint-type” P̂XY of (xn , yn ), which is a
distribution on A × B defined as
1
P̂XY (a, b) = #{i : xi = a, yi = b} ,
n
where a ∈ A and b ∈ B . Therefore { 1n i(xn ; yn ) ≥ γ} can be interpreted as a constraint on
the joint type of (xn , yn ).
(b) Assume also that the input xn is such that P̂X = PX . Show that
1 n n
i(x ; y ) ≤ I(P̂X , P̂Y|X ) .
n
The quantity I(P̂X , P̂Y|X ), sometimes written as I(xn ∧ yn ), is an empirical mutual informa-
tion5 . Hint:
PY|X (Y|X)
EQXY log =
PY (Y)
D(QY|X kQY |QX ) + D(QY kPY ) − D(QY|X kPY|X |QX ) (IV.11)
IV.15 (Fitingof-Goppa universal codes) Consider a finite abelian group X . Define Fitingof norm as
Conclude that dΦ (xn , yn ) ≜ kxn − yn kΦ is a translation invariant (Fitingof) metric on the set
of equivalence classes in X n , with equivalence xn ∼ yn ⇐⇒ kxn − yn kΦ = 0.
(b) Define Fitingof ball Br (xn ) ≜ {yn : dΦ (xn , yn ) ≤ r}. Show that
(d) Conclude that a code C ⊂ X n with Fitingof minimal distance dmin,Φ (C) ≜
minc̸=c′ ∈C dΦ (c, c′ ) ≥ 2λn is decodable with vanishing probability of error on any
additive-noise channel Y = X + Z, as long as H(Z) < λ.
5
Invented by V. Goppa for his maximal mutual information (MMI) decoder: Ŵ = argmaxi I(ci ∧ yn ).
i i
i i
i i
Comment: By Feinstein-lemma like argument it can be shown that there exist codes of size
X n(1−λ) , such that balls of radius λn centered at codewords are almost disjoint. Such codes are
universally capacity achieving for all memoryless additive noise channels on X . Extension to
general (non-additive) channels is done via introducing dΦ (xn , yn ) = nH(xT |yT ), while exten-
sion to channels with Markov memory is done by introducing Markov-type norm kxn kΦ1 =
nH(xT |xT−1 ). See [143, Chapter 3].
IV.16 Consider the polygon channel discussed in Remark 23.3, where the input and output alphabet
are both {1, . . . , L}, and PY|X (b|a) > 0 if and only if b = a or b = (a mod L) + 1. The
confusability graph is a cycle of L vertices. Rigorously prove the following:
(a) For all L, The zero-error capacity with feedback is Cfb,0 = log L2 .
(b) For even L, the zero-error capacity without feedback C0 = log L2 .
(c) Now consider the following channel, where the input and output alphabet are both
{1, . . . , L}, and PY|X (b|a) > 0 if and only if b = a or b = a + 1. In this case the confusability
graph is a path of L vertices. Show that the zero-error capacity is given by
L
C0 = log
2
What is Cfb,0 ?
IV.17 (Input-output cost) Let PY|X : X → Y be a DMC and consider a cost function c : X × Y → R
(note that c(x, y) ≤ L < ∞ for some L). Consider a problem of channel coding, where the
error-event is defined as
( n )
X
{error} ≜ {Ŵ 6= W} ∪ c(Xk , Yk ) > nP ,
k=1
where P is a fixed parameter. Define operational capacity C(P) and show it is given by
for all P > P0 ≜ minx0 E[c(X, Y)|X = x0 ]. Give a counter-example for P = P0 . (Hint: do a
converse directly, and for achievability reduce to an appropriately chosen cost-function c′ (x)).
IV.18 (Expurgated random coding bound)
(a) For any code C show the following bound on probability of error
1 X −dB (c,c′ )
Pe (C) ≤ 2 ,
M ′
c̸=c
Pn
where Bhattacharya distance dB (xn , x̃n ) = j=1 dB (xj , x̃j ) and a single-letter
Xp
dB (x, x̃) = − log2 W(y|x)W(y|x̃) .
y∈Y
i i
i i
i i
′
(b) Fix PX and let E0,x (ρ, PX ) ≜ −ρ log2 E[2− ρ dB (X,X ) ], where X ⊥
1
⊥ X′ ∼ PX . Show by random
coding that there always exists a code C of rate R with
(c) We improve the previous bound as follows. We still generate C by random coding. But this
P ′
time we expurgate all codewords with f(c, C) > med(f(c, C)), where f(c) = c′ ̸=c 2−dB (c,c ) .
Using the bound
med(V) ≤ 2ρ E[V1/ρ ]ρ ∀ρ ≥ 1
show that
(d) Conclude that there must exist a code with rate R − O(1/n) and Pe (C) ≤ 2−nEex (R) , where
IV.19 Give example of a channel with discontinuity of C(P) at P = P0 . (Hint: select a suitable cost
function for the channel Y = (−1)Z · sign(X), where Z is Bernoulli and sign : R → {−1, 0, 1})
IV.20 (Sum of channels) Let W1 and W2 denote the channel matrices of discrete memoryless channel
(DMC) PY1 |X1 and PY2 |X2 with capacity C1 and C2 , respectively. The sum of the two channels is
another DMC with channel matrix W01 W02 . Show that the capacity of the sum channel is given
by
IV.21 (Product of channels) For i = 1, 2, let PYi |Xi be a channel with input space Ai , output space Bi ,
and capacity Ci . Their product channel is a channel with input space A1 × A2 , output space
B1 × B2 , and transition kernel PY1 Y2 |X1 X2 = PY1 |X1 PY2 |X2 . Show that the capacity of the product
channel is given by
C = C1 + C2 .
IV.22 Mixtures of DMCs. Consider two DMCs UY|X and VY|X with a common capacity achieving input
distribution and capacities CU < CV . Let T = {0, 1} be uniform and consider a channel PYn |Xn
that uses U if T = 0 and V if T = 1, or more formally:
1 n 1
PYn |Xn (yn |xn ) = U (yn |xn ) + VnY|X (yn |xn ) . (IV.12)
2 Y| X 2
Show:
(a) Is this channel {PYn |Xn }n≥1 stationary? Memoryless?
(b) Show that the Shannon capacity C of this channel is not greater than CU .
(c) The maximal mutual information rate is
1 CU + CV
C(I) = lim sup I(Xn ; Yn ) =
n→∞ n Xn 2
i i
i i
i i
C < C(I) .
IV.23 Compound DMC [37] Compound DMC is a family of DMC’s with common input and output
alphabets PYs |X : A → B, s ∈ S . An (n, M, ϵ) code is an encoder-decoder pair whose probability
of error ≤ ϵ over any channel PYs |X in the family (note that the same encoder and the same
decoder are used for each s ∈ S ). Show that capacity is given by
IV.24 Consider the following (memoryless) channel. It has a side switch U that can be in positions
ON and OFF. If U is on then the channel from X to Y is BSCδ and if U is off then Y is Bernoulli
(1/2) regardless of X. The receiving party sees Y but not U. A design constraint is that U should
be in the ON position no more than the fraction s of all channel uses, 0 ≤ s ≤ 1. Questions:
(a) One strategy is to put U into ON over the first sn time units and ignore the rest of the (1 − s)n
readings of Y. What is the maximal rate in bits per channel use achievable with this strategy?
(b) Can we increase the communication rate if the encoder is allowed to modulate the U switch
together with the input X (while still satisfying the s-constraint on U)?
(c) Now assume nobody has access to U, which is random, independent of X, memoryless
across different channel uses and
P[U = ON] = s.
Find capacity.
IV.25 Let {Zj , j = 1, 2, . . .} be a stationary Gaussian process with variance 1 such that Zj form a
Markov chain Z1 → . . . → Zn → . . . Consider an additive channel
Yn = Xn + Zn
Pn
with power constraint j=1 |xj |2 ≤ nP. Suppose that I(Z1 ; Z2 ) = ϵ 1, then capacity-cost
function
1
C(P) = log(1 + P) + Bϵ + o(ϵ)
2
as ϵ → 0. Compute B and interpret your answer.
How does the frequency spectrum of optimal signal change with increasing ϵ?
IV.26 A semiconductor company offers a random number generator that outputs a block of random n
bits Y1 , . . . , Yn . The company wants to secretly embed a signature in every chip. To that end, it
decides to encode the k-bit signature in n real numbers Xj ∈ [0, 1]. To each individual signature
a chip is manufactured that produces the outputs Yj ∼ Ber(Xj ). In order for the embedding to
be inconspicuous the average bias P should be small:
n
1 X 1
n Xj − 2 ≤ P .
j=1
i i
i i
i i
As a function of P how many signature bits per output (k/n) can be reliably embedded in this
fashion? Is there a simple coding scheme achieving this performance?
IV.27 Consider a DMC with two outputs PY,U|X . Suppose that receiver observes only Y, while U is
(causally) fed back to the transmitter. We know that when Y = U the capacity is not increased.
(a) Show that capacity is not increased in general (even when Y 6= U).
(b) Suppose now that there is a cost function c and c(x0 ) = 0. Show that capacity per unit cost
(with U being fed back) is still given by
D(PY|X=x kPY|X=x0 )
CV = max
x̸=x0 c(x)
IV.28 (Capacity of sneezing) A sick student is sneezing periodically every minute, with each sneeze
happening i.i.d. with probability p. He decides to send k bits to a friend by modulating the
sneezes. For that, every time he realizes he is about to sneeze he chooses to suppress a sneeze
or not. A friend listens for n minutes and then tries to decode k bits.
(a) Find capacity in bits per minute. (Hint: Think how to define the channel so that channel input
at time t were not dependent on the arrival of the sneeze at time t. To rule out strategies that
depend on arrivals of past sneezes, you may invoke Exercise IV.27.)
(b) Suppose sender can suppress at most E sneezes and listener can wait indefinitely (n = ∞).
Show that sender can transmit Cpuc E + o(E) bits reliably as E → ∞ and find Cpuc . Curiously,
Cpuc ≥ 1.44 bits/sneeze regardless of p. (Hint: This is similar to Exercise IV.17.)
(c) (Bonus, hard) Redo 1 and 2 for the case of a clairvoyant student who knows exactly when
sneezes will happen in the future.
IV.29 An inmate has n oranges that he is using to communicate with his conspirators by putting
oranges in trays. Assume that infinitely many trays are available, each can contain zero or more
oranges, and each orange in each tray is eaten by guards independently with probability δ . In
the limit of n → ∞ show that an arbitrary high rate (in bits per orange) is achievable.
IV.30 Recall that in the proof of the DT bound we used the decoder that outputs (for a given channel
output y) the first cm that satisfies
{i(cm ; y) > log β} . (IV.13)
One may consider the following generalization. Fix E ⊂ X × Y and let the decoder output the
first cm which satisfies
( cm , y) ∈ E
By repeating the random coding proof steps (as in the DT bound) show that the average
probability of error satisfies
M−1
E[Pe ] ≤ P[(X, Y) 6∈ E] + P[(X̄, Y) ∈ E] ,
2
where
PXYX̄ (a, b, ā) = PX (a)PY|X (b|a)PX (ā) .
M−1
Conclude that the optimal E is given by (IV.13) with β = 2 .
i i
i i
i i
IV.31 Bounds for the binary erasure channel (BEC). Consider a code with M = 2k operating over the
blocklength n BEC with erasure probability δ ∈ [0, 1).
(a) Show that regardless of the encoder-decoder pair:
+
P[error|#erasures = z] ≥ 1 − 2n−z−k
(b) Conclude by averaging over the distribution of z that the probability of error ϵ must satisfy
X
n
n ℓ
ϵ≥ δ (1 − δ)n−ℓ 1 − 2n−ℓ−k , (IV.14)
ℓ
ℓ=n−k+1
(c) By applying the DT bound with uniform PX show that there exist codes with
X n
n t
δ (1 − δ)n−t 2−|n−t−k+1| .
+
ϵ≤ (IV.15)
t
t=0
(d) Fix n = 500, δ = 1/2. Compute the smallest k for which the right-hand side of (IV.14) is
greater than 10−3 .
(e) Fix n = 500, δ = 1/2. Find the largest k for which the right-hand side of (IV.15) is smaller
than 10−3 .
(f) Express your results in terms of lower and upper bounds on log M∗ (500, 10−3 ).
i i
i i
i i
Part V
i i
i i
i i
i i
i i
i i
409
In Part II we studied lossless data compression (source coding), where the goal is to compress
a random variable (source) X into a minimal number of bits on average (resp. exactly) so that
X can be reconstructed exactly (resp. with high probability) using these bits. In both cases, the
fundamental limit is given by the entropy of the source X. Clearly, this paradigm is confined to
discrete random variables.
In this part we will tackle the next topic, lossy data compression: Given a random variable X,
encode it into a minimal number of bits, such that the decoded version X̂ is a faithful reconstruction
of X, in the sense that the “distance” between X and X̂ is at most some prescribed accuracy either
on average or with high probability.
The motivations for study lossy compression are at least two-fold:
1 Many natural signals (e.g. audio, images, or video) are continuously valued. As such, there is
a need to represent these real-valued random variables or processes using finitely many bits,
which can be fed to downstream digital processing; see Fig. 23.3 for an illustration.
Domain Range
Continuous Analog
Sampling time Quantization
Signal
Discrete Digital
time
2 There is a lot to be gained in compression if we allow some reconstruction errors. This is espe-
cially important in applications where certain errors (such as high-frequency components in
natural audio and visual signals) are imperceptible to humans. This observation is the basis of
many important compression algorithms and standards that are widely deployed in practice,
including JPEG for images, MPEG for videos, and MP3 for audios.
The operation of mapping (naturally occurring) continuous time/analog signals into
(electronics-friendly) discrete/digital signals is known as quantization, which is an important sub-
ject in signal processing in its own right (cf. the encyclopedic survey [144]). In information theory,
the study of optimal quantization is called rate-distortion theory, introduced by Shannon in 1959
[280]. To start, we will take a closer look at quantization next in Section 24.1, followed by the
information-theoretic formulation in Section 24.2. A simple (and tight) converse bound is given
in Section 24.3, with the the matching achievability bound deferred to the next chapter.
Finally, in Chapter 27 we study Kolmogorov’s metric entropy, which is a non-probabilistic
theory of quantization for sets in metric spaces. In addition to connections to the probabilistic the-
ory of quantization in the preceding chapters, this concept has far-reaching consequences in both
probability (e.g. empirical processes, small-ball probability) and statistical learning (e.g. entropic
upper and lower bounds for estimation) that will be explored further in Part VI.
i i
i i
i i
24 Rate-distortion theory
−A A
2 2
where D denotes the average distortion. Often R = log2 N is used instead of N, so that we think
about the number of bits we can use for quantization instead of the number of points. To analyze
this scalar uniform quantizer, we’ll look at the high-rate regime (R 1). The key idea in the high
rate regime is that (assuming a smooth density PX ), each quantization interval ∆j looks nearly flat,
so conditioned on ∆j , the distribution is accurately approximately by a uniform distribution.
∆j
410
i i
i i
i i
Let cj be the j-th quantization point, and ∆j be the j-th quantization interval. Here we have
X
N
DU (R) = E|X − qU (X)| =2
E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ] (24.1)
j=1
X
N
|∆j |2
(high rate approximation) ≈ P[ X ∈ ∆ j ] (24.2)
12
j=1
( NA )2 A2 −2R
= = 2 , (24.3)
12 12
Var(X)
10 log10 SNR = 10 log10
E|X − qU (X)|2
12Var(X)
= 10 log10 + (20 log10 2)R
A2
= constant + (6.02dB)R
For example, when X is uniform on [− A2 , A2 ], the constant is 0. Every engineer knows the rule of
thumb “6dB per bit”; adding one more quantization bit gets you 6 dB improvement in SNR. How-
ever, here we can see that this rule of thumb is valid only in the high rate regime. (Consequently,
widely articulated claims such as “16-bit PCM (CD-quality) provides 96 dB of SNR” should be
taken with a grain of salt.)
The above discussion deals with X with a bounded support. When X is unbounded, it is wise to
allocate the quantization points to those values that are more likely and saturate the large values at
the dynamic range of the quantizer, resulting in two types of contributions to the quantization error,
known as the granular distortion and overload distortion. This leads us to the question: Perhaps
uniform quantization is not optimal?
i i
i i
i i
412
Often the way such quantizers are implemented is to take a monotone transformation of the source
f(X), perform uniform quantization, then take the inverse function:
f
X U
q qU (24.4)
X̂ qU ( U)
f−1
i.e., q(X) = f−1 (qU (f(X))). The function f is usually called the compander (compressor+expander).
One of the choice of f is the CDF of X, which maps X into uniform on [0, 1]. In fact, this compander
architecture is optimal in the high-rate regime (fine quantization) but the optimal f is not the CDF
(!). We defer this discussion till Section 24.1.4.
In terms of practical considerations, for example, the human ear can detect sounds with volume
as small as 0 dB, and a painful, ear-damaging sound occurs around 140 dB. Achieving this is
possible because the human ear inherently uses logarithmic comp anding function. Furthermore,
many natural signals (such as differences of consecutive samples in speech or music (but not
samples themselves!)) have an approximately Laplace distribution. Due to these two factors, a
very popular and sensible choice for f is the μ-companding function
which compresses the dynamic range, uses more bits for smaller |X|’s, e.g. |X|’s in the range of
human hearing, and less quantization bits outside this region. This results in the so-called μ-law
which is used in the digital telecommunication systems in the US, while in Europe a slightly
different compander called the A-law is used.
Intuitively, we would think that the optimal quantization regions should be contiguous; otherwise,
given a point cj , our reconstruction error will be larger. Therefore in one dimension quantizers are
i i
i i
i i
piecewise constant:
With ideas like this, in 1957 S. Lloyd developed an algorithm (called Lloyd’s algorithm or
Lloyd’s Method I) for iteratively finding optimal quantization regions and points.1 Suitable for
both the scalar and vector cases, this method proceeds as follows: Initialized with some choice of
N = 2k quantization points, the algorithm iterates between the following two steps:
1 Draw the Voronoi regions around the chosen quantization points (aka minimum distance
tessellation, or set of points closest to cj ), which forms a partition of the space.
2 Update the quantization points by the centroids E[X|X ∈ D] of each Voronoi region D.
b b
b b
b b
b b
b b
Lloyd’s clever observation is that the centroid of each Voronoi region is (in general) different than
the original quantization points. Therefore, iterating through this procedure gives the Centroidal
Voronoi Tessellation (CVT - which are very beautiful objects in their own right), which can be
viewed as the fixed point of this iterative mapping. The following theorem gives the results about
Lloyd’s algorithm
1
This work at Bell Labs remained unpublished until 1982 [202].
i i
i i
i i
414
Remark 24.1. The third point tells us that Lloyd’s algorithm is not always guaranteed to give
the optimal quantization strategy.2 One sufficient condition for uniqueness of a CVT is the log-
concavity of the density of X [129], e.g., Gaussians. On the other hand, even for Gaussian, if N > 3,
optimal quantization points are not
Remark 24.2 (k-means). A popular clustering method called k-means is the following: Given n
data points x1 , . . . , xn ∈ Rd , the goal is to find k centers μ1 , . . . , μk ∈ Rd to minimize the objective
function
X n
min kxi − μj k2 .
j∈[k]
i=1
This is equivalent to solving the optimal vector quantization problem analogous to (24.5):
min EkX − q(X)k2
q:|Im(q)|≤k
Pn
where X is distributed according to the empirical distribution over the dataset, namely, 1n i=1 δxi .
Solving the k-means problem is NP-hard in the worst case, and Lloyd’s algorithm is a commonly
used heuristic.
X
N Z
|∆j |2 ∆ 2 ( x)
≈ P[ X ∈ ∆ j ] ≈ p ( x) dx
12 12
j=1
Z
1
= p(x)λ−2 (x)dx ,
12N2
2
As a simple example one may consider PX = 13 ϕ(x − 1) + 13 f(x) + 13 f(x + 1) where f(·) is a very narrow pdf, symmetric
around 0. Here the CVT with centers ± 32 is not optimal among binary quantizers (just compare to any quantizer that
quantizes two adjacent spikes to same value).
3
This argument is easy to make rigorous. We only need to define reconstruction points cj as the solution of
∫ cj j
−∞ λ(x) dx = N (quantile).
i i
i i
i i
To find the optimal density λ that gives the best reconstruction (minimum MSE) when X has den-
R R R R R
sity p, we use Hölder’s inequality: p1/3 ≤ ( pλ−2 )1/3 ( λ)2/3 . Therefore pλ−2 ≥ ( p1/3 )3 ,
1/3
with equality iff pλ−2 ∝ λ. Hence the optimizer is λ∗ (x) = Rp (x) .
p1/3 dx
Therefore when N = 2R ,4
Z 3
1 −2R
Dscalar (R) ≈ 2 p1/3 (x)dx
12
So our optimal quantizer density in the high rate regime is proportional to the cubic root of the
density of our source. This approximation is called the Panter-Dite approximation. For example,
• When X ∈ [− A2 , A2 ], using Hölder’s inequality again h1, p1/3 i ≤ k1k 3 kp1/3 k3 = A2/3 , we have
2
1 −2R 2
Dscalar (R) ≤2 A = DU (R)
12
where the RHS is the uniform quantization error given in (24.1). Therefore as long as the
source distribution is not uniform, there is strict improvement. For uniform distribution, uniform
quantization is, unsurprisingly, optimal.
• When X ∼ N (0, σ 2 ), this gives
√
2 −2R π 3
Dscalar (R) ≈ σ 2 (24.6)
2
Remark 24.3. In fact, in scalar case the optimal non-uniform quantizer can be realized using the
compander architecture (24.4) that we discussed in Section 24.1.2: As an exercise, use Taylor
expansion to analyze the quantization
R
error of (24.4) when N → ∞. The optimal compander
t
p1/3 (t)dt
f : R → [0, 1] turns out to be f(x) = R−∞
∞
p1/3 (t)dt
[28, 289].
−∞
4
In fact when R → ∞, “≈” can be replaced by “= 1 + o(1)” as shown by Zador [344, 345].
i i
i i
i i
416
On the other hand, any quantizer with unnormalized point density function Λ(x) (i.e. smooth
R cj
function such that −∞ Λ(x)dx = j) can be shown to achieve (assuming Λ → ∞ pointwise)
Z
1 1
D≈ pX (x) 2 dx (24.8)
12 Λ ( x)
Z
Λ(x)
H(q(X)) ≈ pX (x) log dx (24.9)
p X ( x)
Now, from Jensen’s inequality we have
Z Z
1 1 1 22h(X)
pX (x) 2 dx ≥ exp{−2 pX (x) log Λ(x) dx} ≈ 2−2H(q(X)) ,
12 Λ ( x) 12 12
concluding that uniform quantizer is asymptotically optimal.
Furthermore, it turns out that for any source, even the optimal vector quantizers (to be con-
2h(X)
sidered next) can not achieve distortion better that 2−2R 22πe . That is, the maximal improvement
they can gain for any i.i.d. source is 1.53 dB (or 0.255 bit/sample). This is one reason why scalar
uniform quantizers followed by lossless compression is an overwhelmingly popular solution in
practice.
Hamming Game. Given 100 unbiased bits, we are asked to inspect them and scribble something
down on a piece of paper that can store 50 bits at most. Later we will be asked to guess the original
100 bits, with the goal of maximizing the number of correctly guessed bits. What is the best
strategy? Intuitively, it seems the optimal strategy would be to store half of the bits then randomly
guess on the rest, which gives 25% bit error rate (BER). However, as we will show in this chapter
(Theorem 26.1), the optimal strategy amazingly achieves a BER of 11%. How is this possible?
After all we are guessing independent bits and the loss function (BER) treats all bits equally.
Gaussian example. Given (X1 , . . . , Xn ) drawn independently from N (0, σ 2 ), we are given a
budget of one bit per symbol to compress, so that the decoded version (X̂1 , . . . , X̂n ) has a small
Pn
mean-squared error 1n i=1 E[(Xi − X̂i )2 ].
To this end, a simple strategy is to quantize each coordinate into 1 bit. As worked out in Exam-
ple 24.1, the optimal one-bit quantization error is (1 − π2 )σ 2 ≈ 0.36σ 2 . In comparison, we will
2
show later (Theorem 26.2) that there is a scheme that achieves an MSE of σ4 per coordinate
for large n; furthermore, this is optimal. More generally, given R bits per symbol, by doing opti-
mal vector quantization in high dimensions (namely, compressing (X1 , . . . , Xn ) jointly to nR bits),
rate-distortion theory will tell us that when n is large, we can achieve the per-coordinate MSE:
Dvec (R) = σ 2 2−2R
i i
i i
i i
1 Applying scalar quantization componentwise results in quantization region that are hypercubes,
which may not suboptimal for covering in high dimensions.
2 Concentration of measures effectively removes many atypical source realizations. For example,
when quantizing a single Gaussian X, we need to cover large portion of R in order to deal with
those significant deviations of X from 0. However, when we are quantizing many (X1 , . . . , Xn )
together, the law of large numbers makes sure that many Xj ’s cannot conspire together and all
produce large values. Indeed, (X1 , . . . , Xn ) concentrates near a sphere. As such, we may exclude
large portions of the space Rn from consideration.
where X ∈ X is refereed to as the source, W = f(X) is the compressed discrete data, and X̂ = g(W)
is the reconstruction which takes values in some alphabet X̂ that needs not be the same as X .
A distortion metric (or loss function) is a measurable function d : X × X̂ → R ∪ {+∞}. There
are various formulations of the lossy compression problem:
1 Fixed length (fixed rate), average distortion: W ∈ [M], minimize E[d(X, X̂)].
2 Fixed length, excess distortion: W ∈ [M], minimize P[d(X, X̂) > D].
3 Variable length, max distortion: W ∈ {0, 1}∗ , d(X, X̂) ≤ D a.s., minimize the average length
E[l(W)] or entropy H(W).
In this book we focus on lossy compression with fixed length and are chiefly concerned with
average distortion (with the exception of joint source-channel coding in Section 26.3 where excess
distortion will be needed). The difference between average and excess distortion is analogous to
average and high-probability risk bound in statistics and machine learning. It turns out that under
mild assumptions these two formulations lead to the same fundamental limit (cf. Remark 25.2).
As usual, of particular interest is when the source takes the form of a random vector Sn =
(S1 , . . . , Sn ) ∈ S n and the reconstruction is Ŝn = (S1 , . . . , Sn ) ∈ Ŝ n . We will be focusing on the
so called separable distortion metric defined for n-letter vectors by averaging the single-letter
distortions:
1X
n
d(sn , ŝn ) ≜ d(si , ŝi ). (24.10)
n
i=1
i i
i i
i i
418
Note that, for stationary memoryless (iid) source, the large-blocklength limit in (24.12) in fact
exists and coincides with the infimum over all blocklengths. This is a consequence of the average
distortion criterion and the separability of the distortion metric – see Exercise V.2.
Theorem 24.3 (General Converse). Suppose X → W → X̂, where W ∈ [M] and E[d(X, X̂)] ≤ D.
Then
log M ≥ ϕX (D) ≜ inf I(X; Y).
PY|X :E[d(X,Y)]≤D
Proof.
log M ≥ H(W) ≥ I(X; W) ≥ I(X; X̂) ≥ ϕX (D)
where the last inequality follows from the fact that PX̂|X is a feasible solution (by assumption).
Then ϕX (D) = 0 for all D > Dmax . If D0 > Dmax then also ϕX (Dmax ) = 0.
Remark 24.4 (The role of D0 and Dmax ). By definition, Dmax is the distortion attainable without any
information. Indeed, if Dmax = Ed(X, x̂) for some fixed x̂, then this x̂ is the “default” reconstruction
of X, i.e., the best estimate when we have no information about X. Therefore D ≥ Dmax can be
achieved for free. This is the reason for the notation Dmax despite that it is defined as an infimum.
On the other hand, D0 should be understood as the minimum distortion one can hope to attain.
Indeed, suppose that X̂ = X and d is a metric on X . In this case, we have D0 = 0, since we can
choose Y to be a finitely-valued approximation of X.
i i
i i
i i
As an example, consider the Gaussian source with MSE distortion, namely, X ∼ N (0, σ 2 ) and
2
d(x, x̂) = (x − x̂)2 . We will show later that ϕX (D) = 21 log+ σD . In this case D0 = 0 which is
however not attained; Dmax = σ 2 and if D ≥ σ 2 , we can simply output 0 as the reconstruction
which requires zero bits.
Proof.
(a) Convexity follows from the convexity of PY|X 7→ I(PX , PY|X ) (Theorem 5.3).
(b) Continuity in the interior of the domain follows from convexity, since D0 =
infPX̂|X E[d(X, X̂)] = inf{D : ϕS (D) < ∞}.
(c) The only way to satisfy the constraint is to take X = Y.
(d) For any D > Dmax we can set X̂ = x̂ deterministically. Thus I(X; x̂) = 0. The second claim
follows from continuity.
In channel coding, the main result relates the Shannon capacity, an operational quantity, to the
information capacity. Here we introduce the information rate-distortion function in an analogous
way, which by itself is not an operational quantity.
The reason for defining R(I) (D) is because from Theorem 24.3 we immediately get:
Naturally, the information rate-distortion function inherits the properties of ϕ from Theo-
rem 24.4:
Proof. All properties follow directly from corresponding properties in Theorem 24.4 applied to
ϕSn .
i i
i i
i i
420
Next we show that R(I) (D) can be easily calculated for stationary memoryless (iid) source
without going through the multi-letter optimization problem. This parallels Corollary 20.5 for
channel capacity (with separable cost function).
i.i.d.
Theorem 24.8 (Single-letterization). For stationary memoryless source Si ∼ PS and separable
distortion d in the sense of (24.10), we have for every n,
Thus
Proof. By definition we have that ϕSn (D) ≤ nϕS (D) by choosing a product channel: PŜn |Sn = P⊗ n
Ŝ|S
.
Thus R(I) (D) ≤ ϕS (D).
For the converse, for any PŜn |Sn satisfying the constraint E[d(Sn , Ŝn )] ≤ D, we have
X
n
I(Sn ; Ŝn ) ≥ I(Sj , Ŝj ) (Sn independent)
j=1
Xn
≥ ϕS (E[d(Sj , Ŝj )])
j=1
1 X
n
≥ nϕ S E[d(Sj , Ŝj )] (convexity of ϕS )
n
j=1
≥ nϕ S ( D) (ϕS non-increasing)
Theorem 24.9 (Excess-to-Average). Suppose that there exists (f, g) such that W = f(X) ∈ [M] and
P[d(X, g(W)) > D] ≤ ϵ. Assume for some p ≥ 1 and x̂0 ∈ X̂ that (E[d(X, x̂0 )p ])1/p = Dp < ∞.
Then there exists (f′ , g′ ) such that W′ = f′ (X) ∈ [M + 1] and
Remark 24.5. This result is only useful for p > 1, since for p = 1 the right-hand side of (24.13)
does not converge to D as ϵ → 0. However, a different method (as we will see in the proof of
i i
i i
i i
Theorem 25.1) implies that under just Dmax = D1 < ∞ the analog of the second term in (24.13)
is vanishing as ϵ → 0, albeit at an unspecified rate.
Proof. We transform the first code into the second by adding one codeword:
(
f ( x) d(x, g(f(x))) ≤ D
f ′ ( x) =
M + 1 otherwise
(
g(j) j ≤ M
g′ ( j) =
x̂0 j=M+1
Then by Hölder’s inequality,
E[d(X, g′ (W′ )) ≤ E[d(X, g(W))|Ŵ 6= M + 1](1 − ϵ) + E[d(X, x̂0 )1{Ŵ = M + 1}]
≤ D(1 − ϵ) + Dp ϵ1−1/p
i i
i i
i i
where
Pn
and d(Sn , Ŝn ) = 1n i=1 d(Si , Ŝi ) takes a separable form.
We have shown the following general converse in Theorem 24.3: For any [M] 3 W → X → X̂
such that E[d(X, X̂)] ≤ D, we have log M ≥ ϕX (D), which implies in the special case of X =
Sn , log M∗ (n, D) ≥ ϕSn (D) and hence, in the large-n limit, R(D) ≥ R(I) (D). For a stationary
i.i.d.
memoryless source Si ∼ PS , Theorem 24.8 shows that ϕSn single-letterizes as ϕSn (D) = nϕS (D).
As a result, we obtain the converse
In this chapter, we will prove a matching achievability bound and establish the identity R(D) =
R(I) (D) for stationary memoryless sources.
i.i.d.
Theorem 25.1. Consider a stationary memoryless source Sn ∼ PS . Suppose that the distortion
metric d and the target distortion D satisfy:
422
i i
i i
i i
Then
R(D) = R(I) (D) = ϕS (D) = inf I(S; Ŝ). (25.4)
PŜ|S :E[d(S,Ŝ)]≤D
Remark 25.1.
• Note that Dmax < ∞ does not require that d(·, ·) only takes values in R. That is, Theorem 25.1
permits d(s, ŝ) = ∞.
• When Dmax = ∞, typically we have R(D) = ∞ for all D. Indeed, suppose that d(·, ·) is a metric
(i.e. real-valued and satisfies triangle inequality). Then, for any x0 ∈ An we have
d(X, X̂) ≥ d(X, x0 ) − d(x0 , X̂) .
Thus, for any finite codebook {c1 , . . . , cM } we have maxj d(x0 , cj ) < ∞ and therefore
E[d(X, X̂)] ≥ E[d(X, x0 )] − max d(x0 , cj ) = ∞ .
j
So that R(D) = ∞ for any finite D. This observation, however, should not be interpreted as
the absolute impossibility of compressing such sources; it is just not possible with fixed-length
codes. As an example, for quadratic distortion and Cauchy-distributed S, Dmax = ∞ since S
has infinite second moment. But it is easy to see that1 the information rate-distortion function
R(I) (D) < ∞ for any D ∈ (0, ∞). In fact, in this case R(I) (D) is a hyperbola-like curve that never
touches either axis. Using variable-length codes, Sn can be compressed non-trivially into W with
bounded entropy (but unbounded cardinality) H(W). An interesting question is: Is H(W) =
nR(I) (D) + o(n) attainable?
• Techniques for proving (25.4) for memoryless sources can extended to “stationary ergodic”
sources with changes similar to those we have discussed in lossless compression (Chapter 12).
Before giving a formal proof, we give a heuristic derivation emphasizing the connection to large
deviations estimates from Chapter 15.
25.1.1 Intuition
Let us throw M random points C = {c1 , . . . , cM } into the space Ân by generating them indepen-
dently according to a product distribution QnŜ , where QŜ is some distribution on  to be optimized.
Consider the following simple coding strategy:
Encoder : f(sn ) = argmin d(sn , cj ) (25.5)
j∈[M]
1
Indeed, if we take W to be a quantized version of S with small quantization error D and notice that differential entropy of
the Cauchy S is finite, we get from (24.7) that R(I) (D) ≤ H(W) < ∞.
i i
i i
i i
424
The basic idea is the following: Since the codewords are generated independently of the source,
the probability that a given codeword is close to the source realization is (exponentially) small, say,
ϵ. However, since we have many codewords, the chance that there exists a good one can be of high
probability. More precisely, the probability that no good codewords exist is approximately (1 −ϵ)M ,
which can be made close to zero provided M 1ϵ .
i.i.d.
To explain this intuition further, consider a discrete memoryless source Sn ∼ PS and let us eval-
uate the excess distortion of this random code: P[d(Sn , f(Sn )) > D], where the probability is over
all random codewords c1 , . . . , cM and the source Sn . Define
where the last equality follows from the assumption that c1 , . . . , cM are iid and independent of Sn .
i.i.d.
To simplify notation, let Ŝn ∼ QnŜ independently of Sn , so that PSn ,Ŝn = PnS QnŜ . Then
To evaluate the failure probability, let us consider the special case of PS = Ber( 12 ) and also
choose QŜ = Ber( 12 ) to generate the random codewords, aiming to achieve a normalized Hamming
P P
distortion at most D < 12 . Since nd(Sn , Ŝn ) = i:si =1 (1 − Ŝi ) + i:si =0 Ŝi ∼ Bin(n, 21 ) for any sn ,
the conditional probability (25.7) does not depend on Sn and is given by
1
P[d(S , Ŝ ) > D|S ] = P Bin n,
n n n
≥ nD ≈ 1 − 2−n(1−h(D))+o(n) , (25.8)
2
where in the last step we applied large-deviation estimates from Theorem 15.9 and Example 15.1.
(Note that here we actually need lower estimates on these exponentially small probabilities.) Thus,
Pfailure = (1 − 2−n(1−h(D))+o(n) )M , which vanishes if M = 2n(1−h(D)+δ) for any δ > 0.2 As we will
compute in Theorem 26.1, the rate-distortion function for PS = Ber( 12 ) is precisely ϕS (D) =
1 − h(D), so we have a rigorous proof of the optimal achievability in this special case.
For general distribution PS (or even for PS = Ber(p) for which it is suboptimal to choose
QŜ as Ber( 21 )), the situation is more complicated as the conditional probability (25.7) depends
on the source realization Sn through its empirical distribution (type). Let Tn be the set of typical
realizations whose empirical distribution is close to PS . We have
2
In fact, this argument shows that M = 2n(1−h(D))+o(n) codewords suffice to cover the entire Hamming space within
distance Dn. See (27.9) and Exercise V.11.
i i
i i
i i
where it can be shown (using large deviations analysis similar to information projection in
Chapter 15) that
Thus we conclude that for any choice of QŜ (from which the random codewords were drawn) and
any δ > 0, the above code with M = 2n(E(QŜ )+δ) achieves vanishing excess distortion
= ϕ S ( D)
where the third equality follows from the variational representation of mutual information (Corol-
lary 4.2). This heuristic derivation explains how the constrained mutual information minimization
arises. Below we make it rigorous using a different approach, again via random coding.
Here the first and the third expectations are over (X, Y) ∼ PX,Y = PX PY|X and the information
density i(·; ·) is defined with respect to this joint distribution (cf. Definition 18.1).
• Theorem 25.2 says that from an arbitrary PY|X such that E[d(X, Y)] ≤ D, we can extract a good
code with average distortion D plus some extra terms which will vanish in the asymptotic regime
for memoryless sources.
• The proof uses the random coding argument with codewords drawn independently from PY , the
marginal distribution induced by the source distribution PX and the auxiliary channel PY|X . As
such, PY|X plays no role in the code construction and is used only in analysis (by defining a
coupling between PX and PY ).
i i
i i
i i
426
• The role of the deterministic y0 is a “fail-safe” codeword (think of y0 as the default reconstruc-
tion with Dmax = E[d(X, y0 )]). We add y0 to the random codebook for “damage control”, to
hedge against the (highly unlikely) event that we end up with a terrible codebook.
Proof. Similar to the intuitive argument sketched in Section 25.1.1, we apply random coding and
generate the codewords randomly and independently of the source:
i.i.d.
C = {c1 , . . . , cM } ∼ PY ⊥
⊥X
and add the “fail-safe” codeword cM+1 = y0 . We adopt the same encoder-decoder pair (25.5) –
(25.6) and let X̂ = g(f(X)). Then by definition,
To simplify notation, let Y be an independent copy of Y (similar to the idea of introducing unsent
codeword X in channel coding – see Chapter 18):
PX,Y,Y = PX,Y PY
where PY = PY . Recall the formula for computing the expectation of a random variable U ∈ [0, a]:
Ra
E[U] = 0 P[U ≥ u]du. Then the average distortion is
where
i i
i i
i i
• (25.12) uses the following trick in dealing with (1 − δ)M for δ 1 and M 1. First, recall the
standard rule of thumb:
(
0, δ M 1
(1 − δ) ≈
M
1, δ M 1
In order to obtain firm bounds of a similar flavor, we apply, for any γ > 0,
• (25.13) is simply a change of measure argument of Proposition 18.3. Namely we apply (18.5)
with f(x, y) = 1{d(x,y)≤u} .
• For (25.14) consider the chain:
As a side product, we have the following achievability result for excess distortion.
Theorem 25.3 (Random coding bound of excess distortion). For any PY|X , there exists a code
X → W → X̂ with W ∈ [M], such that for any γ > 0,
P[d(X, X̂) > D] ≤ e−M/γ + P[{d(X, Y) > D} ∪ {i(X; Y) > log γ}]
Proof. Proceed exactly as in the proof of Theorem 25.2 (without using the extra codeword y0 ),
replace (25.11) by P[d(X, X̂) > D] = P[∀j ∈ [M], d(X, cj ) > D] = EX [(1 − P[d(X, Y) ≤ D|X])M ],
and continue similarly.
Finally, we give a rigorous proof of Theorem 25.1 by applying Theorem 25.2 to the iid source
i.i.d.
X = Sn ∼ PS and n → ∞:
Proof of Theorem 25.1. Our goal is the achievability: R(D) ≤ R(I) (D) = ϕS (D).
WLOG we can assume that Dmax = E[d(S, ŝ0 )] is achieved at some fixed ŝ0 – this is our default
reconstruction; otherwise just take any other fixed symbol so that the expectation is finite. The
i i
i i
i i
428
default reconstruction for Sn is ŝn0 = (ŝ0 , . . . , ŝ0 ) and E[d(Sn , ŝn0 )] = Dmax < ∞ since the distortion
is separable.
Fix some small δ > 0. Take any PŜ|S such that E[d(S, Ŝ)] ≤ D − δ ; such PŜ|S since D > D0 by
assumption. Apply Theorem 25.2 to (X, Y) = (Sn , Ŝn ) with
PX = PSn
PY|X = PŜn |Sn = (PŜ|S )n
log M = n(I(S; Ŝ) + 2δ)
log γ = n(I(S; Ŝ) + δ)
1X
n
d( X , Y ) = d(Sj , Ŝj )
n
j=1
y0 = ŝn0
E[d(Sn , g(f(Sn )))] ≤ E[d(Sn , Ŝn )] + E[d(Sn , ŝn0 )]e−M/γ + E[d(Sn , ŝn0 )1{i(Sn ;Ŝn )>log γ } ]
≤ D − δ + Dmax e− exp(nδ) + E[d(Sn , ŝn0 )1En ], (25.15)
| {z } | {z }
→0 →0 (later)
where
1 X
n
WLLN
En = {i(Sn ; Ŝn ) > log γ} = i(Sj ; Ŝj ) > I(S; Ŝ) + δ ====⇒ P[En ] → 0
n
j=1
If we can show the expectation in (25.15) vanishes, then there exists an (n, M, D)-code with:
M = 2n(I(S;Ŝ)+2δ) , D = D − δ + o( 1) ≤ D.
To summarize, ∀PŜ|S such that E[d(S, Ŝ)] ≤ D −δ we have shown that R(D) ≤ I(S; Ŝ). Sending δ ↓
0, we have, by continuity of ϕS (D) in (D0 ∞) (recall Theorem 24.4), R(D) ≤ ϕS (D−) = ϕS (D).
It remains to show the expectation in (25.15) vanishes. This is a simple consequence of the
uniform integrability of the sequence {d(Sn , ŝn0 )}. We need the following lemma.
Lemma 25.4. For any positive random variable U, define g(δ) = supH:P[H]≤δ E[U1H ], where the
δ→0
supremum is over all events measurable with respect to U. Then3 EU < ∞ ⇒ g(δ) −−−→ 0.
b→∞
Proof. For any b > 0, E[U1H ] ≤ E[U1{U>b} ] + bδ , where E[U1{U> √b}
] −−−→ 0 by dominated
convergence theorem. Then the proof is completed by setting b = 1/ δ .
Pn
Now d(Sn , ŝn0 ) = 1n j=1 Uj , where Uj are iid copies of U ≜ d(S, ŝ0 ). Since E[U] = Dmax < ∞
P
by assumption, applying Lemma 25.4 yields E[d(Sn , ŝn0 )1En ] = 1n E[Uj 1En ] ≤ g(P[En ]) → 0,
since P[En ] → 0. This proves the theorem.
3
In fact, ⇒ is ⇔.
i i
i i
i i
Figure 25.1 Description of channel simulation game. The distribution P (left) is to be simulated via the
distribution Q (right) at minimal rate R. Depending on the exact formulation we either require R = I(A; B)
(covering lemma) or R = C(A; B) (soft-covering lemma).
Remark 25.2 (Fundamental limit for excess distortion). Although Theorem 25.1 is stated for the
average distortion, under certain mild extra conditions, it also holds for excess distortion where
the goal is to achieve d(Sn , Ŝn ) ≤ D with probability arbitrarily close to one as opposed to in
expectation. Indeed, the achievability proof of Theorem 25.1 is already stated in high probability.
For converse, assume in addition to (25.3) that Dp ≜ E[d(S, ŝ)p ]1/p < ∞ for some ŝ ∈ Ŝ and p > 1.
Pn
Applying Rosenthal’s inequality [270, 170], we have E[d(S, ŝn )p ] = E[( i=1 d(Si , ŝ))p ] ≤ CDpp
for some constant C = C(p). Then we can apply Theorem 24.9 to convert a code for excess
distortion to one for average distortion and invoke the converse for the latter.
To end this section, we note that in Section 25.1.1 and in Theorem 25.1 it seems we applied
different proof techniques. How come they both turn out to yield the same tight asymptotic result?
This is because the key to both proofs is to estimate the exponent (large deviations) of the under-
lined probabilities in (25.9) and (25.11), respectively. To obtain the right exponent, as we know,
the key is to apply tilting (change of measure) to the distribution solving the information projec-
tion problem (25.10). When PY = (QŜ )n = (PŜ )n with PŜ chosen as the output distribution in the
solution to rate-distortion optimization (25.1), the resulting exponent is precisely given by 2−i(X;Y) .
i i
i i
i i
430
How large a rate R is required depends on how we excatly understand the requirement to “fool
the tester”. If the tester is fixed ahead of time (this just means that we know the set F such that
i.i.d.
(Ai , Bi ) ∼ PA,B is declared whenever (An , Bn ) ∈ F) then this is precisely the setting in which
covering lemma operates. In the next section we show that a higher rate R = C(A; B) is required
if F is not known ahead of time. We leave out the celebrated theorem of Bennett and Shor [27]
which shows that rate R = I(A; B) is also attainable even if F is not known, but if encoder and
decoder are given access to a source of common random bits (independent of An , of course).
Before proceeding, we note some simple corner cases:
1 If R = H(A), we can compress An and send it to “B side”, who can reconstruct An perfectly and
use that information to produce Bn through PBn |An .
2 If R = H(B), “A side” can generate Bn according to PnA,B and send that Bn sequence to the “B
side”.
3 If A ⊥
⊥ B, we know that R = 0, as “B side” can generate Bn independently.
Our previous argument for achieving the rate-distortion turns out to give a sharp answer (that
R = I(A; B) is sufficient) for the F-known case as follows.
i.i.d.
Theorem 25.5 (Covering Lemma). Fix PA,B and let (Aj , Bj ) ∼ PA,B , R > I(A; B) and C =
{c1 , . . . , cM } where each codeword cj is i.i.d. drawn from distribution PnB . ∀ϵ > 0, for M ≥
2n(I(A;B)+ϵ) we have that: ∀F
Remark 25.3. The origin of the name “covering” is due to the fact that sampling the An space at
rate slightly above I(A; B) covers all of it, in the sense of reproducing the joint statistics of (An , Bn ).
Proof. Set γ > M and following similar arguments of the proof for Theorem 25.2, we have
As we explained, the version of the covering lemma that we stated shows only that for one fixed
test set F. However, if both A and B take values on finite alphabets then something stronger can
be stated.
First, in this case i(An ; Bn ) is a sum of bounded iid terms and thus the o(1) is in fact e−Ω(n) . By
applying the previous result to F = {(an , bn ) : #{i : ai = α, bi = β}} with all possible α ∈ A
and β ∈ B we conclude that for every An there must exist a codeword c such that the empirical
joint distribution (joint type) P̂An ,c satisfies
i i
i i
i i
where δn → 0. Thus, by communicating nR bits we are able to fully reproduce the correct empirical
i.i.d.
distribution as if the output were generated ∼ PA,B .
That this is possible to do at rate R ≈ I(A; B) can be explained combinatorially: To generate
Bn , there are around 2nH(B) high probability sequences; for each An sequence, there are around
2nH(B|A) Bn sequences that have the same joint distribution, therefore, it is sufficient to describe
nH(B)
the class of Bn for each An sequence, and there are around 22nH(B|A) = 2nI(A;B) classes.
Let us now denote the selected codeword c by B̂n . From the previous discussion we have shown
that
1X
n
f(Aj , B̂j ) ≈ EA,B∼PA,B [f(A, B)] ,
n
j=1
for any bounded function f. A stronger requirement would be to demand that the joint distribution
PAn ,B̂n fools any permutation invariant tester, i.e.
sup |PAn ,B̂n (F) − PnA,B (F)| → 0
where the supremum is taken over all permutation invariant subset F ⊂ An × B n . This is not
guaranteed by the covering lemma. Indeed, a sufficient statistic for a permutation invariant tester
is a joint type P̂An ,B̂n . The construction above satisfies P̂An ,B̂n ≈ PA,B , but it might happen that
P̂An ,B̂n although close to PA,B still takes highly unlikely values (for example, if we restrict all c to
have the same composition P0 , the tester can easily detect the problem since PnB -measure of all
√
strings of composition P0 cannot exceed O(1/ n)). Formally, to fool permutation invariant tester
we need to have small total variation between the distribution on the joint types under P and Q.
We conjecture, however, that nevertheless the rate R = I(A; B) should be sufficient to achieve
also this stronger requirement. In the next section we show that if one removes the permutation-
invariance constraint, then a larger rate R = C(A; B) is needed.
Theorem 25.6 (Cuff [85]). Let PA,B be an arbitrary distribution on the finite space A×B . Consider
i.i.d.
a coding scheme where Alice observes An ∼ PnA , sends a message W ∈ [2nR ] to Bob, who given
W generates a (possibly random) sequence B̂n . If (25.16) is satisfied for all ϵ > 0 and sufficiently
large n, then we must have
R ≥ C(A; B) ≜ min I(A, B; U) , (25.17)
A→U→B
i i
i i
i i
432
where C(A; B) is known as the Wyner’s common information [336]. Furthermore, for any R >
C(A; B) and ϵ > 0 there exists n0 (ϵ) such that for all n ≥ n0 (ϵ) there exists a scheme
satisfying (25.16).
Note that condition (25.16) guarantees that any tester (permutation invariant or not) is fooled to
believe he sees the truly iid (An , Bn ) with probability ≥ 1 −ϵ. However, compared to Theorem 25.5,
this requires a higher communication rate since C(A; B) ≥ I(A; B), clearly.
Proof. Showing that Wyner’s common information is a lower-bound is not hard. First, since
PAn ,B̂n ≈ PnA,B (in TV) we have
(Here one needs to use finiteness of the alphabet of A and B and the bounds relating H(P) − H(Q)
with TV(P, Q), cf. (7.18) and Corollary 6.8). Next, we have
≳ nC(A; B) (25.21)
At → W → B̂t
and that Wyner’s common information PA,B 7→ C(A; B) should be continuous in the total variation
distance on PA,B .
To show achievability, let us notice that the problem is equivalent to constructing three random
variables (Ân , W, B̂n ) such that a) W ∈ [2nR ], b) the Markov relation
holds and c) TV(PÂn ,B̂n , PnA,B ) ≤ ϵ/2. Indeed, given such a triple we can use coupling charac-
terization of TV (7.18) and the fact that TV(PÂn , PnA ) ≤ ϵ/2 to extend the probability space
to
An → Ân → W → B̂n
and P[An = Ân ] ≥ 1 − ϵ/2. Again by (7.18) we conclude that TV(PAn ,B̂n , PÂn ,B̂n ) ≤ ϵ/2 and by
triangle inequality we conclude that (25.16) holds.
Finally, construction of the triple satisfying a)-c) follows from the soft-covering lemma
(Corollary 25.8) applied with V = (A, B) and W being uniform on the set of xi ’s there.
i i
i i
i i
Theorem 25.7. Fix PX,Y and for any λ ∈ R let us define Rényi mutual information
Iλ (X; Y) = Dλ (PX,Y kPX PY ) ,
where Dλ is the Rényi-divergence, cf. Definition 7.22. We have for every 1 < λ ≤ 2
1
E[D(PY|X ◦ P̂n kPY )] ≤ log(1 + exp{(λ − 1)(Iλ (X; Y) − log n)}) . (25.23)
λ−1
Note that conditioned on Y we get to analyze a λ-th moment of a sum of iid random variables.
This puts us into a well-known setting of Rosenthal-type inequalities. In particular, we have that
for any iid non-negative Bj we have, provided 1 ≤ λ ≤ 2, that
!λ
X n
E Bi ≤ n E[Bλ ] + (n E[B])λ . (25.26)
i=1
i i
i i
i i
434
This is known to be essentially tight [273]. It can be proven by applying (a + b)λ−1 ≤ aλ−1 + bλ−1
and Jensen’s to get
X
E Bi (Bi + Bj )λ−1 ≤ E[Bλ ] + E[B]((n − 1) E[B])λ−1 .
j̸=i
which implies
1
Iλ (Xn ; Ȳ) ≤ log 1 + n1−λ exp{(λ − 1)Iλ (X; Y)} ,
λ−1
which together with (25.24) recovers the main result (25.23).
Remark 25.4. Hayashi [158] upper bounds the LHS of (25.23) with
λ λ−1
log(1 + exp{ (Kλ (X; Y) − log n)}) ,
λ−1 λ
where Kλ (X; Y) = infQY Dλ (PX,Y kPX QY ) is the so-called Sibson-Csiszár information, cf. [244].
This bound, however, does not have the right rate of convergence as n → ∞, at least for λ = 1 as
comparison with Proposition 7.15 reveals.
We note that [158, 154] also contain direct bounds on
E[TV(PY|X ◦ P̂n , PY )]
P
which do not assume existence of λ-th moment of PYY|X for λ > 1 and instead rely on the distribution
of i(X; Y). We do not discuss these bounds here, however, since for the purpose of discussing finite
alphabets the next corollary is sufficient.
1X
n
D( PY|X=xi kPY ) ≤ exp{−dϵ}
n
i=1
as d → ∞.
Remark 25.5. The origin of the name “soft-covering” is due to the fact that unlike the covering
lemma (Theorem 25.5) which selects one xi (trying to make PY|X=xi as close to PY as possible)
here we mix over n choices uniformly.
i i
i i
i i
i i
i i
i i
In the previous chapters we have proved Shannon’s main theorem for lossy data compression: For
stationary memoryless (iid) sources and separable distortion, under the assumption that Dmax < ∞,
the operational and information rate-distortion functions coincide, namely,
R(D) = R(I) (D) = inf I(S; Ŝ).
PŜ|S :Ed(S,Ŝ)≤D
In addition, we have shown various properties about the rate-distortion function (cf. Theorem 24.4).
In this chapter we compute the rate-distortion function for several important source distributions by
evaluating this constrained minimization of mutual information. Next we extending the paradigm
of joint source-channel coding in Section 19.7 to the lossy setting; this reasoning will later be
found useful in statistical applications in Part VI (cf. Chapter 30).
Theorem 26.1.
R(D) = (h(p) − h(D))+ . (26.1)
For example, when p = 1/2, D = .11, we have R(D) ≈ 1/2 bits. In the Hamming game
described in Section 24.2 where we aim to compress 100 bits down to 50, we indeed can do this
while achieving 11% average distortion, compared to the naive scheme of storing half the string
and guessing on the other half, which achieves 25% average distortion.
Proof. Since Dmax = p, in the sequel we can assume D < p for otherwise there is nothing to
show.
For the converse, consider any PŜ|S such that P[S 6= Ŝ] ≤ D ≤ p ≤ 21 . Then
436
i i
i i
i i
In order to achieve this bound, we need to saturate the above chain of inequalities, in particular,
choose PŜ|S so that the difference S + Ŝ is independent of Ŝ. Let S = Ŝ + Z, where Ŝ ∼ Ber(p′ ) ⊥
⊥
Z ∼ Ber(D), and p′ is such that the convolution gives exactly Ber(p), namely,
p′ ∗ D = p′ (1 − D) + (1 − p′ )D = p,
p−D
i.e., p′ = 1−2D . In other words, the backward channel PS|Ŝ is exactly BSC(D) and the resulting
PŜ|S is our choice of the forward channel PŜ|S . Then, I(S; Ŝ) = H(S) − H(S|Ŝ) = H(S) − H(Z) =
h(p) − h(D), yielding the upper bound R(D) ≤ h(p) − h(D).
Remark 26.1. Here is a more general strategy (which we will later implement in the Gaussian
case.) Denote the optimal forward channel from the achievability proof by P∗Ŝ|S and P∗S|Ŝ the asso-
ciated backward channel (which is BSC(D)). We need to show that there is no better PŜ|S with
P[S 6= Ŝ] ≤ D and a smaller mutual information. Then
Remark 26.2. By WLLN, the distribution PnS = Ber(p)n concentrates near the Hamming sphere of
radius np as n grows large. Recall that in proving Shannon’s rate distortion theorem, the optimal
codebook are drawn independently from PnŜ = Ber(p′ )n with p′ = 1p−−2D D
. Note that p′ = 1/2 if
′
p = 1/2 but p < p if p < 1/2. In the latter case, the reconstruction points concentrate on a smaller
sphere of radius np′ and none of them are typical source realizations, as illustrated in Fig. 26.1.
Theorem 26.2. Let S ∼ N (0, σ 2 ) and d(s, ŝ) = (s − ŝ)2 for s, ŝ ∈ R. Then
1 σ2
R ( D) = log+ . (26.2)
2 D
i i
i i
i i
438
S(0, np)
S(0, np′ )
Hamming Spheres
Figure 26.1 Source realizations (solid sphere) versus codewords (dashed sphere) in compressing Hamming
sources.
d dσ 2
R(D) = log+ . (26.3)
2 D
Proof. Since Dmax = σ 2 , in the sequel we can assume D < σ 2 for otherwise there is nothing to
show.
(Achievability) Choose S = Ŝ + Z , where Ŝ ∼ N (0, σ 2 − D) ⊥
⊥ Z ∼ N (0, D). In other words,
the backward channel PS|Ŝ is AWGN with noise power D, and the forward channel can be easily
found to be PŜ|S = N ( σ σ−2 D S, σ σ−2 D D). Then
2 2
1 σ2 1 σ2
I(S; Ŝ) = log =⇒ R(D) ≤ log
2 D 2 D
(Converse) Formally, we can mimic the proof of Theorem 26.1 replacing Shannon entropy by
the differential entropy and applying the maximal entropy result from Theorem 2.7; the caveat is
that for Ŝ (which may be discrete) the differential entropy may not be well-defined. As such, we
follow the alternative proof given in Remark 26.1. Let PŜ|S be any conditional distribution such
that EP [(S − Ŝ)2 ] ≤ D. Denote the forward channel in the above achievability by P∗Ŝ|S . Then
" #
P∗S|Ŝ
I(PS , PŜ|S ) = D(PS|Ŝ kP∗S|Ŝ |PŜ ) + EP log
PS
" #
P∗S|Ŝ
≥ EP log
PS
(S−Ŝ)2
√ 1 e− 2D
= EP log 2πD
S2
√ 1
2π σ 2
e− 2 σ 2
i i
i i
i i
" #
1 σ2 log e S2 (S − Ŝ)2
= log + EP −
2 D 2 σ2 D
1 σ2
≥ log .
2 D
Finally, for the vector case, (26.3) follows from (26.2) and the same single-letterization argu-
ment in Theorem 24.8 using the convexity of the rate-distortion function in Theorem 24.4(a).
The interpretation of the optimal reconstruction points in the Gaussian case is analogous to that
of the Hamming source previously
√ discussed in Remark 26.2: As n grows, the Gaussian random
vector concentrates on S(0, nσ 2 ) (n-sphere in Euclideanp space rather than Hamming), but each
reconstruction point drawn from (P∗Ŝ )n is close to S(0, n(σ 2 − D)). So again the picture is similar
to Fig. 26.1 of two nested spheres.
Note that the exact expression in Theorem 26.2 relies on the Gaussianity assumption of the
source. How sensitive is the rate-distortion formula to this assumption? The following comparison
result is a counterpart of Theorem 20.12 for channel capacity:
Theorem 26.3. Assume that ES = 0 and Var S = σ 2 . Consider the MSE distortion. Then
1 σ2 1 σ2
log+ − D(PS kN (0, σ 2 )) ≤ R(D) = inf I(S; Ŝ) ≤ log+ .
2 D PŜ|S :E(Ŝ−S)2 ≤D 2 D
Remark 26.3. A simple consequence of Theorem 26.3 is that for source distributions with a den-
sity, the rate-distortion function grows according to 12 log D1 in the low-distortion regime as long as
D(PS kN (0, σ 2 )) is finite. In fact, the first inequality, known as the Shannon lower bound (SLB),
is asymptotically tight, in the sense that
1 σ2
R(D) = log − D(PS kN (0, σ 2 )) + o(1), D → 0 (26.4)
2 D
under appropriate conditions on PS [200, 177]. Therefore, by comparing (2.21) and (26.4), we
see that, for small distortion, uniform scalar quantization (Section 24.1) is in fact asymptotically
optimal within 12 log(2πe) ≈ 2.05 bits.
Later in Section 30.1 we will apply SLB to derive lower bounds for statistical estimation. For
this we need the following general version of SLB (see Exercise V.6 for a proof): Let k · k be an
arbitrary norm on Rd and r > 0. Let X be a d-dimensional continuous random vector with finite
differential entropy h(X). Then
d d d
inf I(X; X̂) ≥ h(X) + log − log Γ +1 V , (26.5)
PX̂|X :E[∥X̂−X∥r ]≤D r Dre r
distortion function:
R(D) ≤ I(PS , P∗Ŝ|S )
i i
i i
i i
440
σ2 − D σ2 − D
= I(S; S + W) W ∼ N ( 0, D)
σ2 σ2
σ2 − D
≤ I(SG ; SG + W ) by Gaussian saddle point (Theorem 5.11)
σ2
1 σ2
= log .
2 D
“≥”: For any PŜ|S such that E(Ŝ − S)2 ≤ D. Let P∗S|Ŝ = N (Ŝ, D) denote the AWGN channel
with noise power D. Then
I(S; Ŝ) = D(PS|Ŝ kPS |PŜ )
" #
P∗S|Ŝ
= D(PS|Ŝ kP∗S|Ŝ |PŜ ) + EP log − D(PS kPSG )
PSG
(S−Ŝ)2
√ 1 e− 2D
≥ EP log 2πD
S2
− D(PS kPSG )
√ 1
2π σ 2
e− 2 σ 2
1 σ2
≥ log − D(PS kPSG ).
2 D
In fact there are also iterative algorithms (Blahut-Arimoto) that computes R(D). However, for
the peace of mind it is good to know there are some general reasons why tricks like we used in
Hamming/Gaussian actually are guaranteed to work.
Theorem 26.4.
1 Suppose PY∗ and PX|Y∗ PX are such that E[d(X, Y∗ )] ≤ D and for any PX,Y with E[d(X, Y)] ≤
D we have
dPX|Y∗
E log (X|Y) ≥ I(X; Y∗ ) . (26.6)
dPX
Then R(D) = I(X; Y∗ ).
i i
i i
i i
2 Suppose that I(X; Y∗ ) = R(D). Then for any regular branch of conditional probability PX|Y∗
and for any PX,Y satisfying
• E[d(X, Y)] ≤ D and
• PY PY∗ and
• I(X; Y) < ∞
the inequality (26.6) holds.
Remarks:
1 The first part is a sufficient condition for optimality of a given PXY∗ . The second part gives a
necessary condition that is convenient to narrow down the search. Indeed, typically the set of
PX,Y satisfying those conditions is rich enough to infer from (26.6):
dPX|Y∗
log (x|y) = R(D) − θ[d(x, y) − D] ,
dPX
for a positive θ > 0.
2 Note that the second part is not valid without assuming PY PY∗ . A counterexample to this
and various other erroneous (but frequently encountered) generalizations is the following: A =
{0, 1}, PX = Bern(1/2), Â = {0, 1, 0′ , 1′ } and
The R(D) = |1 − h(D)|+ , but there exist multiple non-equivalent optimal choices of PY|X , PX|Y
and PY .
Proof. The first part is just a repetition of the proofs above for the Hamming and Gaussian case,
so we focus on the second part. Suppose there exists a counterexample PX,Y achieving
dPX|Y∗
I1 = E log (X|Y) < I∗ = R(D) .
dPX
Notice that whenever I(X; Y) < ∞ we have
and thus
Before going to the actual proof, we describe the principal idea. For every λ we can define a joint
distribution
i i
i i
i i
442
PX|Y∗ (X|Yλ )
= D(PX|Yλ kPX|Y∗ |PYλ ) + E (26.9)
PX
= D(PX|Yλ kPX|Y∗ |PYλ ) + λI1 + (1 − λ)I∗ . (26.10)
From here we will conclude, similar to Proposition 2.18, that the first term is o(λ) and thus for
sufficiently small λ we should have I(X; Yλ ) < R(D), contradicting optimality of coupling PX,Y∗ .
We proceed to details. For every λ ∈ [0, 1] define
dPY
ρ 1 ( y) ≜ ( y) (26.11)
dPY∗
λρ1 (y)
λ(y) ≜ (26.12)
λρ1 (y) + λ̄
(λ)
PX|Y=y = λ(y)PX|Y=y + λ̄(y)PX|Y∗ =y (26.13)
dPYλ = λdPY + λ̄dPY∗ = (λρ1 (y) + λ̄)dPY∗ (26.14)
D(y) = D(PX|Y=y kPX|Y∗ =y ) (26.15)
(λ)
D λ ( y) = D(PX|Y=y kPX|Y∗ =y ) . (26.16)
Notice:
Dλ (y) ≤ λ(y)D(y)
and therefore
1
Dλ (y)1{ρ1 (y) > 0} ≤ D(y)1{ρ1 (y) > 0} .
λ(y)
Notice that by (26.7) the function ρ1 (y)D(y) is non-negative and PY∗ -integrable. Then, applying
dominated convergence theorem we get
Z Z
1 1
lim dPY∗ Dλ (y)ρ1 (y) = dPY∗ ρ1 (y) lim D λ ( y) = 0 (26.17)
λ→0 {ρ >0}
1
λ( y ) {ρ1 >0} λ→ 0 λ( y)
i i
i i
i i
Z
1 1 (λ)
= dPYλ Dλ (y) = D(PX|Y kPX|Y∗ |PYλ ) ,
λ Y λ
(26.20)
where in the penultimate step we used Dλ (y) = 0 on {ρ1 = 0}. Hence, (26.17) shows
(λ)
D(PX|Y kPX|Y∗ |PYλ ) = o(λ) , λ → 0.
Finally, since
(λ)
PX|Y ◦ PYλ = PX ,
we have
(λ) dPX|Y∗ dPX|Y∗ ∗
I ( X ; Yλ ) = D(PX|Y kPX|Y∗ |PYλ ) + λ E log (X|Y) + λ̄ E log (X|Y ) (26.21)
dPX dPX
= I∗ + λ(I1 − I∗ ) + o(λ) , (26.22)
I(X; Yλ ) ≥ I∗ = R(D) .
Such a pair (f, g) is called a (k, n, D)-JSCC, which transmits k symbols over n channel uses such
that the end-to-end distortion is at most D in expectation. Our goal is to optimize the encoder/de-
coder pair so as to maximize the transmission rate (number of symbols per channel use) R = nk .1
As such, we define the asymptotic fundamental limit as
1
RJSCC (D) ≜ lim inf max {k : ∃(k, n, D)-JSCC} .
n→∞ n
1
Or equivalently, minimize the bandwidth expansion factor ρ = nk .
i i
i i
i i
444
To simplify the exposition, we will focus on JSCC for a stationary memoryless source Sk ∼ P⊗S
k
⊗n
transmitted over a stationary memoryless channel PYn |Xn = PY|X subject to a separable distortion
Pk
function d(sk , ŝk ) = 1k i=1 d(si , ŝi ).
26.3.1 Converse
The converse for the JSCC is quite simple, based on data processing inequality and following the
weak converse of lossless JSCC using Fano’s inequality.
The interpretation of this result is clear: Since we need at least R(D) bits per symbol to recon-
struct the source up to a distortion D and we can transmit at most C bits per channel use, the overall
transmission rate cannot exceeds C/R(D). Note that the above theorem clearly holds for channels
with cost constraint with the corresponding capacity (Chapter 20).
Proof. Consider a (k, n, D)-code which induces the Markov chain Sk → Xn → Yn → Ŝk such
Pk
that E[d(Sk , Ŝk )] = 1k i=1 E[d(Si , Ŝi )] ≤ D. Then
( a) (b) ( c)
kR(D) = inf I(Sk ; Ŝk ) ≤ I(Sk ; Ŝk ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn ) = nC
PŜk |Sk :E[d(Sk ,Ŝk )]≤D P Xn
where (b) applies data processing inequality for mutual information, (a) and (c) follow from the
respective single-letterization result for lossy compression and channel coding (Theorem 24.8 and
Proposition 19.10).
Remark 26.4. Consider the case where the source is Ber(1/2) with Hamming distortion. Then
Theorem 26.5 coincides with the converse for channel coding under bit error rate Pb in (19.33):
k C
R= ≤
n 1 − h(Pb )
which was previously given in Theorem 19.21 and proved using ad hoc techniques. In the case of
channel with cost constraints, e.g., the AWGN channel with C(SNR) = 12 log(1 + SNR), we have
C(SNR)
Pb ≥ h−1 1 −
R
This is often referred to as the Shannon limit in plots comparing the bit-error rate of practical
codes. (See, e.g., Fig. 2 from [263] for BIAWGN (binary-input) channel.) This is erroneous, since
the pb above refers to the bit error rate of data bits (or systematic bits), not all of the codeword bits.
The latter quantity is what typically called BER (see (19.33)) in the coding-theoretic literature.
i i
i i
i i
Theorem 26.6. For any stationary memoryless source (PS , S, Ŝ, d) with rate-distortion function
R(D) satisfying Assumption 26.1 (below), and for any stationary memoryless channel PY|X with
capacity C,
C
RJSCC (D) = .
R ( D)
Assumption 26.1 on the source (which is rather technical and can be skipped in the first reading)
is to control the distortion incurred by the channel decoder making an error. Despite this being a
low-probability event, without any assumption on the distortion metric, we cannot say much about
its contribution to the end-to-end average distortion. (Note that this issue does not arise in lossless
JSCC). Assumption 26.1 is trivially satisfied by bounded distortion (e.g., Hamming), and can be
shown to hold more generally such as for Gaussian sources and MSE distortion.
• Let (fs , gs ) be a (k, 2kR(D)+o(k) , D)-code for compressing Sk such that E[d(Sk , gs (fs (Sk )] ≤ D.
By Lemma 26.8 (below), we may assume that all reconstruction points are not too far from
some fixed string, namely,
for all i and some constant L, where sk0 = (s0 , . . . , s0 ) is from Assumption 26.1 below.
• Let (fc , gc ) be a (n, 2nC+o(n) , ϵn )max -code for channel PYn |Xn such that kR(D) + o(k) ≤ nC +
o(n) and the maximal probability of error ϵn → 0 as n → ∞. Such as code exists thanks to
Theorem 19.9 and Corollary 19.5.
Let the JSCC encoder and decoder be f = fc ◦ fs and g = gs ◦ gc . So the overall system is
fs fc gc gs
Sk − → Xn −→ Yn −
→W− → Ŵ −
→ Ŝk .
Note that here we need to control the maximal probability of error of the channel code since
when we concatenate these two schemes, W at the input of the channel is the output of the source
compressor, which need not be uniform.
i i
i i
i i
446
To analyze the average distortion, we consider two cases depending on whether the channel
decoding is successful or not:
By assumption on our lossy code, the first term is at most D. For the second term, we have P[W 6=
Ŵ] ≤ ϵn = o(1) by assumption on our channel code. Then
( a)
E[d(Sk , gs (Ŵ))1{W 6= Ŵ}] ≤ E[1{W 6= Ŵ}λ(d(Sk , ŝk0 ) + d(sk0 , gs (Ŵ)))]
(b)
≤ λ · E[1{W 6= Ŵ}d(Sk , ŝk0 )] + λL · P[W 6= Ŵ]
( c)
= o(1),
where (a) follows from the generalized triangle inequality from Assumption 26.1(a) below; (b)
follows from (26.23); in (c) we apply Lemma 25.4 that were used to show the vanishing of the
expectation in (25.15) before.
In all, our scheme meets the average distortion constraint. Hence we conclude that for all R >
C/R(D), there exists a sequence of (k, n, D + o(1))-JSCC codes.
Assumption 26.1. Fix D. For a source (PS , S, Ŝ, d), there exists λ ≥ 0, s0 ∈ S, ŝ0 ∈ Ŝ such that
(a) Generalized triangle inequality: d(s, ŝ) ≤ λ(d(s, ŝ0 ) + d(s0 , â)) ∀a, â.
(b) E[d(S, ŝ0 )] < ∞ (so that Dmax < ∞ too).
(c) E[d(s0 , Ŝ)] < ∞ for any output distribution PŜ achieving the rate-distortion function R(D).
(d) d(s0 , ŝ0 ) < ∞.
The interpretation of this assumption is that the spaces S and Ŝ have “nice centers” s0 and ŝ0 ,
in the sense that the distance between any two points is upper bounded by a constant times the
distance from the centers to each point (see figure below).
b
b
s ŝ
b b
s0 ŝ0
S Ŝ
Note that Assumption 26.1 is not straightforward to verify. Next we give some more convenient
sufficient conditions. First of all, Assumption 26.1 holds automatically for bounded distortion
i i
i i
i i
function. In other words, for a discrete source on a finite alphabet S , a finite reconstruction alphabet
Ŝ , and a finite distortion function d(s, ŝ) < ∞, Assumption 26.1 is fulfilled. More generally, we
have the following criterion.
Theorem 26.7. If S = Ŝ and d(s, ŝ) = ρ(s, ŝ)q for some metric ρ and q ≥ 1, and Dmax ≜
infŝ0 E[d(S, ŝ0 )] < ∞, then Assumption 26.1 holds.
Proof. Take s0 = ŝ0 that achieves a finite Dmax = E[d(S, ŝ0 )]. (In fact, any points can serve as
centers in a metric space). Applying triangle inequality and Jensen’s inequality, we have
q q
1 1 1 1 1
ρ(s, ŝ) ≤ ρ(s, s0 ) + ρ(s0 , ŝ) ≤ ρq (s, s0 ) + ρq (s0 , ŝ).
2 2 2 2 2
Thus d(s, ŝ) ≤ 2q−1 (d(s, s0 ) + d(s0 , ŝ)). Taking λ = 2q−1 verifies (a) and (b) in Assumption 26.1.
To verify (c), we can apply this generalized triangle inequality to get d(s0 , Ŝ) ≤ 2q−1 (d(s0 , S) +
d(S, Ŝ)). Then taking the expectation of both sides gives
So we see that metrics raised to powers (e.g. squared norms) satisfy Assumption 26.1. Finally,
we give the lemma used in the proof of Theorem 26.6.
Lemma 26.8. Fix a source satisfying Assumption 26.1 and an arbitrary PŜ|S . Let R > I(S; Ŝ),
L > max{E[d(s0 , Ŝ)], d(s0 , ŝ0 )} and D > E[d(S, Ŝ)]. Then, there exists a (k, 2kR , D)-code such
that d(sk0 , ŝk ) ≤ L for every reconstruction point ŝk , where sk0 = (s0 , . . . , s0 ).
For any D′ ∈ (E[d(S, Ŝ)], D), there exist M = 2kR reconstruction points (c1 , . . . , cM ) such that
P min d(S , cj ) > D ≤ P[d1 (Sk , Ŝk ) > D′ ] + o(1),
k ′
j∈[M]
P[d1 (S, Ŝ) > D′ ] ≤ P[d(Sk , Ŝk ) > D′ ] + P[d(sk0 , Ŝk ) > L] → 0
i i
i i
i i
448
as k → ∞ (since E[d(S, Ŝ)] < D′ and E[s0 , Ŝ] < L). Thus we have
′
P min d(S , cj ) > D
k
→0
j∈[M]
and d(sk0 , cj ) ≤ L. Finally, by adding another reconstruction point cM+1 = ŝk0 = (ŝ0 , . . . , ŝ0 ) we
get
h i h i
E min d(Sk , cj ) ≤ D′ + E d(Sk , ŝk0 )1{minj∈[M] d(Sk ,cj )>D′ } = D′ + o(1) ,
j∈[M+1]
where the last estimate follows from the same argument that shows the vanishing of the expectation
in (25.15). Thus, for sufficiently large n the expected distortion is at most D, as required.
i i
i i
i i
As a function of δ the resulting destortion (at large blocklength) will look like the solid and
dashed lines in this graph:
We can see that below δ < δ ∗ the separated solution is much preferred since it achieves zero
distortion. But at δ > δ ∗ it undergoes a catastrophic failure and distortion becomes 1/2 (that is,
we observe pure noise). At the same time the simple “uncoded” JSCC has its distortion decreasing
gracefully. It has been a long-standing problem since the early days of information theory to find
schemes that would interpolate between these two extreme solutions.
Even theoretically the problem of JSCC still contains great many mysteries. For example, in
Section 22.5 we described refined expansion of the channel coding rate as a function of block-
length. However, similar expansions for the JSCC are not available. In fact, even showing that
√
convergence of the nk to the ultimate limit of R(CD) happens at the speed of Θ(1/ n) has only been
demonstrated recently [178] and only for one special case (of a binary source and BSCδ channel
as in the example above).
i i
i i
i i
27 Metric entropy
In the previous chapters of this part we discussed optimal quantization of random vectors in both
fixed and high dimensions. Complementing this average-case perspective, the topic of this chapter
is on the deterministic (worst-case) theory of quantization. The main object of interest is the metric
entropy of a set, which allows us to answer two key questions (a) covering number: the minimum
number of points to cover a set up to a given accuracy; (b) packing number: the maximal number
of elements of a given set with a prescribed minimum pairwise distance.
The foundational theory of metric entropy were put forth by Kolmogorov, who, together with
his students, also determined the behavior of metric entropy in a variety of problems for both finite
and infinite dimensions. Kolmogorov’s original interest in this subject stems from Hilbert’s 13th
problem, which concerns the possibility or impossibility of representing multi-variable functions
as compositions of functions of fewer variables. It turns out that the theory of metric entropy can
provide a surprisingly simple and powerful resolution to such problems. Over the years, metric
entropy has found numerous connections to and applications in other fields such as approximation
theory, empirical processes, small-ball probability, mathematical statistics, and machine learning.
In particular, metric entropy will be featured prominently in Part VI of this book, wherein we
discuss its applications to proving both lower and upper bounds for statistical estimation.
This chapter is organized as follows. Section 27.1 provides basic definitions and explains the
fundamental connections between covering and packing numbers. In Section 27.2 we study met-
ric entropy in finite-dimensional spaces and a popular approach for bounding the metric entropy
known as the volume bound. To demonstrate the limitations of the volume method and the associ-
ated high-dimensional phenomenon, in Section 27.3 we discuss a few other approaches through
concrete examples. Infinite-dimensional spaces are treated next for smooth functions in Sec-
tion 27.4 (wherein we also discuss the application to Hilbert’s 13th problem) and Hilbert spaces in
Section 27.5 (wherein we also discuss the application to empirical processes). Section 27.6 gives
an exposition of the connections between metric entropy and the small-ball problem in probabil-
ity theory. Finally, in Section 27.7 we circle back to rate-distortion theory and discuss how it is
related to metric entropy and how information-theoretic methods can be useful for the latter.
450
i i
i i
i i
ϵ
≥ϵ
Θ Θ
Upon defining ϵ-covering and ϵ-packing, a natural question concerns the size of the optimal
covering and packing, leading to the definition of covering and packing numbers:
with min ∅ understood as ∞; we will sometimes abbreviate these as N(ϵ) and M(ϵ) for brevity.
Similar to volume and width, covering and packing numbers provide a meaningful measure for
the “massiveness” of a set. The major focus of this chapter is to understanding their behavior in
both finite and infinite-dimensional spaces as well as their statistical applications.
Some remarks are in order.
1
Notice we imposed strict inequality for convenience.
i i
i i
i i
452
Remark 27.1. Unlike the packing number M(Θ, d, ϵ), the covering number N(Θ, d, ϵ) defined in
(27.1) depends implicitly on the ambient space V ⊃ Θ, since, per Definition 27.1), an ϵ-covering
is required to be a subset of V rather than Θ. Nevertheless, as the next Theorem 27.2 shows, this
dependency on V has almost no effect on the behavior of the covering number.
As an alternative to (27.1), we can define N′ (Θ, d, ϵ) as the size of the minimal ϵ-covering of Θ
that is also a subset of Θ, which is closely related to the original definition as
Here, the left inequality is obvious. To see the right inequality,2 let {θ1 , . . . , θN } be an 2ϵ -covering
of Θ. We can project each θi to Θ by defining θi′ = argminu∈Θ d(θi , u). Then {θ1′ , . . . , θN′ } ⊂ Θ
constitutes an ϵ-covering. Indeed, for any θ ∈ Θ, we have d(θ, θi ) ≤ ϵ/2 for some θi . Then
d(θ, θi′ ) ≤ d(θ, θi ) + d(θi , θi′ ) ≤ 2d(θ, θi ) ≤ ϵ. On the other hand, the N′ covering numbers need
not be monotone with respect to set inclusion.
The relation between the covering and packing numbers is described by the following funda-
mental result.
Proof. To prove the right inequality, fix a maximal packing E = {θ1 , ..., θM }. Then ∀θ ∈ Θ\E,
∃i ∈ [M], such that d(θ, θi ) ≤ ϵ (for otherwise we can obtain a bigger packing by adding θ). Hence
E must an ϵ-covering (which is also a subset of Θ). Since N(Θ, d, ϵ) is the minimal size of all
possible coverings, we have M(Θ, d, ϵ) ≥ N(Θ, d, ϵ).
We next prove the left inequality by contradiction. Suppose there exists a 2ϵ-packing
{θ1 , ..., θM } and an ϵ-covering {x1 , ..., xN } such that M ≥ N + 1. Then by the pigeonhole prin-
ciple, there exist distinct θi and θj belonging to the same ϵ-ball B(xk , ϵ). By triangle inequality,
d(θi , θj ) ≤ 2ϵ, which is a contradiction since d(θi , θj ) > 2ϵ for a 2ϵ-packing. Hence the size of any
2ϵ-packing is at most that of any ϵ-covering, that is, M(Θ, d, 2ϵ) ≤ N(Θ, d, ϵ).
The significance of (27.4) is that it shows that the small-ϵ behavior of the covering and packing
numbers are essentially the same. In addition, the right inequality therein, namely, N(ϵ) ≤ M(ϵ),
deserves some special mention. As we will see next, it is oftentimes easier to prove negative
results (lower bound on the minimal covering or upper bound on the maximal packing) than pos-
itive results which require explicit construction. When used in conjunction with the inequality
N(ϵ) ≤ M(ϵ), these converses turn into achievability statements,3 leading to many useful bounds
on metric entropy (e.g. the volume bound in Theorem 27.3 and the Gilbert-Varshamov bound
2
Another way to see this is from Theorem 27.2: Note that the right inequality in (27.4) yields a ϵ-covering that is included
in Θ. Together with the left inequality, we get N′ (ϵ) ≤ M(ϵ) ≤ N(ϵ/2).
3
This is reminiscent of duality-based argument in optimization: To bound a minimization problem from above, instead of
constructing an explicit feasible solution, a fruitful approach is to equate it with the dual problem (maximization) and
bound this maximum from above.
i i
i i
i i
Theorem 27.5 in the next section). Revisiting the proof of Theorem 27.2, we see that this logic
actually corresponds to a greedy construction (greedily increase the packing until no points can
be added).
Proof. To prove (a), consider an ϵ-covering Θ ⊂ ∪Ni=1 B(θi , ϵ). Applying the union bound yields
XN
vol(Θ) ≤ vol ∪Ni=1 B(θi , ϵ) ≤ vol(B(θi , ϵ)) = Nϵd vol(B),
i=1
where the last step follows from the translation-invariance and scaling property of volume.
To prove (b), consider an ϵ-packing {θ1 , . . . , θM } ⊂ Θ such that the balls B(θi , ϵ/2) are disjoint.
M(ϵ)
Since ∪i=1 B(θi , ϵ/2) ⊂ Θ + 2ϵ B, taking the volume on both sides yields
ϵ ϵ
vol Θ + B ≥ vol ∪M i=1 B(θi , ϵ/2) = Mvol B .
2 2
This proves (b).
Finally, (c) follows from the following two statements: (1) if ϵB ⊂ Θ, then Θ + 2ϵ B ⊂ Θ + 21 Θ;
and (2) if Θ is convex, then Θ+ 12 Θ = 32 Θ. We only prove (2). First, ∀θ ∈ 32 Θ, we have θ = 13 θ+ 32 θ,
where 13 θ ∈ 12 Θ and 32 θ ∈ Θ. Thus 32 Θ ⊂ Θ + 12 Θ. On the other hand, for any x ∈ Θ + 12 Θ, we
have x = y + 21 z with y, z ∈ Θ. By the convexity of Θ, 23 x = 23 y + 31 z ∈ Θ. Hence x ∈ 23 Θ, implying
Θ + 21 Θ ⊂ 32 Θ.
Remark 27.2. Similar to the proof of (a) in Theorem 27.3, we can start from Θ + 2ϵ B ⊂
∪Ni=1 B(θi , 32ϵ ) to conclude that
N(Θ, k · k, ϵ)
(2/3)d ≤ ≤ 2d .
vol(Θ + 2ϵ B)/vol(ϵB)
In other words, the volume of the fattened set Θ + 2ϵ determines the metric entropy up to constants
that only depend on the dimension. We will revisit this reasoning in Section 27.6 to adapt the
volumetric estimates to infinite dimensions where this fattening step becomes necessary.
i i
i i
i i
454
Corollary 27.4 (Metric entropy of balls and spheres). Let k · k be an arbitrary norm on Rd . Let
B ≡ B∥·∥ = {x ∈ Rd : kxk ≤ 1} and S ≡ S∥·∥ = {x ∈ Rd : kxk ≤ 1} be the corresponding unit
ball and unit sphere. Then for ϵ < 1,
d d
1 2
≤ N(B, k · k, ϵ) ≤ 1 + (27.5)
ϵ ϵ
d−1 d−1
1 1
≤ N(S, k · k, ϵ) ≤ 2d 1 + (27.6)
2ϵ ϵ
where the left inequality in (27.6) holds under the extra assumption that k · k is an absolute norm
(invariant to sign changes of coordinates).
Proof. For balls, the estimate (27.5) directly follows from Theorem 27.3 since B + 2ϵ B = (1 + 2ϵ )B.
Next we consider the spheres. Applying (b) in Theorem 27.3 yields
vol(S + ϵB) vol((1 + ϵ)B) − vol((1 − ϵ)B)
N(S, k · k, ϵ) ≤ M(S, k · k, ϵ) ≤ ≤
vol(ϵB) vol(ϵB)
Z ϵ d−1
(1 + ϵ) − (1 − ϵ)
d d
d d−1 1
= = d (1 + x) dx ≤ 2d 1 + .
ϵd ϵ −ϵ ϵ
where the third inequality applies S + ϵB ⊂ ((1 + ϵ)B)\((1 − ϵ)B) by triangle inequality.
Finally, we prove the lower bound in (27.6) for an absolute norm k · k. To this end one cannot
directly invoke the lower bound in Theorem 27.3 as the sphere has zero volume. Note that k · k′ ≜
k(·, 0)k defines a norm on Rd−1 . We claim that every ϵ-packing in k · k′ for the unit k · k′ -ball
induces an ϵ-packing in k · k for the unit k · k-sphere. Fix x ∈ Rd−1 such that k(x, 0)k ≤ 1 and
define f : R+ → R+ by f(y) = k(x, y)k. Using the fact that k · k is an absolute norm, it is easy to
verify that f is a continuous increasing function with f(0) ≤ 1 and f(∞) = ∞. By the mean value
theorem, there exists yx , such that k(x, yx )k = 1. Finally, for any ϵ-packing {x′1 , . . . , x′M } of the unit
ball B∥·∥′ with respect to k·k′ , setting x′i = (xi , yxi ) we have kx′i −x′j k ≥ k(xi −xj , 0)k = kxi −xj k′ ≥ ϵ.
This proves
Then the left inequality of (27.6) follows from those of (27.4) and (27.5).
(a) Using (27.5), we see that for any compact Θ with nonempty interior, we have
1
N(Θ, k · k, ϵ) M(Θ, k · k, ϵ) (27.7)
ϵd
for small ϵ, with proportionality constants depending on both Θ and the norm. In fact, the sharp
constant is also known to exist. It is shown in [? , Theorem IX] that there exists a constant τ
i i
i i
i i
Next we switch our attention to the discrete case of Hamming space. The following theorem
bounds its packing number M(Fd2 , dH , r) ≡ M(Fd2 , r), namely, the maximal number of binary code-
words of length d with a prescribed minimum distance r + 1.5 This is a central question in coding
theory, wherein the lower and upper bounds below are known as the Gilbert-Varshamov bound
and the Hamming bound, respectively.
Proof. Both inequalities in (27.8) follow from the same argument as that in Theorem 27.3, with
Rd replaced by Fd2 and volume by the counting measure (which is translation invariant).
Of particular interest to coding theory is the asymptotic regime of d → ∞ and r = ρd for some
constant ρ ∈ (0, 1). Using the asymptotics of the binomial coefficients (cf. Proposition 1.5), the
4
For example, it is easy to show that τ = 1 for both ℓ∞ and ℓ1 balls in any dimension since cubes can be subdivided into
smaller cubes; for ℓ2 -ball in d = 2, τ = √π is the famous result of L. Fejes Tóth on the optimality of hexagonal
12
arrangement for circle packing [268].
5
Recall that the packing number in Definition 27.1 is defined with a strict inequality.
i i
i i
i i
456
Finding the exact exponent is one of the most significant open questions in coding theory. The best
upper bound to date is due to McEliece, Rodemich, Rumsey and Welch [214] using the technique
of linear programming relaxation.
In contrast, the corresponding covering problem in Hamming space is much simpler, as we
have the following tight result
where R(ρ) = (1 − h(ρ))+ is the rate-distortion function of Ber( 12 ) from Theorem 26.1. Although
this does not automatically follow from the rate-distortion theory, it can be shown using similar
argument – see Exercise V.11.
Finally, we state a lower bound on the packing number of Hamming spheres, which is needed
for subsequent application in sparse estimation (Exercise VI.11) and useful as basic building blocks
for computing metric entropy in more complicated settings (Theorem 27.7).
In particular,
k d
log M(Sdk , k/2) ≥ log . (27.12)
2 2ek
Proof. Again (27.11) follows from the volume argument. To verify (27.12), note that for r ≤ d/2,
Pr
we have i=0 di ≤ exp(dh( dr )) (see Theorem 8.2 or (15.19) with p = 1/2). Using h(x) ≤ x log xe
and dk ≥ ( dk )k , we conclude (27.12) from (27.11).
i i
i i
i i
As a case in point, consider the maximum number of ℓ2 -balls of radius ϵ packed into the unit
ℓ1 -ball, namely, M(B1 , k · k2 , ϵ). (Recall that Bp denotes the unit ℓp -ball in Rd with 1 ≤ p ≤ ∞.)
We have studied the metric entropy of arbitrary norm balls under the same norm in Corollary 27.4,
where the specific value of the volume was canceled from the √
volume ratio. Here, although ℓ1 and
ℓ2 norms are equivalent in the sense that kxk2 ≤ kxk1 ≤ dkxk2 , this relationship is too loose
when d is large.
Let us start by applying the volume method in Theorem 27.3:
vol(B1 ) vol(B1 + 2ϵ B2 )
≤ N(B1 , k · k2 , ϵ) ≤ M(B1 , k · k2 , ϵ) ≤ .
vol(ϵB2 ) vol( 2ϵ B2 )
Applying the formula for the volume of a unit ℓq -ball in Rd :
h id
2Γ 1 + 1q
vol(Bq ) = , (27.13)
Γ 1 + qd
πd
we get6 vol(B1 ) = 2d /d! and vol(B2 ) = Γ(1+d/2) , which yield, by Stirling approximation,
1 1
vol(B1 )1/d , vol(B2 )1/d √ . (27.14)
d d
Then for some absolute constant C,
√ d
vol(B1 + 2ϵ B2 ) vol((1 + ϵ 2 d )B1 ) 1
M(B1 , k · k2 , ϵ) ≤ ≤ ≤ C 1 + √ , (27.15)
vol( 2ϵ B2 ) vol( 2ϵ B2 ) ϵ d
√
where the second inequality follows from B2 ⊂ dB1 by Cauchy-Schwarz inequality. (This step
is tight in the sense that vol(B1 + 2ϵ B2 )1/d ≳ max{vol(B1 )1/d , 2ϵ vol(B2 )1/d } max{ d1 , √ϵd }.) On
the other hand, for some absolute constant c,
d d
vol(B1 ) 1 vol(B1 ) c
M(B1 , k · k2 , ϵ) ≥ = = √ . (27.16)
vol(ϵB2 ) ϵ vol(B2 ) ϵ d
Overall, for ϵ ≤ √1d , we have M(B1 , k · k2 , ϵ)1/d ϵ√1 d ; however, the lower bound trivializes and
the upper bound (which is exponential in d) is loose in the regime of ϵ √1d , which requires
different methods than volume calculation. The following result describes the complete behavior
of this metric entropy. In view of Theorem 27.2, we will go back and forth between the covering
and packing numbers in the argument.
6
For B1 this can be proved directly by noting that B1 consists 2d disjoint “copies” of the simplex whose volume is 1/d! by
induction on d.
i i
i i
i i
458
Proof. The case of ϵ ≤ √1d follows from earlier volume calculation (27.15)–(27.16). Next we
focus on √1d ≤ ϵ < 1.
For the upper bound, we construct an ϵ-covering in ℓ2 by quantizing each coordinate. Without
loss of generality, assume that ϵ < 1/4. Fix some δ < 1. For each θ ∈ B1 , there exists x ∈
(δ Zd ) ∩ B1 such that kx − θk∞ ≤ δ . Then kx − θk22 ≤ kx − θk1 kx − θk∞ ≤ 2δ . Furthermore, x/δ
belongs to the set
( )
X d
Z= z∈Z : d
|zi | ≤ k (27.17)
i=1
with k = b1/δc. Note that each z ∈ Z has at most k nonzeros. By enumerating the number of non-
negative solutions (stars and bars calculation) and the sign pattern, we have7 |Z| ≤ 2k∧d d−k1+k .
Finally, picking δ = ϵ2 /2, we conclude that N(B1 , k · k2 , ϵ) ≤ |Z| ≤ ( 2e(dk+k) )k as desired. (Note
that this method also recovers the volume bound for ϵ ≤ √1d , in which case k ≤ d.)
√
For the lower bound, note that M(B1 , k · k2 , 2) ≥ 2d by considering ±e1 , . . . , ±ed . So it
suffices to consider d ≥ 8. We construct a packing of B1 based on a packing of the Hamming
sphere. Without loss of generality, assume that ϵ > 4√1 d . Fix some 1 ≤ k ≤ d. Applying
the Gilbert-Varshamov bound in Theorem 27.6, in particular, (27.12), there exists a k/2-packing
Pd
{x1 , . . . , xM } ⊂ Sdk = {x ∈ {0, 1}d : i=1 xi = k} and log M ≥ 2k log 2ek d
. Scale the Hamming
sphere to fit the ℓ1 -ball by setting θi = xi /k. Then θi ∈ B1 and kθi − θj k2 = k2 dH (xi , xj ) ≥ 2k
2 1 1
for all
1
i 6= j. Choosing k = ϵ2 which satisfies k ≤ d/8, we conclude that {θ1 , . . . , θM } is a 2 -packing
ϵ
of B1 in k · k2 as desired.
The above elementary proof can be adapted to give the following more general result (see
Exercise V.12): Let 1 ≤ p < q ≤ ∞. For all 0 < ϵ < 1 and d ∈ N,
(
d log ϵes d ϵ ≤ d−1/s 1 1 1
log M(Bp , k · kq , ϵ) p,q 1 , ≜ − . (27.18)
−1/s
s log(eϵ d)
ϵ
s
ϵ≥d s p q
In the remainder of this section, we discuss a few generic results in connection to Theorem 27.7,
in particular, metric entropy upper bounds via the Sudakov minorization and Maurey’s empirical
method, as well as the duality of metric entropy in Euclidean spaces.
7 ∑d (d)( k )
By enumerating the support and counting positive solutions, it is easy to show that |Z| = i=0 2d−i i d−i
.
8
To avoid measurability difficulty, w(Θ) should be understood as supT⊂Θ,|T|<∞ E maxθ∈T hθ, Zi.
i i
i i
i i
For any Θ ⊂ Rd ,
p
w(Θ) ≳ sup ϵ log M(Θ, k · k2 , ϵ). (27.20)
ϵ>0
The preceding theorem relates the Gaussian width to the metric entropy, both of which are
meaningful measure of the massiveness of a set. The following complementary result is due to
R. Dudley. (See [235, Theorem 5.6] for both results.)
Z ∞p
w(Θ) ≲ log M(Θ, k · k2 , ϵ)dϵ. (27.21)
0
Understanding the maximum of a Gaussian process is a field on its own; see the monograph [302].
In this section we focus on the upper bound (27.20) in order to develop upper bound for metric
entropy using the Gaussian width.
The proof of Theorem 27.8 relies on the following Gaussian comparison lemma of Slepian
(whom we have encountered earlier in Theorem 11.13). For a self-contained proof see [62]. See
also [235, Lemma 5.7, p. 70] for a simpler proof of a weaker version E max Xi ≤ 2E max Yi , which
suffices for our purposes.
Lemma 27.9 (Slepian’s lemma). Let X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Yn ) be Gaussian random
vectors. If E(Yi − Yj )2 ≤ E(Xi − Xj )2 for all i, j, then E max Yi ≤ E max Xi .
We also need the result bounding the expectation of the maximum of n Gaussian random
variables (see also Exercise I.45).
Therefore
log n t
E[max Zi ] ≤ + .
i t 2
p
Choosing t = 2 log n yields (27.22). Next, assume that Zi are iid. For any t > 0,
i i
i i
i i
460
where Φc (t) = P[Z1 ≥ t] is the normal tail probability. The second term equals
2−(n−1) E[Z1 1{Z1 <0} ] = o(1). For the first term, recall that Φc (t) ≥ 1+t t2 φ(t) (Exercise V.9).
p
Choosing
p t = (2 − ϵ) log n for small ϵ > 0 so that Φc (t) = ω( 1n ) and hence E[maxi Zi ] ≥
(2 − ϵ) log n(1 + o(1)). By the arbitrariness of ϵ > 0, the lower bound part of (27.23)
follows.
Proof of Theorem 27.8. Let {θ1 , . . . , θM } be an optimal ϵ-packing of Θ. Let Xi = hθi , Zi for
i.i.d.
i ∈ [M], where Z ∼ N (0, Id ). Let Yi ∼ N (0, ϵ2 /2). Then
Then
p
E sup hθ, Zi ≥ E max Xi ≥ E max Yi ϵ log M
θ∈Θ 1≤i≤M 1≤i≤M
where the second and third step follows from Lemma 27.9 and Lemma 27.10 respectively.
i i
i i
i i
Proof. Let T = {t1 , t2 , . . . , tm } and denote the Chebyshev center of T by c ∈ H, such that r =
maxi∈[m] kc − ti k. For n ∈ Z+ , let
( ! )
1 X
m X
m
Z= c+ ni ti : ni ∈ Z + , ni = n .
n+1
i=1 i=1
Pm P
For any x = i=1 xi ti ∈ co(T) where xi ≥ 0 and xi = 1, let Z be a discrete random variable
such that Z = ti with probability xi . Then E[Z] = x. Let Z0 = c and Z1 , . . . , Zn be i.i.d. copies of
Pm
Z. Let Z̄ = n+1 1 i=0 Zi , which takes values in the set Z . Since
2
X
n
X
n X
1
1
EkZ̄ − xk22 = E
( Z i − x)
= E kZi − xk2 + EhZi − x, Zj − xi
( n + 1) 2
( n + 1) 2
i=0 i=0 i̸=j
1 X
n
1 r2
= E kZi − xk2 = kc − x k2
+ nE [kZ − x k2
] ≤ ,
( n + 1) 2 ( n + 1) 2 n+1
i=0
Pm
where the last inequality follows from that kc − xk ≤ i=1 xi kc − ti k ≤ r (in other words, rad (T) =
rad (co(T)) and E[kZ − xk2 ] ≤ E[kZ − ck2 ] ≤ r2 . Set n = r2 /ϵ2 − 1 so that r2 /(n + 1) ≤ ϵ2 .
There exists some z ∈ N such that kz − xk ≤ ϵ. Therefore Z is an ϵ-covering of co(T). Similar to
(27.17), we have
n+m−1 m + r2 /ϵ2 − 2
|Z| ≤ = .
n dr2 /ϵ2 e − 1
We now apply Theorem 27.11 to recover the result for the unit ℓ1 -ball B1 in Rd in Theorem 27.7:
Note that B1 = co(T), where T = {±e1 , . . . , ±ed , 0} satisfies rad (T) = 1. Then
2d + d ϵ12 e − 1
N(B1 , k · k2 , ϵ) ≤ , (27.26)
d ϵ12 e − 1
which recovers the optimal upper bound in Theorem 27.7 at both small and big scale.
Then the usual covering number in Definition 27.1 satisfies N(K, k · k, ϵ) = N(K, ϵB), where B is
the corresponding unit norm ball.
i i
i i
i i
462
A deep result of Artstein, Milman, and Szarek [18] establishes the following duality for metric
entropy: There exist absolute constants α and β such that for any symmetric convex body K,9
1 ϵ
log N B2 , K◦ ≤ log N(K, ϵB2 ) ≤ log N(B2 , αϵK◦ ), (27.27)
β α
where B2 is the usual unit ℓ2 -ball, and K◦ = {y : supx∈K hx, yi ≤ 1} is the polar body of K.
As an example, consider p < 2 < q and 1p + 1q = 1. By duality, B◦p = Bq . Then (27.27) shows
that N(Bp , k · k2 , ϵ) and N(B2 , k · kq , ϵ) have essentially the same behavior, as verified by (27.18).
Theorem 27.12. Assume that L, A > 0 and p ∈ [1, ∞] are constants. Then
1
log N(F(A, L), k · kp , ϵ) = Θ . (27.28)
ϵ
Furthermore, for the sup-norm we have the sharp asymptotics:
LA
log2 N(F(A, L), k · k∞ , ϵ) = (1 + o(1)), ϵ → 0. (27.29)
ϵ
Thus, it is sufficient to consider F(A, 1) ≜ F(A), the collection of 1-Lipschitz densities on [0, A].
Next, observe that any such density function f is bounded from above. Indeed, since f(x) ≥ (f(0) −
RA
x)+ and 0 f = 1, we conclude that f(0) ≤ max{A, A2 + A1 } ≜ m.
To show (27.28), it suffices to prove the upper bound for p = ∞ and the lower bound for p = 1.
Specifically, we aim to show, by explicit construction,
C Aϵ
N(F(A), k · k∞ , ϵ) ≤ 2 (27.31)
ϵ
9
A convex body K is a compact convex set with non-empty interior. We say K is symmetric if K = −K.
i i
i i
i i
c
M(F(A), k · k1 , ϵ) ≥ 2 ϵ (27.32)
which imply the desired (27.28) in view of Theorem 27.2. Here and below, c, C are constants
depending on A. We start with the easier (27.32). We construct a packing by perturbing the uniform
density. Define a function T by T(x) = x1{x≤ϵ} + (2ϵ − x)1{x≥ϵ} + A1 on [0, 2ϵ] and zero elsewhere.
Let n = 4Aϵ and a = 2nϵ. For each y ∈ {0, 1}n , define a density fy on [0, A] such that
X
n
f y ( x) = yi T(x − 2(i − 1)ϵ), x ∈ [0, a],
i=1
RA
and we linearly extend fy to [a, A] so that 0 fy = 1; see Fig. 27.2. For sufficiently small ϵ, the
Ra
resulting fy is 1-Lipschitz since 0 fy = 12 + O(ϵ) so that the slope of the linear extension is O(ϵ).
1/A
x
0 ϵ 2ϵ 2nϵ A
Figure 27.2 Packing that achieves (27.32). The solid line represent one such density fy (x) with
y = (1, 0, 1, 1). The dotted line is the density of Unif(0, A).
Thus we conclude that each fy is a valid member of F(A). Furthermore, for y, z ∈ {0, 1}n ,
we have kfy − Fz k1 = dH (y, z)kTk1 = ϵ2 dH (y, z). Invoking the Gilbert-Varshamov bound The-
orem 27.5, we obtain an n2 -packing Y of the Hamming space {0, 1}n with |Y| ≥ 2cn for some
2
absolute constant c. Thus {fy : y ∈ Y} constitutes an n2ϵ -packing of F(A) with respect to the
2
L1 -norm. This is the desired (27.32) since n2ϵ = Θ(ϵ).
m
To construct a covering, set J = ϵ , n = Aϵ , and xk = kϵ for k = 0, . . . , n. Let G be the
collection of all lattice paths (with grid size ϵ) of n steps starting from the coordinate (0, jϵ) for
some j ∈ {0, . . . , J}. In other words, each element g of G is a continuous piecewise linear function
on each subinterval Ik = [xk , xk+1 ) with slope being either +1 or −1. Evidently, the number of
such paths is at most (J + 1)2n = O( 1ϵ 2A/ϵ ). To show that G is an ϵ-covering, for each f ∈ F (A),
we show that there exists g ∈ G such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, A]. This can be shown
by a simple induction. Suppose that there exists g such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, xk ],
which clearly holds for the base case of k = 0. We show that g can be extended to Ik so that this
holds for k + 1. Since |f(xk ) − g(xk )| ≤ ϵ and f is 1-Lipschitz, either f(xk+1 ) ∈ [g(xk ), g(xk ) + 2ϵ]
or [g(xk ) − 2ϵ, g(xk )], in which case we extend g upward or downward, respectively. The resulting
g satisfies |f(x) − g(x)| ≤ ϵ on Ik , completing the induction.
Finally, we prove the sharp bound (27.29) for p = ∞. The upper bound readily follows from
(27.31) plus the scaling relation (27.30). For the lower bound, we apply Theorem 27.2 converting
i i
i i
i i
464
b′ + ϵ1/3
b′
x
0 a′ A
Figure 27.3 Improved packing for (27.33). Here the solid and dashed lines are two lattice paths on a grid of
size ϵ starting from (0, b′ ) and staying in the range of [b′ , b′ + ϵ1/3 ], followed by their respective linear
extensions.
the problem to the construction of 2ϵ-packing. Following the same idea of lattice paths, next we
give an improved packing construction such that
a
M(F(A), k · k∞ , 2ϵ) ≥ Ω(ϵ3/2 2 ϵ ). (27.33)
a b
for any a < A. Choose any b such that A1 < b < A1 + (A− a)2 ′ ′
2A . Let a = ϵ ϵ and b = ϵ ϵ . Consider
a density f on [0, A] of the following form (cf. Fig. 27.3): on [0, a′ ], f is a lattice path from (0, b′ ) to
(a′ , b′ ) that stays in the vertical range of [b′ , b′ + ϵ1/3 ]; on [a′ , A], f is a linear extension chosen so
RA
that 0 f = 1. This is possible because by the 1-Lipschitz constraint we can linearly extend f so that
RA ′ 2 ′ 2 R a′
a′
f takes any value in the interval [b′ (A−a′ )− (A−2a ) , b′ (A−a′ )+ (A−2a ) ]. Since 0 f = ab+o(1),
RA R a′
we need a′ f = 1 − 0 f = 1 − ab + o(1), which is feasible due to the choice of b. The collection
G of all such functions constitute a 2ϵ-packing in the sup norm (for two distinct paths consider the
first subinterval where they differ). Finally, we bound the cardinality of this packing by counting
the number of such paths. This can be accomplished by standard estimates on random walks (see
e.g. [122, Chap. III]). For any constant c > 0, the probability that a symmetric random walk on
Z returns to zero in n (even) steps and stays in the range of [0, n1+c ] is Θ(n−3/2 ); this implies the
desired (27.33). Finally, since a < A is arbitrary, the lower bound part of (27.29) follows in view
of Theorem 27.2.
The following result, due to Birman and Solomjak [36] (cf. [203, Sec. 15.6] for an exposition),
is an extension of Theorem 27.12 to the more general Hölder class.
Theorem 27.13. Fix positive constants A, L and d ∈ N. Let β > 0 and write β = ℓ + α,
where ℓ = bβc and α ∈ [0, 1). Let Fβ (A, L) denote the collection of ℓ-times continuously
differentiable densities f on [0, A]d whose ℓth derivative is (L, α)-Hölder continuous, namely,
i i
i i
i i
1 465
27.5 Hilbert ball has metric entropy ϵ2
βd
1
log N(Fβ (A, L), k · kp , ϵ) . (27.34)
ϵ
The main message of the preceding theorem is that is the entropy of the function class grows
more slowly if the dimension decreases or the smoothness increases. As such, the metric entropy
for very smooth functions can grow subpolynomially in 1ϵ . For example, Vitushkin (cf. [180,
Eq. (129)]) showed that for the class of analytic functions on the unit complex disk D having
analytic extension to a bigger disk rD for r > 1, the metric entropy (with respect to the sup-norm
on D) is Θ((log 1ϵ )2 ); see [180, Sec. 7 and 8] for more such results.
As mentioned at the beginning of this chapter, the conception and development of the subject
on metric entropy, in particular, Theorem 27.13, are motivated by and plays an important role
in the study of Hilbert’s 13th problem. In 1900, Hilbert conjectured that there exist functions of
several variables which cannot be represented as a superposition (composition) of finitely many
functions of fewer variables. This was disproved by Kolmogorov and Arnold in 1950s who showed
that every continuous function of d variables can be represented by sums and superpositions of
single-variable functions; however, their construction does not work if one requires the constituent
functions to have specific smoothness. Subsequently, Hilbert’s conjecture for smooth functions
was positively resolved by Vitushkin [323], who showed that there exist functions of d variables
in the β -Hölder class (in the sense of Theorem 27.13) that cannot be expressed as finitely many
superpositions of functions of d′ variables in the β ′ -Hölder class, provided d/β > d′ /β ′ . The
original proof of Vitushkin is highly involved. Later, Kolmogorov gave a much simplified proof
by proving and applying the k · k∞ -version of Theorem 27.13. As evident in (27.34), the index
d/β provides a complexity measure for the function class; this allows an proof of impossibility
of superposition by an entropy comparison argument. For concreteness, let us prove the follow-
ing simpler version: There exists a 1-Lipschitz function f(x, y, z) of three variables on [0, 1]3 that
cannot be written as g(h1 (x, y), h2 (y, z)) where g, h1 , h2 are 1-Lipschitz functions of two variables
on [0, 1]2 . Suppose, for the sake of contradiction, that this is possible. Fixing an ϵ-covering of
cardinality exp(O( ϵ12 )) for 1-Lipschitz functions on [0, 1]2 and using it to approximate the func-
tions g, h1 , h2 , we obtain by superposition g(h1 , h2 ) an O(ϵ)-covering of cardinality exp(O( ϵ12 )) of
1-Lipschitz functions on [0, 1]3 ; however, this is a contradiction as any such covering must be of
size exp(Ω( ϵ13 )). For stronger and more general results along this line, see [180, Appendix I].
1
27.5 Hilbert ball has metric entropy ϵ2
Consider the following set of linear functions fθ (x) = (θ, x) with θ, x ∈ B – a unit ball in infinite
dimensional Hilbert space with inner product (·, ·).
i i
i i
i i
466
p
Theorem 27.14. Consider any measure P on B and let dP (θ, θ′ ) = EX∼P [|fθ (X) − fθ′ (X)|2 ].
Then we have
1
log N(ϵ, dP ) ≤ 2 .
eϵ
Proof. We have log N(ϵ) ≤ log M(ϵ). By some continuity argument, let’s consider only empirical
Pn
measures Pn = n1 i=1 δxi . First consider the special case when xi ’s are orthogonal basis. Then the
√
ϵ-packing in dP is simply an nϵ-packing of n-dimensional Euclidean unit ball. From Varshamov’s
argument we have
√
log M(ϵ) ≤ −n log nϵ . (27.35)
Thus, we have
1 1
log N(ϵ, dPn ) ≤ max n log √ = 2 .
n nϵ eϵ
√
Now, for a general case, after some linear algebra we get that the goal is to do nϵ-packing in
Euclidean metric of an ellipsoid:
X
n
{yn : y2j /λj ≤ 1} ,
j=1
where λj are eigenvalues of the Gram matrix of {xi , i ∈ [n]}. By calculating the volume of this
ellipsoid the bound (27.35) is then replaced by
X
n
√
log M(ϵ) ≤ log λj − n log nϵ .
j=1
P
Since j λj ≤ n (xi ’s are unit norm!) we get from Jensen’s that the first sum above is ≤ 0 and we
reduced to the previous case.
To see one simple implication of the result, recall the standard bound on empirical processes
s
Z ∞
log N(Θ, L ( P̂ ), ϵ)
E sup E[fθ (X)] − Ên [fθ (X)] ≲ E inf δ + dϵ .
2 n
θ δ>0 δ n
It can be see that when entropy behaves as ϵ−p we get rate n− min(1/p,1/2) except for p = 2 for which
the upper bound yields n− 2 log n. The significance of the previous theorem is that the Hilbert ball
1
i i
i i
i i
10
In particular, if γ is the law of a Gaussian process X on C([0, 1]) with E[kXk22 ] < ∞, the kernel K(s, t) = E[X(s)X(t)]
∑
admits the eigendecomposition K(s, t) = λk ψk (s)ψk (t) (Mercer’s theorem), where {ϕk } is an orthonormal basis for
∑
L2 ([0, 1]) and λk > 0. Then H is the closure of the span of {ϕk } with the inner product hx, yiH = k hx, ψk ihy, ψk i/λk .
i i
i i
i i
468
The following fundamental result due to Kuelbs and Li [187] (see also the earlier work of
Goodman [142]) describes a precise connection between the small-ball probability function ϕ(ϵ)
and the metric entropy of the unit Hilbert ball N(K, k · k, ϵ) ≡ N(ϵ).
λ2
ϕ(2ϵ) + log Φ(λ + Φ−1 (e−ϕ(ϵ) )) ≤ log N(λK, ϵ) ≤ log M(λK, ϵ) ≤ + ϕ(ϵ/2) (27.41)
2
p
To deduce (27.40), choose λ = 2ϕ(ϵ/2) and note that by scaling N(λK, ϵ) = N(K, ϵ/λ).
t) = Φc (t) ≤ e−t /2 (Exercise V.9) yields Φ−1 (e−ϕ(ϵ) ) ≥
2
Applying
p the normal tail bound Φ(−
− 2ϕ(ϵ) ≥ −λ so that Φ(Φ−1 (e−ϕ(ϵ) ) + λ) ≥ Φ(0) = 1/2.
We only give the proof in finite dimensions as the results are dimension-free and extend natu-
rally to infinite-dimensional spaces. Let Z ∼ γ = N(0, Σ) on Rd so that K = Σ1/2 B2 is given in
(27.38). Applying (27.37) to λK and noting that γ is a probability measure, we have
γ (λK + B (0, ϵ)) 1
≤ N(λK, ϵ) ≤ M(λK, ϵ) ≤ . (27.42)
maxθ∈Rd γ (B (θ, 2ϵ)) minθ∈λK γ (B (θ, ϵ/2))
Next we further bound (27.42) using properties native to the Gaussian measure.
• For the upper bound, for any symmetric set A = −A and any θ ∈ λK, by a change of measure
γ(θ + A) = P [Z − θ ∈ A]
1 ⊤ −1
h −1 i
= e− 2 θ Σ θ E e⟨Σ θ,Z⟩ 1{Z∈A}
≥ e−λ
2
/2
P [Z ∈ A] ,
h −1 i
where the last step follows from θ⊤ Σ−1 θ ≤ λ2 and by Jensen’s inequality E e⟨Σ θ,Z⟩ |Z ∈ A ≥
−1
e⟨Σ θ,E[Z|Z∈A]⟩ = 1, using crucially that E [Z|Z ∈ A] = 0 by symmetry. Applying the above to
A = B(0, ϵ/2) yields the right inequality in (27.41).
i i
i i
i i
• For the lower bound, recall Anderson’s lemma (Lemma 28.10) stating that the Gaussian measure
of a ball is maximized when centered at zero, so γ(B(θ, 2ϵ)) ≤ γ(B(0, 2ϵ)) for all θ. To bound
the numerator, recall the Gaussian isoperimetric inequality (see e.g. [46, Theorem 10.15]):11
γ(A + λK) ≥ Φ(Φ−1 (γ(A)) + λ). (27.43)
Applying this with A = B(0, ϵ) proves the left inequality in (27.41) and the theorem.
The implication of Theorem 27.15 is the following. Provided that ϕ(ϵ) ϕ(ϵ/2), then we
should expect that approximately
!
ϵ
log N p ϕ(ϵ)
ϕ(ϵ)
With more effort this can be made precise unconditionally (see e.g. [199, Theorem 3.3], incorporat-
ing the later improvement by [198]), leading to very precise connections between metric entropy
and small-ball probability, for example: for fixed α > 0, β ∈ R,
β 2β
−α 1 − 2+α
2α 1 2+α
ϕ(ϵ) ϵ log ⇐⇒ log N(ϵ) ϵ log (27.44)
ϵ ϵ
As a concrete example, consider the unit ball (27.39) in the RKHS generated by the standard
Brownian motion, which is similar to a Sobolev ball.12 Using (27.36) and (27.44), we conclude
that log N(ϵ) 1ϵ , recovering the metric entropy of Sobolev ball determined in [310]. This result
also coincides with the metric entropy of Lipschitz ball in Theorem 27.13 which requires the
derivative to be bounded everywhere as opposed to on average in L2 . For more applications of
small-ball probability on metric entropy (and vice versa), see [187, 198].
11
The connection between (27.43) and isoperimetry is that if we interpret limλ→0 (γ(A + λK) − γ(A))/λ as the surface
measure of A, then among all sets with the same Gaussian measure, the half space has maximal surface measure.
12
The Sobolev norm is kfkW1,2 ≜ kfk2 + kf′ k2 . Nevertheless, it is simple to verify a priori that the metric entropy of
(27.39) and that of the Sobolev ball share the same behavior (see [187, p. 152]).
i i
i i
i i
470
its rate-distortion function (recall Section 24.3). Denote the worst-case rate-distortion function on
X by
The next theorem relates ϕX to the covering and packing number of X . The lower bound simply
follows from a “Bayesian” argument, which bounds the worst case from below by the average case,
akin to the relationship between minimax and Bayes risk (see Section 28.3). The upper bound was
shown in [174] using the dual representation of rate-distortion functions; here we give a simpler
proof via Fano’s inequality.
ϕX (cϵ) + log 2
ϕX (ϵ) ≤ log N(X , d, ϵ) ≤ log M(X , d, ϵ) ≤ . (27.47)
1 − 2c
Proof. Fix an ϵ-covering of X in d of size N. Let X̂ denote the closest element in the covering to
X. Then d(X, X̂) ≤ ϵ almost surely. Thus ϕX (ϵ) ≤ I(X; X̂) ≤ log N. Optimizing over PX proves the
left inequality.
For the right inequality, let X be uniformly distributed over a maximal ϵ-packing of X . For
any PX̂|X such that E[d(X, X̂)] ≤ cϵ. Let X̃ denote the closest point in the packing to X̂. Then we
have the Markov chain X → X̂ → X̃. By definition, d(X, X̃) ≤ d(X̂, X̃) + d(X̂, X) ≤ 2d(X̂, X)
so E[d(X, X̃)] ≤ 2cϵ. Since either X = X̃ or d(X, X̃) > ϵ, we have P[X 6= X̃] ≤ 2c. On the
other hand, Fano’s inequality (Corollary 6.4) yields P[X 6= X̃] ≥ 1 − I(X;log
X̂)+log 2
M . In all, I(X; X̂) ≥
(1 − 2c) log M − log 2, proving the upper bound.
Remark 27.4. (a) Clearly, Theorem 27.16 can be extended to the case where the distortion
function equals a power of the metric, namely, replacing (27.45) with
Then (27.47) continues to hold with 1 − 2c replaced by 1 − (2c)r . This will be useful, for
example, in the forthcoming applications where second moment constraint is easier to work
with.
(b) In the earlier literature a variant of the rate-distortion function is also considered, known as
the ϵ-entropy of X, where the constraint is d(X, X̂) ≤ ϵ with probability one as opposed to
in expectation (cf. e.g. [180, Appendix II] and [254]). With this definition, it is natural to
conjecture that the maximal ϵ-entropy over all distributions on X coincides with the metric
entropy log N(X , ϵ); nevertheless, this need not be true (see [215, Remark, p. 1708] for a
counterexample).
i i
i i
i i
Theorem 27.16 points out an information-theoretic route to bound the metric entropy by the
worst-case rate-distortion function (27.46).13 Solving this maximization, however, is not easy as
PX 7→ ϕX (D) is in general neither convex nor concave [3].14 Fortunately, for certain spaces, one
can show via a symmetry argument that the “uniform” distribution maximizes the rate-distortion
function at every distortion level; see Exercise V.8 for a formal statement. As a consequence, we
have:
• For Hamming space X = {0, 1}d and Hamming distortion, ϕX (D) is attained by Ber( 12 )d . (We
already knew this from Theorem 26.1 and Theorem 24.8.)
• For the unit sphere X = Sd−1 and distortion function defined by the Euclidean distance, ϕX (D)
is attained by Unif(Sd−1 ).
• For the orthogonal group X = O(d) or unitary group U(d) and distortion function defined by
the Frobenius norm, ϕX (D) is attained by the Haar measure. Similar statements also hold for
the Grassmann manifold (collection of linear subspaces).
Theorem 27.17. Let θ be uniformly distributed over the unit sphere Sd−1 . Then for all 0 < ϵ < 1,
1 1
(d − 1) log − C ≤ inf I(θ; θ̂) ≤ (d − 1) log 1 + + log(2d)
ϵ Pθ̂|θ :E[∥θ̂−θ∥22 ]≤ϵ2 ϵ
Note that the random vector θ have dependent entries so we cannot invoke the single-
d
letterization technique in Theorem 24.8. Nevertheless, we have the representation θ=Z/kZk2 for
Z ∼ N (0, Id ), which allows us to relate the rate-distortion function of θ to that of the Gaussian
found in Theorem 26.2. The resulting lower bound agree with the metric entropy for spheres in
Corollary 27.4, which scales as (d − 1) log 1ϵ . Using similar reduction arguments (see [195, The-
orem VIII.18]), one can obtain tight lower bound for the metric entropy of the orthogonal group
O(d) and the unitary group U(d), which scales as d(d2−1) log 1ϵ and d2 log 1ϵ , with pre-log factors
commensurate with their respective degrees of freedoms. As mentioned in Remark 27.3(b), these
results were obtained by Szarek in [298] using a volume argument with Haar measures; in compar-
ison, the information-theoretic approach is more elementary as we can again reduce to Gaussian
rate-distortion computation.
Proof. The upper bound follows from Theorem 27.16 and Remark 27.4(a), applying the metric
entropy bound for spheres in Corollary 27.4.
13
A striking parallelism between the metric entropy of Sobolev balls and the rate-distortion function of smooth Gaussian
processes has been observed by Donoho in [99]. However, we cannot apply Theorem 27.16 to formally relate one to the
other since it is unclear whether the Gaussian rate-distortion function is maximal.
14
As a counterexample, consider Theorem 26.1 for the binary source.
i i
i i
i i
472
I(Z; Ẑ) = I(θ, A; Ẑ) ≤ I(θ, A; θ̂, Â) = I(θ; θ̂) + I(A, Â).
Furthermore, E[Â2 ] = E[(Â − A)2 ] + E[A2 ] + 2E[(Â − A)(A − E[A])] ≤ d + δ 2 + 2δ ≤ d + 3δ .
Similarly, |E[Â(Â − A)]| ≤ 2δ and E[kZ − Ẑk2 ] ≤ dϵ2 + 7δϵ + δ . Choosing δ = ϵ, we have
E[kZ − Ẑk2 ] ≤ (d + 8)ϵ2 . Combining Theorem 24.8 with the Gaussian rate-distortion function in
Theorem 26.2, we have I(Z; Ẑ) ≥ d2 log (d+d8)ϵ2 , so applying log(1 + x) ≤ x yields
1
I(θ; θ̂) ≥ (d − 1) log − 4 log e.
ϵ2
i i
i i
i i
V.1 Let S = Ŝ = {0, 1} and let the source X10 be fair coin flips. Denote the output of the decom-
1
pressor by X̂10 . Show that it is possible to achieve average Hamming distortion 20 with 512
codewords.
V.2 Assume the distortion function is separable. Show that the minimal number of codewords
M∗ (n, D) required to represent memoryless source Xn with average distortion D satisfies
Conclude that
1 1
lim log M∗ (n, D) = inf log M∗ (n, D) . (V.1)
n→∞ n n n
(i.e. one can always achieve a better compression rate by using a longer blocklength). Neither
claim holds for log M∗ (n, ϵ) in channel coding (with inf replaced by sup in (V.1) of course).
Explain why this different behavior arises.
i.i.d.
V.3 Consider a source Sn ∼ Ber( 12 ). Answer the following questions when n is large.
(a) Suppose the goal is to compress Sn into k bits so that one can reconstruct Sn with at most
one bit of error. That is, the decoded version Ŝn satisfies E[dH (Ŝn , Sn )] ≤ 1. Show that this
can be done (if possible, with an explicit algorithm) with k = n − C log n bits for some
constant C. Is it optimal?
(b) Suppose we are required to compress Sn into only 1 bit. Show that one can achieve (if
√
possible, with an explicit algorithm) a reconstruction error E[dH (Ŝn , Sn )] ≤ n2 − C n for
some constant C. Is it optimal?
Warning: We cannot blindly apply asymptotic the rate-distortion theory to show achievability
since here the distortion changes with n. The converse, however, directly applies.
i.i.d.
V.4 (Noisy source coding [94]) Let Zn ∼ Ber( 21 ). Let Xn be the output of a stationary memoryless
binary erasure channel with erasure probability δ when the input is Zn .
(a) Find the best compression rate for Xn so that the decompressor can reconstruct Zn with bit
error rate D.
(b) What if the input is a Ber(p) sequence?
V.5 (a) Let 0 ≺ ∆ Σ be positive definite matrices. For S ∼ N (0, Σ), show that
1 det Σ
inf I(S; Ŝ) = log .
PŜ|S :E[(S−Ŝ)(S−Ŝ)⊤ ]⪯∆ 2 det ∆
i i
i i
i i
(b) Prove the following extension of (26.3): Let σ12 , . . . , σd2 be the eigenvalues of Σ. Then
1 X + σi2
d
inf I(S; Ŝ) = log
PŜ|S :E[∥S−Ŝ∥22 ]≤D 2 λ
i=1
Pd
where λ > 0 is such that i=1 min{σi2 , λ} = D. This is the counterpart of the waterfilling
solution in Theorem 20.14.
(Hint: First, using the orthogonal invariance of distortion metric we can assume that
Σ is diagonal. Next, apply the same single-letterization argument for (26.3) and solve
Pd σ2
minP Di =D 12 i=1 log+ Dii .)
V.6 (Shannon lower bound) Let k · k be an arbitrary norm on Rd and r > 0. Let X be a Rd -valued
random vector with a probability density function pX . Denote the rate-distortion function
ϕ X ( D) ≜ inf I(X; X̂)
PX̂|X :E[∥X̂−X∥r ]≤D
and this entropy maximization can be solved following the argument in Example 5.2.
V.7 (Uniform distribution minimizes convex symmetric functional.) Let G be a group acting on a
set X such that each g ∈ G sends x ∈ X to gx ∈ X . Suppose G acts transitively, i.e., for each
x, x′ ∈ X there exists g ∈ G such that gx = x′ . Let g be a random element of G with an invariant
i i
i i
i i
d
distribution, namely hg=g for any h ∈ G. (Such a distribution, known as the Haar measure,
exists for compact topological groups.)
(a) Show that for any x ∈ X , gx has the same law, denoted by Unif(X ), the uniform distribution
on X .
(b) Let f : P(X ) → R be convex and G-invariant, i.e., f(PgX ) = f(PX ) for any X -valued random
variable X and any g ∈ G. Show that minPX ∈P(X ) f(PX ) = f(Unif(X )).
V.8 (Uniform distribution maximizes rate-distortion function.) Under the setup of Exercise V.7, let
d : X × X → R be a G-invariant distortion function, i.e., d(gx, gx′ ) = d(x, x′ ) for any g ∈ G.
Denote the rate-distortion function of an X -valued X by ϕX (D) = infP :E[d(X,X̂)]≤D I(X; X̂).
X̂|X
Suppose that ϕX (D) < ∞ for all X and all D > 0.
(a) Let ϕ∗X (λ) = supD {λD − ϕX (D)} denote the conjugate of ϕX . Applying Theorem 24.4 and
Fenchel-Moreau’s biconjugation theorem to conclude that ϕX (D) = supλ {λD − ϕ∗X (λ)}.
(b) Show that
ϕ∗X (λ) = sup{λE[d(X, X̂)] − I(X; X̂)}.
PX̂|X
As such, for each λ, PX 7→ ϕ∗X (λ) is convex and G-invariant. (Hint: Theorem 5.3.)
(c) Applying Exercise V.7 to conclude that ϕ∗U (λ) ≤ ϕ∗X (λ) for U ∼ Unif(X ) and that
ϕX (D) ≤ ϕU (D), ∀ D > 0.
V.9 (Normal tail bound.) Denote the standard normal density and tail probability by φ(x) =
R∞
√1 e−x /2 and Φc (t) =
2
2π t
φ(x)dx. Show that for all t > 0,
t φ(t) −t2 /2
φ( t ) ≤ Φ c
( t ) ≤ min , e . (V.3)
1 + t2 t
(Hint: For Φc (t) ≤ e−t /2 apply the Chernoff bound (15.2); for the rest, note that by integration
2
R∞
by parts Φc (t) = φ(t t) − t φ(x2x) dx.)
V.10 (Small-ball probability II.) In this exercise we prove (27.36). Let {Wt : t ≥ 0} be a standard
Brownian motion. Show that for small ϵ,15
1 1
ϕ(ϵ) = log h i
P supt∈[0,1] |Wt | ≤ ϵ ϵ2
h i h i
(a) By rescaling space and time, show that P supt∈[0,1] |Wt | ≤ ϵ = P supt∈[0,T] |Wt | ≤ 1 ≜
pT , where T = 1/ϵ2 . To show pT = e−Θ(T) , there is no loss of generality to assume that T is
an integer.
(b) (Upper bound) Using the independent increment property, show that pT+1 ≤ apT , where
a = P [|Z| ≤ 1] with Z ∼ N(0, 1). (Hint: g(z) ≜ P [|Z − z| ≤ 1] for z ∈ [−1, 1] is maximized
at z = 0 and minimized at z = ±1.)
15
Using the large-deviations theory developed by Donsker-Varadhan, the sharp constant can be found to be
2
limϵ→0 ϵ2 ϕ(ϵ) = π8 . see for example [199, Sec. 6.2].
i i
i i
i i
h i
(c) (Lower bound) Again by scaling, it is equivalent to show P supt∈[0,T] |Wt | ≤ C ≥ C−T for
h i
some constant C. Let qT ≜ P supt∈[0,T] |Wt | ≤ 2, maxt=1,...,T |Wt | ≤ 1 . Show that qT+1 ≥
bqT , where b = P [|Z − 1| ≤ 1] P[supt∈[0,1] |Bt | ≤ 1], and Bt = Bt − tB1 is a Brownian
bridge. (Hint: {Wt : t ∈ [0, T]}, WT+1 − WT , and {WT+t − (1 − t)WT − tWT+1 : t ∈ [0, 1]}
are mutually independent, with the latter distributed as a Brownian bridge.)
V.11 (Covering radius in Hamming space) In this exercise we prove (27.9), namely, for any fixed
0 ≤ D ≤ 1, as n → ∞,
N(Fn2 , dH , Dn) = 2n(1−h(D))+ +o(n) ,
where h(·) is the binary entropy function.
(a) Prove the lower bound by invoking the volume bound in Theorem 27.3 and the large-
deviations estimate in Example 15.1.
(b) Prove the upper bound using probabilistic construction and a similar argument to (25.8).
(c) Show that for D ≥ 12 , N(Fn2 , dH , Dn) ≤ 2 – cf. Ex. V.3a.
V.12 (Covering ℓp -ball with ℓq -balls)
(a) For 1 ≤ p < q ≤ ∞, prove the bound (27.18) on the metric entropy of the unit ℓp -ball with
respect to the ℓq -norm (Hint: for small ϵ, apply the volume calculation in (27.15)–(27.16)
and the formula in (27.13); for large ϵ, proceed as in the proof of Theorem 27.7 by applying
the quantization argument and the Gilbert-Varshamov bound of Hamming spheres.)
(b) What happens when p > q?
V.13 (Random matrix) Let A be an m × n matrix of iid N (0, 1) entries. Denote its operator norm by
kAkop = maxv∈Sn−1 kAvk, which is also the largest singular value of A.
(a) Show that
kAkop = max hA, uv′ i . (V.4)
u∈Sm−1 ,v∈Sn−1
(b) Let U = {u1 , . . . , uM } and V = {v1 , . . . , vM } be an ϵ-net for the spheres Sm−1 and Sn−1
respectively. Show that
1
kAkop ≤ max hA, uv′ i .
(1 − ϵ)2 u∈U ,v∈V
i i
i i
i i
Part VI
Statistical applications
i i
i i
i i
i i
i i
i i
479
This part gives an exposition on the application of information-theoretic principles and meth-
ods in mathematical statistics; we do so by discussing a selection of topics. To start, Chapter 28
introduces the basic decision-theoretic framework of statistical estimation and the Bayes risk
and the minimax risk as the fundamental limits. Chapter 29 gives an exposition of the classi-
cal large-sample asymptotics for smooth parametric models in fixed dimensions, highlighting the
role of Fisher information introduced in Chapter 2. Notably, we discuss how to deduce classi-
cal lower bounds (Hammersley-Chapman-Robbins, Cramér-Rao, van Trees) from the variational
characterization and the data processing inequality (DPI) of χ2 -divergence in Chapter 7.
Moving into high dimensions, Chapter 30 introduces the mutual information method for sta-
tistical lower bound, based on the DPI for mutual information as well as the theory of capacity
and rate-distortion function from Parts IV and V. This principled approach includes three popular
methods for proving minimax lower bounds (Le Cam, Assouad, and Fano) as special cases, which
are discussed at length in Chapter 31 drawing results from metric entropy in Chapter 27 also.
Complementing the exposition on lower bounds in Chapters 30 and 31, in Chapter 32 we
present three upper bounds on statistical estimation based on metric entropy. These bounds appear
strikingly similar but follow from completely different methodologies.
Chapter 33 introduces strong data processing inequalities (SDPI), which are quantitative
strengthning of DPIs in Part I. As applications we show how to apply SDPI to deduce lower
bounds for various estimation problems on graphs or in distributed settings.
i i
i i
i i
where each distribution is indexed by a parameter θ taking values in the parameter space Θ.
In the decision-theoretic framework, we play the following game: Nature picks some parameter
θ ∈ Θ and generates a random variable X ∼ Pθ . A statistician observes the data X and wants to
infer the parameter θ or its certain attributes. Specifically, consider some functional T : Θ → Y
and the goal is to estimate T(θ) on the basis of the observation X. Here the estimand T(θ) may be
the parameter θ itself, or some function thereof (e.g. T(θ) = 1{θ>0} or kθk).
An estimator (decision rule) is a function T̂ : X → Ŷ . Note the that the action space Ŷ need
not be the same as Y (e.g. T̂ may be a confidence interval). Here T̂ can be either deterministic,
i.e. T̂ = T̂(X), or randomized, i.e., T̂ obtained by passing X through a conditional probability
distribution (Markov transition kernel) PT̂|X , or a channel in the language of Part I. For all practical
purposes, we can write T̂ = T̂(X, U), where U denotes external randomness uniform on [0, 1] and
independent of X.
To measure the quality of an estimator T̂, we introduce a loss function ℓ : Y × Ŷ → R such
that ℓ(T, T̂) is the risk of T̂ for estimating T. Since we are dealing with loss (as opposed to reward),
all the negative (converse) results are lower bounds and all the positive (achievable) results are
upper bounds. Note that X is a random variable, so are T̂ and ℓ(T, T̂). Therefore, to make sense of
“minimizing the loss”, we consider the average risk:
Z
Rθ (T̂) = Eθ [ℓ(T, T̂)] = Pθ (dx)PT̂|X (dt̂|x)ℓ(T(θ), t̂), (28.2)
which we refer to as the risk of T̂ at θ. The subscript in Eθ indicates the distribution with respect
to which the expectation is taken. Note that the expected risk depends on the estimator as well as
the ground truth.
480
i i
i i
i i
Remark 28.1. We note that the problem of hypothesis testing and inference can be encompassed
as special cases of the estimation paradigm. As previously discussed in Section 16.4, there are
three formulations for testing:
H0 : θ = θ 0 vs. H1 : θ = θ1 , θ0 6= θ1
H0 : θ = θ 0 vs. H1 : θ ∈ Θ 1 , θ0 ∈
/ Θ1
H0 : θ ∈ Θ 0 vs. H1 : θ ∈ Θ 1 , Θ0 ∩ Θ1 = ∅.
For each case one can introduce the appropriate parameter space and loss function. For example,
in the last (most general) case, we may take
(
0 θ ∈ Θ0
Θ = Θ0 ∪ Θ1 , T(θ) = , T̂ ∈ {0, 1}
1 θ ∈ Θ1
and use the zero-one loss ℓ(T, T̂) = 1{T̸=T̂} so that the expected risk Rθ (T̂) = Pθ {θ ∈ / ΘT̂ } is the
probability of error.
For the problem of inference, the goal is to output a confidence interval (or region) which covers
the true parameter with high probability. In this case T̂ is a subset of Θ and we may choose the
loss function ℓ(θ, T̂) = 1{θ∈/ T̂} + λlength(T̂) for some λ > 0, in order to balance the coverage and
the size of the confidence interval.
Remark 28.2 (Randomized versus deterministic estimators). Although most of the estimators used
in practice are deterministic, there are a number of reasons to consider randomized estimators:
• For certain formulations, such as the minimizing worst-case risk (minimax approach), deter-
ministic estimators are suboptimal and it is necessary to randomize. On the other hand, if the
objective is to minimize the average risk (Bayes approach), then it does not lose generality to
restrict to deterministic estimators.
• The space of randomized estimators (viewed as Markov kernels) is convex which is the convex
hull of deterministic estimators. This convexification is needed for example for the treatment
of minimax theorems.
i i
i i
i i
482
By definition, more structure (smaller parameter space) always makes the estimation task easier
(smaller worst-case risk), but not necessarily so in terms of computation.
For estimating θ itself (denoising), it is customary to use a loss function defined by certain
P 1
norms, e.g., ℓ(θ, θ̂) = kθ − θ̂kpα for some 1 ≤ p ≤ ∞ and α > 0, where kθkp ≜ ( |θi |p ) p , with
p = α = 2 corresponding to the commonly used quadratic loss (squared error). Some well-known
estimators include the Maximum Likelihood Estimator (MLE)
θ̂ML = X (28.3)
and the James-Stein estimator based on shrinkage
(d − 2)σ 2
θ̂JS = 1 − X (28.4)
kXk22
The choice of the estimator depends on both the objective and the parameter space. For instance,
if θ is known to be sparse, it makes sense to set the smaller entries in the observed X to zero
(thresholding) in order to better denoise θ (cf. Section 30.2).
i i
i i
i i
28.3 Bayes risk, minimax risk, and the minimax theorem 483
In addition to estimating the vector θ itself, it is also of interest to estimate certain functionals
T(θ) thereof, e.g., T(θ) = kθkp , max{θ1 , . . . , θd }, or eigenvalues in the matrix case. In addition,
the hypothesis testing problem in the GLM has been well-studied. For example, one can consider
detecting the presence of a signal by testing H0 : θ = 0 against H1 : kθk ≥ ϵ, or testing weak signal
H0 : kθk ≤ ϵ0 versus strong signal H1 : kθk ≥ ϵ1 , with or without further structural assumptions
on θ. We refer the reader to the monograph [165] devoted to these problems.
i i
i i
i i
484
An estimator θ̂ is called a Bayes estimator if it attains the Bayes risk, namely, R∗π = Eθ∼π [Rθ (θ̂∗ )].
∗
Remark 28.3. Bayes estimator is always deterministic – this fact holds for any loss function. To
see this, note that for any randomized estimator, say θ̂ = θ̂(X, U), where U is some external
randomness independent of X and θ, its risk is lower bounded by
Rπ (θ̂) = Eθ,X,U ℓ(θ, θ̂(X, U)) = EU Rπ (θ̂(·, U)) ≥ inf Rπ (θ̂(·, u)).
u
Note that for any u, θ̂(·, u) is a deterministic estimator. This shows that we can find a deterministic
estimator whose average risk is no worse than that of the randomized estimator.
An alternative way to under this fact is the following: Note that the average risk Rπ (θ̂) defined
in (28.5) is an affine function of the randomized estimator (understood as a Markov kernel Pθ̂|X )
is affine, whose minimum is achieved at the extremal points. In this case the extremal points of
Markov kernels are simply delta measures, which corresponds to deterministic estimators.
In certain settings the Bayes estimator can be found explicitly. Consider the problem of esti-
mating θ ∈ Rd drawn from a prior π. Under the quadratic loss ℓ(θ, θ̂) = kθ̂ − θk22 , the Bayes
estimator is the conditional mean θ̂(X) = E[θ|X] and the Bayes risk is the minimum mean-square
error (MMSE)
R∗π = Ekθ − E[θ|X]k22 = Tr(Cov(θ|X)),
where Cov(θ|X = x) is the conditional covariance of θ given X = x.
As a concrete example, let us consider the Gaussian Location Model in Section 28.2 with a
Gaussian prior.
Example 28.1 (Bayes risk in GLM). Consider the scalar case, where X = θ + Z and Z ∼ N(0, σ 2 )
is independent of θ. Consider a Gaussian prior θ ∼ π = N(0, s). One can verify that the posterior
sσ 2
2 x, s+σ 2 ). As such, the Bayes estimator is E[θ|X] = s+σ 2 X and the
s s
distribution Pθ|X=x is N( s+σ
Bayes risk is
sσ 2
R∗π = . (28.6)
s + σ2
Similarly, for multivariate GLM: X = θ + Z, Z ∼ N(0, Id ), if θ ∼ π = N(0, sId ), then we have
sσ 2
R∗π = d. (28.7)
s + σ2
i i
i i
i i
28.3 Bayes risk, minimax risk, and the minimax theorem 485
If there exists θ̂ s.t. supθ∈Θ Rθ (θ̂) = R∗ , then the estimator θ̂ is minimax (minimax optimal).
Finding the value of the minimax risk R∗ entails proving two things, namely,
• a minimax upper bound, by exhibiting an estimator θ̂∗ such that Rθ (θ̂∗ ) ≤ R∗ + ϵ for all θ ∈ Θ;
• a minimax lower bound, by proving that for any estimator θ̂, there exists some θ ∈ Θ, such that
Rθ ≥ R∗ − ϵ,
where ϵ > 0 is arbitrary. This task is frequently difficult especially in high dimensions. Instead of
the exact minimax risk, it is often useful to find a constant-factor approximation Ψ, which we call
minimax rate, such that
R∗ Ψ, (28.9)
that is, cΨ ≤ R∗ ≤ CΨ for some universal constants c, C ≥ 0. Establishing Ψ is the minimax rate
still entails proving the minimax upper and lower bounds, albeit within multiplicative constant
factors.
In practice, minimax lower bounds are rarely established according to the original definition.
The next result shows that the Bayes risk is always lower than the minimax risk. Throughout
this book, all lower bound techniques essentially boil down to evaluating the Bayes risk with a
sagaciously chosen prior.
Theorem 28.1. Let ∆(Θ) denote the collection of probability distributions on Θ. Then
1 “max ≥ mean”: For any θ̂, Rπ (θ̂) = Eθ∼π Rθ (θ̂) ≤ supθ∈Θ Rθ (θ̂). Taking the infimum over θ̂
completes the proof;
2 “min max ≥ max min”:
R∗ = inf sup Rθ (θ̂) = inf sup Rπ (θ̂) ≥ sup inf Rπ (θ̂) = sup R∗π ,
θ̂ θ∈Θ θ̂ π ∈∆(Θ) π ∈∆(Θ) θ̂ π
where the inequality follows from the generic fact that minx maxy f(x, y) ≥ maxy minx f(x, y).
i i
i i
i i
486
Remark 28.4. Unlike Bayes estimators which, as shown in Remark 28.3, are always deterministic,
to minimize the worst-case risk it is sometimes necessary to randomize for example in the context
of hypotheses testing (Chapter 14). Specifically, consider a trivial experiment where θ ∈ {0, 1} and
X is absent, so that we are forced to guess the value of θ under the zero-one loss ℓ(θ, θ̂) = 1{θ̸=θ̂} .
It is clear that in this case the minimax risk is 21 , achieved by random guessing θ̂ ∼ Ber( 21 ) but not
by any deterministic θ̂.
As an application of Theorem 28.1, let us determine the minimax risk of the Gaussian location
model under the quadratic loss function.
Example 28.2 (Minimax quadratic risk of GLM). Consider the Gaussian location model without
structural assumptions, where X ∼ N(θ, σ 2 Id ) with θ ∈ Rd . We show that
By scaling, it suffices to consider σ = 1. For the upper bound, we consider θ̂ML = X which
achieves Rθ (θ̂ML ) = d for all θ. To get a matching minimax lower bound, we consider the prior
θ ∼ N(0, s). Using the Bayes risk previously computed in (28.6), we have R∗ ≥ R∗π = s+ sd
1.
∗
Sending s → ∞ yields R ≥ d.
Remark 28.5 (Non-uniqueness of minimax estimators). In general, estimators that achieve the
minimax risk need not be unique. For instance, as shown in Example 28.2, the MLE θ̂ML = X
is minimax for the unconstrained GLM in any dimension. On the other hand, it is known that
whenever d ≥ 3, the risk of the James-Stein estimator (28.4) is smaller that of the MLE everywhere
(see Fig. 28.2) and thus is also minimax. In fact, there exist a continuum of estimators that are
minimax for (28.11) [196, Theorem 5.5].
3.0
2.8
2.6
2.4
2.2
2 4 6 8
Figure 28.2 Risk of the James-Stein estimator (28.4) in dimension d = 3 and σ = 1 as a function of kθk.
For most of the statistical models, Theorem 28.1 in fact holds with equality; such a result is
known as a minimax theorem. Before discussing this important topic, here is an example where
minimax risk is strictly bigger than the worst-case Bayes risk.
i i
i i
i i
28.3 Bayes risk, minimax risk, and the minimax theorem 487
Example 28.3. Let θ, θ̂ ∈ N ≜ {1, 2, ...} and ℓ(θ, θ̂) = 1{θ̂<θ} , i.e., the statistician loses one dollar
if the Nature’s choice exceeds the statistician’s guess and loses nothing if otherwise. Consider the
extreme case of blind guessing (i.e., no data is available, say, X = 0). Then for any θ̂ possibly
randomized, we have Rθ (θ̂) = P(θ̂ < θ). Thus R∗ ≥ limθ→∞ P(θ̂ < θ) = 1, which is clearly
achievable. On the other hand, for any prior π on N, Rπ (θ̂) = P(θ̂ < θ), which vanishes as θ̂ → ∞.
Therefore, we have R∗π = 0. Therefore in this case R∗ = 1 > R∗Bayes = 0.
As an exercise, one can show that the minimax quadratic risk of the GLM X ∼ N(θ, 1) with
parameter space θ ≥ 0 is the same as the unconstrained case. (This might be a bit surprising
because the thresholded estimator X+ = max(X, 0) achieves a better risk pointwise at every θ ≥
0; nevertheless, just like the James-Stein estimator (cf. Fig. 28.2), in the worst case the gain is
asymptotically diminishing.)
R∗ ≥ R∗Bayes .
This result can be interpreted from an optimization perspective. More precisely, R∗ is the value
of a convex optimization problem (primal) and R∗Bayes is precisely the value of its dual program.
Thus the inequality (28.10) is simply weak duality. If strong duality holds, then (28.10) is in fact
an equality, in which case the minimax theorem holds.
For simplicity, we consider the case where Θ is a finite set. Then
This is a convex optimization problem. Indeed, Pθ̂|X 7→ Eθ [ℓ(θ, θ̂)] is affine and the pointwise
supremum of affine functions is convex. To write down its dual problem, first let us rewrite (28.12)
in an augmented form
R∗ = min t (28.13)
Pθ̂|X ,t
Let π θ ≥ 0 denote the Lagrange multiplier (dual variable) for each inequality constraint. The
Lagrangian of (28.13) is
!
X X X
L(Pθ̂|X , t, π ) = t + π θ Eθ [ℓ(θ, θ̂)] − t = 1 − πθ t + π θ Eθ [ℓ(θ, θ̂)].
θ∈Θ θ∈Θ θ∈Θ
P
By definition, we have R∗ ≥ mint,Pθ̂|X L(θ̂, t, π ). Note that unless θ∈Θ π θ = 1, mint∈R L(θ̂, t, π )
is −∞. Thus π = (π θ : θ ∈ Θ) must be a probability measure and the dual problem is
i i
i i
i i
488
Hence, R∗ ≥ R∗Bayes .
In summary, the minimax risk and the worst-case Bayes risk are related by convex duality,
where the primal variables are (randomized) estimators and the dual variables are priors. This
view can in fact be operationalized. For example, [173, 251] showed that for certain problems
dualizing Le Cam’s two-point lower bound (Theorem 31.1) leads to optimal minimax upper bound;
see Exercise VI.16.
This result shows that for virtually all problems encountered in practice, the minimax risk coin-
cides with the least favorable Bayes risk. At the heart of any minimax theorem, there is an
application of the separating hyperplane theorem. Below we give a proof of a special case
illustrating this type of argument.
R∗ = R∗Bayes
Proof. The first case directly follows from the duality interpretation in Section 28.3.3 and the
fact that strong duality holds for finite-dimensional linear programming (see for example [275,
Sec. 7.4].
For the second case, we start by showing that if R∗ = ∞, then R∗Bayes = ∞. To see this, consider
the uniform prior π on Θ. Then for any estimator θ̂, there exists θ ∈ Θ such that R(θ, θ̂) = ∞.
Then Rπ (θ̂) ≥ |Θ|
1
R(θ, θ̂) = ∞.
Next we assume that R∗ < ∞. Then R∗ ∈ R since ℓ is bounded from below (say, by a) by
assumption. Given an estimator θ̂, denote its risk vector R(θ̂) = (Rθ (θ̂))θ∈Θ . Then its average risk
i i
i i
i i
P
with respect to a prior π is given by the inner product hR(θ̂), π i = θ∈Θ π θ Rθ (θ̂). Define
Note that both S and T are convex (why?) subsets of Euclidean space RΘ and S∩T = ∅ by definition
of R∗ . By the separation hyperplane theorem, there exists a non-zero π ∈ RΘ and c ∈ R, such
that infs∈S hπ , si ≥ c ≥ supt∈T hπ , ti. Obviously, π must be componentwise positive, for otherwise
supt∈T hπ , ti = ∞. Therefore by normalization we may assume that π is a probability vector, i.e.,
a prior on Θ. Then R∗Bayes ≥ R∗π = infs∈S hπ , si ≥ supt∈T hπ , ti ≥ R∗ , completing the proof.
Pn = {P⊗
θ : θ ∈ Θ},
n
n ≥ 1. (28.14)
Clearly, n 7→ R∗n (Θ) is non-increasing since we can always discard the extra observations.
Typically, when Θ is a fixed subset of Rd , R∗n (Θ) vanishes as n → ∞. Thus a natural question is
at what rate R∗n converges to zero. Equivalently, one can consider the sample complexity, namely,
the minimum sample size to attain a prescribed error ϵ even in the worst case:
In the classical large-sample asymptotics (Chapter 29), the rate of convergence for the quadratic
risk is usually Θ( 1n ), which is commonly referred to as the “parametric rate“. In comparison, in this
book we focus on understanding the dependency on the dimension and other structural parameters
nonasymptotically.
As a concrete example, let us revisit the GLM in Section 28.2 with sample size n, in which case
i.i.d.
we observe X = (X1 , . . . , Xn ) ∼ N(0, σ 2 Id ), θ ∈ Rd . In this case, the minimax quadratic risk is1
dσ 2
R∗n = . (28.17)
n
To see this, note that in this case X̄ = n1 (X1 + . . . + Xn ) is a sufficient statistic (cf. Section 3.5) of X
2
for θ. Therefore the model reduces to X̄ ∼ N(θ, σn Id ) and (28.17) follows from the minimax risk
(28.11) for a single observation.
1
See Exercise VI.10 for an extension of this result to nonparametric location models.
i i
i i
i i
490
2
From (28.17), we conclude that the sample complexity is n∗ (ϵ) = d dσϵ e, which grows linearly
with the dimension d. This is the common wisdom that “sample complexity scales proportionally
to the number of parameters”, also known as “counting the degrees of freedom”. Indeed in high
dimensions we typically expect the sample complexity to grow with the ambient dimension; how-
ever, the exact dependency need not be linear as it depends on the loss function and the objective
of estimation. For example, consider the matrix case θ ∈ Rd×d with n independent observations
in Gaussian noise. Let ϵ be a small constant. Then we have
2
• For quadratic loss, namely, kθ − θ̂k2F , we have R∗n = dn and hence n∗ (ϵ) = Θ(d2 );
• If the loss function is kθ − θ̂k2op , then R∗n dn and hence n∗ (ϵ) = Θ(d) (Example 28.4);
• As opposed to θ itself, suppose we are content with p estimating only the scalar functional θmax =
∗
max{θ1 , . . . , θd } up to accuracy ϵ, then n (ϵ) = Θ( log d) (Exercise VI.13).
In the last two examples, the sample complexity scales sublinearly with the dimension.
In this model, the observation X = (X1 , . . . , Xd ) consists of independent (not identically dis-
ind
tributed) Xi ∼ Pθi . This should be contrasted with the multiple-observation model in (28.14), in
which n iid observations drawn from the same distribution are given.
The minimax risk of the tensorized experiment is related to the minimax risk R∗ (Pi ) and worst-
case Bayes risks R∗Bayes (Pi ) ≜ supπ i ∈∆(Θi ) Rπ i (Pi ) of each individual experiment as follows:
Consequently, if minimax theorem holds for each experiment, i.e., R∗ (Pi ) = R∗Bayes (Pi ), then it
also holds for the product experiment and, in particular,
X
d
R∗ (P) = R∗ (Pi ). (28.19)
i=1
i i
i i
i i
Proof. The right inequality of (28.18) simply follows by separately estimating θi on the basis
of Xi , namely, θ̂ = (θ̂1 , . . . , θ̂d ), where θ̂i depends only on Xi . For the left inequality, consider
Qd
a product prior π = i=1 π i , under which θi ’s are independent and so are Xi ’s. Consider any
randomized estimator θ̂i = θ̂i (X, Ui ) of θi based on X, where Ui is some auxiliary randomness
independent of X. We can rewrite it as θ̂i = θ̂i (Xi , Ũi ), where Ũi = (X\i , Ui ) ⊥ ⊥ Xi . Thus θ̂i can
be viewed as it a randomized estimator based on Xi alone and its the average risk must satisfy
Rπ i (θ̂i ) = E[ℓ(θi , θ̂i )] ≥ R∗π i . Summing over i and taking the suprema over priors π i ’s yields the
left inequality of (28.18).
i.i.d.
Theorem 28.4. Consider the Gaussian location model X1 , . . . , Xn ∼ N(θ, Id ). Then for 1 ≤ q <
∞,
E[kZkqq ]
inf sup Eθ [kθ − θ̂kqq ] = , Z ∼ N(0, Id ).
θ̂ θ∈Rd nq/2
i i
i i
i i
492
Proof. Note that N(θ, Id ) is a product distribution and the loss function is separable: kθ − θ̂kqq =
Pd
i=1 |θi − θ̂i | . Thus the experiment is a d-fold tensor product of the one-dimensional version.
q
By Theorem 28.3, it suffices to consider d = 1. The upper bound is achieved by the sample mean
Pn
X = 1n i=1 Xi ∼ N(θ, n1 ), which is a sufficient statistic.
For the lower bound, following Example 28.2, consider a Gaussian prior θ ∼ π = N(0, s). Then
the posterior distribution is also Gaussian: Pθ|X = N(E[θ|X], 1+ssn ). The following lemma shows
that the Bayes estimator is simply the conditional mean:
Lemma 28.5. Let Z ∼ N(0, 1). Then miny∈R E[|y + Z|q ] = E[|Z|q ].
where the inequality follows from the simple observation that for any a > 0, P [|y + Z| ≤ a] ≤
P [|Z| ≤ a], due to the symmetry and unimodality of the normal density.
2
Another example is the multivariate model with the squared error; cf. Exercise VI.7.
i i
i i
i i
28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 493
Theorem 28.7. Consider the d-dimensional GLM where X1 , . . . , Xn ∼ N(0, Id ) are observed.
Let the loss function be ℓ(θ, θ̂) = ρ(θ − θ̂), where ρ : Rd → R+ is bowl-shaped and lower-
semicontinuous. Then the minimax risk is given by
Z
R∗ ≜ inf sup Eθ [ρ(θ − θ̂)] = Eρ √ , Z ∼ N(0, Id ).
θ̂ θ∈Rd n
Pn
Furthermore, the upper bound is attained by X̄ = 1n i=1 Xi .
Corollary 28.8. Let ρ(·) = k · kq for some q > 0, where k · k is an arbitrary norm on Rd . Then
EkZkq
R∗ = . (28.20)
nq/2
R∗ √d .
n
We can also phrase the result of Corollary 28.8 in terms of the sample complexity n∗ (ϵ) as
defined in (28.16). For example, for q = 2 we have n∗ (ϵ) = E[kZk2 ]/ϵ . The above examples
show that the scaling of n∗ (ϵ) with dimension depends on the loss function and the “rule of thumb”
that the sampling complexity is proportional to the number of parameters need not always hold.
Finally, for the sake of high-probability (as opposed to average) risk bound, consider ρ(θ − θ̂) =
1{kθ − θ̂k > ϵ}, which is lower semicontinuous and bowl-shaped. Then the exact expression
√
R∗ = P kZk ≥ ϵ n . This result is stronger since the sample mean is optimal simultaneously for
all ϵ, so that integrating over ϵ recovers (28.20).
Proof of Theorem 28.7. We only prove the lower bound. We bound the minimax risk R∗ from
below by the Bayes risk R∗π with the prior π = N(0, sId ):
i i
i i
i i
494
In order to prove Lemma 28.9, it suffices to consider ρ being indicator functions. This is done
in the next lemma, which we prove later.
Lemma 28.10. Let K ∈ Rd be a symmetric convex set and X ∼ N(0, Σ). Then maxy∈Rd P(X + y ∈
K) = P(X ∈ K).
Proof of Lemma 28.9. Denote the sublevel set set Kc = {x ∈ Rd : ρ(x) ≤ c}. Since ρ is bowl-
shaped, Kc is convex and symmetric, which satisfies the conditions of Lemma 28.10. So,
Z ∞
E[ρ(y + x)] = P(ρ(y + x) > c)dc,
Z ∞
0
= (1 − P(y + x ∈ Kc ))dc,
Z ∞
0
≥ (1 − P(x ∈ Kc ))dc,
Z ∞
0
= P(ρ(x) ≥ c)dc,
0
= E[ρ(x)].
Hence, miny∈Rd E[ρ(y + x)] = E[ρ(x)].
Before going into the proof of Lemma 28.10, we need the following definition.
The following result, due to Prékopa [255], characterizes the log-concavity of measures in terms
of that of its density function; see also [266] (or [135, Theorem 4.2]) for a proof.
Theorem 28.12. Suppose that μ has a density f with respect to the Lebesgue measure on Rd . Then
μ is log-concave if and only if f is log-concave.
i i
i i
i i
28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 495
• Lebesgue measure: Let μ = vol be the Lebesgue measure on Rd , which satisfies Theorem 28.12
(f ≡ 1). Then
vol(λA + (1 − λ)B) ≥ vol(A)λ vol(B)1−λ , (28.21)
which implies3 the Brunn-Minkowski inequality:
1 1 1
vol(A + B) d ≥ vol(A) d + vol(B) d . (28.22)
• Gaussian distribution: Let μ = N(0, Σ), with a log-concave density f since log f(x) =
− p2 log(2π ) − 12 log det(Σ) − 21 x⊤ Σ−1 x is concave.
3
Applying (28.21) to A′ = vol(A)−1/d A, B′ = vol(B)−1/d B (both of which have unit volume), and
λ = vol(A)1/d /(vol(A)1/d + vol(B)1/d ) yields (28.22).
i i
i i
i i
In this chapter we give an overview of the classical large-sample theory in the setting of iid obser-
vations in Section 28.4 focusing again on the minimax risk (28.15). These results pertain to smooth
parametric models in fixed dimensions, with the sole asymptotics being the sample size going to
infinity. The main result is that, under suitable conditions, the minimax squared error of estimating
i.i.d.
θ based on X1 , . . . , Xn ∼ Pθ satisfies
1 + o( 1)
inf sup Eθ [kθ̂ − θk22 ] = sup TrJ− 1
F (θ). (29.1)
θ̂ θ∈Θ n θ∈Θ
where JF (θ) is the Fisher information matrix introduced in (2.31) in Chapter 2. This is asymptotic
characterization of the minimax risk with sharp constant. In later chapters, we will proceed to high
dimensions where such precise results are difficult and rare.
Throughout this chapter, we focus on the quadratic risk and assume that Θ is an open set of the
Euclidean space Rd .
496
i i
i i
i i
Theorem 29.1 (HCR lower bound). The quadratic loss of any estimator θ̂ at θ ∈ Θ ⊂ Rd satisfies
(Eθ [θ̂] − Eθ′ [θ̂])2
Rθ (θ̂) = Eθ [(θ̂ − θ)2 ] ≥ Varθ (θ̂) ≥ sup . (29.2)
θ ′ ̸=θ χ2 (Pθ′ kPθ )
Next we apply Theorem 29.1 to unbiased estimators θ̂ that satisfies Eθ [θ̂] = θ for all θ ∈ Θ.
Then
(θ − θ′ )2
Varθ (θ̂) ≥ sup .
θ ′ ̸=θ χ2 (Pθ′ kPθ )
Lower bounding the supremum by the limit of θ′ → θ and recall the asymptotic expansion of
χ2 -divergence from Theorem 7.20, we get, under the regularity conditions in Theorem 7.20, the
celebrated Cramér-Rao (CR) lower bound [78, 259]:
1
Varθ (θ̂) ≥ . (29.4)
JF (θ)
A few more remarks are as follows:
• Note that the HCR lower bound Theorem 29.1 is based on the χ2 -divergence. For a version
based on Hellinger distance which also implies the CR lower bound, see Exercise VI.5.
• Both the HCR and the CR lower bounds extend to the multivariate case as follows. Let θ̂ be
an unbiased estimator of θ ∈ Θ ⊂ Rd . Assume that its covariance matrix Covθ (θ̂) = Eθ [(θ̂ −
θ)(θ̂ − θ)⊤ ] is positive definite. Fix a ∈ Rd . Applying Theorem 29.1 to ha, θ̂i, we get
h a, θ − θ ′ i 2
χ2 (Pθ kPθ′ ) ≥ .
a⊤ Covθ (θ̂)a
Optimizing over a yields1
Covθ (θ̂) J− 1
F (θ). (29.5)
1 ⟨x,y⟩2
For Σ 0, supx̸=0 x⊤ Σx
= y⊤ Σ−1 y, attained at x = Σ−1 y.
i i
i i
i i
498
• For a sample of n iid observations, by the additivity property (2.35), the Fisher information
matrix is equal to nJF (θ). Taking the trace on both sides, we conclude the squared error of any
unbiased estimators satisfies
1
Eθ [kθ̂ − θk22 ] ≥ Tr(J− 1
F (θ)).
n
This is already very close to (29.1), except for the fundamental restriction that of unbiased
estimators.
Similar to (29.3), applying data processing and variational representation of χ2 -divergence yields
(EP [θ − θ̂] − EQ [θ − θ̂])2
χ2 (PθX kQθX ) ≥ χ2 (Pθθ̂ kQθθ̂ ) ≥ χ2 (Pθ−θ̂ kQθ−θ̂ ) ≥ .
VarQ (θ̂ − θ)
Note that by design, PX = QX and thus EP [θ̂] = EQ [θ̂]; on the other hand, EP [θ] = EQ [θ] + δ .
Furthermore, Eπ [(θ̂ − θ)2 ] ≥ VarQ (θ̂ − θ). Since this applies to any estimators, we conclude that
the Bayes risk R∗π (and hence the minimax risk) satisfies
δ2
R∗π ≜ inf Eπ [(θ̂ − θ)2 ] ≥ sup , (29.6)
θ̂ δ̸=0 χ2 (PXθ kQXθ )
which is referred to as the Bayesian HCR lower bound in comparison with (29.2).
Similar to the deduction of CR lower bound from the HCR, we can further lower bound
this supremum by evaluating the small-δ limit. First note the following chain rule for the
χ2 -divergence:
" 2 #
dPθ
χ (PXθ kQXθ ) = χ (Pθ kQθ ) + EQ χ (PX|θ kQX|θ ) ·
2 2 2
.
dQθ
i i
i i
i i
Under suitable regularity conditions in Theorem 7.20, again applying the local expansion of χ2 -
divergence yields
R π ′2
• χ2 (Pθ kQθ ) = χ2 (Tδ π kπ ) = (J(π ) + o(1))δ 2 , where J(π ) ≜ π is the Fisher information of
the prior;
• χ2 (PX|θ kQX|θ ) = [JF (θ) + o(1)]δ 2 .
δ2 δ2 s
R∗π ≥ sup δ 2 (n+ 1s )
= lim
δ 2 (n+ 1s )
= .
δ̸=0 e −1 δ→0 e −1 sn + 1
In view of the Bayes risk found in Example 28.1, we see that in this case the Bayesian HCR and
Bayesian Cramér-Rao lower bounds are exact.
Theorem 29.2 (BCR lower bound). Let π be a differentiable prior density on the interval [θ0 , θ1 ]
such that π (θ0 ) = π (θ1 ) = 0 and
Z θ1 ′ 2
π (θ)
J( π ) ≜ dθ < ∞. (29.8)
θ0 π (θ)
i i
i i
i i
500
Let Pθ (dx) = pθ (x) μ(dx), where the density pθ (x) is differentiable in θ for μ-almost every x.
Assume that for π-almost every θ,
Z
μ(dx)∂θ pθ (x) = 0. (29.9)
Then the Bayes quadratic risk R∗π ≜ infθ̂ E[(θ − θ̂)2 ] satisfies
1
R∗π ≥ . (29.10)
Eθ∼π [JF (θ)] + J(π )
Proof. In view of Remark 28.3, it loses no generality to assume that the estimator θ̂ = θ̂(X) is
deterministic. For each x, integration by parts yields
Z θ1 Z θ1
dθ(θ̂(x) − θ)∂θ (pθ (x)π (θ)) = pθ (x)π (θ)dθ.
θ0 θ0
Then
R∗π ≜ inf Eπ [kθ̂ − θk22 ] ≥ Tr((Eθ∼π [JF (θ)] + J(π ))−1 ), (29.12)
θ̂
where the Fisher information matrices are given by JF (θ) = Eθ [∇θ log pθ (X)∇θ log pθ (X)⊤ ] and
J(π ) = diag(J(π 1 ), . . . , J(π d )).
i i
i i
i i
where ek denotes the kth standard basis. Applying Cauchy-Schwarz and optimizing over u yield
h u , ek i 2
E[(θ̂k (X) − θk )2 ] ≥ sup = Σ− 1
kk ,
u̸=0 u⊤ Σ u
where Σ ≡ E[∇ log(pθ (X)π (θ))∇ log(pθ (X)π (θ))⊤ ] = Eθ∼π [JF (θ)] + J(π ), thanks to (29.11).
Summing over k completes the proof of (29.12).
• The above versions of the BCR bound assume a prior density that vanishes at the boundary.
If we choose a uniform prior, the same derivation leads to a similar lower bound known as
the Chernoff-Rubin-Stein inequality (see Ex. VI.4), which also suffices for proving the optimal
minimax lower bound in (29.1).
• For the purpose of the lower bound, it is advantageous to choose a prior density with the mini-
mum Fisher information. The optimal density with a compact support is known to be a squared
cosine density [160, 315]:
min J( g ) = π 2 ,
g on [−1,1]
attained by
πu
g(u) = cos2 . (29.13)
2
• Suppose the goal is to estimate a smooth functional T(θ) of the unknown parameter θ, where
T : Rd → Rs is differentiable with ∇T(θ) = ( ∂ T∂θi (θ)
j
) its s × d Jacobian matrix. Then under the
same condition of Theorem 29.3, we have the following Bayesian Cramér-Rao lower bound for
functional estimation:
As a consequence of the BCR bound, we prove the lower bound part for the asymptotic minimax
risk in (29.1).
Theorem 29.4. Assume that θ 7→ JF (θ) is continuous. Denote the minimax squared error R∗n ≜
i.i.d.
infθ̂ supθ∈Θ Eθ [kθ̂ − θk22 ], where Eθ is taken over X1 , . . . , Xn ∼ Pθ . Then as n → ∞,
1 + o( 1)
R∗n ≥ sup TrJ− 1
F (θ). (29.15)
n θ∈Θ
Proof. Fix θ ∈ Θ. Then for all sufficiently small δ , B∞ (θ, δ) = θ + [−δ, δ]d ⊂ Θ. Let π i (θi ) =
1 θ−θi Qd
δ g( δ ), where g is the prior density in (29.13). Then the product distribution π = i=1 π i
satisfies the assumption of Theorem 29.3. By the scaling rule of Fisher information (see (2.34)),
2 2
J(π i ) = δ12 J(g) = δπ2 . Thus J(π ) = δπ2 Id .
i i
i i
i i
502
It is known that (see [44, Theorem 2, Appendix V]) the continuity of θ 7→ JF (θ) implies (29.11).
So we are ready to apply the BCR bound in Theorem 29.3. Lower bounding the minimax by the
Bayes risk and also applying the additivity property (2.35) of Fisher information, we obtain
− 1 !
∗ 1 π2
Rn ≥ · Tr Eθ∼π [JF (θ)] + 2 Id .
n nδ
Finally, choosing δ = n−1/4 and applying the continuity of JF (θ) in θ, the desired (29.15) follows.
Similarly, for estimating a smooth functional T(θ), applying (29.14) with the same argument
yields
1 + o(1)
inf sup Eθ [kT̂ − T(θ)k22 ] ≥ sup Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ). (29.16)
T̂ θ∈Θ n θ∈Θ
where
X
n
Lθ (Xn ) = log pθ (Xi )
i=1
is the total log-likelihood and pθ (x) = dP dμ (x) is the density of Pθ with respect to some com-
θ
mon dominating measure μ. For discrete distribution Pθ , the MLE can also be written as the KL
projection2 of the empirical distribution P̂n to the model class: θ̂MLE ∈ arg minθ∈Θ D(P̂n kPθ ).
2
Note that this is the reverse of the information projection studied in Section 15.3.
i i
i i
i i
The main intuition why MLE works is as follows. Assume that the model is identifiable, namely,
θ 7→ Pθ is injective. Then for any θ 6= θ0 , we have by positivity of the KL divergence (Theorem 2.3)
" n #
X pθ ( X i )
E θ 0 [ Lθ − Lθ0 ] = E θ 0 log = −nD(Pθ0 ||Pθ ) < 0.
pθ0 (Xi )
i=1
In other words, Lθ − Lθ0 is an iid sum with a negative mean and thus negative with high probability
for large n. From here the consistency of MLE follows upon assuming appropriate regularity
conditions, among which is Wald’s integrability condition Eθ0 [sup∥θ−θ0 ∥≤ϵ log ppθθ (X)] < ∞ [330,
0
333].
Assuming more conditions one can obtain the asymptotic normality and efficiency of the
MLE. This follows from the local quadratic approximation of the log-likelihood function. Define
V(θ, x) ≜ ∇θ pθ (x) (score) and H(θ, x) ≜ ∇2θ pθ (x). By Taylor expansion,
! !
Xn
1 Xn
⊤ ⊤
Lθ =Lθ0 + (θ − θ0 ) V(θ0 , Xi ) + (θ − θ0 ) H(θ0 , Xi ) (θ − θ0 )
2
i=1 i=1
+ o(n(θ − θ0 ) ).
2
(29.18)
Recall from Section 2.6.2* that, under suitable regularity conditions, we have
Eθ0 [V(θ0 , X)] = 0, Eθ0 [V(θ0 , X)V(θ0 , X)⊤ ] = −Eθ0 [H(θ0 , X)] = JF (θ0 ).
Thus, by the Central Limit Theorem and the Weak Law of Large Numbers, we have
1 X 1X
n n
d P
√ V(θ0 , Xi )−
→N (0, JF (θ0 )), H(θ0 , Xi )−
→ − JF (θ0 ).
n n
i=1 i=1
Substituting these quantities into (29.18), we obtain the following stochastic approximation of the
log-likelihood:
p n
Lθ ≈ Lθ0 + h nJF (θ0 )Z, θ − θ0 i − (θ − θ0 )⊤ JF (θ0 )(θ − θ0 ),
2
where Z ∼ N (0, Id ). Maximizing the right-hand side yields:
1
θ̂MLE ≈ θ0 + √ JF (θ0 )−1/2 Z.
n
From this asymptotic normality, we can obtain Eθ0 [kθ̂MLE − θ0 k22 ] ≤ n1 (TrJF (θ0 )−1 + o(1)), and
for smooth functionals by Taylor expanding T at θ0 (delta method), Eθ0 [kT(θ̂MLE ) − T(θ0 )k22 ] ≤
−1 ⊤
n (Tr(∇T(θ0 )JF (θ0 ) ∇T(θ0 ) ) + o(1)), matching the information bounds (29.15) and (29.16).
1
Of course, the above heuristic derivation requires additional assumptions to justify (for example,
Cramér’s condition, cf. [126, Theorem 18] and [274, Theorem 7.63]). Even stronger assumptions
are needed to ensure the error is uniform in θ in order to achieve the minimax lower bound in
Theorem 29.4; see, e.g., Theorem 34.4 (and also Chapters 36-37) of [44] for the exact conditions
and statements. A more general and abstract theory of MLE and the attainment of information
bound were developed by Hájek and Le Cam; see [152, 193].
Despite its wide applicability and strong optimality properties, the methodology of MLE is not
without limitations. We conclude this section with some remarks along this line.
i i
i i
i i
504
• MLE may not exist even for simple parametric models. For example, consider X1 , . . . , Xn
drawn iid from the location-scale mixture of two Gaussians 12 N ( μ1 , σ12 ) + 12 N ( μ2 , σ22 ), where
( μ1 , μ2 , σ1 , σ2 ) are unknown parameters. Then the likelihood can be made arbitrarily large by
setting for example μ1 = X1 and σ1 → 0.
• MLE may be inconsistent; see [274, Example 7.61] and [125] for examples, both in one-
dimensional parametric family.
• In high dimensions, it is possible that MLE fails to achieve the minimax rate (Exercise VI.14).
Theorem 29.5. Fox fixed k, the minimax squared error of estimating P satisfies
b − Pk22 ] = 1 k − 1 + o(1) , n → ∞.
R∗sq (k, n) ≜ inf sup E[kP (29.19)
b
P P∈Pk n k
diag(θ) − θθ⊤ − Pk θ
∇T(θ)J− 1
F (θ)∇T(θ)
⊤
=
−Pk θ⊤ Pk (1 − Pk ).
Pk Pk
So Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ) = i=1 Pi (1 − Pi ) = 1 −
2
i=1 Pi , which achieves its maximum
1 − 1k at the uniform distribution. Applying the functional form of the BCR bound in (29.16), we
conclude R∗sq (k, n) ≥ n1 (1 − 1k + o(1)).
For the upper bound, consider the MLE, which in this case coincides with the empirical dis-
Pn
tribution P̂ = (P̂i ) (Exercise VI.8). Note that nP̂i = j=1 1{Xj =i} ∼ Bin(n, Pi ). Then for any P,
Pk
E[kP̂ − Pk22 ] = n1 i=1 Pi (1 − Pi ) ≤ 1n (1 − 1k ).
i i
i i
i i
−1/k
• In fact, for any k, n, we have the precise result: R∗sq (k, n) = (11+√ 2 – see Ex. VI.7h. This can be
n)
shown by considering a Dirichlet prior (13.15) and applying the corresponding Bayes estimator,
which is an additively-smoothed empirical distribution (Section 13.5).
• Note that R∗sq (k, n) does not grow with the alphabet size k; this is because squared loss is
too weak for estimating probability vectors. More meaningful loss functions include the f-
divergences in Chapter 7, such as the total variation, KL divergence, χ2 -divergence. These
minimax rates are worked out in Exercise VI.8 and Exercise VI.9, for both small and large
alphabets, and they indeed depend on the alphabet size k. For example, the minimax KL risk
satisfies Θ( nk ) for k ≤ n and grows as Θ(log nk ) for k n. This agrees with the rule of thumb
that consistent estimation requires the sample size to scale faster than the dimension.
As a final application, let us consider the classical problem of entropy estimation in information
theory and statistics [219, 98, 156], where the goal is to estimate the Shannon entropy, a non-linear
functional of P. The following result follows from the functional BCR lower bound (29.16) and
analyzing the MLE (in this case the empirical entropy) [25].
Theorem 29.6. For fixed k, the minimax quadratic risk of entropy estimation satisfies
b (X1 , . . . , Xn ) − H(P))2 ] = 1 max V(P) + o(1) , n → ∞
R∗ent (k, n) ≜ inf sup E[(H
b P∈Pk
H n P∈Pk
Pk
where H(P) = i=1 Pi log P1i = E[log P(1X) ] and V(P) = Var[log P(1X) ] are the Shannon entropy
and varentropy (cf. (10.4)) of P.
Let us analyze the result of Theorem 29.6 and see how it extends to large alphabets. It can be
2
shown that3 maxP∈Pk V(P) log2 k, which suggests that R∗ent ≡ R∗ent (k, n) may satisfy R∗ent logn k
even when the alphabet size k grows with n; however, this result only holds for sufficiently small
alphabet. In fact, back in Lemma 13.2 we have shown that for the empirical entropy which achieves
the bound in Theorem 29.6, its bias is on the order of nk , which is no longer negligible on large
alphabets. Using techniques of polynomial approximation [335, 168], one can reduce this bias to
n log k and further show that consistent entropy estimation is only possible if and only if n log k
k k
3
Indeed, maxP∈Pk V(P) ≤ log2 k for all k ≥ 3 [239, Eq. (464)]. For the lower bound, consider
P = ( 12 , 2(k−1)
1 1
, . . . 2(k−1) ).
i i
i i
i i
In this chapter we describe a strategy for proving statistical lower bound we call the Mutual Infor-
mation Method (MIM), which entails comparing the amount of information data provides with
the minimum amount of information needed to achieve a certain estimation accuracy. Similar to
Section 29.2, the main information-theoretical ingredient is the data-processing inequality, this
time for mutual information as opposed to f-divergences.
Here is the main idea of the MIM: Fix some prior π on Θ and we aim to lower bound the Bayes
risk R∗π of estimating θ ∼ π on the basis of X with respect to some loss function ℓ. Let θ̂ be an
estimator such that E[ℓ(θ, θ̂)] ≤ D. Then we have the Markov chain θ → X → θ̂. Applying the
data processing inequality (Theorem 3.7), we have
Note that
• The leftmost quantity can be interpreted as the minimum amount of information required to
achieve a given estimation accuracy. This is precisely the rate-distortion function ϕ(D) ≡ ϕθ (D)
(recall Section 24.3).
• The rightmost quantity can be interpreted as the amount of information provided by the data
about the latent parameter. Sometimes it suffices to further upper-bound it by the capacity of
the channel PX|θ by maximizing over all priors (Chapter 5):
Therefore, we arrive at the following lower bound on the Bayes and hence the minimax risks
The reasoning of the mutual information method is reminiscent of the converse proof for joint-
source channel coding in Section 26.3. As such, the argument here retains the flavor of “source-
channel separation”, in that the lower bound in (30.1) depends only on the prior (source) and
the loss function, while the capacity upper bound (30.2) depends only on the statistical model
(channel).
In the next few sections, we discuss a sequence of examples to illustrate the MIM and its
execution:
506
i i
i i
i i
• Denoising a vector in Gaussian noise, where we will compute the exact minimax risk;
• Denoising a sparse vector, where we determine the sharp minimax rate;
• Community detection, where the goal is to recover a dense subgraph planted in a bigger Erd�s-
Rényi graph.
In the next chapter we will discuss three popular approaches for, namely, Le Cam’s method,
Assouad’s lemma, and Fano’s method. As illustrated in Fig. 30.1, all three follow from the mutual
Figure 30.1 The three lower bound techniques as consequences of the Mutual Information Method.
information method, corresponding to different choice of prior π for θ, namely, the uniform dis-
tribution over a two-point set {θ0 , θ1 }, the hypercube {0, 1}d , and a packing (recall Section 27.1).
While these methods are highly useful in determining the minimax rate for many problems, they
are often loose with constant factors compared to the MIM. In the last section of this chapter, we
discuss the problem of how and when is non-trivial estimation achievable by applying the MIM;
for this purpose, none of the three methods in the next chapter works.
i i
i i
i i
508
Using the sufficiency of X̄ and the formula of Gaussian channel capacity (cf. Theorem 5.11 or
Theorem 20.11), the mutual information between the parameter and the data can be computed as
d
I(θ; X) = I(θ; X̄) = log(1 + sn).
2
It then follows from (30.3) that R∗π ≥ sd
1+sn , which in fact matches the exact Bayes risk in (28.7).
Sending s → ∞ yields the identity
d
R∗ (Rd ) =
. (30.4)
n
In the above unconstrained GLM, we are able to compute everything in close form when
applying the mutual information method. Such exact expressions are rarely available in more
complicated models in which case various bounds on the mutual information will prove useful.
Next, let us consider the GLM with bounded means, where the parameter space Θ = B(ρ) =
{θ : kθk2 ≤ ρ} is the ℓ2 -ball of radius ρ centered at zero. In this case there is no known close-
form formula for the minimax quadratic risk even in one dimension. Nevertheless, the next result
determines the sharp minimax rate, which characterizes the minimax risk up to universal constant
factors.
where Z ∼ N(0, Id ). Alternatively, we can use Corollary 5.8 to bound the capacity (as information
radius) by the KL diameter, which yields the same bound within constant factors:
1
I(θ; X) ≤ sup I(θ; θ + √ Z) ≤ max D(N(θ, Id /n)kN(θ, Id /n)k) = 2nr2 . (30.7)
Pθ :∥θ∥≤r n θ,θ ′ ∈B(r)
For the lower bound, due to the lack of close-form formula for the rate-distortion function
for uniform distribution over Euclidean balls, we apply the Shannon lower bound (SLB) from
Section 26.1. Since θ has an isotropic distribution, applying Theorem 26.3 yields
d 2πed d cr2
inf I(θ; θ̂) ≥ h(θ) + log ≥ log ,
Pθ̂|θ :E∥θ−θ̂∥2 ≤D 2 D 2 D
i i
i i
i i
for some universal constant c, where the last inequality is because for θ ∼ Unif(B(r)), h(θ) =
log vol(B(r)) = d log r + log vol(B(1)) and the volume of a unit Euclidean ball in d dimensions
satisfies (recall (27.14)) vol(B(1))1/d √1d .
2 2
∗ 2 −nr /d 2
R∗ ≤ 2 , i.e., R ≥ cr e
Finally, applying (30.3) yields 12 log cr nr
. Optimizing over r and
−ax −a
using the fact that sup0<x<1 xe = ea if a ≥ 1 and e if a < 1, we have
1
d
R∗ ≥ sup cr2 e−nr /d
2
∧ ρ2 .
r∈[0,ρ] n
Finally, to further demonstrate the usefulness of the SLB, we consider non-quadratic loss
ℓ(θ, θ̂) = kθ − θ̂kr , the rth power of an arbitrary norm on Rd , for which the SLB was given in
(26.5) (see Exercise V.6). Applying the mutual information method yields the following minimax
lower bound.
i.i.d.
Theorem 30.2 (GLM with norm loss). Let X = (X1 , · · · , Xn ) ∼ N (θ, Id ) and let r > 0 be a
constant. Then
r/ 2 −r/d
d 2πe d − r/ d
inf sup Eθ [kθ̂ − θk ] ≥
r
V∥·∥ Γ 1 + n−r/2 V∥·∥ . (30.8)
θ̂ θ∈Rd re n r
Furthermore,
r
d
≲ nr/2 · inf sup Eθ [kθ̂ − θkr ] ≲ E[kZkr ], (30.9)
E[kZk∗ ] θ̂ θ∈Rd
Proof. Choose a Gaussian prior θ ∼ N (0, sId ). Suppose E[kθ̂ −θkr ] ≤ D. By the data processing
inequality,
( d )
d d Dre r d
log(1 + ns) ≥ I(θ; X) ≥ I(θ; θ̂) ≥ log(2πes) − log V∥·∥ Γ 1+ ,
2 2 d r
where the last inequality follows from (26.5). Rearranging terms and sending s → ∞ yields the
first inequality in (30.8), and the second follows from Stirling’s approximation Γ(x)1/x x for
x → ∞. For (30.9), the upper bound follows from choosing θ̂ = X̄ and the lower bound follows
from applying (30.8) with the following bound of Urysohn (cf., e.g., [235, p. 7]) on the volume of
a symmetric convex body.
Lemma 30.3. For any symmetric convex body K ⊂ Rd , vol(K)1/d ≲ w(K)/d, where w(K) is the
Gaussian width of K defined in (27.19).
Example 30.1. Recall from Theorem 28.7 that the upper bound in (30.9) is an equality. In view
of this, let us evaluate the tightness of the lower bound from SLB. As an example, consider r = 2
P 1/q
d
and the ℓq -norm kxkq = i=1 |xi |
q
with 1 ≤ q ≤ ∞. Recall the formula (27.13) for the
i i
i i
i i
510
In the special case of q = 2, we see that the lower bound in (30.8) is in fact exact and coincides with
2/ q
(30.4). For general q ∈ [1, ∞), (30.8) shows that the minimax rate is d n . However, for q = ∞,
the minimax lower bound we get is 1/p n, independent of the dimension d. In fact, the upper bound
in (30.9) is tight and since EkZk∞ log d (cf. Lemma 27.10), the minimax rate for the squared
ℓ∞ -risk is logn d . We will revisit this example in Section 31.4 and show how to obtain the sharp
logarithmic dependency on the dimension.
Remark 30.2 (SLB versus the volume method). Recall the connection between rate-distortion
function and the metric entropy in Section 27.7. As we have seen in Section 27.2, a common
lower bound for metric entropy is via the volume bound. In fact, the SLB can be interpreted as
a volume-based lower bound to the rate-distortion function. To see this, consider r = 1 and let θ
be uniformly distributed over some compact set Θ, so that h(θ) = log vol(Θ) (Theorem 2.6.(a)).
Applying Stirling’s approximation, the lower bound in (26.5) becomes log vol(vol (Θ)
B∥·∥ (cϵ)) for some
constant c, which has the same form as the volume ratio in Theorem 27.3 for metric entropy. We
will see later in Section 31.4 that in statistical applications, applying SLB yields basically the same
lower bound as applying Fano’s method to a packing obtained from the volume bound, although
SLB does not rely explicitly on a packing.
Next we prove an optimal lower bound applying MIM. (For a different proof using Fano’s method
in Section 31.4, see Exercise VI.11.)
Theorem 30.4.
k ed
R∗n (B0 (k)) ≳ log . (30.10)
n k
A few remarks are in order:
i i
i i
i i
Remark 30.3. • The lower bound (30.10) turns out to be tight, achieved by the maximum
likelihood estimator
which is equivalent to keeping the k entries from X̄ with the largest magnitude and setting the
rest to zero, or the following hard-thresholding estimator θ̂τ with an appropriately chosen τ (see
Exercise VI.12):
• Sharp asymptotics: For sublinear sparsity k = o(d), we have R∗n (B0 (k)) = (2 + o(1)) nk log dk
(Exercise VI.12); for linear sparsity k = (η + o(1))d with η ∈ (0, 1), R∗n (B0 (k)) = (β(η) +
o(1))d for some constant β(η). For the latter and more refined results, we refer the reader to the
monograph [171, Chapter 8].
Proof. First, note that B0 (k) is a union of linear subspace of Rd and thus homogeneous. Therefore
by scaling, we have
1 ∗ 1
R∗n (B0 (k)) = R (B0 (k)) ≜ R∗ (k, d). (30.13)
n 1 n
Thus it suffices to consider n = 1. Denote the observation by X = θ + Z.
Next, note that the following oracle lower bound:
R∗ (k, d) ≥ k,
which is the optimal risk given the extra information of the support of θ, in view of (30.4). Thus
to show (30.10), below it suffices to consider k ≤ d/4.
We now apply the mutual information method. Recall from (27.10) that Sdk denotes the
Hamming sphere, namely,
τ2
kθ − θ̂k22 ≥ dH (b, b̂). (30.14)
4
Let EdH (b, b̂) = δ k. Assume that δ ≤ 14 , for otherwise, we are done.
i i
i i
i i
512
Note the the following Markov chain b → θ → X → θ̂ → b̂ and thus, by the data processing
inequality of mutual information,
d kτ 2 kτ 2 k d
I(b; b̂) ≤ I(θ; X) ≤ log 1 + ≤ = log .
2 d 2 2 k
where the second inequality follows from the fact that kθk22 = kτ 2 and the Gaussian channel
capacity.
Conversely,
I(b̂; b) ≥ min I(b̂; b)
EdH (b,b̂)≤δ d
Theorem 30.5. Assume that k/n is bounded away from one. If almost exact recovery is possible,
then
2 + o( 1) n
d(pkq) ≥ log . (30.16)
k−1 k
i i
i i
i i
which can be shown by a reduction to testing the membership of two nodes given the rest. It turns
out that conditions (30.16) and (30.17) are optimal, in the sense that almost exact recovery can be
achieved (via maximum likelihood) provided that (30.17) holds and d(pkq) ≥ 2k− +ϵ n
1 log k for any
constant ϵ > 0. For details, we refer the readers to [151].
Proof. Suppose Ĉ achieves almost exact recovery of C∗ . Let ξ ∗ , ξˆ ∈ {0, 1}k denote their indicator
vectors, respectively, for example, ξi∗ = 1{i∈C∗ } for each i ∈ [n]. Then Then E[dH (ξ, ξ)]
ˆ = ϵn k for
some ϵn → 0. Applying the mutual information method as before, we have
( a) n ϵn k (b) n
∗ ˆ ∗
I(G; ξ ) ≥ I(ξ; ξ ) ≥ log − nh ≥ k log (1 + o(1)),
k n k
where (a) follows in exact the same manner as (30.15) did from Exercise I.9; (b) follows from the
assumption that k/n ≤ 1 − c for some constant c.
On the other hand, we upper bound the mutual information between the hidden community and
the graph as follows:
(a) (b) (c) k
∗ ⊗(n2)
I(G; ξ ) = min D(PG|ξ∗ kQ|Pξ∗ ) ≤ D(PG|ξ∗ kBer(q) |Pξ∗ ) = d(pkq),
Q 2
where (a) is by the variational representation of mutual information in Corollary 4.2; (b) follows
from choosing Q to be the distribution of the Erd�s-Rényi graph G(n, q); (c) is by the tensorization
property of KL divergence for product distributions (see Theorem 2.14). Combining the last two
displays completes the proof.
i i
i i
i i
514
i.i.d.
Theorem 30.6 (Bounded GLM continued). Suppose X1 , . . . , Xn ∼ N (θ, Id ), where θ belongs to
B, the unit ℓ2 -ball in Rd . Then for some universal constant C0 ,
n+C0 d
e− d−1 ≤ inf sup Eθ [kθ̂ − θk2 ] ≤ .
θ̂ θ∈B d+n
Proof. Without loss of generality, assume that the observation is X = θ+ √Zn , where Z ∼ N (0, Id ).
For the upper bound, applying the shrinkage estimator1 θ̂ = 1+1d/n X yields E[kθ̂ − θk2 ] ≤ n+d d .
For the lower bound, we apply MIM as in Theorem 30.1 with the prior θ ∼ Unif(Sd−1 ). We
still apply the AWGN capacity in (30.6) to get I(θ; X) ≤ n/2. (Here the constant 1/2 is important
and so the diameter-based (30.7) is too loose.) For the rate-distortion function of spherical uniform
distribution, applying Theorem 27.17 yields I(θ; θ̂) ≥ d−2 1 log E[∥θ̂−θ∥
1
2]
− C. Thus the lower bound
on E[kθ̂ − θk2 ] follows from the data processing inequality.
A similar phenomenon also occurs in the problem of estimating a discrete distribution P on k
elements based on n iid observations, which has been studied in Section 29.4 for small alphabet in
the large-sample asymptotics and extended in Exercise VI.7–VI.9 to large alphabets. In particular,
consider the total variation loss, which is at most one. Ex. VI.9f shows that the TV error of any
estimator is 1 − o(1) if n k; conversely, Ex. VI.9b demonstrates an estimator P̂ such that
E[χ2 (PkP̂)] ≤ nk− 1 2
+1 . Applying the joint range (7.29) between TV and χ and Jensen’s inequality,
we have
q
1 k− 1 n ≥ k − 2
E[TV(P, P̂)] ≤ 2 n+1
k− 1 n≤k−2
k+n
which is bounded away from one whenever n = Ω(k). In summary, non-trivial estimation in total
variation is possible if and only if n scales at least proportionally with k.
1
This corresponds to the Bayes estimator (Example 28.1) when we choose θ ∼ N (0, 1d Id ), which is approximately
concentrated on the unit sphere.
i i
i i
i i
In this chapter we study three commonly used techniques for proving minimax lower bounds,
namely, Le Cam’s method, Assouad’s lemma, and Fano’s method. Compared to the results in
Chapter 29 geared towards large-sample asymptotics in smooth parametric models, the approach
here is more generic, less tied to mean-squared error, and applicable in nonasymptotic settings
such as nonparametric or high-dimensional problems.
The common rationale of all three methods is reducing statistical estimation to hypothesis test-
ing. Specifically, to lower bound the minimax risk R∗ (Θ) for the parameter space Θ, the first step
is to notice that R∗ (Θ) ≥ R∗ (Θ′ ) for any subcollection Θ′ ⊂ Θ, and Le Cam, Assouad, and Fano’s
methods amount to choosing Θ′ to be a two-point set, a hypercube, or a packing, respectively. In
particular, Le Cam’s method reduces the estimation problem to binary hypothesis testing. This
method is perhaps the easiest to evaluate; however, the disadvantage is that it is frequently loose
in estimating high-dimensional parameters. To capture the correct dependency on the dimension,
both Assouad’s and Fano’s method rely on reduction to testing multiple hypotheses.
As illustrated in Fig. 30.1, all three methods in fact follow from the common principle of the
mutual information method (MIM) in Chapter 30, corresponding to different choice of priors.
The limitation of these methods, compared to the MIM, is that, due to the looseness in constant
factors, they are ineffective for certain problems such as estimation better than chance discussed
in Section 30.4.
Then
ℓ(θ0 , θ1 )
inf sup Eθ ℓ(θ, θ̂) ≥ sup (1 − TV(Pθ0 , Pθ1 )) (31.2)
θ̂ θ∈Θ θ0 ,θ1 ∈Θ 2α
515
i i
i i
i i
516
Proof. Fix θ0 , θ1 ∈ Θ. Given any estimator θ̂, let us convert it into the following (randomized)
test:
θ0 with probability ℓ(θ1 ,θ̂)
,
ℓ(θ0 ,θ̂)+ℓ(θ1 ,θ̂)
θ̃ =
θ1 with probability ℓ(θ0 ,θ̂)
.
ℓ(θ ,θ̂)+ℓ(θ ,θ̂) 0 1
and similarly for θ1 . Consider the prior π = 12 (δθ0 + δθ1 ) and let θ ∼ π. Taking expectation on
both sides yields the following lower bound on the Bayes risk:
ℓ(θ0 , θ1 ) ℓ(θ0 , θ1 )
Eπ [ℓ(θ̂, θ)] ≥ P θ̃ 6= θ ≥ (1 − TV(Pθ0 , Pθ1 ))
α 2α
where the last step follows from the minimum average probability of error in binary hypothesis
testing (Theorem 7.7).
Remark 31.1. As an example where the bound (31.2) is tight (up to constants), consider a binary
hypothesis testing problem with Θ = {θ0 , θ1 } and the Hamming loss ℓ(θ, θ̂) = 1{θ 6= θ̂}, where
θ, θ̂ ∈ {θ0 , θ1 } and α = 1. Then the left side is the minimax probability of error, and the right
side is the optimal average probability of error (cf. (7.17)). These two quantities can coincide (for
example for Gaussian location model).
Another special case of interest is the quadratic loss ℓ(θ, θ̂) = kθ − θ̂k22 , where θ, θ̂ ∈ Rd , which
satisfies the α-triangle inequality with α = 2. In this case, the leading constant 41 in (31.2) makes
sense, because in the extreme case of TV = 0 where Pθ0 and Pθ1 cannot be distinguished, the best
estimate is simply θ0 +θ2 . In addition, the inequality (31.2) can be deduced based on properties of
1
f-divergences and their joint range (Chapter 7). To this end, abbreviate Pθi as Pi for i = 0, 1 and
consider the prior π = 12 (δθ0 + δθ1 ). Then the Bayes estimator (posterior mean) is θ0 dP 0 +θ1 dP1
dP0 +dP1 and
the Bayes risk is given by
Z
kθ0 − θ1 k2 dP0 dP1
R∗π =
2 dP0 + dP1
kθ0 − θ1 k2 kθ0 − θ1 k2
= (1 − LC(P0 , P1 )) ≥ (1 − TV(P0 , P1 )),
4 4
R 0 −dP1 )
2
where LC(P0 , P1 ) = (dP dP0 +dP1 is the Le Cam divergence defined in (7.6) and satisfies LC ≤ TV.
Example 31.1. As a concrete example, consider the one-dimensional GLM with sample size n.
Pn
By considering the sufficient statistic X̄ = 1n i=1 Xi , the model is simply {N(θ, 1n ) : θ ∈ R}.
Applying Theorem 31.1 yields
∗ 1 1 1
R ≥ sup |θ0 − θ1 | 1 − TV N θ0 ,
2
, N θ1 ,
θ0 ,θ1 ∈R 4 n n
( a) 1 ( b) c
= sup s2 (1 − TV(N(0, 1), N(s, 1))) = (31.3)
4n s>0 n
i i
i i
i i
where (a) follows from the shift and scale invariance of the total variation; in (b) c ≈ 0.083 is
some absolute constant, obtained by applying the formula TV(N(0, 1), N(s, 1)) = 2Φ( 2s ) − 1 from
(7.37). On the other hand, we know from Example 28.2 that the minimax risk equals 1n , so the
two-point method is rate-optimal in this case.
In the above example, for two points separated by Θ( √1n ), the corresponding hypothesis cannot
be tested with vanishing probability of error so that the resulting estimation risk (say in squared
error) cannot be smaller than 1n . This convergence rate is commonly known as the “parametric
rate”, which we have studied in Chapter 29 for smooth parametric families focusing on the Fisher
information as the sharp constant. More generally, the 1n rate is not improvable for models with
locally quadratic behavior
(Recall that Theorem 7.21 gives a sufficient condition for this behavior.) Indeed, pick θ0 in the
interior of the parameter space and set θ1 = θ0 + √1n , so that H2 (Pθ0 , Pθ1 ) = Θ( 1n ) thanks to (31.4).
By Theorem 7.8, we have TV(P⊗ ⊗n
θ0 , Pθ1 ) ≤ 1 − c for some constant c and hence Theorem 31.1
n
yields the lower bound Ω(1/n) for the squared error. Furthermore, later we will show that the same
locally quadratic behavior in fact guarantees the achievability of the 1/n rate; see Corollary 32.11.
Example 31.2. As a different example, consider the family Unif(0, θ). Note that as opposed to the
quadratic behavior (31.4), we have
√
H2 (Unif(0, 1), Unif(0, 1 + t)) = 2(1 − 1/ 1 + t) t.
Thus an application of Theorem 31.1 yields an Ω(1/n2 ) lower bound. This rate is not achieved by
the empirical mean estimator (which only achieves 1/n rate), but by the the maximum likelihood
estimator θ̂ = max{X1 , . . . , Xn }. Other types of behavior in t, and hence the rates of convergence,
can occur even in compactly supported location families – see Example 7.1.
The limitation of Le Cam’s two-point method is that it does not capture the correct dependency
on the dimensionality. To see this, let us revisit Example 31.1 for d dimensions.
Example 31.3. Consider the d-dimensional GLM in Corollary 28.8. Again, it is equivalent to con-
sider the reduced model {N(θ, 1n ) : θ ∈ Rd }. We know from Example 28.2 (see also Theorem 28.4)
that for quadratic risk ℓ(θ, θ̂) = kθ − θ̂k22 , the exact minimax risk is R∗ = dn for any d and n. Let
us compare this with the best two-point lower bound. Applying Theorem 31.1 with α = 2,
1 1 1
R∗ ≥ sup kθ0 − θ1 k22 1 − TV N θ0 , Id , N θ1 , Id
θ0 ,θ1 ∈Rd 4 n n
1
= sup kθk22 {1 − TV (N (0, Id ) , N (θ, Id ))}
θ∈Rd 4n
1
= sup s2 (1 − TV(N(0, 1), N(s, 1))),
4n s>0
where the second step applies the shift and scale invariance of the total variation; in the last step,
by rotational invariance of isotropic Gaussians, we can rotate the vector θ align with a coordinate
i i
i i
i i
518
vector (say, e1 = (1, 0 . . . , 0)) which reduces the problem to one dimension, namely,
Comparing the above display with (31.3), we see that the best Le Cam two-point lower bound in
d dimensions coincide with that in one dimension.
Let us mention in passing that although Le Cam’s two-point method is typically suboptimal for
estimating a high-dimensional parameter θ, for functional estimation in high dimensions (e.g. esti-
mating a scalar functional T(θ)), Le Cam’s method is much more effective and sometimes even
optimal. The subtlety is that is that as opposed to testing a pair of simple hypotheses H0 : θ = θ0
versus H1 : θ = θ1 , we need to test H0 : T(θ) = t0 versus H1 : T(θ) = t1 , both of which are
composite hypotheses and require a sagacious choice of priors. See Exercise VI.13 for an example.
Theorem 31.2 (Assouad’s Lemma). Assume that the loss function ℓ satisfies the α-triangle
inequality (31.1). Suppose Θ contains a subset Θ′ = {θb : b ∈ {0, 1}d } indexed by the hypercube,
such that ℓ(θb , θb′ ) ≥ β · dH (b, b′ ) for all b, b′ and some β > 0. Then
βd
inf sup Eθ ℓ(θ, θ̂) ≥ 1 − max TV(Pθb , Pθb′ ) (31.5)
θ̂ θ∈Θ 4α dH (b,b′ )=1
Proof. We lower bound the Bayes risk with respect to the uniform prior over Θ′ . Given any
estimator θ̂ = θ̂(X), define b̂ ∈ argmin ℓ(θ̂, θb ). Then for any b ∈ {0, 1}d ,
β X
d
≥ (1 − TV(PX|bi =0 , PX|bi =1 )),
4α
i=1
i i
i i
i i
where the last step is again by Theorem 7.7, just like in the proof of Theorem 31.1. Each total
variation can be upper bounded as follows:
!
( a) 1 X 1 X (b)
TV(PX|bi =0 , PX|bi =1 ) = TV d− 1
Pθb , d−1 Pθb ≤ max TV(Pθb , Pθb′ )
2 2 dH (b,b′ )=1
b:bi =1 b:bi =0
where (a) follows from the Bayes rule, and (b) follows from the convexity of total variation
(Theorem 7.5). This completes the proof.
Example 31.4. Let us continue the discussion of the d-dimensional GLM in Example 31.3. Con-
sider the quadratic loss first. To apply Theorem 31.2, consider the hypercube θb = ϵb, where
b ∈ {0, 1}d . Then kθb − θb′ k22 = ϵ2 dH (b, b′ ). Applying Theorem 31.2 yields
∗ ϵ2 d 1 ′ 1
R ≥ 1− max TV N ϵb, Id , N ϵb , Id
4 b,b′ ∈{0,1}d ,dH (b,b′ )=1 n n
2
ϵ d 1 1
= 1 − TV N 0, , N ϵ, ,
4 n n
where the last step applies (7.10) for f-divergence between product distributions that only differ
in one coordinate. Setting ϵ = √1n and by the scale-invariance of TV, we get the desired R∗ ≳ nd .
Next, let’s consider the loss function kθb − θb′ k∞ . In the same setup, we only kθb − θb′ k∞ ≥
′ ∗ √1 , which does not depend on d. In fact, R∗
d dH (b, b ). Then Assouad’s lemma yields R ≳
ϵ
q n
log d
n as shown in Corollary 28.8. In the next section, we will discuss Fano’s method which can
resolve this deficiency.
Here τ ′ is related to τ by τ log 2 = h(τ ′ ). Thus, using the same “hypercube embedding b → θb ”,
the bound similar to (31.5) will follow once we can bound I(bd ; X) away from d log 2.
Can we use the pairwise total variation bound in (31.5) to do that? Yes! Notice that thanks to
the independence of bi ’s we have1
1
Equivalently, this also follows from the convexity of the mutual information in the channel (cf. Theorem 5.3).
i i
i i
i i
520
where in the last step we used the fact that whenever B ∼ Ber(1/2),
I(B; X) ≤ TV(PX|B=0 , PX|B=1 ) log 2 , (31.8)
which follows from (7.36) by noting that the mutual information is expressed as the Jensen-
Shannon divergence as 2I(B; X) = JS(PX|B=0 , PX|B=1 ). Combining (31.6) and (31.7), the mutual
information method implies the following version of the Assouad’s lemma: Under the assumption
of Theorem 31.2,
βd −1 (1 − t) log 2
inf sup Eθ ℓ(θ, θ̂) ≥ ·f max TV(Pθ , Pθ′ ) , f(t) ≜ h (31.9)
θ̂ θ∈Θ 4α dH (θ,θ ′ )=1 2
where h−1 : [0, log 2] → [0, 1/2] is the inverse of the binary entropy function. Note that (31.9) is
slightly weaker than (31.5). Nevertheless, as seen in Example 31.4, Assouad’s lemma is typically
applied when the pairwise total variation is bounded away from one by a constant, in which case
(31.9) and (31.5) differ by only a constant factor.
In all, we may summarize Assouad’s lemma as a convenient method for bounding I(bd ; X) away
from the full entropy (d bits) on the basis of distances between PX|bd corresponding to adjacent
bd ’s.
Theorem 31.3. Let d be a metric on Θ. Fix an estimator θ̂. For any T ⊂ Θ and ϵ > 0,
h ϵi radKL (T) + log 2
P d(θ, θ̂) ≥ ≥1− , (31.10)
2 log M(T, d, ϵ)
where radKL (T) ≜ infQ supθ∈T D(Pθ kQ) is the KL radius of the set of distributions {Pθ : θ ∈ T}
(recall Corollary 5.8). Consequently,
ϵ r radKL (T) + log 2
inf sup Eθ [d(θ, θ̂) ] ≥ sup
r
1− , (31.11)
θ̂ θ∈Θ T⊂Θ,ϵ>0 2 log M(T, d, ϵ)
i i
i i
i i
I(θ; X) + log 2
P[θ 6= θ̃] ≥ 1 − .
log M
The proof of (31.10) is completed by noting that I(θ; X) ≤ radKL (T) since the latter equals the
maximal mutual information over the distribution of θ (Corollary 5.8).
As an application of Fano’s method, we revisit the d-dimensional GLM in Corollary 28.8 under
the ℓq loss (1 ≤ q ≤ ∞), with the particular focus on the dependency on the dimension. (For a
different application in sparse setting see Exercise VI.11.)
Example 31.5. Consider GLM with sample size n, where Pθ = N(θ, Id )⊗n . Taking natural logs
here and below, we have
n
D(Pθ kPθ′ ) = kθ − θ′ k22 ;
2
in other words, KL-neighborhoods are ℓ2 -balls. As such, let us apply Theorem 31.3 to T = B2 (ρ)
2
for some ρ > 0 to be specified. Then radKL (T) ≤ supθ∈T D(Pθ kP0 ) = nρ2 . To bound the packing
number from below, we applying the volume bound in Theorem 27.3,
d
ρd vol(B2 ) cq ρd1/q
M(B2 (ρ), k · kq , ϵ) ≥ d ≥ √
ϵ vol(Bq ) ϵ d
for some
p constant cq ,cqwhere the last step follows the volume formula (27.13) for ℓq -balls. Choosing
1/q−1/2
ρ = d/n and ϵ = e2 ρd , an application of Theorem 31.3 yields the minimax lower bound
d1/q
Rq ≡ inf sup Eθ [kθ̂ − θkq ] ≥ Cq √ (31.12)
θ̂ θ∈Rd n
for some constant Cq depending on q. This is the same lower bound as that in Example 30.1
obtained via the mutual information method plus the Shannon lower bound (which is also volume-
based).
For any q ≥ 1, (31.12) is rate-optimal since we can apply the MLE θ̂ = X̄. (Note that at q = ∞,
pq = ∞, (31.12)
the constant Cq is still finite since vol(B∞ ) = 2d .) However, for the special case of
does not depend on the dimension at all, as opposed to the correct dependency log d shown in
Corollary 28.8. In fact, this is the same suboptimal result we previously obtained from applying
Shannon lower bound in Example 30.1 or Assouad’s lemma in Example 31.4. So is it possible to
fix this looseness with Fano’s method? It turns out that the answer is yes and the suboptimality
is due to the volume bound on the metric entropy, which, as we have seen in Section 27.3, can
be ineffective if ϵ scales with dimension. Indeed, if we apply the tight bound of M(B2 , k · k∞ , ϵ)
i i
i i
i i
522
q q
in (27.18),2 with ϵ = c log d
and ρ = c′ logn d for some absolute constants c, c′ , we do get
q n
R∞ ≳ logn d as desired.
We end this section with some comments regarding the application Theorem 31.3:
• It is sometimes convenient to further bound the KL radius by the KL diameter, since radKL (T) ≤
diamKL (T) ≜ supθ,θ′ ∈T D(Pθ′ kPθ ) (cf. Corollary 5.8). This suffices for Example 31.5.
• In Theorem 31.3 we actually lower bound the global minimax risk by that restricted on a param-
eter subspace T ⊂ Θ for the purpose of controlling the mutual information, which is often
difficult to compute. For the GLM considered in Example 31.5, the KL divergence is propor-
tional to squared ℓ2 -distance and T is naturally chosen to be a Euclidean ball. For other models
such as the covariance model (Exercise VI.15) wherein the KL divergence is more complicated,
the KL neighborhood T needs to be chosen carefully. Later in Section 32.4 we will apply the
same Fano’s method to the infinite-dimensional problem of estimating smooth density.
2
In fact, in this case we can also choose the explicit packing {ϵe1 , . . . , ϵed }.
i i
i i
i i
So far our discussion on information-theoretic methods have been mostly focused on statistical
lower bounds (impossibility results), with matching upper bounds obtained on a case-by-case basis.
In this chapter, we will discuss three information-theoretic upper bounds for statistical estimation.
These three results apply to different loss functions and are obtained using completely different
means; however, they take on exactly the same form involving the appropriate metric entropy of the
model. Specifically, suppose that we observe X1 , . . . , Xn drawn independently from a distribution
Pθ for some unknown parameter θ ∈ Θ, and the goal is to produce an estimate P̂ for the true
distribution Pθ . We have the following entropic minimax upper bounds:
Here N(P, ϵ) refers to the metric entropy (cf. Chapter 27) of the model class P = {Pθ : θ ∈ Θ}
under various distances, which we will formalize along the way.
523
i i
i i
i i
524
depending on Xn . The loss function is the KL divergence D(Pθ kP̂).1 The average risk is thus
Z
Eθ D(Pθ kP̂) = D Pθ kP̂(·|Xn ) P⊗n (dxn ).
If the family has a common dominating measure μ, the problem is equivalent to estimate the
density pθ = dP dμ , commonly referred to as the problem of density estimation in the statistics
θ
literature.
Our objective is to prove the upper bound (32.1) for the minimax KL risk
where the infimum is taken over all estimators P̂ = P̂(·|Xn ) which is a distribution on X ; in
other words, we allow improper estimates in the sense that P̂ can step outside the model class P .
Indeed, the construction we will use in this section (such as predictive density estimators (Bayes)
or their mixtures) need not be a member of P . Later we will see in Sections 32.2 and 32.3 that for
total variation and Hellinger loss we can always restrict to proper estimators;2 however these loss
functions are weaker than the KL divergence.
The main result of this section is the following.
where the supremum is over all distributions (priors) of θ taking values in Θ. Denote by
NKL (P, ϵ) ≜ min N : ∃Q1 , . . . , QN s.t. ∀θ ∈ Θ, ∃i ∈ [N], D(Pθ kQi ) ≤ ϵ2 . (32.6)
Note that the capacity Cn is precisely the redundancy (13.10) which governs the minimax regret
in universal compression; the fact that it bounds the KL risk can be attributed to a generic relation
1
Note the asymmetry in this loss function. Alternatively the loss D(P̂kP) is typically infinite in nonparametric settings,
because it is impossible to estimate the support of the true density exactly.
2
This is in fact a generic observation: Whenever the loss function satisfies an approximate triangle inequality, any
improper estimate can be converted to a proper one by its project on the model class whose risk is inflated by no more
than a constant factor.
i i
i i
i i
between individual and cumulative risks which we explain later in Section 32.1.4. As explained in
Chapter 13, it is in general difficult to compute the exact value of Cn even for models as simple as
Bernoulli (Pθ = Ber(θ)). This is where (32.8) comes in: one can use metric entropy and tools from
Chapter 27 to bound this capacity, leading to useful (and even optimal) risk bounds. We discuss
two types of applications of this result.
Infinite-dimensional models Similar to the results in Section 27.4, for nonparametric models
NKL (ϵ) typically grows super-polynomially in 1ϵ and, in turn, the capacity Cn grows super-
logarithmically. In fact, whenever we have Cn nα for some α > 0, Theorem 32.1 yields the
sharp minimax rate
Cn
R∗KL (n) nα−1 (32.11)
n
which easily follows from combining (32.7) and (32.8) – see (32.23) for details.
As a concrete example, consider the class P of Lipschitz densities on [0, 1] that are bounded
away from zero. Using the L2 -metric entropy previously established in Theorem 27.12, we will
show in Section 32.4 that NKL (ϵ) ϵ−1 and thus Cn ≤ infϵ>0 (nϵ2 + ϵ−1 ) n1/3 and, in turn,
R∗KL (n) ≲ n−2/3 . This rate turns out to be optimal: In Section 32.1.3 we will develop capacity lower
bound based on metric entropy that shows Cn n1/3 and hence, in view of (32.11), R∗KL (n)
n−2/3 .
Next, we explain the intuition behind and the proof of Theorem 32.1.
i i
i i
i i
526
i.i.d.
where θ ∼ π and (X1 , . . . , Xn+1 ) ∼ Pθ conditioned on θ. The Bayes estimator achieving this infi-
mum is given by P̂Bayes (·|xn ) = PXn+1 |Xn =xn . If each Pθ has a density pθ with respect to some
common dominating measure μ, the Bayes estimator has density:
R Qn+1
π (dθ) i=1 pθ (xi )
p̂Bayes (xn+1 |x ) = R
n
Qn . (32.12)
π (dθ) i=1 pθ (xi )
(a)
= EXn D(PXn+1 |θ kPXn+1 |Xn |Pθ|Xn )
= D(PXn+1 |θ kPXn+1 |Xn |Pθ,Xn )
(b)
= I(θ; Xn+1 |Xn ).
where (a) follows from the variational representation of mutual information (Theorem 4.1 and
Corollary 4.2); (b) invokes the definition of the conditional mutual information (Section 3.4) and
the fact that Xn → θ → Xn+1 forms a Markov chain, so that PXn+1 |θ,Xn = PXn+1 |θ . In addition, the
Bayes optimal estimator is given by PXn+1 |Xn .
Note that the operational meaning of I(θ; Xn+1 |Xn ) is the information provided by one extra
observation about θ having already obtained n observations. In most situations, since Xn will have
3
Throughout this chapter, we continue to use the conventional notation Pθ for a parametric family of distributions and use
π to stand for the distribution of θ.
i i
i i
i i
Lemma 32.3 (Diminishing marginal utility in information). n 7→ I(θ; Xn+1 |Xn ) is a decreasing
sequence. Furthermore,
1
I(θ; Xn+1 |Xn ) ≤ I(θ; Xn+1 ). (32.13)
n
Proof. In view of the chain rule for mutual information (Theorem 3.7): I(θ; Xn+1 ) =
Pn+1 i−1
i=1 I(θ; Xi |X ), (32.13) follows from the monotonicity. To show the latter, let us consider
a “sampling channel” where the input is θ and the output is X sampled from Pθ . Let I(π )
denote the mutual information when the input distribution is π, which is a concave function in
π (Theorem 5.3). Then
where the inequality follows from Jensen’s inequality, since Pθ|Xn−1 is a mixture of Pθ|Xn .
Lemma 32.3 allows us to prove the converse bound (32.9): Fix any prior π. Since the minimax
risk dominates any Bayes risk (Theorem 28.1), in view of Lemma 32.2, we have
X
n X
n
R∗KL (t) ≥ I(θ; Xt+1 |Xt ) = I(θ; Xn+1 ).
t=0 t=0
Recall from (32.5) that Cn+1 = supπ ∈∆(Θ) I(θ; Xn+1 ). Optimizing over the prior π yields (32.9).
Now suppose that the minimax theorem holds for (32.4), so that R∗KL = supπ ∈∆(Θ) R∗KL,Bayes (π ).
Then Lemma 32.2 allows us to express the minimax risk as the conditional mutual information
maximized over the prior π:
1 X
n+1
P̂(·|Xn ) ≜ QXi |Xi−1 . (32.14)
n+1
i=1
i i
i i
i i
528
Let us bound the worst-case KL risk of this estimator. Fix θ ∈ Θ and let Xn+1 be drawn
⊗(n+1)
independently from Pθ so that PXn+1 = Pθ . Taking expectations with this law, we have
"
!#
1 X n+1
Eθ [D(Pθ kP̂(·|X ))] = E D Pθ
n
QXi |Xi−1
n + 1
i=1
(a) 1 X
n+1
≤ D(Pθ kQXi |Xi−1 |PXi−1 )
n+1
i=1
(b) 1 ⊗(n+1)
= D(Pθ kQXn+1 ),
n+1
where (a) and (b) follows from the convexity (Theorem 5.1) and the chain rule for KL divergence
(Theorem 2.14(c)). Taking the supremum over θ ∈ Θ bounds the worst-case risk as
1 ⊗(n+1)
R∗KL (n) ≤ sup D(Pθ kQXn+1 ).
n + 1 θ∈Θ
Optimizing over the choice of QXn+1 , we obtain
1 ⊗(n+1) Cn+1
R∗KL (n) ≤ inf sup D(Pθ kQXn+1 ) = ,
n + 1 QXn+1 θ∈Θ n+1
where the last identity applies Theorem 5.9 of Kemperman, completing the proof of (32.7).
Furthermore, Theorem 5.9 asserts that the optimal QXn+1 exists and given uniquely by the capacity-
achieving output distribution P∗Xn+1 . Thus the above minimax upper bound can be attained by
taking the Cesàro average of P∗X1 , P∗X2 |X1 , . . . , P∗Xn+1 |Xn , namely,
1 X ∗
n+1
P̂∗ (·|Xn ) = PXi |Xi−1 . (32.15)
n+1
i=1
Note that in general this is an improper estimate as it steps outside the class P .
In the special case where the capacity-achieving input distribution π ∗ exists, the capacity-
achieving output distribution can be expressed as a mixture over product distributions as P∗Xn+1 =
R ∗ ⊗(n+1)
π (dθ)Pθ . Thus the estimator P̂∗ (·|Xn ) is in fact the average of Bayes estimators (32.12)
under prior π ∗ for sample sizes ranging from 0 to n.
Finally, as will be made clear in the next section, in order to achieve the further upper bound
(32.8) in terms of the KL covering numbers, namely R∗KL (n) ≤ ϵ2 + n+1 1 log NKL (P, ϵ), it suffices to
choose the following QXn+1 as opposed to the exact capacity-achieving output distribution: Pick an
ϵ-KL cover Q1 , . . . , QN for P of size N = NKL (P, ϵ) and choose π to be the uniform distribution
PN ⊗(n+1)
and define QXn+1 = N1 j=1 Qj – this was the original construction in [341]. In this case,
applying the Bayes rule (32.12), we see that the estimator is in fact a convex combination P̂(·|Xn ) =
PN
j=1 wj Qj of the centers Q1 , . . . , QN , with data-driven weights given by
Qi−1
1 X
n+1
t=1 Qj (Xt )
wj = PN Qi−1 .
n+1 Qj ( X t )
i=1 j=1 t=1
i i
i i
i i
Again, except for the extraordinary case where P is convex and the centers Qj belong to P , the
estimate P̂(·|Xn ) is improper.
Proof. Fix ϵ and let N = NKL (Q, ϵ). Then there exist Q1 , . . . , QN that form an ϵ-KL cover, such
that for any a ∈ A there exists i(a) ∈ [N] such that D(PB|A=a kQi(a) ) ≤ ϵ2 . Fix any PA . Then
where the last inequality follows from that i(A) takes at most N values and, by applying
Theorem 4.1,
I(A; B|i(A)) ≤ D PB|A kQi(A) |Pi(A) ≤ ϵ2 .
For the lower bound, note that if C = ∞, then in view of the upper bound above, NKL (Q, ϵ) = ∞
for any ϵ and (32.16) holds with equality. If C < ∞, Theorem 5.9 shows that C is the KL radius of
Q, namely, there exists P∗B , such that C = supPA ∈∆(A) D(PB|A kP∗B |PA ) = supx∈A D(PB|A kP∗B |PA ).
√
In other words, NKL (Q, C + δ) = 1 for any δ > 0. Sending δ → 0 proves the equality of
(32.16).
Next we specialize Theorem 32.4 to our statistical setting (32.5) where the input A is θ and the
output B is Xn ∼ Pθ . Recall that P = {Pθ : θ ∈ Θ}. Let Pn ≜ {P⊗
i.i.d.
θ : θ ∈ Θ}. By tensorization of
n
⊗n ⊗n
KL divergence (Theorem 2.14(d)), D(Pθ kPθ′ ) = nD(Pθ kPθ′ ). Thus
ϵ
NKL (Pn , ϵ) ≤ NKL P, √ .
n
Combining this with Theorem 32.4, we obtain the following upper bound on the capacity Cn in
terms of the KL metric entropy of the (single-letter) family P :
Cn ≤ inf nϵ2 + log NKL (P, ϵ) . (32.17)
ϵ>0
i i
i i
i i
530
Theorem 32.5. Let P = {Pθ : θ ∈ Θ} and MH (ϵ) ≡ M(P, H, ϵ) the Hellinger packing number
of the set P , cf. (27.2). Then Cn defined in (32.5) satisfies
log e 2
Cn ≥ min nϵ , log MH (ϵ) − log 2 (32.18)
2
Proof. The idea of the proof is simple. Given a packing θ1 , . . . , θM ∈ Θ with pairwise distances
2
H2 (Qi , Qj ) ≥ ϵ2 for i 6= j, where Qi ≡ Pθi , we know that one can test Q⊗ n ⊗n
i vs Qj with error e
− nϵ2
,
nϵ 2
cf. Theorem 7.8 and Theorem 32.7. Then by the union bound, if Me− 2 < 12 , we can distinguish
these M hypotheses with error < 12 . Let θ ∼ Unif(θ1 , . . . , θM ). Then from Fano’s inequality we
get I(θ; Xn ) ≳ log M.
To get sharper constants, though, we will proceed via the inequality shown in Ex. I.47. In the
notation of that exercise we take λ = 1/2 and from Definition 7.22 we get that
1
D1/2 (Qi , Qj ) = −2 log(1 − H2 (Qi , Qj )) ≥ H2 (Qi , Qj ) log e ≥ ϵ2 log e i 6= j .
2
X
M
( a) 1M − 1 − nϵ22 1
≥− log e +
M M M
i=1
XM
1 nϵ 2 1 nϵ 2 1
≥− log e− 2 + = − log e− 2 + ,
M M M
i=1
where in (a) we used the fact that pairwise distances are all ≥ nϵ2 except when i = j. Finally, since
A + B ≤ min(A,B) we conclude the result.
1 1 2
We note that since D ≳ H2 (cf. (7.30)), a different (weaker) lower bound on the KL risk also
follows from Section 32.2.4 below.
i i
i i
i i
• Instead of directly studying the risk R∗KL (n), (32.7) relates it to a cumulative risk Cn
• The cumulative risk turns out to be equal to a capacity, which can be conveniently bounded in
terms of covering numbers.
In this subsection we want to point out that while the second step is very special to KL (log-loss),
the first idea is generic. Namely, we have the following result.
Proposition 32.6. Fix a loss function ℓ : P(X ) × P(X ) → R̄ and a class Π of distributions on
X . Define cumulative and one-step minimax risks as follows:
" n #
X
Cn = inf sup E ℓ(P, P̂t (Xt−1 )) (32.19)
{P̂t (·)} P∈Π
t=1
h i
R∗n = inf sup E ℓ(P, P̂(Xn )) (32.20)
P̂(·) P∈Π
where both infima are over measurable (possibly randomized) estimators P̂t : X t−1 → P(X ), and
i.i.d.
the expectations are over Xi ∼ P and the randomness of the estimators. Then we have
X
n−1
nR∗n−1 ≤ Cn ≤ R∗t . (32.21)
t=0
Pn−1
Thus, if the sequence {R∗n } satisfies R∗n 1n t=0 R∗t then Cn nR∗n . Conversely, if nα− ≲ Cn ≲
nα+ for all n and some α+ ≥ α− > 0, then
α
(α− −1) α+
n − ≲ R∗n ≲ nα+ −1 . (32.22)
Remark 32.1. The meaning of the above is that R∗n ≈ 1n Cn within either constant or polylogarith-
mic factors, for most cases of interest.
Proof. To show the first inequality in (32.21), given predictors {P̂t (Xt−1 ) : t ∈ [n]} for Cn ,
consider a randomized predictor P̂(Xn−1 ) for R∗n−1 that equals each of the P̂t (Xt−1 ) with equal
P
probability. The second inequality follows from interchanging supP and t .
To derive (32.22) notice that the upper bound on R∗n follows from (32.21). For the lower bound,
notice that the sequence R∗n is monotone and hence we have for any n < m
X
m−1 X
n−1
Ct
Cm ≤ R∗t ≤ + (m − n)R∗n . (32.23)
t
t=0 t=0
α+
i i
i i
i i
532
For Hellinger loss, the answer is yes, although the metric entropy involved is with respect to
the Hellinger distance not KL divergence. The basic construction is due to Le Cam and further
developed by Birgé. The main idea is as follows: Fix an ϵ-covering {P1 , . . . , PN } of the set of
distributions P . Given n samples drawn from P ∈ P , let us test which ball P belongs to; this
allows us to estimate P up to Hellinger loss ϵ. This can be realized by a pairwise comparison
argument of testing the (composite) hypothesis P ∈ B(Pi , ϵ) versus P ∈ B(Pj , ϵ). This program
can be further refined to involve on the local entropy of the model.
The optimal test that achieves (32.24) is the likelihood ratio given by the worst-case mixtures, that
is, the closest4 pair of mixture (P∗n , Q∗n ) such that TV(P∗n , Q∗n ) = TV(co(P ⊗n ), co(Q⊗n )).
The exact result (32.24) is unwieldy as the RHS involves finding the least favorable priors over
the n-fold product space. However, there are several known examples where much simpler and
4
In case the closest pair does not exist, we can replace it by an infimizing sequence.
i i
i i
i i
explicit results are available. In the case when P and Q are TV-balls around P0 and Q0 , Huber [161]
showed that the minimax optimal test has the form
( n )
X dP0
n ′ ′′
ϕ(x ) = 1 min(c , max(c , log (Xi ))) > t .
dQ0
i=1
(See also Ex. III.20.) However, there are few other examples where minimax optimal tests are
known explicitly. Fortunately, as was shown by Le Cam, there is a general “single-letter” upper
bound in terms of the Hellinger separation between P and Q. It is the consequence of the more
general tensorization property of Rényi divergence in Proposition 7.23 (of which Hellinger is a
special case).
Theorem 32.7.
min sup P(ϕ = 1) + sup Q(ϕ = 0) ≤ e− 2 infP∈P,Q∈Q H (P,Q) ,
n 2
(32.25)
ϕ P∈P Q∈Q
Remark 32.2. For the case when P and Q are Hellinger balls of radius r around P0 and Q0 , respec-
tively, Birgé [35] constructed
nP an explicit test. Namely,
o under the assumption H(P0 , Q0 ) q > 2.01r,
n n α+βψ(Xi ) −nΩ(r2 ) dP0
there is a test ϕ(x ) = 1 i=1 log β+αψ(Xi ) > t attaining error e , where ψ(x) = dQ 0
( x)
and α, β > 0 depend only on H(P0 , Q0 ).
In the sequel we will apply Theorem 32.7 to two disjoint Hellinger balls (both are convex).
i i
i i
i i
534
Theorem 32.8 (Le Cam-Birgé). Denote by NH (P, ϵ) the ϵ-covering number of the set P under
the Hellinger distance (cf. (27.1)). Let ϵn be such that
Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 1,
and, consequently,
Proof of Theorem 32.8. It suffices to prove the high-probability bound (32.27). Abbreviate ϵ =
ϵn and N = NH (P, ϵn ). Let P1 , · · · , PN be a maximal ϵ-packing of P under the Hellinger distance,
which also serves as an ϵ-covering (cf. Theorem 27.2). Thus, ∀i 6= j,
H(Pi , Pj ) ≥ ϵ,
H(P, Pi ) ≤ ϵ,
Next, consider the following pairwise comparison problem, where we test two Hellinger balls
(composite hypothesis) against each other:
Hi : P ∈ B(Pi , ϵ)
Hj : P ∈ B(Pj , ϵ)
for all i 6= j, s.t. H(Pi , Pj ) ≥ δ = 4ϵ.
Since both B(Pi , ϵ) and B(Pj , ϵ) are convex, applying Theorem 32.7 yields a test ψij =
ψij (X1 , . . . , Xn ), with ψij = 0 corresponding to declaring P ∈ B(Pi , ϵ), and ψij = 1 corresponding
to declaring P ∈ B(Pj , ϵ), such that ψij = 1 − ψji and the following large deviation bound holds:
for all i, j, s.t. H(Pi , Pj ) ≥ δ ,
5
Note that this is not entirely obvious because P 7→ H(P, Q) is not convex (for example, consider
p 7→ H(Ber(p), Ber(0.1)).
i i
i i
i i
where we used the triangle inequality of Hellinger distance: for any P ∈ B(Pi , ϵ) and any Q ∈
B(Pj , ϵ),
Now for the proof of correctness, assume that P ∈ B(P1 , ϵ). The intuition is that, we should
expect, typically, that T1 = 0, and furthermore, Tj ≥ δ 2 for all j such that H(P1 , Pj ) ≥ δ . Note
that by the definition of Ti and the symmetry of the Hellinger distance, for any pair i, j such that
H(Pi , Pj ) ≥ δ , we have
max{Ti , Tj } ≥ H(Pi , Pj ).
Consequently,
where the last equality follows from the definition of i∗ as a global minimizer in (32.30). Thus, for
any t ≥ 1,
i i
i i
i i
536
As usual, such a log factor can be removed using the local entropy argument. To this end, define
the local Hellinger entropy:
Theorem 32.9 (Le Cam-Birgé: local entropy version). Let ϵn be such that
Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 2,
and hence
Remark 32.3 (Doubling dimension). Suppose that for some d > 0, log Nloc (P, ϵ) ≤ d log 1ϵ holds
for all sufficiently large small ϵ; this is the case for finite-dimensional models where the Hellinger
distance is comparable with the vector norm by the usual volume argument (Theorem 27.3). Then
we say the doubling dimension (also known as the Le Cam dimension [318]) of P is at most d; this
terminology comes from the fact that the local entropy concerns covering Hellinger balls using
balls of half the radius. Then Theorem 32.9 shows that it is possible to achieve the “parametric
rate” O( dn ). In this sense, the doubling dimension serves as the effective dimension of the model
P.
Proof. We proceed by induction on k. The base case of k = 0 follows from the definition (32.33).
For k ≥ 1, assume that (32.37) holds for k − 1 for all P ∈ P . To prove it for k, we construct a cover
of B(P, 2k η) ∩ P as follows: first cover it with 2k−1 η -balls, then cover each ball with η/2-balls. By
the induction hypothesis, the total number of balls is at most
NH (B(P, 2k η) ∩ P, 2k−1 η) · sup NH (B(P′ , 2k−1 η) ∩ P, η/2) ≤ Nloc (ϵ) · Nloc (ϵ)k−1
P′ ∈P
Proof. We analyze the same estimator (32.30) following the proof of Theorem 32.8, except
that the estimate (32.31) is improved as follows: Define the Hellinger shell Ak ≜ {P : 2k δ ≤
i i
i i
i i
H(P1 , P) < 2k+1 δ} and Gk ≜ {P1 , . . . , PN } ∩ Ak . Recall that δ = 4ϵ. Given t ≥ 2, let ℓ = blog2 tc
so that 2ℓ ≤ t < 2ℓ+1 . Then
X
P[T1 ≥ tδ] ≤ P[2k δ ≤ T1 < 2k+1 δ]
k≥ℓ
( a) X
|Gk |e− 8 (2 δ)
n k 2
≤
k≥ℓ
(b) X
Nloc (ϵ)k+3 e−2nϵ 4
2 k
≤
k≥ℓ
( c)
≲ e− 4 ≤ e− t
ℓ 2
where (a) follows from from (32.29); (c) follows from the assumption that log Nloc ≤ nϵ2 and
k ≥ ℓ ≥ log2 t ≥ 1; (b) follows from the following reasoning: since {P1 , . . . , PN } is an ϵ-packing,
we have
where the first and the last inequalities follow from Theorem 27.2 and Lemma 32.10 respectively.
As an application of Theorem 32.9, we show that parametric rate (namely, dimension divided
by the sample size) is achievable for models with locally quadratic behavior, such as those smooth
parametric models (cf. Section 7.11 and in particular Theorem 7.21).
Proof. It suffices to bound the local entropy Nloc (P, ϵ) in (32.33). Fix θ0 ∈ Θ. Indeed, for any
η > t0 , we have NH (B(Pθ0 , η) ∩ P, η/2) ≤ NH (P, t0 ) ≲ 1. For ϵ ≤ η ≤ t0 ,
( a)
NH (B(Pθ0 , η) ∩ P, η/2) ≤ N∥·∥ (B∥·∥ (θ0 , η/c), η/(2C))
d
(b) vol(B∥·∥ (θ0 , η/c + η/(2C))) 2C
≤ = 1+
vol(B∥·∥ (θ0 , η/(2C))) c
where (a) and (b) follow from (32.38) and Theorem 27.3 respectively. This shows that
log Nloc (P, ϵ) ≲ d, completing the proof by applying Theorem 32.9.
i i
i i
i i
538
Theorem 32.12. Suppose that the family P has a finite Dλ radius for some λ > 1, i.e.
where Dλ is the Rényi divergence of order λ (see Definition 7.22). There exists constants c = c(λ)
and ϵ < ϵ0 (λ) such that whenever n and ϵ < ϵ0 are such that
1
c(λ)nϵ2 log 2 + Rλ (P) + 2 log 2 < log Mloc (ϵ), (32.40)
ϵ
ϵ2
sup EP [H2 (P, P̂)] ≥ ,
P∈P 32
i.i.d.
where EP is taken with respect to Xn ∼ P.
Remark 32.4. When log Mloc (ϵ) ϵ−p , a minimax lower bound for the squared Hellinger risk
on the order of (n log n)− p+2 follows. Consider the special case of P being the class of β -smooth
2
densities on the unit cube [0, 1]d as defined in Theorem 27.13. The χ2 -radius of this class is finite
since each density therein is bounded from above and the uniform distribution works as a center for
2β
(32.39). In this case we have p = βd and hence the lower bound Ω((n log n)− d+2β ). Here, however,
we can argue differently by considering the subcollection P ′ = 12 Unif([0, 1]d ) + 12 P , which has
(up to a constant factor) the same minimax risk, but has the advantage that D(PkP′ ) H2 (P, P′ )
for all P, P′ ∈ P ′ (see Section 32.4). Repeating the argument in the proof below, then, yields the
2β
optimal lower bound Ω(n− d+2β ) removing the unnecessary logarithmic factors.
Proof. Let M = Mloc (P, ϵ). From the definition there exists an ϵ/2-packing P1 , . . . , PM in some
Hellinger ball B(R, ϵ).
i.i.d.
Let θ ∼ Unif([M]) and Xn ∼ Pθ conditioned on θ. Then from Fano’s inequality in the form
of Theorem 31.3 we get
i i
i i
i i
ϵ 2 I(θ; Xn ) + log 2
sup E[H (P, P̂)] ≥
2
1−
P∈P 4 log M
It remains to show that
I(θ; Xn ) + log 2 1
≤ . (32.41)
log M 2
To that end for an arbitrary distribution U define
Q = ϵ2 U + ( 1 − ϵ2 )R .
We first notice that from Ex. I.48 we have that for all i ∈ [M]
λ 1
D(Pi kQ) ≤ 8(H (Pi , R) + 2ϵ )
2 2
log 2 + Dλ (Pi kU)
λ−1 ϵ
provided that ϵ < 2− 2(λ−1) ≜ ϵ0 . Since H2 (Pi , R) ≤ ϵ2 , by optimizing U (as the Dλ -center of P )
5λ
we obtain
λ 1 c(λ) 2 1
inf max D(Pi kQ) ≤ 24ϵ 2
log 2 + Rλ ≤ ϵ log 2 + Rλ .
U i∈[M] λ−1 ϵ 2 ϵ
By Theorem 4.1 we have
nc(λ) 2 1
I(θ; Xn ) ≤ max D(P⊗ ⊗n
i kQ ) ≤
n
ϵ log 2 + Rλ .
i∈[M] 2 ϵ
This final bound and condition (32.40) then imply (32.41) and the statement of the theorem.
Finally, we mention that for sufficiently regular models wherein the KL divergence and the
squared Hellinger distances are comparable, the upper bound in Theorem 32.9 based on local
entropy gives the exact minimax rate. Models of this type include GLM and more generally
Gaussian local mixtures with bounded centers in arbitrary dimensions.
Then
i i
i i
i i
540
Theorem 32.14 (Yatracos [342]). There exists a universal constant C such that the following
i.i.d.
holds. Let X1 , . . . , Xn ∼ P ∈ P , where P is a collection of distributions on a common measurable
space (X , E). For any ϵ > 0, there exists a proper estimator P̂ = P̂(X1 , . . . , Xn ) ∈ P , such that
1
sup EP [TV(P̂, P) ] ≤ C ϵ + log N(P, TV, ϵ)
2 2
(32.42)
P∈P n
For loss function that is a distance, a natural idea for obtaining proper estimator is the minimum
distance estimator. In the current context, we compute the minimum-distance projection of the
empirical distribution on the model class P :6
Pmin-dist = argmin TV(P̂n , P)
P∈P
1
Pn
where P̂n = n i=1 δXi is the empirical distribution. However, since the empirical distribution is
discrete, this strategy does not make sense if elements of P have densities. The reason for this
degeneracy is because the total variation distance is too strong. The key idea is to replace TV,
which compares two distributions over all measurable sets, by a proxy, which only inspects a
“low-complexity” family of sets.
To this end, let A ⊂ E be a finite collection of measurable sets to be specified later. Define a
pseudo-distance
dist(P, Q) ≜ sup |P(A) − Q(A)|. (32.43)
A∈A
(Note that if A = E , then this is just TV.) One can verify that dist satisfies the triangle inequality.
As a result, the estimator
P̃ ≜ argmin dist(P, P̂n ), (32.44)
P∈P
as a minimizer, satisfies
dist(P̃, P) ≤ dist(P̃, P̂n ) + dist(P, P̂n ) ≤ 2dist(P, P̂n ). (32.45)
In addition, applying the binomial tail bound and the union bound, we have
C0 log |A|
E[dist(P, P̂n )2 ] ≤ . (32.46)
n
for some absolute constant C0 .
6
Here and below, if the minimizer does not exist, we can replace it by an infimizing sequence.
i i
i i
i i
The main idea of Yatracos [342] boils down to the following choice of A: Consider an
ϵ-covering {Q1 , . . . , QN } of P in TV. Define the set
dQi dQj
Aij ≜ x : ( x) ≥ ( x)
d( Qi + Qj ) d(Qi + Qj )
and the collection (known as the Yatracos class)
A ≜ {Aij : i 6= j ∈ [N]}. (32.47)
Then the corresponding dist approximates the TV on P , in the sense that
dist(P, Q) ≤ TV(P, Q) ≤ dist(P, Q) + 4ϵ, ∀P, Q ∈ P. (32.48)
To see this, we only need to justify the upper bound. For any P, Q ∈ P , there exists i, j ∈ [N], such
that TV(P, Pi ) ≤ ϵ and TV(Q, Qj ) ≤ ϵ. By the key observation that dist(Qi , Qj ) = TV(Qi , Qj ), we
have
TV(P, Q) ≤ TV(P, Qi ) + TV(Qi , Qj ) + TV(Qj , Q)
≤ 2ϵ + dist(Qi , Qj )
≤ 2ϵ + dist(Qi , P) + dist(P, Q) + dist(Q, Qj )
≤ 4ϵ + dist(P, Q).
Finally, we analyze the estimator (32.44) with A given in (32.47). Applying (32.48) and (32.45)
yields
TV(P̃, P) ≤ dist(P, P̃) + 4ϵ
≤ 2dist(P, P̂n ) + 4ϵ.
Squaring both sizes, taking expectation and applying (32.46), we have
8C0 log |N|
E[TV(P̃, P)2 ] ≤ 32ϵ2 + 8E[dist(P, P̂n )2 ] ≤ 32ϵ2 + .
n
Choosing the optimal TV-covering completes the proof of (32.42).
Remark 32.5 (Robust version). Note that Yatracos’ scheme idea works even if the data generating
distribution P 6∈ P but close to P . Indeed, denote Q∗ = argminQ∈{Qi } TV(P, Q) and notice that
i i
i i
i i
542
i.i.d.
Theorem 32.15. Given X1 , · · · , Xn ∼ f ∈ F , the minimax quadratic risk over F satisfies
R∗L2 (n; F) ≜ inf sup E kf − f̂k22 n− 3 .
2
(32.49)
f̂ f∈F
Capitalizing on the metric entropy of smooth densities studied in Section 27.4, we will prove
this result by applying the entropic upper bound in Theorem 32.1 and the minimax lower bound
based on Fano’s inequality in Theorem 31.3. However, Theorem 32.15 pertains to the L2 rather
than KL risk. This can be fixed by a simple reduction.
Lemma 32.16. Let F ′ denote the collection of f ∈ F which is bounded from below by 1/2. Then
R∗L2 (n; F ′ ) ≤ R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).
Proof. The left inequality follows because F ′ ⊂ F . For the right inequality, we apply a sim-
i.i.d.
ulation argument. Fix some f ∈ F and we observe X1 , . . . , Xn ∼ f. Let us sample U1 , . . . , Un
independently and uniformly from [0, 1]. Define
(
Ui w.p. 12 ,
Zi =
Xi w.p. 12 .
i.i.d.
Then Z1 , . . . , Zn ∼ g = 12 (1 + f) ∈ F ′ . Let ĝ be an estimator that achieves the minimax risk
R∗L2 (n; F ′ ) on F ′ . Consider the estimator f̂ = 2ĝ − 1. Then kf − f̂k22 = 4kg − ĝk22 . Taking the
supremum over f ∈ F proves R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).
Lemma 32.16 allows us to focus on the subcollection F ′ , where each density is lower bounded
by 1/2. In addition, each 1-Lipschitz density is also upper bounded by an absolute constant.
Therefore, the KL divergence and squared L2 distance are in fact equivalent on F ′ , i.e.,
D(fkg) kf − gk22 , f, g ∈ F ′ , (32.50)
as shown by the following lemma:
dQ
Lemma 32.17. Suppose both f = dP dμ and g = dμ are upper and lower bounded by absolute
constants c and C respectively. Then
Z Z
1 1
dμ(f − g)2 ≤ 2H2 (fkg) ≤ D(PkQ) ≤ χ2 (PkQ) ≤ dμ(f − g)2 .
C c
i i
i i
i i
R R
Proof. For the upper bound, applying (7.31), D(PkQ) ≤ χ2 (PkQ) = dμ (f−gg) ≤ 1c dμ (f−gg) .
2 2
R R
For the lower bound, applying (7.30), D(PkQ) ≥ 2H2 (fkg) = 2 dμ √(f−g√) 2 ≥ C1 dμ(f −
2
( f+ g)
g) 2 .
We now prove Theorem 32.15:
Proof. In view of Lemma 32.16, it suffices to consider R∗L2 (n; F ′ ). For the upper bound, we have
( a)
R∗L2 (n; F ′ ) R∗KL (n; F ′ )
(b)
1 ′
≲ inf ϵ + log NKL (F , ϵ)
2
ϵ>0 n
( c) 1 ′
inf ϵ + log N(F , k · k2 , ϵ)
2
ϵ>0 n
(d) 1
inf ϵ + 2
n−2/3 .
ϵ>0 nϵ
where both (a) and (c) apply (32.50), so that both the risk and the metric entropy are equivalent
for KL and L2 distance; (b) follows from Theorem 32.1; (d) applies the metric entropy (under L2 )
of the Lipschitz class from Theorem 27.12 and the fact that the metric entropy of the subclass F ′
is at most that of the full class F .
For the lower bound, we apply Fano’s inequality. Applying Theorem 27.12 and the relation
between covering and packing numbers in Theorem 27.2, we have log N(F, k·k2 , ϵ) log M(F, k·
k2 , ϵ) 1ϵ . Fix ϵ to be specified and let f1 , . . . , fM be an ϵ-packing in F , where M ≥ exp(C/ϵ). Then
g1 , . . . , gM is an 2ϵ -packing in F ′ , with gi = (fi +1)/2. Applying Fano’s inequality in Theorem 31.3,
we have
∗ Cn
RL2 (n; F) ≳ ϵ 1 −
2
.
log M
Using (32.17), we have Cn ≤ infϵ>0 (nϵ2 + ϵ−1 ) n1/3 . Thus choosing ϵ = cn−1/3 for sufficiently
small c ensures Cn ≤ 12 log M and hence R∗L2 (n; F) ≳ ϵ2 n−2/3 .
Remark 32.6. Note that the above proof of Theorem 32.15 relies on the entropic risk bound (32.1),
which, though rate-optimal, is not attained by a computationally efficient estimator. (The same
criticism also applies to (32.2) and (32.3) for Hellinger and total variation.) To remedy this, for
the squared loss, a classical idea is to apply the kernel density estimator (KDE) – cf. Section 7.9.
Pn
Specifically, one compute the convolution of the empirical distribution P̂n = 1n i=1 δXi with a
kernel function K(·) whose shape and bandwidth are chosen according to the smooth constraint.
For Lipschitz density, the optimal rate in Theorem 32.15 can be attained by a box kernel K(·) =
1 −1/3
2h 1{|·|≤h} with bandwidth h = n (cf. e.g. [313, Sec. 1.2]).
i i
i i
i i
In this chapter we explore statistical implications of the following effect. For any Markov chain
U→X→Y→V (33.1)
However, something stronger can often be said. Namely, if the Markov chain (33.1) factor through
a known noisy channel PY|X : X → Y , then oftentimes we can prove strong data processing
inequalities (SDPI):
where coefficients η = η(PY|X ), η (p) (PY|X ) < 1 only depend on the channel and not the (generally
unknown or very complex) PU,X or PY,V . The coefficients η and η (p) approach 0 for channels that
are very noisy (for example, η is always up to a constant factor equal to the Hellinger-squared
diameter of the channel).
The purpose of this chapter is twofold. First, we want to introduce general properties of the
SDPI coefficients. Second, we want to show how SDPIs help prove sharp lower (impossibility)
bounds on statistical estimation questions. The flavor of the statistical problems in this chapter is
different from the rest of the book in that here the information about unknown parameter θ is thinly
distributed across a high dimensional vector (as in spiked Wigner and tree-coloring examples), or
across different terminals (as in correlation and mean estimation examples).
We point out that SDPIs are an area of current research and multiple topics are not covered by
our brief exposition here. For more, we recommend surveys [250] and [257], of which the latter
explores the functional-theoretic side of SDPIs and their close relation to logarithmic Sobolev
inequalities – a topic we omitted entirely.
544
i i
i i
i i
a a
OR a∨b AND a∧b a NOT a′
b b
Now suppose there are additive noise components on the output of each primitive gate. In this
case, we have a network of the following noisy gates.
Z Z Z
a a
OR ⊕ Y AND ⊕ Y a NOT ⊕ Y
b b
Here, Z ∼ Bern(δ ) and assumed to be independent of the inputs. In other words, with proba-
bility δ , the output of a gate will be flipped no matter what input is given to that gate. Hence, we
sometimes refer to these gates as δ -noisy gates.
In 1950s John von Neumann was laying the groundwork for the digital computers, and he was
bothered by the following question. Can we compute any boolean function f with δ -noisy gates?
Note that any circuit that consists of noisy gates necessarily has noisy (non-deterministic) output.
Therefore, when we say that a noisy gate circuit C computes f we require the existence of some
ϵ0 = ϵ0 (δ) (that cannot depend on f) such that
1
P[C(x1 , . . . , xn ) 6= f(x1 , . . . , xn ) ≤ − ϵ0 (33.2)
2
where C(x1 , . . . , xn ) is the output of the noisy circuit inputs x1 , . . . , xn . If we build the circuit accord-
ing to the classical (Shannon) methods, we would obviously have catastrophic error accumulation
so that deep circuits necessarily have ϵ0 → 0. At the same time, von Neumann was bothered by
the fact that evidently our brains operate with very noisy gates and yet are able to carry very long
computations without mistakes. His thoughts culminated in the following ground-breaking result.
Theorem 33.1 (von Neumann, 1957). There exists δ ∗ > 0 such that for all δ < δ ∗ it is possible
to compute every boolean function f via δ -noisy 3-majority gates.
von Neumann’s original estimate δ ∗ ≈ 0.087 was subsequently improved by Pippenger. The
main (still open) question of this area is to find the largest δ ∗ for which the above theorem holds.
Condition in (33.2) implies the output should be correlated with the inputs. This requires the
mutual information between the inputs (if they are random) and the output to be greater than
zero. We now give a theorem of Evans and Schulman that gives an upper bound to the mutual
information between any of the inputs and the output. We will prove the theorem in Section 33.3
as a consequence of the more general directed information percolation theory.
Theorem 33.2 ([117]). Suppose an n-input noisy boolean circuit composed of gates with at most
K inputs and with noise components having at most δ probability of error. Then, the mutual
information between any input Xi and output Y is upper bounded as
di
I(Xi ; Y) ≤ K(1 − 2δ)2 log 2
i i
i i
i i
546
where di is the minimum length between Xi and Y (i.e, the minimum number of gates required to
be passed through until reaching Y).
Theorem 33.2 implies that noisy computation is only possible for δ < 12 − 2√1 K . This is the best
known threshold. An illustration is given below:
X1 X2 X3 X4 X5 X6 X7 X8 X9
G1 G2 G3
G4 G5
G6
The above 9-input circuit has gates with at most 3 inputs. The 3-input gates are G4 , G5 and G6 .
The minimum distance between X3 and Y is d3 = 2, and the minimum distance between X5 and Y
is d5 = 3. If Gi ’s are δ -noisy gates, we can invoke Theorem 33.2 between any input and the output.
Unsurprisingly, Theorem 33.2 also tells us that there are some circuits that are not com-
putable with δ -noisy gates. For instance, take f(X1 , . . . , Xn ) = XOR(X1 , . . . , Xn ). Then for
log n
at least one input Xi , we have di ≥ log K . This shows that I(Xi ; Y) → 0 as n →
∞, hence Xi and Y will be almost independent for large n. Note that XOR(X1 , . . . , Xn ) =
XOR XOR(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ), Xi . Therefore, it is impossible to compute an n-input
XOR with δ -noisy gates for large n.
Computation with formulas: Note that the graph structure given in Figure 33.1 contains some
undirected loops. A formula is a type of boolean circuits that does not contain any undirected loops
unlike the case in Figure 33.1. In other words, for a formula the underlying graph structure forms
a tree. Removing one of the outputs of G2 of Figure 33.1, we obtain a formula as given below.
In Theorem 1 of [116], it is shown that we can compute reliably any boolean function f that
is represented with a formula with at most K-input gates with K odd and every gate are at most
δ -noisy and δ < δf∗ , and no such computation is possible for δ > δf∗ , where
1 2K− 1
δf∗ = − K−1
2 K K− 1
2
i i
i i
i i
X1 X2 X3 X4 X5 X6 X7 X8 X9
G1 G2 G3
G4 G5
G6
where the approximation holds for large K. This threshold is better than the upper-bound on the
threshold given by Theorem 33.2 for general boolean circuits. However, for large K we have
p
∗ 1 π /2
δf ≈ − √ , K 1
2 2 K
showing that the estimate of Evans-Schulman δ ∗ ≤ 1
2 − 1
√
2 K
is order-tight for large K. This
demonstrates the tightness of Theorem 33.2.
Recall that the DPI (Theorem 7.4) states that Df (PX kQX ) ≥ Df (PY kQY ). The concept of the
Strong DPI introduced above quantifies the multiplicative decrease between the two f-divergences.
Example 33.1. Suppose PY|X is a kernel for a time-homogeneous Markov chain with stationary
distribution π (i.e., PY|X = PXt+1 |Xt ). Then for any initial distribution q, SDPI gives the following
bound:
Df (qPn kπ ) ≤ ηfn Df (qkπ )
These type of exponential decreases are frequently encountered in the Markov chains literature,
especially for KL and χ2 divergences. For example, for reversible Markov chains, we have [91,
Prop. 3]
χ2 (PXn kπ ) ≤ γ∗2n χ2 (PXn kπ ) (33.4)
where γ∗ is the absolute spectral gap of P. See Exercise VI.18.
i i
i i
i i
548
We note that in general ηf (PY|X ) is hard to compute. However, total variation is an exception.
This case is obvious. Take PX = δx0 and QX = δx′0 .1 Then from the definition of ηTV , we
have ηTV ≥ TV(PY|X=x0 , PY|X=x′0 ) for any x0 and x′0 , x0 6= x′0 .
• ηTV ≤ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ):
0
Define η̃ ≜ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ). We consider the discrete alphabet case for simplicity.
0
Fix any PX , QX and PY = PX ◦ PY|X , QY = QX ◦ PY|X . Observe that for any E ⊆ Y
Now suppose there are random variables X0 and X′0 having some marginals PX and QX respec-
tively. Consider any coupling π X0 ,X′0 with marginals PX and QX respectively. Then averaging
(33.5) and taking the supremum over E, we obtain
Now the left-hand side equals TV(PY , QY ) by Theorem 7.7(a). Taking the infimum over
couplings π the right-hand side evaluates to TV(PX , QX ) by Theorem 7.7(b).
Example 33.2 (ηTV of a Binary Symmetric Channel). The ηTV of the BSCδ is given by
Theorem 33.5.
If (U; Y)
ηf (PY|X ) = sup .
PUX : U→X→Y If (U; X)
1
δx0 is the probability distribution with P(X = x0 ) = 1
i i
i i
i i
Recall that for any Markov chain U → X → Y, DPI states that If (U; Y) ≤ If (U; X) and Theorem
33.5 gives the stronger bound
Proof. First, notice that for any u0 , we have Df (PY|U=u0 kPY ) ≤ ηf Df (PX|U=u0 kPX ). Averaging the
above expression over any PU , we obtain
If (U; Y) ≤ ηf If (U; X)
Second, fix P̃X , Q̃X and let U ∼ Bern(λ) for some λ ∈ [0, 1]. Define the conditional distribution
PX|U as PX|U=1 = P̃X , PX|U=0 = Q̃X . Take λ → 0, then (see [250] for technical subtleties)
Theorem 33.6. In the statements below ηf (and others) corresponds to ηf (PY|X ) for some fixed
PY|X .
where (recall β̄ ≜ 1 − β )
( 1 − x) 2
LCβ (PkQ) = Df (PkQ), f(x) = β̄β
β̄ x + β
is the Le Cam divergence of order β (recall (7.6) for β = 1/2).
(e) Consequently,
1 2 H4 (P0 , P1 )
H (P0 , P1 ) ≤ ηKL ≤ H2 (P0 , P1 ) − . (33.7)
2 4
(f) If the binary-input channel is also input-symmetric (or BMS, see Section 19.4*) then ηKL =
Iχ2 (X; Y) for X ∼ Bern(1/2).
i i
i i
i i
550
(g) For any channel the supremum in (33.3) can be restricted to PX , QX with a common binary
support. In other words, ηf (PY|X ) coincides with that of the least contractive binary subchannel.
Consequently, from (e) we conclude
1 diam H2
diam H2 ≤ ηKL (PY|X ) = diam LCmax ≤ diam H2 − ,
2 4
(in particular ηKL diam H2 ), where diam H2 (PY|X ) = supx,x′ ∈X H2 (PY|X=x , PY|X=x′ ),
diam LCmax = supx,x′ LCmax (PY|X=x , PY|x=x′ ) are Hellinger and Le Cam diameters of the
channel.
Proof. Most proofs in full generality can be found in [250]. For (a) one first shows that ηf ≤ ηTV
for the so-called Eγ divergences corresponding to f(x) = |x − γ|+ − |1 − γ|+ , which is not hard to
believe since Eγ is piecewise linear. Then the general result follows from the fact that any convex
function f can be approximated (as N → ∞) in the form
X
N
aj |x − cj |+ + a0 x + c0 .
j=1
For (b) see [66, Theorem 1] and [70, Proposition II.6.13 and Corollary II.6.16]. The idea of this
proof is as follows: :
• ηKL ≥ ηχ2 by locality. Recall that every f-divergence behaves locally behaves as χ2 –
Theorem 7.18.
R∞
• Using the identity D(PkQ) = 0 χ2 (PkQt )dt where Qt = tP1+ Q
+t , we have
Z ∞ Z ∞
D(PY kQY ) = χ2 (PY kQY t )dt ≤ ηχ2 χ2 (PX kQX t )dt = ηχ2 D(PX kQX ).
0 0
I(U; Y) = I(U; Y|∆) ≤ Eδ∼P∆ [(1 − 2δ)2 I(U; X|∆ = δ) = E[(1 − 2∆)2 ]I(U; X),
where we used the fact that I(U; X|∆ = δ) = I(U; X) and Example 33.3 below.
For (g) see Ex. VI.19.
i i
i i
i i
This example has the following consequence for the KL-divergence geometry.
Proposition 33.7. Consider any distributions P0 and P1 on X and let us consider the interval
in P(X ): Pλ = λP1 + (1 − λ)P0 for λ ∈ [0, 1]. Then divergence (with respect to the midpoint)
behaves subquadratically:
The same statement holds with D replaced by χ2 (and any other Df satisfying Theorem 33.6(b)).
Notice that for any metric d(P, Q) on P(X ) that is induced from the norm on the vector space
M(X ) of all signed measures (such as TV), we must necessarily have d(Pλ , P1−λ ) = |1 −
2λ|d(P0 , P1 ). Thus, the ηKL (BSCλ ) = (1 − 2λ)2 which yields the inequality is rather natural.
i i
i i
i i
552
B
X0 W
A
Example 33.4. Suppose we have a graph G = (V, E) as below This means that we have a joint
distribution factorizing as
PX0 ,A,B,W = PX0 PB|X0 PA,B|X0 PW|A,B .
Then every node has a channel from its parents to itself, for example W corresponds to a noisy
channel PW|A,B , and we can define η ≜ ηKL (PW|A,B ). Now, prepend another random variable U ∼
Bern(λ) at the beginning, the new graph G′ = (V′ , E′ ) is shown below: We want to verify the
B
U X0 W
A
relation
I(U; B, W) ≤ η̄ I(U; B) + η I(U; A, B). (33.8)
Recall that from chain rule we have I(U; B, W) = I(U; B) + I(U; W|B) ≥ I(U; B). Hence, if (33.8)
is correct, then η → 0 implies I(U; B, W) ≈ I(U; B) and symmetrically I(U; A, W) ≈ I(U; A).
Therefore for small δ , observing W, A or W, B does not give advantage over observing solely A or
B, respectively.
Observe that G′ forms a Markov chain U → X0 → (A, B) → W, which allows us to factorize
the joint distribution over E′ as
PU,X0 ,A,B,W = PU PX0 |U PA,B|X0 PW|A,B .
Now consider the joint distribution conditioned on B = b, i.e., PU,X0 ,A,W|B . We claim that the
conditional Markov chain U → X0 → A → W|B = b holds. Indeed, given B and A, X0 is
independent of W, that is PX0 |A,B PW|A,B = PX0 ,W|AB , from which follows the mentioned conditional
Markov chain. Using the conditional Markov chain, SDPI gives us for any b,
I(U; W|B = b) ≤ η I(U; A|B = b).
Averaging over b and adding I(U; B) to both sides we obtain
I(U; W, B) ≤ η I(U; A|B) + I(U; B)
= η I(U; A, B) + η̄ I(U; B).
From the characterization of ηf in Theorem 33.5 we conclude
ηKL (PW,B|X0 ) ≤ η · ηKL (PA,B|X0 ) + (1 − η) · ηKL (PB|X0 ) . (33.9)
Now, we provide another example which has in some sense an analogous setup to Example
33.4.
i i
i i
i i
B
R
X W
A
Example 33.5 (Percolation). Take the graph G = (V, E) in example 33.4 with a small modification.
See Fig. 33.2. Now, suppose X,A,B,W are some cities and the edge set E represents the roads
between these cities. Let R be a random variable denoting the state of the road connecting to W
with P(R is open) = η and P[R is closed] = η̄ . For any Y ∈ V, let the event {X → Y} indicate that
one can drive from X to Y. Then
P[X → B or W] = η P[X → A or B] + η̄ P[X → B]. (33.10)
Observe the resemblance between (33.9) and (33.10).
We will now give a theorem that relates ηKL to percolation probability on a DAG under the
following setting: Consider a DAG G = (V, E).
Under this model, for two subsets T, S ⊂ V we define perc[T → S] = P[∃ open path T → S].
Note that PXv |XPa(v) describe the stochastic recipe for producing Xv based on its parent variables.
We assume that in addition to a DAG we also have been given all these constituent channels (or
at least bounds on their ηKL coefficients).
Theorem 33.8 ([250]). Let G = (V, E) be a DAG and let 0 be a node with in-degree equal to zero
(i.e. a source node). Note that for any 0 63 S ⊂ V we can inductively stitch together constituent
channels PXv |XPa(v) and obtain PXS |X0 . Then we have
ηKL (PXS |X0 ) ≤ perc(0 → S). (33.11)
Proof. For convenience let us denote η(T) = ηKL (PXT |X0 ) and ηv = ηKL (PXv |XPa(v) ). The proof
follows from an induction on the size of G. The statement is clear for the |V(G)| = 1 since
S = ∅ or S = {X0 }. Now suppose the statement is already shown for all graphs smaller than
G. Let v be the node with out-degree 0 in G. If v 6∈ S then we can exclude it from G and the
statement follows from induction hypothesis. Otherwise, define SA = Pa(v) \ S and SB = S \ {v},
A = XSA , B = XSB , W = Xv . (If 0 ∈ A then we can create a fake 0′ with X0′ = X0 and retain
0′ ∈ A while moving 0 out of A. So without loss of generality, 0 6∈ A.) Prepending arbitrary U to
the graph as U → X0 , the joint DAG of random variables (X0 , A, B, W) is then given by precisely
the graph in (33.8). Thus, we obtain from (33.9) the estimate
η(S) ≤ ηv η(SA ∪ SB ) + (1 − ηv )ηKL (SB ) . (33.12)
i i
i i
i i
554
From induction hypothesis η(SA ∪ SB ) ≤ perc(0 → SA ) and η(SB ) ≤ perc(0 → SB ) (they live on
a graph G \ {v}). Thus, from computation (33.10) we see that the right-hand side of (33.12) is
precisely perc(0 → S) and thus η(S) ≤ perc(S) as claimed.
We are now in position to complete the postponed proof.
Proof of Theorem 33.2. First observe the noisy boolean circuit is a form of DAG. Since the gates
are δ -noisy contraction coefficients of constituent channels ηv in the DAG can be bounded by
(1 − 2δ)2 . Thus, in the percolation question all vertices are open with probability (1 − 2δ)2
From SDPI, for each i, we have I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ). From Theorem 33.8, we know
ηKL (PY|Xi ) ≤ perc(Xi → Y). We now want to upper bound perc(Xi → Y). Recall that the minimum
distance between Xi and Y is di . For any path π of length ℓ(π ) from Xi to Y, therefore, the probability
that it will be open is ≤ (1 − 2δ)2ℓ(π ) . We can thus bound
X
perc(Xi → Y) ≤ (1 − 2δ)2ℓ(π ) . (33.13)
π :Xi →Y
Let us now build paths backward starting from Y, which allows us to represent paths X → Yi
as vertices of a K-ary tree with root Yi . By labeling all vertices on a K-ary tree corresponding
to paths X → Yi we observe two facts: the labeled set V is prefix-free (two labeled vertices are
never in ancestral relation) and the depth of each labeled set is at least di . It is easy to see that
P
u∈V c
depth(u)
≤ (Kc)di provided Kc ≤ 1 and attained by taking V to be set of all vertices in the
tree at depth di . We conclude that whenever K(1 − 2δ)2 ≤ 1 the right-hand side of (33.13) is
bounded by (K(1 − 2δ)2 )di , which concludes the proof by upper bounding H(Xi ) ≤ log 2 as
I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ) ≤ Kdi (1 − 2δ)2di log 2
We conclude the section with an example illustrating that Theorem 33.8 may give stronger
bounds when compared to Theorem 33.2.
Example 33.6. Suppose we have the topological restriction on the placement of gates (namely
that the inputs to each gets should be from nearest neighbors to the left), resulting in the following
circuit of 2-input δ -noisy gates. Note that each gate may be a simple passthrough (i.e. serve as
router) or a constant output. Theorem 33.2 states that if (1 − 2δ)2 < 21 , then noisy computation
i i
i i
i i
within arbitrary topology is not possible. Theorem 33.8 improves this to (1 − 2δ)2 < pc , where pc
is the oriented site-percolation threshold for the particular graph we have. Namely, if each vertex
is open with probability p < pc then with probability 1 the connected component emanating from
any given node (and extending to the right) is finite. For the example above the site percolation
threshold is estimated as pc ≈ 0.705 (so called Stavskaya automata).
Definition 33.9 (Input Dependent Contraction Coefficient). For any input distribution PX , Markov
kernel PY|X and convex function f, we define
Df (QY kPY )
ηf (PX , PY|X ) ≜ sup
Df (QX kPX )
where PY = PY|X ◦ PX , QY = PY|X ◦ QX and supremum is over QX satisfying 0 < Df (PX kQX ) < ∞.
We refer to ηf (PX , PY|X ) as the input dependent contraction coefficient, to contrast it with the
input independent contraction coefficient ηf (PY|X ).
Remarks:
• Although we have the equality ηKL (PY|X ) = ηχ2 (PY|X ) when PY|X is a BMS channel, we do not
have the same equality for ηKL (PX , PY|X ).
Example 33.7. (ηKL (PX , PY|X ) for Erasure Channel) We define ECτ as the following channel,
(
X w.p. 1 − τ
Y=
? w.p. τ.
Let us define an auxiliary random variable B = 1{Y =?}. Thus we have the following equality,
I(U; Y) = I(U; Y, B) = I(U; B) +I(U; Y|B) = (1 − τ )I(U; X).
| {z }
0,B⊥
⊥U
i i
i i
i i
556
where the last equality is due to the fact that I(U; Y|B = 1) = 0 and I(U; Y|B = 0) = I(U; X). By
the mutual information characterization of ηKL (PX , PY|X ), we have ηKL (PX , ECτ ) = 1 − τ .
Proposition 33.10 (Tensorization of ηKL ). For a given number n, two measures PX and PY|X we
have
ηKL (P⊗ n ⊗n
X , PY|X ) = ηKL (PX , PY|X )
i.i.d.
In particular, if (Xi , Yi ) ∼ PX,Y , then ∀PU|Xn
I(U; Yn ) ≤ ηKL (PX , PY|X )I(U; Xn )
Proof. Without loss of generality (by induction) it is sufficient to prove the proposition for n = 2.
It is always useful to keep in mind the following diagram Let η = ηKL (PX , PY|X )
X1 Y1
X2 Y2
i i
i i
i i
′ |X X3 ……
PX
X1
P X′ |X X5 ……
PX′ |X
Xρ
PX′ |X X6 ……
PX ′
|X X2
PX ′
|X X4 ……
• We can think of this model as a broadcasting scenario, where the root broadcasts its message
Xρ to the leaves through noisy channels PX′ |X . The condition (33.17) here is only made to avoid
defining the reverse channel. In general, one only requires that π is a stationary distribution of
PX′ |X , in which case the (33.19) should be replaced with ηKL (π , PX|X′ )b < 1.
• This model arises frequently in community detection, sparse codes and statistical physics.
• Under the assumption (33.17), the joint distribution of this tree can also be written as a Gibbs
distribution
1 X X
PXall = exp f(Xp , Xc ) + g(Xv ) , (33.18)
Z
(p,c)∈E v∈V
where Z is the normalization constant, f(xp , xc ) = f(xc , xp ) is symmetric. When X = {0, 1}, this
model is known as the Ising model (on a tree). Note, however, that not every measure factorizing
as (33.18) (with symmetric f) can be written as a broadcasting process for some P and π.
We can define a corresponding inference problem, where we want to reconstruct the root variable
Xρ given the observations XLd = {Xv : v ∈ Ld }, with Ld = {v : v ∈ V, depth(v) = d}. A
natural question is to upper bound the performance of any inference algorithm on this problem.
The following theorem shows that there exists a phase transition depending on the branching factor
b and the contraction coefficient of the kernel PX′ |X .
Theorem 33.11. Consider the broadcasting problem on infinite b-ary tree (b > 1), with root
distribution π and edge kernel PX′ |X . If π is a reversible measure of PX′ |X such that
ηKL (π , PX′ |X )b < 1, (33.19)
then I(Xρ ; XLd ) → 0 as d → 0.
Proof. For every v ∈ L1 , we define the set Ld,v = {u : u ∈ Ld , v ∈ ancestor(u)}. We can upper
bound the mutual information between the root vertex and leaves at depth d
X
I(Xρ ; XLd ) ≤ I(Xρ ; XLd,v ).
v∈L1
i i
i i
i i
558
Due to our assumption on π and PX′ |X , we have PXρ |Xv = PX′ |X and PXv = π. By the definition of
the contraction coefficient, we have
Observe that because PXv = π and all edges have the same kernel, then I(XLd,v ; Xv ) = I(XLd−1 ; Xρ ).
This gives us the inequality
which implies
then no inference algorithm can recover the root nodes as depth of the tree goes to infinity. This
result is originally proved in [39].
Example 33.9 (k-coloring on tree). Given a b-ary tree, we assign a k-coloring Xvall by sampling
uniformly from the ensemble of all valid k-coloring. For this model, we can define a corresponding
inference problem, namely given all the colors of the leaves at a certain depth, i.e., XLd , determine
the color of the root node, i.e., Xρ .
This problem can be modeled as a broadcasting problem on tree where the root distribution π
is given by the uniform distribution on k colors, and the edge kernel PX′ |X is defined as
(
1
a 6= b
PX′ |X (a|b) = k−1
0, a = b.
It can be shown, see Ex. VI.23, that ηKL (Unif, PX′ |X ) = k log k(11+o(1)) . By Theorem 33.11, this
implies that if b < k log k(1 + o(1)) then reliable reconstruction of the root node is not possible.
This result is originally proved in [288] and [32]
The other direction b > k log k(1 + o(1)) can be shown by observing that if b > k log k(1 + o(1))
then the probability of the children of a node taking all available colors (except its own) is close to
1. Thus, an inference algorithm can always determine the color of a node by finding a color that
is not assigned to any of its children. Similarly, when b > (1 + ϵ)k log k even observing (1 − ϵ)-
fraction of the node’s children is sufficient to reconstruct its color exactly. Proceeding recursively
i i
i i
i i
from bottom up, such a reconstruction algorithm will succeed with high probability. In this regime
with positive probability (over the leaf variables) the posterior distribution of the root color is a
delta-function (deterministic). This effect is known as “freezing” of the root given the boundary.
Notice that in this problem we are not sample-limited (each party has infinitely many samples),
but communication-limited (only B bits can be exchanged).
Here is a trivial attempt to solve it. Notice that if Bob sends W = (Y1 , . . . , YB ) then the optimal
PB
estimator is ρ̂(X∞ , W) = 1n i=1 Xi Yi which has minimax error B1 , hence R∗ (B) ≤ B1 . Surprisingly,
this can be improved.
Proof. Fix PW|Y∞ , we get the following decomposition Note that once the messages W are fixed
X1 Y1
.. ..
. .
W Xi Yi
.. ..
. .
we have a parameter estimation problem {Qρ , ρ ∈ [−1, 1]} where Qρ is a distribution of (X∞ , W)
when A∞ , B∞ are ρ-correlated. Since we minimize MMSE, we know from the van Trees inequality
(Theorem 29.2) 2 that R∗ (B) ≥ min1+o(1)
ρ JF (ρ)
≥ 1J+Fo(0(1)) where JF (ρ) is the Fisher Information of the
family {Qρ }.
Recall, that we also know from the local approximation that
ρ2 log e
D(Qρ kQ0 ) = JF (0) + o(ρ2 )
2
2
This requires some technical justification about smoothness of the Fisher information JF (ρ).
i i
i i
i i
560
hence JF (0) ≤ (2 ln 2)B + o(1) which in turns implies the theorem. For full details and the
extension to interactive communication between Alice and Bob see [150].
We comment on the upper bound next. First, notice that by taking blocks of m → ∞ consecutive
Pim−1
bits and setting X̃i = √1m j=(i−1)m Xj and similarly for Ỹi , Alice and Bob can replace ρ-correlated
i.i.d. 1 ρ
bits with ρ-correlated standard Gaussians (X̃i , Ỹi ) ∼ N (0, ). Next, fix some very large N
ρ 1
and let
W = argmax Yj .
1≤j≤N
√
From standard concentration results we know that E[YW ] = 2 ln N(1 + o(1)) and Var[YW ] =
O( ln1N ). Therefore, knowing W Alice can estimate
XW
ρ̂ = .
E [ YW ]
1−ρ2 +o(1)
This is an unbiased estimator and Varρ [ρ̂] = 2 ln N . Finally, setting N = 2B completes the
argument.
Definition 33.13 (partial orders on channels). Let PY|X and PZ|X be two channels. We say that PY|X
is a degradation of PZ|X , denoted PY|X ≤deg PZ|X , if there exists PY|Z such that PY|X = PY|Z ◦PZ|X . We
say that PZ|X is less noisy than PY|X , PY|X ≤ln PZ|X , iff for every PU,X on the following Markov chain
we have I(U; Y) ≤ I(U; Z). We say that PZ|X is more capable than PY|X , denoted PY|X ≤mc PZ|X if
U X
i i
i i
i i
• PY|X ≤deg PZ|X =⇒ PY|X ≤ln PZ|X =⇒ PY|X ≤mc PZ|X . Counter examples for reverse
implications can be found in [81, Problem 15.11].
• For less noisy we also have the equivalent definition in terms of the divergence, namely PY|X ≤ln
PZ|X if and only if for all PX , QX we have D(QY kPY ) ≤ D(QZ kPZ ). We refer to [208, Sections
I.B, II.A] and [250, Section 6] for alternative useful characterizations of the less-noisy order.
• For BMS channels (see Section 19.4*) it turns out that among all channels with a given
Iχ2 (X; Y) = η (with X ∼ Ber(1/2)) the BSC and BEC are the minimal and maximal elements
in the poset of ≤ln ; see Ex. VI.20 for details.
Proposition 33.14. ηKL (PY|X ) ≤ 1 − τ if and only if PY|X ≤LN ECτ , where ECτ was defined in
Example 33.7.
Proposition 33.15. (Tensorization of Less Noisy Ordering) If for all i ∈ [n], PYi |Xi ≤LN PZi |Xi , then
PY1 |X1 ⊗ PY2 |X2 ≤LN PZ1 |X1 ⊗ PZ2 |X2 . Note that P ⊗ Q refers to the product channel of P and Q.
Y1
X1
Z1
U
Y2
X2
Z2
i i
i i
i i
562
It can be seen from the Markov chain that I(U; Y1 , Y2 ) ≤ I(U; Y1 , Z2 ) implies I(U; Y1 , Y2 ) ≤
I(U; Z1 , Z2 ). Consider the following inequalities,
X2 X6
Y2
Y6
6
2
Y5
Y1
Y35 7
X1 X3 X5 X7
Y5
Y1
9
4
Y7
Y3
9
4
X4 X9
Example 33.10 (Community Detection). In this model, we consider a complete graph with n
vertices, i.e. Kn , and the random variables Xv representing the membership of each vertex to one
of the m communities. We assume that Xv is sampled uniformly from [m] and independent of the
other vertices. The observation Yu,v is defined as
(
Ber(a/n) Xu = Xv
Yuv ∼
Ber(b/n) Xu 6= Xv .
Example 33.11 (Z2 Synchronization). For any graph G, we sample Xv uniformly from {−1, +1}
and Ye = BSCδ (Xu Xv ).
Example 33.12 (Spiked Wigner Model). We consider the inference problem of determining the
value of vector (Xi )i∈[n] given the observation (Yij )i,j∈[n],i≤j . The Xi ’s and Yij ’s are related by a
linear model
r
λ
Yij = Xi Xj + Wij ,
n
i i
i i
i i
where Xi is sampled uniformly from {−1, +1} and Wij ∼ N(0, 1). This model can also be written
in matrix form as
r
λ
Y= XXT + W
n
where W is the Wigner matrix, hence the name of the model. It is used as a probabilistic model
for principle component analysis (PCA).
This problem can also be treated as a problem of inference on undirected graph. In this case,
the underlying graph is a complete graph, and we assign Xi to the ith vertex. Under this model, the
edge observations is given by Yij = BIAWGNλ/n (Xi Xj ).
Although seemingly different, these problems share similar characteristics, namely:
(Xu , Xv ) → B → Ye .
In other words, the observation on each edge only depends on whether the random variables on
its endpoints are similar.
We will refer to the problem which have this characteristics as the Special Case (S.C.). Due to
S.C.,the reconstructed Xv ’s is symmetric up to any permutation on X . In the case of alphabet X =
{−1, +1}, this implies that for any realization σ then PXall |Yall (σ|b) = PXall |Yall (−σ|b). Consequently,
our reconstruction metric also needs to accommodate this symmetry. For X = {−1, +1}, this
Pn
leads to the use of n1 | i=1 Xi X̂i | as our reconstruction metric.
Our main theorem for undirected inference problem can be seen as the analogue of the infor-
mation percolation theorem for DAG. However, instead of controlling the contraction coefficient,
the percolation probability is used to directly control the conditional mutual information between
any subsets of vertices in the graph.
Before stating our main theorem, we will need to define the corresponding percolation model
for inference on undirected graph. For any undirected graph G = (V, E) we define a percolation
model on this graph as follows :
• Every edge e ∈ E is open with the probability ηKL (PYe |Xe ), independent of the other edges,
• For any v ∈ V and S ⊂ V , we define the v ↔ S as the event that there exists an open path from
v to any vertex in S,
• For any S1 , S2 ⊂ V , we define the function percu (S1 , S2 ) as
X
percu (S1 , S2 ) ≜ P(v ↔ S2 ).
v∈S1
Notice that this function is different from the percolation function for information percolation
in DAG. Most importantly, this function is not equivalent to the exact percolation probability.
i i
i i
i i
564
Instead, it is an upper bound on the percolation probability by union bounding with respect to
S1 . Hence, it is natural that this function is not symmetric, i.e. percu (S1 , S2 ) 6= percu (S2 , S1 ).
Instead of proving theorem 33.16 in its full generality, we will prove the theorem under S.C.
condition. The main step of the proof utilizes the fact we can upper bound the mutual information
of any channel by its less noisy upper bound.
Theorem 33.17. Consider the problem of inference on undirected graph G = (V, E) with
X1 , ..., Xn are not necessarily independent. If PYe |Xe ≤LN PZe |Xe , then for any S1 , S2 ⊂ V and E ⊂ E
X1 X2 X3 X4
From our assumption and the tensorization property of less noisy ordering (Proposition 33.15),
we have PYE |XS1 ,XS2 ≤LN PZE |XS1 ,XS2 . This implies that for σ as a valid realization of XS2 we will
have
I(XS1 ; YE |XS2 = σ) = I(XS1 , XS2 ; YE |XS2 = σ) ≤ I(XS1 , XS2 ; ZE |XS2 = σ) = I(XS1 ; ZE |XS2 = σ).
As this inequality holds for all realization of XS2 , then the following inequality also holds
Proof of Theorem 33.16. We only give a proof under the S.C. condition above and only for the
case S1 = {i}. For the full proof (that proceeds by induction and does not leverage the less noisy
idea), see [252]. We have the following equalities
i i
i i
i i
Due to our previous result, if ηKL (PYe |Xe ) = 1 − τ then PYe |Xe ≤LN PZe |Xe where PZe |Xe = ECτ .
By tensorization property, this ordering also holds for the channel PYE |XE , thus we have
I(Xi ; YE |XS2 ) ≤ I(Xj ; ZE |XS2 ).
Let us define another auxiliary random variable D = 1{i ↔ S2 }, namely it is the indicator that
there is an open path from i to S2 . Notice that D is fully determined by ZE . By the same argument
as in (33.20), we have
I(Xi ; ZE |XS2 ) = I(Xi ; XS2 |ZE )
= I(Xi ; XS2 |ZE , D)
= (1 − P[i ↔ S2 ]) I(Xi ; XS2 |ZE , D = 0) +P[i ↔ S2 ] I(Xi ; XS2 |ZE , D = 1)
| {z } | {z }
0 ≤log |X |
≤ P[i ↔ S2 ] log |X |
= percu (i, S2 )
Theorem 33.18. Consider the spiked Wigner model. If λ ≤ 1, then for any sequence of estimators
Xˆn (Y),
" n #
X
1
E Xi X̂i → 0 (33.21)
n
i=1
as n → ∞.
p
Proof of Theorem 33.18. Note that by E[|T|] ≤ E[T2 ] the left-hand side of (33.21) can be
upper-bounded by
s X
1
E[ Xi Xj X̂i X̂j ] .
n
i, j
Next, it is clear that we can simplify the task of maximizing (over X̂n ) by allowing to separately
estimate each product by T̂i,j , i.e.
X X
max E[ Xi Xj X̂i X̂j ] ≤ max E[Xi Xj T̂i,j ] .
X̂n T̂i,j
i,j i,j
i i
i i
i i
566
(For example, we may notice I(Xi ; Xj |Y) = I(Xi , Xj ; Y) ≥ I(Xi Xj ; Y) and apply Fano’s inequality).
Thus, from symmetry of the problem it is sufficient to prove I(X1 ; X2 |Y) → 0 as n → ∞.
By using the undirected information percolation theorem, we have
in which the percolation model is defined on a complete graph with edge probability λ+no(1) as
ηKL (BIAWGNλ/n ) = λn (1 + o(1)). We only treat the case of λ < 1 below. For such λ we can over-
′
bound λ+no(1) by λn with λ′ < 1. This percolation random graph is equivalent to the Erd�s-Rényi
random graph with n vertices and λ′ /n edge probability, i.e., ER(n, λ′ /n). Using this observation,
the inequality can be rewritten as
The largest components on ER(n, λ′ /n) contains O(log n) if λ′ < 1. This implies that the proba-
bility that two specific vertices are connected is o(1), hence I(X2 ; X1 |Y) → 0 as n → ∞. To treat
the case of λ = 1 we need slightly more refined information about behavior of giant component
of ER(n, 1+on(1) ) graph, see [252].
Remark 33.2 (Dense-Sparse equivalence). This reduction changes the underlying structure of the
graph. Instead of dealing with a complete graph, the percolation problem is defined on an Erd�s-
Rényi random graph. Moreover, if ηKL is small enough, then the underlying percolation graph
tends to have a locally tree-like structure. This is demonstration of the ubiquitous effect: dense
inference (such as spiked Wigner or sparse regression) with very weak signals (ηKL ≈ 1) is similar
to sparse inference (broadcasting on trees) with moderate signals (ηKL ∈ (ϵ, 1 − ϵ)).
Definition 33.19 (Post-SDPI constant). Given a conditional measure PY|X , define the input-
dependent and input-free contraction coefficients as
(p) I(U; X)
ηKL (PX , PY|X ) = sup :X→Y→U
PU|Y I ( U; Y )
(p) I(U; X)
ηKL (PY|X ) = sup :X→Y→U
PX ,PU|Y I(U; Y)
i i
i i
i i
X Y U
ε̄ 0 τ̄ 0 0
τ
?
τ
ε 1 τ̄ 1 1
where PY = PY|X ◦ PX and PX|Y is the conditional measure corresponding to PX PY|X . From (33.22)
and Prop. 33.10 we also get tensorization property for input-dependent post-SDPI:
(p) (p)
ηKL (PnX , (PY|X )n ) = ηKL (PnX , (PY|X )n ) (33.24)
(p)
It is easy to see that by the data processing inequality, ηKL (PY|X ) ≤ 1. Unlike the ηKL coefficient
(p)
the ηKL can equal to 1 even for a noisy channel PY|X .
(p)
Example 33.13 (ηKL = 1). Let PY|X = BECτ and X → Y → U be defined as on Fig. 33.3 Then
we can compute I(Y; U) = H(U) = h(ετ̄ ) and I(X; U) = H(U) − H(U|X) = h(ετ̄ ) − εh(τ ) hence
(p) I(X; U)
ηKL (PY|X ) ≥
I(Y; U)
ε
= 1 − h(τ )
h(ετ̄ )
This last term tends to 1 when ε tends to 0 hence
( p)
ηKL (BECτ ) = 1
Theorem 33.20.
(p)
ηKL (BSCδ ) = (1 − 2δ)2
i i
i i
i i
568
Theorem 33.22 (Post-SDPI for BI-AWGN). Let 0 ≤ ϵ ≤ 1 and consider the channel PY|X with
X ∈ {±1} given by
Y = ϵX + Z, Z ∼ N ( 0, 1) .
Then for any π ∈ (0, 1) taking PX = Ber(π ) we have for some absolute constant K the estimate
(p) ϵ2
ηKL (PX , PY|X ) ≤ K .
π (1 − π )
Proof. In this proof we assume all information measures are used to base-e. First, notice that
1
v( y) ≜ P [ X = 1 | Y = y] = 1−π −2yϵ
.
1+ π e
(p)
Then, the optimization defining ηKL can be written as
(p) d(EQY [v(Y)]kπ )
ηKL (PX , PY|X ) ≤ sup . (33.25)
QY D(QY kPY )
From (7.31) we have
(p) 1 (EQY [v(Y)] − π )2
ηKL (PX , PY|X ) ≤ sup . (33.26)
π (1 − π ) QY D(QY kPY )
To proceed, we need to introduce a new concept. The T1 -transportation inequality for the
measure PY states the following: For every QY we have for some c = c(PY )
p
W1 (QY , PY ) ≤ 2cD(QY kPY ) , (33.27)
i i
i i
i i
Numerically, for π = 1/2 it turns out that the optimal value is λ → 12 , justifying our overbounding
of d by χ2 , and surprisingly giving
(p)
ηKL (Ber(1/2), PY|X ) = 4 EPY [tanh2 (ϵY)] = ηKL (PY|X ) ,
where in the last equality we used Theorem 33.6(f)).
i i
i i
i i
570
estimator is to minimize supθ E[kθ − θ̂k2 ] over θ̂. If we denote by Ui ∈ Ui the messages then
P
i log2 |Ui | ≤ B and the diagram is
Y1 U1
θ .. ..
. . θ̂
Ym Um
Finally, let
Observations:
• Without constraint on the magnitude of θ ∈ [−1, 1]d , we could give θ ∼ N (0, bId ) and from
rate-distortion quickly conclude that estimating θ within risk R requires communicating at least
2 log R bits, which diverges as b → ∞. Thus, restricting the magnitude of θ is necessary in
d bd
i i
i i
i i
dϵ2
Theorem 33.23. There exists a constant c1 > 0 such that if R∗ (m, d, σ 2 , B) ≤ 9 then B ≥ c1 d
ϵ2
.
Proof. Let X ∼ Unif({±1}d ) and set θ = ϵX. Given an estimate θ̂ we can convert it into an
estimator of X via X̂ = sign(θ̂) (coordinatewise). Then, clearly
ϵ2 dϵ 2
E[dH (X, X̂)] ≤ E[kθ̂ − θk2 ] ≤ .
4 9
Thus, we have an estimator of X within Hamming distance 94 d. From Rate-Distortion (Theo-
rem 26.1) we conclude that I(X; X̂) ≥ cd for some constant c > 0. On the other hand, from
the standard DPI we have
X
m
cd ≤ I(X; X̂) ≤ I(X; U1 , . . . , Um ) ≤ I ( X ; Uj ) , (33.29)
j=1
where we also applied Theorem 6.1. Next we estimate I(X; Uj ) via I(Yj ; Uj ) by applying the Post-
SDPI. To do this we need to notice that the channel X → Yj for each j is just a memoryless
extension of the binary-input AWGN channel with SNR ϵ. Since each coordinate of X is uniform,
we can apply Theorem 33.22 (with π = 1/2) together with tensorization (33.24) to conclude that
I(X; Uj ) ≤ 4Kϵ2 I(Yj ; Uj ) ≤ 4Kϵ2 log |Uj |
Together with (33.29) we thus obtain
cd ≤ I(X; X̂) ≤ 4Kϵ2 B log 2 (33.30)
i i
i i
i i
i.i.d.
VI.1 Let X1 , . . . , Xn ∼ Exp(exp(θ)), where θ follows the Cauchy distribution π with parameter s,
whose pdf is given by p(θ) = 1
θ2
for θ ∈ R. Show that the Bayes risk
πs(1+ )
s2
3. Argue that the hardest regime for system identification is when θ ≈ 0, and that instability
(|θ| > 1) is in fact helpful.
VI.3 (Linear regression) Consider the model
Y = Xβ + Z
where the design matrix X ∈ Rn×d is known and Z ∼ N(0, In ). Define the minimax mean-square
error of estimating the regression coefficient β ∈ Rd based on X and Y as follows:
R∗est = inf sup Ekβ̂ − βk22 . (VI.1)
β̂ β∈Rd
Redo (a) and (b) by finding the value of R∗pred and identify the minimax estimator. Explain
intuitively why R∗pred is always finite even when d exceeds n.
i i
i i
i i
i.i.d.
VI.4 (Chernoff-Rubin-Stein lower bound.) Let X1 , . . . , Xn ∼ Pθ and θ ∈ [−a, a].
(a) State the appropriate regularity conditions and prove the following minimax lower bound:
2 2 (1 − ϵ)
2
inf sup Eθ [(θ − θ̂) ] ≥ min max ϵ a ,
2
,
θ̂ θ∈[−a,a] 0<ϵ<1 nJ̄F
1
Ra
where J̄F = 2a J (θ)dθ is the average Fisher information. (Hint: Consider the uniform
−a F
prior on [−a, a] and proceed as in the proof of Theorem 29.2 by applying integration by
parts.)
(b) Simplify the above bound and show that
1
inf sup Eθ [(θ − θ̂)2 ] ≥ p . (VI.3)
θ̂ θ∈[−a,a] ( a− 1 + nJ̄F )2
(c) Assuming the continuity of θ 7→ JF (θ), show that the above result also leads to the optimal
local minimax lower bound in Theorem 29.4 obtained from Bayesian Cramér-Rao:
1 + o( 1)
inf sup Eθ [(θ − θ̂)2 ] ≥ .
θ̂ θ∈[θ0 ±n−1/4 ] nJF (θ0 )
Note: (VI.3) is an improvement of the inequality given in [65, Lemma 1] without proof and
credited to Rubin and Stein.
VI.5 In this exercise we give a Hellinger-based lower bound analogous to the χ2 -based HCR lower
bound in Theorem 29.1. Let θ̂ be an unbiased estimator for θ ∈ Θ ⊂ R.
(a) For any θ, θ′ ∈ Θ, show that [283]
1 (θ − θ′ )2 1
(Varθ (θ̂) + Varθ′ (θ̂)) ≥ −1 . (VI.4)
2 4 H2 (Pθ , Pθ′ )
R √ √ √ √
(Hint: For any c, θ − θ′ = (θ̂ − c)( pθ + pθ′ )( pθ − pθ′ ). Apply Cauchy-Schwarz
and optimize over c.)
(b) Show that
1
H2 (Pθ , Pθ′ ) ≤ (θ − θ′ )2 J̄F (VI.5)
4
R θ′
where J̄F = θ′ 1−θ θ JF (u)du is the average Fisher information.
(c) State the needed regularity conditions and deduce the Cramér-Rao lower bound from (VI.4)
and (VI.5) with θ′ → θ.
(d) Extend the previous parts to the multivariate case.
VI.6 (Bayesian distribution estimation.) Let {Pθ : θ ∈ Θ} be a family of distributions on X
with a common dominating measure μ and density pθ (x) = dP n
dμ (x). Given a sample X =
θ
i.i.d.
(X1 , . . . , Xn ) ∼ Pθ for some θ ∈ Θ, the goal is to estimate the data-generating distribution Pθ by
some estimator P̂(·) = P̂(·; Xn ) with respect to some loss function ℓ(P, P̂). Suppose we are in
a Bayesian setting where θ is drawn from a prior π. Let’s find the form of the Bayes estimator
and the Bayes risk
i i
i i
i i
(a) For convenience, let Xn+1 denote a test data point (unseen) drawn from Pθ and independent
of the observed data Xn . Convince yourself that every estimator P̂ can be formally identified
as a conditional distribution QXn+1 |Xn .
(b) Consider the KL loss ℓ(P, P̂) = D(PkP̂). Using Corollary 4.2, show that the Bayes estimator
minimizing the average KL risk is the posterior (conditional mean). Namely,
is achieved at QXn+1 |Xn = PXn+1 |Xn . In other words, the estimated density value at xn+1 is
Qn+1
dQXn+1 |Xn (xn+1 |xn ) Eθ∼π [ i=1 pθ (xi )]
= Qn . (VI.6)
dμ Eθ∼π [ i=1 pθ (xi )]
is achieved by
" n #
dQXn+1 |Xn (xn+1 |xn ) Y
∝ Eθ [pθ (xn+1 ) |X = x ] ∝ Eθ∼π
2 n n 2
pθ (xi )pθ (xn+1 ) . (VI.7)
dμ
i=1
i.i.d.
(e) Consider the discrete alphabet [k] and Xn ∼ P, where P = (P1 , . . . , Pk ) is drawn from
the Dirichlet prior Dirichlet(α, . . . , α). Applying (VI.6) and (VI.7) (with μ the count-
ing measure), show that the Bayes estimator for the KL loss is the add-α estimator
(Section 13.5):
nj + α
Pbj = , (VI.9)
n + kα
Pn
where nj = i=1 1{Xi =j} is the empirical count, and for the χ2 loss is
p
(nj + α)(nj + α + 1)
Pbj = Pk p . (VI.10)
j=1 (nj + α)(nj + α + 1)
i.i.d.
VI.7 (Coin flips) Given X1 , . . . , Xn ∼ Ber(θ) with θ ∈ Θ = [0, 1], we aim to estimate θ with respect
to the quadratic loss function ℓ(θ, θ̂) = (θ − θ̂)2 . Denote the minimax risk by R∗n .
i i
i i
i i
(a) Use the empirical frequency θ̂emp = X̄ to estimate θ. Compute and plot the risk Rθ (θ̂) and
show that
1
R∗n ≤ .
4n
(b) Compute the Fisher information of Pθ = Ber(θ)⊗n and Qθ = Bin(n, θ). Explain why they
are equal.
(c) Invoke the Bayesian Cramér-Rao lower bound Theorem 29.2 to show that
1 + o( 1)
R∗n = .
4n
(d) Notice that the risk of θ̂emp is maximized at 1/2 (fair coin), which suggests that it might be
possible to hedge against this situation by the following randomized estimator
(
θ̂emp , with probability δ
θ̂rand = 1 (VI.11)
2 with probability 1 − δ
Find the worst-case risk of θ̂rand as a function of δ . Optimizing over δ , show the improved
upper bound:
1
R∗n ≤ .
4( n + 1)
(e) As discussed in Remark 28.3, randomized estimator can always be improved if the loss is
convex; so we should average out the randomness in (VI.11) by considering the estimator
1
θ̂∗ = E[θ̂rand |X] = X̄δ + (1 − δ). (VI.12)
2
Optimizing over δ to minimize the worst-case risk, find the resulting estimator θ̂∗ and its
risk, show that it is constant (independent of θ), and conclude
1
R∗n ≤ √ .
4(1 + n)2
(f) Next we show θ̂∗ found in part (e) is exactly minimax and hence
1
R∗n = √ .
4(1 + n)2
Consider the following prior Beta(a, b) with density:
Γ(a + b) a−1
π (θ) = θ (1 − θ)b−1 , θ ∈ [0, 1],
Γ(a)Γ(b)
R∞ √
where Γ(a) ≜ 0 xa−1 e−x dx. Show that if a = b = 2n , θ̂∗ coincides with the Bayes
estimator for this prior, which is therefore least favorable. (Hint: work with the sufficient
statistic S = X1 + . . . + Xn .)
(g) Show that the least favorable prior is not unique; in fact, there is a continuum of them. (Hint:
consider the Bayes estimator E[θ|X] and show that it only depends on the first n + 1 moments
of π.)
i i
i i
i i
i.i.d.
(h) (Larger alphabet) Suppose X1 , . . . , Xn ∼ P on [k]. Show that for any k, n, the minimax
squared risk of estimating P in Theorem 29.5 is exactly
b − Pk22 ] = √ 1 k−1
R∗sq (k, n) = inf sup E[kP 2
, (VI.13)
b
P P∈Pk ( n + 1) k
√
achieved by the add- kn estimator. (Hint: For the lower bound, show that the Bayes estimator
for the squared loss and the KL loss coincide, then apply (VI.9) in Exercise VI.6.)
i.i.d.
(i) (Nonparametric extension) Suppose X1 , . . . , Xn ∼ P, where P is an arbitrary probability
distribution on [0, 1]. The goal is to estimate the mean of P under the quadratic loss. Show
that the minimax risk equals 4(1+1√n)2 .
VI.8 (Distribution estimation in TV) Continuing (VI.13), we show that the minimax rate for
estimating P with respect to the total variation loss is
r
∗ k
RTV (k, n) ≜ inf sup EP [TV(P̂, P)] ∧ 1, ∀ k ≥ 2, n ≥ 1, (VI.14)
P̂ P∈Pk ) n
(a) Show that the MLE coincides with the empirical distribution.
(b) Show that the MLE achieves the RHS of (VI.14) within constant factors.
(c) Establish the minimax lower bound. (Hint: apply Assouad’s lemma, or Fano’s inequality
(with volume method or explicit construction of packing), or the mutual information method
directly.)
VI.9 (Distribution estimation in KL and χ2 ) Continuing Exercise VI.8, let us consider estimating the
distribution P in KL and χ2 divergence, which are unbounded loss. We show that
(k
∗ k k≤n
RKL (k, n) ≜ inf sup EP [D(P̂kP)] log 1 + n k (VI.15)
P̂ P∈Pk n log n k ≥ n
and
k
R∗χ2 (k, n) ≜ inf sup EP [χ2 (P̂kP)] . (VI.16)
P̂ P∈Pk ) n
To this end, we will apply results on Bayes risk in Exercise VI.6 as well as multiple inequalities
between f-divergences from Chapter 7.
(a) Show that the empirical distribution, which has been shown optimal for the TV loss in
Exercise VI.8, achieves infinite KL and χ2 loss in the worst case.
(b) To show the upper bound in (VI.16), consider the add-α estimator P̂ in (VI.9) with α = 1.
Show that
k−1
E[χ2 (PkP̂)] ≤ .
n+1
Using D ≤ log(1 + χ2 ) – cf. (7.31), conclude the upper bound part of (VI.15). (Hint:
EN∼Bin(n,p) [ N+
1
1 ] = (n+1)p (1 − p̄
1 n+1
).
(c) Show that for the small alphabet regime of k ≲ n, the lower bound follows from that of
(VI.15) and Pinsker’s inequality (7.25).
i i
i i
i i
(d) Next assume k ≥ 4n. Consider a Dirichlet(α, . . . , α) prior in (13.15). Applying the formula
(VI.7) and (VI.8) for the Bayes χ2 risk and choosing α = n/k, show the lower bound in
(VI.16).
(e) Finally, we prove the lower bound in (VI.15). Consider the prior under which P is uniform
over a set S chosen uniformly at random from all s-subsets of [k] and s is some constant to
be specified. Applying (VI.6), show that for this prior the Bayes estimator for KL loss takes
a natural form:
(
1
i ∈ Ŝ
P̂j = 1s −ŝ/s
k−ŝ i∈/ Ŝ
(g) Using (7.28), show that D(PkP̂) ≥ Ω(log nk ). (Note that (7.28) is convex in TV so Jensen’s
inequality applies.)
Note: The following refinement of (VI.15) was known:
• For fixed k, a deep result of [49, 48] is that R∗KL (k, n) = k−12n+o(1)
, achieved by an add-c
estimator where c is a function of the empirical count chosen using polynomial approximation
arguments.
• When k n, R∗KL (k, n) = log nk (1 + o(1)), shown in [230] by a careful analysis of the
Dirichlet prior.
VI.10 (Nonparametric location model) In this exercise we consider some nonparametric extensions
i.i.d.
of the Gaussian location model and the Bernoulli model. Observing X1 , . . . , Xn ∼ P for some
P ∈ P , where P is a collection of distributions on the real line, our goal is to estimate the mean
R
of the distribution P: μ(P) ≜ xP(dx), which is a linear functional of P. Denote the minimax
quadratic risk by
(a) Let P be the class of distributions (which need not have a density) on the real line with
2
variance at most σ 2 . Show that R∗n = σn .
(b) Let P = P([0, 1]), the collection of all probability distributions on [0, 1]. Show that
R∗n = 4(1+1√n)2 . (Hint: For the upper bound, using the fact that, for any [0, 1]-valued ran-
dom variable Z, Var(Z) ≤ E[Z](1 − E[Z]), mimic the analysis of the estimator (VI.12) in
Ex. VI.7e.)
VI.11 Prove Theorem 30.4 using Fano’s method. (Hint: apply Theorem 31.3 with T = ϵ · Skd , where
Sdk denotes the Hamming sphere of radius k in d dimensions. Choose ϵ appropriately and apply
the Gilbert-Varshamov bound for the packing number of Sdk in Theorem 27.6.)
VI.12 (Sharp minimax rate in sparse denoising) Continuing Theorem 30.4, in this exercise we deter-
mine the sharp minimax risk for denoising a high-dimensional sparse vector. In the notation of
(30.13), we show that, for the d-dimensional GLM model X ∼ N (θ, Id ), the following minimax
i i
i i
i i
For the lower bound, consider the prior π under which θ is uniformly p distributed over
{τ e1 , . . . , τ ed }, where ei ’s denote the standard basis. Let τ = (2 − ϵ) log d. Show that
for any ϵ > 0, the Bayes risk is given by
(Hint: either apply the mutual information method, or directly compute the Bayes risk by
evaluating the conditional mean and conditional variance.)
(b) Demonstrate an estimator θ̂ that achieves the RHS of (VI.18) asymptotically. (Hint: consider
the hard-thresholding estimator (30.13) or the MLE (30.11).)
(c) To prove the lower bound part of (VI.17), prove the following generic result
d
R∗ (k, d) ≥ kR∗ 1,
k
and then apply (VI.18). (Hint: consider a prior of d/k blocks each of which is 1-sparse.)
(d) Similar to the 1-sparse case, demonstrate an estimator θ̂ that achieves the RHS of (VI.17)
asymptotically.
Note: For both the upper and lower bound, the normal tail bound in Exercise V.8 is helpful.
VI.13 Consider the following functional estimation problem in GLM. Observing X ∼ N(θ, Id ), we
intend to estimate the maximal coordinate of θ: T(θ) = θmax ≜ max{θ1 , . . . , θd }. Prove the
minimax rate:
(a) Prove the upper bound by considering T̂ = Xmax , the plug-in estimator with the MLE.
(b) For the lower bound, consider two hypotheses:
H0 : θ = 0, H1 : θ ∼ Unif {τ e1 , τ e2 , . . . , τ ed } .
where ei ’s are the standard bases and τ > 0. Then under H0 , X ∼ P0 = N(0, Id ); under H1 ,
Pd
X ∼ P1 = 1d i=1 N(τ ei , Id ). Show that
2
eτ − 1
χ2 (P1 kP0 ) = .
d
(c) Applying the joint range (7.29) (or (7.35)) to bound TV(P0 , P1 ), conclude the lower bound
part of (VI.19) via Le Cam’s method (Theorem 31.1).
i i
i i
i i
(d) By improving both the upper and lower bound prove the sharp version:
1
inf sup Eθ (T̂ − θmax )2 = + o(1) log d, d → ∞. (VI.20)
T̂ θ∈Rd 2
VI.14 (Suboptimality of MLE in high dimensions) Consider the d-dimensional GLM: X ∼ N (θ, Id ),
where θ belongs to the parameter space
n o
Θ = θ ∈ Rd : |θ1 | ≤ d1/4 , kθ\1 k2 ≤ 2(1 − d−1/4 |θ1 |)
with θ\1 ≡ (θ2 , . . . , θd ). For the square loss, prove the following for sufficiently large d.
(a) The minimax risk is bounded:
inf sup Eθ [kθ̂ − θk22 ] ≲ 1.
θ̂ θ∈Θ
is unbounded:
√
sup Eθ [kθ̂MLE − θk22 ] ≳ d.
θ∈Θ
i.i.d.
VI.15 (Covariance model) Let X1 , . . . , Xn ∼ N(0, Σ), where Σ is a d × d covariance matrix. Let us
show that the minimax quadratic risk of estimating Σ using X1 , . . . , Xn satisfies
d
inf sup E[kΣ̂ − Σk2F ] ∧ 1 r2 , ∀ r > 0, d, n ∈ N . (VI.21)
Σ̂ ∥Σ∥F ≤r n
P
where kΣ̂ − Σk2F = ij (Σ̂ij − Σij )2 .
(a) Show that unlike location model, without restricting to a compact parameter space for Σ,
the minimax risk in (VI.21) is infinite.
Pn
(b) Consider the sample covariance matrix Σ̂ = 1n i=1 Xi X⊤ i . Show that
1
E[kΣ̂ − Σk2F ] =kΣk2F + Tr(Σ)2
n
and use this to deduce the minimax upper bound in (VI.21).
(c) To prove the minimax lower bound, we can proceed in several steps. Show that for any
positive semidefinite (PSD) Σ0 , Σ1 0, the KL divergence satisfies
1 1/2 1/2
D(N(0, Id + Σ0 )kN(0, Id + Σ1 )) ≤ kΣ − Σ1 k2F , (VI.22)
2 0
where Id is the d × d identity matrix.
(d) Let B(δ) denote the Frobenius ball of radius δ centered at the zero matrix. Let PSD = {X :
X 0} denote the collection of d× PSD matrices. Show that
vol(B(δ) ∩ PSD)
= P [ Z 0] , (VI.23)
vol(B(δ))
i i
i i
i i
i.i.d.
where Z is a GOE matrix, that is, Z is symmetric with independent diagonals Zii ∼ N(0, 2)
i.i.d.
and off-diagonals Zij ∼ N(0, 1).
2
(e) Show that P [Z 0] ≥ cd for some absolute constant c.3
(f) Prove the following lower bound on the packing number on the set of PSD matrices:
d2 /2
c′ δ
M(B(δ) ∩ PSD, k · kF , ϵ) ≥ (VI.24)
ϵ
for some absolute constant c′ . (Hint: Use the volume bound and the result of Part (d) and
(e).) √
(g) Complete the proof of lower bound of (VI.21). (Hint: WLOG, we can consider r d and
2
aim for the lower bound Ω( dn ∧ d). Take the packing from (VI.24) and shift by the identity
matrix I. Then apply Fano’s method and use (VI.22).)
VI.16 For a family of probability distributions P and a functional T : P → R define its χ2 -modulus
of continuity as
When the functional T is affine, and continuous, and P is compact4 it can be shown [251] that
1
δ 2 (1/n)2 ≤ inf sup E i.i.d. (T(P) − T̂n (X1 , . . . , Xn ))2 ≤ δχ2 (1/n)2 . (VI.25)
7 χ T̂n P∈P Xi ∼ P
Consider the following problem (interval censored model): In i-th mouse a tumour develops at
i.i.d.
time Ai ∈ [0, 1] with Ai ∼ π where π is a pdf on [0, 1] bounded by 21 ≤ π ≤ 2. For each i the
i.i.d.
existence of tumour is checked at another random time Bi ∼ Unif(0, 1) with Bi ⊥ ⊥ Ai . Given
observations Xi = (1{Ai ≤ Bi }, Bi ) one is trying to estimate T(π ) = π [A ≤ 1/2]. Show that
VI.17 (Comparison between contraction coefficients.) Let X be a random variable with distribution
PX , and let PY|X be a Markov kernel. For an f-divergence, define
Df (PY|X ◦ QX kPY|X ◦ PX )
ηf (PY|X , PX ) ≜ sup .
QX :0<Df (QX ∥PX )<∞ Df (QX kPX )
Prove that
3
Getting the exact exponent is a difficult result (cf. [17]). Here we only need some crude estimate.
4
Both under the same, but otherwise arbitrary topology on P.
i i
i i
i i
VI.18 (χ2 -contraction for Markov chains.) In this exercise we prove (33.4). Let P = (P(x, y)) denote
the transition matrix of a time-reversible Markov chain with finite state space X and stationary
distribution π, so that π (x)P(x, y) = π (y)P(y, x) for all x, y ∈ X . It is known that the k = |X |
eigenvalues of P satisfy 1 = λ1 ≥ λ2 ≥ . . . ≥ λk ≥ −1. Define by γ∗ ≜ max{λ2 , |λk |} the
absolute spectral gap.
(a) Show that
VI.20 (BMS channel comparison [269]) Below X ∼ Ber(1/2) and PY|X is an input-symmetric chan-
nel (BMS). It turns out that BSC and BEC are extremal for various partial orders. Prove the
following statements.
(a) If ITV (X; Y) = 12 (1 − 2δ), then
i i
i i
i i
VI.21 (Broadcasting on Trees with BSC [149]) We have seen that Broadcasting on Trees with BSCδ
has non-reconstruction when b(1 − 2δ)2 < 1. In this exercise we prove the achievability bound
(known as the Kesten-Stigum bound [176]) using channel comparison.
We work with an infinite b-ary tree with BSCδ edge channels. Let ρ be the root and Lk be the
set of nodes at distance k to ρ. Let Mk denote the channel Xρ → XLk .
In the following, assume that b(1 − 2δ)2 > 1.
1
(a) Prove that there exists τ < 2 such that
Note: It is known that pc ≈ 0.645 (e.g. [167]). Using site percolation we can prove
non-reconstruction whenever
p
1 − 2δ + 4δ 3 − 2δ 4 − 2δ(1 − δ) δ(1 + δ)(1 − δ)(2 − δ) < p′c ,
where p′c ≈ 0.705 is the directed site percolation threshold. One can check that the bound from
bond percolation is stronger.
VI.23 (Input-dependent contraction coefficient for coloring channel [148]) Fix an integer q ≥ 3 and
let X = [q]. Consider the following coloring channel K : X → X :
(
0 y = x,
K(y|x) = 1
q−1 y 6= x.
i i
i i
i i
VI.24 ([148]) Fix an integer q ≥ 2 and let X = [q]. Let λ ∈ [− q−1 1 , 1] be a real number. Define the
Potts channel Pλ : X → X as
(
λ + 1−λ y = x,
P λ ( y| x) = 1−λ
q
q y 6= x.
Prove that
qλ 2
ηKL (Pλ ) = .
(q − 2)λ + 2
VI.25 (Spectral Independence) Say a probability distribution μ = μXn supported on [q]n is c-pairwise
independent if for every T ⊂ [n], σT ∈ [q]T the conditional measure μ(σT ) ≜ μXTc |XT =σT and
every νXcT satisfies
X (σ ) c X (σ )
D(νXi,j || μXi,Tj ) ≥ (2 − ) D(νXi || μXi T ) .
n − | T | − 1
i̸=j∈T
c i∈ T c
where ECτ is the erasure channel, cf. Example 33.7. (Hint: Define f(τ ) = D(ECτ ◦ ν||ECτ ◦ μ)
and prove f′′ (τ ) ≥ τc f′ (τ ).)
Remark: Applying the above with τ = 1n shows that a Markov chain known as (small-block)
Glauber dynamics for μ is mixing in O(nc+1 log n) time. It is known that c-pairwise indepen-
dence is implied (under some additional conditions on μ and q = 2) by the uniform boundedness
of the operator norms of the covariance matrices of all μ(σT ) (see [64] for details).
i i
i i
i i
References
i i
i i
i i
References 585
[18] S. Artstein, V. Milman, and S. J. Szarek, (Methodological), vol. 41, no. 2, pp. 113–
“Duality of metric entropy,” Annals of math- 128, 1979.
ematics, pp. 1313–1328, 2004. [30] C. Berrou, A. Glavieux, and P. Thiti-
[19] R. B. Ash, Information Theory. New York, majshima, “Near shannon limit
NY: Dover Publications Inc., 1965. error-correcting coding and decoding:
[20] A. V. Banerjee, “A simple model of herd Turbo-codes. 1,” in Proceedings of
behavior,” The quarterly journal of eco- ICC’93-IEEE International Conference on
nomics, vol. 107, no. 3, pp. 797–817, 1992. Communications, vol. 2. IEEE, 1993, pp.
[21] A. Barg and G. D. Forney, “Random codes: 1064–1070.
Minimum distances and error exponents,” [31] D. P. Bertsekas, A. Nedi�, and A. E.
IEEE Transactions on Information Theory, Ozdaglar, Convex analysis and optimiza-
vol. 48, no. 9, pp. 2568–2573, 2002. tion. Belmont, MA, USA: Athena Scien-
[22] A. Barg and A. McGregor, “Distance distri- tific, 2003.
bution of binary codes and the error proba- [32] N. Bhatnagar, J. Vera, E. Vigoda, and
bility of decoding,” IEEE transactions on D. Weitz, “Reconstruction for color-
information theory, vol. 51, no. 12, pp. ings on trees,” SIAM Journal on Discrete
4237–4246, 2005. Mathematics, vol. 25, no. 2, pp. 809–
[23] S. Barman and O. Fawzi, “Algorithmic 826, 2011. [Online]. Available: https:
aspects of optimal channel coding,” IEEE //doi.org/10.1137/090755783
Transactions on Information Theory, [33] A. Bhattacharyya, “On a measure of diver-
vol. 64, no. 2, pp. 1038–1045, 2017. gence between two statistical populations
[24] A. R. Barron, “Universal approximation defined by their probability distributions,”
bounds for superpositions of a sigmoidal Bull. Calcutta Math. Soc., vol. 35, pp. 99–
function,” IEEE Trans. Inf. Theory, vol. 39, 109, 1943.
no. 3, pp. 930–945, 1993. [34] L. Birgé, “Approximation dans les espaces
[25] G. Basharin, “On a statistical estimate for métriques et théorie de l’estimation,”
the entropy of a sequence of independent Zeitschrift für Wahrscheinlichkeitstheorie
random variables,” Theory of Probability & und Verwandte Gebiete, vol. 65, no. 2, pp.
Its Applications, vol. 4, no. 3, pp. 333–336, 181–237, 1983.
1959. [35] ——, “Robust tests for model selection,”
[26] A. Beirami and F. Fekri, “Fundamental lim- From probability to statistics and back:
its of universal lossless one-to-one com- high-dimensional models and processes–A
pression of parametric sources,” in Informa- Festschrift in honor of Jon A. Wellner, IMS
tion Theory Workshop (ITW), 2014 IEEE. Collections, Volume 9, pp. 47–64, 2013.
IEEE, 2014, pp. 212–216. [36] M. Š. Birman and M. Solomjak, “Piecewise-
[27] C. H. Bennett, P. W. Shor, J. A. Smolin, and polynomial approximations of functions of
A. V. Thapliyal, “Entanglement-assisted the classes,” Mathematics of the USSR-
classical capacity of noisy quantum chan- Sbornik, vol. 2, no. 3, p. 295, 1967.
nels,” Physical Review Letters, vol. 83, [37] D. Blackwell, L. Breiman, and
no. 15, p. 3081, 1999. A. Thomasian, “The capacity of a class of
[28] W. R. Bennett, “Spectra of quantized sig- channels,” The Annals of Mathematical
nals,” The Bell System Technical Journal, Statistics, pp. 1229–1241, 1959.
vol. 27, no. 3, pp. 446–472, 1948. [38] R. E. Blahut, “Hypothesis testing and infor-
[29] J. M. Bernardo, “Reference posterior dis- mation theory,” IEEE Trans. Inf. Theory,
tributions for Bayesian inference,” Journal vol. 20, no. 4, pp. 405–417, 1974.
of the Royal Statistical Society: Series B [39] P. M. Bleher, J. Ruiz, and V. A. Zagrebnov,
“On the purity of the limiting gibbs state for
i i
i i
i i
the ising model on the bethe lattice,” Jour- [50] M. Braverman, A. Garg, T. Ma, H. L.
nal of Statistical Physics, vol. 79, no. 1, pp. Nguyen, and D. P. Woodruff, “Communica-
473–482, Apr 1995. [Online]. Available: tion lower bounds for statistical estimation
https://doi.org/10.1007/BF02179399 problems via a distributed data processing
[40] S. G. Bobkov and F. Götze, “Exponential inequality,” in Proceedings of the forty-
integrability and transportation cost related eighth annual ACM symposium on Theory
to logarithmic sobolev inequalities,” Jour- of Computing. ACM, 2016, pp. 1011–
nal of Functional Analysis, vol. 163, no. 1, 1020.
pp. 1–28, 1999. [51] L. M. Bregman, “Some properties of non-
[41] S. Bobkov and G. P. Chistyakov, “Entropy negative matrices and their permanents,”
power inequality for the Rényi entropy.” Soviet Math. Dokl., vol. 14, no. 4, pp. 945–
IEEE Transactions on Information Theory, 949, 1973.
vol. 61, no. 2, pp. 708–714, 2015. [52] L. Breiman, “The individual ergodic the-
[42] T. Bohman, “A limit theorem for the shan- orem of information theory,” Ann. Math.
non capacities of odd cycles i,” Proceedings Stat., vol. 28, no. 3, pp. 809–811, 1957.
of the American Mathematical Society, vol. [53] L. Brillouin, Science and information the-
131, no. 11, pp. 3559–3569, 2003. ory, 2nd Ed. Academic Press, 1962.
[43] H. F. Bohnenblust, “Convex regions and [54] L. D. Brown, “Fundamentals of statisti-
projections in Minkowski spaces,” Ann. cal exponential families with applications
Math., vol. 39, no. 2, pp. 301–308, 1938. in statistical decision theory,” in Lecture
[44] A. Borovkov, Mathematical Statistics. Notes-Monograph Series, S. S. Gupta, Ed.
CRC Press, 1999. Hayward, CA: Institute of Mathematical
[45] S. Boucheron, G. Lugosi, and O. Bousquet, Statistics, 1986, vol. 9.
“Concentration inequalities,” in Advanced [55] P. Bühlmann and S. van de Geer, Statistics
Lectures on Machine Learning, O. Bous- for high-dimensional data: methods, theory
quet, U. von Luxburg, and G. Rätsch, Eds. and applications. Springer Science &
Springer, 2004, pp. 208–240. Business Media, 2011.
[46] S. Boucheron, G. Lugosi, and P. Massart, [56] G. Calinescu, C. Chekuri, M. Pal, and
Concentration Inequalities: A Nonasymp- J. Vondrák, “Maximizing a monotone sub-
totic Theory of Independence. OUP modular function subject to a matroid
Oxford, 2013. constraint,” SIAM Journal on Computing,
[47] O. Bousquet, D. Kane, and S. Moran, “The vol. 40, no. 6, pp. 1740–1766, 2011.
optimal approximation factor in density [57] M. X. Cao and M. Tomamichel, “On
estimation,” in Conference on Learning the quadratic decaying property of the
Theory. PMLR, 2019, pp. 318–341. information rate function,” arXiv preprint
[48] D. Braess and T. Sauer, “Bernstein poly- arXiv:2208.12945, 2022.
nomials and learning theory,” Journal of [58] O. Catoni, “PAC-Bayesian supervised clas-
Approximation Theory, vol. 128, no. 2, pp. sification: the thermodynamics of statis-
187–206, 2004. tical learning,” Lecture Notes-Monograph
[49] D. Braess, J. Forster, T. Sauer, and Series. IMS, vol. 1277, 2007.
H. U. Simon, “How to achieve minimax [59] E. Çinlar, Probability and Stochastics.
expected Kullback-Leibler distance from New York: Springer, 2011.
an unknown finite distribution,” in Algo- [60] N. Cesa-Bianchi and G. Lugosi, Prediction,
rithmic Learning Theory. Springer, 2002, learning, and games. Cambridge univer-
pp. 380–394. sity press, 2006.
[61] D. G. Chapman and H. Robbins, “Mini-
mum variance estimation without regularity
i i
i i
i i
References 587
i i
i i
i i
[86] M. Cuturi, “Sinkhorn distances: Light- [96] ——, “Mathematical problems in the Shan-
speed computation of optimal transport,” non theory of optimal coding of informa-
Advances in neural information processing tion,” in Proc. 4th Berkeley Symp. Mathe-
systems, vol. 26, pp. 2292–2300, 2013. matics, Statistics, and Probability, vol. 1,
[87] A. Dembo and O. Zeitouni, Large devia- Berkeley, CA, USA, 1961, pp. 211–252.
tions techniques and applications. New [97] ——, “Asymptotic bounds on error prob-
York: Springer Verlag, 2009. ability for transmission over DMC with
[88] A. P. Dempster, N. M. Laird, and D. B. symmetric transition probabilities,” Theor.
Rubin, “Maximum likelihood from incom- Probability Appl., vol. 7, pp. 283–311,
plete data via the EM algorithm,” Journal 1962.
of the royal statistical society. Series B [98] R. Dobrushin, “A simplified method of
(methodological), pp. 1–38, 1977. experimentally evaluating the entropy of a
[89] P. Diaconis and L. Saloff-Coste, “Logarith- stationary sequence,” Theory of Probabil-
mic Sobolev inequalities for finite Markov ity & Its Applications, vol. 3, no. 4, pp.
chains,” Ann. Probab., vol. 6, no. 3, pp. 428–430, 1958.
695–750, 1996. [99] D. L. Donoho, “Wald lecture I: Counting
[90] P. Diaconis and D. Freedman, “Finite bits with Kolmogorov and Shannon,” Note
exchangeable sequences,” The Annals of for the Wald Lectures, IMS Annual Meeting,
Probability, vol. 8, no. 4, pp. 745–764, July 1997.
1980. [100] M. D. Donsker and S. S. Varadhan,
[91] P. Diaconis and D. Stroock, “Geometric “Asymptotic evaluation of certain markov
bounds for eigenvalues of Markov chains,” process expectations for large time. iv,”
The Annals of Applied Probability, vol. 1, Communications on Pure and Applied
no. 1, pp. 36–61, 1991. Mathematics, vol. 36, no. 2, pp. 183–212,
[92] H. Djellout, A. Guillin, and L. Wu, “Trans- 1983.
portation cost-information inequalities and [101] J. L. Doob, Stochastic Processes. New
applications to random dynamical systems York Wiley, 1953.
and diffusions,” The Annals of Probability, [102] J. C. Duchi, M. I. Jordan, M. J. Wain-
vol. 32, no. 3B, pp. 2702–2732, 2004. wright, and Y. Zhang, “Optimality guaran-
[93] R. Dobrushin, “Central limit theorem for tees for distributed statistical estimation,”
nonstationary Markov chains, I,” Theory arXiv preprint arXiv:1405.0782, 2014.
Probab. Appl., vol. 1, no. 1, pp. 65–80, [103] J. Duda, “Asymmetric numeral systems:
1956. entropy coding combining speed of
[94] R. Dobrushin and B. Tsybakov, “Informa- huffman coding with compression rate
tion transmission with additional noise,” of arithmetic coding,” arXiv preprint
IRE Transactions on Information Theory, arXiv:1311.2540, 2013.
vol. 8, no. 5, pp. 293–304, 1962. [104] R. M. Dudley, Uniform central limit theo-
[95] R. L. Dobrushin, “A general formulation of rems. Cambridge university press, 1999,
the fundamental theorem of Shannon in the no. 63.
theory of information,” Uspekhi Mat. Nauk, [105] G. Dueck, “The strong converse to the cod-
vol. 14, no. 6, pp. 3–104, 1959, english ing theorem for the multiple–access chan-
translation in Eleven Papers in Analysis: nel,” J. Comb. Inform. Syst. Sci, vol. 6, no. 3,
Nine Papers on Differential Equations, Two pp. 187–196, 1981.
on Information Theory, American Mathe- [106] G. Dueck and J. Korner, “Reliability
matical Society Translations: Series 2, Vol- function of a discrete memoryless chan-
ume 33, 1963. nel at rates above capacity (corresp.),”
i i
i i
i i
References 589
IEEE Transactions on Information Theory, [118] M. Feder, “Gambling using a finite state
vol. 25, no. 1, pp. 82–85, 1979. machine,” IEEE Transactions on Informa-
[107] N. Dunford and J. T. Schwartz, Linear oper- tion Theory, vol. 37, no. 5, pp. 1459–1465,
ators, part 1: general theory. John Wiley 1991.
& Sons, 1988, vol. 10. [119] M. Feder, N. Merhav, and M. Gut-
[108] R. Durrett, Probability: Theory and Exam- man, “Universal prediction of individual
ples, 4th ed. Cambridge University Press, sequences,” IEEE Trans. Inf. Theory,
2010. vol. 38, no. 4, pp. 1258–1270, 1992.
[109] A. Dytso, S. Yagli, H. V. Poor, and S. S. [120] M. Feder and Y. Polyanskiy, “Sequential
Shitz, “The capacity achieving distribution prediction under log-loss and misspecifi-
for the amplitude constrained additive gaus- cation,” arXiv preprint arXiv:2102.00050,
sian channel: An upper bound on the num- 2021.
ber of mass points,” IEEE Transactions [121] A. A. Fedotov, P. Harremoës, and F. Top-
on Information Theory, vol. 66, no. 4, pp. søe, “Refinements of Pinsker’s inequality,”
2006–2022, 2019. Information Theory, IEEE Transactions on,
[110] H. G. Eggleston, Convexity, ser. Tracts in vol. 49, no. 6, pp. 1491–1498, Jun. 2003.
Math and Math. Phys. Cambridge Univer- [122] W. Feller, An Introduction to Probability
sity Press, 1958, vol. 47. Theory and Its Applications, 3rd ed. New
[111] A. El Gamal and Y.-H. Kim, Network infor- York: Wiley, 1970, vol. I.
mation theory. Cambridge University [123] ——, An Introduction to Probability The-
Press, 2011. ory and Its Applications, 2nd ed. New
[112] P. Elias, “The efficient construction of York: Wiley, 1971, vol. II.
an unbiased random sequence,” Annals of [124] T. S. Ferguson, Mathematical Statistics: A
Mathematical Statistics, vol. 43, no. 3, pp. Decision Theoretic Approach. New York,
865–870, 1972. NY: Academic Press, 1967.
[113] ——, “Coding for noisy channels,” IRE [125] ——, “An inconsistent maximum likeli-
Convention Record, vol. 3, pp. 37–46, hood estimate,” Journal of the American
1955. Statistical Association, vol. 77, no. 380, pp.
[114] D. M. Endres and J. E. Schindelin, “A new 831–834, 1982.
metric for probability distributions,” IEEE [126] ——, A course in large sample theory.
Transactions on Information theory, vol. 49, CRC Press, 1996.
no. 7, pp. 1858–1860, 2003. [127] R. A. Fisher, “The logic of inductive infer-
[115] K. Eswaran and M. Gastpar, “Remote ence,” Journal of the royal statistical soci-
source coding under Gaussian noise: Duel- ety, vol. 98, no. 1, pp. 39–82, 1935.
ing roles of power and entropy power,” [128] B. M. Fitingof, “The compression of dis-
IEEE Transactions on Information Theory, crete information,” Problemy Peredachi
2019. Informatsii, vol. 3, no. 3, pp. 28–36, 1967.
[116] W. Evans and N. Pippenger, “On the maxi- [129] P. Fleisher, “Sufficient conditions for
mum tolerable noise for reliable computa- achieving minimum distortion in a quan-
tion by formulas,” IEEE Transactions on tizer,” IEEE Int. Conv. Rec., pp. 104–111,
Information Theory, vol. 44, no. 3, pp. 1964.
1299–1305, May 1998. [130] G. D. Forney, “Concatenated codes,” MIT
[117] W. S. Evans and L. J. Schulman, “Signal RLE Technical Rep., vol. 440, 1965.
propagation and noisy circuits,” IEEE [131] E. Friedgut and J. Kahn, “On the
Transactions on Information Theory, number of copies of one hypergraph
vol. 45, no. 7, pp. 2367–2373, Nov 1999. in another,” Israel J. Math., vol. 105,
i i
i i
i i
pp. 251–256, 1998. [Online]. Available: [145] R. M. Gray, Entropy and Information The-
http://dx.doi.org/10.1007/BF02780332 ory. New York, NY: Springer-Verlag,
[132] R. G. Gallager, “A simple derivation of 1990.
the coding theorem and some applications,” [146] U. Grenander and G. Szegö, Toeplitz forms
IEEE Trans. Inf. Theory, vol. 11, no. 1, pp. and their applications, 2nd ed. New York:
3–18, 1965. Chelsea Publishing Company, 1984.
[133] ——, Information Theory and Reliable [147] L. Gross, “Logarithmic sobolev inequali-
Communication. New York: Wiley, 1968. ties,” American Journal of Mathematics,
[134] R. Gallager, “The random coding bound vol. 97, no. 4, pp. 1061–1083, 1975.
is tight for the average code (corresp.),” [148] Y. Gu and Y. Polyanskiy, “Non-linear log-
IEEE Transactions on Information Theory, sobolev inequalities for the potts semigroup
vol. 19, no. 2, pp. 244–246, 1973. and applications to reconstruction prob-
[135] R. Gardner, “The Brunn-Minkowski lems,” arXiv preprint arXiv:2005.05444,
inequality,” Bulletin of the American 2020.
Mathematical Society, vol. 39, no. 3, pp. [149] Y. Gu, H. Roozbehani, and Y. Polyanskiy,
355–405, 2002. “Broadcasting on trees near criticality,” in
[136] I. M. Gel’fand, A. N. Kolmogorov, and 2020 IEEE International Symposium on
A. M. Yaglom, “On the general definition Information Theory (ISIT). IEEE, 2020,
of the amount of information,” Dokl. Akad. pp. 1504–1509.
Nauk. SSSR, vol. 11, pp. 745–748, 1956. [150] U. Hadar, J. Liu, Y. Polyanskiy, and
[137] G. L. Gilardoni, “On a Gel’fand-Yaglom- O. Shayevitz, “Communication complexity
Peres theorem for f-divergences,” arXiv of estimating correlations,” in Proceedings
preprint arXiv:0911.1934, 2009. of the 51st Annual ACM SIGACT Sympo-
[138] ——, “On pinsker’s and vajda’s type sium on Theory of Computing. ACM,
inequalities for csiszár’s-divergences,” 2019, pp. 792–803.
Information Theory, IEEE Transactions on, [151] B. Hajek, Y. Wu, and J. Xu, “Information
vol. 56, no. 11, pp. 5377–5386, 2010. limits for recovering a hidden community,”
[139] R. D. Gill and B. Y. Levit, “Applications IEEE Trans. on Information Theory, vol. 63,
of the van Trees inequality: a Bayesian no. 8, pp. 4729 – 4745, 2017.
Cramér-Rao bound,” Bernoulli, vol. 1, no. [152] J. Hájek, “Local asymptotic minimax and
1–2, pp. 59–79, 1995. admissibility in estimation,” in Proceedings
[140] C. Giraud, Introduction to High- of the sixth Berkeley symposium on math-
Dimensional Statistics. Chapman ematical statistics and probability, vol. 1,
and Hall/CRC, 2014. 1972, pp. 175–194.
[141] O. Goldreich, Introduction to property test- [153] J. M. Hammersley, “On estimating
ing. Cambridge University Press, 2017. restricted parameters,” Journal of the
[142] V. Goodman, “Characteristics of normal Royal Statistical Society. Series B (Method-
samples,” The Annals of Probability, ological), vol. 12, no. 2, pp. 192–240,
vol. 16, no. 3, pp. 1281–1290, 1988. 1950.
[143] V. D. Goppa, “Codes and information,” Rus- [154] T. S. Han and S. Verdú, “Approximation
sian Mathematical Surveys, vol. 39, no. 1, theory of output statistics,” IEEE Transac-
p. 87, 1984. tions on Information Theory, vol. 39, no. 3,
[144] R. M. Gray and D. L. Neuhoff, “Quanti- pp. 752–772, 1993.
zation,” IEEE Trans. Inf. Theory, vol. 44, [155] P. Harremoës and I. Vajda, “On pairs of
no. 6, pp. 2325–2383, 1998. f-divergences and their joint range,” IEEE
Trans. Inf. Theory, vol. 57, no. 6, pp. 3230–
3235, Jun. 2011.
i i
i i
i i
References 591
[156] B. Harris, “The statistical estimation of [168] J. Jiao, K. Venkat, Y. Han, and T. Weiss-
entropy in the non-parametric case,” in Top- man, “Minimax estimation of functionals of
ics in Information Theory, I. Csiszár and discrete distributions,” IEEE Transactions
P. Elias, Eds. Springer Netherlands, 1975, on Information Theory, vol. 61, no. 5, pp.
vol. 16, pp. 323–355. 2835–2885, 2015.
[157] D. Haussler and M. Opper, “Mutual infor- [169] C. Jin, Y. Zhang, S. Balakrishnan, M. J.
mation, metric entropy and cumulative rel- Wainwright, and M. I. Jordan, “Local max-
ative entropy risk,” The Annals of Statistics, ima in the likelihood of Gaussian mixture
vol. 25, no. 6, pp. 2451–2492, 1997. models: Structural results and algorithmic
[158] M. Hayashi, “General nonasymptotic and consequences,” in Advances in neural infor-
asymptotic formulas in channel resolv- mation processing systems, 2016, pp. 4116–
ability and identification capacity and 4124.
their application to the wiretap channel,” [170] W. B. Johnson, G. Schechtman, and J. Zinn,
IEEE Transactions on Information Theory, “Best constants in moment inequalities
vol. 52, no. 4, pp. 1562–1575, 2006. for linear combinations of independent
[159] W. Hoeffding, “Asymptotically optimal and exchangeable random variables,” The
tests for multinomial distributions,” The Annals of Probability, pp. 234–253, 1985.
Annals of Mathematical Statistics, pp. 369– [171] I. Johnstone, Gaussian estimation:
401, 1965. Sequence and wavelet models, 2011,
[160] P. J. Huber, “Fisher information and spline available at http://www-stat.stanford.edu/
interpolation,” Annals of Statistics, pp. ~imj/.
1029–1033, 1974. [172] L. K. Jones, “A simple lemma on greedy
[161] ——, “A robust version of the probabil- approximation in Hilbert space and conver-
ity ratio test,” The Annals of Mathematical gence rates for projection pursuit regression
Statistics, pp. 1753–1758, 1965. and neural network training,” The Annals of
[162] I. A. Ibragimov and R. Z. Khas’minsk�, Statistics, pp. 608–613, 1992.
Statistical Estimation: Asymptotic Theory. [173] A. B. Juditsky and A. S. Nemirovski, “Non-
Springer, 1981. parametric estimation by convex program-
[163] S. Ihara, “On the capacity of channels with ming,” The Annals of Statistics, vol. 37,
additive non-Gaussian noise,” Information no. 5A, pp. 2278–2300, 2009.
and Control, vol. 37, no. 1, pp. 34–39, 1978. [174] T. Kawabata and A. Dembo, “The rate-
[164] ——, Information theory for continuous distortion dimension of sets and measures,”
systems. World Scientific, 1993, vol. 2. IEEE Trans. Inf. Theory, vol. 40, no. 5, pp.
[165] Y. I. Ingster and I. A. Suslina, Nonparamet- 1564 – 1572, Sep. 1994.
ric goodness-of-fit testing under Gaussian [175] M. Keane and G. O’Brien, “A Bernoulli
models. New York, NY: Springer, 2003. factory,” ACM Transactions on Modeling
[166] Y. I. Ingster, “Minimax testing of nonpara- and Computer Simulation, vol. 4, no. 2, pp.
metric hypotheses on a distribution density 213–219, 1994.
in the lp metrics,” Theory of Probability & [176] H. Kesten and B. P. Stigum, “Additional
Its Applications, vol. 31, no. 2, pp. 333–337, limit theorems for indecomposable multi-
1987. dimensional galton-watson processes,” The
[167] I. Jensen and A. J. Guttmann, “Series expan- Annals of Mathematical Statistics, vol. 37,
sions of the percolation probability for no. 6, pp. 1463–1481, 1966.
directed square and honeycomb lattices,” [177] T. Koch, “The Shannon lower bound is
Journal of Physics A: Mathematical and asymptotically tight,” IEEE Transactions
General, vol. 28, no. 17, p. 4813, 1995. on Information Theory, vol. 62, no. 11, pp.
6155–6161, 2016.
i i
i i
i i
[178] Y. Kochman, O. Ordentlich, and Y. Polyan- [189] C. Külske and M. Formentin, “A symmetric
skiy, “A lower bound on the expected entropy bound on the non-reconstruction
distortion of joint source-channel coding,” regime of Markov chains on Galton-
IEEE Transactions on Information Theory, Watson trees,” Electronic Communications
vol. 66, no. 8, pp. 4722–4741, 2020. in Probability, vol. 14, pp. 587–596, 2009.
[179] A. Kolchinsky and B. D. Tracey, “Esti- [190] A. Lapidoth, A foundation in digital com-
mating mixture entropy with pairwise dis- munication. Cambridge University Press,
tances,” Entropy, vol. 19, no. 7, p. 361, 2017.
2017. [191] A. Lapidoth and S. M. Moser, “Capac-
[180] A. N. Kolmogorov and V. M. Tikhomirov, ity bounds via duality with applications
“ε-entropy and ε-capacity of sets in function to multiple-antenna systems on flat-fading
spaces,” Uspekhi Matematicheskikh Nauk, channels,” IEEE Transactions on Informa-
vol. 14, no. 2, pp. 3–86, 1959, reprinted in tion Theory, vol. 49, no. 10, pp. 2426–2467,
Shiryayev, A. N., ed. Selected Works of AN 2003.
Kolmogorov: Volume III: Information The- [192] L. Le Cam, “Convergence of estimates
ory and the Theory of Algorithms, Vol. 27, under dimensionality restrictions,” Annals
Springer Netherlands, 1993, pp 86–170. of Statistics, vol. 1, no. 1, pp. 38 – 53, 1973.
[181] I. Kontoyiannis and S. Verdú, “Optimal [193] ——, Asymptotic methods in statistical
lossless data compression: Non- decision theory. New York, NY: Springer-
asymptotics and asymptotics,” IEEE Verlag, 1986.
Trans. Inf. Theory, vol. 60, no. 2, pp. [194] C. C. Leang and D. H. Johnson, “On
777–795, 2014. the asymptotics of m-hypothesis Bayesian
[182] J. Körner and A. Orlitsky, “Zero-error detection,” IEEE Transactions on Informa-
information theory,” IEEE Transactions on tion Theory, vol. 43, no. 1, pp. 280–282,
Information Theory, vol. 44, no. 6, pp. 1997.
2207–2229, 1998. [195] K. Lee, Y. Wu, and Y. Bresler, “Near opti-
[183] V. Koshelev, “Quantization with minimal mal compressed sensing of sparse rank-one
entropy,” Probl. Pered. Inform, vol. 14, pp. matrices via sparse power factorization,”
151–156, 1963. IEEE Transactions on Information Theory,
[184] O. Kosut and L. Sankar, “Asymptotics vol. 64, no. 3, pp. 1666–1698, Mar. 2018.
and non-asymptotics for universal fixed- [196] E. L. Lehmann and G. Casella, Theory of
to-variable source coding,” arXiv preprint Point Estimation, 2nd ed. New York, NY:
arXiv:1412.4444, 2014. Springer, 1998.
[185] A. Krause and D. Golovin, “Submodular [197] E. Lehmann and J. Romano, Testing Sta-
function maximization,” Tractability, vol. 3, tistical Hypotheses, 3rd ed. Springer,
pp. 71–104, 2014. 2005.
[186] J. Kuelbs, “A strong convergence theorem [198] W. V. Li and W. Linde, “Approximation,
for banach space valued random variables,” metric entropy and small ball estimates for
The Annals of Probability, vol. 4, no. 5, pp. Gaussian measures,” The Annals of Proba-
744–771, 1976. bility, vol. 27, no. 3, pp. 1556–1578, 1999.
[187] J. Kuelbs and W. V. Li, “Metric entropy and [199] W. V. Li and Q.-M. Shao, “Gaussian pro-
the small ball problem for Gaussian mea- cesses: inequalities, small ball probabilities
sures,” Journal of Functional Analysis, vol. and applications,” Handbook of Statistics,
116, no. 1, pp. 133–157, 1993. vol. 19, pp. 533–597, 2001.
[188] S. Kullback, Information theory and statis- [200] T. Linder and R. Zamir, “On the asymp-
tics. Mineola, NY: Dover publications, totic tightness of the Shannon lower bound,”
1968.
i i
i i
i i
References 593
i i
i i
i i
maximizing submodular set functions–I,” [235] G. Pisier, The volume of convex bodies and
Mathematical programming, vol. 14, no. 1, Banach space geometry. Cambridge Uni-
pp. 265–294, 1978. versity Press, 1999.
[224] J. Neveu, Mathematical foundations of the [236] J. Pitman, “Probabilistic bounds on the
calculus of probability. Holden-day, 1965. coefficients of polynomials with only real
[225] M. E. Newman, “Power laws, pareto dis- zeros,” Journal of Combinatorial Theory,
tributions and zipf’s law,” Contemporary Series A, vol. 77, no. 2, pp. 279–303, 1997.
physics, vol. 46, no. 5, pp. 323–351, 2005. [237] E. Plotnik, M. J. Weinberger, and J. Ziv,
[226] M. Okamoto, “Some inequalities relating to “Upper bounds on the probability of
the partial sum of binomial probabilities,” sequences emitted by finite-state sources
Annals of the institute of Statistical Math- and on the redundancy of the lempel-ziv
ematics, vol. 10, no. 1, pp. 29–35, 1959. algorithm,” IEEE transactions on infor-
[227] B. Oliver, J. Pierce, and C. Shannon, “The mation theory, vol. 38, no. 1, pp. 66–72,
philosophy of PCM,” Proceedings of the 1992.
IRE, vol. 36, no. 11, pp. 1324–1331, 1948. [238] Y. Polyanskiy, “Channel coding: non-
[228] Y. Oohama, “On two strong converse the- asymptotic fundamental limits,” Ph.D.
orems for discrete memoryless channels,” dissertation, Princeton Univ., Princeton,
IEICE Transactions on Fundamentals of NJ, USA, 2010.
Electronics, Communications and Com- [239] Y. Polyanskiy, H. V. Poor, and S. Verdú,
puter Sciences, vol. 98, no. 12, pp. 2471– “Channel coding rate in the finite block-
2475, 2015. length regime,” IEEE Trans. Inf. Theory,
[229] O. Ordentlich and Y. Polyanskiy, “Strong vol. 56, no. 5, pp. 2307–2359, May 2010.
data processing constant is achieved by [240] ——, “Dispersion of the Gilbert-Elliott
binary inputs,” IEEE Trans. Inf. Theory, channel,” IEEE Trans. Inf. Theory, vol. 57,
vol. 68, no. 3, pp. 1480–1481, Mar. 2022. no. 4, pp. 1829–1848, Apr. 2011.
[230] L. Paninski, “Variational minimax estima- [241] ——, “Feedback in the non-asymptotic
tion of discrete distributions under kl loss,” regime,” IEEE Trans. Inf. Theory, vol. 57,
Advances in Neural Information Processing no. 4, pp. 4903 – 4925, Apr. 2011.
Systems, vol. 17, 2004. [242] ——, “Minimum energy to send k bits with
[231] P. Panter and W. Dite, “Quantization distor- and without feedback,” IEEE Trans. Inf.
tion in pulse-count modulation with nonuni- Theory, vol. 57, no. 8, pp. 4880–4902, Aug.
form spacing of levels,” Proceedings of the 2011.
IRE, vol. 39, no. 1, pp. 44–48, 1951. [243] Y. Polyanskiy and S. Verdú, “Arimoto chan-
[232] M. Pardo and I. Vajda, “About distances nel coding converse and Rényi divergence,”
of discrete distributions satisfying the data in Proceedings of the Forty-eighth Annual
processing theorem of information theory,” Allerton Conference on Communication,
IEEE transactions on information theory, Control, and Computing, 2010, pp. 1327–
vol. 43, no. 4, pp. 1288–1293, 1997. 1333.
[233] Y. Peres, “Iterating von Neumann’s proce- [244] Y. Polyanskiy and S. Verdú, “Arimoto chan-
dure for extracting random bits,” Annals of nel coding converse and Rényi divergence,”
Statistics, vol. 20, no. 1, pp. 590–597, 1992. in Proc. 2010 48th Allerton Conference,
[234] M. S. Pinsker, “Optimal filtering of square- Allerton Retreat Center, Monticello, IL,
integrable signals in Gaussian noise,” Prob- USA, Sep. 2010.
lemy Peredachi Informatsii, vol. 16, no. 2, [245] Y. Polyanskiy and S. Verdu, “Binary
pp. 52–68, 1980. hypothesis testing with feedback,” in Infor-
mation Theory and Applications Workshop
(ITA), 2011.
i i
i i
i i
References 595
[246] Y. Polyanskiy and S. Verdú, “Empirical dis- [257] M. Raginsky, “Strong data processing
tribution of good channel codes with non- inequalities and ϕ-Sobolev inequalities for
vanishing error probability,” IEEE Trans. discrete channels,” IEEE Transactions on
Inf. Theory, vol. 60, no. 1, pp. 5–21, Jan. Information Theory, vol. 62, no. 6, pp.
2014. 3355–3389, 2016.
[247] Y. Polyanskiy and Y. Wu, “Peak-to-average [258] M. Raginsky and I. Sason, “Concentration
power ratio of good codes for Gaussian of measure inequalities in information the-
channel,” IEEE Trans. Inf. Theory, vol. 60, ory, communications, and coding,” Founda-
no. 12, pp. 7655–7660, Dec. 2014. tions and Trends® in Communications and
[248] Y. Polyanskiy, “Saddle point in the mini- Information Theory, vol. 10, no. 1-2, pp.
max converse for channel coding,” IEEE 1–246, 2013.
Transactions on Information Theory, [259] C. R. Rao, “Information and the accuracy
vol. 59, no. 5, pp. 2576–2595, 2012. attainable in the estimation of statistical
[249] ——, “On dispersion of compound dmcs,” parameters,” Bull. Calc. Math. Soc., vol. 37,
in 2013 51st Annual Allerton Conference pp. 81–91, 1945.
on Communication, Control, and Comput- [260] A. H. Reeves, “The past present and future
ing (Allerton). IEEE, 2013, pp. 26–32. of PCM,” IEEE Spectrum, vol. 2, no. 5, pp.
[250] Y. Polyanskiy and Y. Wu, “Strong data- 58–62, 1965.
processing inequalities for channels and [261] A. Rényi, “On the dimension and entropy
Bayesian networks,” in Convexity and Con- of probability distributions,” Acta Mathe-
centration. The IMA Volumes in Mathemat- matica Hungarica, vol. 10, no. 1 – 2, Mar.
ics and its Applications, vol 161, E. Carlen, 1959.
M. Madiman, and E. M. Werner, Eds. New [262] R. B. Reznikova Zh., “Analysis of the lan-
York, NY: Springer, 2017, pp. 211–249. guage of ants by information-theoretical
[251] ——, “Dualizing Le Cam’s method for methods,” Problemi Peredachi Informat-
functional estimation, with applications to sii, vol. 22, no. 3, pp. 103–108, 1986,
estimating the unseens,” arXiv preprint english translation: http://reznikova.net/R-
arXiv:1902.05616, 2019. R-entropy-09.pdf.
[252] ——, “Application of the information- [263] T. J. Richardson, M. A. Shokrollahi, and
percolation method to reconstruction prob- R. L. Urbanke, “Design of capacity-
lems on graphs,” Mathematical Statistics approaching irregular low-density
and Learning, vol. 2, no. 1, pp. 1–24, 2020. parity-check codes,” IEEE Transac-
[253] ——, “Self-regularizing property of non- tions on Information Theory, vol. 47, no. 2,
parametric maximum likelihood estima- pp. 619–637, 2001.
tor in mixture models,” arXiv preprint [264] T. Richardson and R. Urbanke, Modern
arXiv:2008.08244, 2020. Coding Theory. Cambridge University
[254] E. C. Posner and E. R. Rodemich, “Epsilon Press, 2008.
entropy and data compression,” Annals of [265] P. Rigollet and J.-C. Hütter, “High dimen-
Mathematical Statistics, vol. 42, no. 6, pp. sional statistics,” Lecture Notes for 18.657,
2079–2125, Dec. 1971. MIT, 2017, https://math.mit.edu/~rigollet/
[255] A. Prékopa, “Logarithmic concave mea- PDFs/RigNotes17.pdf.
sures with application to stochastic pro- [266] Y. Rinott, “On convexity of measures,”
gramming,” Acta Scientiarum Mathemati- Annals of Probability, vol. 4, no. 6, pp.
carum, vol. 32, pp. 301–316, 1971. 1020–1026, 1976.
[256] J. Radhakrishnan, “An entropy proof of [267] J. J. Rissanen, “Fisher information and
Bregman’s theorem,” J. Combin. Theory stochastic complexity,” IEEE transactions
Ser. A, vol. 77, no. 1, pp. 161–164, 1997.
i i
i i
i i
on information theory, vol. 42, no. 1, pp. [279] C. Shannon, “The zero error capacity of a
40–47, 1996. noisy channel,” IRE Transactions on Infor-
[268] C. Rogers, Packing and Covering, ser. Cam- mation Theory, vol. 2, no. 3, pp. 8–19,
bridge tracts in mathematics and mathemat- 1956.
ical physics. Cambridge University Press, [280] C. E. Shannon, “Coding theorems for a dis-
1964. crete source with a fidelity criterion,” IRE
[269] H. Roozbehani and Y. Polyanskiy, “Low Nat. Conv. Rec, vol. 4, no. 142-163, p. 1,
density majority codes and the problem 1959.
of graceful degradation,” arXiv preprint [281] O. Shayevitz, “On Rényi measures and
arXiv:1911.12263, 2019. hypothesis testing,” in 2011 IEEE Interna-
[270] H. P. Rosenthal, “On the subspaces of tional Symposium on Information Theory
lp (p > 2) spanned by sequences of inde- Proceedings. IEEE, 2011, pp. 894–898.
pendent random variables,” Israel Journal [282] O. Shayevitz and M. Feder, “Optimal feed-
of Mathematics, vol. 8, no. 3, pp. 273–303, back communication via posterior match-
1970. ing,” IEEE Trans. Inf. Theory, vol. 57, no. 3,
[271] D. Russo and J. Zou, “Controlling bias in pp. 1186–1222, 2011.
adaptive data analysis using information [283] G. Simons and M. Woodroofe, “The
theory,” in Artificial Intelligence and Statis- Cramér-Rao inequality holds almost every-
tics. PMLR, 2016, pp. 1232–1240. where,” in Recent Advances in Statistics:
[272] I. Sason and S. Verdu, “f-divergence Papers in Honor of Herman Chernoff on his
inequalities,” IEEE Transactions on Infor- Sixtieth Birthday. Academic, New York,
mation Theory, vol. 62, no. 11, pp. 5973– 1983, pp. 69–93.
6006, 2016. [284] R. Sinkhorn, “A relationship between arbi-
[273] G. Schechtman, “Extremal configurations trary positive matrices and doubly stochas-
for moments of sums of independent pos- tic matrices,” Ann. Math. Stat., vol. 35,
itive random variables,” in Banach Spaces no. 2, pp. 876–879, 1964.
and their Applications in Analysis. De [285] M. Sion, “On general minimax theorems,”
Gruyter, 2011, pp. 183–192. Pacific J. Math, vol. 8, no. 1, pp. 171–176,
[274] M. J. Schervish, Theory of statistics. 1958.
Springer-Verlag New York, 1995. [286] M.-K. Siu, “Which latin squares are cay-
[275] A. Schrijver, Theory of linear and integer ley tables?” Amer. Math. Monthly, vol. 98,
programming. John Wiley & Sons, 1998. no. 7, pp. 625–627, Aug. 1991.
[276] C. E. Shannon, “A symbolic analysis of [287] D. Slepian and H. O. Pollak, “Prolate
relay and switching circuits,” Electrical spheroidal wave functions, fourier analysis
Engineering, vol. 57, no. 12, pp. 713–723, and uncertainty—i,” Bell System Technical
Dec 1938. Journal, vol. 40, no. 1, pp. 43–63, 1961.
[277] C. E. Shannon, “A mathematical theory of [288] A. Sly, “Reconstruction of random colour-
communication,” Bell Syst. Tech. J., vol. 27, ings,” Communications in Mathematical
pp. 379–423 and 623–656, Jul./Oct. 1948. Physics, vol. 288, no. 3, pp. 943–
[278] C. E. Shannon, R. G. Gallager, and E. R. 961, Jun 2009. [Online]. Available: https:
Berlekamp, “Lower bounds to error prob- //doi.org/10.1007/s00220-009-0783-7
ability for coding on discrete memoryless [289] B. Smith, “Instantaneous companding of
channels i,” Inf. Contr., vol. 10, pp. 65–103, quantized signals,” Bell System Technical
1967. Journal, vol. 36, no. 3, pp. 653–709, 1957.
[290] J. G. Smith, “The information capacity of
amplitude and variance-constrained scalar
i i
i i
i i
References 597
Gaussian channels,” Information and Con- [303] W. Tang and F. Tang, “The poisson
trol, vol. 18, pp. 203 – 219, 1971. binomial distribution–old & new,” arXiv
[291] Spectre, “SPECTRE: Short packet com- preprint arXiv:1908.10024, 2019.
munication toolbox,” https://github.com/ [304] G. Taricco and M. Elia, “Capacity of fading
yp-mit/spectre, 2015, GitHub repository. channel with no side information,” Elec-
[292] R. Speer, J. Chin, A. Lin, S. Jewett, tronics Letters, vol. 33, no. 16, pp. 1368–
and L. Nathan, “Luminosoinsight/word- 1370, 1997.
freq: v2.2,” Oct. 2018. [Online]. Available: [305] V. Tarokh, H. Jafarkhani, and A. R. Calder-
https://doi.org/10.5281/zenodo.1443582 bank, “Space-time block codes from orthog-
[293] A. J. Stam, “Distance between sampling onal designs,” IEEE Transactions on Infor-
with and without replacement,” Statistica mation theory, vol. 45, no. 5, pp. 1456–
Neerlandica, vol. 32, no. 2, pp. 81–91, 1467, 1999.
1978. [306] V. Tarokh, N. Seshadri, and A. R. Calder-
[294] M. Steiner, “The strong simplex conjecture bank, “Space-time codes for high data rate
is false,” IEEE Transactions on Information wireless communication: Performance cri-
Theory, vol. 40, no. 3, pp. 721–731, 1994. terion and code construction,” IEEE trans-
[295] V. Strassen, “Asymptotische Abschätzun- actions on information theory, vol. 44,
gen in Shannon’s Informationstheorie,” in no. 2, pp. 744–765, 1998.
Trans. 3d Prague Conf. Inf. Theory, Prague, [307] H. Te Sun, Information-spectrum methods
1962, pp. 689–723. in information theory. Springer Science
[296] ——, “The existence of probability mea- & Business Media, 2003.
sures with given marginals,” Annals of [308] E. Telatar, “Capacity of multi-antenna
Mathematical Statistics, vol. 36, no. 2, pp. Gaussian channels,” European trans. tele-
423–439, 1965. com., vol. 10, no. 6, pp. 585–595, 1999.
[297] H. Strasser, Mathematical theory of statis- [309] ——, “Wringing lemmas and multiple
tics: Statistical experiments and asymptotic descriptions,” 2016, unpublished draft.
decision theory. Berlin, Germany: Walter [310] V. N. Temlyakov, “On estimates of ϵ-
de Gruyter, 1985. entropy and widths of classes of functions
[298] S. Szarek, “Nets of Grassmann manifold with a bounded mixed derivative or differ-
and orthogonal groups,” in Proceedings of ence,” Doklady Akademii Nauk, vol. 301,
Banach Space Workshop. University of no. 2, pp. 288–291, 1988.
Iowa Press, 1982, pp. 169–185. [311] F. Topsøe, “Some inequalities for informa-
[299] ——, “Metric entropy of homogeneous tion divergence and related measures of dis-
spaces,” Banach Center Publications, crimination,” IEEE Transactions on Infor-
vol. 43, no. 1, pp. 395–410, 1998. mation Theory, vol. 46, no. 4, pp. 1602–
[300] W. Szpankowski and S. Verdú, “Mini- 1609, 2000.
mum expected length of fixed-to-variable [312] D. Tse and P. Viswanath, Fundamentals
lossless compression without prefix con- of wireless communication. Cambridge
straints,” IEEE Trans. Inf. Theory, vol. 57, University Press, 2005. [Online]. Avail-
no. 7, pp. 4017–4025, 2011. able: http://www.eecs.berkeley.edu/~dtse/
[301] I. Tal and A. Vardy, “List decoding of polar book.html
codes,” IEEE Transactions on Information [313] A. B. Tsybakov, Introduction to Nonpara-
Theory, vol. 61, no. 5, pp. 2213–2226, metric Estimation. New York, NY:
2015. Springer Verlag, 2009.
[302] M. Talagrand, Upper and lower bounds for [314] B. P. Tunstall, “Synthesis of noiseless com-
stochastic processes. Springer, 2014. pression codes,” Ph.D. dissertation, Geor-
gia Institute of Technology, 1967.
i i
i i
i i
i i
i i
i i
References 599
i i
i i