Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022

i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-i

i i
Book Heading
This textbook introduces the subject of information theory at a level suitable for advanced
undergraduate and graduate students. It develops both the classical Shannon theory and recent
applications in statistical learning. There are five parts covering foundations of information mea-
sures; (lossless) data compression; binary hypothesis testing and large deviations theory; channel
coding and channel capacity; lossy data compression; and, finally, statistical applications. There
are over 150 exercises included to help the reader learn about and bring attention to recent
discoveries in the literature.
Yury Polyanskiy is a Professor of Electrical Engineering and Computer Science at MIT. He

received M.S. degree in applied mathematics and physics from the Moscow Institute of Physics
and Technology, in 2005 and Ph.D. degree in electrical engineering from Princeton University in
2010. His research interests span information theory, statistical learning, error-correcting codes
and wireless communication. His work was recognized by the 2020 IEEE Information Theory
Society James Massey Award, 2013 NSF CAREER award and 2011 IEEE Information Theory
Society Paper Award.
Yihong Wu is a Professor in the Department of Statistics and Data Science at Yale University.
He obtained his B.E. degree from Tsinghua University in 2006 and Ph.D. degree from Princeton
University in 2011. He is a recipient of the NSF CAREER award in 2017 and the Sloan Research
Fellowship in Mathematics in 2018. He is broadly interested in the theoretical and algorithmic
aspects of high-dimensional statistics, information theory, and optimization.
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-ii

i i
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-iii

i i
Information Theory
From Coding to Learning
FIRS T E DI TI ON
Yury Polyanskiy
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Yihong Wu
Department of Statistics and Data Science
Yale University
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-iv

i i
University Printing House, Cambridge CB2 8BS, United Kingdom

One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/XXX-X-XXX-XXXXX-X
DOI: 10.1017/XXX-X-XXX-XXXXX-X
© Author name XXXX
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published XXXX
Printed in <country> by <printer>
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
ISBN XXX-X-XXX-XXXXX-X Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-v

i i
Dedicated to
Names
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-vi

i i
Contents
Preface page xv
Introduction xvi
Frequently used notation 1
Part I Information measures 3

1 Entropy 6
1.1 Entropy and conditional entropy 6
1.2 Axiomatic characterization 11
1.3 History of entropy 11
1.4* Submodularity 13
1.5* Han’s inequality and Shearer’s Lemma 14
2 Divergence 17
2.1 Divergence and Radon-Nikodym derivatives 17
2.2 Divergence: main inequality and equivalent expressions 21
2.3 Differential entropy 23
2.4 Markov kernels 25
2.5 Conditional divergence, chain rule, data-processing inequality 27
2.6* Local behavior of divergence and Fisher information 32
2.6.1* Local behavior of divergence for mixtures 32
2.6.2* Parametrized family 34
3 Mutual information 37
3.1 Mutual information 37
3.2 Mutual information as difference of entropies 40
3.3 Examples of computing mutual information 42
3.4 Conditional mutual information and conditional independence 45
3.5 Sufficient statistics and data processing 48
4 Variational characterizations and continuity of D and I 50

4.1 Geometric interpretation of mutual information 51
4.2 Variational characterizations of divergence: Gelfand-Yaglom-Perez 54
4.3 Variational characterizations of divergence: Donsker-Varadhan 55
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-vii

i i
Contents vii
4.4 Continuity of divergence 57

4.5* Continuity under monotone limits of σ -algebras 58
4.6 Variational characterizations and continuity of mutual information 61
5 Extremization of mutual information: capacity saddle point 64

5.1 Convexity of information measures 64
5.2 Extremization of mutual information 65
5.3 Capacity as information radius 69
5.4 Existence of capacity-achieving output distribution (general case) 70
5.5 Gaussian saddle point 73
5.6 Iterative algorithms: Blahut-Arimoto, Expectation-Maximization, Sinkhorn 75
6 Tensorization. Fano’s inequality. Entropy rate. 78

6.1 Tensorization (single-letterization) of mutual information 78
6.2* Gaussian capacity via orthogonal symmetry 80
6.3 Information measures and probability of error 80
6.4 Entropy rate 83
6.5 Entropy and symbol (bit) error rate 84
6.6 Entropy and contiguity 85
6.7 Mutual information rate 86
7 f-divergences 88
7.1 Definition and basic properties of f-divergences 88
7.2 Data-processing inequality; approximation by finite partitions 91
7.3 Total variation and Hellinger distance in hypothesis testing 95
7.4 Inequalities between f-divergences and joint range 98
7.5 Examples of computing joint range 102
7.5.1 Hellinger distance versus total variation 102
7.5.2 KL divergence versus total variation 103
7.5.3 Chi-squared versus total variation 103
7.6 A selection of inequalities between various divergences 104
7.7 Divergences between Gaussians 105
7.8 Mutual information based on f-divergence 106
7.9 Empirical distribution and χ2 -information 107
7.10 Most f-divergences are locally χ2 -like 109
7.11 f-divergences in parametric families: Fisher information 111
7.12 Rényi divergences and tensorization 115
7.13 Variational representation of f-divergences 118
7.14*Technical proofs: convexity, local expansions and variational representations 121
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-viii

i i
viii Contents
8 Entropy method in combinatorics and geometry 127

8.1 Binary vectors of average weights 128
8.2 Shearer’s lemma & counting subgraphs 129
8.3 Brégman’s Theorem 131
8.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney 133
9 Random number generators 135

9.1 Setup 135
9.2 Converse 136
9.3 Elias’ construction from data compression 137
9.4 Peres’ iterated von Neumann’s scheme 138
9.5 Bernoulli factory 140
Exercises for Part I 143
Part II Lossless data compression 157

10 Variable-length lossless compression 161
10.1 Variable-length lossless compression 161
10.2 Mandelbrot’s argument for universality of Zipf’s (power) law 166
10.3 Uniquely decodable codes, prefix codes and Huffman codes 169
11 Fixed-length (almost lossless) compression. Slepian-Wolf. 175

11.1 Fixed-length almost lossless code. Asymptotic Equipartition Property (AEP). 175
11.2 Linear Compression 180
11.3 Compression with side information at both compressor and decompressor 182
11.4 Slepian-Wolf (Compression with side information at decompressor only) 183
11.5 Multi-terminal Slepian Wolf 185
11.6*Source-coding with a helper (Ahlswede-Körner-Wyner) 187
12 Compressing stationary ergodic sources 190

12.1 Bits of ergodic theory 191
12.2 Proof of the Shannon-McMillan Theorem 194
12.3*Proof of the Birkhoff-Khintchine Theorem 196
12.4*Sinai’s generator theorem 198
13 Universal compression 203

13.1 Arithmetic coding 204
13.2 Combinatorial construction of Fitingof 205
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-ix

i i
Contents ix
13.3 Optimal compressors for a class of sources. Redundancy. 206

13.4*Approximate minimax solution: Jeffreys prior 208
13.5 Sequential probability assignment: Krichevsky-Trofimov 211
13.6 Individual sequence and universal prediction 212
13.7 Lempel-Ziv compressor 215
Exercises for Part II 219
Part III Binary hypothesis testing 225

14 Neyman-Pearson lemma 228
14.1 Neyman-Pearson formulation 228
14.2 Likelihood ratio tests 231
14.3 Converse bounds on R(P, Q) 233
14.4 Achievability bounds on R(P, Q) 234
14.5 Stein’s regime 237
14.6 Chernoff regime: preview 240
15 Information projection and large deviations 242

15.1 Basics of large deviations theory 242
15.1.1 Log MGF and rate function 243
15.1.2 Tilted distribution 248
15.2 Large-deviations exponents and KL divergence 249
15.3 Information Projection 253
15.4 Interpretation of Information Projection 256
15.5 Generalization: Sanov’s theorem 257
16 Hypothesis testing: error exponents 258

16.1 (E0 , E1 )-Tradeoff 258
16.2 Equivalent forms of Theorem 16.1 261
16.3*Sequential Hypothesis Testing 264
16.4 Composite, robust and goodness-of-fit hypothesis testing 269
Exercises for Part III 271
Part IV Channel coding 279

17 Error correcting codes 282
17.1 Codes and probability of error 282
17.2 Coding for Binary Symmetric Channels 284
17.3 Optimal decoder 286
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-x

i i
x Contents
17.4 Weak converse bound 287
18 Random and maximal coding 289

18.1 Information density 289
18.2 Shannon’s random coding bound 292
18.3 Dependence-testing bound 294
18.4 Feinstein’s maximal coding bound 296
18.5 RCU and Gallager’s bound 298
18.6 Linear codes 300
18.7 Why random and maximal coding work well? 305
19 Channel capacity 309

19.1 Channels and channel capacity 309
19.2 Shannon’s noisy channel coding theorem 314
19.3 Examples of computing capacity 318
19.4*Symmetric channels 319
19.5*Information Stability 323
19.6 Capacity under bit error rate 327
19.7 Joint Source Channel Coding 329
20 Channels with input constraints. Gaussian channels. 333

20.1 Channel coding with input constraints 333
20.2 Channel capacity under separable cost constraints 336
20.3 Stationary AWGN channel 338
20.4 Parallel AWGN channel 340
20.5*Non-stationary AWGN 342
20.6*Additive colored Gaussian noise channel 343
20.7*Additive White Gaussian Noise channel with Intersymbol Interference 345
20.8*Gaussian channels with amplitude constraints 346
20.9*Gaussian channels with fading 346
21 Energy-per-bit, continuous-time channels 349

21.1 Energy per bit 349
21.2 Capacity per unit cost 352
21.3 Energy-per-bit for the fading channel 355
21.4 Capacity of the continuous-time AWGN channel 356
21.5*Capacity of the continuous-time band-limited AWGN channel 358
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xi

i i
Contents xi
22 Strong converse. Channel dispersion and error exponents. Finite Blocklength

Bounds. 361
22.1 Strong Converse 361
22.2 Stationary memoryless channel without strong converse 366
22.3 Meta-converse 367
22.4*Error exponents 369
22.5 Channel dispersion 373
22.6 Finite blocklength bounds and normal approximation 376
22.7 Normalized Rate 377
23 Channel coding with feedback 380

23.1 Feedback does not increase capacity for stationary memoryless channels 380
23.2*Alternative proof of Theorem 23.3 and Massey’s directed information 384
23.3 When is feedback really useful? 386
23.3.1 Code with very small (e.g. zero) error probability 387
23.3.2 Code with variable length 390
23.3.3 Code with variable power 391
Exercises for Part IV 394
Part V Rate-distortion theory and metric entropy 405

24 Rate-distortion theory 408
24.1 Scalar and vector quantization 408
24.1.1 Scalar Uniform Quantization 408
24.1.2 Scalar Non-uniform Quantization 409
24.1.3 Optimal Scalar Quantizers 410
24.1.4 Fine quantization 412
24.1.5 Fine quantization and variable rate 413
24.2 Information-theoretic formulation 414
24.3 Converse bounds 416
24.4*Converting excess distortion to average 418
25 Rate distortion: achievability bounds 420

25.1 Shannon’s rate-distortion theorem 420
25.1.1 Intuition 421
25.1.2 Proof of Theorem 25.1 423
25.2*Covering lemma 427
25.3*Wyner’s common information 429
25.4*Approximation of output statistics and the soft-covering lemma 431
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xii

i i
xii Contents
26 Evaluating rate-distortion function. Lossy Source-Channel separation. 434

26.1 Evaluation of R(D) 434
26.1.1 Bernoulli Source 434
26.1.2 Gaussian Source 435
26.2*Analog of saddle-point property in rate-distortion 438
26.3 Lossy joint source-channel coding 441
26.3.1 Converse 442
26.3.2 Achievability via separation 443
26.4 What is lacking in classical lossy compression? 446
27 Metric entropy 448

27.1 Covering and packing 448
27.2 Finite-dimensional space and volume bound 451
27.3 Beyond the volume bound 454
27.3.1 Sudakov minorization 456
27.3.2 Maurey’s empirical method 458
27.3.3 Duality of metric entropy 459
27.4 Infinite-dimensional space: smooth functions 460
27.5 Hilbert ball has metric entropy ϵ12 463
27.6 Metric entropy and small-ball probability 465
27.7 Metric entropy and rate-distortion theory 467
Exercises for Part V 471
Part VI Statistical applications 475

28 Basics of statistical decision theory 478
28.1 Basic setting 478
28.2 Gaussian Location Model (GLM) 480
28.3 Bayes risk, minimax risk, and the minimax theorem 481
28.3.1 Bayes risk 482
28.3.2 Minimax risk 483
28.3.3 Minimax and Bayes risk: a duality perspective 485
28.3.4 Minimax theorem 486
28.4 Multiple observations and sample complexity 487
28.5 Tensor product of experiments 488
28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 490
29 Classical large-sample asymptotics 494

29.1 Statistical lower bound from data processing 494
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xiii

i i
Contents xiii
29.1.1 Hammersley-Chapman-Robbins (HCR) lower bound 494

29.1.2 Bayesian Cramér-Rao lower bound 496
29.2 Bayesian Cramér-Rao lower bounds 497
29.3 Maximum Likelihood Estimator and asymptotic efficiency 500
29.4 Application: Estimating discrete distributions and entropy 502
30 Mutual information method 504

30.1 GLM revisited and Shannon lower bound 505
30.2 GLM with sparse means 508
30.3 Community detection 510
30.4 Estimation better than chance 511
31 Lower bounds via reduction to hypothesis testing 513

31.1 Le Cam’s two-point method 513
31.2 Assouad’s Lemma 516
31.3 Assouad’s lemma from the Mutual Information Method 517
31.4 Fano’s method 518
32 Entropic upper bound for statistical estimation 521

32.1 Yang-Barron’s construction 521
32.1.1 Bayes risk as conditional mutual information and capacity bound 524
32.1.2 Capacity upper bound via KL covering numbers 527
32.1.3 Capacity lower bound via Hellinger packing number 528
32.1.4 General bounds between cumulative and individual (one-step) risks 529
32.2 Pairwise comparison à la Le Cam-Birgé 530
32.2.1 Composite hypothesis testing and Hellinger distance 530
32.2.2 Hellinger guarantee on Le Cam-Birgé’s pairwise comparison estimator 531
32.2.3 Refinement using local entropy 533
32.2.4 Lower bound using local Hellinger packing 536
32.3 Yatracos’ class and minimum distance estimator 538
32.4 Application: Estimating smooth densities 540
33 Strong data processing inequality 542

33.1 Computing a boolean function with noisy gates 542
33.2 Strong Data Processing Inequality 545
33.3 Directed Information Percolation 549
33.4 Input-dependent SDPI 553
33.5 Application: Broadcasting and coloring on trees 554
33.6 Application: distributed correlation estimation 557
33.7 Channel comparison: degradation, less noisy, more capable 558
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xiv

i i
xiv Contents
33.8 Undirected information percolation 560

33.9 Application: Spiked Wigner model 563
33.10Strong data post-processing inequality (Post-SDPI) 564
33.11Application: Distributed Mean Estimation 567
Exercises for Part VI 570

References 582
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xv

i i
Preface
This book is a modern introduction to the field of information theory. In the last two decades,
information theory has evolved from a discipline primarily dealing with problems of information
storage and transmission (“coding”) to one focusing increasingly on information extraction and
denoising (“learning”). This transformation is reflected in the title and content of this book.
The content grew out of the lecture notes accumulated over a decade of the authors’ teaching
regular courses at MIT, University of Illinois, and Yale, as well as short courses at EPFL (Switzer-
land) and ENSAE (France). Our intention is to use this manuscript as a textbook for a first course
on information theory for graduate (and advanced undergraduate) students, or for a second (topics)
course delving deeper into specific areas. A significant part of the book is devoted to the exposition
of information-theoretic methods which have found influential applications in other fields such as
statistical learning and computer science. (Specifically, we cover Kolmogorov’s metric entropy,
strong data processing inequalities, and entropic upper bounds for statistical estimation). We also
include some lesser known classical material (for example, connections to ergodicity) along with
the latest developments, which are often covered by the exercises (following the style of Csiszár
and Körner [81]).
It is hard to mention everyone, who helped us start and finish this work, but some stand out
especially. First and foremost, we owe our debt to Sergio Verdú, whose course at Princeton is
responsible for our life-long admiration of the subject. Furthermore, some techical choices, such
as the “one-shot” approach to coding theorems and simultaneous treatment of discrete and contin-
uous alphabets, reflect the style we learned from his courses. Next, we were fortunate to have many
bright students contribute to typing the lecture notes (precursor of this book), as well as to cor-
recting and extending the content. Among them, we especially thank Ganesh Ajjanagadde, Austin
Collins, Yuzhou Gu, Richard Guo, Qingqing Huang, Yunus Inan, Reka Inovan, Jason Klusowski,
Anuran Makur, Pierre Quinton, Aolin Xu, Sheng Xu, Pengkun Yang, Junhui Zhang.
Y. Polyanskiy <[email protected]>
MIT
Y. Wu <[email protected]>
Yale
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xvi

i i
Introduction
What is information?
The Oxford English Dictionary lists 18 definitions of the word information, while the Merriam-
Webster Dictionary lists 17. This emphasizes the diversity of meaning and domains in which the
word information may appear. This book, however, is only concerned with a precise mathematical
understanding of information, independent of the application context.
How can we measure something that we cannot even define well? Among the earliest attempts
of quantifying information we can list R.A. Fisher’s works on the uncertainty of statistical esti-
mates (“confidence intervals”) and R. Hartley’s definition of information as the logarithm of the
number of possibilities. Around the same time, Fisher [127] and others identified connection
between information and thermodynamic entropy. This line of thinking culminated in Claude
Shannon’s magnum opus [277], where he formalized the concept of (what we call today the)
Shannon information and forever changed the human language by accepting John Tukey’s word
bit as the unit of its measurement. In addition to possessing a number of elegant properties, Shan-
non information turned out to also answer certain rigorous mathematical questions (such as the
optimal rate of data compression and data transmission). This singled out Shannon’s definition as
the right way of quantifying information. Classical information theory, as taught in [76, 81, 133],
focuses exclusively on this point of view.
In this book, however, we take a slightly more general point of view. To introduce it, let us
quote an emminent physicist L. Brillouin [53]:
We must start with a precise definition of the word “information”. We consider a problem involving a certain
number of possible answers, if we have no special information on the actual situation. When we happen to be
in possession of some information on the problem, the number of possible answers is reduced, and complete
information may even leave us with only one possible answer. Information is a function of the ratio of the
number of possible answers before and after, and we choose a logarithmic law in order to insure additivity of
the information contained in independent situations.
Note that only the last sentence specializes the more general term information to the Shannon’s
special version. In this book, we think of information without that last sentence. Namely, for us
information is a measure of difference between two beliefs about the system state. For example, it
could be the amount of change in our worldview following an observation or an event. Specifically,
suppose that initially the probability distribution P describes our understanding of the world (e.g.,
P allows us to answer questions such as how likely it is to rain today). Following an observation our
distribution changes to Q (e.g., upon observing clouds or a clear sky). The amount of information in
the observation is the dissimilarity between P and Q. How to quantify dissimilarity depends on the
particular context. As argued by Shannon, in many cases the right choice is the Kullback-Leibler
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xvii

i i
Introduction xvii
(KL) divergence D(QkP), see Definition 2.1. Indeed, if the prior belief is described by a probability
mass function P = (p1 , . . . , pk ) on the set of k possible outcomes, then the observation of the first
outcome results in the new (posterior) belief vector Q = (1, 0, . . . , 0) giving D(QkP) = log p11 ,
and similarly for other outcomes. Since the outcome i happens with probability pi we see that the
average dissimilarity between the prior and posterior beliefs is
X
k
1
pi log ,
pi
i=1
which is precisely the Shannon entropy, cf. Definition 1.1.

However, it is our conviction that measures of dissimilarity (or “information measures”) other
than the KL divergence are needed for applying information theory beyond the classical realms.
For example, the concepts of total variation, Hellinger distance and χ2 -divergence (both promi-
nent members of the f-divergence family) have found deep and fruitful applications in the theory
of statistical estimation and probability, as well as contemporary topics in theoretical computer
science such as communication complexity, estimation with communication constraints, property
testing (we discuss these in detail in Part VI). Therefore, when we talk about information measures
in Part I of this book we do not exclusively focus on those of Shannon type, although the latter are
justly given a premium treatment.
What is information theory?

Similarly to information, the subject of information theory does not have a precise definition.
In the narrowest sense, it is a scientific discipline concerned with optimal methods of transmit-
ting and storing data. The highlights of this part of the subject are so called “coding theorems”
showing existence of algorithms for compressing and communicating information across noisy
channels. Classical results, such as Shannon’s noisy channel coding theorem (Theorem 19.9),
not only show existence of algorithms, but also quantify their performance and show that
such performance is best possible. This part is, thus, concerned with identifying fundamental
limits of practically relevant (engineering) problems. Consequently, this branch is sometimes
called “IEEE1 -style information theory”, and it influenced or revolutionized much of informa-
tion technology we witness today: digital communication, wireless (cellular and WiFi) networks,
cryptography (Diffie-Hellman), data compression (Lempel-Ziv family of algorithms), and a lot
more.
However, the true scope of the field is much broader. Indeed, the Hilbert’s 13th problem (for
smooth functions) was illuminated and resolved by Arnold and Kolmogorov via the idea of metric
entropy that Kolmogorov introduced following Shannon’s rate-distortion theory [324]. The (non-
)isomorphism problem for Bernoulli shifts in ergodic theory has been solved by introducing the
Kolmogorov-Sinai entropy. In physics, the Landauer principle and other works on Maxwell demon
1
For Institute of Electrical and Electronics Engineers; pronounced “Eye-triple-E”.
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xviii

i i
xviii Introduction
have been heavily influenced by the information theory. Many more topics ranging from biology,
neuroscience and thermodynamics to pattern recognition, artificial intelligence and control theory
all regularly appear in information-theoretic conferences and journals.
It seems that objectively circumscribing the territory claimed by information theory is futile.
Instead, we highlight what we believe to be the most interesting developments of late.
First, information processing systems of today are much more varied compared to those of last
century. A modern controller (robot) is not just reacting to a few-dimensional vector of observa-
tions, modeled as a linear time-invariant system. Instead, it has million-dimensional inputs (e.g., a
rasterized image), delayed and quantized, which also need to be communicated across noisy links.
The target of statistical inference is no longer a low-dimensional parameter, but rather a high-
dimensional (possibly discrete) object with structure (e.g. a sparse matrix, or a graph between
communities). Furthermore, observations arrive to a statistician from spatially or temporally sep-
arated sources, which need to be transmitted cognizant of rate limitations. Recognizing these new
challenges, multiple communities simultaneously started re-investigating classical results (Chap-
ter 29) on the optimality of maximum-likelihood and the (optimal) variance bounds given by the
Fisher information. These developments in high-dimensional statistics, computer science and sta-
tistical learning depend on the mastery of the f-divergences (Chapter 7), the mutual-information
method (Chapter 30), and the strong version of the data-processing inequality (Chapter 33).
Second, since the 1990s technological advances have brought about a slew of new noisy channel
models. While classical theory addresses the so-called memoryless channels, the modern channels,
such as in flash storage, or urban wireless (multi-path, multi-antenna) communication, are far from
memoryless. In order to analyze these, the classical “asymptotic i.i.d.” theory is insufficient. The
resolution is the so-called “one-shot” approach to information theory, in which all main results are
developed while treating the channel inputs and outputs as abstract [307]. Only at the last step those
inputs are given the structure of long sequences and the asymptotic values are calculated. This
new “one-shot” approach has additional relevance to anyone willing to learn quantum information
theory, where it is in fact necessary.
Third, and perhaps the most important, is the explosion in the interest of understanding the meth-
ods and limits of machine learning from data. Information-theoretic methods were instrumental for
several discoveries in this area. As examples, we recall the concept of metric entropy (Chapter 27)
that is a cornerstone of Vapnik’s approach to supervised learning (known as empirical risk mini-
mization). In addition, metric entropy turns out to govern the fundamental limits of, and suggest
algorithms for, the problem of density estimation, the canonical building block of unsupervised
learning (Chapter 32). Another fascinating connection is that the optimal prediction performance
of online-learning algorithms is given by the maximum of the mutual information. This is shown
through a deep connection between prediction and universal compression (Chapter 13), which lead
to the multiplicative weight update algorithms [327, 74]. Finally, there is a common information-
theoretic method for solving a series of problems in distributed estimation, community detection
(in graphs), and computation with noisy logic gates. This method is a strong version of the classical
data-processing inequality (see Chapter 33), and is being actively developed and applied.
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xix

i i
Introduction xix
Why another book on information theory?

In short, we think that the three important developments of late – the f-divergences, the one-shot
point of view, the connections with statistical learning – are not covered adequately in existing
textbooks. At the same time these topics are future-looking: their significance will only grow with
time. Hence, studying them along with the classical information theory is a good investment of
students’ effort.
There are two great classical textbooks that are unlikely to become irrelevant any time soon: [76]
and [81] (and the revised edition of the latter [84]). The former has been a primary textbook for the
majority of undergraduate courses on information theory in the world. It manages to rigorously
introduce the concepts of entropy, information and divergence and prove all the main results of
the field. Furthermore, [76] touches upon non-standard topics, such as universal compression,
gambling and portfolio theory.
The [81] spearheaded the combinatorial point of view on information theory, known as “the
method of types”. While more mathematically demanding than [76], [81] manages to intro-
duce stronger results such as sharp estimates of error exponents and, especially, rate regions in
multi-terminal communication systems. However, both books are almost exclusively focused on
asymptotics and Shannon-type information measures.
Many more more specialized treatments are available as well. For a communication-oriented
reader, the classical [133] is still indispensable. The one-shot point of view is taken in [307]. Con-
nections to statistical learning theory and learning on graphs (belief propagation) is beautifully
covered in [205]. Ergodic theory is the central subject in [145]. Quantum information theory – a
burgeoning field – is treated in the recent [332]. The only textbook dedicated to the connection
between information theory and statistics is by Kullback [188], though restricted to large-sample
asymptotics in hypothesis testing. In nonparametric statistics, application of information-theoretic
methods is briefly (but elegantly) covered in [313].
Nevertheless, it is not possible to quilt this textbook from chapters of these excellent prede-
cessors. A number of important topics are treated exclusively here, such as those in Chapters 7
(f-divergences), 18 (one-shot coding theorems), 22 (finite blocklength), 27 (metric entropy), 30
(mutual information method), 32 (entropic bounds on estimation), and 33 (strong data-processing
inequalities). Furthermore, building up to these chapters requires numerous small innovations
across the rest of the textbook and are not available elsewhere.
Going to omissions, this book completely skips the topic of multi-terminal information the-
ory. This difficult subject captivated much of the effort in the post-Shannon “IEEE-style” theory.
We refer to the classics [84] and the recent excellent textbook [111] containing an encyclopedic
coverage of this area.
Another unfortunate omission is the connection between information theory and functional
inequalities [76, Chapter 17]. This topic has seen a flurry of recent activity, especially in logarith-
mic Sobolev inequalities, isoperimetry, concentration of measure, Brascamp-Lieb inequalities,
(Marton-Talagrand) information-transportation inequalities and others; see the monograph [258].
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xx

i i
xx Introduction
A note to statisticians
The interplay between information theory and statistics is a constant theme in the development of
both fields. Since its inception, information theory has been indispensable for understanding the
fundamental limits of statistical estimation. The prominent role of information-theoretic quanti-
ties, such as mutual information, f-divergence, metric entropy, and capacity, in establishing the
minimax rates of estimation has long been recognized since the seminal work of Le Cam [192],
Ibragimov and Khas’minski [162], Pinsker [234], Birgé [34], Haussler and Opper [157], Yang and
Barron [341], among many others. In Part VI of this book we give an exposition to some of the
most influential information-theoretic ideas and their applications in statistics. Of course, this is
not meant to be a thorough treatment of decision theory or mathematical statistics; for that purpose,
we refer to the classics [162, 196, 44, 313] and the more recent monographs [55, 265, 140, 328]
focusing on high dimensions. Instead, we apply the theory developed in previous Parts I–V of
this book to several concrete and carefully chosen examples of determining the minimax risk
in both classical (fixed-dimensional, large-sample asymptotic) and modern (high-dimensional,
non-asymptotic) settings.
At a high level, the connection between information theory (in particular, data transmission)
and statistical inference is that both problems are defined by a conditional distribution PY|X , which
is referred to as the channel for the former and the statistical model or experiment for the latter. In
data transmission we optimize the encoder, which maps messages to codewords, chosen in a way
that permits the decoder to reconstruct the message based on the noisy observation Y. In statistical
settings, Y is still the observation while X plays the role of the parameter which determines the
distribution of Y via PY|X ; the major distinction is that here we no longer have the freedom to
preselect X and the only task is to smartly estimate X (in either the average or the worst case) on
the basis of the data Y. Despite this key difference, many information-theoretic ideas still have
influential and fruitful applications for statistical problems, as we shall see next.
In Chapter 29 we show how the data processing inequality can be used to deduce classical lower
bounds (Hammersley-Chapman-Robbins, Cramér-Rao, van Trees). In Chapter 30 we introduce the
mutual information method, based on the reasoning in joint source-channel coding. Namely, by
comparing the amount of information contained in the data and the amount of information required
for achieving a given estimation accuracy, both measured in bits, this method allows us to apply
the theory of capacity and rate-distortion function developed in Parts IV and V to lower bound the
statistical risk. Besides being principled, this approach also unifies the three popular methods for
proving minimax lower bounds due to Le Cam, Assouad, and Fano respectively (Chapter 31).
It is a common misconception that information theory only supplies techniques for proving
negative results in statistics. In Chapter 32 we present three upper bounds on statistical estimation
risk based on metric entropy: Yang-Barron’s construction inspired by universal compression, Le
Cam-Birgé’s tournament based on pairwise hypothesis testing, and Yatracos’ minimum-distance
approach. These powerful methods are responsible for some of the strongest and most general
results in statistics and applicable for both high-dimensional and nonparametric problems. Finally,
in Chapter 33 we introduce the method based on strong data processing inequalities and apply it to
resolve an array of contemporary problems including community detection on graphs, distributed
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xxi

i i
Introduction xxi
estimation with communication constraints and generating random tree colorings. These problems
are increasingly captivating the minds of computer scientists as well.
How to use this textbook

An introductory class on information theory aiming at advanced undergraduate or graduate
students can proceed with the following sequence:
• Part I: Chapters 1–3, Sections 4.1, 5.1–5.3, 6.1, and 6.3, focusing only on discrete prob-
ability space and ignoring Radon-Nikodym derivatives. Some mention of applications in
combinatorics and cryptography (Chapters 8, 9) is recommended.
• Part II: Chapter 10, Sections 11.1–11.4.
• Part III: Chapter 14, Sections 15.1–15.3, and 16.1.
• Part IV: Chapters 17–18, Sections 19.1–19.3, 19.7, 20.1–20.2, 23.1.
• Part V: Sections 24.1–24.3, 25.1, 26.1, and 26.3.
• Conclude with a few applications of information theory outside the classical domain (Chap-
ters 30 and 33).
A graduate-level class on information theory with a traditional focus on communication and

compression can proceed faster through Part I (omitting f-divergences and other non-essential
chapters), but then cover II–V in depth, including strong converse, finite-blocklength regime, and
communication with feedback, but omitting Chapter 27. It is important to work through exercises
at the end of Part IV for this kind of class.
For a graduate-level class on information theory with an emphasis on statistical learning, start
with Part I (especially Chapter 7), followed by Part II (especially Chapter 13) and Part III, from
Part IV limit coverage to Chapters 17-19, and from Part V to Chapter 27 (especially, Sections 27.1–
27.4). This should leave more than half of the semester for carefully working through Part VI. For
example, for a good pace we suggest leaving at least 5-6 lectures for Chapters 32 and 33. These last
chapters contain some bleeding-edge research results and open problems, hopefully welcoming
students to work on them. For that we also recommend going over the exercises at the end of
Parts I and VI.
i i
i i
i i
itbook-export CUP/HE2-design October 20, 2022 22:10 Page-1

i i
Frequently used notation
General conventions
• The symbol ≜ reads defined as and ≡ abbreviated as.

• The set of real numbers and integers are denoted by R and Z. Let N = {1, 2, . . .}, Z+ =
{0, 1, . . .}, R+ = {x : x ≥ 0}.
• For n ∈ N, let [n] = {1, . . . , n}.
• Throughout the book, xn ≜ (x1 , . . . , xn ) denotes an n-dimensional vector, xji ≜ (xi , . . . , xj ) for
1 ≤ i < j ≤ n and xS ≜ {xi : i ∈ S} for S ⊂ [n].
• Unless explicitly specified, the logarithm log and exponential exp are with respect to a generic
common base. The natural logarithm is denoted by ln = loge .
• a ∧ b = min{a, b} and a ∨ b = max{a, b}.
• For p ∈ [0, 1], p̄ ≜ 1 − p.
• x+ = max{x, 0}.
• wH (x) denotes the Hamming weight (number of ones) of a binary vector x. dH (x, y) =
Pn
i=1 1{xi ̸=yi } denotes the Hamming distance between vectors x and y of length n.
• Standard big O notations are used throughout the book: e.g., for any positive sequences {an }
and {bn }, an = O(bn ) if there is an absolute constant c > 0 such that an ≤ cbn ; an = Ω(bn ) if
bn = O(an ); an = Θ(bn ) if both an = O(bn ) and an = Ω(bn ), we also write an bn in these
cases; an = o(bn ) or bn = ω(an ) if an ≤ ϵn bn for some ϵn → 0.
Analysis
• Let int(E) and cl(E) denote the interior and closure of a set E, namely, the largest open set
contained in and smallest closed set containing E, respectively.
• Let co(E) denote the convex hull of E (without topology), namely, the smallest convex set
Pn Pn
containing E, given by co(E) = { i=1 αi xi : αi ≥ 0, i=1 αi = 1, xi ∈ E, n ∈ N}.
• For subsets A, B of a real vector space and λ ∈ R, denote the dilation λA = {λa : a ∈ A} and
the Minkowski sum A + B = {a + b : a ∈ A, B ∈ B}.
• For a metric space (X , d), a function f : X → R is called C-Lipschiptz if |f(x) − f(y)| ≤ Cd(x, y)
for all x, y ∈ X . We set kfkLip(X ) = inf{C : f is C-Lipschitz}.
Measure theory and probability
• The Lebesgue measure on Euclidean spaces is denoted by Leb and also by vol (volume).
• Throughout the book, all measurable spaces (X , E) are standard Borel spaces. Unless explicitly
needed, we suppress the underlying σ -algebra E .
i i
i i
i i

i i
2 Frequently used notation
• The collection of all probability measures on X is denoted by ∆(X ). For finite spaces we
abbreviate ∆k ≡ ∆([k]), a (k − 1)-dimensional simplex.
• For measures P and Q, their product measure is denoted by P × Q or P ⊗ Q. The n-fold product
of P is denoted by Pn or P⊗n .
• Let P be absolutely continuous with respect to Q, denoted by P Q. The Radon-Nikodym
dP dP
derivative of P with respect to Q is denoted by dQ . For a probability measure P, if Q = Leb, dQ
is referred to the probability density function (pdf); if Q is the counting measure on a countable
X , dQ
dP
is the probability mass function (pmf).
• Let P ⊥ Q denote their mutual singularity, namely, P(A) = 0 and Q(A) = 1 for some A.
• The support of a probability measure P, denoted by supp(P), is the smallest closed set C such
that P(C) = 1. An atom x of P is such that P({x}) > 0. A distribution P is discrete if supp(P)
is a countable set (consisting of its atoms).
• Let X be a random variable taking values on X , which is referred to as the alphabet of X. Typi-
cally upper case, lower case, and script case are reserved for random variables, realizations, and
alphabets. Oftentimes X and Y are automatically assumed to be the alphabet of X and Y, etc.
• Let PX denote the distribution (law) of the random variable X, PX,Y the joint distribution of X
and Y, and PY|X the conditional distribution of Y given X.
• The independence of random variables X and Y is denoted by X ⊥ ⊥ Y, in which case PX,Y =
PX × PY . Similarly, X ⊥ ⊥ Y|Z denotes their conditional independence given Z, in which case
PX,Y|Z = PX|Z × PY|Z .
• Throughout the book, Xn ≡ Xn1 ≜ (X1 , . . . , Xn ) denotes an n-dimensional random vector. We
i.i.d.
write X1 , . . . , Xn ∼ P if they are independently and identically distributed (iid) as P, in which
case PXn = Pn .
• The empirical distribution of a sequence x1 , . . . , xn denoted by P̂xn ; empirical distribution of a
random sample X1 , . . . , Xn denoted by P̂n ≡ P̂Xn .
a.s. P d
• Let −−→, − →, − → denote convergence almost surely, in probability, and in distribution (law),
d
respectively. Let = denote equality in distribution.
• Some commonly used distributions are as follows:
– Ber(p): Bernoulli distribution with mean p.
– Bin(n, p): Binomial distribution with n trials and success probability p.
– Poisson(λ): Poisson distribution with mean λ.
– Let N ( μ, σ 2 ) denote the Gaussian (normal) distribution on R with mean μ and σ 2 and
N ( μ, Σ) the Gaussian distribution on Rd with mean μ and covariance matrix Σ. Denote
the standard normal density by φ(x) = √12π e−x /2 , the CDF and complementary CDF by
2
Rt
Φ(t) = −∞ φ(x)dx and Q(t) = Φc (t) = 1 − Φ(t). The inverse of Q is denoted by Q−1 (ϵ).
– Z ∼ Nc ( μ, σ 2 ) denotes the complex-valued circular symmetric normal distribution with
expectation E[Z] = μ ∈ C and E[|Z − μ|2 ] = σ 2 .
– For a compact subset X of Rd with non-empty interior, Unif(X ) denotes the uniform distri-
bution on X , with Unif(a, b) ≡ Unif([a, b]) for interval [a, b]. We also use Unif(X ) to denote
the uniform (equiprobable) distribution on a finite set X .
i i
i i
i i

i i
Part I
Information measures
i i
i i
i i

i i
i i
i i
i i

i i
Information measures form the backbone of information theory. The first part of this book
is devoted to an in-depth study of various information measures, notably, entropy, divergence,
mutual information, as well as their conditional versions (Chapters 1–3). In addition to basic
definitions illustrated through concrete examples, we will also study various aspects including
regularity, tensorization, variational representation, local expansion, convexity and optimization
properties, as well as the data processing principle (Chapters 4–6). These information measures
will be imbued with operational meaning when we proceed to classical topics in information theory
such as data compression and transmission, in subsequent parts of the book.
In addition to the classical (Shannon) information measures, Chapter 7 provides a systematic
treatment of f-divergences, a generalization of (Shannon) measures introduced by Csíszar that
plays an important role in many statistical problems (see Parts III and VI). Finally, towards the
end of this part we will discuss two operational topics: random number generators in Chapter 9
and the application of entropy method to combinatorics and geometry Chapter 8.
i i
i i
i i

i i
1 Entropy
This chapter introduces the first information measure – Shannon entropy. After studying its stan-
dard properties (chain rule, conditioning), we will briefly describe how one could arrive at its
definition. We discuss axiomatic characterization, the historical development in statistical mechan-
ics, as well as the underlying combinatorial foundation (“method of types”). We close the chapter
with Han’s and Shearer’s inequalities, that both exploit submodularity of entropy. After this Chap-
ter, the reader is welcome to consult the applications in combinatorics (Chapter 8) and random
number generation (Chapter 9), which are independent of the rest of this Part.
1.1 Entropy and conditional entropy

Definition 1.1 (Entropy). Let X be a discrete random variable with probability mass function
PX (x), x ∈ X . The entropy (or Shannon entropy) of X is
h 1 i
H(X) = E log
PX (X)
X 1
= PX (x) log .
P X ( x)
x∈X
When computing the sum, we agree that (by continuity of x 7→ x log 1x )

1
0 log = 0. (1.1)
0
Since entropy only depends on the distribution of a random variable, it is customary in information
theory to also write H(PX ) in place of H(X), which we will do freely in this book. The basis of the
logarithm in Definition 1.1 determines the units of the entropy:
log2 ↔ bits
loge ↔ nats
log256 ↔ bytes
log ↔ arbitrary units, base always matches exp
Different units will be convenient in different cases and so most of the general results in this book
are stated with “baseless” log/exp.
i i
i i
i i

i i
Definition 1.2 (Joint entropy). The joint entropy of n discrete random variables Xn ≜
(X1 , X2 , . . . , Xn ) is
h 1 i
H(Xn ) = H(X1 , . . . , Xn ) = E log .
PX1 ,...,Xn (X1 , . . . , Xn )
Note that joint entropy is a special case of Definition 1.1 applied to the random vector Xn =
(X1 , X2 , . . . , Xn ) taking values in the product space.
Remark 1.1. The name “entropy” originates from thermodynamics – see Section 1.3, which
also provides combinatorial justification for this definition. Another common justification is to
derive H(X) as a consequence of natural axioms for any measure of “information content” – see
Section 1.2. There are also natural experiments suggesting that H(X) is indeed the amount of
“information content” in X. For example, one can measure time it takes for ant scouts to describe
the location of the food to ants-workers. It was found that when nest is placed at the root of a full
binary tree of depth d and food at one of the leaves, the time was proportional to the entropy of a
random variable describing the food location [262]. (It was also estimated that ants communicate
with about 0.7–1 bit/min and that communication time reduces if there are some regularities in
path-description: paths like “left,right,left,right,left,right” are described by scouts faster).
Entropy measures the intrinsic randomness or uncertainty of a random variable. In the simple
setting where X takes values uniformly over a finite set X , the entropy is simply given by log-
cardinality: H(X) = log |X |. In general, the more spread out (resp. concentrated) a probability
mass function is, the higher (resp. lower) is its entropy, as demonstrated by the following example.
h(p)
Example 1.1 (Bernoulli). Let X ∼ Ber(p), with PX (1) = p
and PX (0) = p ≜ 1 − p. Then
log 2
1 1
H(X) = h(p) ≜ p log + p log .
p p
Here h(·) is called the binary entropy function, which is
continuous, concave on [0, 1], symmetric around 12 , and sat-
isfies h′ (p) = log pp , with infinite slope at 0 and 1. The
highest entropy is achieved at p = 21 (uniform), while the
lowest entropy is achieved at p = 0 or 1 (deterministic).
It is instructive to compare the plot of the binary entropy
p
function with the variance p(1 − p). 0 1
2
1
Example 1.2 (Geometric). Let X be geometrically distributed, with PX (i) = ppi , i = 0, 1, . . .. Then
E[X] = p̄p and
1 1 1 h( p)
H(X) = E[log ] = log + E[X] log = .
pp̄X p p̄ p
Example 1.3 (Infinite entropy). Is it possible that H(X) = +∞? Yes, for example, P[X = k] ∝
1
k ln2 k
, k = 2, 3, · · · .
i i
i i
i i

i i
Many commonly used information measures have their conditional counterparts, defined
by applying the original definition to a conditional probability measure followed by a further
averaging. For entropy this is defined as follows.
Definition 1.3 (Conditional entropy). Let X be a discrete random variable and Y arbitrary. Denote
by PX|Y=y (·) or PX|Y (·|y) the conditional distribution of X given Y = y. The conditional entropy of
X given Y is
h 1 i
H(X|Y) = Ey∼PY [H(PX|Y=y )] = E log ,
PX|Y (X|Y)
Similar to entropy, conditional entropy measures the remaining randomness of a random vari-
able when another is revealed. As such, H(X|Y) = H(X) whenever Y is independent of X. But
when Y depends on X, observing Y does lower the entropy of X. Before formalizing this in the
next theorem, here is a concrete example.
Example 1.4 (Conditional entropy and noisy channel). Let Y be a noisy observation of X ∼ Ber( 21 )
as follows.
1 Y = X ⊕ Z, where ⊕ denotes binary addition (XOR) and Z ∼ Ber(δ) independently of X. In

other words, Y agrees with X with probability δ and disagrees with probability δ̄ . Then PX|Y=0 =
Ber(δ) and PX|Y=1 = Ber(δ̄). Since h(δ) = h(δ̄), H(X|Y) = h(δ). Note that when δ = 12 , Y is
independent of X and H(X|Y) = H(X) = 1 bits; when δ = 0 or 1, X is completely determined
by Y and hence H(X|Y) = 0.
2 Y = X + Z be real-valued, where Z ∼ N (0, σ 2 ). Then H(X|Y) = E[h(P [X = 1|Y])], where
φ( y− 1
σ )
P [ X = 1 | Y = y] = φ( σ )+φ( y−
y 1 and Y ∼ 12 (N (0, σ 2 ) + N (1, σ 2 )). Below is a numerical plot of
σ )
H(X|Y) as a function of σ 2 which can be shown to be monotonically increasing from 0 to 1bit.
(Hint: Theorem 1.4(d).)
1.0
0.8
0.6
0.4
0.2
0.0
Before discussing various properties of entropy and conditional entropy, let us first review some
relevant facts from convex analysis, which will be used extensively throughout the book.
i i
i i
i i

i i
Review: Convexity
• Convex set: A subset S of some vector space is convex if x, y ∈ S ⇒ αx + ᾱy ∈ S

for all α ∈ [0, 1]. (Recall: ᾱ ≜ 1 − α.)
Examples: unit interval [0, 1]; S = {probability distributions on X }; S = {PX :
E[X] = 0}.
• Convex function: f : S → R is
– convex if f(αx + ᾱy) ≤ αf(x) + ᾱf(y) for all x, y ∈ S, α ∈ [0, 1].
– strictly convex if f(αx + ᾱy) < αf(x) + ᾱf(y) for all x 6= y ∈ S, α ∈ (0, 1).
– (strictly) concave if −f is (strictly) convex.
R
Examples: x 7→ x log x is strictly convex; the mean P 7→ xdP is convex but
not strictly convex, variance is concave (Question: is it strictly concave? Think of
zero-mean distributions.).
• Jensen’s inequality:
For any S-valued random variable X,
– f is convex ⇒ f(EX) ≤ Ef(X) Ef(X)
– f is strictly convex ⇒ f(EX) < Ef(X), unless X

is a constant (X = EX a.s.) f(EX)
Theorem 1.4 (Properties of entropy).
(a) (Positivity) H(X) ≥ 0 with equality, iff X is a constant (no randomness).

(b) (Uniform distribution maximizes entropy) For finite X , H(X) ≤ log |X |, with equality iff X is
uniform on X .
(c) (Invariance under relabeling) H(X) = H(f(X)) for any bijective f.
(d) (Conditioning reduces entropy) H(X|Y) ≤ H(X), with equality iff X and Y are independent.
(e) (Simple chain rule)
H(X, Y) = H(X) + H(Y|X) ≤ H(X) + H(Y). (1.2)
(f) (Entropy under deterministic transformation) H(X) = H(X, f(X)) ≥ H(f(X)) with equality iff
f is one-to-one on the support of PX .
(g) (Full chain rule)
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |Xi−1 ) ≤ H(Xi ), (1.3)
i=1 i=1
with equality iff X1 , . . . , Xn are mutually independent.
i i
i i
i i

i i
10
Proof. (a) Since log PX1(X) is a positive random variable, its expectation H(X) is also positive,
with H(X) = 0 if and only if log PX1(X) = 0 almost surely, namely, PX is a point mass.
(b) Apply Jensen’s inequality to the strictly concave function x 7→ log x:

1 1
H(X) = E log ≤ log E = log |X |.
PX (X) PX (X)
(c) H(X) as a summation only depends on the values of PX , not locations.
(d) Abbreviate P(x) ≡ PX (x) and P(x|y) ≡ PX|Y (x|y). Using P(x) = EY [P(x|Y)] and applying
Jensen’s inequality to the strictly concave function x 7→ x log 1x ,
X 1
X
1
H(X|Y) = EY P(x|Y) log ≤ P(x) log = H(X).
P(x|Y) P ( x)
x∈X x∈X
Additionally, this also follows from (and is equivalent to) Corollary 3.5 in Chapter 3 or
Theorem 5.2 in Chapter 5.
(e) Telescoping PX,Y (X, Y) = PY|X (Y|X)PX (X) and noting that both sides are positive PX,Y -almost
surely, we have
1 h 1 i h 1 i h 1 i
E[log ] = E log = E log + E log
PX,Y (X, Y) PX (X) · PY|X (Y|X) PX (X) PY|X (Y|X)
| {z } | {z }
H(X) H(Y|X)
(f) The intuition is that (X, f(X)) contains the same amount of information as X. Indeed, x 7→
(x, f(x)) is one-to-one. Thus by (c) and (e):
H(X) = H(X, f(X)) = H(f(X)) + H(X|f(X)) ≥ H(f(X))

The bound is attained iff H(X|f(X)) = 0 which in turn happens iff X is a constant given f(X).
(g) Similar to (e), telescoping
PX1 X2 ···Xn = PX1 PX2 |X1 · · · PXn |Xn−1
and taking the logarithm prove the equality. The inequality follows from (d), with the case of
Qn
equality occuring if and only if PXi |Xi−1 = PXi for i = 1, . . . , n, namely, PXn = i=1 PXi .
To give a preview of the operational meaning of entropy, let us play the game of 20 Questions.
We are allowed to make queries about some unknown discrete RV X by asking yes-no questions.
The objective of the game is to guess the realized value of the RV X. For example, X ∈ {a, b, c, d}
with P [X = a] = 1/2, P [X = b] = 1/4, and P [X = c] = P [X = d] = 1/8. In this case, we can
ask “X = a?”. If not, proceed by asking “X = b?”. If not, ask “X = c?”, after which we will know
for sure the realization of X. The resulting average number of questions is 1/2 + 1/4 × 2 + 1/8 ×
3 + 1/8 × 3 = 1.75, which equals H(X) in bits. An alternative strategy is to ask “X = a, b or c, d”
in the first round then proceeds to determine the value in the second round, which always requires
two questions and does worse on average.
It turns out (Section 10.3) that the minimal average number of yes-no questions to pin down
the value of X is always between H(X) bits and H(X) + 1 bits. In this special case the above
scheme is optimal because (intuitively) it always splits the probability in half.
i i
i i
i i

i i
1.3 History of entropy 11
1.2 Axiomatic characterization

P
One might wonder why entropy is defined as H(P) = pi log p1i and if there are other definitions.
Indeed, the information-theoretic definition of entropy is related to entropy in statistical physics.
Also, it arises as answers to specific operational problems, e.g., the minimum average number of
bits to describe a random variable as discussed above. Therefore it is fair to say that it is not pulled
out of thin air.
Shannon in 1948 paper has also showed that entropy can be defined axiomatically, as a
function satisfying several natural conditions. Denote a probability distribution on m letters by
P = (p1 , . . . , pm ) and consider a functional Hm (p1 , . . . , pm ). If Hm obeys the following axioms:
(a) Permutation invariance

(b) Expansible: Hm (p1 , . . . , pm−1 , 0) = Hm−1 (p1 , . . . , pm−1 ).
(c) Normalization: H2 ( 12 , 12 ) = log 2.
(d) Subadditivity: H(X, Y) ≤ H(X) + H(Y). Equivalently, Hmn (r11 , . . . , rmn ) ≤ Hm (p1 , . . . , pm ) +
Pn Pm
Hn (q1 , . . . , qn ) whenever j=1 rij = pi and i=1 rij = qj .
(e) Additivity: H(X, Y) = H(X) + H(Y) if X ⊥ ⊥ Y. Equivalently, Hmn (p1 q1 , . . . , pm qn ) =
Hm (p1 , . . . , pm ) + Hn (q1 , . . . , qn ).
(f) Continuity: H2 (p, 1 − p) → 0 as p → 0.
Pm
then Hm (p1 , . . . , pm ) = i=1 pi log p1i is the only possibility. The interested reader is referred to
[76, p. 53] and the references therein.
1.3 History of entropy

In the early days of industrial age, engineers wondered if it is possible to construct a perpetual
motion machine. After many failed attempts, a law of conservation of energy was postulated: a
machine cannot produce more work than the amount of energy it consumed from the ambient
world. (This is also called the first law of thermodynamics.) The next round of attempts was then
to construct a machine that would draw energy in the form of heat from a warm body and convert it
to equal (or approximately equal) amount of work. An example would be a steam engine. However,
again it was observed that all such machines were highly inefficient. That is, the amount of work
produced by absorbing heat Q was far less than Q. The remainder of energy was dissipated to
the ambient world in the form of heat. Again after many rounds of attempting various designs
Clausius and Kelvin proposed another law:
Second law of thermodynamics: There does not exist a machine that operates in a cycle (i.e. returns to its original
state periodically), produces useful work and whose only other effect on the outside world is drawing heat from
a warm body. (That is, every such machine, should expend some amount of heat to some cold body too!)1
1
Note that the reverse effect (that is converting work into heat) is rather easy: friction is an example.
i i
i i
i i

i i
12
Equivalent formulation is as follows: “There does not exist a cyclic process that transfers heat
from a cold body to a warm body”. That is, every such process needs to be helped by expending
some amount of external work; for example, the air conditioners, sadly, will always need to use
some electricity.
Notice that there is something annoying about the second law as compared to the first law. In
the first law there is a quantity that is conserved, and this is somehow logically easy to accept. The
second law seems a bit harder to believe in (and some engineers did not, and only their recurrent
failures to circumvent it finally convinced them). So Clausius, building on an ingenious work of
S. Carnot, figured out that there is an “explanation” to why any cyclic machine should expend
heat. He proposed that there must be some hidden quantity associated to the machine, entropy
of it (initially described as “transformative content” or Verwandlungsinhalt in German), whose
value must return to its original state. Furthermore, under any reversible (i.e. quasi-stationary, or
“very slow”) process operated on this machine the change of entropy is proportional to the ratio
of absorbed heat and the temperature of the machine:
∆Q
∆S = . (1.4)
T
If heat Q is absorbed at temperature Thot then to return to the original state, one must return some
amount of heat Q′ , where Q′ can be significantly smaller than Q but never zero if Q′ is returned
at temperature 0 < Tcold < Thot . Further logical arguments can convince one that for irreversible
cyclic process the change of entropy at the end of the cycle can only be positive, and hence entropy
cannot reduce.
There were great many experimentally verified consequences that second law produced. How-
ever, what is surprising is that the mysterious entropy did not have any formula for it (unlike, say,
energy), and thus had to be computed indirectly on the basis of relation (1.4). This was changed
with the revolutionary work of Boltzmann and Gibbs, who provided a microscopic explanation
of the second law based on statistical physics principles and showed that, e.g., for a system of n
independent particles (as in ideal gas) the entropy of a given macro-state can be computed as
X
ℓ
1
S = kn pj log , (1.5)
pj
j=1
where k is the Boltzmann constant, and we assumed that each particle can only be in one of ℓ
molecular states (e.g. spin up/down, or if we quantize the phase volume into ℓ subcubes) and pj is
the fraction of particles in j-th molecular state.
More explicitly, their innovation was two-fold. First, they separated the concept of a micro-
state (which in our example above corresponds to a tuple of n states, one for each particle) and the
macro-state (a list {pj } of proportions of particles in each state). Second, they postulated that for
experimental observations only the macro-state matters, but the multiplicity of the macro-state
(number of micro-states that correspond to a given macro-state) is precisely the (exponential
of the) entropy. The formula (1.5) then follows from the following explicit result connecting
combinatorics and entropy.
i i
i i
i i

i i
1.4* Submodularity 13
Pk
Proposition 1.5 (Method of types). Let n1 , . . . , nk be non-negative integers with i=1 ni = n,

nk ≜
n
and denote the distribution P = (p1 , . . . , pk ), pi = nni . Then the multinomial coefficient n1 ,...
n!
n1 !···nk ! satisfies

1 n
exp{nH(P)} ≤ ≤ exp{nH(P)} .
( 1 + n) k − 1 n1 , . . . nk
i.i.d. Pn
Proof. For the upper bound, let X1 , . . . , Xn ∼ P and let Ni = i=1 1{Xj =i} denote the number of
occurences of i. Then (N1 , . . . , Nk ) has a multinomial distribution:
Y
k
n n′
P[N1 = n′1 , . . . , Nk = n′k ] = ′ ′ pi i ,
n1 , . . . , nk
i=1
for any nonnegative integers n′i such that n′1 + · · · + n′k = n. Recalling that pi = ni /n, the upper
bound follows from P[N1 = n1 , . . . , Nk = nk ] ≤ 1. In addition, since (N1 , . . . , Nk ) takes at most
(n + 1)k−1 values, the lower bound follows if we can show that (n1 , . . . , nk ) is its mode. Indeed,
for any n′i with n′1 + · · · + n′k = n, defining ∆i = n′i − ni we have
P[N1 = n′1 , . . . , Nk = n′k ] Y Y

k k
ni !
= i ≤
p∆ i
n−∆ i ∆i
pi = 1,
P [ N 1 = n1 , . . . , N k = nk ] (ni + ∆i )! i
i=1 i=1
Pn
where the inequality follows from m!
(m+∆)! ≤ m−∆ and the last equality follows from i=1 ∆i =
0.
Proposition 1.5 shows that the multinomial coefficient can be approximated up to a polynomial
(in n) term by exp(nH(P)). More refined estimates can be obtained; see Ex. I.2. In particular, the
binomial coefficient can be approximated using the binary entropy function as follows: Provided
that p = nk ∈ (0, 1),
n

e−1/6 ≤ k
≤ 1. (1.6)
√ 1
enh(p)
2πnp(1−p)
For more on combinatorics and entropy, see Ex. I.1, I.3 and Chapter 8.
1.4* Submodularity

Recall that [n] denotes a set {1, . . . , n}, Sk denotes subsets of S of size k and 2S denotes all subsets
of S. A set function f : 2S → R is called submodular if for any T1 , T2 ⊂ S
f(T1 ∪ T2 ) + f(T1 ∩ T2 ) ≤ f(T1 ) + f(T2 ) (1.7)
Submodularity is similar to concavity, in the sense that “adding elements gives diminishing
returns”. Indeed consider T′ ⊂ T and b 6∈ T. Then
f( T ∪ b) − f( T ) ≤ f( T ′ ∪ b) − f( T ′ ) .
i i
i i
i i

i i
14
Theorem 1.6. Let Xn be discrete RV. Then T 7→ H(XT ) is submodular.
Proof. Let A = XT1 \T2 , B = XT1 ∩T2 , C = XT2 \T1 . Then we need to show
H(A, B, C) + H(B) ≤ H(A, B) + H(B, C) .
This follows from a simple chain
H(A, B, C) + H(B) = H(A, C|B) + 2H(B) (1.8)
≤ H(A|B) + H(C|B) + 2H(B) (1.9)
= H(A, B) + H(B, C) (1.10)
Note that entropy is not only submodular, but also monotone:

T1 ⊂ T2 =⇒ H(XT1 ) ≤ H(XT2 ) .
So fixing n, let us denote by Γn the set of all non-negative, monotone, submodular set-functions on
[n]. Note that via an obvious enumeration of all non-empty subsets of [n], Γn is a closed convex cone
in R2+ −1 . Similarly, let us denote by Γ∗n the set of all set-functions corresponding to distributions
n
on Xn . Let us also denote Γ̄∗n the closure of Γ∗n . It is not hard to show, cf. [347], that Γ̄∗n is also a
closed convex cone and that
Γ∗n ⊂ Γ̄∗n ⊂ Γn .
The astonishing result of [348] is that
Γ∗2 = Γ̄∗2 = Γ2 (1.11)
Γ∗3 ⊊ Γ̄∗3 = Γ3 (1.12)
Γ∗n ⊊ Γ̄∗n ⊊Γn n ≥ 4. (1.13)
This follows from the fundamental new information inequality not implied by the submodularity
of entropy (and thus called non-Shannon inequality). Namely, [348] showed that for any 4-tuple
of discrete random variables:
1 1 1
I(X3 ; X4 ) − I(X3 ; X4 |X1 ) − I(X3 ; X4 |X2 ) ≤ I(X1 ; X2 ) + I(X1 ; X3 , X4 ) + I(X2 ; X3 , X4 ) .
2 4 4
(This can be restated in the form of an entropy inequality using Theorem 3.4 but the resulting
expression is too cumbersome).
1.5* Han’s inequality and Shearer’s Lemma

Theorem 1.7 (Han’s inequality). Let Xn be discrete n-dimensional RV and denote H̄k (Xn ) =
1
P H̄k
[n] H(XT ) the average entropy of a k-subset of coordinates. Then
(nk) T∈( k ) k is decreasing in k:
1 1
H̄n ≤ · · · ≤ H̄k · · · ≤ H̄1 . (1.14)
n k
i i
i i
i i

i i
1.5* Han’s inequality and Shearer’s Lemma 15
Furthermore, the sequence H̄k is increasing and concave in the sense of decreasing slope:
H̄k+1 − H̄k ≤ H̄k − H̄k−1 . (1.15)
H̄m
Proof. Denote for convenience H̄0 = 0. Note that m is an average of differences:
1X
m
1
H̄m = (H̄k − H̄k−1 )
m m
k=1
Thus, it is clear that (1.15) implies (1.14) since increasing m by one adds a smaller element to the
average. To prove (1.15) observe that from submodularity
H(X1 , . . . , Xk+1 ) + H(X1 , . . . , Xk−1 ) ≤ H(X1 , . . . , Xk ) + H(X1 , . . . , Xk−1 , Xk+1 ) .
Now average this inequality over all n! permutations of indices {1, . . . , n} to get
H̄k+1 + H̄k−1 ≤ 2H̄k
as claimed by (1.15).
Alternative proof: Notice that by “conditioning decreases entropy” we have
H(Xk+1 |X1 , . . . , Xk ) ≤ H(Xk+1 |X2 , . . . , Xk ) .
Averaging this inequality over all permutations of indices yields (1.15).
Theorem 1.8 (Shearer’s Lemma). Let Xn be discrete n-dimensional RV and let S ⊂ [n] be a
random variable independent of Xn and taking values in subsets of [n]. Then
H(XS |S) ≥ H(Xn ) · min P[i ∈ S] . (1.16)

i∈[n]
Remark 1.2. In the special case where S is uniform over all subsets of cardinality k, (1.16) reduces
to Han’s inequality 1n H(Xn ) ≤ 1k H̄k . The case of n = 3 and k = 2 can be used to give an entropy
proof of the following well-known geometry result that relates the size of 3-D object to those
of its 2-D projections: Place N points in R3 arbitrarily. Let N1 , N2 , N3 denote the number of dis-
tinct points projected onto the xy, xz and yz-plane, respectively. Then N1 N2 N3 ≥ N2 . For another
application, see Section 8.2.
Proof. We will prove an equivalent (by taking a suitable limit) version: If C = (S1 , . . . , SM ) is a
list (possibly with repetitions) of subsets of [n] then
X
H(XSj ) ≥ H(Xn ) · min deg(i) , (1.17)
i
j
where deg(i) ≜ #{j : i ∈ Sj }. Let us call C a chain if all subsets can be rearranged so that
S1 ⊆ S2 · · · ⊆ SM . For a chain, (1.17) is trivial, since the minimum on the right-hand side is either
i i
i i
i i

i i
16
zero (if SM 6= [n]) or equals multiplicity of SM in C ,2 in which case we have

X
H(XSj ) ≥ H(XSM )#{j : Sj = SM } = H(Xn ) · min deg(i) .
i
j
For the case of C not a chain, consider a pair of sets S1 , S2 that are not related by inclusion and
replace them in the collection with S1 ∩ S2 , S1 ∪ S2 . Submodularity (1.7) implies that the sum on the
left-hand side of (1.17) does not increase under this replacement, values deg(i) are not changed.
Notice that the total number of pairs that are not related by inclusion strictly decreases by this
replacement: if T was related by inclusion to S1 then it will also be related to at least one of S1 ∪ S2
or S1 ∩ S2 ; if T was related to both S1 , S2 then it will be related to both of the new sets as well.
Therefore, by applying this operation we must eventually arrive to a chain, for which (1.17) has
already been shown.
Remark 1.3. Han’s inequality (1.15) holds for any submodular set-function. For Han’s inequal-
ity (1.14) we also need f(∅) = 0 (this can be achieved by adding a constant to all values of f).
Shearer’s lemma holds for any submodular set-function that is also non-negative.
Example 1.5 (Non-entropy submodular function). Another submodular set-function is
S 7→ I(XS ; XSc ) .
Han’s inequality for this one reads
1 1
0= In ≤ · · · ≤ Ik · · · ≤ I1 ,
n k
1
P
where Ik = S:|S|=k I(XS ; XSc ) measures the amount of k-subset coupling in the random vector
(nk)
Xn .
2
Note that, consequently, for Xn without constant coordinates, and if C is a chain, (1.17) is only tight if C consists of only ∅
and [n] (with multiplicities). Thus if degrees deg(i) are known and non-constant, then (1.17) can be improved, cf. [206].
i i
i i
i i

i i
2 Divergence
In this chapter we study divergence D(PkQ) (also known as information divergence, Kullback-
Leibler (KL) divergence, relative entropy), which is the first example of dissimilarity (information)
measure between a pair of distributions P and Q. As we will see later in Chapter 7, KL diver-
gence is a special case of f-divergences. Defining KL divergence and its conditional version in full
generality requires some measure-theoretic acrobatics (Radon-Nikodym derivatives and Markov
kernels), that we spend some time on. (We stress again that all this abstraction can be ignored if
one is willing to only work with finite or countably-infinite alphabets.)
Besides definitions we prove the “main inequality” showing that KL-divergence is non-negative.
Coupled with the chain rule for divergence, this inequality implies the data-processing inequality,
which is arguably the central pillar of information theory and this book. We conclude the chapter
by studying local behavior of divergence when P and Q are close. In the special case when P and
Q belong to a parametric family, we will see that divergence is locally quadratic with Hessian
being the Fisher information, explaining the fundamental role of the latter in classical statistics.
2.1 Divergence and Radon-Nikodym derivatives
Review: Measurability
For an exposition of measure-theoretic preliminaries, see [59, Chapters I and IV].

We emphasize two aspects. First, in this book we understand Lebesgue integration
R
fdμ as defined for measurable functions that are extended real-valued, i.e. f : X →
X R
R ∪ {±∞}. In particular, for negligible set E, i.e. μ[E] = 0, we have X 1E fdμ = 0
regardless of (possibly infinite) values of f on E, cf. [59, Chapter I, Prop. 4.13]. Second,
we almost always assume that alphabets are standard Borel spaces. Some of the nice
properties of standard Borel spaces:
• All complete separable metric spaces, endowed with Borel σ -algebras are standard
Borel. In particular, countable alphabets and Rn and R∞ (space of sequences) are
standard Borel.
Q∞
• If Xi , i = 1, . . . are standard Borel, then so is i=1 Xi .
• Singletons {x} are measurable sets.
• The diagonal {(x, x) : x ∈ X } is measurable in X × X .
17
i i
i i
i i

i i
18
We now need to define the second central concept of this book: the relative entropy, or Kullback-
Leibler divergence. Before giving the formal definition, we start with special cases. For that we
fix some alphabet A. The relative entropy from between distributions P and Q on X is denoted by
D(PkQ), defined as follows.
• Suppose A is a discrete (finite or countably infinite) alphabet. Then

(P P(a)
a∈A:P(a),Q(a)>0 P(a) log Q(a) , supp(P) ⊂ supp(Q)
D(PkQ) ≜ (2.1)
+∞, otherwise
• Suppose A = Rk , P and Q have densities (pdfs) p and q with respect to the Lebesgue measure.
Then
(R
{p>0,q>0}
p(x) log qp((xx)) dx Leb{p > 0, q = 0} = 0
D(PkQ) = (2.2)
+∞ otherwise
These two special cases cover a vast majority of all cases that we encounter in this book. How-
ever, mathematically it is not very satisfying to restrict to these two special cases. For example, it
is not clear how to compute D(PkQ) when P and Q are two measures on a manifold (such as a
unit sphere) embedded in Rk . Another problematic case is computing D(PkQ) between measures
on the space of sequences (stochastic processes). To address these cases we need to recall the
concepts of Radon-Nikodym derivative and absolute continuity.
Recall that for two measures P and Q, we say P is absolutely continuous w.r.t. Q (denoted by
P Q) if Q(E) = 0 implies P(E) = 0 for all measurable E. If P Q, then Radon-Nikodym
theorem show that there exists a function f : X → R+ such that for any measurable set E,
Z
P(E) = fdQ. [change of measure] (2.3)
E
dP
Such f is called a relative density or a Radon-Nikodym derivative of P w.r.t. Q, denoted by dQ . Not
dP dP
that dQ may not be unique. In the simple cases, dQ is just the familiar likelihood ratio:
• For discrete distributions, we can just take dQ

dP
(x) to be the ratio of pmfs.
• For continuous distributions, we can take dQ (x) to be the ratio of pdfs.
dP
We can see that the two special cases of D(PkQ) were both computing EP [log dQdP
]. This turns
out to be the most general definition that we are looking for. However, we will state it slightly
differently, following the tradition.
Definition 2.1 (Kullback-Leibler (KL) Divergence). Let P, Q be distributions on A, with Q called

the reference measure. The divergence (or relative entropy) between P and Q is
(
EQ [ dQ
dP dP
log dQ ] PQ
D(PkQ) = (2.4)
+∞ otherwise
i i
i i
i i

i i
2.1 Divergence and Radon-Nikodym derivatives 19
adopting again the convention from (1.1), namely, 0 log 0 = 0.
Below we will show (Lemma 2.4) that the expectation in (2.4) is well-defined (but possibly
infinite) and coincides with EP [log dQdP
] whenever P Q.
To demonstrate the general definition in the case not covered by discrete/continuous special-
izations, consider the situation in which both P and Q are given as densities with respect to a
common dominating measure μ, written as dP = fP dμ and dQ = fQ dμ for some non-negative
fP , fQ . (In other words, P μ and fP = dP dμ .) For example, taking μ = P + Q always allows one to
specify P and Q in this form. In this case, we have the following expression for divergence:
(R
dμ fP log ffQP μ({fQ = 0, fP > 0}) = 0,
D(PkQ) = fQ >0,fP >0
(2.5)
+∞ otherwise
Indeed, first note that, under the assumption of P μ and Q μ, we have P Q iff
μ({fQ = 0, fP > 0}) = 0. Furthermore, if P Q, then dQdP
= ffQP Q-a.e, in which case apply-
ing (2.3) and (1.1) reduces (2.5) to (2.4). Namely, D(PkQ) = EQ [ dQ dP dP
log dQ ] = EQ [ ffQP log ffQP ] =
R R
dμfP log ffQP 1{fQ >0} = dμfP log ffQP 1{fQ >0,fP >0} .
Note that D(PkQ) was defined to be +∞ if P 6 Q. However, it can also be +∞ even when
P Q. For example, D(CauchykGaussian) = ∞. However, it does not mean that there are
somehow two different ways in which D can be infinite. Indeed, what can be shown is that in
both cases there exists a sequence of (finer and finer) finite partitions Π of the space A such that
evaluating KL divergence between the induced discrete distributions P|Π and Q|Π grows without
a bound. This will be subject of Theorem 4.5 below.
Our next observation is that, generally, D(PkQ) 6= D(QkP) and, therefore, divergence is not a
distance. We will see later, that this is natural in many cases; for example it reflects the inherent
asymmetry of hypothesis testing (see Part III and, in particular, Section 14.5). Consider the exam-
ple of coin tossing where under P the coin is fair and under Q the coin always lands on the head.
Upon observing HHHHHHH, one tends to believe it is Q but can never be absolutely sure; upon
observing HHT, one knows for sure it is P. Indeed, D(PkQ) = ∞, D(QkP) = 1 bit.
Having made these remarks we proceed to some examples. First, we show that D is unsurpris-
ingly a generalization of entropy.
Theorem 2.2 (H v.s. D). If distribution P is supported on a finite set A, then

H(P) = log |A| − D(PkUA ) ,
where UA is a uniform distribution on A.
Proof. D(PkUA ) = EP [log 1P/|A|

( X)
] = log |A| − H(P).
Example 2.1 (Binary divergence). Consider P = Ber(p) and Q = Ber(q) on A = {0, 1}. Then
p p
D(PkQ) = d(pkq) ≜ p log + p log . (2.6)
q q
Here is how d(pkq) depends on p and q:
i i
i i
i i

i i
20
1
log q
d(p∥q) d(p∥q)
1
log q̄
q p
0 p 1 0 q 1
The following quadratic lower bound is easily checked:

d(pkq) ≥ 2(p − q)2 log e .
In fact, this is a special case of a famous Pinsker’s inequality (Theorem 7.9).
Example 2.2 (Real Gaussian). For two Gaussians on A = R,
log e (m1 − m0 )2 1 h σ2 σ2 i
D(N (m1 , σ12 )kN (m0 , σ02 )) = 2
+ log 02 + 12 − 1 log e . (2.7)
2 σ0 2 σ1 σ0
Here, the first and second term compares the means and the variances, respectively.
Similarly, in the vector case of A = Rk and assuming det Σ0 6= 0, we have
D(N (m1 , Σ1 )kN (m0 , Σ0 ))
log e 1
= (m1 − m0 )⊤ Σ−
0 ( m1 − m0 ) +
1
log det Σ0 − log det Σ1 + tr(Σ−
0 Σ1 − I) log e . (2.8)
1
2 2
Example 2.3 (Complex Gaussian). The complex Gaussian distribution Nc (m, σ 2 ) with mean m ∈
1 −|z−m|2 /σ2
C and variance σ 2 has a density e for z ∈ C. In other words, the real and imaginary
π σ2
parts are independent real Gaussians:

σ 2 /2 0
Nc (m, σ 2 ) = N Re(m) Im(m) ,
0 σ 2 /2
Then
log e |m1 − m0 |2 σ 2 σ12
D(Nc (m1 , σ12 )kNc (m0 , σ02 )) = 2
+ log 02 + 2
− 1 log e. (2.9)
2 σ0 σ1 σ0
which follows from (2.8). More generally, for complex Gaussian vectors on Ck , assuming det Σ0 6=
0,
D(Nc (m1 , Σ1 )kNc (m0 , Σ0 )) =(m1 − m0 )H Σ−
0 (m1 − m0 ) log e
1
+ log det Σ0 − log det Σ1 + tr(Σ−

0 Σ1 − I) log e
1
i i
i i
i i

i i
2.2 Divergence: main inequality and equivalent expressions 21
2.2 Divergence: main inequality and equivalent expressions

Many inequalities in information can be attributed to the following fundamental result, namely,
the nonnegativity of divergence.
Theorem 2.3 (Information Inequality).
D(PkQ) ≥ 0,
with equality iff P = Q.
Proof. In view of the definition (2.4), it suffices to consider P Q. Let φ(x) ≜ x log x, which
is strictly convex on R+ . Applying Jensen’s Inequality:
h dP i h dP i
D(PkQ) = EQ φ ≥ φ EQ = φ(1) = 0,
dQ dQ
dP
with equality iff dQ = 1 Q-a.e., namely, P = Q.
The above proof explains the reason for defining D(PkQ) = EQ [ dQ dP dP

log dQ ] as opposed to
D(PkQ) = EP [log dQ ]; nevertheless, the two definitions are equivalent. Furthermore, the next
dP
result unifies the two cases (P Q vs P 6 Q) in Definition 2.1.
Lemma 2.4. Let P, Q, R μ and fP , fQ , fR denote their densities relative to μ. Define a bivariate
function Log ab : R+ × R+ → R ∪ {±∞} by


 −∞ a = 0, b > 0


a  +∞ a > 0, b = 0
Log = (2.10)
b   0 a = 0, b = 0



log ab a > 0, b > 0.
Then the following results hold:
• First,

fR
EP Log = D(PkQ) − D(PkR) , (2.11)
fQ
provided at least one of the hdivergences
i is finite.
• Second, the expectation EP Log fQ is well-defined (but possibly infinite) and, furthermore,
fP

fP
D(PkQ) = EP Log . (2.12)
fQ
In particular, when P Q we have

dP
D(PkQ) = EP log . (2.13)
dQ
i i
i i
i i

i i
22
Remark 2.1. Note that ignoring the issue of dividing by or taking a log of 0, the proof of (2.12)
dR
is just the simple identity log dQ dRdP
= log dQdP = log dQdP
− log dR
dP
. What permits us to handle zeros
is the Log function, which satisfies several natural properties of the log: for every a, b ∈ R+
a b
Log = −Log
b a
and for every c > 0 we have
a a c ac
Log = Log + Log = Log − log(c)
b c b b
except for the case a = b = 0.
Proof. First, suppose D(PkQ) = ∞ and D(PkR) < ∞. Then P[fR (Y) = 0] = 0, and hence in
computation of the expectation in (2.11) only the second part of convention (2.10) can possibly
apply. Since also fP > 0 P-almost surely, we have
fR fR fP
Log = Log + Log , (2.14)
fQ fP fQ
with both logarithms evaluated according to (2.10). Taking expectation over P we see that the
first term, equal to −D(PkR), is finite, whereas the second term is infinite. Thus, the expectation
in (2.11) is well-defined and equal to +∞, as is the LHS of (2.11).
Now consider D(PkQ) < ∞. This implies that P[fQ (Y) = 0] = 0 and this time in (2.11) only
the first part of convention (2.10) can apply. Thus, again we have identity (2.14). Since the P-
expectation of the second term is finite, and of the first term non-negative, we again conclude that
expectation in (2.11) is well-defined, equals the LHS of (2.11) (and both sides are possibly equal
to −∞).
For the second part, we first show that

fP log e
EP min(Log , 0) ≥ − . (2.15)
fQ e
Let g(x) = min(x log x, 0). It is clear − loge e ≤ g(x) ≤ 0 for all x. Since fP (Y) > 0 for P-almost
all Y, in convention (2.10) only the 10 case is possible, which is excluded by the min(·, 0) from the
expectation in (2.15). Thus, the LHS in (2.15) equals
Z Z
f P ( y) f P ( y) f P ( y)
fP (y) log dμ = f Q ( y) log dμ
{fP >fQ >0} f Q ( y ) {fP >fQ >0} f Q ( y ) f Q ( y)
Z
f P ( y)
= f Q ( y) g dμ
{fQ >0} f Q ( y)
log e
≥− .
e
h i h i
Since the negative part of EP Log ffQP is bounded, the expectation EP Log ffQP is well-defined. If
P[fQ = 0] > 0 then it is clearly +∞, as is D(PkQ) (since P 6 Q). Otherwise, let E = {fP >
i i
i i
i i

i i
2.3 Differential entropy 23
0, fQ > 0}. Then P[E] = 1 and on E we have fP = fQ · ffQP . Thus, we obtain

Z Z
fP fP fP fP
EP Log = dμ fP log = dμfQ φ( ) = EQ 1E φ( ) .
fQ E fQ E fQ fQ
From here, we notice that Q[fQ > 0] = 1 and on {fP = 0, fQ > 0} we have φ( ffQP ) = 0. Thus, the
term 1E can be dropped and we obtain the desired (2.12).
The final statement of the Lemma follows from taking μ = Q and noticing that P-almost surely
we have
dP
dQ dP
Log = log .
1 dQ
2.3 Differential entropy

The definition of D(PkQ) extends verbatim to measures P and Q (not necessarily probability
measures), in which case D(PkQ) can be negative. A sufficient condition for D(PkQ) ≥ 0 is that
R R
P is a probability measure and Q is a sub-probability measure, i.e., dQ ≤ 1 = dP. The notion
of differential entropy is simply the divergence with respect to the Lebesgue measure:
Definition 2.5. The differential entropy of a random vector X is
h(X) = h(PX ) ≜ −D(PX kLeb). (2.16)
In particular, if X has probability density function (pdf) p, then h(X) = E log p(1X) ; otherwise
h(X) = −∞. The conditional differential entropy is h(X|Y) ≜ E log pX|Y (1X|Y) where pX|Y is a
conditional pdf.
Example 2.4 (Gaussian). For X ∼ N( μ, σ 2 ),

1
h(X) = log(2πeσ 2 ) (2.17)
2
More generally, for X ∼ N( μ, Σ) in Rd ,
1
h(X) = log((2πe)d det Σ) (2.18)
2
Warning: Even for continuous random variable X, h(X) can be positive, negative, take values
of ±∞ or even undefined.1 There are many crucial differences between the Shannon entropy and
the differential entropy. For example, from Theorem 1.4 we know that deterministic processing
cannot increase the Shannon entropy, i.e. H(f(X)) ≤ H(X) for any discrete X, which is intuitively
clear. However, this fails completely for differential entropy (e.g. consider scaling). Furthermore,
1 n c −(−1)n n
For an example, consider a piecewise-constant pdf taking value e(−1) n on the n-th interval of width ∆n = n2
e .
i i
i i
i i

i i
24
for sums of independent random variables, for integer-valued X and Y, H(X + Y) is finite whenever
H(X) and H(Y) are, because H(X + Y) ≤ H(X, Y) = H(X) + H(Y). This again fails for differential
entropy. In fact, there exists real-valued X with finite h(X) such that h(X + Y) = ∞ for any
independent Y such that h(Y) > −∞; there also exist X and Y with finite differential entropy such
that h(X + Y) does not exist (cf. [41, Section V]).
Nevertheless, differential entropy shares many functional properties with the usual Shannon
entropy. For a short application to Euclidean geometry see Section 8.4.
Theorem 2.6 (Properties of differential entropy). Assume that all differential entropies appearing
below exist and are finite (in particular all RVs have pdfs and conditional pdfs).
(a) (Uniform distribution maximizes differential entropy) If P[Xn ∈ S] = 1 then h(Xn ) ≤

log Leb(S),with equality iff Xn is uniform on S.
(b) (Scaling and shifting) h(Xn + x) = h(Xn ), h(αXn ) = h(Xn ) + k log |α| and for an invertible
matrix A, h(AXn ) = h(Xn ) + log | det A|.
(c) (Conditioning reduces differential entropy) h(X|Y) ≤ h(X). (Here Y is arbitrary.)
(d) (Chain rule) Let Xn has a joint probability density function. Then
X
n
h( X n ) = h(Xk |Xk−1 ) .
k=1
(e) (Submodularity) The set-function T 7→ h(XT ) is submodular.

P
(f) (Han’s inequality) The function k 7→ k 1n [n] h(XT ) is decreasing in k.
(k) T∈( k )
Proof. Parts (a), (c), and (d) follow from the similar argument in the proof (b), (d), and (g) of
Theorem 1.4. Part (b) is by a change of variable in the density. Finall, (e) and (f) are analogous to
Theorems 1.6 and 1.7.
Interestingly, the first property is robust to small additive perturbations, cf. Ex. I.6. Regard-
ing maximizing entropy under quadratic constraints, we have the following characterization of
Gaussians.
Theorem 2.7. Let Cov(X) = E[XX⊤ ] − E[X]E[X]⊤ denote the covariance matrix of a random
vector X. For any d × d positive definite matrix Σ,
1
max h(X) = h(N(0, Σ)) = log((2πe)d det Σ) (2.19)
PX :Cov(X)⪯Σ 2
Furthermore, for any a > 0,

a d 2πea
max h(X) = h N 0, Id = log . (2.20)
PX :E[∥X∥ ]≤a
2 d 2 d
i i
i i
i i

i i
2.4 Markov kernels 25
Proof. To show (2.19), without loss of generality, assume that E[X] = 0. By comparing to
Gaussian, we have
0 ≤ D(PX kN(0, Σ))

1 log e
= − h(X) + log((2π )d det(Σ)) + E[X⊤ Σ−1 X]
2 2
≤ − h(X) + h(N(0, Σ)),
where in the last step we apply E[X⊤ Σ−1 X] = Tr(E[XX⊤ ]Σ−1 ) ≤ Tr(I) due to the constraint
Cov(X) Σ and the formula (2.18). The inequality (2.20) follows analogously by choosing the
reference measure to be N(0, ad Id ).
Finally, let us mention a connection between the differential entropy and the Shannon entropy.
Let X be a continuous random vector in Rd . Denote its discretized version by Xm = m1 bmXc
for m ∈ N, where b·c is taken componentwise. Rényi showed that [261, Theorem 1] provided
H(bXc) < ∞ and h(X) is defined, we have
H(Xm ) = d log m + h(X) + o(1), m → ∞. (2.21)
To interpret this result, consider, for simplicity, d = 1, m = 2k and assume that X takes values in
the unit interval, in which case X2k is the k-bit uniform quantization of X. Then (2.21) suggests
that for large k, the quantized bits behave as independent fair coin flips. The underlying reason is
that for “nice” density functions, the restriction to small intervals is approximately uniform. For
more on quantization see Section 24.1 (notably Section 24.1.5) in Chapter 24.
2.4 Markov kernels

The main objects in this book are random variables and probability distributions. The main opera-
tion for creating new random variables, as well as for defining relations between random variables,
is that of a Markov kernel (also known as a transition probability kernel).
Definition 2.8. A Markov kernel K : X → Y is a bivariate function K( · | · ), whose first argument

is a measurable subset of Y and the second is an element of X , such that:
1 For any x ∈ X : K( · |x) is a probability measure on Y

2 For any measurable set A: x 7→ K(A|x) is a measurable function on X .
The kernel K can be viewed as a random transformation acting from X to Y , which draws
Y from a distribution depending on the realization of X, including deterministic transformations
PY|X
as special cases. For this reason, we write PY|X : X → Y and also X −−→ Y. In information-
theoretic context, we also refer to PY|X as a channel, where X and Y are the channel input and
output respectively. There are two ways of obtaining Markov kernels. The first way is defining
them explicitly. Here are some examples of that:
i i
i i
i i

i i
26
1 Deterministic system: Y = f(X). This corresponds to setting PY|X=x = δf(x) .

2 Decoupled system: Y ⊥ ⊥ X. Here we set PY|X=x = PY .
3 Additive noise (convolution): Y = X + Z with Z ⊥⊥ X. This time we choose PY|X=x (·) = PZ (·− x).
The term convolution correponds to the fact that the resulting marginal distribution PY = PX ∗ PZ
is a convolution of measures.
The second way is to disintegrate a joint distribution PX,Y by conditioning on X, which is

denoted simply by PY|X . Specifically, we have the following result [59, Chapter IV, Theorem
2.18]:
Theorem 2.9 (Disintegration). Suppose PX,Y is a distribution on X × Y with Y being standard

Borel. Then there exists a Markov kernel K : X → Y so that for any measurable E ⊂ X × Y and
any integrable f we have
Z
PX,Y [E] = PX (dx)K(Ex |x) , Ex ≜ {y : (x, y) ∈ E} (2.22)
X
Z Z Z
f(x, y)PX,Y (dx dy) = PX (dx) f(x, y)K(dy|x) .
X ×Y X Y
Note that above we have implicitly used the facts that the slices Ex of E are measurable subsets
of Y for each x and that the function x 7→ K(Ex |x) is measurable (cf. [59, Chapter I, Prop. 6.8 and
6.9], respectively). We also notice that one joint distribution PX,Y can have many different versions
of PY|X differing on a measure-zero set of x’s.
The operation of combining an input distribution on X and a kernel K : X → Y as we did
in (2.22) is going to appear extensively in this book. We will usually denote it as multiplication:
Given PX and kernel PY|X we can multiply them to obtain PX,Y ≜ PX PY|X , which in the discrete
case simply means that the joint PMF factorizes as product of marginal and conditional PMFs:
PX,Y (x, y) = PY|X (y|x)PX (x) ,
and more generally is given by (2.22) with K = PY|X .

Another useful operation will be that of composition (marginalization), which we denote by
PY|X ◦ PX ≜ PY . In words, this means forming a distribution PX,Y = PX PY|X and then computing
the marginal PY , or, explicitly,
Z
PY [E] = PX (dx)PY|X (E|x) .
X
To denote this (linear) relation between the input PX and the output PY we sometimes also write
PY|X
PX −−→ PY .
We must remark that technical assumptions such as restricting to standard Borel spaces are
really necessary for constructing any sensible theory of distintegration/conditioning and multi-
plication. To emphasize this point we consider a (cautionary!) example involving a pathological
measurable space Y .
i i
i i
i i

i i
Example 2.5 (X ⊥⊥ Y but PY|X=x 6 PY for all x). Consider X a unit interval with Borel σ -algebra
and Y a unit interval with the σ -algebra σY consisting of all sets which are either countable or
have a countable complement. Clearly σY is a sub-σ -algebra of Borel one. We define the following
kernel K : X → Y :
K(A|x) ≜ 1{x ∈ A} .
This is simply saying that Y is produced from X by setting Y = X. It should be clear that for
every A ∈ σY the map x 7→ K(A|x) is measurable, and thus K is a valid Markov kernel. Letting
X ∼ Unif(0, 1) and using formula (2.22) we can define a joint distribution PX,Y . But what is the
conditional distribution PY|X ? On one hand, clearly we can set PY|X (A|x) = K(A|x), since this
was how PX,Y was constructed. On the other hand, we will show that PX,Y = PX PY , i.e. X ⊥ ⊥ Y
and X = Y at the same time! Indeed, consider any set E = B × C ⊂ X × Y . We always have
PX,Y [B × C] = PX [B ∩ C]. Thus if C is countable then PX,Y [E] = 0 and so is PX PY [E] = 0. On the
other hand, if Cc is countable then PX [C] = PY [C] = 1 and PX,Y [E] = PX PY [E] again. Thus, both
PY|X = K and PY|X = PY are valid conditional distributions. But notice that since PY [{x}] = 0, we
have K(·|x) 6 PY for every x ∈ X . In particular, the value of D(PY|X=x kPY ) can either be 0 or
+∞ for every x depending on the choice of the version of PY|X . It is, thus, advisable to stay within
the realm of standard Borel spaces.
We will also need to use the following result extensively. We remind that a σ -algebra is called
separable if it is generated by a countable collection of sets. Any standard Borel space’s σ -algebra
is separable. The following is another useful result about Markov kernels, cf. [59, Chapter 5,
Theorem 4.44]:
Theorem 2.10 (Doob’s version of Radon-Nikodym Theorem). Assume that Y is a measurable

space with a separable σ -algebra. Let PY|X : X → Y and RY|X : X → Y be two Markov kernels.
Suppose that for every x we have PY|X=x RY|X=x . Then there exists a measurable function
(x, y) 7→ f(y|x) ≥ 0 such that for every x ∈ X and every measurable subset E of Y ,
Z
PY|X (E|x) = f(y|x)RY|X (dy|x) .
E
dPY|X=x
The meaning of this theorem is that the Radon-Nikodym derivative dRY|X=x can be made jointly
measurable with respect to (x, y).
2.5 Conditional divergence, chain rule, data-processing inequality

We aim to define the conditional divergence between two Markov kernels. Throughout this chapter
we fix a pair of Markov kernels PY|X : X → Y and QY|X : X → Y , and also a probability measure
PX on X . First, let us consider the case of discrete X . We define the conditional divergence as
X
D(PY|X kQY|X |PX ) ≜ PX (x)D(PY|X=x kQY|X=x ) .
x∈X
i i
i i
i i

i i
28
In order to extend the above definition to more general X , we need to first understand whether
the map x 7→ D(PY|X=x kQY|X=x ) is even measurable.
Lemma 2.11. Suppose that Y is standard Borel. The set A0 ≜ {x : PY|X=x QY|X=x } and the
function
x 7→ D(PY|X=x kQY|X=x )
are both measurable.
dPY|X=x dQY|X=x
Proof. Take RY|X = 1
2 PY|X + 12 QY|X and define fP (y|x) ≜ dRY|X=x (y) and fQ (y|x) ≜ dRY|X=x (y).
By Theorem 2.10 these can be chosen to be jointly measurable on X × Y . Let us define B0 ≜
{(x, y) : fP (y|x) > 0, fQ (y|x) = 0} and its slice Bx0 = {y : (x, y) ∈ B0 }. Then note that PY|X=x
QY|X=x iff RY|X=x [Bx0 ] = 0. In other words, x ∈ A0 iff RY|X=x [Bx0 ] = 0. The measurability of B0
implies that of x 7→ RY|X=x [Bx0 ] and thus that of A0 . Finally, from (2.12) we get that

f P ( Y | x)
D(PY|X=x kQY|X=x ) = EY∼PY|X=x Log , (2.23)
f Q ( Y | x)
which is measurable, e.g. [59, Chapter 1, Prop. 6.9].
With this preparation we can give the following definition.
Definition 2.12 (Conditional divergence). Assuming Y is standard Borel, define
D(PY|X kQY|X |PX ) ≜ Ex∼PX [D(PY|X=x kQY|X=x )]
We observe that as usual in Lebesgue integration it is possible that a conditional divergence is

finite even though D(PY|X=x kQY|X=x ) = ∞ for some (PX -negligible set of) x.
Theorem 2.13 (Chain rule). For any pair of measures PX,Y and QX,Y we have
D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) + D(PX kQX ) , (2.24)
regardless of the versions of conditional distributions PY|X and QY|X one chooses.
Proof. First, let us consider the simplest case: X , Y are discrete and QX,Y (x, y) > 0 for all x, y.
Letting (X, Y) ∼ PX,Y we get

PX,Y (X, Y) PX (X)PY|X (Y|X)
D(PX,Y kQX,Y ) = E log = E log
QX,Y (X, Y) QX (X)QY|X (Y|X)

PY|X (Y|X) PX (X)
= E log + E log
QY|X (Y|X) QX (X)
completing the proof.
i i
i i
i i

i i
Next, let us address the general case. If PX 6 QX then PX,Y 6 QX,Y and both sides of (2.24) are
infinity. Thus, we assume PX QX and set λP (x) ≜ dQ dPX
X
(x). Define fP (y|x), fQ (y|x) and RY|X as in
the proof of Lemma 2.11. Then we have PX,Y , QX,Y RX,Y ≜ QX RY|X , and for any measurable E
Z Z
PX,Y [E] = λP (x)fP (y|x)RX,Y (dx dy) , QX,Y [E] = fQ (y|x)RX,Y (dx dy) .
E E
Then from (2.12) we have

fP (Y|X)λP (X)
D(PX,Y kQX,Y ) = EPX,Y Log . (2.25)
fQ ( Y | X )
Note the following property of Log: For any c > 0
ac a
Log = log(c) + Log
b b
unless a = b = 0. Now, since PX,Y [fP (Y|X) > 0, λP (X) > 0] = 1, we conclude that PX,Y -almost
surely
fP (Y|X)λP (X) fP ( Y | X )
Log = log λP (X) + Log .
fQ (Y|X) fQ (Y|X)
We aim to take the expectation of both sides over PX,Y and invoke linearity of expectation. To
ensure that the issue of ∞ − ∞ does not arise, we notice that the negative part of each term has
finite expectation by (2.15). Overall, continuing (2.25) and invoking linearity we obtain

fP ( Y | X )
D(PX,Y kQX,Y ) = EPX,Y [log λP (X)] + EPX,Y Log ,
fQ (Y|X)
where the first term equals D(PX kQX ) by (2.12) and the second D(PY|X kQY|X |PX ) by (2.23) and
the definition of conditional divergence.
The chain rule has a number of useful corollaries, which we summarize below.
Theorem 2.14 (Properties of Divergence). Assume that X and Y are standard Borel. Then
(a) Conditional divergence can be expressed unconditionally:
D(PY|X kQY|X |PX ) = D(PX PY|X kPX QY|X ) .
(b) (Monotonicity) D(PX,Y kQX,Y ) ≥ D(PY kQY ).

(c) (Full chain rule)
X
n
D(PX1 ···Xn kQX1 ···Xn ) = D(PXi |Xi−1 kQXi |Xi−1 |PXi−1 ).
i=1
Qn
In the special case of QXn = i=1 QX i ,
X
n
D(PX1 ···Xn kQX1 · · · QXn ) = D(PX1 ···Xn kPX1 · · · PXn ) + D(PXi kQXi )
i=1
i i
i i
i i

i i
30
X
n
≥ D(PXi kQXi ), (2.26)
i=1
Qn
where the inequality holds with equality if and only if PXn = j=1 PXj .
(d) (Tensorization)
 
Yn Y n X n

D  PXj QXj  = D(PXj kQXj ).

j=1 j=1 j=1
(e) (Conditioning increases divergence) Given PY|X , QY|X and PX , let PY = PY|X ◦ PX and QY =
QY|X ◦ PX , as represented by the diagram:
PY |X PY
PX
QY |X QY
Then D(PY kQY ) ≤ D(PY|X kQY|X |PX ), with equality iff D(PX|Y kQX|Y |PY ) = 0.
We remark that as before without the standard Borel assumption even the first property can
fail. For example, Example 2.5 shows an example where PX PY|X = PX QY|X but PY|X 6= QY|X and
D(PY|X kQY|X |PX ) = ∞.
Proof. (a) This follows from the chain rule (2.24) since PX = QX .
(b) Apply (2.24), with X and Y interchanged and use the fact that conditional divergence is non-
negative.
Qn Qn
(c) By telescoping PXn = i=1 PXi |Xi−1 and QXn = i=1 QXi |Xi−1 .
(d) Apply (c).
(e) The inequality follows from (a) and (b). To get conditions for equality, notice that by the chain
rule for D:
D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) + D(PX kPX )

| {z }
=0
= D(PX|Y kQX|Y |PY ) + D(PY kQY ).
Some remarks are in order:
• There is a nice interpretation of the full chain rule as a decomposition of the “distance” from
PXn to QXn as a sum of “distances” between intermediate distributions, cf. Ex. I.33.
• In general, D(PX,Y kQX,Y ) and D(PX kQX ) + D(PY kQY ) are incomparable. For example, if X = Y
under P and Q, then D(PX,Y kQX,Y ) = D(PX kQX ) < 2D(PX kQX ). Conversely, if PX = QX and
PY = QY but PX,Y 6= QX,Y we have D(PX,Y kQX,Y ) > 0 = D(PX kQX ) + D(PY kQY ).
i i
i i
i i

i i
The following result, known as the Data-Processing Inequality (DPI), is an important principle
in all of information theory. In many ways, it underpins the whole concept of information. The
intuitive interpretation is that it is easier to distinguish two distributions using clean (resp. full) data
as opposed to noisy (resp. partial) data. DPI is a recurring theme in this book, and later we will
study DPI for other information measures such as those for mutual information and f-divergences.
Theorem 2.15 (DPI for KL divergence). Let PY = PY|X ◦ PX and QY = PY|X ◦ QX , as represented
by the diagram:
PX PY
PY|X
QX QY
Then
D(PY kQY ) ≤ D(PX kQX ), (2.27)
with equality if and only if D(PX|Y kQX|Y |PY ) = 0.
Proof. This follows from either the chain rule or monotonicity:
D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) +D(PX kQX )

| {z }
=0
= D(PX|Y kQX|Y |PY ) + D(PY kQY )
Corollary 2.16 (Divergence under deterministic transformation). Let Y = f(X). Then

D(PY kQY ) ≤ D(PX kQX ), with equality if f is one-to-one.
Note that D(Pf(X) kQf(X) ) = D(PX kQX ) does not imply that f is one-to-one; as an example,
consider PX = Gaussian, QX = Laplace, Y = |X|. In fact, the equality happens precisely when
f(X) is a sufficient statistic for testing P against Q; in other words, there is no loss of information
in summarizing X into f(X) as far as testing these two hypotheses is concerned. See Example 3.8
for details.
A particular useful application of Corollary 2.16 is when we take f to be an indicator function:
Corollary 2.17 (Large deviations estimate). For any subset E ⊂ X we have
d(PX [E]kQX [E]) ≤ D(PX kQX ),
where d(·k·) is the binary divergence function in (2.6).
i i
i i
i i

i i
32
Proof. Consider Y = 1{X∈E} .

This method will be highly useful in large deviations theory which studies rare events (Sec-
tion 14.5 and Section 15.2), where we apply Corollary 2.17 to an event E which is highly likely
under P but highly unlikely under Q.
2.6* Local behavior of divergence and Fisher information

As we shall see in Section 4.4, KL divergence is in general not continuous. Nevertheless, it is
reasonable to expect that the functional D(PkQ) vanishes when P approaches Q “smoothly”. Due
to the smoothness and strict convexity of x log x at x = 1, it is then also natural to expect that
this functional decays “quadratically”. In this section we examine this question first along the
linear interpolation between P and Q, then, more generally, in smooth parametrized families of
distributions. These properties will be extended to more general divergences later in Sections 7.10
and 7.11.
2.6.1* Local behavior of divergence for mixtures

Let 0 ≤ λ ≤ 1 and consider D(λP + λ̄QkQ), which vanishes as λ → 0. Next, we show that this
decay is always sublinear.
Proposition 2.18. When D(PkQ) < ∞, the one-sided derivative at λ = 0 vanishes:

d
D(λP + λ̄QkQ) = 0
dλ λ=0
If we exchange the arguments, the criterion is even simpler:
d
D(QkλP + λ̄Q) = 0 ⇐⇒ PQ (2.28)
dλ λ=0
Proof.

1 1
D(λP + λ̄QkQ) = EQ (λf + λ̄) log(λf + λ̄)
λ λ
dP
where f = . As λ → 0 the function under expectation decreases to (f − 1) log e monotonically.
dQ
Indeed, the function
λ 7→ g(λ) ≜ (λf + λ̄) log(λf + λ̄)
g(λ)
is convex and equals zero at λ = 0. Thus λ is increasing in λ. Moreover, by the convexity of
x 7→ x log x:
1 1
(λf + λ)(log(λf + λ)) ≤ (λf log f + λ1 log 1) = f log f
λ λ
i i
i i
i i

i i
and by assumption f log f is Q-integrable. Thus the Monotone Convergence Theorem applies.
To prove (2.28) first notice that if P 6 Q then there is a set E with p = P[E] > 0 = Q[E].
Applying data-processing for divergence to X 7→ 1E (X), we get
1
D(QkλP + λ̄Q) ≥ d(0kλp) = log
1 − λp
and derivative is non-zero. If P Q, then let f = dP

dQ and notice simple inequalities
log λ̄ ≤ log(λ̄ + λf) ≤ λ(f − 1) log e .

1
Dividing by λ and assuming λ < 2 we get for some absolute constants c1 , c2 :

1
log(λ̄ + λf) ≤ c1 f + c2 .
λ
Thus, by the dominated convergence theorem we get

Z Z
1 1 λ→0
D(QkλP + λ̄Q) = − dQ log(λ̄ + λf) −−−→ dQ(1 − f) = 0 .
λ λ
Remark 2.2. More generally, under suitable technical conditions,

d dQ
D(λP + λQkR) = EP log − D(QkR)
dλ λ=0 dR
and

d dP1 dQ0
D ( λ̄ P1 + λ Q 1 kλ̄ P 0 + λ Q 0 ) = E Q1 log − D ( P 1 kP 0 ) + E P1 1 − log e.
dλ λ=0 dP0 dP0
The main message of Proposition 2.18 is that the function
λ 7→ D(λP + λ̄QkQ) ,
is o(λ) as λ → 0. In fact, in most cases it is quadratic in λ. To make a precise statement, we need

to define the concept of χ2 -divergence – a version of f-divergence (see Chapter 7):
Z 2
dP
χ (PkQ) ≜ dQ
2
−1 .
dQ
This is a popular dissimilarity measure between P and Q, frequently used in statistics. It has many
important properties, but we will only mention that χ2 dominates KL-divergence (cf. (7.31)):
D(PkQ) ≤ log(1 + χ2 (PkQ)) .
Our second result about the local behavior of KL-divergence is the following (see Section 7.10
for generalizations):
i i
i i
i i

i i
34
Proposition 2.19 (KL is locally χ2 -like). We have

1 log e 2
lim inf 2
D(λP + λ̄QkQ) = χ (PkQ) , (2.29)
λ→0 λ 2
where both sides are finite or infinite simultaneously.
Proof. First, we assume that χ2 (PkQ) < ∞ and prove

λ2 log e 2
D(λP + λ̄QkQ) = χ (PkQ) + o(λ2 ) , λ → 0.
2
To that end notice that

dP
D(PkQ) = EQ g ,
dQ
where
g(x) ≜ x log x − (x − 1) log e .
g(x) R1
Note that x 7→ (x−1)2 log e
= sds
0 x(1−s)+s
is decreasing in x on (0, ∞). Therefore
0 ≤ g(x) ≤ (x − 1)2 log e ,

and hence
2
1 dP dP
0≤ g λ̄ + λ ≤ − 1 log e.
λ2 dQ dQ
By the dominated convergence theorem (which is applicable since χ2 (PkQ) < ∞) we have
" 2 #
1 dP g′′ (1) dP log e 2
lim 2 EQ g λ̄ + λ = EQ −1 = χ (PkQ) .
λ→0 λ dQ 2 dQ 2
Second, we show that unconditionally

1 log e 2
lim inf D(λP + λ̄QkQ) ≥ χ (PkQ) . (2.30)
λ→0 λ2 2
Indeed, this follows from Fatou’s lemma:

1 dP dP log e 2
lim inf EQ 2 g λ̄ + λ ≥ EQ lim inf g λ̄ + λ = χ (PkQ) .
λ→0 λ dQ λ→0 dQ 2
Therefore, from (2.30) we conclude that if χ2 (PkQ) = ∞ then so is the LHS of (2.29).
2.6.2* Parametrized family

Extending the setting of Section 2.6.1*, consider a parametrized set of distributions {Pθ : θ ∈ Θ}
where the parameter space Θ is an open subset of Rd . Furthermore, suppose that distribution Pθ
are all given in the form of
Pθ (dx) = pθ (x) μ(dx) ,
i i
i i
i i

i i
where μ is some common dominating measure (e.g. Lebesgue or counting measure). If for each
fixed x, the density pθ (x) depends smoothly on θ, one can define the Fisher information matrix
with respect to the parameter θ as

JF (θ) ≜ Eθ VV⊤ , V ≜ ∇θ ln pθ (X) , (2.31)
where Eθ is with respect to X ∼ Pθ . In particular, V is known as the score.

Under suitable regularity conditions, we have the identity
E θ [ V] = 0 (2.32)
and several equivalent expressions for the Fisher information matrix:
JF (θ) = cov(V)
θ
Z p p
= 4 μ(dx)(∇θ pθ (x))(∇θ pθ (x))⊤
= − Eθ [Hessθ (ln pθ (X)))] ,
where the last identity is obtained by differentiating (2.32) with respect to each θj .
The significance of Fisher information matrix arises from the fact that it gauges the local
behaviour of divergence for smooth parametric families. Namely, we have (again under suitable
technical conditions):2
log e ⊤
D(Pθ0 kPθ0 +ξ ) = ξ JF (θ0 )ξ + o(kξk2 ) , (2.33)
2
which is obtained by integrating the Taylor expansion:
1
ln pθ0 +ξ (x) = ln pθ0 (x) + ξ ⊤ ∇θ ln pθ0 (x) + ξ ⊤ Hessθ (ln pθ0 (x))ξ + o(kξk2 ) .
2
We will establish this fact rigorously later in Section 7.11. Property (2.33) is of paramount impor-
tance in statistics. We should remember it as: Divergence is locally quadratic on the parameter
space, with Hessian given by the Fisher information matrix. Note that for the Gaussian location
model Pθ = N (θ, Σ), (2.33) is in fact exact with JF (θ) ≡ Σ−1 – cf. Example 2.2.
As another example, note that Proposition 2.19 is a special case of (2.33) by considering Pλ =
λ̄Q + λP parametrized by λ ∈ [0, 1]. In this case, the Fisher information at λ = 0 is simply
χ2 (PkQ). Nevertheless, Proposition 2.19 is completely general while the asymptotic expansion
(2.33) is not without regularity conditions (see Section 7.11).
Remark 2.3. Some useful properties of Fisher information are as follows:
2
To illustrate the subtlety here, consider a scalar location family, i.e. pθ (x) = f0 (x − θ) for some density f0 . In this case
∫ (f′0 )2
Fisher information JF (θ0 ) = f0
does not depend on θ0 and is well-defined even for compactly supported f0 ,
provided f′0 vanishes at the endpoints sufficiently fast. But at the same time the left-hand side of (2.33) is infinite for any
ξ > 0. In such cases, a better interpretation for Fisher information is as the coefficient of the expansion
ξ2
D(Pθ0 k 12 Pθ0 + 12 Pθ0 +ξ ) = J (θ )
8 F 0
+ o(ξ 2 ). We will discuss this in more detail in Section 7.11.
i i
i i
i i

i i
36
• Reparametrization: It can be seen that if one introduces another parametrization θ̃ ∈ Θ̃ by

means of a smooth invertible map Θ̃ → Θ, then Fisher information matrix changes as
JF (θ̃) = A⊤ JF (θ)A , (2.34)
where A = ddθθ̃ is the Jacobian of the map. So we can see that JF transforms similarly to the
metric tensor in Riemannian geometry. This idea can be used to define a Riemannian metric
on the parameter space Θ, called the Fisher-Rao metric. This is explored in a field known as
information geometry [11].
i.i.d.
• Additivity: Suppose we are given a sample of n iid observations Xn ∼ Pθ . As such, consider the
parametrized family of product distributions {P⊗ θ : θ ∈ Θ}, whose Fisher information matrix
n
⊗n
is denoted by JF (θ). In this case, the score is an iid sum. Applying (2.31) and (2.32) yields
J⊗ n
F (θ) = nJF (θ). (2.35)
Example 2.6. Let Pθ = (θ0 , . . . , θd ) be a probability distribution on the finite alphabet {0, . . . , d}.
Pd
We will take θ = (θ1 , . . . , θd ) as the free parameter and set θ0 = 1 − i=1 θi . So all derivatives
are with respect to θ1 , . . . , θd only. Then we have
(
θi , i = 1, . . . , d
pθ (i) = Pd
1 − i=1 θi , i = 0
and for Fisher information matrix we get

1 1 1
JF (θ) = diag ,..., + Pd 11⊤ , (2.36)
θ1 θd 1 − i=1 θi
where 1 is the d × 1 vector of all ones. For future references (see Sections 29.4 and 13.4*), we also
compute the inverse and determinant of JF (θ). By the matrix inversion lemma (A + UCV)−1 =
A−1 − A−1 U(C−1 + VA−1 U)−1 VA−1 , we have
J− ⊤
F (θ) = diag(θ) − θθ .
1
(2.37)
For the determinant, notice that det(A + xy⊤ ) = det A · det(I + A−1 xy⊤ ) = det A · (1 + y⊤ A−1 x),
where we used the identity det(I + AB) = det(I + BA). Thus, we have
Y
d
1
det JF (θ) = . (2.38)
θi
i=0
i i
i i
i i

i i
3 Mutual information
After technical preparations in previous chapters we define perhaps the most famous concept in
the entire field of information theory, the mutual information. It was originally defined by Shan-
non, although the name was coined later by Robert Fano1 It has two equivalent expressions (as a
KL divergence and as difference of entropies), both having its merits. In this chapter, we prove
first properties of mutual information (non-negativity, chain rule and the data-processing inequal-
ity). While defining conditional information, we also introduce the language of directed graphical
models, and connect the equality case in the data-processing inequality with Fisher’s concept of
sufficient statistics.
3.1 Mutual information

Mutual information was first defined by Shannon to measure the decrease in entropy of a random
quantity following the observation of another (correlated) random quantity. Unlike the concept
of entropy itself, which was well-known by then in statistical mechanics, the mutual information
was new and revolutionary and had no analogs in science. Today, however, it is preferred to define
mutual information in a different form (proposed in [277, Appendix 7]).
Definition 3.1 (Mutual information). For a pair of random variables X and Y we define
I(X; Y) = D(PX,Y kPX PY ).
The intuitive interpretation of mutual information is that I(X; Y) measures the dependency
between X and Y by comparing their joint distribution to the product of the marginals in the KL
divergence, which, as we show next, is also equivalent to comparing the conditional distribution
to the unconditional.
The way we defined I(X; Y) it is a functional of the joint distribution PX,Y . However, it is also
rather fruitful to look at it as a functional of the pair (PX , PY|X ) – more on this in Section 5.1.
In general, the divergence D(PX,Y kPX PY ) should be evaluated using the general definition (2.4).
Note that PX,Y PX PY need not always hold. Let us consider the following examples, though.
1
Professor of electrical engineering at MIT, who developed the first course on information theory and as part of it
formalized and rigorized much of Shannon’s ideas. Most famously, he showed the “converse part” of the noisy channel
coding theorem, see Section 17.4.
37
i i
i i
i i

i i
38
Example 3.1. If X = Y ∼ N(0, 1) then PX,Y 6 PX PY and I(X; Y) = ∞. This reflects our intuition
that X contains an “infinite” amount of information requiring infinitely many bits to describe. On
the other hand, if even one of X or Y is discrete, then we always have PX,Y PX PY . Indeed,
consider any E ⊂ X × Y measurable in the product sigma algebra with PX,Y (E) > 0. Since
P
x∈S P[(X, Y) ∈ S, X = x], there exists some x0 ∈ S such that PY (E ) ≥ P[X =
x0
PX,Y (E) =
x0 , Y ∈ E ] > 0, where E ≜ {y : (x0 , y) ∈ E} is a section of E (measurable for every x0 ). But
x0 x0
then PX PY (E) ≥ PX PY ({x0 } × Ex0 ) = PX ({x0 })PY (Ex0 ) > 0, implying that PX,Y PX PY .
Theorem 3.2 (Properties of mutual information).
(a) (Mutual information as conditional divergence) Whenever Y is standard Borel,
I(X; Y) = D(PY|X kPY |PX ) . (3.1)
(b) (Symmetry) I(X; Y) = I(Y; X)

(c) (Positivity) I(X; Y) ≥ 0; I(X; Y) = 0 iff X ⊥
⊥Y
(d) For any function f, I(f(X); Y) ≤ I(X; Y). If f is one-to-one (with a measurable inverse), then
I(f(X); Y) = I(X; Y).
(e) (More data ⇒ More information) I(X1 , X2 ; Z) ≥ I(X1 ; Z)
Proof. (a) This follows from Theorem 2.14(a) with QY|X = PY .

K
(b) Consider a Markov kernel K sending (x, y) 7→ (y, x). This kernel sends measure PX,Y −
→ PY,X
K
and PX PY −
→ PY PX . Therefore, from the DPI Theorem 2.15 applied to this kernel we get
D(PX,Y kPX PY ) ≥ D(PY,X kPY PX ) .
Applying this argument again, shows that inequality is in fact equality.

(c) This is just D ≥ 0 from Theorem 2.3.
K
(d) Consider a Markov kernel K sending (x, y) 7→ (f(x), y). This kernel sends measure PX,Y − →
K
Pf(X),Y and PX PY −
→ Pf(X) PY . Therefore, from the DPI Theorem 2.15 applied to this kernel we
get
D(PX,Y kPX PY ) ≥ D(Pf(X),Y kPf(X) PY ) .
It is clear that the two sides correspond to the two mutual informations. For bijective f, simply
apply the inequality to f and f−1 .
(e) Apply (d) with f(X1 , X2 ) = X1 .
P P
Proof. (a) I(X; Y) = E log PPXXP,YY = E log PYY|X = E log PXX|Y .
(b) Apply data-processing inequality twice to the map (x, y) → (y, x) to get D(PX,Y kPX PY ) =
D(PYX kPY PX ).
(c) By definition and Theorem 2.3.
i i
i i
i i

i i
3.1 Mutual information 39
(d) We will use the data-processing inequality of mutual information (to be proved shortly in
Theorem 3.7(c)). For bijective f, consider the chain of data processing: (x, y) 7→ (f(x), y) 7→
(f−1 (f(x)), y). Then I(X; Y) ≥ I(f(X); Y) ≥ I(f−1 (f(X)); Y) = I(X; Y).
(e) Apply (d) with f(X1 , X2 ) = X1 .
Of the results above, the one we will use the most is (3.1). Note that it implies that
D(PX,Y kPX PY ) < ∞ if and only if
x 7→ D(PY|X=x kPY )
is PX -integrable. This property has a counterpart in terms of absolute continuity, as follows.
Lemma 3.3. Let Y be standard Borel. Then
PX,Y PX PY ⇐⇒ PY|X=x PY for PX -a.e. x
Proof. Suppose PX,Y PX PY . We need to prove that any version of the conditional probability
satisfies PY|X=x PY for almost every x. Note, however, that if we prove this for some version P̃Y|X
then the statement for any version follows, since PY|X=x = P̃Y|X=x for PX -a.e. x. (This measure-
theoretic fact can be derived from the chain rule (2.24): since PX P̃Y|X = PX,Y = PX PY|X we must
have 0 = D(PX,Y kPX,Y ) = D(P̃Y|X kPY|X |PX ) = Ex∼PX [D(P̃Y|X=x kPY|X=x )], implying the stated
dPX,Y R
fact.) So let g(x, y) = dP X PY
(x, y) and ρ(x) ≜ Y g(x, y)PY (dy). Fix any set E ⊂ X and notice
Z Z
PX [E] = 1E (x)g(x, y)PX (dx) PY (dy) = 1E (x)ρ(x)PX (dx) .
X ×Y X
R
On the other hand, we also have PX [E] = 1E dPX , which implies ρ(x) = 1 for PX -a.e. x. Now
define
(
g(x, y)PY (dy), ρ(x) = 1
P̃Y|X (dy|x) =
PY (dy), ρ(x) 6= 1 .
Directly plugging P̃Y|X into (2.22) shows that P̃Y|X does define a valid version of the conditional
probability of Y given X. Since by construction P̃Y|X=x PY for every x, the result follows.
Conversely, let PY|X be a kernel such that PX [E] = 1, where E = {x : PY|X=x PY } (recall that
E is measurable by Lemma 2.11). Define P̃Y|X=x = PY|X=x if x ∈ E and P̃Y|X=x = PY , otherwise.
By construction PX P̃Y|X = PX PY|X = PX,Y and P̃Y|X=x PY for every x. Thus, by Theorem 2.10
there exists a jointly measurable f(y|x) such that
P̃Y|X (dy|x) = f(y|x)PY (dy) ,
and, thus, by (2.22)

Z
PX,Y [E] = f(y|x)PY (dy)PX (dx) ,
E
implying that PX,Y PX PY .
i i
i i
i i

i i
40
3.2 Mutual information as difference of entropies

As promised, we next introduce a different point of view on I(X; Y), namely as a difference of
entropies. This (conditional entropy) point of view of Shannon emphasizes that I(X; Y) is also
measuring the change in the spread or uncertainty of the distribution of X following the observation
of Y.
Theorem 3.4.
(
H(X) X discrete
(a) I(X; X) =
+∞ otherwise.
(b) If X is discrete, then
I(X; Y) + H(X|Y) = H(X) . (3.2)
Consequently, either H(X|Y) = H(X) = ∞,2 or H(X|Y) < ∞ and
I(X; Y) = H(X) − H(X|Y). (3.3)
(c) If both X and Y are discrete, then

I(X; Y) + H(X, Y) = H(X) + H(Y),
so that whenever H(X, Y) < ∞ we have
I(X; Y) = H(X) + H(Y) − H(X, Y) .
(d) Similarly, if X, Y are real-valued random vectors with a joint PDF, then
I(X; Y) = h(X) + h(Y) − h(X, Y)
provided that h(X, Y) < ∞. If X has a marginal PDF pX and a conditional PDF pX|Y (x|y),
then
I(X; Y) = h(X) − h(X|Y) ,
provided h(X|Y) < ∞.
(e) If X or Y are discrete then I(X; Y) ≤ min (H(X), H(Y)), with equality iff H(X|Y) = 0 or
H(Y|X) = 0, or, equivalently, iff one is a deterministic function of the other.
Proof. (a) By Theorem 3.2.(a), I(X; X) = D(PX|X kPX |PX ) = Ex∼X D(δx kPX ). If PX is discrete,
then D(δx kPX ) = log PX1(x) and I(X; X) = H(X). If PX is not discrete, let A = {x : PX (x) > 0}
denote the set of atoms of PX . Let ∆ = {(x, x) : x 6∈ A} ⊂ X × X . (∆ is measurable since it’s
2
This is indeed possible if one takes Y = 0 (constant) and X from Example 1.3, demonstrating that (3.3) does not always
hold.
i i
i i
i i

i i
3.2 Mutual information as difference of entropies 41
the intersection of Ac × Ac with the diagonal {(x, x) : x ∈ X }.) Then PX,X (∆) = PX (Ac ) > 0
but since
Z Z
(PX × PX )(E) ≜ PX (dx1 ) PX (dx2 )1{(x1 , x2 ) ∈ E}
X X
we have by taking E = ∆ that (PX × PX )(∆) = 0. Thus PX,X 6 PX × PX and thus by definition
I(X; X) = D(PX,X kPX PX ) = +∞ .
(b) Since X is discrete there exists a countable set S such that P[X ∈ S] = 1, and for any x0 ∈ S we
have P[X = x0 ] > 0. Let λ be a counting measure on S and let μ = λ×PY , so that PX PY μ. As
shown in Example 3.1 we also have PX,Y μ. Furthermore, fP (x, y) ≜ dPdμX,Y (x, y) = pX|Y (x|y),
where the latter denotes conditional pmf of X given Y (which is a proper pmf for almost every
y, since P[X ∈ S|Y = y] = 1 for a.e. y). We also have fQ (x, y) = dPdμ
X PY
(x, y) = dP
dλ (x) = pX (x),
X
where the latter is an unconditional pmf of X. Note that by definition of Radon-Nikodym

derivatives we have
E[pX|Y (x0 |Y)] = pX (x0 ) . (3.4)
Next, according to (2.12) we have
X
fP ( X , Y ) pX|Y (x|y)
I(X; Y) = E Log = Ey∼PY pX|Y (x|y)Log .
fQ (X, Y) p X ( x)
x∈ S
Note that PX,Y -almost surely both pX|Y (X|Y) > 0 and PX (x) > 0, so we can replace Log with
log in the above. On the other hand,
X 1

H(X|Y) = Ey∼PY pX|Y (x|y) log .
pX|Y (x|y)
x∈ S
Adding these two expressions, we obtain

( a) X 1

I(X; Y) + H(X|Y) = Ey∼PY pX|Y (x|y) log
p X ( x)
x∈S

(b) X 1 ( c) 1
= Ey∼PY pX|Y (x|y) log = E log ≜ H(X) ,
p X ( x) PX (X)
x∈S
P P
where in (a) we used linearity of Lebesgue integral EPY x , in (b) we interchange E and
via Fubini; and (c) holds due to (3.4).
(c) Simply add H(Y) to both sides of (3.2) and use the chain rule for H from (1.2).
(d) These arguments are similar to discrete case, except that counting measure is replaced with
Lebesgue. We leave the details as an exercise.
(e) Follows from (b).
From (3.2) we deduce the following result, which was previously shown in Theorem 1.4(d).
Corollary 3.5 (Conditioning reduces entropy). For discrete X, H(X|Y) ≤ H(X), with equality iff
X⊥
⊥ Y.
i i
i i
i i

i i
42
Proof. If H(X) = ∞ then there is nothing to prove. Otherwise, apply (3.2).

Thus, the intuition behind the last corollary (and an important innovation of Shannon) is to
give meaning to the amount of entropy reduction (mutual information). It is important to note
that conditioning reduces entropy on average, not per reliazation. Indeed, take X = U OR Y, where
i.i.d.
U, Y ∼ Ber( 21 ). Then X ∼ Ber( 34 ) and H(X) = h( 14 ) < 1 bit = H(X|Y = 0), i.e., conditioning on
Y = 0 increases entropy. But on average, H(X|Y) = P [Y = 0] H(X|Y = 0) + P [Y = 1] H(X|Y =
1) = 12 bits < H(X), by the strong concavity of h(·).
Remark 3.1 (Information, entropy, and Venn diagrams). For discrete random variables, the follow-
ing Venn diagram illustrates the relationship between entropy, conditional entropy, joint entropy,
and mutual information from Theorem 3.4(b) and (c).
H(X, Y )
H(Y |X) I(X; Y ) H(X|Y )
H(Y ) H(X)
Applying analogously the inclusion-exclusion principle to three variables X1 , X2 , X3 , we see

that the triple intersection corresponds to
H(X1 ) + H(X2 ) + H(X3 ) − H(X1 , X2 ) − H(X2 , X3 ) − H(X1 , X3 ) + H(X1 , X2 , X3 ) (3.5)
which is sometimes denoted by I(X1 ; X2 ; X3 ). It can be both positive and negative (why?).
In general, one can treat random variables as sets (so that the Xi corresponds to set Ei and the
pair (X1 , X2 ) corresponds to E1 ∪ E2 ). Then we can define a unique signed measure μ on the finite
algebra generated by these sets so that every information quantity is found by replacing
I/H → μ ;→ ∩ ,→ ∪ | → \.
As an example, we have
H(X1 |X2 , X3 ) = μ(E1 \ (E2 ∪ E3 )) , (3.6)
I(X1 , X2 ; X3 |X4 ) = μ(((E1 ∪ E2 ) ∩ E3 ) \ E4 ) . (3.7)
By inclusion-exclusion, the quantity in (3.5) corresponds to μ(E1 ∩ E2 ∩ E3 ), which explains why
μ is not necessarily a positive measure. For an extensive discussion, see [80, Chapter 1.3].
3.3 Examples of computing mutual information

Below we demonstrate how to compute I in both continuous and discrete settings.
i i
i i
i i

i i
3.3 Examples of computing mutual information 43
Example 3.2 (Bivariate Gaussian). Let X, Y be jointly Gaussian. Then

1 1
I(X; Y) = log (3.8)
2 1 − ρ2X,Y
E[(X−EX)(Y−EY)]
where ρX,Y ≜ σX σY ∈ [−1, 1] is the correlation coefficient; see Fig. 3.1 for a plot. To
I(X; Y )
ρ
-1 0 1
Figure 3.1 Mutual information between correlated Gaussians.
show (3.8), by shifting and scaling if necessary, we can assume without loss of generality that
EX = EY = 0 and EX2 = EY2 = 1. Then ρ = EXY. By joint Gaussianity, Y = ρX + Z for some
Z ∼ N ( 0, 1 − ρ 2 ) ⊥
⊥ X. Then using the divergence formula for Gaussians (2.7), we get
I(X; Y) = D(PY|X kPY |PX )
= ED(N (ρX, 1 − ρ2 )kN (0, 1))

1 1 log e
=E log + (ρX) + 1 − ρ − 1
2 2
2 1 − ρ2 2
1 1
= log .
2 1 − ρ2
Alternatively, we can use the differential entropy representation in Theorem 3.4(d) and the entropy
formula (2.17) for Gaussians:
I(X; Y) = h(Y) − h(Y|X)
= h( Y ) − h( Z )
1 1 1 1
= log(2πe) − log(2πe(1 − ρ2 )) = log .
2 2 2 1 − ρ2
where the second equality follows h(Y|X) = h(Y − X|X) = h(Z|X) = h(Z) applying the shift-
invariance of h and the independence between X and Z.
Similar to the role of mutual information, the correlation coefficient also measures the depen-
dency between random variables which are real-valued (more generally, on an inner-product
space) in a certain sense. In contrast, mutual information is invariant to bijections and thus more
general: it can be defined not just for numerical but for arbitrary random variables.
i i
i i
i i

i i
44
Example 3.3 (AWGN channel). Let X ⊥

⊥ N be independent Gaussian. Consider the additive white
Gaussian noise (AWGN) channel: Y = X + N; pictorially,
X + Y
Then
1 σ2
I(X; Y) = log 1 + X2 ,
2 σN
σX2
where σN2
is frequently referred to as the signal-to-noise ratio (SNR).
Example 3.4 (Gaussian vectors). Let X ∈ Rm and Y ∈ Rn be jointly Gaussian. Then
1 det ΣX det ΣY
I(X; Y) = log
2 det Σ[X,Y]

where ΣX ≜ E (X − EX)(X − EX)⊤ denotes the covariance matrix of X ∈ Rm , and Σ[X,Y]
denotes the covariance matrix of the random vector [X, Y] ∈ Rm+n .
In the special case of additive noise: Y = X + N for N ⊥
⊥ X, we have
1 det(ΣX + ΣN )
I(X; X + N) = log
2 det ΣN
ΣX ΣX
why?
since det Σ[X,X+N] = det ΣX ΣX +ΣN = det ΣX det ΣN .
Example 3.5 (Binary symmetric channel). Recall the setting in Example 1.4(1). Let X ∼ Ber( 21 )
and N ∼ Ber(δ) be independent. Let Y = X ⊕ N; or equivalently, Y is obtained by flipping X with
probability δ .
N
1− δ
0 0
X δ Y X + Y
1 1
1− δ
As shown in Example 1.4(1), H(X|Y) = H(N) = h(δ) and hence
I(X; Y) = log 2 − h(δ).
The channel PY|X , called the binary symmetric channel with parameter δ and denoted by BSCδ ,
will be encountered frequently in this book.
i i
i i
i i

i i
Example 3.6 (Addition over finite groups). Generalizing Example 3.5, let X and Z take values on
a finite group G. If X is uniform on G and independent of Z, then
I(X; X + Z) = log |G| − H(Z),
which simply follows from that X + Z is uniform on G regardless of the distribution of Z.
3.4 Conditional mutual information and conditional independence

Definition 3.6 (Conditional mutual information). If X and Y are standard Borel, then we define
I(X; Y|Z) ≜ D(PX,Y|Z kPX|Z PY|Z |PZ ) (3.9)

= Ez∼PZ [I(X; Y|Z = z)] . (3.10)
where the product PX|Z PY|Z is a conditional distribution such that (PX|Z PY|Z )(A × B|z) =
PX|Z (A|z)PY|Z (B|z), under which X and Y are independent conditioned on Z.
Denoting I(X; Y) as a functional I(PX,Y ) of the joint distribution PX,Y , we have I(X; Y|Z) =
Ez∼PZ [I(PX,Y|Z=z )]. As such, I(X; Y|Z) is a linear functional in PZ . Measurability of the map z 7→
I(PX,Y|Z=z ) is not obvious, but follows from Lemma 2.11.
To further discuss properties of the conditional mutual information, let us first introduce the
notation for conditional independence. A family of joint distributions can be represented by a
directed acyclic graph encoding the dependency structure of the underlying random variables. A
simple example is a Markov chain (path graph) X → Y → Z, which represents distributions that
factor as {PX,Y,Z : PX,Y,Z = PX PY|X PZ|Y }. We have the following equivalent descriptions:
X → Y → Z ⇔ PX,Z|Y = PX|Y · PZ|Y

⇔ PZ|X,Y = PZ|Y
⇔ PX,Y,Z = PX · PY|X · PZ|Y
⇔ X, Y, Z form a Markov chain
⇔ X⊥
⊥ Z| Y
⇔ X ← Y → Z, PX,Y,Z = PY · PX|Y · PZ|Y
⇔ Z→Y→X
Theorem 3.7 (Further properties of mutual information). Suppose that all random variables are
valued in standard Borel spaces. Then:
(a) I(X; Z|Y) ≥ 0, with equality iff X → Y → Z.
i i
i i
i i

i i
46
(b) (Simple chain rule)3
I(X, Y; Z) = I(X; Z) + I(Y; Z|X)

= I(Y; Z) + I(X; Z|Y).
(c) (DPI for mutual information) If X → Y → Z, then

i) I(X; Z) ≤ I(X; Y), with equality iff X → Z → Y.
ii) I(X; Y|Z) ≤ I(X; Y), with equality iff X ⊥
⊥ Z.
(d) If X → Y → Z → W, then I(X; W) ≤ I(Y; Z)
(e) (Full chain rule)
X
n
I(Xn ; Y) = I ( X k ; Y | X k− 1 )
k=1
(f) (Permutation invariance) If f and g are one-to-one (with measurable inverses), then
I(f(X); g(Y)) = I(X; Y).
Proof. (a) By definition and Theorem 3.2(c).

(b) First, notice that from (3.1) we have (with a self-evident notation):
I(Y; Z|X = x) = D(PY|Z,X=x kPY|X=x |PZ|X=x ) .
Taking expectation over X here we get

( a)
I(Y; Z|X) = D(PY|X,Z kPY|X |PX,Z ) .
On the other hand, from the chain rule for D, (2.24), we have
(b)
D(PX,Y,Z kPX,Y PZ ) = D(PX,Z kPX PZ ) + D(PY|X,Z kPY|X |PX,Z ) ,
where in the second term we noticed that conditioning on X, Z under the measure PX,Y PZ
results in PY|X (independent of Z). Putting (a) and (b) together completes the proof.
(c) Apply Kolmogorov identity to I(Y, Z; X):
I(Y, Z; X) = I(X; Y) + I(X; Z|Y)

| {z }
=0
= I(X; Z) + I(X; Y|Z)
(d) Several applications of the DPI: I(X; W) ≤ I(X; Z) ≤ I(Y; Z)

(e) Recursive application of Kolmogorov identity.
Remark 3.2. In general, I(X; Y|Z) and I(X; Y) are incomparable. Indeed, consider the following
examples:
3
Also known as “Kolmogorov identities”.
i i
i i
i i

i i
• I(X; Y|Z) > I(X; Y): We need to find an example of X, Y, Z, which do not form a Markov chain.
To that end notice that there is only one directed acyclic graph non-isomorphic to X → Y → Z,
i.i.d.
namely X → Y ← Z. With this idea in mind, we construct X, Z ∼ Bern( 12 ) and Y = X ⊕ Z. Then
I(X; Y) = 0 since X ⊥⊥ Y; however, I(X; Y|Z) = I(X; X ⊕ Z|Z) = H(X) = 1 bit.
• I(X; Y|Z) < I(X; Y): Simply take X, Y, Z to be any random variables on finite alphabets and
Z = Y. Then I(X; Y|Z) = I(X; Y|Y) = H(Y|Y) − H(Y|X, Y) = 0 by a conditional version of (3.3).
Remark 3.3 (Chain rule for I ⇒ Chain rule for H). Set Y = Xn . Then H(Xn ) = I(Xn ; Xn ) =
Pn n k− 1
Pn
k=1 I(Xk ; X |X ) = k=1 H(Xk |Xk−1 ), since H(Xk |Xn , Xk−1 ) = 0.
Remark 3.4 (DPI for divergence =⇒ DPI for mutual information). We proved DPI for mutual
information in Theorem 3.7 using Kolmogorov’s identity. In fact, DPI for mutual information is
implied by that for divergence in Theorem 2.15:
I(X; Z) = D(PZ|X kPZ |PX ) ≤ D(PY|X kPY |PX ) = I(X; Y),
P Z| Y PZ|Y
where note that for each x, we have PY|X=x −−→ PZ|X=x and PY −−→ PZ . Therefore if we have a
bi-variate functional of distributions D(PkQ) which satisfies DPI, then we can define a “mutual
information-like” quantity via ID (X; Y) ≜ D(PY|X kPY |PX ) ≜ Ex∼PX D(PY|X=x kPY ) which will
satisfy DPI on Markov chains. A rich class of examples arises by taking D = Df (an f-divergence
– see Chapter 7).
Remark 3.5 (Strong data-processing inequalities). For many channels PY|X , it is possible to
strengthen the data-processing inequality (2.27) as follows: For any PX , QX we have
D(PY kQY ) ≤ ηKL D(PX kQX ) ,
where ηKL < 1 and depends on the channel PY|X only. Similarly, this gives an improvement in the
data-processing inequality for mutual information in Theorem 3.7(c): For any PU,X we have
U→X→Y =⇒ I(U; Y) ≤ ηKL I(U; X) .
For example, for PY|X = BSCδ we have ηKL = (1 − 2δ)2 . Strong data-processing inequalities
(SDPIs) quantify the intuitive observation that noise intrinsict in the channel PY|X must reduce the
information that Y carries about the data U, regardless of how we optimize the encoding U 7→ X.
We explore SDPI further in Chapter 33 as well as their ramifications in statistics.
In addition to the case of strict inequality in DPI, the case of equality is also worth taking a closer
look. If U → X → Y and I(U; X) = I(U; Y), intuitively it means that, as far as U is concerned,
there is no loss of information in summarizing X into Y. In statistical parlance, we say that Y is a
sufficient statistic of X for U. This is the topic for the next section.
i i
i i
i i

i i
48
3.5 Sufficient statistics and data processing

Much later in the book we will be interested in estimating parameters θ of probability distributions
of X. To that end, one often first tries to remove unnecessary information contained in X. Let us
formalize the setting as follows:
• Let PθX be a collection of distributions of X parameterized by θ ∈ Θ;

• Let PT|X be some Markov kernel. Let PθT ≜ PT|X ◦ PθX be the induced distribution on T for each
θ.
Definition 3.8 (Sufficient statistic). We say that T is a sufficient statistic of X for θ if there exists a
transition probability kernel PX|T so that PθX PT|X = PθT PX|T , i.e., PX|T can be chosen to not depend
on θ.
The intuitive interpretation of T being sufficient is that, with T at hand, one can ignore X; in
other words, T contains all the relevant information to infer about θ. This is because X can be
simulated on the sole basis of T without knowing θ. As such, X provides no extra information
for identification of θ. Any one-to-one transformation of X is sufficient, however, this is not the
interesting case. In the interesting cases dimensionality of T will be much smaller (typically equal
to that of θ) than that of X. See examples below.
Observe also that the parameter θ need not be a random variable, as Definition 3.8 does not
involve any distribution (prior) on θ. This is a so-called frequentist point of view on the problem
of parameter estimation.
Theorem 3.9. Let θ, X, T be as in the setting above. Then the following are equivalent
(a) T is a sufficient statistic of X for θ.

(b) ∀Pθ , θ → T → X.
(c) ∀Pθ , I(θ; X|T) = 0.
(d) ∀Pθ , I(θ; X) = I(θ; T), i.e., the data processing inequality for mutual information holds with
equality.
Proof. We omit the details, which amount to either restating the conditions in terms of condi-
tional independence, or invoking equality cases in the properties stated in Theorem 3.7.
The following result of Fisher provides a criterion for verifying sufficiency:
Theorem 3.10 (Fisher’s factorization theorem). For all θ ∈ Θ, let PθX have a density pθ with
respect to a common dominating measure μ. Let T = T(X) be a deterministic function of X. Then
T is a sufficient statistic of X for θ iff
pθ (x) = gθ (T(x))h(x)
for some measurable functions gθ and h and all θ ∈ Θ.
i i
i i
i i

i i
3.5 Sufficient statistics and data processing 49
Proof. We only give the proof in the discrete case where pθ represents the PMF. (The argument
P R
for the general case is similar replacing by dμ). Let t = T(x).
“⇒”: Suppose T is a sufficient statistic of X for θ. Then pθ (x) = Pθ (X = x) = Pθ (X = x, T =
t) = Pθ (X = x|T = t)Pθ (T = t) = P(X = x|T = T(x)) Pθ (T = T(x))
| {z }| {z }
h ( x) gθ (T(x))
“⇐”: Suppose the factorization holds. Then
p θ ( x) gθ (t)h(x) h ( x)
Pθ (X = x|T = t) = P =P =P ,
x 1{T(x)=t} pθ (x) x 1{T(x)=t} gθ (t)h(x) x 1{T(x)=t} h(x)
free of θ.
Example 3.7 (Independent observations). In the following examples, a parametrized distribution
generates an independent sample of size n, which can be summarized into a scalar-valued sufficient
statistic. These can be verified by checking the factorization of the n-fold product distribution and
applying Theorem 3.10.
i.i.d.
• Normal mean model. Let θ ∈ R and observations X1 , . . . , Xn ∼ N (θ, 1). Then the sample mean
Pn
X̄ = 1n j=1 Xj is a sufficient statistic of Xn for θ.
i.i.d. Pn
• Coin flips. Let Bi ∼ Ber(θ). Then i=1 Bi is a sufficient statistic of Bn for θ.
i.i.d.
• Uniform distribution. Let Ui ∼ Unif(0, θ). Then maxi∈[n] Ui is a sufficient statistic of Un for θ.
Example 3.8 (Sufficient statistic for hypothesis testing). Let Θ = {0, 1}. Given θ = 0 or 1,
X ∼ PX or QX , respectively. Then Y – the output of PY|X – is a sufficient statistic of X for θ iff
D(PX|Y kQX|Y |PY ) = 0, i.e., PX|Y = QX|Y holds PY -a.s. Indeed, the latter means that for kernel QX|Y
we have
PX PY|X = PY QX|Y and QX PY|X = QY QX|Y ,
which is precisely the definition of sufficient statistic when θ ∈ {0, 1}. This example explains
the condition for equality in the data-processing for divergence in Theorem 2.15. Then assuming
D(PY kQY ) < ∞ we have:
D(PX kQX ) = D(PY kQY ) ⇐⇒ Y is a sufficient statistic for testing PX vs. QX
Proof. Let QX,Y = QX PY|X , PX,Y = PX PY|X , then
D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) +D(PX kQX )
| {z }
=0
= D(PX|Y kQX|Y |PY ) + D(PY kQY )
≥ D(PY kQY )
with equality iff D(PX|Y kQX|Y |PY ) = 0, which is equivalent to Y being a sufficient statistic for
testing PX vs QX as desired.
i i
i i
i i

i i
4 Variational characterizations and continuity of

D and I
In this chapter we collect some results on variational characterizations. It is a well known method
in analysis to study a functional by proving a variational characterization of the form F(x) =
supλ∈Λ fλ (x) or F(x) = infμ∈M gμ (x). Such representations can be useful for multiple purposes:
• Convexity: pointwise supremum of convex functions is convex.

• Regularity: pointwise supremum of lower semicontinuous (lsc) functions is lsc.
• Bounds: upper/lower bound on F follows by choosing any λ (μ) and evaluating fλ (gμ ).
We will see in this chapter that divergence has two different sup characterizations (over partitions
and over functions). The mutual information is even more special. In addition to inheriting the
ones from KL divergence, it possesses two very special ones: an inf over (centroid) measures QY
and a sup over Markov kernels.
As the main applications of these variational characetizations, we will first pursue the topic of
continuity. In fact, we will discuss several types of continuity.
First, is the continuity in discretization. This is related to the issue of computation. For compli-
cated P and Q direct computation of D(PkQ) might be hard. Instead, one may want to discretize the
infinite alphabet and compute numerically the finite sum. Is this procedure stable, i.e., as the quan-
tization becomes finer, does this procedure guarantee to converge to the true value? The answer
is positive and this continuity with respect to discretization is guaranteed by Theorem 4.5.
Second, is the continuity under change of the distribution. For example, this is arises in the
problem of estimating information measures. In many statistical setups, oftentimes we do not
know P or Q, and we estimate the distribution by P̂n using n iid observations sampled from P (in
discrete cases we may set P̂n to be simply the empirical distribution). Does D(P̂n kQ) provide a
good estimator for D(PkQ)? Does D(P̂n kQ) → D(PkQ) if P̂n → P? The answer is delicate – see
Section 4.4.
Third, there is yet another kind of continuity: continuity “in the σ -algebra”. Despite the scary
name, this one is useful even in the most “discrete” situations. For example, imagine that θ ∼
i.i.d.
Unif(0, 1) and Xi ∼ Ber(θ). Suppose that you observe a sequence of Xi ’s until the random moment
τ equal to the first occurence of the pattern 0101. How much information about θ did you learn
by time τ ? We can encode these observations as
(
Xj , j≤τ,
Zj = ,
?, j>τ
50
i i
i i
i i

i i
where ? designates the fact that we don’t know the value of Xj on those times. Then the question
we asked above is to compute I(θ; Z∞ ). We will show in this chapter that
X
∞
I(θ; Z∞ ) = lim I(θ; Zn ) = I(θ; Zn |Zn−1 ) (4.1)
n→∞
n=1
thus reducing computation to evaluating an infinite sum of simpler terms (not involving infinite-
dimensional vectors). Thus, even in this simple question about biased coin flips we have to
understand how to safely work with infinite-dimensional vectors.
4.1 Geometric interpretation of mutual information

Mutual information can be understood as the weighted “distance” from the conditional distribu-
tions to the marginal distribution. Indeed, for discrete X, we have
X
I(X; Y) = D(PY|X kPY |PX ) = D(PY|X=x kPY )PX (x).
x∈X
Furthermore, it turns out that PY , similar to the center of gravity, minimizes this weighted distance
and thus can be thought as the best approximation for the “center” of the collection of distributions
{PY|X=x : x ∈ X } with weights given by PX . We formalize these results in this section and start
with the proof of a “golden formula”. Its importance is in bridging the two points of view on
mutual information: (4.3) is the difference of (relative) entropies in the style of Shannon, while
retaining applicability to continuous spaces in the style of Fano.
Theorem 4.1 (Golden formula). For any QY we have

D(PY|X kQY |PX ) = I(X; Y) + D(PY kQY ). (4.2)
Thus, if D(PY kQY ) < ∞, then
I(X; Y) = D(PY|X kQY |PX ) − D(PY kQY ). (4.3)
Proof. In the discrete case and ignoring the possibility of dividing by zero, the argument is really
simple. We just need to write

(3.1) PY|X PY|X QY
I(X; Y) = EPX,Y log = EPX,Y log
PY PY QY
P Q P
and then expand log PYY|XQYY = log QY|YX − log QPYY . The argument below is a rigorous implementation
of this idea.
First, notice that by Theorem 2.14(e) we have D(PY|X kQY |PX ) ≥ D(PY kQY ) and thus if
D(PY kQY ) = ∞ then both sides of (4.2) are infinite. Thus, we assume D(PY kQY ) < ∞ and
in particular PY QY . Rewriting LHS of (4.2) via the chain rule (2.24) we see that Theorem
amounts to proving
D(PX,Y kPX QY ) = D(PX,Y kPX PY ) + D(PY kQY ) .
i i
i i
i i

i i
52
The case of D(PX,Y kPX QY ) = D(PX,Y kPX PY ) = ∞ is clear. Thus, we can assume at least one of
these divergences is finite, and, hence, also PX,Y PX QY .
dPY
Let λ(y) = dQ Y
(y). Since λ(Y) > 0 PY -a.s., applying the definition of Log in (2.10), we can
write

λ(Y)
EPY [log λ(Y)] = EPX,Y Log . (4.4)
1
dPX PY
Notice that the same λ(y) is also the density dPX QY
(x, y) of the product measure PX PY with respect
to PX QY . Therefore, the RHS of (4.4) by (2.11) applied with μ = PX QY coincides with
D(PX,Y kPX QY ) − D(PX,Y kPX PY ) ,
while the LHS of (4.4) by (2.13) equals D(PY kQY ). Thus, we have shown the required
D(PY kQY ) = D(PX,Y kPX QY ) − D(PX,Y kPX PY ) .
By dropping the second term in (4.2) we obtain the following result.
Corollary 4.2 (Mutual information as center of gravity). For any QY we have
I(X; Y) ≤ D(PY|X kQY |PX )
and, consequently,
I(X; Y) = min D(PY|X kQY |PX ). (4.5)

QY
If I(X; Y) < ∞, the unique minimizer is QY = PY .
Remark 4.1. The variational representation (4.5) is useful for upper bounding mutual information
by choosing an appropriate QY . Indeed, often each distribution in the collection PY|X=x is simple,
but their mixture, PY , is very hard to work with. In these cases, choosing a suitable QY in (4.5)
provides a convenient upper bound. As an example, consider the AWGN channel Y = X + Z in
Example 3.3, where Var(X) = σ 2 , Z ∼ N (0, 1). Then, choosing the best possible Gaussian Q and
applying the above bound, we have:
1
I(X; Y) ≤ inf E[D(N (X, 1)kN ( μ, s))] = log(1 + σ 2 ),
μ∈R,s≥0 2
which is tight when X is Gaussian. For more examples and statistical applications, see Chapter 30.
Theorem 4.3 (Mutual information as distance to product distributions).
I(X; Y) = min D(PX,Y kQX QY )

QX ,QY
with the unique minimizer (QX , QY ) = (PX , PY ).
i i
i i
i i

i i
Proof. We only need to use the previous corollary and the chain rule (2.24):
(2.24)
D(PX,Y kQX QY ) = D(PY|X kQY |PX ) + D(PX kQX ) ≥ I(X; Y) .
Interestingly, the point of view in the previous result extends to conditional mutual information
as follows: We have
I(X; Z|Y) = min D(PX,Y,Z kQX,Y,Z ) , (4.6)

QX,Y,Z :X→Y→Z
where the minimization is over all QX,Y,Z = QX QY|X QZ|Y , cf. Section 3.4. Showing this character-
ization is very similar to the previous theorem. By repeating the same argument as in (4.2) we
get
D(PX,Y,Z kQX QY|X QZ|Y )

=D(PX,Y,Z kPX PY|X PZ|Y ) + D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
=D(PX,Y,Z kPY PX|Y PZ|Y ) + D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
= D(PXZ|Y kPX|Y PZ|Y |PY ) +D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
| {z }
I(X;Z|Y)
≥ I ( X ; Z| Y) .
Characterization (4.6) can be understood as follows. The most general graphical model for the
triplet (X, Y, Z) is a 3-clique (triangle).
Y X
What is the information flow on the dashed edge X → Z? To answer this, notice that removing
this edge restricts the joint distribution to a Markov chain X → Y → Z. Thus, it is natural to
ask what is the minimum (KL-divergence) distance between a given PX,Y,Z and the set of all
distributions QX,Y,Z satisfying the Markov chain constraint. By the above calculation, optimal
QX,Y,Z = PY PX|Y PZ|Y and hence the distance is I(X; Z|Y). For this reason, we may interpret I(X; Z|Y)
as the amount of information flowing through the X → Z edge.
In addition to inf-characterization, mutual information also has a sup-characterization.
Theorem 4.4. For any Markov kernel QX|Y such that QX|Y=y PX for PY -a.e. y we have

dQX|Y
I(X; Y) ≥ EPX,Y log .
dPX
i i
i i
i i

i i
54
If I(X; Y) < ∞ then

dQX|Y
I(X; Y) = sup EPX,Y log , (4.7)
QX|Y dPX
where supremum is over Markov kernels QX|Y as in the first sentence.
Remark 4.2. Similar to how Theorem 4.1 is used to upper-bound I(X; Y) by choosing a good
approximation to PY , this result is used to lower-bound I(X; Y) by selecting a good (but com-
putable) approximation QX|Y to usually a very complicated posterior PX|Y . See Section 5.6 for
applications.
Proof. Since modifying QX|Y=y on a negligible set of y’s does not change the expectations, we
will assume that QX|Y=y PY for every y. If I(X; Y) then there is nothing to prove. So we assume
I(X; Y) < ∞, which implies PX,Y PX PY . Then by Lemma 3.3 we have that PX|Y=y PX for
dQX|Y=y /dPX
almost every y. Choose any such y and apply (2.11) with μ = PX and noticing Log 1 =
dQX|Y=y
log dP X
we get

dQX|Y=y
EPX|Y=y log = D(PX|Y=y kPX ) − D(PX|Y=y kQX|Y=y ) ,
dPX
which is applicable since the first term is finite for a.e. y by (3.1). Taking expectation of the previous
identity over y we obtain

dQX|Y
EPX,Y log = I(X; Y) − D(PX|Y kQX|Y |PY ) ≤ I(X; Y) , (4.8)
dPX
implying the first part. The equality case in (4.7) follows by taking QX|Y = PX|Y , which satisfies
the conditions on Q when I(X; Y) < ∞.
4.2 Variational characterizations of divergence: Gelfand-Yaglom-Perez

The point of the following theorem is that divergence on general alphabets can be defined via
divergence on finite alphabets and discretization. Moreover, as the quantization becomes finer, we
approach the value of divergence.
Theorem 4.5 (Gelfand-Yaglom-Perez [136]). Let P, Q be two probability measures on X with

σ -algebra F . Then
X
n
P[Ei ]
D(PkQ) = sup P[Ei ] log , (4.9)
{E1 ,...,En } i=1 Q[ E i ]
Sn
where the supremum is over all finite F -measurable partitions: j=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 10 = 0 and log 01 = ∞ per our usual convention.
i i
i i
i i

i i
4.3 Variational characterizations of divergence: Donsker-Varadhan 55
Remark 4.3. This theorem, in particular, allows us to prove all general identities and inequalities
for the cases of discrete random variables and then pass to the limit. In case of mutual information
I(X; Y) = D(PX,Y kPX PY ), the partitions over X and Y can be chosen separately, see (4.29).
Proof. “≥”: Fix a finite partition E1 , . . . En . Define a function (quantizer/discretizer) f : X →

{1, . . . , n} as follows: For any x, let f(x) denote the index j of the set Ej to which x belongs. Let
X be distributed according to either P or Q and set Y = f(X). Applying data processing inequality
for divergence yields
D(PkQ) = D(PX kQX )

≥ D( P Y k QY ) (4.10)
X P(Ei )
= P(Ei ) log .
Q(Ei )
i
“≤”: To show D(PkQ) is indeed achievable, first note that if P 6 Q, then by definition, there
exists B such that Q(B) = 0 < P(B). Choosing the partition E1 = B and E2 = Bc , we have
P2 P[Ei ]
D(PkQ) = ∞ = i=1 P[Ei ] log Q[Ei ] . In the sequel we assume that P Q and let X = dQ .
dP
Then D(PkQ) = EQ [X log X] = EQ [φ(X)] by (2.4). Note that φ(x) ≥ 0 if and only if x ≥ 1. By
monotone convergence theorem, we have EQ [φ(X)1{X<c} ] → D(PkQ) as c → ∞, regardless of
the finiteness of D(PkQ).
Next, we construct a finite partition. Let n = c/ϵ be an integer and for j = 0, . . . , n − 1, let
Ej = {jϵ ≤ X(j + 1)ϵ} and En = {X ≥ c}. Define Y = ϵbX/ϵc as the quantized version. Since φ is
uniformly continuous on [0, c], for any x, y ∈ [0, c] such |x − y| ≤ ϵ, we have |φ(x) − φ(y)| ≤ ϵ′
for some ϵ′ = ϵ′ (ϵ, c) such as ϵ′ → 0 as ϵ → 0. Then EQ [φ(Y)1{X<c} ] ≥ EQ [φ(X)1{X<c} ] − ϵ′ .
Morever,
X
n−1 n−1
X
P(Ej )
EQ [φ(Y)1{X<c} ] = φ(jϵ)Q(Ej ) ≤ ϵ′ + φ Q(Ej )
Q( E j )
j=0 j=0
X
n
P(Ej )
≤ ϵ′ + Q(X ≥ c) log e + P(Ej ) log ,
Q( E j )
j=0
P(E )
where the first inequality applies the uniform continuity of φ since jϵ ≤ Q(Ejj ) < (j + 1)ϵ, and the
second applies φ ≥ − log e. As Q(X ≥ c) → 0 as c → ∞, the proof is completed by first sending
ϵ → 0 then c → ∞.
4.3 Variational characterizations of divergence: Donsker-Varadhan

The following is perhaps the most important variational characterization of divergence.
i i
i i
i i

i i
56
Theorem 4.6 (Donsker-Varadhan [100]). Let P, Q be probability measures on X and let CQ denote
the set of functions f : X → R such that EQ [exp{f(X)}] < ∞. We have
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] . (4.11)
f∈CQ
In particular, if D(PkQ) < ∞ then EP [f(X)] is finite for every f ∈ CQ . The identity (4.11) holds
with CQ replaced by the class of all simple functions. If X is a normal topological space (e.g., a
metric space) with Borel σ -algebra, then also
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] , (4.12)
f∈Cb
where Cb is a class of bounded continuous functions.
Proof. “≥”: We can assume for this part that D(PkQ) < ∞, since otherwise there is nothing to
prove. Then fix f ∈ CQ and define a probability measure Qf (tilted version of Q) via
Qf (dx) = exp{f(x) − Zf }Q(dx) , Zf ≜ log EQ [exp{f(X)}] .
Then, obviously Qf Q and we have

dQf dPdQf
EP [f(X)] − Zf = EP log = EP log = D(PkQ) − D(PkQf ) ≤ D(PkQ) .
dQ dQdP
“≤”: The idea is to just take f = log dQ
dP
; however to handle the edge cases we proceed carefully.
First, notice that if P 6 Q then for some E with Q[E] = 0 < P[E] and c → ∞ taking f = c1E shows
that both sides of (4.11) are infinite. Thus, we assume P Q. For any partition of X = ∪nj=1 Ej
Pn P[ E ]
we set f = j=1 1Ej log Q[Ejj ] . Then the right-hand sides of (4.11) and (4.9) evaluate to the same
value and hence by Theorem 4.5 we obtain that supremum over simple functions (and thus over
CQ ) is at least as large as D(PkQ).
Finally, to show (4.12), we show that for every simple function f there exists a continuous
bounded f′ such that EP [f′ ] − log EQ [exp{f′ }] is arbitrarily close to the same functional evaluated
at f. Clearly, for that it is enough to show that for any a ∈ R and measurable A ⊂ X there exists a
sequence of continuous bounded fn such that
EP [fn ] → aP[A], and EQ [exp{fn }] → exp{a}Q[A] (4.13)
hold simultaneously. We only consider the case of a > 0 below. Let compact F and open U be
such that F ⊂ A ⊂ U and max(P[U] − P[F], Q[U] − Q[F]) ≤ ϵ. Such F and U exist whenever P and
Q are so-called regular measures. Without going into details, we just notice that finite measures
on Polish spaces are automatically regular. Then by Urysohn’s lemma there exists a continuous
function fϵ : X → [0, a] equal to a on F and 0 on Uc . Then we have
aP[F] ≤ EP [fϵ ] ≤ aP[U]
exp{a}Q[F] ≤ EQ [exp{fϵ }] ≤ exp{a}Q[U] .
Subtracting aP[A] and exp{a}Q[A] for each of these inequalities, respectively, we see that taking
ϵ → 0 indeed results in a sequence of functions satisfying (4.13).
i i
i i
i i

i i
4.4 Continuity of divergence 57
Remark 4.4. 1 What is the Donsker-Varadhan representation useful for? By setting f(x) = ϵ · g(x)
with ϵ 1 and linearizing exp and log we can see that when D(PkQ) is small, expecta-
tions under P can be approximated by expectations over Q (change of measure): EP [g(X)] ≈
EQ [g(X)]. This holds for all functions g with finite exponential moment under Q. Total variation
distance provides a similar bound, but for a narrower class of bounded functions:
| EP [g(X)] − EQ [g(X)]| ≤ kgk∞ TV(P, Q) .
2 More formally, the inequality EP [f(X)] ≤ log EQ [exp f(X)] + D(PkQ) is useful in estimating
EP [f(X)] for complicated distribution P (e.g. over high-dimensional X with weakly dependent
coordinates) by making a smart choice of Q (e.g. with iid components).
3 In Chapter 5 we will show that D(PkQ) is convex in P (in fact, in the pair). A general method
of obtaining variational formulas like (4.11) is via the Young-Fenchel duality. Indeed, (4.11) is
exactly this inequality since the Fenchel-Legendre conjugate of D(·kQ) is given by a convex
map f 7→ Zf . For more details, see Section 7.13.
4 Donsker-Varadhan should also be seen as an “improved version” of the DPI. For example, one
of the main applications of the DPI in this book is in obtaining estimates like
1
P[A] log ≤ D(PkQ) + log 2 , (4.14)
Q[ A ]
which is the basis of the large deviations theory (Corollary 2.17) and Fano’s inequality
(Theorem 6.3). The same estimate can be obtained by applying (4.11) via f(x) = 1{x∈A} log Q[1A] .
4.4 Continuity of divergence

For a finite alphabet X it is easy to establish the continuity of entropy and divergence:
Proposition 4.7. Let X be finite. Fix a distribution Q on X with Q(x) > 0 for all x ∈ X . Then the
map
P 7→ D(PkQ)
is continuous. In particular,
P 7→ H(P) (4.15)
is continuous.
Warning: Divergence is never continuous in the pair, even for finite alphabets. For example,
as n → ∞, d( 1n k2−n ) 6→ 0.
Proof. Notice that
X P ( x)
D(PkQ) = P(x) log
x
Q ( x)
i i
i i
i i

i i
58
and each term is a continuous function of P(x).
Our next goal is to study continuity properties of divergence for general alphabets. We start
with a negative observation.
Remark 4.5. In general, D(PkQ) is not continuous in either P or Q. For example, let X1 , . . . , Xn
Pn d
be iid and equally likely to be {±1}. Then by central limit theorem, Sn = √1n i=1 Xi −
→N (0, 1)
as n → ∞. But
D( PSn k N (0, 1)) = ∞

|{z} | {z }
discrete cont’s
for all n. Note that this is an example for strict inequality in (4.16).
Nevertheless, there is a very useful semicontinuity property.
Theorem 4.8 (Lower semicontinuity of divergence). Let X be a metric space with Borel σ -algebra
H. If Pn and Qn converge weakly to P and Q, respectively,1 then
D(PkQ) ≤ lim inf D(Pn kQn ) . (4.16)

n→∞
On a general space if Pn → P and Qn → Q pointwise2 (i.e. Pn [E] → P[E] and Qn [E] → Q[E] for
every measurable E) then (4.16) also holds.
Proof. This simply follows from (4.12) since EPn [f] → EP [f] and EQn [exp{f}] → EQ [exp{f}] for
every f ∈ Cb .
4.5* Continuity under monotone limits of σ -algebras

Our final and somewhat delicate topic is to understand the (so far neglected) dependence of D and I
on the implicit σ -algebra of the space. Indeed, the definition of divergence D(PkQ) implicitly (via
Radon-Nikodym derivative) depends on the σ -algebra F defining the measurable space (X , F).
To emphasize the dependence on F we will write in this Section only the underlying σ -algebra
explicitly as follows:
D(PF kQF ) .
Our main results are continuity under monotone limits of σ -algebras
Fn % F =⇒ D(PFn kQFn ) % D(PF kQF ) (4.17)

Fn & F =⇒ D(PFn kQFn ) & D(PF kQF ) (4.18)
1
Recall that sequence of random variables Xn converges in distribution to X if and only if their laws PXn converge weakly
to PX .
2
Pointwise convergence is weaker than convergence in total variation and stronger than weak convergence.
i i
i i
i i

i i
4.5* Continuity under monotone limits of σ -algebras 59
For establishing the first result, it will be convenient to extend the definition of the divergence
D(PF kQF ) to (a) any algebra of sets F and (b) two positive additive (not necessarily σ -additive)
set-functions P, Q on F .
Definition 4.9 (KL divergence over an algebra). Let P and Q be two positive, additive (not nec-
essarily σ -additive) set-functions defined over an algebra F of subsets of X (not necessarily a
σ -algebra). We define
X
n
P[Ei ]
D(PF kQF ) ≜ sup P[Ei ] log ,
{E1 ,...,En } i=1 Q[Ei ]
Sn
where the supremum is over all finite F -measurable partitions: j=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 01 = 0 and log 10 = ∞ per our usual convention.
Note that when F is not a σ -algebra or P, Q are not σ -additive, we do not have Radon-Nikodym
theorem and thus our original definition of KL-divergence is not applicable.
Theorem 4.10 (Measure-theoretic properties of divergence). Let P, Q be probability measures on

the measurable space (X , H). Assume all algebras below are sub-algebras of H. Then:
• (Monotonicity) If F ⊆ G are nested algebras then
D(PF kQF ) ≤ D(PG kQG ) . (4.19)

S
• Let F1 ⊆ F2 . . . be an increasing sequence of algebras and let F = n Fn be their limit, then
D(PFn kQFn ) % D(PF kQF ) .
• If F is (P + Q)-dense in G then3
D(PF kQF ) = D(PG kQG ) . (4.20)
• (Monotone convergence theorem) Let F1 ⊆ F2 . . . be an increasing sequence of algebras and

W
let F = n Fn be the smallest σ -algebra containing all of Fn . Then we have
D(PFn kQFn ) % D(PF kQF )
and, in particular,
D(PX∞ kQX∞ ) = lim D(PXn kQXn ) .

n→∞
Proof. The first two items are straightforward applications of the definition. The third follows
from the following fact: if F is dense in G then any G -measurable partition {E1 , . . . , En } can
be approximated by a F -measurable partition {E′1 , . . . , E′n } with (P + Q)[Ei 4E′i ] ≤ ϵ. Indeed,
first we set E′1 to be an element of F with (P + Q)(E1 4E′1 ) ≤ 2n ϵ
. Then, we set E′2 to be
3
Recall that F is μ-dense in G if ∀E ∈ G, ϵ > 0∃E′ ∈ F s.t. μ[E∆E′ ] ≤ ϵ.
i i
i i
i i

i i
60
an ϵ
2n -approximation of E2 \ E′1 , etc. Finally, E′n = (∪j≤1 E′j )c . By taking ϵ → 0 we obtain
P ′ P[E′i ] P
P[Ei ] log QP[[EEii]] .
i P[Ei ] log Q[E′i ] → i
The last statement follows from the previous one and the fact that any algebra F is μ-dense in
the σ -algebra σ{F} it generates for any bounded μ on (X , H) (cf. [107, Lemma III.7.1].)
Finally, we address the continuity under the decreasing σ -algebra, i.e. (4.18).
Proposition 4.11. Let Fn & F be a sequence of decreasing σ -algebras with intersection F =

∩n Fn ; let P, Q be two probability measures on F0 . If D(PF0 kQF0 ) < ∞ then we have
D(PFn kQFn ) & D(PF kQF ) (4.21)
The condition D(PF0 kQF0 ) < ∞ can not be dropped, cf. the example after (4.32).
h i
dP dP
Proof. Let X−n = dQ . Since X−n = EQ dQ Fn , we have that (. . . , X−1 , X0 ) is a uniformly
Fn
integrable martingale. By the martingale convergence theorem in reversed time, cf. [59, Theorem
5.4.17], we have almost surely

dP
X−n → X−∞ ≜ . (4.22)
dQ F
We need to prove that
EQ [X−n log X−n ] → EQ [X−∞ log X−∞ ] .
We will do so by decomposing x log x as follows
x log x = x log+ x + x log− x ,
where log+ x = max(log x, 0) and log− x = min(log x, 0). Since x log− x is bounded, we have
from the bounded convergence theorem:
EQ [X−n log− X−n ] → EQ [X−∞ log− X−∞ ].
To prove a similar convergence for log+ we need to notice two things. First, the function
x 7→ x log+ x
is convex. Second, for any non-negative convex function ϕ s.t. E[ϕ(X0 )] < ∞ the collection
{Zn = ϕ(E[X0 |Fn ]), n ≥ 0} is uniformly integrable. Indeed, we have from Jensen’s inequality
1 E[ϕ(X0 )]
P [ Z n > c] ≤
E[ϕ(E[X0 |Fn ])] ≤
c c
and thus P[Zn > c] → 0 as c → ∞. Therefore, we have again by Jensen’s
E[Zn 1{Zn > c}] ≤ E[ϕ(X0 )1{Zn > c}] → 0 c → ∞.
+
Finally, since X−n log X−n is uniformly integrable, we have from (4.22)
EQ [X−n log− X−n ] → EQ [X−∞ log− X−∞ ]
and this concludes the proof.
i i
i i
i i

i i
4.6 Variational characterizations and continuity of mutual information

Again, similarly to Proposition 4.7, it is easy to show that in the case of finite alphabets mutual
information is always continuous on finite-dimensional simplices of distributions.4
Proposition 4.12. Let X and Y be finite alphabets. Then

PX,Y 7→ I(X; Y)
is continuous. Let X be finite. Then
PX 7→ I(X; Y) (4.23)
is continuous. Without any assumptions on X and Y , let PX range over the convex hull Π =
Pn Pn
co(P1 , . . . , Pn ) = { i=1 αi Pi : i=1 αi = 1, αi ≥ 0}. If I(Pj , PY|X ) < ∞ (using notation
I(PX , PY|X ) = I(X; Y)) for all j ∈ [n], then the map (4.23) is continuous.
Proof. For the first statement, apply representation

I(X; Y) = H(X) + H(Y) − H(X, Y)
and (4.15).
1
P
For the second statement, take QY = |X | x PY|X=x . Note that
" !#
X
D(PY kQY ) = EQY f PX (x)hx (Y) ,
x
dP
Y|X=x
where f(t) = t log t and hx (y) = dQ Y
(y) are bounded by |X | and non-negative. Thus, from the
bounded-convergence theorem we have that
PX 7→ D(PY kQY )
is continuous. The proof is complete since by the golden formula
I(X; Y) = D(PY|X kQY |PX ) − D(PY kQY ) ,
and the first term is linear in PX .
For the third statement, form a chain Z → X → Y with Z ∈ [n] and PX|Z=j = Pj . WLOG assume
that P1 , . . . , Pn are distinct extreme points of co(P1 , . . . , Pn ). Then there is a linear bijection
between PZ and PX ∈ Π. Furthermore, I(X; Y) = I(Z; Y) + I(X; Y|Z). The first term is continu-
ous in PZ by the previous claim, whereas the second one is simply linear in PZ . Thus, the map
PZ 7→ I(X; Y) is continuous and so is PX 7→ I(X; Y).
Further properties of mutual information follow from I(X; Y) = D(PX,Y kPX PY ) and correspond-
ing properties of divergence, e.g.
4
Here we only assume that topology on the space of measures is compatible with the linear structure, so that all linear
operations on measures are continuous.
i i
i i
i i

i i
62
1 Donsker-Varadhan and PAC-Bayes:
I(X; Y) = sup E[f(X, Y)] − log E[exp{f(X, Ȳ)}] , (4.24)

f
where Ȳ is a copy of Y, independent of X and supremum is over bounded, or even bounded

continuous functions. Notice, however, that for mutual information we can also get a stronger
characterization:5
I(X; Y) ≥ E[f(X, Y)] − E[log E[exp{f(X, Ȳ)}|X]] , (4.25)
from which (4.24) follows by moving the outer expectation inside the log. Both of these can
be used to show that E[f(X, Y)] ≈ E[f(X, Ȳ)] as long as the dependence between X and Y (as
measured by I(X; Y)) is weak. For example, suppose that for every x the random variable h(x, Ȳ)
is ϵ-subgaussian, i.e.
1
log E[exp{λh(x, Ȳ)}] ≤ λ E[h(x, Ȳ)] + ϵ2 λ2 .
2
Then plugging f = λh into (4.25) and optimizing λ shows
p
E[h(X, Y)] − E[h(X, Ȳ)] ≤ 2ϵ2 I(X; Y) . (4.26)
This allows one to control expectations of functions of dependent random variables by replac-
ing them with independent pairs at the expense of (square-root of the) mutual information
slack [338]. Variant of this idea for bounding deviations with high-probability is a foundation of
the PAC-Bayes bounds on generalization of learning algorithims (in there, Y becomes training
data, X is the selected hypothesis/predictor, PX|Y the learning algorithm, E[h(X, Ȳ)] the test loss,
etc); see Ex. I.44 and [58] for more.
2 (Uniform convergence and Donsker-Varadhan) There is an interesting other consequence
of (4.25). By Theorem 4.1 we have I(X; Y) ≤ D(PX|Y kQX |PY ) for any fixed QX . This lets us con-
vert (4.25) into the following inequality: (we denote by EY and EX|Y the respective uncoditional
and conditional expectations): For every f, PY , QX and PX|Y we have

EY EX|Y [f(X, Y) − log EȲ [exp f(X, Ȳ)] − D(PX|Y=Y kQX ) ≤ 0 .
Now because of the arbitrariness of PX|Y , setting measurability issues aside, we get: For every
f, PY and QX

EY sup EX∼PX [f(X, Y) − log EȲ [exp f(X, Ȳ)] − D(PX kQX ) ≤ 0 .
PX
As an example, consider a countable collection {hx : x ∈ X } of functions on Y , each of which

is ϵ-subgaussian. Then for every PY and QX , we have for every δ > 0

1 ϵ2
E sup hx (Y) − E[hx (Y)] − δ log ≤ .
x∈X QX (x) 2δ
5
Just apply Donsker-Varadhan to D(PY|X=x0 kPY ) and average over x0 ∼ PX .
i i
i i
i i

i i
For example, taking QX to be uniform on N elements recovers the standard bound on the
maximum of subgaussian random variables: if H1 , . . . , HN are ϵ-subgaussian, then
p
E max (Hi − E[Hi ]) ≤ 2ϵ2 log N . (4.27)
1≤i≤N
For a generalization see Ex. I.45.

d
3 If (Xn , Yn ) → (X, Y) converge in distribution, then
I(X; Y) ≤ lim inf I(Xn ; Yn ) . (4.28)
n→∞
d
• Good example of strict inequality: Xn = Yn = 1n Z. In this case (Xn , Yn ) → (0, 0) but
I(Xn ; Yn ) = H(Z) > 0 = I(0; 0).
• Even more impressive example: Let (Xp , Yp ) be uniformly distributed on the unit ℓp -ball on
d
the plane: {x, y : |x|p + |y|p ≤ 1}. Then as p → 0, (Xp , Yp ) → (0, 0), but I(Xp ; Yp ) → ∞. (See
Ex. I.36)
4 Mutual information as supremum over partitions:
X PX,Y [Ei × Fj ]
I(X; Y) = sup PX,Y [Ei × Fj ] log , (4.29)
{Ei }×{Fj } PX [Ei ]PY [Fj ]
i,j
where supremum is over finite partitions of spaces X and Y .6

5 (Monotone convergence I):
I(X∞ ; Y) = lim I(Xn ; Y) (4.30)
n→∞
I(X∞ ; Y∞ ) = lim I(Xn ; Yn ) (4.31)

n→∞
This implies that the full amount of mutual information between two processes X∞ and Y∞
is contained in their finite-dimensional projections, leaving nothing in the tail σ -algebra. Note
also that applying the (finite-n) chain rule to (4.30) recovers (4.1).
T
6 (Monotone convergence II): Let Xtail be a random variable such that σ(Xtail ) = n≥1 σ(X∞ n ).
Then
I(Xtail ; Y) = lim I(X∞
n ; Y) , (4.32)
n→∞
whenever the right-hand side is finite. This is a consequence of Proposition 4.11. Without the
i.i.d.
finiteness assumption the statement is incorrect. Indeed, consider Xj ∼ Ber(1/2) and Y = X∞ 0 .
Then each I(X∞ n ; Y) = ∞ , but Xtail = const a.e. by Kolmogorov’s 0-1 law, and thus the left-hand
side of (4.32) is zero.
6
To prove this from (4.9) one needs to notice that algebra of measurable rectangles is dense in the product σ-algebra. See
[95, Sec. 2.2].
i i
i i
i i

i i
5 Extremization of mutual information: capacity

saddle point
There are four fundamental optimization problems arising in information theory:
• I-projection: Given Q minimize D(PkQ) over convex class of P. (See Chapter 15.)
• Maximum likelihood: Given P minimize D(PkQ) over some class of Q. (See Section 29.3.)
• Rate-Distortion: Given PX minimize I(X; Y) over a convex class of PY|X . (See Chapter 26.)
• Capacity: Given PY|X maximize I(X; Y) over a convex class of PX . (This chapter.)
In this chapter we show that all these problems have convex/concave objective functions,
discuss iterative algorithms for solving them, and study the capacity problem in more detail.
5.1 Convexity of information measures

Theorem 5.1. (P, Q) 7→ D(PkQ) is convex.
Proof. Let PX = QX = Ber(λ) and define two conditional kernels:

PY|X=0 = P0 , PY|X=1 = P1
QY|X=0 = Q0 , QY|X=1 = Q1
An explicit calculation shows that
D(PX,Y kQX,Y ) = λ̄D(P0 kQ0 ) + λD(P1 kQ1 ) .
Therefore, from the DPI (monotonicity) we get:
λ̄D(P0 kQ0 ) + λD(P1 kQ1 ) = D(PX,Y kQX,Y ) ≥ D(PY kQY ) = D(λ̄P0 + λP1 kλ̄Q0 + λQ1 ).
Remark 5.1. The proof shows that for an arbitrary measure of similarity D(PkQ) convexity of
(P, Q) 7→ D(PkQ) is equivalent to “conditioning increases divergence” property of D. Convexity
can also be understood as “mixing decreases divergence”.
Remark 5.2. There are a number of alternative arguments possible. For example, (p, q) 7→ p log pq
is convex on R2+ , which isa manifestation
of a general phenomenon: for a convex f(·) the perspec-
tive function (p, q) 7→ qf pq is convex too. Yet another way is to invoke the Donsker-Varadhan
variational representation Theorem 4.6 and notice that supremum of convex functions is convex.
64
i i
i i
i i

i i
Theorem 5.2. The map PX 7→ H(X) is concave. Furthermore, if PY|X is any channel, then PX 7→
H(X|Y) is concave. If X is finite, then PX 7→ H(X|Y) is continuous.
Proof. For the special case of the first claim, when PX is on a finite alphabet, the proof is complete
by H(X) = log |X | − D(PX kUX ). More generally, we prove the second claim as follows. Let
f(PX ) = H(X|Y). Introduce a random variable U ∼ Ber(λ) and define the transformation

P0 U = 0
PX|U =
P1 U = 1
Consider the probability space U → X → Y. Then we have f(λP1 + (1 − λ)P0 ) = H(X|Y) and
λf(P1 ) + (1 − λ)f(P0 ) = H(X|Y, U). Since H(X|Y, U) ≤ H(X|Y), the proof is complete. Continuity
follows from Proposition 4.12.
Recall that I(X; Y) is a function of PX,Y , or equivalently, (PX , PY|X ). Denote I(PX , PY|X ) =
I(X; Y).
Theorem 5.3 (Mutual Information).
• For fixed PY|X , PX →

7 I(PX , PY|X ) is concave.
• For fixed PX , PY|X →7 I(PX , PY|X ) is convex.
Proof. There are several ways to prove the first statement, all having their merits.
• First proof : Introduce θ ∈ Ber(λ). Define PX|θ=0 = P0X and PX|θ=1 = P1X . Then θ → X → Y.
Then PX = λ̄P0X + λP1X . I(X; Y) = I(X, θ; Y) = I(θ; Y) + I(X; Y|θ) ≥ I(X; Y|θ), which is our
desired I(λ̄P0X + λP1X , PY|X ) ≥ λ̄I(P0X , PY|X ) + λI(P0X , PY|X ).
• Second proof : I(X; Y) = minQ D(PY|X kQ|PX ), which is a pointwise minimum of affine functions
in PX and hence concave.
• Third proof : Pick a Q and use the golden formula: I(X; Y) = D(PY|X kQ|PX ) − D(PY kQ), where
PX 7→ D(PY kQ) is convex, as the composition of the PX 7→ PY (affine) and PY 7→ D(PY kQ)
(convex).
To prove the second (convexity) statement, simply notice that
I(X; Y) = D(PY|X kPY |PX ) .
The argument PY is a linear function of PY|X and thus the statement follows from convexity of D
in the pair.
5.2 Extremization of mutual information

Two problems of interest
i i
i i
i i

i i
66
• Fix PY|X → max I(X; Y) — channel coding (Part IV)

PX
Note: This maximum is called “capacity” of a set of distributions {PY|X=x , x ∈ X }.
• Fix PX → min I(X; Y) — lossy compression (Part V)
P Y| X
Theorem 5.4 (Saddle point). Let P be a convex set of distributions on X . Suppose there exists
P∗X ∈ P , called a capacity-achieving input distribution, such that
sup I(PX , PY|X ) = I(P∗X , PY|X ) ≜ C.

PX ∈P
Let P∗Y = P∗X PY|X , called a capacity-achieving output distribution. Then for all PX ∈ P and for all
QY , we have
D(PY|X kP∗Y |PX ) ≤ D(PY|X kP∗Y |P∗X ) ≤ D(PY|X kQY |P∗X ). (5.1)
Proof. Right inequality: obvious from C = I(P∗X , PY|X ) = minQY D(PY|X kQY |P∗X ).
Left inequality: If C = ∞, then trivial. In the sequel assume that C < ∞, hence I(PX , PY|X ) < ∞
for all PX ∈ P . Let PXλ = λPX + λP∗X ∈ P and PYλ = PY|X ◦ PXλ . Clearly, PYλ = λPY + λP∗Y ,
where PY = PY|X ◦ PX .
We have the following chain then:
C ≥ I(Xλ ; Yλ ) = D(PY|X kPYλ |PXλ )

= λD(PY|X kPYλ |PX ) + λ̄D(PY|X kPYλ |P∗X )
≥ λD(PY|X kPYλ |PX ) + λ̄C
= λD(PX,Y kPX PYλ ) + λ̄C ,
where inequality is by the right part of (5.1) (already shown). Thus, subtracting λ̄C and dividing
by λ we get
D(PX,Y kPX PYλ ) ≤ C
and the proof is completed by taking lim infλ→0 and applying the lower semincontinuity of
divergence (Theorem 4.8).
Corollary 5.5. In addition to the assumptions of Theorem 5.4, suppose C < ∞. Then the capacity-
achieving output distribution P∗Y is unique. It satisfies the property that for any PY induced by some
PX ∈ P (i.e. PY = PY|X ◦ PX ) we have
D(PY kP∗Y ) ≤ C < ∞ (5.2)
and in particular PY P∗Y .
Proof. The statement is: I(PX , PY|X ) = C ⇒ PY = P∗Y . Indeed:
C = D(PY|X kPY |PX ) = D(PY|X kP∗Y |PX ) − D(PY kP∗Y )

≤ D(PY|X kP∗Y |P∗X ) − D(PY kP∗Y )
i i
i i
i i

i i
= C − D(PY kP∗Y ) ⇒ PY = P∗Y

Statement (5.2) follows from the left inequality in (5.1) and “conditioning increases divergence”.
Remark 5.3. • The finiteness of C is necessary for Corollary 5.5 to hold. For a counterexample,
consider the identity channel Y = X, where X takes values on integers. Then any distribution
with infinite entropy is a capacity-achieving input (and output) distribution.
• Unlike the output distribution, capacity-achieving input distribution need not be unique. For
example, consider Y1 = X1 ⊕ Z1 and Y2 = X2 where Z1 ∼ Ber( 12 ) is independent of X1 . Then
maxPX1 X2 I(X1 , X2 ; Y1 , Y2 ) = log 2, achieved by PX1 X2 = Ber(p) × Ber( 21 ) for any p. Note that
the capacity-achieving output distribution is unique: P∗Y1 Y2 = Ber( 12 ) × Ber( 21 ).
Review: Minimax and saddlepoint
Suppose we have a bivariate function f. Then we always have the minimax inequality:
inf sup f(x, y) ≥ sup inf f(x, y).
y x x y
When does it hold with equality?
1 It turns out minimax equality is implied by the existence of a saddle point (x∗ , y∗ ),
i.e.,
f ( x, y∗ ) ≤ f ( x∗ , y∗ ) ≤ f ( x∗ , y) ∀ x, y
Furthermore, minimax equality also implies existence of saddle point if inf and sup
are achieved c.f. [31, Section 2.6]) for all x, y [Straightforward to check. See proof
of corollary below].
2 There are a number of known criteria establishing
inf sup f(x, y) = sup inf f(x, y)
y x x y
They usually require some continuity of f, compactness of domains and concavity

in x and convexity in y. One of the most general version is due to M. Sion [285].
3 The mother result of all this minimax theory is a theorem of von Neumann on
bilinear functions: Let A and B have finite alphabets, and g(a, b) be arbitrary, then
min max E[g(A, B)] = max min E[g(A, B)]
PA PB PB PA
P
Here (x, y) ↔ (PA , PB ) and f(x, y) ↔ a,b PA (a)PB (b)g(a, b).
4 A more general version is: if X and Y are compact convex domains in Rn , f(x, y)
continuous in (x, y), concave in x and convex in y then
max min f(x, y) = min max f(x, y)
x∈X y∈Y y∈Y x∈X
Applying Theorem 5.4 to conditional divergence gives the following result.
i i
i i
i i

i i
68
Corollary 5.6 (Minimax). Under assumptions of Theorem 5.4, we have
max I(X; Y) = max min D(PY|X kQY |PX )

PX ∈P PX ∈P QY
= min sup D(PY|X kQY |PX )

QY PX ∈P
Proof. This follows from the saddle-point: Maximizing/minimizing the leftmost/rightmost sides
of (5.1) gives
min sup D(PY|X kQY |PX ) ≤ max D(PY|X kP∗Y |PX ) = D(PY|X kP∗Y |P∗X )
QY PX ∈P PX ∈P
≤ min D(PY|X kQY |P∗X ) ≤ max min D(PY|X kQY |PX ).

QY PX ∈P QY
but by definition min max ≥ max min. Note that we were careful to only use max and min for the
cases where we know the optimum is achievable.
i i
i i
i i

i i
5.3 Capacity as information radius 69
5.3 Capacity as information radius
Review: Radius and diameter

Let (X, d) be a metric space. Let A be a bounded subset.
1 Radius (aka Chebyshev radius) of A: the radius of the smallest ball that covers A,
i.e.,
rad (A) = inf sup d(x, y). (5.3)
y∈ X x∈ A
2 Diameter of A:
diam (A) = sup d(x, y). (5.4)
x, y∈ A
Note that the radius and the diameter both measure the massiveness/richness of a
set.
3 From definition and triangle inequality we have
1
diam (A) ≤ rad (A) ≤ diam (A). (5.5)
2
The lower and upper bounds are achieved when A is, for example, a Euclidean ball
and the Hamming space, respectively.
4 In many special cases, the upper bound in (5.5) can be improved:
• A result of Bohnenblust [43] shows that in Rn equipped with any norm we always
have rad (A) ≤ n+n 1 diam (A).
q
• For Rn with Euclidean distance Jung proved rad (A) ≤ n
2(n+1) diam (A),
attained by simplex. The best constant is sometimes called the Jung constant
of the space.
• For Rn with ℓ∞ -norm the situation is even simpler: rad (A) = 12 diam (A); such
spaces are called centrable.
The next simple corollary shows that capacity is just the radius of a finite collection of dis-
tributions {PY|X=x : x ∈ X } when distances are measured by divergence (although, we remind,
divergence is not a metric).
Corollary 5.7. For any finite X and any kernel PY|X , the maximal mutual information over all
distributions PX on X satisfies
max I(X; Y) = max D(PY|X=x kP∗Y )

PX x∈X
= D(PY|X=x kP∗Y ) ∀x : P∗X (x) > 0 .
i i
i i
i i

i i
70
The last corollary gives a geometric interpretation to capacity: It equals the radius of the smallest
divergence-“ball” that encompasses all distributions {PY|X=x : x ∈ X }. Moreover, the optimal
center P∗Y is a convex combination of some PY|X=x and is equidistant to those.
The following is the information-theoretic version of “radius ≤ diameter” (in KL divergence)
for arbitrary input space (see Theorem 32.4 for a related representation):
Corollary 5.8. Let {PY|X=x : x ∈ X } be a set of distributions. Then
C = sup I(X; Y) ≤ inf sup D(PY|X=x kQ) ≤ sup D(PY|X=x kPY|X=x′ )

PX Q x∈X x,x′ ∈X
| {z } | {z }
radius diameter
Proof. By the golden formula Corollary 4.2, we have
I(X; Y) = inf D(PY|X kQ|PX ) ≤ inf sup D(PY|X=x kQ) ≤ ′inf sup D(PY|X=x kPY|X=x′ ).
Q Q x∈X x ∈X x∈X
5.4 Existence of capacity-achieving output distribution (general case)

In the previous section we have shown that the solution to
C = sup I(X; Y)
PX ∈P
can be (a) interpreted as a saddle point; (b) written in the minimax form; and (c) that the capacity-
achieving output distribution P∗Y is unique. This was all done under the extra assumption that the
supremum over PX is attainable. It turns out, properties b) and c) can be shown without that extra
assumption.
Theorem 5.9 (Kemperman). For any PY|X and a convex set of distributions P such that
C = sup I(PX , PY|X ) < ∞, (5.6)

PX ∈P
there exists a unique P∗Y with the property that
C = sup D(PY|X kP∗Y |PX ) . (5.7)

PX ∈P
Furthermore,
C = sup min D(PY|X kQY |PX ) (5.8)

PX ∈P QY
= min sup D(PY|X kQY |PX ) (5.9)
QY PX ∈P
= min sup D(PY|X=x kQY ) , (if P = {all PX }.) (5.10)

QY x∈X
i i
i i
i i

i i
5.4 Existence of capacity-achieving output distribution (general case) 71
Note that Condition (5.6) is automatically satisfied if there exists a QY such that
sup D(PY|X kQY |PX ) < ∞ . (5.11)
PX ∈P
Example 5.1 (Non-existence of capacity-achieving input distribution). Let Z ∼ N (0, 1) and

consider the problem
C= sup I(X; X + Z) . (5.12)
E[X]=0,E[X2 ]=P
PX :
E[X4 ]=s
Without the constraint E[X4 ] = s, the capacity is uniquely achieved at the input distribution PX =
N (0, P); see Theorem 5.11. When s 6= 3P2 , such PX is no longer feasible. However, for s > 3P2
the maximum
1
C = log(1 + P)
2
is still attainable. Indeed, we can add a small “bump” to the gaussian distribution as follows:
PX = (1 − p)N (0, P) + pδx ,
where p → 0 and x → ∞ such that px2 → 0 but px4 → s − 3P2 > 0. This shows that for the
problem (5.12) with s > 3P2 , the capacity-achieving input distribution does not exist, but the
capacity-achieving output distribution P∗Y = N (0, 1 + P) exists and is unique as Theorem 5.9
shows.
Proof of Theorem 5.9. Let P′Xn be a sequence of input distributions achieving C, i.e.,
I(P′Xn , PY|X ) → C. Let Pn be the convex hull of {P′X1 , . . . , P′Xn }. Since Pn is a finite-dimensional
simplex, the (concave) function PX 7→ I(PX , PY|X ) is continuous (Proposition 4.12) and attains its
maximum at some point PXn ∈ Pn , i.e.,
In ≜ I(PXn , PY|X ) = max I(PX , PY|X ) .
PX ∈Pn
Denote by PYn be the output distribution induced by PXn . We have then:

D(PYn kPYn+k ) = D(PY|X kPYn+k |PXn ) − D(PY|X kPYn |PXn ) (5.13)
≤ I(PXn+k , PY|X ) − I(PXn , PY|X ) (5.14)
≤ C − In , (5.15)
where in (5.14) we applied Theorem 5.4 to (Pn+k , PYn+k ). The crucial idea is to apply comparison
of KL divergence (which is not a distance) with a true distance known as total variation defined
in (7.3) below. Such comparisons are going to be the topic of Chapter 7. Here we assume for
granted validity of Pinsker’s inequality (see Theorem 7.9). According to that inequality and since
In % C, we conclude that the sequence PYn is Cauchy in total variation:
sup TV(PYn , PYn+k ) → 0 , n → ∞.
k≥1
Since the space of all probability distributions on a fixed alphabet is complete in total variation,
the sequence must have a limit point PYn → P∗Y . Convergence in TV implies weak convergence,
i i
i i
i i

i i
72
and thus by taking a limit as k → ∞ in (5.15) and applying the lower semicontinuity of divergence
(Theorem 4.8) we get
D(PYn kP∗Y ) ≤ lim D(PYn kPYn+k ) ≤ C − In ,

k→∞
and therefore, PYn → P∗Y in the (stronger) sense of D(PYn kP∗Y ) → 0. By Theorem 4.1,
D(PY|X kP∗Y |PXn ) = In + D(PYn kP∗Y ) → C . (5.16)

S
Take any PX ∈ k≥ 1 Pk . Then PX ∈ Pn for all sufficiently large n and thus by Theorem 5.4
D(PY|X kPYn |PX ) ≤ In ≤ C , (5.17)
which, by the lower semicontinuity of divergence and Fatou’s lemma, implies
D(PY|X kP∗Y |PX ) ≤ C . (5.18)
To prove that (5.18) holds for arbitrary PX ∈ P , we may repeat the argument above with Pn
replaced by P̃n = conv({PX } ∪ Pn ), denoting the resulting sequences by P̃Xn , P̃Yn and the limit
point by P̃∗Y , and obtain:
D(PYn kP̃Yn ) = D(PY|X kP̃Yn |PXn ) − D(PY|X kPYn |PXn ) (5.19)

≤ C − In , (5.20)
where (5.20) follows from (5.18) since PXn ∈ P̃n . Hence taking limit as n → ∞ we have P̃∗Y = P∗Y
and therefore (5.18) holds.
To see the uniqueness of P∗Y , assuming there exists Q∗Y that fulfills C = supPX ∈P D(PY|X kQ∗Y |PX ),
we show Q∗Y = P∗Y . Indeed,
C ≥ D(PY|X kQ∗Y |PXn ) = D(PY|X kPYn |PXn ) + D(PYn kQ∗Y ) = In + D(PYn kQ∗Y ).
Since In → C, we have D(PYn kQ∗Y ) → 0. Since we have already shown that D(PYn kP∗Y ) → 0,
we conclude P∗Y = Q∗Y (this can be seen, for example, from Pinsker’s inequality and the triangle
inequality TV(P∗Y , Q∗Y ) ≤ TV(PYn , Q∗Y ) + TV(PYn , P∗Y ) → 0).
Finally, to see (5.9), note that by definition capacity as a max-min is at most the min-max, i.e.,
C = sup min D(PY|X kQY |PX ) ≤ min sup D(PY|X kQY |PX ) ≤ sup D(PY|X kP∗Y |PX ) = C
PX ∈P QY QY PX ∈P PX ∈P
in view of (5.16) and (5.17).
Corollary 5.10. Let X be countable and P a convex set of distributions on X . If supPX ∈P H(X) <
∞ then
X 1
sup H(X) = min sup PX (x) log <∞
PX ∈P QX PX ∈P
x
Q X ( x)
and the optimizer Q∗X exists and is unique. If Q∗X ∈ P , then it is also the unique maximizer of H(X).
Proof. Just apply Kemperman’s Theorem 5.9 to the identity channel Y = X.
i i
i i
i i

i i
5.5 Gaussian saddle point 73
P
Example 5.2 (Max entropy). Assume that f : Z → R is such that Z(λ) ≜ n∈Z exp{−λf(n)} < ∞
for all λ > 0. Then
max H(X) ≤ inf {λa + log Z(λ)} .

X:E[f(X)]≤a λ>0
This follows from taking QX (n) = Z(λ)−1 exp{−λf(n)} in Corollary 5.10. Distributions of this
form are known as Gibbs distributions for the energy function f. This bound is often tight and
achieved by PX (n) = Z(λ∗ )−1 exp{−λ∗ f(n)} with λ∗ being the minimizer.
5.5 Gaussian saddle point

For additive noise, there is also a different kind of saddle point between PX and the distribution of
noise:
Theorem 5.11. Let Xg ∼ N (0, σX2 ) , Ng ∼ N (0, σN2 ) , Xg ⊥

⊥ Ng . Then:
1. “Gaussian capacity”:
1 σ2
C = I(Xg ; Xg + Ng ) = log 1 + X2
2 σN
2. “Gaussian input is the best for Gaussian noise”: For all X ⊥

⊥ Ng and Var X ≤ σX2 ,
I(X; X + Ng ) ≤ I(Xg ; Xg + Ng ),
d
with equality iff X=Xg .
3. “Gaussian noise is the worst for Gaussian input”: For for all N s.t. E[Xg N] = 0 and EN2 ≤ σN2 ,
I(Xg ; Xg + N) ≥ I(Xg ; Xg + Ng ),
d
with equality iff N=Ng and independent of Xg .
Interpretations:
1 For AWGN channel, Gaussian input is the most favorable. Indeed, immediately from the second
statement we have
1 σ2
max I(X; X + Ng ) = log 1 + X2
X:Var X≤σX2 2 σN
which is the capacity formula for the AWGN channel.

2 For Gaussian source, additive Gaussian noise is the worst in the sense that it minimizes the
mutual information provided by the noisy version.
i i
i i
i i

i i
74
Proof. WLOG, assume all random variables have zero mean. Let Yg = Xg + Ng . Define
1 σ 2 log e x2 − σX2
f(x) ≜ D(PYg |Xg =x kPYg ) = D(N (x, σN2 )kN (0, σX2 + σN2 )) = log 1 + X2 +
2 σN 2 σX2 + σN2
| {z }
=C
1. Compute I(Xg ; Xg + Ng ) = E[f(Xg )] = C

2. Recall the inf-representation (Corollary 4.2): I(X; Y) = minQ D(PY|X kQ|PX ). Then
I(X; X + Ng ) ≤ D(PYg |Xg kPYg |PX ) = E[f(X)] ≤ C < ∞ .
Furthermore, if I(X; X + Ng ) = C, then the uniqueness of the capacity-achieving output distribu-
tion, cf. Corollary 5.5, implies PY = PYg . But PY = PX ∗N (0, σN2 ), where ∗ denotes convolution.
Then it must be that X ∼ N (0, σX2 ) simply by considering characteristic functions:
ΨX (t) · e− 2 σN t = e− 2 (σX +σN )t ⇒ ΨX (t) = e− 2 σX t =⇒ X ∼ N (0, σX2 )
1 2 2 1 2 2 2 1 2 2
3. Let Y = Xg + N and let PY|Xg be the respective kernel. Note that here we only assume that N is
uncorrelated with Xg , i.e., E [NXg ] = 0, not necessarily independent. Then
dPXg |Yg (Xg |Y)
I(Xg ; Xg + N) ≥ E log (5.21)
dPXg (Xg )
dPYg |Xg (Y|Xg )
= E log (5.22)
dPYg (Y)
log e h Y2 N2 i
=C+ E 2 2
− 2 (5.23)
2 σX + σN σN
log e σX 2 EN2
=C+ 1 − (5.24)
2 σX2 + σN2 σN2
≥ C, (5.25)
where
• (5.21): follows from (4.7),
dPX |Y dPY |X
• (5.22): dPgX g = dPgY g
g g
• (5.24): E[Xg N] = 0 and E[Y2 ] = E[N2 ] + E[X2g ].

• (5.25): EN2 ≤ σN2 .
Finally, the conditions for equality in (5.21) (see (4.8)) require
D(PXg |Y kPXg |Yg |PY ) = 0
Thus, PXg |Y = PXg |Yg , i.e., Xg is conditionally Gaussian: PXg |Y=y = N (by, c2 ) for some constants
b and c. In other words, under PXg Y , we have
Xg = bY + cZ , Z ∼ Gaussian ⊥
⊥ Y.
But then Y must be Gaussian itself by Cramer’s Theorem [77] or simply by considering
characteristic functions:
2 ′ 2 ′′ 2
ΨY (t) · ect = ec t ⇒ ΨY (t) = ec t
=⇒ Y is Gaussian
i i
i i
i i

i i
Therefore, (Xg , Y) must be jointly Gaussian and hence N = Y − Xg is Gaussian. Thus we

conclude that it is only possible to attain I(Xg ; Xg + N) = C if N is Gaussian of variance σN2 and
independent of Xg .
5.6 Iterative algorithms: Blahut-Arimoto, Expectation-Maximization,

Sinkhorn
Although the optimization problems that we discussed above are convex (and thus would be
considered algorithmically “easy”), there are still clever ideas used to speed up their numerical
solutions. The main underlying principle is the following alternating minimization algorithm:
• Optimization problem: mint f(t).

• Assumption I: f(t) = mins F(t, s) (i.e. f can be written as a minimum of some other function F).
• Assumption II: There exist two solvers t∗ (s) = argmint F(t, s) and s∗ (t) = argmins F(t, s).
• Iterative algorithm:
– Step 0: Fix some s0 , t0 .
– Step 2k − 1: sk = s∗ (tk−1 ).
– Step 2k: tk = t∗ (sk ).
Note that there is a steady improvement at each step (the value F(sk , tk ) is decreasing), so it
can be often proven that the algorithm converges to a local minimum, or even a global minimum
under appropriate conditions (e.g. the convexity of f). Below we discuss several applications of
this idea, and refer to [82] for proofs of convergence. We need a result, which will be derived
in Chapter 15: for any function c : Y → R and any QY on Y , under the integrability condition
R
Z = QY (dy) exp{−c(y)} < ∞,
min D(PY kQY ) + EY∼PY [c(Y)] (5.26)
PY
is attained at P∗Y (dy) = Z1 QY (dy) exp{−c(y)}.

For simplicity below we only consider the case of discrete alphabets X , Y .
Maximizing mutual information (capacity). We have a fixed PY|X and the optimization
problem

QX | Y
C = max I(X; Y) = max max EPX,Y log .
PX PX QX|Y PX
This results in the iterations:
1
QX|Y (x|y) ← PX (x)PY|X (y|x)
Z(y)
( )
1 X
PX (x) ← Q′ (x) ≜ exp PY|X (y|x) log QX|Y (x|y) ,
Z y
i i
i i
i i

i i
76
where Z(y) and Z are normalization constants. To derive this, notice that for a fixed PX the optimal
QX|Y = PX|Y . For a fixed QX|Y , we can see that

QX|Y
EPX,Y log = log Z − D(PX kQ′ ) ,
PX
and thus the optimal PX = Q′ .
Denoting Pn to be the value of PX at the nth iteration, we observe that
I(Pn , PY|X ) ≤ C ≤ sup D(PY|X=x kPY|X ◦ Pn ) . (5.27)
x
This is useful since at every iteration not only we get an estimate of the optimizer Pn , but also the
gap to optimality C − I(Pn , PY|X ) ≤ C − RHS. It can be shown, furthermore, that both RHS and
LHS in (5.27) monotonically converge to C as n → ∞, see [82] for details.
Minimizing mutual information (rate-distortion). We have a fixed PX , a cost function c(x, y)

and the optimization problem
R = min I(X; Y) + E[d(X, Y)] = min D(PY|X kQY |PX ) + E[d(X, Y)] . (5.28)
PY|X PY|X ,QY
Using (5.26) we derive the iterations:

1
PY|X (y|x) ← QY (y) exp{−d(x, y)}
Z(x)
QY ← PY|X ◦ PX .
A sandwich bound similar to (5.27) holds here, see (5.30), so that one gets two computable
sequences converging to R from above and below, as well as PY|X converging to the argmin
in (5.28).
EM algorithm (convex case). Proposed in [88], Expectation-maximization (EM) algorithm is

a heuristic for solving the maximal likelihood problem. It is known to converge to the global
maximizer for convex problems. Let us consider such a simple case. Given a distribution PX our
goal is to minimize the divergence with respect to the mixture QX = QX|Y ◦ QY :
L = min D(PX kQX|Y ◦ QY ) , (5.29)
QY
where QX|Y is a given channel. This is a problem arising in the maximum likelihood estimation
Pn
for mixture models where QY is the unknown mixing distribution and PX = 1n i=1 δxi is the
empirical distribution of the sample (x1 , . . . , xn ).1
1 (θ)
Note that EM algorithm is also applicable more generally, when QX|Y itself depends on the unknown parameter θ and the
∑ (θ)
goal (see Section 29.3) is to maximize the total log likelihood ni=1 log QX (xi ) joint over (QY , θ). A canonical example
(which was one of the original motivations for the EM algorithm) a k-component Gaussian mixture
(θ) ∑ (θ)
QX = kj=1 wj N (μj , 1); in other words, QY = (w1 , . . . , wk ), QX|Y=j = N (μj , 1) and θ = (μ1 , . . . , μk ). If the centers
μj ’s are known and only the weights wj ’s are to be estimated, then we get the simple convex case in (5.29). In general the
log likelihood function is non-convex in the centers and EM iterations may not converge to the global optimum even with
infinite sample size (see [169] for an example with k = 3).
i i
i i
i i

i i
To derive an iterative algorithm for (5.29), we write

min D(PX kQX ) = min min D(PX,Y kQX,Y ) .
QY QX PY|X
dQX|Y
(Note that taking d(x, y) = − log dPX shows that this problem is equivalent to (5.28).) By the
chain rule, thus, we find the iterations
1
PY|X ← QY (y)QX|Y (x|y)
Z(x)
QY ← PY|X ◦ PX .
Denote by Qn the value of QX = QX|Y ◦ QY at the nth iteration. Notice that for any n and all QX
we have from Jensen’s inequality
dQX|Y
D(PX kQX ) − D(PX kQn ) = EX∼PX [log EY∼QY ] ≥ gap(Qn ) ,
dQn
dQX|Y=y
where we defined gap(Qn ) = − log esssupy EX∼PX [ dQn ]. In all, we get the following sandwich
bound:
D(PY kQn ) − gap(Qn ) ≤ L ≤ D(PY kQn ) , (5.30)
and it can be shown that as n → ∞ both sides converge to L.
Sinkhorn’s algorithm. This algorithm [284] is very similar, but not exactly the same as the ones
above. We fix QX,Y , two marginals VX , VY and solve the problem
S = min{D(PX,Y kQX,Y ) : PX = VX , PY = VY )} .
From the results of Chapter 15 it is clear that the optimal distribution PX,Y is given by
P∗X,Y = A(x)QX,Y (x, y)B(y) ,
for some A, B ≥ 0. In order to find functions A, B we notice that under a fixed B the value of A
that makes PX = VX is given by
VX (x)QX,Y (x, y)B(y)
A ( x) ← P .
y QX,Y (x, y)B(y)
Similarly, to fix the Y-marginal we set

A(x)QX,Y (x, y)VY (y)
B(y) ← P .
x A(x)QX,Y (x, y)
The Sinkhorn’s algorithm alternates the A and B updates until convergence.

The original version in [284] corresponds to VX = VY = Unif([n]), and the goal there was to
show that any matrix {Cx,y } with non-negative entries can be transformed into a doubly-stochastic
matrix {A(x)Cx,y B(y)} by only rescaling rows and columns. The renewed interest in this classical
algorithm arose from an observation that taking a jointly Gaussian QX,Y (x, y) = c exp{−kx −
yk2 /ϵ} produces a coupling PX,Y which resembles and approximates (as ϵ → 0) the optimal-
transport coupling required for computing the Wasserstein distance W2 (VX , VY ), see [86] for more.
i i
i i
i i

i i
6 Tensorization. Fano’s inequality. Entropy rate.
In this chapter we start with explaining the important property of mutual information known as
tensorization (or single-letterization), which allows one to maximize and minimize mutual infor-
mation between two high-dimensional vectors. So far in this book we have tacitly failed to give
any operational meaning to the value of I(X; Y). In this chapter, we give one fundamental such
justification in the form of Fano’s inequality. It states that whenever I(X; Y) is small, one should
not be able to predict X on the basis of Y with a small probability of error. As such, this inequality
will be applied countless times in the rest of the book. We also define concepts of entropy rate (for
a stochastic process) and of mutual information rate (for a pair of stochastic processes). For the
former, it is shown that two processes that coincide often must have close entropy rates – a fact
to be used later in the discussion of ergodicity. For the latter we give a closed form expression for
the pair of Gaussian processes in terms of their spectral density.
6.1 Tensorization (single-letterization) of mutual information

For many applications we will have memoryless channels or memoryless sources. The following
result is critical for extremizing mutual information in those cases.
Theorem 6.1 (Joint vs. marginal mutual information).

Q
(1) If PYn |Xn = PYi |Xi then
X
n
I(Xn ; Yn ) ≤ I(Xi ; Yi ) (6.1)
i=1
Q
with equality iff PYn = PYi . Consequently, the (unconstrained) capacity is additive for
memoryless channels:
X
n
max I(Xn ; Yn ) = max I(Xi ; Yi ).
PXn PXi
i=1
(2) If X1 ⊥
⊥ ... ⊥
⊥ Xn then
X
n
I(Xn ; Y) ≥ I(Xi ; Y) (6.2)
i=1
78
i i
i i
i i

i i
6.1 Tensorization (single-letterization) of mutual information 79
Q
with equality iff PXn |Y = PXi |Y PY -almost surely.1 Consequently,
X
n
min I(Xn ; Yn ) = min I(Xi ; Yi ).
PYn |Xn P Yi | X i
i=1
P Q Q
Proof. (1) Use I(Xn ; Yn ) − I(Xi ; Yi ) = D(PYn |Xn k PYi |Xi |PXn ) − D(PYn k PYi )
P Q Q
(2) Reverse the role of X and Y: I(Xn ; Y) − I(Xi ; Y) = D(PXn |Y k PXi |Y |PY ) − D(PXn k PXi )
The moral of this result is that
1 For product channel, the input maximizing the mutual information is a product distribution.
2 For product source, the channel minimizing the mutual information is a product channel.
This type of result is often known as single-letterization in information theory. It tremendously

simplifies the optimization problem over a high-dimensional (multi-letter) problem to a scalar
(single-letter) problem. For example, in the simplest case where Xn , Yn are binary vectors, opti-
mizing I(Xn ; Yn ) over PXn and PYn |Xn entails optimizing over 2n -dimensional vectors and 2n × 2n
matrices, whereas optimizing each I(Xi ; Yi ) individually is easy. In analysis, the effect when some
quantities (or inequalities, such as log-Sobolev [147]) extend additively to tensor powers is called
tensorization. Since forming a product of channels or distributions is a form of tensor power, the
first part of the theorem shows that the capacity tensorizes.
⊥ X2 ∼ Bern(1/2) on {0, 1} = F2 :
Example 6.1. 1. (6.1) fails for non-product channels. X1 ⊥
Y1 = X1 + X2
Y2 = X1
I(X1 ; Y1 ) = I(X2 ; Y2 ) = 0 but I(X2 ; Y2 ) = 2 bits
2. Strict inequality in (6.1).
∀k Yk = Xk = U ∼ Bern(1/2) ⇒ I(Xk ; Yk ) = 1 bit

X
I(Xn ; Yn ) = 1 bit < I(Xk ; Yk ) = n bits
3. Strict inequality in (6.2). X1 ⊥

⊥ ... ⊥
⊥ Xn
Y1 = X2 , Y2 = X3 , . . . , Yn = X1 ⇒ I(Xk ; Yk ) = 0
X X
I(Xn ; Yn ) = H(Xi ) > 0 = I(Xk ; Yk )
1 ∏n
That is, if PXn ,Y = PY i=1 PXi |Y as joint distributions.
i i
i i
i i

i i
80
6.2* Gaussian capacity via orthogonal symmetry

Multi-dimensional case (WLOG assume X1 ⊥
⊥ ... ⊥
⊥ Xn iid): if Z1 , . . . , Zn are independent, then
X
n
Pmax I(X n
;X + Z ) ≤
n n
Pmax I(Xk ; Xk + Zk )
E[ X2k ]≤nP E[ X2k ]≤nP
k=1
Given a distribution PX1 · · · PXn satisfying the constraint, form the “average of marginals” distribu-
Pn Pn
tion P̄X = n1 k=1 PXk , which also satisfies the single letter constraint E[X2 ] = n1 k=1 E[X2k ] ≤ P.
Then from the concavity in PX of I(PX , PY|X )
1X
n
I(P̄X ; PY|X ) ≥ I(PXk , PY|X )
n
k=1
So P̄X gives the same or better mutual information, which shows that the extremization above
ought to have the form nC(P) where C(P) is the single letter capacity. Now suppose Yn = Xn + ZnG
where ZnG ∼ N (0, In ). Since an isotropic Gaussian is rotationally symmetric, for any orthogo-
nal transformation U ∈ O(n), the additive noise has the same distribution ZnG ∼ UZnG , so that
PUYn |UXn = PYn |Xn , and
I(PXn , PYn |Xn ) = I(PUXn , PUYn |UXn ) = I(PUXn , PYn |Xn )
From the “average of marginal” argument above, averaging over many rotations of Xn can only
make the mutual information larger. Therefore, the optimal input distribution PXn can be chosen
to be invariant under orthogonal transformations. Consequently, the (unique!) capacity achiev-
ing output distribution P∗Yn must be rotationally invariant. Furthermore, from the conditions for
equality in (6.1) we conclude that P∗Yn must have independent components. Since the only product
distribution satisfying the power constraints and having rotational symmetry is an isotropic Gaus-
sian, we conclude that PYn = (P∗Y )n and P∗Y = N (0, PIn ).
For the other direction in the Gaussian saddle point problem:
min I(XG ; XG + N)
PN :E[N2 ]=1
This uses the same trick, except here the input distribution is automatically invariant under
orthogonal transformations.
6.3 Information measures and probability of error

Let W be a random variable and Ŵ be our prediction. There are three types of problems:
1 Random guessing: W ⊥ ⊥ Ŵ.

2 Guessing with data: W → X → Ŵ.
3 Guessing with noisy data: W → X → Y → Ŵ.
i i
i i
i i

i i
6.3 Information measures and probability of error 81
Our goal is to draw converse statements: for example, if the uncertainty of W is too high or if the
information provided by the data is too scarce, then it is difficult to guess the value of W.
Theorem 6.2. Let |X | = M < ∞. Then for any X̂ ⊥

⊥ X,
H(X) ≤ FM (P[X = X̂]) (6.3)
where
FM (x) ≜ (1 − x) log(M − 1) + h(x), x ∈ [0, 1] (6.4)
and h(x) = x log 1x + (1 − x) log 1−1 x is the binary entropy function.
If Pmax ≜ maxx∈X PX (x), then recalling that h(·) is a
H(X) ≤ FM (Pmax ) = (1 − Pmax ) log(M − 1) + h(Pmax ) , (6.5)
with equality iff PX = (Pmax , 1− Pmax 1−Pmax
M−1 , . . . , M−1 ).
The function FM (·) is shown in Fig. 6.1. Notice that due to its non-monotonicity the state-
ment (6.5) does not imply (6.3), even though P[X = X̂] ≤ Pmax .
FM (p)
log M
log(M − 1)
p
0 1/M 1
Figure 6.1 The function FM in (6.4) is concave with maximum log M at maximizer 1/M, but not monotone.
Proof. To show (6.3) consider an auxiliary distribution QX,X̂ = UX PX̂ , where UX is uniform on
X . Then Q[X = X̂] = 1/M. Denoting P[X = X̂] ≜ PS , applying the DPI for divergence to the data
processor (X, X̂) 7→ 1{X=X̂} yields d(PS k1/M) ≤ D(PXX̂ kQXX̂ ) = log M − H(X).
To show the second part, suppose one is trying to guess the value of X without any side informa-
tion. Then the best bet is obviously the most likely outcome (mode) and the maximal probability
of success is
max P[X = X̂] = Pmax (6.6)
X̂⊥
⊥X
Thus, applying (6.3) with X̂ being the mode yield (6.5). Finally, suppose that P =
(Pmax , P2 , . . . , PM ) and introduce Q = (Pmax , 1− Pmax 1−Pmax
M−1 , . . . , M−1 ). Then the difference of the right
and left side of (6.5) equals D(PkQ) ≥ 0, with equality iff P = Q.
i i
i i
i i

i i
82
Remark 6.1. Let us discuss the unusual proof technique. Instead of studying directly the prob-
ability space PX,X̂ given to us, we introduced an auxiliary one: QX,X̂ . We then drew conclusions
about the target metric (probability of error) for the auxiliary problem (the probability of error
= 1 − M1 ). Finally, we used DPI to transport statement about Q to a statement about P: if D(PkQ)
is small, then the probabilities of the events (e.g., {X 6= X̂}) should be small as well. This is a
general method, known as meta-converse, that we develop in more detail later in this book. For
this result, however, there are much more explicit ways to derive it – see Ex. I.42.
Similar to Shannon entropy H, Pmax is also a reasonable measure for randomness of P. In fact,
1
H∞ (P) ≜ log (6.7)
Pmax
is known as the Rényi entropy of order ∞ (or the min-entropy in the cryptography literature). Note
that H∞ (P) = log M iff P is uniform; H∞ (P) = 0 iff P is a point mass. In this regard, the Fano’s
inequality can be thought of as our first example of a comparison of information measures: it
compares H and H∞ .
Theorem 6.3 (Fano’s inequality). Let |X | = M < ∞ and X → Y → X̂. Let Pe = P[X 6= X̂], then
H(X|Y) ≤ FM (1 − Pe ) = Pe log(M − 1) + h(Pe ) . (6.8)
Furthermore, if Pmax ≜ maxx∈X PX (x) > 0, then regardless of |X |,

1
I(X; Y) ≥ (1 − Pe ) log − h( P e ) . (6.9)
Pmax
Proof. The benefit of the previous proof is that it trivially generalizes to this new case of (possibly
randomized) estimators X̂, which may depend on some observation Y correlated with X. Note that
it is clear that the best predictor for X given Y is the maximum posterior (MAP) rule, i.e., posterior
mode: X̂(y) = argmaxx PX|Y (x|y).
To show (6.8) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs. QX,Y,X̂ =
UX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1{X̸=X̂} (note that PX̂|Y is identical for both).
To show (6.9) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs. QX,Y,X̂ =
PX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1{X̸=X̂} to obtain:
I(X; Y) = D(PX,Y,X̂ kQX,Y,X̂ ) ≥ d(P[X = X̂]kQ[X = X̂])

1
≥ −h(Pe ) + (1 − Pe ) log ≥ −h(Pe ) − (1 − Pe ) log Pmax ,
Q[X = X̂]
where the last step follows from Q[X = X̂] ≤ Pmax since X ⊥
⊥ X̂ under Q. (Again, we refer to
Ex. I.42 for a direct proof.)
The following corollary of the previous result emphasizes its role in providing converses (or
impossibility results) for statistics and data transmission.
i i
i i
i i

i i
6.4 Entropy rate 83
Corollary 6.4 (Lower bound on average probability of error). Let W → X → Y → Ŵ and W is

uniform on [M] ≜ {1, . . . , M}. Then
I(X; Y) + h(Pe )
Pe ≜ P[W 6= Ŵ] ≥ 1 − (6.10)
log M
I(X; Y) + log 2
≥1− . (6.11)
log M
Proof. Apply Theorem 6.3 and the data processing for mutual information: I(W; Ŵ) ≤ I(X; Y).
6.4 Entropy rate

Definition 6.5. The entropy rate of a process X = (X1 , X2 , . . .) is
1
H(X) ≜ lim H(Xn ) (6.12)
n→∞ n
provided the limit exists.
A sufficient condition for the entropy rate to exist is stationarity, which essentially means invari-
d
ance with respect to time shift. Formally, X is stationary if (Xt1 , . . . , Xtn )=(Xt1 +k , . . . , Xtn +k ) for
any t1 , . . . , tn , k ∈ N. This definition naturally extends to two-sided processes.
Theorem 6.6. For any stationary process X = (X1 , X2 , . . .)
(a) H(Xn |Xn−1 ) ≤ H(Xn−1 |Xn−2 ).

(b) 1n H(Xn ) ≥ H(Xn |Xn−1 ).
(c) 1n H(Xn ) ≤ n−1 1 H(Xn−1 ).
(d) H(X) exists and H(X) = limn→∞ n1 H(Xn ) = limn→∞ H(Xn |Xn−1 ).
(e) For a two-sided stationary process X = (. . . , X−1 , X0 , X1 , X2 , . . .), H(X) = H(X1 |X0−∞ )
provided that H(X1 ) < ∞.
Proof.
(a) Further conditioning + stationarity: H(Xn |Xn−1 ) ≤ H(Xn |Xn2−1 ) = H(Xn−1 |Xn−2 )
P
(b) Using chain rule: 1n H(Xn ) = 1n H(Xi |Xi−1 ) ≥ H(Xn |Xn−1 )
(c) H(Xn ) = H(Xn−1 ) + H(Xn |Xn−1 ) ≤ H(Xn−1 ) + 1n H(Xn )
(d) n 7→ 1n H(Xn ) is a decreasing sequence and lower bounded by zero, hence has a limit
Pn
H(X). Moreover by chain rule, 1n H(Xn ) = n1 i=1 H(Xi |Xi−1 ). From here we claim that
H(Xn |Xn−1 ) converges to the same limit H(X). Indeed, from the monotonicity shown in part
(a), limn H(Xn |Xn−1 ) = H′ exists. Next, recall the following fact from calculus: if an → a,
Pn
then the Cesàro’s mean 1n i=1 ai → a as well. Thus, H′ = H(X).
i i
i i
i i

i i
84
5. Assuming H(X1 ) < ∞ we have from (4.30):
lim H(X1 ) − H(X1 |X0−n ) = lim I(X1 ; X0−n ) = I(X1 ; X0−∞ ) = H(X1 ) − H(X1 |X0−∞ )
n→∞ n→∞
Example 6.2 (Stationary processes).
(a) X − iid source ⇒ H(X) = H(X1 )

(b) X − mixed sources: Flip a coin with bias p at time t = 0, if head, let X = Y, if tail, let X = Z.
Then H(X) = pH(Y) + p̄H(Z).
(c) X − stationary Markov chain : X1 → X2 → X3 → · · ·
X 1
H(Xn |Xn−1 ) = H(Xn |Xn−1 ) ⇒ H(X) = H(X2 |X1 ) = μ(a)Pb|a log
Pb|a
a,b
where μ is an invariant measure (possibly non-unique; unique if the chain is ergodic).

(d) X − hidden Markov chain : Let X1 → X2 → X3 → · · · be a Markov chain. Fix PY|X . Let
P Y| X
Xi −−→ Yi . Then Y = (Y1 , . . .) is a stationary process. Therefore H(Y) exists but it is very
difficult to compute (no closed-form solution known), even if X is a binary Markov chain and
PY|X is a BSC.
6.5 Entropy and symbol (bit) error rate

In this section we show that the entropy rates of two processes X and Y are close whenever they
can be “coupled”. Coupling of two processes means defining them on a common probability space
so that the average distance between their realizations is small. In the following, we will require
that the so-called symbol error rate (expected fraction of errors) is small, namely
1X
n
P[Xj 6= Yj ] ≤ ϵ . (6.13)
n
j=1
For binary alphabet this quantity is known as the bit error rate, which is one of the performance
metrics we consider for reliable data transmission in Part IV (see Section 17.1 and Section 19.6).
Notice that if we define the Hamming distance as
X
n
d H ( xn , yn ) ≜ 1{xj 6= yj } (6.14)
j=1
then (6.13) corresponds to requiring E[dH (Xn , Yn )] ≤ nϵ.

Before showing our main result, we show that Fano’s inequality Theorem 6.3 can be tensorized:
Proposition 6.7. Let X1 , . . . , Xn take values on a finite alphabet X . Then
H(Xn |Yn ) ≤ nF|X | (1 − δ) = n(δ log(|X | − 1) + h(δ)) , (6.15)
i i
i i
i i

i i
6.6 Entropy and contiguity 85
where the function FM is defined in (6.4), and
1X
n
1
δ= E[dH (Xn , Yn )] = P[Xj 6= Yj ] .
n n
j=1
Proof. For each j ∈ [n], applying (6.8) to the Markov chain Xj → Yn → Yj yields
H(Xj |Yn ) ≤ FM (P [Xj = Yj ]) , (6.16)
where we denoted M = |X |. Then, upper-bounding joint entropy by the sum of marginals, cf. (1.3),
and combining with (6.16), we get
X
n
H(Xn |Yn ) ≤ H( X j | Y n ) (6.17)
j=1
Xn
≤ FM (P[Xj = Yj ]) (6.18)
j=1
 
1 X
n
≤ nFM  P[Xj = Yj ] (6.19)
n
j=1
where in the last step we used the concavity of FM and Jensen’s inequality.
Corollary 6.8. Consider two processes X and Y with entropy rates H(X) and H(Y). If
P[Xj 6= Yj ] ≤ ϵ
for every j and if X takes values on a finite alphabet of size M, then
H(X) − H(Y) ≤ FM (1 − ϵ) .
If both processes have alphabets of size M then
|H(X) − H(Y)| ≤ ϵ log M + h(ϵ) → 0 as ϵ → 0
Proof. There is almost nothing to prove:
H(Xn ) ≤ H(Xn , Yn ) = H(Yn ) + H(Xn |Yn )
and apply (6.15). For the last statement just recall the expression for FM .
6.6 Entropy and contiguity

In this section, we show a related result showing that the entropy rate of a X -valued process that is
“almost iid uniform” is necessarily log |X |. To quantify “almost” we recall the following concept.
i i
i i
i i

i i
86
Definition 6.9 (Contiguity). Let {Pn } and {Qn } be sequences of probability measures on some
Ωn . We say Pn is contiguous with respect to Qn (denoted by Pn ◁ Qn ) if for any sequence {An } of
measurable sets, Qn (An ) → 0 implies that Pn (An ) → 0. We say Pn and Qn are mutually contiguous
(denoted by Pn ◁▷ Qn ) if Pn ◁ Qn and Qn ◁ Pn .
Note that Pn ◁▷ Qn is much weaker than TV(Pn , Qn ) → 0. A commonly used sufficient

condition for Pn ◁ Qn is bounded second p moment χ2 (Pn kQp n ) = O(1). Indeed, by the Cauchy-
Schwarz inequality, Pn (An ) = EPn [1An ] ≤ 1 + χ2 (Pn kQn ) Qn (An ) which vanishes whenever
Qn (An ) vanishes. In particular, a sufficient condition for mutual contiguity is the boundedness of
likelihood ratio: c ≤ QPnn ≤ C for some constants c, C.
Here is the promised result about the entropy rate:
Theorem 6.10. Let X be a finite set and Qn the uniform distribution on X n . If Pn ◁ Qn , then
H(Pn ) = H(Qn ) + o(n) = n log |X | + o(n). Equivalently, D(Pn kQn ) = o(n).
Proof. Suppose for the sake of contradiction that H(Pn ) ≤ (1 − ϵ)n log |X | for some constant
′ ′
ϵ. Let ϵ′ < ϵ and define An ≜ {xn ∈ X n : Pn (xn ) ≥ |X |−(1−ϵ )n }. Then |An | ≤ |X |(1−ϵ )n and
′
hence Qn (An ) ≤ |X |−ϵ n . Since Pn ◁ Qn , we have Pn (An ) → 0. On the other hand, H(Pn ) ≥
−ϵ
EPn [log P1n 1Acn ] ≥ (1 − ϵ′ )n log |X |Pn (Acn ). Thus Pn (Acn ) ≤ 11−ϵ ′ which is a contradiction.
Remark 6.2. It is natural to ask whether Theorem 6.10 holds for non-uniform Qn , that is, whether
Pn ◁▷ Qn implies H(Pn ) = H(Qn ) + o(n). This turns out to be false. To see this, choose any
μn , νn and set Pn ≜ 12 μn + 12 νn and Qn ≜ 13 μn + 23 νn . Then we always have Pn ◁▷ Qn since
4 ≤ Qn ≤ 2 . Using conditional entropy, it is clear that H(Pn ) = 2 H( μn ) + 2 H(νn ) + O(1) and
3 Pn 3 1 1
Qn = 3 H( μn ) + 3 H(νn ) + O(1). In addition, by data processing, D(Pn kQn ) ≤ d( 21 k 13 ) = O(1).

1 2
Choosing, say, μn = Ber( 21 )⊗n and νn = Ber( 13 )⊗n leads to |H(Pn ) − H(Qn )| = Ω(n).
6.7 Mutual information rate

Definition 6.11 (Mutual information rate).
1 n n
I(X; Y) = lim I(X ; Y )
n→∞ n
provided the limit exists.
Example 6.3 (Gaussian processes). Consider X, N two stationary Gaussian processes, independent
of each other. Assume that their auto-covariance functions are absolutely summable and thus there
exist continuous power spectral density functions fX and fN . Without loss of generality, assume all
means are zero. Let cX (k) = E [X1 Xk+1 ]. Then fX is the Fourier transform of the auto-covariance
P∞
function cX , i.e., fX (ω) = k=−∞ cX (k)e
iω k
. Finally, assume fN ≥ δ > 0. Then recall from
Example 3.4:
1 det(ΣXn + ΣNn )
I(Xn ; Xn + Nn ) = log
2 det ΣNn
i i
i i
i i

i i
6.7 Mutual information rate 87
1X 1X
n n
= log σi − log λi ,
2 2
i=1 i=1
where σj , λj are the eigenvalues of the covariance matrices ΣYn = ΣXn + ΣNn and ΣNn , which are
all Toeplitz matrices, e.g., (ΣXn )ij = E [Xi Xj ] = cX (i − j). By Szegö’s theorem [146, Sec. 5.2]:
Z 2π
1X
n
1
log σi → log fY (ω)dω (6.20)
n 2π 0
i=1
Note that cY (k) = E [(X1 + N1 )(Xk+1 + Nk+1 )] = cX (k) + cN (k) and hence fY = fX + fN . Thus, we
have
Z 2π
1 n n 1 fX (w) + fN (ω)
I(X ; X + Nn ) → I(X; X + N) = log dω.
n 4π 0 fN (ω)
Maximizing this over fX (ω) leads to the famous water-filling solution f∗X (ω) = |T − fN (ω)|+ .
i i
i i
i i

i i
7 f-divergences
In Chapter 2 we introduced the KL divergence that measures the dissimilarity between two dis-
tributions. This turns out to be a special case of the family of f-divergence between probability
distributions, introduced by Csiszár [79]. Like KL-divergence, f-divergences satisfy a number of
useful properties:
• operational significance: KL divergence forms a basis of information theory by yielding funda-

mental answers to questions in channel coding and data compression. Simiarly, f-divergences
such as χ2 , H2 and TV have their foundational roles in parameter estimation, high-dimensional
statistics and hypothesis testing, respectively.
• invariance to bijective transformations.
• data-processing inequality
• variational representations (à la Donsker-Varadhan)
• local behavior given by χ2 (in nonparametric cases) or Fisher information (in parametric cases).
The purpose of this chapter is to establish these properties and prepare the ground for appli-
cations in subsequent chapters. The important highlight is a joint range Theorem of Harremoës
and Vajda [155], which gives the sharpest possible comparison inequality between arbitrary f-
divergences (and puts an end to a long sequence of results starting from Pinsker’s inequality –
Theorem 7.9). This material can be skimmed on the first reading and referenced later upon need.
7.1 Definition and basic properties of f-divergences

Definition 7.1 (f-divergence). Let f : (0, ∞) → R be a convex function with f(1) = 0. Let P and
Q be two probability distributions on a measurable space (X , F). If P Q then the f-divergence
is defined as

dP
Df (PkQ) ≜ EQ f (7.1)
dQ
where dQdP
is a Radon-Nikodym derivative and f(0) ≜ f(0+). More generally, let f′ (∞) ≜
limx↓0 xf(1/x). Suppose that Q(dx) = q(x) μ(dx) and P(dx) = p(x) μ(dx) for some common
dominating measure μ, then we have
Z
p ( x)
Df (PkQ) = q ( x) f dμ + f′ (∞)P[q = 0] (7.2)
q>0 q(x)
88
i i
i i
i i

i i
7.1 Definition and basic properties of f-divergences 89
with the agreement that if P[q = 0] = 0 the last term is taken to be zero regardless of the value of
f′ (∞) (which could be infinite).
Remark 7.1. For the discrete case, with Q(x) and P(x) being the respective pmfs, we can also
write
X
P(x)
Df (PkQ) = Q(x)f
x
Q ( x)
with the understanding that
• f(0) = f(0+),
• 0f( 00 ) = 0, and
• 0f( a0 ) = limx↓0 xf( ax ) = af′ (∞) for a > 0.
Remark 7.2. A nice property of Df (PkQ) is that the definition is invariant to the choice of the
dominating measure μ in (7.2). This is not the case for other dissimilarity measures, e.g., the
squared L2 -distance between the densities kp − qk2L2 (dμ) which is a popular loss function for density
estimation in statistics literature.
The following are common f-divergences:
• Kullback-Leibler (KL) divergence: We recover the usual D(PkQ) in Chapter 2 by taking

f(x) = x log x.
• Total variation: f(x) = 12 |x − 1|,
Z Z
1 dP 1
TV(P, Q) ≜ EQ − 1 = |dP − dQ| = 1 − d(P ∧ Q). (7.3)
2 dQ 2
Moreover, TV(·, ·) is a metric on the space of probability distributions.1

• χ2 -divergence: f(x) = (x − 1)2 ,
" 2 # Z Z
dP (dP − dQ)2 dP2
χ (PkQ) ≜ EQ
2
−1 = = − 1. (7.4)
dQ dQ dQ
Note that we can also choose f(x) = x2 − 1. Indeed, f’s differing by a linear term lead to the
same f-divergence, cf. Proposition 7.2.
√ 2
• Squared Hellinger distance: f(x) = 1 − x ,
 
r !2 Z √ Z p
dP  p 2
H2 (P, Q) ≜ EQ  1 − = dP − dQ = 2 − 2 dPdQ. (7.5)
dQ
1 ∫ ∫ dQ
In (7.3), d(P ∧ Q) is the usual short-hand for ( dP dμ
∧ dμ
)dμ where μ is any dominating measure. The expressions in
(7.4) and (7.5) are understood in the similar sense.
i i
i i
i i

i i
90
RR√
Here the quantity B(P, Q) ≜ dPdQ p is known as the Bhattacharyya coefficient (or
Hellinger affinity) [33]. Note that H(P, Q) = H2 (P, Q) defines a metric on the space of prob-
ability distributions: indeed, the triangle inequality follows from that of L2 ( μ) for a common
dominating measure. Note, however, that (P, Q) 7→ H(P, Q) is not convex. (This is because
metric H is not induced by a Banach norm on the space of measures.)
1−x
• Le Cam distance [193, p. 47]: f(x) = 2x +2 ,
Z
1 (dP − dQ)2
LC(P, Q) = . (7.6)
2 dP + dQ
p
Moreover, LC(PkQ) is a metric on the space of probability distributions [114].
• Jensen-Shannon divergence: f(x) = x log x2x 2
+1 + log x+1 ,
P + Q P + Q

JS(P, Q) = D P + D Q . (7.7)
2 2
p
Moreover, JS(PkQ) is a metric on the space of probability distributions [114].
Remark 7.3. If Df (PkQ) is an f-divergence, then it is easy to verify that Df (λP + λ̄QkQ) and
Df (PkλP + λ̄Q) are f-divergences for all λ ∈ [0, 1]. In particular, Df (QkP) = Df̃ (PkQ) with
f̃(x) ≜ xf( 1x ).
We start summarizing some formal observations about the f-divergences
Proposition 7.2 (Basic properties). The following hold:
1 Df1 +f2 (PkQ) = Df1 (PkQ) + Df2 (PkQ).

2 Df (PkP) = 0.
3 Df (PkQ) = 0 for all P 6= Q iff f(x) = c(x − 1) for some c. For any other f we have Df (PkQ) =
f(0) + f′ (∞) > 0 for P ⊥ Q.
4 If PX,Y = PX PY|X and QX,Y = PX QY|X then the function x 7→ Df (PY|X=x kQY|X=x ) is measurable
and
Z
Df (PX,Y kQX,Y ) = dPX (x)Df (PY|X=x kQY|X=x ) ≜ Df (PY|X kQY|X |PX ) , (7.8)
X
the latter referred to as the conditional f-divergence (similar to Definition 2.12 for conditional
KL divergence).
5 If PX,Y = PX PY|X and QX,Y = QX PY|X then
Df (PX,Y kQX,Y ) = Df (PX kQX ) . (7.9)
In particular,
Df ( P X P Y k QX P Y ) = Df ( P X k QX ) . (7.10)
6 Let f1 (x) = f(x) + c(x − 1), then
Df1 (PkQ) = Df (PkQ) ∀P, Q .
i i
i i
i i

i i
In particular, we can always assume that f ≥ 0 and (if f is differentiable at 1) that f′ (1) = 0.
Proof. The first and second are clear. For the third property, verify explicitly that Df (PkQ) = 0
for f = c(x − 1). Next consider general f and observe that for P ⊥ Q, by definition we have
Df (PkQ) = f(0) + f′ (∞), (7.11)
which is well-defined (i.e., ∞ − ∞ is not possible) since by convexity f(0) > −∞ and f′ (∞) >
−∞. So all we need to verify is that f(0) + f′ (∞) = 0 if and only if f = c(x − 1) for some c ∈ R.
Indeed, since f(1) = 0, the convexity of f implies that x 7→ g(x) ≜ xf(−x)1 is non-decreasing. By
assumption, we have g(0+) = g(∞) and hence g(x) is a constant on x > 0, as desired.
For property 4, let RY|X = 12 PY|X + 21 QY|X . By Theorem 2.10 there exist jointly measurable
p(y|x) and q(y|x) such that dPY|X=x = p(y|x)dRY|X=x and QY|X = q(y|x)dRY|X=x . We can then take
μ in (7.2) to be μ = PX RY|X , which gives dPX,Y = p(y|x)dμ and dQX,Y = q(y|x)dμ and thus
Df (PX,Y kQX,Y )
Z Z
p(y|x)
= dμ1{y : q(y|x) > 0} q(y|x)f + f′ (∞) dμ1{y : q(y|x) = 0} p(y|x)
X ×Y q(y|x) X ×Y
Z Z Z
(7.2) p ( y| x) ′
= dPX dRY|X=x q(y|x)f + f (∞) dRY|X=x p(y|x)
X {y:q(y|x)>0} q ( y| x) {y:q(y|x)=0}
| {z }
Df (PY|X=x ∥QY|X=x )
which is the desired (7.8).

Property 5 follows from the observation: if we take μ = PX,Y + QX,Y and μ1 = PX + QX then
dPX,Y dPX
dμ = dμ1 and similarly for Q.
Property 6 follows from the first and the third. Note also that reducing to f ≥ 0 is done by taking
c = f′ (1) (or any subdifferential at x = 1 if f is not differentiable).
7.2 Data-processing inequality; approximation by finite partitions

Theorem 7.3 (Monotonicity).
Df (PX,Y kQX,Y ) ≥ Df (PX kQX ) . (7.12)
Proof. Note that in the case PX,Y QX,Y (and thus PX QX ), the proof is a simple application
of Jensen’s inequality to definition (7.1):

dPY|X PX
Df (PX,Y kQX,Y ) = EX∼QX EY∼QY|X f
dQY|X QX

dPY|X PX
≥ EX∼QX f EY∼QY|X
dQY|X QX

dPX
= EX∼QX f .
dQX
i i
i i
i i

i i
92
To prove the general case we need to be more careful. Let RX = 21 (PX + QX ) and RY|X = 12 PY|X +
1
2 QY|X .It should be clear that PX,Y , QX,Y RX,Y ≜ RX RY|X and that for every x: PY|X=x , QY|X=x
RY|X=x . By Theorem 2.10 there exist measurable functions p1 , p2 , q1 , q2 so that
dPX,Y = p1 (x)p2 (y|x)dRX,Y , dQX,Y = q1 (x)q2 (y|x)dRX,Y
and dPY|X=x = p2 (y|x)dRY|X=x , dQY|X=x = q2 (y|x)dRY|X=x . We also denote p(x, y) = p1 (x)p2 (y|x),
q(x, y) = q1 (x)q2 (y|x).
Fix t > 0 and consider a supporting line to f at t with slope μ, so that
f( u) ≥ f( t) + μ ( u − t) , ∀u ≥ 0 .
Thus, f′ (∞) ≥ μ and taking u = λt for any λ ∈ [0, 1] we have shown:
f(λt) + λ̄tf′ (∞) ≥ f(t) , ∀t ≥ 0, λ ∈ [0, 1] . (7.13)
Note that we added t = 0 case as well, since for t = 0 the statement is obvious (recall, though,
that f(0) ≜ f(0+) can be equal to +∞).
Next, fix some x with q1 (x) > 0 and consider the chain
Z
p 1 ( x) p 2 ( y| x) p 1 ( x)
dRY|X=x q2 (y|x)f + P [q2 (Y|x) = 0]f′ (∞)
{y:q2 (y|x)>0} q1 ( x ) q 2 ( y | x ) q1 (x) Y|X=x

( a) p 1 ( x) p1 (x)
≥f PY|X=x [q2 (Y|x) > 0] + P [q2 (Y|x) = 0]f′ (∞)
q 1 ( x) q1 (x) Y|X=x

(b) p 1 ( x)
≥f
q 1 ( x)
where (a) is by Jensen’s inequality and the convexity of f, and (b) by taking t = pq11 ((xx)) and λ =
PY|X=x [q2 (Y|x) > 0] in (7.13). Now multiplying the obtained inequality by q1 (x) and integrating
over {x : q1 (x) > 0} we get
Z
p ( x, y)
dRX,Y q(x, y)f + f′ (∞)PX,Y [q1 (X) > 0, q2 (Y|X) = 0]
{q>0} q ( x , y )
Z
p 1 ( x)
≥ dRX q1 (x)f .
{q1 >0} q 1 ( x)
Adding f′ (∞)PX [q1 (X) = 0] to both sides we obtain (7.12) since both sides evaluate to
definition (7.2).
The following is the main result of this section.
i i
i i
i i

i i
Theorem 7.4 (Data processing). Consider a channel that produces Y given X based on the
conditional law PY|X (shown below).
PX PY
PY|X
QX QY
Let PY (resp. QY ) denote the distribution of Y when X is distributed as PX (resp. QX ). For any
f-divergence Df (·k·),
Df (PY kQY ) ≤ Df (PX kQX ).
Proof. This follows from the monotonicity (7.12) and (7.9).
Next we discuss some of the more useful properties of f-divergence that parallel those of KL
divergence in Theorem 2.14:
Theorem 7.5 (Properties of f-divergences).
(a) Non-negativity: Df (PkQ) ≥ 0. If f is strictly convex2 at 1, then Df (PkQ) = 0 if and only if

P = Q.
(b) Joint convexity: (P, Q) 7→ Df (PkQ) is a jointly convex function. Consequently, P 7→ Df (PkQ)
and Q 7→ Df (PkQ) are also convex.
(c) Conditioning increases f-divergence. Let PY = PY|X ◦ PX and QY = QY|X ◦ QY , or, pictorially,
PY |X PY
PX
QY |X QY
Then

Df (PY kQY ) ≤ Df PY|X kQY|X |PX .
Proof. (a) Non-negativity follows from monotonicity by taking X to be unary. To show strict
positivity, suppose for the sake of contradiction that Df (PkQ) = 0 for some P 6= Q. Then
there exists some measurable A such that p = P(A) 6= q = Q(A) > 0. Applying the data
2
By strict convexity at 1, we mean for all s, t ∈ [0, ∞) and α ∈ (0, 1) such that αs + ᾱt = 1, we have
αf(s) + (1 − α)f(t) > f(1).
i i
i i
i i

i i
94
processing inequality (with Y = 1{X∈A} ), we obtain Df (Ber(p)kBer(q)) = 0. Consider two

cases
a 0 < q < 1: Then Df (Ber(p)kBer(q)) = qf( qp ) + q̄f( p̄q̄ ) = f(1);
b q = 1: Then p < 1 and Df (Ber(p)kBer(q)) = f(p) + p̄f′ (∞) = 0, i.e. f′ (∞) = f(p)
p−1 . Since
f(x)
x 7→ x− 1is non-decreasing, we conclude that f is affine on [p, ∞).
Both cases contradict the assumed strict convexity of f at 1.
(b) Convexity follows from the DPI as in the proof of Theorem 5.1.
(c) Recall that the conditional divergence was defined in (7.8) and hence the inequality follows
from the monotonicity. Another way to see the inequality is as result of applying Jensen’s
inequality to the jointly convex function Df (PkQ).
Remark 7.4 (Strict convexity). Note that even when f is strictly convex at 1, the map (P, Q) 7→
Df (PkQ) may not be strictly convex (e.g. TV(Ber(p), Ber(q)) = |p − q| is piecewise linear).
However, if f is strictly convex everywhere on R+ then so is Df . Indeed, if P 6= Q, then there
exists E such that P(E) 6= Q(E). By the DPI and the strict convexity of f, we have Df (PkQ) ≥
Df (Ber(P(E))kBer(Q(E))) > 0. Strict convexity of f is also related to other desirable properties
of If (X; Y), see Ex. I.31.
Remark 7.5 (g-divergences). We note that, more generally, we may call functional D(PkQ) a
“g-divergence”, or a generalized dissimilarity measure, if it satisfies the following properties: pos-
itivity, monotonicity, data processing inequality (DPI), conditioning increases divergence (CID)
and convexity in the pair. As we have seen in the proof of Theorem 5.1 the latter two are exactly
equivalent. Furthermore, our proof demonstrated that DPI and CID are both implied by monotonic-
ity. If D(PkP) = 0 then monotonicity, as in (7.12), also implies positivity by taking X to be unary.
Finally, notice that DPI also implies monotonicity by applying it to the (deterministic) channel
(X, Y) 7→ X. Thus, requiring DPI (or monotonicity) for D automatically implies all the other main
properties. We remark also that there exist g-divergences which are not monotone transformations
of any f-divergence, cf. [243, Section V]. On the other hand, for finite alphabets, [232] shows that
P
any D(PkQ) = i ϕ(Pi , Qi ) is a g-divergence iff it is an f-divergence.
The following convenient property, a counterpart of Theorem 4.5, allows us to reduce any gen-
eral problem about f-divergences to the problem on finite alphabets. The proof is in Section 7.14*.
Theorem 7.6. Let P, Q be two probability measures on X with σ -algebra F . Given a finite F -
measurable partitions E = {E1 , . . . , En } define the distribution PE on [n] by PE (i) = P[Ei ] and
QE (i) = Q[Ei ]. Then
Df (PkQ) = sup Df (PE kQE ) (7.14)

E
where the supremum is over all finite F -measurable partitions E .
i i
i i
i i

i i
7.3 Total variation and Hellinger distance in hypothesis testing

As we will discover throughout the book, different f-divergences have different operational sig-
nificance. For example, χ2 -divergence is useful in the study of Markov chains (see Example 33.1
and Exercise VI.18); for estimation the Bayes quadratic risk for a binary prior is determined by
Le Cam divergence (7.6). Here we discuss the relation of TV and Hellinger H2 to the problem
of binary hypothesis testing. We will delve deep into this problem in Part III (and return to its
composite version in Part VI). In this section, we only introduce some basics for the purpose of
illustration.
The binary hypothesis testing problem is formulated as follows: one is given an observation
(random variable) X, and it is known that either X ∼ P (a case referred to as null-hypothesis H0 )
or X ∼ Q (alternative hypothesis H1 ). The goal is to decide, on the basis of X alone, which of the
two hypotheses holds. In other words, we want to find a (possibly randomized) decision function
ϕ : X → {0, 1} such that the sum of two types of probabilities of error
P[ϕ(X) = 1] + Q[ϕ(X) = 0] (7.15)
is minimized.
In this section we first show that optimization over ϕ naturally leads to the concept of TV.
Subsequently, we will see that asymptotic considerations (when P and Q are replaced with P⊗n
and Q⊗n ) leads to H2 . We start with the former case.
Theorem 7.7. (a) Sup-representation of total variation:

1
TV(P, Q) = sup P(E) − Q(E) = sup EP [f(X)] − EQ [f(X)] (7.16)
E 2 f∈F
where the first supremum is over all measurable sets E, and the second is over F = {f : X →
R, kfk∞ ≤ 1}. In particular, the minimal sum of error probabilities in (7.15) is given by
min {P[ϕ(X) = 1] + Q[ϕ(X) = 0]} = 1 − TV(P, Q), (7.17)

ϕ
where the minimum is over all decision rules ϕ : X → {0, 1}.3

(b) Inf-representation of TV [296]: Provided that the diagonal {(x, x) : x ∈ X } is measurable,
TV(P, Q) = inf {PX,Y [X 6= Y] : PX = P, PY = Q}, (7.18)

PX,Y
where the set of joint distribution PX,Y with the property PX = P and PY = Q are called
couplings of P and Q.
Proof. Let p, q, μ be as in Definition 7.1. Then for any f ∈ F we have

Z Z
f(x)(p(x) − q(x))dμ ≤ |p(x) − q(x)|dμ = 2TV(P, Q) ,
3
The extension of (7.17) from from simple to composite hypothesis testing is in (32.24).
i i
i i
i i

i i
96
which establishes that the second supremum in (7.16) lower bounds TV, and hence (by taking
f(x) = 2 · 1E (x) − 1) so does the first. For the other direction, let E = {x : p(x) > q(x)} and notice
Z Z Z
0 = (p(x) − q(x))dμ = + (p(x) − q(x))dμ ,
E Ec
R R
implying that Ec (q(x)− p(x))dμ = E (p(x)− q(x))dμ. But the sum of these two integrals precisely
equals 2 · TV, which implies that this choice of E attains equality in (7.16).
For the inf-representation, we notice that given a coupling PX,Y , for any kfk∞ ≤ 1, we have
EP [f(X)] − EQ [f(X)] = E[f(X) − f(Y)] ≤ 2PX,Y [X 6= Y]
which, in view of (7.16), shows that the inf-reprensentation is always an upper bound. To show
R
that this bound is tight one constructs X, Y as follows: with probability π ≜ min(p(x), q(x))dμ
we take X = Y = c with c sampled from a distribution with density r(x) = π1 min(p(x), q(x)),
whereas with probability 1 − π we take X, Y sampled independently from distributions p1 (x) =
1−π (p(x) − min(p(x), q(x))) and q1 (x) = 1−π (q(x) − min(p(x), q(x))) respectively. The result
1 1
follows upon verifying that this PX,Y indeed defines a coupling of P and Q and applying the last
identity of (7.3).
Remark 7.6 (Variational representation). The sup-representation (7.16) of the total variation will
be extended to general f-divergences in Section 7.13. In turn, the inf-representation (7.18) has
no analogs for other f-divergences, with the notable exception of Marton’s d2 , see Remark 7.15.
Distances defined via inf-representations over couplings are often called Wasserstein distances,
and hence we may think of TV as the Wasserstein distance with respect to Hamming distance
d(x, x′ ) = 1{x 6= x′ } on X . The benefit of variational representations is that choosing a particular
coupling in (7.18) gives an upper bound on TV(P, Q), and choosing a particular f in (7.16) yields
a lower bound.
Of particular relevance is the special case of testing with multiple observations, where the data
X = (X1 , . . . , Xn ) are i.i.d. drawn from either P or Q. In other words, the goal is to test
H0 : X ∼ P⊗n vs H1 : X ∼ Q⊗n .
By Theorem 7.7, the optimal total probability of error is given by 1 − TV(P⊗n , Q⊗n ). By the data
processing inequality, TV(P⊗n , Q⊗n ) is a non-decreasing sequence in n (and bounded by 1 by
definition) and hence converges. One would expect that as n → ∞, TV(P⊗n , Q⊗n ) converges to 1
and consequently, the probability of error in the hypothesis test vanishes. It turns out that for fixed
distributions P 6= Q, large deviation theory (see Chapter 16) shows that TV(P⊗n , Q⊗n ) indeed
converges to one as n → ∞ and, in fact, exponentially fast:
TV(P⊗n , Q⊗n ) = 1 − exp(−nC(P, Q) + o(n)), (7.19)
where the exponent C(P, Q) > 0 is known as the Chernoff Information of P and Q given in (16.2).
However, as frequently encountered in high-dimensional statistical problems, if the distributions
P = Pn and Q = Qn depend on n, then the large-deviation asymptotics in (7.19) can no longer be
directly applied. Since computing the total variation between two n-fold product distributions is
i i
i i
i i

i i
typically difficult, understanding how a more tractable f-divergence is related to the total variation
may give insight on its behavior. It turns out Hellinger distance is precisely suited for this task.
Shortly, we will show the following relation between TV and the Hellinger divergence:
r
1 2 H2 ( P , Q)
H (P, Q) ≤ TV(P, Q) ≤ H(P, Q) 1 − ≤ 1. (7.20)
2 4
Direct consequences of the bound (7.20) are:
• H2 (P, Q) = 2, if and only if TV(P, Q) = 1. In this case, the probability of error is zero since
essentially P and Q have disjoint supports.
• H2 (P, Q) = 0 if and only if TV(P, Q) = 0. In this case, the smallest total probability of error is
one, meaning the best test is random guessing.
• Hellinger consistency is equivalent to TV consistency: we have
H2 (Pn , Qn ) → 0 ⇐⇒ TV(Pn , Qn ) → 0 (7.21)

H (Pn , Qn ) → 2 ⇐⇒ TV(Pn , Qn ) → 1;
2
(7.22)
however, the speed of convergence need not be the same.
Theorem 7.8. For any sequence of distributions Pn and Qn , as n → ∞,

1
TV(P⊗n
n
, Q ⊗n
n ) → 0 ⇐⇒ H 2
( Pn , Q n ) = o
n

1
TV(P⊗n
n
, Q ⊗n
n ) → 1 ⇐⇒ H 2
( Pn , Q n ) = ω
n
i.i.d.
Proof. For convenience, let X1 , X2 , ...Xn ∼ Qn . Then
v 
u n
u Y Pn
H2 (P⊗ ⊗n
n , Qn ) = 2 − 2E
n t (Xi ) 
Qn
i=1
r r n
Yn
Pn Pn
=2−2 E ( Xi ) = 2 − 2 E
Qn Qn
i=1
n
1
= 2 − 2 1 − H2 ( P n , Qn ) . (7.23)
2
We now use (7.23) to conclude the proof. Recall from (7.21) that TV(P⊗ ⊗n
n , Qn ) → 0 if and
n
2 ⊗n ⊗n
only if H (Pn , Qn ) → 0, which happens precisely when H (Pn , Qn ) = o( n ). Similarly, by
2 1
(7.22), TV(P⊗ ⊗n 2 ⊗n ⊗n
n , Qn ) → 1 if and only if H (Pn , Qn ) → 2, which is further equivalent to
n
2 1
H (Pn , Qn ) = ω( n ).
Remark 7.7. Property (7.23) is known as tensorization. More generally, we have

! n
Yn Y
n Y 1 2
H 2
Pi , Qi = 2 − 2 1 − H ( P i , Qi ) . (7.24)
2
i=1 i=1 i=1
i i
i i
i i

i i
98
While some other f-divergences also satisfy tensorization, see Section 7.12, the H2 has the advan-
tage of a sandwich bound (7.20) making it the most convenient tool for checking asymptotic
testability of hypotheses.
Q Q
Remark 7.8 (Kakutani’s dichotomy). Let P = i≥1 Pi and Q = i≥1 Qi , where Pi Qi . Kaku-
tani’s theorem shows the following dichotomy between these two distributions on the infinite
sequence space:
P
• If i≥1 H2 (Pi , Qi ) = ∞, then P and Q are mutually singular.
P
• If i≥1 H2 (Pi , Qi ) < ∞, then P and Q are equivalent (i.e. absolutely continuous with respect
to each other).
In the Gaussian case, say, Pi = N( μi , 1) and Qi = N(0, 1), the equivalence condition simplifies to
P 2
μ i < ∞.
To understand Kakutani’s criterion, note that by the tensorization property (7.24), we have
Y H2 (Pi , Qi )

H (P, Q) = 2 − 2
2
1− .
2
i≥1
Q 2 P
Thus, if i≥1 (1 − H (P2i ,Qi ) ) = 0, or equivalently, i≥1 H2 (Pi , Qi ) = ∞, then H2 (P, Q) = 2,
P
which, by (7.20), is equivalent to TV(P, Q) = 0 and hence P ⊥ Q. If i≥1 H2 (Pi , Qi ) < ∞,
then H2 (P, Q) < 2. To conclude the equivalence between P and Q, note that the likelihood ratio
dP
Q dPi dP
dQ = i≥1 dQi satisfies that either Q( dQ = 0) = 0 or 1 by Kolmogorov’s 0-1 law. See [108,
Theorem 5.3.5] for details.
7.4 Inequalities between f-divergences and joint range

In this section we study the relationship, in particular, inequalities, between f-divergences. To
gain some intuition, we start with the ad hoc approach by proving the Pinsker’s inequality, which
bounds total variation from above in terms of the KL divergence.
Theorem 7.9 (Pinsker’s inequality).
D(PkQ) ≥ (2 log e)TV2 (P, Q). (7.25)
Proof. It suffices to consider the natural logarithm for the KL divergence. First we show that, by
the data processing inequality, it suffices to prove the result for Bernoulli distributions. For any
event E, let Y = 1{X∈E} which is Bernoulli with parameter P(E) or Q(E). By the DPI, D(PkQ) ≥
d(P(E)kQ(E)). If Pinsker’s inequality holds for all Bernoulli distributions, we have
r
1
D(PkQ) ≥ TV(Ber(P(E)), Ber(Q(E)) = |P(E) − Q(E)|
2
i i
i i
i i

i i
q
Taking the supremum over E gives 12 D(PkQ) ≥ supE |P(E) − Q(E)| = TV(P, Q), in view of
Theorem 7.7.
The binary case follows easily from a second-order Taylor expansion (with integral remainder
form) of p 7→ d(pkq):
Z p Z p
p−t
d(pkq) = dt ≥ 4 (p − t)dt = 2(p − q)2
q t( 1 − t ) q
and TV(Ber(p), Ber(q)) = |p − q|.
Pinsker’s inequality is sharp in the sense that the constant (2 log e) in (7.25) is not improvable,
i.e., there exist {Pn , Qn }, e.g., Pn = Ber( 12 + 1n ) and Qn = Ber( 21 ), such that RHS
LHS
→ 2 as n → ∞.
(This is best seen by inspecting the local quadratic behavior in Proposition 2.19.) Nevertheless,
this does not mean that the inequality (7.25) is not improvable, as the RHS can be replaced by some
other function of TV(P, Q) with additional higher-order terms. Indeed, several such improvements
of Pinsker’s inequality are known. But what is the best inequality? In addition, another natural
question is the reverse inequality: can we upper-bound D(PkQ) in terms of TV(P, Q)? Settling
these questions rests on characterizing the joint range (the set of possible values) of a given pair
f-divergences. This systematic approach to comparing f-divergences (as opposed to the ad hoc
proof of Theorem 7.9 we presented above) is the subject of this section.
Definition 7.10 (Joint range). Consider two f-divergences Df (PkQ) and Dg (PkQ). Their joint
range is a subset of [0, ∞]2 defined by
R ≜ {(Df (PkQ), Dg (PkQ)) : P, Q are probability measures on some measurable space} .
In addition, the joint range over all k-ary distributions is defined as
Rk ≜ {(Df (PkQ), Dg (PkQ)) : P, Q are probability measures on [k]} .
As an example, Fig. 7.1 gives the joint range R between the KL divergence and the total varia-
tion. By definition, the lower boundary of the region R gives the optimal refinement of Pinsker’s
inequality:
D(PkQ) ≥ F(TV(P, Q)), F(ϵ) ≜ inf D(PkQ) = inf{s : (ϵ, s) ∈ R}.

(P,Q):TV(P,Q)=ϵ
Also from Fig. 7.1 we see that it is impossible to bound D(PkQ) from above in terms of TV(P, Q)
due to the lack of upper boundary.
The joint range R may appear difficult to characterize since we need to consider P, Q over
all measurable spaces; on the other hand, the region Rk for small k is easy to obtain (at least
numerically). Revisiting the proof of Pinkser’s inequality in Theorem 7.9, we see that the key
step is the reduction to Bernoulli distributions. It is natural to ask: to obtain full joint range is it
possible to reduce to the binary case? It turns out that it is always sufficient to consider quaternary
distributions, or the convex hull of that of binary distributions.
i i
i i
i i

i i
100
1.5
1.0
0.5
0.2 0.4 0.6 0.8
Figure 7.1 Joint range of TV and KL divergence. The dashed line is the quadratic lower bound given by
Pinsker’s inequality (7.25).
Theorem 7.11 (Harremoës-Vajda [155]).
R = co(R2 ) = R4 .
where co denotes the convex hull with a natural extension of convex operations to [0, ∞]2 .
We will rely on the following famous result from convex analysis (cf. e.g. [110, Chapter 2,
Theorem 18]).
Lemma 7.12 (Fenchel-Eggleston-Carathéodory theorem). Let S ⊆ Rd and x ∈ co(S). Then there

exists a set of d + 1 points S′ = {x1 , x2 , . . . , xd+1 } ∈ S such that x ∈ co(S′ ). If S has at most d
connected components, then d points are enough.
Proof. Our proof will consist of three claims:
• Claim 1: co(R2 ) ⊂ R4 ;
• Claim 2: Rk ⊂ co(R2 );
• Claim 3: R = R4 .
S∞
Note that Claims 1-2 prove the most interesting part: k=1 Rk = co(R2 ). Claim 3 is more
technical and its proof can be found in [155]. However, the approximation result in Theorem 7.6
S∞
shows that R is the closure of k=1 Rk . Thus for the purpose of obtaining inequalities between
Df and Dg , Claims 1-2 are sufficient.
i i
i i
i i

i i
We start with Claim 1. Given any two pairs of distributions (P0 , Q0 ) and (P1 , Q1 ) on some space
X and given any α ∈ [0, 1], define two joint distributions of the random variables (X, B) where
PB = QB = Ber(α), PX|B=i = Pi and QX|B=i = Qi for i = 0, 1. Then by (7.8) we get
Df (PX,B kQX,B ) = ᾱDf (P0 kQ0 ) + αDf (P1 kQ1 ) ,
and similarly for the Dg . Thus, R is convex. Next, notice that
R2 = R̃2 ∪ {(pf′ (∞), pg′ (∞)) : p ∈ (0, 1]} ∪ {(qf(0), qg(0)) : q ∈ (0, 1]} ,
where R̃2 is the image of (0, 1)2 of the continuous map

(p, q) 7→ Df (Ber(p)kBer(q)), Dg (Ber(p)kBer(q)) .
Since (0, 0) ∈ R̃2 , we see that regardless of which f(0), f′ (∞), g(0), g′ (∞) are infinite, the set
R2 ∩ R2 is connected. Thus, by Lemma 7.12 any point in co(R2 ∩ R2 ) is a combination of two
points in R2 ∩ R2 , which, by the argument above, is a subset of R4 . Finally, it is not hard to see
that co(R2 )\R2 ⊂ R4 , which concludes the proof of co(R2 ) ⊂ R4 .
Next, we prove Claim 2. Fix P, Q on [k] and denote their PMFs (pj ) and (qj ), respectively. Note
that without changing either Df (PkQ) or Dg (PkQ) (but perhaps, by increasing k by 1), we can
p
make qj > 0 for j > 1 and q1 = 0, which we thus assume. Denote ϕj = qjj for j > 1 and consider
the set
 
 X X
k 
S = Q̃ = (q̃j )j∈[k] : q̃j ≥ 0, q̃j = 1, q̃1 = 0, q̃j ϕj ≤ 1 .
 
j=2
We also define a subset Se ⊂ S consisting of points Q̃ of two types:
1 q̃j = 1 for some j ≥ 2 and ϕj ≤ 1.

2 q̃j1 + q̃j2 = 1 for some j1 , j2 ≥ 2 and q̃j1 ϕj1 + q̃j2 ϕj2 = 1 .
P
It can be seen that Se are precisely all the extreme points of S . Indeed, any Q̃ ∈ S with j≥2 q̃j ϕj <
1 with more than one non-zero atom cannot be extremal (since there is only one active linear
P P
constraint j q̃j = 1). Similarly, Q̃ with j≥2 q̃j ϕj = 1 can only be extremal if it has one or two
non-zero atoms.
We next claim that any point in S can be written as a convex combination of finitely many
points in Se . This can be seen as follows. First, we can view S and Se as subsets of Rk−1 . Since S
is clearly closed and convex, by the Krein-Milman theorem (see [7, Theorem 7.68]), S coincides
with the closure of the convex hull of its extreme points. Since Se is compact (hence closed), so
is co(Se ) [7, Corollary 5.33]. Thus we have S = co(Se ) and, in particular, there are probability
weights {αi : i ∈ [m]} and extreme points Q̃i ∈ Se so that
X
m
Q= αi Q̃i . (7.26)
i=1
i i
i i
i i

i i
102
Next, to each Q̃ we associate P̃ = (p̃j )j∈[k] as follows:

(
ϕj q̃j , j ∈ {2, . . . , k} ,
p̃j = Pk
1 − j=2 ϕj q̃j , j=1
We then have that

X
Q̃ 7→ Df (P̃kQ̃) = q̃j f(ϕj ) + f′ (∞)p̃1
j≥2
affinely maps S to [0, ∞] (note that f(0) or f′ (∞) can equal ∞). In particular, if we denote P̃i =
P̃(Q̃i ) corresponding to Q̃i in decomposition (7.26), we get
X
m
Df (PkQ) = αi Df (P̃i kQ̃i ) ,
i=1
and similarly for Dg (PkQ). We are left to show that (P̃i , Q̃i ) are supported on at most two points,
which verifies that any element of Rk is a convex combination of k elements of R2 . Indeed, for
Q̃ ∈ Se the set {j ∈ [k] : q̃j > 0 or p̃j > 0} has cardinality at most two (for the second type
extremal points we notice p̃j1 + p̃j2 = 1 implying p̃1 = 0). This concludes the proof of Claim
2.
7.5 Examples of computing joint range

In this section we show how to apply the method of Harremoës and Vajda for proving the best
possible comparison inequalities between various f-divergences.
7.5.1 Hellinger distance versus total variation

The joint range R2 of H2 and TV over binary distributions is simply:
√ √
R2 = (2(1 − pq − p̄q̄), |p − q|) : 0 ≤ p ≤ 1, 0 ≤ q ≤ 1 .
shown as non-convex grey region in Fig. 7.2. By Theorem 7.11, their full joint range R is the
convex hull of R2 , which turns out to be exactly described by the sandwich bound (7.20) shown
earlier in Section 7.3. This means that (7.20) is not improvable. Indeed, with t ranging from 0 to
1,
1−t
• the upper boundary is achieved by P = Ber( 1+ t
2 ), Q = Ber( 2 ),
• the lower boundary is achieved by P = (1 − t, t, 0), Q = (1 − t, 0, t).
i i
i i
i i

i i
7.5 Examples of computing joint range 103
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0 1.5 2.0
Figure 7.2 The joint range R of TV and H2 is characterized by (7.20), which is the convex hull of the grey
region R2 .
7.5.2 KL divergence versus total variation

The joint range between KL and TV was previously shown in Fig. 7.1. Although there is no known
close-form expression, the following parametric formula of the lower boundary (see Fig. 7.1) is
known [121, Theorem 1]:

TVt = 1 t 1 − coth(t) − 1 2
2 t
, t ≥ 0. (7.27)
Dt = −t2 csch2 (t) + t coth(t) + log(t csch(t))
where we take the natural logarithm. Here is a corollary (weaker bound) due to [316]:
1 + TV(P, Q) 2TV(P, Q)
D(PkQ) ≥ log − . (7.28)
1 − TV(P, Q) 1 + TV(P, Q)
Both bounds are stronger than Pinsker’s inequality (7.25). Note the following consequences:
• D → 0 ⇒ TV → 0, which can be deduced from Pinsker’s inequality;

• TV → 1 ⇒ D → ∞ and hence D = O(1) implies that TV is bounded away from one. This can
be obtained from (7.27) or (7.28), but not Pinsker’s inequality.
7.5.3 Chi-squared versus total variation

Proposition 7.13. We have the following bound
(
4t2 . t≤ 1
χ (PkQ) ≥ f(TV(P, Q)) ≥ 4TV (P, Q),
2 2
f(t) = 2
, (7.29)
t
1−t t≥ 1
2.
where the function f is a convex increasing bijection of [0, 1) onto [0, ∞). Furthermore, for every
s ≥ f(t) there exists a pair of distributions such that χ2 (PkQ) = s and TV(P, Q) = t.
i i
i i
i i

i i
104
Proof. We claim that the binary joint range is convex. Indeed,
(p − q)2 t2
TV(Ber(p), Ber(q)) = |p − q| ≜ t, χ2 (Ber(p)kBer(q)) = = .
q( 1 − q) q( 1 − q)
Given |p − q| = t, let us determine the possible range of q(1 − q). The smallest value of q(1 − q)
is always 0 by choosing p = t, q = 0. The largest value is 1/4 if t ≤ 1/2 (by choosing p = 1/2 − t,
q = 1/2). If t > 1/2 then we can at most get t(1 − t) (by setting p = 0 and q = t). Thus we
get χ2 (Ber(p)kBer(q)) ≥ f(|p − q|) as claimed. The convexity of f follows since its derivative is
monotoinically increasing. Clearly, f(t) ≥ 4t2 because t(1 − t) ≤ 41 .
7.6 A selection of inequalities between various divergences

This section presents a collection of useful inequalities. For a more complete treatment, con-
sider [272] and [313, Sec. 2.4]. Most of these inequalities are joint ranges, which means they
are tight.
• KL vs TV: see (7.27). For discrete distributions there is partial comparison in the other direction
(“reverse Pinsker”, cf. [272, Section VI]):

2 2 log e
D(PkQ) ≤ log 1 + TV(P, Q)2 ≤ TV(P, Q)2 , Qmin = min Q(x)
Qmin Qmin x
• KL vs Hellinger:
2
D(P||Q) ≥ 2 log . (7.30)
2 − H2 (P, Q)
This is tight at P = Ber(0), Q = Ber(q). For a fixed H2 , in general D(P||Q) has no finite upper
bound, as seen from P = Ber(p), Q = Ber(0). Therefore (7.30) gives the joint range.
There is a partial result in the opposite direction (log-Sobolev inequality for Bonami-Beckner
semigroup, cf. [89, Theorem A.1]):
log( Q1min − 1)
D(PkQ) ≤ 1 − (1 − H2 (P, Q))2 , Qmin = min Q(x)
1 − 2Qmin x
Another partial result is in Ex. I.48.

• KL vs χ2 :
0 ≤ D(P||Q) ≤ log(1 + χ2 (P||Q)) ≤ log e · χ2 (PkQ) . (7.31)
The left-hand inequality states that no lower bound on KL in terms of χ2 is possible.

• TV and Hellinger: see (7.20). Another bound [138]:
r
H2 (P, Q)
TV(P, Q) ≤ −2 ln 1 −
2
i i
i i
i i

i i
7.7 Divergences between Gaussians 105
• Le Cam and Hellinger [193, p. 48]:

1 2
H (P, Q) ≤ LC(P, Q) ≤ H2 (P, Q). (7.32)
2
• Le Cam and Jensen-Shannon [311]:
LC(P, Q) log e ≤ JS(P, Q) ≤ LC(P, Q) · 2 log 2 (7.33)
• χ2 and TV: The full joint range is given by (7.29). Two simple consequences are:
1p 2
TV(P, Q) ≤ χ (PkQ) (7.34)
2
1 χ2 (PkQ)
TV(P, Q) ≤ max , (7.35)
2 1 + χ2 (PkQ)
where the second is useful for bounding TV away from one.
• JS and TV: The full joint region is given by

1 − TV(P, Q) 1
2d ≤ JS(P, Q) ≤ TV(P, Q) · 2 log 2 . (7.36)
2 2
The lower bound is a consequence of Fano’s inequality. For the upper bound notice that for
p, q ∈ [0, 1] and |p − q| = τ the maximum of d(pk p+2 q ) is attained at p = 0, q = τ (from the
convexity of d(·k·)) and, thus, the binary joint-range is given by τ 7→ d(τ kτ /2) + d(1 − τ k1 −
τ /2). Since the latter is convex, its concave envelope is a straightline connecting endpoints at
τ = 0 and τ = 1.
7.7 Divergences between Gaussians

To get a better feel for the behavior of f-divergences, here we collect expressions (as well as
asymptotic expansions near 0) of divergences between a pair of Gaussian distributions.
1 Total variation:
Z | μ|
| μ| 2σ | μ|
TV(N (0, σ ), N ( μ, σ )) = 2Φ
2 2
−1= φ(x)dx = √ + O( μ2 ), μ → 0.
2σ − 2σ
| μ| 2π σ
(7.37)
2 Hellinger distance:
μ2 μ2
H2 (N (0, σ 2 )||N ( μ, σ 2 )) = 2 − 2e− 8σ2 = + O( μ3 ), μ → 0. (7.38)
4σ 2
More generally,
1 1
|Σ1 | 4 |Σ2 | 4 1
H2 (N ( μ1 , Σ1 )||N ( μ2 , Σ2 )) = 2 − 2 1 exp − ( μ1 − μ2 )′ Σ̄−1 ( μ1 − μ2 ) ,
|Σ̄| 2 8
Σ1 +Σ2
where Σ̄ = 2 .
i i
i i
i i

i i
106
3 KL divergence:

1 σ2 1 ( μ 1 − μ 2 )2 σ12
D(N ( μ1 , σ12 )||N ( μ2 , σ22 )) = log 22 + + 2 − 1 log e. (7.39)
2 σ1 2 σ22 σ2
For a more general result see (2.8).

4 χ2 -divergence:
μ2 μ2
χ2 (N ( μ, σ 2 )||N (0, σ 2 )) = e σ2 − 1 = 2 + O( μ3 ), μ → 0 (7.40)
 2 σ
 eμ √/(2−σ 2 )
− 1 σ2 < 2
χ2 (N ( μ, σ 2 )||N (0, 1)) = σ 2−σ 2 (7.41)
∞ σ2 ≥ 2
5 χ2 -divergence for Gaussian mixtures [165]:

−1
X,X′ ⟩
χ2 (P ∗ N (0, Σ)||N (0, Σ)) = E[e⟨Σ ] − 1, ⊥ X′ ∼ P .
X⊥
7.8 Mutual information based on f-divergence

Given an f-divergence Df , we can define a version of mutual information
If (X; Y) ≜ Df (PX,Y kPX PY ) . (7.42)
Theorem 7.14 (Data processing). For U → X → Y, we have If (U; Y) ≤ If (U; X).
Proof. Note that If (U; X) = Df (PU,X kPU PX ) ≥ Df (PU,Y kPU PY ) = If (U; Y), where we
applied the data-processing Theorem 7.4 to the (possibly stochastic) map (U, X) 7→ (U, Y). See
also Remark 3.4.
A useful property of mutual information is that X ⊥⊥ Y iff I(X; Y) = 0. A generalization of it is

the property that for X → Y → Z we have I(X; Y) = I(X; Z) iff X → Z → Y. Both of these may or
may not hold for If depending on the strict convexity of f, see Ex. I.31.
Another often used property of the standard mutual information is the subadditivity: If PA,B|X =
PA|X PB|X (i.e. A and B are conditionally independent given X), then
I(X; A, B) ≤ I(X; A) + I(X; B). (7.43)
However, other notions of f-information have complicated relationship with subadditivity:
1 The f-information corresponding to the χ2 -divergence,
Iχ2 (X; Y) ≜ χ2 (PX,Y kPX PY ) (7.44)
is not subadditive.
i i
i i
i i

i i
7.8 Mutual information based on f-divergence 107
2 The f-information corresponding to total-variation ITV (X; Y) ≜ TV(PX,Y , PX PY ) is not subad-

ditive. Even worse, it can get stuck. For example, take X ∼ Ber(1/2) and A = BSCδ (X),
B = BSCδ (X) – two independent observations of X across the BSC. A simple computation
shows:
ITV (X; A, B) = ITV (X; A) = ITV (X; B) .
In other words, an additional observation does not improve TV-information at all. This is the
main reason for the famous herding effect in economics [20].
3 The symmetric KL-divergence4 ISKL (X; Y) ≜ D(PX,Y kPX PY ) + D(PX PY kPX,Y ) satisfies, quite
amazingly [189], the additivity property:
ISKL (X; A, B) = ISKL (X; A) + ISKL (X; B) (7.45)
Let us prove this in the discrete case. First notice the following equivalent expression for ISKL :
X
ISKL (X; Y) = PX (x)PX (x′ )D(PY|X=x kPY|X=x′ ) . (7.46)
x, x′
From (7.46) we get (7.45) by the additivity D(PA,B|X=x kPA,B|X=x′ ) = D(PA|X=x kPA|X=x′ ) +
D(PB|X=x kPB|X=x′ ). To prove (7.46) first consider the obvious identity:
X
PX (x)PX (x′ )[D(PY kPY|X=x′ ) − D(PY kPY|X=x )] = 0
x, x′
which is rewritten as
X X PY|X (y|x)
PX (x)PX (x′ ) PY (y) log = 0. (7.47)
PY|X (y|x′ )
x, x′ y
Next, by definition,
X PX,Y (x, y)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PX (x)PY (y)
PY|X (y|x)
Since the marginals of PX,Y and PX PY coincide, we can replace log PPXX(,xY)(PxY,y()y) by any log f(y)
for any f. We choose f(y) = PY|X (y|x′ ) to get
X PY|X (y|x)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PY|X (y|x′ )
Now averaging this over PX (x′ ) and applying (7.47) to get rid of the second term in [· · · ], we
obtain (7.46). For another interesting property of ISKL , see Ex. I.43.
4
This is the f-information corresponding to the Jeffreys divergence D(PkQ) + D(QkP).
i i
i i
i i

i i
108
7.9 Empirical distribution and χ2 -information

i.i.d.
Consider an arbitrary channel PY|X and some input distribution PX . Suppose that we have Xi ∼ PX
for i = 1, . . . , n. Let
1X
n
P̂n = δ Xi
n
i=1
denote the empirical distribution corresponding to this sample. Let PY = PY|X ◦ PX be the output
distribution corresponding to PX and PY|X ◦ P̂n be the output distribution corresponding to P̂n (a
random distribution). Note that when PY|X=x (·) = ϕ(· − x), where ϕ is a fixed density, we can
think of PY|X ◦ P̂n as a kernel density estimator (KDE), whose density is p̂n (x) = (ϕ ∗ P̂n )(x) =
Pn
i=1 ϕ(Xi − x). Furthermore, using the fact that E[PY|X ◦ P̂n ] = PY , we have
1
n
E[D(PY|X ◦ P̂n kPX )] = D(PY kPX ) + E[D(PY|X ◦ P̂n kPY )] ,
where the first term represents the bias of the KDE due to convolution and increases with band-
width of ϕ, while the second term represents the variability of the KDE and decreases with the
bandwidth of ϕ. Surprisingly, the second term is is sharply (within a factor of two) given by the
Iχ2 information. More exactly, we prove the following result.
Proposition 7.15. We have

1
E[D(PY|X ◦ P̂n kPY )] ≤ log 1 + Iχ2 (X; Y) , (7.48)
n
where Iχ2 (X; Y) is defined in (7.44). Furthermore,
log e
lim inf n E[D(PY|X ◦ P̂n kPY )] ≥ I 2 (X; Y) . (7.49)
n→∞ 2 χ
In particular, E[D(PY|X ◦ P̂n kPY )] = O(1/n) if Iχ2 (X; Y) < ∞ and ω(1/n) otherwise.
In Section 25.4* we will discuss an extension of this simple bound, in particular showing that
in many cases about n = exp{I(X; Y) + K} samples are sufficient to get e−O(K) bound on D(PY|X ◦
P̂n kPY ).
Proof. First, a simple calculation shows that

1
E[χ2 (PY|X ◦ P̂n kPY )] = I 2 (X; Y) .
n χ
Then from (7.31) and Jensen’s inequality we get (7.48).
To get the lower bound in (7.49), let X̄ be drawn uniformly at random from the sample
{X1 , . . . , Xn } and let Ȳ be the output of the PY|X channel with input X̄. With this definition we
have:
E[D(PY|X ◦ P̂n kPY )] = I(Xn ; Ȳ) . (7.50)
i i
i i
i i

i i
7.9 Empirical distribution and χ2 -information 109
Next, apply (6.2) to get
X
n
I(Xn ; Ȳ) ≥ I(Xi ; Ȳ) = nI(X1 ; Ȳ) .
i=1
Finally, notice that

!
n−1
1
I(X1 ; Ȳ) = D PX PY + PX,Y PX PY
n n
and apply the local expansion of KL divergence (Proposition 2.19) to get (7.49).
In the discrete case, by taking PY|X to be the identity (Y = X) we obtain the following guarantee
on the closeness between the empirical and the population distribution. This fact can be used to
test whether the sample was truly generated by the distribution PX .
Corollary 7.16. Suppose PX is discrete with support X . If X is infinite, then
lim n E[D(P̂n kPX )] = ∞ . (7.51)

n→∞
Otherwise, we have
log e
E[D(P̂n kPX )] ≤ (|X | − 1) . (7.52)
n
Proof. Simply notice that Iχ2 (X; X) = |X | − 1.
Remark 7.9. For fixed PX , the tight asymptotic result is
log e
lim n E[D(P̂n kPX )] = (|supp(PX )| − 1) . (7.53)
n→∞ 2
See Lemma 13.2 below.
Corollary 7.16 is also useful for the statistical application of entropy estimation. Given n iid
samples, a natural estimator of the entropy of PX is the empirical entropy Ĥemp = H(P̂n ) (plug-in
estimator). It is clear that empirical entropy is an underestimate, in the sense that the bias
E[Ĥemp ] − H(PX ) = − E[D(P̂n kPX )]
is always non-negative. For fixed PX , Ĥemp is known to be consistent even on countably infinite
alphabets [14], although the convergence rate can be arbitrarily slow, which aligns with the con-
clusion of (7.51). However, for large alphabet of size Θ(n), the upper bound (7.52) does not
vanish (this is tight for, e.g., uniform distribution). In this case, one need to de-bias the empirical
entropy (e.g. on the basis of (7.53)) or employ different techniques in order to achieve consistent
estimation.
i i
i i
i i

i i
110
7.10 Most f-divergences are locally χ2 -like

In this section we prove analogs of Proposition 2.18 and Proposition 2.19 for the general
f-divergences.
Theorem 7.17. Suppose that Df (PkQ) < ∞ and derivative of f(x) at x = 1 exist. Then,
1
lim Df (λP + λ̄QkQ) = (1 − P[supp(Q)])f′ (∞) ,
λ→0 λ
where as usual we take 0 · ∞ = 0 in the left-hand side.
Remark 7.10. Note that we do not need a separate theorem for Df (QkλP + λ̄Q) since the exchange
of arguments leads to another f-divergence with f(x) replaced by xf(1/x).
Proof. Without loss of generality we may assume f(1) = f′ (1) = 0 and f ≥ 0. Then, decomposing
P = μP1 + μ̄P0 with P0 ⊥ Q and P1 Q we have
Z
1 1 dP1
Df (λP + λ̄QkQ) = μ̄f′ (∞) + dQ f 1 + λ( μ − 1) .
λ λ dQ
Note that g(λ) = f (1 + λt) is positve and convex for every t ∈ R and hence λ1 g(λ) is mono-
tonically decreasing to g′ (0) = 0 as λ & 0. Since for λ = 1 the integrand is assumed to be
Q-integrable, the dominated convergence theorem applies and we get the result.
Theorem 7.18. Let f be twice continuously differentiable on (0, ∞) with

lim sup f′′ (x) < ∞ .
x→+∞
If χ2 (PkQ) < ∞, then Df (λ̄Q + λPkQ) < ∞ for all 0 ≤ λ < 1 and
1 f′′ (1) 2
lim D f ( λ̄ Q + λ PkQ ) = χ (PkQ) . (7.54)
λ→0 λ2 2
If χ2 (PkQ) = ∞ and f′′ (1) > 0 then (7.54) also holds, i.e. Df (λ̄Q + λPkQ) = ω(λ2 ).
Remark 7.11. Conditions of the theorem include D, DSKL , H2 , JS, LC and all Rényi-type diver-
gences, with f(x) = p−1 1 (xp − 1), of orders p < 2. A similar result holds also for the case when
f′′ (x) → ∞ with x → +∞ (e.g. Rényi-type divergences with p > 2), but then we need to make
extra assumptions in order to guarantee applicability of the dominated convergence theorem (often
just the finiteness of Df (PkQ) is sufficient).
Proof. Assuming that χ2 (PkQ) < ∞ we must have P Q and hence we can use (7.1) as the
definition of Df . Note that under (7.1) without loss of generality we may assume f′ (1) = f(1) = 0
(indeed, for that we can just add a multiple of (x − 1) to f(x), which does not change the value of
Df (PkQ)). From the Taylor expansion we have then
Z 1
f(1 + u) = u2 (1 − t)f′′ (1 + tu)dt .
0
i i
i i
i i

i i
Applying this with u = λ P− Q

Q we get
Z Z 1 2
P−Q P−Q
Df (λ̄Q + λPkQ) = dQ dt(1 − t)λ2 f′′ 1 + tλ . (7.55)
0 Q Q
P−Q
Note that for any ϵ > 0 we have supx≥ϵ |f′′ (x)| ≜ Cϵ < ∞. Note that Q ≥ −1 and, thus, for
every λ the integrand is non-negative and bounded by
2
P−Q
C1−λ (7.56)
Q
which is integrable over dQ × Leb[0, 1] (by finiteness of χ2 (PkQ) and Fubini, which applies due
to non-negativity). Thus, Df (λ̄Q + λPkQ) < ∞. Dividing (7.55) by λ2 we see that the integrand
is dominated by (7.56) and hence we can apply the dominated convergence theorem to conclude
Z 1 Z 2
1 ( a) P−Q ′′ P−Q
lim Df (λ̄Q + λPkQ) = dt(1 − t) dQ lim f 1 + tλ
λ→0 λ2 0 Q λ→0 Q
Z 1 Z 2 ′′
P−Q f ( 1) 2
= dt(1 − t) dQ f′′ (1) = χ (PkQ) ,
0 Q 2
which proves (7.54).
We proceed to proving that Df (λP + λ̄QkQ) = ω(λ2 ) when χ2 (PkQ) = ∞. If P Q then
this follows by replacing the equality in (a) with ≥ due to Fatou lemma. If P 6 Q, we consider
decomposition P = μP1 + μ̄P0 with P1 Q and P0 ⊥ Q. From definition (7.2) we have (for
λμ
λ1 = 1−λμ̄ )
Df (λP + λ̄QkQ) = (1 − λμ̄)Df (λ1 P1 + λ̄1 QkQ) + λμ̄Df (P0 kQ) ≥ λμ̄Df (P0 kQ) .
Recall from Proposition 7.2 that Df (P0 kQ) > 0 unless f(x) = c(x − 1) for some constant c and
the proof is complete.
7.11 f-divergences in parametric families: Fisher information

In Section 2.6.2* we have already previewed the fact that in parametric families of distributions,
the Hessian of the KL divergence turns out to coincide with the Fisher information. Here we
collect such facts and their proofs. These materials form the basis of sharp bounds on parameter
estimation that we will study later in Chapter 29.
To start with an example, let us return to the Gaussian location family Pt ≜ N (t, 1), t ∈ R.
From the identities presented in Section 7.7 we obtain the following asymptotics:
| t| t2
TV(Pt , P0 ) = √ + o(|t|), H2 (Pt , P0 ) = + o( t2 ) ,
2π 4
t2
χ2 (Pt kP0 ) = t2 + o(t2 ), D(Pt kP0 ) = + o( t2 ) ,
2 log e
12
LC(Pt , P0 ) = t + o( t2 ) .
4
i i
i i
i i

i i
112
We can see that with the exception of TV, other f-divergences behave quadratically under small
displacement t → 0. This turns out to be a general fact, and furthermore the coefficient in front
of t2 is given by the Fisher information (at t = 0). To proceed carefully, we need some technical
assumptions on the family Pt .
Definition 7.19 (Regular single-parameter families). Fix τ > 0, space X and a family Pt of
distributions on X , t ∈ [0, τ ). We define the following types of conditions that we call regularity
at t = 0:
(a) Pt (dx) = pt (x) μ(dx), for some measurable (t, x) 7→ pt (x) ∈ R+ and a fixed measure μ on X ;
(b0 ) There exists a measurable function (s, x) 7→ ṗs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost
Rτ
every x0 we have 0 |ṗs (x0 )|ds < ∞ and
Z t
p t ( x0 ) = p 0 ( x0 ) + ṗs (x0 )ds . (7.57)
0
Furthermore, for μ-almost every x0 we have limt↘0 ṗt (x0 ) = ṗ0 (x0 ).
(b1 ) We have ṗt (x) = 0 whenever p0 (x) = 0 and, furthermore,
Z
(ṗt (x))2
μ(dx) sup < ∞. (7.58)
X 0≤t<τ p0 (x)
c0 ) There exists a measurable function (s, x) 7→ ḣs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost
Rτ
every x0 we have 0 |ḣs (x0 )|ds < ∞ and
p p Z t
h t ( x0 ) ≜ p t ( x0 ) = p 0 ( x0 ) + ḣs (x0 )ds . (7.59)
0
Furthermore, for μ-almost every x0 we have limt↘0 ḣt (x0 ) = ḣ0 (x0 ).
(c1 ) The family of functions {(ḣt (x))2 : t ∈ [0, τ )} is uniformly μ-integrable.
Remark 7.12. Recall that the uniform integrability condition (c1 ) is implied by the following
stronger (but easier to verify) condition:
Z
μ(dx) sup (ḣt (x))2 < ∞ . (7.60)
X 0≤t<τ
Impressively, if one also assumes the continuous differentiability of ht then the uniform integra-
bility condition becomes equivalent to the continuity of the Fisher information
Z
t 7→ JF (t) ≜ 4 μ(dx)(ḣt (x))2 . (7.61)
We refer to [44, Appendix V] for this finesse.
Theorem 7.20. Let the family of distributions {Pt : t ∈ [0, τ )} satisfy the conditions (a), (b0 ) and
(b1 ) in Definition 7.19. Then we have
χ2 (Pt kP0 ) = JF (0)t2 + o(t2 ) , (7.62)
i i
i i
i i

i i
log e
D(Pt kP0 ) = JF ( 0 ) t 2 + o ( t 2 ) , (7.63)
2
R 2
where JF (0) ≜ X
μ(dx) (ṗp00((xx))) < ∞ is the Fisher information at t = 0.
Proof. From assumption (b1 ) we see that for any x0 with p0 (x0 ) = 0 we must have ṗt (x0 ) = 0
and thus pt (x0 ) = 0 for all t ∈ [0, τ ). Hence, we may restrict all intergrals below to subset {x :
p0 (x0 ))2
p0 (x) > 0}, on which the ratio (pt (x0p)−
0 ( x0 )
is well-defined. Consequently, we have by (7.57)
Z
1 2 1 (pt (x) − p0 (x))2
χ (Pt kP0 ) = 2 μ(dx)
t2 t p 0 ( x)
Z Z 1 !2
1 1
= 2 μ(dx) t duṗtu (x)
t p 0 ( x) 0
Z Z 1 Z 1
(a) ṗtu (x)ṗtu2 (x)
= μ(dx) du1 du2 1
0 0 p 0 ( x)
ṗtu1 (x)ṗtu2 (x) → ṗ0 2(x) for every (u1 , u2 , x)

2
Note that by the continuity assumption in(b1 ) we have
ṗtu1 (x)ṗtu2 (x)
as t → 0. Furthermore, we also have p0 (x) ≤ sup0≤t<τ (ṗpt0((xx00))) , which is integrable
by (7.58). Consequently, application of the dominated convergence theorem to the integral in
(a) concludes the proof of (7.62).
We next show that for any f-divergence with twice continuously differentiable f (and in fact,
without assuming (7.58)) we have:
1 f′′ (1)
lim inf 2
Df (Pt kP0 ) ≥ JF (0) . (7.64)
t→0 t 2
Indeed, similar to (7.55) we get
Z " 2 #
1
′′ pt (X) − p0 (X) pt (X) − p0 (X)
Df (Pt kP0 ) = dz(1 − z) EX∼P0 f 1 + z . (7.65)
0 p0 ( X ) p0 (X)
pt (X)−p0 (X) a.s. ṗ0 (X)

Dividing by t2 notice that from (b0 ) we have tp0 (X) −−→ p0 (X) and thus
2 2
pt (X) − p0 (X) pt (X) − p0 (X) ṗ0 (X)
f′′ 1 + z → f′′ (1) .
p0 ( X ) tp0 (X) p0
Thus, applying Fatou’s lemma we recover (7.64).
Next, plugging f(x) = x log x in (7.65) we obtain for the KL divergence
Z 1 " 2 #
1 1−z pt (X) − p0 (X)
D(Pt kP0 ) = (log e) dz EX∼P0 . (7.66)
t2 0 1 + z pt (X)−p0 (X) tp0 (X)
p0 (X)
2
The first fraction inside the bracket is between 0 and 1 and the second by sup0<t<τ pṗ0t ((XX)) , which
is P0 -integrable by (b1 ). Thus, dominated convergence theorem applies to the double integral
i i
i i
i i

i i
114
in (7.65) and we obtain

Z " 2 #
1
1 ṗ0 (X)
lim 2 D(Pt kP0 ) = (log e) dz EX∼P0 (1 − z) ,
t→0 t 0 p0 ( X )
completing the proof of (7.63).
Remark 7.13. Theorem 7.20 extends to the case of multi-dimensional parameters as follows.
Define the Fisher information matrix at θ ∈ Rd :
Z p p ⊤
JF (θ) ≜ μ(dx)∇θ pθ (x)∇θ pθ (x) (7.67)
Then (7.62) becomes χ2 (Pt kP0 ) = t⊤ JF (0)t + o(ktk2 ) as t → 0 and similarly for (7.63), which
has previously appeared in (2.33).
Theorem 7.20 applies to many cases (e.g. to smooth subfamilies of exponential families, for
which one can take μ = P0 and p0 (x) ≡ 1), but it is not sufficiently general. To demonstrate the
issue, consider the following example.
Example 7.1 (Location families with compact support). We say that family Pt is a (scalar) location
family if X = R, μ = Leb and pt (x) = p0 (x − t). Consider the following example, for α > −1:


 α
x ∈ [ 0, 1] ,
x ,
p 0 ( x) = C α × ( 2 − x) α , x ∈ [ 1, 2] , ,


0, otherwise
with Cα chosen from normalization. Clearly, here condition (7.58) is not satisfied and both
χ2 (Pt kP0 ) and D(Pt kP0 ) are infinite for t > 0, since Pt 6 P0 . But JF (0) < ∞ whenever α > 1
and thus one expects that a certain remedy should be possible. Indeed, one can compute those
f-divergences that are finite for Pt 6 P0 and find that for α > 1 they are quadratic in t. As an
illustration, we have


 1+α
0≤α<1
Θ(t ),
H2 (Pt , P0 ) = Θ(t2 log 1t ), α = 1 (7.68)


Θ(t2 ), α>1
as t → 0. This can be computed directly, or from a more general results of [162, Theorem VI.1.1].5
5
Statistical significance of this calculation is that if we were to estimate the location parameter t from n iid samples, then
precision δn∗ of the optimal estimator up to constant factors is given by solving H2 (Pδn∗ , P0 ) n1 , cf. [162, Chapter VI].
1
− 1+α
For α < 1 we have δn∗ n which is notably better than the empirical mean estimator (attaining precision of only
− 21
n ). For α = 1/2 this fact was noted by D. Bernoulli in 1777 as a consequence of his (newly proposed) maximum
likelihood estimation.
i i
i i
i i

i i
The previous example suggests that quadratic behavior as t → 0 can hold even when Pt 6 P0 ,
which is the case handled by the next (more technical) result, whose proof we placed in Sec-
tion 7.14*). One can verify that condition (c1 ) is indeed satisfied for all α > 1 in Example 7.1,
thus establishing the quadratic behavior. Also note that the stronger (7.60) only applies to α ≥ 2.
Theorem 7.21. Given a family of distributions {Pt : t ∈ [0, τ )} satisfying the conditions (a), c0 )
and (c1 ) of Definition 7.19, we have

1 − 4ϵ #
χ (Pt kϵ̄P0 + ϵPt ) = t ϵ̄
2 2 2
JF ( 0 ) + J ( 0) + o( t2 ) , ∀ϵ ∈ (0, 1) (7.69)
ϵ
t2
H2 (Pt , P0 ) = JF ( 0 ) + o ( t 2 ) , (7.70)
4
R R
where JF (0) = 4 ḣ20 dμ < ∞ is the Fisher information and J# (0) = ḣ20 1{h0 =0} dμ can be called
the Fisher defect at t = 0.
Example 7.2 (On Fisher defect). Note that in most cases of interest we will have the situation that
t 7→ ht (x) is actually differentiable for all t in some two-sided neighborhood (−τ, τ ) of 0. In such
cases, h0 (x) = 0 implies that t = 0 is a local minima and thus ḣ0 (x) = 0, implying that the defect
J#F = 0. However, for other families this will not be so, sometimes even when pt (x) is smooth on
t ∈ (−τ, τ ) (but not ht ). Here is such an example.
Consider Pt = Ber(t2 ). A straighforward calculation shows:
ϵ̄2 p
χ2 (Pt kϵ̄P0 + ϵPt ) = t2 + O(t4 ), H2 (Pt , P0 ) = 2(1 − 1 − t2 ) = t2 + O(t4 ) .
ϵ
Taking μ({0}) = μ({1}) = 1 to be the counting measure, we get the following


(√ 
 √−t , x=0
1 − t2 , x=0  1−t2
h t ( x) = , ḣt (x) = sign(t), x = 1, t 6= 0 .
|t|, x=1 


1, x = 1, t = 0 (just as an agreement)
Note that if we view Pt as a family on t ∈ [0, τ ) for small τ , then all conditions (a), c0 ) and (c1 ) are
clearly satisfied (ḣt is bounded on t ∈ (−τ, τ )). We have JF (0) = 4 and J# (0) = 1 and thus (7.69)
recovers the correct expansion for χ2 and (7.70) for H2 .
Notice that the non-smoothness of ht only becomes visible if we extend the domain to t ∈
(−τ, τ ). In fact, this issue is not seen in terms of densities pt . Indeed, let us compute the density pt
and its derivative ṗt explicitly too:
( (
1 − t2 , x=0 −2t, x=0
p t ( x) = , ṗt (x) = .
2
t, x=1 2t, x=1
i i
i i
i i

i i
116
Clearly, pt is continuously differentiable on t ∈ (−τ, τ ). Furthermore, the following expectation

(typically equal to JF (t) in (7.61))
" 2 # (
ṗt (X) 0, t=0
EX∼Pt = 2
pt (X) 4 + 2 , t 6= 0
4t
1−t
is discontinuous at t = 0. To make things worse, at t = 0 this expectation does not match our
definition of the Fisher information JF (0) in Theorem 7.21, and thus does not yield the correct
small-t behavior for either χ2 or H2 . In general, to avoid difficulties one should restrict to those
families with t 7→ ht (x) continuously differentiable in t ∈ (−τ, τ ).
7.12 Rényi divergences and tensorization

The following family of divergence measures introduced by Rényi is key in many applications
involving product measures. Although these measures are not f-divergences, they are obtained as
monotone transformation of an appropriate f-divergence and thus satisfy DPI and other properties
of f-divergences. Later, Rényi divergence will feature prominently in characterizing the optimal
error exponents in hypothesis testing (see Section 16.1 and especially Remark 16.1), in approxi-
mating of channel output statistic (see Section 25.4*), and in nonasymptotic bounds for composite
hypothesis testing (see Section 32.2.1).
Definition 7.22. For any λ ∈ R \ 0, 1 we define the Rényi divergence of order λ as

" #
λ
1 dP
Dλ (PkQ) ≜ log EQ ,
λ−1 dQ
where EQ [·] is understood as an f-divergence Df (PkQ) with f(x) = xλ , see Definition 7.1.
Conditional Rényi divergence is defined as
Dλ (PX|Y ||QX|Y |PY ) ≜ Dλ (PY × PX|Y ||PY × QX|Y )

Z
1
= log EY∼PY (dPX|Y (x))λ (dQX|Y (a))1−λ .
λ−1 X
Numerous properties of Rényi divergences are known, see [319]. Here we only notice a few:
• Special cases of λ = 12 , 1, 2: Under mild regularity conditions limλ→1 Dλ (PkQ) = D(PkQ).

On the other hand, D2 is a monotone transformation of χ2 in (7.4), while D 12 is a monotone
transformation of H2 in (7.5).
• For all λ ∈ R the map λ → Dλ (PkQ) is non-decreasing and the map λ → (1 − λ)Dλ (PkQ) is
concave.
• For λ ∈ [0, 1] the map (P, Q) 7→ Dλ (PkQ) is convex.
• For λ ≥ 0 the map Q 7→ Dλ (PkQ) is convex.
i i
i i
i i

i i
7.12 Rényi divergences and tensorization 117
• There is a version of the chain rule:

(λ)
Dλ (PA,B ||QA,B ) = Dλ (PB ||QB ) + Dλ (PA|B ||QA|B |PB ) , (7.71)
(λ)
where PB is the λ-tilting of PB towards QB given by
PB (b) ≜ PλB (b)Q1B−λ (b) exp{−(λ − 1)Dλ (PB ||QB )} .
(λ)
(7.72)
• The key property is additivity under products, or tensorization:
!
Y Y X
Dλ PXi QXi = Dλ (PXi kQXi ) , (7.73)

i i i
which is a simple consequence of (7.71). Dλ ’s are the only divergences satisfying DPI and
tensorization [222]. The most well-known special cases of (7.73) are for Hellinger distance,
see (7.24) and for χ2 :
!
Yn Yn Yn

1+χ 2
Pi Qi = 1 + χ2 (Pi kQi ) .

i=1 i=1 i=1
We can also obtain additive bounds for non-product distributions, see Ex. I.32 and I.33.
The following consequence of the chain rule will be crucial in statistical applications later (see
Section 32.2, in particular, Theorem 32.7).
Q Q
Proposition 7.23. Consider product channels PYn |Xn = PYi |Xi and QYn |Xn = QYi |Xi . We have
(with all optimizations over all possible distributions)
X
n
inf Dλ (PYn kQYn ) = inf Dλ (PYi kQYi ) (7.74)
PXn ,QXn PXi ,QXi
i=1
Xn X
n
sup Dλ (P kQ ) =
Yn Yn sup Dλ (PYi kQYi ) = sup Dλ (PYi |Xi =x kQYi |Xi =x′ ) (7.75)
PXn ,QXn ′
i=1 PXi ,QXi i=1 x,x
In particular, for any collections of distributions {Pθ , θ ∈ Θ} and {Qθ , θ ∈ Θ}:

inf Dλ (PkQ) ≥ n inf Dλ (PkQ) (7.76)
P∈co{P⊗ n ⊗n
θ },Q∈co{Qθ }
P∈co{Pθ },Q∈co{Qθ }
sup Dλ (PkQ) ≤ n sup Dλ (PkQ) (7.77)

P∈co{P⊗ n ⊗n
θ },Q∈co{Qθ }
P∈co{Pθ },Q∈co{Qθ }
Remark 7.14. The mnemonic for (7.76)-(7.77) is that “mixtures of products are less distinguish-
able than products of mixtures”. The former arise in statistical settings where iid observations are
drawn a single distribution whose parameter is drawn from a prior.
Proof. The second equality in (7.75) follows from the fact that Dλ is an increasing function
of an f-divergence, and thus maximization should be attained at an extreme point of the space
of probabilities, which are just the single-point masses. The main equalities (7.74)-(7.75) follow
from a) restricting optimizations to product distributions and invoking (7.73); and b) the chain rule
i i
i i
i i

i i
118
for Dλ . For example for n = 2, we fix PX2 and QX2 , which (via channels) induce joint distributions
PX2 ,Y2 and QX2 ,Y2 . Then we have
Dλ (PY1 |Y2 =y kQY1 |Y2 =y′ ) ≥ inf Dλ (P̃Y1 kQ̃Y1 ) ,

P̃X1 ,Q̃X1
since PY1 |Y2 =y is a distribution induced by taking P̃X1 = PX1 |Y2 =y , and similarly for QY1 |Y2 =y′ . In
all, we get
(λ)
X
2
Dλ (PY2 kQY2 ) = Dλ (PY2 kQY2 ) + Dλ (PY1 |Y2 kQY1 |Y2 |PY2 ) ≥ inf Dλ (PYi kQYi ) ,
PXi ,QXi
i=1
as claimed. The case of sup is handled similarly.

From (7.74)-(7.75), we get (7.76)-(7.77) by taking X = Θ and specializing inf, sup to diagonal
distributions PXn and QXn , i.e. those with the property that P[X1 = · · · = Xn ] = 1 and Q[X1 =
· · · = Xn ] = 1).
7.13 Variational representation of f-divergences

In Theorem 4.6 we had a very useful variational representation of KL-divergence due to Donsker
and Varadhan. In this section we show how to derive such representations for other f-divergences
in a principled way. The proofs are slightly technical and given in Section 7.14* at the end of this
chapter.
Let f : (0, +∞) → R be a convex function. The convex conjugate f∗ : R → R ∪ {+∞} of f is
defined by:
f∗ (y) = sup xy − f(x) , y ∈ R. (7.78)

x∈R+
Denote the domain of f∗ by dom(f∗ ) ≜ {y : f∗ (y) < ∞}. Two important properties of the convex
conjugates are
1 f∗ is also convex (which holds regardless of f being convex or not);

2 Biconjugation: (f∗ )∗ = f, which means
f(x) = sup xy − f∗ (y)

y
and implies the following (for all x > 0 and y)
f(x) + f∗ (g) ≥ xy .
Similarly, we can define a convex conjugate for any convex functional Ψ(P) defined on the
space of measures, by setting
Z
Ψ∗ (g) = sup gdP − Ψ(P) . (7.79)
P
i i
i i
i i

i i
Under appropriate conditions (e.g. finite X ), biconjugation then yields the sought-after variational
representation
Z
Ψ(P) = sup gdP − Ψ∗ (g) . (7.80)
g
Next we will now compute these conjugates for Ψ(P) = Df (PkQ). It turns out to be convenient
to first extend the definition of Df (PkQ) to all finite signed measures P then compute the conjugate.
To this end, let fext : R → R ∪ {+∞} be an extension of f, such that fext (x) = f(x) for x ≥ 0 and
fext is convex on R. In general, we can always choose fext (x) = ∞ for all x < 0. In special cases
e.g. f(x) = |x − 1|/2 or f(x) = (x − 1)2 we can directly take fext (x) = f(x) for all x. Now we can
define Df (PkQ) for all signed measure measures P in the same way as in Definition 7.1 using fext
in place of f.
For each choice of fext we have a variational representation of f-divergence:
Theorem 7.24. Let P and Q be probability measures on X . Fix an extension fext of f and let f∗ext
is the conjugate of fext , i.e., f∗ext (y) = supx∈R xy − fext (x). Denote dom(f∗ext ) ≜ {y : f∗ext (y) < ∞}.
Then
Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))]. (7.81)
g:X →dom(f∗
ext )
where the supremum can be taken over either (a) all simple g or (b) over all g satisfying
EQ [f∗ext (g(X))] < ∞.
We remark that when P Q then both results (a) and (b) also hold for supremum over g :
X → R, i.e. without restricting g(x) ∈ dom(f∗ext ).
As a consequence of the variational characterization, we get the following properties for f-
divergences:
1 Convexity: First of all, note that Df (PkQ) is expressed as a supremum of affine functions (since
the expectation is a linear operation). As a result, we get that (P, Q) 7→ Df (PkQ) is convex,
which was proved previously in Theorem 7.5 using different method.
2 Weak lower semicontinuity: Recall the example in Remark 4.5, where {Xi } are i.i.d. Rademach-
ers (±1), and
Pn
i=1 Xi d
√ →N (0, 1)
−
n
by the central limit theorem; however, by Proposition 7.2, for all n,

PX1 +X2 +...+Xn
N (0, 1) = f(0) + f′ (∞) > 0,
Df √
n
since the former distribution is discrete and the latter is continuous. Therefore similar to the
KL divergence, the best we can hope for f-divergence is semicontinuity. Indeed, if X is a nice
space (e.g., Euclidean space), in (7.81) we can restrict the function g to continuous bounded
functions, in which case Df (PkQ) is expressed as a supremum of weakly continuous functionals
i i
i i
i i

i i
120
(note that f∗ ◦ g is also continuous and bounded since f∗ is continuous) and is hence weakly
w
lower semicontinuous, i.e., for any sequence of distributions Pn and Qn such that Pn −
→ P and
w
Qn −→ Q, we have
lim inf Df (Pn kQn ) ≥ Df (PkQ).

n→∞
3 Relation to DPI: As discussed in (4.14) variational representations can be thought of as

extensions of the DPI. As an exercise, one should try to derive the estimate
p
|P[A] − Q[A]| ≤ Q[A] · χ2 (PkQ)
via both the DPI and (7.85).
Example 7.3 (Total variation and Hellinger). For total variation, we have f(x) = 21 |x − 1|. Consider
the extension fext (x) = 12 |x − 1| for x ∈ R. Then

1 +∞ if |y| > 12
f∗ext (y) = sup xy − |x − 1| = .
x 2 y if |y| ≤ 12
Thus (7.81) gives
TV(P, Q) = sup EP [g(X)] − EQ [g(X)], (7.82)
g:|g|≤ 12
which previously appeared in (7.16). A similar calculation for Hellinger-squared yields (after
changing from g to h = 1 − h in (7.81)):
1
H2 (P, Q) = 2 − inf EP [h] + EQ [ ] .
h>0 h
Example 7.4 (χ2 -divergence). For χ2 -divergence we have f(x) = (x − 1)2 . Take fext (x) = (x − 1)2 ,
2
whose conjugate is f∗ext (y) = y + y4 . Applying (7.81) yields
" #
g2 (X)
χ (PkQ) = sup EP [g(X)] − EQ g(X) +
2
(7.83)
g:X →R 4
= sup 2EP [g(X)] − EQ [g2 (X)] − 1 (7.84)
g:X →R
where the last step follows from a change of variable (g ← 12 g − 1).

To get another equivalent, but much more memorable representation, we notice that (7.84) it is
not scale-invariant. To make it so, setting g = λh and optimizing over the λ ∈ R first we get
(EP [h(X)] − EQ [h(X)])2
χ2 (PkQ) = sup . (7.85)
h:X →R VarQ (h(X))
The statistical interpretation of (7.85) is as follows: if a test statistic h(X) is such that the separation
between its expectation under P and Q far exceeds its standard deviation, then this suggests the two
hypothesis can be distinguished reliably. The representation (7.85) will turn out useful in statistical
applications in Chapter 29 for deriving the Hammersley-Chapman-Robbins (HCR) lower bound
as well as its Bayesian version, see Section 29.1.2, and ultimately the Cramér-Rao and van Trees
lower bounds.
i i
i i
i i

i i
Example 7.5 (KL-divergence). In this case we have f(x) = x log x. Consider the extension fext (x) =
∞ for x < 0, whose convex conjugate is f∗ (y) = loge e exp(y). Hence (7.81) yields
D(PkQ) = sup EP [g(X)] − (EQ [exp{g(X)}] − 1)log e (7.86)

g:X →R
Note that in the last example, the variational representation (7.86) we obtained for the KL
divergence is not the same as the Donsker-Varadhan identity in Theorem 4.6, that is,
D(PkQ) = sup EP [g(X)] − log EQ [exp{g(X)}] . (7.87)
g:X →R
In fact, (7.86) is weaker than (7.87) in the sense that for each choice of g, the obtained lower bound
on D(PkQ) in the RHS is smaller. Furthermore, regardless of the choice of fext , the Donsker-
Varadhan representation can never be obtained from Theorem 7.24 because, unlike (7.87), the
second term in (7.81) is always linear in Q. It turns out if we define Df (PkQ) = ∞ for all non-
probability measure P, and compute its convex conjugate, we obtain in the next theorem a different
type of variational representation, which, specialized to KL divergence in Example 7.5, recovers
exactly the Donsker-Varadhan identity.
Theorem 7.25. Consider the extension fext of f such that fext (x) = ∞ for x < 0. Let S = {x :
q(x) > 0} where q is as in (7.2). Then
Df (PkQ) = f′ (∞)P[Sc ] + sup EP [g1S ] − Ψ∗Q,P (g) , (7.88)
g
where
Ψ∗Q,P (g) ≜ inf EQ [f∗ext (g(X) − a)] + aP[S].
a∈R
′
In the special case f (∞) = ∞, we have
Df (PkQ) = sup EP [g] − Ψ∗Q (g), Ψ∗Q (g) ≜ inf EQ [f∗ext (g(X) − a)] + a. (7.89)
g a∈R
Remark 7.15 (Marton’s divergence). Recall that in Theorem 7.7 we have shown both the sup and
inf characterizations for the TV. Do other f-divergences also possess inf characterizations? The
only other known example (to us) is due to Marton. Let
Z 2
dP
Dm (PkQ) = dQ 1 − ,
dQ +
which is clearly an f-divergence with f(x) = (1 − x)2+ . We have the following [45, Lemma 8.3]:
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : X ∼ P, Y ∼ Q} ,
where the infimum is over all couplings of P and Q. See Ex. I.34.
Marton’s Dm divergence plays a crucial role in the theory of concentration of measure [45,
Chapter 8]. Note also that while Theorem 7.18 does not apply to Dm , due to the absence of twice
continuous differentiability, it does apply to the symmetrized Marton divergence Dsm (PkQ) ≜
Dm (PkQ) + Dm (QkP).
i i
i i
i i

i i
122
7.14* Technical proofs: convexity, local expansions and variational

representations
In this section we collect proofs of some technical theorems from this chapter.
Proof of Theorem 7.21. By definition we have

Z Z
1 1 (pt (x) − p0 (x))2 1
L(t) ≜ 2 2 χ2 (Pt kϵ̄P0 + ϵPt ) = 2 μ(dx) = 2 μ(dx)g(t, x)2 , (7.90)
ϵ̄ t t X ϵ̄p0 (x) + ϵpt (x) t
where
p t ( x) − p 0 ( x) h 2 − p 0 ( x)
g ( t , x) ≜ p = ϕ(ht (x); x) , ϕ(h; x) ≜ p .
ϵ̄p0 (x) + ϵpt (x) ϵ̄p0 (x) + ϵh2
p
By c0 ) the function t 7→ ht (x) ≜ pt (x) is absolutely continuous (for μ-a.e. x). Below we
2−ϵ√
will show that kϕ(·; x)kLip = suph≥0 |ϕ′ (h; x)| ≤ (1−ϵ) ϵ
. This implies that t 7→ g(t, x) is also
absolutely continuous and hence differentiable almost everywhere. Consequently, we have
Z 1
g(t, x) = t duġ(tu, x), ġ(t, x) ≜ ϕ′ (ht (x); x)ḣt (x) ,
0
′
Since ϕ (·; x) is continuous with
(
′
2, x : h 0 ( x) > 0 ,
ϕ (h0 (x); x) = (7.91)
√1 , x : h 0 ( x) = 0
ϵ
(we verify these facts below too), we conclude that

1
lim ġ(s, x) = ġ(0, x) = ḣ0 (x) 2 · 1{h0 (x) > 0} + √ 1{h0 (x) = 0} , (7.92)
s→ 0 ϵ
where we also used continuity ḣt (x) → ḣ0 (x) by assumption c0 ).

Substituting the integral expression for g(t, x) into (7.90) we obtain
Z Z 1 Z 1
L(t) = μ(dx) du1 du2 ġ(tu1 , x)ġ(tu2 , x) . (7.93)
0 0
Since |ġ(s, x)| ≤ C|hs (x)| for some C = C(ϵ), we have from Cauchy-Schwarz
Z Z
μ(dx)˙|g(s1 , x)ġ(s2 , x)| ≤ C2 sup μ(dx)ḣt (x)2 < ∞ . (7.94)
t X
where the last inequality follows from the uniform integrability assumption (c1 ). This implies that
Fubini’s theorem applies in (7.93) and we obtain
Z 1 Z 1 Z
L ( t) = du1 du2 G(tu1 , tu2 ) , G(s1 , s2 ) ≜ μ(dx)ġ(s1 , x)ġ(s2 , x) .
0 0
Notice that if a family of functions {fα (x) : α ∈ I} is uniformly square-integrable, then the family
{fα (x)fβ (x) : α ∈ I, β ∈ I} is uniformly integrable simply because apply |fα fβ | ≤ 21 (f2α + f2β ).
i i
i i
i i

i i
7.14* Technical proofs: convexity, local expansions and variational representations 123
Consequently, from the assumption (c1 ) we see that the integral defining G(s1 , s2 ) allows passing
the limit over s1 , s2 inside the integral. From (7.92) we get as t → 0
Z
1 1 − 4ϵ #
G(tu1 , tu2 ) → G(0, 0) = μ(dx)ḣ0 (x) 4 · 1{h0 > 0} + 1{h0 = 0} = JF (0)+
2
J ( 0) .
ϵ ϵ
From (7.94) we see that G(s1 , s2 ) is bounded and thus, the bounded convergence theorem applies
and
Z 1 Z 1
lim du1 du2 G(tu1 , tu2 ) = G(0, 0) ,
t→0 0 0
which thus concludes the proof of L(t) → JF (0) and of (7.69) assuming facts about ϕ. Let us
verify those.
For simplicity, in the next paragraph we omit the argument x in h0 (x) and ϕ(·; x). A straightfor-
ward differentiation yields
h20 (1 − 2ϵ ) + 2ϵ h2
ϕ′ (h) = 2h .
(ϵ̄h20 + ϵh2 )3/2
h20 (1− ϵ2 )+ ϵ2 h2 1−ϵ/2
Since √ h
≤ √1
ϵ
and ϵ̄h20 +ϵh2
≤ 1−ϵ we obtain the finiteness of ϕ′ . For the continuity
ϵ̄h20 +ϵh2
of ϕ′ notice that if h0 > 0 then clearly the function is continuous, whereas for h0 = 0 we have
ϕ′ (h) = √1ϵ for all h.
We next proceed to the Hellinger distance. Just like in the argument above, we define
Z Z 1 Z 1
1
M(t) ≜ 2 H2 (Pt , P0 ) = μ(dx) du1 du2 ḣtu1 (x)ḣtu2 (x) .
t 0 0
R
Exactly as above from Cauchy-Schwarz and supt μ(dx)ḣt (x)2 < ∞ we conclude that Fubini
applies and hence
Z 1 Z 1 Z
M(t) = du1 du2 H(tu1 , tu2 ) , H(s1 , s2 ) ≜ μ(dx)ḣs1 (x)ḣs2 (x) .
0 0
Again, the family {ḣs1 ḣs2 : s1 ∈ [0, τ ), s2 ∈ [0, τ } is uniformly integrable and thus from c0 ) we
conclude H(tu1 , tu2 ) → 14 JF (0). Furthermore, similar to (7.94) we see that H(s1 , s2 ) is bounded
and thus
Z 1 Z 1
1
lim M(t) = du1 du2 lim H(tu1 , tu2 ) = JF (0) ,
t→ 0 0 0 t→ 0 4
concluding the proof of (7.70).
Proceeding to variational representations, we prove the counterpart of Gelfand-Yaglom-

Perez Theorem 4.5, cf. [137].
Proof of Theorem 7.6. The lower bound Df (PkQ) ≥ Df (PE kQE ) follows from the DPI. To prove
an upper bound, first we reduce to the case of f ≥ 0 by property 6 in Proposition 7.2. Then define
i i
i i
i i

i i
124
sets S = suppQ, F∞ = { dQ
dP
= 0} and for a fixed ϵ > 0 let

dP
Fm = ϵm ≤ f < ϵ(m + 1) , m = 0, 1, . . . .
dQ
We have
X Z X
dP
ϵ mQ[Fm ] ≤ dQf ≤ϵ (m + 1)Q[Fm ] + f(0)Q[F∞ ]
m S dQ m
X
≤ϵ mQ[Fm ] + f(0)Q[F∞ ] + ϵ . (7.95)
m
Notice that on the interval I+m = {x > 1 : ϵm ≤ f(x) < ϵ(m + 1)} the function f is increasing and
on I−
m = { x ≤ 1 : ϵ m ≤ f ( x ) < ϵ(m + 1)} it is decreasing. Thus partition further every Fm into
−
Fm = { dQ ∈ Im } and Fm = { dQ
+ dP + dP
∈ I−
m }. Then, we see that

P[F±m]
f ≥ ϵm .
Q[ F ±
m]
− −
0 , F0 , . . . , Fn , Fn , F∞ , S , ∪m>n Fm }.
Consequently, for a fixed n define the partition consisting of sets E = {F+ + c
For this partition we have, by the previous display:

X
D(PE kQE ) ≥ ϵ mQ[Fm ] + f(0)Q[F∞ ] + f′ (∞)P[Sc ] . (7.96)
m≤n
We next show that with sufficiently large n and sufficiently small ϵ the RHS of (7.96) approaches
Df (PkQ). If f(0)Q[F∞ ] = ∞ (and hence Df (PkQ) = ∞) then clearly (7.96) is also infinite. Thus,
assume thatf(0)Q[F∞ ] < ∞.
R
If S dQf dQdP
= ∞ then the sum over m on the RHS of (7.95) is also infinite, and hence for any
P
N > 0 there exists some n such that m≤n mQ[Fm ] ≥ N, thus showing that RHS for (7.96) can be
R
made arbitrarily large. Thus assume S dQf dQ dP
< ∞. Considering LHS of (7.95) we conclude
P
that for some large n we have m>n mQ[Fm ] ≤ 12 . Then, we must have again from (7.95)
X Z
dP 3
ϵ mQ[Fm ] + f(0)Q[F∞ ] ≥ dQf − ϵ.
S dQ 2
m≤n
Thus, we have shown that for arbitrary ϵ > 0 the RHS of (7.96) can be made greater than
Df (PkQ) − 32 ϵ.
Proof of Theorem 7.24. First, we show that for any g : X → dom(f∗ext ) we must have
EP [g(X)] ≤ Df (PkQ) + EQ [f∗ext (g(X))] . (7.97)
Let p(·) and q(·) be the densities of P and Q. Then, from the definition of f∗ext we have for every x
s.t. q(x) > 0:
p ( x) p ( x)
f∗ext (g(x)) + fext ( ) ≥ g ( x) .
q ( x) q ( x)
i i
i i
i i

i i
Integrating this over dQ = q dμ restricted to the set {q > 0} we get

Z
p ( x)
EQ [f∗ext (g(X))] + q(x)fext ( ) dμ ≥ EP [g(X)1{q(X) > 0}] . (7.98)
q>0 q ( x)
Now, notice that
fext (x)
sup{y : y ∈ dom(f∗ext )} = lim = f′ (∞) (7.99)
x→∞ x
Therefore, f′ (∞)P[q(X) = 0] ≥ EP [g(X)1{q(X) = 0}]. Summing the latter inequality with (7.98)
we obtain (7.97).
Next we prove that supremum in (7.81) over simple functions g does yield Df (PkQ), so that
inequality (7.97) is tight. Armed with Theorem 7.6, it suffices to show (7.81) for finite X . Indeed,
for general X , given a finite partition E = {E1 , . . . , En } of X , we say a function g : X → R is
E -compatible if g is constant on each Ei ∈ E . Taking the supremum over all finite partitions E we
get
Df (PkQ) = sup Df (PE kQE )
E
= sup sup EP [g(X)] − EQ [f∗ext (g(X))]

E g:X →dom(f∗
ext )
g E -compatible
= sup EP [g(X)] − EQ [f∗ext (g(X))],

g:X →dom(f∗
ext )
g simple
where the last step follows is because the two sumprema combined is equivalent to the supremum
over all simple (finitely-valued) functions g.
Next, consider finite X . Let S = {x ∈ X : Q(x) > 0} denote the support of Q. We show the
following statement
Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))] + f′ (∞)P(Sc ), (7.100)
g:S→dom(f∗
ext )
which is equivalent to (7.81) by (7.99). By definition,

X
P(x)
Df (PkQ) = Q(x)fext +f′ (∞) · P(Sc ),
Q ( x)
x∈S
| {z }
≜Ψ(P)
Consider the functional Ψ(P) defined above where P takes values over all signed measures on S,
which can be identified with RS . The convex conjugate of Ψ(P) is as follows: for any g : S → R,
( )
X P( x )
Ψ∗ (g) = sup P(x)g(x) − Q(x) sup h − f∗ext (h)
P h∈ dom( f∗ ) Q ( x)
x ext
X
= sup inf ∗ P(x)(g(x) − h(x)) + Q(x)f∗ext (h(x))
P h:S→dom(fext ) x
( a) X
= inf sup P(x)(g(x) − h(x)) + EQ [f∗ext (h)]
h:S→dom(f∗
ext ) P
x
i i
i i
i i

i i
126
(
EQ [f∗ext (g(X))] g : S → dom(f∗ext )
= .
+∞ otherwise
where (a) follows from the minimax theorem (which applies due to finiteness of X ). Applying
the convex duality in (7.80) yields the proof of the desired (7.100).
Proof of Theorem 7.25. First we argue that the supremum in the right-hand side of (7.88) can
be taken over all simple functions g. Then thanks to Theorem 7.6, it will suffice to consider finite
alphabet X . To that end, fix any g. For any δ , there exists a such that EQ [f∗ext (g − a)] − aP[S] ≤
Ψ∗Q,P (g) + δ . Since EQ [f∗ext (g − an )] can be approximated arbitrarily well by simple functions we
conclude that there exists a simple function g̃ such that simultaneously EP [g̃1S ] ≥ EP [g1S ] − δ and
Ψ∗Q,P (g̃) ≤ EQ [f∗ext (g̃ − a)] − aP[S] + δ ≤ Ψ∗Q,P (g) + 2δ .
This implies that restricting to simple functions in the supremization in (7.88) does not change the
right-hand side.
Next consider finite X . We proceed to compute the conjugate of Ψ, where Ψ(P) ≜ Df (PkQ) if
P is a probability measure on X and +∞ otherwise. Then for any g : X → R, maximizing over
all probability measures P we have:
X
Ψ∗ (g) = sup P(x)g(x) − Df (PkQ)
P x∈X
X X X
P(x)
= sup P(x)g(x) − P(x)g(x) − Q ( x) f
P x∈X Q ( x)
x∈Sc x∈ S
X X X
= sup inf P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + Q(x)f∗ext (h(x))
P h:S→R x∈S x∈Sc x∈S
( ! )
( a) X X
= inf sup P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + EQ [f∗ext (h(X))]
h:S→R P x∈ S x∈Sc

(b) ′ ∗
= inf max max g(x) − h(x), maxc g(x) − f (∞) + EQ [fext (h(X))]
h:S→R x∈ S x∈ S

( c) ′ ∗
= inf max a, maxc g(x) − f (∞) + EQ [fext (g(X) − a)]
a∈ R x∈ S
where (a) follows from the minimax theorem; (b) is due to P being a probability measure; (c)
follows since we can restrict to h(x) = g(x) − a for x ∈ S, thanks to the fact that f∗ext is non-
decreasing (since dom(fext ) = R+ ).
From convex duality we have shown that Df (PkQ) = supg EP [g] − Ψ∗ (g). Notice that without
loss of generality we may take g(x) = f′ (∞) + b for x ∈ Sc . Interchanging the optimization over
b with that over a we find that
sup bP[Sc ] − max(a, b) = −aP[S] ,

b
i i
i i
i i

i i
which then recovers (7.88). To get (7.89) simply notice that if P[Sc ] > 0, then both sides of (7.89)
are infinite (since Ψ∗Q (g) does not depend on the values of g outside of S). Otherwise, (7.89)
coincides with (7.88).
i i
i i
i i

i i
8 Entropy method in combinatorics and geometry
A commonly used method in combinatorics for bounding the number of certain objects from above
involves a smart application of Shannon entropy. This method typically proceeds as follows: in
order to count the cardinality of a given set C , we draw an element uniformly at random from C ,
whose entropy is given by log |C|. To bound |C| from above, we describe this random object by a
random vector X = (X1 , . . . , Xn ), e.g., an indicator vector, then proceed to compute or upper-bound
the joint entropy H(X1 , . . . , Xn ).
Notably, three methods of increasing precision are as follows:
• Marginal bound:
X
n
H(X1 , . . . , Xn ) ≤ H( X i )
i=1
• Pairwise bound (Shearer’s lemma and generalization cf. Theorem 1.8):
1 X
H(X1 , . . . , Xn ) ≤ H(Xi , Xj )
n−1
i< j
• Chain rule (exact calculation):
X
n
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 )
i=1
We give three applications using the above three methods, respectively, in the order of increasing
difficulty:
• Enumerating binary vectors of a given average weights

• Counting triangles and other subgraphs
• Brégman’s theorem
Finally, to demonstrate how entropy method can also be used for questions in Euclidean spaces,
we prove the Loomis-Whitney and Bollobás-Thomason theorems based on analogous properties
of differential entropy (Section 2.3).
128
i i
i i
i i

i i
8.1 Binary vectors of average weights 129
8.1 Binary vectors of average weights

Lemma 8.1 (Massey [210]). Let C ⊂ {0, 1}n and let p be the average fraction of 1’s in C , i.e.
1 X wH (x)
p= ,
|C| n
x∈C
where wH (x) is the Hamming weight (number of 1’s) of x ∈ {0, 1}n . Then |C| ≤ 2nh(p) .
Remark 8.1. This result holds even if p > 1/2.
Proof. Let X = (X1 , . . . , Xn ) be drawn uniformly at random from C . Then

X
n X
n
log |C| = H(X) = H(X1 , . . . , Xn ) ≤ H(Xi ) = h(pi ),
i i=1
where pi = P [Xi = 1] is the fraction of vertices whose i-th bit is 1. Note that
1X
n
p= pi ,
n
i=1
since we can either first average over vectors in C or first average across different bits. By Jensen’s
inequality and the fact that x 7→ h(x) is concave,
!
Xn
1X
n
h(pi ) ≤ nh pi = nh(p).
n
i=1 i=1
Hence we have shown that log |C| ≤ nh(p).
Theorem 8.2.
k
X n
≤ 2nh(k/n) , k ≤ n/2.
j
j=0
Proof. We take C = {x ∈ {0, 1}n : wH (x) ≤ k} and invoke the previous lemma, which says that
k
X n
= |C| ≤ 2nh(p) ≤ 2nh(k/n) ,
j
j=0
where the last inequality follows from the fact that x 7→ h(x) is increasing for x ≤ 1/2.
Remark 8.2. Alternatively, we can prove Theorem 8.2 using the large-deviation bound in Part III.
By the Chernoff bound on the binomial tail (see (15.19) in Example 15.1),
LHS RHS
= P(Bin(n, 1/2) ≤ k) ≤ 2−nd( n ∥ 2 ) = 2−n(1−h(k/n)) = n .
k 1
2n 2
i i
i i
i i

i i
130
8.2 Shearer’s lemma & counting subgraphs

Recall that a special case of Shearer’s lemma Theorem 1.8 (or Han’s inequality Theorem 1.7) says:
1
H(X1 , X2 , X3 ) ≤ [H(X1 , X2 ) + H(X2 , X3 ) + H(X1 , X3 )].
2
A classical application of this result (see Remark 1.2) is to bound cardinality of a set in R3 given
cardinalities of its projections.
For graphs H and G, define N(H, G) to be the number of copies of H in G.1 For example,
N( , ) = 4, N( , ) = 8.
If we know G has m edges, what is the maximal number of H that are contained in G? To study
this quantity, let’s define
N(H, m) = max N(H, G).

G:|E(G)|≤m
We will show that for the maximal number of triangles satisfies
N(K3 , m) m3/2 . (8.1)

To show that N(H, m) ≳ m3/2 , consider G = Kn which has m = |E(G)| = n2 n2 and

N(K3 , Kn ) = n3 n3 m3/2 .
To show the upper bound, fix a graph G = (V, E) with m edges. Draw a labeled triangle
uniformly at random and denote the vertices by (X1 , X2 , X3 ). Then by Shearer’s Lemma,
1 3
log(3!N(K3 , G)) = H(X1 , X2 , X3 ) ≤ [H(X1 , X2 ) + H(X2 , X3 ) + H(X1 , X3 )] ≤ log(2m).
2 2
Hence
√
2 3/2
N(K3 , G) ≤ m . (8.2)
3
Remark 8.3. Interestingly, linear algebra argument yields exactly the same upper bound as (8.2):
Let A be the adjacency matrix of G with eigenvalues {λi }. Then
X
2|E(G)| = tr(A2 ) = λ2i
X
6N(K3 , G) = tr(A3 ) = λ3i
√
By Minkowski’s inequality, (6N(K3 , G))1/3 ≤ (2|E(G)|)1/2 which yields N(K3 , G) ≤ 2 3/2
3 m .
Using Shearer’s Theorem 1.8 Friedgut and Kahn [131] obtained the counterpart of (8.1) for
arbitrary H; this result was first proved by Alon [8]. We start by introducing the fractional covering
1
To be precise, here N(H, G) is the number of subgraphs of G (subsets of edges) isomorphic to H. If we denote by inj(H, G)
the number of injective maps V(H) → V(G) mapping edges of H to edges of G, then N(H, G) = |Aut(H)| 1
inj(H, G).
i i
i i
i i

i i
8.2 Shearer’s lemma & counting subgraphs 131
number of a graph. For a graph H = (V, E), define the fractional covering number as the value of
the following linear program:2
( )
X X
ρ∗ (H) = min wH (e) : wH (e) ≥ 1, ∀v ∈ V, wH (e) ∈ [0, 1] (8.3)
w
e∈E e∈E, v∈e
Theorem 8.3.
∗ ∗
c0 ( H ) m ρ (H)
≤ N(H, m) ≤ c1 (H)mρ (H)
. (8.4)
For example, for triangles we have ρ∗ (K3 ) = 3/2 and Theorem 8.3 is consistent with (8.1).
Proof. Upper bound: Let V(H) = [n] and let w∗ (e) be the solution for ρ∗ (H). For any G with m
edges, draw a subgraph of G, uniformly at random from all those that are isomorphic to H. Given
such a random subgraph set Xi ∈ V(G) to be the vertex corresponding to an i-th vertex of H, i ∈ [n].
∗
Now define a random 2-subset S of [n] by sampling an edge e from E(H) with probability ρw∗ ((He)) .
By the definition of ρ∗ (H) we have for any i ∈ [n] that P[i ∈ S] ≥ ρ∗1(H) . We are now ready to
apply Shearer’s Theorem 1.8:
log N(H, G) = H(X)

≤H(XS |S)ρ∗ (H) ≤ log(2m)ρ∗ (H) ,
where the last bound is as before: if S = {v, w} then XS = (Xv , Xw ) takes one of 2m values. Overall,
∗
we get3 N(H, G) ≤ (2m)ρ (H) .
Lower bound: It amounts to construct a graph G with m edges for which N(H, G) ≥
∗
c(H)|e(G)|ρ (H) . Consider the dual LP of (8.3)
 
 X 
α∗ (H) = max ψ(v) : ψ(v) + ψ(w) ≤ 1, ∀(vw) ∈ E, ψ(v) ∈ [0, 1] (8.5)
ψ  
v∈V(H)
i.e., the fractional packing number. By the duality theorem of LP, we have α∗ (H) = ρ∗ (H). The
graph G is constructed as follows: for each vertex v of H, replicate it for m(v) times. For each edge
e = (vw) of H, replace it by a complete bipartite graph Km(v),m(w) . Then the total number of edges
of G is
X
|E(G)| = m(v)m(w).
(vw)∈E(H)
2
If the “∈ [0, 1]” constraints in (8.3) and (8.5) are replaced by “∈ {0, 1}”, we obtain the covering number ρ(H) and the
independence number α(H) of H, respectively.
3
Note that for H = K3 this gives a bound weaker than (8.2). To recover (8.2) we need to take X = (X1 , . . . , Xn ) be
uniform on all injective homomorphisms H → G.
i i
i i
i i

i i
132
Q
Furthermore, N(G, H) ≥ v∈V(H) m(v). To minimize the exponent log N(G,H)
log |E(G)| , fix a large number
ψ(v)
M and let m(v) = M , where ψ is the maximizer in (8.5). Then
X
|E(G)| ≤ 4Mψ(v)+ψ(w) ≤ 4M|E(H)|
(vw)∈E(H)
Y ∗
N(G, H) ≥ Mψ(v) = Mα (H)
v∈V(H)
and we are done.
8.3 Brégman’s Theorem

In this section, we present an elegant entropy proof by Radhakrishnan [256] of Brégman’s The-
orem [51], which bounds the number of perfect matchings (1-regular spanning subgraphs) in a
bipartite graphs.
We start with some definitions. The permanent of an n × n matrix A is defined as
XY
n
perm(A) ≜ aiπ (i) ,
π ∈Sn i=1
where Sn denotes the group of all permutations of [n]. For a bipartite graph G with n vertices on
the left and right respectively, the number of perfect matchings in G is given by perm(A), where
A is the adjacency matrix. For example,
 
 
 
 
perm   = 1, perm  =2
 
 
Theorem 8.4 (Brégman’s Theorem). For any n × n bipartite graph with adjacency matrix A,
Y
n
1
perm(A) ≤ (di !) di ,
i=1
where di is the degree of left vertex i (i.e. sum of the ith row of A).
As an example, consider G = Kn,n . Then perm(G) = n!, which coincides with the RHS
[(n!)1/n ]n = n!. More generally, if G consists of n/d copies of Kd,d , then Bregman’s bound is
tight and perm = (d!)n/d .
As a first attempt of proving Theorem 8.4 using the entropy method, we select a perfect
matching uniformly at random which matches the ith left vertex to the Xi th right one. Let X =
i i
i i
i i

i i
8.3 Brégman’s Theorem 133
(X1 , . . . , Xn ). Then
X
n X
n
log perm(A) = H(X) = H(X1 , . . . , Xn ) ≤ H(Xi ) ≤ log(di ).
i=1 i=1
Q
Hence perm(A) ≤ i di . This is worse than Brégman’s bound by an exponential factor, since by
Stirling’s formula
!
Y
n
1 Y
n
(di !) di
∼ di e− n .
i=1 i=1
Here is our second attempt. The hope is to use the chain rule to expand the joint entropy and
bound the conditional entropy more carefully. Let’s write
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 ) ≤ E[log Ni ].
i=1 i=1
where Ni , as a random variable, denotes the number of possible values Xi can take conditioned
on X1 , . . . , Xi−1 , i.e., how many possible matchings for left vertex i given the outcome of where
1, . . . , i − 1 are matched to. However, it is hard to proceed from this point as we only know the
degree information, not the graph itself. In fact, since we do not know the relative positions of the
vertices, there is no reason why we should order from 1 to n. The key idea is to label the vertices
randomly, apply chain rule in this random order and average.
To this end, pick π uniformly at random from Sn and independent of X. Then
log perm(A) = H(X) = H(X|π )

= H(Xπ (1) , . . . , Xπ (n) |π )
X
n
= H(Xπ (k) |Xπ (1) , . . . , Xπ (k−1) , π )
k=1
Xn
= H(Xk |{Xj : π −1 (j) < π −1 (k)}, π )
k=1
X
n
≤ E log Nk ,
k=1
where Nk denotes the number of possible matchings for vertex k given the outcomes of {Xj :
π −1 (j) < π −1 (k)} and the expectation is with respect to (X, π ). The key observation is:
Lemma 8.5. Nk is uniformly distributed on [dk ].
i i
i i
i i

i i
134
Example 8.1. As a concrete example for Lemma 8.5, consider the 1 1

graph G on the right. For vertex k = 1, dk = 2. Depending on the
random ordering, if π = 1 ∗ ∗, then Nk = 2 w.p. 1/3; if π = ∗ ∗ 1,
2 2
then Nk = 1 w.p. 1/3; if π = 213, then Nk = 2 w.p. 1/3; if π = 312,
then Nk = 1 w.p. 1/3. Combining everything, indeed Nk is equally
3 3
likely to be 1 or 2.
Applying Lemma 8.5,
1 X
dk
1
E(X,π ) log Nk = log i = log(di !) di
dk
i=1
and hence
X
n
1 Y
n
1
log perm(A) ≤ log(di !) di = log (di !) di .
k=1 i=1
Finally, we prove Lemma 8.5:
Proof. Note that Xi = σ(i) for some random permutation σ . Let T = ∂(k) be the neighbors of k.
Then
Nk = |T\{σ(j) : π −1 (j) < π −1 (k)}|
which is a function of (σ, π ). In fact, conditioned on any realization of σ , Nk is uniform over [dk ].
To see this, note that σ −1 (T) is a fixed subset of [n] of cardinality dk , and k ∈ σ −1 (T). On the other
hand, S ≜ {j : π −1 (j) < π −1 (k)} is a uniformly random subset of [n]\{k}. Then
Nk = |σ −1 (T)\S| = 1 + |σ −1 (T)\{k} ∩ S|,

| {z }
Unif({0,...,dk −1})
which is uniform over [dk ].
8.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney

The following famous result shows that n-dimensional rectangles simultaneously minimize the
volumes of all coordinate projections:4
Theorem 8.6 (Bollobás-Thomason Box Theorem). Let K ⊂ Rn be a compact set. For S ⊂ [n],
denote by KS ⊂ RS the projection of K onto those coordinates indexed by S. Then there exists a
rectangle A s.t. Leb(A) = Leb(K) and for all S ⊂ [n]:
Leb(AS ) ≤ Leb(KS )
4
Note that since K is compact, its projection and slices are all compact and hence measurable.
i i
i i
i i

i i
8.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney 135
Proof. Let Xn be uniformly distributed on K. Then h(Xn ) = log Leb(K). Let A be a rectangle of
size a1 × · · · × an where
log ai = h(Xi |Xi−1 ) .
Then, we have by Theorem 2.6(a)
h(XS ) ≤ log Leb(KS ).
On the other hand, by the chain rule and the fact that conditioning reduces differential entropy
(recall Theorem 2.6(a) and (c)),
X
n
h( X S ) = 1{i ∈ S}h(Xi |X[i−1]∩S )
i=1
X
≥ h(Xi |Xi−1 )
i∈S
Y
= log ai
i∈S
= log Leb(AS )
The following result is a continuous counterpart of Shearer’s lemma (see Theorem 1.8 and
Remark 1.2):
Corollary 8.7 (Loomis-Whitney). Let K be a compact subset of Rn and let Kjc denote the
projection of K onto coordinates in [n] \ j. Then
Y
n
1
Leb(K) ≤ Leb(Kjc ) n−1 . (8.6)
j=1
Proof. Let A be a rectangle having the same volume as K. Note that

Y
n
1
Leb(K) = Leb(A) = Leb(Ajc ) n−1
j=1
By the previous theorem, Leb(Ajc ) ≤ Leb(Kjc ).

The meaning of the Loomis-Whitney inequality is best understood by introducing the average
Leb(K)
width of K in the jth direction: wj ≜ Leb(Kjc ) . Then (8.6) is equivalent to
Y
n
Leb(K) ≥ wj ,
j=1
i.e. that volume of K is greater than that of the rectangle of average widths.
i i
i i
i i

i i
9 Random number generators
Consider the following problem: Given a stream of independent Ber(p) bits, with unknown p, we
want to turn them into pure random bits, i.e., independent Ber(1/2) bits; Our goal is to find a
universal way to extract the most number of bits. In other words, we want to extract as many fair
coin flips as possible from possibly biased coin flips, without knowing the actual bias.
In 1951 von Neumann [326] proposed the following scheme: Divide the stream into pairs of
bits, output 0 if 10, output 1 if 01, otherwise do nothing and move to the next pair. Since both
01 and 10 occur with probability pq (where q ≜ 1 − p throughout this chapter), regardless of the
value of p, we obtain fair coin flips at the output. To measure the efficiency of von Neumann’s
scheme, note that, on average, we have 2n bits in and 2pqn bits out. So the efficiency (rate) is pq.
The question is: Can we do better?
There are several choices to be made in the problem formulation. Universal v.s. non-universal:
the source distribution can be unknown or partially known, respectively. Exact v.s. approximately
fair coin flips: whether the generated coin flips are exactly fair or approximately, as measured by
one of the f-divergences studied in Chapter 7 (e.g., the total variation or KL divergence). In this
chapter, we only focus on the universal generation of exactly fair coins.
9.1 Setup
Let {0, 1}∗ = ∪k≥0 {0, 1}k = {∅, 0, 1, 00, 01, . . . } denote the set of all finite-length binary strings,
where ∅ denotes the empty string. For any x ∈ {0, 1}∗ , l(x) denotes the length of x.
Let us first introduce the definition of random number generator formally. If the input vector is
X, denote the output (variable-length) vector by Y ∈ {0, 1}∗ . Then the desired property of Y is the
following: Conditioned on the length of Y being k, Y is uniformly distributed on {0, 1}k .
Definition 9.1 (Extractor). We say Ψ : {0, 1}∗ → {0, 1}∗ is an extractor if
1 Ψ(x) is a prefix of Ψ(y) if x is a prefix of y.

i.i.d.
2 For any n and any p ∈ (0, 1), if Xn ∼ Ber(p), then Ψ(Xn ) ∼ Ber(1/2)k conditioned on
l(Ψ(Xn )) = k for each k ≥ 1.
136
i i
i i
i i

i i
9.2 Converse 137
The efficiency of an extractor Ψ is measured by its rate:
E[l(Ψ(Xn ))] i.i.d.

rΨ (p) = lim sup , Xn ∼ Ber(p).
n→∞ n
In other words, Ψ consumes a stream of n coins with bias p and outputs on average nrΨ (p) fair
coins.
Note that the von Neumann scheme above defines a valid extractor ΨvN (with ΨvN (x2n+1 ) =
ΨvN (x2n )), whose rate is rvN (p) = pq. Clearly this is wasteful, because even if the input bits are
already fair, we only get 25% in return.
9.2 Converse
We show that no extractor has a rate higher than the binary entropy function h(p), even if the
extractor is allowed to be non-universal (depending on p). The intuition is that the “information
content” contained in each Ber(p) variable is h(p) bits; as such, it is impossible to extract more
than that. This is easily made precise by the data processing inequality for entropy (since extractors
are deterministic functions).
Theorem 9.2. For any extractor Ψ and any p ∈ (0, 1),
1 1
rΨ (p) ≥ h(p) = p log2 + q log2 .
p q
Proof. Let L = Ψ(Xn ). Then
nh(p) = H(Xn ) ≥ H(Ψ(Xn )) = H(Ψ(Xn )|L) + H(L) ≥ H(Ψ(Xn )|L) = E [L] bits,
where the last step follows from the assumption on Ψ that Ψ(Xn ) is uniform over {0, 1}k
conditioned on L = k.
The rate of von Neumann extractor and the entropy bound are plotted below. Next we present
two extractors, due to Elias [112] and Peres [233] respectively, that attain the binary entropy func-
tion. (More precisely, both construct a sequence of extractors whose rate approaches the entropy
bound).
i i
i i
i i

i i
138
rate
1 bit
rvN
p
0 1 1
2
9.3 Elias’ construction from data compression

The intuition behind Elias’ scheme is the following:
1 For iid Xn , the probability of each string only depends on its type, i.e., the number of 1’s. (This is
the main idea of the method of types for data compression.) Therefore conditioned on the num-
ber of 1’s, Xn is uniformly distributed (over the type class). This observation holds universally
for any p.
2 Given a uniformly distributed random variable on some finite set, we can easily turn it into
variable-length string of fair coin flips. For example:
• If U is uniform over {1, 2, 3}, we can map 1 7→ ∅, 2 7→ 0 and 3 7→ 1.
• If U is uniform over {1, 2, . . . , 11}, we can map 1 7→ ∅, 2 7→ 0, 3 7→ 1, and the remaining
eight numbers 4, . . . , 11 are assigned to 3-bit strings.
Lemma 9.3. Given U uniformly distributed on [M], there exists f : [M] → {0, 1}∗ such that
conditioned on l(f(U)) = k, f(U) is uniformly over {0, 1}k . Moreover,
log2 M − 4 ≤ E[l(f(U))] ≤ log2 M bits.
Proof. We defined f by partitioning [M] into subsets whose cardinalities are powers of two, and
assign elements in each subset to binary strings of that length. Formally, denote the binary expan-
Pn
sion of M by M = i=0 mi 2i , where the most significant bit mn = 1 and n = blog2 Mc + 1. Those
non-zero mi ’s defines a partition [M] = ∪tj=0 Mj , where |Mi | = 2ij . Map the elements of Mj to
{0, 1}ij . Finally, notice that uniform distribution conditioned on any subset is still uniform.
To prove the bound on the expected length, the upper bound follows from the same entropy
argument log2 M = H(U) ≥ H(f(U)) ≥ H(f(U)|l(f(U))) = E[l(f(U))], and the lower bound
follows from
1 X 1 X 2n X i−n
n n n
2n+1
E[l(f(U))] = mi 2 · i = n −
i
mi 2 ( n − i) ≥ n −
i
2 ( n − i) ≥ n − ≥ n − 4,
M M M M
i=0 i=0 i=0
where the last step follows from n ≤ log2 M + 1.
i i
i i
i i

i i
9.4 Peres’ iterated von Neumann’s scheme 139
Elias’ extractor Fix n ≥ 1. Let wH (xn ) define the Hamming weight (number of ones) of a
binary string xn . Let Tk = {xn ∈ {0, 1}n : wH (xn ) = k} define the Hamming sphere of radius k.
For each 0 ≤ k ≤ n, we apply the function f from Lemma 9.3 to each Tk . This defines a mapping
ΨE : {0, 1}n → {0, 1}∗ and then we extend it to ΨE : {0, 1}∗ → {0, 1}∗ by applying the mapping
per n-bit block and discard the last incomplete block. Then it is clear that the rate is given by
n E[l(ΨE (X ))]. By Lemma 9.3, we have
1 n

n n
E log − 4 ≤ E[l(ΨE (X ))] ≤ E log
n
wH (Xn ) wH (Xn )
Using Stirling’s approximation, we can show (see, e.g., [19, Lemma 4.7.1])

2nh(p) n 2nh(p)
p ≤ ≤p (9.1)
8k(n − k)/n k 2πk(n − k)/n
whenever 1 ≤ k ≤ n − 1. Since wH (Xn ) ∼ Bin(n, p) and h is a continuous bounded function,

applying the law of large numbers yields
E[l(ΨE (Xn ))] = nh(p) + o(n).
Therefore the extraction rate of ΨE approaches the optimum h(p) as n → ∞.
9.4 Peres’ iterated von Neumann’s scheme

The main idea is to recycle the bits thrown away in von Neumann’s scheme and iterate. What von
Neumann’s extractor discarded are: (a) bits from equal pairs; (b) location of the distinct pairs. To
achieve the entropy bound, we need to extract the randomness out of these two parts as well.
First, some notations: Given x2n , let k = l(ΨvN (x2n )) denote the number of consecutive distinct
bit-pairs.
• Let 1 ≤ m1 < . . . < mk ≤ n denote the locations such that x2mj 6= x2mj −1 .
• Let 1 ≤ i1 < . . . < in−k ≤ n denote the locations such that x2ij = x2ij −1 .
• yj = x2mj , vj = x2ij , uj = x2j ⊕ x2j+1 .
Here yk are the bits that von Neumann’s scheme outputs and both vn−k and un are discarded. Note
that un is important because it encodes the location of the yk and contains a lot of information.
Therefore von Neumann’s scheme can be improved if we can extract the randomness out of both
vn−k and un .
Peres’ extractor For each t ∈ N, recursively define an extractor Ψt as follows:
• Set Ψ1 to be von Neumann’s extractor ΨvN , i.e., Ψ1 (x2n+1 ) = Ψ1 (x2n ) = yk .

• Define Ψt by Ψt (x2n ) = Ψt (x2n+1 ) = (Ψ1 (x2n ), Ψt−1 (un ), Ψt−1 (vn−k )).
i i
i i
i i

i i
140
Example: Input x = 100111010011 of length 2n = 12. Output recursively:

y u v
z }| { z }| { z }| {
(011) (110100) (101)
(1)(010)(10)(0)
(1)(0)
Next we (a) verify Ψt is a valid extractor; (b) evaluate its efficiency (rate). Note that the bits that
enter into the iteration are no longer i.i.d. To compute the rate of Ψt , it is convenient to introduce
the notion of exchangeability. We say Xn are exchangeable if the joint distribution is invariant
under permutation, that is, PX1 ,...,Xn = PXπ (1) ,...,Xπ (n) for any permutation π on [n]. In particular, if
Xi ’s are binary, then Xn are exchangeable if and only if the joint distribution only depends on the
Hamming weight, i.e., PXn (xn ) = f(wH (xn )) for some function f. Examples: Xn is iid Ber(p); Xn is
uniform over the Hamming sphere Tk .
As an example, if X2n are i.i.d. Ber(p), then conditioned on L = k, Vn−k is iid Ber(p2 /(p2 + q2 )),
since L ∼ Binom(n, 2pq) and
pk+2m qn−k−2m
P[Yk = y, Un = u, Vn−k = v|L = k] =
n
k(p2 + q2 )n−k (2pq)k
− 1
n p2 m q2 n − k − m
= 2− k · · 2
k p + q2 p2 + q2
= P[Yk = y|L = k]P[Un = u|L = k]P[Vn−k = v|L = k],
where m = wH (v). In general, when X2n are only exchangeable, we have the following:
Lemma 9.4 (Ψt preserves exchangeability). Let X2n be exchangeable and L = Ψ1 (X2n ). Then
conditioned on L = k, Yk , Un and Vn−k are independent, each having an exchangeable distribution.
i.i.d.
Furthermore, Yk ∼ Ber( 12 ) and Un is uniform over Tk .
Proof. If suffices to show that ∀y, y′ ∈ {0, 1}k , u, u′ ∈ Tk and v, v′ ∈ {0, 1}n−k such that wH (v) =
wH (v′ ), we have
P[Yk = y, Un = u, Vn−k = v|L = k] = P[Yk = y′ , Un = u′ , Vn−k = v′ |L = k],
which implies that P[Yk = y, Un = u, Vn−k = v|L = k] = f(wH (v)) for some function f. Note that
the string X2n and the triple (Yk , Un , Vn−k ) are in one-to-one correspondence of each other. Indeed,
to reconstruct X2n , simply read the k distinct pairs from Y and fill them according to the locations of
ones in U and fill the remaining equal pairs from V. [Examples: (y, u, v) = (01, 1100, 01) ⇒ x =
(10010011), (y, u, v) = (11, 1010, 10) ⇒ x′ = (01110100).] Finally, note that u, y, v and u′ , y′ , v′
correspond to two input strings x and x′ of identical Hamming weight (wH (x) = k + 2wH (v)) and
hence of identical probability due to the exchangeability of X2n .
i.i.d.
Lemma 9.5 (Ψt is an extractor). Let X2n be exchangeable. Then Ψt (X2n ) ∼ Ber(1/2) conditioned
on l(Ψt (X2n )) = m.
i i
i i
i i

i i
Proof. Note that Ψt (X2n ) ∈ {0, 1}∗ . It is equivalent to show that for all sm ∈ {0, 1}m ,
P[Ψt (X2n ) = sm ] = 2−m P[l(Ψt (X2n )) = m].
Proceed by induction on t. The base case of t = 1 follows from Lemma 9.4 (the distribution of
the Y part). Assume Ψt−1 is an extractor. Recall that Ψt (X2n ) = (Ψ1 (X2n ), Ψt−1 (Un ), Ψt−1 (Vn−k ))
and write the length as L = L1 + L2 + L3 , where L2 ⊥ ⊥ L3 |L1 by Lemma 9.4. Then
P[Ψt (X2n ) = sm ]
Xm
= P[Ψt (X2n ) = sm |L1 = k]P[L1 = k]
k=0
m X
X m−k
Lemma 9.4 n−k
= P[L1 = k]P[Yk = sk |L1 = k]P[Ψt−1 (Un ) = skk+1 |L1 = k]P[Ψt−1 (V
+r
k+r+1 |L1 = k]
) = sm
k=0 r=0
X
m X
m−k
P[L1 = k]2−k 2−r P[L2 = r|L1 = k]2−(m−k−r) P[L3 = m − k − r|L1 = k]
induction
=
k=0 r=0
= 2−m P[L = m].

i.i.d.
Next we compute the rate of Ψt . Let X2n ∼ Ber(p). Then by SLLN, 2n 1
l(Ψ1 (X2n )) ≜ 2n
Ln
con-
a . s.
verges a.s. to pq. Assume, again by induction, that 2n l(Ψt−1 (X ))−−→rt−1 (p), with r1 (p) = pq.
1 2n
Then
1 Ln 1 1
l(Ψt (X2n )) = + l(Ψt−1 (Un )) + l(Ψt−1 (Vn−Ln )).
2n 2n 2n 2n
i.i.d. i.i.d. a. s .
Note that Un ∼ Ber(2pq), Vn−Ln |Ln ∼ Ber(p2 /(p2 + q2 )) and Ln −−→∞. Then the induction hypoth-
a. s . a. s .
esis implies that 1n l(Ψt−1 (Un ))−−→rt−1 (2pq) and 2(n−1 Ln ) l(Ψt−1 (Vn−Ln ))−−→rt−1 (p2 /(p2 +q2 )). We
obtain the recursion:

1 p2 + q2 p2
rt (p) = pq + rt−1 (2pq) + rt−1 ≜ (Trt−1 )(p), (9.2)
2 2 p2 + q2
where the operator T maps a continuous function on [0, 1] to another. Furthermore, T is mono-
tone in the senes that f ≤ g pointwise then Tf ≤ Tg. Then it can be shown that rt converges
monotonically from below to the fixed point of T, which turns out to be exactly the binary
entropy function h. Instead of directly verifying Th = h, here is a simple proof: Consider
i.i.d.
X1 , X2 ∼ Ber(p). Then 2h(p) = H(X1 , X2 ) = H(X1 ⊕ X2 , X1 ) = H(X1 ⊕ X2 ) + H(X1 |X1 ⊕ X2 ) =
2
h(2pq) + 2pqh( 12 ) + (p2 + q2 )h( p2p+q2 ).
The convergence of rt to h are shown in Fig. 9.1.
9.5 Bernoulli factory

Given a stream of Ber(p) bits with unknown p, for what kind of function f : [0, 1] → [0, 1] can
we simulate iid bits from Ber(f(p)). Our discussion above deals with f(p) ≡ 12 . The most famous
i i
i i
i i

i i
142
1.0
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8 1.0
Figure 9.1 Rate function rt for t = 1, 4, 10 versus the binary entropy function.
example is whether we can simulate Ber(2p) from Ber(p), i.e., f(p) = 2p ∧ 1. Keane and O’Brien
[175] showed that all f that can be simulated are either constants or “polynomially bounded away
from 0 or 1”: for all 0 < p < 1, min{f(p), 1 − f(p)} ≥ min{p, 1 − p}n for some n ∈ N. In particular,
doubling the bias is impossible.
The above result deals with what f(p) can be simulated in principle. What type of computational
devices are needed for such as task? Note that since r1 (p) is quadratic in p, all rate functions rt
that arise from the iteration (9.2) are rational functions (ratios of polynomials), converging to the
binary entropy function as Fig. 9.1 shows. It turns out that for any rational function f that satisfies
0 < f < 1 on (0, 1), we can generate independent Ber(f(p)) from Ber(p) using either of the
following schemes with finite memory [221]:
1 Finite-state machine (FSM): initial state (red), intermediate states (white) and final states (blue,
output 0 or 1 then reset to initial state).
2 Block simulation: let A0 , A1 be disjoint subsets of {0, 1}k . For each k-bit segment, output 0 if
falling in A0 or 1 if falling in A1 . If neither, discard and move to the next segment. The block
size is at most the degree of the denominator polynomial of f.
The next table gives some examples of f that can be realized with these two architectures. (Exercise:
How to generate f(p) = 1/3?)
It turns out that the only type of f that can be simulated using either FSM or block simulation
√
is rational function. For f(p) = p, which satisfies Keane-O’Brien’s characterization, it cannot
be simulated by FSM or block simulation, but it can be simulated by the so-called pushdown
automata, which is a FSM operating with a stack (infinite memory) [221].
It is unknown how to find the optimal Bernoulli factory with the best rate. Clearly, a converse
is the entropy bound h(hf((pp))) , which can be trivial (bigger than one).
i i
i i
i i

i i
Goal Block simulation FSM
1
1
0
0
f(p) = 1/2 A0 = 10; A1 = 01
1
1 0
0
0 0
1
f(p) = 2pq A0 = 00, 11; A1 = 01, 10 0 1
0
1 1
0 0
0
0
1
1
p3
f(p) = p3 +q3
A0 = 000; A1 = 111
0
0
1
1
1 1
Table 9.1 Bernoulli factories realized by FSM or block simulation.
i i
i i
i i

i i
Exercises for Part I
I.1 (Combinatorial meaning of entropy)

1 Fix n ≥ 1 and 0 ≤ k ≤ n. Let p = nk and define Tp ⊂ {0, 1}n to be the set of all binary
sequences with p fraction of ones. Show that if k ∈ [1, n − 1] then
s
1
| Tp | = exp{nh(p)}C(n, k)
np(1 − p)
where C(n, k) is bounded by two universal constants C0 ≤ C(n, k) ≤ C1 , and h(·) is the
binary entropy. Conclude that for all 0 ≤ k ≤ n we have
log |Tp | = nh(p) + O(log n) .
Hint: Stirling’s approximation:

1 n! 1
e 12n+1 ≤ √ ≤ e 12n , n≥1 (I.1)
2πn(n/e) n
2 Let Qn = Bern(q)n be iid Bernoulli distribution on {0, 1}n . Show that
log Qn [Tp ] = −nd(pkq) + O(log n)
3* More generally, let X be a finite alphabet, P̂, Q distributions on X , and TP̂ a set of all strings
in X n with composition P̂. If TP̂ is non-empty (i.e. if nP̂(·) is integral) then
log |TP̂ | = nH(P̂) + O(log n)

log Qn [TP̂ ] = −nD(P̂kQ) + O(log n)
and furthermore, both O(log n) terms can be bounded as |O(log n)| ≤ |X | log(n + 1). (Hint:
show that number of non-empty TP̂ is ≤ (n + 1)|X | .)
I.2 (Refined method of types) The following refines Proposition 1.5. Let n1 , . . . , be non-negative
P
integers with i ni = n and let k+ be the number of non-zero ni ’s. Then

n k+ − 1 1 X
log = nH(P̂) − log(2πn) − log P̂i − Ck+ ,
n1 , n2 , . . . 2 2
i:ni >0
where P̂i = nni and 0 ≤ Ck+ ≤ log e

12 . (Hint: use (I.1)).
I.3 (Conditional entropy and Markov types)
(a) Fix n ≥ 1, a sequence xn ∈ X n and define
Nxn (a, b) = |{(xi , xi+1 ) : xi = a, xi+1 = b, i = 1, . . . , n}| ,
i i
i i
i i

i i
where we define xn+1 = x1 (cyclic continuation). Show that 1n Nxn (·, ·) defines a probability
distribution PA,B on X ×X with equal marginals PA = PB . Conclude that H(A|B) = H(B|A).
Is PA|B = PB|A ?
(2)
(b) Let Txn (Markov type-class of xn ) be defined as
(2)
Txn = {x̃n ∈ X n : Nx̃n = Nxn } .
(2)
Show that elements of Txn can be identified with cycles in the complete directed graph G
on X , such that for each (a, b) ∈ X × X the cycle passes Nxn (a, b) times through edge
( a, b) .
(c) Show that each such cycle can be uniquely specified by indentifying the first node and by
choosing at each vertex of the graph the order in which the outgoing edges are taken. From
this and Stirling’s approximation conclude that
(2)
log |Txn | = nH(xT+1 |xT ) + O(log n) , T ∼ Unif([n]) .
Check that H(xT+1 |xT ) = H(A|B) = H(B|A).

(d) Show that for any time-homogeneous Markov chain Xn with PX1 ,X2 (a1 , a2 ) > 0 ∀a1 , a2 ∈ X
we have
(2)
log PXn (Xn ∈ Txn ) = −nD(PB|A kPX2 |X1 |PA ) + O(log n) .
I.4 Find the entropy rate of a stationary ergodic Markov chain with transition probability matrix
 1 1 1 
2 4 4
P= 0 1
2
1
2

1 0 0
I.5 Let X = X∞
0 be a stationary Markov chain. Let PY|X be a Markov kernel. Define a new process
Y = Y∞0 where Yi ∼ PY|X=Xi conditionally independent of all other Xj , j 6= i. Prove that
H(Yn |Yn2−1 , X1 ) ≤ H(Y) ≤ H(Yn |Yn1−1 ) (I.2)
and
lim H(Yn |Yn2−1 , X1 ) = H(Y) = lim H(Yn |Yn1−1 ). (I.3)

n→∞ n→∞
I.6 (Robust version of the maximal entropy) Maximal differential entropy among all variables X
supported on [−b, b] is attained by a uniform distribution. Prove that as ϵ → 0+ we have
sup{h(M + Z) : M ∈ [−b, b], E[Z] = 0, Var[Z] ≤ ϵ} = log(2b) + o(1) .
where supremization is over all (not necessarily independent) random variables M, Z such that
M + Z possesses a density. (Hint: [120, Appendix C] proves o(1) = O(ϵ1/3 log 1ϵ ) bound.)
I.7 (Maximum entropy.) Prove that for any X taking values on N = {1, 2, . . .} such that E[X] < ∞,

1
H(X) ≤ E[X]h ,
E [ X]
i i
i i
i i

i i
146 Exercises for Part I
maximized uniquely by the geometric distribution. Here as usual h(·) denotes the binary entropy
function. Hint: Find an appropriate Q such that RHS - LHS = D(PX kQ).
I.8 (Finiteness of entropy) We have shown that any N-valued random variable X, with E[X] < ∞
has H(X) ≤ E[X]h(1/ E[X]) < ∞. Next let us improve this result.
(a) Show that E[log X] < ∞ ⇒ H(X) < ∞.
Moreover, show that the condition of X being integer-valued is not superfluous by giving a
counterexample.
(b) Show that if k 7→ PX (k) is a decreasing sequence, then H(X) < ∞ ⇒ E[log X] < ∞.
Moreover, show that the monotonicity assumption is not superfluous by giving a counterex-
ample.
I.9 (Maximum entropy under Hamming weight constraint.) For any α ≤ 1/2 and d ∈ N,
max{H(Y) : Y ∈ {0, 1}d , E[wH (Y)] ≤ αd} = dh(α),
achieved by the product distribution Y ∼ Ber(α)⊗d . Hint: Find an appropriate Q such that RHS
- LHS = D(PY kQ).
I.10 Let N (m, �) be the Gaussian distribution on Rn with mean m ∈ Rn and covariance matrix �.
(a) Under what conditions on m0 , �0 , m1 , �1 is
D( N (m1 , �1 ) k N (m0 , �0 ) ) < ∞
(b) Compute D(N (m, �)kN (0, In )), where In is the n × n identity matrix.
(c) Compute D( N (m1 , �1 ) k N (m0 , �0 ) ) for non-singular �0 . (Hint: think how Gaussian dis-
tribution changes under shifts x 7→ x + a and non-singular linear transformations x 7→ Ax.
Apply data-processing to reduce to previous case.)
I.11 (Information lost in erasures) Let X, Y be a pair of random variables with I(X; Y) < ∞. Let Z
be obtained from Y by passing the latter through an erasure channel, i.e., X → Y → Z where
(
1 − δ, z = y ,
PZ|Y (z|y) =
δ, z =?
where ? is a symbol not in the alphabet of Y. Find I(X; Z).

I.12 (Information bottleneck) Let X → Y → Z where Y is a discrete random variable taking values
on a finite set Y . Prove that
I(X; Z) ≤ log |Y|.
I.13 The Hewitt-Savage 0-1 law states that certain symmetric events have no randomness. Let
{Xi }i≥1 be a sequence be iid random variables. Let E be an event determined by this sequence.
We say E is exchangeable if it is invariant under permutation of finitely many indices in
the sequence of {Xi }’s, e.g., the occurance of E is unchanged if we permute the values of
(X1 , X4 , X7 ), etc.
Let’s prove the Hewitt-Savage 0-1 law information-theoretically in the following steps:
P Pn
(a) (Warm-up) Verify that E = { i≥1 Xi converges} and E = {limn→∞ n1 i=1 Xi = E[X1 ]}
are exchangeable events.
i i
i i
i i

i i
(b) Let E be an exchangeable event and W = 1E is its indicator random variable. Show that
for any k, I(W; X1 , . . . , Xk ) = 0. (Hint: Use tensorization (6.2) to show that for arbitrary n,
nI(W; X1 , . . . , Xk ) ≤ 1 bit.)
(c) Since E is determined by the sequence {Xi }i≥1 , we have by continuity of mutual informa-
tion:
H(W) = I(W; X1 , . . .) = lim I(W; X1 , . . . , Xk ) = 0.

k→∞
Conclude that E has no randomness, i.e., P(E) = 0 or P(E) = 1.

(d) (Application to random walk) Often after the application of Hewitt-Savage, further efforts
are needed to determine whether the probability is 0 or 1. As an example, consider Xi ’s
Pn
are iid ±1 and Sn = i=1 Xi denotes the symmetric random walk. Verify that the event
E = {Sn = 0 finitely often} is exchangeable. Now show that P(E) = 0.
(Hint: consider E+ = {Sn > 0 eventually} and E− similarly. Apply Hewitt-Savage to them
and invoke symmetry.)
I.14 Conditioned on X = x, let Y be Poisson with mean x, i.e.,
xk
PY|X [k|x] = e−x , k = 0, 1, 2, . . .
k!
Let X be an exponential random variable with unit mean. Find I(X; Y).
I.15 Consider the following Z-channel given by PY|X [1|1] = 1 and PY|X [1|0] = PY|X [0|0] = 1/2.
1 1
0 0
(a) Find the capacity
C = max I(X; Y) .
X
(b) Find D(PY|X=0 kP∗Y ) and D(PY|X=1 kP∗Y ) where P∗Y is the capacity-achieving output distribu-
tion, or caod, i.e., the distribution of Y induced by the maximizer of I(X; Y).
I.16 (a) For any X such that E [|X|] < ∞, show that
(E[X])2
D(PX kN (0, 1)) ≥ nats.
2
(b) For a > 0, find the minimum and minimizer of
min D(PX kN (0, 1)).

PX :EX≥a
Is the minimizer unique? Why?
i i
i i
i i

i i
I.17 (Entropy numbers and capacity.) Let {PY|X=x : x ∈ X } be a set of distributions and let C =
supPX I(X; Y) be its capacity. For every ϵ ≥ 0, define1
N(ϵ) = min{k : ∃Q1 . . . Qk : ∀x ∈ X , min D(PY|X=x kQj ) ≤ ϵ2 } . (I.4)

j
(a) Prove that

C = inf ϵ2 + log N(ϵ) . (I.5)
ϵ≥0
(Hint: when is N(ϵ) = 1? See Theorem 32.4.)

(b) Similarly, show
I(X; Y) = inf (ϵ + log N(ϵ; PX )) ,

ϵ≥0
where the average-case covering number is
N(ϵ; PX ) = min{k : ∃Q1 . . . Qk : Ex∼PX [min D(PY|X=x kQj )] ≤ ϵ} (I.6)

j
Comments: The reason these estimates are useful is because N(ϵ) for small ϵ roughly speaking
depends on local (differential) properties of the map x 7→ PY|X=x , unlike C which is global.
I.18 Consider the channel PYm |X : [0, 1] 7→ {0, 1}m , where given x ∈ [0, 1], Ym is i.i.d. Ber(x). Using
the upper bound from Ex. I.17 prove
1
C(m) ≜ max I(X; Ym ) ≤ log m + O(1) , m → ∞.
PX 2
Hint: Find a covering of the input space.
Show a lower bound to establish
1
C(m) ≥ log m + o(log m) , m → ∞.
2
You may use without proof that ∀ϵ > 0 there exists K(ϵ) such that for all m ≥ 1 and all
p ∈ [ϵ, 1 − ϵ] we have |H(Binom(m, p)) − 12 log m| ≤ K(ϵ).
I.19 Show that
Y
n
PY1 ···Yn |X1 ···Xn = PYi |Xi (I.7)
i=1
if and only if for all i = 1, . . . , n,
Yi → Xi → (X\i , Y\i ) (I.8)
where X\i = {Xj , j 6= i}.
1
N(ϵ) is the minimum number of points that cover the set {PY|X=x : x ∈ X } to within ϵ in divergence; log N(ϵ) would be
called (Kolmogorov) metric ϵ-entropy of the set {PY|X=x : x ∈ X } – see Chapter 27.
i i
i i
i i

i i
Pn
I.20 Suppose Z1 , . . . Zn are independent Poisson random variables with mean λ. Show that i=1 Zi
is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.21 Suppose Z1 , . . . Zn are independent uniformly distributed on the interval [0, λ]. Show that
max1≤i≤n Zi is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.22 Consider a binary symmetric random walk Xn on Z that starts at zero. In other words, Xn =
Pn
j=1 Bj , where (B1 , B2 , . . .) are independent and equally likely to be ±1.
(a) When n 1 does knowing X2n provide any information about Xn ? More exactly, prove
lim inf I(Xn ; X2n ) > 0.

n→∞
(Hint: lower-semicontinuity and central-limit theorem)

(b) Bonus: Compute the exact value of the limit
lim I(Xn ; X2n ).

n→∞
I.23 (Continuity of entropy on finite alphabet.) We have shown that entropy is continuous on on
finite alphabet. Now let us study how continuous it is with respect to the total variation. Prove
|H(P) − H(Q)| ≤ h(TV(P, Q)) + TV(P, Q) log(|X | − 1)
for any P and Q supported on X .

Hint: Use Fano’s inequaility and the inf-representation (over coupling) of total variation in
Theorem 7.7(a).
I.24 Distributions and graphical models:
(a) Draw all possible directed acyclic graphs (DAGs, or graphical models) compatible with the
following distribution on X, Y, Z ∈ {0, 1}:
(
1/6, x = 0, z ∈ {0, 1} ,
PX,Z (x, z) = (I.9)
1/3, x = 1, z ∈ {0, 1}
Y=X+Z (mod2) (I.10)
You may include only the minimal DAGs (recall: the DAG is minimal for a given
distribution if removal of any edge leads to a graphical model incompatible with the
distribution).2
(b) Draw the DAG describing the set of distributions PXn Yn satisfying:
Y
n
PYn |Xn = PYi |Xi
i=1
(c) Recall that two DAGs G1 and G2 are called equivalent if they have the same vertex sets and
each distribution factorizes w.r.t. G1 if and only if it does so w.r.t. G2 . For example, it is
2
Note: {X → Y}, {X ← Y} and {X Y} are the three possible directed graphical modelss for two random variables. For
example, the third graph describes the set of distributions for which X and Y are independent: PXY = PX PY . In fact, PX PY
factorizes according to any of the three DAGs, but {X Y} is the unique minimal DAG.
i i
i i
i i

i i
well known
X→Y→Z ⇐⇒ X←Y←Z ⇐⇒ X ← Y → Z.
Consider the following two DAGs with countably many vertices:
X1 → X2 → · · · → Xn → · · ·
X1 ← X2 ← · · · ← Xn ← · · ·
Are they equivalent?

I.25 Give a necessary and sufficient condition for
A→B→C
for jointly Gaussian (A, B, C) in terms of correlation coefficients.

For discrete (A, B, C) denote xabc = PABC (a, b, c) and write the Markov chain condition as a list
of degree-2 polynomial equations in {xabc , a ∈ A, b ∈ B, c ∈ C}.
I.26 Let A, B, C be discrete with PC|B (c|b) > 0 ∀b, c. Show
A→B→C
=⇒ A ⊥
⊥ (B, C)
A→C→B
Discuss implications for sufficient statistics.
Bonus: for binary (A, B, C) characterize all counter-examples.
Comment: Thus, a popular positivity condition PABC > 0 allows to infer conditional indepen-
dence relations, which are not true in general. Wisdom: This example demonstrates that a set
of distributions satisfying certain (conditional) independence relations does not equal to the
closure of its intersection with {PABC > 0}.
I.27 Show that for jointly gaussian (A, B, C)
I( A; C ) = I( B; C ) = 0 =⇒ I(A, B; C) = 0 . (I.11)
Find a counter-example for general (A, B, C).

Prove or disprove: Implication (I.11) also holds for arbitrary discrete (A, B, C) under positivity
condition PABC (a, b, c) > 0 ∀abc.
I.28 Let ITV (X; Y) = TV(PX,Y , PX PY ). Let X ∼ Ber(1/2) and conditioned on X generate A and B
independently setting them equal to X or 1 − X with probabilities 1 − δ and δ , respectively (i.e.
A ← X → B). Show
1
ITV (X; A, B) = ITV (X; A) = | − δ| .
2
This means the second observation of X is “uninformative” (in the ITV sense).
Similarly, show that when X ∼ Ber(δ) for δ < 1/2 there exists joint distribution PX,Y so that
TV(PY|X=0 , PY|X=1 ) > 0 (thus ITV (X; Y) and I(X; Y) are strictly positive), but at the same time
minX̂(Y) P[X 6= X̂] = δ . In other words, observation Y is informative about X, but does not
improve the probability of error.
i i
i i
i i

i i
eϵ
I.29 (Rényi divergences and Blackwell order) Let pϵ = 1+eϵ . Show that for all ϵ > 0 and all α > 0
we have
Dα (Ber(pϵ )kBer(1 − pϵ )) < Dα (N (ϵ, 1)kN (0, 1)) .
Yet, for small enough ϵ we have
TV(Ber(pϵ ), Ber(1 − pϵ )) > TV(N (ϵ, 1), N (0, 1)) .
Note: This shows that domination under all Rényi divergences does not imply a similar
comparison in other f-divergences [? ]. On the other hand, we have the equivalence [222]:
∀α > 0 : Dα (P1 kP0 ) ≤ Dα (Q1 kQ0 )

⇐⇒ ∃n0 ∀n ≥ n0 ∀f : Df (P⊗ ⊗n ⊗n ⊗n
1 kP0 ) ≤ Df (Q1 kQ0 ) .
n
(The latter is also equivalent to existence of a kernel Kn such that Kn ◦ P⊗

i
n
= Q⊗ n
i – a so-called
Blackwell order on pairs of measures).
I.30 (Rényi divergence as KL [281]) Show for all α ∈ R:
(1 − α)Dα (PkQ) = inf (αD(RkP) + (1 − α)D(RkQ)) . (I.12)

R
Whenever the LHS is finite, derive the explicit form of a unique minimizer R.
I.31 For an f-divergence, consider the following statements:
(i) If If (X; Y) = 0, then X ⊥
⊥ Y.
(ii) If X − Y − Z and If (X; Y) = If (X; Z) < ∞, then X − Z − Y.
Recall that f : (0, ∞) → R is a convex function with f(1) = 0.
(a) Choose an f-divergence which is not a multiple of the KL divergence (i.e., f cannot be of
form c1 x log x + c2 (x − 1) for any c1 , c2 ∈ R). Prove both statements for If .
(b) Choose an f-divergence which is non-linear (i.e., f cannot be of form c(x − 1) for any c ∈ R)
and provide examples that violate (i) and (ii).
(c) Choose an f-divergence. Prove that (i) holds, and provide an example that violates (ii).
I.32 (Chain rules I)
(a) Show using (I.12) and the chain rule for KL that
X
n
(1 − α)Dα (PXn kQXn ) ≥ inf(1 − α)Dα (PXi |Xi−1 =a kQXi |Xi−1 =a )
a
i=1
(b) Derive two special cases:
1 Y n
1
1 − H2 (PXn , QXn ) ≤ sup(1 − H2 (PXi |Xi−1 =a , QXi |Xi−1 =a ))
2 a 2
i=1
Y
n
1 + χ2 (PXn kQXn ) ≤ sup(1 + χ2 (PXi |Xi−1 =a kQXi |Xi−1 =a ))
a
i=1
I.33 (Chain rules II)
i i
i i
i i

i i
(a) Show that the chain rule for divergence can be restated as
X
n
D(PXn kQXn ) = D(Pi kPi−1 ),
i=1
where Pi = PXi QXni+1 |Xi , with Pn = PXn and P0 = QXn . The identity above shows how
KL-distance from PXn to QXn can be traversed by summing distances between intermediate
Pi ’s.
(b) Using the same path and triangle inequality show that
X
n
TV(PXn , QXn ) ≤ EPXi−1 TV(PXi |Xi−1 , QXi |Xi−1 )
i=1
(c) Similarly, show for the Hellinger distance H:

Xn q
H(PXn , QXn ) ≤ EPXi−1 H2 (PXi |Xi−1 , QXi |Xi−1 )
i=1
I.34 (a) Define Marton’s divergence

Z 2
dP
Dm (PkQ) = dQ 1 − .
dQ +
Prove that
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : PX = P, PY = Q}
PXY
where the infimum is over all couplings. (Hint: For one direction use the same coupling
achieving TV. For the other direction notice that P[X 6= Y|Y] ≥ 1 − QP((YY)) .)
(b) Define symmetrized Marton’s divergence
Dsm (PkQ) = Dm (PkQ) + Dm (QkP).
Prove that
Dsm (PkQ) = inf{E[P2 [X 6= Y|Y]] + E[P2 [X 6= Y|X]] : PX = P, PY = Q}.
PXY
I.35 (Center of gravity under f-divergences.) Recall from Corollary 4.2 the fact that
min D(PY|X kQY |PX ) = I(X; Y)
QY
achieved at QY = PY . Prove the following extensions to other f-divergences:

(a) Suppose that for PX -a.e. x, PY|X=x μ with density p(y|x).3 Then
Z q 2
min χ (PY|X kQY |PX ) =
2
μ(dy) E[pY|X (y|X) ] − 1.
2 (I.13)
QY
p
If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ E[p(y|X)2 ] μ(dy).
3
Note that the results do not depend on the choice of μ, so we can take for example μ = PY , in view of Lemma 3.3.
i i
i i
i i

i i
(b) Show that

Z
min D(QY kPY|X |PX ) = log μ(dy) exp(E[log p(y|X)]). (I.14)
QY
If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ exp(E[log p(y|X)]) μ(dy).
Note: This exercise shows that the center of gravity with respect to other f-divergences need not
be PY but its reweighted version. For statistical applications, see Exercise VI.6 and Exercise VI.9,
where (I.13) is used to determine the form of the Bayes estimator.
I.36 Let (X, Y) be uniformly distributed in the unit ℓp -ball Bp ≜ {(x, y) : |x|p + |y|p ≤ 1}, where
p ∈ (0, ∞). Also define the ℓ∞ -ball B∞ ≜ {(x, y) : |x| ≤ 1, |y| ≤ 1}.
(a) Compute I(X; Y) for p = 1/2, p = 1 and p = ∞.
(b) (Bonus) What do you think I(X; Y) converges to as p → 0. Can you prove it?
I.37 (Divergence of order statistics) Given xn = (x1 , . . . , xn ) ∈ Rn , let x(1) ≤ . . . ≤ x(n) denote the
ordered entries. Let P, Q be distributions on R and PXn = Pn , QXn = Qn .
(a) Prove that
D(PX(1) ,...,X(n) kQX(1) ,...,X(n) ) = nD(PkQ). (I.15)
(b) Show that
D(Bin(n, p)kBin(n, q)) = nd(pkq).
I.38 (Sampling without replacement I, [293]) Consider two ways of generating a random vector
Xn = (X1 , . . . , Xn ): Under P, Xn are sampled from the set [n] = {1, . . . , n} without replacement;
under Q, Xn are sampled from [n] with replacement. Let’s compare the joint distribution of the
first k draws X1 , . . . , Xk for some 1 ≤ k ≤ n.
(a) Show that

k! n
TV(PXk , QXk ) = 1 − k
n k

k! n
D(PXk kQXk ) = − log k .
n k
√
Conclude that D and TV are o(1) iff k = o( n). You may use the fact that TV between two
discrete distributions is equal to half the ℓ1 -distance between their PMFs.
√
(b) Explain the specialness of n by find an explicit test that distinguishes P and Q with high
√
probability when k n. Hint: Birthday problem.
I.39 (Sampling without replacement II, [293]) Let X1 , . . . , Xk be a random sample of balls without
Pq
replacement from an urn containing ai balls of color i ∈ [q], i=1 ai = n. Let QX (i) = ani . Show
that
k2 ( q − 1 ) log e
D(PXk kQkX ) ≤ c , c= .
(n − 1)(n − k + 1) 2
Let Rm,b0 ,b1 be the distribution of the number of 1’s in the first m ≤ b0 + b1 coordinates of a
randomly permuted binary strings with b0 zeros and b1 ones.
i i
i i
i i

i i
(a) Show that

X
q
ai − V i ai − V i
D(PXm+1 |Xm kQX |PXm ) = E[ log ],
N−m pi (N − m)
i=1
where Vi ∼ Rm,N−ai ,ai .

(b) Show that the i-th term above also equals pi E[log pia(iN−−Ṽmi ) ], Ṽi ∼ Rm,N−ai ,ai −1 .

(c) Use Jensen’s inequality to show that the i-th term is upper bounded by pi log 1 + (n−1)(mn−m) 1−pipi .
(d) Use the bound log(1 + x) ≤ x log e to complete the proof.
I.40 (Effective de Finetti) We will show that for any distribution PXn invariant to permutation and
k < n there exists a mixture of iid distributions QXk which approximates PXk :
r
k2 H(X1 ) X m
TV(PXk , QXk ) ≤ c , QXk = λ i ( Qi ) n (I.16)
n−k+1
i=1
Pm
where i=1 λi = 1, λi ≥ 0 and Qi are some distributions on X and c > 0 is a universal constant.
Follow the steps:
(a) Show the identity (here PXk is arbitrary)
Y k Xk−1

D PXk PXj = I(Xj ; Xj+1 ).
j=1 j=1
(b) Show that there must exist some t ∈ {k, k + 1, . . . , n} such that
H( X k − 1 )
I(Xk−1 ; Xk |Xnt+1 ) ≤ .
n−k+1
(Hint: Expand I(Xk−1 ; Xnk ) via chain rule.)
(c) Show from 1 and 2 that
Y kH(Xk−1 )

D PXk |T PXj |T |PT ≤
n−k+1
where T = Xnt+1 .
(d) By Pinsker’s inequality
h i r
Y kH(Xk−1 )|X | 1
ET TV PXk |T , PXj |T ≤ c , c= p .
n−k+1 2 log e
Conclude the proof of (I.16) by convexity of total variation.
Note: Another estimate [293, 90] is easy to deduce from Exercise I.39 and Exercise I.38: there
exists a mixture of iid QXk such that
k
TV(QXk , PXk ) ≤ min(2|X |, k − 1) .
n
The bound (I.16) improves the above only when H(X1 ) ≲ 1.
i i
i i
i i

i i
I.41 (Wringing Lemma [105, 309]) Prove that for any δ > 0 and any (Un , Vn ) there exists an index
n n
set I ⊂ [n] of size |I| ≤ I(U δ;V ) such that
I(Ut ; Vt |UI , VI ) ≤ δ ∀ t ∈ [ n] .
When I(Un ; Vn ) n, this shows that conditioning on a (relatively few) entries, one can make
individual coordinates almost independent. (Hint: Show I(A, B; C, D) ≥ I(A; C) + I(B; D|A, C)
first. Then start with I = ∅ and if there is any index t s.t. I(Ut ; Vt |UI , VI ) > δ then add it to I and
repeat.)
I.42 This exercise shows other ways of proving the Fano’s inequality in its various forms.
(a) Prove (6.5) as follows. Given any P = (Pmax , P2 , . . . , PM ), apply a random permutation π
to the last M − 1 atoms to obtain the distribution Pπ . By comparing H(P) and H(Q), where
Q is the average of Pπ over all permutations, complete the proof.
(b) Prove (6.5) by directly solving the convex optimization max{H(P) : 0 ≤ pi ≤ Pmax , i =
P
1, . . . , M, i pi = 1}.
(c) Prove (6.9) as follows. Let Pe = P[X 6= X̂]. First show that
I(X; Y) ≥ I(X; X̂) ≥ min{I(PX , PZ|X ) : P[X = Z] ≥ 1 − Pe }.
PZ|X
Notice that the minimum is non-zero unless Pe = Pmax . Second, solve the stated convex
optimization problem. (Hint: look for invariants that the matrix PZ|X must satisfy under
permutations (X, Z) 7→ (π (X), π (Z)) then apply the convexity of I(PX , ·)).
I.43 (Generalization gap = ISKL , [12]) A learning algorithm selects a parameter W based on observing
(not necessarily independent) samples S1 , . . . , Sn , where all Si have a common marginal law PS ,
with the goal of minimizing the loss on a fresh sample = E[ℓ(W, S)], where Sn ⊥ ⊥ S ∼ PS
and ℓ is an arbitrary loss function4 . Consider a Gibbs algorithm (generalizing ERM and various
regularizations) which chooses
αX
n
1
W ∼ PW|Sn (w|sn ) = n
π (w) exp{− ℓ(w, si )} ,
Z( s ) n
i=1
where π (·) is a fixed prior on weights and Z(·) – normalization constant. Show that generaliza-
tion gap of this algorithm is given by
1X
n
1
E[ℓ(W, S)] − E[ ℓ(W, Si )] = ISKL (W; Sn ) .
n α
i=1
I.44 (PAC-Bayes bounds [58]) Donsker-Varadhan characterizations of mutual information in (4.24)

and (4.25) allows us to bound expectations of functions of weakly-dependent random variables,
as shown in (4.26). Show the following extension: In the setting of (4.26) for any QX and any
λ > 0 with probability ≥ 1 − δ (over y ∼ PY )
1 1 ϵ2 λ
E[h(X, Y) − h(X, Ȳ)|Y = y] ≤ (log + D(PX|Y=y kQX )) + . (I.17)
λ δ 2
4
For example, if S = (X, Y) we may have ℓ(w, (x, y)) = 1{fw (x) 6= y} where fw denotes a neural network with weights w.
i i
i i
i i

i i
q
(in other words, the typical deviation is of order ϵ2 log δ1 + D(· · · )). Prove this inequality in
two steps (assume that (X, Y, Ȳ) are all discrete):
• For convenience, let EX|Y and EȲ denote the respective (conditional) expectation operators.
Show that the result follows from the following inequality (valid for all f and QX ):
h i
EY eEX|Y [f(X,Y)−ln EȲ e )]−D(PX|Y=Y ∥QX ) ≤ 1
f(X,Ȳ
(I.18)
• Prove the previous inequality (Hint: a single application of Jensen’s).

• Now consider a countable collection {fi , i ∈ Z+ } of functions and an arbitrary distribution
Q on Z+ . Show the following version of a union bound: with probability 1 − δ over Y ∼ PY
we have simultaneously for all i:
1
fi (Y) ≤ ln EY efi (Y) + log .
δ Q( i)
(Hint: choose PX|Y in (I.18)).
Note: In applications, PY is unknown (data) distribution and X is the chosen classifier. h(X, Y) is
the training loss and E[h(X, Ȳ)|Y] is the (sample-dependent) test loss. Compared to (4.26), the
bound (I.17) does not depend on (generally unknown) PY or PX .
I.45 Let A = {Aj : j ∈ J} be a countable collection of random variables and T is a J-valued random
index. Show that if each Aj is ϵ-subgaussian then
p
| E[AT ]| ≤ 2ϵ2 I(A; T) .
This generalizes (4.27), cf. [271].
P
I.46 (Divergence for mixtures [157, 179]) Let Q̄ = i π i Qi be a mixture distribution.
(a) Prove
!
X
D(PkQ̄) ≤ − log π i exp(−D(PkQi )) ,
i
P
improving over the simple convexity estimate D(PkQ̄) ≤ i π i D(PkQi ). (Hint: Prove that
the function Q 7→ exp{−aD(PkQ)} is concave for every a ≤ 1.)
(b) Furthermore, for any distribution {π̃ j }, any λ ∈ [0, 1] we have
X X X
π̃ j D(Qj kQ̄) + D(π kπ̃ ) ≥ − π i log π̃ j e−(1−λ)Dλ (Pi ∥Pj )
j i j
X
≥ − log π i π̃ j e−(1−λ)Dλ (Pi ∥Pj )
i,j
′
(Hint: Prove D(PA|B=b kQA ) ≥ − EA|B=b [log EA′ ∼QA gg((AA,,bb)) ] via Donsker-Varadhan. Plug in
g(a, b) = PB|A (b|a)1−λ , average over B and use Jensen to bring outer EB|A inside the log.)
I.47 (Mutual information and pairwise distances [157]) Suppose we have knowledge of pairwise
distances dλ (x, x′ ) ≜ Dλ (PY|X=x kPY|X=x′ ), where Dλ is the Rényi divergence of order λ. What
i.i.d.
can be said about I(X; Y)? Let X, X′ ∼ PX . Using Exercise I.46, prove that
I(X; Y) ≤ − E[log E[exp(−d1 (X, X′ ))|X]]
i i
i i
i i

i i
and for every λ ∈ [0, 1]

I(X; Y) ≥ − E[log E[exp(−(1 − λ)dλ (X, X′ ))|X]].
See Theorem 32.5 for an application.
I.48 (D ≲ H2 log H12 trick). Show that for any P, U, R and 0 < ϵ < 2−5 λ−1 we have
λ

λ 1
D(PkϵU + ϵ̄R) ≤ 8(H (P, R) + 2ϵ)
2
log + Dλ (PkU) .
λ−1 ϵ
Thus, a Hellinger ϵ-net for a set of P’s can be converted into a KL (ϵ2 log 1ϵ )-net; see
Section 32.2.4.)
−1
(a) Start by proving the tail estimate for the divergence: For any λ > 1 and b > e(λ−1)

dP dP log b
EP log · 1{ > b} ≤ λ−1 exp{(λ − 1)Dλ (PkQ)}
dQ dQ b
(b) Show that for any b > 1 we have

b log b dP dP
D(PkQ) ≤ H2 (P, Q) √ + EP log · 1{ > b}
( b − 1)2 dQ dQ
h(x)
(Hint: Write D(PkQ) = EP [h( dQ
dP )] for h(x) = − log x + x − 1 and notice that
√
( x−1)2
is
monotonically decreasing on R+ .)
(c) Set Q = ϵU + ϵ̄R and show that for every δ < e− λ−1 ∧ 14
1
1
D(PkQ) ≤ 4H2 (P, R) + 8ϵ + cλ ϵ1−λ δ λ−1 log ,
δ
where cλ = exp{(λ − 1)Dλ (PkU). (Notice H2 (P, Q) ≤ H2 (P, R) + 2ϵ, Dλ (PkQ) ≤
Dλ (PkU) + log 1ϵ and set b = 1/δ .)
2
(d) Complete the proof by setting δ λ−1 = 4H c(λPϵ,λ−
R)+2ϵ
1 .
I.49 Let G = (V, E) be a finite directed graph. Let

4 = (x, y, z) ∈ V3 : (x, y), (y, z), (z, x) ∈ E ,

∧ = (x, y, z) ∈ V3 : (x, y), (x, z) ∈ E .
Prove that 4 ≤ ∧.
Hint: Prove H(X, Y, Z) ≤ H(X) + 2H(Y|X) for random variables (X, Y, Z) distributed uniformly
over the set of directed 3-cycles, i.e. subsets X → Y → Z → X.
i i
i i
i i

i i
i i
i i
i i

i i
Part II
ossless data compression
i i
i i
i i

i i
i i
i i
i i

i i
161
The principal engineering goal of data compression is to represent a given sequence

a1 , a2 , . . . , an produced by a source as a sequence of bits of minimal possible length with possible
algorithmic constraints. Of course, reducing the number of bits is generally impossible, unless the
source satisfies certain statistical restrictions, that is, only a small subset of all sequences actually
occur in practice. (Or, more precisely, only a small subset captures the majority of the overall
probability distribution) Is this the case for real-world data?
As a simple demonstration, one may take two English novels and compute empirical frequen-
cies of each letter. It will turn out to be the same for both novels (approximately). Thus, we can
see that there is some underlying structure in English texts restricting possible output sequences.
The structure goes beyond empirical frequencies of course, as further experimentation (involving
digrams, word frequencies etc) may reveal. Thus, the main reason for the possibility of data com-
pression is the experimental (empirical) law: Real-world sources produce very restricted sets of
sequences.
How do we model these restrictions? Further experimentation (with language, music, images)
reveals that frequently, the structure may be well described if we assume that sequences are gen-
erated probabilistically [277, Sec. III]. This is one of the main contributions of Shannon: another
empirical law states that real-world sources may be described probabilistically with increasing
precision starting from i.i.d., first order Markov, second order Markov etc. Note that sometimes
one needs to find an appropriate basis in which this “law” holds – this is the case of images. (That
is, rasterized sequence of pixels does not exhibit local probabilistic laws due to the 2-D constraints
being ignored; instead, wavelets and local Fourier transform provide much better bases).5
Let us state upfront that the principal idea of strategically assigning shorter descriptions (bit
strings) to more probable symbols is quite obvious, and was present already in the Morse code.
The main innovation of Shannon was to show that compressing groups of symbols together can
lead to dramatic savings, even when symbols are considered as i.i.d. This was a bold proposition
at the time, as algorithmically it appears to be impossible to sort all possible 2610 realizations of
the 10-letter English chunks, in the order of their decreasing frequency. Shannon’s idea became
practical with the invention of Huffman, arithmetic and Lempel-Ziv compressors, decades after.
In the beginning of our investigation we will restrict attention to representing one random vari-
able X in terms of (minimal number of) bits. Again, the understanding and the innovation comes
when we replace a single X with an n-letter block Sn = (S1 , . . . , Sn ). The types of compression we
will consider:
• Variable-length lossless compression. Here we require P[X 6= X̂] = 0, where X̂ is the decoded
version. To make the question interesting, we compress X into a variable-length binary string. It
will turn out that optimal compression length is H(X) − O(log(1 + H(X))). If we further restrict
attention to so-called prefix-free or uniquely decodable codes, then the optimal compression
length is H(X) + O(1). Applying these results to n-letter variables X = Sn we see that optimal
5
Of course, one should not take these “laws” too far. In regards to language modeling, (finite-state) Markov assumption is
too simplistic to truly generate all proper sentences, cf. Chomsky [67].
i i
i i
i i

i i
162
compression length normalized by n converges to the entropy rate (Section 6.4) of the process
{Sj }.
• Fixed-length, almost lossless compression. Here, we allow some very small (or vanishing with
n → ∞ when X = Sn ) probability of error, i.e. P[X 6= X̂] ≤ ϵ. It turns out that under mild
assumptions on the process {Sj }, here again we can compress to entropy rate but no more.
This mode of compression permits various beautiful results in the presence of side-information
(Slepian-Wolf, etc).
• Lossy compression. Here we require only E[d(X, X̂)] ≤ ϵ where d(·, ·) is some loss function.
This type of compression problems is the central topic of Part V.
Note that more correctly we should have called all the examples above as “fixed-to-variable”,
“fixed-to-fixed” and “fixed-to-lossy” codes, because they take fixed number of input letters. We
omit descussion of the beautiful class of variable-to-fixed compressors, such as the famous
Tunstall code [314], which consume an incoming stream of letters in variable-length chunks.
i i
i i
i i

i i
10 Variable-length lossless compression
10.1 Variable-length lossless compression

The coding paradigm of this compression is depicted in the following figure. Here a compres-
X Compressor
{0, 1}∗ Decompressor X
f: X →{0,1}∗ g: {0,1}∗ →X
sor is a function f that maps each symbol x ∈ X into a variable-length string f(x) in {0, 1}∗ ≜
∪k≥0 {0, 1}k = {∅, 0, 1, 00, 01, . . . }. Each f(x) is referred to as a codeword and the collection of
codewords the codebook. We say f is a lossless compressor for a random variable X if there exists
a decompressor g : {0, 1}∗ → X such that P [X = g(f(X))] = 1, i.e., g(f(x)) = x for all x ∈ X
such that PX (x) > 0. (As such, f is injective on the support of PX ). We are interested in the most
economical way to compress the data. So let us introduce the length function l : {0, 1}∗ → Z+ ,
e.g., l(∅) = 0, l(01001) = 5.
Notice that since {0, 1}∗ is countable, lossless compression is only possible for discrete X.
Also, without loss of generality, we can relabel X such that X = N = {1, 2, . . . } and sort the
PMF decreasingly: PX (1) ≥ PX (2) ≥ · · · . At this point we do not impose any other constraints on
the map f; later in Section 10.3 we will introduce conditions such as prefix-freeness and unique-
decodability. The unconstrained setting is sometimes called a single-shot compression setting,
cf. [181].
We could consider different objectives for selecting the best compressor f, for example, min-
imizing any of E[l(f(X))], esssup l(f(X)), median[l(f(X))] would be reasonable. It turns out that
there is a compressor f∗ that minimizes all objectives simultaneously. As mentioned in the preface
of this chapter, the main idea is to assign longer codewords to less likely symbols, and reserve
the shorter codewords for more probable symbols. To make precise of the optimality of f∗ , let us
recall the concept of stochastic dominance.
Definition 10.1 (Stochastic dominance). For real-valued random variables X and Y, we say Y
st.
stochastically dominates (or, is stochastically larger than) X, denoted by X ≤ Y, if P [Y ≤ t] ≤
P [X ≤ t] for all t ∈ R.
163
i i
i i
i i

i i
164
PX (i)
i
1 2 3 4 5 6 7 ···
∗
f
∅ 0 1 00 01 10 11 ···
Figure 10.1 Illustration of the optimal variable-length lossless compressor f∗ .
st.
By definition, X ≤ Y if and only if the CDF of X is larger than that of Y pointwise; in other words,
the distribution of X assigns more probability to lower values than that of Y does. In particular, if
X is dominated by Y stochastically, so are their means, medians, supremum, etc.
Theorem 10.2 (Optimal f∗ ). Consider the compressor f∗ defined (for a down-sorted PMF PX )
by f∗ (1) = ∅, f∗ (2) = 0, f∗ (3) = 1, f∗ (4) = 00, etc, assigning strings with increasing lengths to
symbols i ∈ X . (See Fig. 10.1 for an illustration.) Then
1 Length of codeword:
l(f∗ (i)) = blog2 ic.
2 l(f∗ (X)) is stochastically the smallest: For any lossless compressor f : X → {0, 1}∗ ,
st.
l(f∗ (X)) ≤ l(f(X))
i.e., for any k, P[l(f(X)) ≤ k] ≤ P[l(f∗ (X)) ≤ k]. As a result, E[l(f∗ (X))] ≤ E[l(f(X))].
Proof. Note that

X
k
|Ak | ≜ |{x : l(f(x)) ≤ k}| ≤ 2i = 2k+1 − 1 = |{x : l(f∗ (x)) ≤ k}| ≜ |A∗k |.
i=0
Here the inequality is because f is lossless so that |Ak | can at most be the total number of binary
strings of length up to k. Then
X X
P[l(f(X)) ≤ k] = P X ( x) ≤ PX (x) = P[l(f∗ (X)) ≤ k], (10.1)
x∈Ak x∈A∗
k
since |Ak | ≤ |A∗k | and A∗k contains all 2k+1 − 1 most likely symbols.
The following lemma (see Ex. I.7) is useful in bounding the expected code length of f∗ . It says
if the random variable is integer-valued, then its entropy can be controlled using its mean.
i i
i i
i i

i i
Lemma 10.3. For any Z ∈ N s.t. E[Z] < ∞, H(Z) ≤ E[Z]h( E[1Z] ), where h(·) is the binary entropy
function.
Theorem 10.4 (Optimal average code length: exact expression). Suppose X ∈ N and PX (1) ≥
PX (2) ≥ . . .. Then
X
∞
E[l(f∗ (X))] = P[X ≥ 2k ].
k=1
P
Proof. Recall that expectation of U ∈ Z+ can be written as E [U] = k≥1 P [U ≥ k]. Then by
P P
Theorem 10.2, E[l(f∗ (X))] = E [blog2 Xc] = k≥1 P [blog2 Xc ≥ k] = k≥1 P [log2 X ≥ k].
Theorem 10.5 (Optimal average code length vs. entropy [9]).
H(X) bits − log2 [e(H(X) + 1)] ≤ E[l(f∗ (X))] ≤ H(X) bits
Remark 10.1. Theorem 10.5 is the first example of a coding theorem in this book, which relates
the fundamental limit E[l(f∗ (X))] (an operational quantity) to the entropy H(X) (an information
measure).
Proof. Define L(X) = l(f∗ (X))). For the upper bound, observe that since the PMF are ordered
decreasingly by assumption, PX (m) ≤ 1/m, so L(m) ≤ log2 m ≤ log2 (1/PX (m)). Taking
expectation yields E[L(X)] ≤ H(X).
For the lower bound,
( a)
H(X) = H(X, L) = H(X|L) + H(L) ≤ E[L] + H(L)

(b) 1
≤ E [ L] + h (1 + E[L])
1 + E[L]

1
= E[L] + log2 (1 + E[L]) + E[L] log2 1 + (10.2)
E [ L]
( c)
≤ E[L] + log2 (1 + E[L]) + log2 e
(d)
≤ E[L] + log2 (e(1 + H(X)))
where in (a) we have used the fact that H(X|L = k) ≤ k bits, because f∗ is lossless, so that given
f∗ (X) ∈ {0, 1}k , X can take at most 2k values; (b) follows by Lemma 10.3; (c) is via x log(1+1/x) ≤
log e, ∀x > 0; and (d) is by the previously shown upper bound H(X) ≤ E[L].
To give an illustration, we need to introduce an important method of going from a single-letter

i.i.d.
source to a multi-letter one. Suppose that Sj ∼ PS (this is called a memoryless source). We can
group n letters of Sj together and consider X = Sn as one super-letter. Applying our results to
i i
i i
i i

i i
166
random variable X we obtain:

nH(S) ≥ E[l(f∗ (Sn ))] ≥ nH(S) − log2 n + O(1).
In fact for memoryless sources, the exact asymptotic behavior is found in [300, Theorem 4]:
(
∗ n nH(S) + O(1) , PS = Unif
E[l(f (S ))] = .
nH(S) − 2 log2 n + O(1) , PS 6= Unif
1
1
For the case of sources for which log2 PS has non-lattice distribution, it is further shown in [300,
Theorem 3]:
1
E[l(f∗ (Sn ))] = nH(S) − log2 (8πeV(S)n) + o(1) , (10.3)
2
where V(S) is the varentropy of the source S:
1
V(S) ≜ Var log2 . (10.4)
PS (S)
Theorem 10.5 relates the mean of l(f∗ (X)) to that of log2 PX1(X) (entropy). It turns out that
distributions of these random variables are also closely related.
Theorem 10.6 (Code length distribution of f∗ ). ∀τ > 0, k ≥ 0,

1 1
P log2 ≤ k ≤ P [l(f∗ (X)) ≤ k] ≤ P log2 ≤ k + τ + 2−τ +1 .
PX (X) PX (X)
Proof. Lower bound (achievability): Use PX (m) ≤ 1/m. Then similarly as in Theorem 10.5,
L(m) = blog2 mc ≤ log2 m ≤ log2 PX 1(m) . Hence L(X) ≤ log2 PX1(X) a.s.
Upper bound (converse): By truncation,

1 1
P [L ≤ k] = P L ≤ k, log2 ≤ k + τ + P L ≤ k, log2 >k+τ
PX (X) PX (X)
X
1
≤ P log2 ≤k+τ + PX (x)1{l(f∗ (x))≤k} 1{PX (x)≤2−k−τ }
PX (X)
x∈X

1
≤ P log2 ≤ k + τ + (2k+1 − 1) · 2−k−τ
PX (X)
So far our discussion applies to an arbitrary random variable X. Next we consider the source as
a random process (S1 , S2 , . . .) and introduce blocklength n. We apply our results to X = Sn , that is,
by treating the first n symbols as a supersymbol. The following corollary states that the limiting
behavior of l(f∗ (Sn )) and log PSn 1(Sn ) always coincide.
Corollary 10.7. Let (S1 , S2 , . . .) be a random process and U, V real-valued random variable. Then
1 1 d 1 ∗ n d
log2 →U
− ⇔ l(f (S ))−
→U (10.5)
n PSn (Sn ) n
i i
i i
i i

i i
and

1 1 1
√ (l(f∗ (Sn )) − H(Sn ))−
d d
√ log2 − H( S ) →
n
−V ⇔ →V (10.6)
n PSn (Sn ) n
Proof. First recall that convergence in distribution is equivalent to convergence of CDF at all
d
→U ⇔ P [Un ≤ u] → P [U ≤ u] for all u at which point the CDF of U is
continuity point, i.e., Un −
continuous (i.e., not an atom of U).
√
To get (10.5), apply Theorem 10.6 with k = un and τ = n:

1 1 1 ∗ 1 1 1 √
P log2 ≤ u ≤ P l(f (X)) ≤ u ≤ P log2 ≤ u + √ + 2− n+1 .
n PX (X) n n PX (X) n
√
To get (10.6), apply Theorem 10.6 with k = H(Sn ) + nu and τ = n1/4 :
∗ n
1 1 l(f (S )) − H(Sn )
P √ log − H( S ) ≤ u ≤ P
n
√ ≤u
n PSn (Sn ) n

1 1 −1/4
+ 2−n +1
1/ 4
≤P √ log n
− H( S ) ≤ u + n
n
n PSn (S )
Now let us particularize the preceding theorem to memoryless sources of i.i.d. Sj ’s. The
important observation is that the log likihood becomes an i.i.d. sum:
1 X n
1
log n
= log .
PSn (S ) PS (Si )
i=1 | {z }
i.i.d.
P
1 By the Law of Large Numbers (LLN), we know that n1 log PSn 1(Sn ) − →E log PS1(S) = H(S).
Therefore in (10.5) the limiting distribution U is degenerate, i.e., U = H(S), and we
P
have 1n l(f∗ (Sn ))−
→E log PS1(S) = H(S). [Note: convergence in distribution to a constant ⇔
convergence in probability to a constant]
2 By the Central Limit Theorem (CLT), if varentropy V(S) < ∞, then we know that V in (10.6)
is Gaussian, i.e.,

1 1 d
p log − nH(S) −→N (0, 1).
nV(S) PSn (Sn )
Consequently, we have the following Gaussian approximation for the probability law of the
optimal code length
1
(l(f∗ (Sn )) − nH(S))−
d
p →N (0, 1),
nV(S)
or, in shorthand,
p
l(f∗ (Sn )) ∼ nH(S) + nV(S)N (0, 1) in distribution.
1 ∗ n
Gaussian approximation tells us the speed of n l(f (S )) to entropy and give us a good
approximation at finite n.
i i
i i
i i

i i
168
Optimal compression: CDF, n = 200, PS = [0.445 0.445 0.110] Optimal compression: PMF, n = 200, P S = [0.445 0.445 0.110]
1 0.06
True PMF
Gaussian approximation
Gaussian approximation (mean adjusted)
0.9
0.05
0.8
0.7
0.04
0.6
0.5
P
0.03
P
0.4
0.02
0.3
0.2
0.01
True CDF
0.1 Lower bound
Upper bound
Gaussian approximation
Gaussian approximation (mean adjusted)
0 0
1.25 1.3 1.35 1.4 1.45 1.5 1.25 1.3 1.35 1.4 1.45 1.5
Rate Rate
Figure 10.2 Left plot: Comparison of the true CDF of l(f∗ (Sn )), bounds of Theorem 10.6 (optimized over τ ),
and the Gaussian approximations in (10.7) and (10.8). Right plot: PMF of the optimal compression length
l(f∗ (Sn )) and the two Gaussian approximations.
Example 10.1 (Ternary source). Next we apply our bounds to approximate the distribution of
l(f∗ (Sn )) in a concrete example. Consider a memoryless ternary source outputing i.i.d. n symbols
from the distribution PS = [0.445, 0.445, 0.11]. We first compare different results on the minimal
expected length E[l(f∗ (Sn ))] in the following table:
Blocklength Lower bound (10.5) E[l(f∗ (Sn ))] H(Sn ) (upper bound) asymptotics (10.3)
n = 20 21.5 24.3 27.8 23.3 + o(1)
n = 100 130.4 134.4 139.0 133.3 + o(1)
n = 500 684.1 689.2 695.0 688.1 + o(1)
In all cases above E[l(f∗ (S))] is close to a midpoint between the bounds.
Next we consider the distribution of l(f∗ (Sn ). Its Gaussian approximation is defined as
p
nH(S) + nV(S)Z , Z ∼ N ( 0, 1) . (10.7)
However, in view of (10.3) we also define the mean-adjusted Gaussian approximation as
1 p
nH(S) − log2 (8πeV(S)n) + nV(S)Z , Z ∼ N (0, 1) . (10.8)
2
Fig. 10.2 compares the true distribution of l(f∗ (Sn )) with bounds and two Gaussian approxima-
tions.
10.2 Mandelbrot’s argument for universality of Zipf’s (power) law

Given a corpus of text it is natural to plot its rank-frequency table by sorting the word frequencies
according to their rank p1 ≥ p2 ≥ · · · . The resulting tables, as noticed by Zipf [350], satisfy
i i
i i
i i

i i
10.2 Mandelbrot’s argument for universality of Zipf’s (power) law 169
Figure 10.3 The log-log frequency-rank plots of the most used words in various languages exhibit a power
law tail with exponent close to 1, as popularized by Zipf [350]. Data from [292].
pr r−α for some value of α. Remarkably, this holds across various corpi of text in multiple
different languages (and with α ≈ 1) – see Fig. 10.3 for an illustration. Even more surprisingly, a
lot of other similar tables possess the power-law distribution: “city populations, the sizes of earth-
quakes, moon craters, solar flares, computer files, wars, personal names in most cultures, number
of papers scientists write, number of citations a paper receives, number of hits on web pages, sales
of books and music recordings, number of species in biological taxa, people’s incomes” (quoting
from [225], which gives references for each study). This spectacular universality of the power law
continues to provoke scientiests from many disciplines to suggest explanations for its occurrence;
see [220] for a survey of such. One of the earliest (in the context of natural language of Zipf) is
due to Mandelbrot [209] and is in fact intimately related to the topic of this Chapter.
Let us go back to the question of minimal expected length of the representation of source X. We
have shown bounds on this quantity in terms of the entropy of X in Theorem 10.5. Let us introduce
the following function
H(Λ) = sup{H(X) : E[l(f(X))] ≤ Λ} ,

f,PX
i i
i i
i i

i i
170
where optimization is over lossless encoders and probability distributions PX = {pj : j = 1, . . .}.
Theorem 10.5 (or more precisely, the intermediate result (10.2)) shows that
Λ log 2 ≤ H(Λ) ≤ Λ log 2 + (1 + Λ) log(1 + Λ) − Λ log Λ .
It turns out that the upper bound is in fact tight. Furthermore, among all distributions the optimal
tradeoff between entropy and minimal compression length is attained at power law distributions.
To show that, notice that in computing H(Λ), we can restrict attention to sorted PMFs p1 ≥ p2 ≥
· · · (call this class P ↓ ), for which the optimal encoder is such that l(f(j)) = blog2 jc (Theorem 10.2).
Thus, we have shown
X
H(Λ) = sup {H(P) : pj blog2 jc ≤ Λ} .
P∈P ↓ j
Next, let us fix the base of the logarithm of H to be 2, for convenience. (We will convert to arbitrary
base at the end). Applying Example 5.2 we obtain:
H(Λ) ≤ inf λΛ + log2 Z(λ) , (10.9)

λ>0
P∞ P∞
where Z(λ) = n=1 2−λ⌊log2 n⌋ = m=0 2(1−λ)m = 1−211−λ if λ > 1 and Z(λ) = ∞ otherwise.
Clearly, the infimum over λ > 0 is a minimum attained at a value λ∗ > 1 satisfying

d
Λ=− log2 Z(λ) .
dλ λ=λ∗
Define the distribution
1 −λ⌊log2 n⌋
Pλ (n) ≜ 2 , n≥1
Z(λ)
and notice that
d 21−λ
EPλ [blog2 Xc] = − log2 Z(λ) =
dλ 1 − 21−λ
H(Pλ ) = log2 Z(λ) + λ EPλ [blog2 Xc] .
Comparing with (10.9) we find that the upper bound in (10.9) is tight and attained by Pλ∗ . From
the first equation above, we also find λ∗ = log2 2+Λ2Λ . Altogether this yields
H(Λ) = Λ log 2 + (Λ + 1) log(Λ + 1) − Λ log Λ ,

∗
and the extremal distribution Pλ∗ (n) n−λ is power-law distribution with the exponent λ∗ → 1
as Λ → ∞.
The argument of Mandelbrot [209] The above derivation shows a special (extremality) prop-
erty of the power law, but falls short of explaining its empirical ubiquity. Here is a way to connect
the optimization problem H(Λ) to the evolution of the natural language. Suppose that there is a
countable set S of elementary concepts that are used by the brain as building blocks of perception
and communication with the outside world. As an approximation we can think that concepts are
in one-to-one correspondence with language words. Now every concept x is represented internally
i i
i i
i i

i i
by the brain as a certain pattern, in the simplest case – a sequence of zeros and ones of length l(f(x))
([209] considers more general representations). Now we have seen that the number of sequences
of concepts with a composition P grows exponentially (in length) with the exponent given by
H(P), see Proposition 1.5. Thus in the long run the probability distribution P over the concepts
results in the rate of information transfer equal to EP [Hl((fP(X) ))] . Mandelbrot concludes that in order
to transfer maximal information per unit, language and brain representation co-evolve in such a
way as to maximize this ratio. Note that
H(P) H(Λ)
sup = sup .
P,f EP [l(f(X))] Λ Λ
It is not hard to show that H(Λ) is concave and thus the supremum is achieved at Λ = 0+ and
equals infinity. This appears to have not been observed by Mandelbrot. To fix this issue, we can
postulate that for some unknown reason there is a requirement of also having a certain minimal
entropy H(P) ≥ h0 . In this case
H(P) h0
sup = −1
P,f:H(P)≥h0 EP [l(f(X))] H ( h0 )
and the supremum is achieved at a power law distribution P. Thus, the implication is that the fre-
quency of word usage in human languages evolves until a power law is attained, at which point it
maximizes information transfer within the brain. That’s the gist of the argument of [209]. It is clear
that this does not explain appearance of the power law in other domains, for which other explana-
tions such as preferential attachment models are more plausible, see [220]. Finally, we mention
that the Pλ distributions take discrete values 2−λm−log2 Z(λ) , m = 0, 1, 2, . . . with multiplicities 2m .
Thus Pλ appears as a rather unsightly staircase on frequency-rank plots such as Fig. 10.3. This
artifact can be alleviated by considering non-binary brain representations with unequal lengths of
signals.
10.3 Uniquely decodable codes, prefix codes and Huffman codes

In the previous sections we have studied f∗ , which achieves the stochastically (in particular, in
expectation) shortest code length among all variable-length lossless compressors. Note that f∗ is
obtained by ordering the PMF and assigning shorter codewords to more likely symbols. In this
section we focus on a specific class of compressors with good algorithmic properties which lead to
low complexity decoding and short delay when decoding from a stream of compressed bits. This
part is more combinatorial in nature.
S
We start with a few definitions. Let A+ = n≥1 An denotes all non-empty finite-length strings
consisting of symbols from the alphabet A. Throughout this chapter A is a countable set.
Definition 10.8 (Extension of a code). The (symbol-by-symbol) extension of f : A → {0, 1}∗ is

f : A+ → {0, 1}∗ where f(a1 , . . . , an ) = (f(a1 ), . . . , f(an )) is defined by concatenating the bits.
i i
i i
i i

i i
172
Definition 10.9 (Uniquely decodable codes). f : A → {0, 1}∗ is uniquely decodable if its
extension f : A+ → {0, 1}∗ is injective.
Definition 10.10 (Prefix codes). f : A → {0, 1}∗ is a prefix code1 if no codeword is a prefix of
another (e.g., 010 is a prefix of 0101).
Example 10.2. A = {a, b, c}.
• f(a) = 0, f(b) = 1, f(c) = 10. Not uniquely decodable, since f(ba) = f(c) = 10.
• f(a) = 0, f(b) = 10, f(c) = 11. Uniquely decodable and a prefix code.
• f(a) = 0, f(b) = 01, f(c) = 011, f(d) = 0111 Uniquely decodable but not a prefix code, since
as long as 0 appears, we know that the previous codeword has terminated.2
Remark 10.2.
1 Prefix codes are uniquely decodable and hence lossless, as illustrated in the following picture:
all lossless codes
uniquely decodable codes
prefix codes
Huffman
code
2 Similar to prefix-free codes, one can define suffix-free codes. Those are also uniquely decodable
(one should start decoding in reverse direction).
3 By definition, any uniquely decodable code does not have the empty string as a codeword. Hence
f : X → {0, 1}+ in both Definition 10.9 and Definition 10.10.
4 Unique decodability means that one can decode from a stream of bits without ambiguity, but
one might need to look ahead in order to decide the termination of a codeword. (Think of the
1
Also known as prefix-free/comma-free/self-punctuatingf/instantaneous code.
2
In this example, if 0 is placed at the very end of each codeword, the code is uniquely decodable, known as the unary code.
i i
i i
i i

i i
last example). In contrast, prefix codes allow the decoder to decode instantaneously without
looking ahead.
5 Prefix codes are in one-to-one correspondence with binary trees (with codewords at leaves). It
is also equivalent to strategies to ask “yes/no” questions previously mentioned at the end of
Section 1.1.
Theorem 10.11 (Kraft-McMillan).
1 Let f : A → {0, 1}∗ be uniquely decodable. Set la = l(f(a)). Then f satisfies the Kraft inequality
X
2−la ≤ 1. (10.10)
a∈A
2 Conversely, for any set of code length {la : a ∈ A} satisfying (10.10), there exists a prefix code
f, such that la = l(f(a)). Moreover, such an f can be computed efficiently.
Remark 10.3. The consequence of Theorem 10.11 is that as far as compression efficiency is
concerned, we can ignore those uniquely decodable codes that are not prefix codes.
Proof. We prove the Kraft inequality for prefix codes and uniquely decodable codes separately.
The proof for the former is probabilistic, following ideas in [10, Exercise 1.8, p. 12]. Let f be a
prefix code. Let us construct a probability space such that the LHS of (10.10) is the probability
of some event, which cannot exceed one. To this end, consider the following scenario: Generate
independent Ber( 12 ) bits. Stop if a codeword has been written, otherwise continue. This process
P
terminates with probability a∈A 2−la . The summation makes sense because the events that a
given codeword is written are mutually exclusive, thanks to the prefix condition.
Now let f be a uniquely decodable code. The proof uses generating function as a device for
counting. (The analogy in coding theory is the weight enumerator function.) First assume A is
P PL
finite. Then L = maxa∈A la is finite. Let Gf (z) = a∈A zla = l=0 Al (f)zl , where Al (f) denotes
the number of codewords of length l in f. For k ≥ 1, define fk : Ak → {0, 1}+ as the symbol-
P k k P P
by-symbol extension of f. Then Gfk (z) = ak ∈Ak zl(f (a )) = a1 · · · ak zla1 +···+lak = [Gf (z)]k =
PkL k l
l=0 Al (f )z . By the unique decodability of f, fk is lossless. Hence Al (fk ) ≤ 2l . Therefore we have
P
Gf (1/2) = Gfk (1/2) ≤ kL for all k. Then a∈A 2−la = Gf (1/2) ≤ limk→∞ (kL)1/k = 1. If A is
k
P
countably infinite, for any finite subset A′ ⊂ A, repeating the same argument gives a∈A′ 2−la ≤
1. The proof is complete by the arbitrariness of A′ .
P
Conversely, given a set of code lengths {la : a ∈ A} s.t. a∈A 2−la ≤ 1, construct a prefix
code f as follows: First relabel A to N and assume that 1 ≤ l1 ≤ l2 ≤ . . .. For each i, define
X
i− 1
ai ≜ 2− l k
k=1
with a1 = 0. Then ai < 1 by Kraft inequality. Thus we define the codeword f(i) ∈ {0, 1}+ as the
first li bits in the binary expansion of ai . Finally, we prove that f is a prefix code by contradiction:
i i
i i
i i

i i
174
Suppose for some j > i, f(i) is the prefix of f(j), since lj ≥ li . Then aj − ai ≤ 2−li , since they agree
on the most significant li bits. But aj − ai = 2−li + 2−li+1 +. . . > 2−li , which is a contradiction.
Remark 10.4. A conjecture of Ahslwede et al [4] states that for any set of lengths for which
P −la
2 ≤ 34 there exists a fix-free code (i.e. one which is simultaneously prefix-free and suffix-
free). So far, existence has only been shown when the Kraft sum is ≤ 58 , cf. [343].
In view of Theorem 10.11, the optimal average code length among all prefix (or uniquely
decodable) codes is given by the following optimization problem
X
L∗ (X) ≜ min P X ( a) la (10.11)
a∈A
X
s.t. 2−la ≤ 1
a∈A
la ∈ N
This is an integer programming (IP) problem, which, in general, is computationally hard to solve.
It is remarkable that this particular IP can be solved in near-linear time, thanks to the Huffman
algorithm. Before describing the construction of Huffman codes, let us give bounds to L∗ (X) in
terms of entropy:
Theorem 10.12.
H(X) ≤ L∗ (X) ≤ H(X) + 1 bit. (10.12)

Proof. Right inequality: Consider the following length assignment la = log2 PX1(a) ,3 which
P −la
P
satisfies Kraft since a∈A ≤
l2 m a∈A PX (a) = 1. By Theorem 10.11, there exists a prefix
code f such that l(f(a)) = log2 PX (a) and El(f(X)) ≤ H(X) + 1.
1
Light inequality: We give two proofs for this converse. One of the commonly used ideas to deal
with combinatorial optimization is relaxation. Our first idea is to drop the integer constraints in
(10.11) and relax it into the following optimization problem, which obviously provides a lower
bound
X
L∗ (X) ≜ min PX (a)la (10.13)
a∈A
X
s.t. 2−la ≤ 1 (10.14)
a∈A
This is a nice convex optimization problem, with affine objective function and a convex feasible
set. Solving (10.13) by Lagrange multipliers (Exercise!) yields that the minimum is equal to H(X)
(achieved at la = log2 PX1(a) ).
3
Such a code is called a Shannon code.
i i
i i
i i

i i
Another proof is the following: For any f whose codelengths {la } satisfying the Kraft inequality,
− la
define a probability measure Q(a) = P 2 2−la . Then
a∈A
X
El(f(X)) − H(X) = D(PkQ) − log 2−la ≥ 0.
a∈A
Next we describe the Huffman code, which achieves the optimum in (10.11). In view of the fact
that prefix codes and binary trees are one-to-one, the main idea of the Huffman code is to build
the binary tree from the bottom up: Given a PMF {PX (a) : a ∈ A},
1 Choose the two least-probable symbols in the alphabet.

2 Delete the two symbols and add a new symbol (with combined probabilities). Add the new
symbol as the parent node of the previous two symbols in the binary tree.
The algorithm terminates in |A| − 1 steps. Given the binary tree, the code assignment can be
obtained by assigning 0/1 to the branches. Therefore the time complexity is O(|A|) (sorted PMF)
or O(|A| log |A|) (unsorted PMF).
Example 10.3. A = {a, b, c, d, e}, PX = {0.25, 0.25, 0.2, 0.15, 0.15}.
Huffman tree: Codebook:
0 1 f(a) = 00
0.55 0.45 f(b) = 10
0 1 0 1 f(c) = 11
f(d) = 010
a 0.3 b c
0 1 f(e) = 011
d e
Theorem 10.13 (Optimality of Huffman codes). The Huffman code achieves the minimal average
code length (10.11) among all prefix (or uniquely decodable) codes.
Proof. See [76, Sec. 5.8].
Remark 10.5 (Drawbacks of Huffman codes).
1 As Shannon pointed out in his 1948 paper, in compressing English texts, in addition to exploit-
ing the nonequiprobability of English letters, working with pairs (or more generally, n-grams)
of letters achieves even more compression. To compress a block of symbols (S1 , . . . , Sn ), while
a natural idea is to apply the Huffman codes on a symbol-by-symbol basis (i.e., applying the cor-
responding Huffman code for each PSi ). By Theorem 10.12, this is only guaranteed to achieve an
Pn
average length at most i=1 H(Si ) + n bits, which also fails to exploit the memory in the source
i i
i i
i i

i i
176
Pn
when i=1 H(Si ) is significantly larger than H(S1 , . . . , Sn ). The solution is to apply block Huff-
man coding. Indeed, compressing the block (S1 , . . . , Sn ) using its Huffman code (designed for
PS1 ,...,Sn ) achieves H(S1 , . . . , Sn ) within one bit, but the complexity is |A|n !
2 Constructing the Huffman code requires knowing the source distribution. This brings us the
question: Is it possible to design universal compressor which achieves entropy for a class of
source distributions? And what is the price to pay? These questions are addressed in Chapter 13.
There are much more elegant solutions, e.g.,
1 Arithmetic coding: sequential encoding, linear complexity in compressing (S1 , . . . , Sn ) –

Section 13.1.
2 Lempel-Ziv algorithm: low-complexity, universal, provably optimal in a very strong sense –
Section 13.7.
As a summary of this chapter, we state the following comparison of average code length (in
bits) for lossless codes:
H(X) − log2 [e(H(X) + 1)] ≤ E[l(f∗ (X))] ≤ H(X) ≤ E[l(fHuffman (X))] ≤ H(X) + 1.
i i
i i
i i

i i
11 Fixed-length (almost lossless) compression.

Slepian-Wolf.
In the previous chapter we introduced the concept of variable-length compression and studied
its fundamental limits (with and without prefix-free condition). In some situations, however, one
may desire that the output of the compressor always has a fixed length, say, k bits. Unless k is
unreasonably large, then, this will require relaxing the losslessness condition. This is the focus of
this chapter: compression in the presence of (typically vanishingly small) probability of error. It
turns out allowing even very small error enables several beautiful effects:
• The possibility to compress data via matrix multiplication over finite fields (Linear Compres-
sion).
• The possibility to reduce compression length from H(X) to H(X|Y) if side information Y is
available at the decompressor (Slepian-Wolf).
• The possibility to reduce compression length below H(X) if access to a compressed representa-
tion of side-information Y is available at the decompressor (Ahslwede-Körner-Wyner).
11.1 Fixed-length almost lossless code. Asymptotic Equipartition

Property (AEP).
The coding paradigm in this section is illustrated as follows: Note that if we insist like in Chapter 10
X Compressor {0, 1}k Decompressor X ∪ {e}

f: X →{0,1}k g: {0,1}k →X ∪{e}
that g(f(X)) = X with probability one, then k ≥ log2 |supp(PX )| and no meaningful compression
can be achieved. It turns out that by tolerating a small error probability, we can gain a lot in
terms of code length! So, instead of requiring g(f(x)) = x for all x ∈ X , consider only lossless
decompression for a subset S ⊂ X :
(
x x∈S
g(f(x)) =
e x 6∈ S
and the probability of error is:
P [g(f(X)) 6= X] = P [g(f(X)) = e] = P [X ∈
/ S] .
177
i i
i i
i i

i i
178
Definition 11.1. A compressor-decompressor pair (f, g) is called a (k, ϵ)-code if:
f : X → {0, 1}k
g : {0, 1}k → X ∪ {e}
such that g(f(x)) ∈ {x, e} for all x ∈ X and P [g(f(X)) = e] ≤ ϵ.
The minimum probability of error is defined as
ϵ∗ (X, k) ≜ inf{ϵ : ∃(k, ϵ)-code for X}
The following result connects the respective fundamental limits of fixed-length almost lossless
compression and variable-length lossless compression (Chapter 10):
Theorem 11.2 (Fundamental limit of fixed-length compression). Recall the optimal variable-
length compressor f∗ defined in Theorem 10.2. Then
ϵ∗ (X, k) = P [l(f∗ (X)) ≥ k] = 1 − total probability of the 2k − 1 most likely symbols of X.
Proof. The proof is essentially tautological. Note 1 + 2 + · · · + 2k−1 = 2k − 1. Let S = {2k −

1 most likely realizations of X}. Then
ϵ∗ (X, k) = P [X 6∈ S] = P [l(f∗ (X)) ≥ k] .
The last equality follows from (10.1).
Comparing Theorems 10.2 and 11.2, we see that the optimal codes in these two settings work
as follows:
• Variable-length: f∗ encodes the 2k − 1 symbols with the highest probabilities to

{ϕ, 0, 1, 00, . . . , 1k−1 }.
• Fixed-length: The optimal compressor f maps the elements of S into (00 . . . 00), . . . , (11 . . . 10)
and the rest in X \S to (11 . . . 11). The decompressor g decodes perfectly except for outputting
e upon receipt of (11 . . . 11).
Remark 11.1. In Definition 11.1 we require that the errors are always detectable, i.e., g(f(x)) = x
or e. Alternatively, we can drop this requirement and allow undetectable errors, in which case we
can of course do better since we have more freedom in designing codes. It turns out that we do
not gain much by this relaxation. Indeed, if we define
ϵ̃∗ (X, k) = inf{P [g(f(X)) 6= X] : f : X → {0, 1}k , g : {0, 1}k → X ∪ {e}},
then ϵ̃∗ (X, k) = 1 − sum of 2k largest masses of X. This follows immediately from
P
P [g(f(X)) = X] = x∈S PX (x) where S ≜ {x : g(f(x)) = x} satisfies |S| ≤ 2k , because f takes no
more than 2k values. Compared to Theorem 11.2, we see that ϵ̃∗ (X, k) and ϵ∗ (X, k) do not differ
much. In particular, ϵ∗ (X, k + 1) ≤ ϵ̃∗ (X, k) ≤ ϵ∗ (X, k).
i i
i i
i i

i i
Corollary 11.3 (Shannon). Let Sn be i.i.d. Then

∗ n 0 R > H( S)
lim ϵ (S , nR) =
n→∞ 1 R < H( S)
p
lim ϵ∗ (Sn , nH(S) + nV(S)γ) = 1 − Φ(γ).
n→∞
where Φ(·) is the CDF of N (0, 1), H(S) = E[log PS1(S) ] is the entropy, V(S) = Var[log PS1(S) ] is the
varentropy which is assumed to be finite.
Proof. Combine Theorem 11.2 with Corollary 10.7.
Next we give separate achievability and converse bounds complementing the exact result in
Theorem 11.2.
Theorem 11.4 (Converse).

∗ ∗ 1
ϵ (X, k) ≥ ϵ̃ (X, k) ≥ P log2 > k + τ − 2−τ , ∀τ > 0.
PX (X)
Proof. The argument identical to the converse of Theorem

10.6. Let S = {x : g(f(x)) = x}. Then
h i 1
|S| ≤ 2k and P [X ∈ S] ≤ P log2 PX1(X) ≤ k + τ + P X ∈ S, log2 >k+τ
PX (X)
| {z }
≤2−τ
We state two achievability bounds.
Theorem 11.5.

∗ 1
ϵ (X, k) ≤ P log2 ≥k . (11.1)
PX (X)
Theorem 11.6.

∗ 1
ϵ (X, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0. (11.2)
PX (X)
Note that Theorem 11.5 is in fact always stronger than Theorem 11.6. Still, we present the proof
of Theorem 11.6 and the technology behind it – random coding – a powerful technique introduced
by Shannon for proving existence of good codes (achievability). This technique is used throughout
in this book and Theorem 11.6 is its first appearance. To see that Theorem 11.5 gives a better bound,
note that even the first term in (11.2) exceeds (11.1). Nevertheless, the random coding argument for
proving this weaker bound is much more important and generalizable. We will apply it again for
linear compression in Section 11.2 and the Slepian-Wolf problem in Section 11.4 in this chapter;
i i
i i
i i

i i
180
later for data transmission and lossy data compression in Parts IV and V it will take the central
stage as the method of choice for most achievability proofs.
Proof of Theorem 11.5. Construction: use those 2k − 1 symbols with the highest probabilities.
The analysis is essentially the same as the lower bound in Theorem 10.6 from Chapter 10. Note
that the mth largest mass PX (m) ≤ m1 . Therefore
X X X
ϵ∗ (X, k) = P X ( m) = 1{m≥2k } PX (m) ≤ 1n 1 ≥2k o PX (m) = E1nlog 1 ≥ko .
PX (m) 2 PX (X)
m≥2k
Proof of Theorem 11.6. (Random coding.) For a given compressor f, the optimal decompressor
which minimizes the error probability is the maximum a posteriori (MAP) decoder, i.e.,
g∗ (w) = argmax PX|f(X) (x|w) = argmax PX (x),

x x:f(x)=w
which can be hard to analyze. Instead, let us consider the following (suboptimal) decompressor g:


 x, ∃! x ∈ X s.t. f(x) = w and log2 PX1(x) ≤ k − τ,
g(w) = (exists unique high-probability x that is mapped to w)

 e, o.w.
Note that log2 PX1(x) ≤ k − τ ⇐⇒ PX (x) ≥ 2−(k−τ ) . We call those x “high-probability”.

Denote f(x) = cx and the codebook C = {cx : x ∈ X } ⊂ {0, 1}k . It is instructive to think of C
as a hashing table.
Error probability analysis: There are two ways to make an error ⇒ apply union bound. Before
proceeding, define

′ ′ 1
J(x, C) ≜ x ∈ X : cx′ = cx , x 6= x, log2 ≤k−τ
PX (x′ )
to be the set of high-probability inputs whose hashes collide with that of x. Then we have the
following estimate for probability of error:

1
P [g(f(X)) = e] = P log2 > k − τ ∪ {J(X, C) 6= ∅}
PX (X)

1
≤ P log2 > k − τ + P [J(X, C) 6= ϕ]
PX (X)
The first term does not depend on the codebook C , while the second term does. The idea now is
to randomize over C and show that when we average over all possible choices of codebook, the
second term is smaller than 2−τ . Therefore there exists at least one codebook that achieves the
desired bound. Specifically, let us consider C which is uniformly distributed over all codebooks
and independently of X. Equivalently, since C can be represented by an |X | × k binary matrix,
whose rows correspond to codewords, we choose each entry to be independent fair coin flips.
i i
i i
i i

i i
Averaging the error probability (over C and over X), we have

EC [P [J(X, C) 6= ϕ]] = EC,X 1n∃x′ ̸=X:log 1 ≤k−τ,c =c o
2 P (x′ ) x′ X
 X

X
≤ EC,X  1nlog 1 ≤k−τ o 1{cx′ =cX }  (union bound)
2 P (x′ )
x′ ̸=X X
 
X
= 2− k E X  1{PX (x′ )≥2−k+τ } 
x′ ̸=X
X
≤ 2− k 1{PX (x′ )≥2−k+τ }
x′ ∈X
≤ 2−k 2k−τ = 2−τ .
Remark 11.2 (Why random coding works). The compressor f(x) = cx can be thought as hashing
x ∈ X to a random k-bit string cx ∈ {0, 1}k , as illustrated below:
Here, x has high probability ⇔ log2 PX1(x) ≤ k − τ ⇔ PX (x) ≥ 2−k+τ . Therefore the number of
those high-probability x’s is at most 2k−τ , which is far smaller than 2k , the total number of k-bit
codewords. Hence the chance of collision is small.
Remark 11.3. The random coding argument is a canonical example of probabilistic method: To
prove the existence of an object with certain property, we construct a probability distribution
(randomize) and show that on average the property is satisfied. Hence there exists at least one
realization with the desired property. The downside of this argument is that it is not constructive,
i.e., does not give us an algorithm to find the object.
Remark 11.4. This is a subtle point: Notice that in the proof we choose the random codebook to
be uniform over all possible codebooks. In other words, C = {cx : x ∈ X } consists of iid k-bit
strings. In fact, in the proof we only need pairwise independence, i.e., cx ⊥ ⊥ cx′ for any x 6= x′
(Why?). Now, why should we care about this? In fact, having access to external randomness is
also a lot of resources. It is more desirable to use less randomness in the random coding argument.
Indeed, if we use zero randomness, then it is a deterministic construction, which is the best situa-
tion! Using pairwise independent codebook requires significantly less randomness than complete
random coding which needs |X |k bits. To see this intuitively, note that one can use 2 independent
random bits to generate 3 random bits that is pairwise independent but not mutually independent,
e.g., {b1 , b2 , b1 ⊕ b2 }. This observation is related to linear compression studied in the next section,
where the codeword we generated are not iid, but elements of a random linear subspace.
i i
i i
i i

i i
182
Remark 11.5 (AEP for memoryless sources). Consider iid Sn . By WLLN,

1 1 P
log −
→H(S). (11.3)
n PSn (Sn )
For any δ > 0, define the set

1 1
Tδn = n
s : log
− H(S) ≤ δ .
n PSn (sn )
i.i.d.
For example: Sn ∼ Ber(p), since PSn (sn ) = pw(s ) qn−w(s ) , the typical set corresponds to those
n n
sequences whose Hamming is close to the expectation: Tδn = {sn ∈ {0, 1}n : w(sn ) ∈ [p ± δ ′ ]n},
where δ ′ is a constant depending on δ .
As a consequence of (11.3),

1 P Sn ∈ Tδn → 1 as n → ∞.
2 |Tδn | ≤ 2(H(S)+δ)n |S|n .
In other words, Sn is concentrated on the set Tδn which is exponentially smaller than the whole
space. In almost lossless compression we can simply encode this set losslessly. Although this is
different than the optimal encoding, Corollary 11.3 indicates that in the large-n limit the optimal
compressor is no better.
The property (11.3) is often referred as the Asymptotic Equipartition Property (AEP), in the
sense that the random vector is concentrated on a set wherein each reliazation is roughly equally
likely up to the exponent. Indeed, Note that for any sn ∈ Tδn , its likelihood is concentrated around
PSn (sn ) ∈ 2−(H(S)±δ)n , called δ -typical sequences.
11.2 Linear Compression

Recall from Shannon’s theorem (Corollary 11.3) that as the blocklength n → ∞, the optimal
probability of error ϵ∗ (X, nR) tends to zero (resp. one) if the compression rate is strictly above
(resp. below) the entropy. Complementing the achievability result in Section 11.1, for example,
Theorem 11.6 obtained by randomizing over all compressors, the goal of this section is to find
compressor with structures. The simplest conceivable case is probably linear functions, which is
also highly desirable for its simplicity (low complexity). Of course, we have to be on a vector
space where we can define linear operations. In this part, we assume that the source takes the form
X = Sn , where each coordinate is an element of a finite field (Galois field), i.e., Si ∈ Fq , where q
is the cardinality of Fq . (This is only possible if q = pk for some prime number p and k ∈ N.)
Definition 11.7 (Galois field). F is a finite set with operations (+, ·) where
• The addition operation + is associative and commutative.

• The multiplication operation · is associative and commutative.
• There exist elements 0, 1 ∈ F s.t. 0 + a = 1 · a = a.
i i
i i
i i

i i
11.2 Linear Compression 183
• ∀a, ∃ − a, s.t. a + (−a) = 0

• ∀a 6= 0, ∃a−1 , s.t. a−1 a = 1
• Distributive: a · (b + c) = (a · b) + (a · c)
Simple examples of finite fields:
• Fp = Z/pZ, where p is prime.

• F4 = {0, 1, x, x + 1} with addition and multiplication as polynomials in F2 [x] modulo x2 + x + 1.
A linear compressor is a linear function H : Fnq → Fkq (represented by a matrix H ∈ Fqk×n ) that
maps each x ∈ Fnq to its codeword w = Hx, namely
    
w1 h11 . . . h1n x1
 ..   .. ..   .. 
 . = . .  . 
wk hk1 ... hkn xn
Compression is achieved if k < n, i.e., H is a fat matrix, which, again, is only possible in the
almost lossless sense.
Theorem 11.8 (Achievability). Let X ∈ Fnq be a random vector. ∀τ > 0, ∃ linear compressor
H : Fnq → Fkq and decompressor g : Fkq → Fnq ∪ {e}, s.t.

1
P [g(HX) 6= X] ≤ P logq > k − τ + q−τ
PX (X)
Remark 11.6. Consider the Hamming space q = 2. In comparison with Shannon’s random coding
achievability, which uses k2n bits to construct a completely random codebook, here for linear codes
we need kn bits to randomly generate the matrix H, and the codebook is a k-dimensional linear
subspace of the Hamming space.
Proof. Fix τ . As pointed in the proof of Shannon’s random coding theorem (Theorem 11.6),
given the compressor H, the optimal decompressor is the MAP decoder, i.e., g(w) =
argmaxx:Hx=w PX (x), which outputs the most likely symbol that is compatible with the codeword
received. Instead, let us consider the following (suboptimal) decoder for its ease of analysis:
(
x ∃!x ∈ Fnq : w = Hx, x − h.p.
g( w ) =
e otherwise
where we used the short-hand:
1
x − h.p. (high probability) ⇔ logq < k − τ ⇔ PX (x) ≥ q−k+τ .
P X ( x)
Note that this decoder is the same as in the proof of Theorem 11.6. The proof is also mostly the
same, except now hash collisions occur under the linear map H. By union bound,

1
P [g(f(X)) = e] ≤ P logq > k − τ + P [∃x′ − h.p. : x′ 6= X, Hx′ = HX]
P X ( x)
i i
i i
i i

i i
184
X X
1
(union bound) ≤ P logq >k−τ + PX (x) 1{Hx′ = Hx}
PX (x) x x′ −h.p.,x′ ̸=x
Now we use random coding to average the second term over all possible choices of H. Specif-
ically, choose H as a matrix independent of X where each entry is iid and uniform on Fq . For
distinct x0 and x1 , the collision probability is
PH [Hx1 = Hx0 ] = PH [Hx2 = 0] (x2 ≜ x1 − x0 6= 0)
= P H [ H 1 · x2 = 0 ] k
(iid rows)
where H1 is the first row of the matrix H, and each row of H is independent. This is the probability
that Hi is in the orthogonal complement of x2 . On Fnq , the orthogonal complement of a given
non-zero vector has cardinality qn−1 . So the probability for the first row to lie in this subspace is
qn−1 /qn = 1/q, hence the collision probability 1/qk . Averaging over H gives
X X
EH 1{Hx′ = Hx} = PH [Hx′ = Hx] = |{x′ : x′ − h.p., x′ 6= x}|q−k ≤ qk−τ q−k = q−τ
x′ −h.p.,x′ ̸=x x′ −h.p.,x′ ̸=x
Thus the bound holds.

Remark 11.7. 1 Compared to Theorem 11.6, which is obtained by randomizing over all possible
compressors, Theorem 11.8 is obtained by randomizing over only linear compressors, and the
bound we obtained is identical. Therefore restricting on linear compression almost does not
lose anything.
2 Note that in this case it is not possible to make all errors detectable.
3 Can we loosen the requirement on Fq to instead be a commutative ring? In general, no, since
zero divisors in the commutative ring ruin the key proof ingredient of low collision probability
in the random hashing. E.g. in Z/6Z
       
1 2
  0     0  
       
P  H  .  = 0 = 6− k but P  H  .  = 0 = 3− k ,
  .  .    .  . 
0 0
since 0 · 2 = 3 · 2 = 0 in Z/6Z.
11.3 Compression with side information at both compressor and

decompressor
Definition 11.9 (Compression with Side Information). Given PX,Y ,
• f : X × Y → {0, 1}k
• g : {0, 1}k × Y → X ∪ {e}
• P[g(f(X, Y), Y) 6= X] < ϵ
• Fundamental Limit: ϵ∗ (X|Y, k) = inf{ϵ : ∃(k, ϵ) − S.I. code}
i i
i i
i i

i i
11.4 Slepian-Wolf (Compression with side information at decompressor only) 185
X {0, 1}k X ∪ {e}

Compressor Decompressor
Note that here unlike the source X, the side information Y need not be discrete. Conditioned
on Y = y, the problem reduces to compression without side information studied in Section 11.1,
where the source X is distributed according to PX|Y=y . Since Y is known to both the compressor
and decompressor, they can use the best code tailored for this distribution. Recall ϵ∗ (X, k) defined
in Definition 11.1, the optimal probability of error for compressing X using k bits, which can also
be denoted by ϵ∗ (PX , k). Then we have the following relationship
ϵ∗ (X|Y, k) = Ey∼PY [ϵ∗ (PX|Y=y , k)],
which allows us to apply various bounds developed before.
Theorem 11.10.

1 1
P log > k + τ − 2−τ ≤ ϵ∗ (X|Y, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0
PX|Y (X|Y) PX|Y (X|Y)
i.i.d.
Corollary 11.11. Let (X, Y) = (Sn , Tn ) where the pairs (Si , Ti ) ∼ PST . Then
(
∗ n n 0 R > H(S|T)
lim ϵ (S |T , nR) =
n→∞ 1 R < H(S|T)
Proof. Using the converse Theorem 11.4 and achievability Theorem 11.6 (or Theorem 11.5) for
compression without side information, we have

1 1
P log > k + τ Y = y − 2−τ ≤ ϵ∗ (PX|Y=y , k) ≤ P log > k Y = y
PX|Y (X|y) PX|Y (X|y)
By taking the average over all y ∼ PY , we get the theorem. For the corollary
1X
n
1 1 1 P
log = log −
→H(S|T)
n PSn |Tn (S |T )
n n n PS|T (Si |Ti )
i=1
as n → ∞, using the WLLN.
11.4 Slepian-Wolf (Compression with side information at decompressor

only)
Consider the compression with side information problem, except now the compressor has no
access to the side information.
i i
i i
i i

i i
186
X {0, 1}k X ∪ {e}

Compressor Decompressor
Definition 11.12 (S.W. code). Given PX,Y ,
• f : X → {0, 1}k
• g : {0, 1}k × Y → X ∪ {e}
• P[g(f(X), Y) 6= X] ≤ ϵ
• Fundamental Limit: ϵ∗SW (X|Y, k) = inf{ϵ : ∃(k, ϵ)-S.W. code}
Now the very surprising result: Even without side information at the compressor, we can still
compress down to the conditional entropy!
Theorem 11.13 (Slepian-Wolf, ’73).

∗ 1
ϵ (X|Y, k) ≤ ϵ∗SW (X|Y, k) ≤ P log ≥ k − τ + 2−τ
PX|Y (X|Y)
Corollary 11.14.
(
0 R > H(S|T)
lim ϵ∗SW (Sn |Tn , nR) =
n→∞ 1 R < H(S|T)
Remark 11.8. Definition 11.12 does not include the zero-undected-error condition (that is
g(f(x), y) = x or e). In other words, we allow for the possibility of undetected errors. Indeed,
if we require this condition, the side-information savings will be mostly gone. Indeed, assuming
PX,Y (x, y) > 0 for all (x, y) it is clear that under zero-undetected-error condition, if f(x1 ) = f(x2 ) =
c then g(c) = e. Thus except for c all other elements in {0, 1}k must have unique preimages. Sim-
ilarly, one can show that Slepian-Wolf theorem does not hold in the setting of variable-length
lossless compression (i.e. average length is H(X) not H(X|Y).)
Proof. LHS is obvious, since side information at the compressor and decompressor is better than
only at the decompressor.
For the RHS, first generate a random codebook with iid uniform codewords: C = {cx ∈ {0, 1}k :
x ∈ X } independently of (X, Y), then define the compressor and decoder as
f ( x) = cx
i i
i i
i i

i i
11.5 Multi-terminal Slepian Wolf 187
(
x ∃!x : cx = w, x − h.p.|y
g(w, y) =
0 o.w.
where we used the shorthand x − h.p.|y ⇔ log2 PX|Y1(x|y) < k − τ . The error probability of this
scheme, as a function of the code book C, is

1
E(C) = P log ≥ k − τ or J(X, C|Y) 6= ∅
PX|Y (X|Y)

1
≤ P log ≥ k − τ + P [J(X, C|Y) 6= ∅]
PX|Y (X|Y)
X
1
= P log ≥k−τ + PX,Y (x, y)1{J(x,C|y)̸=∅} .
PX|Y (X|Y) x, y
where J(x, C|y) ≜ {x′ 6= x : x′ − h.p.|y, cx = cx′ }.

Now averaging over C and applying the union bound: use |{x′ : x′ − h.p.|y}| ≤ 2k−τ and
P[cx′ = cx ] = 2−k for any x 6= x′ ,
 
X
PC [J(x, C|y) 6= ∅] ≤ EC  1{x′ −h.p.|y} 1{cx′ =cx } 
x′ ̸=x
= 2k−τ P[cx′ = cx ]
= 2−τ
Hence the theorem follows as usual from two terms in the union bound.
11.5 Multi-terminal Slepian Wolf

Distributed compression: Two sources are correlated. Compress individually, decompress jointly.
What are those rate pairs that guarantee successful reconstruction?
X {0, 1}k1
Compressor f1
Decompressor g
(X̂, Ŷ)
Y {0, 1}k2
Compressor f2
Definition 11.15. Given PX,Y ,
i i
i i
i i

i i
188
• (f1 , f2 , g) is (k1 , k2 , ϵ)-code if f1 : X → {0, 1}k1 , f2 : Y → {0, 1}k2 , g : {0, 1}k1 × {0, 1}k2 →
X × Y , s.t. P[(X̂, Ŷ) 6= (X, Y)] ≤ ϵ, where (X̂, Ŷ) = g(f1 (X), f2 (Y)).
• Fundamental limit: ϵ∗SW (X, Y, k1 , k2 ) = inf{ϵ : ∃(k1 , k2 , ϵ)-code}.
Theorem 11.16. (X, Y) = (Sn , Tn ) - iid pairs

(
0 (R1 , R2 ) ∈ int(RSW )
lim ϵ∗SW (Sn , Tn , nR1 , nR2 ) =
n→∞ 1 (R1 , R2 ) 6∈ RSW
where RSW denotes the Slepian-Wolf rate region



 a ≥ H( S| T )

RSW = (a, b) : b ≥ H(T|S)


 a + b ≥ H( S, T )
The rate region RSW typically looks like:
R2
Achievable
H(T )
Region
H(T |S)
R1
H(S|T ) H(S)
Since H(T) − H(T|S) = H(S) − H(S|T) = I(S; T), the slope is −1.
Proof. Converse: Take (R1 , R2 ) 6∈ RSW . Then one of three cases must occur:
1 R1 < H(S|T). Then even if encoder and decoder had full Tn , still can’t achieve this (from
compression with side info result – Corollary 11.11).
2 R2 < H(T|S) (same).
3 R1 + R2 < H(S, T). Can’t compress below the joint entropy of the pair (S, T).
Achievability: First note that we can achieve the two corner points. The point (H(S), H(T|S))
can be approached by almost lossless compressing S at entropy and compressing T with side infor-
mation S at the decoder. To make this rigorous, let k1 = n(H(S) + δ) and k2 = n(H(T|S) + δ). By
Corollary 11.3, there exist f1 : S n → {0, 1}k1 and g1 : {0, 1}k1 → S n s.t. P [g1 (f1 (Sn )) 6= Sn ] ≤
ϵn → 0. By Theorem 11.13, there exist f2 : T n → {0, 1}k2 and g2 : {0, 1}k1 × S n → T n
s.t. P [g2 (f2 (Tn ), Sn ) 6= Tn ] ≤ ϵn → 0. Now that Sn is not available, feed the S.W. decompres-
sor with g(f(Sn )) and define the joint decompressor by g(w1 , w2 ) = (g1 (w1 ), g2 (w2 , g1 (w1 ))) (see
below):
i i
i i
i i

i i
11.6* Source-coding with a helper (Ahlswede-Körner-Wyner) 189
Sn Ŝn
f1 g1
Tn T̂n
f2 g2
Apply union bound:
P [g(f1 (Sn ), f2 (Tn )) 6= (Sn , Tn )]

= P [g1 (f1 (Sn )) 6= Sn ] + P [g2 (f2 (Tn ), g(f1 (Sn ))) 6= Tn , g1 (f1 (Sn )) = Sn ]
≤ P [g1 (f1 (Sn )) 6= Sn ] + P [g2 (f2 (Tn ), Sn ) 6= Tn ]
≤ 2ϵ n → 0.
Similarly, the point (H(S), H(T|S)) can be approached.

To achieve other points in the region, use the idea of time sharing: If you can achieve with
vanishing error probability any two points (R1 , R2 ) and (R′1 , R′2 ), then you can achieve for λ ∈
[0, 1], (λR1 + λ̄R′1 , λR2 + λ̄R′2 ) by dividing the block of length n into two blocks of length λn and
λ̄n and apply the two codes respectively

λnR1
( S1 , T 1 ) →
λn λn
using (R1 , R2 ) code
λnR2

λ̄nR′1
(Sλn+1 , Tλn+1 ) →
n n
using (R′1 , R′2 ) code
λ̄nR′2
(Exercise: Write down the details rigorously.) Therefore, all convex combinations of points in the
achievable regions are also achievable, so the achievable region must be convex.
11.6* Source-coding with a helper (Ahlswede-Körner-Wyner)

Yet another variation of distributed compression problem is compressing X with a helper, see
figure below. Note that the main difference from the previous section is that decompressor is only
required to produce the estimate of X, using rate-limited help from an observer who has access to
Y. Characterization of rate pairs R1 , R2 is harder than in the previous section.
Theorem 11.17 (Ahlswede-Körner-Wyner). Consider i.i.d. source (Xn , Yn ) ∼ PX,Y with X dis-
crete. If rate pair (R1 , R2 ) is achievable with vanishing probability of error P[X̂n 6= Xn ] → 0, then
there exists an auxiliary random variable U taking values on alphabet of cardinality |Y| + 1 such
that PX,Y,U = PX,Y PU|X,Y and
R1 ≥ H(X|U), R2 ≥ I(Y; U) . (11.4)
i i
i i
i i

i i
190
X {0, 1}k1
Compressor f1
Decompressor g
X̂
Y {0, 1}k2
Compressor f2
Furthermore, for every such random variable U the rate pair (H(X|U), I(Y; U)) is achievable with
vanishing error.
Proof. We only sketch some crucial details.

First, note that iterating over all possible random variables U (without cardinality constraint)
the set of pairs (R1 , R2 ) satisfying (11.4) is convex. Next, consider a compressor W1 = f1 (Xn ) and
W2 = f2 (Yn ). Then from Fano’s inequality (6.9) assuming P[Xn 6= X̂n ] = o(1) we have
H(Xn |W1 , W2 )) = o(n) .
Thus, from chain rule and conditioning-decreases-entropy, we get
nR1 ≥ I(Xn ; W1 |W2 ) ≥ H(Xn |W2 ) − o(n) (11.5)

Xn
= H(Xk |W2 , Xk−1 ) − o(n) (11.6)
k=1
Xn
≥ H(Xk | W2 , Xk−1 , Yk−1 ) − o(n) (11.7)
| {z }
k=1
≜Uk
On the other hand, from (6.2) we have

X
n
nR2 ≥ I(W2 ; Y ) =
n
I(W2 ; Yk |Yk−1 ) (11.8)
k=1
Xn
= I(W2 , Xk−1 ; Yk |Yk−1 ) (11.9)
k=1
Xn
= I(W2 , Xk−1 , Yk−1 ; Yk ) (11.10)
k=1
where (11.9) follows from I(W2 , Xk−1 ; Yk |Yk−1 ) = I(W2 ; Yk |Yk−1 ) + I(Xk−1 ; Yk |W2 , Yk−1 ) and the
⊥ Xk−1 |Yk−1 ; and (11.10) from Yk−1 ⊥
fact that (W2 , Yk ) ⊥ ⊥ Yk . Comparing (11.7) and (11.10) we
i i
i i
i i

i i
11.6* Source-coding with a helper (Ahlswede-Körner-Wyner) 191
notice that denoting Uk = (W2 , Xk−1 , Yk−1 ) we have

1X
n
(R1 , R2 ) ≥ (H(Xk |Uk ), I(Uk ; Yk ))
n
k=1
and thus (from convexity) the rate pair must belong to the region spanned by all pairs
(H(X|U), I(U; Y)).
To show that without loss of generality the auxiliary random variable U can be chosen to take
at most |Y| + 1 values, one can invoke Carathéodory’s theorem (see Lemma 7.12). We omit the
details.
Finally, showing that for each U the mentioned rate-pair is achievable, we first notice that if there
were side information at the decompressor in the form of the i.i.d. sequence Un correlated to Xn ,
then Slepian-Wolf theorem implies that only rate R1 = H(X|U) would be sufficient to reconstruct
Xn . Thus, the question boils down to creating a correlated sequence Un at the decompressor by
using the minimal rate R2 . This is the content of the so called covering lemma – see Theorem 25.5:
It is sufficient to use rate I(U; Y) to do so. We omit further details.
i i
i i
i i

i i
12 Compressing stationary ergodic sources
We have studyig the compression of i.i.d. sequence {Si }, for which

1 ∗ n P
l(f (S ))−
→H(S) (12.1)
n

0 R > H(S)
lim ϵ∗ (Sn , nR) = (12.2)
n→∞ 1 R < H(S)
In this chapter, we shall examine similar results for ergodic processes and we first state the main
theory as follows:
Theorem 12.1 (Shannon-McMillan). Let {S1 , S2 , . . . } be a stationary and ergodic discrete

process, then
1 1 P
log −
→H, also a.s. and in L1 (12.3)
n PSn (Sn )
where H = limn→∞ 1n H(Sn ) is the entropy rate.
Corollary 12.2. For any stationary and ergodic discrete process {S1 , S2 , . . . }, (12.1) – (12.2) hold
with H(S) replaced by H.
Proof. Shannon-McMillan (we only need convergence in probability) + Theorem 10.6 + Theo-
rem 11.2 which tie together the respective CDF of the random variable l(f∗ (Sn )) and log PSn1(sn ) .
In Chapter 11 we learned the asymptotic equipartition property (AEP) for iid sources. Here we
generalize it to stationary ergodic sources thanks to Shannon-McMillan.
Corollary 12.3 (AEP for stationary ergodic sources). Let {S1 , S2 , . . . } be a stationary and ergodic
discrete process. For any δ > 0, define the set

1 1
Tδn = sn : log − H ≤δ .

n PSn (sn )
Then

1 P Sn ∈ Tδn → 1 as n → ∞.
2 2n(H−δ) (1 + o(1)) ≤ |Tδn | ≤ 2(H+δ)n (1 + o(1)).
192
i i
i i
i i

i i
Some historical notes are in order. Convergence in probability for stationary ergodic Markov
chains was already shown in [277]. The extension to convergence in L1 for all stationary ergodic
processes is due to McMillan in [216], and to almost sure convergence to Breiman [52]. A modern
proof is in [6]. Note also that for a Markov chain, existence of typical sequences and the AEP can
be anticipated by thinking of a Markov process as a sequence of independent decisions regarding
which transitions to take at each state. It is then clear that Markov process’s trajectory is simply a
transformation of trajectories of an iid process, hence must concentrate similarly.
12.1 Bits of ergodic theory

Let’s start with a dynamic system view and introduce a few definitions:
Definition 12.4 (Measure preserving transformation). τ : Ω → Ω is measure preserving (more

precisely, probability preserving) if
∀E ∈ F , P(E) = P(τ −1 E).
The set E is called τ -invariant if E = τ −1 E. The set of all τ -invariant sets forms a σ -algrebra
(exercise) denoted Finv .
Definition 12.5 (Stationary process). A process {Sn , n = 0, . . .} is stationary if there exists a

measure preserving transformation τ : Ω → Ω such that:
Sj = Sj−1 ◦ τ = S0 ◦ τ j
Therefore a stationary process can be described by the tuple (Ω, F, P, τ, S0 ) and Sk = S0 ◦ τ k .
Remark 12.1.
1 Alternatively, a random process (S0 , S1 , S2 , . . . ) is stationary if its joint distribution is invariant

with respect to shifts in time, i.e., PSmn = PSm+t , ∀n, m, t. Indeed, given such a process we can
n+ t
define a m.p.t. as follows:
τ
(s0 , s1 , . . . ) −
→ (s1 , s2 , . . . ) (12.4)
So τ is a shift to the right.

2 An event E ∈ F is shift-invariant if
(s1 , s2 , . . . ) ∈ E ⇒ (s0 , s1 , s2 , . . . ) ∈ E, s0
or equivalently E = τ −1 E (exercise). Thus τ -invariant events are also called shift-invariant,

when τ is interpreted as (12.4).
3 Some examples of shift-invariant events are {∃n : xi = 0∀i ≥ n}, {lim sup xi < 1} etc. A non
shift-invariant event is A = {x0 = x1 = · · · = 0}, since τ (1, 0, 0, . . .) ∈ A but (1, 0, . . .) 6∈ A.
i i
i i
i i

i i
194
4 Also recall that the tail σ -algebra is defined as

\
Ftail ≜ σ{Sn , Sn+1 , . . .} .
n≥1
It is easy to check that all shift-invariant events belong to Ftail . The inclusion is strict, as for
example the event
{∃n : xi = 0, ∀ odd i ≥ n}
is in Ftail but not shift-invariant.
Proposition 12.6 (Poincare recurrence). Let τ be measure-preserving for (Ω, F, P). Then for any
measurable A with P[A] > 0 we have
[
P[ τ −k A|A] = P[τ k (ω) ∈ A occurs infinitely often|A] = 1 .
k≥ 1
S
Proof. Let B = k≥ 1 τ −k A. It is sufficient to show that P[A ∩ B] = P[A] or equivalently
P[ A ∪ B] = P[ B] . (12.5)
To that end notice that τ −1 A ∪ τ −1 B = B and thus
P[τ −1 (A ∪ B)] = P[B] ,
but the left-hand side equals P[A ∪ B] by the measure-preservation of τ , proving (12.5).
Consider τ mapping initial state of the conservative (Hamiltonian) mechanical system to its
state after passage of a given unit of time. It is known that τ preserves Lebesgue measure in
phase space (Liouville’s theorem). Thus Poincare recurrence leads to a rather counter-intuitive
conclusions. For example, opening the barrier separating two gases in a cylinder allows them to
mix. Poincare recurrence says that eventually they will return back to the original separated state
(with each gas occupying roughly its half of the cylinder). Of course, the “paradox” is resolved
by observing that it will take unphysically long for this to happen.
Definition 12.7 (Ergodicity). A transformation τ is ergodic if ∀E ∈ Finv we have P[E] = 0 or 1.

A process {Si } is ergodic if all shift invariant events are deterministic, i.e., for any shift invariant
event E, P [S∞
1 ∈ E] = 0 or 1.
Here are some examples:
• {Sk = k2 }: ergodic but not stationary

• {Sk = S0 }: stationary but not ergodic (unless S0 is a constant). Note that the singleton set
E = {(s, s, . . .)} is shift invariant and P [S∞
1 ∈ E] = P [S0 = s] ∈ (0, 1) – not deterministic.
• {Sk } i.i.d. is stationary and ergodic (by Kolmogorov’s 0-1 law, tail events have no randomness)
i i
i i
i i

i i
• (Sliding-window construction of ergodic processes)

If {Si } is ergodic, then {Xi = f(Si , Si+1 , . . . )} is also ergodic. It is called a B-process if Si is
i.i.d.
P∞ −n−1
Example, Si ∼ Ber( 12 ) i.i.d., Xk = n=0 2 Sk+n = 2Xk−1 mod 1. The marginal distri-
bution of Xi is uniform on [0, 1]. Note that Xk ’s behavior is completely deterministic: given
X0 , all the future Xk ’s are determined exactly. This example shows that certain determinis-
tic maps exhibit ergodic/chaotic behavior under iterative application: although the trajectory
is completely deterministic, its time-averages converge to expectations and in general “look
random”.
• There are also stronger conditions than ergodicity. Namely, we say that τ is mixing (or strong
mixing) if
P[A ∩ τ −n B] → P[A]P[B] .
We say that τ is weakly mixing if

X
n
1
P[ A ∩ τ − n B] − P[ A] P [ B] → 0 .
n
k=1
Strong mixing implies weak mixing, which implies ergodicity (exercise).

• {Si }: finite irreducible Markov chain with recurrent states is ergodic (in fact strong mixing),
regardless of initial distribution.
Toy example: kernel P(0|1) = P(1|0) = 1 with initial dist. P(S0 = 0) = 0.5. This process
only has two sample paths: P [S∞ ∞
1 = (010101 . . .)] = P [S1 = (101010 . . .)] = 2 . It is easy to
1
verify this process is ergodic (in the sense defined above!). Note however, that in Markov-chain
literature a chain is called ergodic if it is irreducible, aperiodic and recurrent. This example does
not satisfy this definition (this clash of terminology is a frequent source of confusion).
• (optional) {Si }: stationary zero-mean Gaussian process with autocovariance function R(n) =
E[S0 S∗n ].
1 X
n
lim R[t] = 0 ⇔ {Si } ergodic ⇔ {Si } weakly mixing
n→∞ n + 1
t=0
lim R[n] = 0 ⇔ {Si } mixing

n→∞
Intuitively speaking, an ergodic process can have infinite memory in general, but the memory
is weak. Indeed, we see that for a stationary Gaussian process ergodicity means the correlation
dies (in the Cesaro-mean sense).
The spectral measure is defined as the (discrete time) Fourier transform of the autocovariance
sequence {R(n)}, in the sense that there exists a unique probability measure μ on [− 12 , 21 ] such
that R(n) = E exp(i2nπX) where X ∼ μ. The spectral criteria can be formulated as follows:
{Si } ergodic ⇔ spectral measure has no atoms (CDF is continuous)

{Si } B-process ⇔ spectral measure has density
i i
i i
i i

i i
196
Detailed exposition on stationary Gaussian processes can be found in [101, Theorem 9.3.2, pp.
474, Theorem 9.7.1, pp. 493–494].
12.2 Proof of the Shannon-McMillan Theorem

We shall show the L1 -convergence, which implies convergence in probability automatically. To
this end, let us first introduce Birkhoff-Khintchine’s convergence theorem for ergodic processes,
the proof of which is presented in the next subsection. The interpretation of this result is that time
averages converge to the ensemble average.
Theorem 12.8 (Birkhoff-Khintchine’s Ergodic Theorem). Let {Si } be a stationary and ergodic
process. For any integral function f, i.e., E |f(S1 , . . . )| < ∞,
1X
n
lim f(Sk , . . . ) = E f(S1 , . . . ). a.s. and in L1 .
n→∞ n
k=1
In the special case where f depends on finitely many coordinates, say, f = f(S1 , . . . , Sm ),
1X
n
lim f(Sk , . . . , Sk+m−1 ) = E f(S1 , . . . , Sm ). a.s. and in L1 .
n→∞ n
k=1
Example 12.1. Consider f = f(S1 ).
• For iid {Si }, Theorem 12.8 is SLLN (strong LLN).

• For {Si } such that Si = S1 for all i, which is non-ergodic, Theorem 12.8 fails unless S1 is a
constant.
Definition 12.9. {Si : i ∈ N} is an mth order Markov chain if PSt+1 |St1 = PSt+1 |Stt−m+1 for all t ≥ m.
It is called time homogeneous if PSt+1 |Stt−m+1 = PSm+1 |Sm1 .
Remark 12.2. Showing (12.3) for an mth order time homogeneous Markov chain {Si } is a direct
application of Birkhoff-Khintchine.
1X
n
1 1 1
log = log
n n
PSn (S ) n PSt |St−1 (St |St−1 )
t=1
1 X
n
1 1 1
= log + log
n PSm (Sm ) n
t=m+1
PSt |St−1 (Sl |Sll− 1
−m )
t−m
1 X
n
1 1 1
= log + log t−1
, (12.6)
n PS1 (Sm ) n P | m ( S | S − )
| {z 1
} | t=m+1 S
{z
m+ 1 S1
t t m
}
→0
→H(Sm+1 |Sm
1 ) by Birkhoff-Khintchine
1
where we applied Theorem 12.8 with f(s1 , s2 , . . .) = log PS m (sm+1 |sm )
.
m+1 |S1 1
i i
i i
i i

i i
12.2 Proof of the Shannon-McMillan Theorem 197
Now let’s prove (12.3) for a general stationary ergodic process {Si } which might have infinite
memory. The idea is to approximate the distribution of that ergodic process by an m-th order MC
(finite memory) and make use of (12.6); then let m → ∞ to make the approximation accurate
(Markov approximation).
Proof of Theorem 12.1 in L1 . To show that (12.3) converges in L1 , we want to show that
1
1
E log − H → 0, n → ∞.
n PSn (Sn )
To this end, fix an m ∈ N. Define the following auxiliary distribution for the process:
Y
∞
Q(m) (S∞ m
1 ) = PS1m (S1 ) PSt |St−1 (St |Stt− 1
−m )
t− m
t=m+1
Y∞
stat.
= PSm1 (Sm
1) PSm+1 |Sm1 (St |Stt− 1
−m )
t=m+1
Note that under Q , {Si } is an m -order time-homogeneous Markov chain.

(m) th
By triangle inequality,
1 1
1 1 1 1
E log n
− H ≤E log n
− log (m)
n PSn (S ) n PSn (S ) n n
QSn (S )
| {z }
≜A
1
1
+ E log (m) − Hm + |Hm − H|
n n
QSn (S ) | {z }
| {z } ≜C
≜B
where Hm ≜ H(Sm+1 |Sm

1 ).
Now
• C = |Hm − H| → 0 as m → ∞ by Theorem 5.4 (Recall that for stationary processes:

H(Sm+1 |Sm
1 ) → H from above).
• As shown in Remark 12.2, for any fixed m, B → 0 in L1 as n → ∞, as a consequence of
Birkhoff-Khintchine. Hence for any fixed m, EB → 0 as n → ∞.
• For term A,
1 dPSn 1 (m) 2 log e
E[A] = EP log (m)
≤ D(PSn kQSn ) +
n dQSn n en
where
" #
1 1 PSn (Sn )
(m)
D(PSn kQSn ) = E log Qn
n n PSm (Sm ) t=m+1 PSm+1 |S1 (St |St−1 )
m t− m
stat. 1 (−H(Sn )
+ H( S ) + ( n − m) Hm )
m
=
n
→ Hm − H as n → ∞
i i
i i
i i

i i
198
and the next Lemma 12.10.
Combining all three terms and sending n → ∞, we obtain for any m,

1
1
lim sup E log − H ≤ 2(Hm − H).
n→∞ n PSn (Sn )
Sending m → ∞ completes the proof of L1 -convergence.
Lemma 12.10.

dP 2 log e

EP log ≤ D(PkQ) + .
dQ e
Proof. |x log x| − x log x ≤ 2 log e

e , ∀x > 0, since LHS is zero if x ≥ 1, and otherwise upper
bounded by 2 sup0≤x≤1 x log 1x = log
2 e
e .
12.3* Proof of the Birkhoff-Khintchine Theorem

Proof of Theorem 12.8. ∀ function f̃ ∈ L1 , ∀ϵ, there exists a decomposition f̃ = f + h such that f
is bounded, and h ∈ L1 , khk1 ≤ ϵ.
Let us first focus on the bounded function f. Note that in the bounded domain L1 ⊂ L2 , thus f ∈ L2 .
Furthermore, L2 is a Hilbert space with inner product (f, g) = E[f(S∞ ∞
1 )g(S1 )].
For the measure preserving transformation τ that generates the stationary process {Si }, define
the operator T(f) = f ◦ τ . Since τ is measure preserving, we know that kTfk22 = kfk22 , thus T is a
unitary and bounded operator.
Define the operator
1X
n
An (f) = f ◦ τk
n
k=1
Intuitively:
1X k 1
n
An = T = (I − Tn )(I − T)−1
n n
k=1
Then, if f ⊥ ker(I − T) we should have An f → 0, since only components in the kernel can blow
up. This intuition is formalized in the proof below.
Let’s further decompose f into two parts f = f1 + f2 , where f1 ∈ ker(I − T) and f2 ∈ ker(I − T)⊥ .
Observations:
• if g ∈ ker(I − T), g must be a constant function. This is due to the ergodicity. Consider indicator
function 1A , if 1A = 1A ◦ τ = 1τ −1 A , then P[A] = 0 or 1. For a general case, suppose g = Tg and
g is not constant, then at least some set {g ∈ (a, b)} will be shift-invariant and have non-trivial
measure, violating ergodicity.
i i
i i
i i

i i
12.3* Proof of the Birkhoff-Khintchine Theorem 199
• ker(I − T) = ker(I − T∗ ). This is due to the fact that T is unitary:
g = Tg ⇒ kgk2 = (Tg, g) = (g, T∗ g) ⇒ (T∗ g, g) = kgkkT∗ gk ⇒ T∗ g = g
where in the last step we used the fact that Cauchy-Schwarz (f, g) ≤ kfk · kgk only holds with
equality for g = cf for some constant c.
• ker(I − T)⊥ = ker(I − T∗ )⊥ = [Im(I − T)], where [Im(I − T)] is an L2 closure.
• g ∈ ker(I − T)⊥ ⇐⇒ E[g] = 0. Indeed, only zero-mean functions are orthogonal to constants.
With these observations, we know that f1 = m is a const. Also, f2 ∈ [Im(I − T)] so we further
approximate it by f2 = f0 + h1 , where f0 ∈ Im(I − T), namely f0 = g − g ◦ τ for some function
g ∈ L2 , and kh1 k1 ≤ kh1 k2 < ϵ. Therefore we have
An f1 = f1 = E[f]
1
An f0 = (g − g ◦ τ n ) → 0 a.s. and L1
n
P g◦τ n 2 P 1 a. s .
since E[ n≥1 ( n ) ] = E[g ] n2 < ∞ =⇒ 1n g ◦ τ n −−→0.
2
The proof is completed by showing

2ϵ
P lim sup An (h + h1 ) ≥ δ ≤ . (12.7)
n δ
Indeed, then by taking ϵ → 0 we will have shown

P lim sup An (f) ≥ E[f] + δ = 0
n
as required.
Proof of (12.7) makes use of the Maximal Ergodic Lemma stated as follows:
Theorem 12.11 (Maximal Ergodic Lemma). Let (P, τ ) be a probability measure and a measure-
preserving transformation. Then for any f ∈ L1 (P) we have
E[f1
{supn≥1 An f>a} ] kfk1
P sup An f > a ≤ ≤
n≥1 a a
Pn−1
where An f = 1
n k=0 f ◦ τ k.
This is a so-called “weak L1 ” estimate for a sublinear operator supn An (·). In fact, this theorem
is exactly equivalent to the following result:
Lemma 12.12 (Estimate for the maximum of averages). Let {Zn , n = 1, . . .} be a stationary
process with E[|Z|] < ∞ then

|Z1 + . . . + Zn | E[|Z|]
P sup >a ≤ ∀a > 0
n≥1 n a
i i
i i
i i

i i
200
Proof. The argument for this Lemma has originally been quite involved, until a dramatically
simple proof (below) was found by A. Garcia.
Define
Xn
Sn = Zk (12.8)
k=1
Ln = max{0, Z1 , . . . , Z1 + · · · + Zn } (12.9)
Mn = max{0, Z2 , Z2 + Z3 , . . . , Z2 + · · · + Zn } (12.10)
Sn
Z∗ = sup (12.11)
n≥1 n
It is sufficient to show that

E[Z1 1{Z∗ >0} ] ≥ 0 . (12.12)
Indeed, applying (12.12) to Z̃1 = Z1 − a and noticing that Z̃∗ = Z∗ − a we obtain
E[Z1 1{Z∗ >a} ] ≥ aP[Z∗ > a] ,
from which Lemma follows by upper-bounding the left-hand side with E[|Z1 |].
In order to show (12.12) we first notice that {Ln > 0} % {Z∗ > 0}. Next we notice that
Z1 + Mn = max{S1 , . . . , Sn }
and furthermore
Z1 + Mn = Ln on {Ln > 0}
Thus, we have
Z1 1{Ln >0} = Ln − Mn 1{Ln >0}
where we do not need indicator in the first term since Ln = 0 on {Ln > 0}c . Taking expectation
we get
E[Z1 1{Ln >0} ] = E[Ln ] − E[Mn 1{Ln >0} ] (12.13)
≥ E [ Ln ] − E [ M n ] (12.14)
= E[Ln ] − E[Ln−1 ] = E[Ln − Ln−1 ] ≥ 0 , (12.15)
where we used Mn ≥ 0, the fact that Mn has the same distribution as Ln−1 , and Ln ≥ Ln−1 ,
respectively. Taking limit as n → ∞ in (12.15) we obtain (12.12).
12.4* Sinai’s generator theorem

It turns out there is a way to associate to every probability-preserving transformation (p.p.t.) τ a
number, called Kolmogorov-Sinai entropy. This number is invariant to isomorphisms of p.p.t.’s
(appropriately defined).
i i
i i
i i

i i
12.4* Sinai’s generator theorem 201
Definition 12.13. Fix a probability-preserving transformation τ acting on probability space

(Ω, F, P). Kolmogorov-Sinai entropy of τ is defined as
1
H(τ ) ≜ sup lim H(X0 , X0 ◦ τ, . . . , X0 ◦ τ n−1 ) ,
X0 n→∞ n
where supremum is taken over all finitely-valued random variables X0 : Ω → X and measurable
with respect to F .
Note that every random variable X0 generates a stationary process adapted to τ , that is
Xk ≜ X0 ◦ τ k .
In this way, Kolmogorov-Sinai entropy of τ equals the maximal entropy rate among all stationary
processes adapted to τ . This quantity may be extremely hard to evaluate, however. One help comes
in the form of the famous criterion of Y. Sinai. We need to elaborate on some more concepts before:
• σ -algebra G ⊂ F is P-dense in F , or sometimes we also say G = F mod P or even G = F

mod 0, if for every E ∈ F there exists E′ ∈ G s.t.
P[E∆E′ ] = 0 .
• Partition A = {Ai , i = 1, 2, . . .} measurable with respect to F is called generating if
_
∞
σ{τ −n A} = F mod P .
n=0
• Random variable Y : Ω → Y with a countable alphabet Y is called a generator of (Ω, F, P, τ )

if
σ{Y, Y ◦ τ, . . . , Y ◦ τ n , . . .} = F mod P
Theorem 12.14 (Sinai’s generator theorem). Let Y be the generator of a p.p.t. (Ω, F, P, τ ). Let
H(Y) be the entropy rate of the process Y = {Yk = Y ◦ τ k , k = 0, . . .}. If H(Y) is finite, then
H(τ ) = H(Y).
Proof. Notice that since H(Y) is finite, we must have H(Y0n ) < ∞ and thus H(Y) < ∞. First, we
argue that H(τ ) ≥ H(Y). If Y has finite alphabet, then it is simply from the definition. Otherwise
let Y be Z+ -valued. Define a truncated version Ỹm = min(Y, m), then since Ỹm → Y as m → ∞
we have from lower semicontinuity of mutual information, cf. (4.28), that
lim I(Y; Ỹm ) ≥ H(Y) ,
m→∞
and consequently for arbitrarily small ϵ and sufficiently large m

H(Y|Ỹ) ≤ ϵ ,
Then, consider the chain
H(Yn0 ) = H(Ỹn0 , Yn0 ) = H(Ỹn0 ) + H(Yn0 |Ỹn0 )
i i
i i
i i

i i
202
X
n
= H(Ỹn0 ) + H(Yi |Ỹn0 , Yi0−1 )
i=0
X
n
≤ H(Ỹn0 ) + H(Yi |Ỹi )
i=0
= H(Ỹn0 ) + nH(Y|Ỹ) ≤ H(Ỹn0 ) + nϵ
Thus, entropy rate of Ỹ (which has finite-alphabet) can be made arbitrarily close to the entropy
rate of Y, concluding that H(τ ) ≥ H(Y).
The main part is showing that for any stationary process X adapted to τ the entropy rate is
upper bounded by H(Y). To that end, consider X : Ω → X with finite X and define as usual the
process X = {X ◦ τ k , k = 0, 1, . . .}. By generating property of Y we have that X (perhaps after
modification on a set of measure zero) is a function of Y∞0 . So are all Xk . Thus
H(X0 ) = I(X0 ; Y0∞ ) = lim I(X0 ; Yn0 ) ,

n→∞
where we used the continuity-in-σ -algebra property of mutual information, cf. (4.30). Rewriting
the latter limit differently, we have
lim H(X0 |Yn0 ) = 0 .
n→∞
0 ) ≤ ϵ. Then consider the following chain:

Fix ϵ > 0 and choose m so that H(X0 |Ym
H(Xn0 ) ≤ H(Xn0 , Yn0 ) = H(Yn0 ) + H(Xn0 |Yn0 )
X n
≤ H(Y0 ) +
n
H(Xi |Yni )
i=0
X
n
= H(Yn0 ) + H(X0 |Yn0−i )
i=0
≤ H(Yn0 ) + m log |X | + (n − m)ϵ ,
where we used stationarity of (Xk , Yk ) and the fact that H(X0 |Yn0−i ) < ϵ for i ≤ n − m. After
dividing by n and passing to the limit our argument implies
H( X ) ≤ H( Y ) + ϵ .
Taking here ϵ → 0 completes the proof.
Alternative proof: Suppose X0 is taking values on a finite alphabet X and X0 = f(Y∞
0 ). Then (this
is a measure-theoretic fact) for every ϵ > 0 there exists m = m(ϵ) and a function fϵ : Y m+1 → X
s.t.
P [ f( Y ∞
0 ) 6= fϵ (Y0 )] ≤ ϵ .
m
S
(This is just another way to say that n σ(Yn0 ) is P-dense in σ(Y∞ 0 ).) Define a stationary process
X̃ as
X̃j ≜ fϵ (Ym
j
+j
).
i i
i i
i i

i i
12.4* Sinai’s generator theorem 203
Notice that since X̃n0 is a function of Yn0+m we have

H(X̃n0 ) ≤ H(Yn0+m ) .
Dividing by m and passing to the limit we obtain that for entropy rates
H(X̃) ≤ H(Y) .
Finally, to relate X̃ to X notice that by construction for every j
P[X̃j 6= Xj ] ≤ ϵ .
Since both processes take values on a fixed finite alphabet, from Corollary 6.8 we infer that
|H(X) − H(X̃)| ≤ ϵ log |X | + h(ϵ) .
Altogether, we have shown that

H(X) ≤ H(Y) + ϵ log |X | + h(ϵ) .
Taking ϵ → 0 we conclude the proof.
Examples:
• Let Ω = [0, 1], F the Borel σ -algebra, P = Leb and

(
2ω, ω < 1/ 2
τ (ω) = 2ω mod 1 =
2ω − 1, ω ≥ 1/ 2
It is easy to show that Y(ω) = 1{ω < 1/2} is a generator and that Y is an i.i.d. Bernoulli(1/2)
process. Thus, we get that Kolmogorov-Sinai entropy is H(τ ) = log 2.
• Let Ω be the unit circle S1 , F the Borel σ -algebra, and P the normalized length and
τ (ω) = ω + γ
γ
i.e. τ is a rotation by the angle γ . (When 2π is irrational, this is known to be an ergodic p.p.t.).
Here Y = 1{|ω| < 2π ϵ} is a generator for arbitrarily small ϵ and hence
H(τ ) ≤ H(X) ≤ H(Y0 ) = h(ϵ) → 0 as ϵ → 0 .
This is an example of a zero-entropy p.p.t.
Remark 12.3. Two p.p.t.’s (Ω1 , τ1 , P1 ) and (Ω0 , τ0 , P0 ) are called isomorphic if there exists fi :
Ωi → Ω1−i defined Pi -almost everywhere and such that 1) τ1−i ◦ fi = f1−i ◦ τi ; 2) fi ◦ f1−i is identity
on Ωi (a.e.); 3) Pi [f−
1−i E] = P1−i [E]. It is easy to see that Kolmogorov-Sinai entropies of isomorphic
1
p.p.t.s are equal. This observation was made by Kolmogorov in 1958. It was revoluationary, since
it allowed to show that p.p.t.s corresponding shifts of iid Ber(1/2) and iid Ber(1/3) procceses are
not isomorphic. Before, the only invariants known were those obtained from studying the spectrum
of a unitary operator
Uτ : L 2 ( Ω , P ) → L 2 ( Ω , P ) (12.16)
i i
i i
i i

i i
204
ϕ(x) 7→ ϕ(τ (x)) . (12.17)

However, the spectrum of τ corresponding to any non-constant i.i.d. process consists of the entire
unit circle, and thus is unable to distinguish Ber(1/2) from Ber(1/3).1
1
To see the statement about the spectrum, let Xi be iid with zero mean and unit variance. Then consider ϕ(x∞1 ) defined as
∑m iωk x . This ϕ has unit energy and as m → ∞ we have kU ϕ − eiω ϕk
√1 L2 → 0. Hence every e
iω belongs to
m k=1 e k τ
the spectrum of Uτ .
i i
i i
i i

i i
13 Universal compression
In this chapter we will discuss how to produce compression schemes that do not require apriori
knowledge of the distribution. Here, compressor is a map X n → {0, 1}∗ . Now, however, there is
no one fixed probability distribution PXn on X n . The plan for this chapter is as follows:
1 We will start by discussing the earliest example of a universal compression algorithm (of
Fitingof). It does not talk about probability distributions at all. However, it turns out to be asymp-
totically optimal simulatenously for all i.i.d. distributions and with small modifications for all
finite-order Markov chains.
2 Next class of universal compressors is based on assuming that the true distribution PXn belongs
to a given class. These methods proceed by choosing a good model distribution QXn serving as
the minimax approximation to each distribution in the class. The compression algorithm for a
single distribution QXn is then designed as in previous chapters.
3 Finally, an entirely different idea are algorithms of Lempel-Ziv type. These automatically adapt
to the distribution of the source, without any prior assumptions required.
Throughout this chapter, all logarithms are binary. Instead of describing each compres-
sion algorithm, we will merely specify some distribution QXn and apply one of the following
constructions:
• Sort all xn in the order of decreasing QXn (xn ) and assign values from {0, 1}∗ as in Theorem 10.2,
this compressor has lengths satisfying
1
ℓ(f(xn )) ≤ log .
QXn (xn )
• Set lengths to be

1
ℓ(f(x )) ≜ log
n
Q X ( xn )
n
and apply Kraft’s inequality Theorem 10.11 to construct a prefix code.

• Use arithmetic coding (see next section).
The important conclusion is that in all these cases we have

1
ℓ(f(xn )) ≤ log + universal constant ,
Q X n ( xn )
205
i i
i i
i i

i i
206
and in this way we may and will always replace lengths with log QXn1(xn ) . In this architecture, the
only task of a universal compression algorithm is to specify the probability assignment QXn .
Remark 13.1. Furthermore, if we only restrict attention to prefix codes, then any code f : X n →
{0, 1}∗ defines a distribution QXn (xn ) = 2−ℓ(f(x )) . (We assume the code’s binary tree is full such
n
that the Kraft sum equals one). In this way, for prefix-free codes results on redundancy, stated
in terms of optimizing the choice of QXn , imply tight converses too. For one-shot codes without
prefix constraints the optimal answers are slightly different, however. (For example, the optimal
universal code for all i.i.d. sources satisfies E[ℓ(f(Xn ))] ≈ H(Xn ) + |X 2|−3 log n in contrast with
|X |−1
2 log n for prefix-free codes, cf. [26, 184].)
Qn
If one factorizes QXn = t=1 QXt |Xt−1 then we arrive at a crucial conclusion: (universal) com-
1
pression is equivalent to sequential (online) prediction under the log-loss. As of 2022 the best
performing text compression algorithms (cf. the leaderboard at [207]) use a deep neural network
(transformer model) that starts from a fixed initialization. As the input text is processed, parame-
ters of the network are continuously updated via stochastic gradient descent causing progressively
better prediction (and hence compression) performance.
13.1 Arithmetic coding

Constructing an encoder table from QXn may require a lot of resources if n is large. Arithmetic
coding provides a convenient workaround by allowing the encoder to output bits sequentially.
Notice that to do so, it requires that not only QXn but also its marginalizations QX1 , QX2 , · · · be
easily computable. (This is not the case, for example, for Shtarkov distributions (13.12)-(13.13),
which are not compatible for different n.)
Let us agree upon some ordering on the alphabet of X (e.g. a < b < · · · < z) and extend this
order lexicographically to X n (that is for x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ), we say x < y if
xi < yi for the first i such that xi 6= yi , e.g., baba < babb). Then let
X
Fn (xn ) = QXn (yn ) .
yn < xn
Associate to each xn an interval Ixn = [Fn (xn ), Fn (xn ) + QXn (xn )). These intervals are disjoint
subintervals of [0, 1). As such, each xn can be represented uniquely by any point in the interval Ixn .
A specific choice is as follows. Encode
xn 7→ largest dyadic interval Dxn contained in Ixn (13.1)
and we agree to select the left-most dyadic interval when there are two possibilities. Recall that
dyadic intervals are intervals of the type [m2−k , (m + 1)2−k ) where m is an integer. We encode
such interval by the k-bit (zero-padded) binary expansion of the fractional number m2−k =
Pk
0.b1 b2 . . . bk = i=1 bi 2−i . For example, [3/4, 7/8) 7→ 110, [3/4, 13/16) 7→ 1100. We set the
i i
i i
i i

i i
13.2 Combinatorial construction of Fitingof 207
codeword f(xn ) to be that string. The resulting code is a prefix code satisfying

1 1
log2 ≤ ℓ(f(x )) ≤ log2
n
+ 1. (13.2)
QXn (xn ) QXn (xn )
(This is an exercise, see Ex. II.11.)
Observe that
X
Fn (xn ) = Fn−1 (xn−1 ) + QXn−1 (xn−1 ) QXn |Xn−1 (y|xn−1 )
y<xn
and thus Fn (xn ) can be computed sequentially if QXn−1 and QXn |Xn−1 are easy to compute. This
method is the method of choice in many modern compression algorithms because it allows to
dynamically incorporate the learned information about the stream, in the form of updating QXn |Xn−1
(e.g. if the algorithm detects that an executable file contains a long chunk of English text, it may
temporarily switch to QXn |Xn−1 modeling the English language).
We note that efficient implementation of arithmetic encoder and decoder is a continuing
research area. Indeed, performance depends on number-theoretic properties of denominators of
distributions QXt |Xt−1 , because as encoder/decoder progress along the string, they need to periodi-
cally renormalize the current interval Ixt to be [0, 1) but this requires carefully realigning the dyadic
boundaries. A recent idea, known as asymmetric numeral system (ANS) [103], lead to such impres-
sive computational gains that in less than a decade it was adopted by most compression libraries
handling diverse data streams (e.g., the Linux kernel images, Dropbox and Facebook traffic, etc).
13.2 Combinatorial construction of Fitingof

Fitingof [128] suggested that a sequence xn ∈ X n should be prescribed information Φ0 (xn ) equal
to the logarithm of the number of all possible permutations obtainable from xn (i.e. log-size of the
type-class containing xn ). As we have shown in Proposition 1.5:
Φ0 (xn ) = nH(xT ) + O(log n) T ∼ Unif([n]) (13.3)

= nH(P̂xn ) + O(log n) , (13.4)
where P̂xn is the empirical distribution of the sequence xn :
1X
n
P̂xn (a) ≜ 1 { xi = a } . (13.5)
n
i=1
Then Fitingof argues that it should be possible to produce a prefix code with
ℓ(f(xn )) = Φ0 (xn ) + O(log n) . (13.6)
This can be done in many ways. In the spirit of what comes next, let us define
QXn (xn ) ≜ exp{−Φ0 (xn )}cn , (13.7)
i i
i i
i i

i i
208
where the normalization constant cn is determined by the number of types, namely, cn =

|−1

1/ n+|X
|X |−1
. Counting the number of different possible empirical distributions (types), we get
cn = O(n−(|X |−1) ) ,
and thus, by Kraft inequality, there must exist a prefix code with lengths satisfying (13.6).1 Now
i.i.d.
taking expectation over Xn ∼ PX we get
E[ℓ(f(Xn ))] = nH(PX ) + (|X | − 1) log n + O(1) ,
for every i.i.d. source on X .
Universal compressor for all finite-order Markov chains. Fitingof’s idea can be extended as
follows. Define now the first order information content Φ1 (xn ) to be the log of the number of all
sequences, obtainable by permuting xn with extra restriction that the new sequence should have
the same statistics on digrams. Asymptotically, Φ1 is just the conditional entropy
Φ1 (xn ) = nH(xT |xT−1 ) + O(log n), T ∼ Unif([n]) ,
where T − 1 is understood in the sense of modulo n. Again, it can be shown that there exists a code
such that lengths
ℓ(f(xn )) = Φ1 (xn ) + O(log n) .
This implies that for every first order stationary Markov chain X1 → X2 → · · · → Xn we have
E[ℓ(f(Xn ))] = nH(X2 |X1 ) + O(log n) .
This can be further continued to define Φ2 (xn ) and build a universal code, asymptotically
optimal for all second order Markov chains etc.
13.3 Optimal compressors for a class of sources. Redundancy.

So we have seen that we can construct compressor f : X n → {0, 1}∗ that achieves
E[ℓ(f(Xn ))] ≤ H(Xn ) + o(n) ,
simultaneously for all i.i.d. sources (or even all r-th order Markov chains). What should we do
next? Krichevsky suggested that the next barrier should be to minimize the regret, or redundancy:
E[ℓ(f(Xn ))] − H(Xn )
simultaneously for all sources in a given class. We proceed to rigorous definitions.
1
Explicitly, we can do a two-part encoding: first describe the type class of xn (takes (|X | − 1) log n bits) and then describe
the element of the class (takes Φ0 (xn ) bits).
i i
i i
i i

i i
13.3 Optimal compressors for a class of sources. Redundancy. 209
Given a collection {PXn |θ : θ ∈ Θ} of sources, and a compressor f : X n → {0, 1}∗ we define

its redundancy as
sup E[ℓ(f(Xn ))|θ = θ0 ] − H(Xn |θ = θ0 ) .

θ0
Replacing code lengths with log Q1Xn , we define redundancy of the distribution QXn as
sup D(PXn |θ=θ0 kQXn ) .

θ0
Thus, the question of designing the best universal compressor (in the sense of optimizing worst-
case deviation of the average length from the entropy) becomes the question of finding solution
of:
Q∗Xn = argmin sup D(PXn |θ=θ0 kQXn ) .

QXn θ0
We therefore get to the following definition
Definition 13.1 (Redundancy in universal compression). Given a class of sources {PXn |θ=θ0 : θ0 ∈
Θ, n = 1, . . .} we define its minimax redundancy as
R∗n ≡ R∗n (Θ) ≜ min sup D(PXn |θ=θ0 kQXn ) . (13.8)

QXn θ0 ∈Θ
Assuming the finiteness of R∗n , Theorem 5.9 gives the maximin and capacity representation
R∗n = sup min D(PXn |θ kQXn |π ) (13.9)

π QXn
= sup I(θ; Xn ) , (13.10)

π
where optimization is over priors π ∈ P(Θ) on θ. Thus redundancy is simply the capacity of
the channel θ → Xn . This result, obvious in hindsight, was rather surprising in the early days of
universal compression. It is known as capacity-redundancy theorem.
Finding exact QXn -minimizer in (13.8) is a daunting task even for the simple class of all i.i.d.
Bernoulli sources (i.e. Θ = [0, 1], PXn |θ = Bern (θ)). In fact, for smooth parametric families the
capacity-achieving input distribution is rather ugly: it is a discrete distribution with a kn atoms, kn
slowly growing as n → ∞. A provocative conjecture was put forward by physicists [213, 1] that
there is a certain universality relation:
3
R∗n = log kn + o(log kn )
4
satisfied for all parametric families simultaneously. For the Bernoulli example this implies kn
n2/3 , but even this is open. However, as we will see below it turns out that these unwieldy capacity-
achieving input distributions converge as n → ∞ to a beautiful limiting law, known as the Jeffreys
prior.
Remark 13.2. (Shtarkov, Fitingof and individual sequence approach) There is a connection
between the combinatorial method of Fitingof and the method of optimality for a class. Indeed,
i i
i i
i i

i i
210
(S)
following Shtarkov we may want to choose distribution QXn so as to minimize the worst-case
redundancy for each realization xn (not average!):
PXn |θ (xn |θ0 )
min max sup log (13.11)
QXn n x θ0 Q X n ( xn )
This leads to Shtarkov’s distribution (also known as the normalized maximal likehood (NML)
code):
(S)
QXn (xn ) = c sup PXn |θ (xn |θ0 ) , (13.12)
θ0
where c is the normalization constant. If class {PXn |θ , θ ∈ Θ} is chosen to be all i.i.d. distributions
on X then
(S)
i.i.d. QXn (xn ) = c exp{−nH(P̂xn )} , (13.13)
(S)
and thus compressing w.r.t. QXn recovers Fitingof’s construction Φ0 up to O(log n) differences
between nH(P̂xn ) and Φ0 (xn ). If we take PXn |θ to be all first order Markov chains, then we get
construction Φ1 etc. Note also, that the problem (13.11) can also be written as minimization of the
regret for each individual sequence (under log-loss, with respect to a parameter class PXn |θ ):

1 1
min max log − inf log . (13.14)
QXn xn QXn (xn ) θ0 PXn |θ (xn |θ0 )
The gospel is that if there is a reason to believe that real-world data xn are likely to be generated
by one of the models PXn |θ , then using minimizer of (13.14) will result in the compressor that both
learns the right model (in the sense of QXn |Xn−1 ≈ true PXn |Xn−1 ) and compresses with respect to it.
See more in Section 13.6.
13.4* Approximate minimax solution: Jeffreys prior

In this section we will only consider the simple setting of a class of sources consisting of all
i.i.d. distributions on a given finite alphabet |X | = d + 1, which defines a d-parameter family
of distributions. We will show that the prior, asymptotically achieving the capacity (13.10), is
given by the Dirichlet distribution with parameters set to 1/2. Recall that the Dirchlet distribution
Dirichlet(α0 , . . . , αd ) with parameters αj > 0 is a distribution for a probability vector (θ0 , . . . , θd )
such that (θ1 , . . . , θd ) has a joint density
Y
d
αj − 1
c(α0 , . . . , αd ) θj (13.15)
j=0
Γ(α0 +...+αd )
where c(α0 , . . . , αd ) = Qd is the normalizing constant.
j=0 Γ(αj )
First, we give the formal setting as follows:
• Fix a finite alphabet X of size |X | = d + 1, which we will enumerate as X = {0, . . . , d}.
i i
i i
i i

i i
13.4* Approximate minimax solution: Jeffreys prior 211
Pd
• As in Example 2.6, let Θ = {(θ1 , . . . , θd ) : j=1 θj ≤ 1, θj ≥ 0} parametrizes the collection of
all probability distributions on X . Note that Θ is a d-dimensional simplex. We will also define
X
d
θ0 ≜ 1 − θj .
j=1
• The source class is

( )
Y
n X 1
PXn |θ (x |θ) ≜
n
θxj = exp −n θa log ,
j=1 a∈X
P̂x ( a)
n
where as before P̂xn is the empirical distribution of xn , cf. (13.5).
In order to find the (near) optimal QXn , we need to guess an (almost) optimal prior π ∈ P(Θ)
in (13.10) and take QXn to be the mixture of PXn |θ ’s. We will search for π in the class of smooth
densities on Θ and set
Z
QXn (xn ) ≜ PXn |θ (xn |θ′ )π (θ′ )dθ′ . (13.16)
Θ
Before proceeding further, we recall the Laplace method of approximating exponential inte-
grals. Suppose that f(θ) has a unique minimum at the interior point θ̂ of Θ and that Hessian Hessf
is uniformly lower-bounded by a multiple of identity (in particular, f(θ) is strongly convex). Then
taking Taylor expansion of π and f we get
Z Z
−nf(θ)
dθ = (π (θ̂) + O(ktk))e−n(f(θ̂)− 2 t Hessf(θ̂)t+o(∥t∥ )) dt
1 T 2
π (θ)e (13.17)
Θ
Z
dx
= π (θ̂)e−nf(θ̂) e−x Hessf(θ̂)x √ (1 + O(n−1/2 ))
T
(13.18)
Rd nd
d2
−nf(θ̂) 2π 1
= π (θ̂)e q (1 + O(n−1/2 )) (13.19)
n
det Hessf(θ̂)
where in the last step we computed Gaussian integral.

Next, we notice that
PXn |θ (xn |θ′ ) = exp{−n(D(P̂xn kPX|θ=θ′ ) + H(P̂xn ))} ,
and therefore, denoting
θ̂(xn ) ≜ P̂xn
we get from applying (13.19) to (13.16)
d 2π Pθ (θ̂)
+ O( n− 2 ) ,
1
log QXn (xn ) = −nH(θ̂) + log + log q
2 n log e
det JF (θ̂)
i i
i i
i i

i i
212
where we used the fact that Hessθ′ D(P̂kPX|θ=θ′ )|θ′ =θ̂ = log1 e JF (θ̂) with JF being the Fisher infor-
mation matrix introduced previously in (2.33). From here, using the fact that under Xn ∼ PXn |θ=θ′
the random variable θ̂ = θ′ + O(n−1/2 ) we get by approximating JF (θ̂) and Pθ (θ̂)
d Pθ (θ′ )
D(PXn |θ=θ′ kQXn ) = n(E[H(θ̂)]−H(X|θ = θ′ ))+ log n−log p +C+O(n− 2 ) , (13.20)
1
2 ′
det JF (θ )
where C is some constant (independent of the prior Pθ or θ′ ). The first term is handled by the next
result, refining Corollary 7.16.
i.i.d.
Lemma 13.2. Let Xn ∼ P on a finite alphabet X such that P(x) > 0 for all x ∈ X . Let P̂ = P̂Xn
be the empirical distribution of Xn , then

|X | − 1 1
E[D(P̂kP)] = log e + o .
2n n
log e 2
In fact, nD(P̂kP) → 2 χ (|X | − 1) in distribution.
√
Proof. By Central Limit Theorem, n(P̂ − P) converges in distribution to N (0, Σ), where Σ =
diag(P) − PPT , where P is an |X |-by-1 column vector. Thus, computing second-order Taylor
expansion of D(·kP), cf. (2.33) and (2.36), we get the result.
Continuing (13.20) we get in the end

d π (θ′ )
+ const + O(n− 2 )
1
D(PXn |θ=θ′ kQXn ) = log n − log p (13.21)
2 det JF (θ′ )
under the assumption of smoothness of prior π and that θ′ is not on the boundary of Θ. Con-
sequently, we can see that in order for the prior π be the saddle point solution, we should
have
p
π (θ′ ) ∝ det JF (θ′ ) ,
provided that the right side is integrable. Prior proportional to square-root of the determinant of
Fisher information matrix is known as the Jeffreys prior. In our case, using the explicit expression
for Fisher information (2.38), we conclude that π ∗ is the Dirchlet(1/2, 1/2, · · · , 1/2) prior, with
density:
1
π ∗ (θ) = cd qQ , (13.22)
d
j=0 θj
Γ( d+ 1
2 )
where cd = Γ(1/2)d+1
is the normalization constant. The corresponding redundancy is then
d n Γ( d+2 1 )
R∗n = log − log + o( 1) . (13.23)
2 2πe Γ(1/2)d+1
Making the above derivation rigorous is far from trivial, and was completed in [337]. Surprisingly,
while the Jeffreys prior π ∗ that we derived does attain the claimed value (13.23) of the mutual
i i
i i
i i

i i
13.5 Sequential probability assignment: Krichevsky-Trofimov 213
information I(θ; Xn ), the corresponding mixture QXn does not yield (13.23). In other words, this
QXn when plugged into (13.8) results in the value of supθ0 that is much larger than the optimal
value (13.23). The way (13.23) was proved is by patching the Jeffreys prior near the boundary of
the simplex.
Extension to general smooth parametric families. The fact that Jeffreys prior θ ∼ π maxi-
mizes the value of mutual information I(θ; Xn ) for general parametric families was conjectured
in [29] in the context of selecting priors in Bayesian inference. This result was proved rigorously
in [68, 69]. We briefly summarize the results of the latter.
Let {Pθ : θ ∈ Θ0 } be a smooth parametric family admitting a continuous and bounded Fisher
information matrix JF (θ) everywhere on the interior of Θ0 ⊂ Rd . Then for every compact Θ
contained in the interior of Θ0 we have
Z p
∗ d n
Rn (Θ) = log + log det JF (θ)dθ + o(1) . (13.24)
2 2πe Θ
Although Jeffreys prior on Θ achieves (up to o(1)) the optimal value of supπ I(θ; Xn ), to produce
an approximate capacity-achieving output distribution QXn , however, one needs to take a mixture
with respect to a Jeffreys prior on a slightly larger set Θϵ = {θ : d(θ, Θ) ≤ ϵ} and take ϵ → 0
slowly with n → ∞. This sequence of QXn ’s does achieve the optimal redundancy up to o(1).
Remark 13.3. In statistics Jeffreys prior is justified as being invariant to smooth reparametrization,
as evidenced by (2.34). For example, in answering “will the sun rise tomorrow”, Laplace proposed
to estimate the probability by modeling sunrise as i.i.d. Bernoulli process with a uniform prior on
θ ∈ [0, 1]. However, this is clearly not very logical, as one may equally well postulate uniformity of
√
α = θ10 or β = θ. Jeffreys prior θ ∼ √ 1 is invariant to reparametrization in the sense that
θ(1−θ)
p
if one computed det JF (α) under α-parametrization the result would be exactly the pushforward
of the √ 1 along the map θ 7→ θ10 .
θ(1−θ)
13.5 Sequential probability assignment: Krichevsky-Trofimov

From (13.22) it is not hard to derive the (asymptotically) optimal universal probability assignment
QXn . For simplicity we consider Bernoulli case, i.e. d = 1 and θ ∈ [0, 1] is the 1-dimensional
parameter. Then,2
1
P∗θ = p (13.25)
π θ(1 − θ)
(KT) (2t0 − 1)!! · (2t1 − 1)!!
QXn (xn ) = , ta = #{j ≤ n : xj = a} (13.26)
2n n!
2 ∫1 a b
This is obtained from the identity √
θ (1−θ)
dθ = π
1·3···(2a−1)·1·3···(2b−1)
for integer a, b ≥ 0. This identity can be
0 θ(1−θ) 2a+b (a+b)!
θ
derived by change of variable z = 1−θ
and using the standard keyhole contour on the complex plane.
i i
i i
i i

i i
214
This assignment can now be used to create a universal compressor via one of the methods outlined
in the beginning of this chapter. However, what is remarkable is that it has a very nice sequential
R
interpretation (as does any assignment obtained via QXn = Pθ PXn |θ with Pθ not depending on
n).
t1 + 12
QXn |Xn−1 (1|xn−1 ) =
(KT)
, t1 = #{j ≤ n − 1 : xj = 1} (13.27)
n
t0 + 12
QXn |Xn−1 (0|xn−1 ) =
(KT)
, t0 = #{j ≤ n − 1 : xj = 0} (13.28)
n
This is the famous “add 1/2” rule of Krichevsky and Trofimov. As mentioned in Section 13.1,
this sequential assignment is very convenient for use in prediction as well as in implementing an
arithmetic coder. The version for a general (non-binary) alphabet is equally simple:
1
ta +
QXn |Xn−1 (a|xn−1 ) =
(KT) 2
|X |−2
, ta = #{j ≤ n − 1 : xj = a}
n+ 2
Remark 13.4 (Laplace “add 1” rule). A slightly less optimal choice of QXn results from Laplace
prior: just take Pθ to be uniform on [0, 1]. Then, in the Bernoulli (d = 1) case we get
1
(Lap)
QXn = n
, w = #{j : xj = 1} . (13.29)
w ( n + 1)
The corresponding successive probability is given by
t1 + 1
QXn |Xn−1 (1|xn−1 ) =
(Lap)
, t1 = #{j ≤ n − 1 : xj = 1} .
n+1
We notice two things. First, the distribution (13.29) is exactly the same as Fitingof’s (13.7). Second,
this distribution “almost” attains the optimal first-order term in (13.23). Indeed, when Xn is iid
Ber(θ) we have for the redundancy:
" #
1 n
E log (Lap) − H(X ) = log(n + 1) + E log
n
− nh(θ) , W ∼ Bin(n, θ) . (13.30)
n
QX n ( X ) W
From Stirling’s expansion we know that as n → ∞ this redundancy evaluates to 12 log n + O(1),
uniformly in θ over compact subsets of (0, 1). However, for θ = 0 or θ = 1 the Laplace redun-
dancy (13.30) clearly equals log(n + 1). Thus, supremum over θ ∈ [0, 1] is achieved close to
endpoints and results in suboptimal redundancy log n + O(1). The Jeffreys prior (13.25) fixes the
problem at the endpoints.
13.6 Individual sequence and universal prediction

The problem of selecting one QXn serving as good prior for a whole class of distributions can
also be interpreted in terms of so-called “universal prediction”. An excellent textbook on the topic
is [60]. We discuss this connection briefly.
i i
i i
i i

i i
13.6 Individual sequence and universal prediction 215
Consider the following problem: a sequence xn is observed sequentially and our goal is to
predict (by making a soft decision) the next symbol given the past observations. The experiment
proceeds as follows:
1 A string xn ∈ X n is selected by the nature.

2 Having observed samples x1 , . . . , xt−1 we are requested to output a probability distribution
Qt (·|xt−1 ) on X n .
3 After that nature reveals the next sample xt and our loss for t-th prediction is evaluated as
1
log .
Qt (xt |xt−1 )
Goal (informal): Find a sequence of predictors {Qt } that minimizes the cumulative loss:
X
n
1
ℓ({Qt }, x ) ≜
n
log .
Qt (xt |xt−1 )
t=1
Note that to make this goal formal, we need to explain how xn is generated. Consider first a
naive requirement that the worst-case loss is minimized:
min max ℓ({Qt }, xn ) .
{Qt }nt=1 n x
This is clearly hopeless. Indeed, at any step t the distribution Qt must have at least one atom with
weight ≤ |X1 | , and hence for any predictor
max
n
ℓ({Qt }, xn ) ≥ n log |X | ,
x
which is clearly achieved iff Qt (·) ≡ |X1 | , i.e. if predictor simply makes uniform random guesses.
This triviality is not surprising: In the absence of whatsoever prior information on xn it is
impossible to predict anything.
The exciting idea, originated by Feder, Merhav and Gutman, cf. [119, 218], is to replace loss
with regret, i.e. the gap to the best possible static oracle. More precisely, suppose a non-causal
oracle can examine the entire string xn and output a constant Qt ≡ Q. From non-negativity of
divergence this non-causal oracle achieves:
X
n
1
ℓoracle (xn ) = min log = nH(P̂xn ) .
Q Q ( xt )
t=1
Can causal (but time-varying) predictor come close to this performance? In other words, we define
regret of a sequential predictor as the excess risk over the static oracle
reg({Qt }, xn ) ≜ ℓ({Qt }, xn ) − nH(P̂xn )
and ask to minimize the worst-case regret:
Reg∗n ≜ min max reg({Qt }, xn ) . (13.31)
{Qt } n x
Excitingly, non-trivial predictors emerge as solutions to the above problem, which furthermore
do not rely on any assumptions on the prior distribution of xn .
i i
i i
i i

i i
216
We next consider the case of X = {0, 1} for simplicity. To solve (13.31), first notice that
designing a sequence {Qt (·|xt−1 } is equivalent to defining one joint distribution QXn and then
Q
factorizing the latter as QXn (xn ) = t Qt (xt |xt−1 ). Then the problem (13.31) becomes simply
1
Reg∗n = min max log − nH(P̂xn ) .
QXnn x QXn (xn )
First, we notice that generally we have that optimal QXn is Shtarkov distribution (13.12), which
implies that that regret is just the log of normalization constant in Shtarkov distribution. In the iid
case we are considering, we get
X Y
n X
Reg∗n = log max Q(xi ) = log exp{−nH(P̂xn )} .
Q
xn i=1 xn
This is, however,frequently a not very convenient expression to analyze, so instead we consider
upper and lower bounds. We may lower-bound the max over xn with the average over the Xn ∼
Ber(θ)n and obtain (also applying Lemma 13.2):
|X | − 1
Reg∗n ≥ R∗n + log e + o(1) ,
2
where R∗n is the universal compression redundancy defined in (13.8), whose asymptotics we
derived in (13.23).
(KT)
On the other hand, taking QXn from Krichevsky-Trofimov (13.26) we find after some algebra
and Stirling’s expansion:
1 1
max log (KT)
− nH(P̂xn ) = log n + O(1) .
n x QXn (xn ) 2
In all, we conclude that,

|X | − 1
Reg∗n = R∗n + O(1) = log n + O(1) ,
2
and remarkably, the per-letter regret 1n Reg∗n converges to zero. That is, there exists a causal pre-
dictor that can predict (under log-loss) almost as well as any constant one, even if it is adapted
to a particular sequence xn non-causally.
Explicit (asymptotically optimal) sequential prediction rules are given by Krichevsky-
Trofimov’s “add 1/2” rules (13.28). We note that the resulting rules are also independent of n
(“horizon-free”). This is a very desirable property not shared by the optimal sequential predictors
derived from factorizing the Shtarkov’s distribution (13.12).
General parametric families. The general definition of (cumulative) individual-sequence (or

worst-case) regret for a model class {PXn |θ=θ0 , θ0 ∈ Θ} is given by
1 1
Reg∗n (Θ) = min sup log − inf log ,
QXn xn QXn (xn ) θ0 ∈Θ PXn |θ=θ0 (xn )
i i
i i
i i

i i
This regret can be interpreted as worst-case loss of a given estimator compared to the best possible
one from a class PXn |θ , when the latter is selected optimally for each sequence. In this sense, regret
gives a uniform (in xn ) bound on the performance of an algorithm against a class.
It turns out that similarly to (13.24) the individual sequence redundancy for general d-
parametric families (under smoothness conditions) can be shown to satisfy [267]:
Z p
d d n
Reg∗n (Θ) = R∗n (Θ) + log e + o(1) = log + log det JF (θ)dθ + o(1) .
2 2 2π Θ
In machine learning terms, we say that R∗n (Θ) in (13.8) is a cumulative sequential prediction
regret under the well-specified setting (i.e. data Xn is generated by a distribution inside the model
class Θ), while here Reg∗n (Θ) corresponds to a fully mis-specified setting (i.e. data is completely
arbitrary). There are also interesting settings in between these extremes, e.g. when data is iid but
not from a model class Θ, cf. [120].
13.7 Lempel-Ziv compressor

So given a class of sources {PXn |θ : θ ∈ Θ} we have shown how to produce an asymptotically
optimal compressors by using Jeffreys’ prior. In the case of a class of i.i.d. processes, the resulting
sequential probability of Krichevsky-Trofimov, see (13.5), had a very simple algorithmic descrip-
tion. When extended to more general classes (such as r-th order Markov chains), however, the
sequential probability rules become rather complex. The Lempel-Ziv approach was to forego the
path “ design QXn , convert to QXt |Xt−1 , extract compressor” and attempt to directly construct a
reasonable sequential compressor or, equivallently, derive an algorithmically simple sequential
estimator QXt |Xt−1 . The corresponding joint distribution QXn is hard to imagine, and the achieved
redundancy is not easy to derive, but the the algorithm becomes very transparent.
In principle, the problem is rather straightforward: as we observe a stationary process, we may
estimate with better and better precision the conditional probability P̂Xn |Xn−1 and then use it as
n− r
the basis for arithmetic coding. As long as P̂ converges to the actual conditional probability, we
will attain the entropy rate of H(Xn |Xnn− 1
−r ). Note that Krichevsky-Trofimov assignment (13.28) is
clearly learning the distribution too: as n grows, the estimator QXn |Xn−1 converges to the true PX
(provided that the sequence is i.i.d.). So in some sense the converse is also true: any good universal
compression scheme is inherently learning the true distribution.
The main drawback of the learn-then-compress approach is the following. Once we extend the
class of sources to include those with memory, we invariably are lead to the problem of learning
the joint distribution PXr−1 of r-blocks. However, the number of samples required to obtain a good
0
estimate of PXr−1 is exponential in r. Thus learning may proceed rather slowly. Lempel-Ziv family
0
of algorithms works around this in an ingeniously elegant way:
• First, estimating probabilities of rare substrings takes longest, but it is also the least useful, as
these substrings almost never appear at the input.
i i
i i
i i

i i
218
• Second, and the most crucial, point is that an unbiased estimate of PXr (xr ) is given by the
reciprocal of the time since the last observation of xr in the data stream.
• Third, there is a prefix code3 mapping any integer n to binary string of length roughly log2 n:
fint : Z+ → {0, 1}+ , ℓ(fint (n)) = log2 n + O(log log n) . (13.32)

Thus, by encoding the pointer to the last observation of xr via such a code we get a string of
length roughly log PXr (xr ) automatically.
There are a number of variations of these basic ideas, so we will only attempt to give a rough
explanation of why it works, without analyzing any particular algorithm.
We proceed to formal details. First, we need to establish Kac’s lemma.
Lemma 13.3 (Kac). Consider a finite-alphabet stationary ergodic process . . . , X−1 , X0 , X1 . . ..

Let L = inf{t > 0 : X−t = X0 } be the last appearance of symbol X0 in the sequence X− 1
−∞ . Then
for any u such that P[X0 = u] > 0 we have
1
E [ L | X 0 = u] = .
P [ X 0 = u]
In particular, mean recurrence time E[L] = |supp(PX )|.
Proof. Note that from stationarity the following probability

P[∃t ≥ k : Xt = u]
does not depend on k ∈ Z. Thus by continuity of probability we can take k = −∞ to get
P[∃t ≥ 0 : Xt = u] = P[∃t ∈ Z : Xt = u] .
However, the last event is shift-invariant and thus must have probability zero or one by ergodic
assumption. But since P[X0 = u] > 0 it cannot be zero. So we conclude
P[∃t ≥ 0 : Xt = u] = 1 . (13.33)
Next, we have
X
E[L|X0 = u] = P[L ≥ t|X0 = u] (13.34)
t≥1
1 X
= P[L ≥ t, X0 = u] (13.35)
P[X0 = u]
t≥ 1
1 X
= P[X−t+1 6= u, . . . , X−1 6= u, X0 = u] (13.36)
P[X0 = u]
t≥ 1
1 X
= P[X0 6= u, . . . , Xt−2 6= u, Xt−1 = u] (13.37)
P[X0 = u]
t≥ 1
3 ∑
For this just notice that k≥1 2− log2 k−2 log2 log(k+1) < ∞ and use Kraft’s inequality. See also Ex. II.14.
i i
i i
i i

i i
1
= P[∃t ≥ 0 : Xt = u] (13.38)
P[X0 = u]
1
= , (13.39)
P[X0 = u]
where (13.34) is the standard expression for the expectation of a Z+ -valued random vari-
able, (13.37) is from stationarity, (13.38) is because the events corresponding to different t are
disjoint, and (13.39) is from (13.33).
The following proposition serves to explain the basic principle behind operation of Lempel-Ziv:
Theorem 13.4. Consider a finite-alphabet stationary ergodic process . . . , X−1 , X0 , X1 . . . with

entropy rate H. Suppose that X− 1
−∞ is known to the decoder. Then there exists a sequence of prefix-
codes fn (xn0−1 , x− 1
−∞ ) with expected length
1
E[ℓ(fn (Xn0−1 , X∞
−1
))] → H ,
n
Proof. Let Ln be the last occurence of the block x0n−1 in the string x− 1
−∞ (recall that the latter is
known to decoder), namely
Ln = inf{t > 0 : x−
−t
t+n−1
= xn0−1 } .
= Xtt+n−1 we have
(n)
Then, by Kac’s lemma applied to the process Yt
1
E[Ln |Xn0−1 = xn0−1 ] = .
P[Xn0−1 = xn0−1 ]
We know encode Ln using the code (13.32). Note that there is crucial subtlety: even if Ln < n and
thus [−t, −t + n − 1] and [0, n − 1] overlap, the substring xn0−1 can be decoded from the knowledge
of Ln .
We have, by applying Jensen’s inequality twice and noticing that 1n H(Xn0−1 ) & H and
n−1
n log H(X0 ) → 0 that
1
1 1 1
E[ℓ(fint (Ln ))] ≤ E[log ] + o( 1) → H .
n n PXn−1 (Xn0−1 )
0
From Kraft’s inequality we know that for any prefix code we must have
1 1
E[ℓ(fint (Ln ))] ≥ H(Xn0−1 |X− 1
−∞ ) = H .
n n
The result shown above demonstrates that LZ algorithm has asymptotically optimal com-
pression rate for every stationary ergodic process. Recall, however, that previously discussed
compressors also enjoyed non-stochastic (individual sequence) guarantess. For example, we have
seen in Section 13.6 that Krichevsky-Trofimov’s compressor achieves on every input sequence a
compression ratio that is at most O( logn n ) worse than the arithmetic encoder built with the best
possible (for this sequence!) static probability assignment. It turns out that LZ algorithm is also
i i
i i
i i

i i
220
special from this point of view. In [237] (see also [118, Theorem 4]) it was shown that the LZ
compression rate on every input sequence is better than that achieved by any finite state machine
(FSM) up to correction terms O( logloglogn n ). Consequently, investing via LZ achieves capital growth
that is competitive against any possible FSM investor [118].
Altogether we can see that LZ compression enjoys certain optimality guarantees in both the
stochastic and individual sequence senses.
i i
i i
i i

i i
Exercises for Part II
II.1 Let Sj ∈ {±1} be a stationary two-state Markov process with

(
τ, s 6= s′
PSj |Sj−1 (s|s′ ) =
1 − τ, s = s′ .
iid
Let Ej ∼ Ber(δ), with Ej ∈ {0, 1} and let Yj be the observation of Sj through the binary erasure
channel with erasure probability δ , i.e.
Yj = Sj Ej .
Find entropy rate of Yj (you can give answer in the form of a convergent series). Evaluate at
τ = 0.11, δ = 1/2 and compare with H(Y1 ).
II.2 Recall that an entropy rate of a process {Xj : j = 1, . . .} is defined as follows provided the limit
exists:
1
H = lim H( X n ) .
n→∞ n
Consider a 4-state Markov chain with transition probability matrix
 
0.89 0.11 0 0
 0.11 0.89 0 0 
 
 0 0 0.11 0.89 
0 0 0.89 0.11
The distribution of the initial state is [p, 0, 0, 1 − p].
(a) Does the entropy rate of such a Markov chain exist? If it does, find it.
(b) Describe the asymptotic behavior of the optimum variable-length rate n1 ℓ(f∗ (X1 , . . . , Xn )).
Consider convergence in probability and in distribution.
(c) Repeat with transition matrix:
 
0.89 0.11 0 0
 0.11 0.89 0 0 
 
 0 0 0. 5 0. 5 
0 0 0. 5 0. 5
II.3 Consider a three-state Markov chain S1 , S2 , . . . with the following transition probability matrix
 1 1 1 
2 4 4
P= 0 1
2
1
2
.
1 0 0
i i
i i
i i

i i
222 Exercises for Part II
Compute the limit of 1n E[l(f∗ (Sn ))] when n → ∞. Does your answer depend on the distribution
of the initial state S1 ?
II.4 (a) Let X take values on a finite alphabet X . Prove that
H(X) − k − 1
ϵ ∗ ( X , k) ≥ .
log(|X | − 1)
(b) Deduce the following converse result: For a stationary process {Sk : k ≥ 1} on a finite
alphabet S ,
H−R
lim inf ϵ∗ (Sn , nR) ≥ .
n→∞ log |S|
n
where H = limn→∞ H(nS ) is the entropy rate of the process.
II.5 Run-length encoding is a popular variable-length lossless compressor used in fax machines,
image compression, etc. Consider compression of Sn – an i.i.d. Ber(δ) source with very small
1
δ = 128 using run-length encoding: A chunk of consecutive r ≤ 255 zeros (resp. ones) is
encoded into a zero (resp. one) followed by an 8-bit binary encoding of r (If there are > 255
consecutive zeros then two or more 9-bit blocks will be output). Compute the average achieved
compression rate
1
lim E[ℓ(f(Sn )]
n→∞n
How does it compare with the optimal lossless compressor?
Hint: Compute the expected number of 9-bit blocks output per chunk of consecutive zeros/ones;
normalize by the expected length of the chunk.
II.6 Draw n random points independently and uniformly from the vertices of the following square.
Denote the coordinates by (X1 , Y1 ), . . . , (Xn , Yn ). Suppose Alice only observes Xn and Bob only
observes Yn . They want to encode their observation using RX and RY bits per symbol respectively
and send the codewords to Charlie who will be able to reconstruct the sequence of pairs.
(a) Find the optimal rate region for (RX , RY ).
(b) What if the square is rotated by 45◦ ?
II.7 Recall a bound on the probability of error for the Slepian-Wolf compression to k bits:

∗ 1 −τ
ϵSW (k) ≤ min P log|A| > k − τ + |A| (II.1)
τ >0 PXn |Y (Xn |Y)
i i
i i
i i

i i
Consider the following case: Xn = (X1 , . . . , Xn ) – uniform on {0, 1}n and

Y = (X1 , . . . , Xn ) + (N1 , . . . , Nn ) ,
where Ni are iid Gaussian with zero mean and variance 0.1
Let n = 10. Propose a method to numerically compute or approximate the bound (II.1) as a
function of k = 1, . . . 10. Plot the results.
II.8 Consider a probability measure P and a measure-preserving transformation τ : Ω → Ω. Prove:
τ -ergodic iff for any measurable A, B we have
1X
n−1
P[A ∩ τ −k B] → P[A]P[B] .
n
k=0
Comment: Thus ergodicity is a weaker condition than mixing: P[A ∩ τ −n B] → P[A]P[B].

II.9 Consider a ternary fixed length (almost lossless) compression X → {0, 1, 2}k with an additional
requirement that the string in wk ∈ {0, 1, 2}k should satisfy
X
k
k
wj ≤ (II.2)
2
j=1
For example, (0, 0, 0, 0), (0, 0, 0, 2) and (1, 1, 0, 0) satisfy the constraint but (0, 0, 1, 2) does not.
Let ϵ∗ (Sn , k) denote the minimum probability of error among all possible compressors of Sn =
{Sj , j = 1, . . . , n} with i.i.d. entries of finite entropy H(S) < ∞. Compute
lim ϵ∗ (Sn , nR)

n→∞
as a function of R ≥ 0.
Hint: Relate to P[ℓ(f∗ (Sn )) ≥ γ n] and use Stirling’s formula (or Theorems 11.1.1, 11.1.3 in [76])
to find γ .
II.10 Mismatched compression. Let P, Q be distributions on some discrete alphabet A.
(a) Let f∗P : A → {0, 1} denote the optimal variable-length lossless compressor for X ∼ P.
Show that under Q,
EQ [l(f∗P (X))] ≤ H(Q) + D(QkP).
(b) The Shannon code for X ∼ P is a prefix code fP with the code length l(fP (a)) =
dlog2 P(1a) e, a ∈ A. Show that if X is distributed according to Q instead, then
H(Q) + D(QkP) ≤ EQ [l(fP (X))] ≤ H(Q) + D(QkP) + 1 bit.

Comments: This can be interpreted as a robustness result for compression with model misspec-
ification: When a compressor designed for P is applied to a source whose distribution is in fact
Q, the suboptimality incurred by this mismatch can be related to divergence D(QkP).
II.11 Arithmetic Coding. We analyze the encoder defined by (13.1) for iid source. Let P be a dis-
tribution on some ordered finite alphabet, say, a < b < · · · < z. For each n, define
Qn P
p(xn ) = i=1 P(xi ) and q(xn ) = n
yn <xn p(y ) according to the lexicographic ordering, so
that Fn (x ) = q(x ) and |Ixn | = p(x ).
n n n
i i
i i
i i

i i
(a) Show that if xn−1 = (x1 , . . . , xn−1 ), then

X
q(xn ) = q(xn−1 ) + p(xn−1 ) P(α).
α<xn
Conclude that q(xn ) can be computed in O(n) steps sequentially.

(b) Show that intervals Ixn are disjoint subintervals of [0, 1).
n
(c) Encoding. Show that the codelength l l(f(x )) m defined in (13.1) satisfies the constraint (13.2),
namely, log2 p(1xn ) ≤ ℓ(f(xn )) ≤ log2 p(1xn ) +1. Furthermore, verify that the map xn 7→ f(xn )
defines a prefix code. (Warning: This is not about checking Kraft’s inequality.)
(d) Decoding. Upon receipt of the codeword, we can reconstruct the interval Dxn . Divide the
unit interval according to the distribution P, i.e., partition [0, 1) into disjoint subintervals
Ia , . . . , Iz . Output the index that contains Dxn . Show that this gives the first symbol x1 . Con-
tinue in this fashion by dividing Ix1 into Ix1 ,a , . . . , Ix1 ,z and etc. Argue that xn can be decoded
losslessly. How many steps are needed?
(e) Suppose PX (e) = 0.5, PX (o) = 0.3, PX (t) = 0.2. Encode etoo (write the binary codewords)
and describe how to decode.
(f) Show that the average length of this code satisfies
nH(P) ≤ E[l(f(Xn ))] ≤ nH(P) + 2 bits.
(g) Assume that X = (X1 , . . . , Xn ) is not iid but PX1 , PX2 |X1 , . . . , PXn |Xn−1 are known. How would
you modify the scheme so that we have
H(Xn ) ≤ E[l(f(Xn ))] ≤ H(Xn ) + 2 bits.
II.12 Enumerative Codes. Consider the following simple universal compressor for binary sequences:
Pn
Given xn ∈ {0, 1}n , denote by n1 = i=1 xi and n0 = n − n1 the number of ones and zeros in xn .
First encode n1 ∈ {0, 1, . . . , n} using dlog2 (ln + 1)e bits, then encode the index of xn in the set
n
m
of all strings with n1 number of ones using log2 n1 bits. Concatenating two binary strings,
we obtain the codeword of xn . This defines a lossless compressor f : {0, 1}n → {0, 1}∗ .
(a) Verify that f is a prefix code.
i.i.d.
(b) Let Snθ ∼ Ber(θ). Show that for any θ ∈ [0, 1],
E [l(f(Snθ ))] ≤ nh(θ) + log n + O(1),
where h(·) is the binary entropy function. Conclude that
sup {E [l(f(Snθ ))] − nh(θ)} ≥ log n + O(1).

0≤θ≤1
[Optional: Explain why enumerative coding fails to achieve the optimal redundancy.]
Hint: The following non-asymptotic version of Stirling approximation might be useful
n! e
1≤ √ n ≤ √ , ∀ n ∈ N.
2πn ne 2π
i i
i i
i i

i i
II.13 Krichevsky-Trofimov codes. From Kraft’s inequality we knowlthat any probabilitym distribution
QXn on {0, 1} gives rise to a prefix code f such that l(f(x )) = log2 QXn (xn ) for all xn . Consider
n n 1
the following QXn defined by the factorization QXn = QX1 QX2 |X1 · · · QXn |Xn−1 ,
1 n1 (xt ) + 12
QX 1 ( 1) = , QXt+1 |Xt (1|xt ) = , (II.3)
2 t+1
where n1 (xt ) denotes the number of ones in xt . Denote the prefix code corresponding to this QXn
by fKT : {0, 1}n → {0, 1}∗ .
(a) Prove that for any n and any xn ∈ {0, 1}n ,
n0 n1
1 1 n0 n1
QXn (x ) ≥ √
n
.
2 n0 + n1 n0 + n1 n0 + n1
where n0 = n0 (xn ) and n1 = n1 (xn ) denote the number of zeros and ones in xn .
Hint: Use induction on n.
(b) Conclude that the K-T code length satisfies:
n 1
1
l(fKT (xn )) ≤ nh + log n + 2, ∀xn ∈ {0, 1}n .
n 2
(c) Conclude that for K-T codes :
1
sup {E [l(fKT (Snθ ))] − nh(θ)} ≤ log n + O(1).
0≤θ≤1 2
1
This value is known as the redundancy of a universal code. It turns out that 2 log n + O(1)
is optimal for the class of all Bernoulli sources (see (13.23)).
Comments:
(a) The probability assignment (II.3) is known as the “add- 21 ” estimator: Upon observing xt
which contains n1 number of ones, a natural probability assignment to xt+1 = 1 is the
n +1
empirical average nt1 . Instead, K-T codes assign probability t1+12 , or equivalently, adding
1 4
2 to both n0 and n1 . This is a crucial modification to Laplace’s “add-one estimator”.
(b) By construction, the probability assignment QXn can be sequentially computed, which
allows us implement sequential encoding and encode a stream of bits on the fly. This is
a highly desirable feature of the K-T codes. Of course, we need to resort to construction
other than the one in Kraft’s inequality construction, e.g., arithmetic coding.
II.14 (Elias coding) In this problem all logarithms and entropy units are binary.
(a) Consider the following universal compressor for natural numbers: For x ∈ N = {1, 2, . . .},
let k(x) denote the length of its binary representation. Define its codeword c(x) to be k(x)
zeros followed by the binary representation of x. Compute c(10). Show that c is a prefix
code and describe how to decode a stream of codewords.
4
Interested readers should check Laplace’s rule of succession and the sunrise problem
https://en.wikipedia.org/wiki/Rule_of_succession.
i i
i i
i i

i i
(b) Next we construct another code using the one above: Define the codeword c′ (x) to be c(k(x))
followed by the binary representation of x. Compute c′ (10). Show that c′ is a prefix code
and describe how to decode a stream of codewords.
(c) Let X be a random variable on N whose probability mass function is decreasing. Show that
E[log(X)] ≤ H(X).
(d) Show that the average code length of c satisfies E[ℓ(c(X))] ≤ 2H(X) + 2 bit.
(e) Show that the average code length of c′ satisfies E[ℓ(c′ (X))] ≤ H(X) + 2 log(H(X) + 1) + 3
bit.
Comments: The two coding schems are known as Elias γ -codes and δ -codes.
i i
i i
i i

i i
Part III
Binary hypothesis testing
i i
i i
i i

i i
i i
i i
i i

i i
229
In this part we study the topic of binary hypothesis testing (BHT). This is an important area
of statistics, with a definitive treatment given in [197]. Historically, there has been two schools
of thought on how to approach this question. One is the so-called significance testing of Karl
Pearson and Ronald Fisher. This is perhaps the most widely used approach in modern biomedical
and social sciences. The concepts of null hypothesis, p-value, χ2 -test, goodness-of-fit belong to
this world. We will not be discussing these.
The other school was pioneered by Jerzy Neyman and Egon Pearson, and is our topic in this part.
The concepts of type-I and type-II errors, likelihood-ratio tests, Chernoff exponent are from this
domain. This is, arguably, a more popular way of looking at the problem among the engineering
disciplines (perhaps explained by its foundational role in radar and electronic signal detection.)
The conceptual difference between the two is that in the first approach the full probabilistic
i.i.d.
model is specified only under the null hypothesis. (It still could be very specific like Xi ∼ N (0, 1),
i.i.d.
contain unknown parameters, like Xi ∼ N (θ, 1) with θ ∈ R arbitrary, or be nonparametric, like
i.i.d.
(Xi , Yi ) ∼ PX,Y = PX PY denoting that observables X and Y are statistically independent). The main
goal of the statistician in this setting is inventing a testing process that is able to find statistically
significant deviations from the postulated null behavior. If such deviation is found then the null is
rejected and (in scientific fields) a discovery is announced. The role of the alternative hypothesis
(if one is specified at all) is to roughly suggest what feature of the null are most likely to be violated
i.i.d.
and motivates the choice of test procedures. For example, if under the null Xi ∼ N (0, 1), then both
of the following are reasonable tests:
1X ? 1X 2 ?
n n
Xi ≈ 0 Xi ≈ 1 .
n n
i=1 i=1
However, the first one would be preferred if, under the alternative, “data has non-zero mean”, and
the second if “data has zero mean but variance not equal to one”. Whichever of the alternatives is
selected does not imply in any way the validity of the alternative. In addition, theoretical properties
of the test are mostly studied under the null rather than the alternative. For this approach the null
hypothesis (out of the two) plays a very special role.
The second approach treats hypotheses in complete symmetry. Exact specifications of proba-
bility distributions are required for both hypotheses and the precision of a proposed test is to be
analyzed under both. This is the setting that is most useful for our treatment of forthcoming topics
of channel coding (Part IV) and statistical estimation (Part VI).
The outline of this part is the following. First, we define the performance metric R(P, Q) giving
a full description of the BHT problem. A key result in this theory, the Neyman-Pearson lemma
determines the form of the optimal test and, at the same time, characterizes R(P, Q). We then
specialize to the setting of iid observations and consider two types of asymptotics (as the sam-
ple size n goes to infinity): Stein’s regime (where type-I error is held constant) and Chernoff’s
regime (where errors of both types are required to decay exponentially). The fundamental limit
in the former regime is simply a scalar (given by D(PkQ)), while in the latter it is a region. To
describe this region (Chapter 16) we will need to understand the problem of large deviations and
the information projection (Chapter 15).
i i
i i
i i

i i
14 Neyman-Pearson lemma
14.1 Neyman-Pearson formulation

Consider the situation where we have two possible distributions on a space X and
H0 : X ∼ P
H1 : X ∼ Q .
What this means is that under hypothesis H0 (the null hypothesis) X is distributed according to P,
and under H1 (the alternative hypothesis) X is distributed according to Q. A test (or decision rule)
between two distributions chooses either H0 or H1 based on an observation of X. We will consider
• Deterministic tests: f : X → {0, 1}, or equivalently, f(x) = 1{x∈E} where E is known as a

decision region; and more genreally,
• Randomized tests: PZ|X : X → {0, 1}, so that PZ|X (1|x) ∈ [0, 1] is the probability of rejecting
the null upon observing X = x.
Let Z = 0 denote that the test chooses P (accepting the null) and Z = 1 that the test chooses Q
(rejecting the null).
This setting is called “testing simple hypothesis against simple hypothesis”. Here “simple”
refers to the fact that under each hypothesis there is only one distribution that could generate
the data. In comparison, composite hypothesis postulates that X ∼ P for some P is a given class
of distributions; see Section 32.2.1.
In order to quantify the “effectiveness” of a test, we focus on two metrics. Let π i|j denote the
probability of the test choosing i when the correct hypothesis is j, with i, j ∈ {0, 1}. For every test
PZ|X we associate a pair of numbers:
α = π 0|0 = P[Z = 0] (Probability of success given H0 is true)

β = π 0|1 = Q[Z = 0] (Probability of error given H1 is true),
R R
where P[Z = 0] = PZ|X (0|x)P(dx) and Q[Z = 0] = PZ|X (0|x)Q(dx). There are many alterna-
tive names for these quantities: 1 − α is called significance level, size, type-I error, false positive,
false alarm rate of a test; β is called type-II error, false negative, missed detection rate; 1 − β or
π 1|1 is known as true positive or the power of a test.
There are a few ways to determine the “best test”:
230
i i
i i
i i

i i
14.1 Neyman-Pearson formulation 231
• Bayesian: Assuming the prior distribution P[H0 ] = π 0 and P[H1 ] = π 1 , we minimize the
average probability of error:
P∗b = min π 0 π 1|0 + π 1 π 0|1 . (14.1)
PZ|X :X →{0,1}
• Minimax: Assuming there is an unknown prior distribution, we choose the test that preforms
the best for the worst-case prior
P∗m = min max{π 1|0 , π 0|1 }.
PZ|X :X →{0,1}
• Neyman-Pearson: Minimize the type-II error β subject to that the success probability under the
null is at least α.
In this book the Neyman-Pearson formulation and the following quantities play important roles:
Definition 14.1. Given (P, Q), the Neyman-Pearson region consists of achievable points for all
randomized tests

R(P, Q) = (P[Z = 0], Q[Z = 0]) : PZ|X : X → {0, 1} ⊂ [0, 1]2 . (14.2)
In particular, its lower boundary is defined as (see Fig. 14.1 for an illustration)
βα (P, Q) ≜ inf Q[ Z = 0] (14.3)
P[Z=0]≥α
R(P, Q)
βα (P, Q)
Figure 14.1 Illustration of the Neyman-Pearson region.
Remark 14.1. The Neyman-Pearson region encodes much useful information about the relation-
ship between P and Q. For example, we have the following extreme cases1
1
Recall that P is mutually singular w.r.t. Q, denoted by P ⊥ Q, if P[E] = 0 and Q[E] = 1 for some E.
i i
i i
i i

i i
232
P = Q ⇔ R(P, Q) = P ⊥ Q ⇔ R(P, Q) =
Moreover, TV(P, Q) coincides with half the length of the longest vertical segment contained in
R(P, Q) (Exercise III.2).
Theorem 14.2 (Properties of R(P, Q)).
(a) R(P, Q) is a closed, convex subset of [0, 1]2 .

(b) R(P, Q) contains the diagonal.
(c) Symmetry: (α, β) ∈ R(P, Q) ⇔ (1 − α, 1 − β) ∈ R(P, Q).
Proof. (a) For convexity, suppose that (α0 , β0 ), (α1 , β1 ) ∈ R(P, Q), corresponding to tests
PZ0 |X , PZ1 |X , respectively. Randomizing between these two tests, we obtain the test λPZ0 |X +
λ̄PZ1 |X for λ ∈ [0, 1], which achieves the point (λα0 + λ̄α1 , λβ0 + λ̄β1 ) ∈ R(P, Q).
The closedness of R(P, Q) will follow from the explicit determination of all boundary
points via the Neyman-Pearson lemma – see Remark 14.3. In more complicated situations
(e.g. in testing against composite hypothesis) simple explicit solutions similar to Neyman-
Pearson Lemma are not available but closedness of the region can frequently be argued
still. The basic reason is that the collection of bounded functions {g : X → [0, 1]} (with
g(x) = PZ|X (0|x)) forms a weakly compact set and hence its image under the linear functional
R R
g 7→ ( gdP, gdQ) is closed.
(b) Testing by random guessing, i.e., Z ∼ Ber(1 − α) ⊥ ⊥ X, achieves the point (α, α).
(c) If (α, β) ∈ R(P, Q) is achieved by PZ|X , P1−Z|X achieves (1 − α, 1 − β).
The region R(P, Q) consists of the operating points of all randomized tests, which include as
special cases those of deterministic tests, namely
Rdet (P, Q) = {(P(E), Q(E)) : E measurable} . (14.4)
As the next result shows, the former is in fact the closed convex hull of the latter. Recall that cl(E)
(resp. co(E)) denote the closure and convex hull of a set E, namely, the smallest closd (resp. convex)
set containing E. A useful example: For a subset E of an Euclidean space, and measurable functions
f, g : R → E, we have (E [f(X)] , E [g(X)]) ∈ cl(co(E)) for any real-valued random variable X.
Theorem 14.3 (Randomized test v.s. deterministic tests).
R(P, Q) = cl(co(Rdet (P, Q))).
Consequently, if P and Q are on a finite alphabet X , then R(P, Q) is a polygon of at most 2|X |
vertices.
i i
i i
i i

i i
14.2 Likelihood ratio tests 233
Proof. “⊃”: Comparing (14.2) and (14.4), by definition, R(P, Q) ⊃ Rdet (P, Q)), the former of
which is closed and convex , by Theorem 14.2.
“⊂”: Given any randomized test PZ|X , define a measurable function g : X → [0, 1] by g(x) =
PZ|X (0|x). Then
X Z 1
P [ Z = 0] = g(x)P(x) = EP [g(X)] = P[g(X) ≥ t]dt
x 0
X Z 1
Q[ Z = 0] = g(x)Q(x) = EQ [g(X)] = Q[g(X) ≥ t]dt
x 0
R
where we applied the “area rule” that E[U] = R+ P [U ≥ t] dt for any non-negative random
variable U. Therefore the point (P[Z = 0], Q[Z = 0]) ∈ R is a mixture of points (P[g(X) ≥
t], Q[g(X) ≥ t]) ∈ Rdet , averaged according to t uniformly distributed on the unit interval. Hence
R ⊂ cl(co(Rdet )).
The last claim follows because there are at most 2|X | subsets in (14.4).
Example 14.1 (Testing Ber(p) versus Ber(q)). Assume that p < 12 < q. Using Theorem 14.3, note
that there are 22 = 4 events E = ∅, {0}, {1}, {0, 1}. Then R(Ber(p), Ber(q)) is given by
1
q))
(p, q)
r(
Be
),
(p
er
(B
R
(p̄, q̄)
α
0 1
14.2 Likelihood ratio tests

To define optimal hypothesis tests, we need to define the concept of the log-likelihood ratio (LLR).
In the simple case when P Q we can define the LLR T(x) = log dQ dP
(x) as a function T : X →
R ∪{−∞} by thinking of log 0 = −∞. In order to handle also the case of P 6 Q, we can leverage
our concept of the Log function, cf. (2.10).
Definition 14.4 (Extended log likelihood ratio). Assume that dP = p(x)dμ and dQ = q(x)dμ for
some dominating measure μ (e.g. μ = P + Q.) Recalling the definition of Log from (2.10) we
i i
i i
i i

i i
234
define the extended LLR as



 log qp((xx)) , p ( x) > 0 , q ( x) > 0


p(x) +∞, p ( x) > 0 , q ( x) = 0
T(x) ≜ Log =
q ( x)  −∞, p ( x) = 0 , q ( x) > 0



0, p ( x) = 0 , q ( x) = 0 ,
The likelihood ratio test (LRT) with threshold τ ∈ R ∪ {±∞} is 1{x : T(x) ≤ τ }.
When P Q it is clear that T(x) = log dQ dP

(x) for P- and Q-almost every x. For this reason,
dP
everywhere in this Part we abuse notation and write simply log dQ to denote the extended (!) LLR
as defined above. Notice that LRT is a deterministic test, and that it does make intuitive sense:
upon observing x, if QP((xx)) is large then Q is more likely and one should reject the null hypothesis
P.
Note that for a discrete alphabet X and assuming Q P we can see
Q[T = t] = exp(−t)P[T = t] ∀t ∈ R ∪ {+∞}.
Indeed, this is shown by the following chain:
X P(x) X
QT (t) = Q(x)1{log = t} = Q(x)1{et Q(x) = P(x)}
Q ( x)
X X
X P ( x )
= e− t P(x)1{log = t} = e−t PT (t)
Q ( x)
X
We see that taking expectation over P and over Q are equivalent upon multiplying the expectant
by exp(±T). The next result gives precise details in the general case.
Theorem 14.5 (Change of measure P ↔ Q). The following hold:
1 For any h : X → R we have

EQ [h(X)1{T > −∞}] = EP [h(X) exp(−T)] (14.5)
EP [h(X)1{T < +∞}] = EQ [h(X) exp(T)] (14.6)
2 For any f ≥ 0 and any −∞ < τ < ∞ we have
EQ [f(X)1{T ≥ τ }] ≤ EP [f(X)1{T ≥ τ }] · exp(−τ )
EQ [f(X)1{T ≤ τ }] ≥ EP [f(X)1{T ≥ τ }] · exp(−τ ) (14.7)
Proof. We first observe that

Q[T = +∞] = P[T = −∞] = 0 . (14.8)
Then consider the chain
Z Z
( a) (b)
EQ [h(X)1{T > −∞}] = dμ q(x)h(x) = dμ p(x) exp(−T(x))h(x)
{−∞<T(x)<∞} {−∞<T(x)<∞}
i i
i i
i i

i i
14.3 Converse bounds on R(P, Q) 235
Z
( c)
= dμ p(x) exp(−T(x))h(x) = EP [exp(−T)g(T)] ,
{−∞<T(x)≤∞}
where in (a) we used (14.8) to justify restriction to finite values of T; in (b) we used exp(−T(x)) =
q(x)
p(x) for p, q > 0; and (c) follows from the fact that exp(−T(x)) = 0 whenever T = ∞. Exchanging
the roles of P and Q proves (14.6).
The last part follows upon taking h(x) = f(x)1{T(x) ≥ τ } and h(x) = f(x)1{T(x) ≤ τ } in (14.5)
and (14.6), respectively.
The importance of the LLR is that it is a sufficient statistic for testing the two hypotheses (recall
Section 3.5 and in particular Example 3.8), as the following result shows.
Corollary 14.6. T = T(X) is a sufficient statistic for testing P versus Q.
Proof. For part 2, suffiency of T would be implied by PX|T = QX|T . For the case of X being
discrete we have:
PX (x)PT|X (t|x) P(x)1{ QP((xx)) = et } et Q(x)1{ QP((xx)) = et }

PX|T (x|t) = = =
PT (t) PT (t) PT (t)
QXT (xt) QXT
= −t = = QX|T (x|t).
e PT (t) QT
We leave the general case as an exercise.
From Theorem 14.3 we know that to obtain the achievable region R(P, Q), one can iterate over
all decision regions and compute the region Rdet (P, Q) first, then take its closed convex hull. But
this is a formidable task if the alphabet is large or infinite. On the other hand, we know that the
LLR is a sufficient statistic. Next we give bounds to the region R(P, Q) in terms of the statistics
of the LLR. As usual, there are two types of statements:
• Converse (outer bounds): any point in R(P, Q) must satisfy certain constraints;
• Achievability (inner bounds): points satisfying certain constraints belong to R(P, Q).
14.3 Converse bounds on R(P, Q)

Theorem 14.7 (Weak converse). ∀(α, β) ∈ R(P, Q),
d(αkβ) ≤ D(PkQ)
d(βkα) ≤ D(QkP)
where d(·k·) is the binary divergence function in (2.6).
Proof. Use the data processing inequality for KL divergence with PZ|X ; cf. Corollary 2.17.
We will strengthen this bound with the aid of the following result.
i i
i i
i i

i i
236
Lemma 14.8. For any test Z and any γ > 0 we have

P[Z = 0] − γ Q[Z = 0] ≤ P T > log γ ,
dP
where T = log dQ is understood in the extended sense of Definition 14.4.
Note that we do not need to assume P Q precisely because ±∞ are admissible values for
the (extended) LLR.
Proof. Defining τ = log γ and g(x) = PZ|X (0|x) we get from (14.7):
P[Z = 0, T ≤ τ ] − γ Q[Z = 0, T ≤ τ ] ≤ 0 .
Decomposing P[Z = 0] = P[Z = 0, T ≤ τ ] + P[Z = 0, T > τ ] and similarly for Q we obtain then
P[Z = 0] − γ Q[Z = 0] ≤= P [T > log γ, Z = 0] − γ Q [T > log γ, Z = 0] ≤ P [T > log γ]
Theorem 14.9 (Strong converse). ∀(α, β) ∈ R(P, Q), ∀γ > 0,

h dP i
α − γβ ≤ P log > log γ (14.9)
dQ
1 h dP i
β − α ≤ Q log < log γ (14.10)
γ dQ
Proof. Apply Lemma 14.8 to (P, Q, γ) and (Q, P, 1/γ).

Remark 14.2.
• Theorem 14.9 provides an outer bound for the region R(P, Q) in terms of half-spaces. To see
this, fix γ > 0 and consider the line α − γβ = c by gradually increaseing c from zero. There
exists a maximal c, say c∗ , at which point the line touches the lower boundary of the region.
Then (14.9) says that c∗ cannot exceed P[log dQ dP
> log γ]. Hence R must lie to the left of the
line. Similarly, (14.10) provides bounds for the upper boundary. Altogether Theorem 14.9 states
that R(P, Q) is contained in the intersection of an infinite collection of half-spaces indexed by
γ.
• To apply the strong converse Theorem 14.9, we need to know the CDF of the LLR, whereas to
apply the weak converse Theorem 14.7 we need only to know the expectation of the LLR, i.e.,
the divergence.
14.4 Achievability bounds on R(P, Q)

Given the convexity of the set R(P, Q), it is natural to try to find all of its supporting lines (hyper-
planes), as it is well-known that closed convex set equals the intersection of all halfspaces that are
supporting hyperplanes. We are thus lead to the following problem: for t >0,
max{α − tβ : (α, β) ∈ R(P, Q)} ,
i i
i i
i i

i i
14.4 Achievability bounds on R(P, Q) 237
which is equivalent to minimizing the average probability of error in (14.1), with t = ππ 10 . This can
be solved without much effort. For simplicity, consider the discrete case. Then
X X
α∗ − tβ ∗ = max (α − tβ) = max (P(x) − tQ(x))PZ|X (0|x) = |P(x) − tQ(x)|+
(α,β)∈R PZ|X
x∈X x∈X
where the last equality follows from the fact that we are free to choose PZ|X (0|x), and the best
choice is obvious:

P ( x)
PZ|X (0|x) = 1 log ≥ log t .
Q(x)
Thus, we have shown that all supporting hyperplanes are parameterized by LRT. This completely
recovers the region R(P, Q) except for the points corresponding to the faces (flat pieces) of the
region. The precise result is stated as follows:
Theorem 14.10 (Neyman-Pearson Lemma: “LRT is optimal”). For each α, βα in (14.3) is attained
by the following test:


 dP
1 log dQ > τ
PZ|X (0|x) = λ log dQdP
=τ (14.11)


0 log dP < τ
dQ
where τ ∈ R and λ ∈ [0, 1] are the unique solutions to α = P[log dQ

dP dP
> τ ] + λP[log dQ = τ ].
Proof of Theorem 14.10. Let t = exp(τ ). Given any test PZ|X , let g(x) = PZ|X (0|x) ∈ [0, 1]. We
want to show that
h dP i h dP i
α = P[Z = 0] = EP [g(X)] = P > t + λP =t (14.12)
dQ dQ
goal h dP i h dP i
⇒ β = Q[Z = 0] = EQ [g(X)] ≥ Q > t + λQ =t (14.13)
dQ dQ
Using the simple fact that EQ [f(X)1{ dP ≤t} ] ≥ t−1 EP [f(X)1{ dP ≤t} ] for any f ≥ 0 twice, we have
dQ dQ
β = EQ [g(X)1{ dP ≤t} ] + EQ [g(X)1{ dP >t} ]

dQ dQ
1
≥ EP [g(X)1{ dP ≤t} ] +EQ [g(X)1{ dP >t} ]
t| {z
dQ
}
dQ
(14.12) 1
h dP i
= EP [(1 − g(X))1{ dP >t} ] + λP = t + EQ [g(X)1{ dP >t} ]
t dQ dQ dQ
| {z }
h dP i
≥ EQ [(1 − g(X))1{ dP >t} ] + λQ = t + EQ [g(X)1{ dP >t} ]
dQ dQ dQ
h dP i h dP i
=Q > t + λQ =t .
dQ dQ
i i
i i
i i

i i
238
Remark 14.3. As a consequence of the Neyman-Pearson lemma, all the points on the boundary
of the region R(P, Q) are attainable. Therefore
R(P, Q) = {(α, β) : βα ≤ β ≤ 1 − β1−α }.
Since α 7→ βα is convex on [0, 1], hence continuous, the region R(P, Q) is a closed convex set, as
previously stated in Theorem 14.2. Consequently, the infimum in the definition of βα is in fact a
minimum.
Furthermore, the lower half of the region R(P, Q) is the convex hull of the union of the
following two sets:
(
dP
α = P log dQ >τ
dP
τ ∈ R ∪ {±∞}.
β = Q log dQ >τ
and
(
dP
α = P log dQ ≥τ
τ ∈ R ∪ {±∞}.
dP
β = Q log dQ ≥τ
Therefore it does not lose optimality to restrict our attention on tests of the form 1{log dQ
dP
≥ τ}
or 1{log dQ > τ }. The convex combination (randomization) of the above two styles of tests lead
dP
to the achievability of the Neyman-Pearson lemma (Theorem 14.10).

Remark 14.4. The Neyman-Pearson test (14.11) is related to the LRT2 as follows:
dP dP
P [log dQ > t] P [log dQ > t]
1 1
α α
t t
τ τ
• Left figure: If α = P[log dQ

dP
> τ ] for some τ , then λ = 0, and (14.11) becomes the LRT
Z = 1{log dP ≤τ } .
dQ
• Right figure: If α 6= P[log dQ

dP
> τ ] for any τ , then we have λ ∈ (0, 1), and (14.11) is equivalent
to randomize over tests: Z = 1{log dP ≤τ } with probability 1 − λ or 1{log dP <τ } with probability
dQ dQ
λ.
Corollary 14.11. ∀τ ∈ R, there exists (α, β) ∈ R(P, Q) s.t.

h dP i
α = P log >τ
dQ
2
Note that it so happens that in Definition 14.4 the LRT is defined with an ≤ instead of <.
i i
i i
i i

i i
h dP i
β ≤ exp(−τ )P log > τ ≤ exp(−τ )
dQ
Proof. For the case of discrete X it is easy to give an explicit proof

h dP i X n P ( x) o
Q log >τ = Q ( x) 1 > exp(τ )
dQ Q ( x)
X n P ( x) o h dP i
≤ P(x) exp(−τ )1 > exp(tau) = exp(−τ )P log >τ .
Q(x) dQ
The general case is just an application of (14.7).
In the remainder of the chapter, we focus on the special case of iid observations in the large-
sample asymptotics. Consider
i.i.d.
H0 : X1 , . . . , Xn ∼ P
i.i.d.
H1 : X1 , . . . , Xn ∼ Q, (14.14)
where P and Q do not depend on n; this is a particular case of our general setting with P and Q
replaced by their n-fold product distributions. We are interested in the asymptotics of the error
probabilities π 0|1 and π 1|0 as n → ∞ in the following two regimes:
• Stein regime: When π 1|0 is constrained to be at most ϵ, what is the best exponential rate of
convergence for π 0|1 ?
• Chernoff regime: When both π 1|0 and π 0|1 are required to vanish exponentially, what is the
optimal tradeoff between their exponents?
14.5 Stein’s regime

Recall that we are in the iid setting (14.14) and are interested in tests satisfying 1 − α = π 1|0 ≤ ϵ
and β = π 0|1 ≤ exp(−nE) for some exponent E > 0. Motivation of this asymmetric objective
is that often a “missed detection” (π 0|1 ) is far more disastrous than a “false alarm” (π 1|0 ). For
example, a false alarm could simply result in extra computations (attempting to decode a packet
when there is in fact only noise has been received). The formal definition of the best exponent is
as follows.
Definition 14.12. The ϵ-optimal exponent in Stein’s regime is

Vϵ ≜ sup{E : ∃n0 , ∀n ≥ n0 , ∃PZ|Xn s.t. α > 1 − ϵ, β < exp (−nE)}.
and Stein’s exponent is defined as V ≜ limϵ→0 Vϵ .
It is an exercise to check the following equivalent definition

1 1
Vϵ = lim inf log
n→∞ n β1−ϵ (PXn , QXn )
i i
i i
i i

i i
240
where βα is defined in (14.3).

Here is the main result of this section.
Theorem 14.13 (Stein’s lemma). Consider the iid setting (14.14) where PXn = Pn and QXn = Qn .
Then
Vϵ = D(PkQ), ∀ϵ ∈ (0, 1).
Consequently, V = D(PkQ).
The way to use this result in practice is the following. Suppose it is required that α ≥ 0.999,
and β ≤ 10−40 , what is the required sample size? Stein’s lemma provides a rule of thumb: n ≥
10−40
− log
D(P∥Q) .
Proof. We first assume that P Q so that dP

dQ is well defined. Define the LLR
dPXn X n
dP
Fn = log = log (Xi ), (14.15)
dQXn dQ
i=1
which is an iid sum under both hypotheses. As such, by WLLN, under P, as n → ∞,

1X
n
1 dP P dP
Fn = log →EP log
( Xi ) − = D(PkQ). (14.16)
n n dQ dQ
i=1
Alternatively, under Q, we have

1 P dP
→EQ log
Fn − = −D(QkP). (14.17)
n dQ
Note that both convergence results hold even if the divergence is infinite.
(Achievability) We show that Vϵ ≥ D(PkQ) ≡ D for any ϵ > 0. First assume that D < ∞. Pick
τ = n(D − δ) for some small δ > 0. Then Corollary 14.11 yields
α = P(Fn > n(D − δ)) → 1, by (14.16)

β ≤ e−n(D−δ)
then pick n large enough (depends on ϵ, δ ) such that α ≥ 1 − ϵ, we have the exponent E = D − δ
achievable, Vϵ ≥ E. Sending δ → 0 yields Vϵ ≥ D. Finally, if D = ∞, the above argument holds
for arbitrary τ > 0, proving that Vϵ = ∞.
(Converse) We show that Vϵ ≤ D for any ϵ < 1, to which end it suffices to consider D < ∞. As
a warm-up, we first show a weak converse by applying Theorem 14.7 based on data processing
inequality. For any (α, β) ∈ R(PXn , QXn ), we have
1
−h(α) + α log ≤ d(αkβ) ≤ D(PXn kQXn ) (14.18)
β
i i
i i
i i

i i
For any achievable exponent E < Vϵ , by definition, there exists a sequence of tests such that
αn ≥ 1 − ϵ and βn ≤ exp(−nE). Plugging this into (14.18) and using h ≤ log 2, we have E ≤
D(P∥Q) log 2
1−ϵ + n(1−ϵ) . Sending n → ∞ yields
D(PkQ)
Vϵ ≤ ,
1−ϵ
which is weaker than what we set out to prove; nevertheless, this weak converse is tight for ϵ → 0,
so that for Stein’s exponent we have succeeded in proving the desired result of V = limϵ→0 Vϵ ≥
D(PkQ). So the question remains: if we allow the type-I error to be ϵ = 0.999, is it possible for
the type-II error to decay faster? This is shown impossible by the strong converse next.
To this end, note that, in proving the weak converse, we only made use of the expectation
of Fn in (14.18), we need to make use of the entire distribution (CDF) in order to obtain better
results. Applying the strong converse Theorem 14.9 to testing PXn versus QXn and α = 1 − ϵ and
β = exp(−nE), we have
1 − ϵ − γ exp (−nE) ≤ αn − γβn ≤ PXn [Fn > log γ].
Pick γ = exp(n(D + δ)) for δ > 0, by WLLN (14.16) the probability on the right side goes to 0,
which implies that for any fixed ϵ < 1, we have E ≤ D + δ and hence Vϵ ≤ D + δ . Sending δ → 0
complete the proof.
Finally, let us address the case of P 6 Q, in which case D(PkQ) = ∞. By definition, there
exists a subset A such that Q(A) = 0 but P(A) > 0. Consider the test that selects P if Xi ∈ A for
some i ∈ [n]. It is clear that this test achieves β = 0 and 1 − α = (1 − P(A))n , which can be made
less than any ϵ for large n. This shows Vϵ = ∞, as desired.
Remark 14.5 (Non-iid data). Just like in Chapter 12 on data compression, Theorem 14.13 can be
extended to stationary ergodic processes:
1
Vϵ = lim D(PXn kQXn )
n→∞ n
where {Xi } is stationary and ergodic under both P and Q. Indeed, the counterpart of (14.16) based
on WLLN, which is the key for choosing the appropriate threshold τ , for ergodic processes is the
Birkhoff-Khintchine convergence theorem (cf. Theorem 12.8).
Remark 14.6. The theoretical importance of Stein’s exponent is that:
∀E ⊂ X n , PXn [E] ≥ 1 − ϵ ⇒ QXn [E] ≥ exp (−nVϵ + o(n))
Thus knowledge of Stein’s exponent Vϵ allows one to prove exponential bounds on probabilities
of arbitrary sets; this technique is known as “change of measure”, which will be applied in large
deviations analysis in Chapter 15.
i i
i i
i i

i i
242
14.6 Chernoff regime: preview

We are still considering iid setting (14.14), namely, testing
H0 : Xn ∼ Pn versus H1 : X n ∼ Qn ,
but the objective in the Chernoff regime is to achieve exponentially small error probability of both
types simultaneously. We say a pair of exponents (E0 , E1 ) is achievable if there exists a sequence
of tests such that
1 − α = π 1|0 ≤ exp(−nE0 )
β = π 0|1 ≤ exp(−nE1 ).
Intuitively, one exponent can made large at the expense of making the other small. So the interest-
ing question is to find their optimal tradeoff by characterizing the achievable region of (E0 , E1 ).
This problem was solved by [159, 38] and is the topic of Chapter 16. (See Fig. 16.2 for an
illustration of the optimal (E0 , E1 )-tradeoff.)
Let us explain what we already know about the region of achievable pairs of exponents (E0 , E1 ).
First, Stein’s regime corresponds to corner points of this achievable region. Indeed, Theo-
rem 14.13 tells us that when fixing αn = 1 − ϵ, namely E0 = 0, picking τ = D(PkQ) − δ
(δ → 0) gives the exponential convergence rate of βn as E1 = D(PkQ). Similarly, exchanging the
role of P and Q, we can achieves the point (E0 , E1 ) = (D(QkP), 0).
Second, we have shown in Section 7.3 that the minimum total error probabilities over all tests
satisfies
min 1 − α + β = 1 − TV(Pn , Qn ) .
(α,β)∈R(Pn ,Qn )
As n → ∞, Pn and Qn becomes increasingly distinguishable and their total variation converges

to 1 exponentially, with exponent E given by max min(E0 , E1 ) over all achievable pairs. From the
bounds (7.20) and tensorization of the Hellinger distance (7.23), we obtain
p
1 − 1 − exp(−2nEH ) ≤ 1 − TV(Pn , Qn ) ≤ exp(−nEH ) , (14.19)
where we denoted

1
EH ≜ log 1 − H2 (P, Q) .
2
Thus, we can see that
EH ≤ E ≤ 2EH .
This characterization is valid even if P and Q depends on the sample size n which will prove useful
later when we study composite hypothesis testing in Section 32.2.1. However, for fixed P and Q
this is not precise enough. In order to determine the full set of achievable pairs, we need to make
i i
i i
i i

i i
14.6 Chernoff regime: preview 243
a detour into the topic of large deviations next. To see how this connection arises, notice that the
(optimal) likelihood ratio tests give us explicit expressions for both error probabilities:

1 1
1 − αn = P Fn ≤ τ , βn = Q Fn > τ
n n
where Fn is the LLR in (14.15). When τ falls in the range of (−D(QkP), D(PkQ)), both proba-
bilities are vanishing thanks to WLLN – see (14.16) and (14.17), and we are interested in their
exponential convergence rate. This falls under the purview of large deviations theory.
i i
i i
i i

i i
15 Information projection and large deviations
In this chapter we discuss the following set of topics:
1 Basics of large deviation: log moment generating function (MGF) ψX and its conjugate (rate
function) ψX∗ , tilting.
2 Information projection problem:
min D(QkP) = ψ ∗ (γ).

Q:EQ [X]≥γ
3 Use information projection to prove tight Chernoff bound: for iid copies X1 , . . . , Xn of X,
" n #
1X
P Xk ≥ γ = exp (−nψ ∗ (γ) + o(n)) .
n
k=1
In the next chapter, we apply these results to characterize the achievable (E0 , E1 )-region (as defined
in Section 14.6) to get
(E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ) ,
with ψP∗ being the rate function of log dQ dP

(under P). This gives us a complete (parametric)
description of the sought-after tradeoff between the two exponents.
15.1 Basics of large deviations theory

Pn
Let X1 , . . . , Xn be an iid sequence drawn from P and P̂n = 1n i=1 δXi their empirical distribution.
The large deviation theory focuses on establishing sharp exponential estimates of the kind
P[P̂n ∈ E] = exp{−nE + o(n)} .
The full account of such theory requires delicate consideration of topological properties of E , and
is the subject of classical treatments e.g. [87]. We focus here on a simple special case which,
however, suffices for the purpose of establishing the Chernoff exponents in hypothesis testing,
and also showcases all the relevant information-theoretic ideas. Our ultimate goal is to show the
following result:
244
i i
i i
i i

i i
Theorem 15.1. Consider a random variable X whose log MGF ψX (λ) = log E[exp(λX)] is finite
for all λ ∈ R. Let B = esssup X and let E[X] < γ < B. Then
" n #
X
P Xi ≥ nγ = exp{−nE(γ) + o(n)} ,
i=1
where E(γ) = supλ≥0 λγ − ψX (λ) = ψX∗ (γ), known as the rate function.
The concepts of log MGF and the rate function will be elaborated in subsequent sections. We
provide the proof below that should be revisited after reading the rest of the chapter.
Proof. Let us recall the usual Chernoff bound: For iid Xn , for any λ ≥ 0,
" n # " ! #
X X n
P Xi ≥ nγ = P exp λ Xi ≥ exp(nλγ)
i=1 i=1
" !#
(Markov ineq.) X
n
≤ exp(−nλγ)E exp λ Xi
i=1
= exp(−nλγ + n log E [exp(λX)]).

| {z }
ψX (λ)
Optimizing over λ ≥ 0 gives the non-asymptotic upper bound (concentration inequality) which
holds for any n:
" n #
X n o
P Xi ≥ nγ ≤ exp − n sup(λγ − ψX (λ)) . (15.1)
i=1 λ≥0
This proves the upper bound part of Theorem 15.1. Our goal, thus, is to show the lower bound. This
will be accomplished by first expressing E(γ) as a certain KL-minimization problem (see Theo-
rem 15.9), known as information projection, and then solving this problem (see (15.26)) to obtain
the desired value of E(γ). In the process of this proof we will also understand why the apparently
naive Chernoff bound is in fact sharp. The explanation is that, essentially, inequality (15.1) per-
forms a change of measure to a tilted distribution Pλ , which is the closest to P (in KL divergence)
among all distributions Q with EQ [X] ≥ γ .
15.1.1 Log MGF and rate function

Definition 15.2. The log moment-generating function (also known as the cumulant generating
function) of a real-valued random variable X is
ψX (λ) = log E[exp(λX)], λ ∈ R.
Per convention in information theory, we will denote ψP (λ) = ψX (λ) if X ∼ P.
λ2
As an example, for a standard Gaussian Z ∼ N (0, 1), we have ψZ (λ) = 2 . Taking X = Z3
yields a random variable such that ψX (λ) is infinite for all non-zero λ.
i i
i i
i i

i i
246
In the remaining of the chapter, we shall assume that the MGF of random variable X is finite,
namely ψX (λ) < ∞ for all λ ∈ R. This, in particular, implies that all moments of X is finite.
Theorem 15.3 (Properties of ψX ). (a) ψX is convex;

(b) ψX is continuous;
(c) ψX is infinitely differentiable and
E[XeλX ]
ψX′ (λ) = = e−ψX (λ) E[XeλX ].
E [ eλ X ]
In particular, ψX (0) = 0, ψX′ (0) = E [X].
(d) If a ≤ X ≤ b a.s., then a ≤ ψX′ ≤ b;
(e) Conversely, if
A = inf ψX′ (λ), B = sup ψX′ (λ),

λ∈R λ∈R
then A ≤ X ≤ B a.s.;
(f) If X is not a constant, then ψX is strictly convex, and consequently, ψX′ is strictly increasing.
(g) Chernoff bound:
P(X ≥ γ) ≤ exp(−λγ + ψX (λ)), λ ≥ 0. (15.2)
Remark 15.1. The slope of log MGF encodes the range of X. Indeed, Theorem 15.3(d) and (e)
together show that the smallest closed interval containing the support of PX equals (closure of) the
range of ψX′ . In other words, A and B coincide with the essential infimum and supremum (min and
max of RV in the probabilistic sense) of X respectively,
A = essinf X ≜ sup{a : X ≥ a a.s.}

B = esssup X ≜ inf{b : X ≤ b a.s.}
See Fig. 15.1 for an illustration.
Proof. Note that (g) is already proved in (15.1). The proof of (e)–(f) relies on Theorem 15.8 and
can be revisited later.
(a) Fix θ ∈ (0, 1). Recall Hölder’s inequality:

1 1
E[|UV|] ≤ kUkp kVkq , for p, q ≥ 1, + =1
p q
where the Lp -norm of a random variable U is defined by kUkp = (E|U|p )1/p . Applying to
E[e(θλ1 +θ̄λ2 )X ] with p = 1/θ, q = 1/θ̄, we get
E[exp((λ1 /p + λ2 /q)X)] ≤ k exp(λ1 X/p)kp k exp(λ2 X/q)kq = E[exp(λ1 X)]θ E[exp(λ2 X)]θ̄ ,
i.e., eψX (θλ1 +θ̄λ2 ) ≤ eψX (λ1 )θ eψX (λ2 )θ̄ .

(b) By our assumptions on X, the domain of ψX is R. By the fact that a convex function must be
continuous on the interior of its domain, we conclude that ψX is continuous on R.
i i
i i
i i

i i
ψX (λ)
slope A
slope B
0
λ
slope E[X]
Figure 15.1 Example of a log MGF ψX (γ) with PX supported on [A, B]. The limiting maximal and minimal
slope is A and B respectively. The slope at γ = 0 is ψX′ (0) = E[X]. Here we plot for X = ±1 with
P [X = 1] = 1/3.
(c) The subtlety here is that we need to be careful when exchanging the order of differentiation
and expectation.
Assume without loss of generality that λ ≥ 0. First, we show that E[|XeλX |] exists. Since
e| X | ≤ eX + e− X
|XeλX | ≤ e|(λ+1)X| ≤ e(λ+1)X + e−(λ+1)X
by assumption on X, both of the summands are absolutely integrable in X. Therefore by the

dominated convergence theorem, E[|XeλX |] exists and is continuous in λ.
Second, by the existence and continuity of E[|XeλX |], u 7→ E[|XeuX |] is integrable on [0, λ], we
can switch order of integration and differentiation as follows:
" Z λ # Z λ
Fubini
e ψX (λ)
= E[e ] = E 1 +
λX uX
Xe du = 1 + E XeuX du
0 0
⇒ ψX′ (λ)eψX (λ) = E[Xe ]

λX
thus ψX′ (λ) = e−ψX (λ) E[XeλX ] exists and is continuous in λ on R.

Furthermore, using similar application of the dominated convergence theorem we can extend
to λ ∈ C and show that λ 7→ E[eλX ] is a holomorphic function. Thus it is infinitely
differentiable.
(d) A ≤ X ≤ B ⇒ ψX′ (λ) = EE[[Xe
λX
]
eλ X ] ∈ [ A , B ] .
(e) Suppose (for contradiction) that PX [X > B] > 0. Then PX [X > B + 2ϵ] > 0 for some small
ϵ > 0. But then Pλ [X ≤ B +ϵ] → 0 for λ → ∞ (see Theorem 15.8.3 below). On the other hand,
we know from Theorem 15.8.2 that EPλ [X] = ψX′ (λ) ≤ B. This is not yet a contradiction, since
Pλ might still have some very small mass at a very negative value. To show that this cannot
happen, we first assume that B − ϵ > 0 (otherwise just replace X with X − 2B). Next note that
B ≥ EPλ [X] = EPλ [X1{X<B−ϵ} ] + EPλ [X1{B−ϵ≤X≤B+ϵ} ] + EPλ [X1{X>B+ϵ} ]
i i
i i
i i

i i
248
≥ EPλ [X1{X<B−ϵ} ] + EPλ [X1{X>B+ϵ} ]

≥ − EPλ [|X|1{X<B−ϵ} ] + (B + ϵ) Pλ [X > B + ϵ] (15.3)
| {z }
→1
therefore we will obtain a contradiction if we can show that EPλ [|X|1{X<B−ϵ} ] → 0 as λ → ∞.

To that end, notice that convexity of ψX implies that ψX′ % B. Thus, for all λ ≥ λ0 we have
ψX′ (λ) ≥ B − 2ϵ . Thus, we have for all λ ≥ λ0
ϵ ϵ
ψX (λ) ≥ ψX (λ0 ) + (λ − λ0 )(B − ) = c + λ(B − ) , (15.4)
2 2
for some constant c. Then,
EPλ [|X|1{X < B − ϵ}] = E[|X|eλX−ψX (λ) 1{X < B − ϵ}]

≤ E[|X|eλX−c−λ(B− 2 ) 1{X < B − ϵ}]
ϵ
≤ E[|X|eλ(B−ϵ)−c−λ(B− 2 ) ]
ϵ
= E[|X|]e−λ 2 −c → 0
ϵ
λ→∞
where the first inequality is from (15.4) and the second from X < B − ϵ. Thus, the first term
in (15.3) goes to 0 implying the desired contradiction.
(f) Suppose ψX is not strictly convex. Since ψX is convex from part (f), ψX must be “flat” (affine)
near some point. That is, there exists a small neighborhood of some λ0 such that ψX (λ0 + u) =
ψX (λ0 ) + ur for some r ∈ R. Then ψPλ (u) = ur for all u in small neighborhood of zero, or
equivalently EPλ [eu(X−r) ] = 1 for u small. The following Lemma 15.4 implies Pλ [X = r] = 1,
but then P[X = r] = 1, contradicting the assumption X 6= const.
Lemma 15.4. E[euS ] = 1 for all u ∈ (−ϵ, ϵ) then S = 0.
Proof. Expand in Taylor series around u = 0 to obtain E[S] = 0, E[S2 ] = 0. Alternatively, we

can extend the argument we gave for differentiating ψX (λ) to show that the function z 7→ E[ezS ] is
holomorphic on the entire complex plane1 . Thus by uniqueness, E[euS ] = 1 for all u.
Definition 15.5 (Rate function). The rate function ψX∗ : R → R ∪ {+∞} is given by the Legendre-
Fenchel transform of the log MGF:
ψX∗ (γ) = sup λγ − ψX (λ) (15.5)

λ∈R
Note that the maximization (15.5) is a convex optimization problem since ψX is strictly convex,
so we can find the maximum by taking the derivative and finding the stationary point. In fact, ψX∗
is the precisely the convex conjugate of ψX ; cf. (7.78).
1
More precisely, if we only know that E[eλS ] is finite for |λ| ≤ 1 then the function z 7→ E[ezS ] is holomorphic in the
vertical strip {z : |Rez| < 1}.
i i
i i
i i

i i
The next result describes useful properties of the rate function. See Fig. 15.2 for an illustration.
ψX (λ)
slope γ
0
λ
ψX∗ (γ)
ψX∗ (γ)
+∞ +∞
γ
A E[X] 0 B
Figure 15.2 Log-MGF ψX and its conjugate (rate function) ψX∗ for X taking values in [A, B], continuing the
example in Fig. 15.1.
Theorem 15.6 (Properties of ψX∗ ). Assume that X is non-constant.
(a) Let A = essinf X and B = esssup X. Then


 ′
 λγ − ψX (λ) for λ s.t. γ = ψX (λ), A<γ<B
∗ 1
ψX (γ) = log P(X=γ) γ = A or B

 +∞, γ < A or γ > B
(b) ψX∗ is strictly convex and strictly positive except ψX∗ (E[X]) = 0.
(c) ψX∗ is decreasing when γ ∈ (A, E[X]), and increasing when γ ∈ [E[X], B)
Proof. By Theorem 15.3(d), since A ≤ X ≤ B a.s., we have A ≤ ψX′ ≤ B. When γ ∈ (A, B), the
strictly concave function λ 7→ λγ − ψX (λ) has a single stationary point which achieves the unique
maximum. When γ > B (resp. < A), λ 7→ λγ − ψX (λ) increases (resp. decreases) without bounds.
When γ = B, since X ≤ B a.s., we have
ψX∗ (B) = sup λB − log(E[exp(λX)]) = − log inf E[exp(λ(X − B))]
λ∈R λ∈R
= − log lim E[exp(λ(X − B))] = − log P(X = B),

λ→∞
i i
i i
i i

i i
250
by the monotone convergence theorem.

By Theorem 15.3(f), since ψX is strictly convex, the derivative of ψX and ψX∗ are inverse to each
other. Hence ψX∗ is strictly convex. Since ψX (0) = 0, we have ψX∗ (γ) ≥ 0. Moreover, ψX∗ (E[X]) = 0
follows from E[X] = ψX′ (0).
15.1.2 Tilted distribution

As early as in Chapter 4, we have already introduced the concept of tilting in the proof of Donsker-
Varadhan’s variational characterization of divergence (Theorem 4.6). Let us formally define it
now.
Definition 15.7 (Tilting). Given X ∼ P and λ ∈ R, the tilted measure Pλ is defined by
eλ x
Pλ (dx) = P(dx) = eλx−ψX (λ) P(dx) (15.6)
E [ eλ X ]
In particular, if P has a PDF p, then the PDF of Pλ is given by pλ (x) = eλx−ψX (λ) p(x).
The set of distributions {Pλ : λ ∈ R} parametrized by λ is called a standard (one-parameter)

exponential family, an important object in statistics [54]. Here are some of the examples:
• Gaussian: P = N (0, 1) with density p(x) = √1

2π
exp(−x2 /2). Then Pλ has density
exp(λx)
√1 exp(−x2 /2) = √1 exp(−(x − λ) /2). Hence Pλ = N (λ, 1).
2
exp(λ2 /2) 2π 2π λ
• Bernoulli: P = Ber( 12 ). Then Pλ = Ber eλe+1 which puts more (resp. less) mass on 1 if λ > 0
d
(resp. < 0). Moreover, Pλ − →δ1 if λ
→ ∞ or δ0 if λ → −∞.
• Uniform: Let P be the uniform distribution on [0, 1]. Then Pλ is also supported on [0, 1] with
pdf pλ (x) = λ exp(λx)
eλ −1 . Therefore as λ increases, Pλ becomes increasingly concentrated near 1,
and Pλ → δ1 as λ → ∞. Similarly, Pλ → δ0 as λ → −∞.
In the above examples we see that Pλ shifts the mean of P to the right (resp. left) when λ > 0
(resp. < 0). Indeed, this is a general property of tilting.
Theorem 15.8 (Properties of Pλ ).
(a) Log MGF:
ψPλ (u) = ψX (λ + u) − ψX (λ)
(b) Tilting trades mean for divergence:
EPλ [X] = ψX′ (λ) ≷ EP [X] if λ ≷ 0. (15.7)

D(Pλ kP) = ψX∗ (ψX′ (λ)) = ψX∗ (EPλ [X]). (15.8)
i i
i i
i i

i i
(c)
P(X > b) > 0 ⇒ ∀ϵ > 0, Pλ (X ≤ b − ϵ) → 0 as λ → ∞

P(X < a) > 0 ⇒ ∀ϵ > 0, Pλ (X ≥ a + ϵ) → 0 as λ → −∞
d d
Therefore if Xλ ∼ Pλ , then Xλ −
→ essinf X = A as λ → −∞ and Xλ −
→ esssup X = B as λ → ∞.
Proof. (a) By definition.

(b) EPλ [X] = EE[X[exp
exp(λX)] ′ ′
(λX)] = ψX (λ), which is strictly increasing in λ, with ψX (0) = EP [X].
exp(λX) ′
D(Pλ kP) = EPλ log dP dP = EPλ log E[exp(λX)] = λEPλ [X] − ψX (λ) = λψX (λ) − ψX (λ) =
λ
∗ ′
ψX (ψX (λ)), where the last equality follows from Theorem 15.6(a).
(c)
Pλ (X ≤ b − ϵ) = EP [eλX−ψX (λ) 1{X≤b−ϵ} ]

≤ EP [eλ(b−ϵ)−ψX (λ) 1{X≤b−ϵ} ]
≤ e−λϵ eλb−ψX (λ)
e−λϵ
≤ → 0 as λ → ∞
P [ X > b]
where the last inequality is due to the usual Chernoff bound (Theorem 15.3(g)): P[X > b] ≤
exp(−λb + ψX (λ)).
15.2 Large-deviations exponents and KL divergence

Large deviations problems deal with rare events by making statements about the tail probabilities
of a sequence of distributions. Here, we are interested in the following special case: the speed of
Pn
decay for P 1n k=1 Xk ≥ γ for iid Xk .
In (15.1) we have used Chernoff bound to obtain an upper bound on the exponent via the log-
MGF. Here we use a different method to give a formula for the exponent as a convex optimization
problem involving the KL divergence. In the subsequent chapter (information projection). Later
in Section 15.4 we shall revisit the Chernoff bound after we have computed the value of the
information projection.
i.i.d.
Theorem 15.9. Let X1 , X2 , . . . ∼ P. Then for any γ ∈ R,
1 1
lim log 1 Pn = inf D(QkP) (15.9)
n→∞ n P n k=1 Xk > γ Q : EQ [X]>γ
1 1
lim log 1 Pn = inf D(QkP) (15.10)
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]≥γ
i i
i i
i i

i i
252
Furthermore, for every n we have the firm upper bound

" n #
1X
P Xk ≥ γ ≤ exp −n · inf D(QkP) (15.11)
n Q : EQ [X]≥γ
k=1
and similarly for > in place of ≥.
Remark 15.2 (Subadditivity). It is possible to argue from first principles that the limits (15.9) and
(15.10) exist. Indeed, note that the sequence pn ≜ log P 1 Pn 1 X ≥γ satisfies pn+m ≥ pn pm and
[ n k=1 k ]
hence log p1n is subadditive. As such, limn→∞ 1n log p1n = infn log p1n by Fekete’s lemma.
Proof. First note that if the events have zero probability, then both sides coincide with infinity.
Pn
Indeed, if P 1n k=1 Xk > γ = 0, then P [X > γ] = 0. Then EQ [X] > γ ⇒ Q[X > γ] > 0 ⇒
Q 6 P ⇒ D(QkP) = ∞ and hence (15.9) holds trivially. The case for (15.10) is similar.
In the sequel we assume both probabilities are nonzero. We start by proving (15.9). Set P[En ] =
Pn
P 1n k=1 Xk > γ .
Lower Bound on P[En ]: Fix a Q such that EQ [X] > γ . Let Xn be iid. Then by WLLN,
" n #
X LLN
Q[En ] = Q Xk > nγ = 1 − o(1).
k=1
Now the data processing inequality (Corollary 2.17) gives
d(Q[En ]kP[En ]) ≤ D(QXn kPXn ) = nD(QkP)
And a lower bound for the binary divergence is

1
d(Q[En ]kP[En ]) ≥ −h(Q[En ]) + Q[En ] log
P[ En ]
Combining the two bounds on d(Q[En ]kP[En ]) gives

−nD(QkP) − log 2
P[En ] ≥ exp (15.12)
Q[En ]
Optimizing over Q to give the best bound:
1 1
lim sup log ≤ inf D(QkP).
n→∞ n P [En ] Q:EQ [X]>γ
Upper Bound on P[En ]: The key observation is that given any X and any event E, PX (E) > 0
can be expressed via the divergence between the conditional and unconditional distribution as:
P
log PX1(E) = D(PX|X∈E kPX ). Define P̃Xn = PXn | P Xi >nγ , under which Xi > nγ holds a.s. Then
1
log = D(P̃Xn kPXn ) ≥ inf
P D(QXn kPXn ) (15.13)
P[En ] QXn :EQ [ Xi ]>nγ
i i
i i
i i

i i
We now show that the last problem “single-letterizes”, i.e., reduces n = 1. Note that this is a
special case of a more general phenomena – see Ex. III.7. Consider the following two steps:
X
n
D(QXn kPXn ) ≥ D(QXj kP)
j=1
1X
n
≥ nD(Q̄kP) , Q̄ ≜ QX j , (15.14)
n
j=1
where the first step follows from (2.26) in Theorem 2.14, after noticing that PXn = Pn , and the
second step is by convexity of divergence (Theorem 5.1). From this argument we conclude that
inf
P D(QXn kPXn ) = n · inf D(QkP) (15.15)
QXn :EQ [ Xi ]>nγ Q:EQ [X]>γ
inf
P D(QXn kPXn ) = n · inf D(QkP) (15.16)
QXn :EQ [ Xi ]≥nγ Q:EQ [X]≥γ
In particular, (15.13) and (15.15) imply the required lower bound in (15.9).
Next we prove (15.10). First, notice that the lower bound argument (15.13) applies equally well,
so that for each n we have
1 1
log 1 Pn ≥ inf D(QkP) .
n P n k=1 Xk ≥ γ Q : EQ [X]≥γ
To get a matching upper bound we consider two cases:
• Case I: P[X > γ] = 0. If P[X ≥ γ] = 0, then both sides of (15.10) are +∞. If P[X = γ] > 0,
P
then P[ Xk ≥ nγ] = P[X1 = . . . = Xn = γ] = P[X = γ]n . For the right-hand side, since
D(QkP) < ∞ =⇒ Q P =⇒ Q(X ≤ γ) = 1, the only possibility for EQ [X] ≥ γ is that
Q(X = γ) = 1, i.e., Q = δγ . Then infEQ [X]≥γ D(QkP) = log P(X1=γ) .
P P
• Case II: P[X > γ] > 0. Since P[ Xk ≥ γ] ≥ P[ Xk > γ] from (15.9) we know that
1 1
lim sup log 1 Pn ≤ inf D(QkP) .
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]>γ
We next show that in this case

inf D(QkP) = inf D(QkP) (15.17)
Q : EQ [X]>γ Q : EQ [X]≥γ
Indeed, let P̃ = PX|X>γ which is well defined since P[X > γ] > 0. For any Q such that EQ [X] ≥
γ , set Q̃ = ϵ̄Q + ϵP̃ satisfies EQ̃ [X] > γ . Then by convexity, D(Q̃kP) ≤ ϵ̄D(QkP) + ϵD(P̃kP) =
ϵ̄D(QkP) + ϵ log P[X1>γ] . Sending ϵ → 0, we conclude the proof of (15.17).
Remark 15.3. Note that the upper bound (15.11) also holds for independent non-identically
distributed Xi . Indeed, we only need to replace the step (15.14) with D(QXn kPXn ) ≥
Pn Pn
i=1 D(QXi kPXi ) ≥ nD(Q̄kP̄) where P̄ = n
1
i=1 PXi . This yields a bound (15.11) with P
replaced by P̄ in the right-hand side.
i i
i i
i i

i i
254
Example 15.1 (Poisson-Binomial tails). Consider X which is a sum of n independent Bernoulli

random variables so that E[X] = np. The distribution of X is known as Poission-Binomial [236,
303], including Bin(n, p) as a special case. Applying Theorem 15.9 (or the Remark 15.3), we get
the following tail bounds on X:
k
P[X ≥ k] ≤ exp{−nd(k/nkp)}, >p (15.18)
n
k
P[X ≤ k] ≤ exp{−nd(k/nkp)}, <p (15.19)
n
where for (15.18) we used the fact that minQ:EQ [X]≥k/n D(QkBer(p)) = minq≥k/n d(qkp) = d( nk kp)
and similarly for (15.19). These bounds, in turn, can be used to derive various famous estimates:
• Multiplicative deviation from the mean (Bennett’s inequality): We have
P[X ≥ u E[X]] ≤ exp{− E[X]f(u)} ∀u > 1 ,

P[X ≤ u E[X]] ≤ exp{− E[X]f(u)} ∀0 ≤ u < 1 ,
where f(u) ≜ u log u − (u − 1) log e ≥ 0. These follow from (15.18)-(15.19) via the following
useful estimate:
d(upkp) ≥ pf(u) ∀p ∈ [0, 1], u ∈ [0, 1/p] (15.20)
Indeed, consider the elementary inequality
x
x log ≥ (x − y) log e
y
for all x, y ∈ [0, 1] (since the difference between the left and right side is minimized over y at
y = x). Using x = 1 − up and y = 1 − p establishes (15.20).
• Bernstein’s inequality:
t2
P[X > np + t] ≤ e− 2(t+np) ∀t > 0 .
f(u) Ru u−x
Ru
This follows from the previous bound for u > 1 by bounding log e = 1 x dx ≥ 1
u 1
(u −
x)dx = (u−
2
1)
2u .
• Okamoto’s inequality: For all 0 < p < 1 and t > 0,
√ √
P[ X − np ≥ t] ≤ e−t ,
2
(15.21)
√ √
P[ X − np ≤ −t] ≤ e−t .
2
(15.22)
These simply follow from the inequality between KL divergence and Hellinger distance
√ √ √
( np+t)2
in (7.30). Indeed, we get d(xkp) ≥ H2 (Ber(x), Ber(p)) ≥ ( x − p)2 . Plugging x = n
into (15.18)-(15.19) we obtain the result. We note that [226, Theorem 3] shows a stronger bound
of e−2t in (15.21).
2
Remarkably, the bounds in (15.21) and (15.22) do not depend on n or p. This is due to the
√
variance-stabilizing effect of the square-root transformation for binomials: Var( X) is at most
√ √
a constant for all n, p. In addition, X − np = √XX− np
√
+ np
is of a self-normalizing form: the
i i
i i
i i

i i
denominator is on par with the standard deviation of the numerator. For more on self-normalized
sums, see [45, Problem 12.2].
15.3 Information Projection

The results of Theorem 15.9 motivate us to study the following general information projection
problem: Let E be a convex set of distributions on some abstract space Ω, then for the distribution
P on Ω, we want
inf D(QkP)
Q∈E
Denote the minimizing distribution Q by Q∗ . The next result shows that intuitively the “line”
between P and optimal Q∗ is “orthogonal” to E .
Q∗
Distributions on X
Theorem 15.10. Suppose ∃Q∗ ∈ E such that D(Q∗ kP) = minQ∈E D(QkP), then ∀Q ∈ E
D(QkP) ≥ D(QkQ∗ ) + D(Q∗ kP)
Proof. If D(QkP) = ∞, then there is nothing to prove. So we assume that D(QkP) < ∞, which
also implies that D(Q∗ kP) < ∞. For λ ∈ [0, 1], form the convex combination Q(λ) = λ̄Q∗ +λQ ∈
E . Since Q∗ is the minimizer of D(QkP), then

d
0≤ D(Q(λ) kP) = D(QkP) − D(QkQ∗ ) − D(Q∗ kP)
dλ λ=0
The rigorous analysis requires an argument for interchanging derivatives and integrals (via domi-
nated convergence theorem) and is similar to the proof of Proposition 2.18. The details are in [83,
Theorem 2.2].
Remark 15.4. If we view the picture above in the Euclidean setting, the “triangle” formed by P,
Q∗ and Q (for Q∗ , Q in a convex set, P outside the set) is always obtuse, and is a right triangle
only when the convex set has a “flat face”. In this sense, the divergence is similar to the squared
Euclidean distance, and the above theorem is sometimes known as a “Pythagorean” theorem.
i i
i i
i i

i i
256
The relevant set E of Q’s that we will focus next is the “half-space” of distributions E = {Q :
EQ [X] ≥ γ}, where X : Ω → R is some fixed function (random variable). This is justified by rela-
tion with the large-deviations exponent in Theorem 15.9. First, we solve this I-projection problem
explicitly.
Theorem 15.11. Given a distribution P on Ω and X : Ω → R let
A = inf ψX′ = essinf X = sup{a : X ≥ a P-a.s.} (15.23)

B= sup ψX′ = esssup X = inf{b : X ≤ b P-a.s.} (15.24)
1 The information projection problem over E = {Q : EQ [X] ≥ γ} has solution



 0 γ < E P [ X]


ψ ∗ (γ) E P [ X] ≤ γ < B
P
min D(QkP) = (15.25)
Q : EQ [X]≥γ 
 1
log P(X=B) γ = B



+∞ γ>B
= ψP∗ (γ)1{γ ≥ EP [X]} (15.26)
2 Whenever the minimum is finite, the minimizing distribution is unique and equal to tilting of P
along X, namely2
dPλ = exp{λX − ψ(λ)} · dP (15.27)
3 For all γ ∈ [EP [X], B) we have
min D(QkP) = inf D(QkP) = min D(QkP) .

EQ [X]≥γ EQ [X]>γ EQ [X]=γ
Remark 15.5. Both Theorem 15.9 and Theorem 15.11 are stated for the right tail where the sample
mean exceeds the population mean. For the left tail, simply these results to −Xi to obtain for
γ < E[X],
1 1
lim log 1 Pn = inf D(QkP) = ψX∗ (γ).
n→∞ n P n k=1 Xk < γ Q : EQ [X]<γ
In other words, the large deviation exponent is still given by the rate function (15.5) except that
the optimal tilting parameter λ is negative.
Proof. We first prove (15.25).
• First case: Take Q = P.

• Fourth case: If EQ [X] > B, then Q[X ≥ B + ϵ] > 0 for some ϵ > 0, but P[X ≥ B + ϵ] = 0, since
P(X ≤ B) = 1, by Theorem 15.3(e). Hence Q 6 P =⇒ D(QkP) = ∞.
2
Note that unlike the setting of Theorems 15.1 and 15.9 here P and Pλ are measures on an abstract space Ω, not necessarily
on the real line.
i i
i i
i i

i i
• Third case: If P(X = B) = 0, then X < B a.s. under P, and Q 6 P for any Q s.t. EQ [X] ≥ B.
Then the minimum is ∞. Now assume P(X = B) > 0. Since D(QkP) < ∞ =⇒ Q P =⇒
Q(X ≤ B) = 1. Therefore the only possibility for EQ [X] ≥ B is that Q(X = B) = 1, i.e., Q = δB .
Then D(QkP) = log P(X1=B) .
• Second case: Fix EP [X] ≤ γ < B, and find the unique λ such that ψX′ (λ) = γ = EPλ [X] where
dPλ = exp(λX − ψX (λ))dP. This corresponds to tilting P far enough to the right to increase its
mean from EP X to γ , in particular λ ≥ 0. Moreover, ψX∗ (γ) = λγ − ψX (λ). Take any Q such
that EQ [X] ≥ γ , then

dQdPλ
D(QkP) = EQ log (15.28)
dPdPλ
dPλ
= D(QkPλ ) + EQ [log ]
dP
= D(QkPλ ) + EQ [λX − ψX (λ)]
≥ D(QkPλ ) + λγ − ψX (λ)
= D(QkPλ ) + ψX∗ (γ)
≥ ψX∗ (γ), (15.29)
where the last inequality holds with equality if and only if Q = Pλ . In addition, this shows
the minimizer is unique, proving the second claim. Note that even in the corner case of γ = B
(assuming P(X = B) > 0) the minimizer is a point mass Q = δB , which is also a tilted measure
(P∞ ), since Pλ → δB as λ → ∞, cf. Theorem 15.8(c).
An alternative version of the solution, given by expression (15.26), follows from Theorem 15.6.
For the third claim, notice that there is nothing to prove for γ < EP [X], while for γ ≥ EP [X] we
have just shown
ψX∗ (γ) = min D(QkP)

Q:EQ [X]≥γ
while from the next corollary we have
inf D(QkP) = inf

′
ψX∗ (γ ′ ) .
Q:EQ [X]>γ γ >γ
The final step is to notice that ψX∗ is increasing and continuous by Theorem 15.6, and hence the
right-hand side infimum equalis ψX∗ (γ). The case of minQ:EQ [X]=γ is handled similarly.
Corollary 15.12. For any Q with EQ [X] ∈ (A, B), there exists a unique λ ∈ R such that the tilted
distribution Pλ satisfies
EPλ [X] = EQ [X]

D(Pλ kP) ≤ D(QkP)
and furthermore the gap in the last inequality equals D(QkPλ ) = D(QkP) − D(Pλ kP).
i i
i i
i i

i i
258
Proof. Proceed as in the proof of Theorem 15.11, and find the unique λ s.t. EPλ [X] = ψX′ (λ) =
EQ [X]. Then D(Pλ kP) = ψX∗ (EQ [X]) = λEQ [X] − ψX (λ). Repeat the steps (15.28)-(15.29)
obtaining D(QkP) = D(QkPλ ) + D(Pλ kP).
Remark: For any Q, this allows us to find a tilted measure Pλ that has the same mean yet
smaller (or equal) divergence.
15.4 Interpretation of Information Projection

The following picture describes many properties of information projections.
Q 6≪ P
One Parameter Family
γ=A
P b
D(Pλ ||P ) EQ [X] = γ
λ=0 = ψ ∗ (γ)
b
λ>0 Q
b
γ=B
Q∗
=Pλ
Q 6≪ P
Space of distributions on R
• Each set {Q : EQ [X] = γ} corresponds to a slice. As γ varies from A to B, the curves fill the
entire space except for the corner regions.
• When γ < A or γ > B, Q 6 P.
• As γ varies, the Pλ ’s trace out a curve via ψ ∗ (γ) = D(Pλ kP). This set of distributions is called
a one parameter family, or exponential family.
Key Point: The one parameter family curve intersects each γ -slice E = {Q : EQ [X] =
γ} “orthogonally” at the minimizing Q∗ ∈ E , and the distance from P to Q∗ is given by
ψ ∗ (λ). To see this, note that applying Theorem 15.10 to the convex set E gives us D(QkP) ≥
D(QkQ∗ ) + D(Q∗ kP). Now thanks to Corollary 15.12, we in fact have equality D(QkP) =
D(QkQ∗ ) + D(Q∗ kP) and Q∗ = Pλ for some tilted measure.
i i
i i
i i

i i
15.5 Generalization: Sanov’s theorem 259
15.5 Generalization: Sanov’s theorem

A corollary of the WLLN is that the empirical distribution of n iid observations drawn from a distri-
bution (called population in statistics speak) converges weakly to this distribution. The following
theorem due to Sanov quantifies the large-deviations behavior of this convergence.
Theorem 15.13 (Sanov’s Theorem). Consider observing n samples X1 , . . . , Xn ∼ iid P. Let P̂ be

Pn
the empirical distribution, i.e., P̂ = 1n j=1 δXj . Let E be a convex set of distributions. Then under
regularity conditions on E and P we have

P[P̂ ∈ E] = exp −n min D(QkP) + o(n) .
Q∈E
Examples of regularity conditions in the above theorem include: (a) X is finite and E is closed
with non-empty interior – see Exercise III.12 for a full proof in this case; (b) X is a Polish space
and the set E is weakly closed and has non-empty interior.
Proof sketch. The lower bound is proved as in Theorem 15.9: Just take an arbitrary Q ∈ E and
apply a suitable version of WLLN to conclude Qn [P̂ ∈ E] = 1 + o(1).
For the upper bound we can again adapt the proof from Theorem 15.9. Alternatively, we can
write the convex set E as an intersection of half spaces. Then we have already solved the problem
for half-spaces {Q : EQ [X] ≥ γ}. The general case follows by the following consequence of
Theorem 15.10: if Q∗ is projection of P onto E1 and Q∗∗ is projection of Q∗ on E2 , then Q∗∗ is
also projection of P onto E1 ∩ E2 :
(
∗∗ D(Q∗ kP) = minQ∈E1 D(QkP)
D(Q kP) = min D(QkP) ⇐
Q∈E1 ∩E2 D(Q∗∗ kQ∗ ) = minQ∈E2 D(QkQ∗ )
(Repeated projection property)
Indeed, by first tilting from P to Q∗ we find
P[P̂ ∈ E1 ∩ E2 ] ≤ exp (−nD(Q∗ kP)) Q∗ [P̂ ∈ E1 ∩ E2 ]
≤ exp (−nD(Q∗ kP)) Q∗ [P̂ ∈ E2 ]
and from here proceed by tilting from Q∗ to Q∗∗ and note that D(Q∗ kP) + D(Q∗∗ kQ∗ ) =
D(Q∗∗ kP).
i i
i i
i i

i i
16 Hypothesis testing: error exponents
In this chapter our goal is to determine the achievable region of the exponent pairs (E0 , E1 ) for
the Type-I and Type-II error probabilities. Our strategy is to apply the achievability and (strong)
converse bounds from Chapter 14 in conjunction with the large deviation theory developed in
Chapter 15.
16.1 (E0 , E1 )-Tradeoff

Recall the setting of Chernoff regime introduced in Section 14.6, where the goal is in designing
tests satisfying
π 1|0 = 1 − α ≤ exp (−nE0 ) , π 0|1 = β ≤ exp (−nE1 ) .
To find the best tradeoff of E0 versus E1 we can define the following function
E∗1 (E0 ) ≜ sup{E1 : ∃n0 , ∀n ≥ n0 , ∃PZ|Xn s.t. α > 1 − exp (−nE0 ) , β < exp (−nE1 )}
1 1
= lim inf log
n→∞ n β1−exp(−nE0 ) (Pn , Qn )
This should be compared with Stein’s exponent in Definition 14.12.
Define
dQ
Tk = log (Xk ), k = 1, . . . , n
dP
dQn Pn
which are iid copies of T = log dQ n
dP (X). Then log dPn (X ) = k=1 Tk , which is an iid sum under
both P and Q.
The log MGF of T under P (again assumed to be finite and also T is not a constant since P 6= Q)
and the corresponding rate function are (cf. Definitions 15.2 and 15.5):
ψP (λ) = log EP [exp(λT)], ψP∗ (θ) = sup θλ − ψP (λ).
λ∈R
P
For discrete distributions, we have ψP (λ) = log x P(x)1−λ Q(x)λ ; in general, ψP (λ) =
R 1−λ dQ λ
log dμ( dPdμ ) ( dμ ) for some dominating measure μ.
Note that since ψP (0) = ψP (1) = 0, from the convexity of ψP (Theorem 15.3) we conclude
that ψP (λ) is finite on 0 ≤ λ ≤ 1. Furthermore, assuming P Q and Q P we also have that
λ 7→ ψP (λ) continuous everywhere on [0, 1]. (The continuity on (0, 1) follows from convexity,
but for the boundary points we need more detailed arguments.) Although all results in this section
260
i i
i i
i i

i i
16.1 (E0 , E1 )-Tradeoff 261
ψP (λ)
0 1
λ
E0 = ψP∗ (θ)
E1 = ψP∗ (θ) − θ
slope θ
Figure 16.1 Geometric interpretation of Theorem 16.1 relies on the properties of ψP (λ) and ψP∗ (θ). Note that
ψP (0) = ψP (1) = 0. Moreover, by Theorem 15.6, θ 7→ E0 (θ) is increasing, θ 7→ E1 (θ) is decreasing.
apply under the (milder) conditions of P Q and Q P, we will only present proofs under
the (stronger) condition that log-MGF exists for all λ, following the convention of the previous
chapter. The following result determines the optimal (E0 , E1 )-tradeoff in a parametric form. For a
concrete example, see Exercise III.11 for testing two Gaussians.
Theorem 16.1. Assume P Q and Q P. Then

E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ, (16.1)
parametrized by −D(PkQ) ≤ θ ≤ D(QkP), characterizes the upper boundary of the region of all
achievable (E0 , E1 )-pairs. (See Fig. 16.1 for an illustration.)
Remark 16.1 (Rényi divergence). In Definition 7.22 we defined Rényi divergences Dλ . Note that
ψP (λ) = (λ − 1)Dλ (QkP) = −λD1−λ (PkQ). This provides another explanation that ψP (λ) is
negative for λ between 0 and 1, and the slope at endpoints is: ψP′ (0) = −D(PkQ) and ψP′ (1) =
D(QkP). See also Ex. I.30.
Corollary 16.2 (Bayesian criterion). Fix a prior (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 < π 0 < 1.
Denote the optimal Bayesian (average) error probability by
P∗e (n) ≜ inf π 0 π 1|0 + π 1 π 0|1
PZ|Xn
with exponent
1 1
E ≜ lim log ∗ .
n→∞ n P e ( n)
Then
E = max min(E0 (θ), E1 (θ)) = ψP∗ (0)
θ
i i
i i
i i

i i
262
regardless of the prior, and

Z
ψP∗ (0) = − inf ψP (λ) = − inf log (dP)1−λ (dQ)λ ≜ C(P, Q) (16.2)
λ∈[0,1] λ∈[0,1]
is called the Chernoff exponent or Chernoff information.
Notice that from (14.19) we always have

1 2 1 2
log 1 − H (P, Q) ≤ C(P, Q) ≤ 2 log 1 − H (P, Q)
2 2
and thus for small H2 (P, Q) we have C(P, Q) H2 (P, Q).
Remark 16.2 (Bhattacharyya distance). There is an important special case in which Chernoff
exponent simplifies. Instead of i.i.d. observations, consider independent, but not identically dis-
tributed observations. Namely, suppose that two hypotheses correspond to two different strings
xn and x̃n over a finite alphabet X . The hypothesis tester observes Yn = (Y1 , . . . , Yn ) obtained by
applying one of the two strings to the input of the memoryless channel PY|X ; in other words, either
Qn Qn
Yn ∼ t=1 PY|X=xt or t=1 PY|X=x̃t . (The alphabet Y does not need to be finite, but we assume
this below.) Extending Corollary 16.2 it can be shown, that in this case the optimal (average)
probability of error P∗e (xn , x̃n ) has (Chernoff) exponent1
1X X
n
E = − inf log PY|X (y|xt )λ PY|X (y|x̃t )1−λ .
λ∈[0,1] n
t=1 y∈Y
If |X | = 2 and if the compositions (types) of xn and x̃n are equal (!), the expression is invariant
under λ ↔ 1 − λ and thus from the convexity in λ we conclude that λ = 12 is optimal,2 yielding
E = 1n dB (xn , x̃n ), where
X
n Xq
dB (x , x̃ ) = −
n n
log PY|X (y|xt )PY|X (y|x̃t ) (16.3)
t=1 y∈Y
is known as the Bhattacharyya distance between codewords xn and x̃n . (Compare with the Bhat-
tacharyya coefficient defined after (7.5).) Without the two assumptions stated, dB (·, ·) does not
necessarily give the optimal error exponent. We do, however, always have the bounds, see (14.19):
1
exp (−2dB (xn , x̃n )) ≤ P∗e (xn , x̃n ) ≤ exp (−dB (xn , x̃n )) ,
4
where the upper bound becomes tighter when the joint composition of (xn , x̃n ) and that of (x̃n , xn )
are closer.
Pn
Proof of Theorem 16.1. The idea is to apply the large deviation theory to the iid sum k=1 Tk .
Specifically, let’s rewrite the achievability and converse bounds from Chapter 14 in terms of T:
1
In short, this is because the optimal tilting parameter λ does not need to be chosen differently for different values of
(xt , x̃t ).
2 1
For another example where λ = 2
achieves the optimal in the Chernoff information, see Exercise III.19.
i i
i i
i i

i i
• Achievability (Neyman-Pearson): Applying Theorem 14.10 with τ = −nθ, the LRT achieves
the following
" n # " n #
X X
π 1|0 = P Tk ≥ nθ π 0|1 = Q Tk < nθ (16.4)
k=1 k=1
• Converse (strong): Applying Theorem 14.9 with γ = exp (−nθ), any achievable π 1|0 and π 0|1
satisfy
" n #
X
π 1|0 + exp (−nθ) π 0|1 ≥ P T k ≥ nθ . (16.5)
k=1
For achievability, applying the nonasymptotic large deviations upper bound in Theorem 15.9
(and Theorem 15.11) to (16.4), we obtain that for any n,
" n #
X
π 1|0 = P Tk ≥ nθ ≤ exp (−nψP∗ (θ)) , for θ ≥ EP T = −D(PkQ)
k=1
" #
Xn

π 0|1 = Q Tk < nθ ≤ exp −nψQ∗ (θ) , for θ ≤ EQ T = D(QkP)
k=1
Notice that by the definition of T = log dQ

dP we have
ψQ (λ) = log EQ [eλT ] = log EP [e(λ+1)T ] = ψP (λ + 1)

⇒ ψQ∗ (θ) = sup θλ − ψP (λ + 1) = ψP∗ (θ) − θ.
λ∈R
Thus the pair of exponents (E0 (θ), E1 (θ)) in (16.1) is achievable.

For converse, we aim to show that any achievable (E0 , E1 ) pair must lie below the curve
achieved by the above Neyman-Pearson test, namely (E0 (θ), E1 (θ)) parametrized by θ. Suppose
π 1|0 = exp (−nE0 ) and π 0|1 = exp (−nE1 ) is achievable. Combining the strong converse bound
(16.5) with the large deviations lower bound, we have: for any fixed θ ∈ [−D(PkQ), ≤ D(QkP)],
exp (−nE0 ) + exp (−nθ) exp (−nE1 ) ≥ exp (−nψP∗ (θ) + o(n))
⇒ min(E0 , E1 + θ) ≤ ψP∗ (θ)
⇒ either E0 ≤ ψP∗ (θ) or E1 ≤ ψP∗ (θ) − θ,
proving the desired result.
16.2 Equivalent forms of Theorem 16.1

Alternatively, the optimal (E0 , E1 )-tradeoff can be stated in the following equivalent forms:
Theorem 16.3. (a) The optimal exponents are given (parametrically) in terms of λ ∈ [0, 1] as
E0 = D(Pλ kP), E1 = D(Pλ kQ) (16.6)
i i
i i
i i

i i
264
where the distribution Pλ 3 is tilting of P along T given in (15.27), which moves from P0 = P
to P1 = Q as λ ranges from 0 to 1:
dPλ = (dP)1−λ (dQ)λ exp{−ψP (λ)}.
(b) Yet another characterization of the boundary is
E∗1 (E0 ) = min D(Q′ kQ) , 0 ≤ E0 ≤ D(QkP) (16.7)
Q′ :D(Q′ ∥P)≤E0
Remark 16.3. The interesting consequence of this point of view is that it also suggests how
typical error event looks like. Namely, consider an optimal hypothesis test achieving the pair
of exponents (E0 , E1 ). Then conditioned on the error event (under either P or Q) we have that
the empirical distribution of the sample will be close to Pλ . For example, if P = Bin(m, p) and
Q = Bin(m, q), then the typical error event will correspond to a sample whose empirical distribu-
tion P̂n is approximately Bin(m, r) for some r = r(p, q, λ) ∈ (p, q), and not any other distribution
on {0, . . . , m}.
Proof. The first part is verified trivially. Indeed, if we fix λ and let θ(λ) ≜ EPλ [T], then
from (15.8) we have
D(Pλ kP) = ψP∗ (θ) ,
whereas

dPλ dPλ dP
D(Pλ kQ) = EPλ log = EPλ log = D(Pλ kP) − EPλ [T] = ψP∗ (θ) − θ .
dQ dP dQ
Also from (15.7) we know that as λ ranges in [0, 1] the mean θ = EPλ [T] ranges from −D(PkQ)
to D(QkP).
To prove the second claim (16.7), the key observation is the following: Since Q is itself a tilting
of P along T (with λ = 1), the following two families of distributions
dPλ = exp{λT − ψP (λ)} · dP
dQλ′ = exp{λ′ T − ψQ (λ′ )} · dQ
are in fact the same family with Qλ′ = Pλ′ +1 .
Now, suppose that Q∗ achieves the minimum in (16.7) and that Q∗ 6= Q, Q∗ 6= P (these cases
should be verified separately). Note that we have not shown that this minimum is achieved, but it
will be clear that our argument can be extended to the case of when Q′n is a sequence achieving
the infimum. Then, on one hand, obviously
D(Q∗ kQ) = min D(Q′ kQ) ≤ D(PkQ)
Q′ :D(Q′ ∥P)≤E0
On the other hand, since E0 ≤ D(QkP) we also have

D(Q∗ kP) ≤ D(QkP) .
3
This is called a geometric mixture of P and Q.
i i
i i
i i

i i
Therefore,

dQ∗ dQ
EQ∗ [T] = EQ∗ log = D(Q∗ kP) − D(Q∗ kQ) ∈ [−D(PkQ), D(QkP)] . (16.8)
dP dQ∗
Next, we have from Corollary 15.12 that there exists a unique Pλ with the following three
properties:4
EPλ [T] = EQ∗ [T]

D(Pλ kP) ≤ D(Q∗ kP)
D(Pλ kQ) ≤ D(Q∗ kQ)
Thus, we immediately conclude that minimization in (16.7) can be restricted to Q∗ belonging

to the family of tilted distributions {Pλ , λ ∈ R}. Furthermore, from (16.8) we also conclude
that λ ∈ [0, 1]. Hence, characterization of E∗1 (E0 ) given by (16.6) coincides with the one given
by (16.7).
Remark 16.4. A geometric interpretation of (16.7) is given in Fig. 16.2: As λ increases from 0 to
1, or equivalently, θ increases from −D(PkQ) to D(QkP), the optimal distribution Pλ traverses
down the dotted path from P and Q. Note that there are many ways to interpolate between P and
Q, e.g., by taking their (arithmetic) mixture (1 − λ)P + λQ. In contrast, Pλ is a geometric mixture
of P and Q, and this special path is in essence a geodesic connecting P to Q and the exponents
E0 and E1 measures its respective distances to P and Q. Unlike Riemannian geometry, though,
here the sum of distances to the two endpoints from an intermediate Pλ actually varies along the
geodesic.
4
A subtlety: In Corollary 15.12 we ask EQ∗ [T] ∈ (A, B). But A, B – the essential range of T – depend on the distribution
under which the essential range is computed, cf. (15.23). Fortunately, we have Q P and P Q, so the essential range
is the same under both P and Q. And furthermore (16.8) implies that EQ∗ [T] ∈ (A, B).
i i
i i
i i

i i
266
E1
P
D(PkQ) Pλ
space of distributions
D(Pλ kQ)
E0
0 D(Pλ kP) D(QkP)
Figure 16.2 Geometric interpretation of (16.7). Here the shaded circle represents {Q′ : D(Q′ kP) ≤ E0 }, the
KL divergence “ball” of radius E0 centered at P. The optimal E∗1 (E0 ) in (16.7) is given by the divergence from
Q to the closest element of this ball, attained by some tilted distribution Pλ . The tilted family Pλ is the
geodesic traversing from P to Q as λ increases from 0 to 1.
16.3* Sequential Hypothesis Testing
Review: Filtration and stopping time
• A sequence of nested σ -algebras F0 ⊂ F1 ⊂ F2 · · · ⊂ Fn · · · ⊂ F is called a

filtration of F .
• A random variable τ is called a stopping time of a filtration Fn if (a) τ is valued in
Z+ and (b) for every n ≥ 0 the event {τ ≤ n} ∈ Fn .
• The σ -algebra Fτ consists of all events E such that E ∩ {τ ≤ n} ∈ Fn for all n ≥ 0.
• When Fn = σ{X1 , . . . , Xn } the interpretation is that τ is a time that can be deter-
mined by causally observing the sequence Xj , and random variables measurable
with respect to Fτ are precisely those whose value can be determined on the basis
of knowing (X1 , . . . , Xτ ).
• Let Mn be a martingale adapted to Fn , i.e. Mn is Fn -measurable and E[Mn |Fk ] =
Mmin(n,k) . Then M̃n = Mmin(n,τ ) is also a martingale. If collection {Mn } is uniformly
integrable then
E[Mτ ] = E[M0 ] .
• For more details, see [59, Chapter V].
i i
i i
i i

i i
16.3* Sequential Hypothesis Testing 267
So far we have always been working with a fixed number of observations n. However, different
realizations of Xn are informative to different levels, i.e. under some realizations we are very certain
about declaring the true hypothesis, whereas some other realizations leave us more doubtful. In
the fixed n setting, the tester is forced to take a guess in the latter case. In the sequential setting,
pioneered by Wald [329], the tester is allowed to request more samples. We show in this section that
the optimal test in this setting is something known as sequential probability ratio test (SPRT) [331].
It will also be shown that the resulting tradeoff between the exponents E0 and E1 is much improved
in the sequential setting.
We start with the concept of a sequential test. Informally, at each time t, upon receiving the
observation Xt , a sequential test either declares H0 , declares H1 , or requests one more observation.
The rigorous definition is as follows: a sequential hypothesis test consists of (a) a stopping time
τ with respect to the filtration {Fk , k ∈ Z+ }, where Fk ≜ σ{X1 , . . . , Xn } is generated by the first
n observations; and (b) a random variable (decision) Z ∈ {0, 1} measurable with respect to Fτ .
Each sequential test is associated with the following performance metrics:
α = P[Z = 0], β = Q [ Z = 0] (16.9)
l0 = EP [τ ], l1 = EQ [τ ] (16.10)
The easiest way to see why sequential tests may be dramatically superior to fixed-sample-size
tests is the following example: Consider P = 12 δ0 + 12 δ1 and Q = 12 δ0 + 12 δ−1 . Since P 6⊥ Q,
we also have Pn 6⊥ Qn . Consequently, no finite-sample-size test can achieve zero error under both
hypotheses. However, an obvious sequential test (wait for the first appearance of ±1) achieves zero
error probability with finite average number of samples (2) under both hypotheses. This advantage
is also very clear in the achievable error exponents as Fig. 16.3 shows.
The following result is due to [331] (for the special case of E0 = D(QkP) and E1 = D(PkQ))
and [245] (for the generalization).
Theorem 16.4. Assume bounded LLR:5

P(x)

log ≤ c0 , ∀ x
Q ( x)
where c0 is some positive constant. Call a pair of exponents (E0 , E1 ) achievable if there exist a
test with ℓ0 , ℓ1 → ∞ and probabilities satisfy:
π 1|0 ≤ exp (−l0 E0 (1 + o(1))) , π 0|1 ≤ exp (−l1 E1 (1 + o(1)))
Then the set of achievable exponents must satisfy
E0 E1 ≤ D(PkQ)D(QkP).
Furthermore, any such (E0 , E1 ) is achieved by the sequential probability ratio test SPRT(A, B) (A,
B are large positive numbers) defined as follows:
τ = inf{n : Sn ≥ B or Sn ≤ −A}
5
This assumption is satisfied for example for a pair of full support discrete distributions on finite alphabets.
i i
i i
i i

i i
268
E1
Sequential test
D(PkQ)
Fixed sample size
E0
0 D(QkP)
Figure 16.3 Tradeoff between Type-I and Type-II error exponents. The bottom curve corresponds to optimal
tests with fixed sample size (Theorem 16.1) and the upper curve to optimal sequential tests (Theorem 16.4).

0, if Sτ ≥ B
Z=
1, if Sτ < −A
where
X
n
P(Xk )
Sn = log
Q(Xk )
k=1
is the log likelihood function of the first n observations.
Remark 16.5 (Interpretation of SPRT). Under the usual setup of hypothesis testing, we collect
a sample of n iid observations, evaluate the LLR Sn , and compare it to the threshold to give the
optimal test. Under the sequential setup, {Sn : n ≥ 1} is a random walk, which has positive
(resp. negative) drift D(PkQ) (resp. −D(QkP)) under the null (resp. alternative)! SPRT simply
declares P if the random walk crosses the upper boundary B, or Q if the random walk crosses the
upper boundary −A. See Fig. 16.4 for an illustration.
Proof. As preparation we show two useful identities:
• For any stopping time with EP [τ ] < ∞ we have
EP [Sτ ] = EP [τ ]D(PkQ) (16.11)
i i
i i
i i

i i
16.3* Sequential Hypothesis Testing 269
Sn
0 n
τ
−A
Figure 16.4 Illustration of the SPRT(A, B) test. Here, at the stopping time τ , the LLR process Sn reaches B
before reaching −A and the decision is Z = 1.
and similarly, if EQ [τ ] < ∞ then
EQ [Sτ ] = − EQ [τ ]D(QkP) .
To prove these, notice that
Mn = Sn − nD(PkQ)
is clearly a martingale w.r.t. Fn . Consequently,
M̃n ≜ Mmin(τ,n)
is also a martingale. Thus
E[M̃n ] = E[M̃0 ] = 0 ,
or, equivalently,
E[Smin(τ,n) ] = E[min(τ, n)]D(PkQ) . (16.12)
This holds for every n ≥ 0. From boundedness assumption we have |Sn | ≤ nc and thus
|Smin(n,τ ) | ≤ nτ , implying that collection {Smin(n,τ ) , n ≥ 0} is uniformly integrable. Thus, we
can take n → ∞ in (16.12) and interchange expectation and limit safely to conclude (16.11).
i i
i i
i i

i i
270
• Let τ be a stopping time. Recall that Z is a Radon-Nikodym derivative of P w.r.t. Q on a σ -

dP|Fτ
algebra Fτ , denoted by dQ |F , if
τ
EP [1E ] = EQ [Z1E ] ∀E ∈ F τ . (16.13)
We will show that it is in fact given by

dP|Fτ
= exp{Sτ } .
dQ|Fτ
Indeed, what we need to verify is that (16.13) holds with Z = exp{Sτ } and an arbitary event
E ∈ Fτ , which we decompose as
X
1E = 1E∩{τ =n} .
n≥0
By monotone convergence theorem applied to the both sides of (16.13) it is then sufficient to
verify that for every n
EP [1E∩{τ =n} ] = EQ [exp{Sτ }1E∩{τ =n} ] . (16.14)

dP|Fn
This, however, follows from the fact that E ∩ {τ = n} ∈ Fn and dQ|Fn = exp{Sn } by the very
definition of Sn .
We now proceed to the proof. For achievability we apply (16.13) to infer
π 1|0 = P[Sτ ≤ −A] = EQ [exp{Sτ }1{Sτ ≤ −A}] ≤ e−A .
Next, we denote τ0 = inf{n : Sn ≥ B} and observe that τ ≤ τ0 , whereas the expectation of τ0 can
be bounded using (16.11) as:
EP [τ ] ≤ EP [τ0 ] = EP [Sτ0 ] ≤ B + c0 ,
where in the last step we used the boundedness assumption to infer
S τ 0 ≤ B + c0 .
Thus
B + c0 B
l0 = EP [τ ] ≤ EP [τ0 ] ≤ ≈ . for large B
D(PkQ) D(PkQ)
Similarly we can show π 0|1 ≤ e−B and l1 ≤ D(QA∥P) for large A. Take B = l0 D(PkQ), A =
l1 D(QkP), this shows the achievability.
Converse: Assume (E0 , E1 ) achievable for large l0 , l1 . Recall from Section 4.5* that
D(PFτ kQFτ ) denotes the divergence between P and Q when viewed as measures on σ -algebra
Fτ . We apply the data processing inequality for divergence to obtain:
(16.11)
d(P(Z = 1)kQ(Z = 1)) ≤ D(PFτ kQFτ ) = EP [Sτ ] = EP [τ ]D(PkQ) = l0 D(PkQ),
i i
i i
i i

i i
16.4 Composite, robust and goodness-of-fit hypothesis testing 271
Notice that for l0 E0 and l1 E1 large, we have d(P(Z = 1)kQ(Z = 1)) = l1 E1 (1 + o(1)), therefore
l1 E1 ≤ (1 + o(1))l0 D(PkQ). Similarly we can show that l0 E0 ≤ (1 + o(1))l1 D(QkP). Thus taking
ℓ0 , ℓ1 → ∞ we conclude
E0 E1 ≤ D(PkQ)D(QkP) .
16.4 Composite, robust and goodness-of-fit hypothesis testing

In this chapter we have considered the setting of distinguishing between the two alternatives, under
either of which the data distribution was specified completely. There are multiple other settings
that have also been studied in the literature, which we briefly mention here for completeness.
The key departure is to replace the simple hypotheses that we started with in Chapter 14 with
composite ones. Namely, we postulate
i.i.d. i.i.d.
H0 : Xi ∼ P, P∈P vs H1 : X i ∼ Q, Q ∈ Q,
where P and Q are two families of distributions. In this case for a given test Z = Z(X1 , . . . , Xn ) ∈
{0, 1} we define the two types of error as before, but taking worst-case choices over the
distribution:
1 − α = inf P⊗n [Z = 0], β = sup Q⊗n [Z = 0] .
P∈P Q∈Q
Unlike testing simple hypotheses for which Neyman-Pearson’s test is optimal (Theorem 14.10), in
general there is no explicit description for the optimal test of composite hypotheses (cf. (32.24)).
The popular choice is a generalized likelihood-ratio test (GLRT) that proposes to threshold the
GLR
supP∈P P⊗n (Xn )
T( X n ) = .
supQ∈Q Q⊗n (Xn )
For examples and counterexamples of the optimality of GLRT in terms of error exponents, see,
e.g. [346].
Sometimes the families P and Q are small balls (in some metric) surrounding the center dis-
tributions P and Q, respectively. In this case, testing P against Q is known as robust hypothesis
testing (since the test is robust to small deviations of the data distribution). There is a notable
finite-sample optimality result in this case due to Huber [161], Exercise III.20. Asymptotically, it
turns out that if P and Q are separated in the Hellinger distance, then the probability of error can
be made exponentially small: see Theorem 32.7.
Sometimes in the setting of composite testing the distance between P and Q is zero. This is
the case, for example, for the most famous setting of a Student t-test: P = {N (0, σ 2 ) : σ 2 > 0},
Q = {N ( μ, σ 2 ) : μ 6= 0, σ 2 > 0}. It is clear that in this case there is no way to construct a test with
α + β < 1, since the data distribution under H1 can be arbitrarily close to P0 . Here, thus, instead
of minimizing worst-case β , one tries to find a test statistic T(X1 , . . . , Xn ) which is a) pivotal in
i i
i i
i i

i i
272
the sense that its distribution under the H0 is (asymptotically) independent of the choice P0 ∈ P ;
and b) consistent, in the sense that T → ∞ as n → ∞ under any Q ∈ Q. Optimality questions are
studied by minimizing β as a function of Q ∈ Q (known as the power curve). The uniform most
powerful tests are the gold standard in this area [197, Chapter 3], although besides a few classical
settings (such as the one above) their existence is unknown.
In other settings, known as the goodness-of-fit testing [197, Chapter 14], instead of relatively
low-complexity parametric families P and Q one is interested in a giant set of alternatives Q. For
i.i.d. i.i.d.
example, the simplest setting is to distinguish H0 : Xi ∼ P0 vs H1 : Xi ∼ Q, TV(P0 , Q) > δ . If
δ = 0, then in this case again the worst case α + β = 1 for any test and one may only ask for a
statistic T(Xn ) with a known distribution under H0 and T → ∞ for any Q in the alternative. For
δ > 0 the problem is known as nonparametric detection [165, 166] and related to that of property
testing [141].
i i
i i
i i

i i
Exercises for Part III
III.1 Let P0 and P1 be distributions on X . Recall that the region of achievable pairs (P0 [Z =
0], P1 [Z = 0]) via randomized tests PZ|X : X → {0, 1} is denoted
[
R(P0 , P1 ) ≜ (P0 [Z = 0], P1 [Z = 0]) ⊆ [0, 1]2 .
PZ|X
PY|X
Let PY|X : X → Y be a Markov kernel, which maps Pj to Qj according to Pj −−→ Qj , j =
0, 1. Compare the regions R(P0 , P1 ) and R(Q0 , Q1 ). What does this say about βα (P0 , P1 ) vs.
β α ( Q0 , Q1 ) ?
Comment: This is the most general form of data-processing, all the other ones (divergence,
mutual information, f-divergence, total-variation, Rényi-divergence, etc) are corollaries.
Bonus: Prove that R(P0 , P1 ) ⊃ R(Q0 , Q1 ) implies existence of some PY|X carrying Pj to Qj
(“inclusion of R is equivalent to degradation”).
III.2 Recall the total variation distance
TV(P, Q) ≜ sup (P[E] − Q[E]) .

E
(a) Prove that
TV(P, Q) = sup {α − βα (P, Q)}.

0≤α≤1
Explain how to read the value TV(P, Q) from the region R(P, Q). Does it equal half the
maximal vertical segment in R(P, Q)?
(b) (Bayesian criteria) Fix a prior π = (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 < π 0 < 1. Denote
the optimal average error probability by
Pe ≜ inf π 0 π 1|0 + π 1 π 0|1 .

PZ|Xn
Prove that if π = ( 21 , 12 ), then
1
Pe = (1 − TV(P, Q)).
2
Find the optimal test.
(c) Find the optimal test for general prior π (not necessarily equiprobable).
(d) Why is it always sufficient to focus on deteministic test in order to minimize the Bayesian
error probability?
i i
i i
i i

i i
274 Exercises for Part III
III.3 Let P, Q be distributions such that for all α ∈ [0, 1] we have
βα (P, Q) ≜ min Q[ Z = 0] = α 2 .
PZ|X :P[Z=0]≥α
Find TV(P, Q), D(PkQ) and D(QkP).

III.4 Function α 7→ βα (P, Q) is monotone and thus by Lebesgue’s theorem possesses a derivative
d
βα′ ≜ βα (P, Q) .
dα
almost everywhere on [0, 1]. Prove
Z 1
D(PkQ) = − log βα′ dα . (III.1)
0
III.5 We have shown that for testing iid products and any fixed ϵ ∈ (0, 1):
log β1−ϵ (Pn , Qn ) = −nD(PkQ) + o(n) , n → ∞,
which is equivalent to Stein’s lemma. Show furthermore that assuming V(PkQ) < ∞ we have
p √
log β1−ϵ (Pn , Qn ) = −nD(PkQ) + nV(PkQ)Q−1 (ϵ) + o( n) , (III.2)
R ∞
where Q−1 (·) is the functional inverse of Q(x) = x √12π e−t /2 dt and
2

dP
V(PkQ) ≜ VarP log .
dQ
III.6 (Inverse Donsker-Varadhan) Verify for positive discrete random variables X that,
log EQ [X] = sup[EP [log X] − D(PkQ)],

P
where supremum is over all distributions P on X . (Hint: It is enough to extremize J(P) =

P
EP [log X] − D(PkQ) + λ( P(x) − 1)).
III.7 Prove
Y
n X
n
min D(QYn k PYj ) = min D(QYj kPYj )
QYn ∈F
j=1 j=1
whenever the constraint set F is marginals-based, i.e.:
QYn ∈ F ⇐⇒ (QY1 , . . . , QYn ) ∈ F ′
for some F ′ .
Conclude that in the case when PYj = P and
   
 X
n 
F = QYn : EQ  f(Yj ) ≥ nγ
 
j=1
i i
i i
i i

i i
we have (single-letterization)
min D(QYn kPn ) = n min D(QY kP) ,

QYn QY :EQY [f(Y)]≥γ
of which (15.15) is a special case. Hint: Convexity of divergence.

III.8 Fix a distribution PX on a finite set X and a channel PY|X : X → Y . Consider a sequence xn
with composition PX , i.e.
#{j : xj = a} = nPX (a) ± 1 ∀a ∈ X .
Let Yn be generated according to PnY|X (·|xn ). Show that

 
X
n
n
log P  f(Xj , Yj ) ≥ nγ X = xn  = −n min D(QY|X kPY|X |PX ) + o(n) ,
j=1 EQ [f(X,Y)]≥γ
where minimum is over all QXY with QX = PX .

III.9 (Large deviations on the boundary) Recall that A = infλ Ψ′ (λ) and B = supλ Ψ′ (λ) were shown
to be the boundaries of the support of PX :6
B = sup{b : P[X > b] > 0} .
(a) Show by example that Ψ∗ (B) can be finite or infinite.

(b) Show by example that asymptotic behavior of
" n #
1X
P Xi ≥ B , (III.3)
n
i=1
can be quite different depending on the distribution of PX .

(c) Compare Ψ∗ (B) and the exponent in (III.3) for your examples. Prove a general statement.
III.10 (Small-ball probability I.) Let Z ∼ N (0, Id ). Without using the χ2 -density, show the following
bound on P [kZk2 ≤ ϵ].
√
(a) Using the Chernoff bound, show that for all ϵ > d,
d/2
eϵ 2
e−ϵ /2 .
2
P [kZk2 ≤ ϵ] ≤
d
(b) Prove the lower bound
d/2
ϵ2
e−ϵ /2 .
2
P [kZk2 ≤ ϵ] ≥
2πd
(c) Extend the results to Z ∼ N (0, Σ).

See Exercise V.10 for an example in infinite dimensions.
6
In this exercise and the next, you may assume all log’s and exp’s are to the natural basis and that MGF exists for all λ.
i i
i i
i i

i i
III.11 Consider the hypothesis testing problem:

i.i.d.
H0 : X1 , . . . , Xn ∼ P = N (0, 1) ,
i.i.d.
H1 : X1 , . . . , Xn ∼ Q = N ( μ, 1) .
(a) Show that the Stein exponent is V = log2 e μ2 .

(b) Show that the optimal tradeoff between achievable error-exponent pairs (E0 , E1 ) is given
by
log e p log e 2
E1 = ( μ − 2E0 )2 , 0 ≤ E0 ≤ μ ,
2 2
(c) Show that the Chernoff exponent is C(P, Q) = log8 e μ2 .
III.12 Let Xj be i.i.d. exponential with unit mean. Since the log-MGF ψX (λ) ≜ log E[exp{λX}] does
not exist for all λ > 1, the large deviations result
X
n
P[ Xj ≥ nγ] = exp{−nψX∗ (γ) + o(n)} (III.4)
j=1
does not apply. Show (III.4) via the following steps:

(a) Apply Chernoff argument directly to prove an upper bound:
X
n
P[ Xj ≥ nγ] ≤ exp{−nψX∗ (γ)} (III.5)
j=1
(b) Fix an arbitrary A > 0 and prove

X
n X
n
P[ Xj ≥ nγ] ≥ P[ (Xj ∧ A) ≥ nγ] , (III.6)
j=1 j=1
where u ∧ v = min(u, v).

(c) Apply the results shown in Chapter to investigate the asymptotics of the right-hand side
of (III.6).
(d) Conclude the proof of (III.4) by taking A → ∞.
III.13 Baby version of Sanov’s theorem. Let X be a finite set. Let E be a convex set of probability
distributions on X . Assume that E has non-empty interior. Let Xn = (X1 , . . . , Xn ) be iid drawn
Pn
from some distribution P and let π n denote the empirical distribution, i.e., π n = n1 i=1 δXi . Our
goal is to show that
1 1
E ≜ lim log = inf D(QkP). (III.7)
n→∞ n P(π n ∈ E) Q∈E
(a) Define the following set of joint distributions En ≜ {QXn : QXi ∈ E, i = 1, . . . , n}. Show
that
inf D(QXn kPXn ) = n inf D(QkP),

QXn ∈En Q∈E
n
where PXn = P .
i i
i i
i i

i i
(b) Consider the conditional distribution P̃Xn = PXn |π n ∈E . Show that P̃Xn ∈ En .
(c) Prove the following nonasymptotic upper bound:

P(π n ∈ E) ≤ exp − n inf D(QkP) , ∀n.
Q∈E
(d) For any Q in the interior of E , show that
P(π n ∈ E) ≥ exp(−nD(QkP) + o(n)), n → ∞.
(Hint: Use data processing as in the proof of the large deviations theorem.)
(e) Conclude (III.7).
III.14 Error exponents of data compression. Let Xn be iid according to P on a finite alphabet X . Let
ϵ∗n (R) denote the minimal probability of error achieved by fixed-length compressors and decom-
pressors for Xn of compression rate R. We have learned that the if R < H(P), then ϵ∗n (R) tends
to zero. The goal of this exercise is to show it converges exponentially fast and find the best
exponent.
(a) For any sequence xn , denote by π (xn ) its empirical distribution and by Ĥ(xn ) its empirical
entropy, i.e., the entropy of the empirical distribution.7 For each R > 0, define the set
T = {xn : Ĥ(xn ) < R}. Show that
|T| ≤ exp(nR)(n + 1)|X | .
(b) Show that for any R > H(P),

ϵ∗n (R) ≤ exp −n inf D(QkP) .
Q:H(Q)>R
Specify the achievable scheme. (Hint: Use Sanov’s theorem in Exercise III.13.)
(c) Prove that the above exponent is asymptotically optimal:
1 1
lim sup log ∗ ≤ inf D(QkP).
n→∞ n ϵn (R) Q:H(Q)>R
(Hint: Recall that any compression scheme for memoryless source with rate below the
entropy fails with probability tending to one. Use data processing inequality. )
III.15 Denote by N( μ, σ 2 ) the one-dimensional Gaussian distribution with mean μ and variance σ 2 .
Let a > 0. All logarithms below are natural.
(a) Show that
a2
min D(QkN(0, 1)) = .
Q:EQ [X]≥a 2
(b) Let X1 , . . . , Xn be drawn iid from N(0, 1). Using part (a) show that
1 1 a2
lim log = . (III.8)
n→∞ n P [X1 + · · · + Xn ≥ na] 2
7
For example, for the binary sequence xn = (010110), the empirical distribution is Ber(1/2) and the empirical entropy is
1 bit.
i i
i i
i i

i i
R∞
(c) Let Φ̄(x) = x √12π e−t /2 dt denote the complementary CDF of the standard Gaussian
2
distribution. Express P [X1 + · · · + Xn ≥ na] in terms of the Φ̄ function. Using the fact that
Φ̄(x) = e−x /2+o(x ) as x → ∞, reprove (III.8).
2 2
(d) Let Y be a continuous random variable with zero mean and unit variance. Show that
min D(PY kN( μ, σ 2 )) = D(PY kN(0, 1)).

μ,σ
III.16 (Gibbs distribution) Let X be finite alphabet, f : X → R some function and Emin = min f(x).
(a) Using I-projection show that for any E ≥ Emin the solution of
H∗ (E) = max{H(X) : E[f(X)] ≤ E}

1
is given by PX (x) = Z(β) e−β f(x) for some β = β(E).
Comment: In statistical physics x is state of the system (e.g. locations and velocities of all
molecules), f(x) is energy of the system in state x, PX is the Gibbs distribution and β = T1 is
the inverse temperatur of the system. In thermodynamic equillibrium, PX (x) gives fraction
of time system spends in state x.
∗
(b) Show that dHdE(E) = β(E).
(c) Next consider two functions f0 , f1 (i.e. two types of molecules with different state-energy
relations). Show that for E ≥ minx0 f(x0 ) + minx1 f(x1 ) we have
max H(X0 , X1 ) = max H∗0 (E0 ) + H∗1 (E1 ) (III.9)

E[f0 (X0 )+f1 (X1 )]≤E E0 +E1 ≤E
where H∗j (E) = maxE[fj (X)]≤E H(X).

(d) Further, show that for the optimal choice of E0 and E1 in (III.9) we have
β 0 ( E0 ) = β 1 ( E1 ) (III.10)
or equivalently that the optimal distribution PX0 ,X1 is given by

1
PX0 ,X1 (a, b) = e−β(f0 (a)+f1 (b)) (III.11)
Z0 (β)Z1 (β)
Remark: (III.11) also just follows from part 1 by taking f(x0 , x1 ) = f0 (x0 )+ f1 (x1 ). The point here
is relation (III.10): when two thermodynamical systems are brought in contact with each other,
the energy distributes among them in such a way that β parameters (temperatures) equalize.
III.17 (Importance Sampling [63]) Let μ and ν be two probability measures on set X . Assume that
ν μ. Let L = D(νk μ) and ρ = ddμν be the Radon-Nikodym derivative. Let f : X → R be a
measurable function. We would like to estimate Eν f using samples from μ.
i.i.d. P
Let X1 , . . . , Xn ∼ μ and In (f) = 1n 1≤i≤n f(Xi )ρ(Xi ). Prove the following.
(a) For n ≥ exp(L + t) with t ≥ 0, we have
q
E |In (f) − Eν f| ≤ kfkL2 (ν) exp(−t/4) + 2 Pμ (log ρ > L + t/2) .
Hint: Let h = f1{ρ ≤ exp(L + t/2)}. Use triangle inequality and bound E |In (h) − Eν h|,
E |In (h) − In (f)|, | Eν f − Eν h| separately.
i i
i i
i i

i i
(b) On the other hand, for n ≤ exp(L − t) with t ≥ 0, we have
Pμ (log ρ ≤ L − t/2)
P(In (1) ≥ 1 − δ)| ≤ exp(−t/2) + ,
1−δ
for all δ ∈ (0, 1), where 1 is the constant-1 function.
Hint: Divide into two cases depending on whether max1≤i≤n ρ(Xi ) ≤ exp(L − t/2).
This shows that a sample of size exp(D(νk μ) + Θ(1)) is both necessary and sufficient for
accurate estimation by importance sampling.
III.18 M-ary hypothesis testing.8 The following result [194] generalizes Corollary 16.2 on the best
average probability of error for testing two hypotheses to multiple hypotheses.
Fix a collection of distributions {P1 , . . . , PM }. Conditioned on θ, which takes value i with prob-
i.i.d.
ability π i > 0 for i = 1, . . . , M, let X1 , . . . , Xn ∼ Pθ . Denote the optimal average probability of
error by p∗n = inf P[θ̂ 6= θ], where the infimum is taken over all decision rules θ̂ = θ̂(X1 , . . . , Xn ).
(a) Show that
1 1
lim log ∗ = min C(Pi , Pj ), (III.12)
n→∞ n pn 1≤i<j≤M
where C is the Chernoff information defined in (16.2).

(b) It is clear that the optimal decision rule is the Maximum a Posteriori (MAP) rule. Does
maximum likelihood rule also achieve the optimal exponent (III.12)? Prove or disprove it.
III.19 Given n observations (X1 , Y1 ), . . . , (Xn , Yn ), where each observation consists of a pair of random
variables, we want to test the following hypothesis:
i.i.d.
H0 : (Xi , Yi ) ∼ P × Q
i.i.d.
H1 : (Xi , Yi ) ∼ Q × P
where × denotes product distribution as usual.

(a) Show that the Stein exponent D(P × QkQ × P) is equal to D(PkQ) + D(QkP).
2
(b) Show that the Chernoff exponent C(P × Q, Q × P) is equal to −2 log(1 − H (P2 ,Q) ) =
R√
−2 log dPdQ, where H(P, Q) is the Hellinger distance – cf. (7.5).
Comment: This type of hypothesis testing arises in the context of community detection and
stochastic block model, where n nodes indexed by i ∈ [n] are partitioned into two communities
(labeled by σi = + and σi = − uniformly and independently). The task is to classify the nodes
based on the pairwise observations W = (Wij : 1 ≤ i < j ≤ n) are independent conditioned
on σi ’s and Wij ∼ P if σi = σj and Q otherwise. As a means to prove the impossibility result
[334], consider the setting where an oracle reveals all labels except for σ1 . Define S+ = {j =
i.i.d. i.i.d.
2, . . . , n : σj = +} and similarly S− . If σ1 = +, {W1,j : j ∈ S+ } ∼ P and {W1,j : j ∈ S− } ∼ Q
and vice versa if σ1 = −.
8
Not to be confused with multiple testing in the statistics literature, which refers to testing multiple pairs of binary
hypotheses simultaneously.
i i
i i
i i

i i
III.20 (Stochastic dominance and robust LRT) Let P0 , P1 be two families of probability distributions
on X . Suppose that there is a least favorable pair (LFP) (Q0 , Q1 ) ∈ P0 × P1 such that
Q0 [π > t] ≥ Q′0 [π > t]
Q1 [π > t] ≤ Q′1 [π > t],
for all t ≥ 0 and Q′i ∈ Pi , where π = dQ1 /dQ0 . Prove that (Q0 , Q1 ) simultaneously minimizes
all f-divergences between P0 and P1 , i.e.
Df (Q1 kQ0 ) ≤ Df (Q′1 kQ′0 ) ∀Q′0 ∈ P0 , Q′1 ∈ P1 . (III.13)
Hint: Interpolate between (Q0 , Q1 ) and (Q′0 , Q′1 ) and differentiate.
Remark: For the case of two TV-balls, i.e. Pi = {Q : TV(Q, Pi ) ≤ ϵ}, the existence of LFP is
shown in [161], in which case π = min(c′ , max(c′′ , dP ′ ′′
dP1 )) for some 0 ≤ c < c ≤ ∞ giving
0
the robust likelihood-ratio test.
i i
i i
i i

i i
Part IV
Channel coding
i i
i i
i i

i i
i i
i i
i i

i i
283
In this Part we study a new type of problem known as “channel coding”. Historically, this
was the first application area of information theory that lead to widely recognized and surprising
results [277]. To explain the relation of this Part to others, let us revisit what problems we have
studied so far.
In Part II our objective was data compression. The main object there was a single distribution
PX and the fundamental limit E[ℓ(f∗ (X))] – the minimal compression length. The main result was
connection between the fundamental limit and an information quantity, that we can summarize as
E[ℓ(f∗ (X))] ≈ H(X)
In Part III we studied binary hypothesis testing. There the main object was a pair of distributions
(P, Q), the fundamental limit was the Neyman-Pearson curve β1−ϵ (Pn , Qn ) and the main result
β1−ϵ (Pn , Qn ) ≈ exp{−nD(PkQ)} ,
again connecting an operational quantity to an information measure.

In channel coding – the topic of this Part – the main object is going to be a channel PY|X .
The fundamental limit is M∗ (ϵ), the maximum number of messages that can be transmitted with
probability of error at most ϵ, which we rigorously define in this chapter. Our main result in this
part is to show the celebrated Shannon’s noisy channel coding theorem:
log M∗ (ϵ) ≈ max I(X; Y) .
PX
i i
i i
i i

i i
17 Error correcting codes
17.1 Codes and probability of error

We start with a simple definition of a code.
Definition 17.1. An M-code for PY|X is an encoder/decoder pair (f, g) of (randomized) functions1
• encoder f : [M] → X
• decoder g : Y → [M] ∪ {e}
In most cases f and g are deterministic functions, in which case we think of them, equivalently,
in terms of codewords, codebooks, and decoding regions (see Fig. 17.1 for an illustration)
• ∀i ∈ [M] : ci ≜ f(i) are codewords, the collection C = {c1 , . . . , cM } is called a codebook.

• ∀i ∈ [M], Di ≜ g−1 ({i}) is the decoding region for i.
c1 b
b
b
D1 b
b b
b cM
b b
b
DM
Figure 17.1 When X = Y, the decoding regions can be pictured as a partition of the space, each containing
one codeword.
Given an M-code we can define a probability space, underlying all the subsequent developments
in this Part. For that we chain the three objects – message W, the encoder and the decoder – together
1
For randomized encoder/decoders, we identify f and g as probability transition kernels PX|W and PŴ|Y .
284
i i
i i
i i

i i
17.1 Codes and probability of error 285
into the following Markov chain:

f PY|X g
W −→ X −→ Y −→ Ŵ (17.1)
where we set W ∼ Unif([M]). In the case of discrete spaces, we can explicitly write out the joint
distribution of these variables as follows:
1
(general) PW,X,Y,Ŵ (m, a, b, m̂) = P (a|m)PY|X (b|a)PŴ|Y (m̂|b)
M X|W
1
(deterministic f, g) PW,X,Y,Ŵ (m, cm , b, m̂) = PY|X (b|cm )1{b ∈ Dm̂ }
M
Throughout these sections, these random variables will be referred to by their traditional names:
W – original (true) message, X - (induced) channel input, Y - channel output and Ŵ - decoded
message.
Although any pair (f, g) is called an M-code, in reality we are only interested in those that satisfy
certain “error-correcting” properties. To assess their quality we define the following performance
metrics:
1 Maximum error probability: Pe,max (f, g) ≜ maxm∈[M] P[Ŵ 6= m|W = m].

2 Average error probability: Pe (f, g) ≜ P[W 6= Ŵ].
Note that, clearlym, Pe ≤ Pe,max . Therefore, requirement of the small maximum error probability
is a more stringent criterion, and offers uniform protection for all codewords. Some codes (such
as linear codes, see Section 18.6) have the property of Pe = Pe,max by construction, but generally
these two metrics could be very different.
Having defined the concept of an M-code and the performance metrics, we can finally define
the fundamental limits for a given channel PY|X .
Definition 17.2. A code (f, g) is an (M, ϵ)-code for PY|X if Pe (f, g) ≤ ϵ. Similarly, an (M, ϵ)max -
code must satisfy Pe,max ≤ ϵ. The fundamental limits of channel coding are defined as
M∗ (ϵ; PY|X ) = max{M : ∃(M, ϵ)-code}

M∗max (ϵ; PY|X ) = max{M : ∃(M, ϵ)max -code}
The argument PY|X will be omitted when PY|X is clear from the context.
In other words, the quantity log2 M∗ (ϵ) gives the maximum number of bits that we can
push through a noisy transformation PY|X , while still guaranteeing the error probability in the
appropriate sense to be at most ϵ.
Example 17.1. The channel BSC⊗ n
δ (recall from Example 3.5 that BSC stands for binary symmet-
ric channel) acts between X = {0, 1}n and Y = {0, 1}n , where the input Xn is contaminated by
i.i.d.
additive noise Zn ∼ Ber(δ) independent of Xn , resulting in the channel output
Yn = Xn ⊕ Zn .
i i
i i
i i

i i
286
In other words, the BSC⊗ n

δ channel takes a binary sequence length n and flips each bit indepen-
dently with probability δ ; pictorially,
0 1 0 0 1 1 0 0 1 1
PY n |X n
1 1 0 1 0 1 0 0 0 1
In the next section we discuss coding for the BSC channel in more detail.
17.2 Coding for Binary Symmetric Channels

To understand the problem of designing the encoders and decoders, let us consider the BSC trans-
formation with δ = 0.11 and n = 1000. The problem of studying log2 M∗ (ϵ) attempts to answer
what is the maximum number k of bits you can send with Pe ≤ ϵ? For concreteness, let us fix
ϵ = 10−3 and discuss some of the possible ideas.
Perhaps our first attempt would be to try sending k = 1000 bits with one data bit mapped to
one channel input position. However, a simple calculation shows that in this case we get Pe =
1 − (1 − δ)n ≈ 1. In other word, the uncoded transmission does not meet our objective of small
Pe and some form of coding is necessary. This incurs a fundamental tradeoff: reduce the number
of bits to send (and use the freed channel inputs for sending redundant copies) in order to increase
the probability of success.
So let us consider the next natural idea: the repetition coding. We take each of the input data
bits and repeat it ℓ times:
0 0 1 0
0000000 0000000 1111111 0000000
Decoding can be done by taking a majority vote inside each ℓ-block. Thus, each data bit is decoded
with probability of bit error Pb = P[Binom(l, δ) > l/2]. However, the probability of block error of
this scheme is Pe ≤ kP[Binom(l, δ) > l/2]. (This bound is essentially tight in the current regime).
Consequently, to satisfy Pe ≤ 10−3 we must solve for k and ℓ satisfying kl ≤ n = 1000 and also
kP[Binom(l, δ) > l/2] ≤ 10−3 .
i i
i i
i i

i i
17.2 Coding for Binary Symmetric Channels 287
This gives l = 21, k = 47 bits. So we can see that using repetition coding we can send 47 data
bits by using 1000 channel uses.
Repetition coding is a natural idea. It also has a very natural tradeoff: if you want better reliabil-
ity, then the number ℓ needs to increase and hence the ratio nk = 1ℓ should drop. Before Shannon’s
groundbreaking work, it was almost universally accepted that this is fundamentally unavoidable:
vanishing error probability should imply vanishing communication rate nk .
Before delving into optimal codes let us offer a glimpse of more sophisticated ways of injecting
redundancy into the channel input n-sequence than simple repetition. For that, consider the so-
called first-order Reed-Muller codes (1, r). We interpret a sequence of r data bits a0 , . . . , ar−1 ∈ Fr2
as a degree-1 polynomial in (r − 1) variables:
X
r− 1
a = (a0 , . . . , ar−1 ) 7→ fa (x) ≜ a i xi + a 0 .
i=1
In order to transmit these r bits of data we simply evaluate fa (·) at all possible values of the variables
xr−1 ∈ Fr2−1 . This code, which maps r bits to 2r−1 bits, has minimum distance dmin = 2r−2 . That
is, for two distinct a 6= a′ the number of positions in which fa and fa′ disagree is at least 2r−2 . In
coding theory notation [n, k, dmin ] we say that the first-order Reed-Muller code (1, 7) is a [64, 7, 32]
code. It can be shown that the optimal decoder for this code achieves over the BSC0.11 ⊗ 64 channel
a probability of error at most 6 · 10−6 . Thus, we can use 16 such blocks (each carrying 7 data bits
and occupying 64 bits on the channel) over the BSC⊗ δ
1024
, and still have (by the union bound)
−4 −3
overall probability of block error Pe ≲ 10 < 10 . Thus, with the help of Reed-Muller codes
we can send 7 · 16 = 112 bits in 1024 channel uses, more than doubling that of the repetition code.
Shannon’s noisy channel coding theorem (Theorem 19.9) – a crown jewel of information theory
– tells us that over memoryless channel PYn |Xn = (PY|X )n of blocklength n the fundamental limit
satisfies
log M∗ (ϵ; PYn |Xn ) = nC + o(n) (17.2)
as n → ∞ and for arbitrary ϵ ∈ (0, 1). Here C = maxPX1 I(X1 ; Y1 ) is the capacity of the single-letter
channel. In our case of BSC we have
1
C = log 2 − h(δ) ≈ bit ,
2
since the optimal input distribution is uniform (from symmetry) – see Section 19.3. Shannon’s
expansion (17.2) can be used to predict (not completely rigorously, of course, because of the
o(n) residual) that it should be possible to send around 500 bits reliably. As it turns out, for the
blocklength n = 1000 this is not quite possible.
Note that computing M∗ exactly requries iterating over all possible encoders and decoder –
an impossible task even for small values of n. However, there exist rigorous and computation-
ally tractable finite blocklength bounds [239] that demonstrate for our choice of n = 1000, δ =
0.11, ϵ = 10−3 :
414 ≤ log2 M∗ ≤ 416 bits (17.3)
i i
i i
i i

i i
288
Thus we can see that Shannon’s prediction is about 20% too optimistic. We will see below some
of such finite-length bounds. Notice, however, that while the guarantee existence of an encoder-
decoder pair achieving a prescribed performance, building an actual f and g implementable with
a modern software/hardware is a different story.
It took about 60 years after Shannon’s discovery of (17.2) to construct practically imple-
mentable codes achieving that performance. The first codes that approach the bounds on log M∗
are called Turbo codes [30] (after the turbocharger engine, where the exhaust is fed back in to
power the engine). This class of codes is known as sparse graph codes, of which the low-density
parity check (LDPC) codes invented by Gallager are particularly well studied [264]. As a rule of
thumb, these codes typically approach 80 . . . 90% of log M∗ when n ≈ 103 . . . 104 . For shorter
blocklengths in the range of n = 100 . . . 1000 there is an exciting alternative to LDPC codes: the
polar codes of Arıkan [15], which are most typically used together with the list-decoding idea
of Tal and Vardy [301]. And of course, the story is still evolving today as new channel models
become relevant and new hardware possibilities open up.
We wanted to point out a subtle but very important conceptual paradigm shift introduced by
Shannon’s insistence on coding over many (information) bits together. Indeed, consider the sit-
uation discussed above, where we constructed a powerful code with M ≈ 2400 codewords and
n = 1000. Now, one might imagine this code as a constellation of 2400 points carefully arranged
inside a hypercube {0, 1}1000 to guarantee some degree of separation between them, cf. (17.6).
Next, suppose one was using this code every second for the lifetime of the universe (≈ 1018 sec).
Yet, even after this laborious process she will have explored at most 260 different codewords from
among an overwhelmingly large codebook 2400 . So a natural question arises: why did we need
to carefully place all these many codewords if majority of them will never be used by anyone?
The answer is at the heart of the concept of information: to transmit information is to convey a
selection of one element (W) from a collection of possibilities ([M]). The fact that we do not know
which W will be selected forces us to apriori prepare for every one of the possibilities. This simple
idea, proposed in the first paragraph of [277], is now tacitly assumed by everyone, but was one of
the subtle ways in which Shannon revolutionized scientific approach to the study of information
exchange.
17.3 Optimal decoder

Given any encoder f : [M] → X , the decoder that minimizes Pe is the Maximum A Posteriori
(MAP) decoder, or equivalently, the Maximal Likelihood (ML) decoder, since the codewords are
equiprobable (W is uniform):
g∗ (y) = argmax P [W = m|Y = y]
m∈[M]
= argmax P [Y = y|W = m] (17.4)

m∈[M]
Notice that the optimal decoder is deterministic. For the special case of deterministic encoder,
where we can identify the encoder with its image C the minimal (MAP) probability of error for
i i
i i
i i

i i
17.4 Weak converse bound 289
the codebook C can be written as

1 X
Pe,MAP (C) = 1 − max PY|X (y|x) , (17.5)
M x∈C
y∈Y
with a similar extension to non-discrete Y .

Remark 17.1. For the case of BSC⊗ n
δ MAP decoder has a nice geometric interpretation. Indeed, if
dH (xn , yn ) = |{i : xi 6= yi }| denotes the Hamming distance and if f (the encoder) is deterministic
with codewords C = {ci , i ∈ [M]} then
g∗ (yn ) = argmin dH (cm , yn ) . (17.6)

m∈[M]
Consequently, the optimal decoding regions – see Fig. 17.1 – become the Voronoi cells tesselating
the Hamming space {0, 1}n . Similarly, the MAP decoder for the AWGN channel induces a Voronoi
tesselation of Rn – see Section 20.3.
So we have seen that the optimal decoder is without loss of generality can be assumed to be
deterministic. Similarly, we can represent any randomized encoder f as a function of two argu-
ments: the true message W and an external randomness U ⊥ ⊥ W, so that X = f(W, U) where this
time f is a deterministic function. Then we have
P[W 6= Ŵ] = E[P[W 6= Ŵ|U]] ,
which implies that if P[W 6= Ŵ] ≤ ϵ then there must exist some choice u0 such that P[W 6= Ŵ|U =
u0 ] ≤ ϵ. In other words, the fundamental limit M∗ (ϵ) is unchanged if we restrict our attention to
deterministic encoders and decoders only.
Note, however, that neither of the above considerations apply to the maximal probability of
error Pe,max . Indeed, the fundamental limit M∗max (ϵ) does indeed require considering randomized
encoders and decoders. For example, when M = 2 from the decoding point of view we are back to
the setting of binary hypotheses testing in Part III. The optimal decoder (test) that minimizes the
maximal Type-I and II error probability, i.e., max{1 − α, β}, will not be deterministic if max{1 −
α, β} is not achieved at a vertex of the Neyman-Pearson region R(PY|W=1 , PY|W=2 ).
17.4 Weak converse bound

The main focus of both theory and practice of channel coding lies in showing existence (or con-
structing explicit) (M, ϵ) codes with large M and small ϵ. To understand how close the constructed
code is to the fundamental limit, one needs to prove an “impossibility result” bounding M from the
above or ϵ from below. Such results are known as “converse bounds”, with the name coming from
the fact that classically such bounds followed right after the existential results and were preceded
with the words “Conversely, …”.
i i
i i
i i

i i
290
Theorem 17.3 (Weak converse). Any (M, ϵ)-code for PY|X satisfies
supPX I(X; Y) + h(ϵ)
log M ≤ ,
1−ϵ
where h(x) = H(Ber(x)) is the binary entropy function.
Proof. This can be derived as a one-line application of Fano’s inequalty (Theorem 6.3), but we
proceed slightly differently. Consider an M-code with probability of error Pe and its corresponding
probability space: W → X → Y → Ŵ. We want to show that this code can be used as a hypothesis
test between distributions PX,Y and PX PY . Indeed, given a pair (X, Y) we can sample (W, Ŵ) from
PW,Ŵ|X,Y = PW|X PŴ|Y and compute the binary value Z = 1{W 6= Ŵ}. (Note that in the most
interesting cases when encoder and decoder are deterministic and the encoder is injective, the value
Z is a deterministic function of (X, Y).) Let us compute performance of this binary hypothesis test
under two hypotheses. First, when (X, Y) ∼ PX PY we have that Ŵ ⊥ ⊥ W ∼ Unif([M]) and therefore:
1
PX PY [Z = 1] = .
M
Second, when (X, Y) ∼ PX,Y then by definition we have
PX,Y [Z = 1] = 1 − Pe .
Thus, we can now apply the data-processing inequality for divergence to conclude: Since W →
X → Y → Ŵ, we have the following chain of inequalities (cf. Fano’s inequality Theorem 6.3):
DPI 1
D(PX,Y kPX PY ) ≥ d(1 − Pe k )
M
≥ −h(P[W 6= Ŵ]) + (1 − Pe ) log M
By noticing that the left-hand side is I(X; Y) ≤ supPX I(X; Y) we obtain:

supPX I(X; Y) + h(Pe )
log M ≤ ,
1 − Pe
h(p)
and the proof is completed by checking that p 7→ 1−p is monotonically increasing.
Remark 17.2. The bound can be significantly improved by considering other divergence measures
in the data-processing step. In particular, we will see below how one can get “strong” converse
(explaining the term “weak” converse here as well) in Section 22.1. The proof technique is known
as meta-converse, see Section 22.3.
i i
i i
i i

i i
18 Random and maximal coding
So far our discussion of channel coding was mostly following the same lines as the M-ary hypothe-
sis testing (HT) in statistics. In this chapter we introduce the key departure: the principal and most
interesting goal in information theory is the design of the encoder f : [M] → X or the codebook
{ci ≜ f(i), i ∈ [M]}. Once the codebook is chosen, the problem indeed becomes that of M-ary HT
and can be tackled by the standard statistical methods. However, the task of choosing the encoder
f has no exact analogs in statistical theory (the closest being design of experiments). Each f gives
rise to a different HT problem and the goal is to choose these M hypotheses PX|c1 , . . . , PX|cM to
ensure maximal testability. It turns out that the problem of choosing a good f will be much sim-
plified if we adopt a suboptimal way of testing M-ary HT. Namely, roughly speaking we will run
M binary HTs testing PY|X=cm against PY , which tries to distinguish the channel output induced by
the message m from an “average background noise” PY . An optimal such test, as we know from
Neyman-Pearson (Theorem 14.10), thresholds the following quantity
PY|X=x
log
PY
This explains the central role played by the information density (see below) in these achievability
bounds.
In this chapter it will be convenient to introduce the following independent pairs (X, Y) ⊥
⊥ (X, Y)
with their joint distribution given by:
PX,Y,X,Y (a, b, a, b) = PX (a)PY|X (b|a)PX (a)PY|X (b|a). (18.1)
We will often call X the sent codeword and X̄ the unsent codeword.
18.1 Information density

A crucial object for the subsequent development is the information density. Informally speaking,
dPX,Y
we simply want to define i(x; y) = log dP X PY
(x, y). However, we want to make this definition
sufficiently general so as to take into account both the possibility of PX,Y 6 PX PY (in which case
the value under the log can equal +∞) and the possibility of argument of the log being equal to
0. The definition below is similar to what we did in Definition 14.4 and (2.10), but we repeat it
below for convenience.
291
i i
i i
i i

i i
292
Definition 18.1 (Information density). Let PX,Y μ and PX PY μ for some dominating measure
μ, and denote by f(x, y) = dPdμX,Y and f̄(x, y) = dPdμ
X PY
the Radon-Nikodym derivatives of PX,Y and
PX PY with respect to μ, respectively. Then recalling the Log definition (2.10) we set


 log ff̄((xx,,yy)) , f(x, y) > 0, f̄(x, y) > 0


f(x, y) +∞, f(x, y) > 0, f̄(x, y) = 0
iPX,Y (x; y) ≜ Log = (18.2)
f̄(x, y)  −∞, f(x, y) = 0, f̄(x, y) > 0



0, f(x, y) = f̄(x, y) = 0 ,
Note that when PX,Y PX PY we have simply

dPX,Y
iPX,Y (x; y) = log ( x, y) ,
dPX PY
with log 0 = −∞.
Notice that the information density as a function depends on the underlying PX,Y . Throughout
this Part, however, the PY|X is going to be a fixed channel (fixed by the problem at hand), and thus
information density only depends on the choice of PX . Most of the time PX (and, correspondingly,
the PX,Y ) used to define information density will be apparent from the context. Thus for the benefit
of the reader as well as our own, we will write i(x; y) dropping the subscript PX,Y .
Information density is a natural concept for understanding the decoding process. We will see
shortly that what our decoders will do is threshold information density. We wanted to briefly give
intuition for this idea. First, consider, for simplicity, the case of discrete alphabets and PX,Y
PX PY . Then we have an equivalent expression
PY|X (y|x)
i(x; y) = log .
PY (y)
Therefore, the optimal (maximum likelihood) decoder can be written in terms of the information
density:
g∗ (y) = argmax PX|Y (cm |y)

m∈[M]
= argmax PY|X (y|cm )

m∈[M]
= argmax i(cm ; y). (18.3)

m∈[M]
Note an important observation: (18.3) holds regardless of the input distribution PX used for the def-
PM
inition of i(x; y), in particular we do not have to use the code-induced distribution PX = M1 i=1 δci .
However, if we are to threshold information density, different choices of PX will result in different
decoders, so we need to justify the choice of PX .
To that end, recall that to distinguish between two codewords ci and cj , one can apply (as we
P
learned in Part III for binary HT) the likelihood ratio test, namely thresholding the LLR log PYY||XX=
=c
ci
.
j
As we explained at the beginning of this Part, a (possibly suboptimal) approach in M-ary HT
is to run binary tests by thresholding each information density i(ci ; y). This, loosely speaking,
i i
i i
i i

i i
18.1 Information density 293
evaluates the likelihood of ci against the average distribution of the other M − 1 codewords, which
1
P
we approximate by PY (as opposed to the more precise form M− 1 j̸=i PY|X=cj ). Putting these ideas
together we can propose the decoder as
g(y) = any m s.t. i(cm ; y) > γ ,
where λ is a threshold and PX is judiciously chosen (to maximize I(X; Y) as we will see soon).
We proceed to show some elementary properties of the information density. The next result
explains the name “information density”1
Proposition 18.2. The expectation E[i(X; Y)] is well-defined and non-negative (but possibly
infinite). In any case, we have I(X; Y) = E[i(X; Y)].
Proof. This is follows from (2.12) and the definition of i(x; y) as log-ratio.
Being defined as log-likelihood, information density possesses the standard properties of the
latter, cf. Theorem 14.5. However, because its defined in terms of two variables (X, Y), there are
also very useful conditional expectation versions. To illustrate the meaning of the next proposition,
let us consider the case of discrete X, Y and PX,Y PX PY . Then we have for every x:
X X
f(x, y)PX (x)PY (y) = f(x, y) exp{−i(x; y)}PX,Y (x, y) .
y y
The general case requires a little more finesse.
Proposition 18.3 (Conditioning-unconditioning trick). Let X̄ ⊥

⊥ (X, Y) be a copy of X. We have
the following:
1 For any function f : X × Y → R
E[f(X̄, Y)1{i(X̄; Y) > −∞}] = E[f(X, Y) exp{−i(X; Y)}] . (18.4)
2 Let f+ be a non-negative function. Then for PX -almost every x we have
E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = E[f+ (X, Y) exp{−i(X; Y)}|X = x] (18.5)
Proof. The first part (18.4) is simply a restatement of (14.5). For the second part, let us define
a(x) ≜ E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x], b(x) ≜ E[f+ (X, Y) exp{−i(X; Y)}|X = x]
We first additionally assume that f is bounded. Fix ϵ > 0 and denote Sϵ = {x : a(x) ≥ b(x) + ϵ}.
As ϵ → 0 we have Sϵ % {x : a(x) > b(x)} and thus if we show PX [Sϵ ] = 0 this will imply that
a(x) ≤ b(x) for PX -a.e. x. The symmetric argument shows b(x) ≤ a(x) and completes the proof
of the equality.
1
Still an unfortunate name for a quantity that can be negative, though.
i i
i i
i i

i i
294
To show PX [Sϵ ] = 0 let us apply (18.4) to the function f(x, y) = f+ (x, y)1{x ∈ Sϵ }. Then we get
E[f+ (X, Y)1{X ∈ Sϵ } exp{−i(X; Y)}] = E[f+ (X̄, Y)1{i(X̄; Y) > −∞}1{X ∈ Sϵ }] .
Let us re-express both sides of this equality by taking the conditional expectations over Y to get:
E[b(X)1{X ∈ Sϵ }] = E[a(X̄)1{X̄ ∈ Sϵ }] .
But from the definition of Sϵ we have
E[b(X)1{X ∈ Sϵ }] ≥ E[(b(X̄) + ϵ)1{X̄ ∈ Sϵ }] .
(d)
Recall that X = X̄ and hence
E[b(X)1{X ∈ Sϵ }] ≥ E[b(X)1{X ∈ Sϵ }] + ϵPX [Sϵ ] .
Since f+ (and therefore b) was assumed to be bounded we can cancel the common term from both
sides and conclude PX [Sϵ ] = 0 as required.
Finally, to show (18.5) in full generality, given an unbounded f+ we define fn (x, y) =
min(f+ (x, y), n). Since (18.5) holds for fn we can take limit as n → ∞ on both sides of it:
lim E[fn (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = lim E[fn (X, Y) exp{−i(X; Y)}|X = x]
n→∞ n→∞
By the monotone convergence theorem (for conditional expectations!) we can take the limits inside
the expectations to conclude the proof.
Corollary 18.4. For PX -almost every x we have

P[i(x; Y) > t] ≤ exp(−t), (18.6)
P[i(X; Y) > t] ≤ exp(−t) (18.7)
Proof. Pick f+ (x, y) = 1 {i(x; y) > t} in (18.5).

Remark 18.1. This estimate has been used by us several times before. In the hypothesis testing
part we used (Corollary 14.11):
h dP i
Q log ≥ t ≤ exp(−t). (18.8)
dQ
In data compression, we used the fact that |{x : log PX (x) ≥ t}| ≤ exp(−t), which is also of the
form (18.8) with Q being the counting measure.
18.2 Shannon’s random coding bound

In this section we present perhaps the most virtuous technical result of Shannon. As we discussed
before, good error correcting code is supposed to be a geometrically elegant constellation in a
high-dimensional space. Its chief goal is to push different codewords as far apart as possible, so as
to reduce the deleterious effects of channel noise. However, in early 1940’s there were no codes
i i
i i
i i

i i
18.2 Shannon’s random coding bound 295
and no tools for constructing them available to Shannon. So facing the problem of understanding
if error-correction is even possible, Shannon decided to check if placing codewords randomly
in space will somehow result in favorable geometric arrangement. To everyone’s astonishment,
which is still producing aftershocks today, this method not only produced reasonable codes, but
in fact turned out to be optimal asymptotically (and almost-optimal non-asymptotically [239]).
We also remark that the method of proving existence of certain combinatorial objects by random
selection is known as Erdös’s probabilistic method [10], which Shannon apparently discovered
independently and, perhaps, earlier.
Theorem 18.5 (Shannon’s achievability bound). Fix a channel PY|X and an arbitrary input
distribution PX . Then for every τ > 0 there exists an (M, ϵ)-code with
ϵ ≤ P[i(X; Y) ≤ log M + τ ] + exp(−τ ). (18.9)
Proof. Recall that for a given codebook {c1 , . . . , cM }, the optimal decoder is MAP and is equiv-
alent to maximizing information density, cf. (18.3). The step of maximizing the i(cm ; Y) makes
analyzing the error probability difficult. Similar to what we did in almost loss compression, cf. The-
orem 11.6, the first important step for showing the achievability bound is to consider a suboptimal
decoder. In Shannon’s bound, we consider a threshold-based suboptimal decoder g(y) as follows:

m, ∃! cm s.t. i(cm ; y) ≥ log M + τ
g ( y) = (18.10)
e, o.w.
In words, decoder g reports m as decoded message if and only if codeword cm is a unique one
with information density exceeding the threshold log M + τ . If there are multiple or none such
codewords, then decoder outputs a special value of e, which always results in error since W 6= e
ever. (We could have decreased probability of error slightly by allowing the decoder to instead
output a random message, or to choose any one of the messages exceeding the threshold, or any
other clever ideas. The point, however, is that even the simplistic resolution of outputting e already
achieves all qualitative goals, while simplifying the analysis considerably.)
For a given codebook (c1 , . . . , cM ), the error probability is:
Pe (c1 , . . . , cM ) = P[{i(cW ; Y) ≤ log M + τ } ∪ {∃m 6= W, i(cm ; Y) > log M + τ }]
where W is uniform on [M] and the probability space is as in (17.1).
The second (and most ingenious) step proposed by Shannon was to forego the complicated
discrete optimization of the codebook. His proposal is to generate the codebook (c1 , . . . , cM ) ran-
domly with cm ∼ PX i.i.d. for m ∈ [M] and then try to reason about the average E[Pe (c1 , . . . , cM )].
By symmetry, this averaged error probability over all possible codebooks is unchanged if we con-
dition on W = 1. Considering also the random variables (X, Y, X̄) as in (18.1), we get the following
chain:
E[Pe (c1 , . . . , cM )]
= E[Pe (c1 , . . . , cM )|W = 1]
= P[{i(c1 ; Y) ≤ log M + τ } ∪ {∃m 6= 1, i(cm , Y) > log M + τ }|W = 1]
i i
i i
i i

i i
296
X
M
≤ P[i(c1 ; Y) ≤ log M + τ |W = 1] + P[i(cm̄ ; Y) > log M + τ |W = 1] (union bound)
m̄=2
( a)
= P [i(X; Y) ≤ log M + τ ] + (M − 1)P i(X; Y) > log M + τ
≤ P [i(X; Y) ≤ log M + τ ] + (M − 1) exp(−(log M + τ )) (by Corollary 18.4)
≤ P [i(X; Y) ≤ log M + τ ] + exp(−τ ) ,
where the crucial step (a) follows from the fact that given W = 1 and m̄ 6= 1 we have
d
(c1 , cm̄ , Y) = (X, X̄, Y)
with the latter triple defined in (18.1).

The last expression does indeed conclude the proof of existence of the (M, ϵ) code: it shows
that the average of Pe (c1 , . . . , cM ) satisfies the required bound on probability of error, and thus
there must exist at least one choice of c1 , . . . , cM satisfying the same bound.
Remark 18.2 (Joint typicality). Shortly in Chapter 19, we will apply this theorem for the case
of PX = P⊗ n ⊗n
X1 (the iid input) and PY|X = PY1 |X1 (the memoryless channel). Traditionally, cf. [81],
decoders in such settings were defined with the help of so called “joint typicality”. Those decoders
given y = yn search for the codeword xn (both of which are an n-letter vectors) such that the
empirical joint distribution is close to the true joint distribution, i.e., P̂xn ,yn ≈ PX1 ,Y1 , where
1
P̂xn ,yn (a, b) = · |{j ∈ [n] : xj = a, yj = b}|
n
is the joint empirical distribution of (xn , yn ). This definition is used for the case when random
coding is done with cj ∼ uniform on the type class {xn : P̂xn ≈ PX }. Another alternative, “entropic
Pn
typicality”, cf. [76], is to search for a codeword with j=1 log PX ,Y 1(xj ,yj ) ≈ H(X, Y). We think of
1 1
our requirement, {i(xn ; yn ) ≥ nγ1 }, as a version of “joint typicality” that is applicable to much
wider generality of channels (not necessarily over product alphabets, or memoryless).
18.3 Dependence-testing bound

The following result is a slight refinement of Theorem 18.5, that results in a bound that is free
from the auxiliary parameters and is provably stronger.
Theorem 18.6 (DT bound). Fix a channel PY|X and an arbitrary input distribution PX . Then for
every τ > 0 there exists an (M, ϵ)-code with

M − 1 +
ϵ ≤ E exp − i(X; Y) − log (18.11)
2
where x+ ≜ max(x, 0).
i i
i i
i i

i i
18.3 Dependence-testing bound 297
Proof. For a fixed γ , consider the following suboptimal decoder:

m, for the smallest m s.t. i(cm ; y) ≥ γ
g ( y) =
e, o/w
Setting Ŵ = g(Y) we note that given a codebook {c1 , . . . , cM }, we have by union bound
P[Ŵ 6= j|W = j] = P[i(cj ; Y) ≤ γ|W = j] + P[i(cj ; Y) > γ, ∃k ∈ [j − 1], s.t. i(ck ; Y) > γ]
j−1
X
≤ P[i(cj ; Y) ≤ γ|W = j] + P[i(ck ; Y) > γ|W = j].
k=1
Averaging over the randomly generated codebook, the expected error probability is upper bounded
by:
1 X
M
E[Pe (c1 , . . . , cM )] = P[Ŵ 6= j|W = j]
M
j=1
1 X
j−1
M X
≤ P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
M
j=1 k=1
M−1
= P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
2
M−1
= P[i(X; Y) ≤ γ] + E[exp(−i(X; Y))1 {i(X; Y) > γ}] (by (18.4))
2
h M−1 i
= E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
2
To optimize over γ , note the simple observation that U1E + V1Ec ≥ min{U, V}, with equal-
ity iff U ≥ V on E. Therefore for any x, y, 1[i(x; y) ≤ γ] + M− 1 −i(x;y)
2 e 1[i(x; y) > γ] ≥
M−1 −i(x;y) M−1
min(1, 2 e ), achieved by γ = log 2 regardless of x, y. Thus, we continue the bounding
as follows
h M−1 i
inf E[Pe (c1 , . . . , cM )] ≤ inf E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
γ γ 2
h M−1 i
= E min 1, exp(−i(X; Y))
2
M − 1 +
= E exp − i(X; Y) − log .
2
Remark 18.3 (Dependence testing interpretation). The RHS of (18.11) equals to M+ 1

2 multiple of
the minimum error probability of the following Bayesian hypothesis testing problem:
H0 : X, Y ∼ PX,Y versus H1 : X, Y ∼ PX PY
2 M−1
prior prob.: π 0 = , π1 = .
M+1 M+1
i i
i i
i i

i i
298
Note that X, Y ∼ PX,Y and X, Y ∼ PX PY , where X is the sent codeword and X is the unsent code-
word. As we know from binary hypothesis testing, the best threshold for the likelihood ratio test
(minimizing the weighted probability of error) is log ππ 10 , as we indeed found out.
One of the immediate benefits of Theorem 18.6 compared to Theorem 18.5 is precisely the fact
that we do not need to perform a cumbersome minimization over τ in (18.9) to get the minimum
upper bound in Theorem 18.5. Nevertheless, it can be shown that the DT bound is stronger than
Shannon’s bound with optimized τ . See also Exc. IV.30.
Finally, we remark (and will develop this below in our treatment of linear codes) that DT bound
and Shannon’s bound both hold without change if we generate {ci } by any other (non-iid) pro-
cedure with a prescribed marginal and pairwise independent codewords – see Theorem 18.13
below.
18.4 Feinstein’s maximal coding bound

The previous achievability results are obtained using probabilistic methods (random coding). In
contrast, the following achievability bound due to Feinstein uses a greedy construction. One imme-
diate advantage of Feinstein’s method is that it shows existence of codes satisfying maximal
probability of error criterion.2
Theorem 18.7 (Feinstein’s lemma). Fix a channel PY|X and an arbitrary input distribution PX .
Then for every γ > 0 and for every ϵ ∈ (0, 1) there exists an (M, ϵ)max -code with
M ≥ γ(ϵ − P[i(X; Y) < log γ]) (18.12)
Remark 18.4 (Comparison with Shannon’s bound). We can also interpret (18.12) differently: for
any fixed M, there exists an (M, ϵ)max -code that achieves the maximal error probability bounded
as follows:
M
ϵ ≤ P[i(X; Y) < log γ] +
γ
If we take log γ = log M + τ , this gives the bound of exactly the same form as Shannon’s (18.9). It
is rather surprising that two such different methods of proof produced essentially the same bound
(modulo the difference between maximal and average probability of error). We will discuss the
reason for this phenomenon in Section 18.7.
Proof. From the definition of (M, ϵ)max -code, we recall that our goal is to find codewords
c1 , . . . , cM ∈ X and disjoint subsets (decoding regions) D1 , . . . , DM ⊂ Y , s.t.
PY|X (Di |ci ) ≥ 1 − ϵ, ∀i ∈ [M].
Feinstein’s idea is to construct a codebook of size M in a sequential greedy manner.
2
Nevertheless, we should point out that this is not a serious advantage: from any (M, ϵ) code we can extract an
(M′ , ϵ′ )max -subcode with a smaller M′ and larger ϵ′ – see Theorem 19.4.
i i
i i
i i

i i
18.4 Feinstein’s maximal coding bound 299
For every x ∈ X , associate it with a preliminary decoding region Ex defined as follows:
Ex ≜ {y ∈ Y : i(x; y) ≥ log γ}
Notice that the preliminary decoding regions {Ex } may be overlapping, and we will trim them
into final decoding regions {Dx }, which will be disjoint. Next, we apply Corollary 18.4 and find
out that there is a set F ⊂ X with two properties: a) PX [F] = 1 and b) for every x ∈ F we have
1
PY ( Ex ) ≤ . (18.13)
γ
We can assume that P[i(X; Y) < log γ] ≤ ϵ, for otherwise the RHS of (18.12) is negative and
there is nothing to prove. We first claim that there exists some c ∈ F such that P[Y ∈ Ec |X =
c] = PY|X (Ec |c) ≥ 1 − ϵ. Indeed, assume (for the sake of contradiction) that ∀c ∈ F, P[i(c; Y) ≥
log γ|X = c] < 1 − ϵ. Note that since PX (F) = 1 we can average this inequality over c ∼ PX . Then
we arrive at P[i(X; Y) ≥ log γ] < 1 − ϵ, which is a contradiction.
With these preparations we construct the codebook in the following way:
1 Pick c1 to be any codeword in F such that PY|X (Ec1 |c1 ) ≥ 1 − ϵ, and set D1 = Ec1 ;
2 Pick c2 to be any codeword in F such that PY|X (Ec2 \D1 |c2 ) ≥ 1 − ϵ, and set D2 = Ec2 \D1 ;
...
−1
3 Pick cM to be any codeword in F such that PY|X (EcM \ ∪M j=1 Dj |cM ] ≥ 1 − ϵ, and set DM =
M−1
EcM \ ∪j=1 Dj .
We stop if cM+1 codeword satisfying the requirement cannot be found. Thus, M is determined by
the stopping condition:
∀c ∈ F, PY|X (Ec \ ∪M
j=1 Dj |c) < 1 − ϵ
Averaging the stopping condition over c ∼ PX (which is permissible due to PX (F) = 1), we
obtain
 
[
M
P i(X; Y) ≥ log γ and Y 6∈ Dj  < 1 − ϵ,
j=1
or, equivalently,
 
[
M
ϵ < P i(X; Y) < log γ or Y ∈ Dj  .
j=1
Applying the union bound to the right hand side yields

X
M
ϵ < P[i(X; Y) < log γ] + PY (Dj )
j=1
X
M
≤ P[i(X; Y) < log γ] + PY (Ecj )
j=1
i i
i i
i i

i i
300
M
≤ P[i(X; Y) < log γ] +
γ
where the last step makes use of (18.13).Evidently, this completes the proof.
18.5 RCU and Gallager’s bound

Although the bounds we demonstrated so far will be sufficient for recovering the noisy channel
coding theorem later, they are not the best possible. Namely, for a given M one can show much
smaller upper bounds on the probability of error. Two such bounds are the so-called random-
coding union (RCU) and the Gallager’s bound, which we prove here. The main new ingredient
is that instead of using suboptimal (threshold) decoders as before, we will analyze the optimal
maximum likelihood decoder.
Theorem 18.8 (RCU bound). Fix a channel PY|X and an arbitrary input distribution PX . Then for
every integer M ≥ 1 there exists an (M, ϵ)-code with

ϵ ≤ E min 1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y , (18.14)
where the joint distribution of (X, X̄, Y) is as in (18.1).
Proof. For a given codebook (c1 , . . . cM ) the average probability of error for the maximum
likelihood decoder, cf. (18.3), is upper bounded by
 
1 X M [
M
ϵ≤ P {i(cj ; Y) ≥ i(cm ; Y)} |X = cm  .
M
m=1 j=1;j̸=m
Note that we do not necessarily have equality here, since the maximum likelihood decoder will
resolves ties (i.e. the cases when multiple codewords maximize information density) in favor of
the correct codeword, whereas in the expression above we pessimistically assume that all ties are
resolved incorrectly. Now, similar to Shannon’s bound in Theorem 18.5 we prove existence of a
i.i.d.
good code by averaging the last expression over cj ∼ PX .
To that end, notice that expectations of each term in the sum coincide (by symmetry). To evalu-
ate this expectation, let us take m = 1 condition on W = 1 and observe that under this conditioning
we have
Y
M
(c1 , Y, c2 , . . . , cM ) ∼ PX,Y PX .
j=2
With this observation in mind we have the following chain:

 

[M

P  {i(cj ; Y) ≥ i(c1 ; Y)} W = 1
j=2
i i
i i
i i

i i
18.5 RCU and Gallager’s bound 301
  

[
M

= E(x,y)∼PX,Y P  {i(cj ; Y) ≥ i(c1 ; Y)} c1 = x, Y = y, W = 1
(a)
j=2

( b)
≤ E min{1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y }
where (a) is just expressing the probability by first conditioning on the values of (c1 , Y); and (b)
corresponds to applying the union bound but capping the result by 1. This completes the proof
of the bound. We note that the step (b) is the essence of the RCU bound and corresponds to the
self-evident fact that for any collection of events Ej we have
X
P[∪Ej ] ≤ min{1, P[Ej ]} .
j
What is makes its application clever is that we first conditioned on (c1 , Y). If we applied the union
bound right from the start without conditioning, the resulting estimate on ϵ would have been much
weaker (in particular, would not have lead to achieving capacity).
It turns out that Shannon’s bound Theorem 18.5 is just a weaking of (18.14) obtained by split-
ting the expectation according to whether or not i(X; Y) ≤ log β and upper bounding min{x, 1}
by 1 when i(X; Y) ≤ log β and by x otherwise. Another such weaking is a famous Gallager’s
bound [132]:
Theorem 18.9 (Gallager’s bound). Fix a channel PY|X , an arbitrary input distribution PX and
ρ ∈ [0, 1]. Then there exists an (M, ϵ) code such that
" 1+ρ #
i ( X̄ ; Y)
ϵ ≤ Mρ E E exp Y (18.15)
1+ρ
where again (X̄, Y) ∼ PX PY as in (18.1).
Proof. We first notice that by Proposition 18.3 applied with f+ (x, y) = exp{ i1(+ρ
x; y)
} and
interchanged X and Y we have for PY -almost every y
ρ 1 1
E[exp{−i(X; Y) }|Y = y] = E[exp{i(X; Ȳ) }|Ȳ = y] = E[exp{i(X̄; Y) }|Y = y] ,
1+ρ 1+ρ 1+ρ
(18.16)
d
where we also used the fact that (X, Ȳ) = (X̄, Y) under (18.1).
Now, consider the bound (18.14) and replace the min via the bound
min{t, 1} ≤ tρ ∀t ≥ 0 . (18.17)
this results in

ϵ ≤ Mρ E P[i(X̄; Y) > i(X; Y)|X, Y]ρ . (18.18)
We apply the Chernoff bound
1 1
P[i(X̄; Y) > i(X; Y)|X, Y] ≤ exp{− i(X; Y)} E[exp{ i(X̄; Y)}|Y] .
1+ρ 1+ρ
i i
i i
i i

i i
302
Raising this inequality to ρ-power and taking expectation E[·|Y] we obtain

1 ρ
E P[i(X̄; Y) > i(X; Y)|X, Y]ρ |Y ≤ Eρ [exp{ i(X̄; Y)|Y] E[exp{− i(X; Y)}|Y] .
1+ρ 1+ρ
The last term can be now re-expressed via (18.16) to obtain
1
E P[i(X̄; Y) > i(X; Y)|X, Y]ρ |Y ≤ E1+ρ [exp{ i(X̄; Y)|Y] .
1+ρ
Applying this estimate to (18.18) completes the proof.
Gallager’s bound (18.15) can also be obtained by analyzing the average behavior of random
coding and maximum-likelihood decoding. In fact, it is easy to verify that we can weaken (18.14)
to recover (18.15) using max{0, x} ≤ x1/(1+ρ) and min{x, 1} ≤ xρ .
The key innovation of Gallager – a step (18.17), which became know as the ρ-trick – cor-
responds to the following version of the union bound: For any events Ej and 0 ≤ ρ ≤ 1 we
have
 ρ
X X
P[∪Ej ] ≤ min{1, P[Ej ]} ≤  P [ Ej ]  .
j j
Now to understand properly the significance of Gallager’s bound we need to first define the concept
of the memoryless channels (see (19.1) below). For such channels and using the iid inputs, the
expression (18.15) turns, after optimization over ρ, into
ϵ ≤ exp{−nEr (R)} ,
where R = logn M is the rate and Er (R) is the Gallager’s random coding exponent. This shows that
not only the error probability at a fixed rate can be made to vanish, but in fact it can be made to
vanish exponentially fast in the blocklength. We will discuss such exponential estimates in more
detail in Section 22.4*.
18.6 Linear codes

So far in this Chapter we have shown existence of good error-correcting codes by either doing
the random or maximal coding. The constructed codes have little structure. At the same time,
most codes used in practice are so-called linear codes and a natural question whether restricting
to linear codes leads to loss in performance. In this section we show that there exist good linear
codes as well. A pleasant property of linear codes is that Pe = Pe,max and, therefore, bounding
average probability of error (as in Shannon’s bound) automatically yields control of the maximale
probability of error as well.
Definition 18.10 (Linear codes). Let Fq denote the finite field of cardinality q (cf. Definition 11.7).
Let the input and output space of the channel be X = Y = Fnq . We say a codebook C = {cu : u ∈
Fkq } of size M = qk is a linear code if C is a k-dimensional linear subspace of Fnq .
i i
i i
i i

i i
A linear code can be equivalently described by:
• Generator matrix G ∈ Fkq×n , so that the codeword for each u ∈ Fkq is given by cu = uG
(row-vector convention) and the codebook C is the row-span of G, denoted by Im(G);
(n−k)×n
• Parity-check matrix H ∈ Fq , so that each codeword c ∈ C satisfies Hc⊤ = 0. Thus C is
the nullspace of H, denoted by Ker(H). We have HG⊤ = 0.
Example 18.1 (Hamming code). The [7, 4, 3]2 Hamming code over F2 is a linear code with the
following generator and parity check matrices:
 
1 0 0 0 1 1 0  
 0 1 1 0 1 1 0 0
1 0 0 1 0 1 
G=
 0
, H= 1 0 1 1 0 1 0 
0 1 0 0 1 1 
0 1 1 1 0 0 1
0 0 0 1 1 1 1
In particular, G and H are of the form G = [I; P] and H = [−P⊤ ; I] (systematic codes) so that
HG⊤ = 0. The following picture helps to visualize the parity check operation:
x5
x2 x1
x4
x7 x6
x3
Note that all four bits in each circle (corresponding to a row of H) sum up to zero. One can verify
that the minimum distance of this code is 3 bits. As such, it can correct 1 bit of error and detect 2
bits of error.
Linear codes are almost always examined with channels of additive noise, a precise definition
of which is given below:
Definition 18.11 (Additive noise). A channel PY|X with input and output space Fnq is called
additive-noise if
PY|X (y|x) = PZ (y − x)
for some random vector Z taking values in Fnq . In other words, Y = X + Z, where Z ⊥
⊥ X.
Given a linear code and an additive-noise channel PY|X , it turns out that there is a special
“syndrome decoder” that is optimal.
i i
i i
i i

i i
304
Theorem 18.12. Any [k, n]Fq linear code over an additive-noise PY|X has a maximum likelihood
(ML) decoder g : Fnq → Fkq such that:
1 g(y) = y − gsynd (Hy⊤ ), i.e., the decoder is a function of the “syndrome” Hy⊤ only. Here gsynd :
Fnq−k → Fnq , defined by gsynd (s) ≜ argmaxz:Hz⊤ =s PZ (z), is called the “syndrome decoder”,
which decodes the most likely realization of the noise.
2 (Geometric uniformity) Decoding regions are translates of D0 = Im(gsynd ): Du = cu + D0 for
any u ∈ Fkq .
3 Pe,max = Pe .
In other words, syndrome is a sufficient statistic (Definition 3.8) for decoding a linear code.
Proof. 1 The maximum likelihood decoder for a linear code is
g(y) = argmax PY|X (y|c) = argmax PZ (y − c) = y − argmax PZ (z) = y − gsynd (Hy⊤ ).

c∈C c:Hc⊤ =0 z:Hz⊤ =Hy⊤
2 For any u, the decoding region
Du = {y : g(y) = cu } = {y : y−gsynd (Hy⊤ ) = cu } = {y : y−cu = gsynd (H(y−cu )⊤ )} = cu +D0 ,
where we used Hc⊤

u = 0 and c0 = 0.
3 For any u,
P[Ŵ 6= u|W = u] = P[g(cu +Z) 6= cu ] = P[cu +Z−gsynd (Hc⊤ ⊤ ⊤

u +HZ ) 6= cu ] = P[gsynd (HZ ) 6= Z].
Remark 18.5. As a concrete example, consider the binary symmetric channel BSC⊗ n
δ previously
considered in Example 17.1 and Section 17.2. This is an additive-noise channel over Fn2 , where
i.i.d.
Y = X + Z and Z = (Z1 , . . . , Zn ) ∼ Ber(δ). Assuming δ < 1/2, the syndrome decoder aims
to find the noise realization with the fewest number of flips that is compatible with the received
codeword, namely gsynd (s) = argminz:Hz⊤ =s wH (z), where wH denotes the Hamming weight. In
this case elements of the image of gsynd , which we deonted by D0 , are known as “minimal weight
coset leaders”. Counting how many of them occur at each Hamming weight is a difficult open
problem even for the most well-studied codes such as Reed-Muller ones. In Hamming space D0
looks like a Voronoi region of a lattice and Du ’s constitute a Voronoi tesselation of Fnq .
Remark 18.6. Overwhelming majority of practically used codes are in fact linear codes. Early in
the history of coding, linearity was viewed as a way towards fast and low-complexity encoding (just
binary matrix multiplication) and slightly lower complexity of the maximum-likelihood decoding
(via the syndrome decoder). As codes became longer and longer, though, the syndrome decoding
became impractical and today only those codes are used in practice for which there are fast and
low-complexity (suboptimal) decoders.
i i
i i
i i

i i
Theorem 18.13 (DT bound for linear codes). Let PY|X be an additive noise channel over Fnq . For
all integers k ≥ 1 there exists a linear code f : Fkq → Fnq with error probability:
 + 
− n−k−logq 1
Pe,max = Pe ≤ E q .
P Z ( Z)
(18.19)
Remark 18.7. The bound above is the same as Theorem 18.6 evaluated with PX = Unif(Fnq ). The
analogy between Theorems 18.6 and 18.13 is the same as that between Theorems 11.6 and 11.8
(full random coding vs random linear codes).
Proof. Recall that in proving DT bound (Theorem 18.6), we selected the codewords
i.i.d.
c1 , . . . , cM ∼ PX and showed that
M−1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y) ≤ γ] + P[i(X; Y) ≥ γ]
2
Here we will adopt the same approach and take PX = Unif(Fnq ) and M = qk .
By Theorem 18.12 the optimal decoding regions are translational invariant, i.e. Du = cu +
D0 , ∀u, and therefore:
Pe,max = Pe = P[Ŵ 6= u|W = u], ∀u.
Step 1: Random linear coding with dithering: Let codewords be chosen as
cu = uG + h, ∀u ∈ Fkq
where random G and h are drawn as follows: the k × n entries of G and the 1 × n entries
of h are i.i.d. uniform over Fq . We add the dithering to eliminate the special role that the
all-zero codeword plays (since it is contained in any linear codebook).
Step 2: We claim that the codewords are pairwise independent and uniform, i.e. ∀u 6= u′ ,
(cu , cu′ ) ∼ (X, X), where PX,X (x, x) = 1/q2n . To see this note that
cu ∼ uniform on Fnq
cu′ = u′ G + h = uG + h + (u′ − u)G = cu + (u′ − u)G
We claim that cu ⊥ ⊥ G because conditioned on the generator matrix G = G0 , cu ∼

uniform on Fnq due to the dithering h.
We also claim that cu ⊥ ⊥ cu′ because conditioned on cu , (u′ − u)G ∼ uniform on Fnq .
Thus random linear coding with dithering indeed gives codewords cu , cu′ pairwise
independent and are uniformly distributed.
Step 3: Repeat the same argument in proving DT bound for the symmetric and pairwise indepen-
dent codewords, we have
+ +
M − 1 + qk − 1
E[Pe (c1 , . . . , cM )] ≤ E[exp{− i(X; Y) − log }] = E[q− i(X;Y)−logq 2 ] ≤ E [ q− i(X;Y)−k
]
2
where we used M = qk and picked the base of log to be q.
i i
i i
i i

i i
306
Step 4: compute i(X; Y):

PZ (b − a) 1
i(a; b) = logq = n − logq
q− n P Z ( b − a)
therefore
+
− n−k−logq 1
Pe ≤ E[q P Z ( Z)
] (18.20)
Step 5: Remove dithering h. We claim that there exists a linear code without dithering such
that (18.20) is satisfied. The intuition is that shifting a codebook has no effect on its
performance. Indeed,
• Before, with dithering, the encoder maps u to uG + h, the channel adds noise to produce
Y = uG + h + Z, and the decoder g outputs g(Y).
• Now, without dithering, we encode u to uG, the channel adds noise to produce Y =
uG + Z, and we apply decode g′ defined by g′ (Y) = g(Y + h).
By doing so, we “simulate” dithering at the decoder end and the probability of error
remains the same as before. Note that this is possible thanks to the additivity of the noisy
channel.
We see that random coding can be done with different ensembles of codebooks. For example,
we have
i.i.d.
• Shannon ensemble: C = {c1 , . . . , cM } ∼ PX – fully random ensemble.
• Elias ensemble [113]: C = {uG : u ∈ Fkq }, with the k × n generator matrix G drawn uniformly
at random from the set of all matrices. (This ensemble is used in the proof of Theorem 18.13.)
• Gallager ensemble: C = {c : Hc⊤ = 0}, with the (n − k) × n parity-check matrix H drawn
uniformly at random. Note this is not the same as the Elias ensemble.
• One issue with Elias ensemble is that with some non-zero probability G may fail to be full rank.
(It is a good exercise to find P [rank(G) < k] as a function of n, k, q.) If G is not full rank, then
there are two identical codewords and hence Pe,max ≥ 1/2. To fix this issue, one may let the
generator matrix G be uniform on the set of all k × n matrices of full (row) rank.
• Similarly, we may modify Gallager’s ensemble by taking the parity-check matrix H to be
uniform on all n × (n − k) full rank matrices.
For the modified Elias and Gallager’s ensembles, we could still do the analysis of random coding.
A small modification would be to note that this time (X, X̄) would have distribution
1
PX,X̄ = 1{X̸=X′ }
q2n − qn
uniform on all pairs of distinct codewords and are not pairwise independent.
Finally, we note that the Elias ensemble with dithering, cu = uG + h, has pairwise independence
property and its joint entropy H(c1 , . . . , cM ) = H(G) + H(h) = (nk + n) log q. This is significantly
smaller than for Shannon’s fully random ensemble that we used in Theorem 18.5. Indeed, when
i i
i i
i i

i i
i.i.d.
cj ∼ Unif(Fnq ) we have H(c1 , . . . , cM ) = qk n log q. An interesting question, thus, is to find
min H(c1 , . . . , cM )
where the minimum is over all distributions with P[ci = a, cj = b] = q−2n when i 6= j (pairwise
independent, uniform codewords). Note that H(c1 , . . . , cM ) ≥ H(c1 , c2 ) = 2n log q. Similarly, we
may ask for (ci , cj ) to be uniform over all pairs of distinct elements. In this case, the Wozencraft
ensemble (see Exercise IV.13) for the case of n = 2k achieves H(c1 , . . . , cqk ) ≈ 2n log q, which is
essentially our lower bound.
18.7 Why random and maximal coding work well?

As we will see later the bounds developed in this chapter are very tight both asymptotically and
non-asymptotically. That is, the codes constructed by the apparently rather naive processes of ran-
domly selecting codewords or a greedily growing the codebook turn out to be essentially optimal
in many ways. An additional mystery is that the bounds we obtained via these two rather different
processes are virtually the same. These questions have puzzled researchers since the early days of
information theory.
A rather satisfying reason was finally given in an elegant work of Barman and Fawzi [23].
Before going into the details, we want to vocalize explicitly the two questions we want to address:
1 Why is greedy procedure close to optimal?

2 Why is random coding procedure (with a simple PX ) close to optimal?
In short, we will see that the answer is that both of these methods are well-known to be (almost)
optimal for submodular function maximization, and this is exactly what channel coding is about.
Before proceeding, we notice that in the second question it is important to qualify that PX
is simple, since taking PX to be supported on the optimal M∗ (ϵ)-achieving codebook would of
course result in very good performance. However, instead we will see that choosing rather simple
PX already achieves a rather good lower bound on M∗ (ϵ). More explicitly, by simple we mean a
product distribution for the memoryless channel. Or, as an even better example to have in mind,
consider an additive-noise vector channel:
Yn = Xn + Zn
with addition over a product abelian group and arbitrary (even non-memoryless) noise Zn . In this
case the choice of uniform PX in random coding bound works, and is definitely “simple”.
The key observation of [23] is submodularity of the function mapping a codebook C ⊂ X to
the |C|(1 − Pe,MAP (C)), where Pe,MAP (C) is the probability of error under the MAP decoder (17.5).
(Recall (1.7) for the definition of submodularity.) More expicitly, consider a discrete Y and define
X
S(C) ≜ max PY|X (y|x) , S(∅) = 0
x∈C
y∈Y
i i
i i
i i

i i
308
It is clear that S(C) is submodular non-decreasing as a sum of submodular non-decreasing func-

tions max (i.e. T 7→ maxx∈T ϕ(x) is submodular for any ϕ). On the other hand, Pe,MAP (C) =
1 − |C|1
S(C), and thus search for the minimal error codebook is equivalent to maximizing the
set-function S.
The question of finding
S∗ (M) ≜ max S(C)

|C|≤M
was algorithmically resolved in a groundbreaking work of [223] showing (approximate) optimality

of a greedy process. Consider, the following natural greedy process of constructing a sequence of
good sets Ct . Start with C0 = ∅. At each step find any
xt+1 ∈ argmax S(Ct ∪ {x})

x̸∈Ct
and set
Ct+1 = Ct ∪ {xt+1 } .
They showed that
S(Ct ) ≥ (1 − 1/e) max S(C) .

|C|=t
In other words, the probability of successful (MAP) decoding for the greedily constructed code-
book is at most a factor (1 − 1/e) away from the largest possible probability of success among all
codebooks of the same cardinality. Since we are mostly interested in success probabilities very
close to 1, this result may not appear very exciting. However, a small modification of the argument
yields the following (see [185, Theorem 1.5] for the proof):
Theorem 18.14 ([223]). For any non-negative submodular set-function f and a greedy sequence
Ct we have for all ℓ, t:
f(Cℓ ) ≥ (1 − e−ℓ/t ) max f(C) .

|C|=t
Applying this to the special case of f(·) = S(·) we obtain the result of [23]: The greedily
constructed codebook C ′ with M′ codewords satisfies
M ′
1 − Pe,MAP (C ′ ) ≥ ′
(1 − e−M /M )(1 − ϵ∗ (M)) .
M
In particular, the greedily constructed code with M′ = M2−10 achieves probability of success that
is ≥ 0.9995(1 −ϵ∗ (M)). In other words, compared to the best possible code a greedy code carrying
10 bits fewer of data suffers at most 5 · 10−4 worse probability of error. This is a very compelling
evidence for why greedy construction is so good. We do note, however, that Feinstein’s bound
does greedy construction not with the MAP decoder, but with a suboptimal one.
Next we address the question of random coding. Recall that our goal is to explain how can
selecting codewords uniformly at random from a “simple” distribution PX be any good. The key
i i
i i
i i

i i
idea is again contained in [223]. The set-function S(C) can also be understood as a function with
domain {0, 1}|X | . Here is a natural extension of this function to the entire solid hypercube [0, 1]|X | :
X X
SLP (π ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} . (18.21)
x, y x
Indeed, it is easy to see that SLP (1C ) = S(C) and that SLP is a concave function.3
Since SLP is an extension of S it is clear that
X
S∗ (M) ≤ S∗LP (M) ≜ max{SLP (π ) : 0 ≤ π x ≤ 1, π x ≤ M} . (18.22)
x
In fact, we will see later in Section 22.3 that this bound coincides with the bound known as
meta-converse. Surprisingly, [223] showed that the greedy construction not only achieves a large
multiple of S∗ (M) but also of S∗LP (M):
S(CM ) ≥ (1 − e−1 )S∗LP (M) . (18.23)

P
The importance of this result (which is specific to submodular functions C 7→ y maxx∈C g(x, y))
is that it gave one of the first integrality gap results relating the value of combinatorial optimization
S∗ (M) and a linear program relaxatioa S∗LP (M): (1 − e−1 )S∗LP (M) ≤ S∗ (M) ≤ S∗LP (M).
An extension of (18.23) similar to the preceding theorem can also be shown: for all M′ , M we
have
′
S(CM′ ) ≥ (1 − e−M /M )S∗LP (M) .
To connect to the concept of random coding, though, we need the following result of [23]:4
P i.i.d.
Theorem 18.15. Fix π ∈ [0, 1]|X | and let M = x∈X π x . Let C = {c1 , . . . , cM′ } with cj ∼ PX (x) =
πx
M . Then we have
′
E[S(C)] ≥ (1 − e−M /M )SLP (π ) .
The proof of this result trivially follows from applying the following lemma with g(x) =
PY|X (y|x), summing over y and recalling the definition of SLP in (18.21).
Lemma 18.16. Let π and C be as in Theorem. Let g : X → R be any function and denote
P P
T(π , g) = max{ x rx g(x) : 0 ≤ rx ≤ π x , x rx = 1}. Then
′
E[max g(x)] ≥ (1 − e−M /M )T(π , g) .
x∈C
3
There are a number of standard extensions of a submodular function f to a hypercube. The largest convex interpolant f+ ,
also known as Lovász extension, the least concave interpolant f− , and multi-linear extension [56]. However, SLP does not
coincide with any of these and in particular strictly larger (in general) than f− .
4
There are other ways of doing “random coding” to produce an integer solution from a fractional one. For example, see the
multi-linear extension based one in [56].
i i
i i
i i

i i
310
Proof. Without loss of generality we take X = [m] and g(1) ≥ g(2) ≥ · · · ≥ g(m) ≥ g(m + 1) ≜
′ ′
0. Denote for convenience a = 1 − (1 − M1 )M ≥ 1 − e−M /M , b(j) ≜ P[{1, . . . , j} ∩ C 6= ∅]. Then
P[max g(x) = g(j)] = b(j) − b(j − 1) ,
x∈C
and from the summation by parts we get

X
m
E[max g(x)] = (g(j) − g(j + 1))b(j) . (18.24)
x∈C
j=1
P
On the other hand, denoting c(j) = min( i≤ j π i , 1). Now from the definition of b(j) we have
π1 + . . . πj ℓ c( j ) M ′
b( j) = 1 − ( 1 − ) ≥ 1 − (1 − ) .
M M
x M′
From the simple inequality (1 − M) ≤ 1 − ax (valid for any x ∈ [0, 1]) we get
b(j) ≥ ac(j) .
Plugging this into (18.24) we conclude the proof by noticing that rj = c(j) − c(j − 1) attains the
maximum in the definition of T(π , g).
Theorem 18.15 completes this section’s goal and shows that the random coding (as well as the
greedy/maximal coding) attains an almost optimal value of S∗ (M). Notice also that the random
coding distribution that we should be using is the one that attains the definition of S∗LP (M). For input
symmetric channels (such as additive noise ones) it is easy to show that the optimal π ∈ [0, 1]X is
a constant vector, and hence the codewords are to be generated iid uniformly on X .
i i
i i
i i

i i
19 Channel capacity
In this chapter we apply methods developed in the previous chapters (namely the weak converse
and the random/maximal coding achievability) to compute the channel capacity. This latter notion
quantifies the maximal amount of (data) bits that can be reliably communicated per single channel
use in the limit of using the channel many times. Formalizing the latter statement will require
introducing the concept of a communication channel. Then for special kinds of channels (the
memoryless and the information stable ones) we will show that computing the channel capacity
reduces to maximizing the (sequence of the) mutual informations. This result, known as Shannon’s
noisy channel coding theorem, is very special as it relates the value of a (discrete, combinatorial)
optimization problem over codebooks to that of a (convex) optimization problem over information
measures. It builds a bridge between the abstraction of Information Measures (Part I) and the
practical engineering problems.
Information theory as a subject is sometimes accused of “asymptopia”, or the obsession with
asymptotic results and computing various limits. Although in this book we mostly refrain from
asymptopia, the topic of this chapter requires committing this sin ipso facto.
19.1 Channels and channel capacity

As we discussed in Chapter 17 the main information-theoretic question of data transmission is the
following: How many bits can one transmit reliably if one is allowed to use a given noisy channel
n times? The normalized quantity equal to the number of message bits per channel use is known as
rate, and capacity refers to the highest achievable rate under a small probability of decoding error.
However, what does it mean to “use channel many times”? How do we formalize the concept of
a channel use? To that end, we need to change the meaning of the term “channel”. So far in this
book we have used the term channel as a synonym of the Markov kernel (Definition 2.8). More
correctly, however, this term should be used to refer to the following notion.
Definition 19.1. Fix an input alphabet A and an output alphabet B . A sequence of Markov kernels
PYn |Xn : An → B n indexed by the integer n = 1, 2 . . . is called a channel. The length of the input
n is known as blocklength.
To give this abstract notion more concrete form one should recall Section 17.2, in which we
described the BSC channel. Note, however, that despite this definition, it is customary to use the
term channel to refer to a single Markov kernel (as we did before in this book). An even worse,
311
i i
i i
i i

i i
312
yet popular, abuse of terminology is to refer to n-th element of the sequence, the kernel PYn |Xn , as
the n-letter channel.
Although we have not imposed any requirements on the sequence of kernels PYn |Xn , one is never
interested in channels at this level of generality. Most of the time the elements of the channel input
Xn = (X1 , . . . , Xn ) are thought as indexed by time. That is the Xt corresponds to the letter that is
transmitted at time t, while Yt is the letter received at time t. The channel’s action is that of “adding
noise” to Xt and outputting Yt . However, the generality of the previous definition allows to model
situations where the channel has internal state, so that the amount and type of noise added to Xt
depends on the previous inputs and in principle even on the future inputs. The interpretation of t
as time, however, is not exclusive. In storage (magnetic, non-volatile or flash) t indexes space. In
those applications, the noise may have a rather complicated structure with transformation Xt → Yt
depending on both the “past” X<t and the “future” X>t .
Almost all channels of interest satisfy one or more of the restrictions that we list next:
• A channel is called non-anticipatory if it has the following extension property. Under the n-letter
kernel PYn |Xn , the conditional distribution of the first k output symbols Yk only depends on Xk
(and not on Xnk+1 ) and coincides with the kernel PYk |Xk (the k-th element of the channel sequence)
the k-th channel transition kernel in the sequence. This requirement models the scenario wherein
channel outputs depend causally on the inputs.
• A channel is discrete if A and B are finite.
• A channel is additive-noise if A = B are abelian group and Yn = Xn + Zn for some Zn
independent of Xn (see Definition 18.11). Thus
PYn |Xn (yn |xn ) = PZn (yn − xn ).
• A channel is memoryless if PYn |Xn factorizes into a product distribution. Namely,
Y
n
PYn |Xn = PYk |Xk . (19.1)
k=1
where each PYk |Xk : A → B ; in particular, PYn |Xn are compatible at different blocklengths n.
• A channel is stationary memoryless if (19.1) is satisfied with PYk |Xk not depending on k, denoted
commonly by PY|X . In other words,
PYn |Xn = (PY|X )⊗n . (19.2)
Thus, in discrete cases, we have

Y
n
PYn |Xn (yn |xn ) = PY|X (yi |xi ). (19.3)
k=1
The interpretation is that each coordinate of the transmitted codeword Xn is corrupted by noise
independently with the same noise statistic.
• Discrete memoryless stationary channel (DMC): A DMC is a channel that is both discrete and
stationary memoryless. It can be specified in two ways:
i i
i i
i i

i i
Figure 19.1 Examples of DMCs.
– an |A| × |B|-dimensional (row-stochastic) matrix PY|X where elements specify the transition
probabilities;
– a bipartite graph with edge weight specifying the transition probabilities.
Fig. 19.1 lists some common binary-input binary-output DMCs.
Let us recall the example of the AWGN channel Example 3.3: the alphabets A = B = R and
Yn = Xn + Zn , with Xn ⊥ ⊥ Zn ∼ N (0, σ 2 In ). This channel is a non-discrete, stationary memoryless,
additive-noise channel.
Having defined the notion of the channel, we can define next the operational problem that the
communication engineer faces when tasked with establishing a data link across the channel. Since
the channel is noisy, the data is not going to pass unperturbed and the error-correcting codes are
naturally to be employed. To send one of M = 2k messages (or k data bits) with low probabil-
ity of error, it is often desirable to use the shortest possible length of the input sequence. This
desire explains the following definitions, which extend the fundamental limits in Definition 17.2
to involve the blocklength n.
Definition 19.2 (Fundamental Limits of Channel Coding).
• An (n, M, ϵ)-code is an (M, ϵ)-code for PYn |Xn , consisting of an encoder f : [M] → An and a
decoder g : B n → [M] ∪ {e}.
• An (n, M, ϵ)max -code is analogously defined for the maximum probability of error.
The (non-asymptotic) fundamental limits are
M∗ (n, ϵ) = max{M : ∃ (n, M, ϵ)-code}, (19.4)

M∗max (n, ϵ) = max{M : ∃ (n, M, ϵ)max -code}. (19.5)
How to understand the behaviour of M∗ (n, ϵ)? Recall that blocklength n measures the amount
of time or space resource used by the code. Thus, it is natural to maximize the ratio of the data
i i
i i
i i

i i
314
transmitted to the resource used, and that leads us to the notion of the transmission rate defined as
log M
R = n2 and equal to the number of bits transmitted per channel use. Consequently, instead of
studying M∗ (n, ϵ) one is lead to the study of 1n log M∗ (n, ϵ). A natural first question is to determine
the first-order asymptotics of this quantity and this motivates the final definition of the Section.
Definition 19.3 (Channel capacity). The ϵ-capacity Cϵ and Shannon capacity C are defined as
follows
1
Cϵ ≜ lim inf log M∗ (n, ϵ);
n→∞ n
C = lim Cϵ .
ϵ→0+
The operational meaning of Cϵ (resp. C) is the maximum achievable rate at which one can
communicate through a noisy channel with probability of error at most ϵ (resp. o(1)). In other
words, for any R < C, there exists an (n, exp(nR), ϵn )-code, such that ϵn → 0. In this vein, Cϵ and
C can be equivalently defined as follows:
Cϵ = sup{R : ∀δ > 0, ∃n0 (δ), ∀n ≥ n0 (δ), ∃(n, exp(n(R − δ)), ϵ)-code}

C = sup{R : ∀ϵ > 0, ∀δ > 0, ∃n0 (δ, ϵ), ∀n ≥ n0 (δ, ϵ), ∃(n, exp(n(R − δ)), ϵ)-code}
The reason that capacity is defined as a large-n limit (as opposed to a supremum over n) is because
we are concerned with rate limit of transmitting large amounts of data without errors (such as in
communication and storage).
The case of zero-error (ϵ = 0) is so different from ϵ > 0 that the topic of ϵ = 0 constitutes a
separate subfield of its own (cf. the survey [182]). Introduced by Shannon in 1956 [279], the value
1
C0 ≜ lim inf log M∗ (n, 0) (19.6)
n→∞ n
is known as the zero-error capacity and represents the maximal achievable rate with no error
whatsoever. Characterizing the value of C0 is often a hard combinatorial problem. However, for
many practically relevant channels it is quite trivial to show C0 = 0. This is the case, for example,
for the DMCs we considered before: the BSC or BEC. Indeed, for them we have log M∗ (n, 0) = 0
for all n, meaning transmitting any amount of information across these channels requires accepting
some (perhaps vanishingly small) probability of error. Nevertheless, there are certain interesting
and important channels for which C0 is positive, cf. Section 23.3.1 for more.
As a function of ϵ the Cϵ could (most generally) behave like the plot below on the left-hand
side below. It may have a discontinuity at ϵ = 0 and may be monotonically increasing (possibly
even with jump discontinuities) in ϵ. Typically, however, Cϵ is zero at ϵ = 0 and stays constant for
all 0 < ϵ < 1 and, hence, coincides with C (see the plot on the right-hand side). In such cases we
say that the strong converse holds (more on this later in Section 22.1).
i i
i i
i i

i i
Cǫ Cǫ
strong converse
holds
Zero error b
C0
Capacity
ǫ ǫ
0 1 0 1
In Definition 19.3, the capacities Cϵ and C are defined with respect to the average probabil-
ity of error. By replacing M∗ with M∗max , we can define, analogously, the capacities Cϵ
(max)
and
C(max) with respect to the maximal probability of error. It turns out that these two definitions are
equivalent, as the next theorem shows.
Theorem 19.4. ∀τ ∈ (0, 1),
τ M∗ (n, ϵ(1 − τ )) ≤ M∗max (n, ϵ) ≤ M∗ (n, ϵ)
Proof. The second inequality is obvious, since any code that achieves a maximum error
probability ϵ also achieves an average error probability of ϵ.
For the first inequality, take an (n, M, ϵ(1 − τ ))-code, and define the error probability for the jth
codeword as
λj ≜ P[Ŵ 6= j|W = j]
Then
X X X
M(1 − τ )ϵ ≥ λj = λj 1{λj ≤ϵ} + λj 1{λj >ϵ} ≥ ϵ|{j : λj > ϵ}|.
Hence |{j : λj > ϵ}| ≤ (1 − τ )M. (Note that this is exactly Markov inequality.) Now by removing
those codewords1 whose λj exceeds ϵ, we can extract an (n, τ M, ϵ)max -code. Finally, take M =
M∗ (n, ϵ(1 − τ )) to finish the proof.
(max)
Corollary 19.5 (Capacity under maximal probability of error). Cϵ = Cϵ for all ϵ > 0 such
thatIn particular, C(max) = C.
Proof. Using the definition of M∗ and the previous theorem, for any fixed τ > 0
1
Cϵ ≥ C(ϵmax) ≥ lim inf log τ M∗ (n, ϵ(1 − τ )) ≥ Cϵ(1−τ )
n→∞ n
(max)
Sending τ → 0 yields Cϵ ≥ Cϵ ≥ Cϵ− .
1
This operation is usually referred to as expurgation which yields a smaller code by killing part of the codebook to reach a
desired property.
i i
i i
i i

i i
316
19.2 Shannon’s noisy channel coding theorem

Now that we have the basic definitions for Shannon capacity, we define another type of capac-
ity, and show that for a stationary memoryless channels, these two notions (“operational” and
“information”) of capacity coincide.
Definition 19.6. The information capacity of a channel is

1
C(I) = lim inf sup I(Xn ; Yn ),
n→∞ n PXn
where for each n the supremum is taken over all joint distributions PXn on An .
Note that information capacity C(I) so defined is not the same as the Shannon capacity C in Def-
inition 19.3; as such, from first principles it has no direct interpretation as an operational quantity
related to coding. Nevertheless, they are related by the following coding theorems. We start with
a converse result:
C( I )
Theorem 19.7 (Upper Bound for Cϵ ). For any channel, ∀ϵ ∈ [0, 1), Cϵ ≤ 1−ϵ and C ≤ C(I) .
Proof. Applying the general weak converse bound in Theorem 17.3 to PYn |Xn yields
supPXn I(Xn ; Yn ) + h(ϵ)
log M∗ (n, ϵ) ≤
1−ϵ
Normalizing this by n and taking the lim inf as n → ∞, we have
1 1 supPXn I(Xn ; Yn ) + h(ϵ) C(I)
Cϵ = lim inf log M∗ (n, ϵ) ≤ lim inf = .
n→∞ n n→∞ n 1−ϵ 1−ϵ
Next we give an achievability bound:
Theorem 19.8 (Lower Bound for Cϵ ). For a stationary memoryless channel, Cϵ ≥ supPX I(X; Y),
for any ϵ ∈ (0, 1].
Proof. Fix an arbitrary PX on A and let PXn = P⊗ n

X be an iid product of a single-letter distribution
PX . Recall Shannon’s achievability bound Theorem 18.5 (or any other one would work just as well).
From that result we know that for any n, M and any τ > 0, there exists an (n, M, ϵn )-code with
ϵn ≤ P[i(Xn ; Yn ) ≤ log M + τ ] + exp(−τ )
Here the information density is defined with respect to the distribution PXn ,Yn = P⊗ n
X,Y and, therefore,
X
n
dPX,Y Xn
i(Xn ; Yn ) = log (Xk , Yk ) = i(Xk ; Yk ),
dPX PY
k=1 k=1
i i
i i
i i

i i
where i(x; y) = iPX,Y (x; y) and i(xn ; yn ) = iPXn ,Yn (xn ; yn ). What is important is that under PXn ,Yn the
random variable i(Xn ; Yn ) is a sum of iid random variables with mean I(X; Y). Thus, by the weak
law of large numbers we have
P[i(Xn ; Yn ) < n(I(X; Y) − δ)] → 0
for any δ > 0.
With this in minde, let us set log M = n(I(X; Y) − 2δ) for some δ > 0, and take τ = δ n in
Shannon’s bound. Then for the error bound we have
" n #
X n→∞
ϵn ≤ P i(Xk ; Yk ) ≤ nI(X; Y) − δ n + exp(−δ n) −−−→ 0, (19.7)
k=1
Since the bound converges to 0, we have shown that there exists a sequence of (n, Mn , ϵn )-codes
with ϵn → 0 and log Mn = n(I(X; Y) − 2δ). Hence, for all n such that ϵn ≤ ϵ
log M∗ (n, ϵ) ≥ n(I(X; Y) − 2δ)
And so
1
Cϵ = lim inf log M∗ (n, ϵ) ≥ I(X; Y) − 2δ
n→∞ n
Since this holds for all PX and all δ > 0, we conclude Cϵ ≥ supPX I(X; Y).
The following result follows from pairing the upper and lower bounds on Cϵ .
Theorem 19.9 (Shannon’s Noisy Channel Coding Theorem [277]). For a stationary memoryless
channel,
C = C(I) = sup I(X; Y). (19.8)
PX
As we mentioned several times already this result is among the most significant results in
information theory. From the engineering point of view, the major surprise was that C > 0,
i.e. communication over a channel is possible with strictly positive rate for any arbitrarily small
probability of error. The way to achieve this is to encode the input data jointly (i.e. over many
input bits together). This is drastically different from the pre-1948 methods, which operated on
a letter-by-letter bases (such as Morse code). This theoretical result gave impetus (and still gives
guidance) to the evolution of practical communication systems – quite a rare achievement for an
asymptotic mathematical fact.
Proof. Statement (19.8) contains two equalities. The first one follows automatically from the
second and Theorems 19.7 and 19.8. To show the second equality C(I) = supPX I(X; Y), we note
that for stationary memoryless channels C(I) is in fact easy to compute. Indeed, rather than solving
a sequence of optimization problems (one for each n) and taking the limit of n → ∞, memoryless-
ness of the channel implies that only the n = 1 problem needs to be solved. This type of result is
known as “single-letterization” in information theory and we show it formally in the following
proposition, which concludes the proof.
i i
i i
i i

i i
318
Proposition 19.10 (Memoryless input is optimal for memoryless channels).
• For memoryless channels,

X
n
n n
sup I(X ; Y ) = sup I(Xi ; Yi ).
PXn PXi
i=1
• For stationary memoryless channels,
C(I) = sup I(X; Y).

PX
Q
Proof. Recall that from Theorem 6.1 we know that for product kernels PYn |Xn = PYi |Xi , mutual
P n
information satisfies I(Xn ; Yn ) ≤ k=1 I(Xk ; Yk ) with equality whenever Xi ’s are independent.
Then
1
C(I) = lim inf sup I(Xn ; Yn ) = lim inf sup I(X; Y) = sup I(X; Y).
n→∞ n P Xn n→∞ PX PX
Shannon’s noisy channel theorem shows that by employing codes of large blocklength, we can
approach the channel capacity arbitrarily close. Given the asymptotic nature of this result (or any
other asymptotic result), a natural question is understanding the price to pay for reaching capacity.
This can be understood in two ways:
1 The complexity of achieving capacity: Is it possible to find low-complexity encoders and

decoders with polynomial number of operations in the blocklength n which achieve the capac-
ity? This question was resolved by Forney [130] who showed that this is possible in linear time
with exponentially small error probability.
Note that if we are content with polynomially small probability of error, e.g., Pe = O(n−100 ),
then we can construct polynomial-time decodable codes as follows. First, it can be shown that
with rate strictly below capacity, the error probability of optimal codes decays exponentially
w.r.t. the blocklenth. Now divide the block of length n into shorter block of length c log n and
apply the optimal code for blocklength c log n with error probability n−101 . The by the union
bound, the whole block has error with probability at most n−100 . The encoding and exhaustive-
search decoding are obviously polynomial time.
2 The speed of achieving capacity: Suppose we want to achieve 90% of the capacity, we want to
know how long do we need wait? The blocklength is a good proxy for delay. In other words,
we want to know how fast the gap to capacity vanish as blocklength grows. Shannon’s theorem
shows that the gap C − 1n log M∗ (n, ϵ) = o(1). Next theorem shows that under proper conditions,
the o(1) term is in fact O( √1n ).
The main tool in the proof of Theorem 19.8 was the law of large numbers. The lower bound
Cϵ ≥ C(I) in Theorem 19.8 shows that log M∗ (n, ϵ) ≥ nC + o(n) (this just restates the fact
that normalizing by n and taking the lim inf must result in something ≥ C). If instead we apply
i i
i i
i i

i i
a more careful analysis using the central-limit theorem (CLT), we obtain the following refined
achievability result.
Theorem 19.11. Consider a stationary memoryless channel with a capacity-achieving input dis-
tribution. Namely, C = maxPX I(X; Y) = I(X∗ ; Y∗ ) is attained at P∗X , which induces PX∗ Y∗ =
PX∗ PY|X . Assume that V = Var[i(X∗ ; Y∗ )] < ∞. Then
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n),
where Q(·) is the complementary Gaussian CDF and Q−1 (·) is its functional inverse.
Proof. Writing the little-o notation in terms of lim inf, our goal is
log M∗ (n, ϵ) − nC
lim inf √ ≥ −Q−1 (ϵ) = Φ−1 (ϵ),
n→∞ nV
where Φ(t) is the standard normal CDF.
Recall Feinstein’s bound
∃(n, M, ϵ)max : M ≥ β (ϵ − P[i(Xn ; Yn ) ≤ log β])
√
Take log β = nC + nVt, then applying the CLT gives
√ hX √ i
log M ≥ nC + nVt + log ϵ − P i(Xk ; Yk ) ≤ nC + nVt
√
=⇒ log M ≥ nC + nVt + log (ϵ − Φ(t)) ∀Φ(t) < ϵ
log M − nC log(ϵ − Φ(t))
=⇒ √ ≥t+ √ ,
nV nV
where Φ(t) is the standard normal CDF. Taking the liminf of both sides
log M∗ (n, ϵ) − nC
lim inf √ ≥ t,
n→∞ nV
for all t such that Φ(t) < ϵ. Finally, taking t % Φ−1 (ϵ), and writing the liminf in little-oh notation
completes the proof
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n).
Remark 19.1. Theorem 19.9 implies that for any R < C, there exists a sequence of (n, exp(nR), ϵn )-
codes such that the probability of error ϵn vanishes as n → ∞. Examining the upper bound (19.7),
we see that the probability of error actually vanishes exponentially fast, since the event in the first
term is of large-deviations type (recall Chapter 15) so that both terms are exponentially small.
Finding the value of the optimal exponent (or even the existence thereof) has a long history (but
remains generally open) in information theory, see Section 22.4*. Recently, however, it was under-
stood that a practically more relevant, and also much easier to analyze, is the regime of fixed
(non-vanishing) error ϵ, in which case the main question is to bound the speed of convergence of
R → Cϵ = C. Previous theorem shows one bound on this speed of convergence. See Sections 22.5
i i
i i
i i

i i
320
√
and 22.6 for more. In particular, we will show that the bound on the n term in Theorem 19.11
is often tight.
19.3 Examples of computing capacity

We compute the capacities of the simple DMCs from Fig. 19.1 and plot them in Fig. 19.2.
C
C C
1 bit
1 bit 1 bit
δ
0 1 1 δ δ
2 0 1 0 1
BSCδ BECδ Z-channel
Figure 19.2 Capacities of three simple DMCs.
First for the BSCδ we have the following description of the input-output law:
Y = X + Z mod 2, Z ∼ Ber(δ) ⊥
⊥X
To compute the capacity, let us notice
I(X; X + Z) = H(X + Z) − H(X + Z|X) = H(X + Z) − H(Z) ≤ log 2 − h(δ)
with equality iff X ∼ Ber(1/2). Hence we have shown
C = sup I(X; Y) = log 2 − h(δ)

PX
More generally, for all additive-noise channel over a finite abelian group G, C = supPX I(X; X +
Z) = log |G| − H(Z), achieved by X ∼ Unif(G).
Next we consider the binary erasure channel (BEC). BECδ is a multiplicative channel. Indeed,
if we define the input X ∈ {±1} and output Y ∈ {±1, 0}, then BEC relation can be written as
Y = XZ, Z ∼ Ber(δ) ⊥
⊥ X.
To compute the capacity, we first notice that even without evaluating Shannon’s formula, it is
clear that C ≤ 1 − δ (bit), because for a large blocklength n about δ -fraction of the message is
completely lost (even if the encoder knows a priori where the erasures are going to occur, the rate
still cannot exceed 1 − δ ). More formally, we notice that P[X = +1|Y = e] = P[X= δ
1]δ
= P[X = 1]
and therefore
I(X; Y) = H(X) − H(X|Y) = H(X) − H(X|Y = e) ≤ (1 − δ)H(X) ≤ (1 − δ) log 2 ,
i i
i i
i i

i i
19.4* Symmetric channels 321
with equality iff X ∼ Ber(1/2). Thus we have shown
C = sup I(X; Y) = 1 − δ bits

PX
Finally, the Z-channel can be thought of as a multiplicative channel with transition law
Y = XZ, X ∈ { 0, 1} ⊥
⊥ Z ∼ Ber(1 − δ) ,
so that P[Z = 0] = δ . For this channel if X ∼ Ber(p) we have
I(X; Y) = H(Y) − H(Y|X) = h(p(1 − δ)) − ph(δ) .
Optimizing over p we get that the optimal input is given by

1 1
p∗ (δ) = .
1 − δ 1 + exp{ h(δ) }
1−δ
The capacity-achieving input distribution p∗ (δ) monotonically decreases from 12 when δ = 0 to 1e

when δ → 1. (Infamously, there is no “explanation” for this latter limiting value.) For the capacity,
thus, we get
C = h(p∗ (δ)(1 − δ)) − p∗ (δ)h(δ) .
19.4* Symmetric channels

Definition 19.12. A pair of measurable maps f = (fi , fo ) is a symmetry of PY|X if
PY|X (f−
o (E)|fi (x)) = PY|X (E|x) ,
1
for all measurable E ⊂ Y and x ∈ X . Two symmetries f and g can be composed to produce another
symmetry as
( gi , go ) ◦ ( fi , fo ) ≜ ( gi ◦ fi , fo ◦ go ) . (19.9)
A symmetry group G of PY|X is any collection of invertible symmetries (automorphisms) closed

under the group operation (19.9).
Note that both components of an automorphism f = (fi , fo ) are bimeasurable bijections, that is
fi , f− 1 −1
i , fo , fo are all measurable and well-defined functions.
Naturally, every symmetry group G possesses a canonical left action on X × Y defined as
g · (x, y) ≜ (gi (x), g− 1

o (y)) . (19.10)
Since the action on X × Y splits into actions on X and Y , we will abuse notation slightly and write
g · ( x, y) ≜ ( g x , g y ) .
For the cases of infinite X , Y we need to impose certain additional regularity conditions:
i i
i i
i i

i i
322
Definition 19.13. A symmetry group G is called regular if it possesses a left-invariant Haar

probability measure ν such that the group action (19.10)
G×X ×Y →X ×Y
is measurable.
Note that under the regularity assumption the action (19.10) also defines a left action of G on
P(X ) and P(Y) according to
(gPX )[E] ≜ PX [g−1 E] , (19.11)

(gQY )[E] ≜ QY [g−1 E] , (19.12)
or, in words, if X ∼ PX then gX ∼ gPX , and similarly for Y and gY. For every distribution PX we
define an averaged distribution P̄X as
Z
P̄X [E] ≜ PX [g−1 E]ν(dg) , (19.13)
G
which is the distribution of random variable gX when g ∼ ν and X ∼ PX . The measure P̄X is G-
invariant, in the sense that gP̄X = P̄X . Indeed, by left-invariance of ν we have for every bounded
function f
Z Z
f(g)ν(dg) = f(hg)ν(dg) ∀h ∈ G ,
G G
and therefore
Z
P̄X [h−1 E] = PX [(hg)−1 E]ν(dg) = P̄X [E] .
G
Similarly one defines Q̄Y :

Z
Q̄Y [E] ≜ QY [g−1 E]ν(dg) , (19.14)
G
which is also G-invariant: gQ̄Y = Q̄Y .

The main property of the action of G may be rephrased as follows: For arbitrary ϕ : X ×Y → R
we have
Z Z
ϕ(x, y)PY|X (dy|x)(gPX )(dx)
X Y
Z Z
= ϕ(gx, gy)PY|X (dy|x)PX (dx) . (19.15)
X Y
In other words, if the pair (X, Y) is generated by taking X ∼ PX and applying PY|X , then the pair
(gX, gY) has marginal distribution gPX but conditional kernel is still PY|X . For finite X , Y this is
equivalent to
PY|X (gy|gx) = PY|X (y|x) , (19.16)
i i
i i
i i

i i
19.4* Symmetric channels 323
which may also be taken as the definition of the automorphism. In terms of the G-action on P(Y)
we may also say:
gPY|X=x = PY|X=gx ∀ g ∈ G, x ∈ X . (19.17)
It is not hard to show that for any channel and a regular group of symmetries G the capacity-
achieving output distribution must be G-invariant, and capacity-achieving input distribution can
be chosen to be G-invariant. That is, the saddle point equation
inf sup D(PY|X kQY |PX ) = sup inf D(PY|X kQY |PX ) ,
PX QY QY PX
can be solved in the class of G-invariant distribution. Often, the action of G is transitive on X (Y ),
in which case the capacity-achieving input (output) distribution can be taken to be uniform.
Below we systematize many popular notions of channel symmetry and explain relationship
between them.
• PY|X is called input-symmetric (output-symmetric) if there exists a regular group of symmetries

G acting transitively on X (Y ).
• An input-symmetric channel with a binary X is known as BMS (for Binary Memoryless
Symmetric). These channels possess a rich theory [264, Section 4.1].
• PY|X is called weakly input-symmetric if there exists an x0 ∈ X and a channel Tx : B → B for
each x ∈ X such that Tx ◦ PY|X=x0 = PY|X=x and Tx ◦ P∗Y , where P∗Y is the caod. In [238, Section
3.4.5] it is shown that the allowing for a randomized maps Tx is essential and that not all PY|X
are weakly input-symmetric.
• DMC PY|X is a group-noise channel if X = Y is a group and PY|X acts by composing X with a
noise variable Z:
Y = X ◦ Z,
where ◦ is a group operation and Z is independent of X.

• DMC PY|X is called Dobrushin-symmetric if every row of PY|X is a permutation of the first one
and every column of PY|X is a permutation of the first one; see [97].
• DMC PY|X is called Gallager-symmetric if the output alphabet Y can be split into a disjoint union
of sub-alphabets such that restricted to each sub-alphabet PY|X has the Dobrushin property: every
row (every column) is a permutation of the first row (column); see [133, Section 4.5].
• for convenience, say that the channel is square if |X | = |Y|.
We demonstrate some of the relationship between these various notions of symmetry:
• Note that it is an easy consequence of the definitions that any input-symmetric (resp. output-
symmetric) channel’s PY|X has all rows (resp. columns) – permutations of the first row (resp.
column). Hence,
input-symmetric, output-symmetric =⇒ Dobrushin (19.18)
i i
i i
i i

i i
324
• Group-noise channels satisfy all other definitions of symmetry:
group-noise =⇒ square, input/output-symmetric (19.19)

=⇒ Dobrushin, Gallager (19.20)
• Since Gallager symmetry implies all rows are permutations of the first one, while output
symmetry implies the same statement for columns we have
Gallager, output-symmetric =⇒ Dobrushin
• Clearly, not every Dobrushin-symmetric channel is square. One may wonder, however, whether
every square Dobrushin channel is a group-noise channel. This is not so. Indeed, according
to [286] the latin squares that are Cayley tables are precisely the ones in which composition of
two rows (as permutations) gives another row. An example of the latin square which is not a
Cayley table is the following:
 
1 2 3 4 5
2 5 4 1 3
 
 
3 1 2 5 4 . (19.21)
 
4 3 5 2 1
5 4 1 3 2
1
Thus, by multiplying this matrix by 15 we obtain a counter-example:
Dobrushin, square 6=⇒ group-noise
In fact, this channel is not even input-symmetric. Indeed, suppose there is g ∈ G such that
g4 = 1 (on X ). Then, applying (19.16) with x = 4 we figure out that on Y the action of g must
be:
1 7→ 4, 2 7→ 3, 3 7→ 5, 4 7→ 2, 5 7→ 1 .
But then we have

1
gPY|X=1 = 5 4 2 1 3 · ,
15
which by a simple inspection does not match any of the rows in (19.21). Thus, (19.17) cannot
hold for x = 1. We conclude:
Dobrushin, square 6=⇒ input-symmetric
Similarly, if there were g ∈ G such that g2 = 1 (on Y ), then on X it would act as
1 7→ 2, 2 7→ 5, 3 7→ 1, 4 7→ 3, 5 7→ 4 ,
which implies via (19.16) that PY|X (g1|x) is not a column of (19.21). Thus:
Dobrushin, square 6=⇒ output-symmetric
i i
i i
i i

i i
19.5* Information Stability 325
• Clearly, not every input-symmetric channel is Dobrushin (e.g., BEC). One may even find a
counter-example in the class of square channels:
 
1 2 3 4
1 3 2 4 1
 
4 2 3 1 · 10 (19.22)
4 3 2 1
This shows:
input-symmetric, square 6=⇒ Dobrushin
• Channel (19.22) also demonstrates:
Gallager-symmetric, square 6=⇒ Dobrushin .
• Example (19.22) naturally raises the question of whether every input-symmetric channel is
Gallager symmetric. The answer is positive: by splitting Y into the orbits of G we see that a
subchannel X → {orbit} is input and output symmetric. Thus by (19.18) we have:
input-symmetric =⇒ Gallager-symmetric =⇒ weakly input-symmetric (19.23)
(The second implication is evident).
• However, not all weakly input-symmetric channels are Gallager-symmetric. Indeed, consider
the following channel
 
1/7 4/7 1/7 1/7
 
 4/7 1/7 0 4/7 
 
W= . (19.24)
 0 0 4 /7 2 / 7 
 
2/7 2/7 2/7 0
Since det W 6= 0, the capacity achieving input distribution is unique. Since H(Y|X = x) is
independent of x and PX = [1/4, 1/4, 3/8, 1/8] achieves uniform P∗Y it must be the unique
optimum. Clearly any permutation Tx fixes a uniform P∗Y and thus the channel is weakly input-
symmetric. At the same time it is not Gallager-symmetric since no row is a permutation of
another.
• For more on the properties of weakly input-symmetric channels see [238, Section 3.4.5].
A pictorial representation of these relationships between the notions of symmetry is given

schematically on Fig. 19.3.
19.5* Information Stability

We saw that C = C(I) for stationary memoryless channels, but what other channels does this hold
for? And what about non-stationary channels? To answer this question, we introduce the notion
of information stability.
i i
i i
i i

i i
326
Weakly input symmetric
Gallager
1010
1111111
0000000
0000000
1111111
0000000
1111111 0
1 Dobrushin
0000000
1111111 101111
0000000000
1111111111
0000
0000000
1111111 101111
0000000000
1111111111
0000
0000
1111
0000000
1111111 101111
0000000000
1111111111
0000
000
111
0000
1111 000
111
0000
1111
0000000
1111111 0
1
0000000000
1111111111
0000
1111
000
111
000
111
0000
1111 000
111
0000
1111
0000000
1111111 0
1
0000000000
1111111111
0000
1111
000
111
000
111
0000
1111
000
111 000
111
0000
1111
0000000
1111111
0000
1111 0000
1111
0000
1111 000
111
000input−symmetric
111
output−symmetric group−noise
Figure 19.3 Schematic representation of inclusions of various classes of channels.
Definition 19.14. A channel is called information stable if there exists a sequence of input
distributions {PXn , n = 1, 2, . . .} such that
1 n n P (I)
i( X ; Y ) −
→C .
n
For example, we can pick PXn = (P∗X )n for stationary memoryless channels. Therefore
stationary memoryless channels are information stable.
The purpose for defining information stability is the following theorem.
Theorem 19.15. For an information stable channel, C = C(I) .
Proof. Like the stationary, memoryless case, the upper bound comes from the general con-
verse Theorem 17.3, and the lower bound uses a similar strategy as Theorem 19.8, except utilizing
the definition of information stability in place of WLLN.
The next theorem gives conditions to check for information stability in memoryless channels
which are not necessarily stationary.
Theorem 19.16. A memoryless channel is information stable if there exists {X∗k : k ≥ 1} such
that both of the following hold:
1X ∗ ∗
n
I(Xk ; Yk ) → C(I) (19.25)
n
k=1
X
∞
1
Var[i(X∗n ; Y∗n )] < ∞ . (19.26)
n2
n=1
i i
i i
i i

i i
19.5* Information Stability 327
In particular, this is satisfied if

|A| < ∞ or |B| < ∞ (19.27)
Proof. To show the first part, it is sufficient to prove

" n #
1 X ∗ ∗ ∗ ∗

P i(Xk ; Yk ) − I(Xk , Yk ) > δ → 0
n
k=1
So that 1n i(Xn ; Yn ) → C(I) in probability. We bound this by Chebyshev’s inequality

" n # Pn
1 X ∗ ∗ ∗ ∗

1 ∗ ∗
k=1 Var[i(Xk ; Yk )]
P i(Xk ; Yk ) − I(Xk , Yk ) > δ ≤ n2
→ 0,
n δ2
k=1
where convergence to 0 follows from Kronecker lemma (Lemma 19.17 to follow) applied with
bn = n2 , xn = Var[i(X∗n ; Y∗n )]/n2 .
The second part follows from the first. Indeed, notice that
1X
n
C(I) = lim inf sup I(Xk ; Yk ) .
n→∞ n PXk
k=1
Now select PX∗k such that

I(X∗k ; Y∗k ) ≥ sup I(Xk ; Yk ) − 2−k .
PXk
(Note that each supPX I(Xk ; Yk ) ≤ log min{|A|, |B|} < ∞.) Then, we have
k
X
n X
n
I(X∗k ; Y∗k ) ≥ sup I(Xk ; Yk ) − 1 ,
PXk
k=1 k=1
and hence normalizing by n we get (19.25). We next show that for any joint distribution PX,Y we
have
Var[i(X; Y)] ≤ 2 log2 (min(|A|, |B|)) . (19.28)
The argument is symmetric in X and Y, so assume for concreteness that |B| < ∞. Then
E[i2 (X; Y)]
Z X
≜ dPX (x) PY|X (y|x) log PY|X (y|x) + log PY (y) − 2 log PY|X (y|x) · log PY (y)
2 2
A y∈B
Z X h i
≤ dPX (x) PY|X (y|x) log2 PY|X (y|x) + log2 PY (y) (19.29)
A y∈B
   
Z X X
= dPX (x)  PY|X (y|x) log2 PY|X (y|x) +  PY (y) log2 PY (y)
A y∈B y∈B
Z
≤ dPX (x)g(|B|) + g(|B|) (19.30)
A
i i
i i
i i

i i
328
=2g(|B|) ,
where (19.29) is because 2 log PY|X (y|x) · log PY (y) is always non-negative, and (19.30) follows
because each term in square-brackets can be upper-bounded using the following optimization
problem:
X
n
g(n) ≜ sup
Pn
aj log2 aj . (19.31)
aj ≥0: j=1 aj =1 j=1
Since the x log2 x has unbounded derivative at the origin, the solution of (19.31) is always in the
interior of [0, 1]n . Then it is straightforward to show that for n > e the solution is actually aj = 1n .
For n = 2 it can be found directly that g(2) = 0.5629 log2 2 < log2 2. In any case,
2g(|B|) ≤ 2 log2 |B| .
Finally, because of the symmetry, a similar argument can be made with |B| replaced by |A|.
Lemma 19.17 (Kronecker Lemma). Let a sequence 0 < bn % ∞ and a non-negative sequence
P∞
{xn } such that n=1 xn < ∞, then
1 X
n
bj xj → 0
bn
j=1
Proof. Since bn ’s are strictly increasing, we can split up the summation and bound them from
above
X
n X
m X
n
bk xk ≤ bm xk + b k xk
k=1 k=1 k=m+1
Now throw in the rest of the xk ’s in the summation
1 X bm X X bm X X
n ∞ n ∞ ∞
bk
=⇒ b k xk ≤ xk + xk ≤ xk + xk
bn bn bn bn
k=1 k=1 k=m+1 k=1 k=m+1
1 X X
n ∞
=⇒ lim bk xk ≤ xk → 0
n→∞ bn
k=1 k=m+1
Since this holds for any m, we can make the last term arbitrarily small.
How to show information stability? One important class of channels with memory for which
information stability can be shown easily are Gaussian channels. The complete details will be
shown below (see Sections 20.5* and 20.6*), but here we demonstrate a crucial fact.
For jointly Gaussian (X, Y) we always have bounded variance:
cov[X, Y]
Var[i(X; Y)] = ρ2 (X, Y) log2 e ≤ log2 e , ρ(X, Y) = p . (19.32)
Var[X] Var[Y]
i i
i i
i i

i i
19.6 Capacity under bit error rate 329
Indeed, first notice that we can always represent Y = X̃ + Z with X̃ = aX ⊥⊥ Z. On the other hand,
we have

log e x̃2 + 2x̃z σ2 2
i(x̃; y) = − z , z ≜ y − x̃ .
2 σY2 σY2 σZ2
From here by using Var[·] = Var[E[·|X̃]] + Var[·|X̃] we need to compute two terms separately:
 σX̃2

log e  X̃ − σZ2 
2
E[i(X̃; Y)|X̃] = ,
2 σY2
and hence
2 log2 e 4
Var[E[i(X̃; Y)|X̃]] = σ .
4σY4 X̃
On the other hand,
2 log2 e 2 2
Var[i(X̃; Y)|X̃] = [4σX̃ σZ + 2σX̃4 ] .
4σY4
Putting it all together we get (19.32). Inequality (19.32) justifies information stability of all sorts
of Gaussian channels (memoryless and with memory), as we will see shortly.
19.6 Capacity under bit error rate

In most cases of interest the space [M] of messages can be given additional structure by identifying
[M] = {0, 1}k , which is, of course, only possible for M = 2k . In these cases, in addition to Pe and
Pe,max every code (f, g) has another important figure of merit – the so called Bit Error Rate (BER),
denoted as Pb . In fact, we have already defined a similar concept in Section 6.5:
1X
k
1
Pb ≜ P[Sj 6= Ŝj ] = E[dH (Sk , Ŝk )] , (19.33)
k k
j=1
where we represented W and Ŵ as k-tuples: W = (S1 , . . . , Sk ), Ŵ = (Ŝ1 , . . . , Ŝk ), and dH denotes

the Hamming distance (6.14). In words, Pb is the average fraction of errors in a decoded k-bit
block.
In addition to constructing codes minimizing block error probability Pe or Pe,max as we studied
above, the problem of minimizing the BER Pb is also practically relevant. Here, we discuss some
simple facts about this setting. In particular, we will see that the capacity value for memoryless
channels does not increase even if one insists only on a vanishing Pb – a much weaker criterion
compared to vanishing Pe .
First, we give a bound on the average probability of error (block error rate) in terms of the bit
error rate.
Theorem 19.18. For all (f, g), M = 2k =⇒ Pb ≤ Pe ≤ kPb
i i
i i
i i

i i
330
Proof. Recall that M = 2k gives us the interpretation of W = Sk sequence of bits.
1X X
k k
1{Si 6= Ŝi } ≤ 1{Sk 6= Ŝk } ≤ 1{Si 6= Ŝi },
k
i=1 i=1
where the first inequality is obvious and the second follow from the union bound. Taking
expectation of the above expression gives the theorem.
Next, the following pair of results is often useful for lower bounding Pb for some specific codes.
Theorem 19.19 (Assouad). If M = 2k then

Pb ≥ min{P[Ŵ = c′ |W = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = 1} .
Proof. Let ei be a length k vector that is 1 in the i-th position, and zero everywhere else. Then
X
k X
k
1{Si =
6 Ŝi } ≥ 1{Sk = Ŝk + ei }
i=1 i=1
Dividing by k and taking expectation gives
1X
k
Pb ≥ P[Sk = Ŝk + ei ]
k
i=1
≥ min{P[Ŵ = c′ |W = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = 1} .
Similarly, we can prove the following generalization:
Theorem 19.20. If A, B ∈ {0, 1}k (with arbitrary marginals!) then for every r ≥ 1 we have

1 k−1
Pb = E[dH (A, B)] ≥ Pr,min (19.34)
k r−1
Pr,min ≜ min{P[B = c′ |A = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = r} (19.35)
Proof. First, observe that

X
k
P[dH (A, B) = r|A = a] = PB|A (b|a) ≥ Pr,min .
r
b:dH (a,b)=r
Next, notice
dH (x, y) ≥ r1{dH (x, y) = r}
and take the expectation with x ∼ A, y ∼ B.
In statistics, Assouad’s Lemma is a useful tool for obtaining lower bounds on the minimax risk
of an estimator. See Section 31.2 for more.
The following is a converse bound for channel coding under BER constraint.
i i
i i
i i

i i
Theorem 19.21 (Converse under BER). Any M-code with M = 2k and bit-error rate Pb satisfies
supPX I(X; Y)
log M ≤ .
log 2 − h(Pb )
i.i.d.
Proof. Note that Sk → X → Y → Ŝk , where Sk ∼ Ber( 12 ). Recall from Theorem 6.1 that for iid
P
Sn , I(Si ; Ŝi ) ≤ I(Sk ; Ŝk ). This gives us
X
k
sup I(X; Y) ≥ I(X; Y) ≥ I(Si ; Ŝi )
PX
i=1
1
1X
≥k d P[Si = Ŝi ]
k 2
!
1X
k
1
≥ kd P[Si = Ŝi ]
k 2
i=1
1

= kd 1 − Pb = k(log 2 − h(Pb ))
2
where the second line used Fano’s inequality (Theorem 6.3) for binary random variables (or data
processing inequality for divergence), and the third line used the convexity of divergence.2
Pairing this bound with Proposition 19.10 shows that any sequence of codes with Pb → 0 (for
a memoryless channel) must have rate R < C. In other words, relaxing the constraint from Pe to
Pb does not yield any higher rates.
Later in Section 26.3 we will see that channel coding under BER constraint is a special case
of a more general paradigm known as lossy joint source channel coding so that Theorem 19.21
follows from Theorem 26.5. Furthermore, this converse bound is in fact achievable asymptotically
for stationary memoryless channels.
19.7 Joint Source Channel Coding

Now we will examine a slightly different data transmission scenario called Joint Source Channel
Coding (JSCC):
Sk Encoder Xn Yn Decoder Ŝ k
Source (JSCC) Channel (JSCC)
Formally, a JSCC code consists of an encoder f : Ak → X n and a decoder g : Y n → Ak . The

goal is to maximize the transmission rate R = nk (symbol per channel use) while ensuring the
2
Note that this last chain of inequalities is similar to the proof of Proposition 6.7.
i i
i i
i i

i i
332
probability of error P[Sk 6= Ŝk ] is small. The fundamental limit (optimal probability of error) is
defined as
ϵ∗JSCC (k, n) = inf P[Sk 6= Ŝk ]

f, g
In channel coding we are interested in transmitting M messages and all messages are born equal.
Here we want to convey the source realizations which might not be equiprobable (has redundancy).
Indeed, if Sk is uniformly distributed on, say, {0, 1}k , then we are back to the channel coding setup
with M = 2k under average probability of error, and ϵ∗JSCC (k, n) coincides with ϵ∗ (n, 2k ) defined
in Section 22.1.
Here, we look for a clever scheme to directly encode k symbols from A into a length n channel
input such that we achieve a small probability of error over the channel. This feels like a mix
of two problems we’ve seen: compressing a source and coding over a channel. The following
theorem shows that compressing and channel coding separately is optimal. This is a relief, since
it implies that we do not need to develop any new theory or architectures to solve the Joint Source
Channel Coding problem. As far as the leading term in the asymptotics is concerned, the following
two-stage scheme is optimal: First use the optimal compressor to eliminate all the redundancy in
the source, then use the optimal channel code to add redundancy to combat the noise in the data
transmission.
Theorem 19.22. Let the source {Sk } be stationary memoryless on a finite alphabet with entropy
H. Let the channel be stationary memoryless with finite capacity C. Then
(
∗ → 0 R < C/H
ϵJSCC (nR, n) n → ∞.
6→ 0 R > C/H
The interpretation of this result is as follows: Each source symbol has information content
(entropy) H bits. Each channel use can convey C bits. Therefore to reliably transmit k symbols
over n channel uses, we need kH ≤ nC.
Proof. (Achievability.) The idea is to separately compress our source and code it for transmission.
Since this is a feasible way to solve the JSCC problem, it gives an achievability bound. This
separated architecture is
f1 f2 P Yn | X n g2 g1
Sk −→ W −→ Xn −→ Yn −→ Ŵ −→ Ŝk
Where we use the optimal compressor (f1 , g1 ) and optimal channel code (maximum probability of
error) (f2 , g2 ). Let W denote the output of the compressor which takes at most Mk values. Then
from Corollary 11.3 and Theorem 19.9 we get:
1
(From optimal compressor) log Mk > H + δ =⇒ P[Ŝk 6= Sk (W)] ≤ ϵ ∀k ≥ k0
k
1
(From optimal channel code) log Mk < C − δ =⇒ P[Ŵ 6= m|W = m] ≤ ϵ ∀m, ∀k ≥ k0
n
i i
i i
i i

i i
Using both of these,
P[Sk 6= Ŝk (Ŵ)] ≤ P[Sk 6= Ŝk , W = Ŵ] + P[W 6= Ŵ]

≤ P[Sk 6= Ŝk (W)] + P[W 6= Ŵ] ≤ ϵ + ϵ
And therefore if R(H + δ) < C − δ , then ϵ∗ → 0. By the arbitrariness of δ > 0, we conclude the
weak converse for any R > C/H.
(Converse.) To prove the converse notice that any JSCC encoder/decoder induces a Markov
chain
Sk → Xn → Yn → Ŝk .
Applying data processing for mutual information
I(Sk ; Ŝk ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn ) = nC.

PXn
On the other hand, since P[Sk 6= Ŝk ] ≤ ϵn , Fano’s inequality (Theorem 6.3) yields
I(Sk ; Ŝk ) = H(Sk ) − H(Sk |Ŝk ) ≥ kH − ϵn log |A|k − log 2.
Combining the two gives
nC ≥ kH − ϵn log |A|k − log 2. (19.36)
Since R = nk , dividing both sides by n and sending n → ∞ yields

RH − C
lim inf ϵn ≥ .
n→∞ R log |A|
Therefore ϵn does not vanish if R > C/H.
We remark that instead of using Fano’s inequality we could have lower bounded I(Sk ; Ŝk ) as in
the proof of Theorem 17.3 by defining QSk Ŝk = USk PŜk (with USk = Unif({0, 1}k ) and applying the
data processing inequality to the map (Sk , Ŝk ) 7→ 1{Sk = Ŝk }:
D(PSk Ŝk kQSk Ŝk ) = D(PSk kUSk ) + D(PŜ|Sk kPŜ |PSk ) ≥ d(1 − ϵn k|A|−k )
Rearranging terms yields (19.36). As we discussed in Remark 17.2, replacing D with other f-
divergences can be very fruitful.
In a very similar manner, by invoking Corollary 12.2 and Theorem 19.15 we obtain:
Theorem 19.23. Let source {Sk } be ergodic on a finite alphabet, and have entropy rate H. Let
the channel have capacity C and be information stable. Then
(
∗ = 0 R > H/C
lim ϵJSCC (nR, n)
n→∞ > 0 R < H/C
i i
i i
i i

i i
334
We leave the proof as an exercise.

The figure illustrates the power allocation via water-filling. In this particular case, the second
branch is too noisy (σ2 too big) such that it is better be discarded, i.e., the assigned power is zero.
i i
i i
i i

i i
20 Channels with input constraints. Gaussian

channels.
In this chapter we study data transmission with constraints on the channel input. Namely, in pre-
vious chapter the encoder for blocklength n code was permitted to produce arbitrary sequences of
inputs, i.e. elements of An . However, in many practical problem only a subset of An is allowed
to be used. As a motivation, consider the setting of the AWGN channel Example 3.3. Without
restricting the input, i.e. allowing arbitrary elements of Rn as input, the channel capacity is infi-
nite: supPX I(X; X + Z) = ∞ (for example, take X ∼ N (0, P) and P → ∞). Indeed, one can
transmit arbitrarily many messages with arbitrarily small error probability by choosing elements
of Rn with giant pairwise distance. In reality, however, one is limited by the available power. In
other words, only the elements xn ∈ Rn are allowed satisfying
1X 2
n
xt ≤ P ,
n
t=1
where P > 0 is known as the power constraint. How many bits per channel use can we transmit
under this constraint on the codewords? To answer this question in general, we need to extend
the setup and coding theorems to channels with input constraints. After doing that we will apply
these results to compute capacities of various Gaussian channels (memoryless, with inter-symbol
interference and subject to fading).
20.1 Channel coding with input constraints
An
b b
b
b Fn b
b b
b b b
b b
b
Codewords all land in a subset of An
We will say that an (n, M, ϵ)-code satisfies the input constraint Fn ⊂ An if the encoder maps
[M] into Fn , i.e. f : [M] → Fn . What subsets Fn are of interest?
In the context of Gaussian channels, we have A = R. Then one often talks about the following
constraints:
335
i i
i i
i i

i i
336
• Average power constraint:
1X 2 √
n
| xi | ≤ P ⇔ kxn k2 ≤ nP.
n
i=1
√
In other words, codewords must lie in a ball of radius nP.
• Peak power constraint :
max |xi | ≤ A ⇔ kxn k∞ ≤ A

1≤i≤n
Notice that the second type of constraint does not introduce any new problems: we can simply
restrict the input space from A = R to A = [−A, A] and be back into the setting of input-
unconstrained coding. The first type of the constraint is known as a separable cost-constraint.
We will restrict our attention from now on to it exclusively.
Definition 20.1. A channel with a separable cost constraint is specified as follows:
1 A, B : input/output spaces
2 PYn |Xn : An → B n , n = 1, 2, . . .
3 Cost function c : A → R ∪ {±∞}.
We extend the per-letter cost to n-sequences as follows:
1X
n
c(xn ) ≜ c(xk )
n
k=1
We next extend the channel coding notions to such channels.
Definition 20.2. • A code is an (n, M, ϵ, P)-code if it is an (n, M, ϵ)-code satisfying input

Pn
constraint Fn ≜ {xn : 1n k=1 c(xk ) ≤ P}
• Finite-n fundamental limits:
M∗ (n, ϵ, P) = max{M : ∃(n, M, ϵ, P)-code}

M∗max (n, ϵ, P) = max{M : ∃(n, M, ϵ, P)max -code}
• ϵ-capacity and Shannon capacity

1
Cϵ (P) = lim inf log M∗ (n, ϵ, P)
n→∞
n
C(P) = lim Cϵ (P)
ϵ↓0
• Information capacity
1
C(I) (P) = lim inf sup I(Xn ; Yn )
n→∞ n PXn :E[Pnk=1 c(Xk )]≤nP
i i
i i
i i

i i
20.1 Channel coding with input constraints 337
• Information stability: Channel is information stable if for all (admissible) P, there exists a
sequence of channel input distributions PXn such that the following two properties hold:
1 P
iP n n (Xn ; Yn )−
→C(I) (P) (20.1)
n X ,Y
P[c(Xn ) > P + δ] → 0 ∀δ > 0 . (20.2)
These definitions clearly parallel those of Definitions 19.3 and 19.6 for channels without input
constraints. A notable and crucial exception is the definition of the information capacity C(I) (P).
Indeed, under input constraints instead of maximizing I(Xn ; Yn ) over distributions supported on
Fn we extend maximization to a richer set of distributions, namely, those satisfying
X
n
E[ c(Xk )] ≤ nP .
k=1
This will be crucial for single-letterization of C(I) (P) soon.
Definition 20.3 (Admissible constraint). We say P is an admissible constraint if ∃x0 ∈ A

s.t. c(x0 ) ≤ P, or equivalently, ∃PX : E[c(X)] ≤ P. The set of admissible P’s is denoted by
Dc , and can be either in the form (P0 , ∞) or [P0 , ∞), where P0 ≜ infx∈A c(x).
Clearly, if P ∈
/ Dc , then there is no code (even a useless one, with 1 codeword) satisfying the
input constraint. So in the remaining we always assume P ∈ Dc .
Proposition 20.4. Define ϕ(P) ≜ supPX :E[c(X)]≤P I(X; Y). Then
1 ϕ is concave and non-decreasing. The domain of ϕ is dom ϕ ≜ {x : f(x) > −∞} = Dc .

2 One of the following is true: ϕ(P) is continuous and finite on (P0 , ∞), or ϕ = ∞ on (P0 , ∞).
Both of these properties extend to the function P 7→ C(I) (P).
Proof. In the first part all statements are obvious, except for concavity, which follows from the
concavity of PX 7→ I(X; Y). For any PXi such that E [c(Xi )] ≤ Pi , i = 0, 1, let X ∼ λ̄PX0 + λPX1 .
Then E [c(X)] ≤ λ̄P0 + λP1 and I(X; Y) ≥ λ̄I(X0 ; Y0 ) + λI(X1 ; Y1 ). Hence ϕ(λ̄P0 + λP1 ) ≥
λ̄ϕ(P0 ) + λϕ(P1 ). The second claim follows from concavity of ϕ(·).
To extend these results to C(I) (P) observe that for every n
1
P 7→ sup I(Xn ; Yn )
n PXn :E[c(Xn )]≤P
is concave. Then taking lim infn→∞ the same holds for C(I) (P).
An immediate consequence is that memoryless input is optimal for memoryless channel with
separable cost, which gives us the single-letter formula of the information capacity:
i i
i i
i i

i i
338
Corollary 20.5 (Single-letterization). Information capacity of stationary memoryless channel

with separable cost:
C(I) (P) = ϕ(P) = sup I(X; Y).

E[c(X)]≤P
Proof. C(I) (P) ≥ ϕ(P) is obvious by using PXn = (PX )⊗n . For “≤”, fix any PXn satisfying the
cost constraint. Consider the chain
 
( a) X (b) X X
n n ( c)
n
1
I(Xn ; Yn ) ≤ I(Xj ; Yj ) ≤ ϕ(E[c(Xj )]) ≤ nϕ  E[c(Xj )] ≤ nϕ(P) ,
n
j=1 j=1 j=1
where (a) follows from Theorem 6.1; (b) from the definition of ϕ; and (c) from Jensen’s inequality
and concavity of ϕ.
20.2 Channel capacity under separable cost constraints

We start with a straightforward extension of the weak converse to the case of input constraints.
Theorem 20.6 (Weak converse).

C(I) (P)
Cϵ (P) ≤
1−ϵ
Proof. The argument is the same as we used in Theorem 17.3. Take any (n, M, ϵ, P)-code, W →
Xn → Yn → Ŵ. Applying Fano’s inequality and the data-processing, we get
−h(ϵ) + (1 − ϵ) log M ≤ I(W; Ŵ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn )

PXn :E[c(Xn )]≤P
Normalizing both sides by n and taking lim infn→∞ we obtain the result.
Next we need to extend one of the coding theorems to the case of input constraints. We do so for
the Feinstein’s lemma (Theorem 18.7). Note that when F = X , it reduces to the original version.
Theorem 20.7 (Extended Feinstein’s lemma). Fix a Markov kernel PY|X and an arbitrary PX .
Then for any measurable subset F ⊂ X , everyγ > 0 and any integer M ≥ 1, there exists an
(M, ϵ)max -code such that
• Encoder satisfies the input constraint: f : [M] → F ⊂ X ;

• Probability of error bound:
M
ϵPX (F) ≤ P[i(X; Y) < log γ] +
γ
i i
i i
i i

i i
20.2 Channel capacity under separable cost constraints 339
Proof. Similar to the proof of the original Feinstein’s lemma, define the preliminary decoding
regions Ec = {y : i(c; y) ≥ log γ} for all c ∈ X . Next, we apply Corollary 18.4 and find out
that there is a set F0 ⊂ X with two properties: a) PX [F0 ] = 1 and b) for every x ∈ F0 we have
PY (Ex ) ≤ γ1 . We now let F′ = F ∩ F0 and notice that PX [F′ ] = PX [F].
We sequentially pick codewords {c1 , . . . , cM } from the set F′ (!) and define the decoding regions
{D1 , . . . , DM } as Dj ≜ Ecj \ ∪jk− 1
=1 Dk . The stopping criterion is that M is maximal, i.e.,

∀x0 ∈ F′ , PY [Ex0 \ ∪M
j=1 Dj X = x0 ] < 1 − ϵ

⇔ ∀x0 ∈ X , PY [Ex0 \ ∪M ′
j=1 Dj X = x0 ] < (1 − ϵ)1[x0 ∈ F ] + 1[x0 ∈ F ]
′c
Now average the last inequality over x0 ∼ PX to obtain

′ ′c
P[{i(X; Y) ≥ log γ}\ ∪M
j=1 Dj ] ≤ (1 − ϵ)PX [F ] + PX [F ] = 1 − ϵPX [F]
From here, we complete the proof by following the same steps as in the proof of original Feinstein’s
lemma (Theorem 18.7).
Given the coding theorem we can establish a lower bound on capacity
Theorem 20.8 (Capacity lower bound). For any information stable channel with input constraints
and P > P0 we have
C(P) ≥ C(I) (P). (20.3)
Proof. Let us consider a special case of the stationary memoryless channel (the proof for general
information stable channel follows similarly). Thus, we assume PYn |Xn = (PY|X )⊗n .
Fix n ≥ 1. Choose a PX such that E[c(X)] < P, Pick log M = n(I(X; Y) − 2δ) and log γ =
n(I(X; Y) − δ).
P
With the input constraint set Fn = {xn : 1n c(xk ) ≤ P}, and iid input distribution PXn = P⊗ n
X ,
we apply the extended Feinstein’s lemma. This shows existence of an (n, M, ϵn , P)max -code with
the encoder satisfying input constraint Fn and vanishing (maximal) error probability
ϵn PXn [Fn ] ≤ P[i(Xn ; Yn ) ≤ n(I(X; Y) − δ)] + exp(−nδ)
| {z } | {z } | {z }
→1 →0 as n→∞ by WLLN and stationary memoryless assumption →0
Indeed, the first term is vanishing by the weak law of large numbers: since E[c(X)] < P, we have
P
PXn (Fn ) = P[ 1n c(xk ) ≤ P] → 1. Since ϵn → 0 this implies that for every ϵ > 0 we have
1
Cϵ (P) ≥ log M = I(X; Y) − 2δ, ∀δ > 0, ∀PX s.t. E[c(X)] < P
n
⇒ Cϵ (P) ≥ sup lim (I(X; Y) − 2δ)
PX :E[c(X)]<P δ→0
⇒ Cϵ (P) ≥ sup I(X; Y) = C(I) (P−) = C(I) (P)

PX :E[c(X)]<P
where the last equality is from the continuity of C(I) on (P0 , ∞) by Proposition 20.4.
For a general information stable channel, we just need to use the definition to show that
P[i(Xn ; Yn ) ≤ n(C(I) − δ)] → 0, and the rest of the proof follows similarly.
i i
i i
i i

i i
340
Theorem 20.9 (Channel capacity under cost constraint). For an information stable channel with
cost constraint and for any admissible constraint P we have
C(P) = C(I) (P).
Proof. The boundary case of P = P0 is treated in Ex. IV.10, which shows that C(P0 ) = C(I) (P0 )
even though C(I) (P) may be discontinuous at P0 . So assume P > P0 next. Theorem 20.6 shows
(I)
Cϵ (P) ≤ C1−ϵ (P)
, thus C(P) ≤ C(I) (P). On the other hand, from Theorem 20.8 we have C(P) ≥
C(I) (P).
20.3 Stationary AWGN channel

We start our applications with perhaps the most important channel (from the point of view of
communication engineering).
Z ∼ N (0, σ 2 )
X Y
+
Definition 20.10 (The stationary AWGN channel). The Additive White Gaussian Noise (AWGN)
channel is a stationary memoryless additive-noise channel with separable cost constraint: A =
B = R, c(x) = x2 , and a single-letter kernel PY|X given by Y = X + Z, where Z ∼ N (0, σ 2 ) ⊥⊥ X.
The n-letter kernel is given by a product extension, i.e. Yn = Xn + Zn with Zn ∼ N (0, In ). When
the power constraint is E[c(X)] ≤ P we say that the signal-to-noise ratio (SNR) equals σP2 .
The terminology white noise refers to the fact that the noise variables are uncorrelated across
time. This makes the power spectral density of the process {Zj } constant in frequency (or “white”).
We often drop the word stationary when referring to this channel. The definition we gave above is
more correctly should be called the real AWGN, or R-AWGN, channel. The complex AWGN, or
C-channel is defined similarly: A = B = C, c(x) = |x|2 , and Yn = Xn + Zn , with Zn ∼ Nc (0, In )
being the circularly symmetric complex gaussian.
Theorem 20.11. For the stationary AWGN channel, the channel capacity is equal to information
capacity, and is given by:

1 P
( I)
C(P) = C (P) = log 1 + 2 for R-AWGN (20.4)
2 σ

P
C(P) = C(I) (P) = log 1 + 2 for C-AWGN
σ
i i
i i
i i

i i
20.3 Stationary AWGN channel 341
Proof. By Corollary 20.5,
C(I) = sup I(X; X + Z)

PX :EX2 ≤P
Then using Theorem 5.11 (the Gaussian saddle point) to conclude X ∼ N (0, P) (or Nc (0, P)) is
the unique capacity-achieving input distribution.
At this point it is also instructive to revisit Section 6.2* which shows that Gaussian capacity
can in fact be derived essentially without solving the maximization of mutual information: the
Euclidean rotational symmetry implies the optimal input should be Gaussian.
There is a great deal of deep knowledge embedded in the simple looking formula of Shan-
non (20.4). First, from the engineering point of view we immediately see that to transmit
information faster (per unit time) one needs to pay with radiating at higher power. Second, the
amount of energy spent per transmitted information bit is minimized by solving
P log 2
inf = 2σ 2 loge 2 (20.5)
P>0 C(P)
and is achieved by taking P → 0. (We will discuss the notion of energy-per-bit more in
Section 21.1.) Thus, we see that in order to maximize communication rate we need to send
powerful, high-power waveforms. But in order to minimize energy-per-bit we need to send in
very quiet “whisper” and at very low communication rate.1 In either case the waveforms of good
error-correcting codes should look like samples of the white gaussian process.
Third, from the mathematical point of view, formula (20.4) reveals certain properties of high-
dimensional Euclidean geometry
√ as follows. Since Zn ∼ N (0, σ 2 ), then with high probability,
kZ k2 concentrates around nσ 2 . Similarly, due the power constraint and the fact that Zn ⊥
n
⊥ Xn , we
n 2 n 2 n 2
have E kY k = E kY p k + E kZ k ≤ n(P + σ 2 ) and the received vector Yn lies in an ℓ√ 2 -ball
of radius approximately n(P + σ 2 ). Since the noise √ can at most perturb the codeword p by nσ 2
in Euclidean distance, if we can pack M balls of radius nσ 2 into the ℓ2 -ball of radius n(P + σ 2 )
centered at the origin, this yields a good codebook and decoding regions – see Fig. 20.1 for an
illustration. So how large can M be? Note that the volume of an ℓ2 -ball of radius r in Rn is given by
2 n/ 2 n/2
cn rn for some constant cn . Then cn (cnn((Pn+σ ))
= 1 + σP2 . Taking the log and dividing by n, we
σ 2 ) n/ 2
∗
get n log M ≈ 2 log 1 + σ2 . This tantalazingly convincing reasoning, however, is flawed in at
1 1 P
least two ways. (a) Computing the volume ratio only gives an upper bound on the maximal number
of disjoint balls (See Section 27.2 for an extensive discussion on this topic.) (b) Codewords need
not correspond to centers of disjoint ℓ2 -balls. √ Indeed, the fact that we allow some vanishing (but
non-zero) probability of error means that the nσ 2 -balls are slightly overlapping and Shannon’s
formula establishes the maximal number of such partially overlapping balls that we can pack so
that they are (mostly) inside a larger ball.
Theorem 20.11 applies to Gaussian noise. What if the noise is non-Gaussian and how sensi-
tive is the capacity formula 12 log(1 + SNR) to the Gaussian assumption? Recall the Gaussian
1
This explains why, for example, the deep space probes communicate with earth via very low-rate codes and very long
blocklengths.
i i
i i
i i

i i
342
c3
c4
p n
√ c2
nσ 2
(P
c1
+
σ
2
)
c5
c8
···
c6
c7
cM
Figure 20.1 Interpretation of the AWGN capacity formula as “soft” packing.
saddlepoint result we have studied in Chapter 5 where we showed that for the same variance,
Gaussian noise is the worst which shows that the capacity of any non-Gaussian noise is at least
1
2 log(1 + SNR). Conversely, it turns out the increase of the capacity can be controlled by how
non-Gaussian the noise is (in terms of KL divergence). The following result is due to Ihara [163].
Theorem 20.12 (Additive Non-Gaussian noise). Let Z be a real-valued random variable indepen-
dent of X and EZ2 < ∞. Let σ 2 = Var Z. Then

1 P 1 P
log 1 + 2 ≤ sup I(X; X + Z) ≤ log 1 + 2 + D(PZ kN (EZ, σ 2 )).
2 σ PX :EX2 ≤P 2 σ
Proof. See Ex. IV.11.
Remark 20.1. The quantity D(PZ kN (EZ, σ 2 )) is sometimes called the non-Gaussianness of Z,
where N (EZ, σ 2 ) is a Gaussian with the same mean and variance as Z. So if Z has a non-Gaussian
density, say, Z is uniform on [0, 1], then the capacity can only differ by a constant compared to
AWGN, which still scales as 21 log SNR in the high-SNR regime. On the other hand, if Z is discrete,
then D(PZ kN (EZ, σ 2 )) = ∞ and indeed in this case one can show that the capacity is infinite
because the noise is “too weak”.
20.4 Parallel AWGN channel

Definition 20.13 (Parallel AWGN). A parallel AWGN channel with L branches is a stationary
memoryless channel whose single-letter kernel is defined as follows: alphabets A = B = RL , the
PL
cost c(x) = k=1 |xk |2 ; and the kernel PYL |XL : Yk = Xk + Zk , for k = 1, . . . , L, and Zk ∼ N (0, σk2 )
are independent for each branch.
i i
i i
i i

i i
20.4 Parallel AWGN channel 343
Figure 20.2 Power allocation via water-filling. Here, the second branch is too noisy (σ2 too big) for the
amount of available power P and the optimal coding should discard (input zeros to) this branch altogether.
Theorem 20.14 (Waterfilling). The capacity of L-parallel AWGN channel is given by
1X + T
L
C = log
2 σj2
j=1
where log+ (x) ≜ max(log x, 0), and T ≥ 0 is determined by
X
L
P = |T − σj2 |+
j=1
Proof.
C(I) (P) = sup

P
I(XL ; YL )
PXL : E[X2i ]≤P
X
L
≤ P
sup sup I(Xk ; Yk )
Pk ≤P,Pk ≥0 k=1 E[X2k ]≤Pk
X
L
1 Pk
= P
sup log(1 + )
Pk ≤P,Pk ≥0 k=1 2 σk2
with equality if Xk ∼ N (0, Pk ) are independent. So the question boils down to the last maximiza-
P
tion problem – power allocation: Denote the Lagragian multipliers for the constraint Pk ≤ P by
P1 P
λ and for the constraint Pk ≥ 0 by μk . We want to solve max 2 log(1 + σk2 )− μk Pk +λ(P −
Pk
Pk ).
First-order condition on Pk gives that
1 1
2
= λ − μk , μk Pk = 0
2 σk + Pk
therefore the optimal solution is
X
L
Pk = |T − σk2 |+ , T is chosen such that P = |T − σk2 |+
k=1
i i
i i
i i

i i
344
On Fig. 20.2 we give a visual interpretation of the waterfilling solution. It has a number of
practically important conclusions. First, it gives a precise recipee for how much power to allocate
to different frequency bands. This solution, simple and elegant, was actually pivotal for bringing
high-speed internet to many homes (via cable modems): initially, before information theorists
had a say, power allocations were chosen on the basis of costly and imprecise simulations of real
codes. Simplicity of the waterfilling power allocation allowed to make power allocation dynamic
and enable instantaneous reaction to changing noise environments.
Second, there is a very important consequence for multiple-antenna (MIMO) communication.
Given nr receive antennas and nt transmit antennas, very often one gets as a result a parallel AWGN
with L = min(nr , nt ) branches. For a single-antenna system the capacity then scales as 12 log P with
increasing power (Theorem 20.11), while the capacity for a MIMO AWGN channel is approxi-
mately L2 log( PL ) ≈ L2 log P for large P. This results in a L-fold increase in capacity at high SNR.
This is the basis of a powerful technique of spatial multiplexing in MIMO, largely behind much
of advance in 4G, 5G cellular (3GPP) and post-802.11n WiFi systems.
Notice that spatial diversity (requiring both receive and transmit antennas) is different from a
so-called multipath diversity (which works even if antennas are added on just one side). Indeed,
if a single stream of data is sent through every parallel channel simultaneously, then sufficient
statistic would be to average the received vectors, resulting in a the effective noise level reduced
by L1 factor. The result is capacity increase from 12 log P to 12 log(LP) – a far cry from the L-fold
increase of spatial multiplexing. These exciting topics are explored in excellent books [312, 190].
20.5* Non-stationary AWGN

Definition 20.15 (Non-stationary AWGN). A non-stationary AWGN channel is a non-stationary
memoryless channel with j-th single-letter kernel defined as follows: alphabets A = B = R,
cost constraint c(x) = x2 , PYj |Xj : Yj = Xj + Zj , where Zj ∼ N (0, σj2 ). The n-letter channel
Qn
PYn |Xn = j=1 PYj |Xj .
Theorem 20.16. Assume that for every T the following limits exist:
1X1
n
T
C̃(I) (T) = lim log+ 2
n→∞ n 2 σj
j=1
1X
n
P̃(T) = lim |T − σj2 |+ .
n→∞ n
j=1
Then the capacity of the non-stationary AWGN channel is given by the parameterized form:
C(T) = C̃(I) (T) with input power constraint P̃(T).
Proof. Fix T > 0. Then it is clear from the waterfilling solution that
X
n
1 T
sup I(Xn ; Yn ) = log+ , (20.6)
2 σj2
j=1
i i
i i
i i

i i
20.6* Additive colored Gaussian noise channel 345
where supremum is over all PXn such that
1X
n
E[c(Xn )] ≤ |T − σj2 |+ . (20.7)
n
j=1
Now, by assumption, the LHS of (20.7) converges to P̃(T). Thus, we have that for every δ > 0
C(I) (P̃(T) − δ) ≤ C˜(I) (T) (20.8)

C(I) (P̃(T) + δ) ≥ C˜(I) (T) (20.9)
Taking δ → 0 and invoking continuity of P 7→ C(I) (P), we get that the information capacity
satisfies
C(I) (P̃(T)) = C˜(I) (T) .
The channel is information stable. Indeed, from (19.32)

log2 e Pj log2 e
Var(i(Xj ; Yj )) = 2
≤
2 Pj + σj 2
and thus
X
n
1
Var(i(Xj ; Yj )) < ∞ .
n2
j=1
From here information stability follows via Theorem 19.16.
Non-stationary AWGN is primarily interesting due to its relationship to the additive colored
gaussian noise channel in the following section.
20.6* Additive colored Gaussian noise channel

Definition 20.17. The Additive Colored Gaussian Noise (ACGN) channel is a channel with mem-
ory defined as follows. The single-letter alphabets are A = B = R and the separable cost is
c(x) = x2 . The channel acts on the input vector Xn by addition Yn = Xn +Zn , where Zj is a stationary
Gaussian process with power spectral density fZ (ω) ≥ 0, ω ∈ [−π , π ] (recall Example 6.3).
Theorem 20.18. The capacity of the ACGN channel with fZ (ω) > 0 for almost every ω ∈ [−π , π ]
is given by the following parametric form:
Z 2π
1 1 T
C ( T) = log+ dω,
2π 0 2 fZ (ω)
Z 2π
1
P ( T) = |T − fZ (ω)|+ dω.
2π 0
i i
i i
i i

i i
346
Figure 20.3 The ACGN channel: the “whitening” process used in the capacity proof and the waterfilling
solution.
Proof. Take n ≥ 1, consider the diagonalization of the covariance matrix of Zn :

e U, such that Σ
Cov(Zn ) = Σ = U∗ Σ e = diag(σ1 , . . . , σn )
en = UXn and Y
Since Cov(Zn ) is positive semi-definite, U is a unitary matrix. Define X en = UYn ,
the channel between Xen and Yen is thus
en = X
Y en + UZn ,
e
Cov(UZn ) = U · Cov(Zn ) · U∗ = Σ
Therefore we have the equivalent channel as follows:

en = X
Y en + Z
en , enj ∼ N (0, σj2 ) independent across j.
Z
By Theorem 20.16, we have that

X n Z 2π
e = lim 1
C
T
log+ 2 =
1 1
log+
T
dω. (Szegö’s Theorem, see (6.20))
n→∞ n σj 2π 0 2 fZ (ω)
j=1
1 X
n
lim |T − σj2 |+ = P(T).
n→∞ n
j=1
e
Finally since U is unitary, C = C.
The idea used in the proof as well as the waterfilling power allocation are illustrated on Fig. 20.3.
Note that most of the time the noise that impacts real-world systems is actually “born” white
(because it’s a thermal noise). However, between the place of its injection and the processing there
are usually multiple circuit elements. If we model them linearly then their action can equivalently
be described as the ACGN channel, since the effective noise added becomes colored. In fact, this
i i
i i
i i

i i
20.7* Additive White Gaussian Noise channel with Intersymbol Interference 347
filtering can be inserted deliberately in order to convert the actual channel into an additive noise
one. This is the content of the next section.
20.7* Additive White Gaussian Noise channel with Intersymbol

Interference
Oftentimes in wireless communication systems a signal is propagating through a rich scattering
environment. Thus, reaching the receiver are multiple delayed and attenuated copies of the initial
signal. Such situation is formally called intersymbol interference (ISI). A similar effect also occurs
when the cable modem attempts to send signals across telephone or TV wires due to the presence
of various linear filters, transformers and relays. The mathematical model for such channels is the
following.
Definition 20.19 (AWGN with ISI). An AWGN channel with ISI is a channel with memory that
is defined as follows: the alphabets are A = B = R, and the separable cost is c(x) = x2 . The
channel law PYn |Xn is given by
X
n
Yk = hk−j Xj + Zk , k = 1, . . . , n
j=1
i.i.d.
where Zk ∼ N (0, σ 2 ) is white Gaussian noise, {hk , k = −∞, . . . , ∞} are coefficients of a discrete-
time channel filter.
The coefficients {hk } describe the action of the environment. They are often learned by the
receiver during the “handshake” process of establishing a communication link.
Theorem 20.20. Suppose that the sequence {hk } is the inverse Fourier transform of a frequency
response H(ω):
Z 2π
1
hk = eiωk H(ω)dω .
2π 0
Assume also that H(ω) is a continuous function on [0, 2π ]. Then the capacity of the AWGN channel
with ISI is given by
Z 2π
1 1
C ( T) = log+ (T|H(ω)|2 )dω
2π 0 2
Z 2π +
1 1
P ( T) = T − dω
2π 0 |H(ω)| 2
Proof sketch. At the decoder apply the inverse filter with frequency response ω 7→ 1
H(ω) . The
equivalent channel then becomes a stationary colored-noise Gaussian channel:
Ỹj = Xj + Z̃j ,
i i
i i
i i

i i
348
where Z̃j is a stationary Gaussian process with spectral density

1
fZ̃ (ω) = .
|H(ω)|2
Then apply Theorem 20.18 to the resulting channel.
To make the above argument rigorous one must carefully analyze the non-zero error introduced
by truncating the deconvolution filter to finite n. This would take us too much outside of the scope
of this book.
20.8* Gaussian channels with amplitude constraints

We have examined some classical results of additive Gaussian noise channels. In the following,
we will list some more recent results without proof.
Theorem 20.21 (Amplitude-constrained AWGN channel capacity). For an AWGN channel Yi =

Xi + Zi with amplitude constraint |Xi | ≤ A, we denote the capacity by:
C(A) = max I(X; X + Z).

PX :|X|≤A
The capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. The
number of atoms is Ω(A) and O(A2 ) as A → ∞. Moreover,

1 2A2 1
log 1 + ≤ C(A) ≤ log 1 + A2
2 eπ 2
Theorem 20.22 (Amplitude-and-power-constrained AWGN channel capacity). For an AWGN

Pn
channel Yi = Xi + Zi with amplitude constraint |Xi | ≤ A and power constraint i=1 X2i ≤ nP, we
denote the capacity by:
C(A, P) = max I(X; X + Z).

PX :|X|≤A,E|X|2 ≤P
Capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. Moreover,
the convergence speed of limA→∞ C(A, P) = 21 log(1 + P) is of the order e−O(A ) .
2
For details, see [290], [247, Section III] and [109, 253] for the O(A2 ) bound on the number of
atoms.
20.9* Gaussian channels with fading

So far we assumed that the channel is either additive (as in AWGN or ACGN) or has known
multiplicative gains (as in ISI). However, in practice the channel gain is a random variable. This
situation is called “fading” and is often used to model the urban signal propagation with multipath
i i
i i
i i

i i
20.9* Gaussian channels with fading 349
or shadowing. The received signal at time i, Yi , is affected by multiplicative fading coefficient Hi

and additive noise Zi as follows:
Yi = Hi Xi + Zi , Z i ∼ N ( 0, σ 2 )
This is illustrated in Fig. 20.4.
Figure 20.4 AWGN with fading.
There are two drastically different cases of fading channels, depending on the presence or
absence of the dashed link on Fig. 20.4. In the first case, known as the coherent case or the CSIR
case (for channel state information at the receiver), the receiver is assumed to have perfect esti-
mate of the channel state information Hi at every time i. In other words, the channel output is
effectively (Yi , Hi ). This situation occurs, for example, when there are pilot signals sent period-
ically and are used at the receiver to estimate the channel. in some cases, the i then references
different frequencies or sub-channels of an OFDM frame).
Whenever Hj is a stationary ergodic process, we have the channel capacity given by:

1 P|H|2
C(P) = E log 1 +
2 σ2
and the capacity achieving input distribution is the usual PX = N (0, P). Note that the capacity
C(P) is in the order of log P and we call the channel “energy efficient”.
In the second case, known as non-coherent or no-CSIR, the receiver does not have any knowl-
edge of the Hi . In this case, there is no simple expression for the channel capacity. Most of the
i.i.d.
known results were shown for the case of Hi ∼ according to the Rayleigh distribution. In this
case, the capacity achieving input distribution is discrete [2], and the capacity scales as [304, 191]
C(P) = O(log log P), P→∞ (20.10)
This channel is said to be “energy inefficient” since increase in communication rate requires
dramatic expenditures in power.
Further generalization of the Gaussian channel models requires introducing multiple input and
output antennas. In this case, the single-letter input Xi ∈ Cnt and the output Yi ∈ Cnr are related
by
Yi = Hi Xi + Zi , (20.11)
i.i.d.
where Zi ∼ CN (0, σ 2 Inr ), nt and nr are the number of transmit and receive antennas, and Hi ∈
Cnt ×nr is a matrix-valued channel gain process. An incredible effort in the 1990s and 2000s was
i i
i i
i i

i i
350
invested by the information-theoretic and communication-theoretic researchers to understand this

channel model. Some of the highlights include a beautiful transmit-diversity 2x2 code of Alam-
outi [5]; generalization of which lead to the discovery of space-time coding [306, 305]; the result
of Telatar [308] showing that under coherent fading the capacity scales as min(nt , nr ) log P; and
the result of Zheng and Tse [349] showing a different pre-log in the scaling for the non-coherent
(block-fading) case. It is not possible to cover these and many other beautiful results in any
detail here, unfortunately. We suggest textbook [312] as an introduction to this field of MIMO
communication.
i i
i i
i i

i i
21 Energy-per-bit, continuous-time channels
In this chapter we will consider an interesting variation of the channel coding problem. Instead
of constraining the blocklength (i.e. the number of channel uses), we will constrain the total cost
incurred by the codewords. The motivation is the following. Consider a deep space probe which
has a k bit message that needs to be delivered to Earth (or a satellite orbiting it). The duration of
transmission is of little worry for the probe, but what is really limited is the amount of energy it has
stored in its battery. In this chapter we will learn how to study this question abstractly, how coding
over large number of bits k → ∞ reduces the energy spent (per bit), and how this fundamental
limit is related to communication over continuous-time channels.
21.1 Energy per bit

In this chapter we will consider Markov kernels PY∞ |X∞ acting between two spaces of infite
sequences. The prototypical example is again the AWGN channel:
Yi = Xi + Zi , Zi ∼ N (0, N0 /2). (21.1)
Note that in this chapter we have denoted the noise level for Zi to be N20 . There is a long tradition for
such a notation. Indeed, most of the noise in communication systems is a white thermal noise at the
receiver. The power spectral density of that noise is flat and denoted by N0 (in Joules per second
per Hz). However, recall that received signal is complex-valued and, thus, each real component
has power N20 . Note also that thermodynamics suggests that N0 = kT, where k = 1.38 × 10−23 is
the Boltzmann constant, and T is the absolute temperature in Kelvins.
In previous chapter, we analyzed the maximum number of information messages (M∗ (n, ϵ, P))
that can be sent through this channel for a given n number of channel uses and under the power
constraint P. We have also hinted that in (20.5) that there is a fundamental minimal required cost
to send each (data) bit. Here we develop this question more rigorously. Everywhere in this chapter
for v ∈ R∞ or u ∈ C∞ we define
X
∞ X
∞
kvk22 = v2j , kuk22 = | uj | 2 .
j=1 j=1
Definition 21.1 ((E, 2k , ϵ)-code). Given a Markov kernel with input space R∞ or C∞ we define
an (E, 2k , ϵ)-code to be an encoder-decoder pair, f : [2k ] → R∞ and g : R∞ → [2k ] (or similar
351
i i
i i
i i

i i
352
randomized versions), such that for all messages m ∈ [2k ] we have kf(m)k22 ≤ E and
P[g(Y∞ ) 6= W] ≤ ϵ ,
where as usual the probability space is W → X∞ → Y∞ → Ŵ with W ∼ Unif([2k ]), X∞ = f(W)

and Ŵ = g(Y∞ ). The fundamental limit is defined to be
E∗ (k, ϵ) = min{E : ∃(E, 2k , ϵ) code}
The operational meaning of E∗ (k, ϵ) should be apparent: it is the minimal amount of energy the
space probe needs to draw from the battery in order to send k bits of data.
Theorem 21.2 ((Eb /N0 )min = −1.6dB). For the AWGN channel we have
E∗ (k, ϵ) N0
lim lim sup = . (21.2)
ϵ→0 k→∞ k log2 e
Remark 21.1. This result, first obtained by Shannon [277], is colloquially referred to as: mini-
mal Eb /N0 (pronounced “eebee over enzero” or “ebno”) is −1.6 dB. The latter value is simply
10 log10 ( log1 e ) ≈ −1.592. Achieving this value of the ebno was an ultimate quest for coding
2
theory, first resolved by the turbo codes [30]. See [73] for a review of this long conquest.
Proof. We start with a lower bound (or the “converse” part). As usual, we have the working
probability space
W → X∞ → Y∞ → Ŵ .
Then consider the following chain:

1
−h(ϵ) + ϵk ≤ d((1 − ϵ)k ) Fano’s inequality
M
≤ I(W; Ŵ) data processing for divergence
∞ ∞
≤ I(X ; Y ) data processing for mutual information
X
∞
≤ I(Xi ; Yi ) lim I(Xn ; U) = I(X∞ ; U)by (4.30)
n→∞
i=1
X
∞
1 EX2i
≤ log(1 + ) Gaussian capacity, Theorem 5.11
2 N0 /2
i=1
log e X EX2i
∞
≤ linearization of log
2 N0 /2
i=1
E
≤ log e
N0
Thus, we have shown
E∗ (k, ϵ) N0 h(ϵ)
≥ (ϵ − )
k log e k
i i
i i
i i

i i
21.1 Energy per bit 353
and taking the double limit in n then in ϵ completes the proof.

Next, for the upper bound (the “achievability” part). We first give a traditional existential proof.
Notice that a (n, 2k , ϵ, P)-code for the AWGN channel is also a (nP, 2k , ϵ)-code for the energy
problem without time constraint. Therefore,
log2 M∗ (n, ϵ, P) ≥ k ⇒ E∗ (k, ϵ) ≤ nP.

E∗ (kn ,ϵ)
Take kn = blog2 M∗ (n, ϵ, P)c, so that we have kn ≤ nP
kn for all n ≥ 1. Taking the limit then we
get
E∗ (kn , ϵ) nP
lim sup ≤ lim sup ∗ (n, ϵ, P)
n→∞ kn n→∞ log M
P
=
lim infn→∞ 1n log M∗max (n, ϵ, P)
P
= 1 ,
2 log( 1 + N0P/2 )
where in the last step we applied Theorem 20.11. Now the above statement holds for every P > 0,
so let us optimize it to get the best bound:
E∗ (kn , ϵ) P
lim sup ≤ inf 1 P
n→∞ kn P≥0
2 log(1 + N0 / 2 )
P
= lim
P→0 1 log(1 + P
2 N0 / 2 )
N0
= (21.3)
log2 e
Note that the fact that minimal energy per bit is attained at P → 0 implies that in order to send
information reliably at the Shannon limit of −1.6dB, infinitely many time slots are needed. In
other words, the information rate (also known as spectral efficiency) should be vanishingly small.
Conversely, in order to have non-zero spectral efficiency, one necessarily has to step away from
the −1.6 dB. This tradeoff is known as spectral efficiency vs. energy-per-bit.
We next can give a simpler and more explicit construction of the code, not relying on the random
coding implicit in Theorem 20.11. Let M = 2k and consider the following code, known as the
pulse-position modulation (PPM):
√
PPM encoder: ∀m, f(m) = cm ≜ (0, 0, . . . , |{z} E ,...) (21.4)
m-th location
It is not hard to derive an upper bound on the probability of error that this code achieves [242,
Theorem 2]:
" ( r ! )#
2E
ϵ ≤ E min MQ + Z ,1 , Z ∼ N (0, 1) . (21.5)
N0
i i
i i
i i

i i
354
Indeed, our orthogonal codebook under a maximum likelihood decoder has probability of error
equal to
Z ∞" r !#M−1 √
(z− E)2
1 2 − N
Pe = 1 − √ 1−Q z e 0 dz , (21.6)
πN0 −∞ N0
which is obtained by observing that conditioned on (W = j,q Zj ) the events {||cj + z||2 ≤ ||cj +
z − ci ||2 }, i 6= j are independent. A change of variables x = N20 z and application of the bound
1 − (1 − y)M−1 ≤ min{My, 1} weakens (21.6) to (21.5).
To see that (21.5) implies (21.3), fix c > 0 and condition on |Z| ≤ c in (21.5) to relax it to
r
2E
ϵ ≤ MQ( − c) + 2Q(c) .
N0
Recall the expansion for the Q-function [322, (3.53)]:
x2 log e 1
log Q(x) = − − log x − log 2π + o(1) , x→∞ (21.7)
2 2
Thus, choosing τ > 0 and setting E = (1 + τ )k logN0 e we obtain
2
r
2E
2k Q( − c) → 0
N0
as k → ∞. Thus choosing c > 0 sufficiently large we obtain that lim supk→∞ E∗ (k, ϵ) ≤ (1 +
τ ) logN0 e for every τ > 0. Taking τ → 0 implies (21.3).
2
Remark 21.2 (Simplex conjecture). The code (21.4) in fact achieves the first three terms in the
large-k expansion of E∗ (k, ϵ), cf. [242, Theorem 3]. In fact, the code can be further slightly opti-
√ √
mized by subtracting the common center of gravity (2−k E, . . . , 2−k E . . .) and rescaling each
codeword to satisfy the power constraint. The resulting constellation is known as the simplex code.
It is conjectured to be the actual optimal code achieving E∗ (k, ϵ) for a fixed k and ϵ, see [75, Section
3.16] and [294] for more.
21.2 Capacity per unit cost

Generalizing the energy-per-bit setting of the previous section, we get the problem of capacity per
unit cost.
Definition 21.3. Given a channel PY∞ |X∞ : X ∞ → Y ∞ and a cost function c : X → R+ , we

define (E, M, ϵ)-code to be an encoder f : [M] → X ∞ and a decoder g : Y ∞ → [M] s.t. a) for every
m the output of the encoder x∞ ≜ f(m) satisfies
X
∞
c(xt ) ≤ E . (21.8)
t=1
i i
i i
i i

i i
21.2 Capacity per unit cost 355
and b) the probability of error is small
P[g(Y∞ ) 6= W] ≤ ϵ ,
where as usual we operate on the space W → X∞ → Y∞ → Ŵ with W ∼ Unif([M]). We let
M∗ (E, ϵ) = max{M : (E, M, ϵ)-code} ,
and define capacity per unit cost as

1
Cpuc ≜ lim lim inf log M∗ (E, ϵ) .
ϵ→0 E→∞ E
Let C(P) be the capacity-cost function of the channel (in the usual sense of capacity, as defined
in (20.1)). Assuming P0 = 0 and C(0) = 0 it is not hard to show (basically following the steps of
Theorem 21.2) that:

C(P) C(P) d
Cpuc = sup = lim = C(P) .
P P P→0 P dP P=0
The surprising discovery of Verdú [321] is that one can avoid computing C(P) and derive the Cpuc
directly. This is a significant help, as for many practical channels C(P) is unknown. Additionally,
this gives a yet another fundamental meaning to divergence.
Q
Theorem 21.4. For a stationary memoryless channel PY∞ |X∞ = PY|X with P0 = c(x0 ) = 0
(i.e. there is a symbol of zero cost), we have
D(PY|X=x kPY|X=x0 )
Cpuc = sup .
x̸=x0 c(x)
In particular, Cpuc = ∞ if there exists x1 6= x0 with c(x1 ) = 0.
Proof. Let
D(PY|X=x kPY|X=x0 )
CV = sup .
x̸=x0 c(x)
Converse: Consider a (E, M, ϵ)-code W → X∞ → Y∞ → Ŵ. Introduce an auxiliary distribution

QW,X∞ ,Y∞ ,Ŵ , where a channel is a useless one
QY∞ |X∞ = QY∞ ≜ P∞

Y|X=x0 .
That is, the overall factorization is
QW,X∞ ,Y∞ ,Ŵ = PW PX∞ |W QY∞ PŴ|Y∞ .
Then, as usual we have from the data-processing for divergence

1
(1 − ϵ) log M + h(ϵ) ≤ d(1 − ϵk )
M
≤ D(PW,X∞ ,Y∞ ,Ŵ kQW,X∞ ,Y∞ ,Ŵ )
i i
i i
i i

i i
356
= D(PY∞ |X∞ kQY∞ |PX∞ )

"∞ #
X
=E d( X t ) , (21.9)
t=1
where we denoted for convenience d(x) ≜ D(PY|X=x kPY|X=x0 ). By the definition of CV we have
d(x) ≤ c(x)CV .
Thus, continuing (21.9) we obtain
" #
X
∞
(1 − ϵ) log M + h(ϵ) ≤ CV E c(Xt ) ≤ CV · E ,
t=1
where the last step is by the cost constraint (21.8). Thus, dividing by E and taking limits we get
Cpuc ≤ CV .
Achievability: We generalize the PPM code (21.4). For each x1 ∈ X and n ∈ Z+ we define the
encoder f as follows:
f ( 1 ) = ( x1 , x1 , . . . , x1 , x0 , . . . , x0 ) (21.10)
| {z } | {z }
n-times n(M−1)-times
f ( 2 ) = ( x0 , x0 , . . . , x0 , x1 , . . . , x1 , x0 , . . . , x0 ) (21.11)
| {z } | {z } | {z }
n-times n-times n(M−2)-times
··· (21.12)
f(M) = ( x0 , . . . , x0 , x1 , x1 , . . . , x1 ) (21.13)
| {z } | {z }
n(M−1)-times n-times
Now, by Stein’s lemma (Theorem 14.13) there exists a subset S ⊂ Y n with the property that
P[Yn ∈ S|Xn = (x1 , . . . , x1 )] ≥ 1 − ϵ1 (21.14)
P[Yn ∈ S|Xn = (x0 , . . . , x0 )] ≤ exp{−nD(PY|X=x1 kPY|X=x0 ) + o(n)} . (21.15)
Therefore, we propose the following (suboptimal!) decoder:
Yn ∈ S =⇒ Ŵ = 1 (21.16)
n+1 ∈ S
Y2n =⇒ Ŵ = 2 (21.17)
··· (21.18)
From the union bound we find that the overall probability of error is bounded by
ϵ ≤ ϵ1 + M exp{−nD(PY|X=x1 kPY|X=x0 ) + o(n)} .
At the same time the total cost of each codeword is given by nc(x1 ). Thus, taking n → ∞ and
after straightforward manipulations, we conclude that
D(PY|X=x1 kPY|X=x0 )
Cpuc ≥ .
c(x1 )
i i
i i
i i

i i
21.3 Energy-per-bit for the fading channel 357
This holds for any symbol x1 ∈ X , and so we are free to take supremum over x1 to obtain Cpuc ≥
CV , as required.
21.3 Energy-per-bit for the fading channel

We can now apply the results of Theorem 21.4 to the case of a channel whose capacity (as a
function of power) is unknown. Namely, we consider a stationary memoryless Gaussian channel
with fading Hj unknown at the receiver (i.e. non-coherent fading channel, see Section 20.9*):
Yj = Hj Xj + Zj , Hj ∼ N c ( 0, 1) ⊥
⊥ Zj ∼ Nc (0, N0 )
(we use here a more convenient C-valued fading channel, the Hj ∼ Nc is known as the Rayleigh
fading). The cost function is the usual quadratic one c(x) = |x|2 . As we discussed previously,
cf. (20.10), the capacity-cost function C(P) is unknown in closed form, but is known to behave
drastically different from the case of non-fading AWGN (i.e. when Hj = 1). So here Theorem 21.4
comes handy. Let us perform a simple computation required, cf. (2.9):
D(Nc (0, |x|2 + N0 )kNc (0, N0 ))
Cpuc = sup (21.19)
x̸=0 | x| 2
 
| x| 2
1 log ( 1 + )
= sup log e − | x| 2
N0 
(21.20)
N0 x̸=0
N0
log e
= (21.21)
N0
Comparing with Theorem 21.2 we discover that surprisingly, the capacity-per-unit-cost is unaf-
fected by the presence of fading. In other words, the random multiplicative noise which is so
detrimental at high SNR, appears to be much more benign at low SNR (recall that Cpuc = C′ (0)
and thus computing Cpuc corresponds to computing C(P) at P → 0). There is one important differ-
ence: the supremization over x in (21.20) is solved at x = ∞. Following the proof of the converse
bound, we conclude that any code hoping to achieve optimal Cpuc must satisfy a strange constraint:
X X
|xt |2 1{|xt | ≥ A} ≈ | xt | 2 ∀A > 0
t t
i.e. the total energy expended by each codeword must be almost entirely concentrated in very
large spikes. Such a coding method is called “flash signalling”. Thus, we can see that unlike the
non-fading AWGN (for which due to rotational symmetry all codewords can be made relatively
non-spiky), the only hope of achieving full Cpuc in the presence of fading is by signalling in short
bursts of energy.
This effect manifests itself in the speed of convergence to Cpuc with increasing constellation
∗
sizes. Namely, the energy-per-bit E (kk,ϵ) behaves asymptotically as
r
E∗ (k, ϵ) const −1
= (−1.59 dB) + Q (ϵ) (AWGN) (21.22)
k k
i i
i i
i i

i i
358
14
12
10
Achievability
8
Converse
dB
6 Rayleigh fading, noCSI

N0 ,
Eb
2
fading+CSIR, non-fading AWGN
0
−1.59 dB
−2
100 101 102 103 104 105 106 107 108
Information bits, k
Figure 21.1 Comparing the energy-per-bit required to send a packet of k-bits for different channel models
∗
(curves represent upper and lower bounds on the unknown optimal value E (k,ϵ) k
). As a comparison: to get to
−1.5 dB one has to code over 6 · 104 data bits when the channel is non-fading AWGN or fading AWGN with
Hj known perfectly at the receiver. For fading AWGN without knowledge of Hj (noCSI), one has to code over
at least 7 · 107 data bits to get to the same −1.5 dB. Plot generated via [291].
r
E∗ (k, ϵ) 3 log k −1 2
= (−1.59 dB) + (Q (ϵ)) (non-coherent fading) (21.23)
k k
That is we see that the speed of convergence to Shannon limit is much slower under fading.
Fig. 21.1 shows this effect numerically by plotting evaluation of (the upper and lower bounds
for) E∗ (k, ϵ) for the fading and non-fading channels. See [340] for details.
21.4 Capacity of the continuous-time AWGN channel

We now briefly discuss the topic of continous-time channels. We would like to define the channel
as acting on waveforms x(t), t ≥ 0 by adding white Gaussian noise as this:
Y(t) = X(t) + N(t) ,
where the N(t) is a (generalized) Gaussian process with covariance function

N0
E[N(t)N(s)] = δ(t − s) ,
2
where δ is the Dirac δ -function. Defining the channel in this way requires careful understanding
of the nature of N(t) (in particular, it is not a usual stochastic process, since its value at any point
N(t) = ∞), but is preferred by engineers. Mathematicians prefer to define the continuous-time
i i
i i
i i

i i
21.4 Capacity of the continuous-time AWGN channel 359
channel by introducing the standard Wiener process Wt and setting

Z t r
N0
Yint (t) = X(τ )dτ + Wt ,
0 2
where Wt is the zero-mean Gaussian process with covariance function
E[Ws Wt ] = min(s, t) .
Let M∗ (T, ϵ, P) the maximum number of messages that can be sent through this channel such that
given an encoder f : [M] → L2 [0, T] for each m ∈ [M] the waveform x(t) ≜ f(m)
1 is non-zero only on [0, T];

RT
2 input energy constrained to t=0 x2 (t) ≤ TP;
and the decoding error probability P[Ŵ 6= W] ≤ ϵ. This is a natural extension of the previously
defined log M∗ functions to continuous-time setting.
We prove the capacity result for this channel next.
Theorem 21.5. The maximal reliable rate of communication across the continuous-time AWGN
channel is NP0 log e (per unit of time). More formally, we have
1 P
lim lim inf log M∗ (T, ϵ, P) = log e (21.24)
ϵ→0 T→∞ T N0
Proof. Note that the space of all square-integrable functions on [0, T], denoted L2 [0, T] has count-
able basis (e.g. sinusoids). Thus, by expanding our input and output waveforms in that basis we
obtain an equivalent channel model:
N0
Ỹj = X̃j + Z̃j , Z̃j ∼ N (0, ),
2
and energy constraint (dependent upon duration T):
X
∞
X̃2j ≤ PT .
j=1
But then the problem is equivalent to the energy-per-bit for the (discrete-time) AWGN channel
(see Theorem 21.2) and hence
log2 M∗ (T, ϵ, P) = k ⇐⇒ E∗ (k, ϵ) = PT .
Thus,
1 P P
lim lim inf log2 M∗ (T, ϵ, P) = E∗ (k,ϵ)
= log2 e ,
ϵ→0 n→∞ T limϵ→0 lim supk→∞ N0
k
where the last step is by Theorem 21.2.
i i
i i
i i

i i
360
21.5* Capacity of the continuous-time band-limited AWGN channel

An engineer looking at the previous theorem will immediately point out an issue with the definition
of an error-correcting code. Namely, we allowed the waveforms x(t) to have bounded duration
and bounded power, but did not constrain its frequency content. In practice, waveforms are also
required to occupy a certain limited band of B Hz. What is the capacity of the AWGN channel
subject to both the power p and the bandwidth B constraints?
Unfortunately, answering this question rigorously requires a long and delicate digression into
functional analysis and prolate spheroidal functions. We thus only sketch the main result, without
stating it as a rigorous theorem. For a full treatment, consider the monograph of Ihara [164].
Let us again define M∗CT (T, ϵ, P, B) to be the maximum number of waveforms that can be sent
with probability of error ϵ in time T, power P and so that each waveform in addition to those two
constraints also has Fourier spectrum entirely contained in [fc − B/2, fc + B/2], where fc is a certain
“carrier” frequency.1
We claim that
1 P
CB (P) ≜ lim lim inf log M∗CT (T, ϵ, P, B) = B log(1 + ), (21.25)
ϵ→0 n→∞ T N0 B
In other words, the capacity of this channel is B log(1 + NP0 B ). To understand the idea of the proof,
we need to recall the concept of modulation first. Every signal X(t) that is required to live in
[fc − B/2, fc + B/2] frequency band can be obtained by starting with a complex-valued signal XB (t)
with frequency content in [−B/2, B/2] and mapping it to X(t) via the modulation:
√
X(t) = Re(XB (t) 2ejωc t ) ,
where ωc = 2πfc . Upon receiving the sum Y(t) = X(t) + N(t) of the signal and the white noise
N(t) we may demodulate Y by computing
√
YB (t) = 2LPF(Y(t)ejωc t ), ,
where the LPF is a low-pass filter removing all frequencies beyond [−B/2, B/2]. The important
fact is that converting from Y(t) to YB (t) does not lose information.
Overall we have the following input-output relation:
e ( t) ,
YB (t) = XB (t) + N
e (t) is a complex Gaussian white noise and

where all processes are C-valued and N
e ( t) N
E[ N e (s)∗ ] = N0 δ(t − s).
1
Here we already encounter a major issue: the waveform x(t) supported on a finite interval (0, T] cannot have spectrum
supported on a compact. The requiremes of finite duration and finite spectrum are only satisfied by the zero waveform.
Rigorously, one should relax the bandwidth constraint to requiring that the signal have a vanishing out-of-band energy as
T → ∞. As we said, rigorous treatment of this issue lead to the theory of prolate spheroidal functions [287].
i i
i i
i i

i i
21.5* Capacity of the continuous-time band-limited AWGN channel 361
( Notice that after demodulation, the power spectral density √

of the noise is N0 /2 with N0 /4 in the
real part and N0 /4 in the imaginary part, and after the 2 amplifier the spectral density of the
noise is restored to N0 /2 in both real and imaginary part.)
Next, we do Nyquist sampling to convert from continous-time to discrete time. Namely, the
input waveform XB (t) is going to be represented by its values at an equispaced grid of time instants,
separated by B1 . Similar representation is done to YB (t). It is again known that these two operations
do not lead to either restriction of the space of input waveforms (since every band limited signal
can be uniquely represented by its samples at Nyquist rate) or loss of the information content in
YB (t) (again, Nyquist samples represent the singal YB completely). Mathematically, what we have
done is
X∞
i
X B ( t) = Xi sincB (t − )
B
i=−∞
Z ∞
i
Yi = YB (t)sincB (t − )dt ,
t=−∞ B
where sincB (x) = sin(xBx) and Xi = XB (i/B). After the Nyquist sampling on XB and YB we get the
following equivalent input-output relation:
Yi = Xi + Zi , Zi ∼ Nc (0, N0 ) (21.26)
R∞
where the noise Zi = t=−∞ N e (t)sincB (t − i )dt. Finally, given that XB (t) is only non-zero for
B
t ∈ (0, T] we see that the C-AWGN channel (21.26) is only allowed to be used for i = 1, . . . , TB.
This fact is known in communication theory as “bandwidth B and duration T signal has BT complex
degrees of freedom”.
Let us summarize what we obtained so far:
• After sampling the equivalent channel model is that of discrete-time C-AWGN.

• Given time T and bandwidth B the discrete-time equivalent channel has blocklength n = BT.
• The power constraint in the discrete-time model corresponds to:
X
BT
|Xi |2 = kX(t)k22 ≤ PT ,
i=1
Thus the effective discrete-time power constraint becomes Pd = PB .
Hence, we have established the following fact:

1 1
log M∗CT (T, ϵ, P, B) = log M∗C−AWGN (BT, ϵ, Pd ) ,
T T
∗
where MC−AWGN denotes the fundamental limit of the C-AWGN channel from Theorem 20.11.
Thus, taking T → ∞ we get (21.25).
Note also that in the limit of large bandwidth B the capacity formula (21.25) yields
P P
CB=∞ (P) = lim B log(1 + )= log e ,
B→∞ N0 B N0
i i
i i
i i

i i
362
agreeing with (21.24).
i i
i i
i i

i i
22 Strong converse. Channel dispersion and error

exponents. Finite Blocklength Bounds.
In previous chapters our main object of study was the fundamental limit of blocklenght-n coding:
M∗ (n, ϵ) = max{M : ∃(n, M, ϵ)-code}
Equivalently, we can define it in terms of the smallest probability of error at a given M:
ϵ∗ (n, M) = inf{ϵ : ∃(n, M, ϵ)-code}
What we learned so far is that for stationary memoryless channels we have

1
lim lim inf log M∗ (n, ϵ) = C ,
ϵ→0 n→∞ n
or, equivalently,
(
∗ 0, R<C
lim sup ϵ (n, exp{nR}) =
n→∞ > 0, R > C.
These results were proved 75 years ago by Shannon. What happened in the ensuing 75 years is that
we obtained much more detailed information about M∗ and ϵ∗ . For example, the strong converse
says that in the previous limit the > 0 can be replaced with 1. The error-exponents show that
convergence of ϵ∗ (n, exp{nR}) to zero or one happens exponentially fast (with partially known
exponents). The channel dispersion refines the asymptotic description to
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) .
Finally, the finite blocklength information theory strives to prove the sharpest possible computa-
tional bounds on log M∗ (n, ϵ) at finite n, which allows evaluating real-world codes’ performance
taking their latency n into account. These results are surveyed in this chapter.
22.1 Strong Converse

We begin by stating the main theorem.
Theorem 22.1. For any stationary memoryless channel with either |A| < ∞ or |B| < ∞ we have
Cϵ = C for 0 < ϵ < 1. Equivalently, for every 0 < ϵ < 1 we have
log M∗ (n, ϵ) = nC + o(n) , n → ∞.
363
i i
i i
i i

i i
364
Previously in Theorem 19.7, we showed that C ≤ Cϵ ≤ 1−ϵ C

. Now we are asserting that equality
holds for every ϵ. Our previous converse arguments (Theorem 17.3 based on Fano’s inequality)
showed that communication with an arbitrarily small error probability is possible only when using
rate R < C; the strong converse shows that when communicating at any rate above capacity R > C,
the probability of error in fact goes to 1 (in fact, with speed exponential in n). In other words,
(
0 R<C
ϵ∗ (n, exp(nR)) → (22.1)
1 R>C
where ϵ∗ (n, M) is the inverse of M∗ (n, ϵ) defined in (19.4).
In practice, engineers observe this effect differently. They fix a code and then allow the channel
parameter (SNR for the AWGN channel, or δ for BSCδ ) vary. This typically results in a waterfall
plot for the probability of error:
Pe
1
10−1
10−2
10−3
10−4
10−5
SNR
In other words, below a certain critical SNR, the probability of error quickly approaches 1, so that
the receiver cannot decode anything meaningful. Above the critical SNR the probability of error
quickly approaches 0 (unless there is an effect known as the error floor, in which case probability
of error decreases reaches that floor value and stays there regardless of the further SNR increase).
Proof. We will improve the method used in the proof of Theorem 17.3. Take an (n, M, ϵ)-code
and consider the usual probability space
W → Xn → Yn → Ŵ ,
where W ∼ Unif([M]). Note that PXn is the empirical distribution induced by the encoder at the
channel input. Our goal is to replace this probability space with a different one where the true
channel PYn |Xn = P⊗ n
Y|X is replaced with a “dummy” channel:
QYn |Xn = (QY )⊗n .

We will denote this new measure by Q. Note that for communication purposes, QYn |Xn is a useless
channel since it ignores the input and randomly picks a member of the output space according to
i.i.d.
Yi ∼ QY , so that Xn and Yn are independent (under Q). Therefore, for the probability of success
under each channel we ahve
1
Q[Ŵ = W] =
M
P[Ŵ = W] ≥ 1 − ϵ
i i
i i
i i

i i
Therefore, the random variable 1{Ŵ=W} is likely to be 1 under P and likely to be 0 under Q. It
thus looks like a rather good choice for a binary hypothesis test statistic distinguishing the two
distributions, PW,Xn ,Yn ,Ŵ and QW,Xn ,Yn ,Ŵ . Since no hypothesis test can beat the optimal (Neyman-
Pearson) test, we get the upper bound
1
β1−ϵ (PW,Xn ,Yn ,Ŵ , QW,Xn ,Yn ,Ŵ ) ≤ (22.2)
M
(Recall the definition of β from (14.3).) The likelihood ratio is a sufficient statistic for this
hypothesis test, so let us compute it:
PW,Xn ,Yn ,Ŵ PW PXn |W PYn |Xn PŴ|Yn PW|Xn PXn ,Yn PŴ|Yn PXn ,Yn
= ⊗
= =
QW,Xn ,Yn ,Ŵ n
PW PXn |W (QY ) PŴ|Yn PW|Xn PXn (QY )⊗n PŴ|Yn PXn (QY )⊗n
Therefore, inequality above becomes

1
β1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≤ (22.3)
M
Computing the LHS of this bound may appear to be impossible because the distribution PXn
depends on the unknown code. However, it will turn out that a judicious choice of QY will make
knowledge of PXn unnecessary. Before presenting a formal argument, let us consider a special case
of the BSCδ channel. It will show that for symmetric channels we can select QY to be the capacity
achieving output distribution (recall, that it is unique by Corollary 5.5). To treat the general case
later we will (essentially) decompose the channel into symmetric subchannels (corresponding to
“composition” of the input).
Special case: BSCδ . So let us take PYn |Xn = BSC⊗ δ
n
and for QY we will take the capacity
achieving output distribution which is simply QY = Ber(1/2).
PYn |Xn (yn |xn ) = PnZ (yn − xn ), Zn ∼ Ber(δ)n

(QY )⊗n (yn ) = 2−n
From the Neyman Pearson test, the optimal HT takes the form

⊗n PXn Yn PXn Yn
βα (PXn Yn , PXn (QY ) ) = Q log ≥ γ where α = P log ≥γ
| {z } | {z } PXn (QY )⊗n PXn (QY )⊗n
P Q
For the BSC, this becomes

PXn Yn PZn (yn − xn )
log ∗ = log .
PXn (PY ) n 2− n
Notice that the effect of unknown PXn completely disappeared, and so we can compute βα :
1
βα (PXn Yn , PXn (QY )⊗n ) = βα (Ber(δ)⊗n , Ber( )⊗n ) (22.4)
2
1
= exp{−nD(Ber(δ)kBer( )) + o(n)} (by Stein’s Lemma: Theorem 14.13)
2
i i
i i
i i

i i
366
Putting this together with our main bound (22.3), we see that any (n, M, ϵ) code for the BSC
satisfies
1
log M ≤ nD(Ber(δ)kBer( )) + o(n) = nC + o(n) .
2
Clearly, this implies the strong converse for the BSC.
For the general channel, let us denote by P∗Y the capacity achieving output distribution. Recall
that by Corollary 5.5 it is unique and by (5.1) we have for every x ∈ A:
D(PY|X=x kP∗Y ) ≤ C . (22.5)
This property will be very useful. We next consider two cases separately:
1 If |B| < ∞ we take QY = P∗Y and note that from (19.31) we have
X
PY|X (y|x0 ) log2 PY|X (y|x0 ) ≤ log2 |B| ∀ x0 ∈ A
y
and since miny P∗Y (y) > 0 (without loss of generality), we conclude that for some constant
K > 0 and for all x0 ∈ A we have

PY|X (Y|X = x0 )
Var log | X = x0 ≤ K < ∞ .
QY ( Y )
Thus, if we let
X
n
PY|X (Yi |Xi )
Sn = log ,
P∗Y (Yi )
i=1
then we have
E[Sn |Xn ] ≤ nC, Var[Sn |Xn ] ≤ Kn . (22.6)
Hence from Chebyshev inequality (applied conditional on Xn ), we have

√ √ 1
P[Sn > nC + λ Kn] ≤ P[Sn > E[Sn |Xn ] + λ Kn] ≤ 2 . (22.7)
λ
2 If |A| < ∞, then first we recall that without loss of generality the encoder can be taken to be
deterministic. Then for each codeword c ∈ An we define its composition as
1X
n
P̂c (x) ≜ 1{cj = x} .
n
j=1
By simple counting it is clear that from any (n, M, ϵ) code, it is possible to select an (n, M′ , ϵ)
subcode, such that a) all codeword have the same composition P0 ; and b) M′ > (n+1M )|A|−1
. Note
′ ′
that, log M = log M + O(log n) and thus we may replace M with M and focus on the analysis of
the chosen subcode. Then we set QY = PY|X ◦ P0 . From now on we also assume that P0 (x) > 0
for all x ∈ A (otherwise just reduce A). Let i(x; y) denote the information density with respect
i i
i i
i i

i i
to P0 PY|X . If X ∼ P0 then I(X; Y) = D(PY|X kQY |P0 ) ≤ log |A| < ∞ and we conclude that
PY|X=x QY for each x and thus
dPY|X=x
i(x; y) = log ( y) .
dQY
From (19.28) we have
Var [i(X; Y)|X] ≤ Var[i(X; Y)] ≤ K < ∞
Furthermore, we also have
E [i(X; Y)|X] = D(PY|X kQY |P0 ) = I(X; Y) ≤ C X ∼ P0 .
So if we define
X
n
dPY|X=Xi (Yi |Xi ) X n
Sn = log ( Yi ) = i(Xi ; Yi ) ,
dQY
i=1 i=1
we again first get the estimates (22.6) and then (22.7).
To proceed with (22.3) we apply the lower bound on β from (14.9):

h i
γβ1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≥ 1 − ϵ − P Sn > log γ ,
√ 2
where γ is arbitrary. We set log γ = nC + λ Kn and λ2 = 1−ϵ to obtain (via (22.7)):
1−ϵ
γβ1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≥ ,
2
which then implies
√
log β1−ϵ (PXn Yn , PXn (QY )n ) ≥ −nC + O( n) .
Consequently, from (22.3) we conclude that

√
log M ≤ nC + O( n) ,
implying the strong converse.
We note several lessons from this proof. First, we basically followed the same method as in the
proof of the weak converse, except instead of invoking data-processing inequality for divergence,
we analyzed the hypothesis testing problem explicitly. Second, the bound on variance of the infor-
mation density is important. Thus, while the AWGN channel is excluded by the assumptions of
the Theorem, the strong converse for it does hold as well (see Ex. IV.9). Third, this method of
proof is also known as “sphere-packing”, for the reason that becomes clear if we do the example
of the BSC slightly differently (see Ex. IV.8).
i i
i i
i i

i i
368
22.2 Stationary memoryless channel without strong converse

In the proof above we basically only used the fact that the sum of independent random variables
concentrates around its expectation (we used second moment to show that, but it could have been
done more generally, when the second moment does not exist). Thus one may wonder whether
the strong converse should hold for all stationary memoryless channels (it was only showed in
Theorem 22.1 for those with finite input or output spaces). In this section we construct a counter-
example.
Let the output alphabet be B = [0, 1]. The input A is going to be countably infinite. It will be
convenient to define it as
A = {(j, m) : j, m ∈ Z+ , 0 ≤ j ≤ m} .
The single-letter channel PY|X is defined in terms of probability density function as
(
am , mj ≤ y ≤ j+m1 , ,
pY|X (y|(j, m)) =
bm , otherwise ,
where am , bm are chosen to satisfy
1 1
am + ( 1 − ) bm = 1 (22.8)
m m
1 1
am log am + (1 − )bm log bm = C , (22.9)
m m
where C > 0 is an arbitary fixed constant. Note that for large m we have
mC 1
am = ( 1 + O( )) , (22.10)
log m log m
C 1
bm = 1 − + O( 2 ) (22.11)
log m log m
It is easy to see that P∗Y = Unif(0, 1) is the capacity-achieving output distribution and
sup I(X; Y) = C .
PX
Thus by Theorem 19.9 the capacity of the corresponding stationary memoryless channel is C. We
next show that nevertheless the ϵ-capacity can be strictly greater than C.
Indeed, fix blocklength n and consider a single letter distribution PX assigning equal weights
to all atoms (j, m) with m = exp{2nC}. It can be shown that in this case, the distribution of a
single-letter information density is given by
( (
log am , w.p. amm 2nC + O(log n), w.p. amm
i(X; Y) = = .
log bm , w.p. 1 − amm O( 1n ), w.p. 1 − amm
Thus, for blocklength-n density we have
1X
n
1 n n d 1 1 am d
i( X ; Y ) = i(Xi ; Yi ) = O( ) + (2C + O( log n)) · Bin(n, )−→2C · Poisson(1/2) ,
n n n n m
i=1
i i
i i
i i

i i
22.3 Meta-converse 369
where we used the fact that amm = (1 + o(1)) 2n

1
and invoked the Poisson limit theorem for Binomial.
Therefore, from Theorem 18.5 we get that for ϵ > e−1/2 there exist (n, M, ϵ)-codes with
log M ≥ 2nC(1 + o(1)) .
In particular,
Cϵ ≥ 2C ∀ϵ > e−1/2
22.3 Meta-converse
We have seen various ways in which one can derive upper (impossibility or converse) bounds on
the fundamental limits such as log M∗ (n, ϵ). In Theorem 17.3 we used data-processing and Fano’s
inequalities. In the proof of Theorem 22.1 we reduced the problem to that of hypothesis testing.
There are many other converse bounds that were developed over the years. It turns out that there
is a very general approach that encompasses all of them. For its versatility it is sometimes referred
to as the “meta-converse”.
To describe it, let us fix a Markov kernel PY|X (usually, it will be the n-letter channel PYn |Xn ,
but in the spirit of “one-shot” approach, we avoid introducing blocklength). We are also given a
certain (M, ϵ) code and the goal is to show that there is an upper bound on M in terms of PY|X and
ϵ. The essence of the meta-converse is described by the following diagram:
PY |X
W Xn Yn Ŵ
QY |X
Here the W → X and Y → Ŵ represent encoder and decoder of our fixed (M, ϵ) code. The upper
arrow X → Y corresponds to the actual channel, whose fundamental limits we are analyzing. The
lower arrow is an auxiliary channel that we are free to select.
The PY|X or QY|X together with PX (distribution induced by the code) define two distribu-
tions: PX,Y and QX,Y . Consider a map (X, Y) 7→ Z ≜ 1{W 6= Ŵ} defined by the encoder and
decoder pair (if decoders are randomized or W → X is not injective, we consider a Markov kernel
PZ|X,Y (1|x, y) = P[Z = 1|X = x, Y = y] instead). We have
PX,Y [Z = 0] = 1 − ϵ, QX , Y [ Z = 0] = 1 − ϵ ′ ,
where ϵ and ϵ′ are the average probilities of error of the given code under the PY|X and QY|X
respectively. This implies the following relation for the binary HT problem of testing PX,Y vs
QX,Y :
β1−ϵ (PX,Y , QX,Y ) ≤ 1 − ϵ′ .
i i
i i
i i

i i
370
The high-level idea of the meta-converse is to select a convenient QY|X , bound 1 − ϵ′ from above
(i.e. prove a converse result for the QY|X ), and then use the Neyman-Pearson β -function to lift the
Q-channel converse to P-channel.
How one chooses QY|X is a matter of art. For example, in the proof of Case 2 of Theorem 22.1
we used the trick of reducing to the constant-composition subcode. This can instead be done by
taking QYn |Xn =c = (PY|X ◦ P̂c )⊗n . Since there are at most (n + 1)|A|−1 different output distributions,
we can see that
(n + 1)∥A|−1
1 − ϵ′ ≤ ,
M
and bounding of β can be done similar to Case 2 proof of Theorem 22.1. For channels with
|A| = ∞ the technique of reducing to constant-composition codes is not available, but the meta-
converse can still be applied. Examples include proof of parallel AWGN channel’s dispersion [238,
Theorem 78] and the study of the properties of good codes [246, Theorem 21].
However, the most common way of using meta-converse is to apply it with the trivial channel
QY|X = QY . We have already seen this idea in Section 22.1. Indeed, with this choice the proof
of the converse for the Q-channel is trivial, because we always have: 1 − ϵ′ = M1 . Therefore, we
conclude that any (M, ϵ) code must satisfy
1
≥ β1−ϵ (PX,Y , PX QY ) . (22.12)
M
Or, after optimization we obtain
1
≥ inf sup β1−ϵ (PX,Y , PX QY ) .
M∗ (ϵ) PX QY
This is a special case of the meta-coverse known as the minimax meta-converse. It has a number of
interesting properties. First, the minimax problem in question possesses a saddle-point and is of
convex-concave type [248]. It, thus, can be seen as a stronger version of the capacity saddle-point
result for divergence in Theorem 5.4.
Second, the bound given by the minimax meta-converse coincides with the bound we obtained
before via linear programming relaxation (18.22), as discovered by [212]. To see this connection,
instead of writing the meta-converse as an upper bound M (for a given ϵ) let us think of it as an
upper bound on 1 − ϵ (for a given M).
We have seen that existence of an (M, ϵ)-code for PY|X implies existence of the (stochastic) map
(X, Y) 7→ Z ∈ {0, 1}, denoted by PZ|X,Y , with the following property:
1
PX,Y [Z = 0] ≥ 1 − ϵ, and PX QY [Z = 0] ≤ ∀ QY .
M
That is PZ|X,Y is a test of a simple null hypothesis (X, Y) ∼ PX,Y against a composite alternative
(X, Y) ∼ PX QY for an arbitrary QY . In other words every (M, ϵ) code must satisfy
1 − ϵ ≤ α̃(M; PX ) ,
i i
i i
i i

i i
22.4* Error exponents 371
where (we are assuming finite X , Y for simplicity)

X X 1
α̃(M; PX ) ≜ sup { PX,Y (x, y)PZ|X,Y (0|x, y) : PX (x)QY (y)PZ|X,Y (0|x, y) ≤ ∀QY } .
PZ|X,Y x,y x, y
M
We can simplify the constraint by rewriting it as

X 1
sup PX (x)QY (y)PZ|X,Y (0|x, y) ≤ ,
QY x, y
M
and further simplifying it to

X 1
PX (x)PZ|X,Y (0|x, y) ≤ , ∀y ∈ Y .
x
M
Let us now replace PX with a π x ≜ MPX (x), x ∈ X . It is clear that π ∈ [0, 1]X . Let us also
replace the optimization variable with rx,y ≜ MPZ|X,Y (0|x, y)PX (x). With these notational changes
we obtain
1 X X
α̃(M; PX ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} .
M x, y x
It is now obvious that α̃(M; PX ) = SLP (π ) defined in (18.21). Optimizing over the choice of PX
P
(or equivalently π with x π x ≤ M) we obtain
1 1 X S∗ ( M )
1−ϵ≤ SLP (π ) ≤ sup{SLP (π ) : π x ≤ M} = LP .
M M x
M
Now recall that in (18.23) we showed that a greedy procedure (essentially, the same as the one we
used in the Feinstein’s bound Theorem 18.7) produces a code with probability of success
1 S∗ (M)
1 − ϵ ≥ (1 − ) LP .
e M
This indicates that in the regime of a fixed ϵ the bound based on minimax metaconverse should
be very sharp. This, of course, provided we can select the best QY in applying it. Fortunately, for
symmetric channels optimal QY can be guessed fairly easily, cf. [248] for more.
22.4* Error exponents

We have studied the question of optimal error exponents in hypothesis testing before (Chapter 16).
The corresponding topic for channel coding is much harder and full of open problems.
We motivate the question by trying to understand the speed of convergence in the strong con-
verse (22.1). If we return to the proof of Theorem 19.9, namely the step (19.7), we see that by
applying large-deviations Theorem 15.9 we can prove that for some Ẽ(R) and any R < C we have
ϵ∗ (n, exp{nR}) ≤ exp{−nẼ(R)} .
i i
i i
i i

i i
372
What is the best value of Ẽ(R) for each R? This is perhaps the most famous open question in all of
channel coding. More formally, let us define what is known as reliability function of a channel as
(
limn→∞ − 1n log ϵ∗ (n, exp{nR}) R<C
E( R) =
∗
limn→∞ − n log(1 − ϵ (n, exp{nR})) R > C .
1
We leave E(R) as undefined if the limit does not exist. Unfortunately, there is no general argument
showing that this limit exist. The only way to show its existence is to prove an achievability bound
1
lim inf − log ϵ∗ (n, exp{nR}) ≥ Elower (R) ,
n→∞ n
a converse bound
1
lim sup − log ϵ∗ (n, exp{nR}) ≤ Eupper (R) ,
n→∞ n
and conclude that the limit exist whenever Elower = Eupper . It is common to abuse notation and
write such pair of bounds as
Elower (R) ≤ E(R) ≤ Eupper (R) ,
even though, as we said, the E(R) is not known to exist unless the two bounds match unless the
two bounds match.
From now on we restrict our discussion to the case of a DMC. An important object to
define is the Gallager’s E0 function, which is nothing else than the right-hand side of Gallager’s
bound (18.15). For the DMC it has the following expression:
!1+ρ
X X 1
E0 (ρ, PX ) = − log PX (x)PY|X (y|x)
1+ρ
y∈B x∈A
E0 (ρ) = max E0 (ρ, PX ) , ρ≥0

PX
E0 (ρ) = min E0 (ρ, PX ) , ρ ≤ 0.

PX
This expression is defined in terms of the single-letter channel PY|X . It is not hard to see that E0
function for the n-letter extension evaluated with P⊗ n
X just equals nE0 (ρ, PX ), i.e. it tensorizes
1
similar to mutual information. From this observation we can apply Gallager’s random coding
bound (Theorem 18.9) with P⊗ n
X to obtain
ϵ∗ (n, exp{nR}) ≤ exp{n(ρR − E0 (ρ, PX ))} ∀PX , ρ ∈ [0, 1] . (22.13)
Optimizing the choice of PX we obtain our first estimate on the reliability function
E(R) ≤ Er (R) ≜ sup E0 (ρ) − ρR .

ρ∈[0,1]
1
There is one more very pleasant analogy with mutual information: the optimization problems in the definition of E0 (ρ)
also tensorize. That is, the optimal distribution for the n-letter channel is just P⊗n
X , where PX is optimal for a single-letter
one.
i i
i i
i i

i i
22.4* Error exponents 373
An analysis, e.g. [133, Section 5.6], shows that the function Er (R) is a convex, decreasing and
strictly positive on 0 ≤ R < C. Therefore, Gallager’s bound provides a non-trivial estimate of
the reliability function for the entire range of rates below capacity. At rates R → C the optimal
choice of ρ → 0. As R departs further away from the capacity the optimal ρ reaches 1 at a certain
rate R = Rcr known as the critical rate, so that for R < Rcr we have Er (R) = E0 (1) − R behaving
linearly.
Going to the upper bounds, taking QY to be the iid product distribution in (22.12) and optimizing
yields the bound [278] known as the sphere-packing bound:
E(R) ≤ Esp (R) ≜ sup E0 (ρ) − ρR .
ρ≥0
Comparing the definitions of Esp and Er we can see that for Rcr < R < C we have
Esp (R) = E(R) = Er (R)
thus establishing reliability function value for high rates. However, for R < Rcr we have Esp (R) >
Er (R), so that E(R) remains unknown.
Both upper and lower bounds have classical improvements. The random coding bound can be
improved via technique known as expurgation showing
E(R) ≥ Eex (R) ,
and Eex (R) > Er (R) for rates R < Rx where Rx ≤ Rcr is the second critical rate; see Exc. IV.18.
The sphere packing bound can also be improved at low rates by analyzing a combinatorial packing
problem by showing that any code must have a pair of codewords which are close (in terms of
Hellinger distance between the induced output distributions) and concluding that confusing these
two leads to a lower bound on probability of error (via (16.3)). This class of bounds is known
as “minimum distance” based bounds. The straight-line bound [133, Theorem 5.8.2] allows to
interpolate between any minimum distance bound and the Esp (R). Unfortunately, these (classical)
improvements tightly bound E(R) at only one additional rate point R = 0. This state of affairs
remains unchanged (for a general DMC) since 1967. As far as we know, the common belief is that
Eex (R) is in fact the true value of E(R) for all rates.
We demonstrate these bounds (and some others, but not the straight-line bound) on the reli-
ability function on Fig. 22.1 for the case of the BSCδ . For this channel, there is an interesting
interpretation of the expurgated bound. To explain it, let us recall the different ensembles of ran-
dom codes that we discussed in Section 18.6. In particular, we had the Shannon ensemble (as used
in Theorem 18.5) and the random linear code (either Elias or Gallager ensembles, we do not need
to make a distinction here).
For either ensemble, it it is known [134] that Er (R) is not just an estimate, but in fact the exact
value of the exponent of the average probability of error (averaged over a code in the ensemble).
For either ensemble, however, for low rates the average is dominated by few bad codes, whereas
a typical (high probability) realization of the code has a much better error exponent. For Shannon
ensemble this happens at R < 12 Rx and for the linear ensemble it happens at R < Rx . Furthermore,
the typical linear code in fact has error exponent exactly equal to the expurgated exponent Eex (R),
see [21].
i i
i i
i i

i i
374
Error−exponent bounds for BSC(0.02)

2
Sphere−packing (radius) + mindist
MRRW + mindist
1.8 Sphere−packing (volume)
Random code ensemble
Typical random linear code (aka expurgation)
Gilbert−Varshamov, dmin/2 halfplane + union bound
1.6 Gilbert−Varshamov, dmin/2 sphere
1.4
1.2
Err.Exp. (log2)
0.8 Rx = 0.24
0.6 Rcrit = 0.46
0.4
C = 0.86
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Rate
Figure 22.1 Comparison of bounds on the error exponent of the BSC. The MMRW stands for the upper
bound on the minimum distance of a code [214] and Gilbert-Varshamov is a lower bound of Theorem 27.5.
There is a famous conjecture in combinatorics stating that the best possible minimum pairwise
Hamming distance of a code with rate R is given by the Gilbert-Varshamov bound (Theorem 27.5).
If true, this would imply that E(R) = Eex (R) for R < Rx , see e.g. [201].
The most outstanding development in the error exponents since 1967 was a sequence of papers
starting from [201], which proposed a new technique for bounding E(R) from above. Litsyn’s
idea was to first prove a geometric result (that any code of a given rate has a large number of
pairs of codewords at a given distance) and then use de Caen’s inequality to convert it into a lower
bound on the probability of error. The resulting bound was very cumbersome. Thus, it was rather
surprising when Barg and MacGregor [22] were able to show that the new upper bound on E(R)
matched Er (R) for Rcr − ϵ < R < Rcr for some small ϵ > 0. This, for the first time since [278]
extended the range of knowledge of the reliability function. Their amazing result (together with
Gilbert-Varshamov conjecture) reinforced the belief that the typical linear codes achieve optimal
error exponent in the whole range 0 ≤ R ≤ C.
Regarding E(R) for R > C the situation is much simpler. We have
E(R) = sup E0 (ρ) − ρR .

ρ∈(−1,0)
The lower (achievability) bound here is due to [106] (see also [228]), while the harder (converse)
part is by Arimoto [16]. It was later discovered that Arimoto’s converse bound can be derived by a
simple modification of the weak converse (Theorem 17.3): instead of applying data-processing to
i i
i i
i i

i i
1
the KL divergence, one uses Rényi divergence of order α = 1+ρ ; see [243] for details. This sug-
gests a general conjecture that replacing Shannon information measures with Rényi ones upgrades
the (weak) converse proofs to a strong converse.
22.5 Channel dispersion

Historically, first error-correcting codes had rather meager rates R very far from channel capacity.
As we have seen in Section 22.4* the best codes at any rate R < C have probability of error that
behaves as
Pe = exp{−nE(R) + o(n)} .
Therefore, for a while the question of non-asymptotic characterization of log M∗ (n, ϵ) and ϵ∗ (n, M)
was equated with establishing the sharp value of the error exponent E(R). However, as codes
became better and started having rates approaching the channel capacity, the question has changed
to that of understanding behavior of log M∗ (n, ϵ) in the regime of fixed ϵ and large n (and, thus, rates
R → C). It was soon discovered by [239] that the next-order terms in the asymptotic expansion of
log M∗ give surprisingly sharp estimates on the true value of the log M∗ . Since then, the work on
channel coding focused on establishing sharp upper and lower bounds on log M∗ (n, ϵ) for finite n
(the topic of Section 22.6) and refining the classical results on the asymptotic expansions, which
we discuss here.
We have already seen that the strong converse (Theorem 22.1) can be stated in the asymptotic
expansion form as: for every fixed ϵ ∈ (0, 1),
log M∗ (n, ϵ) = nC + o(n), n → ∞.
Intuitively, though, the smaller values of ϵ should make convergence to capacity slower. This
suggests that the term o(n) hides some interesting dependence on ϵ. What is it?
This topic has been investigated since the 1960s, see [96, 295, 239, 238] , and resulted in the
concept of channel dispersion. We first present the rigorous statement of the result and then explain
its practical uses.
Theorem 22.2. Consider one of the following channels:
1 DMC
2 DMC with cost constraint
3 AWGN
4 Parallel AWGN
Let (X∗ , Y∗ ) be the input-output of the channel under the capacity achieving input distribution, and
i(x; y) be the corresponding (single-letter) information density. The following expansion holds for
a fixed 0 < ϵ < 1/2 and n → ∞
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) (22.14)
i i
i i
i i

i i
376
where Q−1 is the inverse of the complementary standard normal CDF, the channel capacity is
C = I(X∗ ; Y∗ ) = E[i(X∗ ; Y∗ )], and the channel dispersion2 is V = Var[i(X∗ ; Y∗ )|X∗ ].
Proof. The full proofs of these results are somewhat technical, even for the DMC.3 Here we only
sketch the details.
First, in the absence of cost constraints the achievability (lower bound on log M∗ ) part has
already been done by us in Theorem 19.11, where we have shown that log M∗ (n, ϵ) ≥ nC −
√ √
nVQ−1 (ϵ) + o( n) by refining the proof of the noisy channel coding theorem and using the
CLT. Replacing the CLT with its non-asymptotic version (Berry-Esseen inequality [123, Theorem
√
2, Chapter XVI.5]) improves o( n) to O(log n). In the presence of cost constraints, one is inclined
to attempt to use an appropriate version of the achievability bound such as Theorem 20.7. However,
for the AWGN this would require using input distribution that is uniform on the sphere. Since this
distribution is non-product, the information density ceases to be a sum of iid, and CLT is harder
to justify. Instead, there is a different achievability bound known as the κ-β bound [239, Theorem
25] that has become the workhorse of achievability proofs for cost-constrained channels with
continuous input spaces.
The upper (converse) bound requires various special methods depending on the channel. How-
ever, the high-level idea is to always apply the meta-converse bound from (22.12) with an
approriate choice of QY . Most often, QY is taken as the n-th power of the capacity achieving output
distribution for the channel. We illustrate the details for the special case of the BSC. In (22.4) we
have shown that
1
log M∗ (n, ϵ) ≤ − log βα (Ber(δ)⊗n , Ber( )⊗n ) . (22.15)
2
On the other hand, Exc. III.5 shows that
1 1 √ √
− log β1−ϵ (Ber(δ)⊗n , Ber( )⊗n ) = nd(δk ) + nvQ−1 (ϵ) + o( n) ,
2 2
where v is just the variance of the (single-letter) log-likelihood ratio:
" #
δ 1−δ δ δ
v = VarZ∼Ber(δ) Z log 1 + (1 − Z) log 1 = Var[Z log ] = δ(1 − δ) log2 .
2 2
1 − δ 1 − δ
Upon inspection we notice that v = V – the channel dispersion of the BSC, which completes the
proof of the upper bound:
√ √
log M∗ (n, ϵ) ≤ nC − nVQ−1 (ϵ) + o( n)
√
Improving the o( n) to O(log n) is done by applying the Berry-Esseen inequality in place of the
CLT, similar to the upper bound. Many more details on these proofs are contained in [238].
2
There could be multiple capacity-achieving input distributions, in which case PX∗ should be chosen as the one that
minimizes Var[i(X∗ ; Y∗ )|X∗ ]. See [239] for more details.
3
Recently, subtle gaps in [295] and [239] in the treatment of DMCs with non-unique capacity-achieving input distributions
were found and corrected in [57].
i i
i i
i i

i i
Remark 22.1 (Zero dispersion). We notice that V = 0 is entirely possible. For example, consider
an additive-noise channel Y = X + Z over some abelian group G with Z being uniform on some
subset of G, e.g. channel in Exc. IV.13. Among the zero-dispersion channels there is a class of
exotic channels [239], which for ϵ > 1/2 have asymptotic expansions of the form [238, Theorem
51]:
log M∗ (n, ϵ) = nC + Θϵ (n 3 ) .
1
Existence of this special case is why we restricted the theorem above to ϵ < 12 .
Remark 22.2. The expansion (22.14) only applies to certain channels (as described in the theorem).
If, for example, Var[i(X∗ ; Y∗ )] = ∞, then the theorem need not hold and there might be other stable
(non-Gaussian) distributions that the n-letter information density will converge to. Also notice that
in the absence of cost constraints we have
Var[i(X∗ ; Y∗ )|X∗ ] = Var[i(X∗ ; Y∗ )]
since, by capacity saddle-point (Corollary 5.7), E[i(X∗ ; Y∗ )|X∗ = x] = C for PX∗ -almost all x.
As an example, we have the following dispersion formulas for the common channels that we
discussed so far:
BECδ : V(δ) = δ δ̄ log2 2

δ̄
BSCδ : V(δ) = δ δ̄ log2
δ
P(P + 2) P(P + 2)
AWGN: V(P) = log2 e (Real) log2 e (Complex)
2( P + 1) 2 ( P + 1) 2
What about channel dispersion for other channels? Discrete channels with memory have seen
some limited success in [240], which expresses dispersion in terms of the Fourier spectrum of the
information density process. The compound DMC (Ex. IV.23) has a much more delicate dispersion
formula (and the remainder term is not O(log n), but O(n1/4 )), see [249]. For non-discrete channels
(other than the AWGN and Poisson) new difficulties appear in the proof of the converse part. For
example, the dispersion of a (coherent) fading channel is known only if one additionally restricts
the input codewords to have limited peak values, cf. [71, Remark 1]. In particular, dispersion of
the following Gaussian erasure channel is unknown:
Y i = Hi ( X i + Z i ) ,
Pn
where we have N (0, 1) ∼ Zi ⊥ ⊥ Hi ∼ Ber(1/2) and the usual quadratic cost constraint i=1 x2i ≤
nP.
Multi-antenna (MIMO) channels (20.11) present interesting new challenges as well. For exam-
ple, for coherent channels the capacity achieving input distribution is non-unique [71]. The
quasi-static channels are similar to fading channels but the H1 = H2 = · · · , i.e. the channel
gain matrix in (20.11) is not changing with time. This channel model is often used to model cellu-
lar networks. By leveraging an unexpected amount of differential geometry, it was shown in [339]
i i
i i
i i

i i
378
0.5
0.4
Rate, bit/ch.use
0.3
0.2
0.1 Capacity
Converse
RCU
DT
Gallager
Feinstein
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n
Figure 22.2 Comparing various lower (achievability) bounds on 1

n
log M∗ (n, ϵ) for the BSCδ channel
(δ = 0.11, ϵ = 10−3 ).
that they have zero-dispersion, or more specifically:
log M∗ (n, ϵ) = nCϵ + O(log n) ,
where the ϵ-capacity Cϵ is known as outage capacity in this case (and depends on ϵ). The main
implication is that Cϵ is a good predictor of the ultimate performance limits for these practically-
relevant channels (better than C is for the AWGN channel, for example). But some caution must
be taken in approximating log M∗ (n, ϵ) ≈ nCϵ , nevertheless. For example, in the case where H
matrix is known at the transmitter, the same paper demonstrated that the standard water-filling
power allocation (Theorem 20.14) that maximizes Cϵ is rather sub-optimal at finite n.
22.6 Finite blocklength bounds and normal approximation

As stated earlier, direct computation of M∗ (n, ϵ) by exhaustive search doubly exponential in com-
plexity, and thus is infeasible in most cases. However, the bounds we have developed so far can
often help sandwich the unknown value pretty tightly. Less rigorously, we may also evaluate the
normal approximation which simply suggests dropping unknown terms in the expansion (22.14):
√
log M∗ (n, ϵ) ≈ nC − nVQ−1 (ϵ)
(The log n term in (22.14) is known to be equal to O(1) for the BEC, and 12 log n for the BSC,
AWGN and binary-input AWGN. For these latter channels, normal approximation is typically
defined with + 12 log n added to the previous display.)
For example, considering the BEC1/2 channel we can easily compute the capacity and disper-
sion to be C = (1 − δ) and V = δ(1 − δ) (in bits and bits2 , resp.). Detailed calculation in Ex. IV.31
i i
i i
i i

i i
22.6 Finite blocklength bounds and normal approximation 379
0.5
0.4
Rate, bit/ch.use
0.3
0.2
Capacity
Converse
0.1 Normal approximation + 1/2 log n
Normal approximation
Achievability
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n
Figure 22.3 Comparing the normal approximation against the best upper and lower bounds on 1
n
log M∗ (n, ϵ)
for the BSCδ channel (δ = 0.11, ϵ = 10−3 ).
lead to the following rigorous estimates:
213 ≤ log2 M∗ (500, 10−3 ) ≤ 217 .
At the same time the normal approximation yields
p
log M∗ (500, 10−3 ) ≈ nδ̄ − nδ δ̄ Q−1 (10−3 ) ≈ 215.5 bits
This tightness is preserved across wide range of n, ϵ, δ .

As another example, we can consider the BSCδ channel. We have already presented numerical
results for this channel in (17.3). Here, we evaluate all the lower bounds that were discussed in
Chapter 18. We show the results in Fig. 22.2 together with the upper bound (22.15). We conclude
that (unsurprisingly) the RCU bound is the tightest and is impressively close to the converse bound,
as we have already seen in (17.3). The normal approximation (with and without the 1/2 log n term)
is compared against the rigorous bounds on Fig. 22.3. The excellent precision of the approximation
should be contrasted with a fairly loose estimate arising from the error-exponent approximation
(which coincides with the “Gallager” curve on Fig. 22.2).
We can see that for the simple cases of the BEC and the BSC, the solution to the incredibly
complex combinatorial optimization problem log M∗ (n, ϵ) can be rather well approximated by
considering the first few terms in the expansion (22.14). This justifies further interest in computing
channel dispersion and establishing such expansions for other channels.
i i
i i
i i

i i
380
22.7 Normalized Rate

Suppose we are considering two different codes. One has M = 2k1 and blocklength n1 (and so, in
engineering language is a k1 -to-n1 code) and another is a k2 -to-n2 code. How can we compare the
two of them fairly? A traditional way of presenting the code performance is in terms of the “water-
fall plots” showing dependence of the probability of error on the SNR (or crossover probability)
of the channel. These two codes could have waterfall plots of the following kind:
Pe k1 → n1 Pe k2 → n2
10−4 10−4
P∗ SNR P∗ SNR
After inspecting these plots, one may believe that the k1 → n1 code is better, since it requires a
smaller SNR to achieve the same error probability. However, this ignores the fact that the rate of
this code nk11 might be much smaller as well. The concept of normalized rate allows us to compare
the codes of different blocklengths and coding rates.
Specifically, suppose that a k → n code is given. Fix ϵ > 0 and find the value of the SNR P for
which this code attains probability of error ϵ (for example, by taking a horizontal intercept at level
ϵ on the waterfall plot). The normalized rate is defined as
k k
Rnorm (ϵ) = ≈ p ,
log2 M∗ (n, ϵ, P) nC(P) − nV(P)Q−1 (ϵ)
where log M∗ , capacity and dispersion correspond to the channel over which evaluation is being
made (most often the AWGN, BI-AWGN or the fading channel). We also notice that, of course,
the value of log M∗ is not possible to compute exactly and thus, in practice, we use the normal
approximation to evaluate it.
This idea allows us to clearly see how much different ideas in coding theory over the decades
were driving the value of normalized rate upward to 1. This comparison is show on Fig. 22.4.
A short summary is that at blocklengths corresponding to “data stream” channels in cellular net-
works (n ∼ 104 ) the LDPC codes and non-binary LDPC codes are already achiving 95% of the
information-theoretic limit. At blocklengths corresponding to “control plane” (n ≲ 103 ) the polar
codes and LDPC codes are at similar performance and at 90% of the fundamental limits.
i i
i i
i i

i i
22.7 Normalized Rate 381
Normalized rates of code families over AWGN, Pe=0.0001

1
0.95
0.9
0.85 Turbo R=1/3

Turbo R=1/6
Turbo R=1/4
0.8 Voyager
Normalized rate
Galileo HGA
Turbo R=1/2
0.75 Cassini/Pathfinder
Galileo LGA
Hermitian curve [64,32] (SDD)
0.7 Reed−Solomon (SDD)
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (List dec.)
0.65 ME LDPC R=1/2 (BP)
0.6
0.55
0.5 2 3 4 5
10 10 10 10
Blocklength, n
Normalized rates of code families over BIAWGN, Pe=0.0001
1
0.95
0.9
Turbo R=1/3
Turbo R=1/6
Turbo R=1/4
0.85
Voyager
Normalized rate
Galileo HGA
Turbo R=1/2
Cassini/Pathfinder
0.8
Galileo LGA
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (L=32)
Polar+CRC R=1/2 (L=256)
0.75
Huawei NB−LDPC
Huawei hybrid−Cyclic
ME LDPC R=1/2 (BP)
0.7
0.65
0.6 2 3 4 5
10 10 10 10
Blocklength, n
Figure 22.4 Normalized rates for various codes. Plots generated via [291] (color version recommended)
i i
i i
i i

i i
23 Channel coding with feedback
So far we have been focusing on the paradigm for one-way communication: data are mapped to
codewords and transmitted, and later decoded based on the received noisy observations. In most
practical settings (except for storage), frequently the communication goes in both ways so that the
receiver can provide certain feedback to the transmitter. As a motivating example, consider the
communication channel of the downlink transmission from a satellite to earth. Downlink transmis-
sion is very expensive (power constraint at the satellite), but the uplink from earth to the satellite
is cheap which makes virtually noiseless feedback readily available at the transmitter (satellite).
In general, channel with noiseless feedback is interesting when such asymmetry exists between
uplink and downlink. Even in less ideal settings, noisy or partial feedbacks are commonly available
that can potentially improve the realiability or complexity of communication.
In the first half of our discussion, we shall follow Shannon to show that even with noiseless
feedback “nothing” can be gained in the conventional setup, while in the second half, we examine
situations where feedback is extremely helpful.
23.1 Feedback does not increase capacity for stationary memoryless

channels
Definition 23.1 (Code with feedback). An (n, M, ϵ)-code with feedback is specified by the
encoder-decoder pair (f, g) as follows:
• Encoder: (time varying)
f1 : [ M ] → A
f2 : [ M ] × B → A
..
.
fn : [M] × B n−1 → A
• Decoder:
g : B n → [M]
such that P[W 6= Ŵ] ≤ ϵ.
382
i i
i i
i i

i i
Figure 23.1 Schematic representation of coding without feedback (left) and with full noiseless feedback
(right).
Here the symbol transmitted at time t depends on both the message and the history of received
symbols:
Xt = ft (W, Yt1−1 ).
Hence the probability space is as follows:
W ∼ uniform on [M]
PY|X

X1 = f1 (W) −→ Y1 


.. −→ Ŵ = g(Yn )
. 
PY|X 

Xn = fn (W, Yn1−1 ) −→ Yn
Fig. 23.1 compares the settings of channel coding without feedback and with full feedback:
Definition 23.2 (Fundamental limits).
M∗fb (n, ϵ) = max{M : ∃(n, M, ϵ) code with feedback.}

1
Cfb,ϵ = lim inf log M∗fb (n, ϵ)
n→∞ n
Cfb = lim Cfb,ϵ (Shannon capacity with feedback)

ϵ→0
Theorem 23.3 (Shannon 1956). For a stationary memoryless channel,
Cfb = C = C(I) = sup I(X; Y).

PX
Proof. Achievability: Although it is obvious that Cfb ≥ C, we wanted to demonstrate that in fact
constructing codes achieving capacity with full feedback can be done directly, without appealing
to a (much harder) problem of non-feedback codes. Let π t (·) ≜ PW|Yt (·|Yt ) with the (random) pos-
terior distribution after t steps. It is clear that due to the knowledge of Yt on both ends, transmitter
and receiver have perfectly synchronized knowledge of π t . Now consider how the transmission
progresses:
i i
i i
i i

i i
384
1 Initialize π 0 (·) = M1
2 At (t + 1)-th step, having knowledge of π t all messages are partitioned into classes Pa , according
to the values ft+1 (·, Yt ):
Pa ≜ {j ∈ [M] : ft+1 (j, Yt ) = a} a ∈ A.
Then transmitter, possessing the knowledge of the true message W, selects a letter Xt+1 =
ft+1 (W, Yt ).
3 Channel perturbs Xt+1 into Yt+1 and both parties compute the updated posterior:
PY|X (Yt+1 |ft+1 (j, Yt ))
π t+1 (j) ≜ π t (j)Bt+1 (j) , Bt+1 (j) ≜ P .
a∈A π t (Pa )
Notice that (this is the crucial part!) the random multiplier satisfies:
XX PY|X (y|a)
E[log Bt+1 (W)|Yt ] = π t (Pa ) log P = I(π̃ t , PY|X ) (23.1)
a∈A y∈B a∈A π t (Pa )a
where π̃ t (a) ≜ π t (Pa ) is a (random) distribution on A.

The goal of the code designer is to come up with such a partitioning {Pa : a ∈ A} that the speed
of growth of π t (W) is maximal. Now, analyzing the speed of growth of a random-multiplicative
process is best done by taking logs:
X
t
log π t (j) = log Bs + log π 0 (j) .
s=1
Intuitively, we expect that the process log π t (W) resembles a random walk starting from − log M
and having a positive drift. Thus to estimate the time it takes for this process to reach value 0
we need to estimate the upward drift. Appealing to intuition and the law of large numbers we
approximate
X
t
log π t (W) − log π 0 (W) ≈ E[log Bs ] .
s=1
Finally, from (23.1) we conclude that the best idea is to select partitioning at each step in such a
way that π̃ t ≈ P∗X (capacity-achieving input distribution) and this obtains
log π t (W) ≈ tC − log M ,
implying that the transmission terminates in time ≈ logCM . The important lesson here is the follow-
ing: The optimal transmission scheme should map messages to channel inputs in such a way that
the induced input distribution PXt+1 |Yt is approximately equal to the one maximizing I(X; Y). This
idea is called posterior matching and explored in detail in [282].1
1
Note that the magic of Shannon’s theorem is that this optimal partitioning can also be done blindly. That is, it is possible
to preselect partitions Pa in a way that is independent of π t (but dependent on t) and so that π t (Pa ) ≈ P∗X (a) with
overwhelming probability and for all t ∈ [n].
i i
i i
i i

i i
Converse: We are left to show that Cfb ≤ C(I) . Recall the key in proving weak converse for
channel coding without feedback: Fano’s inequality plus the graphical model
W → Xn → Yn → Ŵ. (23.2)
Then
−h(ϵ) + ϵ̄ log M ≤ I(W; Ŵ) ≤ I(Xn ; Yn ) ≤ nC(I) .
With feedback the probabilistic picture becomes more complicated as the following figure
shows for n = 3 (dependence introduced by the extra squiggly arrows):
X1 Y1 X1 Y1
W X2 Y2 Ŵ W X2 Y2 Ŵ
X3 Y3 X3 Y3
without feedback with feedback
So, while the Markov chain relation in (23.2) is still true, the input-output relation is no longer
memoryless2
Y
n
PYn |Xn (yn |xn ) 6= PY|X (yj |xj ) (!)
j=1
There is still a large degree of independence in the channel, though. Namely, we have
(Yt−1 , W) →Xt → Yt , t = 1, . . . , n (23.3)

W → Y → Ŵn
(23.4)
Then
−h(ϵ) + ϵ̄ log M ≤ I(W; Ŵ) (Fano)

≤ I(W; Y ) n
(Data processing applied to (23.4))
X
n
= I(W; Yt |Yt−1 ) (Chain rule)
t=1
Xn
≤ I(W, Yt−1 ; Yt ) (I(W; Yt |Yt−1 ) = I(W, Yt−1 ; Yt ) − I(Yt−1 ; Yt ))
t=1
Xn
≤ I(Xt ; Yt ) (Data processing applied to (23.3))
t=1
2
To see this, consider the example where X2 = Y1 and thus PY1 |X1 X2 = δX1 is a point mass.
i i
i i
i i

i i
386
≤ nCt
In comparison with Theorem 22.2, the following result shows that, with fixed-length block cod-
ing, feedback does not even improve the speed of approaching capacity and can at most improve
the third-order log n terms.
Theorem 23.4 (Dispersion with feedback). For weakly input-symmetric DMC (e.g. additive noise,
BSC, BEC) we have:
√
log M∗fb (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n)
23.2* Alternative proof of Theorem 23.3 and Massey’s directed

information
In this section we show an alternative proof of Theorem 23.3, which is more in the spirit of “chan-
nel substitution” ideas that we continue to emphasize in this book, see Sections 6.3, 17.4 and 22.3.
In addition, it will also lead us to defining the concept of directed information ⃗I(Xn ; Yn ) due to
Massey [211]. The latter is deeply related to various causality studies, but we will not go into this
side of ⃗I here.
Proof. It is obvious that Cfb ≥ C, we are left to show that Cfb ≤ C(I) .
1 Recap of the steps of showing the strong converse of C ≤ C(I) previously in Section 22.1: take
any (n, M, ϵ) code, compare the two distributions:
P : W → Xn → Yn → Ŵ (23.5)
Q:W→X n
Y → Ŵ
n
(23.6)
two key observations:

a Under Q, W ⊥⊥ W, so that Q[W = Ŵ] = M1 while P[W = Ŵ] ≥ 1 − ϵ.
b The two graphical models give the factorization:
PW,Xn ,Yn ,Ŵ = PW,Xn PYn |Xn PŴ|Yn , QW,Xn ,Yn ,Ŵ = PW,Xn PYn PŴ|Yn
thus D(PkQ) = I(Xn ; Yn ) measures the information flow through the links Xn → Yn .
mem−less,stat X
n
1 DPI
−h(ϵ) + ϵ̄ log M = d(1 − ϵk ) ≤ D(PkQ) = I(Xn ; Yn ) = I(X; Y) ≤ nC(I)
M
i=1
(23.7)
2 Notice that when feedback is present, Xn → Yn is not memoryless due to the transmission
protocol. So let us unfold the probability space over time to see the dependence explicitly. As
an example, the graphical model for n = 3 is given below:
i i
i i
i i

i i
23.2* Alternative proof of Theorem 23.3 and Massey’s directed information 387
If we define Q similarly as in the case without feedback, we will encounter a problem at the
second last inequality in (23.7), as with feedback I(Xn ; Yn ) can be significantly larger than
Pn n n
i=1 I(X; Y). Consider the example where X2 = Y1 , we have I(X ; Y ) = +∞ independent
of I(X; Y).
We also make the observation that if Q is defined in (23.6), D(PkQ) = I(Xn ; Yn ) measures the
information flow through all the 6→ and ⇝ links. This motivates us to find a proper Q such that
D(PkQ) only captures the information flow through all the 6→ links {Xi → Yi : i = 1, . . . , n}, so
⊥ W, so that Q[W 6= Ŵ] = M1 .
that D(PkQ) closely relates to nC(I) , while still guarantees that W ⊥
3 Formally, we shall restrict QW,Xn ,Yn ,Ŵ ∈ Q, where Q is the set of distributions that can be
factorized as follows:
QW,Xn ,Yn ,Ŵ = QW QX1 |W QY1 QX2 |W,Y1 QY2 |Y1 · · · QXn |W,Yn−1 QYn |Yn−1 QŴ|Yn (23.8)
PW,Xn ,Yn ,Ŵ = PW PX1 |W PY1 |X1 PX2 |W,Y1 PY2 |X2 · · · PXn |W,Yn−1 PYn |Xn PŴ|Yn (23.9)
Verify that W ⊥
⊥ W under Q: W and Ŵ are d-separated by Xn .
Notice that in the graphical models, when removing ↛ we also added the directional links
between the Yi ’s, these links serve to maximally preserve the dependence relationships between
variables when ↛ are removed, so that Q is the “closest” to P while W ⊥ ⊥ W is satisfied.
Now we have that for Q ∈ Q, d(1 − ϵk M1 ) ≤ D(PkQ), in order to obtain the least upper bound,
in Lemma 23.5 we shall show that:
X
n
inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ ) = I(Xt ; Yt |Yt−1 )
Q∈Q
t=1
X
n
= EYt−1 [I(PXt |Yt−1 , PY|X )]
t=1
X
n
≤ I(EYt−1 [PXt |Yt−1 ], PY|X ) (concavity of I(·, PY|X ))
t=1
i i
i i
i i

i i
388
X
n
= I(PXt , PY|X )
t=1
≤nC . ( I)
Following the same procedure as in (a) we have
nC + h(ϵ) C
−h(ϵ) + ϵ̄ log M ≤ nC(I) ⇒ log M ≤ ⇒ Cfb,ϵ ≤ ⇒ Cfb ≤ C.
1−ϵ 1−ϵ
4 Notice that the above proof is also valid even when cost constraint is present.
Lemma 23.5.
X
n
inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ ) = I(Xt ; Yt |Yt−1 ) (23.10)
Q∈Q
t=1
Pn
Remark 23.1 (Directed information). The quantity ⃗I(Xn ; Yn ) ≜ t=1 I(Xt ; Yt |Yt−1 ) was defined
by Massey and is known as directed information. In some sense, see [211] it quantifies the amount
of causal information transfer from X-process to Y-process.
Proof. By chain rule, we can show that the minimizer Q ∈ Q must satisfy the following
equalities:
QX,W = PX,W ,
QXt |W,Yt−1 = PXt |W,Yt−1 , (exercise)
QŴ|Yn = PW|Yn .
and therefore
inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ )

Q∈Q
= D(PY1 |X1 kQY1 |X1 ) + D(PY2 |X2 ,Y1 kQY2 |Y1 |X2 , Y1 ) + · · · + D(PYn |Xn ,Yn−1 kQYn |Yn−1 |Xn , Yn−1 )
= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Yn−1 )
23.3 When is feedback really useful?

Theorems 23.3 and 23.4 state that feedback does not improve communication rate neither asymp-
totically nor for moderate blocklengths. In this section, we shall examine three cases where
feedback turns out to be very useful.
i i
i i
i i

i i
23.3.1 Code with very small (e.g. zero) error probability

Theorem 23.6 (Shannon [279]). For any DMC PY|X ,
1
Cfb,0 = max min log (23.11)
PX y∈B PX (Sy )
where
Sy = {x ∈ A : PY|X (y|x) > 0}
denotes the set of input symbols that can lead to the output symbol y.
Remark 23.2. For stationary memoryless channel, we have

( a) (b) ( c) (d)
C0 ≤ Cfb,0 ≤ Cfb = lim Cfb,ϵ = C = lim Cϵ = C(I) = sup I(X; Y)
ϵ→0 ϵ→0 PX
where (a) and (b) are by definitions, (c) follows from Theorem 23.3, and (d) is due to Theorem 19.9.
All capacity quantities above are defined with (fixed-length) block codes.
Remark 23.3. 1 In DMC for both zero-error capacities (C0 and Cfb,0 ) only the support of the
transition matrix PY|X , i.e., whether PY|X (b|a) > 0 or not, matters. The value of PY|X (b|a) > 0
is irrelevant. That is, C0 and Cfb,0 are functions of a bipartite graph between input and output
alphabets. Furthermore, the C0 (but not Cfb,0 !) is a function of the confusability graph – a simple
undirected graph on A with a 6= a′ connected by an edge iff ∃b ∈ B s.t. PY|X (b|a)PY|X (b|a′ ) > 0.
2 That Cfb,0 is not a function of the confusability graph alone is easily seen from comparing the
polygon channel (next remark) with L = 3 (for which Cfb,0 = log 32 ) and the useless channel
with A = {1, 2, 3} and B = {1} (for which Cfb,0 = 0). Clearly in both cases confusability
graph is the same – a triangle.
3 Oftentimes C0 is very hard to compute, but Cfb,0 can be obtained in closed form as in (23.11).
As an example, consider the following polygon channel:
1
5
4
3
Bipartite graph Confusability graph
The following are known:

• Zero-error capacity C0 :
– L = 3: C0 = 0
i i
i i
i i

i i
390
– L = 5: C0 = 12 log 5 . For achievability, with blocklength one, one can use {1, 3} to achieve
rate 1 bit; with blocklength two, the codebook {(1, 1), (2, 3), (3, 5), (4, 2), (5, 4)} achieves
rate 12 log 5 bits, as pointed out by Shannon in his original 1956 paper [279]. More than
two decades later this was shown optimal by Lovász using a technique now known as
semidefinite programming relaxation [204].
– Even L: C0 = log L2 (Exercise IV.16).
– Odd L: The exact value of C0 is unknown in general. For large L, C0 = log L2 + o(1) [42].
• Zero-error capacity with feedback (Exercise IV.16)
L
Cfb,0 = log , ∀L,
2
which can strictly exceeds C0 .
4 Notice that Cfb,0 is not necessarily equal to Cfb = limϵ→0 Cfb,ϵ = C. Here is an example with
C0 < Cfb,0 < Cfb = C.
Consider the channel:
Then
C0 = log 2
2
Cfb,0 = max − log max( δ, 1 − δ) (P∗X = (δ/3, δ/3, δ/3, δ̄))
δ 3
5 3
= log > C0 (δ ∗ = )
2 5
On the other hand, the Shannon capacity C = Cfb can be made arbitrarily close to log 4 by
picking the cross-over probabilities arbitrarily close to zero, while the confusability graph stays
the same.
Proof of Theorem 23.6. 1 Fix any (n, M, 0)-code. Denote the confusability set of all possible
messages that could have produced the received signal yt = (y1 , . . . , yt ) for all t = 0, 1, . . . , n
by:
Et (yt ) ≜ {m ∈ [M] : f1 (m) ∈ Sy1 , f2 (m, y1 ) ∈ Sy2 , . . . , fn (m, yt−1 ) ∈ Syt }
Notice that zero-error means no ambiguity:
ϵ = 0 ⇔ ∀yn ∈ B n , |En (yn )| = 1 or 0. (23.12)
i i
i i
i i

i i
2 The key quantities in the proof are defined as follows:
θfb ≜ min max PX (Sy ), P∗X ≜ argmin max PX (Sy )

PX y∈B PX y∈B
The goal is to show
1
Cfb,0 = log .
θfb
By definition, we have
∀PX , ∃y ∈ B, such that PX (Sy ) ≥ θfb (23.13)
Notice the minimizer distribution P∗X is usually not the capacity-achieving input distribution in
the usual sense. This definition also sheds light on how the encoding and decoding should be
proceeded and serves to lower bound the uncertainty reduction at each stage of the decoding
scheme.
3 “≤” (converse): Let PXn be the joint distribution of the codewords. Denote E0 = [M] – original
message set.
t = 1: For PX1 , by (23.13), ∃y∗1 such that:
|{m : f1 (m) ∈ Sy∗1 }| |E1 (y∗1 )|

PX1 (Sy∗1 ) = = ≥ θfb .
|{m ∈ [M]}| | E0 |
t = 2: For PX2 |X1 ∈Sy∗ , by (23.13), ∃y∗2 such that:

1
|{m : f1 (m) ∈ Sy∗1 , f2 (m, y∗1 ) ∈ Sy∗2 }| |E2 (y∗1 , y∗2 )|

PX2 (Sy∗2 |X1 ∈ Sy∗1 ) = = ≥ θfb ,
|{m : f1 (m) ∈ Sy∗1 }| |E1 (y∗1 )|
t = n: Continue the selection process up to y∗n which satisfies that:
|En (y∗1 , . . . , y∗n )|

PXn (Sy∗n |Xt ∈ Sy∗t for t = 1, . . . , n − 1) = ≥ θfb .
|En−1 (y∗1 , . . . , y∗n−1 )|
Finally, by (23.12) and the above selection procedure, we have
1 |En (y∗1 , . . . , y∗n )|

≥ ≥ θfb
n
M | E0 |
⇒ M ≤ −n log θfb
⇒ Cfb,0 ≤ − log θfb
4 “≥” (achievability)
Let’s construct a code that achieves (M, n, 0).
i i
i i
i i

i i
392
The above example with |A| = 3 illustrates that the encoder f1 partitions the space of all mes-
sages to 3 groups. The encoder f1 at the first stage encodes the groups of messages into a1 , a2 , a3
correspondingly. When channel outputs y1 and assume that Sy1 = {a1 , a2 }, then the decoder
can eliminate a total number of MP∗X (a3 ) candidate messages in this round. The “confusabil-
ity set” only contains the remaining MP∗X (Sy1 ) messages. By definition of P∗X we know that
MP∗X (Sy1 ) ≤ Mθfb . In the second round, f2 partitions the remaining messages into three groups,
send the group index and repeat.
By similar arguments, each interaction reduces the uncertainty by a factor of at least θfb . After n
iterations, the size of “confusability set” is upper bounded by Mθfbn n
, if Mθfb ≤ 1,3 then zero error
probability is achieved. This is guaranteed by choosing log M = −n log θfb . Therefore we have
shown that −n log θfb bits can be reliably delivered with n + O(1) channel uses with feedback,
thus
Cfb,0 ≥ − log θfb
23.3.2 Code with variable length

Consider the example of BECδ with feedback, send k bits in the following way: repeat sending each
bit until it gets through the channel correctly. The expected number of channel uses for sending k
bits is given by
k
l = E[n] =
1−δ
We state the result for variable-length feedback (VLF) code without proof:
log M∗VLF (l, 0) ≥ lC

√
Notice that compared to the scheme without feedback, there is the improvement of nVQ−1 (ϵ)
√
in the order of O( n), which is stronger than the result in Theorem 23.4.
This is also true in general [241]:
lC
log M∗VLF (l, ϵ) = + O(log l)
1−ϵ
3
Some rounding-off errors need to be corrected in a few final steps (because P∗X may not be closely approximable when
very few messages are remaining). This does not change the asymptotics though.
i i
i i
i i

i i
Example 23.1. For the channel BSC0.11 without feedback the minimal is n = 3000 needed to
achieve 90% of the capacity C, while there exists a VLF code with l = E[n] = 200 achieving that.
This showcases how much feedback can improve the latency and decoding complexity.
23.3.3 Code with variable power

Elias’ scheme To send a number A drawn from a Gaussian distribution N (0, Var A), Elias’
scheme uses linear processing. Consider the following set of AWGN channel:
Yk = X k + Zk , Zk ∼ N (0, σ 2 ) i.i.d.
E[X2k ] ≤ P, power constraint in expectation
Elias’ scheme proceeds according to the figure below.4
According to the orthogonality principle of the mininum mean-square estimation (MMSE) of

A at receiver side in every step:
A = Ân + Nn , Nn ⊥
⊥ Yn .
Morever, since all operations are lienar and everything is jointly Gaussian, Nn ⊥
⊥ Yn . Since Xn ∝
n−1
Nn−1 ⊥
⊥ Y , the codeword we are sending at each time slot is independent of the history of the
channel output (“innovation”), in order to maximize information transfer.
4 ∑n
Note that if we insist each codeword satisfies power constraint almost surely instead on average, i.e., k=1 X2k ≤ nP a.s.,
then this scheme does not work!
i i
i i
i i

i i
394
Note that Yn → Ân → A, and the optimal estimator Ân (a linear combination of Yn ) is a sufficient
statistic of Yn for A under Gaussianity. Then
I(A; Yn ) =I(A; Ân , Yn )
= I(A; Ân ) + I(A; Yn |Ân )
= I(A; Ân )
1 Var(A)
= log .
2 Var(Nn )
where the last equality uses the fact that N follows a normal distribution. Var(Nn ) can be computed
directly using standard linear MMSE results. Instead, we determine it information theoretically:
Notice that we also have
I(A; Yn ) = I(A; Y1 ) + I(A; Y2 |Y1 ) + · · · + I(A; Yn |Yn−1 )
= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Yn−1 )
key
= I(X1 ; Y1 ) + I(X2 ; Y2 ) + · · · + I(Xn ; Yn )
1
= n log(1 + P) = nC
2
Therefore, with Elias’ scheme of sending A ∼ N (0, Var A), after the n-th use of the AWGN(P)
channel with feedback,
P n
Var Nn = Var(Ân − A) = 2−2nC Var A = Var A,
P + σ2
which says that the reduction of uncertainty in the estimation is exponential fast in n.
Schalkwijk-Kailath Elias’ scheme can also be used to send digital data. Let W ∼ uniform on
M-PAM constellation in ∈ [−1, 1], i.e., {−1, −1 + M2 , · · · , −1 + 2k
M , · · · , 1}. In the very first step
W is sent (after scaling to satisfy the power constraint):
√
X0 = PW, Y0 = X0 + Z0
Since Y0 and X0 are both known at the encoder, it can compute Z0 . Hence, to describe W it is
sufficient for the encoder to describe the noise realization Z0 . This is done by employing the Elias’
scheme (n − 1 times). After n − 1 channel uses, and the MSE estimation, the equivalent channel
output:
e 0 = X0 + Z
Y e0 , e0 ) = 2−2(n−1)C
Var(Z
e0 to the nearest PAM point. Notice that
Finally, the decoder quantizes Y
√ (n−1)C √
e 1 −(n−1)C P 2 P
ϵ ≤ P | Z0 | > =P 2 |Z| > = 2Q
2M 2M 2M
so that
√
P ϵ
log M ≥ (n − 1)C + log − log Q−1 ( ) = nC + O(1).
2 2
i i
i i
i i

i i
Hence if the rate is strictly less than capacity, the error probability decays doubly exponentially as
√
n increases. More importantly, we gained an n term in terms of log M, since for the case without
feedback we have (by Theorem 22.2)
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) .
As an example, consider P = 1 and (n−thenchannel capacity is C = 0.5 bit per channel use. To
e(n−1)C
−3
achieve error probability 10 , 2Q 2 1) C
2M ≈ 10 , so 2M ≈ 3, and logn M ≈ n−n 1 C − logn 8 .
−3
Notice that the capacity is achieved to within 99% in as few as n = 50 channel uses, whereas the
best possible block codes without feedback require n ≈ 2800 to achieve 90% of capacity.
The take-away message of this chapter is as follows: Feedback is best harnessed with adaptive
strategies. Although it does not increase capacity under block coding, feedback greatly boosts
reliability as well as reduces coding complexity.
i i
i i
i i

i i
Exercises for Part IV
1 1
IV.1 A code with M = 2k , average probability of error ϵ < 2 and bit-error probability pb < 2 must
satisfy both
C + h(ϵ)
log M ≤ (IV.1)
1−ϵ
and
C
log M ≤ , (IV.2)
log 2 − h(pb )
where C = supPX I(X; Y). Since pb ≤ ϵ, in the bound (IV.2) we may replace h(pb ) with h(ϵ) to
obtain a new bound. Suppose that a value of k is fixed and the bounds are used to prove a lower
bound on ϵ. When will the new bound be better than (IV.1)?
IV.2 A magician is performing card tricks on stage. In each round he takes a shuffled deck of 52
cards and asks someone to pick a random card N from the deck, which is then revealed to the
audience. Assume the magician can prepare an arbitrary ordering of cards in the deck (before
each round) and that N is distributed binomially on {0, . . . , 51} with mean 51
2 .
(a) What is the maximal number of bits per round that he can send over to his companion in
the room? (in the limit of infinitely many rounds)
(b) Is communication possible if N were uniform on {0, . . . , 51}? (In practice, however, nobody
ever picks the top or the bottom ones)
IV.3 Find the capacity of the erasure-error channel (Fig. 23.2) with channel matrix

1 − 2δ δ δ
W=
δ δ 1 − 2δ
where 0 ≤ δ ≤ 1/2.
FIXME
Figure 23.2 Binary erasure-error channel.
IV.4 Consider a binary symmetric channel with crossover probability δ ∈ (0, 1):
Y = X + Z mod 2 , Z ∼ Bern(δ) .
Suppose that in addition to Y the receiver also gets to observe noise Z through a binary erasure
channel with erasure probability δe ∈ (0, 1). Compute:
(a) Capacity C of the channel.
(b) Zero-error capacity C0 of the channel.
i i
i i
i i

i i
(c) Zero-error capacity in the presence of feedback Cfb,0 .

(d) (Bonus) Now consider the setup when in addition to feedback also the variable-length
communication with feedback and termination (VLFT) is allowed. What is the zero-error
capacity (in bits per average number of channel uses) in this case? (In VLFT model, trans-
mitter can send a special symbol T that is received without error, but the channel dies after
T has been sent.)
IV.5 (Time varying channel, Problem 9.12 [76]) A train pulls out of the station at constant velocity.
The received signal energy thus falls off with time as 1/i2 . The total received signal at time i is

1
Yi = Xi + Zi ,
i
i.i.d.
where Z1 , Z2 , . . . ∼ N(0, σ 2 ). The transmitter constraint for block length n is
1X 2
n
xi (w) ≤ P, w ∈ {1, 2, . . . , 2nR }.
n
i=1
Using Fano’s inequality, show that the capacity C is equal to zero for this channel.
IV.6 Randomized encoders and decoders may help for maximal probability of error:
(a) Consider a binary asymmetric channel PY|X : {0, 1} → {0, 1} specified by PY|X=0 =
Ber(1/2) and PY|X=1 = Ber(1/3). The encoder f : [M] → {0, 1} tries to transmit 1 bit
of information, i.e., M = 2, with f(1) = 0, f(2) = 1. Show that the optimal decoder which
minimizes the maximal probability of error is necessarily randomized. Find the optimal
decoder and the optimal Pe,max . (Hint: Recall binary hypothesis testing.)
(b) Give an example of PY|X : X → Y , M > 1 and ϵ > 0 such that there is an (M, ϵ)max -code
with a randomized encoder-decoder, but no such code with a deterministic encoder-decoder.
IV.7 Routers A and B are setting up a covert communication channel in which the data is encoded
in the ordering of packets. Formally: router A receives n packets, each of type A or D (for
Ack/Data), where type is i.i.d. Bernoulli(p) with p ≈ 0.9. It encodes k bits of secret data by
reordering these packets. The network between A and B delivers packets in-order with loss rate
δ ≈ 5% (Note: packets have sequence numbers, so each loss is detected by B).
What is the maximum rate nk of reliable communication achievable for large n? Justify your
answer!
IV.8 (Strong converse for BSC) In this exercise we give a combinatorial proof of the strong converse
for the binary symmetric channel. For BSCδ with 0 < δ < 21 ,
(a) Given any (n, M, ϵ)max -code with deterministic encoder f and decoder g, recall that the
decoding regions {Di = g−1 (i)}M i=1 form a partition of the output space. Prove that for
all i ∈ [M],
L
X n
| Di | ≥
j
j=0
where L is the largest integer such that P [Binomial(n, δ) ≤ L] ≤ 1 − ϵ.
i i
i i
i i

i i
398 Exercises for Part IV
(b) Conclude that

M ≤ 2n(1−h(δ))+o(n) . (IV.3)
(c) Show that (IV.3) holds for average probability of error. (Hint: how to go from maximal to
averange probability of error?)
(d) Conclude that strong converse holds for BSC. (Hint: argue that requiring deterministic
encoder/decoder does not change the asymptotics.)
IV.9 Recall that the AWGN channel is specified by
1 n 2
Yn = X n + Zn , Zn ∼ N (0, In ) , c(xn ) = kx k
n
Prove the strong converse for the AWGN via the following steps:
(a) Let ci = f(i) and Di = g−1 (i), i = 1, . . . , M be the codewords and the decoding regions of
an (n, M, P, ϵ)max code. Let
QYn = N (0, (1 + P)In ) .
Show that there must exist a codeword c and a decoding region D such that
PYn |Xn =c [D] ≥ 1 − ϵ (IV.4)
1
QYn [D] ≤ . (IV.5)
M
(b) Show that then
1
β1−ϵ (PYn |Xn =c , QYn ) ≤ . (IV.6)
M
(c) Show that hypothesis testing problem
PYn |Xn =c vs. QYn
is equivalent to
PYn |Xn =Uc vs. QYn
where U ∈ Rn×n is an orthogonal matrix. (Hint: use spherical symmetry of white Gaussian
distributions.)
(d) Choose U such that
PYn |Xn =Uc = Pn ,
where Pn is an iid Gaussian distribution of mean that depends on kck2 .
(e) Apply Stein’s lemma (Theorem 14.13) to show:
β1−ϵ (Pn , QYn ) = exp{−nE + o(n)}
(f) Conclude via (IV.6) that

1
log M ≤ nE + o(n) =⇒ Cϵ ≤ log(1 + P) .
2
i i
i i
i i

i i
IV.10 (Capacity-cost at P = P0 .) Recall that we have shown that for stationary memoryless channels
and P > P0 capacity equals f(P):
C(P) = f(P) , (IV.7)
where
P0 ≜ inf c(x) (IV.8)

x∈A
f(P) ≜ sup I(X; Y) . (IV.9)

X:E[c(X)]≤P
Show:
(a) If P0 is not admissible, i.e., c(x) > P0 for all x ∈ A, then C(P0 ) is undefined (even M = 1
is not possible)
(b) If there exists a unique x0 such that c(x0 ) = P0 then
C(P0 ) = f(P0 ) = 0 .
(c) If there are more than one x with c(x) = P0 then we still have
C(P0 ) = f(P0 ) .
(d) Give example of a channel with discontinuity of C(P) at P = P0 . (Hint: select a suitable
cost function for the channel Y = (−1)Z · sign(X), where Z is Bernoulli and sign : R →
{−1, 0, 1})
IV.11 Consider a stationary memoryless additive non-Gaussian noise channel:
Yi = Xi + Zi , E[Zi ] = 0, Var[Zi ] = 1
with the input constraint

√ X
n
kxn k2 ≤ nP ⇐⇒ x2i ≤ nP .
i=1
(a) Prove that capacity C(P) of this channel satisfies

1 1
log(1 + P) ≤ C(P) ≤ log(1 + P) + D(PZ kN (0, 1)) ,
2 2
where PZ is the distribution of the noise. (Hints: Gaussian saddle point and the golden
formula I(X; Y) ≤ D(PY|X kQY |PX ).)
(b) If D(PZ kN (0, 1)) = ∞ (Z is very non-Gaussian), then it is possible that the capacity is
infinite. Consider Z is ±1 equiprobably. Show that the capacity is infinite by a) proving
the maximal mutual information is infinite; b) giving an explicit scheme to achieve infinite
capacity.
IV.12 In Section 18.6 we showed that for additive noise, random linear codes achieves the same per-
formance as Shannon’s ensemble (fully random coding). The total number of possible generator
matrices is qnk , which is significant smaller than double exponential, but still quite large. Now
i i
i i
i i

i i
we show that without degrading the performance, we can reduce this number to qn by restricting
to Toeplitz generator matrix G, i.e., Gij = Gi−1,j−1 for all i, j > 1.
Prove the following strengthening of Theorem 18.13: Let PY|X be additive noise over Fnq . For
any 1 ≤ k ≤ n, there exists a linear code f : Fkq → Fnq with Toeplitz generator matrix, such that
+
h − n−k−log 1 n i
Pe,max = Pe ≤ E q
q PZn (Z )
How many Toeplitz generator matrices are there?

Hint: Analogous to the proof Theorem 15.2, first consider random linear codewords plus ran-
dom dithering, then argue that dithering can be removed without changing the performance of
the codes. Show that codewords are pairwise independent and uniform.
IV.13 (Wozencraft ensemble) Let X = Y = F2q , a vector space of dimension two over Galois field
with q elements. A Wozencraft code of rate 1/2 is a map parameterized by 0 = 6 u ∈ Fq given as
a 7→ (a, a · u), where a ∈ Fq corresponds to the original message, multiplication is over Fq and
(·, ·) denotes a 2-dimensional vector in F2q . We will show there exists u yielding a (q, ϵ)avg code
with
" ( + )#
q 2
ϵ ≤ E exp − i(X; Y) − log (IV.10)
2(q − 1)
for the channel Y = X + Z where X is uniform on F2q , noise Z ∈ F2q has distribution PZ and
P Z ( b − a)
i(a; b) ≜ log .
q− 2
(a) Show that probability of error of the code a 7→ (av, au) + h is the same as that of a 7→
(a, auv−1 ).
(b) Let {Xa , a ∈ Fq } be a random codebook defined as
Xa = (aV, aU) + H ,
with V, U – uniform over non-zero elements of Fq and H – uniform over F2q , the three being
jointly independent. Show that for a 6= a′ we have
1
PXa ,X′a (x21 , x̃21 ) = 1{x1 6= x̃1 , x2 6= x̃2 }
q2 (q − 1)2
(c) Show that for a 6= a′
q2 1
P[i(X′a ; Xa + Z) > log β] ≤ P[i(X̄; Y) > log β] − P[i(X; Y) > log β]
( q − 1) 2 (q − 1)2
q2
≤ P[i(X̄; Y) > log β] ,
( q − 1) 2
where PX̄XY (ā, a, b) = q14 PZ (b − a).
(d) Conclude by following the proof of the DT bound with M = q that the probability of error
averaged over the random codebook {Xa } satisfies (IV.10).
i i
i i
i i

i i
IV.14 (Information density and types.) Let PY|X : A → B be a DMC and let PX be some input
distribution. Take PXn Yn = PnXY and define i(an ; bn ) with respect to this PXn Yn .
(a) Show that i(xn ; yn ) is a function of only the “joint-type” P̂XY of (xn , yn ), which is a
distribution on A × B defined as
1
P̂XY (a, b) = #{i : xi = a, yi = b} ,
n
where a ∈ A and b ∈ B . Therefore { 1n i(xn ; yn ) ≥ γ} can be interpreted as a constraint on
the joint type of (xn , yn ).
(b) Assume also that the input xn is such that P̂X = PX . Show that
1 n n
i(x ; y ) ≤ I(P̂X , P̂Y|X ) .
n
The quantity I(P̂X , P̂Y|X ), sometimes written as I(xn ∧ yn ), is an empirical mutual informa-
tion5 . Hint:

PY|X (Y|X)
EQXY log =
PY (Y)
D(QY|X kQY |QX ) + D(QY kPY ) − D(QY|X kPY|X |QX ) (IV.11)
IV.15 (Fitingof-Goppa universal codes) Consider a finite abelian group X . Define Fitingof norm as
kxn kΦ ≜ nH(P̂xn ) = nH(xT ), T ∼ Unif([n]) ,
where P̂xn is the empirical distribution of xn .

(a) Show that kxn kΦ = k − xn kΦ and triangle inequality
kxn − yn kΦ ≤ kxn kΦ + kyn kΦ
Conclude that dΦ (xn , yn ) ≜ kxn − yn kΦ is a translation invariant (Fitingof) metric on the set
of equivalence classes in X n , with equivalence xn ∼ yn ⇐⇒ kxn − yn kΦ = 0.
(b) Define Fitingof ball Br (xn ) ≜ {yn : dΦ (xn , yn ) ≤ r}. Show that
log |Bλn (xn )| = λn + O(log n)
for all 0 ≤ λ ≤ log |X |.

(c) Show that for any i.i.d. measure PZn = PnZ on X n we have
(
1, H( Z ) < λ
lim PZn [Bλn (0n )] =
n→∞ 0, H( Z ) > λ
(d) Conclude that a code C ⊂ X n with Fitingof minimal distance dmin,Φ (C) ≜
minc̸=c′ ∈C dΦ (c, c′ ) ≥ 2λn is decodable with vanishing probability of error on any
additive-noise channel Y = X + Z, as long as H(Z) < λ.
5
Invented by V. Goppa for his maximal mutual information (MMI) decoder: Ŵ = argmaxi I(ci ∧ yn ).
i i
i i
i i

i i
Comment: By Feinstein-lemma like argument it can be shown that there exist codes of size
X n(1−λ) , such that balls of radius λn centered at codewords are almost disjoint. Such codes are
universally capacity achieving for all memoryless additive noise channels on X . Extension to
general (non-additive) channels is done via introducing dΦ (xn , yn ) = nH(xT |yT ), while exten-
sion to channels with Markov memory is done by introducing Markov-type norm kxn kΦ1 =
nH(xT |xT−1 ). See [143, Chapter 3].
IV.16 Consider the polygon channel discussed in Remark 23.3, where the input and output alphabet
are both {1, . . . , L}, and PY|X (b|a) > 0 if and only if b = a or b = (a mod L) + 1. The
confusability graph is a cycle of L vertices. Rigorously prove the following:
(a) For all L, The zero-error capacity with feedback is Cfb,0 = log L2 .
(b) For even L, the zero-error capacity without feedback C0 = log L2 .
(c) Now consider the following channel, where the input and output alphabet are both
{1, . . . , L}, and PY|X (b|a) > 0 if and only if b = a or b = a + 1. In this case the confusability
graph is a path of L vertices. Show that the zero-error capacity is given by

L
C0 = log
2
What is Cfb,0 ?
IV.17 (Input-output cost) Let PY|X : X → Y be a DMC and consider a cost function c : X × Y → R
(note that c(x, y) ≤ L < ∞ for some L). Consider a problem of channel coding, where the
error-event is defined as
( n )
X
{error} ≜ {Ŵ 6= W} ∪ c(Xk , Yk ) > nP ,
k=1
where P is a fixed parameter. Define operational capacity C(P) and show it is given by
C(I) (P) = max I(X; Y)

PX :E[c(X,Y)]≤P
for all P > P0 ≜ minx0 E[c(X, Y)|X = x0 ]. Give a counter-example for P = P0 . (Hint: do a
converse directly, and for achievability reduce to an appropriately chosen cost-function c′ (x)).
IV.18 (Expurgated random coding bound)
(a) For any code C show the following bound on probability of error
1 X −dB (c,c′ )
Pe (C) ≤ 2 ,
M ′
c̸=c
Pn
where Bhattacharya distance dB (xn , x̃n ) = j=1 dB (xj , x̃j ) and a single-letter
Xp
dB (x, x̃) = − log2 W(y|x)W(y|x̃) .
y∈Y
i i
i i
i i

i i
′
(b) Fix PX and let E0,x (ρ, PX ) ≜ −ρ log2 E[2− ρ dB (X,X ) ], where X ⊥
1
⊥ X′ ∼ PX . Show by random
coding that there always exists a code C of rate R with
Pe (C) ≤ 2n(E0,x (1,PX )−R) .
(c) We improve the previous bound as follows. We still generate C by random coding. But this
P ′
time we expurgate all codewords with f(c, C) > med(f(c, C)), where f(c) = c′ ̸=c 2−dB (c,c ) .
Using the bound
med(V) ≤ 2ρ E[V1/ρ ]ρ ∀ρ ≥ 1
show that
med(f(c, C)) ≤ 2n(ρR−E0,x (ρ,PX )) .
(d) Conclude that there must exist a code with rate R − O(1/n) and Pe (C) ≤ 2−nEex (R) , where
Eex (R) ≜ max −ρR + max E0,x (ρ, PX ) .

ρ≥1 PX
IV.19 Give example of a channel with discontinuity of C(P) at P = P0 . (Hint: select a suitable cost
function for the channel Y = (−1)Z · sign(X), where Z is Bernoulli and sign : R → {−1, 0, 1})
IV.20 (Sum of channels) Let W1 and W2 denote the channel matrices of discrete memoryless channel
(DMC) PY1 |X1 and PY2 |X2 with capacity C1 and C2 , respectively. The sum of the two channels is

another DMC with channel matrix W01 W02 . Show that the capacity of the sum channel is given
by
C = log(exp(C1 ) + exp(C2 )).
IV.21 (Product of channels) For i = 1, 2, let PYi |Xi be a channel with input space Ai , output space Bi ,
and capacity Ci . Their product channel is a channel with input space A1 × A2 , output space
B1 × B2 , and transition kernel PY1 Y2 |X1 X2 = PY1 |X1 PY2 |X2 . Show that the capacity of the product
channel is given by
C = C1 + C2 .
IV.22 Mixtures of DMCs. Consider two DMCs UY|X and VY|X with a common capacity achieving input
distribution and capacities CU < CV . Let T = {0, 1} be uniform and consider a channel PYn |Xn
that uses U if T = 0 and V if T = 1, or more formally:
1 n 1
PYn |Xn (yn |xn ) = U (yn |xn ) + VnY|X (yn |xn ) . (IV.12)
2 Y| X 2
Show:
(a) Is this channel {PYn |Xn }n≥1 stationary? Memoryless?
(b) Show that the Shannon capacity C of this channel is not greater than CU .
(c) The maximal mutual information rate is
1 CU + CV
C(I) = lim sup I(Xn ; Yn ) =
n→∞ n Xn 2
i i
i i
i i

i i
(d) Conclude that
C < C(I) .
IV.23 Compound DMC [37] Compound DMC is a family of DMC’s with common input and output
alphabets PYs |X : A → B, s ∈ S . An (n, M, ϵ) code is an encoder-decoder pair whose probability
of error ≤ ϵ over any channel PYs |X in the family (note that the same encoder and the same
decoder are used for each s ∈ S ). Show that capacity is given by
C = sup inf I(X; Ys ) .

PX s
IV.24 Consider the following (memoryless) channel. It has a side switch U that can be in positions
ON and OFF. If U is on then the channel from X to Y is BSCδ and if U is off then Y is Bernoulli
(1/2) regardless of X. The receiving party sees Y but not U. A design constraint is that U should
be in the ON position no more than the fraction s of all channel uses, 0 ≤ s ≤ 1. Questions:
(a) One strategy is to put U into ON over the first sn time units and ignore the rest of the (1 − s)n
readings of Y. What is the maximal rate in bits per channel use achievable with this strategy?
(b) Can we increase the communication rate if the encoder is allowed to modulate the U switch
together with the input X (while still satisfying the s-constraint on U)?
(c) Now assume nobody has access to U, which is random, independent of X, memoryless
across different channel uses and
P[U = ON] = s.
Find capacity.
IV.25 Let {Zj , j = 1, 2, . . .} be a stationary Gaussian process with variance 1 such that Zj form a
Markov chain Z1 → . . . → Zn → . . . Consider an additive channel
Yn = Xn + Zn
Pn
with power constraint j=1 |xj |2 ≤ nP. Suppose that I(Z1 ; Z2 ) = ϵ 1, then capacity-cost
function
1
C(P) = log(1 + P) + Bϵ + o(ϵ)
2
as ϵ → 0. Compute B and interpret your answer.
How does the frequency spectrum of optimal signal change with increasing ϵ?
IV.26 A semiconductor company offers a random number generator that outputs a block of random n
bits Y1 , . . . , Yn . The company wants to secretly embed a signature in every chip. To that end, it
decides to encode the k-bit signature in n real numbers Xj ∈ [0, 1]. To each individual signature
a chip is manufactured that produces the outputs Yj ∼ Ber(Xj ). In order for the embedding to
be inconspicuous the average bias P should be small:
n
1 X 1
n Xj − 2 ≤ P .
j=1
i i
i i
i i

i i
As a function of P how many signature bits per output (k/n) can be reliably embedded in this
fashion? Is there a simple coding scheme achieving this performance?
IV.27 Consider a DMC with two outputs PY,U|X . Suppose that receiver observes only Y, while U is
(causally) fed back to the transmitter. We know that when Y = U the capacity is not increased.
(a) Show that capacity is not increased in general (even when Y 6= U).
(b) Suppose now that there is a cost function c and c(x0 ) = 0. Show that capacity per unit cost
(with U being fed back) is still given by
D(PY|X=x kPY|X=x0 )
CV = max
x̸=x0 c(x)
IV.28 (Capacity of sneezing) A sick student is sneezing periodically every minute, with each sneeze
happening i.i.d. with probability p. He decides to send k bits to a friend by modulating the
sneezes. For that, every time he realizes he is about to sneeze he chooses to suppress a sneeze
or not. A friend listens for n minutes and then tries to decode k bits.
(a) Find capacity in bits per minute. (Hint: Think how to define the channel so that channel input
at time t were not dependent on the arrival of the sneeze at time t. To rule out strategies that
depend on arrivals of past sneezes, you may invoke Exercise IV.27.)
(b) Suppose sender can suppress at most E sneezes and listener can wait indefinitely (n = ∞).
Show that sender can transmit Cpuc E + o(E) bits reliably as E → ∞ and find Cpuc . Curiously,
Cpuc ≥ 1.44 bits/sneeze regardless of p. (Hint: This is similar to Exercise IV.17.)
(c) (Bonus, hard) Redo 1 and 2 for the case of a clairvoyant student who knows exactly when
sneezes will happen in the future.
IV.29 An inmate has n oranges that he is using to communicate with his conspirators by putting
oranges in trays. Assume that infinitely many trays are available, each can contain zero or more
oranges, and each orange in each tray is eaten by guards independently with probability δ . In
the limit of n → ∞ show that an arbitrary high rate (in bits per orange) is achievable.
IV.30 Recall that in the proof of the DT bound we used the decoder that outputs (for a given channel
output y) the first cm that satisfies
{i(cm ; y) > log β} . (IV.13)
One may consider the following generalization. Fix E ⊂ X × Y and let the decoder output the
first cm which satisfies
( cm , y) ∈ E
By repeating the random coding proof steps (as in the DT bound) show that the average
probability of error satisfies
M−1
E[Pe ] ≤ P[(X, Y) 6∈ E] + P[(X̄, Y) ∈ E] ,
2
where
PXYX̄ (a, b, ā) = PX (a)PY|X (b|a)PX (ā) .
M−1
Conclude that the optimal E is given by (IV.13) with β = 2 .
i i
i i
i i

i i
IV.31 Bounds for the binary erasure channel (BEC). Consider a code with M = 2k operating over the
blocklength n BEC with erasure probability δ ∈ [0, 1).
(a) Show that regardless of the encoder-decoder pair:
+
P[error|#erasures = z] ≥ 1 − 2n−z−k
(b) Conclude by averaging over the distribution of z that the probability of error ϵ must satisfy
X
n
n ℓ
ϵ≥ δ (1 − δ)n−ℓ 1 − 2n−ℓ−k , (IV.14)
ℓ
ℓ=n−k+1
(c) By applying the DT bound with uniform PX show that there exist codes with
X n
n t
δ (1 − δ)n−t 2−|n−t−k+1| .
+
ϵ≤ (IV.15)
t
t=0
(d) Fix n = 500, δ = 1/2. Compute the smallest k for which the right-hand side of (IV.14) is
greater than 10−3 .
(e) Fix n = 500, δ = 1/2. Find the largest k for which the right-hand side of (IV.15) is smaller
than 10−3 .
(f) Express your results in terms of lower and upper bounds on log M∗ (500, 10−3 ).
i i
i i
i i

i i
Part V
Rate-distortion theory and metric entropy
i i
i i
i i

i i
i i
i i
i i

i i
409
In Part II we studied lossless data compression (source coding), where the goal is to compress
a random variable (source) X into a minimal number of bits on average (resp. exactly) so that
X can be reconstructed exactly (resp. with high probability) using these bits. In both cases, the
fundamental limit is given by the entropy of the source X. Clearly, this paradigm is confined to
discrete random variables.
In this part we will tackle the next topic, lossy data compression: Given a random variable X,
encode it into a minimal number of bits, such that the decoded version X̂ is a faithful reconstruction
of X, in the sense that the “distance” between X and X̂ is at most some prescribed accuracy either
on average or with high probability.
The motivations for study lossy compression are at least two-fold:
1 Many natural signals (e.g. audio, images, or video) are continuously valued. As such, there is
a need to represent these real-valued random variables or processes using finitely many bits,
which can be fed to downstream digital processing; see Fig. 23.3 for an illustration.
Domain Range
Continuous Analog
Sampling time Quantization
Signal
Discrete Digital
time
Figure 23.3 Sampling and quantization in engineering practice.
2 There is a lot to be gained in compression if we allow some reconstruction errors. This is espe-
cially important in applications where certain errors (such as high-frequency components in
natural audio and visual signals) are imperceptible to humans. This observation is the basis of
many important compression algorithms and standards that are widely deployed in practice,
including JPEG for images, MPEG for videos, and MP3 for audios.
The operation of mapping (naturally occurring) continuous time/analog signals into
(electronics-friendly) discrete/digital signals is known as quantization, which is an important sub-
ject in signal processing in its own right (cf. the encyclopedic survey [144]). In information theory,
the study of optimal quantization is called rate-distortion theory, introduced by Shannon in 1959
[280]. To start, we will take a closer look at quantization next in Section 24.1, followed by the
information-theoretic formulation in Section 24.2. A simple (and tight) converse bound is given
in Section 24.3, with the the matching achievability bound deferred to the next chapter.
Finally, in Chapter 27 we study Kolmogorov’s metric entropy, which is a non-probabilistic
theory of quantization for sets in metric spaces. In addition to connections to the probabilistic the-
ory of quantization in the preceding chapters, this concept has far-reaching consequences in both
probability (e.g. empirical processes, small-ball probability) and statistical learning (e.g. entropic
upper and lower bounds for estimation) that will be explored further in Part VI.
i i
i i
i i

i i
24 Rate-distortion theory
24.1 Scalar and vector quantization
24.1.1 Scalar Uniform Quantization

The idea of quantizing an inherently continuous-valued signal was most explicitly expounded in
the patenting of Pulse-Coded Modulation (PCM) by A. Reeves; cf. [260] for some interesting
historical notes. His argument was that unlike AM and FM modulation, quantized (digital) sig-
nals could be sent over long routes without the detrimental accumulation of noise. Some initial
theoretical analysis of the PCM was undertaken in 1948 by Oliver, Pierce, and Shannon [227].
For a random variable X ∈ [−A/2, A/2] ⊂ R, the scalar uniform quantizer qU (X) with N
quantization points partitions the interval [−A/2, A/2] uniformly
N equally spaced points
−A A
2 2
where the points are in { −2A + kA

N , k = 0, . . . , N − 1}.
What is the quality (or fidelity) of this quantization? Most of the time, mean squared error is
used as the quality criterion:
D(N) = E|X − qU (X)|2
where D denotes the average distortion. Often R = log2 N is used instead of N, so that we think
about the number of bits we can use for quantization instead of the number of points. To analyze
this scalar uniform quantizer, we’ll look at the high-rate regime (R 1). The key idea in the high
rate regime is that (assuming a smooth density PX ), each quantization interval ∆j looks nearly flat,
so conditioned on ∆j , the distribution is accurately approximately by a uniform distribution.
Nearly flat for

large partition
∆j
410
i i
i i
i i

i i
Let cj be the j-th quantization point, and ∆j be the j-th quantization interval. Here we have
X
N
DU (R) = E|X − qU (X)| =2
E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ] (24.1)
j=1
X
N
|∆j |2
(high rate approximation) ≈ P[ X ∈ ∆ j ] (24.2)
12
j=1
( NA )2 A2 −2R
= = 2 , (24.3)
12 12
where we used the fact that the variance of Unif(−a, a) is a2 /3.

How much do we gain per bit?
Var(X)
10 log10 SNR = 10 log10
E|X − qU (X)|2
12Var(X)
= 10 log10 + (20 log10 2)R
A2
= constant + (6.02dB)R
For example, when X is uniform on [− A2 , A2 ], the constant is 0. Every engineer knows the rule of
thumb “6dB per bit”; adding one more quantization bit gets you 6 dB improvement in SNR. How-
ever, here we can see that this rule of thumb is valid only in the high rate regime. (Consequently,
widely articulated claims such as “16-bit PCM (CD-quality) provides 96 dB of SNR” should be
taken with a grain of salt.)
The above discussion deals with X with a bounded support. When X is unbounded, it is wise to
allocate the quantization points to those values that are more likely and saturate the large values at
the dynamic range of the quantizer, resulting in two types of contributions to the quantization error,
known as the granular distortion and overload distortion. This leads us to the question: Perhaps
uniform quantization is not optimal?
24.1.2 Scalar Non-uniform Quantization

Since our source has density pX , a good idea might be to use more quantization points where pX
is larger, and less where pX is smaller.
i i
i i
i i

i i
412
Often the way such quantizers are implemented is to take a monotone transformation of the source
f(X), perform uniform quantization, then take the inverse function:
f
X U
q qU (24.4)
X̂ qU ( U)
f−1
i.e., q(X) = f−1 (qU (f(X))). The function f is usually called the compander (compressor+expander).
One of the choice of f is the CDF of X, which maps X into uniform on [0, 1]. In fact, this compander
architecture is optimal in the high-rate regime (fine quantization) but the optimal f is not the CDF
(!). We defer this discussion till Section 24.1.4.
In terms of practical considerations, for example, the human ear can detect sounds with volume
as small as 0 dB, and a painful, ear-damaging sound occurs around 140 dB. Achieving this is
possible because the human ear inherently uses logarithmic comp anding function. Furthermore,
many natural signals (such as differences of consecutive samples in speech or music (but not
samples themselves!)) have an approximately Laplace distribution. Due to these two factors, a
very popular and sensible choice for f is the μ-companding function
f (X) = sign(X) ln(1+µ|X|)

ln(1+µ)
which compresses the dynamic range, uses more bits for smaller |X|’s, e.g. |X|’s in the range of
human hearing, and less quantization bits outside this region. This results in the so-called μ-law
which is used in the digital telecommunication systems in the US, while in Europe a slightly
different compander called the A-law is used.
24.1.3 Optimal Scalar Quantizers

Now we look for the optimal scalar quantizer given R bits for reconstruction. Formally, this is
Dscalar (R) = min E|X − q(X)|2 (24.5)

q:|Im q|≤2R
Intuitively, we would think that the optimal quantization regions should be contiguous; otherwise,
given a point cj , our reconstruction error will be larger. Therefore in one dimension quantizers are
i i
i i
i i

i i
piecewise constant:
q(x) = cj 1{Tj ≤x≤Tj+1 }
for some cj ∈ [Tj , Tj+1 ].
Example 24.1. As a simple example, consider the one-bit quantization

q of X ∼ N (0, σ 2 ).q
Then
optimal quantization points are c1 = E[X|X ≥ 0] = E[|X|] = π σ , c2 = E[X|X ≤ 0] = − π2 σ ,
2
with quantization error equal to Var(|X|) = 1 − π2 .
With ideas like this, in 1957 S. Lloyd developed an algorithm (called Lloyd’s algorithm or
Lloyd’s Method I) for iteratively finding optimal quantization regions and points.1 Suitable for
both the scalar and vector cases, this method proceeds as follows: Initialized with some choice of
N = 2k quantization points, the algorithm iterates between the following two steps:
1 Draw the Voronoi regions around the chosen quantization points (aka minimum distance
tessellation, or set of points closest to cj ), which forms a partition of the space.
2 Update the quantization points by the centroids E[X|X ∈ D] of each Voronoi region D.
b b
b b
b b
b b
b b
Steps of Lloyd’s algorithm
Lloyd’s clever observation is that the centroid of each Voronoi region is (in general) different than
the original quantization points. Therefore, iterating through this procedure gives the Centroidal
Voronoi Tessellation (CVT - which are very beautiful objects in their own right), which can be
viewed as the fixed point of this iterative mapping. The following theorem gives the results about
Lloyd’s algorithm
Theorem 24.1 (Lloyd).
1 Lloyd’s algorithm always converges to a Centroidal Voronoi Tessellation.

2 The optimal quantization strategy is always a CVT.
3 CVT’s need not be unique, and the algorithm may converge to non-global optima.
1
This work at Bell Labs remained unpublished until 1982 [202].
i i
i i
i i

i i
414
Remark 24.1. The third point tells us that Lloyd’s algorithm is not always guaranteed to give
the optimal quantization strategy.2 One sufficient condition for uniqueness of a CVT is the log-
concavity of the density of X [129], e.g., Gaussians. On the other hand, even for Gaussian, if N > 3,
optimal quantization points are not
Remark 24.2 (k-means). A popular clustering method called k-means is the following: Given n
data points x1 , . . . , xn ∈ Rd , the goal is to find k centers μ1 , . . . , μk ∈ Rd to minimize the objective
function
X n
min kxi − μj k2 .
j∈[k]
i=1
This is equivalent to solving the optimal vector quantization problem analogous to (24.5):
min EkX − q(X)k2
q:|Im(q)|≤k
Pn
where X is distributed according to the empirical distribution over the dataset, namely, 1n i=1 δxi .
Solving the k-means problem is NP-hard in the worst case, and Lloyd’s algorithm is a commonly
used heuristic.
24.1.4 Fine quantization

Following Panter-Dite [231], we now study the asymptotics of small quantization error. For this,
introduce a probability density function λ(x), which represents the density of quantization points
in a given interval and allows us to approximate summations by integrals.3 Then the number of
Rb
quantization points in any interval [a, b] is ≈ N a λ(x)dx. For any point x, denote the size of the
quantization interval that contains x by ∆(x). Then
Z x+∆(x)
1
N λ(t)dt ≈ Nλ(x)∆(x) ≈ 1 =⇒ ∆(x) ≈ .
x Nλ(x)
With this approximation, the quality of reconstruction is
X
N
E|X − q(X)|2 = E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ]
j=1
X
N Z
|∆j |2 ∆ 2 ( x)
≈ P[ X ∈ ∆ j ] ≈ p ( x) dx
12 12
j=1
Z
1
= p(x)λ−2 (x)dx ,
12N2
2
As a simple example one may consider PX = 13 ϕ(x − 1) + 13 f(x) + 13 f(x + 1) where f(·) is a very narrow pdf, symmetric
around 0. Here the CVT with centers ± 32 is not optimal among binary quantizers (just compare to any quantizer that
quantizes two adjacent spikes to same value).
3
This argument is easy to make rigorous. We only need to define reconstruction points cj as the solution of
∫ cj j
−∞ λ(x) dx = N (quantile).
i i
i i
i i

i i
To find the optimal density λ that gives the best reconstruction (minimum MSE) when X has den-
R R R R R
sity p, we use Hölder’s inequality: p1/3 ≤ ( pλ−2 )1/3 ( λ)2/3 . Therefore pλ−2 ≥ ( p1/3 )3 ,
1/3
with equality iff pλ−2 ∝ λ. Hence the optimizer is λ∗ (x) = Rp (x) .
p1/3 dx
Therefore when N = 2R ,4
Z 3
1 −2R
Dscalar (R) ≈ 2 p1/3 (x)dx
12
So our optimal quantizer density in the high rate regime is proportional to the cubic root of the
density of our source. This approximation is called the Panter-Dite approximation. For example,
• When X ∈ [− A2 , A2 ], using Hölder’s inequality again h1, p1/3 i ≤ k1k 3 kp1/3 k3 = A2/3 , we have
2
1 −2R 2
Dscalar (R) ≤2 A = DU (R)
12
where the RHS is the uniform quantization error given in (24.1). Therefore as long as the
source distribution is not uniform, there is strict improvement. For uniform distribution, uniform
quantization is, unsurprisingly, optimal.
• When X ∼ N (0, σ 2 ), this gives
√
2 −2R π 3
Dscalar (R) ≈ σ 2 (24.6)
2
Remark 24.3. In fact, in scalar case the optimal non-uniform quantizer can be realized using the
compander architecture (24.4) that we discussed in Section 24.1.2: As an exercise, use Taylor
expansion to analyze the quantization
R
error of (24.4) when N → ∞. The optimal compander
t
p1/3 (t)dt
f : R → [0, 1] turns out to be f(x) = R−∞
∞
p1/3 (t)dt
[28, 289].
−∞
24.1.5 Fine quantization and variable rate

So far we have been focusing on quantization with restriction on the cardinality of the image
of q(·). If one, however, intends to further compress the values q(X) losslessly, a more natural
constraint is to bound H(q(X)).
Koshelev [183] discovered in 1963 that in the high rate regime uniform quantization is asymp-
totically optimal under the entropy constraint. Indeed, if q∆ is a uniform quantizer with cell size
∆, then under appropriate assumptions we have (recall (2.21))
H(q∆ (X)) = h(X) − log ∆ + o(1) , (24.7)

R
where h(X) = − pX (x) log pX (x) dx is the differential entropy of X. So a uniform quantizer with
H(q(X)) = R achieves
∆2 22h(X)
D= ≈ 2−2R .
12 12
4
In fact when R → ∞, “≈” can be replaced by “= 1 + o(1)” as shown by Zador [344, 345].
i i
i i
i i

i i
416
On the other hand, any quantizer with unnormalized point density function Λ(x) (i.e. smooth
R cj
function such that −∞ Λ(x)dx = j) can be shown to achieve (assuming Λ → ∞ pointwise)
Z
1 1
D≈ pX (x) 2 dx (24.8)
12 Λ ( x)
Z
Λ(x)
H(q(X)) ≈ pX (x) log dx (24.9)
p X ( x)
Now, from Jensen’s inequality we have
Z Z
1 1 1 22h(X)
pX (x) 2 dx ≥ exp{−2 pX (x) log Λ(x) dx} ≈ 2−2H(q(X)) ,
12 Λ ( x) 12 12
concluding that uniform quantizer is asymptotically optimal.
Furthermore, it turns out that for any source, even the optimal vector quantizers (to be con-
2h(X)
sidered next) can not achieve distortion better that 2−2R 22πe . That is, the maximal improvement
they can gain for any i.i.d. source is 1.53 dB (or 0.255 bit/sample). This is one reason why scalar
uniform quantizers followed by lossless compression is an overwhelmingly popular solution in
practice.
24.2 Information-theoretic formulation

Before describing the mathematical formulation of optimal quantization, let us begin with two
concrete examples.
Hamming Game. Given 100 unbiased bits, we are asked to inspect them and scribble something
down on a piece of paper that can store 50 bits at most. Later we will be asked to guess the original
100 bits, with the goal of maximizing the number of correctly guessed bits. What is the best
strategy? Intuitively, it seems the optimal strategy would be to store half of the bits then randomly
guess on the rest, which gives 25% bit error rate (BER). However, as we will show in this chapter
(Theorem 26.1), the optimal strategy amazingly achieves a BER of 11%. How is this possible?
After all we are guessing independent bits and the loss function (BER) treats all bits equally.
Gaussian example. Given (X1 , . . . , Xn ) drawn independently from N (0, σ 2 ), we are given a
budget of one bit per symbol to compress, so that the decoded version (X̂1 , . . . , X̂n ) has a small
Pn
mean-squared error 1n i=1 E[(Xi − X̂i )2 ].
To this end, a simple strategy is to quantize each coordinate into 1 bit. As worked out in Exam-
ple 24.1, the optimal one-bit quantization error is (1 − π2 )σ 2 ≈ 0.36σ 2 . In comparison, we will
2
show later (Theorem 26.2) that there is a scheme that achieves an MSE of σ4 per coordinate
for large n; furthermore, this is optimal. More generally, given R bits per symbol, by doing opti-
mal vector quantization in high dimensions (namely, compressing (X1 , . . . , Xn ) jointly to nR bits),
rate-distortion theory will tell us that when n is large, we can achieve the per-coordinate MSE:
Dvec (R) = σ 2 2−2R
i i
i i
i i

i i
24.2 Information-theoretic formulation 417
which, compared to (24.6), gains 4.35 dB (or 0.72 bit/sample).

The conclusions from both the Bernoulli and the Gaussian examples are rather surprising: Even
when X1 , . . . , Xn are iid, there is something to be gained by quantizing these coordinates jointly.
Some intuitive explanations for this high-dimensional phenomenon as as follows:
1 Applying scalar quantization componentwise results in quantization region that are hypercubes,
which may not suboptimal for covering in high dimensions.
2 Concentration of measures effectively removes many atypical source realizations. For example,
when quantizing a single Gaussian X, we need to cover large portion of R in order to deal with
those significant deviations of X from 0. However, when we are quantizing many (X1 , . . . , Xn )
together, the law of large numbers makes sure that many Xj ’s cannot conspire together and all
produce large values. Indeed, (X1 , . . . , Xn ) concentrates near a sphere. As such, we may exclude
large portions of the space Rn from consideration.
Mathematical formulation A lossy compressor is an encoder/decoder pair (f, g) that induced

the following Markov chain
f g
X −→ W −→ X̂
where X ∈ X is refereed to as the source, W = f(X) is the compressed discrete data, and X̂ = g(W)
is the reconstruction which takes values in some alphabet X̂ that needs not be the same as X .
A distortion metric (or loss function) is a measurable function d : X × X̂ → R ∪ {+∞}. There
are various formulations of the lossy compression problem:
1 Fixed length (fixed rate), average distortion: W ∈ [M], minimize E[d(X, X̂)].
2 Fixed length, excess distortion: W ∈ [M], minimize P[d(X, X̂) > D].
3 Variable length, max distortion: W ∈ {0, 1}∗ , d(X, X̂) ≤ D a.s., minimize the average length
E[l(W)] or entropy H(W).
In this book we focus on lossy compression with fixed length and are chiefly concerned with
average distortion (with the exception of joint source-channel coding in Section 26.3 where excess
distortion will be needed). The difference between average and excess distortion is analogous to
average and high-probability risk bound in statistics and machine learning. It turns out that under
mild assumptions these two formulations lead to the same fundamental limit (cf. Remark 25.2).
As usual, of particular interest is when the source takes the form of a random vector Sn =
(S1 , . . . , Sn ) ∈ S n and the reconstruction is Ŝn = (S1 , . . . , Sn ) ∈ Ŝ n . We will be focusing on the
so called separable distortion metric defined for n-letter vectors by averaging the single-letter
distortions:
1X
n
d(sn , ŝn ) ≜ d(si , ŝi ). (24.10)
n
i=1
Definition 24.2. An (n, M, D)-code consists of an encoder f : An → [M] and a decoder g :

[M] → Ân such that the average distortion satisfies E[d(Sn , g(f(Sn )))] ≤ D. The nonasymptotic
i i
i i
i i

i i
418
and asymptotic fundamental limits are defined as follows:

M∗ (n, D) = min{M : ∃(n, M, D)-code} (24.11)
1
R(D) = lim sup log M∗ (n, D). (24.12)
n→∞ n
Note that, for stationary memoryless (iid) source, the large-blocklength limit in (24.12) in fact
exists and coincides with the infimum over all blocklengths. This is a consequence of the average
distortion criterion and the separability of the distortion metric – see Exercise V.2.
24.3 Converse bounds

Now that we have the definitions, we give a (surprisingly simple) general converse.
Theorem 24.3 (General Converse). Suppose X → W → X̂, where W ∈ [M] and E[d(X, X̂)] ≤ D.
Then
log M ≥ ϕX (D) ≜ inf I(X; Y).
PY|X :E[d(X,Y)]≤D
Proof.
log M ≥ H(W) ≥ I(X; W) ≥ I(X; X̂) ≥ ϕX (D)
where the last inequality follows from the fact that PX̂|X is a feasible solution (by assumption).
Theorem 24.4 (Properties of ϕX ).
(a) ϕX is convex, non-increasing.

(b) ϕX continuous on (D0 , ∞), where D0 = inf{D : ϕX (D) < ∞}.
(c) Suppose the distortion function satisfies d(x, x) = D0 for all x and d(x, y) > D0 for all x 6= y.
Then ϕX (D0 ) = I(X; X).
(d) Let
Dmax = inf Ed(X, x̂).
x̂∈X̂
Then ϕX (D) = 0 for all D > Dmax . If D0 > Dmax then also ϕX (Dmax ) = 0.
Remark 24.4 (The role of D0 and Dmax ). By definition, Dmax is the distortion attainable without any
information. Indeed, if Dmax = Ed(X, x̂) for some fixed x̂, then this x̂ is the “default” reconstruction
of X, i.e., the best estimate when we have no information about X. Therefore D ≥ Dmax can be
achieved for free. This is the reason for the notation Dmax despite that it is defined as an infimum.
On the other hand, D0 should be understood as the minimum distortion one can hope to attain.
Indeed, suppose that X̂ = X and d is a metric on X . In this case, we have D0 = 0, since we can
choose Y to be a finitely-valued approximation of X.
i i
i i
i i

i i
24.3 Converse bounds 419
As an example, consider the Gaussian source with MSE distortion, namely, X ∼ N (0, σ 2 ) and
2
d(x, x̂) = (x − x̂)2 . We will show later that ϕX (D) = 21 log+ σD . In this case D0 = 0 which is
however not attained; Dmax = σ 2 and if D ≥ σ 2 , we can simply output 0 as the reconstruction
which requires zero bits.
Proof.
(a) Convexity follows from the convexity of PY|X 7→ I(PX , PY|X ) (Theorem 5.3).
(b) Continuity in the interior of the domain follows from convexity, since D0 =
infPX̂|X E[d(X, X̂)] = inf{D : ϕS (D) < ∞}.
(c) The only way to satisfy the constraint is to take X = Y.
(d) For any D > Dmax we can set X̂ = x̂ deterministically. Thus I(X; x̂) = 0. The second claim
follows from continuity.
In channel coding, the main result relates the Shannon capacity, an operational quantity, to the
information capacity. Here we introduce the information rate-distortion function in an analogous
way, which by itself is not an operational quantity.
Definition 24.5. The information rate-distortion function for a source {Si } is

1
R(I) (D) = lim sup ϕSn (D), where ϕSn (D) = inf I(Sn ; Ŝn ).
n→∞ n PŜn |Sn :E[d(Sn ,Ŝn )]≤D
The reason for defining R(I) (D) is because from Theorem 24.3 we immediately get:
Corollary 24.6. ∀D, R(D) ≥ R(I) (D).
Naturally, the information rate-distortion function inherits the properties of ϕ from Theo-
rem 24.4:
Theorem 24.7 (Properties of R(I) ).
(a) R(I) (D) is convex, non-increasing

(b) R(I) (D) is continuous on (D0 , ∞), where D0 ≜ inf{D : R(I) (D) < ∞}.
(c) Assume the same assumption on the distortion function as in Theorem 24.4(c). For stationary
ergodic {Sn }, R(I) (D) = H (entropy rate) or +∞ if Sk is not discrete.
(d) R(I) (D) = 0 for all D > Dmax , where
Dmax ≜ lim sup inf Ed(Xn , xˆn ) .

n→∞ xˆn ∈X̂
If D0 < Dmax , then R(I) (Dmax ) = 0 too.
Proof. All properties follow directly from corresponding properties in Theorem 24.4 applied to
ϕSn .
i i
i i
i i

i i
420
Next we show that R(I) (D) can be easily calculated for stationary memoryless (iid) source
without going through the multi-letter optimization problem. This parallels Corollary 20.5 for
channel capacity (with separable cost function).
i.i.d.
Theorem 24.8 (Single-letterization). For stationary memoryless source Si ∼ PS and separable
distortion d in the sense of (24.10), we have for every n,
ϕSn (D) = nϕS (D).
Thus
R(I) (D) = ϕS (D) = inf I(S; Ŝ).

PŜ|S :E[d(S,Ŝ)]≤D
Proof. By definition we have that ϕSn (D) ≤ nϕS (D) by choosing a product channel: PŜn |Sn = P⊗ n
Ŝ|S
.
Thus R(I) (D) ≤ ϕS (D).
For the converse, for any PŜn |Sn satisfying the constraint E[d(Sn , Ŝn )] ≤ D, we have
X
n
I(Sn ; Ŝn ) ≥ I(Sj , Ŝj ) (Sn independent)
j=1
Xn
≥ ϕS (E[d(Sj , Ŝj )])
j=1
 
1 X
n
≥ nϕ S  E[d(Sj , Ŝj )] (convexity of ϕS )
n
j=1
≥ nϕ S ( D) (ϕS non-increasing)
24.4* Converting excess distortion to average

Finally, we discuss how to build a compressor for average distortion if we have one for excess
distortion, the former of which is our focus.
Theorem 24.9 (Excess-to-Average). Suppose that there exists (f, g) such that W = f(X) ∈ [M] and
P[d(X, g(W)) > D] ≤ ϵ. Assume for some p ≥ 1 and x̂0 ∈ X̂ that (E[d(X, x̂0 )p ])1/p = Dp < ∞.
Then there exists (f′ , g′ ) such that W′ = f′ (X) ∈ [M + 1] and
E[d(X, g(W′ ))] ≤ D(1 − ϵ) + Dp ϵ1−1/p . (24.13)
Remark 24.5. This result is only useful for p > 1, since for p = 1 the right-hand side of (24.13)
does not converge to D as ϵ → 0. However, a different method (as we will see in the proof of
i i
i i
i i

i i
24.4* Converting excess distortion to average 421
Theorem 25.1) implies that under just Dmax = D1 < ∞ the analog of the second term in (24.13)
is vanishing as ϵ → 0, albeit at an unspecified rate.
Proof. We transform the first code into the second by adding one codeword:
(
f ( x) d(x, g(f(x))) ≤ D
f ′ ( x) =
M + 1 otherwise
(
g(j) j ≤ M
g′ ( j) =
x̂0 j=M+1
Then by Hölder’s inequality,
E[d(X, g′ (W′ )) ≤ E[d(X, g(W))|Ŵ 6= M + 1](1 − ϵ) + E[d(X, x̂0 )1{Ŵ = M + 1}]
≤ D(1 − ϵ) + Dp ϵ1−1/p
i i
i i
i i

i i
25 Rate distortion: achievability bounds
Recall from the last chapter:

1
R(D) = lim sup log M∗ (n, D), (rate-distortion function)
n→∞ n
(I) 1
R (D) = lim sup ϕSn (D), (information rate-distortion function)
n→∞ n
where
ϕ S ( D) ≜ inf I(S; Ŝ) (25.1)

ϕSn (D) = inf I(Sn ; Ŝn ) (25.2)

PŜn |Sn :E[d(Sn ,Ŝn )]≤D
Pn
and d(Sn , Ŝn ) = 1n i=1 d(Si , Ŝi ) takes a separable form.
We have shown the following general converse in Theorem 24.3: For any [M] 3 W → X → X̂
such that E[d(X, X̂)] ≤ D, we have log M ≥ ϕX (D), which implies in the special case of X =
Sn , log M∗ (n, D) ≥ ϕSn (D) and hence, in the large-n limit, R(D) ≥ R(I) (D). For a stationary
i.i.d.
memoryless source Si ∼ PS , Theorem 24.8 shows that ϕSn single-letterizes as ϕSn (D) = nϕS (D).
As a result, we obtain the converse
R(D) ≥ R(I) (D) = ϕS (D).
In this chapter, we will prove a matching achievability bound and establish the identity R(D) =
R(I) (D) for stationary memoryless sources.
25.1 Shannon’s rate-distortion theorem

The following result is proved by Shannon in his 1959 paper [280].
i.i.d.
Theorem 25.1. Consider a stationary memoryless source Sn ∼ PS . Suppose that the distortion
metric d and the target distortion D satisfy:
1 d(sn , ŝn ) is non-negative and separable.

2 D > D0 , where D0 = inf{D : ϕS (D) < ∞}.
422
i i
i i
i i

i i
3 Dmax is finite, i.e.

Dmax ≜ inf E[d(S, ŝ)] < ∞. (25.3)
ŝ
Then
R(D) = R(I) (D) = ϕS (D) = inf I(S; Ŝ). (25.4)
Remark 25.1.
• Note that Dmax < ∞ does not require that d(·, ·) only takes values in R. That is, Theorem 25.1
permits d(s, ŝ) = ∞.
• When Dmax = ∞, typically we have R(D) = ∞ for all D. Indeed, suppose that d(·, ·) is a metric
(i.e. real-valued and satisfies triangle inequality). Then, for any x0 ∈ An we have
d(X, X̂) ≥ d(X, x0 ) − d(x0 , X̂) .
Thus, for any finite codebook {c1 , . . . , cM } we have maxj d(x0 , cj ) < ∞ and therefore
E[d(X, X̂)] ≥ E[d(X, x0 )] − max d(x0 , cj ) = ∞ .
j
So that R(D) = ∞ for any finite D. This observation, however, should not be interpreted as
the absolute impossibility of compressing such sources; it is just not possible with fixed-length
codes. As an example, for quadratic distortion and Cauchy-distributed S, Dmax = ∞ since S
has infinite second moment. But it is easy to see that1 the information rate-distortion function
R(I) (D) < ∞ for any D ∈ (0, ∞). In fact, in this case R(I) (D) is a hyperbola-like curve that never
touches either axis. Using variable-length codes, Sn can be compressed non-trivially into W with
bounded entropy (but unbounded cardinality) H(W). An interesting question is: Is H(W) =
nR(I) (D) + o(n) attainable?
• Techniques for proving (25.4) for memoryless sources can extended to “stationary ergodic”
sources with changes similar to those we have discussed in lossless compression (Chapter 12).
Before giving a formal proof, we give a heuristic derivation emphasizing the connection to large
deviations estimates from Chapter 15.
25.1.1 Intuition
Let us throw M random points C = {c1 , . . . , cM } into the space Ân by generating them indepen-
dently according to a product distribution QnŜ , where QŜ is some distribution on Â to be optimized.
Consider the following simple coding strategy:
Encoder : f(sn ) = argmin d(sn , cj ) (25.5)
j∈[M]
1
Indeed, if we take W to be a quantized version of S with small quantization error D and notice that differential entropy of
the Cauchy S is finite, we get from (24.7) that R(I) (D) ≤ H(W) < ∞.
i i
i i
i i

i i
424
Decoder : g(j) = cj (25.6)
The basic idea is the following: Since the codewords are generated independently of the source,
the probability that a given codeword is close to the source realization is (exponentially) small, say,
ϵ. However, since we have many codewords, the chance that there exists a good one can be of high
probability. More precisely, the probability that no good codewords exist is approximately (1 −ϵ)M ,
which can be made close to zero provided M 1ϵ .
i.i.d.
To explain this intuition further, consider a discrete memoryless source Sn ∼ PS and let us eval-
uate the excess distortion of this random code: P[d(Sn , f(Sn )) > D], where the probability is over
all random codewords c1 , . . . , cM and the source Sn . Define
Pfailure ≜ P[∀c ∈ C, d(Sn , c) > D] = ESn [P[d(Sn , c1 ) > D|Sn ]M ],
where the last equality follows from the assumption that c1 , . . . , cM are iid and independent of Sn .
i.i.d.
To simplify notation, let Ŝn ∼ QnŜ independently of Sn , so that PSn ,Ŝn = PnS QnŜ . Then
P[d(Sn , c1 ) > D|Sn ] = P[d(Sn , Ŝn ) > D|Sn ]. (25.7)
To evaluate the failure probability, let us consider the special case of PS = Ber( 12 ) and also
choose QŜ = Ber( 12 ) to generate the random codewords, aiming to achieve a normalized Hamming
P P
distortion at most D < 12 . Since nd(Sn , Ŝn ) = i:si =1 (1 − Ŝi ) + i:si =0 Ŝi ∼ Bin(n, 21 ) for any sn ,
the conditional probability (25.7) does not depend on Sn and is given by

1
P[d(S , Ŝ ) > D|S ] = P Bin n,
n n n
≥ nD ≈ 1 − 2−n(1−h(D))+o(n) , (25.8)
2
where in the last step we applied large-deviation estimates from Theorem 15.9 and Example 15.1.
(Note that here we actually need lower estimates on these exponentially small probabilities.) Thus,
Pfailure = (1 − 2−n(1−h(D))+o(n) )M , which vanishes if M = 2n(1−h(D)+δ) for any δ > 0.2 As we will
compute in Theorem 26.1, the rate-distortion function for PS = Ber( 12 ) is precisely ϕS (D) =
1 − h(D), so we have a rigorous proof of the optimal achievability in this special case.
For general distribution PS (or even for PS = Ber(p) for which it is suboptimal to choose
QŜ as Ber( 21 )), the situation is more complicated as the conditional probability (25.7) depends
on the source realization Sn through its empirical distribution (type). Let Tn be the set of typical
realizations whose empirical distribution is close to PS . We have
Pfailure ≈P[d(Sn , Ŝn ) > D|Sn ∈ Tn ]M

=(1 − P[d(Sn , Ŝn ) ≤ D|Sn ∈ Tn ])M (25.9)
| {z }
≈ 0, since Sn ⊥
⊥ Ŝn
−nE(QŜ ) M
≈(1 − 2 ) ,
2
In fact, this argument shows that M = 2n(1−h(D))+o(n) codewords suffice to cover the entire Hamming space within
distance Dn. See (27.9) and Exercise V.11.
i i
i i
i i

i i
where it can be shown (using large deviations analysis similar to information projection in
Chapter 15) that
E(QŜ ) = min D(PŜ|S kQŜ |PS ) (25.10)

Thus we conclude that for any choice of QŜ (from which the random codewords were drawn) and
any δ > 0, the above code with M = 2n(E(QŜ )+δ) achieves vanishing excess distortion
Pfailure = P[∀c ∈ C, d(Sn , c) > D] → 0 as n → ∞.
Finally, we optimize QŜ to get the smallest possible M:
min E(QŜ ) = min min D(PŜ|S kQŜ |PS )

QŜ QŜ P :E[d(S,Ŝ)]≤D
Ŝ|S
= min min D(PŜ|S kQŜ |PS )

PŜ|S :E[d(S,Ŝ)]≤D QŜ
= min I(S; Ŝ)

= ϕ S ( D)
where the third equality follows from the variational representation of mutual information (Corol-
lary 4.2). This heuristic derivation explains how the constrained mutual information minimization
arises. Below we make it rigorous using a different approach, again via random coding.
25.1.2 Proof of Theorem 25.1

Theorem 25.2 (Random coding bound of average distortion). Fix PX and suppose d(x, x̂) ≥ 0 for
all x, x̂. For any PY|X and any y0 ∈ Â, there exists a code X → W → X̂ with W ∈ [M + 1], such
d(X, X̂) ≤ d(X, y0 ) almost surely and for any γ > 0,
E[d(X, X̂)] ≤ E[d(X, Y)] + E[d(X, y0 )]e−M/γ + E[d(X, y0 )1{i(X;Y)>log γ} ].
Here the first and the third expectations are over (X, Y) ∼ PX,Y = PX PY|X and the information
density i(·; ·) is defined with respect to this joint distribution (cf. Definition 18.1).
Some remarks are in order:
• Theorem 25.2 says that from an arbitrary PY|X such that E[d(X, Y)] ≤ D, we can extract a good
code with average distortion D plus some extra terms which will vanish in the asymptotic regime
for memoryless sources.
• The proof uses the random coding argument with codewords drawn independently from PY , the
marginal distribution induced by the source distribution PX and the auxiliary channel PY|X . As
such, PY|X plays no role in the code construction and is used only in analysis (by defining a
coupling between PX and PY ).
i i
i i
i i

i i
426
• The role of the deterministic y0 is a “fail-safe” codeword (think of y0 as the default reconstruc-
tion with Dmax = E[d(X, y0 )]). We add y0 to the random codebook for “damage control”, to
hedge against the (highly unlikely) event that we end up with a terrible codebook.
Proof. Similar to the intuitive argument sketched in Section 25.1.1, we apply random coding and
generate the codewords randomly and independently of the source:
i.i.d.
C = {c1 , . . . , cM } ∼ PY ⊥
⊥X
and add the “fail-safe” codeword cM+1 = y0 . We adopt the same encoder-decoder pair (25.5) –
(25.6) and let X̂ = g(f(X)). Then by definition,
d(X, X̂) = min d(X, cj ) ≤ d(X, y0 ).

j∈[M+1]
To simplify notation, let Y be an independent copy of Y (similar to the idea of introducing unsent
codeword X in channel coding – see Chapter 18):
PX,Y,Y = PX,Y PY
where PY = PY . Recall the formula for computing the expectation of a random variable U ∈ [0, a]:
Ra
E[U] = 0 P[U ≥ u]du. Then the average distortion is
Ed(X, X̂) = E min d(X, cj )

j∈[M+1]
h i

= EX E min d(X, cj )X
j∈[M+1]
Z d(X,y0 ) h i

= EX P min d(X, cj ) > uX du
0 j∈[M+1]
Z d(X,y0 ) h i

≤ EX P min d(X, cj ) > uX du
0 j∈[M]
Z d(X,y0 )
= EX P[d(X, Y) > u|X]M du
0
Z d(X,y0 )
= EX (1 − P[d(X, Y) ≤ u|X])M du
0
Z d(X,y0 )
≤ EX (1 − P[d(X, Y) ≤ u, i(X, Y) > −∞|X])M du. (25.11)
0 | {z }
≜δ(X,u)
Next we upper bound (1 − δ(X, u))M as follows:
(1 − δ(X, u))M ≤ e−M/γ + |1 − γδ(X, u)|

+
(25.12)
+
= e−M/γ + 1 − γ E[exp{−i(X; Y)}1{d(X,Y)≤u} |X] (25.13)
−M/γ
≤e + P[i(X; Y) > log γ|X] + P[d(X, Y) > u|X] (25.14)
where
i i
i i
i i

i i
• (25.12) uses the following trick in dealing with (1 − δ)M for δ 1 and M 1. First, recall the
standard rule of thumb:
(
0, δ M 1
(1 − δ) ≈
M
1, δ M 1
In order to obtain firm bounds of a similar flavor, we apply, for any γ > 0,
(1 − δ)M ≤ e−δM ≤ e−M/γ + (1 − γδ)+ .
• (25.13) is simply a change of measure argument of Proposition 18.3. Namely we apply (18.5)
with f(x, y) = 1{d(x,y)≤u} .
• For (25.14) consider the chain:
1 − γ E[exp{−i(X; Y)}1{d(X,Y)≤u} |X] ≤ 1 − γ E[exp{−i(X; Y)}1{d(X,Y)≤u,i(X;Y)≤log γ} |X]

≤ 1 − E[1{d(X,Y)≤u,i(X;Y)≤log γ} |X]
= P[d(X, Y) > u or i(X; Y) > log γ|X]
≤ P[d(X, Y) > u|X] + P[i(X; Y) > log γ|X]
Plugging (25.14) into (25.11), we have

"Z #
d(X,y0 )
E[d(X, X̂)] ≤ EX (e−M/γ + P[i(X; Y) > log γ|X] + P[d(X, Y) > u|X])du
0
Z ∞
≤ E[d(X, y0 )]e−M/γ + E[d(X, y0 )P[i(X; Y) > log γ|X]] + EX P[d(X, Y) > u|X])du
0
= E[d(X, y0 )]e−M/γ + E[d(X, y0 )1{i(X;Y)>log γ} ] + E[d(X, Y)].
As a side product, we have the following achievability result for excess distortion.
Theorem 25.3 (Random coding bound of excess distortion). For any PY|X , there exists a code
X → W → X̂ with W ∈ [M], such that for any γ > 0,
P[d(X, X̂) > D] ≤ e−M/γ + P[{d(X, Y) > D} ∪ {i(X; Y) > log γ}]
Proof. Proceed exactly as in the proof of Theorem 25.2 (without using the extra codeword y0 ),
replace (25.11) by P[d(X, X̂) > D] = P[∀j ∈ [M], d(X, cj ) > D] = EX [(1 − P[d(X, Y) ≤ D|X])M ],
and continue similarly.
Finally, we give a rigorous proof of Theorem 25.1 by applying Theorem 25.2 to the iid source
i.i.d.
X = Sn ∼ PS and n → ∞:
Proof of Theorem 25.1. Our goal is the achievability: R(D) ≤ R(I) (D) = ϕS (D).
WLOG we can assume that Dmax = E[d(S, ŝ0 )] is achieved at some fixed ŝ0 – this is our default
reconstruction; otherwise just take any other fixed symbol so that the expectation is finite. The
i i
i i
i i

i i
428
default reconstruction for Sn is ŝn0 = (ŝ0 , . . . , ŝ0 ) and E[d(Sn , ŝn0 )] = Dmax < ∞ since the distortion
is separable.
Fix some small δ > 0. Take any PŜ|S such that E[d(S, Ŝ)] ≤ D − δ ; such PŜ|S since D > D0 by
assumption. Apply Theorem 25.2 to (X, Y) = (Sn , Ŝn ) with
PX = PSn
PY|X = PŜn |Sn = (PŜ|S )n
log M = n(I(S; Ŝ) + 2δ)
log γ = n(I(S; Ŝ) + δ)
1X
n
d( X , Y ) = d(Sj , Ŝj )
n
j=1
y0 = ŝn0
we conclude that there exists a compressor f : An → [M + 1] and g : [M + 1] → Ân , such that
E[d(Sn , g(f(Sn )))] ≤ E[d(Sn , Ŝn )] + E[d(Sn , ŝn0 )]e−M/γ + E[d(Sn , ŝn0 )1{i(Sn ;Ŝn )>log γ } ]
≤ D − δ + Dmax e− exp(nδ) + E[d(Sn , ŝn0 )1En ], (25.15)
| {z } | {z }
→0 →0 (later)
where
 
1 X
n 
WLLN
En = {i(Sn ; Ŝn ) > log γ} = i(Sj ; Ŝj ) > I(S; Ŝ) + δ ====⇒ P[En ] → 0
n 
j=1
If we can show the expectation in (25.15) vanishes, then there exists an (n, M, D)-code with:
M = 2n(I(S;Ŝ)+2δ) , D = D − δ + o( 1) ≤ D.
To summarize, ∀PŜ|S such that E[d(S, Ŝ)] ≤ D −δ we have shown that R(D) ≤ I(S; Ŝ). Sending δ ↓
0, we have, by continuity of ϕS (D) in (D0 ∞) (recall Theorem 24.4), R(D) ≤ ϕS (D−) = ϕS (D).
It remains to show the expectation in (25.15) vanishes. This is a simple consequence of the
uniform integrability of the sequence {d(Sn , ŝn0 )}. We need the following lemma.
Lemma 25.4. For any positive random variable U, define g(δ) = supH:P[H]≤δ E[U1H ], where the
δ→0
supremum is over all events measurable with respect to U. Then3 EU < ∞ ⇒ g(δ) −−−→ 0.
b→∞
Proof. For any b > 0, E[U1H ] ≤ E[U1{U>b} ] + bδ , where E[U1{U> √b}
] −−−→ 0 by dominated
convergence theorem. Then the proof is completed by setting b = 1/ δ .
Pn
Now d(Sn , ŝn0 ) = 1n j=1 Uj , where Uj are iid copies of U ≜ d(S, ŝ0 ). Since E[U] = Dmax < ∞
P
by assumption, applying Lemma 25.4 yields E[d(Sn , ŝn0 )1En ] = 1n E[Uj 1En ] ≤ g(P[En ]) → 0,
since P[En ] → 0. This proves the theorem.
3
In fact, ⇒ is ⇔.
i i
i i
i i

i i
25.2* Covering lemma 429
Figure 25.1 Description of channel simulation game. The distribution P (left) is to be simulated via the
distribution Q (right) at minimal rate R. Depending on the exact formulation we either require R = I(A; B)
(covering lemma) or R = C(A; B) (soft-covering lemma).
Remark 25.2 (Fundamental limit for excess distortion). Although Theorem 25.1 is stated for the
average distortion, under certain mild extra conditions, it also holds for excess distortion where
the goal is to achieve d(Sn , Ŝn ) ≤ D with probability arbitrarily close to one as opposed to in
expectation. Indeed, the achievability proof of Theorem 25.1 is already stated in high probability.
For converse, assume in addition to (25.3) that Dp ≜ E[d(S, ŝ)p ]1/p < ∞ for some ŝ ∈ Ŝ and p > 1.
Pn
Applying Rosenthal’s inequality [270, 170], we have E[d(S, ŝn )p ] = E[( i=1 d(Si , ŝ))p ] ≤ CDpp
for some constant C = C(p). Then we can apply Theorem 24.9 to convert a code for excess
distortion to one for average distortion and invoke the converse for the latter.
To end this section, we note that in Section 25.1.1 and in Theorem 25.1 it seems we applied
different proof techniques. How come they both turn out to yield the same tight asymptotic result?
This is because the key to both proofs is to estimate the exponent (large deviations) of the under-
lined probabilities in (25.9) and (25.11), respectively. To obtain the right exponent, as we know,
the key is to apply tilting (change of measure) to the distribution solving the information projec-
tion problem (25.10). When PY = (QŜ )n = (PŜ )n with PŜ chosen as the output distribution in the
solution to rate-distortion optimization (25.1), the resulting exponent is precisely given by 2−i(X;Y) .
25.2* Covering lemma

In this section we consider the following curious problem, a version of channel simulation/synthe-
i.i.d.
sis. We want to simulate a sequence of iid correlated strings (Ai , Bi ) ∼ PA,B following the following
i.i.d.
protocol. First, an sequence An ∼ PA is generated at one terminal. Then we can look at it, produces
a short message W ∈ [2nR ] which gets communicated to a remote destination (noiselessly). Upon
receiver the message, remote decoder produces a string Bn out of it. The goal is to be able to fool
i.i.d.
the tester who inspects (An , Bn ) and tries to check that it was indeed generated as (Ai , Bi ) ∼ PA,B .
See figure Fig. 25.1 for an illustration.
i i
i i
i i

i i
430
How large a rate R is required depends on how we excatly understand the requirement to “fool
the tester”. If the tester is fixed ahead of time (this just means that we know the set F such that
i.i.d.
(Ai , Bi ) ∼ PA,B is declared whenever (An , Bn ) ∈ F) then this is precisely the setting in which
covering lemma operates. In the next section we show that a higher rate R = C(A; B) is required
if F is not known ahead of time. We leave out the celebrated theorem of Bennett and Shor [27]
which shows that rate R = I(A; B) is also attainable even if F is not known, but if encoder and
decoder are given access to a source of common random bits (independent of An , of course).
Before proceeding, we note some simple corner cases:
1 If R = H(A), we can compress An and send it to “B side”, who can reconstruct An perfectly and
use that information to produce Bn through PBn |An .
2 If R = H(B), “A side” can generate Bn according to PnA,B and send that Bn sequence to the “B
side”.
3 If A ⊥
⊥ B, we know that R = 0, as “B side” can generate Bn independently.
Our previous argument for achieving the rate-distortion turns out to give a sharp answer (that
R = I(A; B) is sufficient) for the F-known case as follows.
i.i.d.
Theorem 25.5 (Covering Lemma). Fix PA,B and let (Aj , Bj ) ∼ PA,B , R > I(A; B) and C =
{c1 , . . . , cM } where each codeword cj is i.i.d. drawn from distribution PnB . ∀ϵ > 0, for M ≥
2n(I(A;B)+ϵ) we have that: ∀F
P[∃c : (An , c) ∈ F] ≥ P[(An , Bn ) ∈ F] + o(1)

|{z}
uniform in F
Remark 25.3. The origin of the name “covering” is due to the fact that sampling the An space at
rate slightly above I(A; B) covers all of it, in the sense of reproducing the joint statistics of (An , Bn ).
Proof. Set γ > M and following similar arguments of the proof for Theorem 25.2, we have
P[∀c ∈ C : (An , c) 6∈ F] ≤ e−M/γ + P[{(An , Bn ) 6∈ F} ∪ {i(An ; Bn ) > log γ}]

= P[(An , Bn ) 6∈ F] + o(1)
⇒ P[∃c ∈ C : (An , c) ∈ F] ≥ P[(An , Bn ) ∈ F] + o(1)
As we explained, the version of the covering lemma that we stated shows only that for one fixed
test set F. However, if both A and B take values on finite alphabets then something stronger can
be stated.
First, in this case i(An ; Bn ) is a sum of bounded iid terms and thus the o(1) is in fact e−Ω(n) . By
applying the previous result to F = {(an , bn ) : #{i : ai = α, bi = β}} with all possible α ∈ A
and β ∈ B we conclude that for every An there must exist a codeword c such that the empirical
joint distribution (joint type) P̂An ,c satisfies
P[∃c ∈ C such that TV(P̂An ,c , PA,B ) ≤ δn ] → 1 ,
i i
i i
i i

i i
25.3* Wyner’s common information 431
where δn → 0. Thus, by communicating nR bits we are able to fully reproduce the correct empirical
i.i.d.
distribution as if the output were generated ∼ PA,B .
That this is possible to do at rate R ≈ I(A; B) can be explained combinatorially: To generate
Bn , there are around 2nH(B) high probability sequences; for each An sequence, there are around
2nH(B|A) Bn sequences that have the same joint distribution, therefore, it is sufficient to describe
nH(B)
the class of Bn for each An sequence, and there are around 22nH(B|A) = 2nI(A;B) classes.
Let us now denote the selected codeword c by B̂n . From the previous discussion we have shown
that
1X
n
f(Aj , B̂j ) ≈ EA,B∼PA,B [f(A, B)] ,
n
j=1
for any bounded function f. A stronger requirement would be to demand that the joint distribution
PAn ,B̂n fools any permutation invariant tester, i.e.
sup |PAn ,B̂n (F) − PnA,B (F)| → 0
where the supremum is taken over all permutation invariant subset F ⊂ An × B n . This is not
guaranteed by the covering lemma. Indeed, a sufficient statistic for a permutation invariant tester
is a joint type P̂An ,B̂n . The construction above satisfies P̂An ,B̂n ≈ PA,B , but it might happen that
P̂An ,B̂n although close to PA,B still takes highly unlikely values (for example, if we restrict all c to
have the same composition P0 , the tester can easily detect the problem since PnB -measure of all
√
strings of composition P0 cannot exceed O(1/ n)). Formally, to fool permutation invariant tester
we need to have small total variation between the distribution on the joint types under P and Q.
We conjecture, however, that nevertheless the rate R = I(A; B) should be sufficient to achieve
also this stronger requirement. In the next section we show that if one removes the permutation-
invariance constraint, then a larger rate R = C(A; B) is needed.
25.3* Wyner’s common information

We continue discussing the channel simulation setting as in previous section. We now want to
determined the minimal possible communication rate (i.e. cardinality of W ∈ [2nR ]) required to
have small total variation:
TV(PAn ,B̂n , PnA,B ) ≤ ϵ (25.16)
between the simulated and the true output (see Fig. 25.1).
Theorem 25.6 (Cuff [85]). Let PA,B be an arbitrary distribution on the finite space A×B . Consider
i.i.d.
a coding scheme where Alice observes An ∼ PnA , sends a message W ∈ [2nR ] to Bob, who given
W generates a (possibly random) sequence B̂n . If (25.16) is satisfied for all ϵ > 0 and sufficiently
large n, then we must have
R ≥ C(A; B) ≜ min I(A, B; U) , (25.17)
A→U→B
i i
i i
i i

i i
432
where C(A; B) is known as the Wyner’s common information [336]. Furthermore, for any R >
C(A; B) and ϵ > 0 there exists n0 (ϵ) such that for all n ≥ n0 (ϵ) there exists a scheme
satisfying (25.16).
Note that condition (25.16) guarantees that any tester (permutation invariant or not) is fooled to
believe he sees the truly iid (An , Bn ) with probability ≥ 1 −ϵ. However, compared to Theorem 25.5,
this requires a higher communication rate since C(A; B) ≥ I(A; B), clearly.
Proof. Showing that Wyner’s common information is a lower-bound is not hard. First, since
PAn ,B̂n ≈ PnA,B (in TV) we have
I(At , B̂t ; At−1 , B̂t−1 ) ≈ I(At , Bt ; At−1 , Bt−1 ) = 0
(Here one needs to use finiteness of the alphabet of A and B and the bounds relating H(P) − H(Q)
with TV(P, Q), cf. (7.18) and Corollary 6.8). Next, we have
nR = H(W) ≥ I(An , B̂n ; W) (25.18)

X n
≥ I(At , B̂t ; W) − I(At , B̂t ; At−1 , B̂t−1 ) (25.19)
t=1
Xn
≈ I(At , B̂t ; W) (25.20)
t=1
≳ nC(A; B) (25.21)
where in the last step we used the crucial observation that
At → W → B̂t
and that Wyner’s common information PA,B 7→ C(A; B) should be continuous in the total variation
distance on PA,B .
To show achievability, let us notice that the problem is equivalent to constructing three random
variables (Ân , W, B̂n ) such that a) W ∈ [2nR ], b) the Markov relation
Ân ← W → B̂n (25.22)
holds and c) TV(PÂn ,B̂n , PnA,B ) ≤ ϵ/2. Indeed, given such a triple we can use coupling charac-
terization of TV (7.18) and the fact that TV(PÂn , PnA ) ≤ ϵ/2 to extend the probability space
to
An → Ân → W → B̂n
and P[An = Ân ] ≥ 1 − ϵ/2. Again by (7.18) we conclude that TV(PAn ,B̂n , PÂn ,B̂n ) ≤ ϵ/2 and by
triangle inequality we conclude that (25.16) holds.
Finally, construction of the triple satisfying a)-c) follows from the soft-covering lemma
(Corollary 25.8) applied with V = (A, B) and W being uniform on the set of xi ’s there.
i i
i i
i i

i i
25.4* Approximation of output statistics and the soft-covering lemma 433
25.4* Approximation of output statistics and the soft-covering lemma

In this section we aim to prove the remaining ingredient (the soft-covering lemma) required for
the proof of Theorem 25.6. To that end, recall that in Section 7.9 we have shown that generating
i.i.d.
iid samples Xi ∼ PX and passing their empirical distribution P̂n across the channel PY|X results in
a good approximation of PY = PY|X ◦ PX , i.e.
D(PY|X ◦ P̂n kPY ) → 0 .
A natural questions is how large n should be in order for the approximation PY|X ◦ P̂n ≈ PY to
hold. A remarkable fact that we establish in this section is that the answer is n ≈ 2I(X;Y) , assum-
ing I(X; Y) 1 and there is certain concentration properties of i(X; Y) around I(X; Y). This fact
originated from Wyner [336] and was significantly strengthened in [154].
Here, we show a new variation of such results by strengthening our simple χ2 -information
bound of Proposition 7.15.
Theorem 25.7. Fix PX,Y and for any λ ∈ R let us define Rényi mutual information
Iλ (X; Y) = Dλ (PX,Y kPX PY ) ,
where Dλ is the Rényi-divergence, cf. Definition 7.22. We have for every 1 < λ ≤ 2
1
E[D(PY|X ◦ P̂n kPY )] ≤ log(1 + exp{(λ − 1)(Iλ (X; Y) − log n)}) . (25.23)
λ−1
Proof. Since λ → Dλ is non-decreasing, it is sufficient to prove an equivalent upper bound on

E[Dλ (PY|X ◦ Pn kPY )]. From Jensen’s inequality we see that
( )λ 
1 PY|X ◦ P̂n
E[Dλ (PY|X ◦ P̂n kPY )] ≜ EXn log EY∼PY   (25.24)
λ−1 PY
( )λ 
1 PY|X ◦ P̂n
≤ log EXn EY∼PY   ≜ Iλ (Xn ; Ȳ) ,
λ−1 PY
Pn
where similarly to (7.50) we introduced the channel PȲ|Xn = n1 i=1 PY|X=Xi . To analyze Iλ (Xn ; Ȳ)
we need to bound
( )λ 
1 X PY| X ( Y | X )
E(Xn ,Ȳ)∼PnX ×PY  .
i
(25.25)
n PY (Y)
i
Note that conditioned on Y we get to analyze a λ-th moment of a sum of iid random variables.
This puts us into a well-known setting of Rosenthal-type inequalities. In particular, we have that
for any iid non-negative Bj we have, provided 1 ≤ λ ≤ 2, that
 !λ 
X n
E Bi  ≤ n E[Bλ ] + (n E[B])λ . (25.26)
i=1
i i
i i
i i

i i
434
This is known to be essentially tight [273]. It can be proven by applying (a + b)λ−1 ≤ aλ−1 + bλ−1
and Jensen’s to get
X
E Bi (Bi + Bj )λ−1 ≤ E[Bλ ] + E[B]((n − 1) E[B])λ−1 .
j̸=i
Summing the latter over i and bounding (n − 1) ≤ n we get (25.26).

Now using (25.26) we can overbound (25.25) as
" #
1−λ PY|X (Y|Xi ) λ
≤1+n E(X,Ȳ)∼PX ×PY ,
PY (Y)
which implies
1
Iλ (Xn ; Ȳ) ≤ log 1 + n1−λ exp{(λ − 1)Iλ (X; Y)} ,
λ−1
which together with (25.24) recovers the main result (25.23).
Remark 25.4. Hayashi [158] upper bounds the LHS of (25.23) with
λ λ−1
log(1 + exp{ (Kλ (X; Y) − log n)}) ,
λ−1 λ
where Kλ (X; Y) = infQY Dλ (PX,Y kPX QY ) is the so-called Sibson-Csiszár information, cf. [244].
This bound, however, does not have the right rate of convergence as n → ∞, at least for λ = 1 as
comparison with Proposition 7.15 reveals.
We note that [158, 154] also contain direct bounds on
E[TV(PY|X ◦ P̂n , PY )]
P
which do not assume existence of λ-th moment of PYY|X for λ > 1 and instead rely on the distribution
of i(X; Y). We do not discuss these bounds here, however, since for the purpose of discussing finite
alphabets the next corollary is sufficient.
Corollary 25.8 (Soft-covering lemma). Suppose X = (U1 , . . . , Ud ) and Y = (V1 , . . . , Vd ) are

i.i.d.
vectors with (Ui , Vi ) ∼ PU,V and Iλ0 (U; V) < ∞ for some λ0 > 1 (e.g. if one of U or V is over
a finite alphabet). Then for any R > I(U; V) there exists ϵ > 0, so that for all d ≥ 1 there exists
x1 , . . . , xn , n = exp{dR}, such that
1X
n
D( PY|X=xi kPY ) ≤ exp{−dϵ}
n
i=1
as d → ∞.
Remark 25.5. The origin of the name “soft-covering” is due to the fact that unlike the covering
lemma (Theorem 25.5) which selects one xi (trying to make PY|X=xi as close to PY as possible)
here we mix over n choices uniformly.
i i
i i
i i

i i
25.4* Approximation of output statistics and the soft-covering lemma 435
Proof. By tensorization of Rényi divergence, cf. Section 7.12, we have

Iλ (X; Y) = dIλ (U; V) .
For every 1 < λ < λ0 we have that λ 7→ Iλ (U; V) is continuous and converging to I(U; V) as
λ → 1. Thus, we can find λ sufficiently small so that R > Iλ (U; V). Applying Theorem 25.7 with
this λ completes the proof.
i i
i i
i i

i i
26 Evaluating rate-distortion function. Lossy

Source-Channel separation.
In the previous chapters we have proved Shannon’s main theorem for lossy data compression: For
stationary memoryless (iid) sources and separable distortion, under the assumption that Dmax < ∞,
the operational and information rate-distortion functions coincide, namely,
R(D) = R(I) (D) = inf I(S; Ŝ).
PŜ|S :Ed(S,Ŝ)≤D
In addition, we have shown various properties about the rate-distortion function (cf. Theorem 24.4).
In this chapter we compute the rate-distortion function for several important source distributions by
evaluating this constrained minimization of mutual information. Next we extending the paradigm
of joint source-channel coding in Section 19.7 to the lossy setting; this reasoning will later be
found useful in statistical applications in Part VI (cf. Chapter 30).
26.1 Evaluation of R(D)
26.1.1 Bernoulli Source

Let S ∼ Ber(p) with Hamming distortion d(S, Ŝ) = 1{S 6= Ŝ} and alphabets S = Ŝ = {0, 1}. Then
d(sn , ŝn ) = n1 dH (sn , ŝn ) is the bit-error rate (fraction of erroneously decoded bits). By symmetry,
we may assume that p ≤ 1/2.
Theorem 26.1.
R(D) = (h(p) − h(D))+ . (26.1)
For example, when p = 1/2, D = .11, we have R(D) ≈ 1/2 bits. In the Hamming game
described in Section 24.2 where we aim to compress 100 bits down to 50, we indeed can do this
while achieving 11% average distortion, compared to the naive scheme of storing half the string
and guessing on the other half, which achieves 25% average distortion.
Proof. Since Dmax = p, in the sequel we can assume D < p for otherwise there is nothing to
show.
For the converse, consider any PŜ|S such that P[S 6= Ŝ] ≤ D ≤ p ≤ 21 . Then
I(S; Ŝ) = H(S) − H(S|Ŝ)

= H(S) − H(S + Ŝ|Ŝ)
436
i i
i i
i i

i i
≥ H(S) − H(S + Ŝ)

= h(p) − h(P[S 6= Ŝ])
≥ h(p) − h(D).
In order to achieve this bound, we need to saturate the above chain of inequalities, in particular,
choose PŜ|S so that the difference S + Ŝ is independent of Ŝ. Let S = Ŝ + Z, where Ŝ ∼ Ber(p′ ) ⊥
⊥
Z ∼ Ber(D), and p′ is such that the convolution gives exactly Ber(p), namely,
p′ ∗ D = p′ (1 − D) + (1 − p′ )D = p,
p−D
i.e., p′ = 1−2D . In other words, the backward channel PS|Ŝ is exactly BSC(D) and the resulting
PŜ|S is our choice of the forward channel PŜ|S . Then, I(S; Ŝ) = H(S) − H(S|Ŝ) = H(S) − H(Z) =
h(p) − h(D), yielding the upper bound R(D) ≤ h(p) − h(D).
Remark 26.1. Here is a more general strategy (which we will later implement in the Gaussian
case.) Denote the optimal forward channel from the achievability proof by P∗Ŝ|S and P∗S|Ŝ the asso-
ciated backward channel (which is BSC(D)). We need to show that there is no better PŜ|S with
P[S 6= Ŝ] ≤ D and a smaller mutual information. Then
I(PS , PŜ|S ) = D(PS|Ŝ kPS |PŜ )

" #
P∗S|Ŝ
= D(PS|Ŝ kP∗S|Ŝ |PŜ ) + EP log
PS
≥ H(S) + EP [log D1{S 6= Ŝ} + log D̄1{S = Ŝ}]
≥ h( p) − h( D)
where the last inequality uses P[S 6= Ŝ] ≤ D ≤ 12 .
Remark 26.2. By WLLN, the distribution PnS = Ber(p)n concentrates near the Hamming sphere of
radius np as n grows large. Recall that in proving Shannon’s rate distortion theorem, the optimal
codebook are drawn independently from PnŜ = Ber(p′ )n with p′ = 1p−−2D D
. Note that p′ = 1/2 if
′
p = 1/2 but p < p if p < 1/2. In the latter case, the reconstruction points concentrate on a smaller
sphere of radius np′ and none of them are typical source realizations, as illustrated in Fig. 26.1.
26.1.2 Gaussian Source

The following results compute the Gaussian rate-distortion function for quadratic distortion in
both the scalar and vector case. (For general covariance, see Exercise V.5.)
Theorem 26.2. Let S ∼ N (0, σ 2 ) and d(s, ŝ) = (s − ŝ)2 for s, ŝ ∈ R. Then
1 σ2
R ( D) = log+ . (26.2)
2 D
i i
i i
i i

i i
438
S(0, np)
S(0, np′ )
Hamming Spheres
Figure 26.1 Source realizations (solid sphere) versus codewords (dashed sphere) in compressing Hamming
sources.
In the vector case of S ∼ N (0, σ 2 Id ) and d(s, ŝ) = ks − ŝk22 ,
d dσ 2
R(D) = log+ . (26.3)
2 D
Proof. Since Dmax = σ 2 , in the sequel we can assume D < σ 2 for otherwise there is nothing to
show.
(Achievability) Choose S = Ŝ + Z , where Ŝ ∼ N (0, σ 2 − D) ⊥
⊥ Z ∼ N (0, D). In other words,
the backward channel PS|Ŝ is AWGN with noise power D, and the forward channel can be easily
found to be PŜ|S = N ( σ σ−2 D S, σ σ−2 D D). Then
2 2
1 σ2 1 σ2
I(S; Ŝ) = log =⇒ R(D) ≤ log
2 D 2 D
(Converse) Formally, we can mimic the proof of Theorem 26.1 replacing Shannon entropy by
the differential entropy and applying the maximal entropy result from Theorem 2.7; the caveat is
that for Ŝ (which may be discrete) the differential entropy may not be well-defined. As such, we
follow the alternative proof given in Remark 26.1. Let PŜ|S be any conditional distribution such
that EP [(S − Ŝ)2 ] ≤ D. Denote the forward channel in the above achievability by P∗Ŝ|S . Then
" #
P∗S|Ŝ
I(PS , PŜ|S ) = D(PS|Ŝ kP∗S|Ŝ |PŜ ) + EP log
PS
" #
P∗S|Ŝ
≥ EP log
PS
 (S−Ŝ)2

√ 1 e− 2D
= EP log 2πD
S2

√ 1
2π σ 2
e− 2 σ 2
i i
i i
i i

i i
" #
1 σ2 log e S2 (S − Ŝ)2
= log + EP −
2 D 2 σ2 D
1 σ2
≥ log .
2 D
Finally, for the vector case, (26.3) follows from (26.2) and the same single-letterization argu-
ment in Theorem 24.8 using the convexity of the rate-distortion function in Theorem 24.4(a).
The interpretation of the optimal reconstruction points in the Gaussian case is analogous to that
of the Hamming source previously
√ discussed in Remark 26.2: As n grows, the Gaussian random
vector concentrates on S(0, nσ 2 ) (n-sphere in Euclideanp space rather than Hamming), but each
reconstruction point drawn from (P∗Ŝ )n is close to S(0, n(σ 2 − D)). So again the picture is similar
to Fig. 26.1 of two nested spheres.
Note that the exact expression in Theorem 26.2 relies on the Gaussianity assumption of the
source. How sensitive is the rate-distortion formula to this assumption? The following comparison
result is a counterpart of Theorem 20.12 for channel capacity:
Theorem 26.3. Assume that ES = 0 and Var S = σ 2 . Consider the MSE distortion. Then
1 σ2 1 σ2
log+ − D(PS kN (0, σ 2 )) ≤ R(D) = inf I(S; Ŝ) ≤ log+ .
2 D PŜ|S :E(Ŝ−S)2 ≤D 2 D
Remark 26.3. A simple consequence of Theorem 26.3 is that for source distributions with a den-
sity, the rate-distortion function grows according to 12 log D1 in the low-distortion regime as long as
D(PS kN (0, σ 2 )) is finite. In fact, the first inequality, known as the Shannon lower bound (SLB),
is asymptotically tight, in the sense that
1 σ2
R(D) = log − D(PS kN (0, σ 2 )) + o(1), D → 0 (26.4)
2 D
under appropriate conditions on PS [200, 177]. Therefore, by comparing (2.21) and (26.4), we
see that, for small distortion, uniform scalar quantization (Section 24.1) is in fact asymptotically
optimal within 12 log(2πe) ≈ 2.05 bits.
Later in Section 30.1 we will apply SLB to derive lower bounds for statistical estimation. For
this we need the following general version of SLB (see Exercise V.6 for a proof): Let k · k be an
arbitrary norm on Rd and r > 0. Let X be a d-dimensional continuous random vector with finite
differential entropy h(X). Then

d d d
inf I(X; X̂) ≥ h(X) + log − log Γ +1 V , (26.5)
PX̂|X :E[∥X̂−X∥r ]≤D r Dre r
where V = vol({x ∈ Rd : kxk ≤ 1}) is the volume of the unit k · k-ball.

Proof. Again, assume D < Dmax = σ 2 . Let SG ∼ N (0, σ 2 ).
“≤”: Use the same P∗Ŝ|S = N ( σ σ−2 D S, σ σ−2 D D) in the achievability proof of Gaussian rate-
2 2
distortion function:
R(D) ≤ I(PS , P∗Ŝ|S )
i i
i i
i i

i i
440
σ2 − D σ2 − D
= I(S; S + W) W ∼ N ( 0, D)
σ2 σ2
σ2 − D
≤ I(SG ; SG + W ) by Gaussian saddle point (Theorem 5.11)
σ2
1 σ2
= log .
2 D
“≥”: For any PŜ|S such that E(Ŝ − S)2 ≤ D. Let P∗S|Ŝ = N (Ŝ, D) denote the AWGN channel
with noise power D. Then
I(S; Ŝ) = D(PS|Ŝ kPS |PŜ )
" #
P∗S|Ŝ
= D(PS|Ŝ kP∗S|Ŝ |PŜ ) + EP log − D(PS kPSG )
PSG
 (S−Ŝ)2

√ 1 e− 2D
≥ EP log 2πD
S2
 − D(PS kPSG )
√ 1
2π σ 2
e− 2 σ 2
1 σ2
≥ log − D(PS kPSG ).
2 D
26.2* Analog of saddle-point property in rate-distortion

In the computation of R(D) for the Hamming and Gaussian source, we guessed the correct form
of the rate-distortion function. In both of their converse arguments, we used the same trick to
establish that any other feasible PŜ|S gave a larger value for R(D). In this section, we formalize
this trick, in an analogous manner to the saddle point property of the channel capacity . Note that
typically we don’t need any tricks to compute R(D), since we can obtain a solution in a parametric
form to the unconstrained convex optimization
min I(S; Ŝ) + λ E[d(S, Ŝ)]
PŜ|S
In fact there are also iterative algorithms (Blahut-Arimoto) that computes R(D). However, for
the peace of mind it is good to know there are some general reasons why tricks like we used in
Hamming/Gaussian actually are guaranteed to work.
Theorem 26.4.
1 Suppose PY∗ and PX|Y∗ PX are such that E[d(X, Y∗ )] ≤ D and for any PX,Y with E[d(X, Y)] ≤
D we have

dPX|Y∗
E log (X|Y) ≥ I(X; Y∗ ) . (26.6)
dPX
Then R(D) = I(X; Y∗ ).
i i
i i
i i

i i
26.2* Analog of saddle-point property in rate-distortion 441
2 Suppose that I(X; Y∗ ) = R(D). Then for any regular branch of conditional probability PX|Y∗
and for any PX,Y satisfying
• E[d(X, Y)] ≤ D and
• PY PY∗ and
• I(X; Y) < ∞
the inequality (26.6) holds.
Remarks:
1 The first part is a sufficient condition for optimality of a given PXY∗ . The second part gives a
necessary condition that is convenient to narrow down the search. Indeed, typically the set of
PX,Y satisfying those conditions is rich enough to infer from (26.6):
dPX|Y∗
log (x|y) = R(D) − θ[d(x, y) − D] ,
dPX
for a positive θ > 0.
2 Note that the second part is not valid without assuming PY PY∗ . A counterexample to this
and various other erroneous (but frequently encountered) generalizations is the following: A =
{0, 1}, PX = Bern(1/2), Â = {0, 1, 0′ , 1′ } and
d(0, 0) = d(0, 0′ ) = 1 − d(0, 1) = 1 − d(0, 1′ ) = 0 .
The R(D) = |1 − h(D)|+ , but there exist multiple non-equivalent optimal choices of PY|X , PX|Y
and PY .
Proof. The first part is just a repetition of the proofs above for the Hamming and Gaussian case,
so we focus on the second part. Suppose there exists a counterexample PX,Y achieving

dPX|Y∗
I1 = E log (X|Y) < I∗ = R(D) .
dPX
Notice that whenever I(X; Y) < ∞ we have
I1 = I(X; Y) − D(PX|Y kPX|Y∗ |PY ) ,
and thus
D(PX|Y kPX|Y∗ |PY ) < ∞ . (26.7)
Before going to the actual proof, we describe the principal idea. For every λ we can define a joint
distribution
PX,Yλ = λPX,Y + (1 − λ)PX,Y∗ .
Then, we can compute

PX|Yλ PX|Yλ PX|Y∗
I(X; Yλ ) = E log (X|Yλ ) = E log (26.8)
PX PX|Y∗ PX
i i
i i
i i

i i
442

PX|Y∗ (X|Yλ )
= D(PX|Yλ kPX|Y∗ |PYλ ) + E (26.9)
PX
= D(PX|Yλ kPX|Y∗ |PYλ ) + λI1 + (1 − λ)I∗ . (26.10)
From here we will conclude, similar to Proposition 2.18, that the first term is o(λ) and thus for
sufficiently small λ we should have I(X; Yλ ) < R(D), contradicting optimality of coupling PX,Y∗ .
We proceed to details. For every λ ∈ [0, 1] define
dPY
ρ 1 ( y) ≜ ( y) (26.11)
dPY∗
λρ1 (y)
λ(y) ≜ (26.12)
λρ1 (y) + λ̄
(λ)
PX|Y=y = λ(y)PX|Y=y + λ̄(y)PX|Y∗ =y (26.13)
dPYλ = λdPY + λ̄dPY∗ = (λρ1 (y) + λ̄)dPY∗ (26.14)
D(y) = D(PX|Y=y kPX|Y∗ =y ) (26.15)
(λ)
D λ ( y) = D(PX|Y=y kPX|Y∗ =y ) . (26.16)
Notice:
On {ρ1 = 0} : λ(y) = D(y) = Dλ (y) = 0
and otherwise λ(y) > 0. By convexity of divergence
Dλ (y) ≤ λ(y)D(y)
and therefore
1
Dλ (y)1{ρ1 (y) > 0} ≤ D(y)1{ρ1 (y) > 0} .
λ(y)
Notice that by (26.7) the function ρ1 (y)D(y) is non-negative and PY∗ -integrable. Then, applying
dominated convergence theorem we get
Z Z
1 1
lim dPY∗ Dλ (y)ρ1 (y) = dPY∗ ρ1 (y) lim D λ ( y) = 0 (26.17)
λ→0 {ρ >0}
1
λ( y ) {ρ1 >0} λ→ 0 λ( y)
where in the last step we applied the result from Chapter 5
D(PkQ) < ∞ =⇒ D(λP + λ̄QkQ) = o(λ)
since for each y on the set {ρ1 > 0} we have λ(y) → 0 as λ → 0.

On the other hand, notice that
Z Z
1 1
dPY∗ Dλ (y)ρ1 (y)1{ρ1 (y) > 0} = dPY∗ (λρ1 (y) + λ̄)Dλ (y) (26.18)
{ρ1 >0} λ(y) λ {ρ1 >0}
Z
1
= dPYλ Dλ (y) (26.19)
λ {ρ1 >0}
i i
i i
i i

i i
Z
1 1 (λ)
= dPYλ Dλ (y) = D(PX|Y kPX|Y∗ |PYλ ) ,
λ Y λ
(26.20)
where in the penultimate step we used Dλ (y) = 0 on {ρ1 = 0}. Hence, (26.17) shows
(λ)
D(PX|Y kPX|Y∗ |PYλ ) = o(λ) , λ → 0.
Finally, since
(λ)
PX|Y ◦ PYλ = PX ,
we have

(λ) dPX|Y∗ dPX|Y∗ ∗
I ( X ; Yλ ) = D(PX|Y kPX|Y∗ |PYλ ) + λ E log (X|Y) + λ̄ E log (X|Y ) (26.21)
dPX dPX
= I∗ + λ(I1 − I∗ ) + o(λ) , (26.22)
contradicting the assumption
I(X; Yλ ) ≥ I∗ = R(D) .
26.3 Lossy joint source-channel coding

Extending the lossless joint source channel coding problem studied in Section 19.7, in this section
we study the lossy version of this problem: How to transmit a source over a noisy channel such
that the receiver can reconstruct the original source within a prescribed distortion.
The setup of the lossy joint source-channel coding problem is as follows. For each k and n, we
are given a source Sk = (S1 , . . . , Sk ) taking values on S , a distortion metric d : S k × Ŝ k → R, and
a channel PYn |Xn acting from An to B n . A lossy joint source-channel code (JSCC) consists of an
encoder f : S k → An and decoder g : B n → Ŝ k , such that the channel input is Xn = f(Sk ) and the
reconstruction Ŝk = g(Yn ) satisfies E[d(Sk , Ŝk )] ≤ D. By definition, we have the Markov chain
f PYn |Xn g
Sk −−→ Xn −−−→ Yn −−→ Ŝk
Such a pair (f, g) is called a (k, n, D)-JSCC, which transmits k symbols over n channel uses such
that the end-to-end distortion is at most D in expectation. Our goal is to optimize the encoder/de-
coder pair so as to maximize the transmission rate (number of symbols per channel use) R = nk .1
As such, we define the asymptotic fundamental limit as
1
RJSCC (D) ≜ lim inf max {k : ∃(k, n, D)-JSCC} .
n→∞ n
1
Or equivalently, minimize the bandwidth expansion factor ρ = nk .
i i
i i
i i

i i
444
To simplify the exposition, we will focus on JSCC for a stationary memoryless source Sk ∼ P⊗S
k
⊗n
transmitted over a stationary memoryless channel PYn |Xn = PY|X subject to a separable distortion
Pk
function d(sk , ŝk ) = 1k i=1 d(si , ŝi ).
26.3.1 Converse
The converse for the JSCC is quite simple, based on data processing inequality and following the
weak converse of lossless JSCC using Fano’s inequality.
Theorem 26.5 (Converse).

C
RJSCC (D) ≤ ,
R(D)
where C = supPX I(X; Y) is the capacity of the channel and R(D) = infP :E[d(S,Ŝ)]≤D I(S; Ŝ) is the
Ŝ|S
rate-distortion function of the source.
The interpretation of this result is clear: Since we need at least R(D) bits per symbol to recon-
struct the source up to a distortion D and we can transmit at most C bits per channel use, the overall
transmission rate cannot exceeds C/R(D). Note that the above theorem clearly holds for channels
with cost constraint with the corresponding capacity (Chapter 20).
Proof. Consider a (k, n, D)-code which induces the Markov chain Sk → Xn → Yn → Ŝk such
Pk
that E[d(Sk , Ŝk )] = 1k i=1 E[d(Si , Ŝi )] ≤ D. Then
( a) (b) ( c)
kR(D) = inf I(Sk ; Ŝk ) ≤ I(Sk ; Ŝk ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn ) = nC
PŜk |Sk :E[d(Sk ,Ŝk )]≤D P Xn
where (b) applies data processing inequality for mutual information, (a) and (c) follow from the
respective single-letterization result for lossy compression and channel coding (Theorem 24.8 and
Proposition 19.10).
Remark 26.4. Consider the case where the source is Ber(1/2) with Hamming distortion. Then
Theorem 26.5 coincides with the converse for channel coding under bit error rate Pb in (19.33):
k C
R= ≤
n 1 − h(Pb )
which was previously given in Theorem 19.21 and proved using ad hoc techniques. In the case of
channel with cost constraints, e.g., the AWGN channel with C(SNR) = 12 log(1 + SNR), we have

C(SNR)
Pb ≥ h−1 1 −
R
This is often referred to as the Shannon limit in plots comparing the bit-error rate of practical
codes. (See, e.g., Fig. 2 from [263] for BIAWGN (binary-input) channel.) This is erroneous, since
the pb above refers to the bit error rate of data bits (or systematic bits), not all of the codeword bits.
The latter quantity is what typically called BER (see (19.33)) in the coding-theoretic literature.
i i
i i
i i

i i
26.3.2 Achievability via separation

The proof strategy is similar to lossless JSCC in Section 19.7 by separately constructing a
channel coding scheme and a lossy compression scheme, as opposed to jointly optimizing the
JSCC encoder/decoder pair. Specifically, first compress the data into bits then encode with
a channel code; to decode, apply the channel decoder followed by the source decompresser.
Under appropriate assumptions, this separately-designed scheme achieves the optimal rate in
Theorem 26.5.
Theorem 26.6. For any stationary memoryless source (PS , S, Ŝ, d) with rate-distortion function
R(D) satisfying Assumption 26.1 (below), and for any stationary memoryless channel PY|X with
capacity C,
C
RJSCC (D) = .
R ( D)
Assumption 26.1 on the source (which is rather technical and can be skipped in the first reading)
is to control the distortion incurred by the channel decoder making an error. Despite this being a
low-probability event, without any assumption on the distortion metric, we cannot say much about
its contribution to the end-to-end average distortion. (Note that this issue does not arise in lossless
JSCC). Assumption 26.1 is trivially satisfied by bounded distortion (e.g., Hamming), and can be
shown to hold more generally such as for Gaussian sources and MSE distortion.
Proof. In view of Theorem 26.5, we only prove achievability. We constructed a separated

compression/channel coding scheme as follows:
• Let (fs , gs ) be a (k, 2kR(D)+o(k) , D)-code for compressing Sk such that E[d(Sk , gs (fs (Sk )] ≤ D.
By Lemma 26.8 (below), we may assume that all reconstruction points are not too far from
some fixed string, namely,
d(sk0 , gs (i)) ≤ L (26.23)
for all i and some constant L, where sk0 = (s0 , . . . , s0 ) is from Assumption 26.1 below.
• Let (fc , gc ) be a (n, 2nC+o(n) , ϵn )max -code for channel PYn |Xn such that kR(D) + o(k) ≤ nC +
o(n) and the maximal probability of error ϵn → 0 as n → ∞. Such as code exists thanks to
Theorem 19.9 and Corollary 19.5.
Let the JSCC encoder and decoder be f = fc ◦ fs and g = gs ◦ gc . So the overall system is
fs fc gc gs
Sk − → Xn −→ Yn −
→W− → Ŵ −
→ Ŝk .
Note that here we need to control the maximal probability of error of the channel code since
when we concatenate these two schemes, W at the input of the channel is the output of the source
compressor, which need not be uniform.
i i
i i
i i

i i
446
To analyze the average distortion, we consider two cases depending on whether the channel
decoding is successful or not:
E[d(Sk , Ŝk )] = E[d(Sk , gs (W))1{W = Ŵ}] + E[d(Sk , gs (Ŵ)))1{W 6= Ŵ}].
By assumption on our lossy code, the first term is at most D. For the second term, we have P[W 6=
Ŵ] ≤ ϵn = o(1) by assumption on our channel code. Then
( a)
E[d(Sk , gs (Ŵ))1{W 6= Ŵ}] ≤ E[1{W 6= Ŵ}λ(d(Sk , ŝk0 ) + d(sk0 , gs (Ŵ)))]
(b)
≤ λ · E[1{W 6= Ŵ}d(Sk , ŝk0 )] + λL · P[W 6= Ŵ]
( c)
= o(1),
where (a) follows from the generalized triangle inequality from Assumption 26.1(a) below; (b)
follows from (26.23); in (c) we apply Lemma 25.4 that were used to show the vanishing of the
expectation in (25.15) before.
In all, our scheme meets the average distortion constraint. Hence we conclude that for all R >
C/R(D), there exists a sequence of (k, n, D + o(1))-JSCC codes.
The following assumption is needed by the previous theorem:
Assumption 26.1. Fix D. For a source (PS , S, Ŝ, d), there exists λ ≥ 0, s0 ∈ S, ŝ0 ∈ Ŝ such that
(a) Generalized triangle inequality: d(s, ŝ) ≤ λ(d(s, ŝ0 ) + d(s0 , â)) ∀a, â.
(b) E[d(S, ŝ0 )] < ∞ (so that Dmax < ∞ too).
(c) E[d(s0 , Ŝ)] < ∞ for any output distribution PŜ achieving the rate-distortion function R(D).
(d) d(s0 , ŝ0 ) < ∞.
The interpretation of this assumption is that the spaces S and Ŝ have “nice centers” s0 and ŝ0 ,
in the sense that the distance between any two points is upper bounded by a constant times the
distance from the centers to each point (see figure below).
b
b
s ŝ
b b
s0 ŝ0
S Ŝ
Note that Assumption 26.1 is not straightforward to verify. Next we give some more convenient
sufficient conditions. First of all, Assumption 26.1 holds automatically for bounded distortion
i i
i i
i i

i i
function. In other words, for a discrete source on a finite alphabet S , a finite reconstruction alphabet
Ŝ , and a finite distortion function d(s, ŝ) < ∞, Assumption 26.1 is fulfilled. More generally, we
have the following criterion.
Theorem 26.7. If S = Ŝ and d(s, ŝ) = ρ(s, ŝ)q for some metric ρ and q ≥ 1, and Dmax ≜
infŝ0 E[d(S, ŝ0 )] < ∞, then Assumption 26.1 holds.
Proof. Take s0 = ŝ0 that achieves a finite Dmax = E[d(S, ŝ0 )]. (In fact, any points can serve as
centers in a metric space). Applying triangle inequality and Jensen’s inequality, we have
q q
1 1 1 1 1
ρ(s, ŝ) ≤ ρ(s, s0 ) + ρ(s0 , ŝ) ≤ ρq (s, s0 ) + ρq (s0 , ŝ).
2 2 2 2 2
Thus d(s, ŝ) ≤ 2q−1 (d(s, s0 ) + d(s0 , ŝ)). Taking λ = 2q−1 verifies (a) and (b) in Assumption 26.1.
To verify (c), we can apply this generalized triangle inequality to get d(s0 , Ŝ) ≤ 2q−1 (d(s0 , S) +
d(S, Ŝ)). Then taking the expectation of both sides gives
E[d(s0 , Ŝ)] ≤ 2q−1 (E[d(s0 , S)] + E[d(S, Ŝ)])

≤ 2q−1 (Dmax + D) < ∞.
So we see that metrics raised to powers (e.g. squared norms) satisfy Assumption 26.1. Finally,
we give the lemma used in the proof of Theorem 26.6.
Lemma 26.8. Fix a source satisfying Assumption 26.1 and an arbitrary PŜ|S . Let R > I(S; Ŝ),
L > max{E[d(s0 , Ŝ)], d(s0 , ŝ0 )} and D > E[d(S, Ŝ)]. Then, there exists a (k, 2kR , D)-code such
that d(sk0 , ŝk ) ≤ L for every reconstruction point ŝk , where sk0 = (s0 , . . . , s0 ).
Proof. Let X = S k , X̂ = Ŝ ⊗k and PX = PkS , PY|X = P⊗ k

Ŝ|S
. We apply the achievability bound
for excess distortion from Theorem 25.3 with γ = 2k(R+I(S;Ŝ))/2 to the following non-separable
distortion function
(
d(x, x̂) d(sk0 , x̂) ≤ L
d1 (x, x̂) =
+∞ otherwise.
For any D′ ∈ (E[d(S, Ŝ)], D), there exist M = 2kR reconstruction points (c1 , . . . , cM ) such that

P min d(S , cj ) > D ≤ P[d1 (Sk , Ŝk ) > D′ ] + o(1),
k ′
j∈[M]
where on the right side (Sk , Ŝk ) ∼ P⊗ k

S,Ŝ
. Note that without any change in d1 -distortion we can
remove all (if any) reconstruction points cj with d(sk0 , cj ) > L. Furthermore, from the WLLN we
have
P[d1 (S, Ŝ) > D′ ] ≤ P[d(Sk , Ŝk ) > D′ ] + P[d(sk0 , Ŝk ) > L] → 0
i i
i i
i i

i i
448
as k → ∞ (since E[d(S, Ŝ)] < D′ and E[s0 , Ŝ] < L). Thus we have

′
P min d(S , cj ) > D
k
→0
j∈[M]
and d(sk0 , cj ) ≤ L. Finally, by adding another reconstruction point cM+1 = ŝk0 = (ŝ0 , . . . , ŝ0 ) we
get
h i h i
E min d(Sk , cj ) ≤ D′ + E d(Sk , ŝk0 )1{minj∈[M] d(Sk ,cj )>D′ } = D′ + o(1) ,
j∈[M+1]
where the last estimate follows from the same argument that shows the vanishing of the expectation
in (25.15). Thus, for sufficiently large n the expected distortion is at most D, as required.
26.4 What is lacking in classical lossy compression?

Let us discuss some issues and open problems in the classical compression theory. First, for
the compression the standard results in lossless compression apply well for text files. The lossy
compression theory, however, relies on the independence assumption and on separable distortion
metrics. Because of that, while the scalar quantization theory has been widely used in practice (in
the form of analog-to-digital converters, ADCs), the vector quantizers (rate-distortion) theory so
far has not been employed. The assumptions of the rate-distortion theory can be seen to be espe-
cially problematic in the case of compressing digital images, which evidently have very strong
spatial correlation compared to 1D signals. (For example, the first sentence and the last in Tol-
stoy’s novel are pretty uncorrelated. But the regions in the upper-left and bottom-right corners of
one image can be strongly correlated. At the same time, the uncompressed size of the novel and
the image could be easily equal.) Thus, for practicing the lossy compression of videos and images
the key problem is that of coming up with a good “whitening” bases, which is an art still being
refined.
For the joint-source-channel coding, the separation principle has definitely been a guiding light
for the entire development of digital information technology. But this now ubiquitous solution
that Shannon’s separation has professed led to a rather undesirable feature of dropped cellular
calls (as opposed to slowly degraded quality of the old analog telephones) or “snow screen” on
TV whenever the SNR drops below a certain limit. That is the separated systems can be very
unstable, or lacks graceful degradation. To sketch this effect consider an example of JSCC, where
the source distribution is Ber( 12 ) and the channel is = BSCδ . For it we can consider two solutions:
1 a separate compressor and channel encoder designed for RC((δ)

D)
=1
2 a simple JSCC with ρ = 1 which transmits “uncoded” data, i.e. Xi = Si .
i i
i i
i i

i i
26.4 What is lacking in classical lossy compression? 449
As a function of δ the resulting destortion (at large blocklength) will look like the solid and
dashed lines in this graph:
We can see that below δ < δ ∗ the separated solution is much preferred since it achieves zero
distortion. But at δ > δ ∗ it undergoes a catastrophic failure and distortion becomes 1/2 (that is,
we observe pure noise). At the same time the simple “uncoded” JSCC has its distortion decreasing
gracefully. It has been a long-standing problem since the early days of information theory to find
schemes that would interpolate between these two extreme solutions.
Even theoretically the problem of JSCC still contains great many mysteries. For example, in
Section 22.5 we described refined expansion of the channel coding rate as a function of block-
length. However, similar expansions for the JSCC are not available. In fact, even showing that
√
convergence of the nk to the ultimate limit of R(CD) happens at the speed of Θ(1/ n) has only been
demonstrated recently [178] and only for one special case (of a binary source and BSCδ channel
as in the example above).
i i
i i
i i

i i
27 Metric entropy
In the previous chapters of this part we discussed optimal quantization of random vectors in both
fixed and high dimensions. Complementing this average-case perspective, the topic of this chapter
is on the deterministic (worst-case) theory of quantization. The main object of interest is the metric
entropy of a set, which allows us to answer two key questions (a) covering number: the minimum
number of points to cover a set up to a given accuracy; (b) packing number: the maximal number
of elements of a given set with a prescribed minimum pairwise distance.
The foundational theory of metric entropy were put forth by Kolmogorov, who, together with
his students, also determined the behavior of metric entropy in a variety of problems for both finite
and infinite dimensions. Kolmogorov’s original interest in this subject stems from Hilbert’s 13th
problem, which concerns the possibility or impossibility of representing multi-variable functions
as compositions of functions of fewer variables. It turns out that the theory of metric entropy can
provide a surprisingly simple and powerful resolution to such problems. Over the years, metric
entropy has found numerous connections to and applications in other fields such as approximation
theory, empirical processes, small-ball probability, mathematical statistics, and machine learning.
In particular, metric entropy will be featured prominently in Part VI of this book, wherein we
discuss its applications to proving both lower and upper bounds for statistical estimation.
This chapter is organized as follows. Section 27.1 provides basic definitions and explains the
fundamental connections between covering and packing numbers. In Section 27.2 we study met-
ric entropy in finite-dimensional spaces and a popular approach for bounding the metric entropy
known as the volume bound. To demonstrate the limitations of the volume method and the associ-
ated high-dimensional phenomenon, in Section 27.3 we discuss a few other approaches through
concrete examples. Infinite-dimensional spaces are treated next for smooth functions in Sec-
tion 27.4 (wherein we also discuss the application to Hilbert’s 13th problem) and Hilbert spaces in
Section 27.5 (wherein we also discuss the application to empirical processes). Section 27.6 gives
an exposition of the connections between metric entropy and the small-ball problem in probabil-
ity theory. Finally, in Section 27.7 we circle back to rate-distortion theory and discuss how it is
related to metric entropy and how information-theoretic methods can be useful for the latter.
27.1 Covering and packing

Definition 27.1. Let (V, d) be a metric space and Θ ⊂ V.
450
i i
i i
i i

i i
27.1 Covering and packing 451
• We say {v1 , ..., vN } ⊂ V is an ϵ-covering of Θ if Θ ⊂ ∪Ni=1 B(vi , ϵ), where B(v, ϵ) ≜ {u ∈ V :

d(u, v) ≤ ϵ} is the (closed) ball of radius ϵ centered at v; or equivalently, ∀θ ∈ Θ, ∃i ∈ [N] such
that d(θ, vi ) ≤ ϵ.
• We say {θ1 , ..., θM } ⊂ Θ is an ϵ-packing of Θ if mini̸=j kθi − θj k > ϵ;1 or equivalently, the balls
{B(θi , ϵ/2) : j ∈ [M]} are disjoint.
ϵ
≥ϵ
Θ Θ
Figure 27.1 Illustration of ϵ-covering and ϵ-packing.
Upon defining ϵ-covering and ϵ-packing, a natural question concerns the size of the optimal
covering and packing, leading to the definition of covering and packing numbers:
N(Θ, d, ϵ) ≜ min{n : ∃ ϵ-covering of Θ of size n} (27.1)

M(Θ, d, ϵ) ≜ max{m : ∃ ϵ-packing of Θ of size m} (27.2)
with min ∅ understood as ∞; we will sometimes abbreviate these as N(ϵ) and M(ϵ) for brevity.
Similar to volume and width, covering and packing numbers provide a meaningful measure for
the “massiveness” of a set. The major focus of this chapter is to understanding their behavior in
both finite and infinite-dimensional spaces as well as their statistical applications.
Some remarks are in order.
• Monotonicity: N(Θ, d, ϵ) and M(Θ, d, ϵ) are non-decreasing and right-continuous functions of

ϵ. Furthermore, both are non-decreasing in Θ with respect to set inclusion.
• Finiteness: Θ is totally bounded (e.g. compact) if N(Θ, d, ϵ) < ∞ for all ϵ > 0. For Euclidean
spaces, this is equivalent to Θ being bounded, namely, diam (Θ) < ∞ (cf. (5.4)).
• The logarithm of the covering and packing numbers are commonly referred to as metric
entropy. In particular, log M(ϵ) and log N(ϵ) are called ϵ-entropy and ϵ-capacity in [180]. Quan-
titative connections between metric entropy and other information measures are explored in
Section 27.7.
• Widely used in the literature of functional analysis [235, 203], the notion of entropy numbers
essentially refers to the inverse of the metric entropy: The kth entropy number of Θ is ek (Θ) ≜
inf{ϵ : N(Θ, d, ϵ) ≤ 2k }. In particular, e1 (Θ) = rad (Θ), the radius of Θ defined in (5.5).
1
Notice we imposed strict inequality for convenience.
i i
i i
i i

i i
452
Remark 27.1. Unlike the packing number M(Θ, d, ϵ), the covering number N(Θ, d, ϵ) defined in
(27.1) depends implicitly on the ambient space V ⊃ Θ, since, per Definition 27.1), an ϵ-covering
is required to be a subset of V rather than Θ. Nevertheless, as the next Theorem 27.2 shows, this
dependency on V has almost no effect on the behavior of the covering number.
As an alternative to (27.1), we can define N′ (Θ, d, ϵ) as the size of the minimal ϵ-covering of Θ
that is also a subset of Θ, which is closely related to the original definition as
N(Θ, d, ϵ) ≤ N′ (Θ, d, ϵ) ≤ N(Θ, d, ϵ/2) (27.3)
Here, the left inequality is obvious. To see the right inequality,2 let {θ1 , . . . , θN } be an 2ϵ -covering
of Θ. We can project each θi to Θ by defining θi′ = argminu∈Θ d(θi , u). Then {θ1′ , . . . , θN′ } ⊂ Θ
constitutes an ϵ-covering. Indeed, for any θ ∈ Θ, we have d(θ, θi ) ≤ ϵ/2 for some θi . Then
d(θ, θi′ ) ≤ d(θ, θi ) + d(θi , θi′ ) ≤ 2d(θ, θi ) ≤ ϵ. On the other hand, the N′ covering numbers need
not be monotone with respect to set inclusion.
The relation between the covering and packing numbers is described by the following funda-
mental result.
Theorem 27.2 (Kolomogrov-Tikhomirov [180]).
M(Θ, d, 2ϵ)≤N(Θ, d, ϵ)≤M(Θ, d, ϵ). (27.4)
Proof. To prove the right inequality, fix a maximal packing E = {θ1 , ..., θM }. Then ∀θ ∈ Θ\E,
∃i ∈ [M], such that d(θ, θi ) ≤ ϵ (for otherwise we can obtain a bigger packing by adding θ). Hence
E must an ϵ-covering (which is also a subset of Θ). Since N(Θ, d, ϵ) is the minimal size of all
possible coverings, we have M(Θ, d, ϵ) ≥ N(Θ, d, ϵ).
We next prove the left inequality by contradiction. Suppose there exists a 2ϵ-packing
{θ1 , ..., θM } and an ϵ-covering {x1 , ..., xN } such that M ≥ N + 1. Then by the pigeonhole prin-
ciple, there exist distinct θi and θj belonging to the same ϵ-ball B(xk , ϵ). By triangle inequality,
d(θi , θj ) ≤ 2ϵ, which is a contradiction since d(θi , θj ) > 2ϵ for a 2ϵ-packing. Hence the size of any
2ϵ-packing is at most that of any ϵ-covering, that is, M(Θ, d, 2ϵ) ≤ N(Θ, d, ϵ).
The significance of (27.4) is that it shows that the small-ϵ behavior of the covering and packing
numbers are essentially the same. In addition, the right inequality therein, namely, N(ϵ) ≤ M(ϵ),
deserves some special mention. As we will see next, it is oftentimes easier to prove negative
results (lower bound on the minimal covering or upper bound on the maximal packing) than pos-
itive results which require explicit construction. When used in conjunction with the inequality
N(ϵ) ≤ M(ϵ), these converses turn into achievability statements,3 leading to many useful bounds
on metric entropy (e.g. the volume bound in Theorem 27.3 and the Gilbert-Varshamov bound
2
Another way to see this is from Theorem 27.2: Note that the right inequality in (27.4) yields a ϵ-covering that is included
in Θ. Together with the left inequality, we get N′ (ϵ) ≤ M(ϵ) ≤ N(ϵ/2).
3
This is reminiscent of duality-based argument in optimization: To bound a minimization problem from above, instead of
constructing an explicit feasible solution, a fruitful approach is to equate it with the dual problem (maximization) and
bound this maximum from above.
i i
i i
i i

i i
Theorem 27.5 in the next section). Revisiting the proof of Theorem 27.2, we see that this logic
actually corresponds to a greedy construction (greedily increase the packing until no points can
be added).
27.2 Finite-dimensional space and volume bound

A commonly used method to bound metric entropy in finite dimensions is in terms of volume
ratio. Consider the d-dimensional Euclidean space V = Rd with metric given by an arbitrary
norm d(x, y) = kx − yk. We have the following result.
Theorem 27.3. Let k · k be an arbitrary norm on Rd and B = {x ∈ Rd : kxk ≤ 1} the

corresponding unit norm ball. Then for any Θ ⊂ Rd ,
d d
1 vol(Θ) (a) (b) vol(Θ + ϵ B) (c) 3 vol(Θ)
≤ N(Θ, k · k, ϵ) ≤ M(Θ, k · k, ϵ) ≤ 2
≤ .
ϵ vol(B) vol( 2ϵ B) ϵ vol(B)
where (c) holds under the extra condition that Θ is convex and contains ϵB.
Proof. To prove (a), consider an ϵ-covering Θ ⊂ ∪Ni=1 B(θi , ϵ). Applying the union bound yields
XN
vol(Θ) ≤ vol ∪Ni=1 B(θi , ϵ) ≤ vol(B(θi , ϵ)) = Nϵd vol(B),
i=1
where the last step follows from the translation-invariance and scaling property of volume.
To prove (b), consider an ϵ-packing {θ1 , . . . , θM } ⊂ Θ such that the balls B(θi , ϵ/2) are disjoint.
M(ϵ)
Since ∪i=1 B(θi , ϵ/2) ⊂ Θ + 2ϵ B, taking the volume on both sides yields
ϵ ϵ
vol Θ + B ≥ vol ∪M i=1 B(θi , ϵ/2) = Mvol B .
2 2
This proves (b).
Finally, (c) follows from the following two statements: (1) if ϵB ⊂ Θ, then Θ + 2ϵ B ⊂ Θ + 21 Θ;
and (2) if Θ is convex, then Θ+ 12 Θ = 32 Θ. We only prove (2). First, ∀θ ∈ 32 Θ, we have θ = 13 θ+ 32 θ,
where 13 θ ∈ 12 Θ and 32 θ ∈ Θ. Thus 32 Θ ⊂ Θ + 12 Θ. On the other hand, for any x ∈ Θ + 12 Θ, we
have x = y + 21 z with y, z ∈ Θ. By the convexity of Θ, 23 x = 23 y + 31 z ∈ Θ. Hence x ∈ 23 Θ, implying
Θ + 21 Θ ⊂ 32 Θ.
Remark 27.2. Similar to the proof of (a) in Theorem 27.3, we can start from Θ + 2ϵ B ⊂
∪Ni=1 B(θi , 32ϵ ) to conclude that
N(Θ, k · k, ϵ)
(2/3)d ≤ ≤ 2d .
vol(Θ + 2ϵ B)/vol(ϵB)
In other words, the volume of the fattened set Θ + 2ϵ determines the metric entropy up to constants
that only depend on the dimension. We will revisit this reasoning in Section 27.6 to adapt the
volumetric estimates to infinite dimensions where this fattening step becomes necessary.
i i
i i
i i

i i
454
Next we discuss several applications of Theorem 27.3.
Corollary 27.4 (Metric entropy of balls and spheres). Let k · k be an arbitrary norm on Rd . Let
B ≡ B∥·∥ = {x ∈ Rd : kxk ≤ 1} and S ≡ S∥·∥ = {x ∈ Rd : kxk ≤ 1} be the corresponding unit
ball and unit sphere. Then for ϵ < 1,
d d
1 2
≤ N(B, k · k, ϵ) ≤ 1 + (27.5)
ϵ ϵ
d−1 d−1
1 1
≤ N(S, k · k, ϵ) ≤ 2d 1 + (27.6)
2ϵ ϵ
where the left inequality in (27.6) holds under the extra assumption that k · k is an absolute norm
(invariant to sign changes of coordinates).
Proof. For balls, the estimate (27.5) directly follows from Theorem 27.3 since B + 2ϵ B = (1 + 2ϵ )B.
Next we consider the spheres. Applying (b) in Theorem 27.3 yields
vol(S + ϵB) vol((1 + ϵ)B) − vol((1 − ϵ)B)
N(S, k · k, ϵ) ≤ M(S, k · k, ϵ) ≤ ≤
vol(ϵB) vol(ϵB)
Z ϵ d−1
(1 + ϵ) − (1 − ϵ)
d d
d d−1 1
= = d (1 + x) dx ≤ 2d 1 + .
ϵd ϵ −ϵ ϵ
where the third inequality applies S + ϵB ⊂ ((1 + ϵ)B)\((1 − ϵ)B) by triangle inequality.
Finally, we prove the lower bound in (27.6) for an absolute norm k · k. To this end one cannot
directly invoke the lower bound in Theorem 27.3 as the sphere has zero volume. Note that k · k′ ≜
k(·, 0)k defines a norm on Rd−1 . We claim that every ϵ-packing in k · k′ for the unit k · k′ -ball
induces an ϵ-packing in k · k for the unit k · k-sphere. Fix x ∈ Rd−1 such that k(x, 0)k ≤ 1 and
define f : R+ → R+ by f(y) = k(x, y)k. Using the fact that k · k is an absolute norm, it is easy to
verify that f is a continuous increasing function with f(0) ≤ 1 and f(∞) = ∞. By the mean value
theorem, there exists yx , such that k(x, yx )k = 1. Finally, for any ϵ-packing {x′1 , . . . , x′M } of the unit
ball B∥·∥′ with respect to k·k′ , setting x′i = (xi , yxi ) we have kx′i −x′j k ≥ k(xi −xj , 0)k = kxi −xj k′ ≥ ϵ.
This proves
M(S∥·∥ , k · k, ϵ) ≥ M(B∥·∥′ , k · k′ , ϵ).
Then the left inequality of (27.6) follows from those of (27.4) and (27.5).
Remark 27.3. Several remarks on Corollary 27.4 are in order:
(a) Using (27.5), we see that for any compact Θ with nonempty interior, we have
1
N(Θ, k · k, ϵ) M(Θ, k · k, ϵ) (27.7)
ϵd
for small ϵ, with proportionality constants depending on both Θ and the norm. In fact, the sharp
constant is also known to exist. It is shown in [? , Theorem IX] that there exists a constant τ
i i
i i
i i

i i
depending only on k · k and the dimension, such that

vol(Θ) 1
M(Θ, k · k, 2ϵ) = (τ + o(1))
vol(B) ϵd
holds for any Θ with positive volume. This constant τ is the maximal sphere packing density in
Rd (the proportion of the whole space covered by the balls in the packing – see [268, Chapter
1] for a formal definition); a similar result and interpretation hold for the covering number as
well. Computing or bounding the value of τ is extremely difficult and remains open except
for some special cases.4 For more on this subject see the monographs [268, 72].
(b) The result (27.6) for spheres suggests that one may expect the metric entropy for a smooth man-
ifold Θ to behave as ( 1ϵ )dim , where dim stands for the dimension of Θ as opposed to the ambient
dimension. This is indeed true in many situations, for example, in the context of matrices, for
the orthogonal group O(d), unitary group U(d), and Grassmanian manifolds [298, 299], in
which case dim corresponds to the “degrees of freedom” (for example, dim = d(d − 1)/2 for
O(d)). More generally, for an arbitrary set Θ, one may define the limit limϵ→0 log Nlog (Θ,∥·∥,ϵ)
1 as
ϵ
its dimension (known as the Minkowski dimension or box-counting dimension). For sets of a
fractal nature, this dimension can be a non-integer (e.g. log2 3 for the Cantor set).
(c) Since all norms on Euclidean space are equivalent (within multiplicative constant factors
depending on dimension), the small-ϵ behavior in (27.7) holds for any norm as long as the
dimension d is fixed. However, this result does not capture the full picture in high dimensions
when ϵ is allowed to depend on d. Understanding these high-dimensional phenomena requires
us to go beyond volume methods. See Section 27.3 for details.
Next we switch our attention to the discrete case of Hamming space. The following theorem
bounds its packing number M(Fd2 , dH , r) ≡ M(Fd2 , r), namely, the maximal number of binary code-
words of length d with a prescribed minimum distance r + 1.5 This is a central question in coding
theory, wherein the lower and upper bounds below are known as the Gilbert-Varshamov bound
and the Hamming bound, respectively.
Theorem 27.5. For any integer 1 ≤ r ≤ d − 1,

2d 2d
Pr d
≤ M( F d
2 , r) ≤ P ⌊r/2⌋ d
. (27.8)
i=0 i i=0 i
Proof. Both inequalities in (27.8) follow from the same argument as that in Theorem 27.3, with
Rd replaced by Fd2 and volume by the counting measure (which is translation invariant).
Of particular interest to coding theory is the asymptotic regime of d → ∞ and r = ρd for some
constant ρ ∈ (0, 1). Using the asymptotics of the binomial coefficients (cf. Proposition 1.5), the
4
For example, it is easy to show that τ = 1 for both ℓ∞ and ℓ1 balls in any dimension since cubes can be subdivided into
smaller cubes; for ℓ2 -ball in d = 2, τ = √π is the famous result of L. Fejes Tóth on the optimality of hexagonal
12
arrangement for circle packing [268].
5
Recall that the packing number in Definition 27.1 is defined with a strict inequality.
i i
i i
i i

i i
456
Hamming and Gilbert-Varshamov bounds translate to
2d(1−h(ρ)+o(d) ≤ M(Fd2 , ρd) ≤ 2d(1−h(ρ/2))+o(d) .
Finding the exact exponent is one of the most significant open questions in coding theory. The best
upper bound to date is due to McEliece, Rodemich, Rumsey and Welch [214] using the technique
of linear programming relaxation.
In contrast, the corresponding covering problem in Hamming space is much simpler, as we
have the following tight result
N(Fd2 , ρd) = 2dR(ρ)+o(d) , (27.9)
where R(ρ) = (1 − h(ρ))+ is the rate-distortion function of Ber( 12 ) from Theorem 26.1. Although
this does not automatically follow from the rate-distortion theory, it can be shown using similar
argument – see Exercise V.11.
Finally, we state a lower bound on the packing number of Hamming spheres, which is needed
for subsequent application in sparse estimation (Exercise VI.11) and useful as basic building blocks
for computing metric entropy in more complicated settings (Theorem 27.7).
Theorem 27.6 (Gilbert-Varshamov bound for Hamming spheres). Denote by
Sdk = {x ∈ Fd2 : wH (x) = k} (27.10)
the Hamming sphere of radius 0 ≤ k ≤ d. Then

d
M(Sdk , r) ≥ Pr k
.
d
(27.11)
i=0 i
In particular,
k d
log M(Sdk , k/2) ≥ log . (27.12)
2 2ek
Proof. Again (27.11) follows from the volume argument. To verify (27.12), note that for r ≤ d/2,
Pr
we have i=0 di ≤ exp(dh( dr )) (see Theorem 8.2 or (15.19) with p = 1/2). Using h(x) ≤ x log xe

and dk ≥ ( dk )k , we conclude (27.12) from (27.11).
27.3 Beyond the volume bound

The volume bound in Theorem 27.3 provides a useful tool for studying metric entropy in Euclidean
spaces. As a result, as ϵ → 0, the covering number of any set with non-empty interior always grows
exponentially in d as ( 1ϵ )d – cf. (27.7). This asymptotic result, however, has its limitations and does
not apply if the dimension d is large and ϵ scales with d. In fact, one expects that there is some
critical threshold of ϵ depending on the dimension d, below which the exponential asymptotics is
tight, and above which the covering number can grow polynomially in d. This high-dimensional
phenomenon is not fully captured by the volume method.
i i
i i
i i

i i
As a case in point, consider the maximum number of ℓ2 -balls of radius ϵ packed into the unit
ℓ1 -ball, namely, M(B1 , k · k2 , ϵ). (Recall that Bp denotes the unit ℓp -ball in Rd with 1 ≤ p ≤ ∞.)
We have studied the metric entropy of arbitrary norm balls under the same norm in Corollary 27.4,
where the specific value of the volume was canceled from the √
volume ratio. Here, although ℓ1 and
ℓ2 norms are equivalent in the sense that kxk2 ≤ kxk1 ≤ dkxk2 , this relationship is too loose
when d is large.
Let us start by applying the volume method in Theorem 27.3:
vol(B1 ) vol(B1 + 2ϵ B2 )
≤ N(B1 , k · k2 , ϵ) ≤ M(B1 , k · k2 , ϵ) ≤ .
vol(ϵB2 ) vol( 2ϵ B2 )
Applying the formula for the volume of a unit ℓq -ball in Rd :
h id
2Γ 1 + 1q
vol(Bq ) = , (27.13)
Γ 1 + qd
πd
we get6 vol(B1 ) = 2d /d! and vol(B2 ) = Γ(1+d/2) , which yield, by Stirling approximation,
1 1
vol(B1 )1/d , vol(B2 )1/d √ . (27.14)
d d
Then for some absolute constant C,
√ d
vol(B1 + 2ϵ B2 ) vol((1 + ϵ 2 d )B1 ) 1
M(B1 , k · k2 , ϵ) ≤ ≤ ≤ C 1 + √ , (27.15)
vol( 2ϵ B2 ) vol( 2ϵ B2 ) ϵ d
√
where the second inequality follows from B2 ⊂ dB1 by Cauchy-Schwarz inequality. (This step
is tight in the sense that vol(B1 + 2ϵ B2 )1/d ≳ max{vol(B1 )1/d , 2ϵ vol(B2 )1/d } max{ d1 , √ϵd }.) On
the other hand, for some absolute constant c,
d d
vol(B1 ) 1 vol(B1 ) c
M(B1 , k · k2 , ϵ) ≥ = = √ . (27.16)
vol(ϵB2 ) ϵ vol(B2 ) ϵ d
Overall, for ϵ ≤ √1d , we have M(B1 , k · k2 , ϵ)1/d ϵ√1 d ; however, the lower bound trivializes and
the upper bound (which is exponential in d) is loose in the regime of ϵ √1d , which requires
different methods than volume calculation. The following result describes the complete behavior
of this metric entropy. In view of Theorem 27.2, we will go back and forth between the covering
and packing numbers in the argument.
Theorem 27.7. For 0 < ϵ < 1 and d ∈ N,

(
d log ϵ2ed ϵ≤ √1
log M(B1 , k · k2 , ϵ) d .
1
ϵ2
log(eϵ2 d) ϵ≥ √1
d
6
For B1 this can be proved directly by noting that B1 consists 2d disjoint “copies” of the simplex whose volume is 1/d! by
induction on d.
i i
i i
i i

i i
458
Proof. The case of ϵ ≤ √1d follows from earlier volume calculation (27.15)–(27.16). Next we
focus on √1d ≤ ϵ < 1.
For the upper bound, we construct an ϵ-covering in ℓ2 by quantizing each coordinate. Without
loss of generality, assume that ϵ < 1/4. Fix some δ < 1. For each θ ∈ B1 , there exists x ∈
(δ Zd ) ∩ B1 such that kx − θk∞ ≤ δ . Then kx − θk22 ≤ kx − θk1 kx − θk∞ ≤ 2δ . Furthermore, x/δ
belongs to the set
( )
X d
Z= z∈Z : d
|zi | ≤ k (27.17)
i=1
with k = b1/δc. Note that each z ∈ Z has at most k nonzeros. By enumerating the number of non-

negative solutions (stars and bars calculation) and the sign pattern, we have7 |Z| ≤ 2k∧d d−k1+k .
Finally, picking δ = ϵ2 /2, we conclude that N(B1 , k · k2 , ϵ) ≤ |Z| ≤ ( 2e(dk+k) )k as desired. (Note
that this method also recovers the volume bound for ϵ ≤ √1d , in which case k ≤ d.)
√
For the lower bound, note that M(B1 , k · k2 , 2) ≥ 2d by considering ±e1 , . . . , ±ed . So it
suffices to consider d ≥ 8. We construct a packing of B1 based on a packing of the Hamming
sphere. Without loss of generality, assume that ϵ > 4√1 d . Fix some 1 ≤ k ≤ d. Applying
the Gilbert-Varshamov bound in Theorem 27.6, in particular, (27.12), there exists a k/2-packing
Pd
{x1 , . . . , xM } ⊂ Sdk = {x ∈ {0, 1}d : i=1 xi = k} and log M ≥ 2k log 2ek d
. Scale the Hamming
sphere to fit the ℓ1 -ball by setting θi = xi /k. Then θi ∈ B1 and kθi − θj k2 = k2 dH (xi , xj ) ≥ 2k
2 1 1
for all
1
i 6= j. Choosing k = ϵ2 which satisfies k ≤ d/8, we conclude that {θ1 , . . . , θM } is a 2 -packing
ϵ
of B1 in k · k2 as desired.
The above elementary proof can be adapted to give the following more general result (see
Exercise V.12): Let 1 ≤ p < q ≤ ∞. For all 0 < ϵ < 1 and d ∈ N,
(
d log ϵes d ϵ ≤ d−1/s 1 1 1
log M(Bp , k · kq , ϵ) p,q 1 , ≜ − . (27.18)
−1/s
s log(eϵ d)
ϵ
s
ϵ≥d s p q
In the remainder of this section, we discuss a few generic results in connection to Theorem 27.7,
in particular, metric entropy upper bounds via the Sudakov minorization and Maurey’s empirical
method, as well as the duality of metric entropy in Euclidean spaces.
27.3.1 Sudakov minorization

Theorem 27.8 (Sudakov minoration). Define the Gaussian width of Θ ⊂ Rd as8
w(Θ) ≜ E sup hθ, Zi , Z ∼ N(0, Id ). (27.19)

θ∈Θ
7 ∑d (d)( k )
By enumerating the support and counting positive solutions, it is easy to show that |Z| = i=0 2d−i i d−i
.
8
To avoid measurability difficulty, w(Θ) should be understood as supT⊂Θ,|T|<∞ E maxθ∈T hθ, Zi.
i i
i i
i i

i i
For any Θ ⊂ Rd ,
p
w(Θ) ≳ sup ϵ log M(Θ, k · k2 , ϵ). (27.20)
ϵ>0
The preceding theorem relates the Gaussian width to the metric entropy, both of which are
meaningful measure of the massiveness of a set. The following complementary result is due to
R. Dudley. (See [235, Theorem 5.6] for both results.)
Z ∞p
w(Θ) ≲ log M(Θ, k · k2 , ϵ)dϵ. (27.21)
0
Understanding the maximum of a Gaussian process is a field on its own; see the monograph [302].
In this section we focus on the upper bound (27.20) in order to develop upper bound for metric
entropy using the Gaussian width.
The proof of Theorem 27.8 relies on the following Gaussian comparison lemma of Slepian
(whom we have encountered earlier in Theorem 11.13). For a self-contained proof see [62]. See
also [235, Lemma 5.7, p. 70] for a simpler proof of a weaker version E max Xi ≤ 2E max Yi , which
suffices for our purposes.
Lemma 27.9 (Slepian’s lemma). Let X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Yn ) be Gaussian random
vectors. If E(Yi − Yj )2 ≤ E(Xi − Xj )2 for all i, j, then E max Yi ≤ E max Xi .
We also need the result bounding the expectation of the maximum of n Gaussian random
variables (see also Exercise I.45).
Lemma 27.10. Let Z1 , . . . , Zn be distributed as N (0, 1). Then

h i p
E max Zi ≤ 2 log n. (27.22)
i∈[n]
In addition, if Z1 , . . . , Zn are iid, then

h i p
E max Zi = 2 log n(1 + o(1)). (27.23)
i∈[n]
Proof. By Jensen’s inequality, for any t > 0,

X 2
etE[maxi Zi ] ≤ E[et maxi Zi ] = E[max etZi ] ≤ E[etZi ] = net /2 .
i
i
Therefore
log n t
E[max Zi ] ≤ + .
i t 2
p
Choosing t = 2 log n yields (27.22). Next, assume that Zi are iid. For any t > 0,
E[max Zi ] ≥ t P[max Zi ≥ t] + E[max Zi 1{Z1 <0} 1{Z2 <0} . . . 1{Zn <0} ]

i i i
≥ t(1 − (1 − Φc (t))n ) + E[Z1 1{Z1 <0} 1{Z2 <0} . . . 1{Zn <0} ].
i i
i i
i i

i i
460
where Φc (t) = P[Z1 ≥ t] is the normal tail probability. The second term equals
2−(n−1) E[Z1 1{Z1 <0} ] = o(1). For the first term, recall that Φc (t) ≥ 1+t t2 φ(t) (Exercise V.9).
p
Choosing
p t = (2 − ϵ) log n for small ϵ > 0 so that Φc (t) = ω( 1n ) and hence E[maxi Zi ] ≥
(2 − ϵ) log n(1 + o(1)). By the arbitrariness of ϵ > 0, the lower bound part of (27.23)
follows.
Proof of Theorem 27.8. Let {θ1 , . . . , θM } be an optimal ϵ-packing of Θ. Let Xi = hθi , Zi for
i.i.d.
i ∈ [M], where Z ∼ N (0, Id ). Let Yi ∼ N (0, ϵ2 /2). Then
E(Xi − Xj )2 = (θi − θj )⊤ E[ZZ⊤ ](θi − θj ) = kθi − θj k22 ≥ ϵ2 = E(Yi − Yj )2 .
Then
p
E sup hθ, Zi ≥ E max Xi ≥ E max Yi ϵ log M
θ∈Θ 1≤i≤M 1≤i≤M
where the second and third step follows from Lemma 27.9 and Lemma 27.10 respectively.
Revisiting the packing number of the ℓ1 -ball, we apply Sudakov minorization to Θ = B1 . By

duality and applying Lemma 27.10,
p
w(B1 ) = E sup hx, Zi = EkZk∞ ≤ 2 log d.
x: ∥ x∥ 1 ≤ 1
Then Theorem 27.8 gives

log d
log M(B1 , k · k2 , ϵ) ≲ . (27.24)
ϵ2
√
When ϵ ≳ 1/ d, this is much tighter than the volume bound (27.15) and almost optimal (com-
2 √
pared to log(ϵd2 ϵ ) ); however, when ϵ 1/ d, (27.24) yields d log d but we know (even from the
volume bound) that the correct behavior is d. Next we discuss another general approach that gives
the optimal bound in this case.
27.3.2 Maurey’s empirical method

In this section we discuss a powerful probabilistic method due to B. Maurey for constructing a
good covering. It has found applications in approximation theory and especially that for neural
nets [172, 24]. The following result gives a dimension-free bound on the cover number of convex
hulls in Hilbert spaces:
p
Theorem 27.11. Let H be an inner product space with the norm kxk ≜ hx, xi. Let T ⊂ H be a
finite set, with radius r = rad (T) = infy∈H supx∈T kx − yk (recall (5.3)). Denote the convex hull of
T by co(T). Then for any 0 < ϵ ≤ r,
2
|T| + d ϵr2 e − 2
N (co(T), k · k, ϵ) ≤ . (27.25)
d ϵr2 e − 1
2
i i
i i
i i

i i
Proof. Let T = {t1 , t2 , . . . , tm } and denote the Chebyshev center of T by c ∈ H, such that r =
maxi∈[m] kc − ti k. For n ∈ Z+ , let
( ! )
1 X
m X
m
Z= c+ ni ti : ni ∈ Z + , ni = n .
n+1
i=1 i=1
Pm P
For any x = i=1 xi ti ∈ co(T) where xi ≥ 0 and xi = 1, let Z be a discrete random variable
such that Z = ti with probability xi . Then E[Z] = x. Let Z0 = c and Z1 , . . . , Zn be i.i.d. copies of
Pm
Z. Let Z̄ = n+1 1 i=0 Zi , which takes values in the set Z . Since
2  
X
n X
n X
1 1 
EkZ̄ − xk22 = E ( Z i − x) = E kZi − xk2 + EhZi − x, Zj − xi
( n + 1) 2 ( n + 1) 2
i=0 i=0 i̸=j
1 X
n
1 r2
= E kZi − xk2 = kc − x k2
+ nE [kZ − x k2
] ≤ ,
( n + 1) 2 ( n + 1) 2 n+1
i=0
Pm
where the last inequality follows from that kc − xk ≤ i=1 xi kc − ti k ≤ r (in other words, rad (T) =

rad (co(T)) and E[kZ − xk2 ] ≤ E[kZ − ck2 ] ≤ r2 . Set n = r2 /ϵ2 − 1 so that r2 /(n + 1) ≤ ϵ2 .
There exists some z ∈ N such that kz − xk ≤ ϵ. Therefore Z is an ϵ-covering of co(T). Similar to
(27.17), we have

n+m−1 m + r2 /ϵ2 − 2
|Z| ≤ = .
n dr2 /ϵ2 e − 1
We now apply Theorem 27.11 to recover the result for the unit ℓ1 -ball B1 in Rd in Theorem 27.7:
Note that B1 = co(T), where T = {±e1 , . . . , ±ed , 0} satisfies rad (T) = 1. Then

2d + d ϵ12 e − 1
N(B1 , k · k2 , ϵ) ≤ , (27.26)
d ϵ12 e − 1
which recovers the optimal upper bound in Theorem 27.7 at both small and big scale.
27.3.3 Duality of metric entropy

First we define a more general notion of covering number. For K, T ⊂ Rd , define the covering
number of K using translates of T as
N(K, T) = min{N : ∃x1 , . . . , xN ∈ Rd such that K ⊂ ∪Ni=1 T + xi }.
Then the usual covering number in Definition 27.1 satisfies N(K, k · k, ϵ) = N(K, ϵB), where B is
the corresponding unit norm ball.
i i
i i
i i

i i
462
A deep result of Artstein, Milman, and Szarek [18] establishes the following duality for metric
entropy: There exist absolute constants α and β such that for any symmetric convex body K,9
1 ϵ
log N B2 , K◦ ≤ log N(K, ϵB2 ) ≤ log N(B2 , αϵK◦ ), (27.27)
β α
where B2 is the usual unit ℓ2 -ball, and K◦ = {y : supx∈K hx, yi ≤ 1} is the polar body of K.
As an example, consider p < 2 < q and 1p + 1q = 1. By duality, B◦p = Bq . Then (27.27) shows
that N(Bp , k · k2 , ϵ) and N(B2 , k · kq , ϵ) have essentially the same behavior, as verified by (27.18).
27.4 Infinite-dimensional space: smooth functions

Unlike Euclidean spaces, in infinite-dimensional spaces, the metric entropy can grow arbitrarily
fast [? , Theorem XI]. Studying of metric entropy in functional space (for example, under shape or
smoothness constraints) is an area of interest in functional analysis (cf. [325]), and has important
applications in nonparametric statistics, empirical processes, and learning theory [104]. To gain
some insight on the fundamental distinction between finite- and infinite-dimensional spaces, let
us work out a concrete example, which will later be used in the application of density estimation
in Section 32.4. For more general and more precise results (including some cases of equality),
see [180, Sec. 4 and 7]. Consider the class F(A, L) of all L-Lipschitz probability densities on the
compact interval [0, A].
Theorem 27.12. Assume that L, A > 0 and p ∈ [1, ∞] are constants. Then

1
log N(F(A, L), k · kp , ϵ) = Θ . (27.28)
ϵ
Furthermore, for the sup-norm we have the sharp asymptotics:
LA
log2 N(F(A, L), k · k∞ , ϵ) = (1 + o(1)), ϵ → 0. (27.29)
ϵ
Proof. By replacing f(x) by √1 f( √x ), we have

L L
√ 1− p
N(F(A, L), k · kp , ϵ) = N(F( LA, 1), k · kp , L 2p ϵ). (27.30)
Thus, it is sufficient to consider F(A, 1) ≜ F(A), the collection of 1-Lipschitz densities on [0, A].
Next, observe that any such density function f is bounded from above. Indeed, since f(x) ≥ (f(0) −
RA
x)+ and 0 f = 1, we conclude that f(0) ≤ max{A, A2 + A1 } ≜ m.
To show (27.28), it suffices to prove the upper bound for p = ∞ and the lower bound for p = 1.
Specifically, we aim to show, by explicit construction,
C Aϵ
N(F(A), k · k∞ , ϵ) ≤ 2 (27.31)
ϵ
9
A convex body K is a compact convex set with non-empty interior. We say K is symmetric if K = −K.
i i
i i
i i

i i
27.4 Infinite-dimensional space: smooth functions 463
c
M(F(A), k · k1 , ϵ) ≥ 2 ϵ (27.32)
which imply the desired (27.28) in view of Theorem 27.2. Here and below, c, C are constants
depending on A. We start with the easier (27.32). We construct a packing by perturbing the uniform
density. Define a function T by T(x) = x1{x≤ϵ} + (2ϵ − x)1{x≥ϵ} + A1 on [0, 2ϵ] and zero elsewhere.

Let n = 4Aϵ and a = 2nϵ. For each y ∈ {0, 1}n , define a density fy on [0, A] such that
X
n
f y ( x) = yi T(x − 2(i − 1)ϵ), x ∈ [0, a],
i=1
RA
and we linearly extend fy to [a, A] so that 0 fy = 1; see Fig. 27.2. For sufficiently small ϵ, the
Ra
resulting fy is 1-Lipschitz since 0 fy = 12 + O(ϵ) so that the slope of the linear extension is O(ϵ).
1/A
x
0 ϵ 2ϵ 2nϵ A
Figure 27.2 Packing that achieves (27.32). The solid line represent one such density fy (x) with
y = (1, 0, 1, 1). The dotted line is the density of Unif(0, A).
Thus we conclude that each fy is a valid member of F(A). Furthermore, for y, z ∈ {0, 1}n ,
we have kfy − Fz k1 = dH (y, z)kTk1 = ϵ2 dH (y, z). Invoking the Gilbert-Varshamov bound The-
orem 27.5, we obtain an n2 -packing Y of the Hamming space {0, 1}n with |Y| ≥ 2cn for some
2
absolute constant c. Thus {fy : y ∈ Y} constitutes an n2ϵ -packing of F(A) with respect to the
2
L1 -norm. This is the desired (27.32) since n2ϵ = Θ(ϵ).
m
To construct a covering, set J = ϵ , n = Aϵ , and xk = kϵ for k = 0, . . . , n. Let G be the
collection of all lattice paths (with grid size ϵ) of n steps starting from the coordinate (0, jϵ) for
some j ∈ {0, . . . , J}. In other words, each element g of G is a continuous piecewise linear function
on each subinterval Ik = [xk , xk+1 ) with slope being either +1 or −1. Evidently, the number of
such paths is at most (J + 1)2n = O( 1ϵ 2A/ϵ ). To show that G is an ϵ-covering, for each f ∈ F (A),
we show that there exists g ∈ G such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, A]. This can be shown
by a simple induction. Suppose that there exists g such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, xk ],
which clearly holds for the base case of k = 0. We show that g can be extended to Ik so that this
holds for k + 1. Since |f(xk ) − g(xk )| ≤ ϵ and f is 1-Lipschitz, either f(xk+1 ) ∈ [g(xk ), g(xk ) + 2ϵ]
or [g(xk ) − 2ϵ, g(xk )], in which case we extend g upward or downward, respectively. The resulting
g satisfies |f(x) − g(x)| ≤ ϵ on Ik , completing the induction.
Finally, we prove the sharp bound (27.29) for p = ∞. The upper bound readily follows from
(27.31) plus the scaling relation (27.30). For the lower bound, we apply Theorem 27.2 converting
i i
i i
i i

i i
464
b′ + ϵ1/3
b′
x
0 a′ A
Figure 27.3 Improved packing for (27.33). Here the solid and dashed lines are two lattice paths on a grid of
size ϵ starting from (0, b′ ) and staying in the range of [b′ , b′ + ϵ1/3 ], followed by their respective linear
extensions.
the problem to the construction of 2ϵ-packing. Following the same idea of lattice paths, next we
give an improved packing construction such that
a
M(F(A), k · k∞ , 2ϵ) ≥ Ω(ϵ3/2 2 ϵ ). (27.33)
a b
for any a < A. Choose any b such that A1 < b < A1 + (A− a)2 ′ ′
2A . Let a = ϵ ϵ and b = ϵ ϵ . Consider
a density f on [0, A] of the following form (cf. Fig. 27.3): on [0, a′ ], f is a lattice path from (0, b′ ) to
(a′ , b′ ) that stays in the vertical range of [b′ , b′ + ϵ1/3 ]; on [a′ , A], f is a linear extension chosen so
RA
that 0 f = 1. This is possible because by the 1-Lipschitz constraint we can linearly extend f so that
RA ′ 2 ′ 2 R a′
a′
f takes any value in the interval [b′ (A−a′ )− (A−2a ) , b′ (A−a′ )+ (A−2a ) ]. Since 0 f = ab+o(1),
RA R a′
we need a′ f = 1 − 0 f = 1 − ab + o(1), which is feasible due to the choice of b. The collection
G of all such functions constitute a 2ϵ-packing in the sup norm (for two distinct paths consider the
first subinterval where they differ). Finally, we bound the cardinality of this packing by counting
the number of such paths. This can be accomplished by standard estimates on random walks (see
e.g. [122, Chap. III]). For any constant c > 0, the probability that a symmetric random walk on
Z returns to zero in n (even) steps and stays in the range of [0, n1+c ] is Θ(n−3/2 ); this implies the
desired (27.33). Finally, since a < A is arbitrary, the lower bound part of (27.29) follows in view
of Theorem 27.2.
The following result, due to Birman and Solomjak [36] (cf. [203, Sec. 15.6] for an exposition),
is an extension of Theorem 27.12 to the more general Hölder class.
Theorem 27.13. Fix positive constants A, L and d ∈ N. Let β > 0 and write β = ℓ + α,
where ℓ = bβc and α ∈ [0, 1). Let Fβ (A, L) denote the collection of ℓ-times continuously
differentiable densities f on [0, A]d whose ℓth derivative is (L, α)-Hölder continuous, namely,
i i
i i
i i

i i
1 465
27.5 Hilbert ball has metric entropy ϵ2
kD(ℓ) f(x) − D(ℓ) f(y)k∞ ≤ Lkx − ykα

∞ for all x, y ∈ [0, A] . Then for any 1 ≤ p ≤ ∞,
d
βd
1
log N(Fβ (A, L), k · kp , ϵ) . (27.34)
ϵ
The main message of the preceding theorem is that is the entropy of the function class grows
more slowly if the dimension decreases or the smoothness increases. As such, the metric entropy
for very smooth functions can grow subpolynomially in 1ϵ . For example, Vitushkin (cf. [180,
Eq. (129)]) showed that for the class of analytic functions on the unit complex disk D having
analytic extension to a bigger disk rD for r > 1, the metric entropy (with respect to the sup-norm
on D) is Θ((log 1ϵ )2 ); see [180, Sec. 7 and 8] for more such results.
As mentioned at the beginning of this chapter, the conception and development of the subject
on metric entropy, in particular, Theorem 27.13, are motivated by and plays an important role
in the study of Hilbert’s 13th problem. In 1900, Hilbert conjectured that there exist functions of
several variables which cannot be represented as a superposition (composition) of finitely many
functions of fewer variables. This was disproved by Kolmogorov and Arnold in 1950s who showed
that every continuous function of d variables can be represented by sums and superpositions of
single-variable functions; however, their construction does not work if one requires the constituent
functions to have specific smoothness. Subsequently, Hilbert’s conjecture for smooth functions
was positively resolved by Vitushkin [323], who showed that there exist functions of d variables
in the β -Hölder class (in the sense of Theorem 27.13) that cannot be expressed as finitely many
superpositions of functions of d′ variables in the β ′ -Hölder class, provided d/β > d′ /β ′ . The
original proof of Vitushkin is highly involved. Later, Kolmogorov gave a much simplified proof
by proving and applying the k · k∞ -version of Theorem 27.13. As evident in (27.34), the index
d/β provides a complexity measure for the function class; this allows an proof of impossibility
of superposition by an entropy comparison argument. For concreteness, let us prove the follow-
ing simpler version: There exists a 1-Lipschitz function f(x, y, z) of three variables on [0, 1]3 that
cannot be written as g(h1 (x, y), h2 (y, z)) where g, h1 , h2 are 1-Lipschitz functions of two variables
on [0, 1]2 . Suppose, for the sake of contradiction, that this is possible. Fixing an ϵ-covering of
cardinality exp(O( ϵ12 )) for 1-Lipschitz functions on [0, 1]2 and using it to approximate the func-
tions g, h1 , h2 , we obtain by superposition g(h1 , h2 ) an O(ϵ)-covering of cardinality exp(O( ϵ12 )) of
1-Lipschitz functions on [0, 1]3 ; however, this is a contradiction as any such covering must be of
size exp(Ω( ϵ13 )). For stronger and more general results along this line, see [180, Appendix I].
1
27.5 Hilbert ball has metric entropy ϵ2
Consider the following set of linear functions fθ (x) = (θ, x) with θ, x ∈ B – a unit ball in infinite
dimensional Hilbert space with inner product (·, ·).
i i
i i
i i

i i
466
p
Theorem 27.14. Consider any measure P on B and let dP (θ, θ′ ) = EX∼P [|fθ (X) − fθ′ (X)|2 ].
Then we have
1
log N(ϵ, dP ) ≤ 2 .
eϵ
Proof. We have log N(ϵ) ≤ log M(ϵ). By some continuity argument, let’s consider only empirical
Pn
measures Pn = n1 i=1 δxi . First consider the special case when xi ’s are orthogonal basis. Then the
√
ϵ-packing in dP is simply an nϵ-packing of n-dimensional Euclidean unit ball. From Varshamov’s
argument we have
√
log M(ϵ) ≤ −n log nϵ . (27.35)
Thus, we have
1 1
log N(ϵ, dPn ) ≤ max n log √ = 2 .
n nϵ eϵ
√
Now, for a general case, after some linear algebra we get that the goal is to do nϵ-packing in
Euclidean metric of an ellipsoid:
X
n
{yn : y2j /λj ≤ 1} ,
j=1
where λj are eigenvalues of the Gram matrix of {xi , i ∈ [n]}. By calculating the volume of this
ellipsoid the bound (27.35) is then replaced by
X
n
√
log M(ϵ) ≤ log λj − n log nϵ .
j=1
P
Since j λj ≤ n (xi ’s are unit norm!) we get from Jensen’s that the first sum above is ≤ 0 and we
reduced to the previous case.
To see one simple implication of the result, recall the standard bound on empirical processes
 s 
Z ∞
log N(Θ, L ( P̂ ), ϵ)
E sup E[fθ (X)] − Ên [fθ (X)] ≲ E  inf δ + dϵ  .
2 n
θ δ>0 δ n
It can be see that when entropy behaves as ϵ−p we get rate n− min(1/p,1/2) except for p = 2 for which
the upper bound yields n− 2 log n. The significance of the previous theorem is that the Hilbert ball
1
is precisely “at the phase transition” from parametric to nonparametric rate.

As a sanity check, let us take Θ = B and any PX over the unit (possibly infinite dimensional)
ball with E[X] = 0 we have
" # r
1X
n
log n
E[kX̄n k] = E sup (θ, Xi ) ≲ ,
θ n i=1 n
1
Pn
where X̄n = p n i=1 Xi is 1the empirical mean vector. In this special case it is easy to bound
E[kX̄n k] ≤ E[kX̄n k2 ] ≤ √n by an explicit calculation.
i i
i i
i i

i i
27.6 Metric entropy and small-ball probability 467
27.6 Metric entropy and small-ball probability

The small ball problem in probability theory concerns the behavior of the function
1
ϕ(ϵ) ≜ log
P [kXk ≤ ϵ]
as ϵ → 0, where X is a random variable taking values on some real separable Banach space
(V, k · k). For example, for standard normal X ∼ N (0, Id ) and the ℓ2 -ball, a simple large-deviation
calculation (Exercise III.10) shows that
1
ϕ(ϵ) d log .
ϵ
Of more interest is the infinite-dimensional case of Gaussian processes. For example, for the
standard Brownian motion on the unit interval and the sup norm, it is elementary to show
(Exercise V.10) that
1
.ϕ(ϵ) (27.36)
ϵ2
We refer the reader to the excellent survey [199] for this field.
There is a deep connection between the small-ball probability and metric entropy, which allows
one to translate results from one area to the other in fruitful ways. To identify this link, the start-
ing point is the volume argument in Theorem 27.3. On the one hand, it is well-known that there
exists no analog of Lebesgue measure (translation-invariant) in infinite-dimensional spaces. As
such, for functional spaces, one frequently uses a Gaussian measure. On the other hand, the “vol-
ume” argument in Theorem 27.3 and Remark 27.2 can adapted to a measure γ that need not be
translation-invariant, leading to
γ (Θ + B (0, ϵ)) γ (Θ + B (0, ϵ/2))
≤ N(Θ, k · k, ϵ) ≤ M(Θ, k · k, ϵ) ≤ , (27.37)
maxθ∈V γ (B (θ, 2ϵ)) minθ∈Θ γ (B (θ, ϵ/2))
where we recall that B(θ, ϵ) denotes the norm ball centered at θ of radius ϵ. From here we have
already seen the natural appearance of small-ball probabilities. Using properties native to the
Gaussian measure, this can be further analyzed and reduced to balls centered at zero.
To be precise, let γ be a zero-mean Gaussian measure on V such that EX∼γ [kXk2 ] < ∞. Let
H ⊂ V be the reproducing kernel Hilbert space (RKHS) generated by γ and K the unit ball in
H. We refer the reader to, e.g., [186, Sec. 2] and [224, III.3.2], for the precise definition of this
object.10 For the purpose of this section, it is enough to consider the following examples (for more
see [186]):
• Finite dimensions. Let γ = N (0, Σ). Then
K = {Σ1/2 x : kxk2 ≤ 1} (27.38)
10
In particular, if γ is the law of a Gaussian process X on C([0, 1]) with E[kXk22 ] < ∞, the kernel K(s, t) = E[X(s)X(t)]
∑
admits the eigendecomposition K(s, t) = λk ψk (s)ψk (t) (Mercer’s theorem), where {ϕk } is an orthonormal basis for
∑
L2 ([0, 1]) and λk > 0. Then H is the closure of the span of {ϕk } with the inner product hx, yiH = k hx, ψk ihy, ψk i/λk .
i i
i i
i i

i i
468
is a rescaled Euclidean ball, with inner product hx, yiH = x⊤ Σ−1 y.

• Brownian motion: Let γ be the law of the standard Brownian motion on the unit interval [0, 1].
Then
Z t
K = f( t) = f′ (s)ds : kf′ k2 ≤ 1 (27.39)
0
R1
with inner product hf, giH = hf′ , g′ i ≡ 0
f′ (t)g′ (t)dt.
The following fundamental result due to Kuelbs and Li [187] (see also the earlier work of
Goodman [142]) describes a precise connection between the small-ball probability function ϕ(ϵ)
and the metric entropy of the unit Hilbert ball N(K, k · k, ϵ) ≡ N(ϵ).
Theorem 27.15. For all ϵ > 0,

!
ϵ
ϕ(2ϵ) − log 2 ≤ log N p ≤ 2ϕ(ϵ/2) (27.40)
2ϕ(ϵ/2)
Proof. We show that for any λ > 0,
λ2
ϕ(2ϵ) + log Φ(λ + Φ−1 (e−ϕ(ϵ) )) ≤ log N(λK, ϵ) ≤ log M(λK, ϵ) ≤ + ϕ(ϵ/2) (27.41)
2
p
To deduce (27.40), choose λ = 2ϕ(ϵ/2) and note that by scaling N(λK, ϵ) = N(K, ϵ/λ).
t) = Φc (t) ≤ e−t /2 (Exercise V.9) yields Φ−1 (e−ϕ(ϵ) ) ≥
2
Applying
p the normal tail bound Φ(−
− 2ϕ(ϵ) ≥ −λ so that Φ(Φ−1 (e−ϕ(ϵ) ) + λ) ≥ Φ(0) = 1/2.
We only give the proof in finite dimensions as the results are dimension-free and extend natu-
rally to infinite-dimensional spaces. Let Z ∼ γ = N(0, Σ) on Rd so that K = Σ1/2 B2 is given in
(27.38). Applying (27.37) to λK and noting that γ is a probability measure, we have
γ (λK + B (0, ϵ)) 1
≤ N(λK, ϵ) ≤ M(λK, ϵ) ≤ . (27.42)
maxθ∈Rd γ (B (θ, 2ϵ)) minθ∈λK γ (B (θ, ϵ/2))
Next we further bound (27.42) using properties native to the Gaussian measure.
• For the upper bound, for any symmetric set A = −A and any θ ∈ λK, by a change of measure
γ(θ + A) = P [Z − θ ∈ A]
1 ⊤ −1
h −1 i
= e− 2 θ Σ θ E e⟨Σ θ,Z⟩ 1{Z∈A}
≥ e−λ
2
/2
P [Z ∈ A] ,
h −1 i
where the last step follows from θ⊤ Σ−1 θ ≤ λ2 and by Jensen’s inequality E e⟨Σ θ,Z⟩ |Z ∈ A ≥
−1
e⟨Σ θ,E[Z|Z∈A]⟩ = 1, using crucially that E [Z|Z ∈ A] = 0 by symmetry. Applying the above to
A = B(0, ϵ/2) yields the right inequality in (27.41).
i i
i i
i i

i i
• For the lower bound, recall Anderson’s lemma (Lemma 28.10) stating that the Gaussian measure
of a ball is maximized when centered at zero, so γ(B(θ, 2ϵ)) ≤ γ(B(0, 2ϵ)) for all θ. To bound
the numerator, recall the Gaussian isoperimetric inequality (see e.g. [46, Theorem 10.15]):11
γ(A + λK) ≥ Φ(Φ−1 (γ(A)) + λ). (27.43)
Applying this with A = B(0, ϵ) proves the left inequality in (27.41) and the theorem.
The implication of Theorem 27.15 is the following. Provided that ϕ(ϵ) ϕ(ϵ/2), then we
should expect that approximately
!
ϵ
log N p ϕ(ϵ)
ϕ(ϵ)
With more effort this can be made precise unconditionally (see e.g. [199, Theorem 3.3], incorporat-
ing the later improvement by [198]), leading to very precise connections between metric entropy
and small-ball probability, for example: for fixed α > 0, β ∈ R,
β 2β
−α 1 − 2+α
2α 1 2+α
ϕ(ϵ) ϵ log ⇐⇒ log N(ϵ) ϵ log (27.44)
ϵ ϵ
As a concrete example, consider the unit ball (27.39) in the RKHS generated by the standard
Brownian motion, which is similar to a Sobolev ball.12 Using (27.36) and (27.44), we conclude
that log N(ϵ) 1ϵ , recovering the metric entropy of Sobolev ball determined in [310]. This result
also coincides with the metric entropy of Lipschitz ball in Theorem 27.13 which requires the
derivative to be bounded everywhere as opposed to on average in L2 . For more applications of
small-ball probability on metric entropy (and vice versa), see [187, 198].
27.7 Metric entropy and rate-distortion theory

In this section we discuss a connection between metric entropy and rate-distortion function. Note
that the former is a non-probabilistic quantity whereas the latter is an information measure depend-
ing on the source distribution; nevertheless, if we consider the rate-distortion function induced by
the “least favorable” source distribution, it turns out to behave similarly to the metric entropy.
To make this precise, consider a metric space (X , d). For an X -valued random variable X, denote
by
ϕX (ϵ) = inf I(X; X̂) (27.45)
PX̂|X :E[d(X,X̂)]≤ϵ
11
The connection between (27.43) and isoperimetry is that if we interpret limλ→0 (γ(A + λK) − γ(A))/λ as the surface
measure of A, then among all sets with the same Gaussian measure, the half space has maximal surface measure.
12
The Sobolev norm is kfkW1,2 ≜ kfk2 + kf′ k2 . Nevertheless, it is simple to verify a priori that the metric entropy of
(27.39) and that of the Sobolev ball share the same behavior (see [187, p. 152]).
i i
i i
i i

i i
470
its rate-distortion function (recall Section 24.3). Denote the worst-case rate-distortion function on
X by
ϕX (ϵ) = sup ϕX (ϵ). (27.46)

PX ∈P(X )
The next theorem relates ϕX to the covering and packing number of X . The lower bound simply
follows from a “Bayesian” argument, which bounds the worst case from below by the average case,
akin to the relationship between minimax and Bayes risk (see Section 28.3). The upper bound was
shown in [174] using the dual representation of rate-distortion functions; here we give a simpler
proof via Fano’s inequality.
Theorem 27.16. For any 0 < c < 1/2,
ϕX (cϵ) + log 2
ϕX (ϵ) ≤ log N(X , d, ϵ) ≤ log M(X , d, ϵ) ≤ . (27.47)
1 − 2c
Proof. Fix an ϵ-covering of X in d of size N. Let X̂ denote the closest element in the covering to
X. Then d(X, X̂) ≤ ϵ almost surely. Thus ϕX (ϵ) ≤ I(X; X̂) ≤ log N. Optimizing over PX proves the
left inequality.
For the right inequality, let X be uniformly distributed over a maximal ϵ-packing of X . For
any PX̂|X such that E[d(X, X̂)] ≤ cϵ. Let X̃ denote the closest point in the packing to X̂. Then we
have the Markov chain X → X̂ → X̃. By definition, d(X, X̃) ≤ d(X̂, X̃) + d(X̂, X) ≤ 2d(X̂, X)
so E[d(X, X̃)] ≤ 2cϵ. Since either X = X̃ or d(X, X̃) > ϵ, we have P[X 6= X̃] ≤ 2c. On the
other hand, Fano’s inequality (Corollary 6.4) yields P[X 6= X̃] ≥ 1 − I(X;log
X̂)+log 2
M . In all, I(X; X̂) ≥
(1 − 2c) log M − log 2, proving the upper bound.
Remark 27.4. (a) Clearly, Theorem 27.16 can be extended to the case where the distortion
function equals a power of the metric, namely, replacing (27.45) with
ϕX,r (ϵ) ≜ inf I(X; X̂).

PX̂|X :E[d(X,X̂)]≤ϵr
Then (27.47) continues to hold with 1 − 2c replaced by 1 − (2c)r . This will be useful, for
example, in the forthcoming applications where second moment constraint is easier to work
with.
(b) In the earlier literature a variant of the rate-distortion function is also considered, known as
the ϵ-entropy of X, where the constraint is d(X, X̂) ≤ ϵ with probability one as opposed to
in expectation (cf. e.g. [180, Appendix II] and [254]). With this definition, it is natural to
conjecture that the maximal ϵ-entropy over all distributions on X coincides with the metric
entropy log N(X , ϵ); nevertheless, this need not be true (see [215, Remark, p. 1708] for a
counterexample).
i i
i i
i i

i i
Theorem 27.16 points out an information-theoretic route to bound the metric entropy by the
worst-case rate-distortion function (27.46).13 Solving this maximization, however, is not easy as
PX 7→ ϕX (D) is in general neither convex nor concave [3].14 Fortunately, for certain spaces, one
can show via a symmetry argument that the “uniform” distribution maximizes the rate-distortion
function at every distortion level; see Exercise V.8 for a formal statement. As a consequence, we
have:
• For Hamming space X = {0, 1}d and Hamming distortion, ϕX (D) is attained by Ber( 12 )d . (We
already knew this from Theorem 26.1 and Theorem 24.8.)
• For the unit sphere X = Sd−1 and distortion function defined by the Euclidean distance, ϕX (D)
is attained by Unif(Sd−1 ).
• For the orthogonal group X = O(d) or unitary group U(d) and distortion function defined by
the Frobenius norm, ϕX (D) is attained by the Haar measure. Similar statements also hold for
the Grassmann manifold (collection of linear subspaces).
Next we give a concrete example by computing the rate-distortion function of θ ∼ Unif(Sd−1 ):
Theorem 27.17. Let θ be uniformly distributed over the unit sphere Sd−1 . Then for all 0 < ϵ < 1,

1 1
(d − 1) log − C ≤ inf I(θ; θ̂) ≤ (d − 1) log 1 + + log(2d)
ϵ Pθ̂|θ :E[∥θ̂−θ∥22 ]≤ϵ2 ϵ
for some universal constant C.
Note that the random vector θ have dependent entries so we cannot invoke the single-
d
letterization technique in Theorem 24.8. Nevertheless, we have the representation θ=Z/kZk2 for
Z ∼ N (0, Id ), which allows us to relate the rate-distortion function of θ to that of the Gaussian
found in Theorem 26.2. The resulting lower bound agree with the metric entropy for spheres in
Corollary 27.4, which scales as (d − 1) log 1ϵ . Using similar reduction arguments (see [195, The-
orem VIII.18]), one can obtain tight lower bound for the metric entropy of the orthogonal group
O(d) and the unitary group U(d), which scales as d(d2−1) log 1ϵ and d2 log 1ϵ , with pre-log factors
commensurate with their respective degrees of freedoms. As mentioned in Remark 27.3(b), these
results were obtained by Szarek in [298] using a volume argument with Haar measures; in compar-
ison, the information-theoretic approach is more elementary as we can again reduce to Gaussian
rate-distortion computation.
Proof. The upper bound follows from Theorem 27.16 and Remark 27.4(a), applying the metric
entropy bound for spheres in Corollary 27.4.
13
A striking parallelism between the metric entropy of Sobolev balls and the rate-distortion function of smooth Gaussian
processes has been observed by Donoho in [99]. However, we cannot apply Theorem 27.16 to formally relate one to the
other since it is unclear whether the Gaussian rate-distortion function is maximal.
14
As a counterexample, consider Theorem 26.1 for the binary source.
i i
i i
i i

i i
472
To prove the lower bound, let Z ∼ N (0, Id ). Define θ = Z

∥Z∥ and A = kZk, where k · k ≡ k · k2
d−1
henceforth. Then θ ∼ Unif(S ) and A ∼ χd are independent. Fix Pθ̂|θ such that E[kθ̂ − θk2 ] ≤
ϵ2 . Since Var(A) ≤ 1, the Shannon lower bound (Theorem 26.3) shows that the rate-distortion
function of A is majorized by that of the standard Gaussian. So for each δ ∈ (0, 1), there exists
PÂ|A such that E[(Â − A)2 ] ≤ δ 2 , I(A, Â) ≤ log δ1 , and E[A] = E[Â]. Set Ẑ = Âθ̂. Then
I(Z; Ẑ) = I(θ, A; Ẑ) ≤ I(θ, A; θ̂, Â) = I(θ; θ̂) + I(A, Â).
Furthermore, E[Â2 ] = E[(Â − A)2 ] + E[A2 ] + 2E[(Â − A)(A − E[A])] ≤ d + δ 2 + 2δ ≤ d + 3δ .
Similarly, |E[Â(Â − A)]| ≤ 2δ and E[kZ − Ẑk2 ] ≤ dϵ2 + 7δϵ + δ . Choosing δ = ϵ, we have
E[kZ − Ẑk2 ] ≤ (d + 8)ϵ2 . Combining Theorem 24.8 with the Gaussian rate-distortion function in
Theorem 26.2, we have I(Z; Ẑ) ≥ d2 log (d+d8)ϵ2 , so applying log(1 + x) ≤ x yields
1
I(θ; θ̂) ≥ (d − 1) log − 4 log e.
ϵ2
i i
i i
i i

i i
Exercises for Part V
V.1 Let S = Ŝ = {0, 1} and let the source X10 be fair coin flips. Denote the output of the decom-
1
pressor by X̂10 . Show that it is possible to achieve average Hamming distortion 20 with 512
codewords.
V.2 Assume the distortion function is separable. Show that the minimal number of codewords
M∗ (n, D) required to represent memoryless source Xn with average distortion D satisfies
log M∗ (n1 + n2 , D) ≤ log M∗ (n1 , D) + log M∗ (n2 , D) .
Conclude that
1 1
lim log M∗ (n, D) = inf log M∗ (n, D) . (V.1)
n→∞ n n n
(i.e. one can always achieve a better compression rate by using a longer blocklength). Neither
claim holds for log M∗ (n, ϵ) in channel coding (with inf replaced by sup in (V.1) of course).
Explain why this different behavior arises.
i.i.d.
V.3 Consider a source Sn ∼ Ber( 12 ). Answer the following questions when n is large.
(a) Suppose the goal is to compress Sn into k bits so that one can reconstruct Sn with at most
one bit of error. That is, the decoded version Ŝn satisfies E[dH (Ŝn , Sn )] ≤ 1. Show that this
can be done (if possible, with an explicit algorithm) with k = n − C log n bits for some
constant C. Is it optimal?
(b) Suppose we are required to compress Sn into only 1 bit. Show that one can achieve (if
√
possible, with an explicit algorithm) a reconstruction error E[dH (Ŝn , Sn )] ≤ n2 − C n for
some constant C. Is it optimal?
Warning: We cannot blindly apply asymptotic the rate-distortion theory to show achievability
since here the distortion changes with n. The converse, however, directly applies.
i.i.d.
V.4 (Noisy source coding [94]) Let Zn ∼ Ber( 21 ). Let Xn be the output of a stationary memoryless
binary erasure channel with erasure probability δ when the input is Zn .
(a) Find the best compression rate for Xn so that the decompressor can reconstruct Zn with bit
error rate D.
(b) What if the input is a Ber(p) sequence?
V.5 (a) Let 0 ≺ ∆ Σ be positive definite matrices. For S ∼ N (0, Σ), show that
1 det Σ
inf I(S; Ŝ) = log .
PŜ|S :E[(S−Ŝ)(S−Ŝ)⊤ ]⪯∆ 2 det ∆
(Hint: for achievability, consider S = Ŝ + Z with Ŝ ∼ N (0, Σ − ∆) ⊥

⊥ Z ∼ N (0, ∆) and
apply Example 3.4; for converse, follow the proof of Theorem 26.2.)
i i
i i
i i

i i
474 Exercises for Part V
(b) Prove the following extension of (26.3): Let σ12 , . . . , σd2 be the eigenvalues of Σ. Then
1 X + σi2
d
inf I(S; Ŝ) = log
PŜ|S :E[∥S−Ŝ∥22 ]≤D 2 λ
i=1
Pd
where λ > 0 is such that i=1 min{σi2 , λ} = D. This is the counterpart of the waterfilling
solution in Theorem 20.14.
(Hint: First, using the orthogonal invariance of distortion metric we can assume that
Σ is diagonal. Next, apply the same single-letterization argument for (26.3) and solve
Pd σ2
minP Di =D 12 i=1 log+ Dii .)
V.6 (Shannon lower bound) Let k · k be an arbitrary norm on Rd and r > 0. Let X be a Rd -valued
random vector with a probability density function pX . Denote the rate-distortion function
ϕ X ( D) ≜ inf I(X; X̂)
PX̂|X :E[∥X̂−X∥r ]≤D
Prove the Shannon lower bound (26.5), namely

d d d
ϕX (D) ≥ h(X) + log − log Γ +1 V , (V.2)
r Dre r
R
where the differential entropy h(X) = Rd pX (x) pX1(x) dx is assumed to be finite and V = vol({x ∈
Rd : kxk ≤ 1}).
(a) Show that 0 < V < ∞.
(b) Show that for any s > 0,
Z
d
+ 1 Vs− r .
d
Z(s) ≜ exp(−skwk )dw = Γ r
Rd r
R R R∞
(Hint: Apply Fubini’s theorem to Rd exp(−skwkr )dw = Rd ∥w∥r s exp(−sx)dxdw and use
R∞
Γ(x) = 0 tx−1 e−t dt.)
(c) Show that for any feasible PX|X̂ such that E[kX − X̂kr ] ≤ D,
I(X; X̂) ≥ h(X) − log Z(s) − sD.

(Hint: Define an auxiliary backward channel QX|X̂ (dx|x̂) = qs (x − x̂)dx, where qs (w) =
Q
1
Z(s)exp(−skwkr ). Then I(X; X̂) = EP [log PXX|X̂ ] + D(PX|X̂ kQX|X̂ kPX̂ ).)
(d) Optimize over s > 0 to conclude (V.2).
(e) Verify that the lower bound of Theorem 26.3 is a special case of (V.2).
Note: Alternatively, the SLB can be written in the following form:
ϕX (D) ≥ h(X) − sup h(W)
PW :E[∥W∥r ]≤D
and this entropy maximization can be solved following the argument in Example 5.2.
V.7 (Uniform distribution minimizes convex symmetric functional.) Let G be a group acting on a
set X such that each g ∈ G sends x ∈ X to gx ∈ X . Suppose G acts transitively, i.e., for each
x, x′ ∈ X there exists g ∈ G such that gx = x′ . Let g be a random element of G with an invariant
i i
i i
i i

i i
Exercises for Part V 475
d
distribution, namely hg=g for any h ∈ G. (Such a distribution, known as the Haar measure,
exists for compact topological groups.)
(a) Show that for any x ∈ X , gx has the same law, denoted by Unif(X ), the uniform distribution
on X .
(b) Let f : P(X ) → R be convex and G-invariant, i.e., f(PgX ) = f(PX ) for any X -valued random
variable X and any g ∈ G. Show that minPX ∈P(X ) f(PX ) = f(Unif(X )).
V.8 (Uniform distribution maximizes rate-distortion function.) Under the setup of Exercise V.7, let
d : X × X → R be a G-invariant distortion function, i.e., d(gx, gx′ ) = d(x, x′ ) for any g ∈ G.
Denote the rate-distortion function of an X -valued X by ϕX (D) = infP :E[d(X,X̂)]≤D I(X; X̂).
X̂|X
Suppose that ϕX (D) < ∞ for all X and all D > 0.
(a) Let ϕ∗X (λ) = supD {λD − ϕX (D)} denote the conjugate of ϕX . Applying Theorem 24.4 and
Fenchel-Moreau’s biconjugation theorem to conclude that ϕX (D) = supλ {λD − ϕ∗X (λ)}.
(b) Show that
ϕ∗X (λ) = sup{λE[d(X, X̂)] − I(X; X̂)}.
PX̂|X
As such, for each λ, PX 7→ ϕ∗X (λ) is convex and G-invariant. (Hint: Theorem 5.3.)
(c) Applying Exercise V.7 to conclude that ϕ∗U (λ) ≤ ϕ∗X (λ) for U ∼ Unif(X ) and that
ϕX (D) ≤ ϕU (D), ∀ D > 0.
V.9 (Normal tail bound.) Denote the standard normal density and tail probability by φ(x) =
R∞
√1 e−x /2 and Φc (t) =
2
2π t
φ(x)dx. Show that for all t > 0,

t φ(t) −t2 /2
φ( t ) ≤ Φ c
( t ) ≤ min , e . (V.3)
1 + t2 t
(Hint: For Φc (t) ≤ e−t /2 apply the Chernoff bound (15.2); for the rest, note that by integration
2
R∞
by parts Φc (t) = φ(t t) − t φ(x2x) dx.)
V.10 (Small-ball probability II.) In this exercise we prove (27.36). Let {Wt : t ≥ 0} be a standard
Brownian motion. Show that for small ϵ,15
1 1
ϕ(ϵ) = log h i
P supt∈[0,1] |Wt | ≤ ϵ ϵ2
h i h i
(a) By rescaling space and time, show that P supt∈[0,1] |Wt | ≤ ϵ = P supt∈[0,T] |Wt | ≤ 1 ≜
pT , where T = 1/ϵ2 . To show pT = e−Θ(T) , there is no loss of generality to assume that T is
an integer.
(b) (Upper bound) Using the independent increment property, show that pT+1 ≤ apT , where
a = P [|Z| ≤ 1] with Z ∼ N(0, 1). (Hint: g(z) ≜ P [|Z − z| ≤ 1] for z ∈ [−1, 1] is maximized
at z = 0 and minimized at z = ±1.)
15
Using the large-deviations theory developed by Donsker-Varadhan, the sharp constant can be found to be
2
limϵ→0 ϵ2 ϕ(ϵ) = π8 . see for example [199, Sec. 6.2].
i i
i i
i i

i i
476 Exercises for Part V
h i
(c) (Lower bound) Again by scaling, it is equivalent to show P supt∈[0,T] |Wt | ≤ C ≥ C−T for
h i
some constant C. Let qT ≜ P supt∈[0,T] |Wt | ≤ 2, maxt=1,...,T |Wt | ≤ 1 . Show that qT+1 ≥
bqT , where b = P [|Z − 1| ≤ 1] P[supt∈[0,1] |Bt | ≤ 1], and Bt = Bt − tB1 is a Brownian
bridge. (Hint: {Wt : t ∈ [0, T]}, WT+1 − WT , and {WT+t − (1 − t)WT − tWT+1 : t ∈ [0, 1]}
are mutually independent, with the latter distributed as a Brownian bridge.)
V.11 (Covering radius in Hamming space) In this exercise we prove (27.9), namely, for any fixed
0 ≤ D ≤ 1, as n → ∞,
N(Fn2 , dH , Dn) = 2n(1−h(D))+ +o(n) ,
where h(·) is the binary entropy function.
(a) Prove the lower bound by invoking the volume bound in Theorem 27.3 and the large-
deviations estimate in Example 15.1.
(b) Prove the upper bound using probabilistic construction and a similar argument to (25.8).
(c) Show that for D ≥ 12 , N(Fn2 , dH , Dn) ≤ 2 – cf. Ex. V.3a.
V.12 (Covering ℓp -ball with ℓq -balls)
(a) For 1 ≤ p < q ≤ ∞, prove the bound (27.18) on the metric entropy of the unit ℓp -ball with
respect to the ℓq -norm (Hint: for small ϵ, apply the volume calculation in (27.15)–(27.16)
and the formula in (27.13); for large ϵ, proceed as in the proof of Theorem 27.7 by applying
the quantization argument and the Gilbert-Varshamov bound of Hamming spheres.)
(b) What happens when p > q?
V.13 (Random matrix) Let A be an m × n matrix of iid N (0, 1) entries. Denote its operator norm by
kAkop = maxv∈Sn−1 kAvk, which is also the largest singular value of A.
(a) Show that
kAkop = max hA, uv′ i . (V.4)
u∈Sm−1 ,v∈Sn−1
(b) Let U = {u1 , . . . , uM } and V = {v1 , . . . , vM } be an ϵ-net for the spheres Sm−1 and Sn−1
respectively. Show that
1
kAkop ≤ max hA, uv′ i .
(1 − ϵ)2 u∈U ,v∈V
(c) Apply Corollary 27.4 and Lemma 27.10 to conclude that

√ √
E[kAk] ≲ n + m (V.5)
(d) By choosing u and v in (V.4) smartly, show a matching lower bound and conclude that
√ √
E[kAk] n + m (V.6)
(e) Use Sudakov minorization (Theorem 27.8) to prove a matching lower bound. (Hint: use
(27.6)).
i i
i i
i i

i i
Part VI
Statistical applications
i i
i i
i i

i i
i i
i i
i i

i i
479
This part gives an exposition on the application of information-theoretic principles and meth-
ods in mathematical statistics; we do so by discussing a selection of topics. To start, Chapter 28
introduces the basic decision-theoretic framework of statistical estimation and the Bayes risk
and the minimax risk as the fundamental limits. Chapter 29 gives an exposition of the classi-
cal large-sample asymptotics for smooth parametric models in fixed dimensions, highlighting the
role of Fisher information introduced in Chapter 2. Notably, we discuss how to deduce classi-
cal lower bounds (Hammersley-Chapman-Robbins, Cramér-Rao, van Trees) from the variational
characterization and the data processing inequality (DPI) of χ2 -divergence in Chapter 7.
Moving into high dimensions, Chapter 30 introduces the mutual information method for sta-
tistical lower bound, based on the DPI for mutual information as well as the theory of capacity
and rate-distortion function from Parts IV and V. This principled approach includes three popular
methods for proving minimax lower bounds (Le Cam, Assouad, and Fano) as special cases, which
are discussed at length in Chapter 31 drawing results from metric entropy in Chapter 27 also.
Complementing the exposition on lower bounds in Chapters 30 and 31, in Chapter 32 we
present three upper bounds on statistical estimation based on metric entropy. These bounds appear
strikingly similar but follow from completely different methodologies.
Chapter 33 introduces strong data processing inequalities (SDPI), which are quantitative
strengthning of DPIs in Part I. As applications we show how to apply SDPI to deduce lower
bounds for various estimation problems on graphs or in distributed settings.
i i
i i
i i

i i
28 Basics of statistical decision theory
28.1 Basic setting

We start by presenting the basic elements of statistical decision theory. We refer to the classics
[124, 193, 297] for a systematic treatment.
A statistical experiment or statistical model refers to a collection P of probability distributions
(over a common measurable space (X , F)). Specifically, let us consider
P = {Pθ : θ ∈ Θ}, (28.1)
where each distribution is indexed by a parameter θ taking values in the parameter space Θ.
In the decision-theoretic framework, we play the following game: Nature picks some parameter
θ ∈ Θ and generates a random variable X ∼ Pθ . A statistician observes the data X and wants to
infer the parameter θ or its certain attributes. Specifically, consider some functional T : Θ → Y
and the goal is to estimate T(θ) on the basis of the observation X. Here the estimand T(θ) may be
the parameter θ itself, or some function thereof (e.g. T(θ) = 1{θ>0} or kθk).
An estimator (decision rule) is a function T̂ : X → Ŷ . Note the that the action space Ŷ need
not be the same as Y (e.g. T̂ may be a confidence interval). Here T̂ can be either deterministic,
i.e. T̂ = T̂(X), or randomized, i.e., T̂ obtained by passing X through a conditional probability
distribution (Markov transition kernel) PT̂|X , or a channel in the language of Part I. For all practical
purposes, we can write T̂ = T̂(X, U), where U denotes external randomness uniform on [0, 1] and
independent of X.
To measure the quality of an estimator T̂, we introduce a loss function ℓ : Y × Ŷ → R such
that ℓ(T, T̂) is the risk of T̂ for estimating T. Since we are dealing with loss (as opposed to reward),
all the negative (converse) results are lower bounds and all the positive (achievable) results are
upper bounds. Note that X is a random variable, so are T̂ and ℓ(T, T̂). Therefore, to make sense of
“minimizing the loss”, we consider the average risk:
Z
Rθ (T̂) = Eθ [ℓ(T, T̂)] = Pθ (dx)PT̂|X (dt̂|x)ℓ(T(θ), t̂), (28.2)
which we refer to as the risk of T̂ at θ. The subscript in Eθ indicates the distribution with respect
to which the expectation is taken. Note that the expected risk depends on the estimator as well as
the ground truth.
480
i i
i i
i i

i i
28.1 Basic setting 481
Remark 28.1. We note that the problem of hypothesis testing and inference can be encompassed
as special cases of the estimation paradigm. As previously discussed in Section 16.4, there are
three formulations for testing:
• Simple vs. simple hypotheses
H0 : θ = θ 0 vs. H1 : θ = θ1 , θ0 6= θ1
• Simple vs. composite hypotheses
H0 : θ = θ 0 vs. H1 : θ ∈ Θ 1 , θ0 ∈
/ Θ1
• Composite vs. composite hypotheses
H0 : θ ∈ Θ 0 vs. H1 : θ ∈ Θ 1 , Θ0 ∩ Θ1 = ∅.
For each case one can introduce the appropriate parameter space and loss function. For example,
in the last (most general) case, we may take
(
0 θ ∈ Θ0
Θ = Θ0 ∪ Θ1 , T(θ) = , T̂ ∈ {0, 1}
1 θ ∈ Θ1
and use the zero-one loss ℓ(T, T̂) = 1{T̸=T̂} so that the expected risk Rθ (T̂) = Pθ {θ ∈ / ΘT̂ } is the
probability of error.
For the problem of inference, the goal is to output a confidence interval (or region) which covers
the true parameter with high probability. In this case T̂ is a subset of Θ and we may choose the
loss function ℓ(θ, T̂) = 1{θ∈/ T̂} + λlength(T̂) for some λ > 0, in order to balance the coverage and
the size of the confidence interval.
Remark 28.2 (Randomized versus deterministic estimators). Although most of the estimators used
in practice are deterministic, there are a number of reasons to consider randomized estimators:
• For certain formulations, such as the minimizing worst-case risk (minimax approach), deter-
ministic estimators are suboptimal and it is necessary to randomize. On the other hand, if the
objective is to minimize the average risk (Bayes approach), then it does not lose generality to
restrict to deterministic estimators.
• The space of randomized estimators (viewed as Markov kernels) is convex which is the convex
hull of deterministic estimators. This convexification is needed for example for the treatment
of minimax theorems.
See Section 28.3 for a detailed discussion and examples.

A well-known fact is that for convex loss function (i.e., T̂ 7→ ℓ(T, T̂) is convex), randomization
does not help. Indeed, for any randomized estimator T̂, we can derandomize it by considering its
conditional expectation E[T̂|X], which is a deterministic estimator and whose risk dominates that
of the original T̂ at every θ, namely, Rθ (T̂) = Eθ ℓ(T, T̂) ≥ Eθ ℓ(T, E[T̂|X]), by Jensen’s inequality.
i i
i i
i i

i i
482
28.2 Gaussian Location Model (GLM)

Note that, without loss of generality, all statistical models can be expressed in the parametric form
of (28.1) (since we can take θ to be the distribution itself). In the statistics literature, it is customary
to refer to a model as parametric if θ takes values in a finite-dimensional Euclidean space (so that
each distribution is specified by finitely many parameters), and nonparametric if θ takes values in
some infinite-dimensional space (e.g. density estimation or sequence model).
Perhaps the most basic parametric model is the Gaussian Location Model (GLM), also known
as the Normal Mean Model, which corresponds to our familiar Gaussian channel in Example 3.3.
This will be our running example in this part of the book. In this model, we have
P = {N(θ, σ 2 Id ) : θ ∈ Θ}
where Id is the d-dimensional identity matrix and the parameter space Θ ⊂ Rd . Equivalently, we
can express the data as a noisy observation of the unknown vector θ as:
X = θ + Z, Z ∼ N(0, σ 2 Id ).
The case of d = 1 and d > 1 refers to the univariate (scalar) and multivariate (vector) case,
respectively. (Also of interest is the case where θ is a d1 × d2 matrix, which can be vectorized into
a d = d1 d2 -dimensional vector.)
The choice of the parameter space Θ represents our prior knowledges of the unknown parameter
θ, for example,
• Θ = Rd , in which case there is no assumption on θ.

• Θ = ℓp -norm balls.
• Θ = {all k-sparse vectors} = {θ ∈ Rd : kθk0 ≤ k}, where kθk0 ≜ |{i : θi 6= 0}| denotes the
size of the support, informally referred to as the ℓ0 -“norm”.
• Θ = {θ ∈ Rd1 ×d2 : rank(θ) ≤ r}, the set of low-rank matrices.
By definition, more structure (smaller parameter space) always makes the estimation task easier
(smaller worst-case risk), but not necessarily so in terms of computation.
For estimating θ itself (denoising), it is customary to use a loss function defined by certain
P 1
norms, e.g., ℓ(θ, θ̂) = kθ − θ̂kpα for some 1 ≤ p ≤ ∞ and α > 0, where kθkp ≜ ( |θi |p ) p , with
p = α = 2 corresponding to the commonly used quadratic loss (squared error). Some well-known
estimators include the Maximum Likelihood Estimator (MLE)
θ̂ML = X (28.3)
and the James-Stein estimator based on shrinkage

(d − 2)σ 2
θ̂JS = 1 − X (28.4)
kXk22
The choice of the estimator depends on both the objective and the parameter space. For instance,
if θ is known to be sparse, it makes sense to set the smaller entries in the observed X to zero
(thresholding) in order to better denoise θ (cf. Section 30.2).
i i
i i
i i

i i
In addition to estimating the vector θ itself, it is also of interest to estimate certain functionals
T(θ) thereof, e.g., T(θ) = kθkp , max{θ1 , . . . , θd }, or eigenvalues in the matrix case. In addition,
the hypothesis testing problem in the GLM has been well-studied. For example, one can consider
detecting the presence of a signal by testing H0 : θ = 0 against H1 : kθk ≥ ϵ, or testing weak signal
H0 : kθk ≤ ϵ0 versus strong signal H1 : kθk ≥ ϵ1 , with or without further structural assumptions
on θ. We refer the reader to the monograph [165] devoted to these problems.
28.3 Bayes risk, minimax risk, and the minimax theorem

One of our main objectives in this part of the book is to understand the fundamental limit of
statistical estimation, that is, to determine the performance of the best estimator. As in (28.2), the
risk Rθ (T̂) of an estimator T̂ for T(θ) depends on the ground truth θ. To compare the risk profiles of
different estimators meaningfully requires some thought. As a toy example, Fig. 28.1 depicts the
risk functions of three estimators. It is clear that θ̂1 is superior to θ̂2 in the sense that the risk of the
former is pointwise lower than that of the latter. (In statistical literature we say θ̂2 is inadmissible.)
However, the comparison of θ̂1 and θ̂3 is less clear. Although the peak risk value of θ̂3 is bigger
than that of θ̂1 , on average its risk (area under the curve) is smaller. In fact, both views are valid
and meaningful, and they correspond to the worst-case (minimax) and average-case (Bayesian)
approach, respectively. In the minimax formulation, we summarize the risk function into a scalar
quantity, namely, the worst-case risk, and seek the estimator that minimize this objective. In the
Bayesian formulation, the objective is the average risk. Below we discuss these two approaches
and their connections. For notational simplicity, we consider the task of estimating T(θ) = θ.
Figure 28.1 Risk profiles of three estimators.
i i
i i
i i

i i
484
28.3.1 Bayes risk

The Bayesian approach is an average-case formulation in which the statistician acts as if the param-
eter θ is random with a known distribution. Concretely, let π be a probability distribution (prior)
on Θ. Then the average risk (w.r.t. π) of an estimator θ̂ is defined as
Rπ (θ̂) = Eθ∼π [Rθ (θ̂)] = Eθ,X [ℓ(θ, θ̂)]. (28.5)
Given a prior π, its Bayes risk is the minimal average risk, namely
R∗π = inf Rπ (θ̂).
θ̂
An estimator θ̂ is called a Bayes estimator if it attains the Bayes risk, namely, R∗π = Eθ∼π [Rθ (θ̂∗ )].
∗
Remark 28.3. Bayes estimator is always deterministic – this fact holds for any loss function. To
see this, note that for any randomized estimator, say θ̂ = θ̂(X, U), where U is some external
randomness independent of X and θ, its risk is lower bounded by
Rπ (θ̂) = Eθ,X,U ℓ(θ, θ̂(X, U)) = EU Rπ (θ̂(·, U)) ≥ inf Rπ (θ̂(·, u)).
u
Note that for any u, θ̂(·, u) is a deterministic estimator. This shows that we can find a deterministic
estimator whose average risk is no worse than that of the randomized estimator.
An alternative way to under this fact is the following: Note that the average risk Rπ (θ̂) defined
in (28.5) is an affine function of the randomized estimator (understood as a Markov kernel Pθ̂|X )
is affine, whose minimum is achieved at the extremal points. In this case the extremal points of
Markov kernels are simply delta measures, which corresponds to deterministic estimators.
In certain settings the Bayes estimator can be found explicitly. Consider the problem of esti-
mating θ ∈ Rd drawn from a prior π. Under the quadratic loss ℓ(θ, θ̂) = kθ̂ − θk22 , the Bayes
estimator is the conditional mean θ̂(X) = E[θ|X] and the Bayes risk is the minimum mean-square
error (MMSE)
R∗π = Ekθ − E[θ|X]k22 = Tr(Cov(θ|X)),
where Cov(θ|X = x) is the conditional covariance of θ given X = x.
As a concrete example, let us consider the Gaussian Location Model in Section 28.2 with a
Gaussian prior.
Example 28.1 (Bayes risk in GLM). Consider the scalar case, where X = θ + Z and Z ∼ N(0, σ 2 )
is independent of θ. Consider a Gaussian prior θ ∼ π = N(0, s). One can verify that the posterior
sσ 2
2 x, s+σ 2 ). As such, the Bayes estimator is E[θ|X] = s+σ 2 X and the
s s
distribution Pθ|X=x is N( s+σ
Bayes risk is
sσ 2
R∗π = . (28.6)
s + σ2
Similarly, for multivariate GLM: X = θ + Z, Z ∼ N(0, Id ), if θ ∼ π = N(0, sId ), then we have
sσ 2
R∗π = d. (28.7)
s + σ2
i i
i i
i i

i i
28.3.2 Minimax risk

A common criticism of the Bayesian approach is the arbitrariness of the selected prior. A frame-
work related to this but not discussed in this case is the empirical Bayes approach [? ], where
one “estimates” the prior from the data instead of choosing a prior a priori. Instead, we take a
frequentist viewpoint by considering the worst-case situation. The minimax risk is defined as
R∗ = inf sup Rθ (θ̂). (28.8)

θ̂ θ∈Θ
If there exists θ̂ s.t. supθ∈Θ Rθ (θ̂) = R∗ , then the estimator θ̂ is minimax (minimax optimal).
Finding the value of the minimax risk R∗ entails proving two things, namely,
• a minimax upper bound, by exhibiting an estimator θ̂∗ such that Rθ (θ̂∗ ) ≤ R∗ + ϵ for all θ ∈ Θ;
• a minimax lower bound, by proving that for any estimator θ̂, there exists some θ ∈ Θ, such that
Rθ ≥ R∗ − ϵ,
where ϵ > 0 is arbitrary. This task is frequently difficult especially in high dimensions. Instead of
the exact minimax risk, it is often useful to find a constant-factor approximation Ψ, which we call
minimax rate, such that
R∗ Ψ, (28.9)
that is, cΨ ≤ R∗ ≤ CΨ for some universal constants c, C ≥ 0. Establishing Ψ is the minimax rate
still entails proving the minimax upper and lower bounds, albeit within multiplicative constant
factors.
In practice, minimax lower bounds are rarely established according to the original definition.
The next result shows that the Bayes risk is always lower than the minimax risk. Throughout
this book, all lower bound techniques essentially boil down to evaluating the Bayes risk with a
sagaciously chosen prior.
Theorem 28.1. Let ∆(Θ) denote the collection of probability distributions on Θ. Then
R∗ ≥ R∗Bayes ≜ sup R∗π . (28.10)

π ∈∆(Θ)
Proof. Two (equivalent) ways to prove this fact:
1 “max ≥ mean”: For any θ̂, Rπ (θ̂) = Eθ∼π Rθ (θ̂) ≤ supθ∈Θ Rθ (θ̂). Taking the infimum over θ̂
completes the proof;
2 “min max ≥ max min”:
R∗ = inf sup Rθ (θ̂) = inf sup Rπ (θ̂) ≥ sup inf Rπ (θ̂) = sup R∗π ,
θ̂ θ∈Θ θ̂ π ∈∆(Θ) π ∈∆(Θ) θ̂ π
where the inequality follows from the generic fact that minx maxy f(x, y) ≥ maxy minx f(x, y).
i i
i i
i i

i i
486
Remark 28.4. Unlike Bayes estimators which, as shown in Remark 28.3, are always deterministic,
to minimize the worst-case risk it is sometimes necessary to randomize for example in the context
of hypotheses testing (Chapter 14). Specifically, consider a trivial experiment where θ ∈ {0, 1} and
X is absent, so that we are forced to guess the value of θ under the zero-one loss ℓ(θ, θ̂) = 1{θ̸=θ̂} .
It is clear that in this case the minimax risk is 21 , achieved by random guessing θ̂ ∼ Ber( 21 ) but not
by any deterministic θ̂.
As an application of Theorem 28.1, let us determine the minimax risk of the Gaussian location
model under the quadratic loss function.
Example 28.2 (Minimax quadratic risk of GLM). Consider the Gaussian location model without
structural assumptions, where X ∼ N(θ, σ 2 Id ) with θ ∈ Rd . We show that
R∗ ≡ inf sup Eθ [kθ̂(X) − θk22 ] = dσ 2 . (28.11)

θ∈Rd θ∈Rd
By scaling, it suffices to consider σ = 1. For the upper bound, we consider θ̂ML = X which
achieves Rθ (θ̂ML ) = d for all θ. To get a matching minimax lower bound, we consider the prior
θ ∼ N(0, s). Using the Bayes risk previously computed in (28.6), we have R∗ ≥ R∗π = s+ sd
1.
∗
Sending s → ∞ yields R ≥ d.
Remark 28.5 (Non-uniqueness of minimax estimators). In general, estimators that achieve the
minimax risk need not be unique. For instance, as shown in Example 28.2, the MLE θ̂ML = X
is minimax for the unconstrained GLM in any dimension. On the other hand, it is known that
whenever d ≥ 3, the risk of the James-Stein estimator (28.4) is smaller that of the MLE everywhere
(see Fig. 28.2) and thus is also minimax. In fact, there exist a continuum of estimators that are
minimax for (28.11) [196, Theorem 5.5].
3.0
2.8
2.6
2.4
2.2
2 4 6 8
Figure 28.2 Risk of the James-Stein estimator (28.4) in dimension d = 3 and σ = 1 as a function of kθk.
For most of the statistical models, Theorem 28.1 in fact holds with equality; such a result is
known as a minimax theorem. Before discussing this important topic, here is an example where
minimax risk is strictly bigger than the worst-case Bayes risk.
i i
i i
i i

i i
Example 28.3. Let θ, θ̂ ∈ N ≜ {1, 2, ...} and ℓ(θ, θ̂) = 1{θ̂<θ} , i.e., the statistician loses one dollar
if the Nature’s choice exceeds the statistician’s guess and loses nothing if otherwise. Consider the
extreme case of blind guessing (i.e., no data is available, say, X = 0). Then for any θ̂ possibly
randomized, we have Rθ (θ̂) = P(θ̂ < θ). Thus R∗ ≥ limθ→∞ P(θ̂ < θ) = 1, which is clearly
achievable. On the other hand, for any prior π on N, Rπ (θ̂) = P(θ̂ < θ), which vanishes as θ̂ → ∞.
Therefore, we have R∗π = 0. Therefore in this case R∗ = 1 > R∗Bayes = 0.
As an exercise, one can show that the minimax quadratic risk of the GLM X ∼ N(θ, 1) with
parameter space θ ≥ 0 is the same as the unconstrained case. (This might be a bit surprising
because the thresholded estimator X+ = max(X, 0) achieves a better risk pointwise at every θ ≥
0; nevertheless, just like the James-Stein estimator (cf. Fig. 28.2), in the worst case the gain is
asymptotically diminishing.)
28.3.3 Minimax and Bayes risk: a duality perspective

Recall from Theorem 28.1 the inequality
R∗ ≥ R∗Bayes .
This result can be interpreted from an optimization perspective. More precisely, R∗ is the value
of a convex optimization problem (primal) and R∗Bayes is precisely the value of its dual program.
Thus the inequality (28.10) is simply weak duality. If strong duality holds, then (28.10) is in fact
an equality, in which case the minimax theorem holds.
For simplicity, we consider the case where Θ is a finite set. Then
R∗ = min max Eθ [ℓ(θ, θ̂)]. (28.12)

Pθ̂|X θ∈Θ
This is a convex optimization problem. Indeed, Pθ̂|X 7→ Eθ [ℓ(θ, θ̂)] is affine and the pointwise
supremum of affine functions is convex. To write down its dual problem, first let us rewrite (28.12)
in an augmented form
R∗ = min t (28.13)
Pθ̂|X ,t
s.t Eθ [ℓ(θ, θ̂)] ≤ t, ∀θ ∈ Θ.
Let π θ ≥ 0 denote the Lagrange multiplier (dual variable) for each inequality constraint. The
Lagrangian of (28.13) is
!
X X X
L(Pθ̂|X , t, π ) = t + π θ Eθ [ℓ(θ, θ̂)] − t = 1 − πθ t + π θ Eθ [ℓ(θ, θ̂)].
θ∈Θ θ∈Θ θ∈Θ
P
By definition, we have R∗ ≥ mint,Pθ̂|X L(θ̂, t, π ). Note that unless θ∈Θ π θ = 1, mint∈R L(θ̂, t, π )
is −∞. Thus π = (π θ : θ ∈ Θ) must be a probability measure and the dual problem is
max min L(Pθ̂|X , t, π ) = max min Rπ (θ̂) = max R∗π .

π Pθ̂|X ,t π ∈∆(Θ) Pθ̂|X π ∈∆(Θ)
i i
i i
i i

i i
488
Hence, R∗ ≥ R∗Bayes .
In summary, the minimax risk and the worst-case Bayes risk are related by convex duality,
where the primal variables are (randomized) estimators and the dual variables are priors. This
view can in fact be operationalized. For example, [173, 251] showed that for certain problems
dualizing Le Cam’s two-point lower bound (Theorem 31.1) leads to optimal minimax upper bound;
see Exercise VI.16.
28.3.4 Minimax theorem

Next we state the minimax theorem which gives conditions that ensure (28.10) holds with equality,
namely, the minimax risk R∗ and the worst-case Bayes risk R∗Bayes coincide. For simplicity, let us
consider the case of estimating θ itself where the estimator θ̂ takes values in the action space Θ̂
with a loss function ℓ : Θ × Θ̂ → R. A very general result (cf. [297, Theorem 46.6]) asserts that
R∗ = R∗Bayes , provided that the following condition hold:
• The experiment is dominated, i.e., Pθ ν holds for all θ ∈ Θ for some ν on X .

• The action space Θ̂ is a locally compact topological space with a countable base (e.g. the
Euclidean space).
• The loss function is level-compact (i.e., for each θ ∈ Θ, ℓ(θ, ·) is bounded from below and the
sublevel set {θ̂ : ℓ(θ, θ̂) ≤ a} is compact for each a).
This result shows that for virtually all problems encountered in practice, the minimax risk coin-
cides with the least favorable Bayes risk. At the heart of any minimax theorem, there is an
application of the separating hyperplane theorem. Below we give a proof of a special case
illustrating this type of argument.
Theorem 28.2 (Minimax theorem).
R∗ = R∗Bayes
in either of the following cases:
• Θ is a finite set and the data X takes values in a finite set X .

• Θ is a finite set and the loss function ℓ is bounded from below, i.e., infθ,θ̂ ℓ(θ, θ̂) > −∞.
Proof. The first case directly follows from the duality interpretation in Section 28.3.3 and the
fact that strong duality holds for finite-dimensional linear programming (see for example [275,
Sec. 7.4].
For the second case, we start by showing that if R∗ = ∞, then R∗Bayes = ∞. To see this, consider
the uniform prior π on Θ. Then for any estimator θ̂, there exists θ ∈ Θ such that R(θ, θ̂) = ∞.
Then Rπ (θ̂) ≥ |Θ|
1
R(θ, θ̂) = ∞.
Next we assume that R∗ < ∞. Then R∗ ∈ R since ℓ is bounded from below (say, by a) by
assumption. Given an estimator θ̂, denote its risk vector R(θ̂) = (Rθ (θ̂))θ∈Θ . Then its average risk
i i
i i
i i

i i
28.4 Multiple observations and sample complexity 489
P
with respect to a prior π is given by the inner product hR(θ̂), π i = θ∈Θ π θ Rθ (θ̂). Define
S = {R(θ̂) ∈ RΘ : θ̂ is a randomized estimator} = set of all possible risk vectors,

T = {t ∈ RΘ : tθ < R∗ , θ ∈ Θ}.
Note that both S and T are convex (why?) subsets of Euclidean space RΘ and S∩T = ∅ by definition
of R∗ . By the separation hyperplane theorem, there exists a non-zero π ∈ RΘ and c ∈ R, such
that infs∈S hπ , si ≥ c ≥ supt∈T hπ , ti. Obviously, π must be componentwise positive, for otherwise
supt∈T hπ , ti = ∞. Therefore by normalization we may assume that π is a probability vector, i.e.,
a prior on Θ. Then R∗Bayes ≥ R∗π = infs∈S hπ , si ≥ supt∈T hπ , ti ≥ R∗ , completing the proof.
28.4 Multiple observations and sample complexity

Given a experiment {Pθ : θ ∈ Θ}, consider the experiment
Pn = {P⊗
θ : θ ∈ Θ},
n
n ≥ 1. (28.14)
We refer to this as the independent sampling model, in which we observe a sample X =

(X1 , . . . , Xn ) consisting of independent observations drawn from Pθ for some θ ∈ Θ ⊂ Rd . Given
a loss function ℓ : Rd × Rd → R+ , the minimax risk is denoted by
R∗n (Θ) = inf sup Eθ ℓ(θ, θ̂). (28.15)

θ̂ θ∈Θ
Clearly, n 7→ R∗n (Θ) is non-increasing since we can always discard the extra observations.
Typically, when Θ is a fixed subset of Rd , R∗n (Θ) vanishes as n → ∞. Thus a natural question is
at what rate R∗n converges to zero. Equivalently, one can consider the sample complexity, namely,
the minimum sample size to attain a prescribed error ϵ even in the worst case:
n∗ (ϵ) ≜ min {n ∈ N : R∗n (Θ) ≤ ϵ} . (28.16)
In the classical large-sample asymptotics (Chapter 29), the rate of convergence for the quadratic
risk is usually Θ( 1n ), which is commonly referred to as the “parametric rate“. In comparison, in this
book we focus on understanding the dependency on the dimension and other structural parameters
nonasymptotically.
As a concrete example, let us revisit the GLM in Section 28.2 with sample size n, in which case
i.i.d.
we observe X = (X1 , . . . , Xn ) ∼ N(0, σ 2 Id ), θ ∈ Rd . In this case, the minimax quadratic risk is1
dσ 2
R∗n = . (28.17)
n
To see this, note that in this case X̄ = n1 (X1 + . . . + Xn ) is a sufficient statistic (cf. Section 3.5) of X
2
for θ. Therefore the model reduces to X̄ ∼ N(θ, σn Id ) and (28.17) follows from the minimax risk
(28.11) for a single observation.
1
See Exercise VI.10 for an extension of this result to nonparametric location models.
i i
i i
i i

i i
490
2
From (28.17), we conclude that the sample complexity is n∗ (ϵ) = d dσϵ e, which grows linearly
with the dimension d. This is the common wisdom that “sample complexity scales proportionally
to the number of parameters”, also known as “counting the degrees of freedom”. Indeed in high
dimensions we typically expect the sample complexity to grow with the ambient dimension; how-
ever, the exact dependency need not be linear as it depends on the loss function and the objective
of estimation. For example, consider the matrix case θ ∈ Rd×d with n independent observations
in Gaussian noise. Let ϵ be a small constant. Then we have
2
• For quadratic loss, namely, kθ − θ̂k2F , we have R∗n = dn and hence n∗ (ϵ) = Θ(d2 );
• If the loss function is kθ − θ̂k2op , then R∗n dn and hence n∗ (ϵ) = Θ(d) (Example 28.4);
• As opposed to θ itself, suppose we are content with p estimating only the scalar functional θmax =
∗
max{θ1 , . . . , θd } up to accuracy ϵ, then n (ϵ) = Θ( log d) (Exercise VI.13).
In the last two examples, the sample complexity scales sublinearly with the dimension.
28.5 Tensor product of experiments

Tensor product is a way to define a high-dimensional model from low-dimensional models. Given
statistical experiments Pi = {Pθi : θi ∈ Θi } and the corresponding loss function ℓi , for i ∈ [d],
their tensor product refers to the following statistical experiment:
( )
Yd Y
d
P = Pθ = Pθi : θ = (θ1 , . . . , θd ) ∈ Θ ≜ Θi ,
i=1 i=1
X
d
ℓ(θ, θ̂) ≜ ℓi (θi , θ̂i ), ∀θ, θ̂ ∈ Θ.
i=1
In this model, the observation X = (X1 , . . . , Xd ) consists of independent (not identically dis-
ind
tributed) Xi ∼ Pθi . This should be contrasted with the multiple-observation model in (28.14), in
which n iid observations drawn from the same distribution are given.
The minimax risk of the tensorized experiment is related to the minimax risk R∗ (Pi ) and worst-
case Bayes risks R∗Bayes (Pi ) ≜ supπ i ∈∆(Θi ) Rπ i (Pi ) of each individual experiment as follows:
Theorem 28.3 (Minimax risk of tensor product).

X
d X
d
R∗Bayes (Pi ) ≤ R∗ (P) ≤ R∗ (Pi ). (28.18)
i=1 i=1
Consequently, if minimax theorem holds for each experiment, i.e., R∗ (Pi ) = R∗Bayes (Pi ), then it
also holds for the product experiment and, in particular,
X
d
R∗ (P) = R∗ (Pi ). (28.19)
i=1
i i
i i
i i

i i
28.5 Tensor product of experiments 491
Proof. The right inequality of (28.18) simply follows by separately estimating θi on the basis
of Xi , namely, θ̂ = (θ̂1 , . . . , θ̂d ), where θ̂i depends only on Xi . For the left inequality, consider
Qd
a product prior π = i=1 π i , under which θi ’s are independent and so are Xi ’s. Consider any
randomized estimator θ̂i = θ̂i (X, Ui ) of θi based on X, where Ui is some auxiliary randomness
independent of X. We can rewrite it as θ̂i = θ̂i (Xi , Ũi ), where Ũi = (X\i , Ui ) ⊥ ⊥ Xi . Thus θ̂i can
be viewed as it a randomized estimator based on Xi alone and its the average risk must satisfy
Rπ i (θ̂i ) = E[ℓ(θi , θ̂i )] ≥ R∗π i . Summing over i and taking the suprema over priors π i ’s yields the
left inequality of (28.18).
As an example, we note that the unstructured d-dimensional GLM {N(θ, σ 2 Id ) : θ ∈ Rd } with

quadratic loss is simply the d-fold tensor product of the one-dimensional GLM. Since minimax
theorem holds for the GLM (cf. Section 28.3.4), Theorem 28.3 shows the minimax risks sum up to
σ 2 d, which agrees with Example 28.2. In general, however, it is possible that the minimax risk of
the tensorized experiment is less than the sum of individual minimax risks and the right inequality
of (28.19) can be strict. This might appear surprising since Xi only carries information about θi
and it makes sense intuitively to estimate θi based solely on Xi . Nevertheless, the following is a
counterexample:
Remark 28.6. Consider X = θZ, where θ ∈ N, Z ∼ Ber( 12 ). The estimator θ̂ takes values in N
as well and the loss function is ℓ(θ, θ̂) = 1{θ̂<θ} , i.e., whoever guesses the greater number wins.
The minimax risk for this experiment is equal to P [Z = 0] = 12 . To see this, note that if Z = 0,
then all information about θ is erased. Therefore for any (randomized) estimator Pθ̂|X , the risk is
lower bounded by Rθ (θ̂) = P[θ̂ < θ] ≥ P[θ̂ < θ, Z = 0] = 21 P[θ̂ < θ|X = 0]. Therefore sending
θ → ∞ yields supθ Rθ (θ̂) ≥ 12 . This is achievable by θ̂ = X. Clearly, this is a case where minimax
theorem does not hold, which is very similar to the previous Example 28.3.
Next consider the tensor product of two copies of this experiment with loss function ℓ(θ, θ̂) =
1{θ̂1 <θ1 } + 1{θ̂2 <θ2 } . We show that the minimax risk is strictly less than one. For i = 1, 2, let
i.i.d.
Xi = θi Zi , where Z1 , Z2 ∼ Ber( 21 ). Consider the following estimator
(
X1 ∨ X2 X1 > 0 or X2 > 0
θ̂1 = θ̂2 =
1 otherwise.
Then for any θ1 , θ2 ∈ N, averaging over Z1 , Z2 , we get

1 3
E[ℓ(θ, θ̂)] ≤ 1{θ1 <θ2 } + 1{θ2 <θ1 } + 1 ≤ .
4 4
We end this section by consider the minimax risk of GLM with non-quadratic loss. The
following result extends Example 28.2:
i.i.d.
Theorem 28.4. Consider the Gaussian location model X1 , . . . , Xn ∼ N(θ, Id ). Then for 1 ≤ q <
∞,
E[kZkqq ]
inf sup Eθ [kθ − θ̂kqq ] = , Z ∼ N(0, Id ).
θ̂ θ∈Rd nq/2
i i
i i
i i

i i
492
Proof. Note that N(θ, Id ) is a product distribution and the loss function is separable: kθ − θ̂kqq =
Pd
i=1 |θi − θ̂i | . Thus the experiment is a d-fold tensor product of the one-dimensional version.
q
By Theorem 28.3, it suffices to consider d = 1. The upper bound is achieved by the sample mean
Pn
X = 1n i=1 Xi ∼ N(θ, n1 ), which is a sufficient statistic.
For the lower bound, following Example 28.2, consider a Gaussian prior θ ∼ π = N(0, s). Then
the posterior distribution is also Gaussian: Pθ|X = N(E[θ|X], 1+ssn ). The following lemma shows
that the Bayes estimator is simply the conditional mean:
Lemma 28.5. Let Z ∼ N(0, 1). Then miny∈R E[|y + Z|q ] = E[|Z|q ].
Thus the Bayes risk is

s q/2
R∗π = E[|θ − E[θ|X]|q ] = E | Z| q .
1 + sn
Sending s → ∞ proves the matching lower bound.
Proof of Lemma 28.5. Write

Z ∞ Z ∞
E | y + Z| q = P [|y + Z|q > c] dc ≥ P [|Z|q > c] dc = E|Z|q ,
0 0
where the inequality follows from the simple observation that for any a > 0, P [|y + Z| ≤ a] ≤
P [|Z| ≤ a], due to the symmetry and unimodality of the normal density.
28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM

As mentioned in Section 28.3.2, computing the exact minimax risk is frequently difficult especially
in high dimensions. Nevertheless, for the special case of (unconstrained) GLM, the minimax risk is
known exactly in arbitrary dimensions for a large collection of loss functions.2 We have previously
seen in Theorem 28.4 that this is possible for loss functions of the form ℓ(θ, θ̂) = kθ − θ̂kqq .
Examining the proof of this result, we note that the major limitation is that it only applies to
separable loss functions, so that tensorization allows us to reduce the problem to one dimension.
This does not apply to (and actually fails) for inseparable loss, since Theorem 28.3, if applicable,
dictates the risk to grow linearly with the dimension, which is not always the case. We next discuss
a more general result that goes beyond separable losses.
Definition 28.6. A function ρ : Rd → R+ is called bowl-shaped if its sublevel set Kc ≜ {x :

ρ(x) ≤ c} is convex and symmetric (i.e. Kc = −Kc ) for all c ∈ R.
2
Another example is the multivariate model with the squared error; cf. Exercise VI.7.
i i
i i
i i

i i
Theorem 28.7. Consider the d-dimensional GLM where X1 , . . . , Xn ∼ N(0, Id ) are observed.
Let the loss function be ℓ(θ, θ̂) = ρ(θ − θ̂), where ρ : Rd → R+ is bowl-shaped and lower-
semicontinuous. Then the minimax risk is given by

Z
R∗ ≜ inf sup Eθ [ρ(θ − θ̂)] = Eρ √ , Z ∼ N(0, Id ).
θ̂ θ∈Rd n
Pn
Furthermore, the upper bound is attained by X̄ = 1n i=1 Xi .
Corollary 28.8. Let ρ(·) = k · kq for some q > 0, where k · k is an arbitrary norm on Rd . Then
EkZkq
R∗ = . (28.20)
nq/2
Example 28.4. Some applications of Corollary 28.8:
• For ρ = k.k22 , R∗ = 1n EkZk2 = dn , which has been shown in (28.17).

p q
• For ρ = k.k∞ , EkZk∞ log d (Lemma 27.10) and R∗ logn d .
• For a matrix θ ∈ Rd×d , let ρ(θ) = kθkop denote the operator norm (maximum singular value).
√ q
It has been shown in Exercise V.13 that E kZkop d and so R∗ n ; for ρ(·) = k · kF ,
d
R∗ √d .
n
We can also phrase the result of Corollary 28.8 in terms of the sample complexity n∗ (ϵ) as

defined in (28.16). For example, for q = 2 we have n∗ (ϵ) = E[kZk2 ]/ϵ . The above examples
show that the scaling of n∗ (ϵ) with dimension depends on the loss function and the “rule of thumb”
that the sampling complexity is proportional to the number of parameters need not always hold.
Finally, for the sake of high-probability (as opposed to average) risk bound, consider ρ(θ − θ̂) =
1{kθ − θ̂k > ϵ}, which is lower semicontinuous and bowl-shaped. Then the exact expression
√
R∗ = P kZk ≥ ϵ n . This result is stronger since the sample mean is optimal simultaneously for
all ϵ, so that integrating over ϵ recovers (28.20).
Proof of Theorem 28.7. We only prove the lower bound. We bound the minimax risk R∗ from
below by the Bayes risk R∗π with the prior π = N(0, sId ):
R∗ ≥ R∗π = inf Eπ [ρ(θ − θ̂)]

θ̂

= E inf E[ρ(θ − θ̂)|X]
θ̂
( a)
= E[E[ρ(θ − E[θ|X])|X]]
r
(b) s
=E ρ Z .
1 + sn
where (a) follows from the crucial Lemma 28.9 below; (b) uses the fact that θ − E[θ|X] ∼
N(0, 1+ssn Id ) under the Gaussian prior. Since ρ(·) is Lower semicontinuous, sending s → ∞ and
i i
i i
i i

i i
494
applying Fatou’s lemma, we obtain the matching lower bound:

r
s Z
R∗ ≥ lim E ρ Z ≥E ρ √ .
s→∞ 1 + sn n
The following lemma establishes the conditional mean as the Bayes estimator under the
Gaussian prior for all bowl-shaped losses, extending the previous Lemma 28.5 in one dimension:
Lemma 28.9 (Anderson [13]). Let X ∼ N(0, Σ) for some Σ 0 and ρ : Rd → R+ be a

bowl-shaped loss function. Then
min E[ρ(y + X)] = E[ρ(X)].
y∈ R d
In order to prove Lemma 28.9, it suffices to consider ρ being indicator functions. This is done
in the next lemma, which we prove later.
Lemma 28.10. Let K ∈ Rd be a symmetric convex set and X ∼ N(0, Σ). Then maxy∈Rd P(X + y ∈
K) = P(X ∈ K).
Proof of Lemma 28.9. Denote the sublevel set set Kc = {x ∈ Rd : ρ(x) ≤ c}. Since ρ is bowl-
shaped, Kc is convex and symmetric, which satisfies the conditions of Lemma 28.10. So,
Z ∞
E[ρ(y + x)] = P(ρ(y + x) > c)dc,
Z ∞
0
= (1 − P(y + x ∈ Kc ))dc,
Z ∞
0
≥ (1 − P(x ∈ Kc ))dc,
Z ∞
0
= P(ρ(x) ≥ c)dc,
0
= E[ρ(x)].
Hence, miny∈Rd E[ρ(y + x)] = E[ρ(x)].
Before going into the proof of Lemma 28.10, we need the following definition.
Definition 28.11. A measure μ on Rd is said to be log-concave if

μ(λA + (1 − λ)B) ≥ μ(A)λ μ(B)1−λ
for all measurable A, B ⊂ Rd and any λ ∈ [0, 1].
The following result, due to Prékopa [255], characterizes the log-concavity of measures in terms
of that of its density function; see also [266] (or [135, Theorem 4.2]) for a proof.
Theorem 28.12. Suppose that μ has a density f with respect to the Lebesgue measure on Rd . Then
μ is log-concave if and only if f is log-concave.
i i
i i
i i

i i
Example 28.5. Examples of log-concave measures:
• Lebesgue measure: Let μ = vol be the Lebesgue measure on Rd , which satisfies Theorem 28.12
(f ≡ 1). Then
vol(λA + (1 − λ)B) ≥ vol(A)λ vol(B)1−λ , (28.21)
which implies3 the Brunn-Minkowski inequality:
1 1 1
vol(A + B) d ≥ vol(A) d + vol(B) d . (28.22)
• Gaussian distribution: Let μ = N(0, Σ), with a log-concave density f since log f(x) =
− p2 log(2π ) − 12 log det(Σ) − 21 x⊤ Σ−1 x is concave.
Proof of Lemma 28.10. By Theorem 28.12, the distribution of X is log-concave. Then

(a)
h 1 1 i
P[X ∈ K] = P X ∈ (K + y) + (K − y)
2 2
(b) p
≥ P[X ∈ K − y]P[X ∈ K + y]
(c)
= P[X + y ∈ K],
where (a) follows from 12 (K + y) + 12 (K − y) = 12 K + 12 K = K since K is convex; (b) follows from

the definition of log-concavity in Definition 28.11 with λ = 21 , A = K − y = {x − y : x ∈ K}
and B = K + y; (c) follows from P[X ∈ K + y] = P[X ∈ −K − y] = P[X + y ∈ K] since X has a
symmetric distribution and K is symmetric (K = −K).
3
Applying (28.21) to A′ = vol(A)−1/d A, B′ = vol(B)−1/d B (both of which have unit volume), and
λ = vol(A)1/d /(vol(A)1/d + vol(B)1/d ) yields (28.22).
i i
i i
i i

i i
29 Classical large-sample asymptotics
In this chapter we give an overview of the classical large-sample theory in the setting of iid obser-
vations in Section 28.4 focusing again on the minimax risk (28.15). These results pertain to smooth
parametric models in fixed dimensions, with the sole asymptotics being the sample size going to
infinity. The main result is that, under suitable conditions, the minimax squared error of estimating
i.i.d.
θ based on X1 , . . . , Xn ∼ Pθ satisfies
1 + o( 1)
inf sup Eθ [kθ̂ − θk22 ] = sup TrJ− 1
F (θ). (29.1)
θ̂ θ∈Θ n θ∈Θ
where JF (θ) is the Fisher information matrix introduced in (2.31) in Chapter 2. This is asymptotic
characterization of the minimax risk with sharp constant. In later chapters, we will proceed to high
dimensions where such precise results are difficult and rare.
Throughout this chapter, we focus on the quadratic risk and assume that Θ is an open set of the
Euclidean space Rd .
29.1 Statistical lower bound from data processing

In this section we derive several statistical lower bounds from data processing argument. Specif-
ically, we will take a comparison-of-experiment approach by comparing the actual model with a
perturbed model. The performance of a given estimator can be then related to the f-divergence via
the data processing inequality and the variational representation (Chapter 7).
We start by discussing the Hammersley-Chapman-Robbins lower bound which implies the well-
known Cramér-Rao lower bound. Because these results are restricted to unbiased estimators, we
will also discuss their Bayesian version; in particular, the Bayesian Cramér-Rao lower bound
is responsible for proving the lower bound in (29.1). We focus on explaining how these results
can be anticipated from information-theoretic reasoning and postpone the exact statement and
assumption of the Bayesian Cramér-Rao bound to Section 29.2.
29.1.1 Hammersley-Chapman-Robbins (HCR) lower bound

The following result due to [153, 61] is a direct consequence of the variational representation of
χ2 -divergence in Section 7.13, which relates it to the mean and variance of test functions.
496
i i
i i
i i

i i
29.1 Statistical lower bound from data processing 497
Theorem 29.1 (HCR lower bound). The quadratic loss of any estimator θ̂ at θ ∈ Θ ⊂ Rd satisfies
(Eθ [θ̂] − Eθ′ [θ̂])2
Rθ (θ̂) = Eθ [(θ̂ − θ)2 ] ≥ Varθ (θ̂) ≥ sup . (29.2)
θ ′ ̸=θ χ2 (Pθ′ kPθ )
Proof. Let θ̂ be a (possibly randomized) estimator based on X. Fix θ′ 6= θ ∈ Θ. Denote by P and

Q the probability distribution when the true parameter is θ or θ′ , respectively. That is, PX = Pθ
and QX = Pθ′ . Then
(Eθ [θ̂] − Eθ′ [θ̂])2
χ2 (PX kQX ) ≥ χ2 (Pθ̂ kQθ̂ ) ≥ (29.3)
Varθ (θ̂)
where the first inequality applies the data processing inequality (Theorem 7.4) and the second
inequality the variational representation (7.85) of χ2 -divergence.
Next we apply Theorem 29.1 to unbiased estimators θ̂ that satisfies Eθ [θ̂] = θ for all θ ∈ Θ.
Then
(θ − θ′ )2
Varθ (θ̂) ≥ sup .
θ ′ ̸=θ χ2 (Pθ′ kPθ )
Lower bounding the supremum by the limit of θ′ → θ and recall the asymptotic expansion of
χ2 -divergence from Theorem 7.20, we get, under the regularity conditions in Theorem 7.20, the
celebrated Cramér-Rao (CR) lower bound [78, 259]:
1
Varθ (θ̂) ≥ . (29.4)
JF (θ)
A few more remarks are as follows:
• Note that the HCR lower bound Theorem 29.1 is based on the χ2 -divergence. For a version
based on Hellinger distance which also implies the CR lower bound, see Exercise VI.5.
• Both the HCR and the CR lower bounds extend to the multivariate case as follows. Let θ̂ be
an unbiased estimator of θ ∈ Θ ⊂ Rd . Assume that its covariance matrix Covθ (θ̂) = Eθ [(θ̂ −
θ)(θ̂ − θ)⊤ ] is positive definite. Fix a ∈ Rd . Applying Theorem 29.1 to ha, θ̂i, we get
h a, θ − θ ′ i 2
χ2 (Pθ kPθ′ ) ≥ .
a⊤ Covθ (θ̂)a
Optimizing over a yields1
χ2 (Pθ kPθ′ ) ≥ (θ − θ′ )⊤ Covθ (θ̂)−1 (θ − θ′ ).
Sending θ′ → θ and applying the asymptotic expansion χ2 (Pθ kPθ′ ) = (θ − θ′ )⊤ JF (θ)(θ −

θ′ )(1 + o(1)) (see Remark 7.13), we get the multivariate version of CR lower bound:
Covθ (θ̂) J− 1
F (θ). (29.5)
1 ⟨x,y⟩2
For Σ 0, supx̸=0 x⊤ Σx
= y⊤ Σ−1 y, attained at x = Σ−1 y.
i i
i i
i i

i i
498
• For a sample of n iid observations, by the additivity property (2.35), the Fisher information
matrix is equal to nJF (θ). Taking the trace on both sides, we conclude the squared error of any
unbiased estimators satisfies
1
Eθ [kθ̂ − θk22 ] ≥ Tr(J− 1
F (θ)).
n
This is already very close to (29.1), except for the fundamental restriction that of unbiased
estimators.
29.1.2 Bayesian Cramér-Rao lower bound

The drawback of the HCR and CR lower bounds is that they are confined to unbiased estimators.
For the minimax settings in (29.1), there is no sound reason to restrict to unbiased estimators; in
fact, it is often wise to trade bias with variance in order to achieve a smaller overall risk.
Next we discuss a lower bound, known as the Bayesian Cramér-Rao (BCR) lower bound [139]
or the van Trees inequality [320], for a Bayesian setting that applies to all estimators; to apply to
the minimax setting, in view of Theorem 28.1, one just needs to choose an appropriate prior. The
exact statement and the application to minimax risk are postponed till the next section. Here we
continue the previous line of thinking and derive it from the data processing argument.
Fix a prior π on Θ and a (possibly randomized) estimator θ̂. Then we have the Markov chain
θ → X → θ̂. Consider two joint distributions for (θ, X):
• Under Q, θ is drawn from π and X ∼ Pθ conditioned on θ;

• Under P, θ is drawn from Tδ π, where Tδ denote the pushforward of shifting by δ , i.e., Tδ π (A) =
π (A − δ), and X ∼ Pθ−δ conditioned on θ.
Similar to (29.3), applying data processing and variational representation of χ2 -divergence yields
(EP [θ − θ̂] − EQ [θ − θ̂])2
χ2 (PθX kQθX ) ≥ χ2 (Pθθ̂ kQθθ̂ ) ≥ χ2 (Pθ−θ̂ kQθ−θ̂ ) ≥ .
VarQ (θ̂ − θ)
Note that by design, PX = QX and thus EP [θ̂] = EQ [θ̂]; on the other hand, EP [θ] = EQ [θ] + δ .
Furthermore, Eπ [(θ̂ − θ)2 ] ≥ VarQ (θ̂ − θ). Since this applies to any estimators, we conclude that
the Bayes risk R∗π (and hence the minimax risk) satisfies
δ2
R∗π ≜ inf Eπ [(θ̂ − θ)2 ] ≥ sup , (29.6)
θ̂ δ̸=0 χ2 (PXθ kQXθ )
which is referred to as the Bayesian HCR lower bound in comparison with (29.2).
Similar to the deduction of CR lower bound from the HCR, we can further lower bound
this supremum by evaluating the small-δ limit. First note the following chain rule for the
χ2 -divergence:
" 2 #
dPθ
χ (PXθ kQXθ ) = χ (Pθ kQθ ) + EQ χ (PX|θ kQX|θ ) ·
2 2 2
.
dQθ
i i
i i
i i

i i
Under suitable regularity conditions in Theorem 7.20, again applying the local expansion of χ2 -
divergence yields
R π ′2
• χ2 (Pθ kQθ ) = χ2 (Tδ π kπ ) = (J(π ) + o(1))δ 2 , where J(π ) ≜ π is the Fisher information of
the prior;
• χ2 (PX|θ kQX|θ ) = [JF (θ) + o(1)]δ 2 .
Thus from (29.6) we get

1
R∗π ≥ . (29.7)
J(π ) + Eθ∼π [JF (θ)]
We conclude this section by revisiting the Gaussian Location Model (GLM) in Example 28.1.
i.i.d.
Example 29.1. Let Xn = (X1 , . . . , Xn ) ∼ N (θ, 1) and consider the prior θ ∼ π = N (0, s). To
apply the Bayesian HCR bound (29.6), note that
(a)
χ2 (PθXn ||QθXn ) = χ2 (PθX̄ ||QθX̄ )
" 2 #
dPθ
= χ (Pθ ||Qθ ) + EQ
2
χ (PX̄|θ ||QX̄|θ )
2
dQθ
(b) 2 2 2
= eδ /s
− 1 + eδ /s
(enδ − 1)
2
(n+ 1s )
= eδ − 1.
Pn
where (a) follows from the sufficiency of X̄ = 1n i=1 Xi ; (b) is by Qθ = N (0, s), QX̄|θ = N (θ, n1 ),
Pθ = N (δ, s), PX̄|θ = N (θ − δ, 1n ), and the fact (7.40) for Gaussians. Therefore,
δ2 δ2 s
R∗π ≥ sup δ 2 (n+ 1s )
= lim
δ 2 (n+ 1s )
= .
δ̸=0 e −1 δ→0 e −1 sn + 1
In view of the Bayes risk found in Example 28.1, we see that in this case the Bayesian HCR and
Bayesian Cramér-Rao lower bounds are exact.
29.2 Bayesian Cramér-Rao lower bounds

In this section we give the rigorous statement of the Bayesian Cramér-Rao lower bound and discuss
its extensions and consequences. For the proof, we take a more direct approach as opposed to the
data-processing argument in Section 29.1 based on asymptotic expansion of the χ2 -divergence.
Theorem 29.2 (BCR lower bound). Let π be a differentiable prior density on the interval [θ0 , θ1 ]
such that π (θ0 ) = π (θ1 ) = 0 and
Z θ1 ′ 2
π (θ)
J( π ) ≜ dθ < ∞. (29.8)
θ0 π (θ)
i i
i i
i i

i i
500
Let Pθ (dx) = pθ (x) μ(dx), where the density pθ (x) is differentiable in θ for μ-almost every x.
Assume that for π-almost every θ,
Z
μ(dx)∂θ pθ (x) = 0. (29.9)
Then the Bayes quadratic risk R∗π ≜ infθ̂ E[(θ − θ̂)2 ] satisfies
1
R∗π ≥ . (29.10)
Eθ∼π [JF (θ)] + J(π )
Proof. In view of Remark 28.3, it loses no generality to assume that the estimator θ̂ = θ̂(X) is
deterministic. For each x, integration by parts yields
Z θ1 Z θ1
dθ(θ̂(x) − θ)∂θ (pθ (x)π (θ)) = pθ (x)π (θ)dθ.
θ0 θ0
Integrating both sides over μ(dx) yields

E[(θ̂ − θ)V(θ, X)] = 1.
where V(θ, x) ≜ ∂θ (log(pθ (x)π (θ))) = ∂θ log pθ (x) + ∂θ log π (θ) and the expectation is over
the joint distribution of (θ, X). Applying Cauchy-Schwarz, we have E[(θ̂ − θ)2 ]E[V(θ, X)2 ] ≥ 1.
The proof is completed by noting that E[V(θ, X)2 ] = E[(∂θ log pθ (X))2 ] + E[(∂θ log π (θ))2 ] =
E[JF (θ)] + J(π ), thanks to the assumption (29.9).
The multivariate version of Theorem 29.2 is the following.
Qd
Theorem 29.3 (Multivariate BCR). Consider a product prior density π (θ) = i=1 π i (θi ) over
Qd
the box i=1 [θ0,i , θ1,i ], where each π i is differentiable on [θ0,i , θ1,i ] and vanishes on the boundary.
Assume that for π-almost every θ,
Z
μ(dx)∇θ pθ (x) = 0. (29.11)
Then
R∗π ≜ inf Eπ [kθ̂ − θk22 ] ≥ Tr((Eθ∼π [JF (θ)] + J(π ))−1 ), (29.12)
θ̂
where the Fisher information matrices are given by JF (θ) = Eθ [∇θ log pθ (X)∇θ log pθ (X)⊤ ] and
J(π ) = diag(J(π 1 ), . . . , J(π d )).
Proof. Fix an estimator θ̂ = (θ̂1 , . . . , θ̂d ) and a non-zero u ∈ Rd . For each i, k = 1, . . . , d,

integration by parts yields
Z θ 1, i Z θ 1, i
(θ̂k (x) − θk )∂θi (pθ (x)π (θ))dθi = 1{k=i} pθ (x)π (θ)dθi .
θ 0, i θ 0, i
Q
Integrating both sides over j̸=i dθj and μ(dx), multiplying by ui , and summing over i, we obtain
E[(θ̂k (X) − θk )hu, ∇ log(pθ (X)π (θ))i] = hu, ek i
i i
i i
i i

i i
where ek denotes the kth standard basis. Applying Cauchy-Schwarz and optimizing over u yield
h u , ek i 2
E[(θ̂k (X) − θk )2 ] ≥ sup = Σ− 1
kk ,
u̸=0 u⊤ Σ u
where Σ ≡ E[∇ log(pθ (X)π (θ))∇ log(pθ (X)π (θ))⊤ ] = Eθ∼π [JF (θ)] + J(π ), thanks to (29.11).
Summing over k completes the proof of (29.12).
Several remarks are in order:
• The above versions of the BCR bound assume a prior density that vanishes at the boundary.
If we choose a uniform prior, the same derivation leads to a similar lower bound known as
the Chernoff-Rubin-Stein inequality (see Ex. VI.4), which also suffices for proving the optimal
minimax lower bound in (29.1).
• For the purpose of the lower bound, it is advantageous to choose a prior density with the mini-
mum Fisher information. The optimal density with a compact support is known to be a squared
cosine density [160, 315]:
min J( g ) = π 2 ,
g on [−1,1]
attained by
πu
g(u) = cos2 . (29.13)
2
• Suppose the goal is to estimate a smooth functional T(θ) of the unknown parameter θ, where
T : Rd → Rs is differentiable with ∇T(θ) = ( ∂ T∂θi (θ)
j
) its s × d Jacobian matrix. Then under the
same condition of Theorem 29.3, we have the following Bayesian Cramér-Rao lower bound for
functional estimation:
inf Eπ [kT̂(X) − T(θ)k22 ] ≥ Tr(E[∇T(θ)](E[JF (θ)] + J(π ))−1 E[∇T(θ)]⊤ ), (29.14)

T̂
where the expectation on the right-hand side is over θ ∼ π.
As a consequence of the BCR bound, we prove the lower bound part for the asymptotic minimax
risk in (29.1).
Theorem 29.4. Assume that θ 7→ JF (θ) is continuous. Denote the minimax squared error R∗n ≜
i.i.d.
infθ̂ supθ∈Θ Eθ [kθ̂ − θk22 ], where Eθ is taken over X1 , . . . , Xn ∼ Pθ . Then as n → ∞,
1 + o( 1)
R∗n ≥ sup TrJ− 1
F (θ). (29.15)
n θ∈Θ
Proof. Fix θ ∈ Θ. Then for all sufficiently small δ , B∞ (θ, δ) = θ + [−δ, δ]d ⊂ Θ. Let π i (θi ) =
1 θ−θi Qd
δ g( δ ), where g is the prior density in (29.13). Then the product distribution π = i=1 π i
satisfies the assumption of Theorem 29.3. By the scaling rule of Fisher information (see (2.34)),
2 2
J(π i ) = δ12 J(g) = δπ2 . Thus J(π ) = δπ2 Id .
i i
i i
i i

i i
502
It is known that (see [44, Theorem 2, Appendix V]) the continuity of θ 7→ JF (θ) implies (29.11).
So we are ready to apply the BCR bound in Theorem 29.3. Lower bounding the minimax by the
Bayes risk and also applying the additivity property (2.35) of Fisher information, we obtain
− 1 !
∗ 1 π2
Rn ≥ · Tr Eθ∼π [JF (θ)] + 2 Id .
n nδ
Finally, choosing δ = n−1/4 and applying the continuity of JF (θ) in θ, the desired (29.15) follows.
Similarly, for estimating a smooth functional T(θ), applying (29.14) with the same argument
yields
1 + o(1)
inf sup Eθ [kT̂ − T(θ)k22 ] ≥ sup Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ). (29.16)
T̂ θ∈Θ n θ∈Θ
29.3 Maximum Likelihood Estimator and asymptotic efficiency

Theorem 29.4 shows that in a small neighborhood of each parameter θ, the best estimation error
is at best 1n (TrJ− 1
F (θ) + o(1)) when the sample size n grows; this is known as the information
bound as determined by the Fisher information matrix. Estimators achieving this bound are called
asymptotic efficient. A cornerstone of the classical large-sample theory is the asymptotic efficiency
of the maximum likelihood estimator (MLE). Rigorously stating this result requires a lengthy
list of technical conditions, and an even lengthier one is needed to make the error uniform so
as to attain the minimax lower bound in Theorem 29.4. In this section we give a sketch of the
asymptotic analysis of MLE, focusing on the main ideas and how Fisher information emerges
from the likelihood optimization.
i.i.d.
Suppose we observe a sample Xn = (X1 , X2 , · · · , Xn ) ∼ Pθ0 , where θ0 stands for the true
parameter. The MLE is defined as:
θ̂MLE ∈ arg max Lθ (Xn ), (29.17)

θ∈Θ
where
X
n
Lθ (Xn ) = log pθ (Xi )
i=1
is the total log-likelihood and pθ (x) = dP dμ (x) is the density of Pθ with respect to some com-
θ
mon dominating measure μ. For discrete distribution Pθ , the MLE can also be written as the KL
projection2 of the empirical distribution P̂n to the model class: θ̂MLE ∈ arg minθ∈Θ D(P̂n kPθ ).
2
Note that this is the reverse of the information projection studied in Section 15.3.
i i
i i
i i

i i
29.3 Maximum Likelihood Estimator and asymptotic efficiency 503
The main intuition why MLE works is as follows. Assume that the model is identifiable, namely,
θ 7→ Pθ is injective. Then for any θ 6= θ0 , we have by positivity of the KL divergence (Theorem 2.3)
" n #
X pθ ( X i )
E θ 0 [ Lθ − Lθ0 ] = E θ 0 log = −nD(Pθ0 ||Pθ ) < 0.
pθ0 (Xi )
i=1
In other words, Lθ − Lθ0 is an iid sum with a negative mean and thus negative with high probability
for large n. From here the consistency of MLE follows upon assuming appropriate regularity
conditions, among which is Wald’s integrability condition Eθ0 [sup∥θ−θ0 ∥≤ϵ log ppθθ (X)] < ∞ [330,
0
333].
Assuming more conditions one can obtain the asymptotic normality and efficiency of the
MLE. This follows from the local quadratic approximation of the log-likelihood function. Define
V(θ, x) ≜ ∇θ pθ (x) (score) and H(θ, x) ≜ ∇2θ pθ (x). By Taylor expansion,
! !
Xn
1 Xn
⊤ ⊤
Lθ =Lθ0 + (θ − θ0 ) V(θ0 , Xi ) + (θ − θ0 ) H(θ0 , Xi ) (θ − θ0 )
2
i=1 i=1
+ o(n(θ − θ0 ) ).
2
(29.18)
Recall from Section 2.6.2* that, under suitable regularity conditions, we have
Eθ0 [V(θ0 , X)] = 0, Eθ0 [V(θ0 , X)V(θ0 , X)⊤ ] = −Eθ0 [H(θ0 , X)] = JF (θ0 ).
Thus, by the Central Limit Theorem and the Weak Law of Large Numbers, we have
1 X 1X
n n
d P
√ V(θ0 , Xi )−
→N (0, JF (θ0 )), H(θ0 , Xi )−
→ − JF (θ0 ).
n n
i=1 i=1
Substituting these quantities into (29.18), we obtain the following stochastic approximation of the
log-likelihood:
p n
Lθ ≈ Lθ0 + h nJF (θ0 )Z, θ − θ0 i − (θ − θ0 )⊤ JF (θ0 )(θ − θ0 ),
2
where Z ∼ N (0, Id ). Maximizing the right-hand side yields:
1
θ̂MLE ≈ θ0 + √ JF (θ0 )−1/2 Z.
n
From this asymptotic normality, we can obtain Eθ0 [kθ̂MLE − θ0 k22 ] ≤ n1 (TrJF (θ0 )−1 + o(1)), and
for smooth functionals by Taylor expanding T at θ0 (delta method), Eθ0 [kT(θ̂MLE ) − T(θ0 )k22 ] ≤
−1 ⊤
n (Tr(∇T(θ0 )JF (θ0 ) ∇T(θ0 ) ) + o(1)), matching the information bounds (29.15) and (29.16).
1
Of course, the above heuristic derivation requires additional assumptions to justify (for example,
Cramér’s condition, cf. [126, Theorem 18] and [274, Theorem 7.63]). Even stronger assumptions
are needed to ensure the error is uniform in θ in order to achieve the minimax lower bound in
Theorem 29.4; see, e.g., Theorem 34.4 (and also Chapters 36-37) of [44] for the exact conditions
and statements. A more general and abstract theory of MLE and the attainment of information
bound were developed by Hájek and Le Cam; see [152, 193].
Despite its wide applicability and strong optimality properties, the methodology of MLE is not
without limitations. We conclude this section with some remarks along this line.
i i
i i
i i

i i
504
• MLE may not exist even for simple parametric models. For example, consider X1 , . . . , Xn
drawn iid from the location-scale mixture of two Gaussians 12 N ( μ1 , σ12 ) + 12 N ( μ2 , σ22 ), where
( μ1 , μ2 , σ1 , σ2 ) are unknown parameters. Then the likelihood can be made arbitrarily large by
setting for example μ1 = X1 and σ1 → 0.
• MLE may be inconsistent; see [274, Example 7.61] and [125] for examples, both in one-
dimensional parametric family.
• In high dimensions, it is possible that MLE fails to achieve the minimax rate (Exercise VI.14).
29.4 Application: Estimating discrete distributions and entropy

As an application in this section we consider the concrete problems of estimating a discrete dis-
tribution or its property (such as Shannon entropy) based on iid observations. Of course, the
asymptotic theory developed in this chapter applies only to the classical setting of fixed alpha-
bet and large sample size. Along the way, we will also discuss extensions to large alphabet and
what may go wrong with the classical theory.
i.i.d.
Throughout this section, let X1 , · · · , Xn ∼ P ∈ Pk , where Pk ≡ P([k]) denotes the collection
of probability distributions over [k] = {1, . . . , k}. We first consider the estimation of P under the
squared loss.
Theorem 29.5. Fox fixed k, the minimax squared error of estimating P satisfies

b − Pk22 ] = 1 k − 1 + o(1) , n → ∞.
R∗sq (k, n) ≜ inf sup E[kP (29.19)
b
P P∈Pk n k
Proof. Let P = (P1 , . . . , Pk ) be parametrized, as in Example 2.6, by θ = (P1 , P2 , · · · , Pk−1 ) and

Pk = 1 − P1 − · · · − Pk−1 . Then P = T(θ), where T : Rk−1 → Rk is an affine functional so that
I 1
∇T(θ) = [ −k−
1⊤
], with 1 being the all-ones (column) vector.
The Fisher information matrix and its inverse have been calculated in (2.36) and (2.37): We
have J−
F (θ) = diag(θ) − θθ and
1 ⊤

diag(θ) − θθ⊤ − Pk θ
∇T(θ)J− 1
F (θ)∇T(θ)
⊤
=
−Pk θ⊤ Pk (1 − Pk ).
Pk Pk
So Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ) = i=1 Pi (1 − Pi ) = 1 −
2
i=1 Pi , which achieves its maximum
1 − 1k at the uniform distribution. Applying the functional form of the BCR bound in (29.16), we
conclude R∗sq (k, n) ≥ n1 (1 − 1k + o(1)).
For the upper bound, consider the MLE, which in this case coincides with the empirical dis-
Pn
tribution P̂ = (P̂i ) (Exercise VI.8). Note that nP̂i = j=1 1{Xj =i} ∼ Bin(n, Pi ). Then for any P,
Pk
E[kP̂ − Pk22 ] = n1 i=1 Pi (1 − Pi ) ≤ 1n (1 − 1k ).
Some remarks on Theorem 29.5 are in order:
i i
i i
i i

i i
29.4 Application: Estimating discrete distributions and entropy 505
−1/k
• In fact, for any k, n, we have the precise result: R∗sq (k, n) = (11+√ 2 – see Ex. VI.7h. This can be
n)
shown by considering a Dirichlet prior (13.15) and applying the corresponding Bayes estimator,
which is an additively-smoothed empirical distribution (Section 13.5).
• Note that R∗sq (k, n) does not grow with the alphabet size k; this is because squared loss is
too weak for estimating probability vectors. More meaningful loss functions include the f-
divergences in Chapter 7, such as the total variation, KL divergence, χ2 -divergence. These
minimax rates are worked out in Exercise VI.8 and Exercise VI.9, for both small and large
alphabets, and they indeed depend on the alphabet size k. For example, the minimax KL risk
satisfies Θ( nk ) for k ≤ n and grows as Θ(log nk ) for k n. This agrees with the rule of thumb
that consistent estimation requires the sample size to scale faster than the dimension.
As a final application, let us consider the classical problem of entropy estimation in information
theory and statistics [219, 98, 156], where the goal is to estimate the Shannon entropy, a non-linear
functional of P. The following result follows from the functional BCR lower bound (29.16) and
analyzing the MLE (in this case the empirical entropy) [25].
Theorem 29.6. For fixed k, the minimax quadratic risk of entropy estimation satisfies

b (X1 , . . . , Xn ) − H(P))2 ] = 1 max V(P) + o(1) , n → ∞
R∗ent (k, n) ≜ inf sup E[(H
b P∈Pk
H n P∈Pk
Pk
where H(P) = i=1 Pi log P1i = E[log P(1X) ] and V(P) = Var[log P(1X) ] are the Shannon entropy
and varentropy (cf. (10.4)) of P.
Let us analyze the result of Theorem 29.6 and see how it extends to large alphabets. It can be
2
shown that3 maxP∈Pk V(P) log2 k, which suggests that R∗ent ≡ R∗ent (k, n) may satisfy R∗ent logn k
even when the alphabet size k grows with n; however, this result only holds for sufficiently small
alphabet. In fact, back in Lemma 13.2 we have shown that for the empirical entropy which achieves
the bound in Theorem 29.6, its bias is on the order of nk , which is no longer negligible on large
alphabets. Using techniques of polynomial approximation [335, 168], one can reduce this bias to
n log k and further show that consistent entropy estimation is only possible if and only if n log k
k k
[317], in which case the minimax rate satisfies

2
k log2 k
R∗ent +
n log k n
In summary, one needs to exercise caution extending classical large-sample results to high
dimensions, especially when bias becomes the dominating factor.
3
Indeed, maxP∈Pk V(P) ≤ log2 k for all k ≥ 3 [239, Eq. (464)]. For the lower bound, consider
P = ( 12 , 2(k−1)
1 1
, . . . 2(k−1) ).
i i
i i
i i

i i
30 Mutual information method
In this chapter we describe a strategy for proving statistical lower bound we call the Mutual Infor-
mation Method (MIM), which entails comparing the amount of information data provides with
the minimum amount of information needed to achieve a certain estimation accuracy. Similar to
Section 29.2, the main information-theoretical ingredient is the data-processing inequality, this
time for mutual information as opposed to f-divergences.
Here is the main idea of the MIM: Fix some prior π on Θ and we aim to lower bound the Bayes
risk R∗π of estimating θ ∼ π on the basis of X with respect to some loss function ℓ. Let θ̂ be an
estimator such that E[ℓ(θ, θ̂)] ≤ D. Then we have the Markov chain θ → X → θ̂. Applying the
data processing inequality (Theorem 3.7), we have
inf I(θ; θ̂) ≤ I(θ; θ̂) ≤ I(θ; X). (30.1)

Pθ̃|θ :Eℓ(θ,θ̃)≤D
Note that
• The leftmost quantity can be interpreted as the minimum amount of information required to
achieve a given estimation accuracy. This is precisely the rate-distortion function ϕ(D) ≡ ϕθ (D)
(recall Section 24.3).
• The rightmost quantity can be interpreted as the amount of information provided by the data
about the latent parameter. Sometimes it suffices to further upper-bound it by the capacity of
the channel PX|θ by maximizing over all priors (Chapter 5):
I(θ; X) ≤ sup I(θ; X) ≜ C. (30.2)

π ∈∆(Θ)
Therefore, we arrive at the following lower bound on the Bayes and hence the minimax risks
R∗π ≥ ϕ−1 (I(θ; X)) ≥ ϕ−1 (C). (30.3)
The reasoning of the mutual information method is reminiscent of the converse proof for joint-
source channel coding in Section 26.3. As such, the argument here retains the flavor of “source-
channel separation”, in that the lower bound in (30.1) depends only on the prior (source) and
the loss function, while the capacity upper bound (30.2) depends only on the statistical model
(channel).
In the next few sections, we discuss a sequence of examples to illustrate the MIM and its
execution:
506
i i
i i
i i

i i
• Denoising a vector in Gaussian noise, where we will compute the exact minimax risk;
• Denoising a sparse vector, where we determine the sharp minimax rate;
• Community detection, where the goal is to recover a dense subgraph planted in a bigger Erd�s-
Rényi graph.
In the next chapter we will discuss three popular approaches for, namely, Le Cam’s method,
Assouad’s lemma, and Fano’s method. As illustrated in Fig. 30.1, all three follow from the mutual
Mutual Information Method
Fano Assouad Le Cam
Figure 30.1 The three lower bound techniques as consequences of the Mutual Information Method.
information method, corresponding to different choice of prior π for θ, namely, the uniform dis-
tribution over a two-point set {θ0 , θ1 }, the hypercube {0, 1}d , and a packing (recall Section 27.1).
While these methods are highly useful in determining the minimax rate for many problems, they
are often loose with constant factors compared to the MIM. In the last section of this chapter, we
discuss the problem of how and when is non-trivial estimation achievable by applying the MIM;
for this purpose, none of the three methods in the next chapter works.
30.1 GLM revisited and Shannon lower bound

i.i.d.
Consider the d-dimensional GLM, where we observe X = (X1 , . . . , Xn ) ∼ N(θ, Id ) and θ ∈ Θ is the
parameter. Denote by R∗ (Θ) the minimax risk with respect to the quadratic loss ℓ(θ, θ̂) = kθ̂ −θk22 .
First, let us consider the unconstrained model where Θ = Rd . Estimating using the sample
Pn
mean X̄ = 1n i=1 Xi ∼ N(θ, n1 Id ), we achieve the upper bound R∗ (Rd ) ≤ dn . This turns out to
be the exact minimax risk, as shown in Example 28.2 by computing the Bayes risk for Gaussian
priors. Next we apply the mutual information method to obtain the same matching lower bound
without evaluating the Bayes risk. Again, let us consider θ ∼ N(0, sId ) for some s > 0. We know
from the Gaussian rate-distortion function (Theorem 26.2) that
(
d
2 log sd
D D < sd
ϕ(D) = inf I(θ; θ̂) =
Pθ̂|θ :E[∥θ̂−θ∥22 ]≤D 0 otherwise.
i i
i i
i i

i i
508
Using the sufficiency of X̄ and the formula of Gaussian channel capacity (cf. Theorem 5.11 or
Theorem 20.11), the mutual information between the parameter and the data can be computed as
d
I(θ; X) = I(θ; X̄) = log(1 + sn).
2
It then follows from (30.3) that R∗π ≥ sd
1+sn , which in fact matches the exact Bayes risk in (28.7).
Sending s → ∞ yields the identity
d
R∗ (Rd ) =
. (30.4)
n
In the above unconstrained GLM, we are able to compute everything in close form when
applying the mutual information method. Such exact expressions are rarely available in more
complicated models in which case various bounds on the mutual information will prove useful.
Next, let us consider the GLM with bounded means, where the parameter space Θ = B(ρ) =
{θ : kθk2 ≤ ρ} is the ℓ2 -ball of radius ρ centered at zero. In this case there is no known close-
form formula for the minimax quadratic risk even in one dimension. Nevertheless, the next result
determines the sharp minimax rate, which characterizes the minimax risk up to universal constant
factors.
Theorem 30.1 (Bounded GLM).

d
R∗ (B(ρ)) ∧ ρ2 . (30.5)
n
p
Remark 30.1. Comparing (30.5) withp (30.4), we see that if ρ ≳ d/n, it is rate-optimal to ignore
the bounded-norm constraint; if ρ ≲ d/n, we can discard all observations and estimate by zero,
because data do not provide a better resolution than the prior information.
Proof. The upper bound R∗ (B(ρ)) ≤ nd ∧ ρ2 follows from considering the estimator θ̂ = X̄ and
θ̂ = 0. To prove the lower bound, we apply the mutual information method with a uniform prior
θ ∼ Unif(B(r)), where r ∈ [0, ρ] is to be optimized. The mutual information can be upper bound
using the AWGN capacity as follows:

1 d nr2 nr2
I(θ; X) = I(θ; X̄) ≤ sup I(θ; θ + √ Z) = log 1 + ≤ , (30.6)
Pθ :E[∥θ∥2 ]≤r n 2 d 2
2
where Z ∼ N(0, Id ). Alternatively, we can use Corollary 5.8 to bound the capacity (as information
radius) by the KL diameter, which yields the same bound within constant factors:
1
I(θ; X) ≤ sup I(θ; θ + √ Z) ≤ max D(N(θ, Id /n)kN(θ, Id /n)k) = 2nr2 . (30.7)
Pθ :∥θ∥≤r n θ,θ ′ ∈B(r)
For the lower bound, due to the lack of close-form formula for the rate-distortion function
for uniform distribution over Euclidean balls, we apply the Shannon lower bound (SLB) from
Section 26.1. Since θ has an isotropic distribution, applying Theorem 26.3 yields
d 2πed d cr2
inf I(θ; θ̂) ≥ h(θ) + log ≥ log ,
Pθ̂|θ :E∥θ−θ̂∥2 ≤D 2 D 2 D
i i
i i
i i

i i
for some universal constant c, where the last inequality is because for θ ∼ Unif(B(r)), h(θ) =
log vol(B(r)) = d log r + log vol(B(1)) and the volume of a unit Euclidean ball in d dimensions
satisfies (recall (27.14)) vol(B(1))1/d √1d .
2 2
∗ 2 −nr /d 2
R∗ ≤ 2 , i.e., R ≥ cr e
Finally, applying (30.3) yields 12 log cr nr
. Optimizing over r and
−ax −a
using the fact that sup0<x<1 xe = ea if a ≥ 1 and e if a < 1, we have
1
d
R∗ ≥ sup cr2 e−nr /d
2
∧ ρ2 .
r∈[0,ρ] n
Finally, to further demonstrate the usefulness of the SLB, we consider non-quadratic loss
ℓ(θ, θ̂) = kθ − θ̂kr , the rth power of an arbitrary norm on Rd , for which the SLB was given in
(26.5) (see Exercise V.6). Applying the mutual information method yields the following minimax
lower bound.
i.i.d.
Theorem 30.2 (GLM with norm loss). Let X = (X1 , · · · , Xn ) ∼ N (θ, Id ) and let r > 0 be a
constant. Then
r/ 2 −r/d
d 2πe d − r/ d
inf sup Eθ [kθ̂ − θk ] ≥
r
V∥·∥ Γ 1 + n−r/2 V∥·∥ . (30.8)
θ̂ θ∈Rd re n r
Furthermore,
r
d
≲ nr/2 · inf sup Eθ [kθ̂ − θkr ] ≲ E[kZkr ], (30.9)
E[kZk∗ ] θ̂ θ∈Rd
where Z ∼ N(0, Id ) and kxk∗ = sup{hx, yi : kyk ≤ 1} is the dual norm of k · k.
Proof. Choose a Gaussian prior θ ∼ N (0, sId ). Suppose E[kθ̂ −θkr ] ≤ D. By the data processing
inequality,
( d )
d d Dre r d
log(1 + ns) ≥ I(θ; X) ≥ I(θ; θ̂) ≥ log(2πes) − log V∥·∥ Γ 1+ ,
2 2 d r
where the last inequality follows from (26.5). Rearranging terms and sending s → ∞ yields the
first inequality in (30.8), and the second follows from Stirling’s approximation Γ(x)1/x x for
x → ∞. For (30.9), the upper bound follows from choosing θ̂ = X̄ and the lower bound follows
from applying (30.8) with the following bound of Urysohn (cf., e.g., [235, p. 7]) on the volume of
a symmetric convex body.
Lemma 30.3. For any symmetric convex body K ⊂ Rd , vol(K)1/d ≲ w(K)/d, where w(K) is the
Gaussian width of K defined in (27.19).
Example 30.1. Recall from Theorem 28.7 that the upper bound in (30.9) is an equality. In view
of this, let us evaluate the tightness of the lower bound from SLB. As an example, consider r = 2
P 1/q
d
and the ℓq -norm kxkq = i=1 |xi |
q
with 1 ≤ q ≤ ∞. Recall the formula (27.13) for the
i i
i i
i i

i i
510
volume of a unit ℓq -ball:

h id
2Γ 1 + q1
V∥·∥q = .
Γ 1 + qd
In the special case of q = 2, we see that the lower bound in (30.8) is in fact exact and coincides with
2/ q
(30.4). For general q ∈ [1, ∞), (30.8) shows that the minimax rate is d n . However, for q = ∞,
the minimax lower bound we get is 1/p n, independent of the dimension d. In fact, the upper bound
in (30.9) is tight and since EkZk∞ log d (cf. Lemma 27.10), the minimax rate for the squared
ℓ∞ -risk is logn d . We will revisit this example in Section 31.4 and show how to obtain the sharp
logarithmic dependency on the dimension.
Remark 30.2 (SLB versus the volume method). Recall the connection between rate-distortion
function and the metric entropy in Section 27.7. As we have seen in Section 27.2, a common
lower bound for metric entropy is via the volume bound. In fact, the SLB can be interpreted as
a volume-based lower bound to the rate-distortion function. To see this, consider r = 1 and let θ
be uniformly distributed over some compact set Θ, so that h(θ) = log vol(Θ) (Theorem 2.6.(a)).
Applying Stirling’s approximation, the lower bound in (26.5) becomes log vol(vol (Θ)
B∥·∥ (cϵ)) for some
constant c, which has the same form as the volume ratio in Theorem 27.3 for metric entropy. We
will see later in Section 31.4 that in statistical applications, applying SLB yields basically the same
lower bound as applying Fano’s method to a packing obtained from the volume bound, although
SLB does not rely explicitly on a packing.
30.2 GLM with sparse means

In this section we consider the problem of denoising for a sparse vector. Specifically, consider
again the Gaussian location model N(θ, Id ) where the mean vector θ is known to be k-sparse,
taking values in the “ℓ0 -ball”
B0 (k) = {θ ∈ Rd : kθk0 ≤ k}, k ∈ [p],
where kθk0 = |{i ∈ [d] : θi 6= 0}| is the number of nonzero entries of θ, indicating the sparsity of
θ. Our goal is to characterize the minimax quadratic risk
R∗n (B0 (k)) = inf sup Eθ kθ̂ − θk22 .
θ̂ θ∈B0 (k)
Next we prove an optimal lower bound applying MIM. (For a different proof using Fano’s method
in Section 31.4, see Exercise VI.11.)
Theorem 30.4.
k ed
R∗n (B0 (k)) ≳ log . (30.10)
n k
A few remarks are in order:
i i
i i
i i

i i
30.2 GLM with sparse means 511
Remark 30.3. • The lower bound (30.10) turns out to be tight, achieved by the maximum
likelihood estimator
θ̂MLE = arg min kX̄ − θk2 , (30.11)

∥θ∥0 ≤k
which is equivalent to keeping the k entries from X̄ with the largest magnitude and setting the
rest to zero, or the following hard-thresholding estimator θ̂τ with an appropriately chosen τ (see
Exercise VI.12):
θ̂iτ = Xi 1{|Xi |≥τ } . (30.12)
• Sharp asymptotics: For sublinear sparsity k = o(d), we have R∗n (B0 (k)) = (2 + o(1)) nk log dk
(Exercise VI.12); for linear sparsity k = (η + o(1))d with η ∈ (0, 1), R∗n (B0 (k)) = (β(η) +
o(1))d for some constant β(η). For the latter and more refined results, we refer the reader to the
monograph [171, Chapter 8].
Proof. First, note that B0 (k) is a union of linear subspace of Rd and thus homogeneous. Therefore
by scaling, we have
1 ∗ 1
R∗n (B0 (k)) = R (B0 (k)) ≜ R∗ (k, d). (30.13)
n 1 n
Thus it suffices to consider n = 1. Denote the observation by X = θ + Z.
Next, note that the following oracle lower bound:
R∗ (k, d) ≥ k,
which is the optimal risk given the extra information of the support of θ, in view of (30.4). Thus
to show (30.10), below it suffices to consider k ≤ d/4.
We now apply the mutual information method. Recall from (27.10) that Sdk denotes the
Hamming sphere, namely,
Skd = {b ∈ {0, 1}d : wH (b) = k},

d
where wH (b) denotes
qthe Hamming weights of b. Let b be uniformly distributed over Sk and let
θ = τ b, where τ = log dk . Given any estimator θ̂ = θ̂(X), define an estimator b̂ ∈ {0, 1}d for b
by
(
0 θ̂i ≤ τ /2
b̂i = , i ∈ [d].
1 θ̂i > τ /2
Thus the Hamming loss of b̂ can be related to the squared loss of θ̂ as
τ2
kθ − θ̂k22 ≥ dH (b, b̂). (30.14)
4
Let EdH (b, b̂) = δ k. Assume that δ ≤ 14 , for otherwise, we are done.
i i
i i
i i

i i
512
Note the the following Markov chain b → θ → X → θ̂ → b̂ and thus, by the data processing
inequality of mutual information,

d kτ 2 kτ 2 k d
I(b; b̂) ≤ I(θ; X) ≤ log 1 + ≤ = log .
2 d 2 2 k
where the second inequality follows from the fact that kθk22 = kτ 2 and the Gaussian channel
capacity.
Conversely,
I(b̂; b) ≥ min I(b̂; b)
EdH (b,b̂)≤δ d
= H(b) − max H(b|b̂)

EdH (b,b̂)≤δ k

d d δk
≥ log − max H(b ⊕ b̂) = log − dh , (30.15)
k EwH (b⊕b̂)≤δ k k d
where the last step follows from Exercise I.9.

Combining the lower and upper bound on the mutual information and using dk ≥ ( dk )k , we
get dh( δdk ) ≥ 2k k log dk . Since h(p) ≤ −p log p + p for p ∈ [0, 1] and k/d ≤ 14 by assumption, we
conclude that δ ≥ ck/d for some absolute constant c, completing the proof of (30.10) in view of
(30.14).
30.3 Community detection

As another application of the mutual information method, let us consider the following statistical
problem of detecting a single hidden community in random graphs, also known as the planted
dense subgraph model [217]. Let C∗ be drawn uniformly at random from all subsets of [n] of
cardinality k ≥ 2. Let G denote a random graph on the vertex set [n], such that for each i 6= j, they
are connected independently with probability p if both i and j belong to C∗ , and with probability
q otherwise. Assuming that p > q, the set C∗ represents a densely connected community, which
forms an Erd�s-Rényi graph G(k, p) planted in the bigger G(n, q) graph. Upon observing G, the
goal is to reconstruct C∗ as accurately as possible. In particular, given an estimator Ĉ = Ĉ(G), we
say it achieves almost exact recovery if E|C4Ĉ| = o(k). The following result gives a necessary
condition in terms of the parameters (p, q, n, k):
Theorem 30.5. Assume that k/n is bounded away from one. If almost exact recovery is possible,
then
2 + o( 1) n
d(pkq) ≥ log . (30.16)
k−1 k
Remark 30.4. In addition to Theorem 30.5, another necessary condition is that

1
d(pkq) = ω , (30.17)
k
i i
i i
i i

i i
30.4 Estimation better than chance 513
which can be shown by a reduction to testing the membership of two nodes given the rest. It turns
out that conditions (30.16) and (30.17) are optimal, in the sense that almost exact recovery can be
achieved (via maximum likelihood) provided that (30.17) holds and d(pkq) ≥ 2k− +ϵ n
1 log k for any
constant ϵ > 0. For details, we refer the readers to [151].
Proof. Suppose Ĉ achieves almost exact recovery of C∗ . Let ξ ∗ , ξˆ ∈ {0, 1}k denote their indicator
vectors, respectively, for example, ξi∗ = 1{i∈C∗ } for each i ∈ [n]. Then Then E[dH (ξ, ξ)]
ˆ = ϵn k for
some ϵn → 0. Applying the mutual information method as before, we have

( a) n ϵn k (b) n
∗ ˆ ∗
I(G; ξ ) ≥ I(ξ; ξ ) ≥ log − nh ≥ k log (1 + o(1)),
k n k
where (a) follows in exact the same manner as (30.15) did from Exercise I.9; (b) follows from the
assumption that k/n ≤ 1 − c for some constant c.
On the other hand, we upper bound the mutual information between the hidden community and
the graph as follows:

(a) (b) (c) k
∗ ⊗(n2)
I(G; ξ ) = min D(PG|ξ∗ kQ|Pξ∗ ) ≤ D(PG|ξ∗ kBer(q) |Pξ∗ ) = d(pkq),
Q 2
where (a) is by the variational representation of mutual information in Corollary 4.2; (b) follows
from choosing Q to be the distribution of the Erd�s-Rényi graph G(n, q); (c) is by the tensorization
property of KL divergence for product distributions (see Theorem 2.14). Combining the last two
displays completes the proof.
30.4 Estimation better than chance

Instead of characterizing the rate of convergence of the minimax risk to zero as the amount of data
grows, suppose we are in a regime where this is impossible due to either limited sample size, poor
signal to noise ratio, or the high dimensionality; instead, we are concerned with the modest goal of
achieving an estimation error strictly better than the trivial error (without data). In the context of
clustering, this is known as weak recovery or correlated recovery, where the goal is not to achieve
a vanishing misclassification rate but one strictly better than random guessing the labels. It turns
out that MIM is particularly suited for this regime. (In fact, we will see in the next chapter that all
three popular further relaxations of MIM fall short due to the loss of constant factors.)
As an example, let us continue the setting of Theorem 30.1, where the goal is to estimate a vector
in a high-dimensional unit-ball based on noisy observations. Note that the radius of the parameter
space is one, so the trivial squared error equals one. The following theorem shows that in high
dimensions, non-trivial estimation is achievable if and only if the sample n grows proportionally
with the dimension d; otherwise, when d n 1, the optimal estimation error is 1 − nd (1 + o(1)).
i i
i i
i i

i i
514
i.i.d.
Theorem 30.6 (Bounded GLM continued). Suppose X1 , . . . , Xn ∼ N (θ, Id ), where θ belongs to
B, the unit ℓ2 -ball in Rd . Then for some universal constant C0 ,
n+C0 d
e− d−1 ≤ inf sup Eθ [kθ̂ − θk2 ] ≤ .
θ̂ θ∈B d+n
Proof. Without loss of generality, assume that the observation is X = θ+ √Zn , where Z ∼ N (0, Id ).
For the upper bound, applying the shrinkage estimator1 θ̂ = 1+1d/n X yields E[kθ̂ − θk2 ] ≤ n+d d .
For the lower bound, we apply MIM as in Theorem 30.1 with the prior θ ∼ Unif(Sd−1 ). We
still apply the AWGN capacity in (30.6) to get I(θ; X) ≤ n/2. (Here the constant 1/2 is important
and so the diameter-based (30.7) is too loose.) For the rate-distortion function of spherical uniform
distribution, applying Theorem 27.17 yields I(θ; θ̂) ≥ d−2 1 log E[∥θ̂−θ∥
1
2]
− C. Thus the lower bound
on E[kθ̂ − θk2 ] follows from the data processing inequality.
A similar phenomenon also occurs in the problem of estimating a discrete distribution P on k
elements based on n iid observations, which has been studied in Section 29.4 for small alphabet in
the large-sample asymptotics and extended in Exercise VI.7–VI.9 to large alphabets. In particular,
consider the total variation loss, which is at most one. Ex. VI.9f shows that the TV error of any
estimator is 1 − o(1) if n k; conversely, Ex. VI.9b demonstrates an estimator P̂ such that
E[χ2 (PkP̂)] ≤ nk− 1 2
+1 . Applying the joint range (7.29) between TV and χ and Jensen’s inequality,
we have
 q
 1 k− 1 n ≥ k − 2
E[TV(P, P̂)] ≤ 2 n+1
 k− 1 n≤k−2
k+n
which is bounded away from one whenever n = Ω(k). In summary, non-trivial estimation in total
variation is possible if and only if n scales at least proportionally with k.
1
This corresponds to the Bayes estimator (Example 28.1) when we choose θ ∼ N (0, 1d Id ), which is approximately
concentrated on the unit sphere.
i i
i i
i i

i i
31 Lower bounds via reduction to hypothesis

testing
In this chapter we study three commonly used techniques for proving minimax lower bounds,
namely, Le Cam’s method, Assouad’s lemma, and Fano’s method. Compared to the results in
Chapter 29 geared towards large-sample asymptotics in smooth parametric models, the approach
here is more generic, less tied to mean-squared error, and applicable in nonasymptotic settings
such as nonparametric or high-dimensional problems.
The common rationale of all three methods is reducing statistical estimation to hypothesis test-
ing. Specifically, to lower bound the minimax risk R∗ (Θ) for the parameter space Θ, the first step
is to notice that R∗ (Θ) ≥ R∗ (Θ′ ) for any subcollection Θ′ ⊂ Θ, and Le Cam, Assouad, and Fano’s
methods amount to choosing Θ′ to be a two-point set, a hypercube, or a packing, respectively. In
particular, Le Cam’s method reduces the estimation problem to binary hypothesis testing. This
method is perhaps the easiest to evaluate; however, the disadvantage is that it is frequently loose
in estimating high-dimensional parameters. To capture the correct dependency on the dimension,
both Assouad’s and Fano’s method rely on reduction to testing multiple hypotheses.
As illustrated in Fig. 30.1, all three methods in fact follow from the common principle of the
mutual information method (MIM) in Chapter 30, corresponding to different choice of priors.
The limitation of these methods, compared to the MIM, is that, due to the looseness in constant
factors, they are ineffective for certain problems such as estimation better than chance discussed
in Section 30.4.
31.1 Le Cam’s two-point method

Theorem 31.1. Suppose the loss function ℓ : Θ × Θ → R+ satisfies ℓ(θ, θ) = 0 for all θ ∈ Θ and
the following α-triangle inequality for some α > 0: For all θ0 , θ1 , θ ∈ Θ,
ℓ(θ0 , θ1 ) ≤ α(ℓ(θ0 , θ) + ℓ(θ1 , θ)). (31.1)
Then
ℓ(θ0 , θ1 )
inf sup Eθ ℓ(θ, θ̂) ≥ sup (1 − TV(Pθ0 , Pθ1 )) (31.2)
θ̂ θ∈Θ θ0 ,θ1 ∈Θ 2α
515
i i
i i
i i

i i
516
Proof. Fix θ0 , θ1 ∈ Θ. Given any estimator θ̂, let us convert it into the following (randomized)
test:

θ0 with probability ℓ(θ1 ,θ̂)
,
ℓ(θ0 ,θ̂)+ℓ(θ1 ,θ̂)
θ̃ =
θ1 with probability ℓ(θ0 ,θ̂)
.
ℓ(θ ,θ̂)+ℓ(θ ,θ̂) 0 1
By the α-triangle inequality, we have

" #
ℓ(θ0 , θ̂) 1
Eθ0 [ℓ(θ̃, θ0 )] = ℓ(θ0 , θ1 )Eθ0 ≥ Eθ [ℓ(θ̂, θ0 )],
ℓ(θ0 , θ̂) + ℓ(θ1 , θ̂) α 0
and similarly for θ1 . Consider the prior π = 12 (δθ0 + δθ1 ) and let θ ∼ π. Taking expectation on
both sides yields the following lower bound on the Bayes risk:
ℓ(θ0 , θ1 ) ℓ(θ0 , θ1 )
Eπ [ℓ(θ̂, θ)] ≥ P θ̃ 6= θ ≥ (1 − TV(Pθ0 , Pθ1 ))
α 2α
where the last step follows from the minimum average probability of error in binary hypothesis
testing (Theorem 7.7).
Remark 31.1. As an example where the bound (31.2) is tight (up to constants), consider a binary
hypothesis testing problem with Θ = {θ0 , θ1 } and the Hamming loss ℓ(θ, θ̂) = 1{θ 6= θ̂}, where
θ, θ̂ ∈ {θ0 , θ1 } and α = 1. Then the left side is the minimax probability of error, and the right
side is the optimal average probability of error (cf. (7.17)). These two quantities can coincide (for
example for Gaussian location model).
Another special case of interest is the quadratic loss ℓ(θ, θ̂) = kθ − θ̂k22 , where θ, θ̂ ∈ Rd , which
satisfies the α-triangle inequality with α = 2. In this case, the leading constant 41 in (31.2) makes
sense, because in the extreme case of TV = 0 where Pθ0 and Pθ1 cannot be distinguished, the best
estimate is simply θ0 +θ2 . In addition, the inequality (31.2) can be deduced based on properties of
1
f-divergences and their joint range (Chapter 7). To this end, abbreviate Pθi as Pi for i = 0, 1 and
consider the prior π = 12 (δθ0 + δθ1 ). Then the Bayes estimator (posterior mean) is θ0 dP 0 +θ1 dP1
dP0 +dP1 and
the Bayes risk is given by
Z
kθ0 − θ1 k2 dP0 dP1
R∗π =
2 dP0 + dP1
kθ0 − θ1 k2 kθ0 − θ1 k2
= (1 − LC(P0 , P1 )) ≥ (1 − TV(P0 , P1 )),
4 4
R 0 −dP1 )
2
where LC(P0 , P1 ) = (dP dP0 +dP1 is the Le Cam divergence defined in (7.6) and satisfies LC ≤ TV.
Example 31.1. As a concrete example, consider the one-dimensional GLM with sample size n.
Pn
By considering the sufficient statistic X̄ = 1n i=1 Xi , the model is simply {N(θ, 1n ) : θ ∈ R}.
Applying Theorem 31.1 yields

∗ 1 1 1
R ≥ sup |θ0 − θ1 | 1 − TV N θ0 ,
2
, N θ1 ,
θ0 ,θ1 ∈R 4 n n
( a) 1 ( b) c
= sup s2 (1 − TV(N(0, 1), N(s, 1))) = (31.3)
4n s>0 n
i i
i i
i i

i i
31.1 Le Cam’s two-point method 517
where (a) follows from the shift and scale invariance of the total variation; in (b) c ≈ 0.083 is
some absolute constant, obtained by applying the formula TV(N(0, 1), N(s, 1)) = 2Φ( 2s ) − 1 from
(7.37). On the other hand, we know from Example 28.2 that the minimax risk equals 1n , so the
two-point method is rate-optimal in this case.
In the above example, for two points separated by Θ( √1n ), the corresponding hypothesis cannot
be tested with vanishing probability of error so that the resulting estimation risk (say in squared
error) cannot be smaller than 1n . This convergence rate is commonly known as the “parametric
rate”, which we have studied in Chapter 29 for smooth parametric families focusing on the Fisher
information as the sharp constant. More generally, the 1n rate is not improvable for models with
locally quadratic behavior
H2 (Pθ0 , Pθ0 +t ) t2 , t → 0. (31.4)
(Recall that Theorem 7.21 gives a sufficient condition for this behavior.) Indeed, pick θ0 in the
interior of the parameter space and set θ1 = θ0 + √1n , so that H2 (Pθ0 , Pθ1 ) = Θ( 1n ) thanks to (31.4).
By Theorem 7.8, we have TV(P⊗ ⊗n
θ0 , Pθ1 ) ≤ 1 − c for some constant c and hence Theorem 31.1
n
yields the lower bound Ω(1/n) for the squared error. Furthermore, later we will show that the same
locally quadratic behavior in fact guarantees the achievability of the 1/n rate; see Corollary 32.11.
Example 31.2. As a different example, consider the family Unif(0, θ). Note that as opposed to the
quadratic behavior (31.4), we have
√
H2 (Unif(0, 1), Unif(0, 1 + t)) = 2(1 − 1/ 1 + t) t.
Thus an application of Theorem 31.1 yields an Ω(1/n2 ) lower bound. This rate is not achieved by
the empirical mean estimator (which only achieves 1/n rate), but by the the maximum likelihood
estimator θ̂ = max{X1 , . . . , Xn }. Other types of behavior in t, and hence the rates of convergence,
can occur even in compactly supported location families – see Example 7.1.
The limitation of Le Cam’s two-point method is that it does not capture the correct dependency
on the dimensionality. To see this, let us revisit Example 31.1 for d dimensions.
Example 31.3. Consider the d-dimensional GLM in Corollary 28.8. Again, it is equivalent to con-
sider the reduced model {N(θ, 1n ) : θ ∈ Rd }. We know from Example 28.2 (see also Theorem 28.4)
that for quadratic risk ℓ(θ, θ̂) = kθ − θ̂k22 , the exact minimax risk is R∗ = dn for any d and n. Let
us compare this with the best two-point lower bound. Applying Theorem 31.1 with α = 2,

1 1 1
R∗ ≥ sup kθ0 − θ1 k22 1 − TV N θ0 , Id , N θ1 , Id
θ0 ,θ1 ∈Rd 4 n n
1
= sup kθk22 {1 − TV (N (0, Id ) , N (θ, Id ))}
θ∈Rd 4n
1
= sup s2 (1 − TV(N(0, 1), N(s, 1))),
4n s>0
where the second step applies the shift and scale invariance of the total variation; in the last step,
by rotational invariance of isotropic Gaussians, we can rotate the vector θ align with a coordinate
i i
i i
i i

i i
518
vector (say, e1 = (1, 0 . . . , 0)) which reduces the problem to one dimension, namely,
TV(N(0, Id ), N(θ, Id )) = TV(N(0, Id ), N(kθke1 , Id )

= TV(N(0, 1), N(kθk, 1)).
Comparing the above display with (31.3), we see that the best Le Cam two-point lower bound in
d dimensions coincide with that in one dimension.
Let us mention in passing that although Le Cam’s two-point method is typically suboptimal for
estimating a high-dimensional parameter θ, for functional estimation in high dimensions (e.g. esti-
mating a scalar functional T(θ)), Le Cam’s method is much more effective and sometimes even
optimal. The subtlety is that is that as opposed to testing a pair of simple hypotheses H0 : θ = θ0
versus H1 : θ = θ1 , we need to test H0 : T(θ) = t0 versus H1 : T(θ) = t1 , both of which are
composite hypotheses and require a sagacious choice of priors. See Exercise VI.13 for an example.
31.2 Assouad’s Lemma

From Example 31.3 we see that Le Cam’s two-point method effectively only perturbs one out
of d coordinates, leaving the remaining d − 1 coordinates unexplored; this is the source of its
suboptimality. In order to obtain a lower bound that scales with the dimension, it is necessary to
randomize all d coordinates. Our next topic Assouad’s Lemma is an extension in this direction.
Theorem 31.2 (Assouad’s Lemma). Assume that the loss function ℓ satisfies the α-triangle
inequality (31.1). Suppose Θ contains a subset Θ′ = {θb : b ∈ {0, 1}d } indexed by the hypercube,
such that ℓ(θb , θb′ ) ≥ β · dH (b, b′ ) for all b, b′ and some β > 0. Then

βd
inf sup Eθ ℓ(θ, θ̂) ≥ 1 − max TV(Pθb , Pθb′ ) (31.5)
θ̂ θ∈Θ 4α dH (b,b′ )=1
Proof. We lower bound the Bayes risk with respect to the uniform prior over Θ′ . Given any
estimator θ̂ = θ̂(X), define b̂ ∈ argmin ℓ(θ̂, θb ). Then for any b ∈ {0, 1}d ,
β dH (b̂, b) ≤ ℓ(θb̂ , θb ) ≤ α(ℓ(θb̂ , θ̂b ) + ℓ(θ̂, θb )) ≤ 2αℓ(θ̂, θb ).
Let b ∼ Unif({0, 1}d ) and we have b → θb → X. Then

β
E[ℓ(θ̂, θb )] ≥ E[dH (b̂, b)]
2α
β X h i
d
= P b̂i 6= bi
2α
i=1
β X
d
≥ (1 − TV(PX|bi =0 , PX|bi =1 )),
4α
i=1
i i
i i
i i

i i
31.3 Assouad’s lemma from the Mutual Information Method 519
where the last step is again by Theorem 7.7, just like in the proof of Theorem 31.1. Each total
variation can be upper bounded as follows:
!
( a) 1 X 1 X (b)
TV(PX|bi =0 , PX|bi =1 ) = TV d− 1
Pθb , d−1 Pθb ≤ max TV(Pθb , Pθb′ )
2 2 dH (b,b′ )=1
b:bi =1 b:bi =0
where (a) follows from the Bayes rule, and (b) follows from the convexity of total variation
(Theorem 7.5). This completes the proof.
Example 31.4. Let us continue the discussion of the d-dimensional GLM in Example 31.3. Con-
sider the quadratic loss first. To apply Theorem 31.2, consider the hypercube θb = ϵb, where
b ∈ {0, 1}d . Then kθb − θb′ k22 = ϵ2 dH (b, b′ ). Applying Theorem 31.2 yields

∗ ϵ2 d 1 ′ 1
R ≥ 1− max TV N ϵb, Id , N ϵb , Id
4 b,b′ ∈{0,1}d ,dH (b,b′ )=1 n n
2

ϵ d 1 1
= 1 − TV N 0, , N ϵ, ,
4 n n
where the last step applies (7.10) for f-divergence between product distributions that only differ
in one coordinate. Setting ϵ = √1n and by the scale-invariance of TV, we get the desired R∗ ≳ nd .
Next, let’s consider the loss function kθb − θb′ k∞ . In the same setup, we only kθb − θb′ k∞ ≥
′ ∗ √1 , which does not depend on d. In fact, R∗
d dH (b, b ). Then Assouad’s lemma yields R ≳
ϵ
q n
log d
n as shown in Corollary 28.8. In the next section, we will discuss Fano’s method which can
resolve this deficiency.
31.3 Assouad’s lemma from the Mutual Information Method

One can integrate the Assouad’s idea into the mutual information method. Consider the Bayesian
i.i.d.
setting of Theorem 31.2, where bd = (b1 , . . . , bd ) ∼ Ber( 21 ). From the rate-distortion function of
the Bernoulli source (Section 26.1.1), we know that for any b̂d and τ > 0 there is some τ ′ > 0
such that
I(bd ; X) ≤ d(1 − τ ) log 2 =⇒ E[dH (b̂d , bd )] ≥ dτ ′ . (31.6)
Here τ ′ is related to τ by τ log 2 = h(τ ′ ). Thus, using the same “hypercube embedding b → θb ”,
the bound similar to (31.5) will follow once we can bound I(bd ; X) away from d log 2.
Can we use the pairwise total variation bound in (31.5) to do that? Yes! Notice that thanks to
the independence of bi ’s we have1
I(bi ; X|bi−1 ) = I(bi ; X, bi−1 ) ≤ I(bi ; X, b\i ) = I(bi ; X|b\i ) .
1
Equivalently, this also follows from the convexity of the mutual information in the channel (cf. Theorem 5.3).
i i
i i
i i

i i
520
Applying the chain rule leads to the upper bound

X
d X
d
I(bd ; X) = I(bi ; X|bi−1 ) ≤ I(bi ; X|b\i ) ≤ d log 2 max TV(PX|bd =b , PX|bd =b′ ) , (31.7)
dH (b,b′ )=1
i=1 i=1
where in the last step we used the fact that whenever B ∼ Ber(1/2),
I(B; X) ≤ TV(PX|B=0 , PX|B=1 ) log 2 , (31.8)
which follows from (7.36) by noting that the mutual information is expressed as the Jensen-
Shannon divergence as 2I(B; X) = JS(PX|B=0 , PX|B=1 ). Combining (31.6) and (31.7), the mutual
information method implies the following version of the Assouad’s lemma: Under the assumption
of Theorem 31.2,

βd −1 (1 − t) log 2
inf sup Eθ ℓ(θ, θ̂) ≥ ·f max TV(Pθ , Pθ′ ) , f(t) ≜ h (31.9)
θ̂ θ∈Θ 4α dH (θ,θ ′ )=1 2
where h−1 : [0, log 2] → [0, 1/2] is the inverse of the binary entropy function. Note that (31.9) is
slightly weaker than (31.5). Nevertheless, as seen in Example 31.4, Assouad’s lemma is typically
applied when the pairwise total variation is bounded away from one by a constant, in which case
(31.9) and (31.5) differ by only a constant factor.
In all, we may summarize Assouad’s lemma as a convenient method for bounding I(bd ; X) away
from the full entropy (d bits) on the basis of distances between PX|bd corresponding to adjacent
bd ’s.
31.4 Fano’s method

In this section we discuss another method for proving minimax lower bound by reduction to multi-
ple hypothesis testing. To this end, assume that the loss function is a metric. The idea is to consider
an ϵ-packing (Chapter 27) of the parameter space, namely, a finite collection of parameters whose
minimum separation is ϵ. Suppose we can show that given data one cannot reliably distinguish
these hypotheses. Then the best estimation error is at least proportional to ϵ. The impossibility of
testing is often shown by applying Fano’s inequality (Corollary 6.4), which bounds the the prob-
ability of error of testing in terms of the mutual information in Section 6.3. As such, we refer to
this program Fano’s method. The following is a precise statement.
Theorem 31.3. Let d be a metric on Θ. Fix an estimator θ̂. For any T ⊂ Θ and ϵ > 0,
h ϵi radKL (T) + log 2
P d(θ, θ̂) ≥ ≥1− , (31.10)
2 log M(T, d, ϵ)
where radKL (T) ≜ infQ supθ∈T D(Pθ kQ) is the KL radius of the set of distributions {Pθ : θ ∈ T}
(recall Corollary 5.8). Consequently,
ϵ r radKL (T) + log 2

inf sup Eθ [d(θ, θ̂) ] ≥ sup
r
1− , (31.11)
θ̂ θ∈Θ T⊂Θ,ϵ>0 2 log M(T, d, ϵ)
i i
i i
i i

i i
31.4 Fano’s method 521
Proof. It suffices to show (31.10). Fix T ⊂ Θ. Consider an ϵ-packing T′ = {θ1 , . . . , θM } ⊂ T such

that mini̸=j d(θi , θj ) ≥ ϵ. Let θ be uniformly distributed on this packing and X ∼ Pθ conditioned
on θ. Given any estimator θ̂, construct a test by rounding θ̂ to θ̃ = argminθ∈T′ d(θ̂, θ). By triangle
inequality, d(θ, θ̃) ≤ 2d(θ, θ̂). Thus P[θ 6= θ̃] ≤ P[d(θ, θ̃) ≥ ϵ/2]. On the other hand, applying
Fano’s inequality (Corollary 6.4) yields
I(θ; X) + log 2
P[θ 6= θ̃] ≥ 1 − .
log M
The proof of (31.10) is completed by noting that I(θ; X) ≤ radKL (T) since the latter equals the
maximal mutual information over the distribution of θ (Corollary 5.8).
As an application of Fano’s method, we revisit the d-dimensional GLM in Corollary 28.8 under
the ℓq loss (1 ≤ q ≤ ∞), with the particular focus on the dependency on the dimension. (For a
different application in sparse setting see Exercise VI.11.)
Example 31.5. Consider GLM with sample size n, where Pθ = N(θ, Id )⊗n . Taking natural logs
here and below, we have
n
D(Pθ kPθ′ ) = kθ − θ′ k22 ;
2
in other words, KL-neighborhoods are ℓ2 -balls. As such, let us apply Theorem 31.3 to T = B2 (ρ)
2
for some ρ > 0 to be specified. Then radKL (T) ≤ supθ∈T D(Pθ kP0 ) = nρ2 . To bound the packing
number from below, we applying the volume bound in Theorem 27.3,
d
ρd vol(B2 ) cq ρd1/q
M(B2 (ρ), k · kq , ϵ) ≥ d ≥ √
ϵ vol(Bq ) ϵ d
for some
p constant cq ,cqwhere the last step follows the volume formula (27.13) for ℓq -balls. Choosing
1/q−1/2
ρ = d/n and ϵ = e2 ρd , an application of Theorem 31.3 yields the minimax lower bound
d1/q
Rq ≡ inf sup Eθ [kθ̂ − θkq ] ≥ Cq √ (31.12)
θ̂ θ∈Rd n
for some constant Cq depending on q. This is the same lower bound as that in Example 30.1
obtained via the mutual information method plus the Shannon lower bound (which is also volume-
based).
For any q ≥ 1, (31.12) is rate-optimal since we can apply the MLE θ̂ = X̄. (Note that at q = ∞,
pq = ∞, (31.12)
the constant Cq is still finite since vol(B∞ ) = 2d .) However, for the special case of
does not depend on the dimension at all, as opposed to the correct dependency log d shown in
Corollary 28.8. In fact, this is the same suboptimal result we previously obtained from applying
Shannon lower bound in Example 30.1 or Assouad’s lemma in Example 31.4. So is it possible to
fix this looseness with Fano’s method? It turns out that the answer is yes and the suboptimality
is due to the volume bound on the metric entropy, which, as we have seen in Section 27.3, can
be ineffective if ϵ scales with dimension. Indeed, if we apply the tight bound of M(B2 , k · k∞ , ϵ)
i i
i i
i i

i i
522
q q
in (27.18),2 with ϵ = c log d
and ρ = c′ logn d for some absolute constants c, c′ , we do get
q n
R∞ ≳ logn d as desired.
We end this section with some comments regarding the application Theorem 31.3:
• It is sometimes convenient to further bound the KL radius by the KL diameter, since radKL (T) ≤
diamKL (T) ≜ supθ,θ′ ∈T D(Pθ′ kPθ ) (cf. Corollary 5.8). This suffices for Example 31.5.
• In Theorem 31.3 we actually lower bound the global minimax risk by that restricted on a param-
eter subspace T ⊂ Θ for the purpose of controlling the mutual information, which is often
difficult to compute. For the GLM considered in Example 31.5, the KL divergence is propor-
tional to squared ℓ2 -distance and T is naturally chosen to be a Euclidean ball. For other models
such as the covariance model (Exercise VI.15) wherein the KL divergence is more complicated,
the KL neighborhood T needs to be chosen carefully. Later in Section 32.4 we will apply the
same Fano’s method to the infinite-dimensional problem of estimating smooth density.
2
In fact, in this case we can also choose the explicit packing {ϵe1 , . . . , ϵed }.
i i
i i
i i

i i
32 Entropic upper bound for statistical estimation
So far our discussion on information-theoretic methods have been mostly focused on statistical
lower bounds (impossibility results), with matching upper bounds obtained on a case-by-case basis.
In this chapter, we will discuss three information-theoretic upper bounds for statistical estimation.
These three results apply to different loss functions and are obtained using completely different
means; however, they take on exactly the same form involving the appropriate metric entropy of the
model. Specifically, suppose that we observe X1 , . . . , Xn drawn independently from a distribution
Pθ for some unknown parameter θ ∈ Θ, and the goal is to produce an estimate P̂ for the true
distribution Pθ . We have the following entropic minimax upper bounds:
• KL loss (Yang-Barron [341]):

1
inf sup Eθ [D(Pθ kP̂)] ≲ inf ϵ + log NKL (P, ϵ) .
2
(32.1)
P̂ θ∈Θ ϵ>0 n
• Hellinger loss (Le Cam-Birgé [193, 34]):

1
inf sup Eθ [H2 (Pθ , P̂)] ≲ inf ϵ2 + log NH (P, ϵ) . (32.2)
P̂ θ∈Θ ϵ>0 n
• Total variation loss (Yatracos [342]):

1
inf sup Eθ [TV (Pθ kP̂)] ≲ inf ϵ + log NTV (P, ϵ) .
2 2
(32.3)
P̂ θ∈Θ ϵ>0 n
Here N(P, ϵ) refers to the metric entropy (cf. Chapter 27) of the model class P = {Pθ : θ ∈ Θ}
under various distances, which we will formalize along the way.
32.1 Yang-Barron’s construction

Let P = {Pθ : θ ∈ Θ} be a parametric family of distributions on the space X . Given Xn =
i.i.d.
(X1 , . . . , Xn ) ∼ Pθ for some θ ∈ Θ, we obtain an estimate P̂ = P̂(·|Xn ), which is a distribution
523
i i
i i
i i

i i
524
depending on Xn . The loss function is the KL divergence D(Pθ kP̂).1 The average risk is thus
Z
Eθ D(Pθ kP̂) = D Pθ kP̂(·|Xn ) P⊗n (dxn ).
If the family has a common dominating measure μ, the problem is equivalent to estimate the
density pθ = dP dμ , commonly referred to as the problem of density estimation in the statistics
θ
literature.
Our objective is to prove the upper bound (32.1) for the minimax KL risk
R∗KL (n) ≜ inf sup Eθ D(Pθ kP̂), (32.4)

P̂ θ∈Θ
where the infimum is taken over all estimators P̂ = P̂(·|Xn ) which is a distribution on X ; in
other words, we allow improper estimates in the sense that P̂ can step outside the model class P .
Indeed, the construction we will use in this section (such as predictive density estimators (Bayes)
or their mixtures) need not be a member of P . Later we will see in Sections 32.2 and 32.3 that for
total variation and Hellinger loss we can always restrict to proper estimators;2 however these loss
functions are weaker than the KL divergence.
The main result of this section is the following.
Theorem 32.1. Let Cn denotes the capacity of the channel θ 7→ Xn ∼ P⊗ n

θ , namely
Cn = sup I(θ; Xn ), (32.5)
where the supremum is over all distributions (priors) of θ taking values in Θ. Denote by

NKL (P, ϵ) ≜ min N : ∃Q1 , . . . , QN s.t. ∀θ ∈ Θ, ∃i ∈ [N], D(Pθ kQi ) ≤ ϵ2 . (32.6)
the KL covering number for the class P . Then

Cn+1
R∗KL (n) ≤ (32.7)
n+1

1
≤ inf ϵ2 + log NKL (P, ϵ) . (32.8)
ϵ>0 n+1
Conversely,
X
n
R∗KL (t) ≥ Cn+1 . (32.9)
t=0
Note that the capacity Cn is precisely the redundancy (13.10) which governs the minimax regret
in universal compression; the fact that it bounds the KL risk can be attributed to a generic relation
1
Note the asymmetry in this loss function. Alternatively the loss D(P̂kP) is typically infinite in nonparametric settings,
because it is impossible to estimate the support of the true density exactly.
2
This is in fact a generic observation: Whenever the loss function satisfies an approximate triangle inequality, any
improper estimate can be converted to a proper one by its project on the model class whose risk is inflated by no more
than a constant factor.
i i
i i
i i

i i
between individual and cumulative risks which we explain later in Section 32.1.4. As explained in
Chapter 13, it is in general difficult to compute the exact value of Cn even for models as simple as
Bernoulli (Pθ = Ber(θ)). This is where (32.8) comes in: one can use metric entropy and tools from
Chapter 27 to bound this capacity, leading to useful (and even optimal) risk bounds. We discuss
two types of applications of this result.
Finite-dimensional models Consider a family P = {Pθ : θ ∈ Θ} of smooth parametrized

densities, where Θ ⊂ Rd is some compact set. Suppose that the KL-divergence behaves like
squared norm, namely, D(Pθ kPθ′ ) kθ − θ′ k2 for any θ, θ′ ∈ Θ and some norm k · k on Rd .
(For example, for GLM with Pθ = N(θ, Id ), we have D(Pθ kPθ′ ) = 12 kθ − θ′ k22 .). In this case, the
KL covering numbers inherits the usual behavior of metric entropy in finite-dimensional space
(cf. Theorem 27.3 and Corollary 27.4) and we have
d
1
NKL (P, ϵ) ≲ .
ϵ
Then (32.8) yields

1
Cn ≲ inf nϵ + d log
2
d log n, (32.10)
ϵ>0 ϵ
d
which is consistent with the typical asymptotics of redundancy Cn = 2 log n + o(log n) (recall
(13.23) and (13.24)).
Applying the upper bound (32.7) or (32.8) yields
d log n
R∗KL (n) ≲ .
n
d
As compared to the usual parametric rate of n in d dimensions (e.g. GLM), this upper bound is
suboptimal only by a logarithmic factor.
Infinite-dimensional models Similar to the results in Section 27.4, for nonparametric models
NKL (ϵ) typically grows super-polynomially in 1ϵ and, in turn, the capacity Cn grows super-
logarithmically. In fact, whenever we have Cn nα for some α > 0, Theorem 32.1 yields the
sharp minimax rate
Cn
R∗KL (n) nα−1 (32.11)
n
which easily follows from combining (32.7) and (32.8) – see (32.23) for details.
As a concrete example, consider the class P of Lipschitz densities on [0, 1] that are bounded
away from zero. Using the L2 -metric entropy previously established in Theorem 27.12, we will
show in Section 32.4 that NKL (ϵ) ϵ−1 and thus Cn ≤ infϵ>0 (nϵ2 + ϵ−1 ) n1/3 and, in turn,
R∗KL (n) ≲ n−2/3 . This rate turns out to be optimal: In Section 32.1.3 we will develop capacity lower
bound based on metric entropy that shows Cn n1/3 and hence, in view of (32.11), R∗KL (n)
n−2/3 .
Next, we explain the intuition behind and the proof of Theorem 32.1.
i i
i i
i i

i i
526
32.1.1 Bayes risk as conditional mutual information and capacity bound

To gain some insight, let us start by considering the Bayesian setting with a prior π on Θ. Condi-
i.i.d.
tioned on θ ∼ π, the data Xn = (X1 , . . . , Xn ) ∼ Pθ .3 Any estimator, P̂ = P̂(·|Xn ), is a distribution
on X depending on Xn . As such, P̂ can be identified with a conditional distribution, say, QXn+1 |Xn ,
and we shall do so henceforth. For convenience, let us introduce an (unseen) observation Xn+1
that is drawn from the same Pθ and independent of Xn conditioned on θ. In this light, the role of
the estimator is to predict the distribution of the unseen Xn+1 .
The following lemma shows that the Bayes KL risk equals the conditional mutual information
and the Bayes estimator is precisely PXn+1 |Xn (with respect to the joint distribution induced by the
prior), known as the predictive density estimator in the statistics literature.
Lemma 32.2. The Bayes risk for prior π is given by

Z
R∗KL,Bayes (π ) ≜ inf π (dθ)P⊗θ (dx )D(Pθ kP̂(·|x )) = I(θ; Xn+1 |X ),
n n n n
P̂
i.i.d.
where θ ∼ π and (X1 , . . . , Xn+1 ) ∼ Pθ conditioned on θ. The Bayes estimator achieving this infi-
mum is given by P̂Bayes (·|xn ) = PXn+1 |Xn =xn . If each Pθ has a density pθ with respect to some
common dominating measure μ, the Bayes estimator has density:
R Qn+1
π (dθ) i=1 pθ (xi )
p̂Bayes (xn+1 |x ) = R
n
Qn . (32.12)
π (dθ) i=1 pθ (xi )
Proof. The Bayes risk can be computed as follows:

inf Eθ,Xn D(Pθ kQXn+1 |Xn ) = inf D(PXn+1 |θ kP̂Xn+1 |Xn |Pθ,Xn )
QXn+1 |Xn QXn+1 |Xn
" #
= E Xn inf D(PXn+1 |θ kP̂Xn+1 |Xn |Pθ|Xn )
QXn+1 |Xn
(a)
= EXn D(PXn+1 |θ kPXn+1 |Xn |Pθ|Xn )
= D(PXn+1 |θ kPXn+1 |Xn |Pθ,Xn )
(b)
= I(θ; Xn+1 |Xn ).
where (a) follows from the variational representation of mutual information (Theorem 4.1 and
Corollary 4.2); (b) invokes the definition of the conditional mutual information (Section 3.4) and
the fact that Xn → θ → Xn+1 forms a Markov chain, so that PXn+1 |θ,Xn = PXn+1 |θ . In addition, the
Bayes optimal estimator is given by PXn+1 |Xn .
Note that the operational meaning of I(θ; Xn+1 |Xn ) is the information provided by one extra
observation about θ having already obtained n observations. In most situations, since Xn will have
3
Throughout this chapter, we continue to use the conventional notation Pθ for a parametric family of distributions and use
π to stand for the distribution of θ.
i i
i i
i i

i i
already allowed θ to be consistently estimated as n → ∞, the additional usefulness of Xn+1 is

vanishing. This is made precisely by the following result.
Lemma 32.3 (Diminishing marginal utility in information). n 7→ I(θ; Xn+1 |Xn ) is a decreasing
sequence. Furthermore,
1
I(θ; Xn+1 |Xn ) ≤ I(θ; Xn+1 ). (32.13)
n
Proof. In view of the chain rule for mutual information (Theorem 3.7): I(θ; Xn+1 ) =
Pn+1 i−1
i=1 I(θ; Xi |X ), (32.13) follows from the monotonicity. To show the latter, let us consider
a “sampling channel” where the input is θ and the output is X sampled from Pθ . Let I(π )
denote the mutual information when the input distribution is π, which is a concave function in
π (Theorem 5.3). Then
I(θ; Xn+1 |Xn ) = EXn [I(Pθ|Xn )] ≤ EXn−1 [I(Pθ|Xn−1 )] = I(θ; Xn |Xn−1 )
where the inequality follows from Jensen’s inequality, since Pθ|Xn−1 is a mixture of Pθ|Xn .
Lemma 32.3 allows us to prove the converse bound (32.9): Fix any prior π. Since the minimax
risk dominates any Bayes risk (Theorem 28.1), in view of Lemma 32.2, we have
X
n X
n
R∗KL (t) ≥ I(θ; Xt+1 |Xt ) = I(θ; Xn+1 ).
t=0 t=0
Recall from (32.5) that Cn+1 = supπ ∈∆(Θ) I(θ; Xn+1 ). Optimizing over the prior π yields (32.9).
Now suppose that the minimax theorem holds for (32.4), so that R∗KL = supπ ∈∆(Θ) R∗KL,Bayes (π ).
Then Lemma 32.2 allows us to express the minimax risk as the conditional mutual information
maximized over the prior π:
R∗KL (n) = sup I(θ; Xn+1 |Xn ).

π ∈∆(Θ)
Thus Lemma 32.3 implies the desired

1
R∗KL (n) ≤ Cn+1 .
n+1
Next, we prove this directly without going through the Bayesian route or assuming the minimax
theorem. The main idea, due to Yang and Barron [341], is to consider Bayes estimators (of the
form (32.12)) but analyze it in the worst case. Fix an arbitrary joint distribution QXn+1 on X n+1 ,
Qn−1
which factorizes as QXn+1 = i=1 QXi |Xi−1 . (This joint distribution is an auxiliary object used only
for constructing an estimator.) For each i, the conditional distribution QXi |Xi−1 defines an estimator
taking the sample Xi of size i as the input. Taking their Cesàro mean results in the following
estimator operating on the full sample Xn :
1 X
n+1
P̂(·|Xn ) ≜ QXi |Xi−1 . (32.14)
n+1
i=1
i i
i i
i i

i i
528
Let us bound the worst-case KL risk of this estimator. Fix θ ∈ Θ and let Xn+1 be drawn
⊗(n+1)
independently from Pθ so that PXn+1 = Pθ . Taking expectations with this law, we have
" !#
1 X n+1

Eθ [D(Pθ kP̂(·|X ))] = E D Pθ
n
QXi |Xi−1
n + 1
i=1
(a) 1 X
n+1
≤ D(Pθ kQXi |Xi−1 |PXi−1 )
n+1
i=1
(b) 1 ⊗(n+1)
= D(Pθ kQXn+1 ),
n+1
where (a) and (b) follows from the convexity (Theorem 5.1) and the chain rule for KL divergence
(Theorem 2.14(c)). Taking the supremum over θ ∈ Θ bounds the worst-case risk as
1 ⊗(n+1)
R∗KL (n) ≤ sup D(Pθ kQXn+1 ).
n + 1 θ∈Θ
Optimizing over the choice of QXn+1 , we obtain
1 ⊗(n+1) Cn+1
R∗KL (n) ≤ inf sup D(Pθ kQXn+1 ) = ,
n + 1 QXn+1 θ∈Θ n+1
where the last identity applies Theorem 5.9 of Kemperman, completing the proof of (32.7).
Furthermore, Theorem 5.9 asserts that the optimal QXn+1 exists and given uniquely by the capacity-
achieving output distribution P∗Xn+1 . Thus the above minimax upper bound can be attained by
taking the Cesàro average of P∗X1 , P∗X2 |X1 , . . . , P∗Xn+1 |Xn , namely,
1 X ∗
n+1
P̂∗ (·|Xn ) = PXi |Xi−1 . (32.15)
n+1
i=1
Note that in general this is an improper estimate as it steps outside the class P .
In the special case where the capacity-achieving input distribution π ∗ exists, the capacity-
achieving output distribution can be expressed as a mixture over product distributions as P∗Xn+1 =
R ∗ ⊗(n+1)
π (dθ)Pθ . Thus the estimator P̂∗ (·|Xn ) is in fact the average of Bayes estimators (32.12)
under prior π ∗ for sample sizes ranging from 0 to n.
Finally, as will be made clear in the next section, in order to achieve the further upper bound
(32.8) in terms of the KL covering numbers, namely R∗KL (n) ≤ ϵ2 + n+1 1 log NKL (P, ϵ), it suffices to
choose the following QXn+1 as opposed to the exact capacity-achieving output distribution: Pick an
ϵ-KL cover Q1 , . . . , QN for P of size N = NKL (P, ϵ) and choose π to be the uniform distribution
PN ⊗(n+1)
and define QXn+1 = N1 j=1 Qj – this was the original construction in [341]. In this case,
applying the Bayes rule (32.12), we see that the estimator is in fact a convex combination P̂(·|Xn ) =
PN
j=1 wj Qj of the centers Q1 , . . . , QN , with data-driven weights given by
Qi−1
1 X
n+1
t=1 Qj (Xt )
wj = PN Qi−1 .
n+1 Qj ( X t )
i=1 j=1 t=1
i i
i i
i i

i i
Again, except for the extraordinary case where P is convex and the centers Qj belong to P , the
estimate P̂(·|Xn ) is improper.
32.1.2 Capacity upper bound via KL covering numbers

As explained earlier, finding the capacity Cn requires solving the difficult optimization problem in
(32.5). In this subsection we prove (32.8) which bounds this capacity by metric entropy. Concep-
tually speaking, both metric entropy and capacity measure the complexity of a model class. The
following result, which applies to a more general setting than (32.5), makes precise their relations.
Theorem 32.4. Let Q = {PB|A=a : a ∈ A} be a collection of distributions on some space B and

denote the capacity C = supPA ∈∆(A) I(A; B). Then
C = inf {ϵ2 + log NKL (Q, ϵ)}, (32.16)

ϵ>0
where NKL is the KL covering number defined in (32.6).
Proof. Fix ϵ and let N = NKL (Q, ϵ). Then there exist Q1 , . . . , QN that form an ϵ-KL cover, such
that for any a ∈ A there exists i(a) ∈ [N] such that D(PB|A=a kQi(a) ) ≤ ϵ2 . Fix any PA . Then
I(A; B) = I(A, i(A); B) = I(i(A); B) + I(A; B|i(A))

≤ H(i(A)) + I(A; B|i(A)) ≤ log N + ϵ2 .
where the last inequality follows from that i(A) takes at most N values and, by applying
Theorem 4.1,

I(A; B|i(A)) ≤ D PB|A kQi(A) |Pi(A) ≤ ϵ2 .
For the lower bound, note that if C = ∞, then in view of the upper bound above, NKL (Q, ϵ) = ∞
for any ϵ and (32.16) holds with equality. If C < ∞, Theorem 5.9 shows that C is the KL radius of
Q, namely, there exists P∗B , such that C = supPA ∈∆(A) D(PB|A kP∗B |PA ) = supx∈A D(PB|A kP∗B |PA ).
√
In other words, NKL (Q, C + δ) = 1 for any δ > 0. Sending δ → 0 proves the equality of
(32.16).
Next we specialize Theorem 32.4 to our statistical setting (32.5) where the input A is θ and the
output B is Xn ∼ Pθ . Recall that P = {Pθ : θ ∈ Θ}. Let Pn ≜ {P⊗
i.i.d.
θ : θ ∈ Θ}. By tensorization of
n
⊗n ⊗n
KL divergence (Theorem 2.14(d)), D(Pθ kPθ′ ) = nD(Pθ kPθ′ ). Thus

ϵ
NKL (Pn , ϵ) ≤ NKL P, √ .
n
Combining this with Theorem 32.4, we obtain the following upper bound on the capacity Cn in
terms of the KL metric entropy of the (single-letter) family P :

Cn ≤ inf nϵ2 + log NKL (P, ϵ) . (32.17)
ϵ>0
This proves (32.8), completing the proof of Theorem 32.1.
i i
i i
i i

i i
530
32.1.3 Capacity lower bound via Hellinger packing number

Recall that in order to deduce from (32.9) concrete lower bound on the minimax KL risk, such as
(32.11), one needs to have matching upper and lower bounds on the capacity Cn . Although Theo-
rem 32.4 characterizes capacity in terms of the KL covering numbers, lower bounding the latter
is not easy. On the other hand, it is much easier to lower bound the packing number (since any
explicit packing works). One may attempt to use Theorem 27.2 to relate packing and covering,
but, alas, KL divergence is not a distance. Thus, it is much easier to use the following method of
reducing to Hellinger distance.
Theorem 32.5. Let P = {Pθ : θ ∈ Θ} and MH (ϵ) ≡ M(P, H, ϵ) the Hellinger packing number
of the set P , cf. (27.2). Then Cn defined in (32.5) satisfies

log e 2
Cn ≥ min nϵ , log MH (ϵ) − log 2 (32.18)
2
Proof. The idea of the proof is simple. Given a packing θ1 , . . . , θM ∈ Θ with pairwise distances
2
H2 (Qi , Qj ) ≥ ϵ2 for i 6= j, where Qi ≡ Pθi , we know that one can test Q⊗ n ⊗n
i vs Qj with error e
− nϵ2
,
nϵ 2
cf. Theorem 7.8 and Theorem 32.7. Then by the union bound, if Me− 2 < 12 , we can distinguish
these M hypotheses with error < 12 . Let θ ∼ Unif(θ1 , . . . , θM ). Then from Fano’s inequality we
get I(θ; Xn ) ≳ log M.
To get sharper constants, though, we will proceed via the inequality shown in Ex. I.47. In the
notation of that exercise we take λ = 1/2 and from Definition 7.22 we get that
1
D1/2 (Qi , Qj ) = −2 log(1 − H2 (Qi , Qj )) ≥ H2 (Qi , Qj ) log e ≥ ϵ2 log e i 6= j .
2
By the tensorization property (7.73) for Rényi divergence, D1/2 (Q⊗ n ⊗n

i , Qj ) = nD1/2 (Qi , Qj ) and
we get by Ex. I.47
 
X
M
1 XM
1 n n o
I(θ; Xn ) ≥ − log  exp − D1/2 (Qi , Qj ) 
M M 2
i=1 j=1
X
M
( a) 1M − 1 − nϵ22 1
≥− log e +
M M M
i=1
XM
1 nϵ 2 1 nϵ 2 1
≥− log e− 2 + = − log e− 2 + ,
M M M
i=1
where in (a) we used the fact that pairwise distances are all ≥ nϵ2 except when i = j. Finally, since
A + B ≤ min(A,B) we conclude the result.
1 1 2
We note that since D ≳ H2 (cf. (7.30)), a different (weaker) lower bound on the KL risk also
follows from Section 32.2.4 below.
i i
i i
i i

i i
32.1.4 General bounds between cumulative and individual (one-step) risks

In summary, we can see that the beauty of the Yang-Barron method lies in two ideas:
• Instead of directly studying the risk R∗KL (n), (32.7) relates it to a cumulative risk Cn
• The cumulative risk turns out to be equal to a capacity, which can be conveniently bounded in
terms of covering numbers.
In this subsection we want to point out that while the second step is very special to KL (log-loss),
the first idea is generic. Namely, we have the following result.
Proposition 32.6. Fix a loss function ℓ : P(X ) × P(X ) → R̄ and a class Π of distributions on
X . Define cumulative and one-step minimax risks as follows:
" n #
X
Cn = inf sup E ℓ(P, P̂t (Xt−1 )) (32.19)
{P̂t (·)} P∈Π
t=1
h i
R∗n = inf sup E ℓ(P, P̂(Xn )) (32.20)
P̂(·) P∈Π
where both infima are over measurable (possibly randomized) estimators P̂t : X t−1 → P(X ), and
i.i.d.
the expectations are over Xi ∼ P and the randomness of the estimators. Then we have
X
n−1
nR∗n−1 ≤ Cn ≤ R∗t . (32.21)
t=0
Pn−1
Thus, if the sequence {R∗n } satisfies R∗n 1n t=0 R∗t then Cn nR∗n . Conversely, if nα− ≲ Cn ≲
nα+ for all n and some α+ ≥ α− > 0, then
α
(α− −1) α+
n − ≲ R∗n ≲ nα+ −1 . (32.22)
Remark 32.1. The meaning of the above is that R∗n ≈ 1n Cn within either constant or polylogarith-
mic factors, for most cases of interest.
Proof. To show the first inequality in (32.21), given predictors {P̂t (Xt−1 ) : t ∈ [n]} for Cn ,
consider a randomized predictor P̂(Xn−1 ) for R∗n−1 that equals each of the P̂t (Xt−1 ) with equal
P
probability. The second inequality follows from interchanging supP and t .
To derive (32.22) notice that the upper bound on R∗n follows from (32.21). For the lower bound,
notice that the sequence R∗n is monotone and hence we have for any n < m
X
m−1 X
n−1
Ct
Cm ≤ R∗t ≤ + (m − n)R∗n . (32.23)
t
t=0 t=0
α+
Setting m = an α− with some appropriate constant a yields the lower bound.
i i
i i
i i

i i
532
32.2 Pairwise comparison à la Le Cam-Birgé

When we proved the lower bound in Theorem 31.3, we applied the reasoning that if an ϵ-packing
of the parameter space Θ cannot be tested, then θ ∈ Θ cannot be estimated more than precision ϵ,
thereby establishing a minimax lower bound in terms of the KL metric entropy. Conversely, we
can ask the following question:
Is it possible to construct an estimator based on tests, and produce a minimax upper bound in terms of the metric
entropy?
For Hellinger loss, the answer is yes, although the metric entropy involved is with respect to
the Hellinger distance not KL divergence. The basic construction is due to Le Cam and further
developed by Birgé. The main idea is as follows: Fix an ϵ-covering {P1 , . . . , PN } of the set of
distributions P . Given n samples drawn from P ∈ P , let us test which ball P belongs to; this
allows us to estimate P up to Hellinger loss ϵ. This can be realized by a pairwise comparison
argument of testing the (composite) hypothesis P ∈ B(Pi , ϵ) versus P ∈ B(Pj , ϵ). This program
can be further refined to involve on the local entropy of the model.
32.2.1 Composite hypothesis testing and Hellinger distance

Recall the problem of composite hypothesis testing introduced in Section 16.4. Let P and Q be
two (not necessarily convex) classes of distributions. Given iid samples X1 , . . . , Xn drawn from
some distribution P, we want to test, according some decision rule ϕ = ϕ(X1 , . . . , Xn ) ∈ {0, 1},
whether P ∈ P (indicated by ϕ = 0) or P ∈ Q (indicated by ϕ = 1). By the minimax theorem,
the optimal error is given by the total variation between the worst-case mixtures:

min sup P(ϕ = 1) + sup Q(ϕ = 0) = 1 − TV(co(P ⊗n ), co(Q⊗n )), (32.24)
ϕ P∈P Q∈Q
wherein the notations are explained as follows:
• P ⊗n ≜ {P⊗n : P ∈ P} consists of all n-fold products of distributions in P ;

• co(·) denotes the convex hull, that is, the set of all mixtures. For example, for a parametric
R
family, co({Pθ : θ ∈ Θ}) = {Pπ : π ∈ ∆(Θ)}, where Pπ = Pθ π (dθ) is the mixture under the
mixing distribution π, and ∆(Θ) denotes the collection of all probability distributions (priors)
on Θ.
The optimal test that achieves (32.24) is the likelihood ratio given by the worst-case mixtures, that
is, the closest4 pair of mixture (P∗n , Q∗n ) such that TV(P∗n , Q∗n ) = TV(co(P ⊗n ), co(Q⊗n )).
The exact result (32.24) is unwieldy as the RHS involves finding the least favorable priors over
the n-fold product space. However, there are several known examples where much simpler and
4
In case the closest pair does not exist, we can replace it by an infimizing sequence.
i i
i i
i i

i i
explicit results are available. In the case when P and Q are TV-balls around P0 and Q0 , Huber [161]
showed that the minimax optimal test has the form
( n )
X dP0
n ′ ′′
ϕ(x ) = 1 min(c , max(c , log (Xi ))) > t .
dQ0
i=1
(See also Ex. III.20.) However, there are few other examples where minimax optimal tests are
known explicitly. Fortunately, as was shown by Le Cam, there is a general “single-letter” upper
bound in terms of the Hellinger separation between P and Q. It is the consequence of the more
general tensorization property of Rényi divergence in Proposition 7.23 (of which Hellinger is a
special case).
Theorem 32.7.

min sup P(ϕ = 1) + sup Q(ϕ = 0) ≤ e− 2 infP∈P,Q∈Q H (P,Q) ,
n 2
(32.25)
ϕ P∈P Q∈Q
Remark 32.2. For the case when P and Q are Hellinger balls of radius r around P0 and Q0 , respec-
tively, Birgé [35] constructed
nP an explicit test. Namely,
o under the assumption H(P0 , Q0 ) q > 2.01r,
n n α+βψ(Xi ) −nΩ(r2 ) dP0
there is a test ϕ(x ) = 1 i=1 log β+αψ(Xi ) > t attaining error e , where ψ(x) = dQ 0
( x)
and α, β > 0 depend only on H(P0 , Q0 ).
Proof. We start by restating the special case of Proposition 7.23:

! !! n
1 2 On O n Y 1
1 − H co Pi , co Qi ≤ 1 − H2 (co(Pi ), co(Qi )) . (32.26)
2 2
i=1 i=1 i=1
The From (32.24) we get

( a) 1
1 − TV(co(P ⊗n ), co(Q⊗n )) ≤ 1 − H2 (co(P ⊗n ), co(Q⊗n ))
2
n n
(b) 1
≤ 1 − H2 (co(P), co(Q)) ≤ exp − H2 (co(P), co(Q))
2 2
where (a) follows from (7.20); (b) follows from (32.26).
In the sequel we will apply Theorem 32.7 to two disjoint Hellinger balls (both are convex).
32.2.2 Hellinger guarantee on Le Cam-Birgé’s pairwise comparison estimator

The idea of constructing estimator based on pairwise tests is due to Le Cam ([193], see also [318,
Section 10]) and Birgé [34]. We are given n i.i.d. observations X1 , · · · , Xn generated from P, where
P ∈ P is the distribution to be estimated. Here let us emphasize that P need not be a convex set. Let
the loss function between the true distribution P and the estimated distribution P̂ be their squared
Hellinger distance, i.e.
ℓ(P, P̂) = H2 (P, P̂).
i i
i i
i i

i i
534
Then, we have the following result:
Theorem 32.8 (Le Cam-Birgé). Denote by NH (P, ϵ) the ϵ-covering number of the set P under
the Hellinger distance (cf. (27.1)). Let ϵn be such that
nϵ2n ≥ log NH (P, ϵn ) ∨ 1.
Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 1,
sup P[H(P, P̂) > 4tϵn ] ≲ e−t

2
(32.27)
P∈P
and, consequently,
sup EP [H2 (P, P̂)] ≲ ϵ2n (32.28)

P∈P
Proof of Theorem 32.8. It suffices to prove the high-probability bound (32.27). Abbreviate ϵ =
ϵn and N = NH (P, ϵn ). Let P1 , · · · , PN be a maximal ϵ-packing of P under the Hellinger distance,
which also serves as an ϵ-covering (cf. Theorem 27.2). Thus, ∀i 6= j,
H(Pi , Pj ) ≥ ϵ,
and for ∀P ∈ P , ∃i ∈ [N], s.t.
H(P, Pi ) ≤ ϵ,
Denote B(P, ϵ) = {Q : H(P, Q) ≤ ϵ} denote the ϵ-Hellinger ball centered at P. Crucially,

Hellinger ball is convex5 thanks to the convexity of squared Hellinger distance as an f-divergence
(cf. Theorem 7.5). Indeed, for any P′ , P′′ ∈ B(P, ϵ) and α ∈ [0, 1],
H2 (ᾱP′ + αP′′ , P) ≤ ᾱH2 (P′ , P) + αH2 (P′′ , P) ≤ ϵ2 .
Next, consider the following pairwise comparison problem, where we test two Hellinger balls
(composite hypothesis) against each other:

Hi : P ∈ B(Pi , ϵ)
Hj : P ∈ B(Pj , ϵ)
for all i 6= j, s.t. H(Pi , Pj ) ≥ δ = 4ϵ.
Since both B(Pi , ϵ) and B(Pj , ϵ) are convex, applying Theorem 32.7 yields a test ψij =
ψij (X1 , . . . , Xn ), with ψij = 0 corresponding to declaring P ∈ B(Pi , ϵ), and ψij = 1 corresponding
to declaring P ∈ B(Pj , ϵ), such that ψij = 1 − ψji and the following large deviation bound holds:
for all i, j, s.t. H(Pi , Pj ) ≥ δ ,
P(ψij = 1) ≤ e− 8 H(Pi ,Pj ) ,

n 2
sup (32.29)
P∈B(Pi ,ϵ)
5
Note that this is not entirely obvious because P 7→ H(P, Q) is not convex (for example, consider
p 7→ H(Ber(p), Ber(0.1)).
i i
i i
i i

i i
where we used the triangle inequality of Hellinger distance: for any P ∈ B(Pi , ϵ) and any Q ∈
B(Pj , ϵ),
H(P, Q) ≥ H(Pi , Pj ) − 2ϵ ≥ H(Pi , Pj )/2 ≥ 2ϵ.
For i ∈ [N], define the random variable

maxj∈[N] H2 (Pi , Pj ) s.t. ψij = 1, H(Pi , Pj ) > δ ;
Ti ≜
0, no such j exists.
Basically, Ti records the maximum distance from Pi to those Pj outside the δ -neighborhood of Pi
that is confusable with Pi given the present sample. Our density estimator is defined as
P̂ = Pi∗ , where i∗ ∈ argmin Ti . (32.30)

i∈[N]
Now for the proof of correctness, assume that P ∈ B(P1 , ϵ). The intuition is that, we should
expect, typically, that T1 = 0, and furthermore, Tj ≥ δ 2 for all j such that H(P1 , Pj ) ≥ δ . Note
that by the definition of Ti and the symmetry of the Hellinger distance, for any pair i, j such that
H(Pi , Pj ) ≥ δ , we have
max{Ti , Tj } ≥ H(Pi , Pj ).
Consequently,
H(P̂, P1 )1{H(P̂,P1 )≥δ} = H(Pi∗ , P1 )1{H(Pi∗ ,P1 )≥δ}

≤ max{Ti∗ , T1 }1{max{Ti∗ ,T1 }≥δ} = T1 1{T1 ≥δ} ,
where the last equality follows from the definition of i∗ as a global minimizer in (32.30). Thus, for
any t ≥ 1,
P[H(P̂, P1 ) ≥ tδ] ≤ P[T1 ≥ tδ]

≤ N(ϵ)e−2nϵ
2 2
t
(32.31)
≲ e− t ,
2
(32.32)
2
where (32.31) follows from (32.29) and (32.32) uses the assumption that nϵ2 ≥ 1 and N ≤ enϵ .
32.2.3 Refinement using local entropy

Just like Theorem 32.1, while they are often tight for nonparametric problems with superlogarith-
mically metric entropy, for finite-dimensional models a direct application of Theorem 32.8 results
in a slack by a log factor. For example, for a d-dimensional parametric family, e.g., the Gaussian
location model or its finite mixtures, the metric entropy usually behaves as log NH (ϵ) d log 1ϵ .
Thus when n ≳ d, Theorem 32.8 entails choosing ϵ2n dn log nd , which falls short of the parametric
rate E[H2 (P̂, P)] ≲ nd which are typically achievable.
i i
i i
i i

i i
536
As usual, such a log factor can be removed using the local entropy argument. To this end, define
the local Hellinger entropy:
Nloc (P, ϵ) ≜ sup sup NH (B(P, η) ∩ P, η/2). (32.33)

P∈P η≥ϵ
Theorem 32.9 (Le Cam-Birgé: local entropy version). Let ϵn be such that
nϵ2n ≥ log Nloc (P, ϵn ) ∨ 1. (32.34)
Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 2,
sup P[H(P, P̂) > 4tϵn ] ≤ e−t

2
(32.35)
P∈P
and hence
sup EP [H2 (P, P̂)] ≲ ϵ2n (32.36)

P∈P
Remark 32.3 (Doubling dimension). Suppose that for some d > 0, log Nloc (P, ϵ) ≤ d log 1ϵ holds
for all sufficiently large small ϵ; this is the case for finite-dimensional models where the Hellinger
distance is comparable with the vector norm by the usual volume argument (Theorem 27.3). Then
we say the doubling dimension (also known as the Le Cam dimension [318]) of P is at most d; this
terminology comes from the fact that the local entropy concerns covering Hellinger balls using
balls of half the radius. Then Theorem 32.9 shows that it is possible to achieve the “parametric
rate” O( dn ). In this sense, the doubling dimension serves as the effective dimension of the model
P.
Lemma 32.10. For any P ∈ P and η ≥ ϵ and k ≥ Z+ ,
NH (B(P, 2k η) ∩ P, η/2) ≤ Nloc (P, ϵ)k (32.37)
Proof. We proceed by induction on k. The base case of k = 0 follows from the definition (32.33).
For k ≥ 1, assume that (32.37) holds for k − 1 for all P ∈ P . To prove it for k, we construct a cover
of B(P, 2k η) ∩ P as follows: first cover it with 2k−1 η -balls, then cover each ball with η/2-balls. By
the induction hypothesis, the total number of balls is at most
NH (B(P, 2k η) ∩ P, 2k−1 η) · sup NH (B(P′ , 2k−1 η) ∩ P, η/2) ≤ Nloc (ϵ) · Nloc (ϵ)k−1
P′ ∈P
completing the proof.
We now prove Theorem 32.9:
Proof. We analyze the same estimator (32.30) following the proof of Theorem 32.8, except
that the estimate (32.31) is improved as follows: Define the Hellinger shell Ak ≜ {P : 2k δ ≤
i i
i i
i i

i i
H(P1 , P) < 2k+1 δ} and Gk ≜ {P1 , . . . , PN } ∩ Ak . Recall that δ = 4ϵ. Given t ≥ 2, let ℓ = blog2 tc
so that 2ℓ ≤ t < 2ℓ+1 . Then
X
P[T1 ≥ tδ] ≤ P[2k δ ≤ T1 < 2k+1 δ]
k≥ℓ
( a) X
|Gk |e− 8 (2 δ)
n k 2
≤
k≥ℓ
(b) X
Nloc (ϵ)k+3 e−2nϵ 4
2 k
≤
k≥ℓ
( c)
≲ e− 4 ≤ e− t
ℓ 2
where (a) follows from from (32.29); (c) follows from the assumption that log Nloc ≤ nϵ2 and
k ≥ ℓ ≥ log2 t ≥ 1; (b) follows from the following reasoning: since {P1 , . . . , PN } is an ϵ-packing,
we have
|Gk | ≤ M(Ak , ϵ) ≤ N(Ak , ϵ/2) ≤ N(B(P1 , 2k+1 δ) ∩ P, ϵ/2) ≤ Nloc (ϵ)k+3
where the first and the last inequalities follow from Theorem 27.2 and Lemma 32.10 respectively.
As an application of Theorem 32.9, we show that parametric rate (namely, dimension divided
by the sample size) is achievable for models with locally quadratic behavior, such as those smooth
parametric models (cf. Section 7.11 and in particular Theorem 7.21).
Corollary 32.11. Consider a parametric family P = {Pθ : θ ∈ Θ}, where Θ ⊂ Rd and P is

totally bounded in Hellinger distance. Suppose that there exists a norm k · k and constants t0 , c, C
such that for all θ0 , θ1 ∈ Θ with kθ0 − θ1 k ≤ t0 ,
ckθ0 − θ1 k ≤ H(Pθ0 , Pθ1 ) ≤ Ckθ0 − θ1 k. (32.38)

i.i.d.
Then there exists an estimator θ̂ based on X1 , . . . , Xn ∼ Pθ , such that
d
sup Eθ [H2 (Pθ , Pθ̂ )] ≲ .
θ∈Θ n
Proof. It suffices to bound the local entropy Nloc (P, ϵ) in (32.33). Fix θ0 ∈ Θ. Indeed, for any
η > t0 , we have NH (B(Pθ0 , η) ∩ P, η/2) ≤ NH (P, t0 ) ≲ 1. For ϵ ≤ η ≤ t0 ,
( a)
NH (B(Pθ0 , η) ∩ P, η/2) ≤ N∥·∥ (B∥·∥ (θ0 , η/c), η/(2C))
d
(b) vol(B∥·∥ (θ0 , η/c + η/(2C))) 2C
≤ = 1+
vol(B∥·∥ (θ0 , η/(2C))) c
where (a) and (b) follow from (32.38) and Theorem 27.3 respectively. This shows that
log Nloc (P, ϵ) ≲ d, completing the proof by applying Theorem 32.9.
i i
i i
i i

i i
538
32.2.4 Lower bound using local Hellinger packing

It turns out that under certain regularity assumptions we can prove an almost matching lower
bound (typically within a logarithmic term) on the Hellinger-risk. First we define the local packing
number as follows:
ϵ
Mloc (ϵ) ≡ Mloc (P, H, ϵ) = max{M : ∃R, P1 , . . . , PM ∈ P : H(Pi , R) ≤ ϵ, H(Pi , Pj ) ≥ ∀i 6= j} .
2
Note that unlike the definition of Nloc in (32.33) we are not taking the supremum over the scale
η ≥ ϵ. For this reason, we cannot generally apply Theorem 27.2 to conclude that Nloc (ϵ) ≥ Mloc (ϵ).
In all instances known to us we have log Nloc log Mloc , in which case the following general result
provides a minimax lower bound that matches the upper bound in Theorem 32.9 up to logarithmic
factors.
Theorem 32.12. Suppose that the family P has a finite Dλ radius for some λ > 1, i.e.
Rλ (P) ≜ inf sup Dλ (PkU) < ∞ , (32.39)

U P∈P
where Dλ is the Rényi divergence of order λ (see Definition 7.22). There exists constants c = c(λ)
and ϵ < ϵ0 (λ) such that whenever n and ϵ < ϵ0 are such that

1
c(λ)nϵ2 log 2 + Rλ (P) + 2 log 2 < log Mloc (ϵ), (32.40)
ϵ
any estimator P̂ = P̂(·; Xn ) must satisfy
ϵ2
sup EP [H2 (P, P̂)] ≥ ,
P∈P 32
i.i.d.
where EP is taken with respect to Xn ∼ P.
Remark 32.4. When log Mloc (ϵ) ϵ−p , a minimax lower bound for the squared Hellinger risk
on the order of (n log n)− p+2 follows. Consider the special case of P being the class of β -smooth
2
densities on the unit cube [0, 1]d as defined in Theorem 27.13. The χ2 -radius of this class is finite
since each density therein is bounded from above and the uniform distribution works as a center for
2β
(32.39). In this case we have p = βd and hence the lower bound Ω((n log n)− d+2β ). Here, however,
we can argue differently by considering the subcollection P ′ = 12 Unif([0, 1]d ) + 12 P , which has
(up to a constant factor) the same minimax risk, but has the advantage that D(PkP′ ) H2 (P, P′ )
for all P, P′ ∈ P ′ (see Section 32.4). Repeating the argument in the proof below, then, yields the
2β
optimal lower bound Ω(n− d+2β ) removing the unnecessary logarithmic factors.
Proof. Let M = Mloc (P, ϵ). From the definition there exists an ϵ/2-packing P1 , . . . , PM in some
Hellinger ball B(R, ϵ).
i.i.d.
Let θ ∼ Unif([M]) and Xn ∼ Pθ conditioned on θ. Then from Fano’s inequality in the form
of Theorem 31.3 we get
i i
i i
i i

i i
ϵ 2 I(θ; Xn ) + log 2

sup E[H (P, P̂)] ≥
2
1−
P∈P 4 log M
It remains to show that
I(θ; Xn ) + log 2 1
≤ . (32.41)
log M 2
To that end for an arbitrary distribution U define
Q = ϵ2 U + ( 1 − ϵ2 )R .
We first notice that from Ex. I.48 we have that for all i ∈ [M]

λ 1
D(Pi kQ) ≤ 8(H (Pi , R) + 2ϵ )
2 2
log 2 + Dλ (Pi kU)
λ−1 ϵ
provided that ϵ < 2− 2(λ−1) ≜ ϵ0 . Since H2 (Pi , R) ≤ ϵ2 , by optimizing U (as the Dλ -center of P )
5λ
we obtain

λ 1 c(λ) 2 1
inf max D(Pi kQ) ≤ 24ϵ 2
log 2 + Rλ ≤ ϵ log 2 + Rλ .
U i∈[M] λ−1 ϵ 2 ϵ
By Theorem 4.1 we have

nc(λ) 2 1
I(θ; Xn ) ≤ max D(P⊗ ⊗n
i kQ ) ≤
n
ϵ log 2 + Rλ .
i∈[M] 2 ϵ
This final bound and condition (32.40) then imply (32.41) and the statement of the theorem.
Finally, we mention that for sufficiently regular models wherein the KL divergence and the
squared Hellinger distances are comparable, the upper bound in Theorem 32.9 based on local
entropy gives the exact minimax rate. Models of this type include GLM and more generally
Gaussian local mixtures with bounded centers in arbitrary dimensions.
Corollary 32.13. Assume that
H2 (P, P′ ) D(PkP′ ), ∀P, P′ ∈ P.
Then
inf sup EP [H2 (P, P̂)] ϵ2n

P̂ P∈P
where ϵn was defined in (32.34).
Proof. By assumption, KL neighborhoods coincide with Hellinger balls up to constant factors.

Thus the lower bound follows from apply Fano’s method in Theorem 31.3 to a Hellinger ball of
radius O(ϵn ).
i i
i i
i i

i i
540
32.3 Yatracos’ class and minimum distance estimator

In this section we prove (32.3), the third entropy upper bound on statistical risk. Paralleling the
result (32.1) of Yang-Barron (for KL divergence) and (32.2) of Le Cam-Birgé (for Hellinger dis-
tance), the following result bounds the minimax total variation risk using the metric entropy of
the parameter space in total variation:
Theorem 32.14 (Yatracos [342]). There exists a universal constant C such that the following
i.i.d.
holds. Let X1 , . . . , Xn ∼ P ∈ P , where P is a collection of distributions on a common measurable
space (X , E). For any ϵ > 0, there exists a proper estimator P̂ = P̂(X1 , . . . , Xn ) ∈ P , such that

1
sup EP [TV(P̂, P) ] ≤ C ϵ + log N(P, TV, ϵ)
2 2
(32.42)
P∈P n
For loss function that is a distance, a natural idea for obtaining proper estimator is the minimum
distance estimator. In the current context, we compute the minimum-distance projection of the
empirical distribution on the model class P :6
Pmin-dist = argmin TV(P̂n , P)
P∈P
1
Pn
where P̂n = n i=1 δXi is the empirical distribution. However, since the empirical distribution is
discrete, this strategy does not make sense if elements of P have densities. The reason for this
degeneracy is because the total variation distance is too strong. The key idea is to replace TV,
which compares two distributions over all measurable sets, by a proxy, which only inspects a
“low-complexity” family of sets.
To this end, let A ⊂ E be a finite collection of measurable sets to be specified later. Define a
pseudo-distance
dist(P, Q) ≜ sup |P(A) − Q(A)|. (32.43)
A∈A
(Note that if A = E , then this is just TV.) One can verify that dist satisfies the triangle inequality.
As a result, the estimator
P̃ ≜ argmin dist(P, P̂n ), (32.44)
P∈P
as a minimizer, satisfies
dist(P̃, P) ≤ dist(P̃, P̂n ) + dist(P, P̂n ) ≤ 2dist(P, P̂n ). (32.45)
In addition, applying the binomial tail bound and the union bound, we have
C0 log |A|
E[dist(P, P̂n )2 ] ≤ . (32.46)
n
for some absolute constant C0 .
6
Here and below, if the minimizer does not exist, we can replace it by an infimizing sequence.
i i
i i
i i

i i
32.3 Yatracos’ class and minimum distance estimator 541
The main idea of Yatracos [342] boils down to the following choice of A: Consider an
ϵ-covering {Q1 , . . . , QN } of P in TV. Define the set

dQi dQj
Aij ≜ x : ( x) ≥ ( x)
d( Qi + Qj ) d(Qi + Qj )
and the collection (known as the Yatracos class)
A ≜ {Aij : i 6= j ∈ [N]}. (32.47)
Then the corresponding dist approximates the TV on P , in the sense that
dist(P, Q) ≤ TV(P, Q) ≤ dist(P, Q) + 4ϵ, ∀P, Q ∈ P. (32.48)
To see this, we only need to justify the upper bound. For any P, Q ∈ P , there exists i, j ∈ [N], such
that TV(P, Pi ) ≤ ϵ and TV(Q, Qj ) ≤ ϵ. By the key observation that dist(Qi , Qj ) = TV(Qi , Qj ), we
have
TV(P, Q) ≤ TV(P, Qi ) + TV(Qi , Qj ) + TV(Qj , Q)
≤ 2ϵ + dist(Qi , Qj )
≤ 2ϵ + dist(Qi , P) + dist(P, Q) + dist(Q, Qj )
≤ 4ϵ + dist(P, Q).
Finally, we analyze the estimator (32.44) with A given in (32.47). Applying (32.48) and (32.45)
yields
TV(P̃, P) ≤ dist(P, P̃) + 4ϵ
≤ 2dist(P, P̂n ) + 4ϵ.
Squaring both sizes, taking expectation and applying (32.46), we have
8C0 log |N|
E[TV(P̃, P)2 ] ≤ 32ϵ2 + 8E[dist(P, P̂n )2 ] ≤ 32ϵ2 + .
n
Choosing the optimal TV-covering completes the proof of (32.42).
Remark 32.5 (Robust version). Note that Yatracos’ scheme idea works even if the data generating
distribution P 6∈ P but close to P . Indeed, denote Q∗ = argminQ∈{Qi } TV(P, Q) and notice that
dist(Q∗ , P̂n ) ≤ dist(Q∗ , P) + dist(P, P̂n ) ≤ TV(P, Q∗ ) + dist(P, P̂n ) ,

since dist(Q, Q′ ) ≤ TV(Q, Q′ ) for any pair of distributions. Then we have:
TV(P̃, P) ≤ TV(P̃, Q∗ ) + TV(Q∗ , P) = dist(P̃, Q∗ ) + TV(Q∗ , P)
≤ dist(P̃, P̂n ) + dist(P̂n , P) + dist(P, Q∗ ) + TV(Q∗ , P)
≤ dist(Q∗ , P̂n ) + dist(P̂n , P) + 2TV(P, Q∗ )
≤ 2dist(P, P̂n ) + 3TV(P, Q∗ ) .
Since 3TV(P, Q∗ ) ≤ 3ϵ + 3 minP′ ∈P TV(P, P′ ) we can see that the estimator also works for
“misspecified case”. Surprisingly, the multiplier 3 is not improvable if the estimator is required to
be proper (inside P ), cf. [47].
i i
i i
i i

i i
542
32.4 Application: Estimating smooth densities

As a concrete application, in this section we study a nonparametric density estimation problem
under smoothness constraint. Let F denote the collection of 1-Lipschitz densities (with respect
to Lebesgue) on the unit interval [0, 1]. (In this case the parameter is simply the density f, so we
i.i.d.
shall refrain from writing a parametrized form.) Given X1 , · · · , Xn ∼ f ∈ F , an estimator of the
unknown density f is a function f̂(·) = f̂(·; X1 , . . . , Xn ). Let us consider the conventional quadratic
R1
risk kf − f̂k22 = 0 (f(x) − f̂(x))2 .
i.i.d.
Theorem 32.15. Given X1 , · · · , Xn ∼ f ∈ F , the minimax quadratic risk over F satisfies
R∗L2 (n; F) ≜ inf sup E kf − f̂k22 n− 3 .
2
(32.49)
f̂ f∈F
Capitalizing on the metric entropy of smooth densities studied in Section 27.4, we will prove
this result by applying the entropic upper bound in Theorem 32.1 and the minimax lower bound
based on Fano’s inequality in Theorem 31.3. However, Theorem 32.15 pertains to the L2 rather
than KL risk. This can be fixed by a simple reduction.
Lemma 32.16. Let F ′ denote the collection of f ∈ F which is bounded from below by 1/2. Then
R∗L2 (n; F ′ ) ≤ R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).
Proof. The left inequality follows because F ′ ⊂ F . For the right inequality, we apply a sim-
i.i.d.
ulation argument. Fix some f ∈ F and we observe X1 , . . . , Xn ∼ f. Let us sample U1 , . . . , Un
independently and uniformly from [0, 1]. Define
(
Ui w.p. 12 ,
Zi =
Xi w.p. 12 .
i.i.d.
Then Z1 , . . . , Zn ∼ g = 12 (1 + f) ∈ F ′ . Let ĝ be an estimator that achieves the minimax risk
R∗L2 (n; F ′ ) on F ′ . Consider the estimator f̂ = 2ĝ − 1. Then kf − f̂k22 = 4kg − ĝk22 . Taking the
supremum over f ∈ F proves R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).
Lemma 32.16 allows us to focus on the subcollection F ′ , where each density is lower bounded
by 1/2. In addition, each 1-Lipschitz density is also upper bounded by an absolute constant.
Therefore, the KL divergence and squared L2 distance are in fact equivalent on F ′ , i.e.,
D(fkg) kf − gk22 , f, g ∈ F ′ , (32.50)
as shown by the following lemma:
dQ
Lemma 32.17. Suppose both f = dP dμ and g = dμ are upper and lower bounded by absolute
constants c and C respectively. Then
Z Z
1 1
dμ(f − g)2 ≤ 2H2 (fkg) ≤ D(PkQ) ≤ χ2 (PkQ) ≤ dμ(f − g)2 .
C c
i i
i i
i i

i i
32.4 Application: Estimating smooth densities 543
R R
Proof. For the upper bound, applying (7.31), D(PkQ) ≤ χ2 (PkQ) = dμ (f−gg) ≤ 1c dμ (f−gg) .
2 2
R R
For the lower bound, applying (7.30), D(PkQ) ≥ 2H2 (fkg) = 2 dμ √(f−g√) 2 ≥ C1 dμ(f −
2
( f+ g)
g) 2 .
We now prove Theorem 32.15:
Proof. In view of Lemma 32.16, it suffices to consider R∗L2 (n; F ′ ). For the upper bound, we have
( a)
R∗L2 (n; F ′ ) R∗KL (n; F ′ )
(b)

1 ′
≲ inf ϵ + log NKL (F , ϵ)
2
ϵ>0 n

( c) 1 ′
inf ϵ + log N(F , k · k2 , ϵ)
2
ϵ>0 n

(d) 1
inf ϵ + 2
n−2/3 .
ϵ>0 nϵ
where both (a) and (c) apply (32.50), so that both the risk and the metric entropy are equivalent
for KL and L2 distance; (b) follows from Theorem 32.1; (d) applies the metric entropy (under L2 )
of the Lipschitz class from Theorem 27.12 and the fact that the metric entropy of the subclass F ′
is at most that of the full class F .
For the lower bound, we apply Fano’s inequality. Applying Theorem 27.12 and the relation
between covering and packing numbers in Theorem 27.2, we have log N(F, k·k2 , ϵ) log M(F, k·
k2 , ϵ) 1ϵ . Fix ϵ to be specified and let f1 , . . . , fM be an ϵ-packing in F , where M ≥ exp(C/ϵ). Then
g1 , . . . , gM is an 2ϵ -packing in F ′ , with gi = (fi +1)/2. Applying Fano’s inequality in Theorem 31.3,
we have

∗ Cn
RL2 (n; F) ≳ ϵ 1 −
2
.
log M
Using (32.17), we have Cn ≤ infϵ>0 (nϵ2 + ϵ−1 ) n1/3 . Thus choosing ϵ = cn−1/3 for sufficiently
small c ensures Cn ≤ 12 log M and hence R∗L2 (n; F) ≳ ϵ2 n−2/3 .
Remark 32.6. Note that the above proof of Theorem 32.15 relies on the entropic risk bound (32.1),
which, though rate-optimal, is not attained by a computationally efficient estimator. (The same
criticism also applies to (32.2) and (32.3) for Hellinger and total variation.) To remedy this, for
the squared loss, a classical idea is to apply the kernel density estimator (KDE) – cf. Section 7.9.
Pn
Specifically, one compute the convolution of the empirical distribution P̂n = 1n i=1 δXi with a
kernel function K(·) whose shape and bandwidth are chosen according to the smooth constraint.
For Lipschitz density, the optimal rate in Theorem 32.15 can be attained by a box kernel K(·) =
1 −1/3
2h 1{|·|≤h} with bandwidth h = n (cf. e.g. [313, Sec. 1.2]).
i i
i i
i i

i i
33 Strong data processing inequality
In this chapter we explore statistical implications of the following effect. For any Markov chain
U→X→Y→V (33.1)
we know from the data-processing inequality (Theorem 3.7) that
I(U; Y) ≤ I(U; X), I(X; V) ≥ I(Y; V) .
However, something stronger can often be said. Namely, if the Markov chain (33.1) factor through
a known noisy channel PY|X : X → Y , then oftentimes we can prove strong data processing
inequalities (SDPI):
I(U; Y) ≤ η I(U; X), η (p) I(X; V) ≥ I(Y; V) ,
where coefficients η = η(PY|X ), η (p) (PY|X ) < 1 only depend on the channel and not the (generally
unknown or very complex) PU,X or PY,V . The coefficients η and η (p) approach 0 for channels that
are very noisy (for example, η is always up to a constant factor equal to the Hellinger-squared
diameter of the channel).
The purpose of this chapter is twofold. First, we want to introduce general properties of the
SDPI coefficients. Second, we want to show how SDPIs help prove sharp lower (impossibility)
bounds on statistical estimation questions. The flavor of the statistical problems in this chapter is
different from the rest of the book in that here the information about unknown parameter θ is thinly
distributed across a high dimensional vector (as in spiked Wigner and tree-coloring examples), or
across different terminals (as in correlation and mean estimation examples).
We point out that SDPIs are an area of current research and multiple topics are not covered by
our brief exposition here. For more, we recommend surveys [250] and [257], of which the latter
explores the functional-theoretic side of SDPIs and their close relation to logarithmic Sobolev
inequalities – a topic we omitted entirely.
33.1 Computing a boolean function with noisy gates

A boolean function with n inputs is defined as f : {0, 1}n → {0, 1}. Note that a boolean function
can be described as a network of primitive logic gates of the three following kinds:
In 1938, Shannon has shown how any boolean function f can be represented with primitive
logic gates [276].
544
i i
i i
i i

i i
33.1 Computing a boolean function with noisy gates 545
a a
OR a∨b AND a∧b a NOT a′
b b
Now suppose there are additive noise components on the output of each primitive gate. In this
case, we have a network of the following noisy gates.
Z Z Z
a a
OR ⊕ Y AND ⊕ Y a NOT ⊕ Y
b b
Here, Z ∼ Bern(δ ) and assumed to be independent of the inputs. In other words, with proba-
bility δ , the output of a gate will be flipped no matter what input is given to that gate. Hence, we
sometimes refer to these gates as δ -noisy gates.
In 1950s John von Neumann was laying the groundwork for the digital computers, and he was
bothered by the following question. Can we compute any boolean function f with δ -noisy gates?
Note that any circuit that consists of noisy gates necessarily has noisy (non-deterministic) output.
Therefore, when we say that a noisy gate circuit C computes f we require the existence of some
ϵ0 = ϵ0 (δ) (that cannot depend on f) such that
1
P[C(x1 , . . . , xn ) 6= f(x1 , . . . , xn ) ≤ − ϵ0 (33.2)
2
where C(x1 , . . . , xn ) is the output of the noisy circuit inputs x1 , . . . , xn . If we build the circuit accord-
ing to the classical (Shannon) methods, we would obviously have catastrophic error accumulation
so that deep circuits necessarily have ϵ0 → 0. At the same time, von Neumann was bothered by
the fact that evidently our brains operate with very noisy gates and yet are able to carry very long
computations without mistakes. His thoughts culminated in the following ground-breaking result.
Theorem 33.1 (von Neumann, 1957). There exists δ ∗ > 0 such that for all δ < δ ∗ it is possible
to compute every boolean function f via δ -noisy 3-majority gates.
von Neumann’s original estimate δ ∗ ≈ 0.087 was subsequently improved by Pippenger. The
main (still open) question of this area is to find the largest δ ∗ for which the above theorem holds.
Condition in (33.2) implies the output should be correlated with the inputs. This requires the
mutual information between the inputs (if they are random) and the output to be greater than
zero. We now give a theorem of Evans and Schulman that gives an upper bound to the mutual
information between any of the inputs and the output. We will prove the theorem in Section 33.3
as a consequence of the more general directed information percolation theory.
Theorem 33.2 ([117]). Suppose an n-input noisy boolean circuit composed of gates with at most
K inputs and with noise components having at most δ probability of error. Then, the mutual
information between any input Xi and output Y is upper bounded as
di
I(Xi ; Y) ≤ K(1 − 2δ)2 log 2
i i
i i
i i

i i
546
where di is the minimum length between Xi and Y (i.e, the minimum number of gates required to
be passed through until reaching Y).
Theorem 33.2 implies that noisy computation is only possible for δ < 12 − 2√1 K . This is the best
known threshold. An illustration is given below:
X1 X2 X3 X4 X5 X6 X7 X8 X9
G1 G2 G3
G4 G5
G6
Figure 33.1 An example of a 9-input Boolean Circuit
The above 9-input circuit has gates with at most 3 inputs. The 3-input gates are G4 , G5 and G6 .
The minimum distance between X3 and Y is d3 = 2, and the minimum distance between X5 and Y
is d5 = 3. If Gi ’s are δ -noisy gates, we can invoke Theorem 33.2 between any input and the output.
Unsurprisingly, Theorem 33.2 also tells us that there are some circuits that are not com-
putable with δ -noisy gates. For instance, take f(X1 , . . . , Xn ) = XOR(X1 , . . . , Xn ). Then for
log n
at least one input Xi , we have di ≥ log K . This shows that I(Xi ; Y) → 0 as n →
∞, hence Xi and Y will be almost independent for large n. Note that XOR(X1 , . . . , Xn ) =

XOR XOR(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ), Xi . Therefore, it is impossible to compute an n-input
XOR with δ -noisy gates for large n.
Computation with formulas: Note that the graph structure given in Figure 33.1 contains some
undirected loops. A formula is a type of boolean circuits that does not contain any undirected loops
unlike the case in Figure 33.1. In other words, for a formula the underlying graph structure forms
a tree. Removing one of the outputs of G2 of Figure 33.1, we obtain a formula as given below.
In Theorem 1 of [116], it is shown that we can compute reliably any boolean function f that
is represented with a formula with at most K-input gates with K odd and every gate are at most
δ -noisy and δ < δf∗ , and no such computation is possible for δ > δf∗ , where
1 2K− 1
δf∗ = − K−1
2 K K− 1
2
i i
i i
i i

i i
X1 X2 X3 X4 X5 X6 X7 X8 X9
G1 G2 G3
G4 G5
G6
where the approximation holds for large K. This threshold is better than the upper-bound on the
threshold given by Theorem 33.2 for general boolean circuits. However, for large K we have
p
∗ 1 π /2
δf ≈ − √ , K 1
2 2 K
showing that the estimate of Evans-Schulman δ ∗ ≤ 1
2 − 1
√
2 K
is order-tight for large K. This
demonstrates the tightness of Theorem 33.2.
33.2 Strong Data Processing Inequality

Definition 33.3. (Contraction coefficient for PY|X ) For a fixed conditional distribution (or kernel)
PY|X , define
Df (PY kQY )
ηf (PY|X ) = sup , (33.3)
Df (PX kQX )
where PY = PY|X ◦ PX , QY = PY|X ◦ QX and supremum is over all pairs (PX , QX ) satisfying
0 < Df ( P X k QX ) < ∞ .
Recall that the DPI (Theorem 7.4) states that Df (PX kQX ) ≥ Df (PY kQY ). The concept of the
Strong DPI introduced above quantifies the multiplicative decrease between the two f-divergences.
Example 33.1. Suppose PY|X is a kernel for a time-homogeneous Markov chain with stationary
distribution π (i.e., PY|X = PXt+1 |Xt ). Then for any initial distribution q, SDPI gives the following
bound:
Df (qPn kπ ) ≤ ηfn Df (qkπ )
These type of exponential decreases are frequently encountered in the Markov chains literature,
especially for KL and χ2 divergences. For example, for reversible Markov chains, we have [91,
Prop. 3]
χ2 (PXn kπ ) ≤ γ∗2n χ2 (PXn kπ ) (33.4)
where γ∗ is the absolute spectral gap of P. See Exercise VI.18.
i i
i i
i i

i i
548
We note that in general ηf (PY|X ) is hard to compute. However, total variation is an exception.
Theorem 33.4 ([93]). ηTV = supx̸=x′ TV(PY|X=x , PY|X=x′ ).
Proof. We consider two directions separately.
• ηTV ≥ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ):

0
This case is obvious. Take PX = δx0 and QX = δx′0 .1 Then from the definition of ηTV , we
have ηTV ≥ TV(PY|X=x0 , PY|X=x′0 ) for any x0 and x′0 , x0 6= x′0 .
• ηTV ≤ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ):
0
Define η̃ ≜ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ). We consider the discrete alphabet case for simplicity.
0
Fix any PX , QX and PY = PX ◦ PY|X , QY = QX ◦ PY|X . Observe that for any E ⊆ Y
PY|X=x0 (E) − PY|X=x′0 (E) ≤ η̃ 1{x0 6= x′0 }. (33.5)
Now suppose there are random variables X0 and X′0 having some marginals PX and QX respec-
tively. Consider any coupling π X0 ,X′0 with marginals PX and QX respectively. Then averaging
(33.5) and taking the supremum over E, we obtain
sup PY (E) − QY (E) ≤ η̃ π [X0 6= X′0 ]

E⊆Y
Now the left-hand side equals TV(PY , QY ) by Theorem 7.7(a). Taking the infimum over
couplings π the right-hand side evaluates to TV(PX , QX ) by Theorem 7.7(b).
Example 33.2 (ηTV of a Binary Symmetric Channel). The ηTV of the BSCδ is given by
ηTV (BSCδ ) =TV(Bern(δ ), Bern(1 − δ ))

1
= |δ − (1 − δ)| + |1 − δ − δ| = |1 − 2δ|.
2
We sometimes want to relate ηf with the f-mutual informations instead of f-divergences. This
relation is given in the following theorem.
Theorem 33.5.
If (U; Y)
ηf (PY|X ) = sup .
PUX : U→X→Y If (U; X)
1
δx0 is the probability distribution with P(X = x0 ) = 1
i i
i i
i i

i i
Recall that for any Markov chain U → X → Y, DPI states that If (U; Y) ≤ If (U; X) and Theorem
33.5 gives the stronger bound
If (U; Y) ≤ ηf (PY|X )If (U; X). (33.6)
Proof. First, notice that for any u0 , we have Df (PY|U=u0 kPY ) ≤ ηf Df (PX|U=u0 kPX ). Averaging the
above expression over any PU , we obtain
If (U; Y) ≤ ηf If (U; X)
Second, fix P̃X , Q̃X and let U ∼ Bern(λ) for some λ ∈ [0, 1]. Define the conditional distribution
PX|U as PX|U=1 = P̃X , PX|U=0 = Q̃X . Take λ → 0, then (see [250] for technical subtleties)
If (U; X) = λDf (P̃X kQ̃X ) + o(λ)

If (U; Y) = λDf (P̃Y kQ̃Y ) + o(λ)
I (U;Y) Df (P̃Y ∥Q̃Y )
The ratio Iff(U;X) will then converge to Df (P̃X ∥Q̃X )
. Thus, optimizing over P̃X and Q̃X we can get ratio
of If ’s arbitrarily close to ηf .
We next state some of the fundamental properties of contraction coefficients.
Theorem 33.6. In the statements below ηf (and others) corresponds to ηf (PY|X ) for some fixed
PY|X .
(a) For any f, ηf ≤ ηTV .

(b) ηKL = ηH2 = ηχ2 . More generally, for any operator-convex and twice continuously
differentiable f we have ηf = ηχ2 .
(c) ηχ2 equals the maximal correlation: ηχ2 = supPX ,f,g ρ(f(X), g(Y)), where ρ(X, Y) ≜
√ Cov(X,Y) is the correlation coefficient between X and Y.
Var(X)Var(Y)
(d) For binary-input channels, denote P0 = PY|X=0 and P1 = PY|X=1 . Then we have
ηKL = LCmax (P0 , P1 ) ≜ sup LCβ (P0 kP1 )

0<β<1
where (recall β̄ ≜ 1 − β )
( 1 − x) 2
LCβ (PkQ) = Df (PkQ), f(x) = β̄β
β̄ x + β
is the Le Cam divergence of order β (recall (7.6) for β = 1/2).
(e) Consequently,
1 2 H4 (P0 , P1 )
H (P0 , P1 ) ≤ ηKL ≤ H2 (P0 , P1 ) − . (33.7)
2 4
(f) If the binary-input channel is also input-symmetric (or BMS, see Section 19.4*) then ηKL =
Iχ2 (X; Y) for X ∼ Bern(1/2).
i i
i i
i i

i i
550
(g) For any channel the supremum in (33.3) can be restricted to PX , QX with a common binary
support. In other words, ηf (PY|X ) coincides with that of the least contractive binary subchannel.
Consequently, from (e) we conclude
1 diam H2
diam H2 ≤ ηKL (PY|X ) = diam LCmax ≤ diam H2 − ,
2 4
(in particular ηKL diam H2 ), where diam H2 (PY|X ) = supx,x′ ∈X H2 (PY|X=x , PY|X=x′ ),
diam LCmax = supx,x′ LCmax (PY|X=x , PY|x=x′ ) are Hellinger and Le Cam diameters of the
channel.
Proof. Most proofs in full generality can be found in [250]. For (a) one first shows that ηf ≤ ηTV
for the so-called Eγ divergences corresponding to f(x) = |x − γ|+ − |1 − γ|+ , which is not hard to
believe since Eγ is piecewise linear. Then the general result follows from the fact that any convex
function f can be approximated (as N → ∞) in the form
X
N
aj |x − cj |+ + a0 x + c0 .
j=1
For (b) see [66, Theorem 1] and [70, Proposition II.6.13 and Corollary II.6.16]. The idea of this
proof is as follows: :
• ηKL ≥ ηχ2 by locality. Recall that every f-divergence behaves locally behaves as χ2 –
Theorem 7.18.
R∞
• Using the identity D(PkQ) = 0 χ2 (PkQt )dt where Qt = tP1+ Q
+t , we have
Z ∞ Z ∞
D(PY kQY ) = χ2 (PY kQY t )dt ≤ ηχ2 χ2 (PX kQX t )dt = ηχ2 D(PX kQX ).
0 0
For (c), we fix QX (and thus QX,Y = QX PY|X ). If g = dQ dPX

X
then Tg(y) = dQ dPY
Y
= EQX|Y [g(X)|Y = y]
is a linear operator. ηχ2 (PY|X ) is then nothing else than the maximal singular value (spectral norm
squared) of T : L2 (QX ) → L2 (QY ) when restricted to {g : EQX [g] = 0}. The adjoint of T is
T∗ h(x) = EPY|X [h(Y)|X = x]. Since the spectral norms of T and T∗ coincide, and the spectral norm
of T∗ is precisely the maximal correlation we get the result.
2
P1 +ᾱP0 ∥β P1 +β̄ P0 )
The (d) follows from the definition of ηχ2 = supα,β χ (α χ2 (Ber(α)∥Ber(β))
and some algebra.
Next, (e) follows from bounding (via Cauchy-Schwarz etc) LCmax in terms of H2 ; see [250,
Appendix B].
The (f) follows from the fact that every BMS channel can be represented as X 7→ Y = (Y∆ , ∆)
where ∆ ∈ [0, 1/2] is independent of X and Yδ = BSCδ (X). In other words, every BMS channel
is a mixture of BSCs; see [264, Section 4.1]. Thus, we have for any U → X → Y = (Y∆ , ∆) and
∆⊥ ⊥ (U, X) the following chain
I(U; Y) = I(U; Y|∆) ≤ Eδ∼P∆ [(1 − 2δ)2 I(U; X|∆ = δ) = E[(1 − 2∆)2 ]I(U; X),
where we used the fact that I(U; X|∆ = δ) = I(U; X) and Example 33.3 below.
For (g) see Ex. VI.19.
i i
i i
i i

i i
Example 33.3 (Computing ηKL (BSCδ )). Consider

p the BSCδ channel. In Example 33.2 we com-
puted ηTV . Here we have diam H2 = 2 − 4 δ(1 − δ) and thus the bound (33.7) we get ηKL ≤
(1 − 2δ)2 . On the other hand taking U = Ber(1/2) and PX|U = Ber(α) we get
I(U; Y) log 2 − h(α + (1 − 2α)δ) 1

ηKL ≥ = → (1 − 2δ)2 α→ .
I(U; X) log 2 − h(α) 2
Thus we have shown:
ηKL (BSCδ ) = ηH2 (BSCδ ) = ηχ2 = (1 − 2δ)2 .
This example has the following consequence for the KL-divergence geometry.
Proposition 33.7. Consider any distributions P0 and P1 on X and let us consider the interval
in P(X ): Pλ = λP1 + (1 − λ)P0 for λ ∈ [0, 1]. Then divergence (with respect to the midpoint)
behaves subquadratically:
D(Pλ kP1/2 ) + D(P1−λ kP1/2 ) ≤ (1 − 2λ)2 {D(P0 kP1/2 ) + D(P1 kP1/2 )) .
The same statement holds with D replaced by χ2 (and any other Df satisfying Theorem 33.6(b)).
Proof. Let X ∼ Ber(1/2) and Y = BSCλ (X). Let U ← X → Y be defined with U ∼ P0 if X = 0

and U ∼ P1 if X = 1. Then
1 1
I f ( U; Y ) = Df (Pλ kP1/2 ) + Df (Pλ kP1/2 ) .
2 2
Thus, applying SDPI (33.6) completes the proof.
p p
Remark 33.1. Let us introduce dJS (P, Q) = JS(P, Q) and dLC = LC(P, Q) – the Jensen-
Shannon (7.7) and Le Cam (7.6) metrics. Then the proposition can be rewritten as
dJS (Pλ , P1−λ ) ≤ |1 − 2λ|dJS (P0 , P1 )

dLC (Pλ , P1−λ ) ≤ |1 − 2λ|dLC (P0 , P1 ) .
Notice that for any metric d(P, Q) on P(X ) that is induced from the norm on the vector space
M(X ) of all signed measures (such as TV), we must necessarily have d(Pλ , P1−λ ) = |1 −
2λ|d(P0 , P1 ). Thus, the ηKL (BSCλ ) = (1 − 2λ)2 which yields the inequality is rather natural.
33.3 Directed Information Percolation

In this section, we are concerned about the amount of information decay experienced in a directed
acyclic graph (DAG) G = (V, E). In the following context the vertex set V refers to a set of vertices
v, each associated with a random variable Xv and the edge set E refers to a set of directed edges
whose configuration allows us to factorize the joint distribution over XV by Throughout the section,
we consider Shannon mutual information, i.e., f = x log x. Let us give a detailed example below.
i i
i i
i i

i i
552
B
X0 W
A
Example 33.4. Suppose we have a graph G = (V, E) as below This means that we have a joint
distribution factorizing as
PX0 ,A,B,W = PX0 PB|X0 PA,B|X0 PW|A,B .
Then every node has a channel from its parents to itself, for example W corresponds to a noisy
channel PW|A,B , and we can define η ≜ ηKL (PW|A,B ). Now, prepend another random variable U ∼
Bern(λ) at the beginning, the new graph G′ = (V′ , E′ ) is shown below: We want to verify the
B
U X0 W
A
relation
I(U; B, W) ≤ η̄ I(U; B) + η I(U; A, B). (33.8)
Recall that from chain rule we have I(U; B, W) = I(U; B) + I(U; W|B) ≥ I(U; B). Hence, if (33.8)
is correct, then η → 0 implies I(U; B, W) ≈ I(U; B) and symmetrically I(U; A, W) ≈ I(U; A).
Therefore for small δ , observing W, A or W, B does not give advantage over observing solely A or
B, respectively.
Observe that G′ forms a Markov chain U → X0 → (A, B) → W, which allows us to factorize
the joint distribution over E′ as
PU,X0 ,A,B,W = PU PX0 |U PA,B|X0 PW|A,B .
Now consider the joint distribution conditioned on B = b, i.e., PU,X0 ,A,W|B . We claim that the
conditional Markov chain U → X0 → A → W|B = b holds. Indeed, given B and A, X0 is
independent of W, that is PX0 |A,B PW|A,B = PX0 ,W|AB , from which follows the mentioned conditional
Markov chain. Using the conditional Markov chain, SDPI gives us for any b,
I(U; W|B = b) ≤ η I(U; A|B = b).
Averaging over b and adding I(U; B) to both sides we obtain
I(U; W, B) ≤ η I(U; A|B) + I(U; B)
= η I(U; A, B) + η̄ I(U; B).
From the characterization of ηf in Theorem 33.5 we conclude
ηKL (PW,B|X0 ) ≤ η · ηKL (PA,B|X0 ) + (1 − η) · ηKL (PB|X0 ) . (33.9)
Now, we provide another example which has in some sense an analogous setup to Example
33.4.
i i
i i
i i

i i
B
R
X W
A
Figure 33.2 Illustration for Example 33.5.
Example 33.5 (Percolation). Take the graph G = (V, E) in example 33.4 with a small modification.
See Fig. 33.2. Now, suppose X,A,B,W are some cities and the edge set E represents the roads
between these cities. Let R be a random variable denoting the state of the road connecting to W
with P(R is open) = η and P[R is closed] = η̄ . For any Y ∈ V, let the event {X → Y} indicate that
one can drive from X to Y. Then
P[X → B or W] = η P[X → A or B] + η̄ P[X → B]. (33.10)
Observe the resemblance between (33.9) and (33.10).
We will now give a theorem that relates ηKL to percolation probability on a DAG under the
following setting: Consider a DAG G = (V, E).
• All edges are open

• Every vertex is open with probability p(v) = ηKL PXv |XPa(v) where Pa(v) denotes the set of
parents of v.
Under this model, for two subsets T, S ⊂ V we define perc[T → S] = P[∃ open path T → S].
Note that PXv |XPa(v) describe the stochastic recipe for producing Xv based on its parent variables.
We assume that in addition to a DAG we also have been given all these constituent channels (or
at least bounds on their ηKL coefficients).
Theorem 33.8 ([250]). Let G = (V, E) be a DAG and let 0 be a node with in-degree equal to zero
(i.e. a source node). Note that for any 0 63 S ⊂ V we can inductively stitch together constituent
channels PXv |XPa(v) and obtain PXS |X0 . Then we have
ηKL (PXS |X0 ) ≤ perc(0 → S). (33.11)
Proof. For convenience let us denote η(T) = ηKL (PXT |X0 ) and ηv = ηKL (PXv |XPa(v) ). The proof
follows from an induction on the size of G. The statement is clear for the |V(G)| = 1 since
S = ∅ or S = {X0 }. Now suppose the statement is already shown for all graphs smaller than
G. Let v be the node with out-degree 0 in G. If v 6∈ S then we can exclude it from G and the
statement follows from induction hypothesis. Otherwise, define SA = Pa(v) \ S and SB = S \ {v},
A = XSA , B = XSB , W = Xv . (If 0 ∈ A then we can create a fake 0′ with X0′ = X0 and retain
0′ ∈ A while moving 0 out of A. So without loss of generality, 0 6∈ A.) Prepending arbitrary U to
the graph as U → X0 , the joint DAG of random variables (X0 , A, B, W) is then given by precisely
the graph in (33.8). Thus, we obtain from (33.9) the estimate
η(S) ≤ ηv η(SA ∪ SB ) + (1 − ηv )ηKL (SB ) . (33.12)
i i
i i
i i

i i
554
From induction hypothesis η(SA ∪ SB ) ≤ perc(0 → SA ) and η(SB ) ≤ perc(0 → SB ) (they live on
a graph G \ {v}). Thus, from computation (33.10) we see that the right-hand side of (33.12) is
precisely perc(0 → S) and thus η(S) ≤ perc(S) as claimed.
We are now in position to complete the postponed proof.
Proof of Theorem 33.2. First observe the noisy boolean circuit is a form of DAG. Since the gates
are δ -noisy contraction coefficients of constituent channels ηv in the DAG can be bounded by
(1 − 2δ)2 . Thus, in the percolation question all vertices are open with probability (1 − 2δ)2
From SDPI, for each i, we have I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ). From Theorem 33.8, we know
ηKL (PY|Xi ) ≤ perc(Xi → Y). We now want to upper bound perc(Xi → Y). Recall that the minimum
distance between Xi and Y is di . For any path π of length ℓ(π ) from Xi to Y, therefore, the probability
that it will be open is ≤ (1 − 2δ)2ℓ(π ) . We can thus bound
X
perc(Xi → Y) ≤ (1 − 2δ)2ℓ(π ) . (33.13)
π :Xi →Y
Let us now build paths backward starting from Y, which allows us to represent paths X → Yi
as vertices of a K-ary tree with root Yi . By labeling all vertices on a K-ary tree corresponding
to paths X → Yi we observe two facts: the labeled set V is prefix-free (two labeled vertices are
never in ancestral relation) and the depth of each labeled set is at least di . It is easy to see that
P
u∈V c
depth(u)
≤ (Kc)di provided Kc ≤ 1 and attained by taking V to be set of all vertices in the
tree at depth di . We conclude that whenever K(1 − 2δ)2 ≤ 1 the right-hand side of (33.13) is
bounded by (K(1 − 2δ)2 )di , which concludes the proof by upper bounding H(Xi ) ≤ log 2 as
I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ) ≤ Kdi (1 − 2δ)2di log 2
We conclude the section with an example illustrating that Theorem 33.8 may give stronger
bounds when compared to Theorem 33.2.
Example 33.6. Suppose we have the topological restriction on the placement of gates (namely
that the inputs to each gets should be from nearest neighbors to the left), resulting in the following
circuit of 2-input δ -noisy gates. Note that each gate may be a simple passthrough (i.e. serve as
router) or a constant output. Theorem 33.2 states that if (1 − 2δ)2 < 21 , then noisy computation
i i
i i
i i

i i
33.4 Input-dependent SDPI 555
within arbitrary topology is not possible. Theorem 33.8 improves this to (1 − 2δ)2 < pc , where pc
is the oriented site-percolation threshold for the particular graph we have. Namely, if each vertex
is open with probability p < pc then with probability 1 the connected component emanating from
any given node (and extending to the right) is finite. For the example above the site percolation
threshold is estimated as pc ≈ 0.705 (so called Stavskaya automata).
33.4 Input-dependent SDPI

Previously we have defined contraction coefficient ηf (PY|X ), as the maximum contraction of an f-
divergences over all channel input distributions. We now define an analogous concept for a specific
input distribution PX .
Definition 33.9 (Input Dependent Contraction Coefficient). For any input distribution PX , Markov
kernel PY|X and convex function f, we define
Df (QY kPY )
ηf (PX , PY|X ) ≜ sup
Df (QX kPX )
where PY = PY|X ◦ PX , QY = PY|X ◦ QX and supremum is over QX satisfying 0 < Df (PX kQX ) < ∞.
We refer to ηf (PX , PY|X ) as the input dependent contraction coefficient, to contrast it with the
input independent contraction coefficient ηf (PY|X ).
Remarks:
• As for ηKL (PY|X ), we also have a corresponding mutual information characterization of

ηKL (PX , PY|X ) as
I(U; Y)
ηKL (PX , PY|X ) = sup .
PU|X :U→X→Y (U; X)
I
• From the definition, the following inequality holds
ηf (PX , PY|X ) ≤ ηf (PY|X ).
• Although we have the equality ηKL (PY|X ) = ηχ2 (PY|X ) when PY|X is a BMS channel, we do not
have the same equality for ηKL (PX , PY|X ).
Example 33.7. (ηKL (PX , PY|X ) for Erasure Channel) We define ECτ as the following channel,
(
X w.p. 1 − τ
Y=
? w.p. τ.
Let us define an auxiliary random variable B = 1{Y =?}. Thus we have the following equality,
I(U; Y) = I(U; Y, B) = I(U; B) +I(U; Y|B) = (1 − τ )I(U; X).
| {z }
0,B⊥
⊥U
i i
i i
i i

i i
556
where the last equality is due to the fact that I(U; Y|B = 1) = 0 and I(U; Y|B = 0) = I(U; X). By
the mutual information characterization of ηKL (PX , PY|X ), we have ηKL (PX , ECτ ) = 1 − τ .
Proposition 33.10 (Tensorization of ηKL ). For a given number n, two measures PX and PY|X we
have
ηKL (P⊗ n ⊗n
X , PY|X ) = ηKL (PX , PY|X )
i.i.d.
In particular, if (Xi , Yi ) ∼ PX,Y , then ∀PU|Xn
I(U; Yn ) ≤ ηKL (PX , PY|X )I(U; Xn )
Proof. Without loss of generality (by induction) it is sufficient to prove the proposition for n = 2.
It is always useful to keep in mind the following diagram Let η = ηKL (PX , PY|X )
X1 Y1
X2 Y2
I(U; Y1 , Y2 ) = I(U; Y1 ) + I(U; Y2 |Y1 )

≤ η [I(U; X1 ) + I(U; X2 |Y1 )] (33.14)
= η [I(U; X1 ) + I(U; X2 |X1 ) + I(U; X1 |Y1 ) − I(U; X1 |Y1 , X2 )] (33.15)
≤ η [I(U; X1 ) + I(U; X2 |Y1 )] (33.16)
= η I(U; X1 , X2 )
where (33.14) is due to the fact that conditioned on Y1 , U − X2 − Y2 is still a Markov chain, (33.15)
is because U − X1 − Y1 is a Markov chain and (33.16) follows from the fact that X2 − U − X1 is a
Markov chain even when condition Y1 .
33.5 Application: Broadcasting and coloring on trees

Consider an infinite b-ary tree G = (V, E). We assign a random variable Xv for each v ∈ V . These
random variables Xv ’s are defined on the same alphabet X . In this model, the joint distribution
is induced by the distribution on the root vertex π, i.e., Xρ ∼ π, and the edge kernel PX′ |X , i.e.
∀(p, c) ∈ E, PXc |Xp = PX′ |X .
To simplify our discussion, we will assume that π is a reversible measure on kernel PX′ |X , i.e.,
PX′ |X (a|b)π (b) = PX′ |X (b|a)π (a). (33.17)
By standard result on Markov chain, this also implies that π is a stationary distribution of the
reversed Markov kernel PX|X′ .
We make the following observations:
i i
i i
i i

i i
33.5 Application: Broadcasting and coloring on trees 557
′ |X X3 ……
PX
X1
P X′ |X X5 ……
PX′ |X
Xρ
PX′ |X X6 ……
PX ′
|X X2
PX ′
|X X4 ……
• We can think of this model as a broadcasting scenario, where the root broadcasts its message
Xρ to the leaves through noisy channels PX′ |X . The condition (33.17) here is only made to avoid
defining the reverse channel. In general, one only requires that π is a stationary distribution of
PX′ |X , in which case the (33.19) should be replaced with ηKL (π , PX|X′ )b < 1.
• This model arises frequently in community detection, sparse codes and statistical physics.
• Under the assumption (33.17), the joint distribution of this tree can also be written as a Gibbs
distribution
 
1 X X
PXall = exp  f(Xp , Xc ) + g(Xv ) , (33.18)
Z
(p,c)∈E v∈V
where Z is the normalization constant, f(xp , xc ) = f(xc , xp ) is symmetric. When X = {0, 1}, this
model is known as the Ising model (on a tree). Note, however, that not every measure factorizing
as (33.18) (with symmetric f) can be written as a broadcasting process for some P and π.
We can define a corresponding inference problem, where we want to reconstruct the root variable
Xρ given the observations XLd = {Xv : v ∈ Ld }, with Ld = {v : v ∈ V, depth(v) = d}. A
natural question is to upper bound the performance of any inference algorithm on this problem.
The following theorem shows that there exists a phase transition depending on the branching factor
b and the contraction coefficient of the kernel PX′ |X .
Theorem 33.11. Consider the broadcasting problem on infinite b-ary tree (b > 1), with root
distribution π and edge kernel PX′ |X . If π is a reversible measure of PX′ |X such that
ηKL (π , PX′ |X )b < 1, (33.19)
then I(Xρ ; XLd ) → 0 as d → 0.
Proof. For every v ∈ L1 , we define the set Ld,v = {u : u ∈ Ld , v ∈ ancestor(u)}. We can upper
bound the mutual information between the root vertex and leaves at depth d
X
I(Xρ ; XLd ) ≤ I(Xρ ; XLd,v ).
v∈L1
For each term in the summation, we consider the Markov chain

XLd,v → Xv → Xρ .
i i
i i
i i

i i
558
Due to our assumption on π and PX′ |X , we have PXρ |Xv = PX′ |X and PXv = π. By the definition of
the contraction coefficient, we have
I(XLd,v ; Xρ ) ≤ ηKL (π , PX′ |X )I(XLd,v ; Xv ).
Observe that because PXv = π and all edges have the same kernel, then I(XLd,v ; Xv ) = I(XLd−1 ; Xρ ).
This gives us the inequality
I(Xρ ; XLd ) ≤ ηKL (π , PX′ |X )bI(Xρ ; XLd−1 ),
which implies
I(Xρ ; XLd ) ≤ (ηKL (π , PX′ |X )b)d H(Xρ ).
Therefore if ηKL (π , PX′ |X )b < 1 then I(Xρ ; XLd ) → 0 exponentially fast as d → ∞.

Note that a weaker version of this theorem (non-reconstruction when ηKL (PX′ |X )b ≤ 1) is
implied by the directed information percolation theorem. The k-coloring example (see below)
demonstrates that this strengthening is essential; see [148] for details.
Example 33.8. (Broadcasting on BSC tree.) Consider a broadcasting problem on b-ary tree with
vertex alphabet X = {0, 1}, edge kernel PX′ |X = BSCδ , and π = Unif . Note that uniform
distribution is a reversible measure for BSCδ . In Example 33.3, we calculated ηKL (BSCδ ) = (1 −
2δ)2 . Therefore, using theorem 33.11, we can deduce that if
b(1 − 2δ)2 < 1
then no inference algorithm can recover the root nodes as depth of the tree goes to infinity. This
result is originally proved in [39].
Example 33.9 (k-coloring on tree). Given a b-ary tree, we assign a k-coloring Xvall by sampling
uniformly from the ensemble of all valid k-coloring. For this model, we can define a corresponding
inference problem, namely given all the colors of the leaves at a certain depth, i.e., XLd , determine
the color of the root node, i.e., Xρ .
This problem can be modeled as a broadcasting problem on tree where the root distribution π
is given by the uniform distribution on k colors, and the edge kernel PX′ |X is defined as
(
1
a 6= b
PX′ |X (a|b) = k−1
0, a = b.
It can be shown, see Ex. VI.23, that ηKL (Unif, PX′ |X ) = k log k(11+o(1)) . By Theorem 33.11, this
implies that if b < k log k(1 + o(1)) then reliable reconstruction of the root node is not possible.
This result is originally proved in [288] and [32]
The other direction b > k log k(1 + o(1)) can be shown by observing that if b > k log k(1 + o(1))
then the probability of the children of a node taking all available colors (except its own) is close to
1. Thus, an inference algorithm can always determine the color of a node by finding a color that
is not assigned to any of its children. Similarly, when b > (1 + ϵ)k log k even observing (1 − ϵ)-
fraction of the node’s children is sufficient to reconstruct its color exactly. Proceeding recursively
i i
i i
i i

i i
33.6 Application: distributed correlation estimation 559
from bottom up, such a reconstruction algorithm will succeed with high probability. In this regime
with positive probability (over the leaf variables) the posterior distribution of the root color is a
delta-function (deterministic). This effect is known as “freezing” of the root given the boundary.
33.6 Application: distributed correlation estimation

Tensorization property can be used for correlation estimation. Suppose Alice have samples
i.i.d. i.i.d.
{Xi }i≥1 ∼ B(1/2) and Bob have samples {Yi }i≥1 ∼ B(1/2) such that the (Xi , Yi ) are i.i.d. with
E[Xi Yi ] = ρ ∈ [−1, 1]. The goal is for Bob to send W to Alice with H(W) = B bits and for Alice
to estimate ρ̂ = ρ̂(X∞ , W) with objective
R∗ (B) = inf sup E[(ρ − ρ̂)2 ].

W,ρ̂ ρ
Notice that in this problem we are not sample-limited (each party has infinitely many samples),
but communication-limited (only B bits can be exchanged).
Here is a trivial attempt to solve it. Notice that if Bob sends W = (Y1 , . . . , YB ) then the optimal
PB
estimator is ρ̂(X∞ , W) = 1n i=1 Xi Yi which has minimax error B1 , hence R∗ (B) ≤ B1 . Surprisingly,
this can be improved.
Theorem 33.12 ([150]). The optimal rate when B → ∞ is given by

1 + o( 1) 1
R∗ (X∞ , W) = ·
2 ln 2 B
Proof. Fix PW|Y∞ , we get the following decomposition Note that once the messages W are fixed
X1 Y1
.. ..
. .
W Xi Yi
.. ..
. .
we have a parameter estimation problem {Qρ , ρ ∈ [−1, 1]} where Qρ is a distribution of (X∞ , W)
when A∞ , B∞ are ρ-correlated. Since we minimize MMSE, we know from the van Trees inequality
(Theorem 29.2) 2 that R∗ (B) ≥ min1+o(1)
ρ JF (ρ)
≥ 1J+Fo(0(1)) where JF (ρ) is the Fisher Information of the
family {Qρ }.
Recall, that we also know from the local approximation that
ρ2 log e
D(Qρ kQ0 ) = JF (0) + o(ρ2 )
2
2
This requires some technical justification about smoothness of the Fisher information JF (ρ).
i i
i i
i i

i i
560
Furthermore, notice that under ρ = 0 we have X∞ and W independent and thus
D(Qρ kQ0 ) = D(PρX∞ ,W kP0X∞ ,W )

= D(PρX∞ ,W kPρX∞ × PρW )
= I(W; X∞ )
≤ ρ2 I(W; Y∞ )
≤ ρ2 B log 2
hence JF (0) ≤ (2 ln 2)B + o(1) which in turns implies the theorem. For full details and the
extension to interactive communication between Alice and Bob see [150].
We comment on the upper bound next. First, notice that by taking blocks of m → ∞ consecutive
Pim−1
bits and setting X̃i = √1m j=(i−1)m Xj and similarly for Ỹi , Alice and Bob can replace ρ-correlated

i.i.d. 1 ρ
bits with ρ-correlated standard Gaussians (X̃i , Ỹi ) ∼ N (0, ). Next, fix some very large N
ρ 1
and let
W = argmax Yj .
1≤j≤N
√
From standard concentration results we know that E[YW ] = 2 ln N(1 + o(1)) and Var[YW ] =
O( ln1N ). Therefore, knowing W Alice can estimate
XW
ρ̂ = .
E [ YW ]
1−ρ2 +o(1)
This is an unbiased estimator and Varρ [ρ̂] = 2 ln N . Finally, setting N = 2B completes the
argument.
33.7 Channel comparison: degradation, less noisy, more capable

It turns out that the ηKL coefficient is intimately related to the concept of less noisy partial order
on channels. We define several such partial orders together.
Definition 33.13 (partial orders on channels). Let PY|X and PZ|X be two channels. We say that PY|X
is a degradation of PZ|X , denoted PY|X ≤deg PZ|X , if there exists PY|Z such that PY|X = PY|Z ◦PZ|X . We
say that PZ|X is less noisy than PY|X , PY|X ≤ln PZ|X , iff for every PU,X on the following Markov chain
we have I(U; Y) ≤ I(U; Z). We say that PZ|X is more capable than PY|X , denoted PY|X ≤mc PZ|X if
U X
i i
i i
i i

i i
33.7 Channel comparison: degradation, less noisy, more capable 561
for any PX we have I(X; Y) ≤ I(X; Z).
We make some remarks, see [250] for proofs:
• PY|X ≤deg PZ|X =⇒ PY|X ≤ln PZ|X =⇒ PY|X ≤mc PZ|X . Counter examples for reverse
implications can be found in [81, Problem 15.11].
• For less noisy we also have the equivalent definition in terms of the divergence, namely PY|X ≤ln
PZ|X if and only if for all PX , QX we have D(QY kPY ) ≤ D(QZ kPZ ). We refer to [208, Sections
I.B, II.A] and [250, Section 6] for alternative useful characterizations of the less-noisy order.
• For BMS channels (see Section 19.4*) it turns out that among all channels with a given
Iχ2 (X; Y) = η (with X ∼ Ber(1/2)) the BSC and BEC are the minimal and maximal elements
in the poset of ≤ln ; see Ex. VI.20 for details.
Proposition 33.14. ηKL (PY|X ) ≤ 1 − τ if and only if PY|X ≤LN ECτ , where ECτ was defined in
Example 33.7.
Proof. For ECτ we always have
I(U; Z) = (1 − τ )I(U; X).
By the mutual information characterization of ηKL we have,
I(U; Y) ≤ (1 − τ )I(U; X).
Combining these two inequalities gives us
I(U; Y) ≤ I(U; Z).
This proposition gives us an intuitive interpretation of contraction coefficient as the worst

erasure channel that still dominates the channel.
Proposition 33.15. (Tensorization of Less Noisy Ordering) If for all i ∈ [n], PYi |Xi ≤LN PZi |Xi , then
PY1 |X1 ⊗ PY2 |X2 ≤LN PZ1 |X1 ⊗ PZ2 |X2 . Note that P ⊗ Q refers to the product channel of P and Q.
Proof. Consider the following Markov chain.
Y1
X1
Z1
U
Y2
X2
Z2
i i
i i
i i

i i
562
It can be seen from the Markov chain that I(U; Y1 , Y2 ) ≤ I(U; Y1 , Z2 ) implies I(U; Y1 , Y2 ) ≤
I(U; Z1 , Z2 ). Consider the following inequalities,
I(U; Y1 , Y2 ) = I(U; Y1 ) + I(U; Y2 |Y1 )

≤ I(U; Y1 ) + I(U; Z2 |Y1 )
= I(U; Y1 , Z2 ).
Hence I(U; Y1 , Y2 ) ≤ I(U; Y1 , Z2 ) for any PX1 ,X2 ,U .
33.8 Undirected information percolation

In this section we will study the problem of inference on undirected graph. Consider an undirected
graph G = (V, E). We assign a random variable Xv on the alphabet X to each vertex v. For each
e = (u, v) ∈ E , we assign Ye sampled according to the kernel PYe |Xe with Xe = (Xu , Xv ). The goal
of this inference model is to determine the value of Xv ’s given the value of Ye ’s.
X2 X6
Y2
Y6
6
2
Y5
Y1
Y35 7
X1 X3 X5 X7
Y5
Y1
9
4
Y7
Y3
9
4
X4 X9
Example 33.10 (Community Detection). In this model, we consider a complete graph with n
vertices, i.e. Kn , and the random variables Xv representing the membership of each vertex to one
of the m communities. We assume that Xv is sampled uniformly from [m] and independent of the
other vertices. The observation Yu,v is defined as
(
Ber(a/n) Xu = Xv
Yuv ∼
Ber(b/n) Xu 6= Xv .
Example 33.11 (Z2 Synchronization). For any graph G, we sample Xv uniformly from {−1, +1}
and Ye = BSCδ (Xu Xv ).
Example 33.12 (Spiked Wigner Model). We consider the inference problem of determining the
value of vector (Xi )i∈[n] given the observation (Yij )i,j∈[n],i≤j . The Xi ’s and Yij ’s are related by a
linear model
r
λ
Yij = Xi Xj + Wij ,
n
i i
i i
i i

i i
33.8 Undirected information percolation 563
where Xi is sampled uniformly from {−1, +1} and Wij ∼ N(0, 1). This model can also be written
in matrix form as
r
λ
Y= XXT + W
n
where W is the Wigner matrix, hence the name of the model. It is used as a probabilistic model
for principle component analysis (PCA).
This problem can also be treated as a problem of inference on undirected graph. In this case,
the underlying graph is a complete graph, and we assign Xi to the ith vertex. Under this model, the
edge observations is given by Yij = BIAWGNλ/n (Xi Xj ).
Although seemingly different, these problems share similar characteristics, namely:
• Xi ’s are uniformly distributed,

• If we define an auxiliary random variable B = 1{Xu 6= Xv } for any edges e = (u, v), then the
following Markov chain holds
(Xu , Xv ) → B → Ye .
In other words, the observation on each edge only depends on whether the random variables on
its endpoints are similar.
We will refer to the problem which have this characteristics as the Special Case (S.C.). Due to
S.C.,the reconstructed Xv ’s is symmetric up to any permutation on X . In the case of alphabet X =
{−1, +1}, this implies that for any realization σ then PXall |Yall (σ|b) = PXall |Yall (−σ|b). Consequently,
our reconstruction metric also needs to accommodate this symmetry. For X = {−1, +1}, this
Pn
leads to the use of n1 | i=1 Xi X̂i | as our reconstruction metric.
Our main theorem for undirected inference problem can be seen as the analogue of the infor-
mation percolation theorem for DAG. However, instead of controlling the contraction coefficient,
the percolation probability is used to directly control the conditional mutual information between
any subsets of vertices in the graph.
Before stating our main theorem, we will need to define the corresponding percolation model
for inference on undirected graph. For any undirected graph G = (V, E) we define a percolation
model on this graph as follows :
• Every edge e ∈ E is open with the probability ηKL (PYe |Xe ), independent of the other edges,
• For any v ∈ V and S ⊂ V , we define the v ↔ S as the event that there exists an open path from
v to any vertex in S,
• For any S1 , S2 ⊂ V , we define the function percu (S1 , S2 ) as
X
percu (S1 , S2 ) ≜ P(v ↔ S2 ).
v∈S1
Notice that this function is different from the percolation function for information percolation
in DAG. Most importantly, this function is not equivalent to the exact percolation probability.
i i
i i
i i

i i
564
Instead, it is an upper bound on the percolation probability by union bounding with respect to
S1 . Hence, it is natural that this function is not symmetric, i.e. percu (S1 , S2 ) 6= percu (S2 , S1 ).
Theorem 33.16 (Undirected information percolation [252]). Consider an inference problem on

undirected graph G = (V, E). For any S1 , S2 ⊂ V , then
I(XS1 ; XS2 |Y) ≤ percu (S1 , S2 ) log |X |.
Instead of proving theorem 33.16 in its full generality, we will prove the theorem under S.C.
condition. The main step of the proof utilizes the fact we can upper bound the mutual information
of any channel by its less noisy upper bound.
Theorem 33.17. Consider the problem of inference on undirected graph G = (V, E) with
X1 , ..., Xn are not necessarily independent. If PYe |Xe ≤LN PZe |Xe , then for any S1 , S2 ⊂ V and E ⊂ E
I(XS1 ; YE |XS2 ) ≤ I(XS1 ; ZE |XS2 )
Proof. Consider the following Markov chain.
Y12 Y23 Y34
X1 X2 X3 X4
Z12 Z23 Z34
From our assumption and the tensorization property of less noisy ordering (Proposition 33.15),
we have PYE |XS1 ,XS2 ≤LN PZE |XS1 ,XS2 . This implies that for σ as a valid realization of XS2 we will
have
I(XS1 ; YE |XS2 = σ) = I(XS1 , XS2 ; YE |XS2 = σ) ≤ I(XS1 , XS2 ; ZE |XS2 = σ) = I(XS1 ; ZE |XS2 = σ).
As this inequality holds for all realization of XS2 , then the following inequality also holds
I(XS1 ; YE |XS2 ) ≤ I(XS1 ; ZE |XS2 ).
Proof of Theorem 33.16. We only give a proof under the S.C. condition above and only for the
case S1 = {i}. For the full proof (that proceeds by induction and does not leverage the less noisy
idea), see [252]. We have the following equalities
I(Xi ; XS2 |YE ) = I(Xi ; XS2 , YE ) = I(Xi ; YE |XS2 ) (33.20)
where the first inequality is due to the fact BE ⊥

⊥ Xi under S.C, and the second inequality is due to
Xi ⊥
⊥ XS2 under S.C.
i i
i i
i i

i i
33.9 Application: Spiked Wigner model 565
Due to our previous result, if ηKL (PYe |Xe ) = 1 − τ then PYe |Xe ≤LN PZe |Xe where PZe |Xe = ECτ .
By tensorization property, this ordering also holds for the channel PYE |XE , thus we have
I(Xi ; YE |XS2 ) ≤ I(Xj ; ZE |XS2 ).
Let us define another auxiliary random variable D = 1{i ↔ S2 }, namely it is the indicator that
there is an open path from i to S2 . Notice that D is fully determined by ZE . By the same argument
as in (33.20), we have
I(Xi ; ZE |XS2 ) = I(Xi ; XS2 |ZE )
= I(Xi ; XS2 |ZE , D)
= (1 − P[i ↔ S2 ]) I(Xi ; XS2 |ZE , D = 0) +P[i ↔ S2 ] I(Xi ; XS2 |ZE , D = 1)
| {z } | {z }
0 ≤log |X |
≤ P[i ↔ S2 ] log |X |
= percu (i, S2 )
33.9 Application: Spiked Wigner model

The following theorem shows how the undirected information percolation concept allows us to
derive a converse result for spiked Wigner model, which we described in Example 33.12.
Theorem 33.18. Consider the spiked Wigner model. If λ ≤ 1, then for any sequence of estimators
Xˆn (Y),
" n #
X
1
E Xi X̂i → 0 (33.21)
n
i=1
as n → ∞.
p
Proof of Theorem 33.18. Note that by E[|T|] ≤ E[T2 ] the left-hand side of (33.21) can be
upper-bounded by
s X
1
E[ Xi Xj X̂i X̂j ] .
n
i, j
Next, it is clear that we can simplify the task of maximizing (over X̂n ) by allowing to separately
estimate each product by T̂i,j , i.e.
X X
max E[ Xi Xj X̂i X̂j ] ≤ max E[Xi Xj T̂i,j ] .
X̂n T̂i,j
i,j i,j
The latter maximization is easy to solve:

T̂i,j (Y) = argmax P[Xi Xj = σ|Y] .
σ
i i
i i
i i

i i
566
Since each Xi ∼ Ber(1/2) it is easy to see that
I(Xi ; Xj |Y) → 0 ⇐⇒ max E[Xi Xj T̂i,j ] → 0 .

T̂i,j
(For example, we may notice I(Xi ; Xj |Y) = I(Xi , Xj ; Y) ≥ I(Xi Xj ; Y) and apply Fano’s inequality).
Thus, from symmetry of the problem it is sufficient to prove I(X1 ; X2 |Y) → 0 as n → ∞.
By using the undirected information percolation theorem, we have
I(X2 ; X1 |Y) ≤ percu ({1}, {2})
in which the percolation model is defined on a complete graph with edge probability λ+no(1) as
ηKL (BIAWGNλ/n ) = λn (1 + o(1)). We only treat the case of λ < 1 below. For such λ we can over-
′
bound λ+no(1) by λn with λ′ < 1. This percolation random graph is equivalent to the Erd�s-Rényi
random graph with n vertices and λ′ /n edge probability, i.e., ER(n, λ′ /n). Using this observation,
the inequality can be rewritten as
I(X2 ; X1 |Y) ≤ P(Vertex 1 and 2 is connected on ER(n, λ′ /n)).
The largest components on ER(n, λ′ /n) contains O(log n) if λ′ < 1. This implies that the proba-
bility that two specific vertices are connected is o(1), hence I(X2 ; X1 |Y) → 0 as n → ∞. To treat
the case of λ = 1 we need slightly more refined information about behavior of giant component
of ER(n, 1+on(1) ) graph, see [252].
Remark 33.2 (Dense-Sparse equivalence). This reduction changes the underlying structure of the
graph. Instead of dealing with a complete graph, the percolation problem is defined on an Erd�s-
Rényi random graph. Moreover, if ηKL is small enough, then the underlying percolation graph
tends to have a locally tree-like structure. This is demonstration of the ubiquitous effect: dense
inference (such as spiked Wigner or sparse regression) with very weak signals (ηKL ≈ 1) is similar
to sparse inference (broadcasting on trees) with moderate signals (ηKL ∈ (ϵ, 1 − ϵ)).
33.10 Strong data post-processing inequality (Post-SDPI)

We introduce the following version of the SDPI constant.
Definition 33.19 (Post-SDPI constant). Given a conditional measure PY|X , define the input-
dependent and input-free contraction coefficients as

(p) I(U; X)
ηKL (PX , PY|X ) = sup :X→Y→U
PU|Y I ( U; Y )

(p) I(U; X)
ηKL (PY|X ) = sup :X→Y→U
PX ,PU|Y I(U; Y)
i i
i i
i i

i i
33.10 Strong data post-processing inequality (Post-SDPI) 567
X Y U
ε̄ 0 τ̄ 0 0
τ
?
τ
ε 1 τ̄ 1 1
Figure 33.3 Post-SDPI coefficient of BEC equals to 1.
To get characterization in terms of KL-divergence we simply notice that

(p)
ηKL (PX , PY|X ) = ηKL (PY , PX|Y ) (33.22)
(p)
ηKL (PX , PY|X ) = sup ηKL (PY , PX|Y ) , (33.23)
PX
where PY = PY|X ◦ PX and PX|Y is the conditional measure corresponding to PX PY|X . From (33.22)
and Prop. 33.10 we also get tensorization property for input-dependent post-SDPI:
(p) (p)
ηKL (PnX , (PY|X )n ) = ηKL (PnX , (PY|X )n ) (33.24)
(p)
It is easy to see that by the data processing inequality, ηKL (PY|X ) ≤ 1. Unlike the ηKL coefficient
(p)
the ηKL can equal to 1 even for a noisy channel PY|X .
(p)
Example 33.13 (ηKL = 1). Let PY|X = BECτ and X → Y → U be defined as on Fig. 33.3 Then
we can compute I(Y; U) = H(U) = h(ετ̄ ) and I(X; U) = H(U) − H(U|X) = h(ετ̄ ) − εh(τ ) hence
(p) I(X; U)
ηKL (PY|X ) ≥
I(Y; U)
ε
= 1 − h(τ )
h(ετ̄ )
This last term tends to 1 when ε tends to 0 hence
( p)
ηKL (BECτ ) = 1
even though Y is not a one to one function of X.

(p)
Note that this example also shows that even for an input-constrained version of ηKL the natural
(p)
conjecture ηKL (Unif, BMS) = ηKL (BMS) is incorrect. Indeed, by taking ε = 12 , we have that
(p)
ηKL (Unif, BECτ ) > 1 − τ for τ → 1.
Nevertheless, the post-SDPI constant is often non-trivial, most importantly for the BSC:
Theorem 33.20.
(p)
ηKL (BSCδ ) = (1 − 2δ)2
To prove this theorem, the following lemma is useful.
i i
i i
i i

i i
568
Lemma 33.21. If for any X and Y in {0, 1} we have

1(x̸=y)
δ
pX,Y (x, y) = f(x) g( Y )
1−δ
for some functions f and g, then ηKL (PY|X ) ≤ (1 − 2δ)2 .
Proof. It is known that for binary input channels PY|X [250]

H4 (PY|X=0 kPY|X=1 )
ηKL (PY|X ) ≤ H2 (PY|X=0 kPY|X=1 ) −
4

g(0) λ 1
If we let ϕ = g(1) , then we have pY|X=0 = B ϕ+λ and pY|X=1 = B 1+ϕλ and a simple check
shows that
H4 (PY|X=0 kPY|X=1 ) ϕ=1 2 H4ϕ=1 (PY|X=0 kPY|X=1 )
max H2 (PY|X=0 kPY|X=1 ) − = Hϕ=1 (PY|X=0 kPY|X=1 ) −
ϕ 4 4
= (1 − 2δ)2
Now observe that PX,Y in Theorem 33.20 satisfies the property of the lemma with X
(p)
and Y exchanged, hence ηKL (PY , PX|Y ) ≤ (1 − 2δ)2 which implies that ηKL (PY|X ) =
supPX ηKL (PY , PX|Y ) ≤ (1 − 2δ) with equality if PX is uniform.
2
Theorem 33.22 (Post-SDPI for BI-AWGN). Let 0 ≤ ϵ ≤ 1 and consider the channel PY|X with
X ∈ {±1} given by
Y = ϵX + Z, Z ∼ N ( 0, 1) .
Then for any π ∈ (0, 1) taking PX = Ber(π ) we have for some absolute constant K the estimate
(p) ϵ2
ηKL (PX , PY|X ) ≤ K .
π (1 − π )
Proof. In this proof we assume all information measures are used to base-e. First, notice that
1
v( y) ≜ P [ X = 1 | Y = y] = 1−π −2yϵ
.
1+ π e
(p)
Then, the optimization defining ηKL can be written as
(p) d(EQY [v(Y)]kπ )
ηKL (PX , PY|X ) ≤ sup . (33.25)
QY D(QY kPY )
From (7.31) we have
(p) 1 (EQY [v(Y)] − π )2
ηKL (PX , PY|X ) ≤ sup . (33.26)
π (1 − π ) QY D(QY kPY )
To proceed, we need to introduce a new concept. The T1 -transportation inequality for the
measure PY states the following: For every QY we have for some c = c(PY )
p
W1 (QY , PY ) ≤ 2cD(QY kPY ) , (33.27)
i i
i i
i i

i i
33.11 Application: Distributed Mean Estimation 569
where W1 (QY , PY ) is the 1-Wasserstein distance defined as

W1 (QY , PY ) = sup{EQY [f] − EPY [f] : f– 1-Lipschitz} (33.28)
= inf{E[|A − B|] : A ∼ QY , B ∼ PY } .
The constant c(PY ) in (33.27) has been characterized in [40, 92] in terms of properties of PY . One
such estimate is the following:
!1/k
2 G(δ)
c(PY ) ≤ sup 2k
,
δ k≥ 1 k
′ 2 i.i.d.
where G(δ) = E[eδ(Y−Y ) ] where Y, Y′ ∼ PY . Using the estimate 2k
k ≥ √ 4k
and the fact
π (k+1/2)
that ln(k + 1/2) ≤
1
k
1
2 we get a further bound
√
2 π e 6G(δ)
c(PY ) ≤ G(δ) ≤ .
δ 4 δ
d √
Next notice that Y − Y′ = Bϵ + 2Z where Bϵ ⊥ ⊥ Z ∼ N (0, 1) and Bϵ is symmetric and |Bϵ | ≤ 2ϵ.
Thus, we conclude that for any δ < 1/4 we have c̄ ≜ δ6 supϵ≤1 G(δ) < ∞. In the end, we have
inequality (33.27) with
constant
c = c̄ that holds uniformly for all 0 ≤ ϵ ≤ 1.
d
Now, notice that dy v(y) ≤ 2ϵ and therefore v is 2ϵ -Lipschitz. From (33.27) and (33.28) we
obtain then
ϵp
|EQY [v(Y)] − EPY [v(Y)]| ≤ 2c̄D(QY kPY ) .
2
Squaring this inequality and plugging back into (33.26) completes the proof.
( p)
Remark 33.3. Notice that we can also compute the exact value of ηKL (PX , PY|X ) by noticing the
following. From (33.25) it is evident that among all measures QY with a given value of EQY [v(Y)]
we are interested in the one minimizing D(QY kPY ). From Theorem 15.11 we know that such QY
is given by dQY = ebv(y)−ψV (b) dPY , where ψV (b) ≜ ln EPY [ebv(Y) ]. Thus, by defining the convex
dual ψV∗ (λ) we can get the exact value in terms of the following single-variable optimization:
(p) d(λkπ )
ηKL (PX , PY|X ) = sup ∗ .
λ∈[0,1] ψV (λ)
Numerically, for π = 1/2 it turns out that the optimal value is λ → 12 , justifying our overbounding
of d by χ2 , and surprisingly giving
(p)
ηKL (Ber(1/2), PY|X ) = 4 EPY [tanh2 (ϵY)] = ηKL (PY|X ) ,
where in the last equality we used Theorem 33.6(f)).
33.11 Application: Distributed Mean Estimation

We want to estimate θ ∈ [−1, 1]d and we have m machines observing Yi = θ + σ Zi where Zi ∼
N (0, Id ) independently. They can send a total of B bits to a remote estimator. The goal of the
i i
i i
i i

i i
570
estimator is to minimize supθ E[kθ − θ̂k2 ] over θ̂. If we denote by Ui ∈ Ui the messages then
P
i log2 |Ui | ≤ B and the diagram is
Y1 U1
θ .. ..
. . θ̂
Ym Um
Finally, let
R∗ (m, d, σ 2 , B) = inf sup E[kθ − θ̂k2 ]

U1 ,...,Um ,θ̂ θ
Observations:
• Without constraint on the magnitude of θ ∈ [−1, 1]d , we could give θ ∼ N (0, bId ) and from
rate-distortion quickly conclude that estimating θ within risk R requires communicating at least
2 log R bits, which diverges as b → ∞. Thus, restricting the magnitude of θ is necessary in
d bd
order to be able to estimate it with finitely many hbits communicated.

σ P 2 i
• It is easy to establish that R∗ (m, d, σ 2 , ∞) = E m i Zi
= dσ2 by taking Ui = Yi and
m
P
θ̂ = m1 i Ui .
2
• In order to approach the risk of order dmσ we could do the following. Let Ui = sign(Yi )
(coordinate-wise sign). This yields B = md and it is easy to show that the achievable risk
2
is O( dmσ ). Indeed, notice that by averaging the signs we can estimate (within Op ( √1m )) all quan-
tities Φ(θi ) with Φ denoting the CDF of N (0, 1). Since Φ has derivative bounded away from 0
on [−1, 1], we get the estimate we claimed by applying Φ−1 .
• Our main result is that this strategy is (order) optimal in terms of communicated bits. This
simplifies the proofs of [102, 50]. (However, in those works authors also consider a more general
setting with interaction between the machines and the “fusion center”).
• In addition, all of these results (again in the non-interactive case, but with optimal constants) are
contained in the long line of work in the information theoretic literature on the so-called Gaus-
sian CEO problem. We recommend consulting [115]; in particular, Theorem 3 there implies the
B ≳ dm lower bound. The Gaussian CEO work uses a lot more sophisticated machinery (the
entropy power inequality and related results). The advantage of our SDPI proof is its simplicity.
Our goal is to show that R∗ ≲ md implies B ≳ md.

Notice, first of all, that this is completely obvious for d = 1. Indeed, if B ≤ τ m then less than τ m
machines are communicating anything at all, and hence R∗ ≥ τKm for some universal constant K
(which is not 1 because θ is restricted to [−1, 1]). However, for d 1 it is not clear whether each
machine is required to communicate Ω(d) bits. Perhaps sending d single-bit measurements
taken in different orthogonal bases could work? Hopefully, this (incorrect) guess demonstrates
why the following result is interesting and non-trivial.
i i
i i
i i

i i
33.11 Application: Distributed Mean Estimation 571
dϵ2
Theorem 33.23. There exists a constant c1 > 0 such that if R∗ (m, d, σ 2 , B) ≤ 9 then B ≥ c1 d
ϵ2
.
Proof. Let X ∼ Unif({±1}d ) and set θ = ϵX. Given an estimate θ̂ we can convert it into an
estimator of X via X̂ = sign(θ̂) (coordinatewise). Then, clearly
ϵ2 dϵ 2
E[dH (X, X̂)] ≤ E[kθ̂ − θk2 ] ≤ .
4 9
Thus, we have an estimator of X within Hamming distance 94 d. From Rate-Distortion (Theo-
rem 26.1) we conclude that I(X; X̂) ≥ cd for some constant c > 0. On the other hand, from
the standard DPI we have
X
m
cd ≤ I(X; X̂) ≤ I(X; U1 , . . . , Um ) ≤ I ( X ; Uj ) , (33.29)
j=1
where we also applied Theorem 6.1. Next we estimate I(X; Uj ) via I(Yj ; Uj ) by applying the Post-
SDPI. To do this we need to notice that the channel X → Yj for each j is just a memoryless
extension of the binary-input AWGN channel with SNR ϵ. Since each coordinate of X is uniform,
we can apply Theorem 33.22 (with π = 1/2) together with tensorization (33.24) to conclude that
I(X; Uj ) ≤ 4Kϵ2 I(Yj ; Uj ) ≤ 4Kϵ2 log |Uj |
Together with (33.29) we thus obtain
cd ≤ I(X; X̂) ≤ 4Kϵ2 B log 2 (33.30)
i i
i i
i i

i i
Exercises for Part VI
i.i.d.
VI.1 Let X1 , . . . , Xn ∼ Exp(exp(θ)), where θ follows the Cauchy distribution π with parameter s,
whose pdf is given by p(θ) = 1
θ2
for θ ∈ R. Show that the Bayes risk
πs(1+ )
s2
R∗π ≜ inf Eθ∼π E(θ̂(Xn ) − θ)2

θ̂
2
satisfies R∗π ≥ 2ns2s2 +1 .
VI.2 (System identification) Let θ ∈ R be an unknown parameter of a dynamical system:
i.i.d.
Xt = θXt−1 + Zt , Zt ∼ N (0, 1), X0 = 0 .
Learning parameters of dynamical systems is known as “system identification”. Denote the law
of (X1 , . . . , Xn ) corresponding to θ by Pθ .
1. Compute D(Pθ kPθ0 ). (Hint: chain rule saves a lot of effort.)
2. Show that Fisher information
X
JF (θ) = θ2t−2 (n − t).
1≤t≤n−1
3. Argue that the hardest regime for system identification is when θ ≈ 0, and that instability
(|θ| > 1) is in fact helpful.
VI.3 (Linear regression) Consider the model
Y = Xβ + Z
where the design matrix X ∈ Rn×d is known and Z ∼ N(0, In ). Define the minimax mean-square
error of estimating the regression coefficient β ∈ Rd based on X and Y as follows:
R∗est = inf sup Ekβ̂ − βk22 . (VI.1)
β̂ β∈Rd
(a) Show that if rank(X) < d, then R∗est = ∞;

(b) Show that if rank(X) = d, then
R∗est = tr((X⊤ X)−1 )
and identify which estimator achieves the minimax risk.
(c) As opposed to the estimation error in (VI.1), consider the prediction error:
R∗pred = inf sup EkXβ̂ − Xβk22 . (VI.2)
β̂ β∈Rd
Redo (a) and (b) by finding the value of R∗pred and identify the minimax estimator. Explain
intuitively why R∗pred is always finite even when d exceeds n.
i i
i i
i i

i i
i.i.d.
VI.4 (Chernoff-Rubin-Stein lower bound.) Let X1 , . . . , Xn ∼ Pθ and θ ∈ [−a, a].
(a) State the appropriate regularity conditions and prove the following minimax lower bound:

2 2 (1 − ϵ)
2
inf sup Eθ [(θ − θ̂) ] ≥ min max ϵ a ,
2
,
θ̂ θ∈[−a,a] 0<ϵ<1 nJ̄F
1
Ra
where J̄F = 2a J (θ)dθ is the average Fisher information. (Hint: Consider the uniform
−a F
prior on [−a, a] and proceed as in the proof of Theorem 29.2 by applying integration by
parts.)
(b) Simplify the above bound and show that
1
inf sup Eθ [(θ − θ̂)2 ] ≥ p . (VI.3)
θ̂ θ∈[−a,a] ( a− 1 + nJ̄F )2
(c) Assuming the continuity of θ 7→ JF (θ), show that the above result also leads to the optimal
local minimax lower bound in Theorem 29.4 obtained from Bayesian Cramér-Rao:
1 + o( 1)
inf sup Eθ [(θ − θ̂)2 ] ≥ .
θ̂ θ∈[θ0 ±n−1/4 ] nJF (θ0 )
Note: (VI.3) is an improvement of the inequality given in [65, Lemma 1] without proof and
credited to Rubin and Stein.
VI.5 In this exercise we give a Hellinger-based lower bound analogous to the χ2 -based HCR lower
bound in Theorem 29.1. Let θ̂ be an unbiased estimator for θ ∈ Θ ⊂ R.
(a) For any θ, θ′ ∈ Θ, show that [283]

1 (θ − θ′ )2 1
(Varθ (θ̂) + Varθ′ (θ̂)) ≥ −1 . (VI.4)
2 4 H2 (Pθ , Pθ′ )
R √ √ √ √
(Hint: For any c, θ − θ′ = (θ̂ − c)( pθ + pθ′ )( pθ − pθ′ ). Apply Cauchy-Schwarz
and optimize over c.)
(b) Show that
1
H2 (Pθ , Pθ′ ) ≤ (θ − θ′ )2 J̄F (VI.5)
4
R θ′
where J̄F = θ′ 1−θ θ JF (u)du is the average Fisher information.
(c) State the needed regularity conditions and deduce the Cramér-Rao lower bound from (VI.4)
and (VI.5) with θ′ → θ.
(d) Extend the previous parts to the multivariate case.
VI.6 (Bayesian distribution estimation.) Let {Pθ : θ ∈ Θ} be a family of distributions on X
with a common dominating measure μ and density pθ (x) = dP n
dμ (x). Given a sample X =
θ
i.i.d.
(X1 , . . . , Xn ) ∼ Pθ for some θ ∈ Θ, the goal is to estimate the data-generating distribution Pθ by
some estimator P̂(·) = P̂(·; Xn ) with respect to some loss function ℓ(P, P̂). Suppose we are in
a Bayesian setting where θ is drawn from a prior π. Let’s find the form of the Bayes estimator
and the Bayes risk
i i
i i
i i

i i
574 Exercises for Part VI
(a) For convenience, let Xn+1 denote a test data point (unseen) drawn from Pθ and independent
of the observed data Xn . Convince yourself that every estimator P̂ can be formally identified
as a conditional distribution QXn+1 |Xn .
(b) Consider the KL loss ℓ(P, P̂) = D(PkP̂). Using Corollary 4.2, show that the Bayes estimator
minimizing the average KL risk is the posterior (conditional mean). Namely,
min D(PXn+1 |θ kQXn+1 |Xn |Pθ,Xn )

| {z }
average KL risk
is achieved at QXn+1 |Xn = PXn+1 |Xn . In other words, the estimated density value at xn+1 is
Qn+1
dQXn+1 |Xn (xn+1 |xn ) Eθ∼π [ i=1 pθ (xi )]
= Qn . (VI.6)
dμ Eθ∼π [ i=1 pθ (xi )]
(Hint: The risk conditioned on Xn is D(PXn+1 |θ kQXn+1 |Xn |Pθ|Xn ).)

(c) Conclude that the Bayes KL risk equals I(θ; Xn+1 |Xn ).
(d) Now, consider the χ2 loss ℓ(P, P̂) = χ2 (PkP̂). Using (I.13) in Exercise I.35, show that
min χ2 (PXn+1 |θ kQXn+1 |Xn |Pθ,Xn )

| {z }
average χ2 risk
is achieved by
" n #
dQXn+1 |Xn (xn+1 |xn ) Y
∝ Eθ [pθ (xn+1 ) |X = x ] ∝ Eθ∼π
2 n n 2
pθ (xi )pθ (xn+1 ) . (VI.7)
dμ
i=1
Furthermore, the Bayes χ2 risk equals

"Z 2 #
p
E Xn μ(dxn+1 ) Eθ [pθ (xn+1 )2 |Xn ] − 1. (VI.8)
i.i.d.
(e) Consider the discrete alphabet [k] and Xn ∼ P, where P = (P1 , . . . , Pk ) is drawn from
the Dirichlet prior Dirichlet(α, . . . , α). Applying (VI.6) and (VI.7) (with μ the count-
ing measure), show that the Bayes estimator for the KL loss is the add-α estimator
(Section 13.5):
nj + α
Pbj = , (VI.9)
n + kα
Pn
where nj = i=1 1{Xi =j} is the empirical count, and for the χ2 loss is
p
(nj + α)(nj + α + 1)
Pbj = Pk p . (VI.10)
j=1 (nj + α)(nj + α + 1)
i.i.d.
VI.7 (Coin flips) Given X1 , . . . , Xn ∼ Ber(θ) with θ ∈ Θ = [0, 1], we aim to estimate θ with respect
to the quadratic loss function ℓ(θ, θ̂) = (θ − θ̂)2 . Denote the minimax risk by R∗n .
i i
i i
i i

i i
(a) Use the empirical frequency θ̂emp = X̄ to estimate θ. Compute and plot the risk Rθ (θ̂) and
show that
1
R∗n ≤ .
4n
(b) Compute the Fisher information of Pθ = Ber(θ)⊗n and Qθ = Bin(n, θ). Explain why they
are equal.
(c) Invoke the Bayesian Cramér-Rao lower bound Theorem 29.2 to show that
1 + o( 1)
R∗n = .
4n
(d) Notice that the risk of θ̂emp is maximized at 1/2 (fair coin), which suggests that it might be
possible to hedge against this situation by the following randomized estimator
(
θ̂emp , with probability δ
θ̂rand = 1 (VI.11)
2 with probability 1 − δ
Find the worst-case risk of θ̂rand as a function of δ . Optimizing over δ , show the improved
upper bound:
1
R∗n ≤ .
4( n + 1)
(e) As discussed in Remark 28.3, randomized estimator can always be improved if the loss is
convex; so we should average out the randomness in (VI.11) by considering the estimator
1
θ̂∗ = E[θ̂rand |X] = X̄δ + (1 − δ). (VI.12)
2
Optimizing over δ to minimize the worst-case risk, find the resulting estimator θ̂∗ and its
risk, show that it is constant (independent of θ), and conclude
1
R∗n ≤ √ .
4(1 + n)2
(f) Next we show θ̂∗ found in part (e) is exactly minimax and hence
1
R∗n = √ .
4(1 + n)2
Consider the following prior Beta(a, b) with density:
Γ(a + b) a−1
π (θ) = θ (1 − θ)b−1 , θ ∈ [0, 1],
Γ(a)Γ(b)
R∞ √
where Γ(a) ≜ 0 xa−1 e−x dx. Show that if a = b = 2n , θ̂∗ coincides with the Bayes
estimator for this prior, which is therefore least favorable. (Hint: work with the sufficient
statistic S = X1 + . . . + Xn .)
(g) Show that the least favorable prior is not unique; in fact, there is a continuum of them. (Hint:
consider the Bayes estimator E[θ|X] and show that it only depends on the first n + 1 moments
of π.)
i i
i i
i i

i i
i.i.d.
(h) (Larger alphabet) Suppose X1 , . . . , Xn ∼ P on [k]. Show that for any k, n, the minimax
squared risk of estimating P in Theorem 29.5 is exactly
b − Pk22 ] = √ 1 k−1
R∗sq (k, n) = inf sup E[kP 2
, (VI.13)
b
P P∈Pk ( n + 1) k
√
achieved by the add- kn estimator. (Hint: For the lower bound, show that the Bayes estimator
for the squared loss and the KL loss coincide, then apply (VI.9) in Exercise VI.6.)
i.i.d.
(i) (Nonparametric extension) Suppose X1 , . . . , Xn ∼ P, where P is an arbitrary probability
distribution on [0, 1]. The goal is to estimate the mean of P under the quadratic loss. Show
that the minimax risk equals 4(1+1√n)2 .
VI.8 (Distribution estimation in TV) Continuing (VI.13), we show that the minimax rate for
estimating P with respect to the total variation loss is
r
∗ k
RTV (k, n) ≜ inf sup EP [TV(P̂, P)] ∧ 1, ∀ k ≥ 2, n ≥ 1, (VI.14)
P̂ P∈Pk ) n
(a) Show that the MLE coincides with the empirical distribution.
(b) Show that the MLE achieves the RHS of (VI.14) within constant factors.
(c) Establish the minimax lower bound. (Hint: apply Assouad’s lemma, or Fano’s inequality
(with volume method or explicit construction of packing), or the mutual information method
directly.)
VI.9 (Distribution estimation in KL and χ2 ) Continuing Exercise VI.8, let us consider estimating the
distribution P in KL and χ2 divergence, which are unbounded loss. We show that
(k
∗ k k≤n
RKL (k, n) ≜ inf sup EP [D(P̂kP)] log 1 + n k (VI.15)
P̂ P∈Pk n log n k ≥ n
and
k
R∗χ2 (k, n) ≜ inf sup EP [χ2 (P̂kP)] . (VI.16)
P̂ P∈Pk ) n
To this end, we will apply results on Bayes risk in Exercise VI.6 as well as multiple inequalities
between f-divergences from Chapter 7.
(a) Show that the empirical distribution, which has been shown optimal for the TV loss in
Exercise VI.8, achieves infinite KL and χ2 loss in the worst case.
(b) To show the upper bound in (VI.16), consider the add-α estimator P̂ in (VI.9) with α = 1.
Show that
k−1
E[χ2 (PkP̂)] ≤ .
n+1
Using D ≤ log(1 + χ2 ) – cf. (7.31), conclude the upper bound part of (VI.15). (Hint:
EN∼Bin(n,p) [ N+
1
1 ] = (n+1)p (1 − p̄
1 n+1
).
(c) Show that for the small alphabet regime of k ≲ n, the lower bound follows from that of
(VI.15) and Pinsker’s inequality (7.25).
i i
i i
i i

i i
(d) Next assume k ≥ 4n. Consider a Dirichlet(α, . . . , α) prior in (13.15). Applying the formula
(VI.7) and (VI.8) for the Bayes χ2 risk and choosing α = n/k, show the lower bound in
(VI.16).
(e) Finally, we prove the lower bound in (VI.15). Consider the prior under which P is uniform
over a set S chosen uniformly at random from all s-subsets of [k] and s is some constant to
be specified. Applying (VI.6), show that for this prior the Bayes estimator for KL loss takes
a natural form:
(
1
i ∈ Ŝ
P̂j = 1s −ŝ/s
k−ŝ i∈/ Ŝ
where Ŝ = {i : √ni ≥ 1} is the support of the empirical distribution and ŝ = |Ŝ|.

p
(f) Choosing s = nk, conclude TV(P, P̂) ≥ 1 − 2 nk . (Hint: Show that TV(P, P̂) ≥ (1 −
s )(1 − k ) and ŝ ≤ n.)
ŝ s
(g) Using (7.28), show that D(PkP̂) ≥ Ω(log nk ). (Note that (7.28) is convex in TV so Jensen’s
inequality applies.)
Note: The following refinement of (VI.15) was known:
• For fixed k, a deep result of [49, 48] is that R∗KL (k, n) = k−12n+o(1)
, achieved by an add-c
estimator where c is a function of the empirical count chosen using polynomial approximation
arguments.
• When k n, R∗KL (k, n) = log nk (1 + o(1)), shown in [230] by a careful analysis of the
Dirichlet prior.
VI.10 (Nonparametric location model) In this exercise we consider some nonparametric extensions
i.i.d.
of the Gaussian location model and the Bernoulli model. Observing X1 , . . . , Xn ∼ P for some
P ∈ P , where P is a collection of distributions on the real line, our goal is to estimate the mean
R
of the distribution P: μ(P) ≜ xP(dx), which is a linear functional of P. Denote the minimax
quadratic risk by
R∗n = inf sup EP [(μ̂(X1 , . . . , Xn ) − μ(P))2 ].

μ̂ P∈P
(a) Let P be the class of distributions (which need not have a density) on the real line with
2
variance at most σ 2 . Show that R∗n = σn .
(b) Let P = P([0, 1]), the collection of all probability distributions on [0, 1]. Show that
R∗n = 4(1+1√n)2 . (Hint: For the upper bound, using the fact that, for any [0, 1]-valued ran-
dom variable Z, Var(Z) ≤ E[Z](1 − E[Z]), mimic the analysis of the estimator (VI.12) in
Ex. VI.7e.)
VI.11 Prove Theorem 30.4 using Fano’s method. (Hint: apply Theorem 31.3 with T = ϵ · Skd , where
Sdk denotes the Hamming sphere of radius k in d dimensions. Choose ϵ appropriately and apply
the Gilbert-Varshamov bound for the packing number of Sdk in Theorem 27.6.)
VI.12 (Sharp minimax rate in sparse denoising) Continuing Theorem 30.4, in this exercise we deter-
mine the sharp minimax risk for denoising a high-dimensional sparse vector. In the notation of
(30.13), we show that, for the d-dimensional GLM model X ∼ N (θ, Id ), the following minimax
i i
i i
i i

i i
risk satisfies, as d → ∞ and k/d → 0,

d
R∗ (k, d) ≜ inf sup Eθ [kθ̂ − θk22 ] = (2 + o(1))k log . (VI.17)
θ̂ ∥θ∥0 ≤k k
(a) We first consider 1-sparse vectors and prove
R∗ (1, d) ≜ inf sup Eθ [kθ̂ − θk22 ] = (2 + o(1)) log d, d → ∞. (VI.18)

θ̂ ∥θ∥0 ≤1
For the lower bound, consider the prior π under which θ is uniformly p distributed over
{τ e1 , . . . , τ ed }, where ei ’s denote the standard basis. Let τ = (2 − ϵ) log d. Show that
for any ϵ > 0, the Bayes risk is given by
inf Eθ∼π [kθ̂ − θk22 ] = τ 2 (1 + o(1)), d → ∞.

θ̂
(Hint: either apply the mutual information method, or directly compute the Bayes risk by
evaluating the conditional mean and conditional variance.)
(b) Demonstrate an estimator θ̂ that achieves the RHS of (VI.18) asymptotically. (Hint: consider
the hard-thresholding estimator (30.13) or the MLE (30.11).)
(c) To prove the lower bound part of (VI.17), prove the following generic result

d
R∗ (k, d) ≥ kR∗ 1,
k
and then apply (VI.18). (Hint: consider a prior of d/k blocks each of which is 1-sparse.)
(d) Similar to the 1-sparse case, demonstrate an estimator θ̂ that achieves the RHS of (VI.17)
asymptotically.
Note: For both the upper and lower bound, the normal tail bound in Exercise V.8 is helpful.
VI.13 Consider the following functional estimation problem in GLM. Observing X ∼ N(θ, Id ), we
intend to estimate the maximal coordinate of θ: T(θ) = θmax ≜ max{θ1 , . . . , θd }. Prove the
minimax rate:
inf sup Eθ (T̂ − θmax )2 log d. (VI.19)

T̂ θ∈Rd
(a) Prove the upper bound by considering T̂ = Xmax , the plug-in estimator with the MLE.
(b) For the lower bound, consider two hypotheses:
H0 : θ = 0, H1 : θ ∼ Unif {τ e1 , τ e2 , . . . , τ ed } .
where ei ’s are the standard bases and τ > 0. Then under H0 , X ∼ P0 = N(0, Id ); under H1 ,
Pd
X ∼ P1 = 1d i=1 N(τ ei , Id ). Show that
2
eτ − 1
χ2 (P1 kP0 ) = .
d
(c) Applying the joint range (7.29) (or (7.35)) to bound TV(P0 , P1 ), conclude the lower bound
part of (VI.19) via Le Cam’s method (Theorem 31.1).
i i
i i
i i

i i
(d) By improving both the upper and lower bound prove the sharp version:

1
inf sup Eθ (T̂ − θmax )2 = + o(1) log d, d → ∞. (VI.20)
T̂ θ∈Rd 2
VI.14 (Suboptimality of MLE in high dimensions) Consider the d-dimensional GLM: X ∼ N (θ, Id ),
where θ belongs to the parameter space
n o
Θ = θ ∈ Rd : |θ1 | ≤ d1/4 , kθ\1 k2 ≤ 2(1 − d−1/4 |θ1 |)
with θ\1 ≡ (θ2 , . . . , θd ). For the square loss, prove the following for sufficiently large d.
(a) The minimax risk is bounded:
inf sup Eθ [kθ̂ − θk22 ] ≲ 1.
θ̂ θ∈Θ
(b) The worst-case risk of maximal likelihood estimator

θMLE ≜ argmin kX − θ̃k2
θ̃∈Θ
is unbounded:
√
sup Eθ [kθ̂MLE − θk22 ] ≳ d.
θ∈Θ
i.i.d.
VI.15 (Covariance model) Let X1 , . . . , Xn ∼ N(0, Σ), where Σ is a d × d covariance matrix. Let us
show that the minimax quadratic risk of estimating Σ using X1 , . . . , Xn satisfies

d
inf sup E[kΣ̂ − Σk2F ] ∧ 1 r2 , ∀ r > 0, d, n ∈ N . (VI.21)
Σ̂ ∥Σ∥F ≤r n
P
where kΣ̂ − Σk2F = ij (Σ̂ij − Σij )2 .
(a) Show that unlike location model, without restricting to a compact parameter space for Σ,
the minimax risk in (VI.21) is infinite.
Pn
(b) Consider the sample covariance matrix Σ̂ = 1n i=1 Xi X⊤ i . Show that
1
E[kΣ̂ − Σk2F ] =kΣk2F + Tr(Σ)2
n
and use this to deduce the minimax upper bound in (VI.21).
(c) To prove the minimax lower bound, we can proceed in several steps. Show that for any
positive semidefinite (PSD) Σ0 , Σ1 0, the KL divergence satisfies
1 1/2 1/2
D(N(0, Id + Σ0 )kN(0, Id + Σ1 )) ≤ kΣ − Σ1 k2F , (VI.22)
2 0
where Id is the d × d identity matrix.
(d) Let B(δ) denote the Frobenius ball of radius δ centered at the zero matrix. Let PSD = {X :
X 0} denote the collection of d× PSD matrices. Show that
vol(B(δ) ∩ PSD)
= P [ Z 0] , (VI.23)
vol(B(δ))
i i
i i
i i

i i
i.i.d.
where Z is a GOE matrix, that is, Z is symmetric with independent diagonals Zii ∼ N(0, 2)
i.i.d.
and off-diagonals Zij ∼ N(0, 1).
2
(e) Show that P [Z 0] ≥ cd for some absolute constant c.3
(f) Prove the following lower bound on the packing number on the set of PSD matrices:
d2 /2
c′ δ
M(B(δ) ∩ PSD, k · kF , ϵ) ≥ (VI.24)
ϵ
for some absolute constant c′ . (Hint: Use the volume bound and the result of Part (d) and
(e).) √
(g) Complete the proof of lower bound of (VI.21). (Hint: WLOG, we can consider r d and
2
aim for the lower bound Ω( dn ∧ d). Take the packing from (VI.24) and shift by the identity
matrix I. Then apply Fano’s method and use (VI.22).)
VI.16 For a family of probability distributions P and a functional T : P → R define its χ2 -modulus
of continuity as
δ χ 2 ( t) = sup {T(P1 ) − T(P2 ) : χ2 (P1 kP2 ) ≤ t} .

P1 ,P2 ∈P
When the functional T is affine, and continuous, and P is compact4 it can be shown [251] that
1
δ 2 (1/n)2 ≤ inf sup E i.i.d. (T(P) − T̂n (X1 , . . . , Xn ))2 ≤ δχ2 (1/n)2 . (VI.25)
7 χ T̂n P∈P Xi ∼ P
Consider the following problem (interval censored model): In i-th mouse a tumour develops at
i.i.d.
time Ai ∈ [0, 1] with Ai ∼ π where π is a pdf on [0, 1] bounded by 21 ≤ π ≤ 2. For each i the
i.i.d.
existence of tumour is checked at another random time Bi ∼ Unif(0, 1) with Bi ⊥ ⊥ Ai . Given
observations Xi = (1{Ai ≤ Bi }, Bi ) one is trying to estimate T(π ) = π [A ≤ 1/2]. Show that
inf sup E[(T(π ) − T̂n (X1 , . . . , Xn ))2 ] n−2/3 .

T̂n π
VI.17 (Comparison between contraction coefficients.) Let X be a random variable with distribution
PX , and let PY|X be a Markov kernel. For an f-divergence, define
Df (PY|X ◦ QX kPY|X ◦ PX )
ηf (PY|X , PX ) ≜ sup .
QX :0<Df (QX ∥PX )<∞ Df (QX kPX )
Prove that
ηχ2 (PY|X , PX ) ≤ ηKL (PY|X , PX ).
Hint: Use local behavior of f-divergences (Proposition 2.19).
3
Getting the exact exponent is a difficult result (cf. [17]). Here we only need some crude estimate.
4
Both under the same, but otherwise arbitrary topology on P.
i i
i i
i i

i i
VI.18 (χ2 -contraction for Markov chains.) In this exercise we prove (33.4). Let P = (P(x, y)) denote
the transition matrix of a time-reversible Markov chain with finite state space X and stationary
distribution π, so that π (x)P(x, y) = π (y)P(y, x) for all x, y ∈ X . It is known that the k = |X |
eigenvalues of P satisfy 1 = λ1 ≥ λ2 ≥ . . . ≥ λk ≥ −1. Define by γ∗ ≜ max{λ2 , |λk |} the
absolute spectral gap.
(a) Show that
χ2 (PX1 kπ ) ≤ χ2 (PX0 kπ )γ∗2n
from which (33.4) follows.

(b) Conclude that for any initial state x,
1 − π (x) 2n
χ2 (PXn |X0 =x kπ ) ≤ γ .
π ( x) ∗
(c) Compute γ∗ for the BSCδ channel and compare with the ηχ2 contraction coefficients.
For a continuous-time version, see [91].
VI.19 (Input-independent contraction coefficient is achieved by binary inputs [229]) Let K : X → Y
be a Markov kernel with countable X . Prove that for all f-divergence, we have
Df (K ◦ PkK ◦ Q)
ηf (K) = sup .
P,Q:|supp(P)∪supp(Q)|≤2 Df (PkQ)
0<Df (P∥Q)<∞
Hint: Define function
Lλ (P, Q) = Df (K ◦ PkK ◦ Q) − λDf (PkQ)
and prove that Lλ ( QP Q̂, Q̂) is convex on the set

P
Q̂ ∈ P(X ) : supp(Q̂) ⊆ supp(Q), Q̂ ∈ P(X ) .
Q
VI.20 (BMS channel comparison [269]) Below X ∼ Ber(1/2) and PY|X is an input-symmetric chan-
nel (BMS). It turns out that BSC and BEC are extremal for various partial orders. Prove the
following statements.
(a) If ITV (X; Y) = 12 (1 − 2δ), then
BSCδ ≤deg PY|X ≤deg BEC2δ .
(b) If I(X; Y) = C, then
BSCh−1 (log 2−C) ≤mc PY|X ≤mc BEC1−C/ log 2 .
(c) If Iχ2 (X; Y) = η , then
BSC1/2−√η/2 ≤ln PY|X ≤ln BEC1−η .
i i
i i
i i

i i
VI.21 (Broadcasting on Trees with BSC [149]) We have seen that Broadcasting on Trees with BSCδ
has non-reconstruction when b(1 − 2δ)2 < 1. In this exercise we prove the achievability bound
(known as the Kesten-Stigum bound [176]) using channel comparison.
We work with an infinite b-ary tree with BSCδ edge channels. Let ρ be the root and Lk be the
set of nodes at distance k to ρ. Let Mk denote the channel Xρ → XLk .
In the following, assume that b(1 − 2δ)2 > 1.
1
(a) Prove that there exists τ < 2 such that
BSCτ ≤ln BSC⊗

τ ◦ M1 .
b
Hint: Use Ex. VI.20.

(b) Prove BSCτ ≤ln Mk by induction on k. Conclude that reconstruction holds.
Hint: Use tensorization of less noisy ordering.
VI.22 (Broadcasting on a 2D Grid) Consider the following broadcasting model on a 2D grid:
• Nodes are labeled with (i, j) for i, j ∈ Z;
• Xi,j = 0 when i < 0 or j < 0;
• X0,0 ∼ Ber( 12 );
i.i.d.
• Xi,j = fi,j (Xi−1,j ⊕ Zi,j,1 , Xi,j−1 ⊕ Zi,j,2 ) for i, j ≥ 0 and (i, j) 6= (0, 0), where Zi,j,k ∼ Ber(δ),
and fi,j is any function {0, 1} × {0, 1} → {0, 1}.
Let Ln = {(n − i, i) : 0 ≤ i ≤ n} be the set of nodes at level n.
Let pc be directed bond percolation threshold from (0, 0) to Ln for n → ∞. Prove that when
(1 − 2δ)2 < pc , we have
lim I(X0,0 ; XLn ) = 0.

n→∞
Note: It is known that pc ≈ 0.645 (e.g. [167]). Using site percolation we can prove
non-reconstruction whenever
p
1 − 2δ + 4δ 3 − 2δ 4 − 2δ(1 − δ) δ(1 + δ)(1 − δ)(2 − δ) < p′c ,
where p′c ≈ 0.705 is the directed site percolation threshold. One can check that the bound from
bond percolation is stronger.
VI.23 (Input-dependent contraction coefficient for coloring channel [148]) Fix an integer q ≥ 3 and
let X = [q]. Consider the following coloring channel K : X → X :
(
0 y = x,
K(y|x) = 1
q−1 y 6= x.
Let π be uniform distribution on X .

(a) Compute ηKL (π , K).
(b) Conclude that there exists a function f(q) = (1 − o(1))q log q as q → ∞ such that for all
d < f(q), BOT with the coloring channel on a d-ary tree has non-reconctruction.
Note: This bound is tight up to the first order: there exists a function g(q) = (1 + o(1))q log q
such that for all d > g(q), BOT with coloring channel on a d-ary tree has reconstruction.
i i
i i
i i

i i
VI.24 ([148]) Fix an integer q ≥ 2 and let X = [q]. Let λ ∈ [− q−1 1 , 1] be a real number. Define the
Potts channel Pλ : X → X as
(
λ + 1−λ y = x,
P λ ( y| x) = 1−λ
q
q y 6= x.
Prove that
qλ 2
ηKL (Pλ ) = .
(q − 2)λ + 2
VI.25 (Spectral Independence) Say a probability distribution μ = μXn supported on [q]n is c-pairwise
independent if for every T ⊂ [n], σT ∈ [q]T the conditional measure μ(σT ) ≜ μXTc |XT =σT and
every νXcT satisfies
X (σ ) c X (σ )
D(νXi,j || μXi,Tj ) ≥ (2 − ) D(νXi || μXi T ) .
n − | T | − 1
i̸=j∈T
c i∈ T c
Prove that for such a measure μ we have

ηKL ( μ, ECτ ) ≤ 1 − τ c+1 ,
where ECτ is the erasure channel, cf. Example 33.7. (Hint: Define f(τ ) = D(ECτ ◦ ν||ECτ ◦ μ)
and prove f′′ (τ ) ≥ τc f′ (τ ).)
Remark: Applying the above with τ = 1n shows that a Markov chain known as (small-block)
Glauber dynamics for μ is mixing in O(nc+1 log n) time. It is known that c-pairwise indepen-
dence is implied (under some additional conditions on μ and q = 2) by the uniform boundedness
of the operator norms of the covariance matrices of all μ(σT ) (see [64] for details).
i i
i i
i i

i i
References
[1] M. C. Abbott and B. B. Machta, “A scaling IEEE Transactions on Information Theory,

law from discrete to continuous solutions vol. 40, no. 5, pp. 1670–1672, 1994.
of channel capacity problems in the low- [10] N. Alon and J. H. Spencer, The Probabilis-
noise limit,” Journal of Statistical Physics, tic Method, 3rd ed. John Wiley & Sons,
vol. 176, no. 1, pp. 214–227, 2019. 2008.
[2] I. Abou-Faycal, M. Trott, and S. Shamai, [11] S.-i. Amari and H. Nagaoka, Methods of
“The capacity of discrete-time memoryless information geometry. American Mathe-
rayleigh-fading channels,” IEEE Transac- matical Soc., 2007, vol. 191.
tion Information Theory, vol. 47, no. 4, pp. [12] G. Aminian, Y. Bu, L. Toni, M. R.
1290 – 1301, 2001. Rodrigues, and G. Wornell, “Characteriz-
[3] R. Ahlswede, “Extremal properties of rate ing the generalization error of Gibbs algo-
distortion functions,” IEEE transactions on rithm with symmetrized KL information,”
information theory, vol. 36, no. 1, pp. 166– arXiv preprint arXiv:2107.13656, 2021.
171, 1990. [13] T. W. Anderson, “The integral of a symmet-
[4] R. Ahlswede, B. Balkenhol, and L. Khacha- ric unimodal function over a symmetric con-
trian, Some properties of fix free codes. vex set and some probability inequalities,”
Citeseer, 1997. Proceedings of the American Mathematical
[5] S. M. Alamouti, “A simple transmit diver- Society, vol. 6, no. 2, pp. 170–176, 1955.
sity technique for wireless communica- [14] A. Antos and I. Kontoyiannis, “Conver-
tions,” IEEE Journal on selected areas in gence properties of functional estimates for
communications, vol. 16, no. 8, pp. 1451– discrete distributions,” Random Structures
1458, 1998. & Algorithms, vol. 19, no. 3-4, pp. 163–193,
[6] P. H. Algoet and T. M. Cover, “A sandwich 2001.
proof of the Shannon-Mcmillan-Breiman [15] E. Arıkan, “Channel polarization: A
theorem,” The annals of probability, pp. method for constructing capacity-achieving
899–909, 1988. codes for symmetric binary-input memo-
[7] C. D. Aliprantis and K. C. Border, Infi- ryless channels,” IEEE Transactions on
nite Dimensional Analysis: a Hitchhiker’s information Theory, vol. 55, no. 7, pp.
Guide, 3rd ed. Berlin: Springer-Verlag, 3051–3073, 2009.
2006. [16] S. Arimoto, “On the converse to the coding
[8] N. Alon, “On the number of sub- theorem for discrete memoryless channels
graphs of prescribed type of graphs (corresp.),” IEEE Transactions on Informa-
with a given number of edges,” Israel tion Theory, vol. 19, no. 3, pp. 357–359,
J. Math., vol. 38, no. 1-2, pp. 116– 1973.
130, 1981. [Online]. Available: http: [17] G. B. Arous and A. Guionnet, “Large devi-
//dx.doi.org/10.1007/BF02761855 ations for wigner’s law and voiculescu’s
[9] N. Alon and A. Orlitsky, “A lower bound non-commutative entropy,” Probability the-
on the expected length of one-to-one codes,” ory and related fields, vol. 108, no. 4, pp.
517–542, 1997.
i i
i i
i i

i i
References 585
[18] S. Artstein, V. Milman, and S. J. Szarek, (Methodological), vol. 41, no. 2, pp. 113–
“Duality of metric entropy,” Annals of math- 128, 1979.
ematics, pp. 1313–1328, 2004. [30] C. Berrou, A. Glavieux, and P. Thiti-
[19] R. B. Ash, Information Theory. New York, majshima, “Near shannon limit
NY: Dover Publications Inc., 1965. error-correcting coding and decoding:
[20] A. V. Banerjee, “A simple model of herd Turbo-codes. 1,” in Proceedings of
behavior,” The quarterly journal of eco- ICC’93-IEEE International Conference on
nomics, vol. 107, no. 3, pp. 797–817, 1992. Communications, vol. 2. IEEE, 1993, pp.
[21] A. Barg and G. D. Forney, “Random codes: 1064–1070.
Minimum distances and error exponents,” [31] D. P. Bertsekas, A. Nedi�, and A. E.
IEEE Transactions on Information Theory, Ozdaglar, Convex analysis and optimiza-
vol. 48, no. 9, pp. 2568–2573, 2002. tion. Belmont, MA, USA: Athena Scien-
[22] A. Barg and A. McGregor, “Distance distri- tific, 2003.
bution of binary codes and the error proba- [32] N. Bhatnagar, J. Vera, E. Vigoda, and
bility of decoding,” IEEE transactions on D. Weitz, “Reconstruction for color-
information theory, vol. 51, no. 12, pp. ings on trees,” SIAM Journal on Discrete
4237–4246, 2005. Mathematics, vol. 25, no. 2, pp. 809–
[23] S. Barman and O. Fawzi, “Algorithmic 826, 2011. [Online]. Available: https:
aspects of optimal channel coding,” IEEE //doi.org/10.1137/090755783
Transactions on Information Theory, [33] A. Bhattacharyya, “On a measure of diver-
vol. 64, no. 2, pp. 1038–1045, 2017. gence between two statistical populations
[24] A. R. Barron, “Universal approximation defined by their probability distributions,”
bounds for superpositions of a sigmoidal Bull. Calcutta Math. Soc., vol. 35, pp. 99–
function,” IEEE Trans. Inf. Theory, vol. 39, 109, 1943.
no. 3, pp. 930–945, 1993. [34] L. Birgé, “Approximation dans les espaces
[25] G. Basharin, “On a statistical estimate for métriques et théorie de l’estimation,”
the entropy of a sequence of independent Zeitschrift für Wahrscheinlichkeitstheorie
random variables,” Theory of Probability & und Verwandte Gebiete, vol. 65, no. 2, pp.
Its Applications, vol. 4, no. 3, pp. 333–336, 181–237, 1983.
1959. [35] ——, “Robust tests for model selection,”
[26] A. Beirami and F. Fekri, “Fundamental lim- From probability to statistics and back:
its of universal lossless one-to-one com- high-dimensional models and processes–A
pression of parametric sources,” in Informa- Festschrift in honor of Jon A. Wellner, IMS
tion Theory Workshop (ITW), 2014 IEEE. Collections, Volume 9, pp. 47–64, 2013.
IEEE, 2014, pp. 212–216. [36] M. Š. Birman and M. Solomjak, “Piecewise-
[27] C. H. Bennett, P. W. Shor, J. A. Smolin, and polynomial approximations of functions of
A. V. Thapliyal, “Entanglement-assisted the classes,” Mathematics of the USSR-
classical capacity of noisy quantum chan- Sbornik, vol. 2, no. 3, p. 295, 1967.
nels,” Physical Review Letters, vol. 83, [37] D. Blackwell, L. Breiman, and
no. 15, p. 3081, 1999. A. Thomasian, “The capacity of a class of
[28] W. R. Bennett, “Spectra of quantized sig- channels,” The Annals of Mathematical
nals,” The Bell System Technical Journal, Statistics, pp. 1229–1241, 1959.
vol. 27, no. 3, pp. 446–472, 1948. [38] R. E. Blahut, “Hypothesis testing and infor-
[29] J. M. Bernardo, “Reference posterior dis- mation theory,” IEEE Trans. Inf. Theory,
tributions for Bayesian inference,” Journal vol. 20, no. 4, pp. 405–417, 1974.
of the Royal Statistical Society: Series B [39] P. M. Bleher, J. Ruiz, and V. A. Zagrebnov,
“On the purity of the limiting gibbs state for
i i
i i
i i

i i
the ising model on the bethe lattice,” Jour- [50] M. Braverman, A. Garg, T. Ma, H. L.
nal of Statistical Physics, vol. 79, no. 1, pp. Nguyen, and D. P. Woodruff, “Communica-
473–482, Apr 1995. [Online]. Available: tion lower bounds for statistical estimation
https://doi.org/10.1007/BF02179399 problems via a distributed data processing
[40] S. G. Bobkov and F. Götze, “Exponential inequality,” in Proceedings of the forty-
integrability and transportation cost related eighth annual ACM symposium on Theory
to logarithmic sobolev inequalities,” Jour- of Computing. ACM, 2016, pp. 1011–
nal of Functional Analysis, vol. 163, no. 1, 1020.
pp. 1–28, 1999. [51] L. M. Bregman, “Some properties of non-
[41] S. Bobkov and G. P. Chistyakov, “Entropy negative matrices and their permanents,”
power inequality for the Rényi entropy.” Soviet Math. Dokl., vol. 14, no. 4, pp. 945–
IEEE Transactions on Information Theory, 949, 1973.
vol. 61, no. 2, pp. 708–714, 2015. [52] L. Breiman, “The individual ergodic the-
[42] T. Bohman, “A limit theorem for the shan- orem of information theory,” Ann. Math.
non capacities of odd cycles i,” Proceedings Stat., vol. 28, no. 3, pp. 809–811, 1957.
of the American Mathematical Society, vol. [53] L. Brillouin, Science and information the-
131, no. 11, pp. 3559–3569, 2003. ory, 2nd Ed. Academic Press, 1962.
[43] H. F. Bohnenblust, “Convex regions and [54] L. D. Brown, “Fundamentals of statisti-
projections in Minkowski spaces,” Ann. cal exponential families with applications
Math., vol. 39, no. 2, pp. 301–308, 1938. in statistical decision theory,” in Lecture
[44] A. Borovkov, Mathematical Statistics. Notes-Monograph Series, S. S. Gupta, Ed.
CRC Press, 1999. Hayward, CA: Institute of Mathematical
[45] S. Boucheron, G. Lugosi, and O. Bousquet, Statistics, 1986, vol. 9.
“Concentration inequalities,” in Advanced [55] P. Bühlmann and S. van de Geer, Statistics
Lectures on Machine Learning, O. Bous- for high-dimensional data: methods, theory
quet, U. von Luxburg, and G. Rätsch, Eds. and applications. Springer Science &
Springer, 2004, pp. 208–240. Business Media, 2011.
[46] S. Boucheron, G. Lugosi, and P. Massart, [56] G. Calinescu, C. Chekuri, M. Pal, and
Concentration Inequalities: A Nonasymp- J. Vondrák, “Maximizing a monotone sub-
totic Theory of Independence. OUP modular function subject to a matroid
Oxford, 2013. constraint,” SIAM Journal on Computing,
[47] O. Bousquet, D. Kane, and S. Moran, “The vol. 40, no. 6, pp. 1740–1766, 2011.
optimal approximation factor in density [57] M. X. Cao and M. Tomamichel, “On
estimation,” in Conference on Learning the quadratic decaying property of the
Theory. PMLR, 2019, pp. 318–341. information rate function,” arXiv preprint
[48] D. Braess and T. Sauer, “Bernstein poly- arXiv:2208.12945, 2022.
nomials and learning theory,” Journal of [58] O. Catoni, “PAC-Bayesian supervised clas-
Approximation Theory, vol. 128, no. 2, pp. sification: the thermodynamics of statis-
187–206, 2004. tical learning,” Lecture Notes-Monograph
[49] D. Braess, J. Forster, T. Sauer, and Series. IMS, vol. 1277, 2007.
H. U. Simon, “How to achieve minimax [59] E. Çinlar, Probability and Stochastics.
expected Kullback-Leibler distance from New York: Springer, 2011.
an unknown finite distribution,” in Algo- [60] N. Cesa-Bianchi and G. Lugosi, Prediction,
rithmic Learning Theory. Springer, 2002, learning, and games. Cambridge univer-
pp. 380–394. sity press, 2006.
[61] D. G. Chapman and H. Robbins, “Mini-
mum variance estimation without regularity
i i
i i
i i

i i
References 587
assumptions,” The Annals of Mathematical [73] D. J. Costello and G. D. Forney, “Channel

Statistics, vol. 22, no. 4, pp. 581–586, 1951. coding: The road to channel capacity,” Pro-
[62] S. Chatterjee, “An error bound in the ceedings of the IEEE, vol. 95, no. 6, pp.
Sudakov-Fernique inequality,” arXiv 1150–1177, 2007.
preprint arXiv:0510424, 2005. [74] T. M. Cover, “Universal data compression
[63] S. Chatterjee and P. Diaconis, “The sample and portfolio selection,” in Proceedings of
size required in importance sampling,” The 37th Conference on Foundations of Com-
Annals of Applied Probability, vol. 28, no. 2, puter Science. IEEE, 1996, pp. 534–538.
pp. 1099–1135, 2018. [75] T. M. Cover and B. Gopinath, Open prob-
[64] Z. Chen, K. Liu, and E. Vigoda, “Optimal lems in communication and computation.
mixing of glauber dynamics: Entropy fac- Springer Science & Business Media, 2012.
torization via high-dimensional expansion,” [76] T. M. Cover and J. A. Thomas, Elements of
in Proceedings of the 53rd Annual ACM information theory, 2nd Ed. New York,
SIGACT Symposium on Theory of Comput- NY, USA: Wiley-Interscience, 2006.
ing, 2021, pp. 1537–1550. [77] H. Cramér, “Über eine eigenschaft der
[65] H. Chernoff, “Large-sample theory: Para- normalen verteilungsfunktion,” Mathema-
metric case,” The Annals of Mathematical tische Zeitschrift, vol. 41, no. 1, pp. 405–
Statistics, vol. 27, no. 1, pp. 1–22, 1956. 414, 1936.
[66] M. Choi, M. Ruskai, and E. Seneta, “Equiv- [78] ——, Mathematical methods of statistics.
alence of certain entropy contraction coeffi- Princeton university press, 1946.
cients,” Linear algebra and its applications, [79] I. Csiszár, “Information-type measures of
vol. 208, pp. 29–36, 1994. difference of probability distributions and
[67] N. Chomsky, “Three models for the descrip- indirect observation,” Studia Sci. Math.
tion of language,” IRE Trans. Inform. Th., Hungar., vol. 2, pp. 229–318, 1967.
vol. 2, no. 3, pp. 113–124, 1956. [80] I. Csiszár and J. Körner, “Graph decomposi-
[68] B. S. Clarke and A. R. Barron, tion: a new key to coding theorems,” IEEE
“Information-theoretic asymptotics of Trans. Inf. Theory, vol. 27, no. 1, pp. 5–12,
Bayes methods,” IEEE Trans. Inf. Theory, 1981.
vol. 36, no. 3, pp. 453–471, 1990. [81] ——, Information Theory: Coding The-
[69] ——, “Jeffreys’ prior is asymptotically orems for Discrete Memoryless Systems.
least favorable under entropy risk,” Jour- New York: Academic, 1981.
nal of Statistical planning and Inference, [82] I. Csiszár and G. Tusnády, “Informa-
vol. 41, no. 1, pp. 37–60, 1994. tion geometry and alternating minimization
[70] J. E. Cohen, J. H. B. Kempermann, and problems,” Statistics & Decision, Supple-
G. Zb�ganu, Comparisons of Stochastic ment Issue No, vol. 1, 1984.
Matrices with Applications in Information [83] I. Csiszár, “i-divergence geometry of prob-
Theory, Statistics, Economics and Popula- ability distributions and minimization prob-
tion. Springer, 1998. lems,” The Annals of Probability, pp. 146–
[71] A. Collins and Y. Polyanskiy, “Coherent 158, 1975.
multiple-antenna block-fading channels at [84] I. Csiszár and J. Körner, Information theory:
finite blocklength,” IEEE Transactions on coding theorems for discrete memoryless
Information Theory, vol. 65, no. 1, pp. 380– systems, 2nd ed. Cambridge University
405, 2018. Press, 2011.
[72] J. H. Conway and N. J. A. Sloane, Sphere [85] P. Cuff, “Distributed channel synthesis,”
packings, lattices and groups. Springer IEEE Transactions on Information Theory,
Science & Business Media, 1999, vol. 290. vol. 59, no. 11, pp. 7071–7096, 2013.
i i
i i
i i

i i
[86] M. Cuturi, “Sinkhorn distances: Light- [96] ——, “Mathematical problems in the Shan-
speed computation of optimal transport,” non theory of optimal coding of informa-
Advances in neural information processing tion,” in Proc. 4th Berkeley Symp. Mathe-
systems, vol. 26, pp. 2292–2300, 2013. matics, Statistics, and Probability, vol. 1,
[87] A. Dembo and O. Zeitouni, Large devia- Berkeley, CA, USA, 1961, pp. 211–252.
tions techniques and applications. New [97] ——, “Asymptotic bounds on error prob-
York: Springer Verlag, 2009. ability for transmission over DMC with
[88] A. P. Dempster, N. M. Laird, and D. B. symmetric transition probabilities,” Theor.
Rubin, “Maximum likelihood from incom- Probability Appl., vol. 7, pp. 283–311,
plete data via the EM algorithm,” Journal 1962.
of the royal statistical society. Series B [98] R. Dobrushin, “A simplified method of
(methodological), pp. 1–38, 1977. experimentally evaluating the entropy of a
[89] P. Diaconis and L. Saloff-Coste, “Logarith- stationary sequence,” Theory of Probabil-
mic Sobolev inequalities for finite Markov ity & Its Applications, vol. 3, no. 4, pp.
chains,” Ann. Probab., vol. 6, no. 3, pp. 428–430, 1958.
695–750, 1996. [99] D. L. Donoho, “Wald lecture I: Counting
[90] P. Diaconis and D. Freedman, “Finite bits with Kolmogorov and Shannon,” Note
exchangeable sequences,” The Annals of for the Wald Lectures, IMS Annual Meeting,
Probability, vol. 8, no. 4, pp. 745–764, July 1997.
1980. [100] M. D. Donsker and S. S. Varadhan,
[91] P. Diaconis and D. Stroock, “Geometric “Asymptotic evaluation of certain markov
bounds for eigenvalues of Markov chains,” process expectations for large time. iv,”
The Annals of Applied Probability, vol. 1, Communications on Pure and Applied
no. 1, pp. 36–61, 1991. Mathematics, vol. 36, no. 2, pp. 183–212,
[92] H. Djellout, A. Guillin, and L. Wu, “Trans- 1983.
portation cost-information inequalities and [101] J. L. Doob, Stochastic Processes. New
applications to random dynamical systems York Wiley, 1953.
and diffusions,” The Annals of Probability, [102] J. C. Duchi, M. I. Jordan, M. J. Wain-
vol. 32, no. 3B, pp. 2702–2732, 2004. wright, and Y. Zhang, “Optimality guaran-
[93] R. Dobrushin, “Central limit theorem for tees for distributed statistical estimation,”
nonstationary Markov chains, I,” Theory arXiv preprint arXiv:1405.0782, 2014.
Probab. Appl., vol. 1, no. 1, pp. 65–80, [103] J. Duda, “Asymmetric numeral systems:
1956. entropy coding combining speed of
[94] R. Dobrushin and B. Tsybakov, “Informa- huffman coding with compression rate
tion transmission with additional noise,” of arithmetic coding,” arXiv preprint
IRE Transactions on Information Theory, arXiv:1311.2540, 2013.
vol. 8, no. 5, pp. 293–304, 1962. [104] R. M. Dudley, Uniform central limit theo-
[95] R. L. Dobrushin, “A general formulation of rems. Cambridge university press, 1999,
the fundamental theorem of Shannon in the no. 63.
theory of information,” Uspekhi Mat. Nauk, [105] G. Dueck, “The strong converse to the cod-
vol. 14, no. 6, pp. 3–104, 1959, english ing theorem for the multiple–access chan-
translation in Eleven Papers in Analysis: nel,” J. Comb. Inform. Syst. Sci, vol. 6, no. 3,
Nine Papers on Differential Equations, Two pp. 187–196, 1981.
on Information Theory, American Mathe- [106] G. Dueck and J. Korner, “Reliability
matical Society Translations: Series 2, Vol- function of a discrete memoryless chan-
ume 33, 1963. nel at rates above capacity (corresp.),”
i i
i i
i i

i i
References 589
IEEE Transactions on Information Theory, [118] M. Feder, “Gambling using a finite state
vol. 25, no. 1, pp. 82–85, 1979. machine,” IEEE Transactions on Informa-
[107] N. Dunford and J. T. Schwartz, Linear oper- tion Theory, vol. 37, no. 5, pp. 1459–1465,
ators, part 1: general theory. John Wiley 1991.
& Sons, 1988, vol. 10. [119] M. Feder, N. Merhav, and M. Gut-
[108] R. Durrett, Probability: Theory and Exam- man, “Universal prediction of individual
ples, 4th ed. Cambridge University Press, sequences,” IEEE Trans. Inf. Theory,
2010. vol. 38, no. 4, pp. 1258–1270, 1992.
[109] A. Dytso, S. Yagli, H. V. Poor, and S. S. [120] M. Feder and Y. Polyanskiy, “Sequential
Shitz, “The capacity achieving distribution prediction under log-loss and misspecifi-
for the amplitude constrained additive gaus- cation,” arXiv preprint arXiv:2102.00050,
sian channel: An upper bound on the num- 2021.
ber of mass points,” IEEE Transactions [121] A. A. Fedotov, P. Harremoës, and F. Top-
on Information Theory, vol. 66, no. 4, pp. søe, “Refinements of Pinsker’s inequality,”
2006–2022, 2019. Information Theory, IEEE Transactions on,
[110] H. G. Eggleston, Convexity, ser. Tracts in vol. 49, no. 6, pp. 1491–1498, Jun. 2003.
Math and Math. Phys. Cambridge Univer- [122] W. Feller, An Introduction to Probability
sity Press, 1958, vol. 47. Theory and Its Applications, 3rd ed. New
[111] A. El Gamal and Y.-H. Kim, Network infor- York: Wiley, 1970, vol. I.
mation theory. Cambridge University [123] ——, An Introduction to Probability The-
Press, 2011. ory and Its Applications, 2nd ed. New
[112] P. Elias, “The efficient construction of York: Wiley, 1971, vol. II.
an unbiased random sequence,” Annals of [124] T. S. Ferguson, Mathematical Statistics: A
Mathematical Statistics, vol. 43, no. 3, pp. Decision Theoretic Approach. New York,
865–870, 1972. NY: Academic Press, 1967.
[113] ——, “Coding for noisy channels,” IRE [125] ——, “An inconsistent maximum likeli-
Convention Record, vol. 3, pp. 37–46, hood estimate,” Journal of the American
1955. Statistical Association, vol. 77, no. 380, pp.
[114] D. M. Endres and J. E. Schindelin, “A new 831–834, 1982.
metric for probability distributions,” IEEE [126] ——, A course in large sample theory.
Transactions on Information theory, vol. 49, CRC Press, 1996.
no. 7, pp. 1858–1860, 2003. [127] R. A. Fisher, “The logic of inductive infer-
[115] K. Eswaran and M. Gastpar, “Remote ence,” Journal of the royal statistical soci-
source coding under Gaussian noise: Duel- ety, vol. 98, no. 1, pp. 39–82, 1935.
ing roles of power and entropy power,” [128] B. M. Fitingof, “The compression of dis-
IEEE Transactions on Information Theory, crete information,” Problemy Peredachi
2019. Informatsii, vol. 3, no. 3, pp. 28–36, 1967.
[116] W. Evans and N. Pippenger, “On the maxi- [129] P. Fleisher, “Sufficient conditions for
mum tolerable noise for reliable computa- achieving minimum distortion in a quan-
tion by formulas,” IEEE Transactions on tizer,” IEEE Int. Conv. Rec., pp. 104–111,
Information Theory, vol. 44, no. 3, pp. 1964.
1299–1305, May 1998. [130] G. D. Forney, “Concatenated codes,” MIT
[117] W. S. Evans and L. J. Schulman, “Signal RLE Technical Rep., vol. 440, 1965.
propagation and noisy circuits,” IEEE [131] E. Friedgut and J. Kahn, “On the
Transactions on Information Theory, number of copies of one hypergraph
vol. 45, no. 7, pp. 2367–2373, Nov 1999. in another,” Israel J. Math., vol. 105,
i i
i i
i i

i i
pp. 251–256, 1998. [Online]. Available: [145] R. M. Gray, Entropy and Information The-
http://dx.doi.org/10.1007/BF02780332 ory. New York, NY: Springer-Verlag,
[132] R. G. Gallager, “A simple derivation of 1990.
the coding theorem and some applications,” [146] U. Grenander and G. Szegö, Toeplitz forms
IEEE Trans. Inf. Theory, vol. 11, no. 1, pp. and their applications, 2nd ed. New York:
3–18, 1965. Chelsea Publishing Company, 1984.
[133] ——, Information Theory and Reliable [147] L. Gross, “Logarithmic sobolev inequali-
Communication. New York: Wiley, 1968. ties,” American Journal of Mathematics,
[134] R. Gallager, “The random coding bound vol. 97, no. 4, pp. 1061–1083, 1975.
is tight for the average code (corresp.),” [148] Y. Gu and Y. Polyanskiy, “Non-linear log-
IEEE Transactions on Information Theory, sobolev inequalities for the potts semigroup
vol. 19, no. 2, pp. 244–246, 1973. and applications to reconstruction prob-
[135] R. Gardner, “The Brunn-Minkowski lems,” arXiv preprint arXiv:2005.05444,
inequality,” Bulletin of the American 2020.
Mathematical Society, vol. 39, no. 3, pp. [149] Y. Gu, H. Roozbehani, and Y. Polyanskiy,
355–405, 2002. “Broadcasting on trees near criticality,” in
[136] I. M. Gel’fand, A. N. Kolmogorov, and 2020 IEEE International Symposium on
A. M. Yaglom, “On the general definition Information Theory (ISIT). IEEE, 2020,
of the amount of information,” Dokl. Akad. pp. 1504–1509.
Nauk. SSSR, vol. 11, pp. 745–748, 1956. [150] U. Hadar, J. Liu, Y. Polyanskiy, and
[137] G. L. Gilardoni, “On a Gel’fand-Yaglom- O. Shayevitz, “Communication complexity
Peres theorem for f-divergences,” arXiv of estimating correlations,” in Proceedings
preprint arXiv:0911.1934, 2009. of the 51st Annual ACM SIGACT Sympo-
[138] ——, “On pinsker’s and vajda’s type sium on Theory of Computing. ACM,
inequalities for csiszár’s-divergences,” 2019, pp. 792–803.
Information Theory, IEEE Transactions on, [151] B. Hajek, Y. Wu, and J. Xu, “Information
vol. 56, no. 11, pp. 5377–5386, 2010. limits for recovering a hidden community,”
[139] R. D. Gill and B. Y. Levit, “Applications IEEE Trans. on Information Theory, vol. 63,
of the van Trees inequality: a Bayesian no. 8, pp. 4729 – 4745, 2017.
Cramér-Rao bound,” Bernoulli, vol. 1, no. [152] J. Hájek, “Local asymptotic minimax and
1–2, pp. 59–79, 1995. admissibility in estimation,” in Proceedings
[140] C. Giraud, Introduction to High- of the sixth Berkeley symposium on math-
Dimensional Statistics. Chapman ematical statistics and probability, vol. 1,
and Hall/CRC, 2014. 1972, pp. 175–194.
[141] O. Goldreich, Introduction to property test- [153] J. M. Hammersley, “On estimating
ing. Cambridge University Press, 2017. restricted parameters,” Journal of the
[142] V. Goodman, “Characteristics of normal Royal Statistical Society. Series B (Method-
samples,” The Annals of Probability, ological), vol. 12, no. 2, pp. 192–240,
vol. 16, no. 3, pp. 1281–1290, 1988. 1950.
[143] V. D. Goppa, “Codes and information,” Rus- [154] T. S. Han and S. Verdú, “Approximation
sian Mathematical Surveys, vol. 39, no. 1, theory of output statistics,” IEEE Transac-
p. 87, 1984. tions on Information Theory, vol. 39, no. 3,
[144] R. M. Gray and D. L. Neuhoff, “Quanti- pp. 752–772, 1993.
zation,” IEEE Trans. Inf. Theory, vol. 44, [155] P. Harremoës and I. Vajda, “On pairs of
no. 6, pp. 2325–2383, 1998. f-divergences and their joint range,” IEEE
Trans. Inf. Theory, vol. 57, no. 6, pp. 3230–
3235, Jun. 2011.
i i
i i
i i

i i
References 591
[156] B. Harris, “The statistical estimation of [168] J. Jiao, K. Venkat, Y. Han, and T. Weiss-
entropy in the non-parametric case,” in Top- man, “Minimax estimation of functionals of
ics in Information Theory, I. Csiszár and discrete distributions,” IEEE Transactions
P. Elias, Eds. Springer Netherlands, 1975, on Information Theory, vol. 61, no. 5, pp.
vol. 16, pp. 323–355. 2835–2885, 2015.
[157] D. Haussler and M. Opper, “Mutual infor- [169] C. Jin, Y. Zhang, S. Balakrishnan, M. J.
mation, metric entropy and cumulative rel- Wainwright, and M. I. Jordan, “Local max-
ative entropy risk,” The Annals of Statistics, ima in the likelihood of Gaussian mixture
vol. 25, no. 6, pp. 2451–2492, 1997. models: Structural results and algorithmic
[158] M. Hayashi, “General nonasymptotic and consequences,” in Advances in neural infor-
asymptotic formulas in channel resolv- mation processing systems, 2016, pp. 4116–
ability and identification capacity and 4124.
their application to the wiretap channel,” [170] W. B. Johnson, G. Schechtman, and J. Zinn,
IEEE Transactions on Information Theory, “Best constants in moment inequalities
vol. 52, no. 4, pp. 1562–1575, 2006. for linear combinations of independent
[159] W. Hoeffding, “Asymptotically optimal and exchangeable random variables,” The
tests for multinomial distributions,” The Annals of Probability, pp. 234–253, 1985.
Annals of Mathematical Statistics, pp. 369– [171] I. Johnstone, Gaussian estimation:
401, 1965. Sequence and wavelet models, 2011,
[160] P. J. Huber, “Fisher information and spline available at http://www-stat.stanford.edu/
interpolation,” Annals of Statistics, pp. ~imj/.
1029–1033, 1974. [172] L. K. Jones, “A simple lemma on greedy
[161] ——, “A robust version of the probabil- approximation in Hilbert space and conver-
ity ratio test,” The Annals of Mathematical gence rates for projection pursuit regression
Statistics, pp. 1753–1758, 1965. and neural network training,” The Annals of
[162] I. A. Ibragimov and R. Z. Khas’minsk�, Statistics, pp. 608–613, 1992.
Statistical Estimation: Asymptotic Theory. [173] A. B. Juditsky and A. S. Nemirovski, “Non-
Springer, 1981. parametric estimation by convex program-
[163] S. Ihara, “On the capacity of channels with ming,” The Annals of Statistics, vol. 37,
additive non-Gaussian noise,” Information no. 5A, pp. 2278–2300, 2009.
and Control, vol. 37, no. 1, pp. 34–39, 1978. [174] T. Kawabata and A. Dembo, “The rate-
[164] ——, Information theory for continuous distortion dimension of sets and measures,”
systems. World Scientific, 1993, vol. 2. IEEE Trans. Inf. Theory, vol. 40, no. 5, pp.
[165] Y. I. Ingster and I. A. Suslina, Nonparamet- 1564 – 1572, Sep. 1994.
ric goodness-of-fit testing under Gaussian [175] M. Keane and G. O’Brien, “A Bernoulli
models. New York, NY: Springer, 2003. factory,” ACM Transactions on Modeling
[166] Y. I. Ingster, “Minimax testing of nonpara- and Computer Simulation, vol. 4, no. 2, pp.
metric hypotheses on a distribution density 213–219, 1994.
in the lp metrics,” Theory of Probability & [176] H. Kesten and B. P. Stigum, “Additional
Its Applications, vol. 31, no. 2, pp. 333–337, limit theorems for indecomposable multi-
1987. dimensional galton-watson processes,” The
[167] I. Jensen and A. J. Guttmann, “Series expan- Annals of Mathematical Statistics, vol. 37,
sions of the percolation probability for no. 6, pp. 1463–1481, 1966.
directed square and honeycomb lattices,” [177] T. Koch, “The Shannon lower bound is
Journal of Physics A: Mathematical and asymptotically tight,” IEEE Transactions
General, vol. 28, no. 17, p. 4813, 1995. on Information Theory, vol. 62, no. 11, pp.
6155–6161, 2016.
i i
i i
i i

i i
[178] Y. Kochman, O. Ordentlich, and Y. Polyan- [189] C. Külske and M. Formentin, “A symmetric
skiy, “A lower bound on the expected entropy bound on the non-reconstruction
distortion of joint source-channel coding,” regime of Markov chains on Galton-
IEEE Transactions on Information Theory, Watson trees,” Electronic Communications
vol. 66, no. 8, pp. 4722–4741, 2020. in Probability, vol. 14, pp. 587–596, 2009.
[179] A. Kolchinsky and B. D. Tracey, “Esti- [190] A. Lapidoth, A foundation in digital com-
mating mixture entropy with pairwise dis- munication. Cambridge University Press,
tances,” Entropy, vol. 19, no. 7, p. 361, 2017.
2017. [191] A. Lapidoth and S. M. Moser, “Capac-
[180] A. N. Kolmogorov and V. M. Tikhomirov, ity bounds via duality with applications
“ε-entropy and ε-capacity of sets in function to multiple-antenna systems on flat-fading
spaces,” Uspekhi Matematicheskikh Nauk, channels,” IEEE Transactions on Informa-
vol. 14, no. 2, pp. 3–86, 1959, reprinted in tion Theory, vol. 49, no. 10, pp. 2426–2467,
Shiryayev, A. N., ed. Selected Works of AN 2003.
Kolmogorov: Volume III: Information The- [192] L. Le Cam, “Convergence of estimates
ory and the Theory of Algorithms, Vol. 27, under dimensionality restrictions,” Annals
Springer Netherlands, 1993, pp 86–170. of Statistics, vol. 1, no. 1, pp. 38 – 53, 1973.
[181] I. Kontoyiannis and S. Verdú, “Optimal [193] ——, Asymptotic methods in statistical
lossless data compression: Non- decision theory. New York, NY: Springer-
asymptotics and asymptotics,” IEEE Verlag, 1986.
Trans. Inf. Theory, vol. 60, no. 2, pp. [194] C. C. Leang and D. H. Johnson, “On
777–795, 2014. the asymptotics of m-hypothesis Bayesian
[182] J. Körner and A. Orlitsky, “Zero-error detection,” IEEE Transactions on Informa-
information theory,” IEEE Transactions on tion Theory, vol. 43, no. 1, pp. 280–282,
Information Theory, vol. 44, no. 6, pp. 1997.
2207–2229, 1998. [195] K. Lee, Y. Wu, and Y. Bresler, “Near opti-
[183] V. Koshelev, “Quantization with minimal mal compressed sensing of sparse rank-one
entropy,” Probl. Pered. Inform, vol. 14, pp. matrices via sparse power factorization,”
151–156, 1963. IEEE Transactions on Information Theory,
[184] O. Kosut and L. Sankar, “Asymptotics vol. 64, no. 3, pp. 1666–1698, Mar. 2018.
and non-asymptotics for universal fixed- [196] E. L. Lehmann and G. Casella, Theory of
to-variable source coding,” arXiv preprint Point Estimation, 2nd ed. New York, NY:
arXiv:1412.4444, 2014. Springer, 1998.
[185] A. Krause and D. Golovin, “Submodular [197] E. Lehmann and J. Romano, Testing Sta-
function maximization,” Tractability, vol. 3, tistical Hypotheses, 3rd ed. Springer,
pp. 71–104, 2014. 2005.
[186] J. Kuelbs, “A strong convergence theorem [198] W. V. Li and W. Linde, “Approximation,
for banach space valued random variables,” metric entropy and small ball estimates for
The Annals of Probability, vol. 4, no. 5, pp. Gaussian measures,” The Annals of Proba-
744–771, 1976. bility, vol. 27, no. 3, pp. 1556–1578, 1999.
[187] J. Kuelbs and W. V. Li, “Metric entropy and [199] W. V. Li and Q.-M. Shao, “Gaussian pro-
the small ball problem for Gaussian mea- cesses: inequalities, small ball probabilities
sures,” Journal of Functional Analysis, vol. and applications,” Handbook of Statistics,
116, no. 1, pp. 133–157, 1993. vol. 19, pp. 533–597, 2001.
[188] S. Kullback, Information theory and statis- [200] T. Linder and R. Zamir, “On the asymp-
tics. Mineola, NY: Dover publications, totic tightness of the Shannon lower bound,”
1968.
i i
i i
i i

i i
References 593
IEEE Transactions on Information Theory, Transactions on Information Theory,

vol. 40, no. 6, pp. 2026–2031, 1994. vol. 58, no. 12, pp. 7036–7044, 2012.
[201] S. Litsyn, “New upper bounds on error [213] H. H. Mattingly, M. K. Transtrum, M. C.
exponents,” IEEE Transactions on Informa- Abbott, and B. B. Machta, “Maximizing
tion Theory, vol. 45, no. 2, pp. 385–398, the information learned from finite data
1999. selects a simple model,” Proceedings of the
[202] S. Lloyd, “Least squares quantization in National Academy of Sciences, vol. 115,
pcm,” IEEE transactions on information no. 8, pp. 1760–1765, 2018.
theory, vol. 28, no. 2, pp. 129–137, 1982. [214] R. McEliece, E. Rodemich, H. Rumsey,
[203] G. G. Lorentz, M. v. Golitschek, and and L. Welch, “New upper bounds on the
Y. Makovoz, Constructive approximation: rate of a code via the Delsarte-Macwilliams
advanced problems. Springer, 1996, vol. inequalities,” IEEE transactions on Infor-
304. mation Theory, vol. 23, no. 2, pp. 157–166,
[204] L. Lovász, “On the shannon capacity of a 1977.
graph,” IEEE Transactions on Information [215] R. J. McEliece and E. C. Posner, “Hide
theory, vol. 25, no. 1, pp. 1–7, 1979. and seek, data storage, and entropy,” The
[205] D. J. MacKay, Information theory, infer- Annals of Mathematical Statistics, vol. 42,
ence and learning algorithms. Cambridge no. 5, pp. 1706–1716, 1971.
university press, 2003. [216] B. McMillan, “The basic theorems of infor-
[206] M. Madiman and P. Tetali, “Information mation theory,” Ann. Math. Stat., pp. 196–
inequalities for joint distributions, with 219, 1953.
interpretations and applications,” IEEE [217] F. McSherry, “Spectral partitioning of ran-
Trans. Inf. Theory, vol. 56, no. 6, pp. 2699– dom graphs,” in 42nd IEEE Symposium
2713, 2010. on Foundations of Computer Science, Oct.
[207] M. Mahoney, “Large text compression 2001, pp. 529 – 537.
benchmark,” http://www.mattmahoney.net/ [218] N. Merhav and M. Feder, “Universal pre-
dc/text.html, Aug. 2021. diction,” IEEE Trans. Inf. Theory, vol. 44,
[208] A. Makur and Y. Polyanskiy, “Compari- no. 6, pp. 2124–2147, 1998.
son of channels: Criteria for domination by [219] G. A. Miller, “Note on the bias of infor-
a symmetric channel,” IEEE Transactions mation estimates,” Information theory in
on Information Theory, vol. 64, no. 8, pp. psychology: Problems and methods, vol. 2,
5704–5725, 2018. pp. 95–100, 1955.
[209] B. Mandelbrot, “An informational theory of [220] M. Mitzenmacher, “A brief history of gen-
the statistical structure of language,” Com- erative models for power law and lognormal
munication theory, vol. 84, pp. 486–502, distributions,” Internet mathematics, vol. 1,
1953. no. 2, pp. 226–251, 2004.
[210] J. Massey, “On the fractional weight of [221] E. Mossel and Y. Peres, “New coins from
distinct binary n-tuples (corresp.),” IEEE old: computing with unknown bias,” Com-
Transactions on Information Theory, binatorica, vol. 25, no. 6, pp. 707–724,
vol. 20, no. 1, pp. 131–131, 1974. 2005.
[211] ——, “Causality, feedback and directed [222] X. Mu, L. Pomatto, P. Strack, and
information,” in Proc. Int. Symp. Inf. The- O. Tamuz, “From Blackwell dominance
ory Applic.(ISITA-90), 1990, pp. 303–305. in large samples to rényi divergences and
[212] W. Matthews, “A linear program for the back again,” Econometrica, vol. 89, no. 1,
finite block length converse of polyanskiy– pp. 475–506, 2021.
poor–verdú via nonsignaling codes,” IEEE [223] G. L. Nemhauser, L. A. Wolsey, and M. L.
Fisher, “An analysis of approximations for
i i
i i
i i

i i
maximizing submodular set functions–I,” [235] G. Pisier, The volume of convex bodies and
Mathematical programming, vol. 14, no. 1, Banach space geometry. Cambridge Uni-
pp. 265–294, 1978. versity Press, 1999.
[224] J. Neveu, Mathematical foundations of the [236] J. Pitman, “Probabilistic bounds on the
calculus of probability. Holden-day, 1965. coefficients of polynomials with only real
[225] M. E. Newman, “Power laws, pareto dis- zeros,” Journal of Combinatorial Theory,
tributions and zipf’s law,” Contemporary Series A, vol. 77, no. 2, pp. 279–303, 1997.
physics, vol. 46, no. 5, pp. 323–351, 2005. [237] E. Plotnik, M. J. Weinberger, and J. Ziv,
[226] M. Okamoto, “Some inequalities relating to “Upper bounds on the probability of
the partial sum of binomial probabilities,” sequences emitted by finite-state sources
Annals of the institute of Statistical Math- and on the redundancy of the lempel-ziv
ematics, vol. 10, no. 1, pp. 29–35, 1959. algorithm,” IEEE transactions on infor-
[227] B. Oliver, J. Pierce, and C. Shannon, “The mation theory, vol. 38, no. 1, pp. 66–72,
philosophy of PCM,” Proceedings of the 1992.
IRE, vol. 36, no. 11, pp. 1324–1331, 1948. [238] Y. Polyanskiy, “Channel coding: non-
[228] Y. Oohama, “On two strong converse the- asymptotic fundamental limits,” Ph.D.
orems for discrete memoryless channels,” dissertation, Princeton Univ., Princeton,
IEICE Transactions on Fundamentals of NJ, USA, 2010.
Electronics, Communications and Com- [239] Y. Polyanskiy, H. V. Poor, and S. Verdú,
puter Sciences, vol. 98, no. 12, pp. 2471– “Channel coding rate in the finite block-
2475, 2015. length regime,” IEEE Trans. Inf. Theory,
[229] O. Ordentlich and Y. Polyanskiy, “Strong vol. 56, no. 5, pp. 2307–2359, May 2010.
data processing constant is achieved by [240] ——, “Dispersion of the Gilbert-Elliott
binary inputs,” IEEE Trans. Inf. Theory, channel,” IEEE Trans. Inf. Theory, vol. 57,
vol. 68, no. 3, pp. 1480–1481, Mar. 2022. no. 4, pp. 1829–1848, Apr. 2011.
[230] L. Paninski, “Variational minimax estima- [241] ——, “Feedback in the non-asymptotic
tion of discrete distributions under kl loss,” regime,” IEEE Trans. Inf. Theory, vol. 57,
Advances in Neural Information Processing no. 4, pp. 4903 – 4925, Apr. 2011.
Systems, vol. 17, 2004. [242] ——, “Minimum energy to send k bits with
[231] P. Panter and W. Dite, “Quantization distor- and without feedback,” IEEE Trans. Inf.
tion in pulse-count modulation with nonuni- Theory, vol. 57, no. 8, pp. 4880–4902, Aug.
form spacing of levels,” Proceedings of the 2011.
IRE, vol. 39, no. 1, pp. 44–48, 1951. [243] Y. Polyanskiy and S. Verdú, “Arimoto chan-
[232] M. Pardo and I. Vajda, “About distances nel coding converse and Rényi divergence,”
of discrete distributions satisfying the data in Proceedings of the Forty-eighth Annual
processing theorem of information theory,” Allerton Conference on Communication,
IEEE transactions on information theory, Control, and Computing, 2010, pp. 1327–
vol. 43, no. 4, pp. 1288–1293, 1997. 1333.
[233] Y. Peres, “Iterating von Neumann’s proce- [244] Y. Polyanskiy and S. Verdú, “Arimoto chan-
dure for extracting random bits,” Annals of nel coding converse and Rényi divergence,”
Statistics, vol. 20, no. 1, pp. 590–597, 1992. in Proc. 2010 48th Allerton Conference,
[234] M. S. Pinsker, “Optimal filtering of square- Allerton Retreat Center, Monticello, IL,
integrable signals in Gaussian noise,” Prob- USA, Sep. 2010.
lemy Peredachi Informatsii, vol. 16, no. 2, [245] Y. Polyanskiy and S. Verdu, “Binary
pp. 52–68, 1980. hypothesis testing with feedback,” in Infor-
mation Theory and Applications Workshop
(ITA), 2011.
i i
i i
i i

i i
References 595
[246] Y. Polyanskiy and S. Verdú, “Empirical dis- [257] M. Raginsky, “Strong data processing
tribution of good channel codes with non- inequalities and ϕ-Sobolev inequalities for
vanishing error probability,” IEEE Trans. discrete channels,” IEEE Transactions on
Inf. Theory, vol. 60, no. 1, pp. 5–21, Jan. Information Theory, vol. 62, no. 6, pp.
2014. 3355–3389, 2016.
[247] Y. Polyanskiy and Y. Wu, “Peak-to-average [258] M. Raginsky and I. Sason, “Concentration
power ratio of good codes for Gaussian of measure inequalities in information the-
channel,” IEEE Trans. Inf. Theory, vol. 60, ory, communications, and coding,” Founda-
no. 12, pp. 7655–7660, Dec. 2014. tions and Trends® in Communications and
[248] Y. Polyanskiy, “Saddle point in the mini- Information Theory, vol. 10, no. 1-2, pp.
max converse for channel coding,” IEEE 1–246, 2013.
Transactions on Information Theory, [259] C. R. Rao, “Information and the accuracy
vol. 59, no. 5, pp. 2576–2595, 2012. attainable in the estimation of statistical
[249] ——, “On dispersion of compound dmcs,” parameters,” Bull. Calc. Math. Soc., vol. 37,
in 2013 51st Annual Allerton Conference pp. 81–91, 1945.
on Communication, Control, and Comput- [260] A. H. Reeves, “The past present and future
ing (Allerton). IEEE, 2013, pp. 26–32. of PCM,” IEEE Spectrum, vol. 2, no. 5, pp.
[250] Y. Polyanskiy and Y. Wu, “Strong data- 58–62, 1965.
processing inequalities for channels and [261] A. Rényi, “On the dimension and entropy
Bayesian networks,” in Convexity and Con- of probability distributions,” Acta Mathe-
centration. The IMA Volumes in Mathemat- matica Hungarica, vol. 10, no. 1 – 2, Mar.
ics and its Applications, vol 161, E. Carlen, 1959.
M. Madiman, and E. M. Werner, Eds. New [262] R. B. Reznikova Zh., “Analysis of the lan-
York, NY: Springer, 2017, pp. 211–249. guage of ants by information-theoretical
[251] ——, “Dualizing Le Cam’s method for methods,” Problemi Peredachi Informat-
functional estimation, with applications to sii, vol. 22, no. 3, pp. 103–108, 1986,
estimating the unseens,” arXiv preprint english translation: http://reznikova.net/R-
arXiv:1902.05616, 2019. R-entropy-09.pdf.
[252] ——, “Application of the information- [263] T. J. Richardson, M. A. Shokrollahi, and
percolation method to reconstruction prob- R. L. Urbanke, “Design of capacity-
lems on graphs,” Mathematical Statistics approaching irregular low-density
and Learning, vol. 2, no. 1, pp. 1–24, 2020. parity-check codes,” IEEE Transac-
[253] ——, “Self-regularizing property of non- tions on Information Theory, vol. 47, no. 2,
parametric maximum likelihood estima- pp. 619–637, 2001.
tor in mixture models,” arXiv preprint [264] T. Richardson and R. Urbanke, Modern
arXiv:2008.08244, 2020. Coding Theory. Cambridge University
[254] E. C. Posner and E. R. Rodemich, “Epsilon Press, 2008.
entropy and data compression,” Annals of [265] P. Rigollet and J.-C. Hütter, “High dimen-
Mathematical Statistics, vol. 42, no. 6, pp. sional statistics,” Lecture Notes for 18.657,
2079–2125, Dec. 1971. MIT, 2017, https://math.mit.edu/~rigollet/
[255] A. Prékopa, “Logarithmic concave mea- PDFs/RigNotes17.pdf.
sures with application to stochastic pro- [266] Y. Rinott, “On convexity of measures,”
gramming,” Acta Scientiarum Mathemati- Annals of Probability, vol. 4, no. 6, pp.
carum, vol. 32, pp. 301–316, 1971. 1020–1026, 1976.
[256] J. Radhakrishnan, “An entropy proof of [267] J. J. Rissanen, “Fisher information and
Bregman’s theorem,” J. Combin. Theory stochastic complexity,” IEEE transactions
Ser. A, vol. 77, no. 1, pp. 161–164, 1997.
i i
i i
i i

i i
on information theory, vol. 42, no. 1, pp. [279] C. Shannon, “The zero error capacity of a
40–47, 1996. noisy channel,” IRE Transactions on Infor-
[268] C. Rogers, Packing and Covering, ser. Cam- mation Theory, vol. 2, no. 3, pp. 8–19,
bridge tracts in mathematics and mathemat- 1956.
ical physics. Cambridge University Press, [280] C. E. Shannon, “Coding theorems for a dis-
1964. crete source with a fidelity criterion,” IRE
[269] H. Roozbehani and Y. Polyanskiy, “Low Nat. Conv. Rec, vol. 4, no. 142-163, p. 1,
density majority codes and the problem 1959.
of graceful degradation,” arXiv preprint [281] O. Shayevitz, “On Rényi measures and
arXiv:1911.12263, 2019. hypothesis testing,” in 2011 IEEE Interna-
[270] H. P. Rosenthal, “On the subspaces of tional Symposium on Information Theory
lp (p > 2) spanned by sequences of inde- Proceedings. IEEE, 2011, pp. 894–898.
pendent random variables,” Israel Journal [282] O. Shayevitz and M. Feder, “Optimal feed-
of Mathematics, vol. 8, no. 3, pp. 273–303, back communication via posterior match-
1970. ing,” IEEE Trans. Inf. Theory, vol. 57, no. 3,
[271] D. Russo and J. Zou, “Controlling bias in pp. 1186–1222, 2011.
adaptive data analysis using information [283] G. Simons and M. Woodroofe, “The
theory,” in Artificial Intelligence and Statis- Cramér-Rao inequality holds almost every-
tics. PMLR, 2016, pp. 1232–1240. where,” in Recent Advances in Statistics:
[272] I. Sason and S. Verdu, “f-divergence Papers in Honor of Herman Chernoff on his
inequalities,” IEEE Transactions on Infor- Sixtieth Birthday. Academic, New York,
mation Theory, vol. 62, no. 11, pp. 5973– 1983, pp. 69–93.
6006, 2016. [284] R. Sinkhorn, “A relationship between arbi-
[273] G. Schechtman, “Extremal configurations trary positive matrices and doubly stochas-
for moments of sums of independent pos- tic matrices,” Ann. Math. Stat., vol. 35,
itive random variables,” in Banach Spaces no. 2, pp. 876–879, 1964.
and their Applications in Analysis. De [285] M. Sion, “On general minimax theorems,”
Gruyter, 2011, pp. 183–192. Pacific J. Math, vol. 8, no. 1, pp. 171–176,
[274] M. J. Schervish, Theory of statistics. 1958.
Springer-Verlag New York, 1995. [286] M.-K. Siu, “Which latin squares are cay-
[275] A. Schrijver, Theory of linear and integer ley tables?” Amer. Math. Monthly, vol. 98,
programming. John Wiley & Sons, 1998. no. 7, pp. 625–627, Aug. 1991.
[276] C. E. Shannon, “A symbolic analysis of [287] D. Slepian and H. O. Pollak, “Prolate
relay and switching circuits,” Electrical spheroidal wave functions, fourier analysis
Engineering, vol. 57, no. 12, pp. 713–723, and uncertainty—i,” Bell System Technical
Dec 1938. Journal, vol. 40, no. 1, pp. 43–63, 1961.
[277] C. E. Shannon, “A mathematical theory of [288] A. Sly, “Reconstruction of random colour-
communication,” Bell Syst. Tech. J., vol. 27, ings,” Communications in Mathematical
pp. 379–423 and 623–656, Jul./Oct. 1948. Physics, vol. 288, no. 3, pp. 943–
[278] C. E. Shannon, R. G. Gallager, and E. R. 961, Jun 2009. [Online]. Available: https:
Berlekamp, “Lower bounds to error prob- //doi.org/10.1007/s00220-009-0783-7
ability for coding on discrete memoryless [289] B. Smith, “Instantaneous companding of
channels i,” Inf. Contr., vol. 10, pp. 65–103, quantized signals,” Bell System Technical
1967. Journal, vol. 36, no. 3, pp. 653–709, 1957.
[290] J. G. Smith, “The information capacity of
amplitude and variance-constrained scalar
i i
i i
i i

i i
References 597
Gaussian channels,” Information and Con- [303] W. Tang and F. Tang, “The poisson
trol, vol. 18, pp. 203 – 219, 1971. binomial distribution–old & new,” arXiv
[291] Spectre, “SPECTRE: Short packet com- preprint arXiv:1908.10024, 2019.
munication toolbox,” https://github.com/ [304] G. Taricco and M. Elia, “Capacity of fading
yp-mit/spectre, 2015, GitHub repository. channel with no side information,” Elec-
[292] R. Speer, J. Chin, A. Lin, S. Jewett, tronics Letters, vol. 33, no. 16, pp. 1368–
and L. Nathan, “Luminosoinsight/word- 1370, 1997.
freq: v2.2,” Oct. 2018. [Online]. Available: [305] V. Tarokh, H. Jafarkhani, and A. R. Calder-
https://doi.org/10.5281/zenodo.1443582 bank, “Space-time block codes from orthog-
[293] A. J. Stam, “Distance between sampling onal designs,” IEEE Transactions on Infor-
with and without replacement,” Statistica mation theory, vol. 45, no. 5, pp. 1456–
Neerlandica, vol. 32, no. 2, pp. 81–91, 1467, 1999.
1978. [306] V. Tarokh, N. Seshadri, and A. R. Calder-
[294] M. Steiner, “The strong simplex conjecture bank, “Space-time codes for high data rate
is false,” IEEE Transactions on Information wireless communication: Performance cri-
Theory, vol. 40, no. 3, pp. 721–731, 1994. terion and code construction,” IEEE trans-
[295] V. Strassen, “Asymptotische Abschätzun- actions on information theory, vol. 44,
gen in Shannon’s Informationstheorie,” in no. 2, pp. 744–765, 1998.
Trans. 3d Prague Conf. Inf. Theory, Prague, [307] H. Te Sun, Information-spectrum methods
1962, pp. 689–723. in information theory. Springer Science
[296] ——, “The existence of probability mea- & Business Media, 2003.
sures with given marginals,” Annals of [308] E. Telatar, “Capacity of multi-antenna
Mathematical Statistics, vol. 36, no. 2, pp. Gaussian channels,” European trans. tele-
423–439, 1965. com., vol. 10, no. 6, pp. 585–595, 1999.
[297] H. Strasser, Mathematical theory of statis- [309] ——, “Wringing lemmas and multiple
tics: Statistical experiments and asymptotic descriptions,” 2016, unpublished draft.
decision theory. Berlin, Germany: Walter [310] V. N. Temlyakov, “On estimates of ϵ-
de Gruyter, 1985. entropy and widths of classes of functions
[298] S. Szarek, “Nets of Grassmann manifold with a bounded mixed derivative or differ-
and orthogonal groups,” in Proceedings of ence,” Doklady Akademii Nauk, vol. 301,
Banach Space Workshop. University of no. 2, pp. 288–291, 1988.
Iowa Press, 1982, pp. 169–185. [311] F. Topsøe, “Some inequalities for informa-
[299] ——, “Metric entropy of homogeneous tion divergence and related measures of dis-
spaces,” Banach Center Publications, crimination,” IEEE Transactions on Infor-
vol. 43, no. 1, pp. 395–410, 1998. mation Theory, vol. 46, no. 4, pp. 1602–
[300] W. Szpankowski and S. Verdú, “Mini- 1609, 2000.
mum expected length of fixed-to-variable [312] D. Tse and P. Viswanath, Fundamentals
lossless compression without prefix con- of wireless communication. Cambridge
straints,” IEEE Trans. Inf. Theory, vol. 57, University Press, 2005. [Online]. Avail-
no. 7, pp. 4017–4025, 2011. able: http://www.eecs.berkeley.edu/~dtse/
[301] I. Tal and A. Vardy, “List decoding of polar book.html
codes,” IEEE Transactions on Information [313] A. B. Tsybakov, Introduction to Nonpara-
Theory, vol. 61, no. 5, pp. 2213–2226, metric Estimation. New York, NY:
2015. Springer Verlag, 2009.
[302] M. Talagrand, Upper and lower bounds for [314] B. P. Tunstall, “Synthesis of noiseless com-
stochastic processes. Springer, 2014. pression codes,” Ph.D. dissertation, Geor-
gia Institute of Technology, 1967.
i i
i i
i i

i i
[315] E. Uhrmann-Klingen, “Minimal Fisher [328] M. J. Wainwright, High-dimensional statis-

information distributions with compact- tics: A non-asymptotic viewpoint. Cam-
supports,” Sankhy�: The Indian Journal of bridge University Press, 2019, vol. 48.
Statistics, Series A, pp. 360–374, 1995. [329] A. Wald, “Sequential tests of statistical
[316] I. Vajda, “Note on discrimination informa- hypotheses,” The Annals of Mathematical
tion and variation (corresp.),” IEEE Trans- Statistics, vol. 16, no. 2, pp. 117–186, 1945.
actions on Information Theory, vol. 16, [330] ——, “Note on the consistency of the max-
no. 6, pp. 771–773, 1970. imum likelihood estimate,” The Annals of
[317] G. Valiant and P. Valiant, “Estimating the Mathematical Statistics, vol. 20, no. 4, pp.
unseen: an n/ log(n)-sample estimator for 595–601, 1949.
entropy and support size, shown optimal [331] A. Wald and J. Wolfowitz, “Optimum char-
via new CLTs,” in Proceedings of the 43rd acter of the sequential probability ratio test,”
annual ACM symposium on Theory of com- The Annals of Mathematical Statistics, pp.
puting, 2011, pp. 685–694. 326–339, 1948.
[318] A. van der Vaart, “The statistical work of [332] M. M. Wilde, Quantum information theory.
Lucien Le Cam,” Annals of Statistics, pp. Cambridge University Press, 2013.
631–682, 2002. [333] J. Wolfowitz, “On wald’s proof of the con-
[319] T. Van Erven and P. Harremoës, “Rényi sistency of the maximum likelihood esti-
divergence and kullback-leibler diver- mate,” The Annals of Mathematical Statis-
gence,” IEEE Trans. Inf. Theory, vol. 60, tics, vol. 20, no. 4, pp. 601–602, 1949.
no. 7, pp. 3797–3820, 2014. [334] Y. Wu and J. Xu, “Statistical problems
[320] H. L. Van Trees, Detection, Estimation, and with planted structures: Information-
Modulation Theory. Wiley, New York, theoretical and computational limits,” in
1968. Information-Theoretic Methods in Data
[321] S. Verdú, “On channel capacity per unit Science, Y. Eldar and M. Rodrigues,
cost,” IEEE Trans. Inf. Theory, vol. 36, Eds. Cambridge University Press, 2020,
no. 5, pp. 1019–1030, Sep. 1990. arXiv:1806.00118.
[322] ——, Multiuser Detection. Cambridge, [335] Y. Wu and P. Yang, “Minimax rates
UK: Cambridge Univ. Press, 1998. of entropy estimation on large alpha-
[323] A. G. Vitushkin, “On the 13th problem of bets via best polynomial approximation,”
Hilbert,” Dokl. Akad. Nauk SSSR, vol. 95, IEEE Transactions on Information Theory,
no. 4, pp. 701–704, 1954. vol. 62, no. 6, pp. 3702–3720, 2016.
[324] ——, “On hilbert’s thirteenth problem and [336] A. Wyner, “The common information of
related questions,” Russian Mathematical two dependent random variables,” IEEE
Surveys, vol. 59, no. 1, p. 11, 2004. Transactions on Information Theory,
[325] ——, Theory of the Transmission and Pro- vol. 21, no. 2, pp. 163–179, 1975.
cessing of Information. Pergamon Press, [337] Q. Xie and A. R. Barron, “Minimax redun-
1961. dancy for the class of memoryless sources,”
[326] J. von Neumann, “Various techniques used IEEE Transactions on Information Theory,
in connection with random digits,” Monte vol. 43, no. 2, pp. 646–657, 1997.
Carlo Method, National Bureau of Stan- [338] A. Xu and M. Raginsky, “Information-
dards, Applied Math Series, no. 12, pp. theoretic analysis of generalization capabil-
36–38, 1951. ity of learning algorithms,” arXiv preprint
[327] V. G. Vovk, “Aggregating strategies,” Proc. arXiv:1705.07809, 2017.
of Computational Learning Theory, 1990, [339] W. Yang, G. Durisi, T. Koch, and Y. Polyan-
1990. skiy, “Quasi-static multiple-antenna fading
i i
i i
i i

i i
References 599
channels at finite blocklength,” IEEE Trans- dimension,” IEEE Transactions on Informa-

actions on Information Theory, vol. 60, tion Theory, vol. 28, no. 2, pp. 139–149,
no. 7, pp. 4232–4265, 2014. 1982.
[340] W. Yang, G. Durisi, and Y. Polyan- [346] O. Zeitouni, J. Ziv, and N. Merhav, “When
skiy, “Minimum energy to send k bits is the generalized likelihood ratio test opti-
over multiple-antenna fading channels,” mal?” IEEE Transactions on Information
IEEE Transactions on Information Theory, Theory, vol. 38, no. 5, pp. 1597–1602,
vol. 62, no. 12, pp. 6831–6853, 2016. 1992.
[341] Y. Yang and A. R. Barron, “Information- [347] Z. Zhang and R. W. Yeung, “A non-
theoretic determination of minimax rates of Shannon-type conditional inequality of
convergence,” Annals of Statistics, vol. 27, information quantities,” IEEE Trans. Inf.
no. 5, pp. 1564–1599, 1999. Theory, vol. 43, no. 6, pp. 1982–1986,
[342] Y. G. Yatracos, “Rates of convergence 1997.
of minimum distance estimators and Kol- [348] ——, “On characterization of entropy func-
mogorov’s entropy,” The Annals of Statis- tion via information inequalities,” IEEE
tics, pp. 768–774, 1985. Trans. Inf. Theory, vol. 44, no. 4, pp. 1440–
[343] S. Yekhanin, “Improved upper bound for 1452, 1998.
the redundancy of fix-free codes,” IEEE [349] L. Zheng and D. N. C. Tse, “Communica-
Trans. Inf. Theory, vol. 50, no. 11, pp. 2815– tion on the grassmann manifold: A geomet-
2818, 2004. ric approach to the noncoherent multiple-
[344] P. L. Zador, “Development and evaluation antenna channel,” IEEE transactions on
of procedures for quantizing multivariate Information Theory, vol. 48, no. 2, pp. 359–
distributions,” Ph.D. dissertation, Stanford 383, 2002.
University, Department of Statistics, 1963. [350] G. Zipf, Selective Studies and the Prin-
[345] ——, “Asymptotic quantization error of ciple of Relative Frequency in Language.
continuous signals and the quantization Cambridge MA: Harvard University Press,
1932.
i i
i i

Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022

Uploaded by

Copyright:

Available Formats

Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022

Uploaded by

Copyright:

Available Formats

i i

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-i

Yury Polyanskiy is a Professor of Electrical Engineering and Computer Science at MIT. He

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-ii

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-iii

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-iv

University Printing House, Cambridge CB2 8BS, United Kingdom

Cambridge University Press is part of the University of Cambridge.

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-v

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-vi

Frequently used notation 1

Part I Information measures 3

4 Variational characterizations and continuity of D and I 50

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-vii

4.4 Continuity of divergence 57

5 Extremization of mutual information: capacity saddle point 64

6 Tensorization. Fano’s inequality. Entropy rate. 78

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-viii

8 Entropy method in combinatorics and geometry 127

9 Random number generators 135

Exercises for Part I 143

Part II Lossless data compression 157

11 Fixed-length (almost lossless) compression. Slepian-Wolf. 175

12 Compressing stationary ergodic sources 190

13 Universal compression 203

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-ix

13.3 Optimal compressors for a class of sources. Redundancy. 206

Exercises for Part II 219

Part III Binary hypothesis testing 225

15 Information projection and large deviations 242

16 Hypothesis testing: error exponents 258

Exercises for Part III 271

Part IV Channel coding 279

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-x

17.4 Weak converse bound 287

18 Random and maximal coding 289

19 Channel capacity 309

20 Channels with input constraints. Gaussian channels. 333

21 Energy-per-bit, continuous-time channels 349

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xi

22 Strong converse. Channel dispersion and error exponents. Finite Blocklength

23 Channel coding with feedback 380

Exercises for Part IV 394

Part V Rate-distortion theory and metric entropy 405

25 Rate distortion: achievability bounds 420

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xii

26 Evaluating rate-distortion function. Lossy Source-Channel separation. 434

27 Metric entropy 448

Exercises for Part V 471

Part VI Statistical applications 475

29 Classical large-sample asymptotics 494

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xiii

29.1.1 Hammersley-Chapman-Robbins (HCR) lower bound 494

30 Mutual information method 504

31 Lower bounds via reduction to hypothesis testing 513

32 Entropic upper bound for statistical estimation 521

33 Strong data processing inequality 542

itbook-export CUP/HE2-design October 20, 2022 22:10 Page-xiv

33.8 Undirected information percolation 560