1480215236

Lectures on Probability Theory
and Mathematical Statistics

Second Edition
Marco Taboga
ii
Contents
I Mathematical tools 1
1 Set theory 3
1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Set membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Set inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 De Morgan’s Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Permutations 9
2.1 Permutations without repetition . . . . . . . . . . . . . . . . . . . . 9
2.1.1 De…nition of permutation without repetition . . . . . . . . . 9
2.1.2 Number of permutations without repetition . . . . . . . . . . 10
2.2 Permutations with repetition . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 De…nition of permutation with repetition . . . . . . . . . . . 11
2.2.2 Number of permutations with repetition . . . . . . . . . . . . 11
3 k-permutations 15
3.1 k-permutations without repetition . . . . . . . . . . . . . . . . . . . 15
3.1.1 De…nition of k-permutation without repetition . . . . . . . . 15
3.1.2 Number of k-permutations without repetition . . . . . . . . . 16
3.2 k-permutations with repetition . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 De…nition of k-permutation with repetition . . . . . . . . . . 17
3.2.2 Number of k-permutations with repetition . . . . . . . . . . . 18
4 Combinations 21
4.1 Combinations without repetition . . . . . . . . . . . . . . . . . . . . 21
4.1.1 De…nition of combination without repetition . . . . . . . . . 21
4.1.2 Number of combinations without repetition . . . . . . . . . . 22
4.2 Combinations with repetition . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 De…nition of combination with repetition . . . . . . . . . . . 23
4.2.2 Number of combinations with repetition . . . . . . . . . . . . 23
4.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Binomial coe¢ cients and binomial expansions . . . . . . . . . 25
4.3.2 Recursive formula for binomial coe¢ cients . . . . . . . . . . . 25
iii
iv CONTENTS
5 Partitions into groups 27

5.1 De…nition of partition into groups . . . . . . . . . . . . . . . . . . . 27
5.2 Number of partitions into groups . . . . . . . . . . . . . . . . . . . . 28
5.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.1 Multinomial expansions . . . . . . . . . . . . . . . . . . . . . 29
6 Sequences and limits 31

6.1 De…nition of sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Countable and uncountable sets . . . . . . . . . . . . . . . . . . . . . 32
6.3 Limit of a sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3.1 The limit of a sequence of real numbers . . . . . . . . . . . . 32
6.3.2 The limit of a sequence in general . . . . . . . . . . . . . . . 34
7 Review of di¤erentiation rules 39

7.1 Derivative of a constant function . . . . . . . . . . . . . . . . . . . . 39
7.2 Derivative of a power function . . . . . . . . . . . . . . . . . . . . . . 39
7.3 Derivative of a logarithmic function . . . . . . . . . . . . . . . . . . . 40
7.4 Derivative of an exponential function . . . . . . . . . . . . . . . . . . 40
7.5 Derivative of a linear combination . . . . . . . . . . . . . . . . . . . 41
7.6 Derivative of a product of functions . . . . . . . . . . . . . . . . . . . 41
7.7 Derivative of a composition of functions . . . . . . . . . . . . . . . . 42
7.8 Derivatives of trigonometric functions . . . . . . . . . . . . . . . . . 43
7.9 Derivative of an inverse function . . . . . . . . . . . . . . . . . . . . 43
8 Review of integration rules 45

8.1 Inde…nite integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.1.1 Inde…nite integral of a constant function . . . . . . . . . . . . 46
8.1.2 Inde…nite integral of a power function . . . . . . . . . . . . . 46
8.1.3 Inde…nite integral of a logarithmic function . . . . . . . . . . 47
8.1.4 Inde…nite integral of an exponential function . . . . . . . . . 47
8.1.5 Inde…nite integral of a linear combination of functions . . . . 47
8.1.6 Inde…nite integrals of trigonometric functions . . . . . . . . . 48
8.2 De…nite integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.2.1 Fundamental theorem of calculus . . . . . . . . . . . . . . . . 48
8.2.2 De…nite integral of a linear combination of functions . . . . . 49
8.2.3 Change of variable . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2.4 Integration by parts . . . . . . . . . . . . . . . . . . . . . . . 51
8.2.5 Exchanging the bounds of integration . . . . . . . . . . . . . 51
8.2.6 Subdividing the integral . . . . . . . . . . . . . . . . . . . . . 51
8.2.7 Leibniz integral rule . . . . . . . . . . . . . . . . . . . . . . . 52
9 Special functions 55
9.1 Gamma function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.1.2 Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.1.3 Relation to the factorial function . . . . . . . . . . . . . . . . 56
9.1.4 Values of the Gamma function . . . . . . . . . . . . . . . . . 57
CONTENTS v
9.1.5 Lower incomplete Gamma function . . . . . . . . . . . . . . . 58

9.2 Beta function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.2.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.2.2 Integral representations . . . . . . . . . . . . . . . . . . . . . 59
II Fundamentals of probability 67
10 Probability 69
10.1 Sample space, sample points and events . . . . . . . . . . . . . . . . 69
10.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.3 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.3.1 Probability of the empty set . . . . . . . . . . . . . . . . . . . 71
10.3.2 Additivity and sigma-additivity . . . . . . . . . . . . . . . . . 72
10.3.3 Probability of the complement . . . . . . . . . . . . . . . . . 72
10.3.4 Probability of a union . . . . . . . . . . . . . . . . . . . . . . 73
10.3.5 Monotonicity of probability . . . . . . . . . . . . . . . . . . . 73
10.4 Interpretations of probability . . . . . . . . . . . . . . . . . . . . . . 74
10.4.1 Classical interpretation of probability . . . . . . . . . . . . . 74
10.4.2 Frequentist interpretation of probability . . . . . . . . . . . . 74
10.4.3 Subjectivist interpretation of probability . . . . . . . . . . . . 74
10.5 More rigorous de…nitions . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.5.1 A more rigorous de…nition of event . . . . . . . . . . . . . . . 75
10.5.2 A more rigorous de…nition of probability . . . . . . . . . . . . 76
11 Zero-probability events 79
11.1 De…nition and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2 Almost sure and almost surely . . . . . . . . . . . . . . . . . . . . . 80
11.3 Almost sure events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
12 Conditional probability 85
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
12.2 The case of equally likely sample points . . . . . . . . . . . . . . . . 85
12.3 A more general approach . . . . . . . . . . . . . . . . . . . . . . . . 87
12.4 Tackling division by zero . . . . . . . . . . . . . . . . . . . . . . . . . 90
12.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
12.5.1 The law of total probability . . . . . . . . . . . . . . . . . . . 90
13 Bayes’rule 95
13.1 Statement of Bayes’rule . . . . . . . . . . . . . . . . . . . . . . . . . 95
13.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vi CONTENTS
14 Independent events 99
14.1 De…nition of independent event . . . . . . . . . . . . . . . . . . . . . 99
14.2 Mutually independent events . . . . . . . . . . . . . . . . . . . . . . 100
14.3 Zero-probability events and independence . . . . . . . . . . . . . . . 101
15 Random variables 105

15.1 De…nition of random variable . . . . . . . . . . . . . . . . . . . . . . 105
15.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . 106
15.3 Absolutely continuous random variables . . . . . . . . . . . . . . . . 107
15.4 Random variables in general . . . . . . . . . . . . . . . . . . . . . . . 108
15.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
15.5.1 Derivative of the distribution function . . . . . . . . . . . . . 109
15.5.2 Continuous variables and zero-probability events . . . . . . . 109
15.5.3 A more rigorous de…nition of random variable . . . . . . . . . 109
16 Random vectors 115

16.1 De…nition of random vector . . . . . . . . . . . . . . . . . . . . . . . 115
16.2 Discrete random vectors . . . . . . . . . . . . . . . . . . . . . . . . . 116
16.3 Absolutely continuous random vectors . . . . . . . . . . . . . . . . . 117
16.4 Random vectors in general . . . . . . . . . . . . . . . . . . . . . . . . 118
16.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
16.5.1 Random matrices . . . . . . . . . . . . . . . . . . . . . . . . . 119
16.5.2 Marginal distribution of a random vector . . . . . . . . . . . 119
16.5.3 Marginalization of a joint distribution . . . . . . . . . . . . . 120
16.5.4 Marginal distribution of a discrete random vector . . . . . . . 120
16.5.5 Marginalization of a discrete distribution . . . . . . . . . . . 120
16.5.6 Marginal distribution of a continuous random vector . . . . . 120
16.5.7 Marginalization of a continuous distribution . . . . . . . . . . 121
16.5.8 Partial derivative of the distribution function . . . . . . . . . 121
16.5.9 A more rigorous de…nition of random vector . . . . . . . . . . 121
17 Expected value 127

17.1 De…nition of expected value . . . . . . . . . . . . . . . . . . . . . . . 127
17.3 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . 129
17.4 The Riemann-Stieltjes integral . . . . . . . . . . . . . . . . . . . . . 130
17.4.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
17.4.2 Some rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
17.5 The Lebesgue integral . . . . . . . . . . . . . . . . . . . . . . . . . . 133
17.6 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
17.6.1 The transformation theorem . . . . . . . . . . . . . . . . . . . 134
17.6.2 Linearity of the expected value . . . . . . . . . . . . . . . . . 134
17.6.3 Expected value of random vectors . . . . . . . . . . . . . . . 136
17.6.4 Expected value of random matrices . . . . . . . . . . . . . . . 136
17.6.5 Integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
17.6.6 Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
17.6.7 Other properties of the expected value . . . . . . . . . . . . . 136
CONTENTS vii
18 Expected value and the Lebesgue integral 141

18.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
18.2 Linearity of the Lebesgue integral . . . . . . . . . . . . . . . . . . . . 143
18.3 A more rigorous de…nition . . . . . . . . . . . . . . . . . . . . . . . . 144
19 Properties of the expected value 147

19.1 Linearity of the expected value . . . . . . . . . . . . . . . . . . . . . 147
19.1.1 Scalar multiplication of a random variable . . . . . . . . . . . 147
19.1.2 Sums of random variables . . . . . . . . . . . . . . . . . . . . 147
19.1.3 Linear combinations of random variables . . . . . . . . . . . . 148
19.1.4 Addition of a constant and a random matrix . . . . . . . . . 148
19.1.5 Multiplication of a constant and a random matrix . . . . . . 149
19.2 Other properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
19.2.1 Expectation of a positive random variable . . . . . . . . . . . 150
19.2.2 Preservation of almost sure inequalities . . . . . . . . . . . . 150
20 Variance 155
20.1 De…nition of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
20.2 Interpretation of variance . . . . . . . . . . . . . . . . . . . . . . . . 155
20.3 Computation of variance . . . . . . . . . . . . . . . . . . . . . . . . . 155
20.4 Variance formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.6 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
20.6.1 Variance and standard deviation . . . . . . . . . . . . . . . . 157
20.6.2 Addition to a constant . . . . . . . . . . . . . . . . . . . . . . 157
20.6.3 Multiplication by a constant . . . . . . . . . . . . . . . . . . 158
20.6.4 Linear transformations . . . . . . . . . . . . . . . . . . . . . . 158
20.6.5 Square integrability . . . . . . . . . . . . . . . . . . . . . . . 159
21 Covariance 163
21.1 De…nition of covariance . . . . . . . . . . . . . . . . . . . . . . . . . 163
21.2 Interpretation of covariance . . . . . . . . . . . . . . . . . . . . . . . 163
21.3 Covariance formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
21.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
21.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
21.5.1 Covariance of a random variable with itself . . . . . . . . . . 166
21.5.2 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
21.5.3 Bilinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
21.5.4 Variance of the sum of two random variables . . . . . . . . . 167
21.5.5 Variance of the sum of n random variables . . . . . . . . . . . 168
viii CONTENTS
22 Linear correlation 177

22.1 De…nition of linear correlation coe¢ cient . . . . . . . . . . . . . . . . 177
22.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
22.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
22.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
22.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
22.5.1 Correlation of a random variable with itself . . . . . . . . . . 180
22.5.2 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
23 Covariance matrix 189

23.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
23.2 Structure of the covariance matrix . . . . . . . . . . . . . . . . . . . 189
23.3 Covariance matrix formula . . . . . . . . . . . . . . . . . . . . . . . . 190
23.4 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
23.4.1 Addition to a constant vector . . . . . . . . . . . . . . . . . . 191
23.4.2 Multiplication by a constant matrix . . . . . . . . . . . . . . 191
23.4.3 Linear transformations . . . . . . . . . . . . . . . . . . . . . . 191
23.4.4 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
23.4.5 Semi-positive de…niteness . . . . . . . . . . . . . . . . . . . . 192
23.4.6 Covariance between linear transformations . . . . . . . . . . . 192
23.4.7 Cross-covariance . . . . . . . . . . . . . . . . . . . . . . . . . 193
24 Indicator function 197

24.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
24.2 Properties of the indicator function . . . . . . . . . . . . . . . . . . . 198
24.2.1 Powers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
24.2.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 198
24.2.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
24.2.4 Intersections . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
24.2.5 Indicators of zero-probability events . . . . . . . . . . . . . . 199
25 Conditional probability as a random variable 201

25.1 Partitions of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
25.2 Probabilities conditional on a partition . . . . . . . . . . . . . . . . . 203
25.3 The fundamental property . . . . . . . . . . . . . . . . . . . . . . . . 204
25.4 The fundamental property as a de…nition . . . . . . . . . . . . . . . 205
25.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
25.5.1 Conditioning with respect to sigma-algebras . . . . . . . . . . 206
25.5.2 Regular conditional probabilities . . . . . . . . . . . . . . . . 207
26 Conditional probability distributions 209

26.1 Conditional probability mass function . . . . . . . . . . . . . . . . . 210
26.2 Conditional probability density function . . . . . . . . . . . . . . . . 213
26.3 Conditional distribution function . . . . . . . . . . . . . . . . . . . . 215
26.4 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
26.4.1 Conditional distribution of a random vector . . . . . . . . . . 216
26.4.2 Joint equals marginal times conditional . . . . . . . . . . . . 216
CONTENTS ix
27 Conditional expectation 221

27.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
27.3 Absolutely continuous random variables . . . . . . . . . . . . . . . . 223
27.4 Conditional expectation in general . . . . . . . . . . . . . . . . . . . 224
27.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
27.5.1 Properties of conditional expectation . . . . . . . . . . . . . . 225
27.5.2 Law of iterated expectations . . . . . . . . . . . . . . . . . . 225
28 Independent random variables 229

28.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
28.2 Independence criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 229
28.3 Independence between discrete variables . . . . . . . . . . . . . . . . 231
28.4 Independence between continuous variables . . . . . . . . . . . . . . 232
28.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
28.5.1 Mutually independent random variables . . . . . . . . . . . . 233
28.5.2 Mutual independence via expectations . . . . . . . . . . . . . 234
28.5.3 Independence and zero covariance . . . . . . . . . . . . . . . 234
28.5.4 Independent random vectors . . . . . . . . . . . . . . . . . . 235
28.5.5 Mutually independent random vectors . . . . . . . . . . . . . 235
III Additional topics in probability theory 239

29 Probabilistic inequalities 241
29.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
29.2 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 242
29.3 Jensens’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
30 Legitimate probability mass functions 247

30.1 Properties of probability mass functions . . . . . . . . . . . . . . . . 247
30.2 Identi…cation of a legitimate pmf . . . . . . . . . . . . . . . . . . . . 248
31 Legitimate probability density functions 251

31.1 Properties of probability density functions . . . . . . . . . . . . . . . 251
31.2 Identi…cation of a legitimate pdf . . . . . . . . . . . . . . . . . . . . 252
32 Factorization of joint probability mass functions 257

32.1 The factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
32.2 A factorization method . . . . . . . . . . . . . . . . . . . . . . . . . . 257
33 Factorization of joint probability density functions 261

33.1 The factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
33.2 A factorization method . . . . . . . . . . . . . . . . . . . . . . . . . . 261
x CONTENTS
34 Functions of random variables and their distribution 265

34.1 Strictly increasing functions . . . . . . . . . . . . . . . . . . . . . . . 265
34.1.1 Strictly increasing functions of a discrete variable . . . . . . . 267
34.1.2 Strictly increasing functions of a continuous variable . . . . . 268
34.2 Strictly decreasing functions . . . . . . . . . . . . . . . . . . . . . . . 269
34.2.1 Strictly decreasing functions of a discrete variable . . . . . . 270
34.2.2 Strictly decreasing functions of a continuous variable . . . . . 271
34.3 Invertible functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
34.3.1 One-to-one functions of a discrete variable . . . . . . . . . . . 273
34.3.2 One-to-one functions of a continuous variable . . . . . . . . . 273
35 Functions of random vectors and their distribution 277

35.1 One-to-one functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
35.1.1 One-to-one function of a discrete vector . . . . . . . . . . . . 277
35.1.2 One-to-one function of a continuous vector . . . . . . . . . . 278
35.2 Independent sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
35.3 Known moment generating function . . . . . . . . . . . . . . . . . . 281
35.4 Known characteristic function . . . . . . . . . . . . . . . . . . . . . . 281
36 Moments and cross-moments 285

36.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
36.1.1 De…nition of moment . . . . . . . . . . . . . . . . . . . . . . . 285
36.1.2 De…nition of central moment . . . . . . . . . . . . . . . . . . 285
36.2 Cross-moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
36.2.1 De…nition of cross-moment . . . . . . . . . . . . . . . . . . . 285
36.2.2 De…nition of central cross-moment . . . . . . . . . . . . . . . 287
37 Moment generating function of a random variable 289

37.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
37.2 Moments and mgfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
37.3 Distributions and mgfs . . . . . . . . . . . . . . . . . . . . . . . . . . 291
37.4 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
37.4.1 Mgf of a linear transformation . . . . . . . . . . . . . . . . . 293
37.4.2 Mgf of a sum . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
38 Moment generating function of a random vector 297

38.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
38.2 Cross-moments and joint mgfs . . . . . . . . . . . . . . . . . . . . . . 299
38.3 Joint distributions and joint mgfs . . . . . . . . . . . . . . . . . . . . 300
38.4 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
38.4.1 Joint mgf of a linear transformation . . . . . . . . . . . . . . 301
38.4.2 Joint mgf of a vector with independent entries . . . . . . . . 302
38.4.3 Joint mgf of a sum . . . . . . . . . . . . . . . . . . . . . . . . 302
CONTENTS xi
39 Characteristic function of a random variable 307

39.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
39.2 Moments and cfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
39.3 Distributions and cfs . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
39.4 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
39.4.1 Cf of a linear transformation . . . . . . . . . . . . . . . . . . 310
39.4.2 Cf of a sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
39.4.3 Computation of the characteristic function . . . . . . . . . . 311
40 Characteristic function of a random vector 315

40.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
40.2 Cross-moments and joint cfs . . . . . . . . . . . . . . . . . . . . . . . 315
40.3 Joint distributions and joint cfs . . . . . . . . . . . . . . . . . . . . . 317
40.4 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
40.4.1 Joint cf of a linear transformation . . . . . . . . . . . . . . . 317
40.4.2 Joint cf of a random vector with independent entries . . . . . 318
40.4.3 Joint cf of a sum . . . . . . . . . . . . . . . . . . . . . . . . . 318
41 Sums of independent random variables 323

41.1 Distribution function of a sum . . . . . . . . . . . . . . . . . . . . . 323
41.2 Probability mass function of a sum . . . . . . . . . . . . . . . . . . . 325
41.3 Probability density function of a sum . . . . . . . . . . . . . . . . . . 327
41.4 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
41.4.1 Sum of n independent random variables . . . . . . . . . . . . 329
IV Probability distributions 333

42 Bernoulli distribution 335
42.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
42.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
42.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
42.4 Moment generating function . . . . . . . . . . . . . . . . . . . . . . . 336
42.5 Characteristic function . . . . . . . . . . . . . . . . . . . . . . . . . . 337
42.6 Distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
42.7 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
42.7.1 Relation to the binomial distribution . . . . . . . . . . . . . . 337
43 Binomial distribution 341

43.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
43.2 Relation to the Bernoulli distribution . . . . . . . . . . . . . . . . . . 342
43.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
43.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
xii CONTENTS
44 Poisson distribution 349

44.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
44.2 Relation to the exponential distribution . . . . . . . . . . . . . . . . 350
44.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
44.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
45 Uniform distribution 359

45.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
45.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
45.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
46 Exponential distribution 365

46.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
46.2 The rate parameter and its interpretation . . . . . . . . . . . . . . . 366
46.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
46.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
46.8 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
46.8.1 Memoryless property . . . . . . . . . . . . . . . . . . . . . . . 371
46.8.2 Sums of exponential random variables . . . . . . . . . . . . . 372
47 Normal distribution 375

47.1 The standard normal distribution . . . . . . . . . . . . . . . . . . . . 376
47.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
47.1.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 377
47.1.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
47.1.4 Moment generating function . . . . . . . . . . . . . . . . . . . 378
47.1.5 Characteristic function . . . . . . . . . . . . . . . . . . . . . . 379
47.1.6 Distribution function . . . . . . . . . . . . . . . . . . . . . . . 380
47.2 The normal distribution in general . . . . . . . . . . . . . . . . . . . 381
47.2.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
47.2.2 Relation to the standard normal distribution . . . . . . . . . 382
47.2.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 382
47.2.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
47.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
47.3.1 Multivariate normal distribution . . . . . . . . . . . . . . . . 384
CONTENTS xiii
47.3.2 Linear combinations of normal random variables . . . . . . . 384

47.3.3 Quadratic forms involving normal random variables . . . . . 385
48 Chi-square distribution 387

48.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
48.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
48.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
48.7 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
48.7.1 Sums of independent Chi-square random variables . . . . . . 392
48.7.2 Relation to the standard normal distribution . . . . . . . . . 393
48.7.3 Relation to the standard normal distribution (2) . . . . . . . 395
49 Gamma distribution 397

49.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
49.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
49.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
49.7 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
49.7.1 Relation to the Chi-square distribution . . . . . . . . . . . . . 402
49.7.2 Multiplication by a constant . . . . . . . . . . . . . . . . . . 403
49.7.3 Relation to the normal distribution . . . . . . . . . . . . . . . 404
50 Student’s t distribution 407

50.1 The standard Student’s t distribution . . . . . . . . . . . . . . . . . 407
50.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
50.1.2 Relation to the normal and Gamma distributions . . . . . . . 408
50.1.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 410
50.1.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
50.1.5 Higher moments . . . . . . . . . . . . . . . . . . . . . . . . . 413
50.2 The Student’s t distribution in general . . . . . . . . . . . . . . . . . 415
50.2.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
50.2.2 Relation to the standard Student’s t distribution . . . . . . . 415
50.2.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 416
50.2.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
50.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
50.3.1 Convergence to the normal distribution . . . . . . . . . . . . 417
xiv CONTENTS
50.3.2 Non-central t distribution . . . . . . . . . . . . . . . . . . . . 418

51 F distribution 421
51.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
51.2 Relation to the Gamma distribution . . . . . . . . . . . . . . . . . . 422
51.3 Relation to the Chi-square distribution . . . . . . . . . . . . . . . . . 424
51.4 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
51.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
51.6 Higher moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
51.10Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
52 Multinomial distribution 431

52.1 The special case of one experiment . . . . . . . . . . . . . . . . . . . 431
52.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
52.1.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 432
52.1.3 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 432
52.1.4 Joint moment generating function . . . . . . . . . . . . . . . 433
52.1.5 Joint characteristic function . . . . . . . . . . . . . . . . . . . 433
52.2 Multinomial distribution in general . . . . . . . . . . . . . . . . . . . 434
52.2.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
52.2.2 Representation as a sum of simpler multinomials . . . . . . . 434
52.2.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 435
53 Multivariate normal distribution 439

53.1 The standard MV-N distribution . . . . . . . . . . . . . . . . . . . . 439
53.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
53.1.2 Relation to the univariate normal distribution . . . . . . . . . 440
53.1.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 441
53.2 The MV-N distribution in general . . . . . . . . . . . . . . . . . . . 443
53.2.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
53.2.2 Relation to the standard MV-N distribution . . . . . . . . . . 444
53.2.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 445
53.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
53.3.1 The univariate normal as a special case . . . . . . . . . . . . 446
53.3.2 Mutual independence and joint normality . . . . . . . . . . . 446
53.3.3 Linear combinations and transformations . . . . . . . . . . . 447
CONTENTS xv
53.3.4 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . 447

53.3.5 Partitioned vectors . . . . . . . . . . . . . . . . . . . . . . . . 447
54 Multivariate Student’s t distribution 451

54.1 The standard MV Student’s t distribution . . . . . . . . . . . . . . . 451
54.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
54.1.2 Relation to the univariate Student’s t distribution . . . . . . 452
54.1.3 Relation to the Gamma and MV normal distributions . . . . 452
54.1.4 Marginals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
54.1.5 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 455
54.2 The MV Student’s t distribution in general . . . . . . . . . . . . . . 457
54.2.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
54.2.2 Relation to the standard MV Student’s t distribution . . . . 457
54.2.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 458
55 Wishart distribution 461

55.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
55.2 Relation to the MV normal distribution . . . . . . . . . . . . . . . . 462
55.3 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
55.4 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
55.5 Review of some facts in matrix algebra . . . . . . . . . . . . . . . . . 465
55.5.1 Outer products . . . . . . . . . . . . . . . . . . . . . . . . . . 465
55.5.2 Symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . 465
55.5.3 Positive de…nite matrices . . . . . . . . . . . . . . . . . . . . 465
55.5.4 Trace of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . 466
55.5.5 Vectorization of a matrix . . . . . . . . . . . . . . . . . . . . 466
55.5.6 Kronecker product . . . . . . . . . . . . . . . . . . . . . . . . 466
V More about normal distributions 467

56 Linear combinations of normals 469
56.1 Linear transformation of a MV-N vector . . . . . . . . . . . . . . . . 469
56.1.1 Sum of two independent variables . . . . . . . . . . . . . . . 470
56.1.2 Sum of more than two independent variables . . . . . . . . . 471
56.1.3 Linear combinations of independent variables . . . . . . . . . 471
56.1.4 Linear transformation of a variable . . . . . . . . . . . . . . . 472
56.1.5 Linear combinations of independent vectors . . . . . . . . . . 473
57 Partitioned multivariate normal vectors 477

57.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
57.2 Normality of the sub-vectors . . . . . . . . . . . . . . . . . . . . . . . 478
57.3 Independence of the sub-vectors . . . . . . . . . . . . . . . . . . . . . 478
xvi CONTENTS
58 Quadratic forms in normal vectors 481

58.1 Review of relevant facts in matrix algebra . . . . . . . . . . . . . . . 481
58.1.1 Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . 481
58.1.2 Symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . 482
58.1.3 Idempotent matrices . . . . . . . . . . . . . . . . . . . . . . . 482
58.1.4 Symmetric idempotent matrices . . . . . . . . . . . . . . . . . 482
58.1.5 Trace of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . 483
58.2 Quadratic forms in normal vectors . . . . . . . . . . . . . . . . . . . 483
58.3 Independence of quadratic forms . . . . . . . . . . . . . . . . . . . . 484
58.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
VI Asymptotic theory 489

59 Sequences of random variables 491
59.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
59.1.1 Realization of a sequence . . . . . . . . . . . . . . . . . . . . 492
59.1.2 Sequences on a sample space . . . . . . . . . . . . . . . . . . 492
59.1.3 Independent sequences . . . . . . . . . . . . . . . . . . . . . . 492
59.1.4 Identically distributed sequences . . . . . . . . . . . . . . . . 492
59.1.5 IID sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
59.1.6 Stationary sequences . . . . . . . . . . . . . . . . . . . . . . . 492
59.1.7 Weakly stationary sequences . . . . . . . . . . . . . . . . . . 493
59.1.8 Mixing sequences . . . . . . . . . . . . . . . . . . . . . . . . . 494
59.1.9 Ergodic sequences . . . . . . . . . . . . . . . . . . . . . . . . 494
59.2 Limit of a sequence of random variables . . . . . . . . . . . . . . . . 495
60 Sequences of random vectors 497

60.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
60.1.1 Realization of a sequence . . . . . . . . . . . . . . . . . . . . 497
60.1.2 Sequences on a sample space . . . . . . . . . . . . . . . . . . 497
60.1.3 Independent sequences . . . . . . . . . . . . . . . . . . . . . . 497
60.1.4 Identically distributed sequences . . . . . . . . . . . . . . . . 498
60.1.5 IID sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
60.1.6 Stationary sequences . . . . . . . . . . . . . . . . . . . . . . . 498
60.1.7 Weakly stationary sequences . . . . . . . . . . . . . . . . . . 499
60.1.8 Mixing sequences . . . . . . . . . . . . . . . . . . . . . . . . . 499
60.1.9 Ergodic sequences . . . . . . . . . . . . . . . . . . . . . . . . 499
60.2 Limit of a sequence of random vectors . . . . . . . . . . . . . . . . . 500
61 Pointwise convergence 501

61.1 Sequences of random variables . . . . . . . . . . . . . . . . . . . . . 501
61.2 Sequences of random vectors . . . . . . . . . . . . . . . . . . . . . . 502
62 Almost sure convergence 505

CONTENTS xvii
63 Convergence in probability 511

64 Mean-square convergence 519

65 Convergence in distribution 527

65.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
65.3.1 Proper distribution functions . . . . . . . . . . . . . . . . . . 529
66 Relations between modes of convergence 533

66.1 Almost sure ) Probability . . . . . . . . . . . . . . . . . . . . . . . 533
66.2 Probability ) Distribution . . . . . . . . . . . . . . . . . . . . . . . 533
66.3 Almost sure ) Distribution . . . . . . . . . . . . . . . . . . . . . . . 534
66.4 Mean square ) Probability . . . . . . . . . . . . . . . . . . . . . . . 534
66.5 Mean square ) Distribution . . . . . . . . . . . . . . . . . . . . . . 534
67 Laws of Large Numbers 535

67.1 Weak Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 535
67.1.1 Chebyshev’s WLLN . . . . . . . . . . . . . . . . . . . . . . . 535
67.1.2 Chebyshev’s WLLN for correlated sequences . . . . . . . . . 537
67.2 Strong Laws of Large numbers . . . . . . . . . . . . . . . . . . . . . 540
67.2.1 Kolmogorov’s SLLN . . . . . . . . . . . . . . . . . . . . . . . 540
67.2.2 Ergodic theorem . . . . . . . . . . . . . . . . . . . . . . . . . 541
67.3 Laws of Large numbers for random vectors . . . . . . . . . . . . . . 541
68 Central Limit Theorems 545

68.1 Examples of Central Limit Theorems . . . . . . . . . . . . . . . . . . 546
68.1.1 Lindeberg-Lévy CLT . . . . . . . . . . . . . . . . . . . . . . . 546
68.1.2 A CLT for correlated sequences . . . . . . . . . . . . . . . . . 548
68.2 Multivariate generalizations . . . . . . . . . . . . . . . . . . . . . . . 549
69 Convergence of transformations 555

69.1 Continuous mapping theorem . . . . . . . . . . . . . . . . . . . . . . 555
69.1.1 Convergence in probability of sums and products . . . . . . . 555
69.1.2 Almost sure convergence of sums and products . . . . . . . . 556
69.1.3 Convergence in distribution of sums and products . . . . . . 556
69.2 Slutski’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
69.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
69.3.1 Convergence of ratios . . . . . . . . . . . . . . . . . . . . . . 557
69.3.2 Random matrices . . . . . . . . . . . . . . . . . . . . . . . . . 558
xviii CONTENTS
VII Fundamentals of statistics 561

70 Statistical inference 563
70.1 Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
70.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
70.2.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . 565
70.3 Statistical inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
70.4 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
71 Point estimation 569

71.1 Estimate and estimator . . . . . . . . . . . . . . . . . . . . . . . . . 569
71.2 Estimation error, loss and risk . . . . . . . . . . . . . . . . . . . . . . 569
71.3 Other criteria to evaluate estimators . . . . . . . . . . . . . . . . . . 571
71.3.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
71.3.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
71.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
72 Point estimation of the mean 573

72.1 Normal IID samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
72.1.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
72.1.2 The estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
72.1.3 Expected value of the estimator . . . . . . . . . . . . . . . . . 573
72.1.4 Variance of the estimator . . . . . . . . . . . . . . . . . . . . 574
72.1.5 Distribution of the estimator . . . . . . . . . . . . . . . . . . 574
72.1.6 Risk of the estimator . . . . . . . . . . . . . . . . . . . . . . . 575
72.1.7 Consistency of the estimator . . . . . . . . . . . . . . . . . . 575
72.2 IID samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
72.2.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
72.2.2 The estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
72.2.8 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . 577
73 Point estimation of the variance 579

73.1 Normal IID samples - Known mean . . . . . . . . . . . . . . . . . . . 579
73.1.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
73.1.2 The estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
73.2 Normal IID samples - Unknown mean . . . . . . . . . . . . . . . . . 582
73.2.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
73.2.2 The estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
CONTENTS xix

74 Set estimation 591

74.1 Con…dence set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
74.2 Coverage probability - con…dence coe¢ cient . . . . . . . . . . . . . . 592
74.3 Size of a con…dence set . . . . . . . . . . . . . . . . . . . . . . . . . . 592
74.4 Other criteria to evaluate set estimators . . . . . . . . . . . . . . . . 593
74.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
75 Set estimation of the mean 595

75.1 Normal IID samples - Known variance . . . . . . . . . . . . . . . . . 595
75.1.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
75.1.2 The interval estimator . . . . . . . . . . . . . . . . . . . . . . 595
75.1.3 Coverage probability . . . . . . . . . . . . . . . . . . . . . . . 596
75.1.4 Con…dence coe¢ cient . . . . . . . . . . . . . . . . . . . . . . 596
75.1.5 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
75.1.6 Expected size . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
75.2 Normal IID samples - Unknown variance . . . . . . . . . . . . . . . . 597
75.2.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
75.2.5 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
75.2.6 Expected size . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
76 Set estimation of the variance 607

76.1.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
76.1.5 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
76.1.6 Expected size . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
76.2.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
76.2.5 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
76.2.6 Expected size . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
xx CONTENTS
77 Hypothesis testing 615

77.1 Null hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
77.2 Alternative hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 616
77.3 Types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
77.4 Critical region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
77.5 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
77.6 Power function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
77.7 Size of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
77.8 Criteria to evaluate tests . . . . . . . . . . . . . . . . . . . . . . . . . 618
77.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
78 Hypothesis tests about the mean 619

78.1 Normal IID samples - Known variance . . . . . . . . . . . . . . . . . 619
78.1.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
78.1.2 The null hypothesis . . . . . . . . . . . . . . . . . . . . . . . 619
78.1.3 The alternative hypothesis . . . . . . . . . . . . . . . . . . . . 620
78.1.4 The test statistic . . . . . . . . . . . . . . . . . . . . . . . . . 620
78.1.5 The critical region . . . . . . . . . . . . . . . . . . . . . . . . 620
78.1.6 The power function . . . . . . . . . . . . . . . . . . . . . . . 620
78.1.7 The size of the test . . . . . . . . . . . . . . . . . . . . . . . . 621
78.2 Normal IID samples - Unknown variance . . . . . . . . . . . . . . . . 621
78.2.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
79 Hypothesis tests about the variance 629

79.1.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
79.2.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
Preface
This book is a collection of lectures on probability theory and mathematical statis-

tics that I have been publishing on the website StatLect.com since 2010. Visitors
to the website have been constantly increasing and, while I have received positive
feedback from many of them, some have suggested that the lectures would be eas-
ier to study if they were collected and printed as a traditional paper textbook. I
followed their suggestion and started to work on this book. The painful editing
work needed to convert the webpages to book chapters took a signi…cant portion
of my spare time for almost one year. I hope the result is vaguely satisfactory, but
I am sure that not all typos and mistakes have been eliminated. For this, I humbly
ask the forgiveness of my readers.
There were two main reasons why I started writing these lectures. First of all,
I thought it was di¢ cult to …nd a thorough yet accessible treatment of the basics
of probability theory and mathematical statistics. While there are many excellent
textbooks on these subjects, the easier ones often do not touch on many important
topics, while the more complete ones frequently require a level of mathematical
sophistication not possessed by most people. In these lectures I tried to give an
accessible introduction to topics that are not usually found in elementary books.
Secondly, I tried to collect in these lectures results and proofs (especially on prob-
ability distributions) that are hard to …nd in standard references and are scattered
here and there in more specialistic books. I hope this will help my readers save
some precious time in their study of probability and statistics.
The plan of the book is as follows: Part 1 is a review of some elementary
mathematical tools that are needed to understand the lectures; Part 2 introduces
the fundamentals of probability theory; Part 3 presents additional topics in prob-
ability theory; Part 4 deals with special probability distributions; Part 5 contains
more details about the normal distribution; Part 6 discusses the basics of asymp-
totic theory (sequences of random variables and their convergence); Part 7 is an
introduction to mathematical statistics.
Preface to second edition

Besides some minor editing, the main changes I introduced in the second edition
are as follows: I have expanded the lectures on the Gamma and Beta functions,
and on the multinomial distribution; I have entirely rewritten the lecture on the
binomial distribution; I have added solved exercises to several chapters that did
not have any.
xxi
xxii PREFACE
Dedication
This book is dedicated to Emanuela and Anna.
Part I
Mathematical tools
1
Chapter 1
Set theory
This lecture introduces the basics of set theory.
1.1 Sets
A set is a collection of objects. Sets are usually denoted by a letter and the objects
(or elements) belonging to a set are usually listed within curly brackets.
Example 1 Denote by the letter S the set of the natural numbers less than or
equal to 5. Then, we can write
S = f1; 2; 3; 4; 5g
Example 2 Denote by the letter A the set of the …rst …ve letters of the alphabet.
Then, we can write
A = fa; b; c; d; eg
Note that a set is an unordered collection of objects, i.e., the order in which
the elements of a set are listed does not matter.
Example 3 The two sets

fa; b; c; d; eg
and
fb; d; a; c; eg
are considered identical.
Sometimes a set is de…ned in terms of one or more properties satis…ed by its

elements. For example, the set
S = f1; 2; 3; 4; 5g
could be equivalently de…ned as
S = fn 2 N : n 5g
which reads as follows: "S is the set of all natural numbers n such that n is less
than or equal to 5", where the colon symbol (:) means "such that" and precedes a
list of conditions that the elements of the set need to satisfy.
3
4 CHAPTER 1. SET THEORY
Example 4 The set n o

n
S= n2N: 2N
4
is the set of all natural numbers n such that n divided by 4 is also a natural number,
i.e.,
S = f4; 8; 12; : : :g
1.2 Set membership

When an element a belongs to a set A, we write
a2A
which reads "a belongs to A" or "a is a member of A".

On the contrary, when an element a does not belong to a set A, we write
a2
=A
which reads "a does not belong to A" or "a is not a member of A".
Example 5 Let the set S be de…ned as follows:
A = f2; 4; 6; 8; 10g
Then, for example,

42A
and
72
=A
1.3 Set inclusion

If A and B are two sets, and if every element of A also belongs to B, then we write
A B
which reads "A is included in B", or
B A
and we read "B includes A". We also say that A is a subset of B.
Example 6 The set

A = f2; 3g
is included in the set
B = f1; 2; 3; 4g
because all the elements of A also belong to B. Thus, we can write
A B
1.4. UNION 5
When A B but A is not the same as B, i.e., there are elements of B that do
not belong to A, then we write
A B
which reads "A is strictly included in B", or
B A
We also say that A is a proper subset of B.
Example 7 Given the sets
A = f2; 3g
B = f1; 2; 3; 4g
C = f2; 3g
we have that
A B
A C
but we cannot write
A C
1.4 Union
The union of two sets A and B is the set of all elements that belong to at least one
of them, and it is denoted by
A[B
Example 8 De…ne two sets A and B as follows:
A = fa; b; c; dg
B = fc; d; e; f g
Their union is
A [ B = fa; b; c; d; e; f g
If A1 , A2 , . . . , An are n sets, their union is the set of all elements that belong
to at least one of them, and it is denoted by
n
[
Ai = A1 [ A2 [ : : : [ An
i=1
Example 9 De…ne three sets A1 , A2 and A3 as follows:

A1 = fa; b; c; dg
A2 = fc; d; e; f g
A3 = fc; f; gg
Their union is
3
[
Ai = A1 [ A2 [ A3 = fa; b; c; d; e; f; gg
i=1
1.5 Intersection
The intersection of two sets A and B is the set of all elements that belong to both
of them, and it is denoted by
A\B
Example 10 De…ne two sets A and B as follows:
A = fa; b; c; dg
B = fc; d; e; f g
Their intersection is
A \ B = fc; dg
If A1 , A2 , . . . , An are n sets, their intersection is the set of all elements that

belong to all of them, and it is denoted by
n
\
Ai = A1 \ A2 \ : : : \ An
i=1
Example 11 De…ne three sets A1 , A2 and A3 as follows:
A1 = fa; b; c; dg
A2 = fc; d; e; f g
A3 = fc; f; gg
Their intersection is
3
\
Ai = A1 \ A2 \ A3 = fcg
i=1
1.6 Complement
Suppose that our attention is con…ned to sets that are all included in a larger set
, called universal set. Let A be one of these sets. The complement of A is the set
of all elements of that do not belong to A and it is indicated by
Ac
Example 12 De…ne the universal set as follows:
= fa; b; c; d; e; f; g; hg
and the two sets
A = fb; c; dg
B = fc; d; eg
The complements of A and B are
Ac = fa; e; f; g; hg
Bc = fa; b; f; g; hg
1.7. DE MORGAN’S LAWS 7
1.7 De Morgan’s Laws

De Morgan’s Laws are
c
(A [ B) = Ac \ B c
c
(A \ B) = Ac [ B c
and can be extended to collections of more than two sets as follows:

n
!c n
[ \
Ai = Aci
i=1 i=1
n
!c n
\ [
Ai = Aci
i=1 i=1
1.8 Solved exercises

Below you can …nd some exercises with explained solutions.
Exercise 1
De…ne the following sets
A1 = fa; b; cg
A2 = fb; c; d; e; f g
A3 = fb; f g
A4 = fa; b; dg
List all the elements belonging to the set

4
[
A= Ai
i=2
Solution
The union can be written as
A = A2 [ A3 [ A4
The union of the three sets A2 , A3 and A4 is the set of all elements that belong to
at least one of them:
A = A2 [ A3 [ A4
= fa; b; c; d; e; f g
Exercise 2
Given the sets de…ned in the previous exercise, list all the elements belonging to
the set
\4
A= Ai
i=1
Solution
The intersection can be written as
A = A1 \ A2 \ A3 \ A4
The intersection of the four sets A1 , A2 , A3 and A4 is the set of elements that are
members of all the four sets:
A = A1 \ A2 \ A3 \ A4
= fbg
Exercise 3
Suppose that A and B are two subsets of a universal set , and that
Ac = fa; b; cg
Bc = fb; c; dg
List all the elements belonging to the set

c
(A [ B)
Solution
Using De Morgan’s laws, we obtain
c
(A [ B) = Ac \ B c
= fa; b; cg \ fb; c; dg
= fb; cg
Chapter 2
Permutations
This lecture introduces permutations, one of the most important concepts in com-
binatorial analysis.
We …rst deal with permutations without repetition, also called simple permu-
tations, and then with permutations with repetition.
2.1 Permutations without repetition

A permutation without repetition of n objects is one of the possible ways of ordering
the n objects.
A permutation without repetition is also simply called a permutation.
The following subsections give a slightly more formal de…nition of permutation
and deal with the problem of counting the number of possible permutations of n
objects.
2.1.1 De…nition of permutation without repetition

Let a1 , a2 ,. . . , an be n objects. Let s1 , s2 , . . . , sn be n slots to which the n objects
can be assigned. A permutation (or permutation without repetition, or
simple permutation) of a1 , a2 ,. . . , an is one of the possible ways to …ll each
of the n slots with one and only one of the n objects, with the proviso that each
object can be assigned to only one slot.
Example 13 Consider three objects a1 , a2 and a3 . There are three slots, s1 , s2

and s3 , to which we can assign the three objects. There are six possible permutations
of the three objects, that is, six possible ways to …ll the three slots with the three
objects:
Slots s1 s2 s3
Permutation 1 a1 a2 a3
9
10 CHAPTER 2. PERMUTATIONS
2.1.2 Number of permutations without repetition

Denote by Pn the number of possible permutations of n objects. How much is Pn
in general? In other words, how do we count the number of possible permutations
of n objects?
We can derive a general formula for Pn by using a sequential argument:
1. First, we assign an object to the …rst slot. There are n objects that can be
assigned to the …rst slot, so there are
n possible ways to …ll the …rst slot
2. Then, we assign an object to the second slot. There were n objects, but one
has already been assigned to a slot. So, we are left with n 1 objects that
can be assigned to the second slot. Thus, there are
n 1 possible ways to …ll the second slot
and
n (n 1) possible ways to …ll the …rst two slots
3. Then, we assign an object to the third slot. There were n objects, but two
have already been assigned to a slot. So, we are left with n 2 objects that
can be assigned to the third slot. Thus, there are
n 2 possible ways to …ll the third slot
and
n (n 1) (n 2) possible ways to …ll the …rst three slots
4. An so on, until only one object and one free slot remain.
5. Finally, when only one free slot remains, we assign the remaining object to
it. There is only one way to do this. Thus, there is
1 possible way to …ll the last slot
and
n (n 1) (n 2) : : : 2 1 possible ways to …ll all the n slots
Therefore, by the above sequential argument, the total number of possible

permutations of n objects is
Pn = n (n 1) (n 2) : : : 2 1
The number Pn is usually indicated as follows:
Pn = n!
where n! is read "n factorial", with the convention that
0! = 1
Example 14 The number of possible permutations of 5 objects is
P5 = 5! = 5 4 3 2 1 = 120
2.2. PERMUTATIONS WITH REPETITION 11
2.2 Permutations with repetition

A permutation with repetition of n objects is one of the possible ways of selecting
another set of n objects from the original one. The selection rules are:
1. each object can be selected more than once;

2. the order of selection matters (the same n objects selected in di¤erent orders
are regarded as di¤erent permutations).
Thus, the di¤erence between simple permutations and permutations with rep-
etition is that objects can be selected only once in the former, while they can be
selected more than once in the latter.
The following subsections give a slightly more formal de…nition of permutation
with repetition and deal with the problem of counting the number of possible
permutations with repetition.
2.2.1 De…nition of permutation with repetition

Let a1 , a2 ,. . . , an be n objects. Let s1 , s2 , . . . , sn be n slots to which the n objects
can be assigned. A permutation with repetition of a1 , a2 ,. . . , an is one of the
possible ways to …ll each of the n slots with one and only one of the n objects, with
the proviso that an object can be assigned to more than one slot.
Example 15 Consider two objects, a1 and a2 . There are two slots to …ll, s1 and
s2 . There are four possible permutations with repetition of the two objects, that is,
four possible ways to assign an object to each slot, being allowed to assign the same
object to more than one slot:
Slots s1 s2
Permutation 1 a1 a1
Permutation 2 a1 a2
Permutation 3 a2 a1
Permutation 4 a2 a2
2.2.2 Number of permutations with repetition

Denote by Pn0 the number of possible permutations with repetition of n objects.
How much is Pn0 in general? In other words, how do we count the number of
possible permutations with repetition of n objects?
We can derive a general formula for Pn0 by using a sequential argument:
2. Then, we assign an object to the second slot. Even if one object has been
assigned to a slot in the previous step, we can still choose among n objects,
because we are allowed to choose an object more than once. So, there are n
objects that can be assigned to the second slot and
n possible ways to …ll the second slot

and
n n possible ways to …ll the …rst two slots
3. Then, we assign an object to the third slot. Even if two objects have been
assigned to a slot in the previous two steps, we can still choose among n
objects, because we are allowed to choose an object more than once. So,
there are n objects that can be assigned to the third slot and
n possible ways to …ll the third slot
and
n n n possible ways to …ll the …rst three slots
4. An so on, until we are left with only one free slot (the n-th).
5. When only one free slot remains, we assign one of the n objects to it. Thus,
there are
n possible ways to …ll the last slot
and
|n n {z: : : n} possible ways to …ll the n available slots
n tim es

permutations with repetition of n objects is
Pn0 = nn
Example 16 The number of possible permutations with repetition of 3 objects is
P30 = 33 = 27

This exercise set contains some solved exercises on permutations.
Exercise 1
There are 5 seats around a table and 5 people to be seated at the table. In how
many di¤erent ways can they seat themselves?
Solution
Sitting 5 people at the table is a sequential problem. We need to assign a person
to the …rst chair. There are 5 possible ways to do this. Then we need to assign
a person to the second chair. There are 4 possible ways to do this, because one
person has already been assigned. An so on, until there remain one free chair and
one person to be seated. Therefore, the number of ways to seat the 5 people at the
table is equal to the number of permutations of 5 objects (without repetition). If
we denote it by P5 , then
P5 = 5! = 5 4 3 2 1 = 120
2.3. SOLVED EXERCISES 13
Exercise 2
Bob, John, Luke and Tim play a tennis tournament. The rules of the tournament
are such that at the end of the tournament a ranking will be made and there will
be no ties. How many di¤erent rankings can there be?
Solution
Ranking 4 people is a sequential problem. We need to assign a person to the …rst
place. There are 4 possible ways to do this. Then we need to assign a person to the
second place. There are 3 possible ways to do this, because one person has already
been assigned. An so on, until there remains one person to be assigned. Therefore,
the number of ways to rank the 4 people participating in the tournament is equal
to the number of permutations of 4 objects (without repetition). If we denote it
by P4 , then
P4 = 4! = 4 3 2 1 = 24
Exercise 3
A byte is a number consisting of 8 digits that can be equal either to 0 or to 1. How
many di¤erent bytes are there?
Solution
To answer this question we need to follow a line of reasoning similar to the one we
followed when we derived the number of permutations with repetition. There are
2 possible ways to choose the …rst digit and 2 possible ways to choose the second
digit. So, there are 4 possible ways to choose the …rst two digits. There are 2
possible ways two choose the third digit and 4 possible ways to choose the …rst
two. Thus, there are 8 possible ways to choose the …rst three digits. An so on,
until we have chosen all digits. Therefore, the number of ways to choose the 8
digits is equal to
: : 2} = 28 = 256
|2 :{z
8 tim es
Chapter 3
k-permutations
This lecture introduces the concept of k-permutation, which is a slight generaliza-

tion of the concept of permutation1 .
We …rst deal with k-permutations without repetition and then with k-
permutations with repetition.
3.1 k-permutations without repetition

A k-permutation without repetition of n objects is a way of selecting k objects
from a list of n. The selection rules are:
1. the order of selection matters (the same k objects selected in di¤erent orders
are regarded as di¤erent k-permutations);
2. each object can be selected only once.
A k-permutation without repetition is also simply called a k-permutation.

The following subsections give a slightly more formal de…nition of k-
permutation and deal with the problem of counting the number of possible k-
permutations.
3.1.1 De…nition of k-permutation without repetition

Let a1 , a2 ,. . . , an be n objects. Let s1 , s2 , . . . , sk be k (k n) slots to which k of
the n objects can be assigned. A k-permutation (or k-permutation without
repetition or simple k-permutation) of n objects from a1 , a2 ,. . . , an is one of
the possible ways to choose k of the n objects and …ll each of the k slots with one
and only one object. Each object can be chosen only once.
Example 17 Consider three objects, a1 , a2 and a3 . There are two slots, s1 and
s2 , to which we can assign two of the three objects. There are six possible 2-
permutations of the three objects, that is, six possible ways to choose two objects
1 See p. 9.
15
16 CHAPTER 3. K-PERMUTATIONS
and …ll the two slots with the two objects:
Slots s1 s2
2-permutation 1 a1 a2
3.1.2 Number of k-permutations without repetition

Denote by Pn;k the number of possible k-permutations of n objects. How much
is Pn;k in general? In other words, how do we count the number of possible k-
permutations of n objects?
We can derive a general formula for Pn;k by using a sequential argument:
2. Then, we assign an object to the second slot. There were n objects, but one
has already been assigned to a slot. So, we are left with n 1 objects that
can be assigned to the second slot. Thus, there are
n 1 possible ways to …ll the second slot
and
n (n 1) possible ways to …ll the …rst two slots
3. Then, we assign an object to the third slot. There were n objects, but two
have already been assigned to a slot. So, we are left with n 2 objects that
can be assigned to the third slot. Thus, there are
n 2 possible ways to …ll the third slot
and
n (n 1) (n 2) possible ways to …ll the …rst three slots
4. An so on, until we are left with n k + 1 objects and only one free slot (the
k-th).
5. Finally, when only one free slot remains, we assign one of the remaining
n k + 1 objects to it. Thus, there are
n k + 1 possible ways to …ll the last slot
and
n (n 1) (n 2) : : : (n k + 1) possible ways to …ll the k available slots

3.2. K-PERMUTATIONS WITH REPETITION 17

k-permutations of n objects is
Pn;k = n (n 1) (n 2) : : : (n k + 1)
Pn;k can be written as
n (n 1) (n 2) : : : (n k + 1) (n k) (n k 1) : : : 2 1
Pn;k =
(n k) (n k 1) : : : 2 1
Remembering the de…nition of factorial2 , we can see that the numerator of

the above ratio is n! while the denominator is (n k)!, so the number of possible
k-permutations of n objects is
n!
Pn;k =
(n k)!
The number Pn;k is usually indicated as follows:
Pn;k = nk
Example 18 The number of possible 3-permutations of 5 objects is
5! 5 4 3 2 1
P5;3 = = = 5 4 3 = 60
2! 2 1
3.2 k-permutations with repetition

A k-permutation with repetition of n objects is a way of selecting k objects from
a list of n. The selection rules are:
1. the order of selection matters (the same k objects selected in di¤erent orders
are regarded as di¤erent k-permutations);
2. each object can be selected more than once.
Thus, the di¤erence between k-permutations without repetition and k-

permutations with repetition is that objects can be selected more than once in
the latter, while they can be selected only once in the former.
The following subsections give a slightly more formal de…nition of k-
permutation with repetition and deal with the problem of counting the number
of possible k-permutations with repetition.
3.2.1 De…nition of k-permutation with repetition

Let a1 , a2 ,. . . , an be n objects. Let s1 , s2 , . . . , sk be k (k n) slots to which k of
the n objects can be assigned. A k-permutation with repetition of n objects
from a1 , a2 ,. . . , an is one of the possible ways to choose k of the n objects and …ll
each of the k slots with one and only one object. Each object can be chosen more
than once.
2 See p. 10.
Example 19 Consider three objects a1 , a2 and a3 and two slots, s1 and s2 . There
are nine possible 2-permutations with repetition of the three objects, that is, nine
possible ways to choose two objects and …ll the two slots with the two objects, being
allowed to pick the same object more than once:
Slots s1 s2
3.2.2 Number of k-permutations with repetition

0
Denote by Pn;k the number of possible k-permutations with repetition of n objects.
0
How much is Pn;k in general? In other words, how do we count the number of
possible k-permutations with repetition of n objects?
0
We can derive a general formula for Pn;k by using a sequential argument:
2. Then, we assign an object to the second slot. Even if one object has been
assigned to a slot in the previous step, we can still choose among n objects,
because we are allowed to choose an object more than once. So, there are n
objects that can be assigned to the second slot and
n possible ways to …ll the second slot
and
n n possible ways to …ll the …rst two slots
3. Then, we assign an object to the third slot. Even if two objects have been
assigned to a slot in the previous two steps, we can still choose among n
objects, because we are allowed to choose an object more than once. So,
there are n objects that can be assigned to the second slot and
n possible ways to …ll the third slot
and
n n n possible ways to …ll the …rst three slots
4. An so on, until we are left with only one free slot (the k-th).
5. When only one free slot remains, we assign one of the n objects to it. Thus,
there are
n possible ways to …ll the last slot
and
|n n {z: : : n} possible ways to …ll the k available slots
k tim es

k-permutations with repetition of n objects is
0
Pn;k = nk
Example 20 The number of possible 2-permutations of 4 objects is

0
P4;2 = 42 = 16

This exercise set contains some solved exercises on k-permutations.
Exercise 1
There is a basket of fruit containing an apple, a banana and an orange and there
are …ve girls who want to eat one fruit. How many ways are there to give three of
the …ve girls one fruit each and leave two of them without a fruit to eat?
Solution
Giving the three fruits to three of the …ve girls is a sequential problem. We …rst
give the apple to one of the girls. There are 5 possible ways to do this. Then we
give the banana to one of the remaining girls. There are 4 possible ways to do this,
because one girl has already been given a fruit. Finally, we give the orange to one
of the remaining girls. There are 3 possible ways to do this, because two girls have
already been given a fruit. Summing up, the number of ways to assign the three
fruits is equal to the number of 3-permutations of 5 objects (without repetition).
If we denote it by P5;3 , then
5! 5 4 3 2 1
P5;3 = =
(5 3)! 2 1
= 5 4 3 = 60
Exercise 2
An hexadecimal number is a number whose digits can take sixteen di¤erent values:
either one of the ten numbers from 0 to 9, or one of the six letters from A to F. How
many di¤erent 8-digit hexadecimal numbers are there, if an hexadecimal number
is allowed to begin with any number of zeros?
Solution
Choosing the 8 digits of the hexadecimal number is a sequential problem. There
are 16 possible ways to choose the …rst digit and 16 possible ways to choose the
second digit. So, there are 16 16 possible ways to choose the …rst two digits.
There are 16 possible ways two choose the third digit and 16 16 possible ways to
choose the …rst two. Thus, there are 16 16 16 possible ways to choose the …rst
three digits. An so on, until we have chosen all digits. Therefore, the number of
ways to choose the 8 digits is equal to the number of 8-permutations with repetition
of 16 objects:
0
P16;8 = 168
Exercise 3
An urn contains ten balls, each representing one of the ten numbers from 0 to 9.
Three balls are drawn at random from the urn and the corresponding numbers are
written down to form a 3-digit number, writing down the digits from left to right
in the order in which they have been extracted. When a ball is drawn from the
urn it is set aside, so that it cannot be extracted again. If one were to write down
all the 3-digit numbers that could possibly be formed, how many would they be?
Solution
The 3 balls are drawn sequentially. At the …rst draw there are 10 balls, hence 10
possible values for the …rst digit of our 3-digit number. At the second draw there
are 9 balls left, hence 9 possible values for the second digit of our 3-digit number.
At the third and last draw there are 8 balls left, hence 8 possible values for the third
digit of our 3-digit number. Summing up, the number of possible 3-digit numbers
is equal to the number of 3-permutations of 10 objects (without repetition). If we
denote it by P10;3 , then
10! 10 9 : : : 2 1
P10;3 = =
(10 3)! 7 6 ::: 2 1
= 10 9 8 = 720
Chapter 4
Combinations
This lecture introduces combinations, one of the most important concepts in com-
binatorial analysis. Before reading this lecture, you should be familiar with the
concept of permutation1 .
We …rst deal with combinations without repetition and then with combinations
with repetition.
4.1 Combinations without repetition

A combination without repetition of k objects from n is a way of selecting k objects
1. the order of selection does not matter (the same objects selected in di¤erent
orders are regarded as the same combination);
2. each object can be selected only once.
A combination without repetition is also called a simple combination or, simply,
a combination.
The following subsections give a slightly more formal de…nition of combination
and deal with the problem of counting the number of possible combinations.
4.1.1 De…nition of combination without repetition

Let a1 , a2 ,. . . , an be n objects. A simple combination (or combination
without repetition) of k objects from the n objects is one of the possible ways
to form a set containing k of the n objects. To form a valid set, any object can be
chosen only once. Furthermore, the order in which the objects are chosen does not
matter.
Example 21 Consider three objects, a1 , a2 and a3 . There are three possible com-
binations of two objects from a1 , a2 and a3 , that is, three possible ways to choose
two objects from this set of three:
Combination 1 a1 and a2
1 See the lecture entitled Permutations (p. 9).
21
22 CHAPTER 4. COMBINATIONS
Other combinations are not possible, because, for example, fa2 ; a1 g is the same as
fa1 ; a2 g.
4.1.2 Number of combinations without repetition

Denote by Cn;k the number of possible combinations of k objects from n. How
much is Cn;k in general? In other words, how do we count the number of possible
combinations of k objects from n?
To answer this question, we need to recall the concepts of permutation and
k-permutation introduced in previous lectures2 .
Like a combination, a k-permutation of n objects is one of the possible ways
of choosing k of the n objects. However, in a k-permutation the order of selection
matters: two k-permutations are regarded as di¤erent if the same k objects are
chosen, but they are chosen in a di¤erent order. On the contrary, in the case of
combinations, the order in which the k objects are chosen does not matter: two
combinations that contain the same objects are regarded as equal.
Despite this di¤erence between k-permutations and combinations, it is very
easy to derive the number of possible combinations (Cn;k ) from the number of
possible k-permutations (Pn;k ). Consider a combination of k objects from n. This
combination will be repeated many times in the set of all possible k-permutations.
It will be repeated one time for each possible way of ordering the k objects. So,
it will be repeated Pk = k! times3 . Therefore, if each combination is repeated
Pk times in the set of all possible k-permutations, dividing the total number of
k-permutations (Pn;k ) by Pk , we obtain the number of possible combinations:
Pn;k n!
Cn;k = =
Pk (n k)!k!
The number of possible combinations is often denoted by
n
Cn;k =
k
n
and k is called binomial coe¢ cient.
Example 22 The number of possible combinations of 3 objects from 5 is

5! 5 4 3 2 1 5 4
C5;3 = = = = 10
2!3! (2 1) (3 2 1) 2 1
4.2 Combinations with repetition

A combination with repetition of k objects from n is a way of selecting k objects
1. the order of selection does not matter (the same objects selected in di¤erent
orders are regarded as the same combination);
2. each object can be selected more than once.
2 See
the lectures entitled Permutations (p. 9) and k-permutations (p. 15).
3Pis the number of all possible ways to order the k objects - the number of permutations of
k
k objects.
4.2. COMBINATIONS WITH REPETITION 23
Thus, the di¤erence between simple combinations and combinations with rep-
etition is that objects can be selected only once in the former, while they can be
selected more than once in the latter.
The following subsections give a slightly more formal de…nition of combination
with repetition and deal with the problem of counting the number of possible
combinations with repetition.
4.2.1 De…nition of combination with repetition

A more rigorous de…nition of combination with repetition involves the concept of
multiset, which is a generalization of the notion of set4 . Roughly speaking, the
di¤erence between a multiset and a set is the following: the same object is allowed
to appear more than once in the list of members of a multiset, while the same
object is allowed to appear only once in the list of members of an ordinary set.
Thus, for example, the collection of objects
fa; b; c; ag
is a valid multiset, but not a valid set, because the letter a appears more than once.
Like sets, multisets are unordered collections of objects, i.e. the order in which the
elements of a multiset are listed does not matter.
Let a1 , a2 ,. . . , an be n objects. A combination with repetition of k objects
from the n objects is one of the possible ways to form a multiset containing k
objects taken from the set fa1 ; a2 ; : : : ; an g.
Example 23 Consider three objects, a1 , a2 and a3 . There are six possible combi-
nations with repetition of two objects from a1 , a2 and a3 , that is, six possible ways
to choose two objects from this set of three, allowing for repetitions:
Other combinations are not possible, because, for example, fa2 ; a1 g is the same as
fa1 ; a2 g.
4.2.2 Number of combinations with repetition

0
Denote by Cn;k the number of possible combinations with repetition of k objects
0
from n. How much is Cn;k in general? In other words, how do we count the number
of possible combinations with repetition of k objects from n?
To answer this question, we need to use a slightly unusual procedure, which is
introduced by the next example.
Example 24 We need to order two scoops of ice cream, choosing among four
‡avours: chocolate, pistachio, strawberry and vanilla. It is possible to order two
scoops of the same ‡avour. How many di¤ erent combinations can we order? The
4 See the lecture entitled Set theory (p. 3).
number of di¤ erent combinations we can order is equal to the number of possible
combinations with repetition of 2 objects from 4. Let us represent an order as a
string of crosses ( ) and vertical bars (j), where a vertical bar delimits two adjacent
‡avours and a cross denotes a scoop of a given ‡avour. For example,
jjj 1 chocolate, 1 vanilla
jj j 1 strawberry, 1 vanilla
jjj 2 chocolate
jj j 2 strawberry
where the …rst vertical bar (the leftmost one) delimits chocolate and pistachio, the
second one delimits pistachio and strawberry and the third one delimits strawberry
and vanilla. Each string contains three vertical bars, one less than the number of
‡avours, and two crosses, one for each scoop. Therefore, each string contains a
total of …ve symbols. Making an order is equivalent to choosing which two of the
…ve symbols will be a cross (the remaining will be vertical bars). So, to make an
order, we need to choose 2 objects from 5. The number of possible ways to choose
2 objects from 5 is equal to the number of possible combinations without repetition5
of 2 objects from 5. Therefore, there are
5 5!
= = 10
2 (5 2)!2!
di¤ erent orders we can make.
In general, choosing k objects from n with repetition is equivalent to writing
a string with n + k 1 symbols, of which n 1 are vertical bars (j) and k are
crosses ( ). In turn, this is equivalent to choose the k positions in the string
(among the available n + k 1) that will contain a cross (the remaining ones will
contain vertical bars). But choosing k positions from n + k 1 is like choosing a
combination without repetition of k objects from n+k 1. Therefore, the number
of possible combinations with repetition is
0 n+k 1
Cn;k = Cn+k 1;k =
k
(n + k 1)! (n + k 1)!
= =
(n + k 1 k)!k! (n 1)!k!
The number of possible combinations with repetition is often denoted by
0 n
Cn;k =
k
n
and k is called a multiset coe¢ cient.
Example 25 The number of possible combinations with repetition of 3 objects from
5 is
0 (5 + 3 1)! 7!
C5;3 = =
(5 1)!3! 4!3!
7 6 5 4 3 2 1
=
(4 3 2 1) (3 2 1)
7 6 5
= = 7 5 = 35
3 2 1
5 See p. 21.
4.3. MORE DETAILS 25
4.3 More details

4.3.1 Binomial coe¢ cients and binomial expansions
The binomial coe¢ cient is so called because it appears in the binomial expan-
sion:
X n
n n k n k
(a + b) = a b
k
k=0
where n 2 N.
4.3.2 Recursive formula for binomial coe¢ cients

The following is a useful recursive formula for computing binomial coe¢ cients:
n+1 n n
= +
k k k 1
Proof. It is proved as follows:
n n n! n!
+ = +
k k 1 k! (n k)! (k 1)! (n + 1 k)!
n! n!
= +
(k 1)! (n k)!k (k 1)! (n k)! (n + 1 k)
n! (n + 1 k + k)
=
(k 1)! (n k)!k (n + 1 k)
n! (n + 1) (n + 1)! n+1
= = =
k! (n + 1 k)! k! (n + 1 k)! k

This exercise set contains some solved exercises on combinations.
Exercise 1
3 cards are drawn from a standard deck of 52 cards. How many di¤erent 3-card
hands can possibly be drawn?
Solution
First of all, the order in which the 3 cards are drawn does not matter, that is,
the same cards drawn in di¤erent orders are regarded as the same 3-card hand.
Furthermore, each card can be drawn only once. Therefore the number of di¤er-
ent 3-card hands that can possibly be drawn is equal to the number of possible
combinations without repetition of 3 objects from 52. If we denote it by C52;3 ,
then
52 52! 52!
C52;3 = = =
3 (52 3)!3! 49!3!
52 51 50 52 51 50
= = = 22100
3! 3 2 1
Exercise 2
John has got one dollar, with which he can buy green, red and yellow candies.
Each candy costs 50 cents. John will spend all the money he has on candies. How
many di¤erent combinations of green, red and yellow candies can he buy?
Solution
First of all, the order in which the 3 di¤erent colors are chosen does not matter.
Furthermore, each color can be chosen more than once. Therefore, the number of
di¤erent combinations of colored candies John can choose is equal to the number
of possible combinations with repetition of 2 objects from 3. If we denote it by
0
C3;2 , then
0 3 3+2 1 4
C3;2 = = =
2 2 2
4! 4! 4 3 4 3
= = = = =6
(4 2)!2! 2!2! 2! 2 1
Exercise 3
The board of directors of a corporation comprises 10 members. An executive board,
formed by 4 directors, needs to be elected. How many possible ways are there to
form the executive board?
Solution
First of all, the order in which the 4 directors are selected does not matter. Fur-
thermore, each director can be elected to the executive board only once. Therefore,
the number of di¤erent ways to form the executive board is equal to the number
of possible combinations without repetition of 4 objects from 10. If we denote it
by C10;4 , then
10 10! 10!
C10;4 = = =
4 (10 4)!4! 6!4!
10 9 8 7 10 9 8 7
= = = 210
4! 4 3 2 1
Chapter 5
Partitions into groups
This lecture introduces partitions into groups. Before reading this lecture, you
should read the lectures entitled Permutations (p. 9) and Combinations (p. 21).
A partition of n objects into k groups is one of the possible ways of subdividing
the n objects into k groups (k n). The rules are:
1. the order in which objects are assigned to a group does not matter;
2. each object can be assigned to only one group.
The following subsections give a slightly more formal de…nition of partition into
groups and deal with the problem of counting the number of possible partitions
into groups.
5.1 De…nition of partition into groups

Let a1 , a2 ,. . . , an be n objects. Let g1 , g2 , . . . , gk be k (with k n) groups to
which we can assign the n objects. Moreover, n1 objects need to be assigned to
group g1 , n2 objects need to be assigned to group g2 , and so on. The numbers n1 ,
n2 , . . . , nk are such that
n1 + n2 + : : : + nk = n
A partition of a1 , a2 ,. . . , an into the k groups g1 , g2 , . . . , gk is one of the possible
ways to assign the n objects to the k groups.
Example 26 Consider three objects, a1 , a2 and a3 , and two groups, g1 and g2 ,
with
n1 = 2
n2 = 1
There are three possible partitions of the three objects into the two groups:
Groups g1 g2
Partition 1 fa1 ; a2 g a3
Note that the order of objects belonging to a group does not matter, so, for example,
fa1 ; a2 g in Partition 1 is the same as fa2 ; a1 g.
27
28 CHAPTER 5. PARTITIONS INTO GROUPS
5.2 Number of partitions into groups

Denote by Pn1 ;n2 ;:::;nk the number of possible partitions into the k groups (where
group i contains ni objects). How much is Pn1 ;n2 ;:::;nk in general?
The number Pn1 ;n2 ;:::;nk can be derived using a sequential argument:
1. First, we assign n1 objects to the …rst group. There is a total of n objects

to choose from. The number of possible ways to choose n1 of the n objects
is equal to the number of combinations1 of n1 elements from n. So there are
n n!
=
n1 n1 ! (n n1 )!
possible ways to form the …rst group.
2. Then, we assign n2 objects to the second group. There were n objects, but
n1 have already been assigned to the …rst group. So, there are n n1 objects
left, that can be assigned to the second group. The number of possible
ways to choose n2 of the remaining n n1 objects is equal to the number of
combinations of n2 elements from n n1 . So there are
n n1 (n n1 )!
=
n2 n2 ! (n n1 n2 )!
possible ways to form the second group and
n n n1 n! (n n1 )!
=
n1 n2 n1 ! (n n1 )! n2 ! (n n1 n2 )!
n!
=
n1 !n2 ! (n n1 n2 )!
possible ways to form the …rst two groups.
3. Then, we assign n3 objects to the third group. There were n objects, but
n1 + n2 have already been assigned to the …rst two groups. So, there are
n n1 n2 objects left, that can be assigned to the third group. The number
of possible ways to choose n3 of the remaining n n1 n2 objects is equal
to the number of combinations of n3 elements from n n1 n2 . So there are
n n1 n2 (n n1 n2 )!
=
n3 n3 ! (n n1 n2 n3 )!
possible ways to form the third group and
n n n1 n n1 n2
n1 n2 n3
n! (n n1 n2 )!
=
n1 !n2 ! (n n1 n2 )! n3 ! (n n1 n2 n3 )!
n!
=
n1 !n2 !n3 ! (n n1 n2 n3 )!
possible ways to form the …rst three groups.
1 See the lecture entitled Combinations (p. 21).
4. An so on, until we are left with nk objects and the last group. There is only
one way to form the last group, which can also be written as:
n n1 n2 : : : nk 1 (n n1 n2 : : : nk 1 )!
=
nk nk ! (n n1 n2 : : : nk )!
As a consequence, there are
n n n1 nn1 n2 n n1 n2 : : : nk 1
:::
n1 n2 n3 nk
n!
=
n1 !n2 ! : : : nk 1 ! (n n1 n2 : : : nk 1 )!
(n n1 n2 : : : nk 1 )!
nk ! (n n1 n2 : : : nk )!
n!
=
n1 !n2 ! : : : nk ! (n n1 n2 : : : nk )!
n!
=
n1 !n2 ! : : : nk !0!
n!
=
n1 !n2 ! : : : nk !
possible ways to form all the groups.
partitions into the k groups is
n!
Pn1 ;n2 ;:::;nk =
n1 !n2 ! : : : nk !
The number Pn1 ;n2 ;:::;nk is often indicated as follows:
n
Pn1 ;n2 ;:::;nk =
n1 ; n 2 ; : : : ; n k
and n1 ;n2n;:::;nk is called a multinomial coe¢ cient.
Sometimes the following notation is also used:
Pn1 ;n2 ;:::;nk = (n1 ; n2 ; : : : ; nk )!
Example 27 The number of possible partitions of 4 objects into 2 groups of 2
objects is
4 4! 4 3 2 1
P2;2 = = = =6
2; 2 2!2! (2 1) (2 1)
5.3 More details

5.3.1 Multinomial expansions
The multinomial coe¢ cient is so called because it appears in the multinomial
expansion:
n
X n
(x1 + x2 + : : : + xk ) = xn1 xn2 : : : xnk k
n1 ; n 2 ; : : : ; n k 1 2
where n 2 N and the summation is over all the k-tuples n1 ; n2 ; : : : ; nk such that
n1 + n2 + : : : + nk = n
30 CHAPTER 5. PARTITIONS INTO GROUPS

Exercise 1
John has a basket of fruit containing one apple, one banana, one orange and one
kiwi. He wants to give one fruit to each of his two little sisters and two fruits to
his big brother. In how many di¤erent ways can he do this?
Solution
John needs to decide how to partition 4 objects into 3 groups. The …rst two groups
will contain one object and the third one will contain two objects. The total
number of partitions is
4 4!
P1;1;2 = =
1; 1; 2 1!1!2!
4 3 2 1 24
= = = 12
1 1 2 1 2
Exercise 2
Ten friends want to play basketball. They need to divide into two teams of …ve
players. In how many di¤erent ways can they do this?
Solution
They need to decide how to partition 10 objects into 2 groups. Each group will
contain 5 objects. The total number of partitions is
10 10!
P5;5 = =
5; 5 5!5!
10 9 8 7 6 5 4 3 2 1
=
(5 4 3 2 1) (5 4 3 2 1)
10 9 8 7 6 9 8 7 6
= =
5 4 3 2 1 4 3
= 9 2 7 2 = 252
Chapter 6
Sequences and limits
This lecture discusses the concepts of sequence and limit of a sequence.
6.1 De…nition of sequence

Let A be a set of objects, for example, real numbers. A sequence of elements
of A is a function from the set of natural numbers N to the set A, that is, a
correspondence that associates one and only one element of A to each natural
number n 2 N. In other words, a sequence of elements of A is an ordered list of
elements of A, where the ordering is provided by the natural numbers.
A sequence is usually indicated by enclosing a generic element of the sequence
in curly brackets:
fan g
where an is the n-th element of the sequence. Alternative notations are
1
fan gn=1
fa1 ; a2 ; : : : ; an ; : : :g
an ; n 2 N
a1 ; a2 ; : : : ; an ; : : :
Thus, if fan g is a sequence, a1 is its …rst element, a2 is its second element, an is
its n-th element, and so on.
Example 28 De…ne a sequence fan g by characterizing its n-th element an as
follows:
1
an =
n
fan g is a sequence of rational numbers. The elements of the sequence are a1 = 1,
a2 = 21 , a3 = 13 , a4 = 14 , and so on.
follows:
1 if n is even
an =
0 if n is odd
fan g is a sequence of 0 and 1. The elements of the sequence are a1 = 0, a2 = 1,
a3 = 0, a4 = 1, and so on.
31
32 CHAPTER 6. SEQUENCES AND LIMITS

follows:
1 1
an = ;
n+1 n
fan g is a sequence of closed subintervals of the interval [0; 1]. The elements of the
sequence are a1 = 12 ; 1 , a2 = 13 ; 12 , a3 = 14 ; 13 , a4 = 15 ; 14 , and so on.
6.2 Countable and uncountable sets

A set of objects A is a countable set if all its elements can be arranged into a
sequence, that is, if there exists a sequence fan g such that
8a 2 A; 9n 2 N : an = a
In other words, A is a countable set if there exists at least one sequence fan g such
that every element of A belongs to the sequence. A is an uncountable set if such
a sequence does not exist. The most important example of an uncountable set is
the set of real numbers R.
6.3 Limit of a sequence

This section introduces the notion of limit of a sequence fan g. We start from the
simple case in which fan g is a sequence of real numbers, then we deal with the
general case in which fan g is a sequence of objects that are not necessarily real
numbers.
6.3.1 The limit of a sequence of real numbers

We give …rst an informal and then a more formal de…nition of the limit of a sequence
of real numbers.
Informal de…nition of limit - Sequences of real numbers

Let fan g be a sequence of real numbers. Let n0 2 N. Denote by fan gn>n0 a
subsequence of fan g obtained by dropping the …rst n0 terms of fan g, i.e.,
fan gn>n0 = fan0 +1 ; an0 +2 ; an0 +3 ; : : :g
The following is an intuitive de…nition of limit of a sequence.
De…nition 31 (informal) We say that a real number a is a limit of a sequence
fan g of real numbers, if, by appropriately choosing n0 , the distance between a and
any term of the subsequence fan gn>n0 can be made as close to zero as we like. If
a is a limit of the sequence fan g, we say that the sequence fan g is a convergent
sequence and that it converges to a. We indicate the fact that a is a limit of
fan g by
a = lim an
n!1
Thus, a is a limit of fan g if, by dropping a su¢ ciently high number of initial
terms of fan g, we can make the remaining terms of fan g as close to a as we like.
Intuitively, a is a limit of fan g if an becomes closer and closer to a by letting n go
to in…nity.
6.3. LIMIT OF A SEQUENCE 33
Formal de…nition of limit - Sequences of real numbers

The distance between two real numbers is the absolute value of their di¤erence.
For example, if a 2 R and an is a term of a sequence fan g, the distance between
an and a, denoted by d (an ; a), is
d (an ; a) = jan aj
Using the concept of distance, the above informal de…nition can be made rigorous.
De…nition 32 (formal) We say that a 2 R is a limit of a sequence fan g of

real numbers if
8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0
If a is a limit of the sequence fan g, we say that the sequence fan g is a convergent
fan g by
a = lim an
n!1
For those unfamiliar with the universal quanti…ers 8 (any) and 9 (exists), the
notation
8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0
reads as follows: "For any arbitrarily small number ", there exists a natural number
n0 such that the distance between an and a is less than " for all the terms an with
n > n0 ", which can also be restated as "For any arbitrarily small number ", you
can …nd a subsequence fan gn>n0 such that the distance between any term of the
subsequence and a is less than "", or as "By dropping a su¢ ciently high number
of initial terms of fan g, you can make the remaining terms as close to a as you
wish".
It is possible to prove that a convergent sequence has a unique limit, that is, if
fan g has a limit a, then a is the unique limit of fan g.

follows:
1
an =
n
The elements of the sequence are a1 = 1, a2 = 12 , a3 = 13 , a4 = 14 , and so on. The
higher n is, the smaller an is and the closer it gets to 0. Therefore, intuitively, the
limit of the sequence should be
lim an = 0
n!1
It is straightforward to prove that 0 is indeed a limit of fan g by using De…nition

32. Choose any " > 0. We need to …nd an n0 2 N such that all terms of the
subsequence fan gn>n0 have distance from zero less than ":
d (an ; 0) < "; 8n > n0 (6.1)
The distance between a generic term of the sequence an and 0 is
d (an ; 0) = jan 0j = jan j = an

where the last equality holds because all the terms of the sequence are positive and
hence equal to their absolute values. Therefore, we need to …nd an n0 2 N such
that all the terms of the subsequence fan gn>n0 satisfy
an < "; 8n > n0 (6.2)
Since
an < an0 ; 8n > n0
condition (6.2) is satis…ed if an0 < ", which is equivalent to n10 < ". As a con-
sequence, it su¢ ces to pick any n0 such that n0 > 1" to satisfy condition (6.1).
Summing up, we have just shown that, for any ", we are able to …nd n0 2 N such
that all terms of the subsequence fan gn>n0 have distance from zero less than ",
which implies that 0 is the limit of the sequence fan g.
6.3.2 The limit of a sequence in general

We now deal with the more general case in which the terms of the sequence fan g
are not necessarily real numbers. As before, we …rst give an informal de…nition,
and then a more formal one.
Informal de…nition of limit - The general case

Let A be a set of objects and let fan g be a sequence of elements of A. The limit
of fan g is de…ned as follows.
De…nition 34 (informal) Let a 2 A. We say that a is a limit of a sequence

fan g of elements of A, if, by appropriately choosing n0 , the distance between a and
any term of the subsequence fan gn>n0 can be made as close to zero as we like. If
a is a limit of the sequence fan g, we say that the sequence fan g is a convergent
fan g by
a = lim an
n!1
The de…nition is the same given in De…nition 31, except for the fact that now
both a and the terms of the sequence fan g belong to a generic set of objects A.
Metrics and the de…nition of distance

In De…nition 34 we have implicitly assumed that the concept of distance between
elements of A is well-de…ned. Thus, for this de…nition to make any sense, we need
to properly de…ne distance.
We need a function d : A A ! R that associates to any couple of elements
of A a real number measuring how far these two elements are. For example, if a
and a0 are two elements of A, d (a; a0 ) needs to be a real number measuring the
distance between a and a0 .
A function d : A A ! R is considered a valid distance function if it satis…es
the properties listed in the following de…nition.
De…nition 35 Let A be a set of objects. Let d : A A ! R. d is considered a

valid distance function, in which case it is called a metric on A, if the following
conditions are satis…ed for any a, a0 and a00 belonging to A:
1. non-negativity: d (a; a0 ) 0;
2. identity of indiscernibles: d (a; a0 ) = 0 if and only if a = a0 ;
3. symmetry: d (a; a0 ) = d (a0 ; a);
4. triangle inequality: d (a; a0 ) + d (a0 ; a00 ) d (a; a00 ).
All four properties are very intuitive: property 1) says that the distance between
two points cannot be a negative number; property 2) says that the distance between
two points is zero if and only if the two points coincide; property 3) says that the
distance from a to a0 is the same as the distance from a0 to a; property 4) says that
the distance you cover when you go from a to a00 directly is less than or equal to
the distance you cover when you go from a to a00 passing from a third point a0 (in
other words, if a0 is not on the way from a to a00 , you are increasing the distance
covered).
Example 36 (Euclidean distance) Consider the set of K-dimensional real vec-

tors RK . The metric usually employed to measure the distance between elements
of RK is the so-called Euclidean distance. If a and b are two vectors belonging to
RK , then their Euclidean distance is
q
2 2 2
d (a; b) = (a1 b1 ) + (a2 b2 ) + : : : + (aK bK )
where a1 ; : : : ; aK are the K components of a and b1 ; : : : ; bK are the K components

of b. It is possible to prove that the Euclidean distance satis…es all the four prop-
erties that a metric needs to satisfy. Furthermore, when K = 1, it becomes
q
2
d (a; b) = (a b) = ja bj
which coincides with the de…nition of distance between real numbers already given
above.
Whenever we are faced with a sequence of objects and we want to assess whether
it is convergent, we need to de…ne a distance function on the set of objects to which
the terms of the sequence belong, and verify that the proposed distance function
satis…es all the properties of a proper distance function (a metric). For example, in
probability theory and statistics we often deal with sequences of random variables.
To assess whether these sequences are convergent, we need to de…ne a metric to
measure the distance between two random variables. As we will see in the lecture
entitled Sequences of random variables (see p. 491), there are several ways of
de…ning the concept of distance between two random variables. All these ways are
legitimate and are useful in di¤erent situations.
Formal de…nition of limit - The general case

Having de…ned the concept of a metric, we are now ready to state the formal
de…nition of a limit of a sequence.
De…nition 37 (formal) Let A be a set of objects. Let d : A A ! R be a metric

on A. We say that a 2 A is a limit of a sequence fan g of objects belonging to
A if
8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0
If a is a limit of the sequence fan g, we say that the sequence fan g is a convergent
fan g by
a = lim an
n!1
Also in this case, it is possible to prove that a convergent sequence has a unique
limit.
Proposition 38 If fan g has a limit a, then a is the unique limit of fan g.
Proof. The proof is by contradiction. Suppose that a and a0 are two limits of
a sequence fan g and a 6= a0 . By combining property 1) and 2) of a metric (see
above), we obtain
d (a; a0 ) > 0
i.e., d (a; a0 ) = d, where d is a strictly positive constant. Pick any term an of the
sequence. By property 4) of a metric (the triangle inequality), we have
d (a; an ) + d (an ; a0 ) d (a; a0 )
Since d (a; a0 ) = d, the previous inequality becomes
d (a; an ) + d (a0 ; an ) d>0
Now, take any " < d. Since a is a limit of the sequence, we can …nd n0 such that
d (a; an ) < "; 8n > n0 , which means that
" + d (a0 ; an ) d (a; an ) + d (a0 ; an ) d > 0; 8n > n0
and
d (a0 ; an ) d " > 0; 8n > n0
0
Therefore, d (a ; an ) can not be made smaller than d ", and, as a consequence, a0
cannot be a limit of the sequence.
Convergence criterion
In practice, it is usually di¢ cult to assess the convergence of a sequence using
De…nition 37. Instead, convergence can be assessed using the following criterion.
Proposition 39 Let A be a set of objects. Let d : A A ! R be a metric on A.

Let fan g be a sequence of objects belonging to A, and a 2 A. The sequence fan g
converges to a if and only if
lim d (an ; a) = 0
n!1
Proof. This is easily proved by de…ning a sequence of real numbers fdn g whose
generic term is
dn = d (an ; a)
and noting that the de…nition of convergence of fan g to a, which is
8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0

can be written as
8" > 0; 9n0 2 N : jdn 0j < "; 8n > n0
which is the de…nition of convergence of fdn g to 0.

So, in practice, the problem of assessing the convergence of a generic sequence
of objects is simpli…ed as follows:
1. …nd a metric d (an ; a) to measure the distance between the terms of the
sequence an and the candidate limit a;
2. de…ne a new sequence fdn g, where dn = d (an ; a);
3. study the convergence of the sequence fdn g, which is a simple problem, be-
cause fdn g is a sequence of real numbers.
Chapter 7
Review of di¤erentiation
rules
This lecture contains a summary of di¤erentiation rules, i.e. of rules for computing
the derivative of a function. This review is neither detailed nor rigorous and it
is not meant to be a substitute for a proper lecture on di¤erentiation. Its only
purpose it to serve as a quick review of di¤erentiation rules.
d
In what follows, f (x) will denote a function of one variable and dx f (x) will
denote its …rst derivative.
7.1 Derivative of a constant function

If f (x) is a constant function:
f (x) = c
where c 2 R, then its …rst derivative is
d
f (x) = 0
dx
7.2 Derivative of a power function

If f (x) is a power function:
f (x) = xn
where n 2 R, then its …rst derivative is
d
f (x) = nxn 1
dx
Example 40 De…ne
f (x) = x5
The derivative of f (x) is
d
f (x) = 5x5 1
= 5x4
dx
39
40 CHAPTER 7. REVIEW OF DIFFERENTIATION RULES
Example 41 De…ne p
3
f (x) = x4
d d p
3 d 4 4 1=3
f (x) = x4 = x4=3 = x4=3 1
= x
dx dx dx 3 3
7.3 Derivative of a logarithmic function

If f (x) is the natural logarithm of x:
f (x) = ln (x)
then its …rst derivative is
d 1
f (x) =
dx x
If f (x) is the logarithm to base b of x:
f (x) = logb (x)
then its …rst derivative is1
d 1
f (x) =
dx x ln (b)
Example 42 De…ne
f (x) = log2 (x)
d 1
f (x) =
dx x ln (2)
7.4 Derivative of an exponential function

If f (x) is the exponential function
f (x) = exp (x)
then its …rst derivative is
d
f (x) = exp(x)
dx
If the exponential function f (x) does not have the natural base e, but another
positive base b:
f (x) = bx
then its …rst derivative is2
d
f (x) = ln (b) bx
dx
Example 43 De…ne
f (x) = 5x
d
f (x) = ln (5) 5x
dx
1 Remember ln(x)
that logb (x) = ln(b) .
2 Remember that bx = exp (x ln (b)).
7.5. DERIVATIVE OF A LINEAR COMBINATION 41
7.5 Derivative of a linear combination

If f1 (x) and f2 (x) are two functions and c1 ; c2 2 R are two constants, then
d d d
(c1 f1 (x) + c2 f2 (x)) = c1 f1 (x) + c2 f2 (x)
dx dx dx
In other words, the derivative of a linear combination is equal to the linear
combinations of the derivatives. This property is called "linearity of the derivative".
Two special cases of this rule are
1. Multiplication by a constant
d d
(c1 f1 (x)) = c1 f1 (x)
dx dx
2. Addition
d d d
(f1 (x) + f2 (x)) = f1 (x) + f2 (x)
dx dx dx
Example 44 De…ne
f (x) = 2 + exp (x)
d d d d
f (x) = (2 + exp (x)) = (2) + (exp (x))
dx dx dx dx
The …rst summand is
d
(2) = 0
dx
because the derivative of a constant is 0. The second summand is
d
(exp (x)) = exp (x)
dx
by the rule for di¤ erentiating exponentials. Therefore
d d d
f (x) = (2) + (exp (x)) = 0 + exp (x) = exp (x)
dx dx dx
7.6 Derivative of a product of functions

If f1 (x) and f2 (x) are two functions, then the derivative of their product is
d d d
(f1 (x) f2 (x)) = f1 (x) f2 (x) + f1 (x) f2 (x)
dx dx dx
Example 45 De…ne
f (x) = x ln (x)
d d d d
f (x) = (x ln (x)) = (x) ln (x) + x (ln (x))
dx dx dx dx
1
= 1 ln (x) + x = ln (x) + 1
x
7.7 Derivative of a composition of functions

If g (y) and h (x) are two functions, then the derivative of their composition is
!
d d d
(g (h (x))) = g (y) h (x)
dx dy y=h(x) dx
What does this chain rule mean in practice? It means that …rst you need to
compute the derivative of g (y):
d
g (y)
dy
Then, you substitute y with h (x):
d
g (y)
dy y=h(x)
Finally, you multiply it by the derivative of h (x):
d
h (x)
dx
Example 46 De…ne
f (x) = ln x2
The function f (x) is a composite function:
f (x) = g (h (x))
where
g (y) = ln (y)
and
h (x) = x2
The derivative of h (x) is
d d
h (x) = x2 = 2x
dx dx
The derivative of g (y) is
d d 1
g (y) = (ln (y)) =
dy dy y
which, evaluated at y = h (x) = x2 , gives
d 1 1
g (y) = = 2
dy y=h(x) h (x) x
Therefore
!
d d d 1 2
(g (h (x))) = g (y) h (x) = 2 2x =
dx dy y=h(x) dx x x
7.8. DERIVATIVES OF TRIGONOMETRIC FUNCTIONS 43
7.8 Derivatives of trigonometric functions

The trigonometric functions have the following derivatives:
d
sin (x) = cos (x)
dx
d
cos (x) = sin (x)
dx
d 1
tan (x) = 2
dx cos (x)
while the inverse trigonometric functions have the following derivatives:
d 1
arcsin (x) = p
dx 1 x2
d 1
arccos (x) = p
dx 1 x2
d 1
arctan (x) =
dx 1 + x2
Example 47 De…ne
f (x) = cos x2
We need to use the chain rule for the derivative of a composite function:
!
d d d
(g (h (x))) = g (y) h (x)
dx dy y=h(x) dx
The derivative of h (x) is

d d
h (x) = x2 = 2x
dx dx
The derivative of g (y) is
d d
g (y) = (cos (y)) = sin (y)
dy dy
which, evaluated at y = h (x) = x2 , gives
d
g (y) = sin (h (x)) = sin x2
dy y=h(x)
Therefore
!
d d d
(g (h (x))) = g (y) h (x) = sin x2 2x
dx dy y=h(x) dx
7.9 Derivative of an inverse function

If y = f (x) is a function with derivative
d
f (x)
dx
1
then its inverse x = f (y) has derivative
! 1
d 1 d
f (y) = f (x)
dy dx x=f 1 (y)
Example 48 De…ne
f (x) = exp (3x)
Its inverse is
1 1
f (y) = ln (y)
3
d
f (x) = 3 exp (3x)
dx
As a consequence
d 1
f (x) = 3 exp (3x)jx= 1 ln(y) = 3 exp 3 ln (y) = 3y
dx x=f 1 (y)
3 3
and ! 1
d 1 d 1
f (y) = f (x) = (3y)
dy dx x=f 1 (y)
Chapter 8
Review of integration rules
This lecture contains a summary of integration rules, i.e. of rules for computing
de…nite and inde…nite integrals of a function. This review is neither detailed nor
rigorous and it is not meant to be a substitute for a proper lecture on integration.
Its only purpose it to serve as a quick review of integration rules.
d
In what follows, f (x) will denote a function of one variable and dx f (x) will
denote its …rst derivative.
8.1 Inde…nite integrals

If f (x) is a function of one variable, an inde…nite integral of f (x) is a function
F (x) whose …rst derivative is equal to f (x):
d
F (x) = f (x)
dx
An inde…nite integral F (x) is denoted by
Z
F (x) = f (x) dx
Inde…nite integrals are also called antiderivatives or primitives.
Example 49 Let
f (x) = x3
The function
1 4
F (x) = x
4
is an inde…nite integral of f (x) because
d d 1 4 1 d 1
F (x) = x = x4 = 4x3 = x3
dx dx 4 4 dx 4
Also the function

1 1 4
G (x) = + x
2 4
45
46 CHAPTER 8. REVIEW OF INTEGRATION RULES
is an inde…nite integral of f (x) because
d d 1 1 4 d 1 1 d
G (x) = + x = + x4
dx dx 2 4 dx 2 4 dx
1
= 0+ 4x3 = x3
4
Note that if a function F (x) is an inde…nite integral of f (x) then also the
function
G (x) = F (x) + c
is an inde…nite integral of f (x) for any constant c 2 R, because
d d d d
G (x) = (F (x) + c) = (F (x)) + (c)
dx dx dx dx
= f (x) + 0 = f (x)
This is also the reason why the adjective inde…nite is used: because inde…nite
integrals are de…ned only up to a constant.
The following subsections contain some rules for computing the inde…nite in-
tegrals of functions that are frequently encountered in probability theory and sta-
tistics. In all these subsections, c will denote a constant and the integration rules
will be reported without a proof. Proofs are trivial and can be easily performed
by the reader: it su¢ ces to compute the …rst derivative of F (x) and verify that it
equals f (x).
8.1.1 Inde…nite integral of a constant function

If f (x) is a constant function:
f (x) = a
where a 2 R, then an inde…nite integral of f (x) is
F (x) = ax + c
8.1.2 Inde…nite integral of a power function

If f (x) is a power function:
f (x) = xn
then an inde…nite integral of f (x) is
1
F (x) = xn+1 + c
n+1
when n 6= 1. When n = 1, i.e. when
1
f (x) =
x
the integral is
F (x) = ln (x) + c
8.1. INDEFINITE INTEGRALS 47
8.1.3 Inde…nite integral of a logarithmic function

If f (x) is the natural logarithm of x:
f (x) = ln (x)
then its inde…nite integral is
F (x) = x ln (x) x+c
If f (x) is the logarithm to base b of x:
f (x) = logb (x)
then its inde…nite integral is1
1
F (x) = (x ln (x) x) + c
ln (b)
8.1.4 Inde…nite integral of an exponential function

If f (x) is the exponential function:
f (x) = exp (x)
then its inde…nite integral is
F (x) = exp(x) + c
If the exponential function f (x) does not have the natural base e, but another
positive base b:
f (x) = bx
then its inde…nite integral is2
1 x
F (x) = b +c
ln (b)
8.1.5 Inde…nite integral of a linear combination of functions

If f1 (x) and f2 (x) are two functions and c1 ; c2 2 R are two constants, then:
Z Z Z
(c1 f1 (x) + c2 f2 (x)) dx = c1 f1 (x) dx + c2 f2 (x) dx
In other words, the integral of a linear combination is equal to the linear com-
binations of the integrals. This property is called "linearity of the integral".
Two special cases of this rule are:
1. Multiplication by a constant:
Z Z
c1 f1 (x) dx = c1 f1 (x) dx
2. Addition:
Z Z Z
(f1 (x) + f2 (x)) dx = f1 (x) dx + f2 (x) dx
1 Remember ln(x)
that logb (x) = ln(b) .
2 Remember that bx = exp (x ln (b)).
8.1.6 Inde…nite integrals of trigonometric functions

The trigonometric functions have the following inde…nite integrals:
Z
sin (x) dx = cos (x) + c
Z
cos (x) dx = sin (x) + c
Z
1
tan (x) dx = ln +c
cos (x)
8.2 De…nite integrals

Let f (x) be a function of one variable and [a; b] an interval of real numbers. The
de…nite integral (or, simply, the integral) from a to b of f (x) is the area of the
region in the xy-plane bounded by the graph of f (x), the x-axis and the vertical
lines x = a and x = b, where regions below the x-axis have negative sign and
regions above the x-axis have positive sign.
The integral from a to b of f (x) is denoted by
Z b
f (x) dx
a
f (x) is called the integrand function and a and b are called the upper
bound of integration and the lower bound of integration.
The following subsections contain some properties of de…nite integrals, which
are also often utilized to actually compute de…nite integrals.
8.2.1 Fundamental theorem of calculus

The fundamental theorem of calculus provides the link between de…nite and indef-
inite integrals. It has two parts.
On the one hand, if you de…ne
Z x
F (x) = f (t) dt
a
then the …rst derivative of F (x) is equal to f (x), i.e.
d
F (x) = f (x)
dx
In other words, if you di¤erentiate a de…nite integral with respect to its upper
bound of integration, then you obtain the integrand function.
Example 50 De…ne Z x
F (x) = exp (2t) dt
a
Then:
d
F (x) = exp (2x)
dx
8.2. DEFINITE INTEGRALS 49
On the other hand, if F (x) is an inde…nite integral (an antiderivative) of f (x),

then Z b
f (x) dx = F (b) F (a)
a
In other words, you can use the inde…nite integral to compute the de…nite
integral.
The following notation is often used:
Z b
b
f (x) dx = [F (x)]a
a
where
b
[F (x)]a = F (b) F (a)
Sometimes the variable of integration x is explicitly speci…ed and we write
x=b
[F (x)]x=a
Example 51 Consider the de…nite integral
Z 1
x2 dx
0
The integrand function is
f (x) = x2
An inde…nite integral of f (x) is
1 3
F (x) =
x
3
Therefore, the de…nite integral from 0 to 1 can be computed as follows:
Z 1 1
1 3 1 3 1 3 1
x2 dx = x = 1 0 =
0 3 0 3 3 3
8.2.2 De…nite integral of a linear combination of functions

Like inde…nite integrals, also de…nite integrals are linear. If f1 (x) and f2 (x) are
two functions and c1 ; c2 2 R are two constants, then:
Z b Z b Z b
(c1 f1 (x) + c2 f2 (x)) dx = c1 f1 (x) dx + c2 f2 (x) dx
a a a
with the two special cases:
1. Multiplication by a constant:
Z b Z b
c1 f1 (x) dx = c1 f1 (x) dx
a a
2. Addition:
Z b Z b Z b
(f1 (x) + f2 (x)) dx = f1 (x) dx + f2 (x) dx
a a a
Example 52 For example:

Z 1 Z 1 Z 1
3x + 2x2 dx = 3 xdx + 2 x2 dx
0 0 0
8.2.3 Change of variable

If f (x) and g (x) are two functions, then the following integral
Z b
d
f (g (x)) g (x) dx
a dx
can be computed by a change of variable, using the variable
t = g (x)
The change of variable is performed in the following steps:
1. Di¤erentiate the change of variable formula
t = g (x)
and obtain
d
dt = g (x) dx
dx
2. Recompute the bounds of integration:
x = a ) t = g (x) = g (a)
x = b ) t = g (x) = g (b)
d
3. Substitute g (x) and dx g (x) dx in the integral:
Z b Z g(b)
d
f (g (x)) g (x) dx = f (t) dt
a dx g(a)
Example 53 The integral

Z 2
ln (x)
dx
1 x
can be computed performing the change of variable
t = ln (x)
Di¤ erentiating the change of variable formula, we obtain
d 1
dt = ln (x) dx = dx
dx x
The new bounds of integration are
x = 1 ) t = ln (1) = 0
x = 2 ) t = ln (2)
Therefore, the integral can be written as follows:

Z 2 Z ln(2)
ln (x)
dx = tdt
1 x 0
8.2. DEFINITE INTEGRALS 51
8.2.4 Integration by parts

Let f (x) and g (x) be two functions and F (x) and G (x) their inde…nite integrals.
The following integration by parts formula holds:
Z b Z b
b
f (x) G (x) dx = [F (x) G (x)]a F (x) g (x) dx
a a
Example 54 The integral

Z 1
exp (x) xdx
0
can be integrated by parts, by setting
f (x) = exp (x)

G (x) = x
An inde…nite integral of f (x) is
F (x) = exp (x)
and G (x) is an inde…nite integral of
g (x) = 1
or, said di¤ erently, g (x) = 1 is the derivative of G (x) = x. Therefore:

Z 1 Z 1
1
exp (x) xdx = [exp (x) x]0 exp (x) dx
0 0
Z 1
= exp (1) 0 exp (x) dx
0
1
= exp (1) [exp (x)]0
= exp (1) [exp (1) 1] = 1
8.2.5 Exchanging the bounds of integration

Given the integral
Z b
f (x) dx
a
exchanging its bounds of integration is equivalent to changing its sign:

Z a Z b
f (x) dx = f (x) dx
b a
8.2.6 Subdividing the integral

Given the two bounds of integration a and b, with a b, and a third point m such
that a m b, then
Z b Z m Z b
f (x) dx = f (x) dx + f (x) dx
a a m
8.2.7 Leibniz integral rule

Given a function of two variables f (x; y) and the integral
Z b(y)
I (y) = f (x; y) dx
a(y)
where both the lower bound of integration a and the upper bound of integration b
may depend on y, under appropriate technical conditions (not discussed here) the
…rst derivative of the function I (y) with respect to y can be computed as follows:
d d d
I (y) = b (y) f (b (y) ; y) a (y) f (a (y) ; y)
dy dy dy
Z b(y)
@
+ f (x; y) dx
a(y) @y
@
where @y f (x; y) is the …rst partial derivative of f (x; y) with respect to y.
Example 55 The derivative of the integral

Z y 2 +1
I (y) = exp (xy) dx
y2
is
d d
I (y) = y2 + 1 exp y2 + 1 y
dy dy
Z y 2 +1
d 2 @
y exp y 2 y + (exp (xy)) dx
dy y2 @y
Z y2 +1
= 2y exp y 3 + y 2y exp y 3 + x exp (xy) dx
y2

Exercise 1
Compute the following integral:
Z 1
cos (x) exp ( x) dx
0
Hint: perform two integrations by parts.
Solution
Performing two integrations by parts, we obtain:
Z 1
cos (x) exp ( x) dx
0
Z 1
A = [sin (x) exp ( x)]0
1
sin (x) ( exp ( x)) dx
0
Z 1
= 0 0+ sin (x) exp ( x) dx
0
Z 1
B = [ cos (x) exp ( x)]0
1
( cos (x)) ( exp ( x)) dx
0
Z 1
= 0 ( 1) cos (x) exp ( x) dx
0
Z 1
= 1 cos (x) exp ( x) dx
0
where integration by parts has been performed in steps A and B . Therefore

Z 1 Z 1
cos (x) exp ( x) dx = 1 cos (x) exp ( x) dx
0 0
which can be rearranged to yield

Z 1
2 cos (x) exp ( x) dx = 1
0
or Z 1
1
cos (x) exp ( x) dx =
0 2
Exercise 2
Use Leibniz integral rule to compute the derivative with respect to y of the following
integral:
Z y2
I (y) = exp ( xy) dx
0
Solution
Leibniz integral rule is
Z b(y)
d d d
f (x; y) dx = b (y) f (b (y) ; y) a (y) f (a (y) ; y)
dy a(y) dy dy
Z b(y)
@
+ f (x; y) dx
a(y) @y
We can apply it as follows:

Z y2
d d
I (y) = exp ( xy) dx
dy dy 0
Z y2
d 2 2 @
= y exp y y + exp ( xy) dx
dy 0 @y
Z y2
= 2y exp y3 x exp ( xy) dx
0
( Z )
y2 y2
A 3 1 1
= 2y exp y x exp ( xy) + exp ( xy) dx
y 0 y 0
y2
1 1
= 2y exp y 3 + y exp y3 exp ( xy)
y y 0
3 1 3 1
= 3y exp y + 2 exp y
y y2
where: in step A we have performed an integration by parts.
Exercise 3
Z 1
2
x 1 + x2 dx
0
Solution
This integral can be solved using the change of variable technique:
Z 1 Z 1
2 2 1 2
x 1+x dx = (1 + t) dt
0 0 2
1
1 1 1 1 1 1
= (1 + t) = + 1=
2 0 2 2 2 4
where the change of variable is

t = x2
and the di¤erential is
dt = 2xdx
so that we can substitute xdx with 21 dt.
Chapter 9
Special functions
This chapter brie‡y introduces some special functions that are frequently used in
probability and statistics.
9.1 Gamma function

The Gamma function is a generalization of the factorial function1 to non-integer
numbers.
Recall that if n 2 N, its factorial n! is
n! = 1 2 : : : (n 1) n
so that n! satis…es the recursion
n! = (n 1)! n
The Gamma function (z) satis…es a similar recursion:
(z) = (z 1) (z 1)
but it is de…ned also when z is not an integer.
9.1.1 De…nition
The following is a possible de…nition of the Gamma function.
De…nition 56 The Gamma function is a function : R++ ! R++ satisfying

the equation Z 1
(z) = xz 1 exp ( x) dx
0
While the domain of de…nition of the Gamma function can be extended beyond
the set R++ of strictly positive real numbers, for example, to complex numbers,
the somewhat restrictive de…nition given above is more than su¢ cient to address
all the problems involving the Gamma function that are found in these lectures.
1 See p. 10.
55
56 CHAPTER 9. SPECIAL FUNCTIONS
9.1.2 Recursion
The next proposition states a recursive property that is used to derive several other
properties of the Gamma function.
Proposition 57 The Gamma function satis…es the recursion
(z) = (z 1) (z 1) (9.1)
Proof. By integrating by parts2 , we get

Z 1
(z) = xz 1
exp ( x) dx
0
Z 1
1
= xz 1
exp ( x) 0
+ (z 1) xz 2
exp ( x) dx
0
Z 1
= (0 0) + (z 1) x(z 1) 1
exp ( x) dx
0
= (z 1) (z 1)
9.1.3 Relation to the factorial function

The next proposition states the relation between the Gamma and factorial func-
tions.
Proposition 58 When the argument of the Gamma function is an integer n 2 N,

then its value is equal to the factorial of n 1:
(n) = (n 1)!
Proof. First of all, we need to compute a starting value:

Z 1 Z 1
1
(1) = x1 1
exp ( x) dx = exp ( x) dx = [ exp ( x)]0 = 1
0 0
By using the recursion (9.1), we obtain
(1) = 1 = 0!
(2) = (2 1) (2 1) = (1) 1 = 1 = 1!
(3) = (3 1) (3 1) = (2) 2 = 1 2 = 2!
(4) = (4 1) (4 1) = (3) 3 = 1 2 3 = 3!
..
.
(n) = (n 1) (n 1) = 1 2 3 : : : (n 1) = (n 1)!
2 See p.51.
9.1. GAMMA FUNCTION 57
9.1.4 Values of the Gamma function

The next proposition states a well-known fact, which is often used in probability
theory and statistics.
Proposition 59 The Gamma function is such that
1 p
=
2
Proof. Using the de…nition of Gamma function and performing a change of vari-
able, we obtain
Z 1
1
= x1=2 1
exp ( x) dx
2 0
Z 1
1=2
= x exp ( x) dx
0
Z 1
A = 2 exp t2 dt
0
Z 1 Z 1 1=2
= 2 exp t2 dt exp t2 dt
0 0
Z 1 Z 1 1=2
= 2 exp t2 dt exp s2 ds
0 0
Z 1 Z 1 1=2
= 2 exp t2 s2 dtds
0 0
Z 1 Z 1 1=2
B = 2 exp s2 u2 s2 sduds
0 0
Z 1 Z 1 1=2
= 2 exp 1 + u2 s2 sdsdu
0 0
Z 1 1 1=2
1
= 2 exp 1 + u2 s2 du
0 2 (1 + u2 ) 0
Z 1 1=2
1
= 2 0+ du
0 2 (1 + u2 )
Z 1 1=2
1
= 21=2 du
0 1 + u2
1=2 1 1=2
= 2 ([arctan (u)]0 )
1=2
= 21=2 (arctan (1) arctan (0))
1=2
= 21=2 0 = 1=2
2
where: in step A we have made the change of variable t = x1=2 ; in step B we

have performed the change of variable t = su.
By using the result stated in the previous proposition, it is possible to derive
other values of the Gamma function.
Proposition 60 The Gamma function is such that
1 p nY1 1
n+ = j+
2 j=0
2
for n 2 N.
Proof. The result is obtained by iterating the recursion formula (9.1):
1
n+
2
1 1
= n 1+ n 1+
2 2
1 1 1
= n 1+ n 2+ n 2+
2 2 2
..
.
1 1 1 1
= n 1+ n 2+ ::: n n+ n n+
2 2 2 2
1 1 1 1
= n 1+ n 2+ :::
2 2 2 2
p nY1 1
= j+
j=0
2
There are also other special cases in which the value of the Gamma function can
be derived analytically, but it is not possible to express (z) in terms of elementary
functions for every z. As a consequence, one often needs to resort to numerical
algorithms to compute (z). For example, the Matlab command
gamma(z)
returns the value of the Gamma function at the point z.

For a thorough discussion of a number of algorithms that can be employed to
compute numerical approximations of (z) see Abramowitz and Stegun3 (1965).
9.1.5 Lower incomplete Gamma function

The de…nition of the Gamma function
Z 1
(z) = xz 1
exp ( x) dx
0
can be generalized by substituting the upper bound of integration, equal to in…nity,

with a variable y: Z y
(z; y) = xz 1
exp ( x) dx
0
The function (z; y) thus obtained is called lower incomplete Gamma function.
3 Abramowitz,
M. and I. A. Stegun (1965) Hanbook of mathematical functions: with formulas,
graphs, and mathematical tables, Courier Dover Publications.
9.2. BETA FUNCTION 59
9.2 Beta function

The Beta function is a function of two variables that is often found in probability
theory and mathematical statistics. We report here some basic facts about the
Beta function.
9.2.1 De…nition
The following is a possible de…nition of the Beta function.
De…nition 61 The Beta function is a function B : R2++ ! R++ de…ned by
(x) (y)
B (x; y) =
(x + y)
where () is the Gamma function.
While the domain of de…nition of the Beta function can be extended beyond
the set R2++ of couples of strictly positive real numbers, for example, to couples
of complex numbers, the somewhat restrictive de…nition given above is more than
su¢ cient to address all the problems involving the Beta function that are found in
these lectures.
9.2.2 Integral representations

The Beta function has several integral representations, which are sometimes also
used as a de…nition of the Beta function in place of the de…nition we have given
above. We report here two often used representations.
Integral between zero and in…nity

The …rst representation involves an integral from zero to in…nity.
Proposition 62 The Beta function has the integral representation

Z 1
x y
B (x; y) = tx 1 (1 + t) dt (9.2)
0
Proof. Given the de…nition of the Beta function as a ratio of Gamma functions,
the equality holds if and only if
Z 1
x y (x) (y)
tx 1 (1 + t) dt =
0 (x + y)
or Z 1
x y
(x + y) tx 1
(1 + t) dt = (x) (y)
0
That the latter equality indeed holds is proved as follows:
(x) (y)
Z 1 Z 1
A = ux 1
exp ( u) du vy 1
exp ( v) dv
0 0
Z 1 Z 1
y 1
= v exp ( v) ux 1
exp ( u) dudv
0 0
Z 1 Z 1
B = vy 1
exp ( v) (vt)
x 1
exp ( vt) vdtdv
Z0 1 Z0 1
= vy 1
exp ( v) v x tx 1
exp ( vt) dtdv
0 0
Z 1 Z 1
= v x+y 1
exp ( v) tx 1
exp ( vt) dtdv
Z0 1 Z 1 0
= v x+y 1 x 1
t exp ( (1 + t) v) dtdv
0 0
Z 1 Z 1
= tx 1
v x+y 1
exp ( (1 + t) v) dvdt
0 0
Z 1 Z 1 x+y 1
C s 1
= tx 1
exp ( s) dsdt
1+t 1+t
Z0 1 0
Z 1
x 1 x y
= t (1 + t) sx+y 1
exp ( s) dsdt
0 0
Z 1
D = tx 1
(1 + t)
x y
(x + y) dt
0
Z 1
x y
= (x + y) tx 1
(1 + t) dt
0
where: in step A we have used the de…nition of Gamma function; in step B we

have performed the change of variable u = vt; in step C we have made the change
of variable s = (1 + t) v; in step D we have again used the de…nition of Gamma
function.
Integral between zero and one

Another representation involves an integral from zero to one.
Proposition 63 The Beta function has the integral representation

Z 1
y 1
B (x; y) = tx 1
(1 t) dt (9.3)
0
Proof. This can be obtained from the previous integral representation:

Z 1
x y
B (x; y) = tx 1 (1 + t) dt
0
by performing a change of variable. The change of variable is
t 1
s= =1
1+t 1+t
Before performing it, note that
t
lim =1
t!1 1+t
and that
1 s
t= 1=
1 s 1 s
Furthermore, by di¤erentiating the previous expression, we obtain
2
1
dt = ds
1 s
We are now ready to perform the change of variable:
Z 1
x y
B (x; y) = tx 1 (1 + t) dt
0
Z 1 x 1 x y 2
s s 1
= 1+ ds
0 1 s 1 s 1 s
Z 1 x 1 x y 2
s 1 1
= ds
0 1 s 1 s 1 s
Z 1 x 1 x y+2
1
= sx 1
ds
0 1 s
Z 1 1 y
1
= sx 1
ds
0 1 s
Z 1
y 1
= sx 1
(1 s) ds
0
Note that the two representations (9.2) and (9.3) involve improper integrals
that converge if x > 0 and y > 0. This might help you to see why the arguments
of the Beta function are required to be strictly positive in De…nition 61.

Exercise 1
Compute the ratio
16
3
10
3
Solution
We need to repeatedly apply the recursive formula
(z) = (z 1) (z 1)
to the numerator of the ratio, as follows:
16 16 16 13
3 3 1 3 1 13 3
10 = 10 = 10
3 3
3 3
13 13 10
13 3 1 3 1 13 10 3 130
= 10 = 10 =
3 3
3 3 3
9
Exercise 2
Compute
(5)
Solution
We need to use the relation of the Gamma function to the factorial function:
(n) = (n 1)!
which for n = 5 becomes
(5) = (5 1)! = 4! = 4 3 2 1 = 24
Exercise 3
Express the integral Z 1
1
x9=2 exp x dx
0 2
in terms of the Gamma function.
Solution
This is accomplished as follows:
Z 1
1
x9=2 exp x dx
0 2
Z 1
A =
9=2
(2t) exp ( t) 2dt
0
Z 1
11=2
= 2 t11=2 1 exp ( t) dt
0
B = 211=2 (11=2)
where: in step A we have performed a change of variable (t = 12 x); in step B

we have used the de…nition of Gamma function.
Exercise 4
Compute the product
5 3
B ;1
2 2
where () is the Gamma function and B () is the Beta function.
Solution
We need to write the Beta function in terms of Gamma functions:
3
5 3 5 2 (1)
B ;1 = 3
2 2 2 2 +1
3
5 2 (1)
= 5
2 2
3
= (1)
2
A 3
=
2
B 3 3
= 1 1
2 2
1 1
=
2 2
C 1p
=
2
where: in step A we have used the fact that (1) = 1; in step B we have used
the recursive formula for the Gamma function; in step C we have used the fact
that
1 p
=
2
Exercise 5
Compute the ratio
7 9
B 2; 2
5 11
B 2; 2
where B () is the Beta function.
Solution
This is achieved by rewriting the numerator of the ratio in terms of Gamma func-
tions:
7 9
B 2; 2
5 11
B 2; 2
7 9
1 2 2
= 5 11 7 9
B 2; 2 2 + 2
7 7 9
A 1 2 1 2 1 2
= 5 11 7 9
B 2; 2 2 + 2
5 5 11 11
B 1 2 2 2 2 1
= 5 11 7 9
B 2; 2 2 1+ 2 +1
5 11
52 1 2 2
=
2 9 B 52 ; 11
2
5
2 + 11
2
C 5 1 5 11 5
= B ; =
9 B 2 ; 11
5
2
2 2 9
where: in steps A and B we have used the the recursive formula for the Gamma
function; in step C we have used the de…nition of Beta function.
Exercise 6
Compute the integral Z 1
5
x3=2 (1 + 2x) dx
0
Solution
We …rst express the integral in terms of the Beta function:
Z 1
5
x3=2 (1 + 2x) dx
0
Z 1 3=2
A 1 5 1
= t (1 + t) dt
0 2 2
5=2 Z 1
1 5
= t3=2 (1 + t) dt
2 0
5=2 Z 1
1 5=2 5=2
= t5=2 1
(1 + t) dt
2 0
5=2
B 1 5 5
= B ;
2 2 2
where: in step A we have performed a change of variable (t = 2x); in step B we

have used the integral representation of Beta function.
Now, write the Beta function in terms of Gamma functions:
5 5
5 5 2 2
B ; = 5
2 2 2 + 52
5 2
2
=
(5)
31 1 2
A = 22 2
4 3 2 1
2 2
1 3 1
=
24 4 2
B 9
=
384
where: in step A we have used the recursive formula for the Gamma function; in
step B we have used the fact that
1 p
=
2
Substituting the above number into the previous expression for the integral, we
obtain
Z 1 5=2 5=2
5 1 5 5 1 9
x3=2 (1 + 2x) dx = B ; =
0 2 2 2 2 384
1p 9 9 p 3 p
= 2 = 2 = 2
8 384 3072 1024
If you wish, you can check the above result using the MATLAB commands
syms x
f = (x^(3=2)) ((1 + 2 x)^ 5)
int(f; 0; Inf)
Part II
Fundamentals of probability
67
Chapter 10
Probability
Probability is used to quantify the likelihood of things that can happen, when it
is not yet known whether they will happen. Sometimes probability is also used to
quantify the likelihood of things that could have happened in the past, when it is
not yet known whether they actually happened.
Since we usually speak of the "probability of an event", the next section intro-
duces a formal de…nition of the concept of event. We then discuss the properties
that probability needs to satisfy. Finally, we discuss some possible interpretations
of the concept of probability.
10.1 Sample space, sample points and events

Let be a set of things that can happen1 . We say that is a sample space, or
space of all possible outcomes, if it satis…es the following two properties:
1. Mutually exclusive outcomes. Only one of the things in will happen.

That is, when we learn that ! 2 has happened, then we also know that
none of the things in the set f! 2 : ! 6= !g has happened.
2. Exhaustive outcomes. At least one of the things in will happen.
An element ! 2 is called a sample point, or a possible outcome.

When (and if) we learn that ! 2 has happened, ! is called the realized
outcome.
A subset E is called an event (you will see below2 that not every subset
of is, strictly speaking, an event; however, on a …rst reading you can be happy
with this de…nition).
Note that itself is an event, because every set is a subset of itself, and the
empty set ; is also an event, because it can be considered a subset of .
Example 64 Suppose that we toss a die. Six numbers, from 1 to 6, can appear
face up, but we do not yet know which one of them will appear. The sample space
is
= f1; 2; 3; 4; 5; 6g
1 In this lecture we are going to use the Greek letter (Omega), which is often used in
probability theory. is upper case, while ! is lower case.
2 P. 75.
69
70 CHAPTER 10. PROBABILITY
Each of the six numbers is a sample point. The outcomes are mutually exclusive,
because only one number at a time can appear face up. The outcomes are also
exhaustive, because at least one of the six numbers in will appear face up, after
we toss the die. De…ne
E = f1; 3; 5g
E is an event (a subset of ). In words, the event E can be described as "an odd
number appears face up". Now, de…ne
F = f6g
Also F is an event (a subset of ). In words, the event F can be described as "the

number 6 appears face up".
10.2 Probability
The probability of an event is a real number, attached to the event, that tells us
how likely that event is. Suppose E is an event. We denote the probability of E
by P (E).
Probability needs to satisfy the following properties:
1. Range. For any event E, 0 P (E) 1.
2. Sure thing. P ( ) = 1.
3. Sigma-additivity (or countable additivity). Let fE1 ; E2 ; : : : ; En ; : : :g be a

sequence3 of events. Let all the events in the sequence be mutually exclusive,
i.e., Ei \ Ej = ; if i 6= j. Then,
1
! 1
[ X
P En = P (En )
n=1 n=1
The …rst property is self-explanatory. It just means that the probability of an

event is a real number between 0 and 1.
The second property is also intuitive. It says that with probability 1 at least
one of the possible outcomes will happen.
The third property is a bit more cumbersome. It can be proved (see p. 72)
that if sigma-additivity holds, then also the following holds:
If E and F are two events and E \F = ;, then P (E [ F ) = P (E)+P (F ) (10.1)
This property, called …nite additivity, while very similar to sigma-additivity,

is easier to interpret. It says that if two events are disjoint, then the probability that
either one or the other happens is equal to the sum of their individual probabilities.
Example 65 Suppose that we ‡ip a coin. The possible outcomes are either tail
(T ) or head (H), i.e.,
= fT; Hg
3 See p. 31.
10.3. PROPERTIES OF PROBABILITY 71
There are a total of four subsets of (events): itself, the empty set ;, the event
fT g and the event fHg. The following assignment of probabilities satis…es the
properties enumerated above:
1 1
P ( ) = 1, P (;) = 0, P (fT g) = , P (fHg) =
2 2
All these probabilities are between 0 and 1, so the range property is satis…ed.
P ( ) = 1, so the sure thing property is satis…ed. Also sigma-additivity is sat-
is…ed, because
1 1
P (fT g [ fHg) = P( ) = 1 = + = P (fT g) + P (fHg)
2 2
P (f g [ f;g) = P ( ) = 1 = 1 + 0 = P (f g) + P (f;g)
1 1
P (fT g [ f;g) = P (T ) = = + 0 = P (fT g) + P (f;g)
2 2
1 1
P (fHg [ f;g) = P (H) = = + 0 = P (fHg) + P (f;g)
2 2
and the four couples (T; H), ( ; ;), (T; ;), (H; ;) are the only four possible couples
of disjoint sets.
Before ending this section, two remarks are in order. First, we have not dis-
cussed the interpretations of probability, but below you can …nd a brief discussion
of the interpretations of probability. Second, we have been somewhat sloppy in
de…ning events and probability, but you can …nd a more rigorous de…nition of
probability below.
10.3 Properties of probability

The following subsections discuss some of the properties enjoyed by probability.
10.3.1 Probability of the empty set

The empty set has zero probability:
P (;) = 0
Proof. De…ne a sequence of events as follows: E1 = , E2 = ;, ..., En = ;, ... The

sequence is a sequence of disjoint events, because the empty set is disjoint from
any other set. Then,
1 = P( )
1
!
[
= P En
n=1
1
X
= P (En )
n=1
1
X
= P( ) + P (;)
n=2
that is,
1
X
P( ) + P (;) = 1
n=2
Since P ( ) = 1, we have that

1
X
P (;) = 1 1=0
n=2
which implies P (;) = 0.
10.3.2 Additivity and sigma-additivity

A sigma-additive function is also additive (see 10.1).
Proof. Let E and F be two events and E \ F = ;. De…ne a sequence of events
as follows: E1 = E, E2 = F , E3 = ;, ..., En = ;, ... The sequence is a sequence of
disjoint events, because the empty set is disjoint from any other set. Then,
1
!
[
P (E [ F ) = P En
n=1
1
X
= P (En )
n=1
1
X
= P (E) + P (F ) + P (;)
n=3
= P (E) + P (F )
since P (;) = 0.
10.3.3 Probability of the complement

Let E be an event and E c its complement (i.e., the set of all elements of that
do not belong to E). Then,
P (E c ) = 1 P (E)
In words, the probability that an event does not occur (P (E c )) is equal to one
minus the probability that it occurs.
Proof. Note that
= E [ Ec (10.2)
and that E and E c are disjoint sets. Then, using the sure thing property and …nite
additivity, we obtain
1 = P ( ) = P (E [ E c ) = P (E) + P (E c )
By rearranging the terms of this equality, we obtain 10.2.

10.3. PROPERTIES OF PROBABILITY 73
10.3.4 Probability of a union

Let E and F be two events. We have already seen how to compute P (E [ F ) in
the special case in which E and F are disjoint. In the more general case (E and F
are not necessarily disjoint), the formula is
P (E [ F ) = P (E) + P (F ) P (E \ F )
Proof. First, note that
P (E) = P (E \ ) = P (E \ (F [ F c ))
= P ((E \ F ) [ (E \ F c )) = P (E \ F ) + P (E \ F c )
and
P (F ) = P (F \ ) = P (F \ (E [ E c ))
= P ((F \ E) [ (F \ E c )) = P (F \ E) + P (F \ E c )
so that
P (E \ F c ) = P (E) P (E \ F )
P (F \ E c ) = P (F ) P (E \ F )
Furthermore, the event E [ F can be written as follows:
E [ F = (E \ F ) [ (E \ F c ) [ (F \ E c )
and the three events on the right hand side are disjoint. Thus,
P (E [ F ) = P ((E \ F ) [ (E \ F c ) [ (F \ E c ))
= P (E \ F ) + P (E \ F c ) + P (F \ E c )
= P (E \ F ) + P (E) P (E \ F ) + P (F ) P (E \ F )
= P (E) + P (F ) P (E \ F )
10.3.5 Monotonicity of probability

If two events E and F are such that E F , then
P (E) P (F )
In other words, if E occurs less often than F , because F contemplates more

occurrences, then the probability of E must be less than the probability of F .
Proof. This is easily proved by using additivity:
P (F ) = P ((F \ E) [ (F \ E c ))
= P (F \ E) + P (F \ E c )
= P (E) + P (F \ E c )
Since, by the range property, P (F \ E c ) 0, it follows that
P (F ) = P (E) + P (F \ E c ) P (E)
10.4 Interpretations of probability

This subsection brie‡y discusses some common interpretations of probability. Al-
though none of these interpretations is su¢ cient per se to clarify the meaning of
probability, they all touch upon important aspects of probability.
10.4.1 Classical interpretation of probability

According to the classical de…nition of probability, when all the possible outcomes
of an experiment are equally likely, the probability of an event is the ratio between
the number of outcomes that are favorable to the event and the total number of
possible outcomes. While intuitive, this de…nition has two main drawbacks:
1. it is circular, because it uses the concept of probability to de…ne probability:

it is based on the assumption of "equally likely" outcomes, where equally
likely means "having the same probability";
2. it is limited in scope, because it does not allow to de…ne probability when

the possible outcomes are not all equally likely.
10.4.2 Frequentist interpretation of probability

According to the frequentist de…nition of probability, the probability of an event is
the relative frequency of the event itself, observed over a large number of repetitions
of the same experiment. In other words, it is the limit to which the ratio
number of occurrences of the event

total number of repetitions of the experiment
converges when the number of repetitions of the experiment tends to in…nity. De-
spite its intuitive appeal, also this de…nition of probability has some important
drawbacks:
1. it assumes that all probabilistic experiments can be repeated many times,

which is false;
2. it is also somewhat circular, because it implicitly relies on a Law of Large

Numbers4 , which can be derived only after having de…ned probability.
10.4.3 Subjectivist interpretation of probability

According to the subjectivist de…nition of probability, the probability of an event
is related to the willingness of an individual to accept bets on that event. Suppose
a lottery ticket pays o¤ 1 dollar in case the event occurs and 0 in case the event
does not occur. An individual is asked to set a price for this lottery ticket, at
which she must be indi¤erent between being a buyer or a seller of the ticket. The
subjective probability of the event is de…ned to be equal to the price thus set by
the individual. Also this de…nition of probability has some drawbacks:
1. di¤erent individuals can set di¤erent prices, therefore preventing an objective

assessment of probabilities;
4 See the lecture entitled Laws of Large Numbers (p. 535).
10.5. MORE RIGOROUS DEFINITIONS 75
2. the price an individual is willing to pay to participate in a lottery can be in‡u-

enced by other factors that have nothing to do with probability; for example,
an individual’s betting behavior can be in‡uenced by her preferences.
10.5 More rigorous de…nitions

10.5.1 A more rigorous de…nition of event
The de…nition of event given above is not entirely rigorous. Often, statisticians
work with probability models where some subsets of are not considered events.
This happens mainly for the following two reasons:
1. sometimes, is a really complicated set; to make things simpler, attention

is restricted to only some subsets of ;
2. sometimes, it is possible to assign probabilities only to some subsets of ;
in these cases, only the subsets to which probabilities can be assigned are
considered events.
Denote by F the set of subsets of which are considered events. F is called

the space of events. In rigorous probability theory, F is required to be a sigma-
algebra on . F is a sigma-algebra on if it is a set of subsets of satisfying
the following three properties:
1. Whole set. 2 F.
2. Closure under complementation. If E 2 F then also E c 2 F (E c , the
complement of E with respect to , is the set of all elements of that do
not belong to E).
3. Closure under countable unions. If E1 , E2 , ..., En ,... are a sequence of
subsets of belonging to F, then
1
!
[
En 2 F
n=1
Why is a space of events required to satisfy these properties? Besides a number

of mathematical reasons, it seems pretty intuitive that they must be satis…ed.
Property a) means that the space of events must include the event "something will
happen", quite a trivial requirement! Property b) means that if "one of the things
in the set E will happen" is considered an event, then also "none of the things
in the set E will happen" is considered an event. This is quite natural: if you
are considering the possibility that an event will happen, then, by necessity, you
must also be simultaneously considering the possibility that the same event will
not happen. Property c) is a bit more complex. However, the following property,
implied by c), is probably easier to interpret:
If E 2 F and F 2 F, then (E [ F ) 2 F
It means that if "one of the things in E will happen" and "one of the things in F
will happen" are considered two events, then also "one of the things in E or one
of the things in F will happen" must be considered an event. This simply means
that if you are able to separately assess the possibility of two events E and F
happening, then, of course, you must be able to assess the possibility of one or the
other happening. Property c) simply extends this intuitive property to countable5
collection of events: the extension is needed for mathematical reasons, to derive
certain continuity properties of probability measures.
10.5.2 A more rigorous de…nition of probability

The de…nition of probability given above was not entirely rigorous. Now that
we have de…ned sigma-algebras and spaces of events, we can make it completely
rigorous. Let be a sample space. Let F be a sigma-algebra on (a space of
events). A function P : F ! [0; 1] is a probability measure if and only if it
satis…es the following two properties:
1. Sure thing. P ( ) = 1.
2. Sigma-additivity. Let fE1 ; E2 ; : : : ; En ; : : :g be any sequence of elements of

F such that i 6= j implies Ei \ Ej = ;. Then,
1
! 1
[ X
P En = P (En )
n=1 n=1
Nothing new has been added to the de…nition given above. This de…nition
just clari…es that a probability measure is a function de…ned on a sigma-algebra of
events. Hence, it is not possible to properly speak of probability for subsets of
that do not belong to the sigma-algebra.
A triple ( ; F,P) is called a probability space and the sets belonging to the
sigma-algebra F are called measurable sets.

This exercise set contains some solved exercises on probability and events.
Exercise 1
A ball is drawn at random from an urn containing colored balls. The balls can be
either red or blue (no other colors are possible). The probability of drawing a blue
ball is 1=3. What is the probability of drawing a red ball?
Solution
The sample space can be represented as the union of two disjoint events E and
F:
=E[F
where the event E can be described as "a red ball is drawn" and the event F can
be described as "a blue ball is drawn". Note that E is the complement of F :
E = Fc
5 See p. 32.
We know P (F ), the probability of drawing a a blue ball:

1
P (F ) =
3
We need to …nd P (E), the probability of drawing a a red ball. Using the
formula for the probability of a complement:
1 2
P (E) = P (F c ) = 1 P (F ) = 1 =
3 3
Exercise 2
Consider a sample space comprising three possible outcomes:
= f! 1 ; ! 2 ; ! 3 g
Suppose the probabilities assigned to the three possible outcomes are
P (f! 1 g) = 1=4 P (f! 2 g) = 1=4 P (f! 3 g) = 1=2
Can you …nd an event whose probability is 3=4?
Solution
There are two events whose probability is 3=4.
The …rst one is
E = f! 1 ; ! 3 g
By using the formula for the probability of a union of disjoint events, we get
P (E) = P (f! 1 g [ f! 3 g) = P (f! 1 g) + P (f! 3 g)

1 1 3
= + =
4 2 4
The second one is
E = f! 2 ; ! 3 g
By using the formula for the probability of a union of disjoint events, we obtain:
P (E) = P (f! 2 g [ f! 3 g) = P (f! 2 g) + P (f! 3 g)

1 1 3
= + =
4 2 4
Exercise 3
Consider a sample space comprising four possible outcomes:
= f! 1 ; ! 2 ; ! 3 ; ! 4 g
Consider the three events E, F and G de…ned as follows:
E = f! 1 g
F = f! 1 ; ! 2 g
G = f! 1 ; ! 2 ; ! 3 g
Suppose their probabilities are

1 5 7
P (E) = 10 P (F ) = 10 P (G) = 10
Now, consider a fourth event H de…ned as follows:
H = f! 2 ; ! 4 g
Find P (H).
Solution
First note that, by additivity,
P (H) = P (f! 2 g [ f! 4 g) = P (f! 2 g) + P (f! 4 g)
Therefore, in order to compute P (H), we need to compute P (f! 2 g) and P (f! 4 g).
P (f! 2 g) is found using additivity on F :
5
= P (F ) = P (f! 1 g [ f! 2 g) = P (f! 1 g) + P (f! 2 g)
10
1
= P (E) + P (f! 2 g) = + P (f! 2 g)
10
so that
5 1 4
P (f! 2 g) = =
10 10 10
P (f! 4 g) is found using the fact that one minus the probability of an event is
equal to the probability of its complement and the fact that f! 4 g = Gc :
7 3
P (f! 4 g) = P (Gc ) = 1 P (G) = 1 =
10 10
As a consequence,
4 3 7
P (H) = P (f! 2 g) + P (f! 4 g) = + =
10 10 10
Chapter 11
Zero-probability events
The notion of a zero-probability event plays a special role in probability theory and
statistics, because it underpins the important concepts of almost sure property and
almost sure event. In this lecture, we de…ne zero-probability events and discuss
some counterintuitive aspects of their apparently simple de…nition, in particular
the fact that a zero-probability event is not an event that never happens: there are
common probabilistic settings where zero-probability events do happen all the time!
After discussing this matter, we introduce the concepts of almost sure property and
almost sure event.
11.1 De…nition and discussion

Tautologically, zero-probability events are events whose probability is equal to zero.
De…nition 66 Let E be an event1 and denote its probability by P (E). We say

that E is a zero-probability event if and only if
P (E) = 0
Despite the simplicity of this de…nition, there are some features of zero-
probability events that might seem paradoxical. We illustrate these features with
the following example.
Example 67 Consider a probabilistic experiment whose set of possible outcomes,

called sample space and denoted by , is the unit interval
= [0; 1]
It is possible to assign probabilities in such a way that each sub-interval has prob-
ability equal to its length:
[a; b] [0; 1] ) P ([a; b]) = b a
The proof that such an assignment of probabilities can be consistently performed is

beyond the scope of this example, but you can …nd it in any elementary measure
1 See the lecture entitled Probability (p. 69) for a de…nition of sample space and event.
79
80 CHAPTER 11. ZERO-PROBABILITY EVENTS
theory book (e.g., Williams2 - 1991). As a direct consequence of this assignment,

all the possible outcomes ! 2 have zero probability:
8! 2 ; P (f!g) = P ([!; !]) = ! !=0
Stated di¤ erently, every possible outcome is a zero-probability event. This might
seem counterintuitive. In everyday language, a zero-probability event is an event
that never happens. However, this example illustrates that a zero-probability event
can indeed happen. Since the sample space provides an exhaustive description of
the possible outcomes, one and only one of the sample points3 ! 2 will be the
realized outcome4 . But we have just demonstrated that all the sample points are
zero-probability events; as a consequence, the realized outcome can only be a zero-
probability event. Another apparently paradoxical aspect of this probability model
is that the sample space can be obtained as the union of disjoint zero-probability
events: [
= f!g
!2
where each ! 2 is a zero-probability event and all events in the union are disjoint.
If we forgot that the additivity property of probability applies only to countable
collections of subsets, we would mistakenly deduce that
!
[ X
P( ) = P f!g = P (!) = 0
!2 !2
and we would come to a contradiction: P ( ) = 0, when, by the properties of

probability5 , it should be P ( ) = 1. Of course, the fallacy in such an argument is
that is not a countable set, and hence the additivity property cannot be used.
The main lesson to be taken from this example is that a zero-probability event
is not an event that never happens: in some probability models, where the sample
space is not countable, zero-probability events do happen all the time!
11.2 Almost sure and almost surely

Zero-probability events are of paramount importance in probability and statistics.
Often, we want to prove that some property is almost always satis…ed, or something
happens almost always. "Almost always" means that the property is satis…ed for
all sample points, except possibly for a negligible set of sample points. The concept
of zero-probability event is used to determine which sets are negligible: if a set is
included in a zero-probability event, then it is negligible.
De…nition 68 Let be some property that a sample point ! 2 can either satisfy
or not satisfy. Let F be the set of all sample points that satisfy the property:
F = f! 2 : ! satis…es property g
2 Williams, D. (1991) Probability with martingales, Cambridge University Press.
3 See p. 69.
4 See p. 69.
5 See p. 70.
11.3. ALMOST SURE EVENTS 81
Denote its complement, that is, the set of all points not satisfying property , by
F c . Property is said to be almost sure if there exists a zero-probability event
E such that6 F c E.
An almost sure property is said to hold almost surely (often abbreviated as

a.s.). Sometimes, an almost sure property is also said to hold with probability
one (abbreviated as w.p.1).
11.3 Almost sure events

Remember7 that some subsets of the sample space may not be considered events.
The above de…nition of almost sure property allows us to consider also sets F that
are not, strictly speaking, events. However, in the case in which F is an event,
F is called an almost sure event and we say that F happens almost surely.
Furthermore, since there exists an event E such that F c E and P (E) = 0, we
can apply the monotonicity of probability8 :
Fc E ) P (F c ) P (E)
which in turn implies P (F c ) = 0. Finally, recalling the formula for the probability
of a complement9 , we obtain
P (F ) = 1 P (F c ) = 1 0=1
Thus, an almost sure event is an event that happens with probability 1.
Example 69 Consider the sample space = [0; 1] and the assignment of proba-
bilities introduced in the previous example:
[a; b] [0; 1] ) P ([a; b]) = b a
We want to prove that the event
E = f! 2 : ! is a rational numberg
is a zero-probability event. Since the set of rational numbers is countable10 and E

is a subset of the set of rational numbers, E is countable. This implies that the
elements of E can be arranged into a sequence:
E = f! 1 ; : : : ; ! n ; : : :g
Furthermore, E can be written as a countable union:

1
[
E= f! n g
n=1
6 In other words, the set F c of all points that do not satisfy the property is included in a
zero-probability event.
7 See the lecture entitled Probability (p. 69).
8 See p. 73
9 See p. 72.
1 0 See p. 32.
Applying the countable additivity property of probability11 , we obtain

1
! 1
[ X
P (E) = P f! n g = P (f! n g) = 0
n=1 n=1
since P (f! n g) = 0 for every n. Therefore, E is a zero-probability event. This might

seem surprising: in this probability model there are also zero-probability events
comprising in…nitely many sample points! It can also easily be proved that the
event
F = f! 2 : ! is an irrational numberg
is an almost sure event. In fact
F = Ec
and applying the formula for the probability of a complement, we get
P (F ) = P (E c ) = 1 P (E) = 1 0=1

Exercise 1
2
Let E and F be two events. Let E c be a zero-probability event and P (F ) = 3.
Compute P (E [ F ).
Solution
E c is a zero-probability event, which means that
P (E c ) = 0
Furthermore, using the formula for the probability of a complement, we obtain
P (E) = 1 P (E c ) = 1 0=1
Since (E [ F ) E, by monotonicity we obtain
P (E [ F ) P (E)
Since P (E) = 1 and probabilities cannot be greater than 1, this implies
P (E [ F ) = 1
Exercise 2
1
Let E and F be two events. Let E c be a zero-probability event and P (F ) = 2.
Compute P (E \ F ).
1 1 See p. 72.
Solution
E c is a zero-probability event, which means that
P (E c ) = 0
Furthermore, using the formula for the probability of a complement, we obtain
P (E) = 1 P (E c ) = 1 0=1
It is also true that
P (E \ F ) = P (E) + P (F ) P (E [ F )
1 3
= 1+ P (E [ F ) = P (E [ F )
2 2
Since (E [ F ) E, by monotonicity, we obtain
P (E [ F ) P (E)
Since P (E) = 1 and probabilities cannot be greater than 1, this implies
P (E [ F ) = 1
Thus, putting pieces together, we get

3 3 1
P (E \ F ) = P (E [ F ) = 1=
2 2 2
Chapter 12
Conditional probability
This lecture introduces the concept of conditional probability. A more sophisticated

treatment of conditional probability can be found in the lecture entitled Conditional
probability as a random variable (p. 201).
Before reading this lecture, make sure you are familiar with the concepts of sam-
ple space, sample point, event, possible outcome, realized outcome and probability
(see the lecture entitled Probability - p. 69).
12.1 Introduction
Let be a sample space and let P (E) denote the probability assigned to the events
E . Suppose that, after assigning probabilites P (E) to the events in , we
receive new information about the things that will happen (the possible outcomes).
In particular, suppose that we are told that the realized outcome will belong to a
set I . How should we revise the probabilities assigned to the events in , to
properly take the new information into account?
Denote by P (E jI ) the revised probability assigned to an event E after
learning that the realized outcome will be an element of I. P (E jI ) is called the
conditional probability of E given I.
Despite being an intuitive concept, conditional probability is quite di¢ cult to
de…ne in a rigorous way. We take a gradual approach in this lecture. We …rst
discuss conditional probability for the very special case in which all the sample
points are equally likely. We then give a more general de…nition. Finally, we refer
the reader to other lectures where conditional probability is de…ned in even more
abstract ways.
12.2 The case of equally likely sample points

Suppose a sample space has a …nite number n of sample points ! 1 ; : : : ; ! n :
= f! 1 ; : : : ; ! n g
Suppose also that each sample point is assigned the same probability:
1
P (f! 1 g) = : : : = P (f! n g) =
n
85
86 CHAPTER 12. CONDITIONAL PROBABILITY
In such a simple space, the probability of a generic event E is obtained as
card (E)
P (E) =
card ( )
where card denotes the cardinality of a set, i.e. the number of its elements. In
other words, the probability of an event E is obtained in two steps:
1. counting the number of "cases that are favorable to the event E", i.e. the
number of elements ! i belonging to E;
2. dividing the number thus obtained by the number of "all possible cases", i.e.
the number of elements ! i belonging to .
For example, if E = f! 1 ; ! 2 g then
card (f! 1 ; ! 2 g) 2
P (E) = =
card ( ) n
When we learn that the realized outcome will belong to a set I , we still
apply the rule
number of cases that are favorable to the event
probability of an event =
number of all possible cases
However, the number of all possible cases is now equal to the number of elements
of I, because only the outcomes beloning to I are still possible. Furthermore, the
number of favorable cases is now equal to the number of elements of E \ I, because
the outcomes in E \ I c are no longer possible. As a consequence:
card (E \ I)
P (E jI ) =
card (I)
Dividing numerator and denominator by card ( ) one obtains
card (E \ I) =card ( ) P (E \ I)
P (E jI ) = =
card (I) =card ( ) P (I)
Therefore, when all sample points are equally likely, conditional probabilities
are computed as
P (E \ I)
P (E jI ) =
P (I)
Example 70 Suppose that we toss a die. Six numbers (from 1 to 6) can appear
is
= f1; 2; 3; 4; 5; 6g
Each of the six numbers is a sample point and is assigned probability 1=6. De…ne
the event E as follows:
E = f1; 3; 5g
where the event E could be described as "an odd number appears face up". De…ne
the event I as follows:
I = f4; 5; 6g
12.3. A MORE GENERAL APPROACH 87
where the event I could be described as "a number greater than 3 appears face up".
The probability of I is
1 1 1 1
P (I) = P (f4g) + P (f5g) + P (f6g) = + + =
6 6 6 2
Suppose we are told that the realized outcome will belong to I. How do we have
to revise our assessment of the probability of the event E, according to the rules
of conditional probability? First of all, we need to compute the probability of the
event E \ I:
1
P (E \ I) = P (f5g) =
6
Then, the conditional probability of E given I is
1
P (E \ I) 6 2 1
P (E jI ) = = 1 = =
P (I) 2
6 3
In the next section, we will show that the conditional probability formula
P (E \ I)
P (E jI ) =
P (I)
is valid also for more general cases (i.e. when the sample points are not all equally
likely). However, this formula already allows us to understand why de…ning con-
ditional probability is a challenging task. In the conditional probability formula,
a division by P (I) is performed. This division is impossible when I is a zero-
probability event1 . If we want to be able to de…ne P (E jI ) also when P (I) = 0,
then we need to give a more complicated de…nition of conditional probability. We
will return to this point later.
12.3 A more general approach

In this section we give a more general de…nition of conditional probability, by taking
an axiomatic approach. First, we list the properties that we would like conditional
probability to satisfy. Then, we prove that the conditional probability formula
introduced above satis…es these properties. The discussion of the case in which the
conditional probability formula cannot be used because P (I) = 0 is postponed to
the next section.
The conditional probability P (E jI ) is required to satisfy the following proper-
ties:
1. Probability measure. P (E jI ) has to satisfy all the properties of a prob-

ability measure2 .
2. Sure thing. P (I jI ) = 1.
3. Impossible events. If E I c 3 , then P (E jI ) = 0.

1 I.e. P (I) = 0; see p. 79.
2 See p. 70.
3 I c , the complement of I with respect to , is the set of all elements of that do not belong
to I.
4. Constant likelihood ratios on I. If E I, F I and P (E) > 0, then:

P (F jI ) P (F )
=
P (E jI ) P (E)
These properties are very intutitve:
1. Probability measure. This property requires that also conditional proba-

bility measures satisfy the fundamental properties that any other probability
measure needs to satisfy.
2. Sure thing. This property says that the probability of a sure thing must be
1: since we know that only things belonging to the set I can happen, then
the probability of I must be 1.
3. Impossible events. This property says that the probability of an impossible
thing must be 0: since we know that things not belonging to the set I will
not happen, then the probability of the events that are disjoint from I must
be 0.
4. Constant likelihood ratios on I. This property is a bit more complex: it
says that if F I is - say - two times more likely than E I before receiving
the information I, then F remains two times more likely than E, also after
reiceiving the information, because all the things in E and F remain possible
(can still happen) and, hence, there is no reason to expect that the ratio of
their likelihoods changes.
It is possible to prove that:
Proposition 71 Whenever P (I) > 0, P (E jI ) satis…es the four above properties

if and only if
P (E \ I)
P (E jI ) =
P (I)
Proof. We …rst show that
P (E \ I)
P (E jI ) =
P (I)
satis…es the four properties whenever P (I) > 0. As far as property 1) is concerned,
we have to check that the three requirements for a probabilitiy measure are sat-
is…ed. The …rst requirement for a probability measure is that 0 P (E jI ) 1.
Since (E \ I) I, by the monotonicity of probability4 we have that:
P (E \ I) P (I)
Hence:
P (E \ I)
1
P (I)
Furthermore, since P (E \ I) 0 and P (I) 0, also
P (E \ I)
0
P (I)
4 See p. 73.
12.3. A MORE GENERAL APPROACH 89
The second requirement for a probability measure is that P ( jI ) = 1. This is

satis…ed because
P ( \ I) P (I)
P ( jI ) = = =1
P (I) P (I)
The third requirement for a probability measure is that for any sequence of disjoint
sets fE1 ; E2 ; : : : ; En ; : : :g the following holds:
1
! 1
[ X
P En jI = P (En jI )
n=1 n=1
But
! S
1
1
[ P En \I
n=1
P En jI =
n=1
P (I)
S
1
P (En \ I)
n=1
=
P (I)
P1
n=1 P (En \ I)
=
P (I)
1
X P (En \ I)
=
n=1
P (I)
X1
= P (En jI )
n=1
so that also the third requirement is satis…ed. Property 2) is trivially satis…ed:

P (I \ I) P (I)
P (I jI ) = = =1
P (I) P (I)
Property 3) is veri…ed because, if E I c , then
P (E \ I) P (;)
P (E jI ) = = =0
P (I) P (I)
Property 4) is veri…ed because, if E I, F I and P (E) > 0, then
P (F jI ) P (F \ I) P (I)
=
P (E jI ) P (I) P (E \ I)
P (F \ I) P (F )
= =
P (E \ I) P (E)
So, the "if" part has been proved. Now we prove the "only if" part. We prove it by
contradiction. Suppose there exist another conditional probability P that satis…es
the four properties. Then, there exists an event E, such that
P (E jI ) 6= P (E jI )
It can not be that E I, otherwise we would have
P (E jI ) P (E jI ) P (E jI ) P (E \ I) P (E)
= 6= = =
P (I jI ) 1 1 P (I) P (I)
which would be a contradiction, since if P was a conditional probability it would

satisfy
P (E jI ) P (E)
=
P (I jI ) P (I)
If E is not a subset of I then P (E jI ) 6= P (E jI ) implies also
P (E \ I jI ) 6= P (E \ I jI )
because
P (E jI ) = P ((E \ I) [ (E \ I c ) jI )
= P (E \ I jI ) + P (E \ I c jI )
= P (E \ I jI )
and
P (E jI ) = P ((E \ I) [ (E \ I c ) jI )
= P (E \ I jI ) + P (E \ I c jI )
= P (E \ I jI )
but this would also lead to a contradiction, because (E \ I) I.
12.4 Tackling division by zero

In the previous section we have generalized the concept of conditional probability.
However, we have not been able to de…ne the conditional probability P (E jI )
for the case in which P (I) = 0. This case is discussed in the lectures entitled
Conditional probability as a random variable (p. 201) and Conditional probability
distributions (p. 209).
12.5 More details

12.5.1 The law of total probability
Let I1 , . . . , In be n events having the following characteristics:
1. they are mutually disjoint: Ij \ Ik = ; whenever j 6= k;
2. they cover all the sample space:

n
[
= Ij
j=1
3. they have strictly positive probability: P (Ij ) > 0 for any j.
I1 , . . . , In is a partition of .
The law of total probability states that, for any event E, the following holds:
P (E) = P (E jI1 ) P (I1 ) + : : : + P (E jIn ) P (In )

which can, of course, also be written as

n
X
P (E) = P (E jIj ) P (Ij )
j=1
Proof. The law of total probability is proved as follows:
P (E) = P (E \ )
0 0 11
n
[
= P @E \ @ Ij AA
j=1
0 1
n
[
= P@ (E \ Ij )A
j=1
n
X
A = P (E \ Ij )
j=1
Xn
B = P (E jIj ) P (Ij )
j=1
where: in step A we have used the additivity of probability; in step B we have

used the conditional probability formula.

Some solved exercises on conditional probability can be found below.
Exercise 1
Consider a sample space comprising three possible outcomes ! 1 , ! 2 , ! 3 :
= f! 1 ; ! 2 ; ! 3 g
Suppose the three possible outcomes are assigned the following probabilities:
1
P (! 1 ) =
5
2
P (! 2 ) =
5
2
P (! 3 ) =
5
De…ne the events
E = f! 1 ; ! 2 g
F = f! 1 ; ! 3 g
and denote by E c the complement of E.

Compute P (F jE c ), the conditional probability of F given E c .
Solution
We need to use the conditional probability formula
P (F \ E c )
P (F jE c ) =
P (E c )
The numerator is
2
P (F \ E c ) = P (f! 1 ; ! 3 g \ f! 3 g) = P (f! 3 g) =
5
and the denominator is
2
P (E c ) = P (f! 3 g) =
5
As a consequence:
P (F \ E c ) 2=5
P (F jE c ) = = =1
P (E c ) 2=5
Exercise 2
Consider a sample space comprising four possible outcomes ! 1 , ! 2 , ! 3 , ! 4 :
= f! 1 ; ! 2 ; ! 3 ; ! 4 g
Suppose the four possible outcomes are assigned the following probabilities:
1
P (! 1 ) =
10
4
P (! 2 ) =
10
3
P (! 3 ) =
10
2
P (! 4 ) =
10
De…ne two events
E = f! 1 ; ! 2 g
F = f! 2 ; ! 3 g
Compute P (E jF ), the conditional probability of E given F .
Solution
We need to use the formula
P (E \ F )
P (E jF ) =
P (F )
But
4
P (E \ F ) = P (f! 2 g) =
10
while, using additivity:
4 3 7
P (F ) = P (f! 2 ; ! 3 g) = P (f! 2 g) + P (f! 3 g) = + =
10 10 10
Therefore:
P (E \ F ) 4=10 4
P (E jF ) = = =
P (F ) 7=10 7
Exercise 3
The Census Bureau has estimated the following survival probabilities for men:
1. probability that a man lives at least 70 years: 80%;

2. probability that a man lives at least 80 years: 50%.
What is the conditional probability that a man lives at least 80 years given that
he has just celebrated his 70th birthday?
Solution
Given an hypothetical sample space , de…ne the two events:
E = f! 2 : man lives at least 70 yearsg

F = f! 2 : man lives at least 80 yearsg
We need to …nd the following conditional probability
P (F \ E)
P (F jE ) =
P (E)
The denominator is known:

4
P (E) = 80% =
5
As far as the numerator is concerned, note that F E (if you live at least 80
years then you also live at least 70 years). But F E implies
F \E =F
Therefore:
1
P (F \ E) = P (F ) = 50% =
2
Thus:
P (F \ E) 1=2 5
P (F jE ) = = =
P (E) 4=5 8
Chapter 13
Bayes’rule
This lecture introduces Bayes’rule. Before reading this lecture, make sure you are
familiar with the concept of conditional probability (p. 85).
13.1 Statement of Bayes’rule

Bayes’ rule, named after the English mathematician Thomas Bayes, is a rule for
computing conditional probabilities.
Let A and B be two events. Denote their probabilities by P (A) and P (B) and
suppose that both P (A) > 0 and P (B) > 0. Denote by P (A jB ) the conditional
probability of A given B and by P (B jA ) the conditional probability of B given A.
Bayes’rule states that:
P (B jA ) P (A)
P (A jB ) =
P (B)
Proof. Take the conditional probability formulae
P (A \ B)
P (A jB ) =
P (B)
and
P (A \ B)
P (B jA ) =
P (A)
Re-arrange the second formula:
P (A \ B) = P (B jA ) P (A)
and plug it into the …rst formula:
P (A \ B) P (B jA ) P (A)
P (A jB ) = =
P (B) P (B)
The following example shows how Bayes’ rule can be applied in a practical
situation.
95
96 CHAPTER 13. BAYES’RULE
Example 72 An HIV test gives a positive result with probability 98% when the
patient is indeed a¤ ected by HIV, while it gives a negative result with 99% probability
when the patient is not a¤ ected by HIV. If a patient is drawn at random from a
population in which 0,1% of individuals are a¤ ected by HIV and he is found positive,
what is the probability that he is indeed a¤ ected by HIV? In probabilistic terms, what
we know about this problem can be formalized as follows:
P (positive jHIV ) = 0:98

P (positive jNO HIV ) = 1 0:99 = 0:01
P (HIV) = 0:001
P (NO HIV) = 1 0:001 = 0:999
The unconditional probability of being found positive can be derived using the law
of total probability1 :
P (positive) = P (positive jHIV ) P (HIV) + P (positive jNO HIV ) P (NO HIV)

= 0:98 0:001 + 0:01 0:999 = 0:00098 + 0:00999 = 0:01097
Using Bayes’ rule:
P (positive jHIV ) P (HIV)

P (HIV jpositive ) =
P (positive)
0:98 0:001 0:00098
= =
0:01097 0:01097
' 0:08933
Therefore, even if the test is conditionally very accurate, the unconditional proba-
bility of being a¤ ected by HIV when found positive is less than 10 per cent!
13.2 Terminology
The quantities involved in Bayes’rule
P (B jA ) P (A)
P (A jB ) =
P (B)
often take the following names:
1. P (A) is called prior probability or, simply, prior;
2. P (B jA ) is called conditional probability or likelihood;
3. P (B) is called marginal probability;
4. P (A jB ) is called posterior probability or, simply, posterior.

1 See p. 90.
Exercise 1
There are two urns containing colored balls. The …rst urn contains 50 red balls and
50 blue balls. The second urn contains 30 red balls and 70 blue balls. One of the
two urns is randomly chosen (both urns have probability 50% of being chosen) and
then a ball is drawn at random from one of the two urns. If a red ball is drawn,
what is the probability that it comes from the …rst urn?
Solution
In probabilistic terms, what we know about this problem can be formalized as
follows:
1
P (red jurn 1 ) =
2
3
P (red jurn 2 ) =
10
1
P (urn1) =
2
1
P (urn 2) =
2
The unconditional probability of drawing a red ball can be derived using the law
of total probability:
P (red) = P (red jurn 1 ) P (urn 1) + P (red jurn 2 ) P (urn 2)

1 1 3 1 1 3 5+3 2
= + = + = =
2 2 10 2 4 20 20 5
Using Bayes’rule we obtain:
P (red jurn 1 ) P (urn 1)
P (urn 1 jred ) =
P (red)
1 1
2 2 1 5 5
= 2 = =
5
4 2 8
Exercise 2
An economics consulting …rm has created a model to predict recessions. The model
predicts a recession with probability 80% when a recession is indeed coming and
with probability 10% when no recession is coming. The unconditional probability
of falling into a recession is 20%. If the model predicts a recession, what is the
probability that a recession will indeed come?
Solution
What we know about this problem can be formalized as follows:
8
P (rec. pred. jrec. coming ) =
10
1
P (rec. pred. jrec. not coming ) =
10
2
P (rec. coming) =
10
98 CHAPTER 13. BAYES’RULE
2 8
P (rec. not coming) = 1 P (rec. coming) = 1 =
10 10
The unconditional probability of predicting a recession can be derived using the
law of total probability:
P (rec. pred.) = P (rec. pred. jrec. coming ) P (rec. coming)

+P (rec. pred. jrec. not coming ) P (rec. not coming)
8 2 1 8 24
= + =
10 10 10 10 100
P (rec. pred. jrec. coming ) P (rec. coming)

P (rec. coming jrec. pred. ) =
P (rec. pred.)
8 2
10 10 16 100 2
= 24 = =
100
100 24 3
Exercise 3
Alice has two coins in her pocket, a fair coin (head on one side and tail on the other
side) and a two-headed coin. She picks one at random from her pocket, tosses it
and obtains head. What is the probability that she ‡ipped the fair coin?
Solution
What we know about this problem can be formalized as follows:
1
P (head jfair coin ) =
2
P (head junfair coin ) = 1
1
P (fair coin) =
2
1
P (unfair coin) =
2
The unconditional probability of obtaining head can be derived using the law of
total probability:
P (head) = P (head jfair coin ) P (fair coin)

+P (head junfair coin ) P (unfair coin)
1 1 1 1 2 3
= +1 = + =
2 2 2 4 4 4
P (head jfair coin ) P (fair coin)

P (fair coin jhead ) =
P (head)
1 1
2 2 1 4 1
= 3 = =
4
4 3 3
Chapter 14
Independent events
This lecture introduces the notion of independent event. Before reading this lec-
ture, make sure you are familiar with the concept of conditional probability1 .
14.1 De…nition of independent event

Two events E and F are said to be independent if the occurrence of E makes it
neither more nor less probable that F occurs and, conversely, if the occurrence of
F makes it neither more nor less probable that E occurs. In other words, after
receiving the information that E will happen, we revise our assessment of the
probability that F will happen, computing the conditional probability of F given
E; if F is independent of E, the probability of F remains the same as it was before
receiving the information:
P (F jE ) = P (F ) (14.1)
Conversely,
P (E jF ) = P (E) (14.2)
In standard probability theory, rather than characterizing independence by the
above two properties, independence is characterized in a more compact way.
De…nition 73 Two events E and F are independent if and only if
P (E \ F ) = P (E) P (F )
This de…nition implies properties (14.1) and (14.2) above: if E and F are
independent, and (say) P (E) > 0, then
P (E \ F ) P (E) P (F )
P (F jE ) = = = P (F )
P (E) P (E)
Example 74 An urn contains four balls B1 , B2 , B3 and B4 . We draw one of

them at random. The sample space is
= fB1 ; B2 ; B3 ; B4 g
1 See p. 85.
99
100 CHAPTER 14. INDEPENDENT EVENTS
Each of the four balls has the same probability of being drawn, equal to 14 , i.e.,
1
P (fB1 g) = P (fB2 g) = P (fB3 g) = P (fB4 g) =
4
De…ne the events E and F as follows:
E = fB1 ; B2 g
F = fB2 ; B3 g
Their respective probabilities are

1 1 1
P (E) = P (fB1 g [ fB2 g) = P (fB1 g) + P (fB2 g) = + =
4 4 2
1 1 1
P (F ) = P (fB2 g [ fB3 g) = P (fB2 g) + P (fB3 g) = + =
4 4 2
The probability of the event E \ F is
1
P (E \ F ) = P (fB1 ; B2 g \ fB2 ; B3 g) = P (fB2 g) =
4
Hence,
11 1
= = P (E \ F )
P (E) P (F ) =
22 4
As a consequence, E and F are independent.
14.2 Mutually independent events

The de…nition of independence can be extended also to collections of more than
two events.
De…nition 75 Let E1 , . . . , En be n events. E1 , . . . , En are jointly independent

(or mutually independent) if and only if, for any sub-collection of k events
(k n) Ei1 , . . . , Eik , we have that
0 1
\k Yk
P @ A
Eij = P Eij
j=1 j=1
Let E1 , . . . , En be a collection of n events. It is important to note that even if

all the possible couples of events are independent (i.e., Ei is independent of Ej for
any j 6= i), this does not imply that the events E1 , . . . , En are jointly independent.
This is proved with a simple counter-example:
Example 76 Consider the experiment presented in the previous example (extract-

ing a ball from an urn that contains four balls). De…ne the events E, F and G as
follows:
E = fB1 ; B2 g
F = fB2 ; B3 g
G = fB2 ; B4 g
14.3. ZERO-PROBABILITY EVENTS AND INDEPENDENCE 101
It is immediate to show that

1
P (E \ F ) = = P (E) P (F ) =) E and F are independent
4
1
P (E \ G) = = P (E) P (G) =) E and G are independent
4
1
P (F \ G) = = P (F ) P (G) =) F and G are independent
4
Thus, all the possible couple of events in the collection E, F , G are independent.
However, the three events are not jointly independent because
1 1
P (E \ F \ G) = P (fB2 g) =6= = P (E) P (F ) P (G)
4 8
On the contrary, it is obviously true that if E1 , . . . , En are jointly independent,
then Ei is independent of Ej for any j 6= i.
14.3 Zero-probability events and independence

Proposition 77 If E is a zero-probability event2 , then E is independent of any
other event F .
Proof. Note that
(E \ F ) E
As a consequence, by the monotonicity of probability3 , we have that
P (E \ F ) P (E)
But P (E) = 0, so P (E \ F ) 0. Since probabilities cannot be negative, it must
be P (E \ F ) = 0. The latter fact implies independence:
P (E \ F ) = 0 = 0 P (F ) = P (E) P (F )

Exercise 1
Suppose that we toss a die. Six numbers (from 1 to 6) can appear face up, but we
do not yet know which one of them will appear. The sample space is
= f1; 2; 3; 4; 5; 6g
1
Each of the six numbers is a sample point and is assigned probability 6. De…ne
the events E and F as follows:
E = f1; 3; 4g
F = f3; 4; 5; 6g
Prove that E and F are independent.
2 See p. 79.
3 See p. 73.
Solution
The probability of E is
P (E) = P (f1g) + P (f3g) + P (f4g)

1 1 1 1
= + + =
6 6 6 2
The probability of F is
P (F ) = P (f3g) + P (f4g) + P (f5g) + P (f6g)

1 1 1 1 4 2
= + + + = =
6 6 6 6 6 3
The probability of E \ F is
P (E \ F ) = P (f1; 3; 4g \ f3; 4; 5; 6g) = P (f3; 4g)

1 1 2 1
= P (f3g) + P (f4g) = + = =
6 6 6 3
The events E and F are independent because
1 1 2
P (E \ F ) = = = P (E) P (F )
3 2 3
Exercise 2
A …rm undertakes two projects, A and B. The probabilities of having a successful
outcome are 43 for project A and 12 for project B. The probability that both
7
projects will have a successful outcome is 16 . Are the outcomes of the two projects
independent?
Solution
Denote by E the event "project A is successful", by F the event "project B is
successful" and by G the event "both projects are successful". The event G can
be expressed as
G=E\F
If E and F are independent, it must be that
P (G) = P (E \ F ) = P (E) P (F )
3 1 3 7
= = 6=
4 2 8 16
Therefore, the outcomes of the two projects are not independent.
Exercise 3
A …rm undertakes two projects, A and B. The probabilities of having a successful
outcome are 23 for project A and 45 for project B. What is the probability that
neither of the two projects will have a successful outcome if their outcomes are
independent?
Solution
Denote by E the event "project A is successful", by F the event "project B is
successful" and by G the event "neither of the two projects is successful". The
event G can be expressed as
G = Ec \ F c
where E c and F c are the complements of E and F . Thus, the probability that
neither of the two projects will have a successful outcome is
P (G) = P (E c \ F c )
A =
c
P ((E [ F ) )
B = 1 P (E [ F )
C = 1 (P (E) + P (F ) P (E \ F ))
= 1 P (E) P (F ) + P (E \ F )
D = 1 P (E) P (F ) + P (E) P (F )
2 4 2 4
= 1 +
3 5 3 5
15 10 12 + 8 1
= =
15 15
where: in step A we have used De Morgan’s law4 ; in step B we have used the
formula for the probability of a complement5 ; in step C we have used the formula
for the probability of a union6 ; in step D we have used the fact that E and F are
independent.
4 See p. 7.
5 See p. 72.
6 See p. 73.
Chapter 15
Random variables
This lecture introduces the concept of random variable. Before reading this lec-
ture, make sure you are familiar with the concepts of sample space, sample point,
event, possible outcome, realized outcome, and probability (see the lecture entitled
Probability - p. 69).
15.1 De…nition of random variable

A random variable is a variable whose value depends on the outcome of a proba-
bilistic experiment. Its value is a priori unknown, but it becomes known once the
outcome of the experiment is realized.
Denote by the sample space, that is, the set of all possible outcomes of the
experiment. A random variable associates a real number to each element of , as
stated by the following de…nition.
De…nition 78 A random variable X is a function from the sample space to

the set of real numbers R:
X: !R
In rigorous (measure-theoretic) probability theory, the function X is also re-

quired to be measurable1 .
The real number X (!) associated to a sample point ! 2 is called a realiza-
tion of the random variable. The set of all possible realizations is called support
and it is denoted by RX .
Some remarks on notation are in order:
1. the dependence of X on ! is often omitted, i.e., we simply write X instead

of X (!);
2. if A R, the exact meaning of the notation P (X 2 A) is the following:
P (X 2 A) = P (f! 2 : X (!) 2 Ag)
3. if A R, we sometimes use the notation PX (A) with the following meaning:
PX (A) = P (X 2 A) = P (f! 2 : X (!) 2 Ag)

1 See below the subsection entitled A more rigorous de…nition of random variable (p. 109).
105
106 CHAPTER 15. RANDOM VARIABLES
In this case, PX is to be interpreted as a probability measure on the set of

real numbers, induced by the random variable X. Often, statisticians con-
struct probabilistic models where a random variable X is de…ned by directly
specifying PX , without specifying the sample space .
Example 79 Suppose that we ‡ip a coin. The possible outcomes are either tail
(T ) or head (H), i.e.,
= fT; Hg
The two outcomes are assigned equal probabilities:
1
P (fT g) = P (fHg) =
2
If tail (T ) is the outcome, we win one dollar, if head (H) is the outcome we lose
one dollar. The amount X we win (or lose) is a random variable, de…ned as
1 if ! = T
X (!) =
1 if ! = H
The probability of winning one dollar is

1
P (X = 1) = P (f! 2 : X (!) = 1g) = P (fT g) =
2
The probability of losing one dollar is
1
P (X = 1) = P (f! 2 : X (!) = 1g) = P (fHg) =
2
The probability of losing two dollars is
P (X = 2) = P (f! 2 : X (!) = 2g) = P (;) = 0
Most of the time, statisticians deal with two special kinds of random variables:
1. discrete random variables;

2. absolutely continuous random variables.
We de…ne these two kinds of random variables below.
15.2 Discrete random variables

Discrete random variables are de…ned as follows.
De…nition 80 A random variable X is discrete if
1. its support RX is a countable set2 ;

2. there is a function pX : R ! [0; 1], called the probability mass function
(or pmf or probability function) of X, such that, for any x 2 R:
P (X = x) if x 2 RX
pX (x) =
0 if x 2
= RX
2 See the lecture entitled Sequences and Limits (p. 32) for a de…nition of countable set.
15.3. ABSOLUTELY CONTINUOUS RANDOM VARIABLES 107
The following is an example of a discrete random variable.
Example 81 Let X be a discrete random variable that can take only two values:
1 with probability q and 0 with probability 1 q, where 0 q 1. Its support is
RX = f0; 1g
Its probability mass function is

8
< q if x = 1
pX (x) = 1 q if x = 0
:
0 otherwise
The properties of probability mass functions are discussed in more detail in the
lecture entitled Legitimate probability mass functions (p. 247). We anticipate here
that probability mass functions are characterized by two fundamental properties:
1. non-negativity: pX (x) 0 for any x 2 R;

P
2. sum over the support equals 1: x2RX pX (x) = 1.
It turns out not only that any probability mass function must satisfy these two
properties, but also that any function satisfying these two properties is a legitimate
probability mass function. You can …nd a detailed discussion of this fact in the
aforementioned lecture.
15.3 Absolutely continuous random variables

Absolutely continuous random variables are de…ned as follows.
De…nition 82 A random variable X is absolutely continuous if
1. its support RX is not countable;
2. there is a function fX : R ! [0; 1], called the probability density function

(or pdf or density function) of X, such that, for any interval [a; b] R,
Z b
P (X 2 [a; b]) = fX (x) dx
a
Absolutely continuous random variables are often called continuous random

variables, omitting the adverb absolutely.
The following is an example of an absolutely continuous random variable.
Example 83 Let X be an absolutely continuous random variable that can take

any value in the interval [0; 1]. All sub-intervals of equal length are equally likely.
Its support is
RX = [0; 1]
Its probability density function is
1 if x 2 [0; 1]
fX (x) =
0 otherwise
1 3
The probability that the realization of X belongs, for example, to the interval 4; 4
is
Z 3=4 Z 3=4
P (X 2 [a; b]) = fX (x) dx = dx
1=4 1=4
3=4 3 1 1
= [x]1=4 = =
4 4 2
The properties of probability density functions are discussed in more detail in
the lecture entitled Legitimate probability density functions (p. 251). We antici-
pate here that probability density functions are characterized by two fundamental
properties:
1. non-negativity: fX (x) 0 for any x 2 R;
R1
2. integral over R equals 1: f (x) dx = 1.
1 X
It turns out not only that any probability density function must satisfy these
two properties, but also that any function satisfying these two properties is a
legitimate probability density function. You can …nd a detailed discussion of this
fact in the aforementioned lecture.
15.4 Random variables in general

Random variables, also those that are neither discrete nor absolutely continuous,
are often characterized in terms of their distribution function.
De…nition 84 Let X be a random variable. The distribution function (or cu-
mulative distribution function or cdf ) of X is a function FX : R ! [0; 1] such
that
FX (x) = P (X x) ; 8x 2 R
If we know the distribution function of a random variable X, then we can easily
compute the probability that X belongs to an interval (a; b] R, as
P (a < X b) = FX (b) FX (a)
Proof. Note that
( 1; b] = ( 1; a] [ (a; b]
where the two sets on the right hand side are disjoint. Hence, by additivity, we get
FX (b) = P (X 2 ( 1; b])
= PX (( 1; b])
= PX (( 1; a] [ (a; b])
= PX (( 1; a]) + PX ((a; b])
= P (X a) + P (a < X x)
= FX (a) + P (a < X x)
Rearranging terms, we obtain
P (a < X x) = FX (b) FX (a)
15.5 More details

15.5.1 Derivative of the distribution function
If X is an absolutely continuous random variable, then its distribution function
can be written as Z x
FX (x) = fX (t) dt
1
Hence, by taking the derivative with respect to x of both sides of the above equa-
tion, we obtain
dFX (x)
= fX (x)
dx
15.5.2 Continuous variables and zero-probability events

If X is an absolutely continuous random variable, then the probability that X
takes on any speci…c value x 2 RX is equal to zero:
Z x
P (X = x) = fX (t) dt = 0
x
Thus, the event f! : X (!) = xg is a zero-probability event for any x 2 RX . The

lecture entitled Zero-probability events (p. 79) contains a thorough discussion of
this apparently paradoxical fact: although it can happen that X (!) = x, the event
f! : X (!) = xg has zero probability of happening.
15.5.3 A more rigorous de…nition of random variable

Random variables can be de…ned in a more rigorous manner using the terminology
of measure theory.
Let ( ; F; P) be a probability space3 . Let X be a function X : ! R. Let
B (R) be the Borel sigma-algebra of R, i.e. the smallest sigma-algebra containing
all the open subsets of R. If, for any B 2 B (R),
f! 2 : X (!) 2 Bg 2 F
then X is a random variable on .

If X satis…es this property, then it is possible to de…ne
P (X 2 B) := P (f! 2 : X (!) 2 Bg)
for any B 2 B (R) and the probability on the right hand side is well-de…ned because
the set
f! 2 : X (!) 2 Bg
is measurable by the very de…nition of random variable.

3 See p. 76.
Exercise 1
Let X be a discrete random variable. Let its support RX be
RX = f0; 1; 2; 3; 4g
Let its probability mass function pX (x) be
1=5 if x 2 RX
pX (x) =
0 if x 2
= RX
Compute
P (1 X < 4)
Solution
By using the additivity of probability, we have
P (1 X < 4) = P (fX = 1g [ fX = 2g [ fX = 3g)

= P (fX = 1g) + P (fX = 2g) + P (fX = 3g)
1 1 1 3
= pX (1) + pX (2) + pX (3) = + + =
5 5 5 5
Exercise 2
Let X be a discrete random variable. Let its support RX be the set of the …rst 20
natural numbers:
RX = f1; 2; : : : ; 19; 20g
x=210 if x 2 RX
pX (x) =
0 if x 2
= RX
Compute the probability

P (X > 17)
Solution
By the additivity of probability, we have
P (X > 17) = P (fX = 18g [ fX = 19g [ fX = 20g)

= P (fX = 18g) + P (fX = 19g) + P (fX = 20g)
= pX (18) + pX (19) + pX (20)
18 19 20 57 19
= + + = =
210 210 210 210 70
Exercise 3
Let X be a discrete random variable. Let its support RX be
RX = f0; 1; 2; 3g

3 1 x 3 3 x
if x 2 RX
pX (x) = x 4 4
0 if x 2
= RX
where
3 3!
=
x x! (3 x)!
is a binomial coe¢ cient4 .
Calculate the probability
P (X < 3)
Solution
First note that, by additivity, we have
P (X < 3) = P (fX = 0g [ fX = 1g [ fX = 2g)
= P (fX = 0g) + P (fX = 1g) + P (fX = 2g)
= pX (0) + pX (1) + pX (2)
Therefore, in order to compute P (X < 3), we need to evaluate the probability
mass function at the three points x = 0 , x = 1 and x = 2:
0 3 0
3 1 3 3! 27 27
pX (0) = = 1 =
0 4 4 0!3! 64 64
1 3 1
3 1 3 3! 1 9
pX (1) = =
1 4 4 1!2! 4 16
1 9 27
= 3 =
4 16 64
2 3 2
3 1 3 3! 1 3 9
pX (2) = = =
2 4 4 2!1! 16 4 64
Finally,
P (X < 3) = pX (0) + pX (1) + pX (2)
27 27 9 63
= + + =
64 64 64 64
Exercise 4
Let X be an absolutely continuous random variable. Let its support RX be
RX = [0; 1]
Let its probability density function fX (x) be
1 if x 2 RX
fX (x) =
0 if x 2
= RX
Compute
1
P X 2
2
4 See p. 22.
Solution
The probability that an absolutely continuous random variable takes a value in a
given interval is equal to the integral of the probability density function over that
interval:
Z 2
1 1
P X 2 = P X2 ;2 = fX (x) dx
2 2 1=2
Z 1
1 1 1
= dx = [x]1=2 = 1 =
1=2 2 2
Exercise 5
RX = [0; 1]
2x if x 2 RX
fX (x) =
0 if x 2
= RX
Compute
1 1
P X
4 2
Solution
interval:
Z 1=2
1 1 1 1
P X = P X2 ; = fX (x) dx
4 2 4 2 1=4
Z 1=2
1=2 1 1 3
= 2xdx = x2 1=4 = =
1=4 4 16 16
Exercise 6
RX = [0; 1)
exp ( x) if x 2 RX
fX (x) =
0 if x 2
= RX
where > 0.
Compute
P (X 1)
Solution
interval:
Z 1
P (X 1) = P (X 2 [1; 1)) = fX (x) dx
1
Z 1
1
= exp ( x) dx = [ exp ( x)]1
1
= 0 ( exp ( )) = exp ( )
Chapter 16
Random vectors
This lecture introduces the concept of random vector, which is a multidimensional

generalization of the concept of random variable. Before reading this lecture, make
sure you are familiar with the concepts of sample space, sample point, event, possi-
ble outcome, realized outcome and probability (see the lecture entitled Probability
- p. 69) and with the concept of random variable (see the lecture entitled Random
variables - p. 105).
16.1 De…nition of random vector

Suppose that we conduct a probabilistic experiment and that the possible outcomes
of the experiment are described by a sample space . A random vector is a vector
whose value depends on the outcome of the experiment, as stated by the following
de…nition.
De…nition 85 Let be a sample space. A random vector X is a function from
the sample space to the set of K-dimensional real vectors RK :
X: ! RK
In rigorous probability theory, the function X is also required to be measurable1 .
The real vector X (!) associated to a sample point ! 2 is called a realization
of the random vector. The set of all possible realizations is called support and it
is denoted by RX .
Denote by P (E) the probability of an event E . When dealing with random
vectors, the following conventions are used:
If A RK , we often write P (X 2 A) with the meaning
P (X 2 A) = P (f! 2 : X (!) 2 Ag)
If A RK , we sometimes use the notation PX (A) with the meaning

PX (A) = P (X 2 A) = P (f! 2 : X (!) 2 Ag)
In applied work, it is very commonplace to build statistical models where a
random vector X is de…ned by directly specifying PX , omitting the speci…-
cation of the sample space altogether.
1 See below the subsection entitled A more rigorous de…nition of random vector (p. 121).
115
116 CHAPTER 16. RANDOM VECTORS
We often write X instead of X (!), omitting the dependence on !.
Example 86 Two coins are tossed. The possible outcomes of each toss can be
either tail (T ) or head (H). The sample space is
= fT T; T H; HT; HHg
The four possible outcomes are assigned equal probabilities:
1
P (fT T g) = P (fT Hg) = P (fHT g) = P (fHHg) =
4
If tail (T ) is the outcome, we win one dollar, if head (H) is the outcome we lose
one dollar. A 2-dimensional random vector X indicates the amount we win (or
lose) on each toss:
8
>
> 1 1 if ! = T T
<
1 1 if ! = T H
X (!) =
>
> 1 1 if ! = HT
:
1 1 if ! = HH
The probability of winning one dollar on both tosses is
P X= 1 1 = P !2 : X (!) = 1 1
1
= P (fT T g) =
4
The probability of losing one dollar on the second toss is
P (X2 = 1) = P (f! 2 : X2 (!) = 1g)

1
= P (fT H; HHg) = P (fT Hg) + P (fHHg) =
2
The next sections deal with discrete random vectors and absolutely continuous
random vectors, two kinds of random vectors that have special properties and are
often found in applications.
16.2 Discrete random vectors

Discrete random vectors are de…ned as follows.
De…nition 87 A random vector X is discrete if:
1. its support RX is a countable set2 ;
2. there is a function pX : RK ! [0; 1], called the joint probability mass

function (or joint pmf, or joint probability function) of X, such that, for
any x 2 RK :
P (X = x) if x 2 RX
pX (x) =
0 if x 2
= RX
2 See the lecture entitled Sequences and Limits (p. 32) for a de…nition of countable set.
16.3. ABSOLUTELY CONTINUOUS RANDOM VECTORS 117
The following notations are used interchangeably to indicate the joint proba-
bility mass function:
pX (x) = pX (x1 ; : : : ; xK ) = pX1 ;:::;XK (x1 ; : : : ; xK )
In the second and third notation the K components of the random vector X are
explicitly indicated.
Example 88 Suppose X is a 2-dimensional random vector whose components X1

and X2 can take only two values: 1 or 0. Furthermore, the four possible combina-
tions of 0 and 1 are all equally likely. X is an example of a discrete random vector.
Its support is
1 1 0 0
RX = ; ; ;
1 0 1 0
Its probability mass function is
8 >
>
> 1=4 if x = 1 1
>
> >
>
< 1=4 if x = 1 0
>
pX (x) = 1=4 if x = 0 1
>
>
>
> >
: 1=4 if x = 0
> 0
0 otherwise
16.3 Absolutely continuous random vectors

Absolutely continuous random vectors are de…ned as follows.
De…nition 89 A random vector X is absolutely continuous (or, simply, con-

tinuous) if:
1. its support RX is not countable;
2. there is a function fX : RK ! [0; 1], called the joint probability density

function (or joint pdf or joint density function) of X, such that, for any set
A RK where
A = [a1 ; b1 ] : : : [aK ; bK ]
the probability that X belongs to A can be calculated as follows:
Z b1 Z bK
P (X 2 A) = ::: fX (x1 ; : : : ; xK ) dxK : : : dx1
a1 aK
provided the above multiple integral is well de…ned.
The following notations are used interchangeably to indicate the joint proba-
bility density function:
fX (x) = fX (x1 ; : : : ; xK ) = fX1 ;:::;XK (x1 ; : : : ; xK )
Example 90 Suppose X is a 2-dimensional random variable whose components

X1 and X2 are independent uniform3 random variables on the interval [0; 1]. X is
an example of an absolutely continuous random variable. Its support is
RX = [0; 1] [0; 1]

1 if x 2 [0; 1] [0; 1]
fX (x) =
0 otherwise
The probability that the realization of X falls in the rectangle [0; 1=2] [0; 1=2] is
Z 1=2 Z 1=2
1 1
P X 2 0; 0; = fX (x1 ; x2 ) dx2 dx1
2 2 0 0
Z 1=2 Z 1=2 Z 1=2
1=2
= dx2 dx1 = [x2 ]0 dx1
0 0 0
Z 1=2 Z 1=2
1 1 1 1=2
= dx1 = dx1 = [x1 ]0
0 2 2 0 2
1 1 1
= =
2 2 4
16.4 Random vectors in general

Random vectors, also those that are neither discrete nor absolutely continuous, are
often characterized in terms of their joint distribution function:
De…nition 91 Let X be a random vector. The joint distribution function (or

joint df, or joint cumulative distribution function, or joint cdf ) of X is a function
FX : RK ! [0; 1] such that
FX (x) = P (X1 x1 ; : : : ; XK xK ) ; 8x 2 RK
where the components of X and x are denoted by Xk and xk respectively, for

k = 1; : : : ; K.
The following notations are used interchangeably to indicate the joint distrib-
ution function:
FX (x) = FX (x1 ; : : : ; xK ) = FX1 ;:::;XK (x1 ; : : : ; xK )
Sometimes, we talk about the joint distribution of a random vector, without
specifying whether we are referring to the joint distribution function, or to the joint
probability mass function (in the case of discrete random vectors), or to the joint
probability density function (in the case of absolutely continuous random vectors).
This ambiguity is legitimate, since:
1. the joint probability mass function completely determines (and is completely

determined by) the joint distribution function of a discrete random vector;
3 See p. 359.
2. the joint probability density function completely determines (and is com-

pletely determined by) the joint distribution function of an absolutely con-
tinuous random vector.
In the remainder of this lecture, we use the term joint distribution when we
are making statements that apply both to the distribution function and to the
probability mass (or density) function of a random vector.
16.5 More details

16.5.1 Random matrices
A random matrix is a matrix whose entries are random variables. It is not necessary
to develop a separate theory for random matrices, because a random matrix can
always be written as a random vector. Given a K L random matrix A, its
vectorization, denoted by vec (A), is the KL 1 random vector obtained by stacking
the columns of A on top of each other.
Example 92 Let A be the following 2 2 random matrix:
A11 A12
A=
A21 A22
The vectorization of A is the following 4 1 random vector:

2 3
A11
6 A21 7
vec (A) = 6 7
4 A12 5
A22
When vec (A) is a discrete random vector, then we say that A is a discrete
random matrix and the joint probability mass function of A is just the joint prob-
ability mass function of vec (A). By the same token, when vec (A) is an absolutely
continuous random vector, then we say that A is an absolutely continuous random
matrix and the joint probability density function of A is just the joint probability
density function of vec (A).
16.5.2 Marginal distribution of a random vector

Let Xi be the i-th component of a K-dimensional random vector X. The distri-
bution function FXi (x) of Xi is called marginal distribution function of Xi .
If X is discrete, then Xi is a discrete random variable4 and its probability mass
function pXi (x) is called marginal probability mass function of Xi . If X is
absolutely continuous, then Xi is an absolutely continuous random variable5 and
its probability density function fXi (x) is called marginal probability density
function of Xi .
4 See p. 106.
5 See p. 107.
16.5.3 Marginalization of a joint distribution

The process of deriving the distribution of a component Xi of a random vector X
from the joint distribution of X is known as marginalization. Marginalization
can also have a broader meaning: it can refer to the act of deriving the joint
distribution of a subset of the set of components of X from the joint distribution
of X. For example, if X is a random vector having three components X1 , X2 and
X3 , we can marginalize the joint distribution of X1 , X2 and X3 to …nd the joint
distribution of X1 and X2 ; in this case we say that X3 is marginalized out of the
joint distribution of X1 , X2 and X3 .
16.5.4 Marginal distribution of a discrete random vector

Let Xi be the i-th component of a K-dimensional discrete random vector X. The
marginal probability mass function of Xi can be derived from the joint prob-
ability mass function of X as follows:
X
pXi (x) = pX (x1 ; : : : ; xK )
(x1 ;:::;xK )2RX :xi =x
In other words, the probability that Xi = x is obtained summing the probabilities

of all the vectors of RX whose i-th component is equal to x.
16.5.5 Marginalization of a discrete distribution

Let Xi be the i-th component of a discrete random vector X. Marginalizing Xi
out of the joint distribution of X, we obtain the joint distribution of the remaining
components of X, i.e. we obtain the joint distribution of the random vector X i
de…ned as follows:
X i = [X1 : : : Xi 1 Xi+1 : : : XK ]
The joint probability mass function of X i is
X
pX i (x1 ; : : : ; xi 1 ; xi+1 ; : : : ,xK ) = pX (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ,xK )
xi 2RXi
In other words, the joint probability mass function of X i is obtained summing the
joint probability mass function of X over all values xi that belong to the support
of Xi .
16.5.6 Marginal distribution of a continuous random vector

Let Xi be the i-th component of a K-dimensional absolutely continuous random
vector X. The marginal probability density function of Xi can be derived
from the joint probability density function of X as follows:
Z 1 Z 1
fXi (x) = ::: fX (x1 ; : : : ; xi 1 ; x; xi+1 ; : : : ; xK ) dxK : : : dxi+1 dxi 1 : : : dx1
1 1
In other words, the joint probability density function, evaluated at xi = x, is

integrated with respect to all variables except xi (so it is integrated a total of
K 1 times).
16.5.7 Marginalization of a continuous distribution

Let Xi be the i-th component of an absolutely continuous random vector X. Mar-
ginalizing Xi out of the joint distribution of X, we obtain the joint distribution of
the remaining components of X, i.e. we obtain the joint distribution of the random
vector X i de…ned as follows:
X i = [X1 : : : Xi 1 Xi+1 : : : XK ]
The joint probability density function of X i is

Z 1
fX i (x1 ; : : : ; xi 1 ; xi+1 ; : : : ,xK ) = fX (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ,xK ) dxi
1
In other words, the joint probability density function of X i is obtained integrating

the joint probability density function of X with respect to xi .
16.5.8 Partial derivative of the distribution function

Note that, if X is absolutely continuous, then
Z x1 Z xK
FX (x) = ::: fX (t1 ; : : : ; tK ) dtK : : : dt1
1 1
Hence, by taking the K th -order cross-partial derivative with respect to x1 ; : : : ; xK

of both sides of the above equation, we obtain
@ K FX (x)
= fX (x)
@x1 : : : @xK
16.5.9 A more rigorous de…nition of random vector

Random vectors can be de…ned in a more rigorous manner using the terminology
of measure theory:
De…nition 93 Let ( ; F; P) be a probability space6 . Let X be a function X :

! RK . Let B RK be the Borel -algebra of RK (i.e. the smallest -algebra
containing all open hyper-rectangles in RK ). If
f! 2 : X (!) 2 Bg 2 F
for any B 2 B RK , then X is a random vector on .
Thus, if X satis…es this property, we are allowed to de…ne
P (X 2 B) := P (f! 2 : X (!) 2 Bg) ; 8B 2 B RK
because the set f! 2 : X (!) 2 Bg is measurable by the very de…nition of random

vector.

Some solved exercises on random vectors can be found below.
6 See p. 76 for a de…nition of probability space and measurable sets.
Exercise 1
Let X be a 2 1 discrete random vector and denote its components by X1 and X2 .
Let the support of X be the set of all 2 1 vectors such that their entries belong
to the set of the …rst three natural numbers, i.e.,
n o
>
RX = x = [x1 x2 ] : x1 2 N3 and x2 2 N3
where
N3 = f1; 2; 3g
Let the joint probability mass function of X be
(
1 >
pX (x1 ; x2 ) = 36 x1 x2 if [x1 x2 ] 2 RX
>
0 if [x1 x2 ] 2= RX
Find P (X1 = 2 and X2 = 3).
Solution
Trivially, we need to evaluate the joint probability mass function at the point (2; 3),
i.e.,
1 6 1
P (X1 = 2 and X2 = 3) = pX (2; 3) = 2 3= =
36 36 6
Exercise 2
Let X be a 2 1 discrete random vector and denote its components by X1 and X2 .
Let the support of X be the set of all 2 1 vectors such that their entries belong
to the set of the …rst three natural numbers, i.e.,
n o
>
RX = x = [x1 x2 ] : x1 2 N3 and x2 2 N3
where
N3 = f1; 2; 3g
Let the joint probability mass function of X be
(
1 >
pX (x1 ; x2 ) = 36 (x1 + x2 ) if [x1 x2 ] 2 RX
>
0 if [x1 x2 ] 2= RX
Find P (X1 + X2 = 3).
Solution
There are only two possible cases that give rise to the occurrence X1 + X2 = 3.
These cases are
>
X = [1 2]
and
>
X = [2 1]
Since these two cases are disjoint events, we can use the additivity property of
probability7 :
n o n o
> >
P (X1 + X2 = 3) = P X = [1 2] [ X = [2 1]
n o n o
> >
= P X = [1 2] + P X = [2 1]
= pX (1; 2) + pX (2; 1)
1 1 6 1
= (1 + 2) + (2 + 1) = =
36 36 36 6
Exercise 3
Let X be a 2 1 discrete random vector and denote its components by X1 and
X2 . Let the support of X be
n o
> > >
RX = [1 1] ; [2 0] ; [0 0]
and its joint probability mass function be

8 >
>
> 1=3 if x = [1 1]
< >
1=3 if x = [2 0]
pX (x) = >
>
: 1=3
> if x = [0 0]
0 otherwise
Derive the marginal probability mass functions of X1 and X2 .
Solution
The support of X1 is
RX1 = f0; 1; 2g
We need to compute the probability of each element of the support of X1 :
X 1
pX1 (0) = pX (x1 ; x2 ) = pX (0; 0) =
3
f(x1 ;x2 )2RX :x1 =0g
X 1
pX1 (1) = pX (x1 ; x2 ) = pX (1; 1) =
3
f(x1 ;x2 )2RX :x1 =1g
X 1
pX1 (2) = pX (x1 ; x2 ) = pX (2; 0) =
3
f(x1 ;x2 )2RX :x1 =2g
Thus, the probability mass function of X1 is

8
>
> 1=3 if x = 0
X <
1=3 if x = 1
pX1 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 2
f(x1 ;x2 )2RX :x1 =xg :
0 otherwise
RX2 = f0; 1g
7 See p. 72.
We need to compute the probability of each element of the support of X2 :

X 2
pX2 (0) = pX (x1 ; x2 ) = pX (2; 0) + pX (0; 0) =
3
f(x1 ;x2 )2RX :x2 =0g
X 1
pX2 (1) = pX (x1 ; x2 ) = pX (1; 1) =
3
f(x1 ;x2 )2RX :x2 =1g
Thus, the probability mass function of X2 is

8
X < 2=3 if x = 0
pX2 (x) = pX (x1 ; x2 ) = 1=3 if x = 1
:
f(x1 ;x2 )2RX :x2 =xg 0 otherwise
Exercise 4
Let X be a 2 1 absolutely continuous random vector and denote its components
by X1 and X2 . Let the support of X be
RX = [0; 2] [0; 3]
i.e. the set of all 2 1 vectors such that the …rst component belongs to the
interval [0; 2] and the second component belongs to the interval [0; 3]. Let the joint
probability density function of X be
1=6 if x 2 RX
fX (x) =
0 otherwise
Compute P (1 X1 3; 1 X2 1).
Solution
By the very de…nition of joint probability density function:
P (1 X1 3; 1 X2 1)
Z 3Z 1
= fX (x1 ; x2 ) dx2 dx1
1 1
Z 2 Z 1 Z 2
1 1 1
= dx2 dx1 = [x2 ]0 dx1
1 0 6 6 1
Z 2
1 1 2 1
= 1dx1 = [x1 ]1 =
6 1 6 6
Exercise 5
RX = [0; 1) [0; 2]
i.e. the set of all 2 1 vectors such that the …rst component belongs to the
interval [0; 1) and the second component belongs to the interval [0; 2]. Let the
joint probability density function of X be
exp ( 2x1 ) if x 2 RX
fX (x) = fX (x1 ; x2 ) =
0 otherwise
Compute P (X1 + X2 3).
Solution
First of all note that X1 + X2 3 if and only if X2 3 X1 . Using the de…nition
of joint probability density function, we obtain
n o
>
P (X1 + X2 3) = P [x1 x2 ] : x1 2 R; x2 2 ( 1; 3 x1 ]
Z 1 Z 3 x1
= fX (x1 ; x2 ) dx2 dx1
1 1
When x1 2 [0; 1), the inner integral is

Z 3 x1
fX (x1 ; x2 ) dx2
1
8
< 0
R3
if 3 x1 < 0, i.e. if x1 > 3
x1
= exp ( 2x1 ) dx2 if 0 3 x1 2, i.e. if 1 x1 3
: R02
0
exp ( 2x1 ) dx2 if 3 x1 > 2, i.e. if x1 < 1
Therefore,
P (X1 + X2 3)
Z 1 Z 3 x1
= fX (x1 ; x2 ) dx2 dx1
1 1
Z 1Z 2 Z 3 Z 3 x1
= exp ( 2x1 ) dx2 dx1 + exp ( 2x1 ) dx2 dx1
0 0 1 0
Z 1 Z 2 Z 3 Z 3 x1
= exp ( 2x1 ) dx2 dx1 + exp ( 2x1 ) dx2 dx1
0 0 1 0
Z 1 Z 3
= exp ( 2x1 ) 2dx1 + exp ( 2x1 ) (3 x1 ) dx1
0 1
Z 3 Z 3
1
= [ exp ( 2x1 )]0 + 3 exp ( 2x1 ) dx1 x1 exp ( 2x1 ) dx1
1 1
3
1
= exp ( 2) + 1 + 3 exp ( 2x1 )
2 1
( Z 3 )
3
1 1
x1 exp ( 2x1 ) exp ( 2x1 ) dx1
2 1 1 2
3 3 3
= 1 exp ( 6) + exp ( 2) + exp ( 6)
exp ( 2)
2 2 2
3
1 1
exp ( 2) + exp ( 2x1 )
2 4 1
1 1
= 1 + exp ( 6) exp ( 2)
4 4
Exercise 6
RX = R2+
i.e., the set of all 2-dimensional vectors with positive entries. Let its joint proba-
bility density function be
exp ( x1 x2 ) if x1 0 and x2 0
fX (x) = fX (x1 ; x2 ) =
0 otherwise
Derive the marginal probability density functions of X1 and X2 .
Solution The support of X1 is
RX1 = R+
We can …nd the marginal density by integrating the joint density with respect
to x2 : Z 1
fX1 (x) = fX (x; x2 ) dx2
1
When x < 0, then fX (x; x2 ) = 0 and the above integral is trivially equal to 0.
Thus, when x < 0, then fX1 (x) = 0.
When x > 0, then
Z 1 Z 0 Z 1
fX1 (x) = fX (x; x2 ) dx2 = fX (x; x2 ) dx2 + fX (x; x2 ) dx2
1 1 0
but the …rst of the two integrals is zero since fX (x; x2 ) = 0 when x2 < 0; as a
consequence,
Z 0 Z 1
fX1 (x) = fX (x; x2 ) dx2 + fX (x; x2 ) dx2
1 0
Z 1 Z 1
= fX (x; x2 ) dx2 = exp ( x x2 ) dx2
0 0
Z 1 Z 1
= exp ( x) exp ( x2 ) dx2 = exp ( x) exp ( x2 ) dx2
0 0
1
= exp ( x) [ exp ( x2 )]0 = exp ( x) ( 0 ( 1)) = exp ( x)
So, putting pieces together, the marginal density function of X1 is
exp ( x) if x 0
fX1 (x) =
0 otherwise
Obviously, by symmetry, the marginal density function of X2 is
exp ( x) if x 0
fX2 (x) =
0 otherwise
Chapter 17
Expected value
The concept of expected value of a random variable1 is one of the most important
concepts in probability theory. It was …rst devised in the 17th century to analyze
gambling games and answer questions such as: how much do I gain - or lose - on
average, if I repeatedly play a given gambling game? how much can I expect to
gain - or lose - by performing a certain bet? If the possible outcomes of the game
(or the bet) and their associated probabilities are described by a random variable,
then these questions can be answered by computing its expected value, which is
equal to a weighted average of the outcomes, in which each outcome is weighted by
its probability. For example, if you play a game where you gain 2$ with probability
1=2 and you lose 1$ with probability 1=2, then the expected value of the game is
half a dollar:
2$ (1=2) + ( 1$) (1=2) = 1=2$
What does this mean? Roughly speaking, it means that if you play this game
many times and the number of times each of the two possible outcomes occurs is
proportional to its probability, then, on average you gain 1=2$ each time you play
the game. For instance, if you play the game 100 times, win 50 times and lose the
remaining 50, then your average winning is equal to the expected value:
(2$ 50 + ( 1$) 50) =100 = 1=2$
In general, giving a rigorous de…nition of expected value requires quite a heavy

mathematical apparatus. To keep things simple, we provide an informal de…nition
of expected value and we discuss its computation in this lecture, while we relegate
a more rigorous de…nition to the (optional) lecture entitled Expected value and the
Lebesgue integral (p. 141).
17.1 De…nition of expected value

The following is an informal de…nition of expected value:
De…nition 94 (informal) The expected value of a random variable X is the

weighted average of the values that X can take on, where each possible value is
weighted by its respective probability.
1 See p. 105.
127
128 CHAPTER 17. EXPECTED VALUE
The expected value of a random variable X is denoted by E [X] and it is often

called the expectation of X or the mean of X.
The following sections discuss how the expected value of a random variable is
computed.

When X is a discrete random variable having support RX and probability mass
function pX (x), the formula for computing its expected value is a straightforward
implementation of the informal de…nition given above: the expected value of X is
the weighted average of the values that X can take on (the elements of RX ), where
each possible value x 2 RX is weighted by its respective probability pX (x).
De…nition 95 Let X be a discrete random variable with support RX and proba-
bility mass function pX (x). The expected value of X is:
X
E [X] = xpX (x)
x2RX
provided that: X
jxj pX (x) < 1
x2RX
The symbol X
x2RX
indicates summation over all the elements of the support RX . So, for example, if
RX = f1; 2; 3g
then: X
xpX (x) = 1 pX (1) + 2 pX (2) + 3 pX (3)
x2RX
The requirement that X

jxj pX (x) < 1
x2RX
is called absolute summability and ensures that the summation

X
xpX (x) (17.1)
x2RX
is well-de…ned also when the support RX contains in…nitely many elements. When
summing in…nitely many terms, the order in which you sum them can change the
result of the sum. However, if the terms are absolutely summable, then the order
in which you sum becomes irrelevant. In the above de…nition of expected value,
the order of the sum X
xpX (x)
x2RX
is not speci…ed, therefore the requirement of absolute summability is introduced in

order to ensure that the expected value is well-de…ned.
When the absolute summability condition is not satis…ed, we say that the ex-
pected value of X is not well-de…ned or that it does not exist.
17.3. CONTINUOUS RANDOM VARIABLES 129
Example 96 Let X be a random variable with support
RX = f0; 1g
and probability mass function:

8
< 1=2 if x = 1
pX (x) = 1=2 if x = 0
:
0 otherwise
Its expected value is:

X
E [X] = xpX (x) = 1 pX (1) + 0 pX (0)
x2RX
1 1 1
= 1 +0 =
2 2 2
17.3 Continuous random variables

When X is an absolutely continuous random variable with probability density
function fX (x), the formula for computing its expected value involves an integral:
De…nition 97 Let X be an absolutely continuous random variable with probability

density function fX (x). The expected value of X is:
Z 1
E [X] = xfX (x) dx
1
provided that: Z 1
jxj fX (x) dx < 1
1
This integral can be thought of as the limiting case of the sum (17.1) found in
the discrete case. Here pX (x)
R 1is replaced by fX (x) dx (the in…nitesimal
P probability
of x) and the integral sign 1 replaces the summation sign x2RX .
The requirement that Z 1
jxj fX (x) dx < 1
1
is called absolute integrability and ensures that the improper integral

Z 1
xfX (x) dx
1
is well-de…ned. This improper integral is a shorthand for:

Z 0 Z t
lim xfX (x) dx + lim xfX (x) dx
t! 1 t t!1 0
and it is well-de…ned only if both limits are …nite. Absolute integrability guarantees
that the latter condition is met and that the expected value is well-de…ned.
When the absolute integrability condition is not satis…ed, we say that the ex-
pected value of X is not well-de…ned or that it does not exist.
Example 98 Let X be an absolutely continuous random variable with support
RX = [0; 1)
and probability density function:
exp ( x) if x 2 [0; 1)
fX (x) =
0 otherwise
where > 0. Its expected value is:

Z 1
E [X] = xfX (x) dx
1
Z 1
= x exp ( x) dx
0
Z 1
A = 1 t exp ( t) dt
0
Z 1
B 1 1
= [ t exp ( t)]0 + exp ( t) dt
0
1 1
= f0 0 + [ exp ( t)]0 g
1 1
= f0 + 1g =
where: in step A we have made a change of variable (t = x); in step B we

have integrated by parts.
17.4 The Riemann-Stieltjes integral

This section introduces a general formula for computing the expected value of
a random variable X. The formula, which does not require X to be discrete or
absolutely continuous and is applicable to any random variable, involves an integral
called Riemann-Stieltjes integral (see below for an introduction). While we brie‡y
discuss this formula for the sake of completeness, no deep understanding of this
formula or of the Riemann-Stieltjes integral is required to understand the other
lectures.
De…nition 99 Let X be a random variable having distribution function2 FX (x).

The expected value of X is:
Z 1
E [X] = xdFX (x)
1
where the integral is a Riemann-Stieltjes integral and the expected value exists and
is well-de…ned only as long as the integral is well-de…ned.
Also this integral is the limiting case of formula (17.1) for the expected value
of a discrete random variable.
R1 Here dFX (x) replaces pXP (x) (the probability of x)
and the integral sign 1 replaces the summation sign x2RX .
2 See p. 108.
17.4. THE RIEMANN-STIELTJES INTEGRAL 131
The following section contains a brief and informal introduction to the Riemann-
Stieltjes integral and an explanation of the above formula. Less technically oriented
readers can safely skip it: when they encounter a Riemann-Stieltjes integral, they
can just think of it as a formal notation which allows a uni…ed treatment of discrete
and absolutely continuous random variables and can be treated as a sum in one
case and as an ordinary Riemann integral in the other.
17.4.1 Intuition
As we have already seen above, the expected value of a discrete random variable
is straightforward to compute: the expected value of a discrete variable X is the
weighted average of the values that X can take on (the elements of the support
RX ), where each possible value x is weighted by its respective probability pX (x):
X
E [X] = xpX (x)
x2RX
or, written in a slightly di¤erent fashion:

X
E [X] = xP (X = x)
x2RX
When X is not discrete the above summation does not make any sense. However,
there is a workaround that allows to extend the formula to random variables that
are not discrete. The workaround entails approximating X with discrete variables
that can take on only …nitely many values.
Let x0 , x1 , . . . , xn be n + 1 real numbers (n 2 N) such that:
x0 < x1 < : : : < xn
De…ne a new random variable Xn (function of X) as follows:

8
>
> x1 when x0 < X x1
>
< x2 when x1 < X x2
Xn = .. ..
>
> . .
>
:
xn when xn 1 < X xn
As the number n of points increases and the points become closer and closer (the
maximum distance between two successive points tends to zero), Xn becomes a
very good approximation of X, until, in the limit, it is indistinguishable from X.
The expected value of Xn is easy to compute:
n
X
E [Xn ] = xi P (Xn = xi )
i=1
Xn
= xi P (X 2 (xi 1 ; xi ])
i=1
Xn
= xi [FX (xi ) FX (xi 1 )]
i=1
where FX (x) is the distribution function of X.

The expected value of X is then de…ned as the limit of E [Xn ] when n tends to
in…nity (i.e. when the approximation becomes better and better):
n
X
E [X] = lim E [Xn ] = lim xi [FX (xi ) FX (xi 1 )]
n!1 n!1
i=1
When the latter limit exists and is well-de…ned, it is called the Riemann-Stieltjes
integral of x with respect to FX (x) and it is indicated as follows:
Z 1 n
X
xdFX (x) = lim xi [FX (xi ) FX (xi 1 )]
1 n!1
i=1
R1
Roughly speaking, the integral notation can be thought of as a shorthand
Pn 1
for limn!1 i=1 and the di¤erential notation dFX (x) can be thought of as a
shorthand for [FX (xi ) FX (xi 1 )].
17.4.2 Some rules

We present here some rules for computing the Riemann-Stieltjes integral when the
integrator function is the distribution function of a random variable X, i.e. we
limit attention to integrals of the kind:
Z b
g (x) dFX (x)
a
where FX (x) is the distribution function of a random variable X and g : R ! R.

Before stating the rules, note that the above integral does not necessarily exist
or is not necessarily well-de…ned. Roughly speaking, for the integral to exist the
integrand function g must be well-behaved. For example, if g is continuous on
[a; b], then the integral exists and is well-de…ned.
That said, we are ready to present the calculation rules:
1. FX (x) is continuously di¤erentiable on [a; b]. If FX (x) is continuously

di¤erentiable on [a; b] and fX (x) is its …rst derivative, then:
Z b Z b
g (x) dFX (x) = g (x) fX (x) dx
a a
2. FX (x) is continuously di¤erentiable on [a; b] except at a …nite num-

ber of points. Suppose FX (x) is continuously di¤erentiable on [a; b] except
at a …nite number of points c1 , . . . , cn such that:
a < c1 < c2 < : : : < cn b
Denote the derivative of FX (x) (where it exists) by fX (x). Then:

Z b
g (x) dFX (x)
a
Z " #
c1
= g (x) fX (x) dx + g (c1 ) FX (c1 ) lim FX (x)
x!c1
a x<c1
17.5. THE LEBESGUE INTEGRAL 133
Z " #
c2
+ g (x) fX (x) dx + g (c2 ) FX (c2 ) lim FX (x)
x!c2
c1 x<c2
+:::
Z cn " #
+ g (x) fX (x) dx + g (cn ) FX (cn ) lim FX (x)
x!cn
cn 1 x<cn
Z b
+ g (x) fX (x) dx
cn
RX = [0; 1]
and distribution function:

8
< 0 if x < 0
FX (x) = x=2 if 0 x < 1
:
1 if x 1
Its expected value is:

Z 1
E [X] = xdFX (x)
1
Z " #
1
d 1
= x x dx + 1 FX (1) lim FX (x)
0 dx 2 x!1
x<1
Z 1
1 1
= xdx + 1 1
0 2 2
1
1 2 1 1 1 3
= x + = + =
4 0 2 4 2 4
17.5 The Lebesgue integral

A completely general and rigorous de…nition of expected value is based on the
Lebesgue integral. We report it below without further comments. Less technically
inclined readers can safely skip it, while interested readers can read more about it
in the lecture entitled Expected value and the Lebesgue integral (p. 141).
De…nition 101 Let be a sample space3 , P a probability measure de…ned on the

events of and X a random variable de…ned on . The expected value of X is:
Z
E [X] = XdP
R
provided XdP (the Lebesgue integral of X with respect to P) exists and is well-
de…ned.
3 See p. 69.
17.6 More details

17.6.1 The transformation theorem
An important property of the expected value, known as transformation theorem,
allows to easily compute the expected value of a function of a random variable.
Let X be a random variable. Let g : R ! R be a real function. De…ne a new
random variable Y as follows:
Y = g (X)
Then: Z 1
E [Y ] = g (x) dFX (x)
1
provided the above integral exists.

This is an important property. It says that, if you need to compute the expected
value of Y = g (X), you do not need to know the support of Y and its distribution
function FY (y): you can compute it just by replacing x with g (x) in the formula
for the expected value of X.
For discrete random variables the formula becomes:
X
E [Y ] = g (x) pX (x)
x2RX
while for absolutely continuous random variables it is:

Z 1
E [Y ] = g (x) fX (x) dx
1
It is possible (albeit non-trivial) to prove that the above two formulae hold also
when X is a K-dimensional random vector, g : RK ! R is a real function of K
variables and Y = g (X). When X is a discrete random vector and pX (x) is its
joint probability mass function, then:
X
E [Y ] = g (x) pX (x)
x2RX
When X is an absolutely continuous random vector and fX (x) is its joint proba-
bility density function, then:
Z 1 Z 1
E [Y ] = ::: g (x1 ; : : : ; xK ) fX (x1 ; : : : ; xK ) dx1 : : : dxK
1 1
17.6.2 Linearity of the expected value

If X is a random variable and Y is another random variable such that:
Y = a + bX
where a 2 R and b 2 R are two constants, then the following holds:
E [Y ] = a + bE [X]
Proof. For discrete random variables this is proved as follows:
E [Y ]
X
A = (a + bx) pX (x)
x2RX
X X
= apX (x) + bxpX (x)
x2RX x2RX
X X
= a pX (x) + b xpX (x)
x2RX x2RX
X
B = a+b xpX (x)
x2RX
C = a + bE [X]
where: in step A we have used the transformation theorem; in step B we have

used the fact that probabilities sum up to 1; in step C we have used the de…nition
of E [X]. For absolutely continuous random variables the proof is:
E [Y ]
Z 1
A = (a + bx) fX (x) dx
1
Z 1 Z 1
= afX (x) dx + bxfX (x) dx
1 1
Z 1 Z 1
= a fX (x) dx + b xfX (x) dx
1 1
Z 1
B = a+b xfX (x) dx
1
C = a + bE [X]
where: in step A we have used the transformation theorem; in step B we have

used the fact that probability densities integrate to 1; in step C we have used
the de…nition of E [X]. In general, the linearity property is a consequence of the
transofrmation theorem and of the fact that the Riemann-Stieltjes integral is a
linear operator:
Z 1
E [Y ] = (a + bx) dFX (x)
1
Z 1 Z 1
= a dFX (x) + b xdFX (x)
1 1
= a + bE [X]
A stronger linearity property holds, which involves two or more random vari-
ables. The property can be proved only using the Lebesgue integral4 . The property
is as follows: let X1 and X2 be two random variables and let c1 2 R and c2 2 R
be two constants; then:
E [c1 X1 + c2 X2 ] = c1 E [X1 ] + c2 E [X2 ]

4 See the lecture entitled Expected value and the Lebesgue integral (p. 141).
17.6.3 Expected value of random vectors

Let X be a K-dimensional random vector and denote its components by X1 , . . . ,
XK . The expected value of X, denoted by E [X], is just the vector of the expected
values of the K components of X. Suppose, for example, that X is a row vector;
then:
E [X] = E [X1 ] : : : E [XK ]
17.6.4 Expected value of random matrices

Let be a K L random matrix, i.e. a K L matrix whose entries are random
variables. Denote its (i; j)-th entry by ij . The expected value of , denoted by
E [ ], is just the matrix of the expected values of the entries of :
2 3
E [ 11 ] : : : E [ 1L ]
6 .. .. .. 7
E[ ] = 4 . . . 5
E[ K1 ] ::: E[ KL ]
17.6.5 Integrability
Denote the absolute value of a random variable X by jXj. If E [jXj] exists and
is …nite, we say that X is an integrable random variable, or just that X is
integrable.
17.6.6 Lp spaces
p
Let 1 p < 1. The space of all random variables X such that E [jXj ] exists
and is …nite is denoted by Lp or Lp ( ; F,P), where the triple ( ; F,P) makes the
dependence on the underlying probability space5 explicit. If X belongs to Lp , we
write X 2 Lp ( ; F,P). Hence, if X is integrable, we write X 2 L1 ( ; F,P).
17.6.7 Other properties of the expected value

Other properties of the expected value are discussed in the lecture entitled Prop-
erties of the expected value (p. 147).

Some solved exercises on the expected value can be found below.
Exercise 1
Let X be a discrete random variable. Let its support be
RX = f0; 1; 2; 3; 4g
1=5 if x 2 RX
pX (x) =
0 if x 2
= RX
Compute the expected value of X.
5 See p. 76.
Solution
Since X is discrete, its expected value is computed as a sum over the support of
X:
X
E [X] = xpX (x)
x2RX
= 0 pX (0) + 1 pX (1) + 2 pX (2) + 3 pX (3) + 4 pX (4)
1 1 1 1 1
= 0 +1 +2 +3 +4
5 5 5 5 5
1+2+3+4 10
= = =2
5 5
Exercise 2
RX = f1; 2; 3g
x=6 if x 2 RX
pX (x) =
0 if x 2
= RX
Solution
Since X is discrete, its expected value is computed as a sum over the support of
X:
X
E [X] = xpX (x) = 1 pX (1) + 2 pX (2) + 3 pX (3)
x2RX
1 2 3 1 4 9 14 7
= 1 +2 +3 = + + = =
6 6 6 6 6 6 6 3
Exercise 3
RX = f2; 4g
x2 =20 if x 2 RX
pX (x) =
0 if x 2
= RX
Solution Since X is discrete, its expected value is computed as a sum over the
support of X:
X
E [X] = xpX (x) = 2 pX (2) + 4 pX (4)
x2RX
4 16 8 64 72 18
= 2 +4 = + = =
20 20 20 20 20 5
Exercise 4
Let X be an absolutely continuous random variable with uniform distribution on
the interval [1; 3].
Its support is
RX = [1; 3]
1=2 if x 2 [1; 3]
fX (x) =
0 otherwise
Solution
Since X is absolutely continuous, its expected value can be computed as an integral:
Z 1
E [X] = xfX (x) dx
1
Z 1 Z 3 Z 1
= xfX (x) dx + xfX (x) dx + xfX (x) dx
1 1 3
Z 1 Z 3 Z 1
1
= x0dx + x dx + x0dx
1 1 2 3
3
1 2 1 2 1 2 9 1 8
= 0+ x +0= 3 1 = = =2
4 1 4 4 4 4
Note that the trick is to: 1) subdivide the interval of integration to isolate
the sub-intervals where the density is zero; 2) split up the integral among the
sub-intervals thus identi…ed.
Exercise 5
Let X be an absolutely continuous random variable. Its support is
RX = R
1 1 2
fX (x) = p exp x
2 2
Solution
Z 1 Z 1
1=2 1 2
E [X] = xfX (x) dx = (2 ) x exp x dx
1 1 2
Z 0 Z 1
1=2 1 2 1=2 1 2
= (2 ) x exp x dx + (2 ) x exp x dx
1 2 0 2
0 1
1=2 1 2 1=2 1 2
= (2 ) exp x + (2 ) exp x
2 1 2 0
1=2 1=2 1=2 1=2
= (2 ) [ 1 + 0] + (2 ) [0 + 1] = (2 ) + (2 ) =0
Exercise 6
Let X be an absolutely continuous random variable. Its support is
RX = [0; 1]
fX (x) = 3x2
Solution
Z 1 Z 1
E [X] = xfX (x) dx = x3x2 dx
1 0
Z1 1
1 4 1 3
= 3 x3 dx = 3 x =3 0 =
0 4 0 4 4
Exercise 7
Let FX (x) be de…ned as follows:
0 if x < 0
FX (x) =
1 exp ( x) if x 0
where > 0.
Z 2
xdFX (x)
1
Solution
FX (x) is continuously di¤erentiable on the interval [1; 2]. Its derivative fX (x) is:
fX (x) = exp ( x)
As a consequence, the integral becomes:

Z 2 Z 2
xdFX (x) = xfX (x) dx
1 1
Z 2 Z 2
1
= x exp ( x) dx = t exp ( t) dt
1
( Z )
2
1 2
= [ t exp ( t)] + exp ( t) dt
1n 2
o
= 2 exp ( 2 ) + exp ( ) + [ exp ( t)]
1
= f 2 exp ( 2 ) + exp ( ) exp ( 2 ) + exp ( )g
1
= f( 2 1) exp ( 2 ) + ( + 1) exp ( )g
Chapter 18
Expected value and the

Lebesgue integral
The Lebesgue integral is used to give a completely general de…nition of expected

value. This lecture introduces the Lebesgue integral, …rst in an intuitive manner
and then in a more rigorous manner. Understanding the material presented in
this lecture is not necessary to understand the material presented in subsequent
lectures.
18.1 Intuition
Let us recall the informal de…nition of expected value we have given in the lecure
entitled Expected Value (p. 127):
De…nition 102 The expected value of a random variable X is the weighted av-
erage of the values that X can take on, where each possible value is weighted by its
respective probability.
When X is discrete and can take on only …nitely many values, it is straightfor-
ward to compute the expected value of X, by just applying the above de…nition.
Denote by x1 , . . . , xn the n values that X can take on (the n elements of its sup-
port). Let be the sample space on which X is de…ned. Also de…ne the following
events:
E1 = f! 2 : X (!) = x1 g
..
.
En = f! 2 : X (!) = xn g
i.e. when the event Ei happens, then X equals xi .

We can write the expected value of X as:
n
X
E [X] = xi P (Ei )
i=1
i.e. the expected value of X is the weighted average of the values that X can take
on, where each possible value xi is weighted by its respective probability P (Ei ).
141
142 CHAPTER 18. EXPECTED VALUE AND THE LEBESGUE INTEGRAL
Note that this is a way of expressing the expected value that uses neither FX (x),
the distribution function1 of X, nor its probability mass function2 pX (x). Instead,
the above way of expressing the expected value uses only the probability P (E)
de…ned on the events E . In many applications, it turns out that this is a very
convenient way of expressing (and calculating) the expected value: for example,
when the distribution function FX (x) is not directly known and it is di¢ cult to
derive, it is sometimes easier to directly compute the probabilities P (E) de…ned
on the events E . Below, this will be illustrated with an example.
When X is discrete, but can take on in…nitely many values, in a similar fashion
we can write:
1
X
E [X] = xi P (Ei ) (18.1)
i=1
In this case, however, there is a possibility that E [X] is not well-de…ned: this
happens when the in…nite series above does not converge, i.e. when the limit
n
X
lim xi P (Ei )
n!1
i=1
does not exist. In the next section we will show how to take care of this possibility.
In the case in which X in not discrete (its support has the power of the con-
tinuum), things are much more complicated. In this case, the summation in (18.1)
does not make any sense, because the support of X cannot be arranged into a se-
quence3 and so there is no sequence over which we can sum. Thus, we have to …nd
a workaround. The workaround is similar to the one we have discussed in the pre-
sentation of the Stieltjes integral4 : …rst, we build a simpler random variable Y that
is a good approximation of X and whose expected value can easily be computed;
then, we make the approximation better and better; …nally, we de…ne the expected
value of X to be equal to the expected value of Y when the approximation tends
to become perfect.
How does the approximation work, intuitively? We illustrate it in three steps:
1. in the …rst step, we partition the sample space into n events E1 , . . . , En ,
such that Ei \ Ej = ; for i 6= j and
n
[
= Ei
i=1
2. in the second step we …nd, for each event Ei , the smallest value that X can
take on when the event Ei happens:
yi = inf fX (!) : ! 2 Ei g
3. in the third step, we de…ne the random variable Y (which approximates X)

as follows: 8
< y1 when ! 2 E1
>
Y (!) = ..
> .
:
yn when ! 2 En
1 See p. 108.
2 See p. 106.
3 See p. 31.
4 See p. 130.
18.2. LINEARITY OF THE LEBESGUE INTEGRAL 143
In this way, we have built a random variable Y such that Y (!) X (!) for
any !. The …ner the partition E1 , . . . , En is, the better the approximation is:
intuitively, when the sets Ei become smaller, then yi becomes on average closer to
the values that X can take on when Ei happens.
The expected value of Y is, of course, easy to compute:
n
X
E [Y ] = yi P (Ei )
i=1
The expected value of X is de…ned as follows:

n
X
E [X] = lim E [Y ] = lim yi P (Ei )
Y !X Y !X
i=1
where the notation Y ! X means that Y becomes a better approximation of X

(because the partition E1 , . . . , En is made …ner).
Several equivalent integral notations are used to denote the above limit:
Z Z Z
E [X] = XdP = X (!) dP (!) = X (!) P (d!)
and the integral is called the Lebesgue integral of X with respect to the probability
measure P. The notation dP (or d!) indicates that the sets Ei become very small by
improvingR the approximation (making the partition E1 , .P . . , En …ner); the integral
n
notation can be thought of as a shorthand for limY !X i=1 ; X appears in place
of Y in the integral, because the two tend to coincide when the approximation
becomes better and better.
18.2 Linearity of the Lebesgue integral

An important property enjoyed by the Lebesgue integral is linearity:
Proposition 103 Let X1 and X2 be two random variables de…ned on a sample

space and let c1 ; c2 2 R be two constants. Then:
Z Z Z
(c1 X1 + c2 X2 ) dP =c1 X1 dP + c2 X2 dP
The next example shows an important application of the linearity of the Lebesgue
integral. The example also shows how the Lebesgue integral can, in certain sit-
uations, be much simpler to use than the Stieltjes integral when computing the
expected value of a random variable.
Example 104 Let X1 and X2 be two random variables de…ned on a sample space
. We want to de…ne (and compute) the expected value of the sum X1 +X2 . De…ne
a new random variable
Z = X1 + X2
Using the Stieltjes integral5 , the expected value is de…ned as follows:
Z 1
E [Z] = zdFZ (z)
1
5 See p. 130.
where FZ (z) is the distribution function of Z. Hence, to compute the above integral,
we …rst need to know the distribution function of Z (which might be extremely
di¢ cult to derive). Using the Lebesgue integral, the expected value is de…ned as
follows: Z
E [Z] = ZdP
However, by linearity of the Lebesgue integral, we obtain:

Z Z Z Z
E [Z] = ZdP = (X1 + X2 ) dP = X1 dP + X2 dP = E [X1 ] + E [X2 ]
Thus, to compute the expected value of Z, we do not need to know the distribution
function of Z, but we only need to know the expected values of X1 and X2 .
This is just an example of how linearity of the Lebesgue integral translates into
linearity of the expected value. The more general property is summarized by
the following:
Proposition 105 Let X1 and X2 be two random variables de…ned on a sample

space and let c1 ; c2 2 R be two constants. Then:
E [c1 X1 + c2 X2 ] = c1 E [X1 ] + c2 E [X2 ]
18.3 A more rigorous de…nition

A more rigorous de…nition of the Lebesgue integral requires that we introduce the
notion of a simple random variable.
De…nition 106 A random variable Y is called simple if and only if it takes on

…nitely many positive values. In this case:
there exist n events E1 , . . . , En such that Ei \ Ej = ; for i 6= j and the

sample space can be written as
n
[
= Ei
i=1
Y can be written as
8
< y1
> when ! 2 E1
Y (!) = ..
> .
:
yn when ! 2 En
yi 0 for all i.
Note that a simple random variable is also a discrete random variable. Hence,
the expected value of a simple random variable is easy to compute, because it is
just the weighted sum of the elements of its support.
The Lebesgue integral of a simple random variable Y is de…ned to be equal to
its expected value:
Z Xn
Y dP = yi P (Ei )
i=1
18.3. A MORE RIGOROUS DEFINITION 145
Let X be the random variable whose integral we want to compute. Let X +

and X be the positive and negative part of X respectively:
X + (!) = max (X (!) ; 0) for any !

X (!) = min (X (!) ; 0) for any !
Note that X + (!) 0, X (!) 0 for any ! and:
X = X+ X
The Lebesgue integral of X + is de…ned as follows:

Z Z
X + dP = sup Y dP : Y is simple and Y (!) X + (!) for all ! 2
In words, the Lebesgue integral of X + is obtained by taking the supremum over

the Lebesgue integrals of all the simple functions Y that are less than X + .
The Lebesgue integral of X is de…ned as follows:
Z Z
X dP = sup Y dP : Y is simple and Y (!) X (!) for all ! 2
Finally, the Lebesgue integral of X is de…ned as the di¤erence between the integrals
of its positive and negative parts:
Z Z Z
+
XdP = X dP X dP
R R
provided the di¤erence makes sense; in case both X + dP and X dP are equal to
in…nity, then the di¤erence is not well-de…ned and we say that X is not integrable.
Chapter 19
Properties of the expected

value
This lecture discusses some fundamental properties of the expected value operator.
Although most of these properties can be understood and proved using the material
presented in previous lectures, some properties are gathered here for convenience,
but can be proved and understood only after reading the material presented in
subsequent lectures.
19.1 Linearity of the expected value

The following properties are related to the linearity of the expected value.
19.1.1 Scalar multiplication of a random variable

If X is a random variable and a 2 R is a constant, then
E [aX] = aE [X]
This property has already been discussed in the lecture entitled Expected value (p.
134).
Example 107 Let X be a random variable with expected value
E [X] = 3
and Y be a random variable de…ned as follows:
Y = 2X
Then:
E [Y ] = E [2X] = 2E [X] = 2 3 = 6
19.1.2 Sums of random variables

If X1 , X2 , . . . , XK are K random variables, then:
E [X1 + X2 + : : : + XK ] = E [X1 ] + E [X2 ] + : : : + E [XK ]
147
148 CHAPTER 19. PROPERTIES OF THE EXPECTED VALUE
Also this property has already been discussed in the lecture entitled Expected value
(p. 134).
Example 108 Let X and Y be two random variables with expected values
E [X] = 2
E [Y ] = 5
and Z be a random variable de…ned as follows:
Z =X +Y
Then
E [Z] = E [X + Y ] = E [X] + E [Y ] = 2 + 5 = 7
19.1.3 Linear combinations of random variables

If X1 , X2 , . . . , XK are K random variables and a1 ; a2 ; : : : ; aK 2 R are K constants,
then
E [a1 X1 + a2 X2 + : : : + aK XK ] = a1 E [X1 ] + a2 E [X2 ] + : : : + aK E [XK ]
This can be trivially obtained combining the two properties above (scalar multi-
plication and sum). Considering a1 ; a2 ; : : : ; aK as the K entries of a 1 K vector
a and X1 , X2 , . . . , XK as the K entries of a K 1 random vector X, the above
property can be written as
E [aX] = aE [X]
which is a multivariate generalization of the scalar multiplication property above.
Example 109 Let X and Y be two random variables with expected values
E [X] = 1
E [Y ] = 4
and Z be a random variable de…ned as follows:
Z = X + 3Y
Then
E [Z] = E [X + 3Y ] = E [X] + 3E [Y ]
= 1 + 3 4 = 1 + 12 = 13
19.1.4 Addition of a constant and a random matrix

Let be a K L random matrix1 , i.e. a K L matrix whose entries are random
variables. If A is a K L matrix of constants, then
E [A + ] = A + E [ ]
This is easily proved by applying the linearity properties above to each entry of
the random matrix A + .
1 See p. 119
19.1. LINEARITY OF THE EXPECTED VALUE 149
Note that a random vector is just a particular instance of a random

matrix. So, if X is a K 1 random vector and a is a K 1 vector of constants,
then
E [a + X] = a + E [X]
Example 110 Let X be a 2 1 random vector such that its two entries X1 and
X2 have expected values
E [X1 ] = 0
E [X2 ] = 2
Let A be the following 2 1 constant vector:
1
A=
7
Let the random vector Y be de…ned as follows:
Y =A+X
Then
E [Y ] = E [A + X] = A + E [X]
1 E [X1 ]
= +
7 E [X2 ]
1 0 1
= + =
7 2 9
19.1.5 Multiplication of a constant and a random matrix

Let be a K L random matrix, i.e. a K L matrix whose entries are random
variables. If B is a M K matrix of constants, then
E [B ] = BE [ ]
If C is a a L N matrix of constants, then
E [ C] = E [ ] C
These are immediate consequences of the linearity properties above.

By iteratively applying this property, if B is a M K matrix of constants and
C is a a L N matrix of constants, we obtain
E [B C] = E [B ( C)] = BE [ C] = BE [ ] C
Example 111 Let X be a 1 2 random vector such that
E [X1 ] = E [X2 ] = 3
where X1 and X2 are the two components of X. Let A be the following 2 2 matrix
of constants:
2 0
A=
3 1
Let the random vector Y be de…ned as follows:
Y = XA
Then
E [Y ] = E [XA] = E [X] A
2 0
= E [X1 ] E [X2 ]
3 1
2 0
= 3 3
3 1
= 3 2+3 3 3 0+3 1
= 15 3
19.2 Other properties

The following properties of the expected value are also very important.
19.2.1 Expectation of a positive random variable

Let X be an integrable2 random variable de…ned on a sample space . Let X be
a positive random variable, i.e.
X (!) 0 ; 8! 2
Then
E [X] 0
Proof. Intuitively, this is obvious: the expected value of X is a weighted average
of the values that X can take on; X can take on only positive values; therefore, also
its expected value must be positive. Formally, the expected value is the Lebesgue
integral3 of X; X can be approximated to any degree of accuracy by positive simple
random variables whose Lebesgue integral is obviously positive; therefore, also the
Lebesgue integral of X must be positive.
19.2.2 Preservation of almost sure inequalities

Let X and Y be two integrable random variables de…ned on a sample space . Let
X and Y be such that X Y almost surely4 . Then
E [X] E [Y ]
Proof. Let E be a zero-probability event such that
f! 2 : X (!) > Y (!)g E

2 See p. 136.
3 See p. 141.
4 In other words, there exists a zero-probability event E such that f! 2 : X (!) > Y (!)g
E. See the lecture entitled Zero-probability events (p. 79) for a de…nition of zero-probability
event and of almost sure property.
First note that

1E + 1E c = 1
where 1E is the indicator5 of the event E and 1E c is the indicator of the complement
of E. We can write
E [Y X] = E [(Y X) 1] (19.1)
= E [(Y X) 1E ] + E [(Y X) 1E c ]
= E [(Y X) 1E c ]
because
E [(Y X) 1E ] = 0
by the properties of indicators of zero-probability events. If ! 2 E c , then
(Y X) 0
and
(Y X) 1E c 0
On the contrary, if ! 2 E, then
1E c = 0
and
(Y X) 1E c = 0
Therefore:
(Y X) 1E c 0 ,8! 2
which means that (Y X) 1E c is a positive random variable. Thus
E [(Y X) 1E c ] 0 (19.2)
because the expectation of a positive random variable is positive (see 19.2.1).

Putting together (19.1) and (19.2), we obtain
E [Y X] 0
By linearity of the expected value:
E [Y X] = E [Y ] E [X] 0
Therefore:
E [X] E [Y ]

5 See p. 197.
Exercise 1
Let X and Y be two random variables, having expected values
p
E [X] = 2
E [Y ] = 1
Compute the expected value of the random variable Z de…ned as follows:

p
Z= 2X + Y
Solution
Using the linearity of the expected value operator, we obtain

hp i p
E [Z] = E 2X + Y = 2E [X] + E [Y ]
p p
= 2 2+1=2+1=3
Exercise 2
Let X be a 2 1 random vector such that its two entries X1 and X2 have expected
values:
E [X1 ] = 2
E [X2 ] = 3
Let A be the following 2 2 matrix of constants:
1 2
A=
0 1
Compute the expected value of the random vector Y de…ned as follows:
Y = AX
Solution
The linearity property of the expected value applies also to the multiplication of a
constant matrix and a random vector:
E [Y ] = E [AX] = AE [X]
1 2 E [X1 ]
=
0 1 E [X2 ]
1 2 2
=
0 1 3
1 2+2 3 8
= =
0 2+1 3 3
Exercise 3
Let be a 2 2 matrix with random entries, such that all its entries have expected
value equal to 1. Let A be the following 1 2 constant vector:
A= 2 3
Compute the expected value of the random vector Y de…ned as follows:
Y =A
Solution
The linearity property of the expected value applies also to the multiplication of a
constant vector and a matrix with random entries:
E [Y ] = E [A ] = AE [ ]
1 1
= 2 3
1 1
= 2 1+3 1 2 1+3 1
= 5 5
Chapter 20
Variance
This lecture introduces the concept of variance. Before reading this lecture, make
sure you are familiar with the concept of random variable (see the lecture entitled
Random variables - p. 105) and with the concept of expected value (see the lecture
entitled Expected value - p. 127).
20.1 De…nition of variance

The following is a de…nition of variance.
De…nition 112 Let X be a random variable. The variance of X, denoted by

Var [X], is de…ned as
h i
2
Var [X] = E (X E [X]) (20.1)
provided the expected values in (20.1) exist and are well-de…ned.
The variance of X is also called the second central moment of X.
20.2 Interpretation of variance

Variance is a measure of the dispersion of a random variable around its mean.
Being the expected value of a squared number, variance is always positive. When
a random variable X is constant (whatever happens, it always takes on the same
value), then its variance is zero, because X is always equal to its expected value
E [X]. On the contrary, the larger are the possible deviations of X from its expected
value E [X], the larger the variance of X is.
20.3 Computation of variance

To better understand how variance is computed, you can break up its computation
in several steps:
1. compute E [X], the expected value of X;
155
156 CHAPTER 20. VARIANCE
2. construct a random variable Y that measures how much the realizations of

X deviate from their expected value:
Y =X E [X]
3. take the square of Y , so that positive and negative deviations from the mean
having the same magnitude yield the same measure of distance from E [X];
4. …nally, compute the expected value of the squared deviation Y 2 to know how
much on average X deviates from E [X]:
h i
2
Var [X] = E Y 2 = E (X E [X])
20.4 Variance formula

The following is a very important formula for computing variance.
Proposition 113 The variance of a random variable X can be expressed as

2
Var [X] = E X 2 E [X]
Proof. The variance formula is derived as follows. First, expand the square:
h i h i
2 2
Var [X] = E (X E [X]) = E X 2 + E [X] 2E [X] X
Then, by exploiting the linearity of the expected value1 , you obtain

h i
2
Var [X] = E X 2 + E [X] 2E [X] X
2
= E X 2 + E [X] 2E [X] E [X]
2 2
= E X E [X]
The above variance formula also makes clear that variance exists and is well-
de…ned only as long as E [X] and E X 2 exist and are well-de…ned.
20.5 Example
The following example shows how to compute the variance of a discrete random
variable2 using both the de…nition and the variance formula above.
Example 114 Let X be a discrete random variable with support
RX = f0; 1g
and probability mass function

8
< q if x = 1
pX (x) = 1 q if x = 0
:
0 otherwise
1 See p. 134.
2 See p. 106.
where 0 q 1. Its expected value is
E [X] = 1 pX (1) + 0 pX (0) = 1 q + 0 (1 q) = q
The expected value of its square is
E X 2 = 12 pX (1) + 02 pX (0) = 1 q + 0 (1 q) = q
Its variance is
2
Var [X] = E X 2 E [X] = q q 2 = q (1 q)
Alternatively, we can compute the variance of X using the de…nition. De…ne a new
random variable, the squared deviation of X from E [X], as
2
Z = (X E [X])
The support of Z is n o
2
RZ = (1 q) ; q 2
and its probability mass function is
8 2
< q if z = (1 q)
pZ (z) = 1 q if z = q 2
:
0 otherwise
The variance of X equals the expected value of Z:

2 2
Var [X] = E [Z] = (1 q) pZ (1 q) + q 2 pZ q 2
2
= (1 q) q + q 2 (1 q)
= (1 q) q [(1 q) + q] = (1 q) q
The exercises at the end of this lecture provide more examples of how variance
can be computed.
20.6 More details

The following subsections contain more details on variance.
20.6.1 Variance and standard deviation

The square root of variance is called standard deviation. The standard deviation
of a random variable X is usually denoted by std [X] or by stdev [X]:
p
stdev [X] = Var [X]
20.6.2 Addition to a constant

Adding a constant to a random variable does not change its variance.
Proposition 115 Let a 2 R be a constant and let X be a random variable. Then,
Var [a + X] = Var [X]

Proof. This is proved as follows:

h i
2
Var [a + X] = E (a + X E [a + X])
h i
2
= E (a + X a E [X])
h i
2
= E (X E [X])
= Var [X]
where we have used the fact that, by linearity of the expected value, it holds that
E [a + X] = a + E [X]
20.6.3 Multiplication by a constant

When a random variable is multiplied by a constant, its variance is multiplied by
the square of that constant.
Proposition 116 Let b 2 R be a constant and let X be a random variable. Then,
Var [bX] = b2 Var [X]

h i
2
Var [bX] = E (bX E [bX])
h i
2
= E (bX bE [X])
h i
2
= E b2 (X E [X])
h i
2
= b2 E (X E [X])
= b2 Var [X]
where we have used the fact that, by linearity of the expected value, it holds that
E [bX] = bE [X]
20.6.4 Linear transformations

By combining the previous two properties, we obtain the following proposition.
Proposition 117 Let a; b 2 R be two constants and let X be a random variable.

Then,
Var [a + bX] = b2 Var [X]
20.6.5 Square integrability

2
If E X exists and is …nite, we say that X is a square integrable random
variable, or just that X is square integrable. It can easily be proved that, if X
is square integrable then X is also integrable3 , that is, E [jXj] exists and is …nite.
Thus, if X is square integrable, also its variance
2
Var [X] = E X 2 E [X]
exists and is …nite.

Exercise 1
Let X be a discrete random variable with support
RX = f0; 1; 2; 3g
1=4 if x 2 RX
pX (x) =
0 otherwise
Compute its variance.
Solution
The expected value of X is
X
E [X] = xpX (x)
x2RX
= 0 pX (0) + 1 pX (1) + 2 pX (2) + 3 pX (3)
1 1 1 1 6 3
= 0 +1 +2 +3 = =
4 4 4 4 4 2
The expected value of X 2 is
X
E X2 = x2 pX (x)
x2RX
= 02 pX (0) + 12 pX (1) + 22 pX (2) + 32 pX (3)

1 1 1 1 14 7
= 0 +1 +4 +9 = =
4 4 4 4 4 2
The variance of X is
2
2 7 3
Var [X] = E X2 E [X] =
2 2
14 9 5
= =
4 4 4
3 See p. 136.
Exercise 2
RX = f1; 2; 3; 4g

1 2
pX (x) = 30 x if x 2 RX
0 otherwise
Solution
X
E [X] = xpX (x)
x2RX
= 1 pX (1) + 2 pX (2) + 3 pX (3) + 4 pX (4)
1 4 9 16
= 1 +2 +3 +4
30 30 30 30
1 100 10
= (1 + 8 + 27 + 64) = =
30 30 3
X
E X2 = x2 pX (x)
x2RX
= 12 pX (1) + 22 pX (2) + 32 pX (3) + 42 pX (4)

1 4 9 16
= 1 +4 +9 + 16
30 30 30 30
1 354 118 59
= (1 + 16 + 81 + 256) = = =
30 30 10 5
2
2 59 10
Var [X] = E X2 E [X] =
5 3
59 9 100 5 531 500 31
= = =
45 45 45
Exercise 3
Read and try to understand how the variance of a Poisson random variable is
derived in the lecture entitled Poisson distribution (p. 349).
Exercise 4
Let X be an absolutely continuous random variable4 with support
RX = [0; 1]
4 See p. 107.
and probability density function
1 if x 2 [0; 1]
fX (x) =
0 otherwise
Solution
Z 1 Z 1
E [X] = xfX (x) dx = xdx
1 0
1
1 2 1 1
= x = 0=
2 0 2 2

Z 1 Z 1
E X2 = x2 fX (x) dx = x2 dx
1 0
1
1 3 1 1
= x = 0=
3 0 3 3
2
2 1 1
Var [X] = E X2 E [X] =
3 2
4 3 1
= =
12 12 12
Exercise 5
Let X be an absolutely continuous random variable with support
RX = [0; 1]
3x2 if x 2 [0; 1]
fX (x) =
0 otherwise
Solution
Z 1 Z 1 Z 1
2
E [X] = xfX (x) dx = x3x dx = 3x3 dx
1 0 0
1
3 4 3 3
= x = 0=
4 0 4 4

Z 1 Z 1 Z 1
2 2 2 2
E X = x fX (x) dx = x 3x dx = 3x4 dx
1 0 0
1
3 5 3 3
= x = 0=
5 0 5 5
2
2 3 3
Var [X] = E X2 E [X] =
5 4
3 9 48 45 3
= = =
5 16 80 80
Exercise 6
Read and try to understand how the variance of a Chi-square random variable is
derived in the lecture entitled Chi-square distribution (p. 387).
Chapter 21
Covariance
This lecture introduces the concept of covariance. Before reading this lecture, make
sure you are familiar with the concepts of random variable (p. 105), expected value
(p. 127) and variance (p. 155).
21.1 De…nition of covariance

Let X and Y be two random variables. The covariance between X and Y , denoted
by Cov [X; Y ], is de…ned as follows:
Cov [X; Y ] = E [(X E [X]) (Y E [Y ])]
provided the above expected value exists and is well-de…ned.
21.2 Interpretation of covariance

Covariance is a measure of association between two random variables.
To understand the meaning of covariance, let us analyze how it is constructed.
It is the expected value of the product XY , where X and Y are de…ned as follows:
X = X E [X]
Y = Y E [Y ]
In other words, X and Y are the deviations of X and Y from their respective
means. When the product XY is positive, it means that:
either X and Y are both above their respective means;
or X and Y are both below their respective means.
On the contrary, when XY is negative, it means that:
either X is above its mean and Y is below its mean;
or X is below its mean and Y is above its mean.
163
164 CHAPTER 21. COVARIANCE
In other words, when XY is positive, X and Y are concordant (their devi-

ations from the mean have the same sign); when XY is negative, X and Y are
discordant (their deviations from the mean have opposite signs). Since
Cov [X; Y ] = E XY
a positive covariance means that on average X and Y are concordant; on the

contrary, a negative covariance means that on average X and Y are discordant.
Thus, the covariance of X and Y provides a measure of the degree to which X
and Y tend to "move together": a positive covariance indicates that the deviations
of X and Y from their respective means tend to have the same sign; a negative
covariance indicates that deviations of X and Y from their respective means tend
to have opposite signs. Intuitively, we could express the concept as follows:
Cov [X; Y ] > 0 implies that X tends to be high when Y is high and low when
Y is low;
Cov [X; Y ] < 0 implies that X tends to be high when Y is low and vice versa.
When Cov [X; Y ] = 0, X and Y do not display any of the above two tendencies.
21.3 Covariance formula

The following covariance formula is often used to compute the covariance between
two random variables.
Proposition 118 The covariance between two random variables X and Y can be
expressed as
Cov [X; Y ] = E [XY ] E [X] E [Y ]
Proof. The formula is derived as follows:
Cov [X; Y ] = E [(X E [X]) (Y E [Y ])]

= E [XY XE [Y ] E [X] Y + E [X] E [Y ]]
A = E [XY ] E [Y ] E [X] E [X] E [Y ] + E [X] E [Y ]
= E [XY ] E [X] E [Y ]
where: in step A we have used the linearity of the expected value1 .

This formula also makes clear that Cov [X; Y ] exists and is well-de…ned only as
long as E [X], E [Y ] and E [XY ] exist and are well-de…ned.
21.4 Example
The following example shows how to compute the covariance between two discrete
random variables.
1 See p. 134.
21.4. EXAMPLE 165
Example 119 Let X be a 2 1 discrete random vector2 and denote its components
1 2 0
RX = ; ;
1 0 0
8 >
>
> 1=3 if x = 1 1
>
< >
pX (x) = 1=3 if x = 2 0
> >
>
> 1=3 if x = 0 0
:
0 otherwise
RX1 = f0; 1; 2g
and its marginal probability mass function3 is
8
>
> 1=3 if x = 0
X <
1=3 if x = 1
pX1 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 2
f(x1 ;x2 )2RX :x1 =xg :
0 otherwise
The expected value of X1 is
X 1 1 1
E [X1 ] = xpX1 (x) = 0+ 1+ 2=1
3 3 3
x2RX1
RX2 = f0; 1g
and its marginal probability mass function is
8
X < 2=3 if x = 0
pX2 (x) = pX (x1 ; x2 ) = 1=3 if x = 1
:
X 2 1 1
E [X2 ] = xpX2 (x) = 0+ 1=
3 3 3
x2RX2
By using the transformation theorem4 , we can compute the expected value of X1 X2 :

X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 )
x2RX
1 1 1 1
= (1 1) + (2 0) + (0 0) =
3 3 3 3
Hence, the covariance between X1 and X2 is
1 1
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ] = 1 =0
3 3
2 See p. 116.
3 See p. 119.
4 See p. 134.
21.5 More details

The following subsections contain more details on covariance.
21.5.1 Covariance of a random variable with itself

The covariance of a random variable with itself is equal to its variance.
Proposition 120 Let Cov [X; X] be the covariance of a random variable with it-
self. Then,
Cov [X; X] = Var [X]
Cov [X; X] = E [(X E [X]) (X E [X])]

h i
2
= E (X E [X])
= Var [X]
where in the last step we have used the very de…nition of variance.
21.5.2 Symmetry
The covariance operator is symmetric.
Proposition 121 Let Cov [X; Y ] be the covariance between two random variables
X and Y . Then,
Cov [X; Y ] = Cov [Y; X]
Cov [X; Y ] = E [(X E [X]) (Y E [Y ])]

= E [(Y E [Y ]) (X E [X])]
= Cov [Y; X]
21.5.3 Bilinearity
The covariance operator is linear in both of its arguments.
Proposition 122 Let a1 and a2 be two constants. Let X1 , X2 and Y be three

random variables such that Cov [X1 ; Y ] and Cov [X2 ; Y ] exist and are well-de…ned.
Then,
Cov [a1 X1 + a2 X2 ; Y ] = a1 Cov [X1 ; Y ] + a2 Cov [X2 ; Y ]
and
Cov [Y; a1 X1 + a2 X2 ] = a1 Cov [Y; X1 ] + a2 Cov [Y; X2 ]
Proof. That the …rst argument is linear is proved by using the linearity of the
expected value:
Cov [a1 X1 + a2 X2 ; Y ]
= E [(a1 X1 + a2 X2 E [a1 X1 + a2 X2 ]) (Y E [Y ])]
= E [(a1 X1 E [a1 X1 ]) (Y E [Y ]) + (a2 X2 E [a2 X2 ]) (Y E [Y ])]
= a1 E [(X1 E [X1 ]) (Y E [Y ])] + a2 E [(X2 E [X2 ]) (Y E [Y ])]
= a1 Cov [X1 ; Y ] + a2 Cov [X2 ; Y ]
By symmetry (see 21.5.2), also the second argument is linear.

Linearity in both the …rst and second argument is called bilinearity.
By iteratively applying the above arguments, one can prove that bilinearity
holds also for linear combinations of more than two variables:
" n # n
X X
Cov ai Xi ; Y = ai Cov [Xi ; Y ]
i=1 i=1
" n
# n
X X
Cov Y; ai Xi = ai Cov [Y; Xi ]
i=1 i=1
21.5.4 Variance of the sum of two random variables

The next proposition provides a formula that links the variance of a sum of two
random variables to their covariance.
Proposition 123 Let X1 and X2 be two random variables. If the variance of their
sum exists and is well-de…ned, then
Var [X1 + X2 ] = Var [X1 ] + Var [X2 ] + 2Cov [X1 ; X2 ]
Proof. The above formula is derived as follows:
Var [X1 + X2 ]
h i
2
= E (X1 + X2 E [X1 + X2 ])
h i
2
= E ((X1 E [X1 ]) + (X2 E [X2 ]))
h i
2 2
= E (X1 E [X1 ]) + (X2 E [X2 ]) + 2 (X1 E [X1 ]) (X2 E [X2 ])
h i h i
2 2
= E (X1 E [X1 ]) + E (X2 E [X2 ])
+2E [(X1 E [X1 ]) (X2 E [X2 ])]
= Var [X1 ] + Var [X2 ] + 2Cov [X1 ; X2 ]
Thus, to compute the variance of the sum of two random variables we need to
know their covariance.
Obviously then, the formula
Var [X1 + X2 ] = Var [X1 ] + Var [X2 ]
holds only when X1 and X2 have zero covariance.

21.5.5 Variance of the sum of n random variables

The previous proposition can be generalized as follows.
Proposition 124 The variance of the sum of n random variables X1 , . . . , Xn , if

it exists, can be expressed as
" n # n n X
i 1
X X X
Var Xi = Var [Xi ] + 2 Cov [Xi ; Xj ] (21.1)
i=1 i=1 i=2 j=1
Proof. The formula is proved by using the bilinearity of the covariance operator
(see 21.5.3):
" n # " n n
#
X X X
Var Xi = Cov Xi ; Xi
i=1 i=1 i=1
2 3
Xn n
X
= Cov 4 Xi ; Xj 5
i=1 j=1
2 3
n
X n
X
= Cov 4Xi ; Xj 5
i=1 j=1
n X
X n
= Cov [Xi ; Xj ]
i=1 j=1
n
X n X
X i 1
= Cov [Xi ; Xi ] + Cov [Xi ; Xj ]
i=1 i=2 j=1
n
X n
X
+ Cov [Xi ; Xj ]
i=2 j=i+1
n
X n X
X i 1
= Var [Xi ] + Cov [Xi ; Xj ]
i=1 i=2 j=1
j 1
n X
X
+ Cov [Xj ; Xi ]
j=2 i=1
n
X n X
X i 1
= Var [Xi ] + 2 Cov [Xi ; Xj ]
i=1 i=2 j=1
Formula (21.1) implies that when all the random variables in the sum have
zero covariance with each other, then the variance of the sum is just the sum of
the variances: " n #
X Xn
Var Xi = Var [Xi ]
i=1 i=1
This is true, for example, when the random variables in the sum are mutually
independent5 , because independence implies zero covariance6 .
5 See p. 233.
6 See p. 234.

Exercise 1
1 2
RX = ;
3 1

8
>
>
< 2=3 if x = 1 3
>
pX (x) = 1=3 if x = 2 1
>
: 0 otherwise
Compute the covariance between X1 and X2 .
Solution
RX1 = f1; 2g
8
X < 2=3 if x = 1
pX1 (x) = pX (x1 ; x2 ) = 1=3 if x = 2
:

X 2 1 4
E [X1 ] = xpX1 (x) = 1+ 2=
3 3 3
x2RX1
RX2 = f1; 3g
8
X < 1=3 if x = 1
pX2 (x) = pX (x1 ; x2 ) = 2=3 if x = 3
:

X 1 2 7
E [X2 ] = xpX2 (x) = 1+ 3=
3 3 3
x2RX2
7 See p. 120.
By using the transformation theorem8 , we can compute the expected value of

X1 X2 :
X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 ) =
x2RX
= (1 3) pX (1; 3) + (2 1) pX (2; 1)
2 1 8
= 3 +2 =
3 3 3
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ]

8 4 7 24 28 4
= = =
3 3 3 9 9
Exercise 2
Let X be a 2 1 discrete random vector and denote its entries by X1 and X2 . Let
the support of X be
1 2 4
RX = ; ;
2 1 4
8 >
>
> 1=3 if x = 1 2
>
< >
pX (x) = 1=3 if x = 2 1
> >
>
> 1=3 if x = 4 4
:
0 otherwise
Solution
RX1 = f1; 2; 4g
8
>
> 1=3 if x = 1
X <
1=3 if x = 2
pX1 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 4
f(x1 ;x2 )2RX :x1 =xg :
0 otherwise
The mean of X1 is
X 1 1 1 7
E [X1 ] = xpX1 (x) = 1+ 2+ 4=
3 3 3 3
x2RX1
RX2 = f1; 2; 4g
8 See p. 134.

8
>
> 1=3 if x = 1
X <
1=3 if x = 2
pX2 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 4
f(x1 ;x2 )2RX :x2 =xg :
0 otherwise
The mean of X2 is
X 1 1 1 7
E [X2 ] = xpX2 (x) = 1+ 2+ 4=
3 3 3 3
x2RX2
The expected value of the product X1 X2 can be derived thanks to the transforma-
tion theorem:
X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 ) =
x2RX
=
(1 2) pX (1; 2) + (2 1) pX (2; 1) + (4 4) pX (4; 4)
1 1 1 20
= 2 +2 + 16 =
3 3 3 3
By putting pieces together, we obtain the covariance between X1 and X2 :
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ]
20 7 7 60 49 11
= = =
3 3 3 9 9
Exercise 3
Let X and Y be two random variables such that
Var [X] = 2
Cov [X; Y ] = 1
Compute the covariance
Cov [5X; 2X + 3Y ]
Solution
By exploiting the bilinearity of the covariance operator, we obtain
Cov [5X; 2X + 3Y ] = 5Cov [X; 2X + 3Y ] = 10Cov [X; X] + 15Cov [X; Y ]
= 10Var [X] + 15Cov [X; Y ] = 10 2 + 15 1 = 35
Exercise 4
Let [X Y ] be an absolutely continuous random vector9 with support
RXY = f(x; y) : 0 x y 2g
In other words, the support RXY is the set of all couples (x; y) such that 0 y 2
and 0 x y. Let the joint probability density function of [X Y ] be
3
fXY (x; y) = 8y if (x; y) 2 RXY
0 otherwise
Compute the covariance between X and Y .
9 See p. 117.
Solution
The support of X is:
RX = [0; 2]
thus, when x 2= [0; 2], the marginal probability density function10 of X is 0, while,
when x 2 [0; 2], the marginal probability density function of X is
Z 1 Z 2 2
3 3 2 3 3 2
fX (x) = fXY (x; y) dy = ydy = y = x
1 x 8 16 x 4 16
Therefore, the marginal probability density function of X is

3 3 2
fX (x) = 4 16 x if x 2 [0; 2]
0 otherwise

Z 1 Z 2
3 3 2
E [X] = xfX (x) dx = x x dx
1 0 4 16
Z2 Z 2
3 3
= xdx x3 dx
4 0 16 0
2 2
3 1 2 3 1 4
= x x dx
4 2 0 16 4 0
3 3
= (2 0) (4 0)
4 16
6 12 6 3 3
= = =
4 16 4 4 4
The support of Y is
RY = [0; 2]
When y 2 = [0; 2], the marginal probability density function of Y is 0, while, when
y 2 [0; 2], the marginal probability density function of Y is
Z 1 Z y
3
fY (y) = fXY (x; y) dx = ydx
1 0 8
Z y
3 3
= y dx = y 2
8 0 8
Therefore, the marginal probability density function of Y is
3 2
fY (y) = 8y if y 2 [0; 2]
0 otherwise
The expected value of Y is

Z 1 Z 2
3
E [Y ] = yfY (y) dy = y y 2 dy
1 0 8
Z2 2
3 3 1 4 3 3
= y 3 dy = y = 4=
8 0 8 4 0 8 2
1 0 See p. 120.
The expected value of the product XY can be computed by using the transforma-
tion theorem:
Z 1Z 1 Z 2 Z y
3
E [XY ] = xyfXY (xy) dydx = xy ydx dy
1 1 0 0 8
Z 2 Z y Z 2 y
3 2 3 2 1 2
= y xdx dy = y x dy
0 8 0 0 8 2 0
Z 2 Z 2
3 21 2 3 4
= y y dy = y dy
0 8 2 0 16
2
3 1 5 3 32 6
= y = =
16 5 0 16 5 5
Hence, by using the covariance formula, the covariance between X and Y can be
computed as
6 3 3
Cov [X; Y ] = E [XY ] E [X] E [Y ] =
5 4 2
6 9 48 45 3
= = =
5 8 40 40
Exercise 5
Let [X Y ] be an absolutely continuous random vector with support
RXY = [0; 1) [1; 4]
and its joint probability density function be
1
fXY (x; y) = 3 y exp ( xy) if x 2 [0; 1) and y 2 [1; 4]
0 otherwise
Solution
The support of Y is
RY = [1; 4]
Z 1 Z 1
1
fY (y) = fXY (x; y) dx = y exp ( xy) dx
1 0 3
1 1 1 1
= [ exp ( xy)]0 = [0 ( 1)] =
3 3 3
Thus, the marginal probability density function of Y is
1=3 if y 2 [1; 4]
fY (y) =
0 otherwise
Z 1 Z 4
1
E [Y ] = yfY (y) dy = y dy
1 1 3
4
1 2 1 1 15 5
= y = 16 = =
6 1 6 6 6 2
The support of X is
RX = [0; 1)
When x 2 = [0; 1), the marginal probability density function of X is 0, while, when
x 2 [0; 1), the marginal probability density function of X is
Z 1 Z 4
1
fX (x) = fXY (x; y) dy = y exp ( xy) dy
1 1 3
We do not explicitly compute the integral, but we write the marginal probability
density function of X as follows:
R4 1
y exp ( xy) dy if x 2 [0; 1)
fX (x) = 1 3
0 otherwise

Z 1
E [X] = xfX (x) dx
1
Z 1 Z 4
1
= x y exp ( xy) dy dx
0 1 3
Z 4 Z 1
A 1
= xy exp ( xy) dx dy
3 1 0
Z 4 Z 1
B 1 1
= t exp ( t) dt dy
3 1 y 0
Z 4 Z 1
C 1 1 1
= [ t exp ( t)]0 + exp ( t) dt dy
3 1 y 0
Z 4
1 1 1
= (0 + [ exp ( t)]0 ) dy
3 1 y
Z 4
1 1 1 4 1
= dy = [ln (y)]1 = ln (4)
3 1 y 3 3
where: in step A we have exchanged the order of integration; in step B we have

changed variable in the inner integral (t = xy); in step C we have integrated by
parts.
The expected value of the product XY can be computed by using the transfor-
mation theorem:
Z 1 Z 1
E [XY ] = xyfXY (xy) dydx
1 1
Z 1 Z 4
1
= xy y exp ( xy) dy dx
0 1 3
Z 4 Z 1
A 1
= y xy exp ( xy) dx dy
3 1 0
Z 4 Z 1
B 1 1
= y t exp ( t) dt dy
3 1 y 0
Z 4 Z 1
C 1 1
= [ t exp ( t)]0 + exp ( t) dt dy
3 1 0
Z 4
1 1
= (0 + [ exp ( t)]0 ) dy
3 1
Z 4
1 1
= dy = 3=1
3 1 3

changed variable in the inner integral (t = xy); in step C we have integrated by
parts.
Hence, using the covariance formula, the covariance between X and Y is
Cov [X; Y ] = E [XY ] E [X] E [Y ]

1 5 5
= 1 ln (4) =1 ln (4)
3 2 6
Exercise 6
Let X and Y be two random variables such that
Var [X] = 4
Cov [X; Y ] = 2
Compute the following covariance:
Cov [3X; X + 3Y ]
Solution
The bilinearity of the covariance operator implies that
Cov [3X; X + 3Y ] = 3Cov [X; X + 3Y ] = 3Cov [X; X] + 9Cov [X; Y ]

= 3Var [X] + 9Cov [X; Y ] = 3 4 + 9 2 = 30
Chapter 22
Linear correlation
This lecture introduces the linear correlation coe¢ cient. Before reading this lec-
ture, make sure you are familiar with the concept of covariance (p. 163).
22.1 De…nition of linear correlation coe¢ cient

Let X and Y be two random variables. The linear correlation coe¢ cient (or
Pearson’s correlation coe¢ cient) between X and Y , denoted by Corr [X; Y ] or by
XY , is de…ned as follows:
Cov [X; Y ]
Corr [X; Y ] =
stdev [X] stdev [Y ]
where Cov [X; Y ] is the covariance between X and Y and stdev [X] and stdev [Y ]
are the standard deviations1 of X and Y .
Of course, the linear correlation coe¢ cient is well-de…ned only as long as the
three quantities Cov [X; Y ], stdev [X] and stdev [Y ] exist and are well-de…ned.
Moreover, while the ratio is well-de…ned only if stdev [X] and stdev [Y ] are
strictly greater than zero, it is often assumed that Corr [X; Y ] = 0 when one of
the two standard deviations is zero. This is equivalent to assuming that 0=0 = 0,
because Cov [X; Y ] = 0 when one of the two standard deviations is zero.
22.2 Interpretation
Linear correlation is a measure of dependence, or association, between two random
variables. Its interpretation is similar to the interpretation of covariance2 .
The correlation between X and Y provides a measure of the degree to which
X and Y tend to "move together": Corr [X; Y ] > 0 indicates that deviations of X
and Y from their respective means tend to have the same sign; Corr [X; Y ] < 0
indicates that deviations of X and Y from their respective means tend to have
opposite signs; when Corr [X; Y ] = 0, X and Y do not display any of these two
tendencies.
1 See p. 157.
2 See the lecture entitled Covariance (p. 163) for a detailed explanation.
177
178 CHAPTER 22. LINEAR CORRELATION
Linear correlation has the property of being bounded between 1 and 1:

1 Corr [X; Y ] 1
Thanks to this property, correlation allows to easily understand the intensity
of the linear dependence between two random variables: the closer the correlation
is to 1, the stronger the positive linear dependence between X and Y is, and the
closer it is to 1, the stronger the negative linear dependence between X and Y
is.
22.3 Terminology
The following terminology is often used:
1. If Corr [X; Y ] > 0 then X and Y are said to be positively linearly corre-
lated (or simply positively correlated).
2. If Corr [X; Y ] < 0 then X and Y are said to be negatively linearly corre-
lated (or simply negatively correlated).
3. If Corr [X; Y ] 6= 0 then X and Y are said to be linearly correlated (or
simply correlated).
4. If Corr [X; Y ] = 0 then X and Y are said to be uncorrelated. Also note
that Cov [X; Y ] = 0 ) Corr [X; Y ] = 0, therefore two random variables X
and Y are uncorrelated whenever Cov [X; Y ] = 0.
22.4 Example
The following example shows how to compute the coe¢ cient of linear correlation
between two discrete random variables.
Example 125 Let X be a 2 1 random vector and denote its components by X1
and X2 . Let the support of X be
1 1 1
RX = ; ;
1 1 1
and its joint probability mass function3 be

8 >
>
> 1=3 if x = [ 1 1]
< >
1=3 if x = [ 1 1]
pX (x) = >
>
: 1=3
> if x = [1 1]
0 otherwise
RX1 = f 1; 1g
8
X < 2=3 if x = 1
pX1 (x) = pX (x1 ; x2 ) = 1=3 if x = 1
:
3 See p. 116.
22.4. EXAMPLE 179

X 2 1 1
E [X1 ] = xpX1 (x) = ( 1) + 1=
3 3 3
x2RX1

X 2 2 1 2
E X12 = x2 pX1 (x) = ( 1) + 1 =1
3 3
x2RX1
The variance of X1 is
2
2 1 8
Var [X1 ] = E X12 E [X1 ] = 1 =
3 9
The standard deviation of X1 is
r
p 8
stdev [X1 ] = Var [X1 ] =
9
RX2 = f 1; 1g
8
X < 1=3 if x = 1
pX2 (x) = pX (x1 ; x2 ) = 2=3 if x = 1
:
X 1 2 1
E [X2 ] = xpX2 (x) = ( 1) + 1=
3 3 3
x2RX2

X 1 2 2 2
E X22 = x2 pX2 (x) = ( 1) + 1 =1
3 3
x2RX2
2
2 1 8
Var [X2 ] = E X22 E [X2 ] = 1 =
3 9
r
p 8
stdev [X2 ] = Var [X2 ] =
9
By using the transformation theorem4 , we can compute the expected value of the
product X1 X2 :
X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 )
x2RX
4 See p. 134.
1 1 1 1
= ( 1 1) + (( 1) ( 1)) + (1 1) =
3 3 3 3
1 1 1 4
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ] = =
3 3 3 9
and the linear correlation coe¢ cient is

Cov [X1 ; X2 ] 4=9 4=9 1
Corr [X1 ; X2 ] = =p p = =
stdev [X1 ] stdev [X2 ] 8=9 8=9 8=9 2
22.5 More details

22.5.1 Correlation of a random variable with itself
The correlation of a random variable with itself is equal to 1.
Proposition 126 If the correlation coe¢ cient of a random variable with itself
exists and is well-de…ned, then
Corr [X; X] = 1

Cov [X; X]
Corr [X; X] =
stdev [X] stdev [X]
Var [X]
= p p
Var [X] Var [X]
Var [X]
= =1
Var [X]
where we have used the fact that5
Cov [X; X] = Var [X]
22.5.2 Symmetry
The linear correlation coe¢ cient is symmetric.
Proposition 127 If the correlation coe¢ cient between two random variables exists
and is well-de…ned, then it satis…es
Corr [X; Y ] = Corr [Y; X]

Cov [X; Y ]
Corr [X; Y ] =
5 See p. 166.
Cov [Y; X]
=
Cov [Y; X]
=
stdev [Y ] stdev [X]
= Corr [Y; X]
where we have used the fact that covariance is symmetric6 :
Cov [X; Y ] = Cov [Y; X]

Exercise 1
1 2
RX = ;
5 1
8
>
>
< 4=5 if x = 1 5
>
pX (x) = 1=5 if x = 2 1
>
: 0 otherwise
Compute the coe¢ cient of linear correlation between X1 and X2 .
Solution
RX1 = f1; 2g
8
X < 4=5 if x = 1
pX1 (x) = pX (x1 ; x2 ) = 1=5 if x = 2
:
X 4 1 6
E [X1 ] = xpX1 (x) = 1 +2 =
5 5 5
x2RX1

X 4 1 4 1 8
E X12 = x2 pX1 (x) = 12 + 22 =1 +4 =
5 5 5 5 5
x2RX1
6 See p. 166.
7 See p. 120.
2
2 8 6 40 36 4
Var [X1 ] = E X12 E [X1 ] = = =
5 5 25 25
r
4 2
stdev [X1 ] = =
25 5
RX2 = f1; 5g
8
X < 1=5 if x = 1
pX2 (x) = pX (x1 ; x2 ) = 4=5 if x = 5
:
X 1 4 21
E [X2 ] = xpX2 (x) = 1 +5 =
5 5 5
x2RX2

X 1 4 1 4 101
E X22 = x2 pX2 (x) = 12 + 52 =1 + 25 =
5 5 5 5 5
x2RX2
2
2 101 21 505 441 64
Var [X2 ] = E X22 E [X2 ] = = =
5 5 25 25
r
64 8
stdev [X2 ] = =
25 5
By using the transformation theorem, we can compute the expected value of X1 X2 :
X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 ) = (1 5) pX (1; 5) + (2 1) pX (2; 1)
x2RX
4 1 22
= 5 +2 =
5 5 5
22 6 21
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ] =
5 5 5
110 126 16
= =
25 25
and the coe¢ cient of linear correlation between X1 and X2 is
16
Cov [X1 ; X2 ] 25
Corr [X1 ; X2 ] = = 2 8 = 1
stdev [X1 ] stdev [X2 ] 5 5
Exercise 2
Let X be a 2 1 discrete random vector and denote its entries by X1 and X2 . Let
the support of X be
1 2 3
RX = ; ;
2 1 3
8 >
>
> 1=3 if x = 1
>
<
2
>
pX (x) = 1=3 if x = 2 1
> >
>
> 1=3 if x = 3 3
:
0 otherwise
Solution
RX1 = f1; 2; 3g
8
>
> 1=3 if x = 1
X <
1=3 if x = 2
pX1 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 3
f(x1 ;x2 )2RX :x1 =xg :
0 otherwise
The mean of X1 is
X 1 1 1 6
E [X1 ] = xpX1 (x) = 1 +2 +3 = =2
3 3 3 3
x2RX1

X 1 1 1
E X12 = x2 pX1 (x) = 12 + 22 + 32
3 3 3
x2RX1
1 1 1 14
= 1 +4 +9 =
3 3 3 3
2 14 14 12 2
Var [X1 ] = E X12 E [X1 ] = 22 = =
3 3 3
r
2
stdev [X1 ] =
3
RX2 = f1; 2; 3g

8
>
> 1=3 if x = 1
X <
1=3 if x = 2
pX2 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 3
f(x1 ;x2 )2RX :x2 =xg :
0 otherwise
The mean of X2 is
X 1 1 1 6
E [X2 ] = xpX2 (x) = 1+ 2+ 3= =2
3 3 3 3
x2RX2

X 1 1 1
E X22 = x2 pX2 (x) = 12 + 22 + 32
3 3 3
x2RX2
1 1 1 14
= 1 +4 +9 =
3 3 3 3
2 14 14 12 2
Var [X2 ] = E X22 E [X2 ] = 22 = =
3 3 3
r
2
stdev [X2 ] =
3
The expected value of the product X1 X2 can be derived thanks to the transforma-
tion theorem:
X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 )
x2RX
=
(1 2) pX (1; 2) + (2 1) pX (2; 1) + (3 3) pX (3; 3)
1 1 1 13
= 2 +2 +9 =
3 3 3 3
Therefore, putting pieces together, the covariance between X1 and X2 is
13 13 12 1
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ] = 2 2= =
3 3 3
and the coe¢ cient of linear correlation between X1 and X2 is
1
Cov [X1 ; X2 ] 1
Corr [X1 ; X2 ] = = q 3q =
stdev [X1 ] stdev [X2 ] 2 2 2
3 3
Exercise 3
RXY = [0; 1) [1; 2]
8
and let its joint probability density function be
2y exp ( 2xy) if x 2 [0; 1) and y 2 [1; 2]
fXY (x; y) =
0 otherwise
8 See p. 117.
Solution
The support of Y is
RY = [1; 2]
When y 2 = RY , the marginal probability density function9 of Y is 0, while, when y 2
RY , the marginal probability density function of Y can be obtained by integrating
x out of the joint probability density as follows:
Z 1 Z 1
fY (y) = fXY (x; y) dx = 2y exp ( 2xy) dx
1 0
1
= [ exp ( 2xy)]0 = [0 ( 1)] = 1
1 if y 2 [1; 2]
fY (y) =
0 otherwise

Z 1 Z 2 2
1 2 1 1 3
E [Y ] = yfY (y) dy = ydy = y = 4 1=
1 1 2 1 2 2 2
The expected value of Y 2 is:

Z 1 Z 2 2
1 3 1 1 7
E Y2 = y 2 fY (y) dy = y 2 dy = y = 8 1=
1 1 3 1 3 3 3
The variance of Y is
2
2 7 3 28 27 1
Var [Y ] = E Y 2 E [Y ] = = =
3 2 12 12
The standard deviation of Y is

r r
1 1 1
stdev [Y ] = =
12 2 3
The support of X is
RX = [0; 1)
When x 2 = RX , the marginal probability density function of X is 0, while, when x 2
RX , the marginal probability density function of X can be obtained by integrating
y out of the joint probability density as follows:
Z 1 Z 2
fX (x) = fXY (x; y) dy = 2y exp ( 2xy) dy
1 1
We do not explicitly compute the integral, but we write the marginal probability
density function of X as follows:
R2
2y exp ( 2xy) dy if x 2 [0; 1)
fX (x) = 1
0 otherwise
9 See p. 120.

Z 1
E [X] = xfX (x) dx
1
Z 1 Z 2
= x 2y exp ( 2xy) dy dx
0 1
Z 2 Z 1
A = 2xy exp ( 2xy) dx dy
1 0
Z 2 Z 1
B 1
= t exp ( t) dt dy
1 2y 0
Z 2 Z 1
C 1 1 1
= [ t exp ( t)]0 + exp ( t) dt dy
2 1 y 0
Z 2
1 1 1
= (0 + [ exp ( t)]0 ) dy
2 1 y
Z 2
1 1 1 2 1
= dy = [ln (y)]1 = ln (2)
2 1 y 2 2

made a change of variable in the inner integral (t = 2xy); in step C we have
performed an integration by parts.
Z 1
2
E X = x2 fX (x) dx
1
Z 1 Z 2
= x2 2y exp ( 2xy) dy dx
0 1
Z 2 Z 1
A = 2yx2 exp ( 2yx) dx dy
1 0
Z 2 Z 1
B = 2
x exp ( 2yx)
1
+ 2x exp ( 2yx) dx dy
0
1 0
Z 2 Z 1
= 2x exp ( 2yx) dx dy
1 0
Z 2 Z 1
1
= 2yx exp ( 2yx) dx dy
1 y 0
Z 2 Z 1
C 1 1
= [ x exp ( 2yx)]0 + exp ( 2yx) dx dy
1 y 0
Z 2 Z 1
1
= exp ( 2yx) dx dy
1 y 0
Z 2 1
1 1
= exp ( 2yx) dy
1 y 2y 0
Z 2 Z
1 1 1 2 2
= dy = y dy
1 y 2y 2 1
1 2 1 1 1
= y 1 1= +1 =
2 2 2 4
where: in step A we have exchanged the order of integration; in step B and C

we have performed an integration by parts.
1h i
2
2 1 1 2
Var [X] = E X 2 E [X] = ln (2) = 1 (ln (2))
4 2 4
The standard deviation of X is

r h i q
1 2 1 2
stdev [X] = 1 (ln (2)) = 1 (ln (2))
4 2
The expected value of the product XY can be computed thanks to the transfor-
mation theorem:
Z 1Z 1
E [XY ] = xyfXY (xy) dydx
1 1
Z 1 Z 2
= xy2y exp ( 2xy) dy dx
0 1
Z 2 Z 1
A = y 2xy exp ( 2xy) dx dy
1 0
Z 2 Z 1
B = y [ x exp (
1
2xy)]0 + exp ( 2xy) dx dy
1 0
Z 2 1
1
= y 0+exp ( 2xy) dy
1 2y 0
Z 2 Z
1 1 2 1
= y dy = dy =
1 2y 2 1 2

performed an integration by parts.
Hence, by using the covariance formula, we obtain the covariance between X
and Y as follows:
1 1 3 1 3
Cov [X; Y ] = E [XY ] E [X] E [Y ] = ln (2) = ln (2)
2 2 2 2 4
Thus, the coe¢ cient of linear correlation between X and Y is
1 3
Cov [X; Y ] 2 4 ln (2)
Corr [X; Y ] = = q q
stdev [X] stdev [Y ] 1 2 1 1
2 1 (ln (2)) 2 3
2 3 ln (2)
= q q
2 1
1 (ln (2)) 3
Chapter 23
Covariance matrix
This lecture introduces the covariance matrix of a random vector, which is a mul-
tivariate generalization of the concept of variance of a random variable. Before
reading this lecture, make sure you are familiar with the concepts of variance (p.
155) and covariance (p. 163).
23.1 De…nition
Let X be a K 1 random vector. The covariance matrix of X, or variance-
covariance matrix of X, denoted by Var [X], is de…ned as follows:
h i
>
Var [X] = E (X E [X]) (X E [X])
provided the above expected value exists and is well-de…ned. It is a multivariate

generalization of the de…nition of variance of a random variable Y :
h i
2
Var [Y ] = E (Y E [Y ]) = E [(Y E [Y ]) (Y E [Y ])]
23.2 Structure of the covariance matrix

Let X1 , . . . , XK denote the K components of the vector X. From the de…nition
of Var [X], it can easily be seen that Var [X] is a K K matrix with the following
structure: 2 3
V11 V12 : : : V1K
6 V21 V22 : : : V2K 7
6 7
Var [X] = 6 . .. .. .. 7
4 .. . . . 5
VK1 VK2 : : : VKK
where the (i; j)-th entry of the matrix is equal to the covariance between Xi and
Xj :
Vij = E [(Xi E [Xi ]) (Xj E [Xj ])] = Cov [Xi ; Xj ]
Since the covariance between Xi and Xj is equal to the variance of Xi when
i = j (i.e., Cov [Xi ; Xi ] = Var [Xi ]), the diagonal entries of the covariance matrix
are equal to the variances of the individual components of X.
189
190 CHAPTER 23. COVARIANCE MATRIX
Example 128 Suppose X is a 2 1 random vector with components X1 and X2 .

Let
Var [X1 ] = 2
Var [X2 ] = 4
Cov [X1 ; X2 ] = 1
By the symmetry of covariance1 , it must also be that
Cov [X2 ; X1 ] = Cov [X1 ; X2 ] = 1
Therefore, the covariance matrix of X is
Var [X1 ] Cov [X1 ; X2 ] 2 1
Var [X] = =
Cov [X2 ; X1 ] Var [X2 ] 1 4
23.3 Covariance matrix formula

The following result is often used to compute the covariance matrix.
Proposition 129 The covariance matrix of a K 1 random vector X, if it exists,
can be computed as follows:
>
Var [X] = E XX > E [X] E [X]
Proof. The above formula can be derived as follows:
h i
>
Var [X] = E (X E [X]) (X E [X])
h i
> >
= E XX > XE [X] E [X] X > + E [X] E [X]
>
> > >
= E XX > XE [X] XE [X] + E [X] E [X]
h i
A = E XX > XE [X]
> >
XE [X] + E [X] E [X]
>
h i
> >
= E XX > 2XE [X] + E [X] E [X]
B = E XX >
>
2E [X] E [X] + E [X] E [X]
>
>
= E XX > E [X] E [X]
where: in step A we have used the fact that a scalar is equal to its transpose; in
step B we have used the linearity of the expected value2 .
This formula also makes clear that the covariance matrix exists and is well-
de…ned only as long as the vector of expected values E [X] and the matrix of
second cross-moments3 E XX > exist and are well-de…ned.
23.4 More details

The following subsections contain more details about the covariance matrix.
1 See p. 166.
2 See p. 134.
3 See p. 285.
23.4.1 Addition to a constant vector

Adding a constant to a random vector does not change its covariance matrix.
Proposition 130 Let a 2 RK be a constant K 1 vector and let X be a K 1

random vector having covariance matrix Var [X]. Then,
Var [a + X] = Var [X]
Proof. This is a consequence of the fact that4 E [a + X] = a + E [X]:

h i
>
Var [a + X] = E (a + X E [a + X]) (a + X E [a + X])
h i
>
= E (a + X a E [X]) (a + X a E [X])
h i
>
= E (X E [X]) (X E [X])
= Var [X]
23.4.2 Multiplication by a constant matrix

If a random vector is pre-multiplied by a constant matrix b, then its covariance
matrix is pre-multiplied by b and post-multiplied by the transpose of b.
Proposition 131 Let b be a constant M K matrix and let X be a K 1 random

vector having covariance matrix Var [X]. Then,
Var [bX] = bVar [X] b>
Proof. This is easily proved using the fact that5 E [bX] = bE [X]:
h i
>
Var [bX] = E (bX E [bX]) (bX E [bX])
h i
>
= E (bX bE [X]) (bX bE [X])
h i
>
= E b (X E [X]) (X E [X]) b>
h i
>
= bE (X E [X]) (X E [X]) b>
= bVar [X] b>
23.4.3 Linear transformations

By combining the two previous properties, one obtains the following proposition.
Proposition 132 Let a 2 RK be a constant K 1 vector, b be a constant M K

matrix and X a K 1 random vector having covariance matrix Var [X]. Then,
Var [a + bX] = bVar [X] b>

4 See the property of the expected value at p. 148.
5 See the property of the expected value at p. 149.
23.4.4 Symmetry
The covariance matrix is a symmetric matrix, i.e., it is equal to its transpose.
Proposition 133 The covariance matrix of a random vector X satis…es

>
Var [X] = Var [X]

h i>
> >
Var [X] = E (X E [X]) (X E [X])
>
>
= E (X E [X]) (X E [X])
h i
>
= E (X E [X]) (X E [X])
= Var [X]
23.4.5 Semi-positive de…niteness

Proposition 134 The covariance matrix of a random vector X is a positive-
semide…nite matrix, i.e., it holds that
aVar [X] a> 0
for any 1 K vector a 2 RK .
Proof. This is easily proved using property 23.4.2 above:
aVar [X] a> = Var [aX] 0
where the last inequality follows from the fact that variance is always positive.
23.4.6 Covariance between linear transformations

Proposition 135 Let a and b be two constant 1 K vectors and X a K 1
random vector. Then, the covariance between the two linear transformations aX
and bX can be expressed as a function of the covariance matrix:
Cov [aX; bX] = aVar [X] b>
Proof. This can be proved as follows:
Cov [aX; bX]

A = E [(aX E [aX]) (bX E [bX])]
= E [a (X E [X]) b (X E [X])]
h i
B = E a (X E [X]) (b (X E [X]))
>
h i
C >
= E a (X E [X]) (X E [X]) b>
h i
D = aE (X E [X]) (X E [X])
>
b>
E = aVar [X] b>
where: in step A we have used the de…nition of covariance; in step B we have

used the fact that the transpose of a scalar is equal to the scalar itself; in step
C we have used the formula for the transpose of a product; in step D we have
used the linearity of the expected value; in step E we have used the de…nition of
covariance matrix.
23.4.7 Cross-covariance
The term covariance matrix is sometimes also referred to the matrix of covariances
between the elements of two vectors. Let X be a K 1 random vector and Y be
a L 1 random vector. The covariance matrix between X and Y , or cross-
covariance between X and Y , denoted by Cov [X; Y ], is de…ned as follows:
h i
>
Cov [X; Y ] = E (X E [X]) (Y E [Y ])
provided the above expected value exists and is well-de…ned. It is a multivariate

generalization of the de…nition of covariance between two scalar random variables.
Let X1 , . . . , XK denote the K components of the vector X and Y1 , . . . , YL denote
the L components of the vector Y . From the de…nition of Cov [X; Y ], it can easily
be seen that Cov [X; Y ] is a K L matrix with the following structure:
2 3
V11 V12 ::: V1L
6 V21 V22 ::: V2L 7
6 7
Cov [X; Y ] = 6 .. .. .. .. 7
4 . . . . 5
VK1 VK2 : : : VKL
where the (i; j)-th entry of the matrix is equal to the covariance between Xi and
Yj :
Vij = E [(Xi E [Xi ]) (Yj E [Yj ])] = Cov [Xi ; Yj ]
Note that Cov [X; Y ] is not the same as Cov [Y; X]. In fact, Cov [Y; X] is a
L K matrix equal to the transpose of Cov [X; Y ]:
h i
>
Cov [Y; X] = E (Y E [Y ]) (X E [X])
h i>
>
= E (X E [X]) (Y E [Y ])
>
= Cov [X; Y ]

Exercise 1
Let X be a 2 1 random vector and denote its components by X1 and X2 . The
covariance matrix of X is
4 1
Var [X] =
1 2
Compute the variance of the random variable Y de…ned as
Y = 3X1 + 4X2
Solution
Using a matrix notation, Y can be written as
X1
Y = [3 4] = bX
X2
where we have de…ned

b = [3 4]
Therefore, the variance of Y can be computed by using the formula for the covari-
ance matrix of a linear transformation:
Var [Y ] = Var [bX] = bVar [X] b>

4 1 3
= [3 4]
1 2 4
4 3+1 4
= [3 4]
1 3+2 4
16
= [3 4]
11
= 3 16 + 4 11 = 92
Exercise 2
Let X be a 3 1 random vector and denote its components by X1 , X2 and X3 .
The covariance matrix of X is
2 3
3 1 0
Var [X] = 4 1 2 0 5
0 0 1
Compute the following covariance:
Cov [X1 + 2X3 ; 3X2 ]
Solution
Using the bilinearity of the covariance operator6 , we obtain
Cov [X1 + 2X3 ; 3X2 ]

= Cov [X1 ; 3X2 ] + 2Cov [X3 ; 3X2 ]
6 See p. 166.
= 3Cov [X1 ; X2 ] + 6Cov [X3 ; X2 ]

= 3 1+6 0=3
The same result can be obtained using the formula for the covariance between two
linear transformations. Let us de…ne
a = [1 0 2]
b = [0 3 0]
Then, we have
Cov [X1 + 2X3 ; 3X2 ] = Cov [aX; bX] = aVar [X] b>
2 32 3
3 1 0 0
= [1 0 2] 4 1 2 0 5 4 3 5
0 0 1 0
2 3
3 0+1 3+0 0
= [1 0 2] 4 1 0 + 2 3 + 0 0 5
0 0+0 3+1 0
2 3
3
= [1 0 2] 4 6 5 = 1 3 + 0 6 + 2 0 = 3
0
Exercise 3
Let X be a K 1 random vector whose covariance matrix is equal to the identity
matrix:
Var [X] = I
De…ne a new random vector Y as follows:
Y = AX
where A is a K K matrix of constants such that
AA> = I
Derive the covariance matrix of Y .
Solution
By using the formula for the covariance matrix of a linear transformation, we obtain
Var [Y ] = Var [AX] = AVar [X] A>

= AIA> = AA> = I
Chapter 24
Indicator function
This lecture introduces the concept of indicator function. Before reading this lec-
ture, make sure you are familiar with the concepts of random variable (p. 105)
and expected value (p. 127).
24.1 De…nition
Let be a sample space1 , let E be an event and denote by P (E) the probabil-
ity assigned to the event E. The indicator function of the event E (or indicator
random variable of the event E), denoted by 1E , is a random variable de…ned
as follows:
1 if ! 2 E
1E (!) =
0 if ! 2=E
In other words, the indicator function of the event E is a random variable that
takes value 1 when the event E happens and value 0 when the event E does not
happen.
Example 136 We toss a die and one of the six numbers from 1 to 6 can appear
face up. The sample space is:
= f1; 2; 3; 4; 5; 6g
De…ne the event
E = f1; 3; 5g
i.e. E is the event "An odd number appears face up". A random variable that takes
value 1 when an odd number appears face up and value 0 otherwise is an indicator
of the event E.
From the above de…nition, it can easily be seen that 1E is a discrete random
variable2 with support R1E = f0; 1g and probability mass function:
8
< P (E) if x = 1
p1E (x) = P (E c ) = 1 P (E) if x = 0
:
0 otherwise
Indicator functions are heavily used in probability theory to simplify notation
and to prove theorems.
1 See p. 69.
2 See p. 106.
197
198 CHAPTER 24. INDICATOR FUNCTION
24.2 Properties of the indicator function

Indicator functions enjoy the following properties.
24.2.1 Powers
The n-th power of 1E is equal to 1E :
n
(1E (!)) = 1E (!) ; 8n; !
because 1E can be either 0 or 1 and
0n = 0
1n = 1
24.2.2 Expected value

The expected value of 1E is equal to P (E):
X
E [1E ] = xp1E (x)
x2R1E
= 1 p1E (1) + 0 p1E (0)

= 1 P (E) + 0 P (E c )
= P (E)
24.2.3 Variance
The variance of 1E is equal to P (E) (1 P (E)). Using the powers property above
and the formula for computing the variance3 :
h i
2 2
Var [1E ] = E (1E ) E [1E ]
2
= E [1E ] E [1E ]
2
= P (E) P (E)
= P (E) (1 P (E))
24.2.4 Intersections
If E and F are two events, then:
1E\F = 1E 1F
In fact:
1. if ! 2 E \ F , then
1E\F (!) = 1
and
! 2 E; ! 2 F
=) 1E (!) = 1; 1F (!) = 1
=) 1E (!) 1F (!) = 1
3 See p. 156.
2. if ! 2
= E \ F , then
1E\F (!) = 0
and
either ! 2
= E or ! 2= F or both
=) either 1E (!) = 0 or 1F (!) = 0 or both
=) 1E (!) 1F (!) = 0
24.2.5 Indicators of zero-probability events

Let E be a zero-probability event4 and X an integrable random variable5 . Then:
E [X1E ] = 0
While a rigorous proof of this fact is beyond the scope of this introductory expo-
sition, this property should be intuitive. The random variable (X1E ) (!) is equal
to zero for all sample points ! except possibly for the points ! 2 E. The expected
value is a weighted average of the values X1E can take on, where each value is
weighted by its respective probability. The non-zero values X1E can take on are
weighted by zero probabilities, so E [X1E ] must be zero.

Exercise 1
Consider a random variable X and another random variable Y de…ned as a function
of X.
2 if X < 2
Y =
X if X 2
Express Y using the indicator functions of the events fX < 2g and fX 2g.
Solution
Denote by 1fX<2g the indicator of the event fX < 2g and denote by 1fX 2g the
indicator of the event fX 2g. We can write Y as:
Y = 2 1fX<2g + X 1fX 2g
Exercise 2
Let X be a positive random variable, i.e. a random variable that can take on only
positive values. Let c be a constant. Prove that
E [X] E X1fX cg
where 1fX cg is the indicator of the event fX cg.

4 See p. 79.
5 See p. 136.
200 CHAPTER 24. INDICATOR FUNCTION
Solution
First note that the sum of the indicators 1fX cg and 1fX<cg is always equal to 1:
1fX cg + 1fX<cg = 1
As a consequence, we can write:
E [X] = E [X 1]
= E X 1fX cg + 1fX<cg
= E X1fX cg + E X1fX<cg
Now, note that X1fX<cg is a positive random variable and that the expected value
of a positive random variable is positive6 :
E X1fX<cg 0
Thus:
E [X] = E X1fX cg + E X1fX<cg E X1fX cg
Exercise 3
Let E be an event and denote its indicator function by 1E . Let E c be the com-
plement of E and denote its indicator function by 1E c . Can you express 1E c as a
function of 1E ?
Solution
The sum of the two indicators is always equal to 1:
1E + 1E c = 1
Therefore:
1E c = 1 1E
6 See p. 150.
Chapter 25
Conditional probability as a
random variable
In the lecture entitled Conditional probability (p. 85) we have stated a number of
properties that conditional probabilities shoud satisfy to be rational in some sense.
We have proved that, whenever P (G) > 0, these properties are satis…ed if and only
if
P (E \ G)
P (E jG ) =
P (G)
but we have not been able to derive a formula for probabilities conditional on zero-
probability events1 , i.e. we have not been able to …nd a way to compute P (E jG )
when P (G) = 0.
Thus, we have concluded that the above elementary formula cannot be taken
as a general de…nition of conditional probability, because it does not cover zero-
probability events.
In this lecture we discuss a completely general de…nition of conditional proba-
bility, which covers also the case in which P (G) = 0.
The plan of the lecture is as follows.
1. We de…ne the concept of a partition of events.
2. We show that, given a partition of events, conditional probability can be

regarded as a random variable (probability conditional on a partition).
3. We show that, when no zero-probability events are involved, probabilities

conditional on a partition satisfy a certain property (the fundamental prop-
erty of conditional probability).
4. We require that the fundamental property of conditional probability be sat-

is…ed also when zero-probability events are involved and we show that this
requirement is su¢ cient to unambiguously pin down probabilities conditional
on a partition. This requirement can therefore be used to give a completely
general de…ntion of conditional probability.
1 See p. 79.
201
202 CHAPTER 25. CONDITIONAL PROBABILITY AS A R.V.
25.1 Partitions of events

Let be a sample space2 and let P (E) denote the probability assigned to the
events E .
De…ne a partition of events of as follows:
De…nition 137 Let G be a collection of non-empty subsets of . G is called a

partition of events of if
1. all subsets G 2 G are events;
2. if G; F 2 G then either G = F or G \ F = ;;
S
3. = G.
G2G
In other words, G is a partition of events of if it is a division of into

non-overlapping and non-empty events that cover all of .
A partition of events of is said to be …nite if there are a …nite number of
sets G in the partition; it is said to be in…nite if there are an in…nite number of
sets G in the partition; it is said to be countable if the sets G in the partition are
countable; it is said to be arbitrary if the number of sets G in the partition is not
necessarily …nite or countable (i.e. it can be uncountable).
Example 138 Suppose that we toss a die. Six numbers (from 1 to 6) can appear
is
= f1; 2; 3; 4; 5; 6g
Let any subset of be considered an event. De…ne the two events:
G1 = f1; 2; 3g
G2 = f4; 5; 6g
Then, G = fG1 ; G2 g is a partition of events of :

[
G = G1 [ G2 = f1; 2; 3g [ f4; 5; 6g
G2G
= f1; 2; 3; 4; 5; 6g =
and G1 \ G2 = ;. Now de…ne the the three events:
F1 = f1; 2g
F2 = f3; 4; 5g
F3 = f5; 6g
and the collection F = fF1 ; F2 ; F3 g. F is not a partition of events of : while

condition 1 and 3 in the de…nition above are satis…ed, condition 2 is not, because
F2 \ F3 = f5g =
6 ; and F2 6= F3
2 See p. 69.
25.2. PROBABILITIES CONDITIONAL ON A PARTITION 203
25.2 Probabilities conditional on a partition

Suppose we are given a …nite partition
G = fG1 ; G2 ; : : : ; Gn g
of events of , such that P (Gi ) > 0 for every i.
Suppose that we are interested in the probability of a speci…c event E and
that at a certain time in the future we will receive some information about the
realized outcome3 !. In particular, we will be told to which one of the n subsets
G1 , G2 , . . . , Gn the realized outcome ! belongs. When we receive the information
that the realized outcome belongs to the set Gi , we will update our assessment of
the probability of the event E, computing its conditional probability:
P (E \ Gi )
P (E jGi ) =
P (Gi )
Before receiving the information, this conditional probability is unknown and can
be regarded as a random variable, denoted by P (E jG ) and de…ned as follows:
8
< P (E jG1 ) if ! 2 G1
>
P (E jG ) (!) = .. (25.1)
> .
:
P (E jGn ) if ! 2 Gn
P (E jG ) is the probability of E conditional on the partition G.
Example 139 Let us continue with Example 138, where the sample space is
= f1; 2; 3; 4; 5; 6g
and G = fG1 ; G2 g is a partition of events of with
G1 = f1; 2; 3g
G2 = f4; 5; 6g
Let us assign equal probability to all the outcomes:
1
P (!) = ; 8! 2
6
Let us now analyze the conditional probability of the event
E = f2; 3; 4g
We have:
P (E \ G1 ) P (f2; 3g) 2=6 2
P (E jG1 ) = = = =
P (G1 ) P (f1; 2; 3g) 3=6 3
P (E \ G2 ) P (f4g) 1=6 1
P (E jG2 ) = = = =
P (G2 ) P (f4; 5; 6g) 3=6 3
The conditional probability P (E jG ) is a random variable de…ned as follows:
2=3 if ! 2 f1; 2; 3g
P (E jG ) (!) =
1=3 if ! 2 f4; 5; 6g
3 See p. 69.
Since P (G1 ) = P (G2 ) = 12 , the probability mass function of P (E jG ) is

8
< 1=2 if x = 2=3
pP(EjG ) (x) = 1=2 if x = 1=3 (25.2)
:
0 otherwise
25.3 The fundamental property

A fundamental property of P (E jG ) is that its expected value equals the uncondi-
tional probability P (E):
E [P (E jG )] = P (E)
E [P (E jG )] = P (E jG1 ) P (f! 2 : P (E jG ) (!) = P (E jG1 )g)

+ : : : + P (E jGn ) P (f! 2 : P (E jG ) (!) = P (E jGn )g)
= P (E jG1 ) P (G1 ) + : : : + P (E jGn ) P (Gn )
P (E \ G1 ) P (E \ Gn )
= P (G1 ) + : : : + P (Gn )
P (G1 ) P (Gn )
= P (E \ G1 ) + : : : + P (E \ Gn )
n n
!
X [
= P (E \ Gi ) = P (E \ Gi )
i=1 i=1
n
!!
[
= P E\ Gi = P (E \ ) = P (E)
i=1
Example 140 In Example 139, where the probability mass function of P (E jG ) is

(25.2), it is easy to verify the above property:
2 2 1 1
E [P (E jG )] = pP(EjG ) + pP(EjG )
3 3 3 3
2 1 1 1 1
= + = = P (E)
3 2 3 2 2
The above property can be generalized as follows:
Proposition 141 (Fundamental property) Let
G = fG1 ; G2 ; : : : ; Gn g
be a …nite partition of events of such that P (Gi ) > 0 for every i. Let H be any
event obtained as a union of events Gi 2 G. Let 1H be the indicator function4 of
H. Let P (E jG ) be de…ned as in (25.1). Then:
E [1H P (E jG )] = P (E \ H) (25.3)
4 See p. 197.
25.4. THE FUNDAMENTAL PROPERTY AS A DEFINITION 205
Proof. Without loss of generality, we can assume that H is obtained as the union
of the …rst k (k n) sets of the partition G (we can always re-arrange the sets Gi
by changing their indices):
k
[
H= Gi
i=1
First note that:

8
>
> P (E jG1 ) if ! 2 G1
>
> ..
>
>
>
> .
<
P (E jGk ) if ! 2 Gk
(1H P (E jG )) (!) =
>
> 0 if ! 2 Gk+1
>
> ..
>
>
>
> .
:
0 if ! 2 Gn
The property is proved as follows:
E [1H P (E jG )] = P (E jG1 ) P (G1 ) + : : : + P (E jGk ) P (Gk )

P (E \ G1 ) P (E \ Gk )
= P (G1 ) + : : : + P (Gk )
P (G1 ) P (Gk )
= P (E \ G1 ) + : : : + P (E \ Gk )
k k
!
X [
= P (E \ Gi ) = P (E \ Gi )
i=1 i=1
k
!!
[
= P E\ Gi = P (E \ H)
i=1
25.4 The fundamental property as a de…nition

Suppose we are not able to explicitly de…ne P (E jG ) as in (25.1). This can happen,
for example, because G contains a zero-probability event G and, therefore, we
cannot use the formula
P (E \ G)
P (E jG ) =
P (G)
to de…ne P (E jG ) (!) for ! 2 G.
Although we are not able to explicitly de…ne P (E jG ), we require, by analogy
with the cases in which we are instead able to de…ne it, that P (E jG ) satis…es
the fundamental property (25.1). How can we be sure that there exists a random
variable P (E jG ) satisfying this property? Existence is guaranteed by the following
important theorem, that we state without providing a proof:
Proposition 142 (Existence of conditional probability) Let G be an arbi-

trary partition of events of . Let E 2 be an event. Then there exists at
least one random variable Y that satis…es the property:
E [1H Y ] = P (E \ H)
for all events H obtainable as unions of events G 2 G. Furthermore, if two random

variables Y1 and Y2 satisfy this property, i.e.:
E [1H Y1 ] = P (E \ H)
E [1H Y2 ] = P (E \ H)
for all H, then Y1 and Y2 are almost surely equal5 .
According to this proposition, a random variable Y satisfying the fundamental

property of conditional probability exists and is unique, up to almost sure equality.
As a consequence, we can indirectly de…ne the probability of an event E conditional
on the partition G as P (E jG ) = Y . This indirect way of de…ning conditional
probability is summarized in the following:
De…nition 143 (Probability conditional on a partition) Let G be a parti-

tion of events of . Let E 2 be an event. We say that a random variable
P (E jG ) is a probability of E conditional on the partition G if
E [1H P (E jG )] = P (E \ H)
for all events H obtainable as unions of events G 2 G.
As we have seen above, such a random variable is guaranteed to exist and is

unique up to almost sure equality.
This apparently abstract de…nition of conditional probability is extremely use-
ful. One of its most important applications is the derivation of conditional proba-
bility density functions for absolutely continuous random vectors (see the lecture
entitled Conditional probability distributions - p. 209).
25.5 More details

25.5.1 Conditioning with respect to sigma-algebras
In rigorous probability theory, when conditional probability is regarded as a random
variable, it is de…ned with respect to sigma-algebras6 , rather than with respect to
partitions. Let ( ; F,P) be a probability space7 . Let I be a sub- -algebra of F, i.e.
I is a -algebra and I F. Let E 2 F be an event. We say that a random variable
P (E jI ) is a conditional probability of E with respect to the -algebra I
if
P (E jI ) is I-measurable
E [1H P (E jI )] = P (E \ H) for any H 2 I
It can be shown that this de…nition is completely equivalent to our de…nition above,
provided I is the smallest -algebra containing all the events H 2 F obtainable as
unions of events G 2 G (where G is a partition of events of ).
5 In other words, there exists a zero-probability event F such that:
f! 2 : Y1 (!) 6= Y2 (!)g F
See the lecture entitled Zero-probability events (p. 79) for a de…nition of almost sure events and
zero-probability events.
6 See p. 75.
7 See p. 76.
25.5.2 Regular conditional probabilities

Let P (E jG ) denote the probability of a generic event E conditional on the partition
G. We say that the probability space ( ; F,P) admits a regular probability
conditional on the partition G if, for any …xed !, P (E jG ) (!) is a probability
measure on the events E, i.e. for any !, ( ; F,P (E jG ) (!)) is a probability space.
Chapter 26
Conditional probability
distributions
To understand conditional probability distributions, you need to be familiar with

the concept of conditional probability, which has been introduced in the lecture
entitled Conditional probability (p. 85).
We discuss here how to update the probability distribution of a random variable
X after observing the realization of another random variable Y , i.e. after receiving
the information that another random variable Y has taken a speci…c value y. The
updated probability distribution of X will be called the conditional probability
distribution of X given Y = y.
The two random variables X and Y , considered together, form a random vector
[X Y ]. Depending on the characteristics of the random vector [X Y ], di¤erent
procedures need to be adopted in order to compute the conditional probability
distribution of X given Y = y. In the remainder of this lecture, these procedures
are presented in the following order:
1. …rst, we tackle the case in which the random vector [X Y ] is a discrete

random vector1 ;
2. then, we tackle the case in which [X Y ] is an absolutely continuous random

vector2 ;
3. …nally, we brie‡y discuss the case in which [X Y ] is neither discrete nor

absolutely continuous.
Important: note that if we are able to update the probability distribution

of X when we observe the realization of Y (i.e. when we receive the information
that Y = y), then we are also able to update the probability distribution of X
when we receive the information that a generic event E has happened: it su¢ ces
to set Y = 1E , where 1E is the indicator function3 of the event E, and update the
distribution of X based on the information Y = 1E = 1.
1 See p. 116.
2 See p. 117.
3 See p. 197.
209
210 CHAPTER 26. CONDITIONAL PROBABILITY DISTRIBUTIONS
26.1 Conditional probability mass function

In the case in which [X Y ] is a discrete random vector (as a consequence X is a
discrete random variable), the probability mass function4 of X conditional on the
information that Y = y is called conditional probability mass function:
De…nition 144 Let [X Y ] be a discrete random vector. We say that a function

pXjY =y : R ! [0; 1] is the conditional probability mass function of X given
Y = y if, for any x 2 R:
pXjY =y (x) = P (X = x jY = y )
where P (X = x jY = y ) is the conditional probability that X = x given that Y = y.
How do we derive the conditional probability mass function from the joint prob-
ability mass function5 pXY (x; y)? The following proposition provides an answer
to this question:
Proposition 145 Let [X Y ] be a discrete random vector. Let pXY (x; y) be its
joint probability mass function and let pY (y) be the marginal probability mass func-
tion6 of Y . The conditional probability mass function of X given Y = y is
pXY (x; y)
pXjY =y (x) =
pY (y)
provided pY (y) > 0.
Proof. This is just the usual formula for computing conditional probabilities
(conditional probability equals joint probability divided by marginal probability):
pXjY =y (x) = P (X = x jY = y )
P (X = x and Y = y)
=
P (Y = y)
pXY (x; y)
=
pY (y)
Note that the above proposition assumes knowledge of the marginal probability
mass function pY (y), which can be derived from the joint probability mass function
pXY (x; y) by marginalization7 .
Example 146 Let the support of [X Y ] be
RXY = f[1 1] ; [2 0] ; [0 0]g

8
>
> 1=3 if x = 1 and y = 1
<
1=3 if x = 2 and y = 0
pXY (x; y) =
>
> 1=3 if x = 0 and y = 0
:
0 otherwise
4 See p. 106.
5 See p. 116
6 See p. 120
7 See p. 120
26.1. CONDITIONAL PROBABILITY MASS FUNCTION 211
Let us compute the conditional probability mass function of X given Y = 0. The

support of Y is
RY = f0; 1g
The marginal probability mass function of Y evaluated at y = 0 is
X
pY (0) = pXY (x; y)
f(x;y)2RXY :y=0g
2
= pXY (2; 0) + pXY (0; 0) =
3
The support of X is
RX = f0; 1; 2g
Thus, the conditional probability mass function of X given Y = 0 is
8 p (0;0) 1=3
> XY 1
>
> pY (0) = 2=3 = 2 if x = 0
>
< pXY (1;0) = 0 = 0
pY (0) 2=3 if x = 1
pXjY =0 (x) = pXY (2;0) 1=3 1
>
> = 2=3 = 2 if x = 2
>
> pY (0)
: pXY (x;0) = 0 = 0 if x 2
= RX
pY (0) 2=3
In the case in which pY (y) = 0, there is, in general, no way to unambiguously

derive the conditional probability mass function of X, as we will show below with
an example. The impossibility of deriving the conditional probability mass function
unambiguously in this case (called by some authors the Borel-Kolmogorov paradox)
is not particularly worrying, as this case is seldom relevant in applications. The
following is an example of a case in which the conditional probability mass function
cannot be derived unambiguously (the example is a bit involved; the reader might
safely skip it on a …rst reading).
Example 147 Suppose we are given the following sample space:
= [0; 1]
i.e. the sample space is the set of all real numbers between 0 and 1. It is possible
to build a probability measure P on , such that P assigns to each sub-interval of
[0; 1] a probability equal to its length, i.e.:
if 0 a b 1 and E = [a; b] , then P (E) = b a
This is the same sample space discussed in the lecture on zero-probability events8 .
De…ne a random variable X as follows:
1 if ! = 0
X (!) =
0 otherwise
and another random variable Y as follows:

8
< 1 if ! = 0
Y (!) = 1 if ! = 1
:
0 otherwise
8 See p. 79.
Both X and Y are discrete random variables and, considered together, they con-
stitute a discrete random vector [X Y ]. Suppose we want to compute the condi-
tional probability mass function of X conditional on Y = 1. It is easy to see that
pY (1) = 0. As a consequence, we cannot use the formula:
pXY (x; 1)
pXjY =1 (x) =
pY (1)
because division by zero is not possible. It turns out that also the technique of
implicitly deriving a conditional probability as a realization of a random variable
satisfying the de…nition of a conditional probability with respect to a partition (see
the lecture entitled Conditional probability as a random variable - p. 201) does not
allow to unambiguously derive pXjY =1 (x). In this case, the partition of interest is
G = fG1 ; G2 g, where:
G1 = f! 2 : Y (!) = 1g = f0; 1g
G2 = f! 2 : Y (!) = 0g = (0; 1)
and pXjY =1 (x) can be viewed as the realization of the conditional probability
P (X = x jG ) (!)
when ! 2 G1 . The fundamental property of conditional probability9 is satis…ed in

this case if and only if, for a given x, the following system of equations is satis…ed:
pXjY =1 (x) pY (1) = pXY (x; 1)

pXjY =0 (x) pY (0) = pXY (x; 0)
which implies:
pXjY =1 (x) 0 = 0
pXjY =0 (x) = pXY (x; 0)
The second equation does not help determining pXjY =1 (x). So, from the …rst equa-
tion it is evident that pXjY =1 (x) is undetermined (any number, when multiplied
by zero, gives zero). One can show that also the requirement that
P (X = x jG )
be a regular conditional probability10 does not help to pin down
pXjY =1 (x) (26.1)
What does it mean that (26.1) is undetermined? It means that any choice of (26.1)
is legitimate, provided the requirement
0 pXjY =1 (x) 1
is satis…ed. Is this really a paradox? No, because conditional probability with respect
to a partition is de…ned up to almost sure equality, G1 is a zero-probability event,
so the value that P (E jG ) takes on G1 does not matter (roughly speaking, we do not
really need to care about zero-probability events, provided there is only a countable
number of them).
9 E [1 jG )] = P (E \ H) - see p. 204.
H P (E
1 0 See p. 207.
26.2. CONDITIONAL PROBABILITY DENSITY FUNCTION 213
26.2 Conditional probability density function

In the case in which [X Y ] is an absolutely continuous random vector (as a con-
sequence X is an absolutely continuous random variable), the probability density
function11 of X conditional on the information that Y = y is called conditional
probability density function:
De…nition 148 Let [X Y ] be an absolutely continuous random vector. We say

that a function fXjY =y : R ! [0; 1] is the conditional probability density
function of X given Y = y if, for any interval [a; b] R:
Z b
P (X 2 [a; b] jY = y ) = fXjY =y (x) dx
a
and fXjY =y is such that the above integral is well de…ned.
How do we derive the conditional probability mass function from the joint
probability density function12 fXY (x; y)?
Deriving the conditional distribution of X given Y = y is far from obvious:
whatever value of y we choose, we are conditioning on a zero-probability event
(P (Y = y) = 0 - see p. 109 for an explanation); therefore, the standard formula
(conditional probability equals joint probability divided by marginal probability)
cannot be used. However, it turns out that the de…nition of conditional probability
with respect to a partition13 can be fruitfully applied in this case to derive the
conditional probability density function of X given Y = y:
Proposition 149 Let [X Y ] be an absolutely continuous random vector. Let

fXY (x; y) be its joint probability density function, and fY (y) be the marginal proba-
bility density function of Y . The conditional probability density function of X given
Y = y is
fXY (x; y)
fXjY =y (x) =
fY (y)
provided fY (y) > 0.
Proof. To prove that

fXY (x; y)
fXjY =y (x) =
fY (y)
is a legitimate choice, we need to prove that conditional probabilities calculated
using the above conditional density function satisfy the fundamental property of
conditional probability:
E [1H P (E jG )] = P (E \ H)
for any H and E. Thanks to some basic results in measure theory, we can con…ne
our attention to the events H and E that can be written as follows:
H = f! 2 : Y 2 [y1 ; y2 ] ; [y1 ; y2 ] RY g
E = f! 2 : X 2 [x1 ; x2 ] ; [x1 ; x2 ] RX g
1 1 See p. 107.
1 2 See p. 117.
1 3 See p. 206.
For these events, it is immediate to verify that the fundamental property of condi-
tional probability holds. First, by the very de…nition of a conditional probability
density function: Z x2
P (E jG ) = fXjY =y (x) dx
x1
Furthermore, 1H = 1fY 2[y1 ;y2 ]g is also a function of Y . Therefore, the product

1H P (E jG ) is a function of Y , so we can use the transformation theorem14 to
compute its expected value:
Z 1 Z x2
E [1H P (E jG )] = 1fy2[y1 ;y2 ]g fXjY =y (x) dx fY (y) dy
1 x1
Z y2 Z x2
fXjY =y (x) dx fY (y) dy
y1 x1
Z y2 Z x2
= fXjY =y (x) fY (y) dxdy
y1 x1
Z y2 Z x2
fXY (x; y)
= fY (y) dxdy
y1 x1 fY (y)
Z y2 Z x2
= fXY (x; y) dxdy
y1 x1
= P (X 2 [x1 ; x2 ] ; Y 2 [y1 ; y2 ])
= P (E \ H)
The last equality proves the proposition.
Example 150 Let the support of [X Y ] be
RXY = [0; 1) [1; 5]

1
0 otherwise
Let us compute the conditional probability density function of X given Y = 1. The

support of Y is
RY = [1; 5]
When y 2= [1; 5], the marginal probability density function of Y is 0; when y 2 [1; 5],
the marginal probability density function is
Z 1 Z 1
1
1 0 4
1 1 1 1
= [ exp ( xy)]0 = [0 ( 1)] =
4 4 4
1=4 if y 2 [1; 5]
fY (y) =
0 otherwise
1 4 See p. 134.
26.3. CONDITIONAL DISTRIBUTION FUNCTION 215
When evaluated at y = 1, it is
1
fY (1) =
4
The support of X is
RX = [0; 1)
Thus, the conditional probability density function of X given Y = 1 is
fXY (x; 1) exp ( x) if x 2 [0; 1)

fXjY =1 (x) = =
fY (1) 0 otherwise
26.3 Conditional distribution function

In general, when [X Y ] is neither discrete nor absolutely continuous, we can char-
acterize the distribution function15 of X conditional on the information that Y = y.
We de…ne the conditional distribution function of X given Y = y as follows:
De…nition 151 We say that a function FXjY =y : R ! [0; 1] is the conditional

distribution function of X given Y = y if
FXjY =y (x) = P (X x jY = y ) ; 8x 2 R
where P (X x jY = y ) is the conditional probability that X x given that Y = y.
There is no immediate way of deriving the conditional distribution of X given

Y = y. However, we can characterize it using the concept of conditional probability
with respect to a partition16 , as follows.
De…ne the events Gy as follows:
Gy = f! 2 : Y = yg
and a partition G of events as:
G = fGy : y 2 RY g
where, as usual, RY is the support of Y .

Then, for any ! 2 Gy we have:
FXjY =y (x) = P (X x jG ) (!)
where P (X x jG ) is the probability that X x conditional on the partition

G. As we know, P (X x jG ) is guaranteed to exist and is unique up to almost
sure equality. Of course, this does not mean that we are able to compute it.
Nonetheless, this characterization is extremely useful, because it allows us to speak
of the conditional distribution of X given Y = y in general, without a need to
specify whether X and Y are discrete or continuous.
1 5 See p. 108.
1 6 See p. 206.
26.4 More details

26.4.1 Conditional distribution of a random vector
We have discussed how to update the probability distribution of a random variable
X after observing the realization of another random variable Y , i.e. after receiving
the information that Y = y. What happens when X and Y are random vectors
rather than random variables? Basically, everything we said above still applies
with straightforward modi…cations.
Thus, if X and Y are discrete random vectors, then the conditional probability
mass function of X given Y = y is (provided pY (y) 6= 0):
pXY (x; y)
pXjY =y (x) =
pY (y)
If X and Y are absolutely continuous random vectors then the conditional proba-
bility density function of X given Y = y is (provided fY (y) 6= 0):
fXY (x; y)
fXjY =y (x) =
fY (y)
In general, the conditional distribution function of X given Y = y is
FXjY =y (x) = P (X x jY = y )
26.4.2 Joint equals marginal times conditional

As we have explained above, the joint distribution of X and Y can be used to
derive the marginal distribution of Y and the conditional distribution of X given
Y = y. This process can also go in the reverse direction: if we know the marginal
distribution of Y and the conditional distribution of X given Y = y, then we can
derive the joint distribution of X and Y . For discrete random variables:
pXY (x; y) = pXjY =y (x) pY (y)
For absolutely continuous random variables:
fXY (x; y) = fXjY =y (x) fY (y)

Exercise 1
Let [X Y ] be a discrete random vector with support:
RXY = f[1 0] ; [2 0] ; [1 1] ; [1 2]g
and joint probability mass function:
8
>
> 1=4 if x = 1 and y =0
>
>
< 1=4 if x = 2 and y =0
pXY (x; y) = 1=4 if x = 1 and y =1
>
>
>
> 1=4 if x = 1 and y =2
:
0 otherwise
Compute the conditional probability mass function of X given Y = 0.
Solution
The marginal probability mass function of Y evaluated at y = 0 is
X
pY (0) = pXY (x; y)
f(x;y)2RXY :y=0g
1 1 1
= pXY (1; 0) + pXY (2; 0) = + =
4 4 2
The support of X is:
RX = f1; 2g
8 pXY (1;0) 1=4 1
>
< pY (0) = 1=2 = 2 if x = 1
pXY (2;0) 1=4 1
pXjY =0 (x) = pY (0) = 1=2 = 2 if x = 2
>
: pXY (x;0) 0
pY (0) = 1=2 =0 if x 2
= RX
Exercise 2
Let [X Y ] be an absolutely continuous random vector with support:
RXY = [0; 1) [1; 2]
and let its joint probability density function be:

2 2
fXY (x; y) = 3y exp ( xy) if x 2 [0; 1) and y 2 [1; 2]
0 otherwise
Compute the conditional probability density function of X given Y = 2.
Solution
The support of Y is:
RY = [1; 2]
When y 2 = [1; 2], the marginal probability density function of Y is fY (y) = 0; when
Z 1 Z 1
2 2
1 0 3
1
2 2
= y exp ( xy) = y
3 0 3

2
fY (y) = 3y if y 2 [1; 2]
0 otherwise
When evaluated at the point y = 2, it becomes

4
fY (2) =
3
The support of X is
RX = [0; 1)
fXY (x; 2)
fXjY =2 (x) =
fY (2)
( 2
(2=3) 2 exp( 2x)
= 2 exp ( 2x) if x 2 [0; 1)
= 4=3
0 otherwise
Exercise 3
RX = [0; 1]
1 if x 2 [0; 1]
fX (x) =
0 otherwise
Let Y be another absolutely continuous random variable with support
RY = [0; 1]
and conditional probability density function
(1 + 2xy) = (1 + x) if y 2 [0; 1]
fY jX=x (y) =
0 otherwise
Find the marginal probability density function function of Y .
Solution
The support of the vector [X Y ] is
RXY = [0; 1] [0; 1]
and the joint probability function of X and Y is
fXY (x; y) = fY jX=x (y) fX (x)

(1 + 2xy) = (1 + x) if x 2 [0; 1] [0; 1]
=
0 otherwise
The marginal probability density function of Y is obtained by marginalization,

integrating x out of the joint probability density function:
Z 1
fY (y) = fXY (x; y) dx
0
Thus, for y 2= [0; 1] we trivially have fY (y) = 0 (because fXY (x; y) = 0), while for
y 2 [0; 1] we have:
Z 1
0
Z 1
1 + 2xy
= dx
0 1+x
Z 1
A 1+x x + 2yx
= dx
0 1+x
Z 1
(2y 1) x
= 1+ dx
0 1+x
Z 1 Z 1
B x
= dx + (2y 1) dx
0 0 1 + x
Z 1
C 1 1+x 1
= [x]0 + (2y 1) dx
0 1+x
Z 1
1
= 1 + (2y 1) 1 dx
0 1+x
Z 1 Z 1
D 1
= 1 + (2y 1) dx dx
0 0 1 + x
h i
1 1
= 1 + (2y 1) [x]0 [ln (1 + x)]0
= 1 + (2y 1) [1 ln (2)]
= 1 + 2 [1 ln (2)] y 1 + ln (2)
= ln (2) + 2 [1 ln (2)] y
where: in step A we have added and subtracted x from the numerator; in step B
we have used the linearity of the integral; in step C we have added and subtracted
1 from the numerator; in step D we have used the linearity of the integral. Thus,
the marginal probability density function of Y is
ln (2) + 2 [1 ln (2)] y if y 2 [0; 1]

fY (y) =
0 otherwise
Chapter 27
Conditional expectation
The conditional expectation (or conditional mean, or conditional expected value)

of a random variable is the expected value of the random variable itself, computed
with respect to its conditional probability distribution1 .
As in the case of the expected value, giving a completely rigorous de…nition
of conditional expected value requires a complicated mathematical apparatus. To
make things simpler, we do not give a completely rigorous de…nition in this lecture.
We rather give an informal de…nition and we show how conditional expectation can
be computed. In particular, we discuss how to compute the expected value of a
random variable X when we observe the realization of another random variable Y ,
that is, when we receive the information that Y = y. The expected value of X
conditional on the information that Y = y is called conditional expectation of X
given Y = y.
27.1 De…nition
The following informal de…nition is very similar to the de…nition of expected value
we have given in the lecture entitled Expected value (p. 127).
De…nition 152 (informal) Let X and Y be two random variables. The condi-
tional expectation of X given Y = y is the weighted average of the values that
X can take on, where each possible value is weighted by its respective conditional
probability (conditional on the information that Y = y).
The expectation of a random variable X conditional on Y = y is denoted by
E [X jY = y ]

In the case in which X and Y are two discrete random variables and, considered
together, they form a discrete random vector2 , the formula for computing the
conditional expectation of X given Y = y is a straightforward implementation of
1 See p. 209.
2 See p. 116.
221
222 CHAPTER 27. CONDITIONAL EXPECTATION
De…nition 152: the weights of the average are given by the conditional probability
mass function3 of X.
De…nition 153 Let X and Y be two discrete random variables. Let RX be the
support of X and let pXjY =y (x) be the conditional probability mass function of X
given Y = y. The conditional expectation of X given Y = y is
X
E [X jY = y ] = xpXjY =y (x)
x2RX
provided that X
jxj pXjY =y (x) < 1
x2RX
P
If you do not understand the symbol x2RX and the …niteness condition above
(absolute summability) go back to the lecture entitled Expected value (p. 127),
where they are explained.
Example 154 Let the support of the random vector [X Y ] be
RXY = f[1 3] ; [2 0] ; [0 0]g

8
>
> 1=3 if x = 1 and y = 3
<
1=3 if x = 2 and y = 0
pXY (x; y) =
>
> 1=3 if x = 0 and y = 0
:
0 otherwise

marginal probability mass function of Y evaluated at y = 0 is
X 2
pY (0) = pXY (x; y) = pXY (2; 0) + pXY (0; 0) =
3
f(x;y)2RXY :y=0g
The support of X is
RX = f0; 1; 2g
8 pXY (0;0) 1=3
>
> pY (0) = 2=3 = 1=2 if x = 0
>
< pXY (1;0) 0
pY (0) = 2=3 =0 if x = 1
pXjY =0 (x) =
>
>
pXY (2;0)
= 1=3
= 1=2 if x = 2
>
: pY (0) 2=3
0 if x 2
= RX
The conditional expectation of X given Y = 0 is
E [X jY = 0 ] = 0 pXjY =0 (0) + 1 pXjY =0 (1) + 2 pXjY =0 (2)

1 1
= 0 +1 0+2 =1
2 2
3 See p. 210.
27.3. ABSOLUTELY CONTINUOUS RANDOM VARIABLES 223
27.3 Absolutely continuous random variables

When X and Y are absolutely continuous random variables, forming an absolutely
continuous random vector4 , the formula for computing the conditional expectation
of X given Y = y involves
P an integral, which can be thought of as the limiting case
of the summation x2RX xpXjY =y (x) found in the discrete case above.
De…nition 155 Let X and Y be two absolutely continuous random variables. Let
RX be the support of X and let fXjY =y (x) be the conditional probability density
function5 of X given Y = y. The conditional expectation of X given Y = y is
Z 1
E [X jY = y ] = xfXjY =y (x) dx
1
provided that Z 1
jxj fXjY =y (x) dx < 1
1
If you do not understand why an integration is required and why the …niteness
condition above (absolute integrability) is imposed, you can …nd an explanation in
the lecture entitled Expected value (p. 127).
Example 156 Let the support of the random vector [X Y ] be
RXY = [0; 1) [2; 4]

1
0 otherwise
Let us compute the conditional probability density function of X given Y = 2. The

support of Y is
RY = [2; 4]
When y 2= [2; 4], the marginal probability density function of Y is 0; when y 2 [2; 4],
the marginal probability density function is
Z 1 Z 1
1
1 0 2
1 1 1 1
= [ exp ( xy)]0 = [0 ( 1)] =
2 2 2
1=2 if y 2 [2; 4]
fY (y) =
0 otherwise
When evaluated at y = 2, it is
1
fY (2) =
2
4 See p. 117.
5 See p. 213.
The support of X is
RX = [0; 1)
fXY (x; 2) 2 exp ( 2x) if x 2 [0; 1)
fXjY =2 (x) = =
fY (2) 0 otherwise
E [X jY = 2 ]
Z 1
= xfXjY =2 (x) dx
1
Z 1
= x2 exp ( 2x) dx
0
Z 1
A 1
= t exp ( t) dt
2 0
Z 1
B 1 1
= [ t exp ( t)]0 + exp ( t) dt
2 0
1 1
= f0 0 + [ exp ( t)]0 g
2
1 1
= f0 + 1g =
2 2
where: in step A we have performed a change of variable (t = 2x); in step B we
have performed an integration by parts.
27.4 Conditional expectation in general

The general formula for computing the conditional expectation of X given Y = y
does not require that the two variables form a discrete or an absolutely continuous
random vector, but it is applicable to any random vector.
De…nition 157 Let FXjY =y (x) be the conditional distribution function6 of X
given Y = y. The conditional expectation of X given Y = y is
Z 1
E [X jY = y ] = xdFXjY =y (x)
1
where the integral is a Riemann-Stieltjes integral and the expected value exists and
is well-de…ned only as long as the integral is well-de…ned.
The above formula follows the same logic of the formula for the expected value:
Z 1
E [X] = xdFX (x)
1
with the only di¤erence that the unconditional distribution function FX (x) has
now been replaced with the conditional distribution function FXjY =y (x). The
reader who feels unfamiliar with this formula can go back to the lecture entitled
Expected value (p. 127) and read an intuitive introduction to the Riemann-Stieltjes
integral and its use in probability theory.
6 See p. 215.
27.5 More details

27.5.1 Properties of conditional expectation
From the above sections, it should be clear that the conditional expectation is
computed exactly as the expected value, with the only di¤erence that probabilities
and probability densities are replaced by conditional probabilities and conditional
probability densities. Therefore, the properties enjoyed by the expected value, such
as linearity, are also enjoyed by the conditional expectation. For an exposition of
the properties of the expected value, you can go to the lecture entitled Properties
of the expected value (p. 147).
27.5.2 Law of iterated expectations

Quite obviously, before knowing the realization of Y , the conditional expectation
of X given Y is unknown and can itself be regarded as a random variable. We
denote it by E [X jY ]. In other words, E [X jY ] is a random variable such that its
realization equals E [X jY = y] when y is the realization of Y .
This random variable satis…es a very important property, known as law of
iterated expectations:
E [E [X jY ]] = E [X]
Proof. For discrete random variables this is proved as follows:
E [E [X jY ]]
X
A = E [X jY = y ] pY (y)
y2RY
X X
B = xpXjY =y (x) pY (y)
y2RY x2RX
X X
C = xpXY (x; y)
y2RY x2RX
X X
= x pXY (x; y)
x2RX y2RY
X
D = xpX (x)
x2RX
E = E [X]
where: in step A we have used the de…nition of expected value; in step B we

have used the de…nition of conditional expectation; in step C we have used the
fact that the joint probability mass function pXY (x; y) is equal to the product
of the marginal and conditional probability mass functions; in step D we have
performed a marginalization of the joint probability mass function7 ; in step E
we have used the de…nition of expected value. For absolutely continuous random
variables the proof is analogous:
E [E [X jY ]]
7 See p. 120.
Z 1
A = E [X jY = y ] fY (y) dy
1
Z 1 Z 1
B = xfXjY =y (x) dxfY (y) dy
1 1
Z 1 Z 1
= xfXjY =y (x) fY (y) dxdy
1 1
Z 1 Z 1
C = xfXY (x; y) dxdy
1 1
Z 1 Z 1
= x fXY (x; y) dydx
1 1
Z 1
D = xfX (x) dx
1
E = E [X]

have used the de…nition of conditional expectation; in step C we have used the
fact that the joint probability density function fXY (x; y) is equal to the product
of the marginal and conditional probability density functions; in step D we have
performed a marginalization of the joint probability density function8 ; in step E
we have used the de…nition of expected value.

Exercise 1
Let [X Y ] be a random vector with support
RXY = f[2 2] ; [2 0] ; [1 2] ; [0 2]g
and joint probability mass function
8
>
> 1=4 if x = 2 and y =2
>
>
< 1=4 if x = 2 and y =0
pXY (x; y) = 1=4 if x = 1 and y =2
>
>
>
> 1=4 if x = 0 and y =2
:
0 otherwise
What is the conditional expectation of X given Y = 2?
Solution
marginal probability mass function of Y evaluated at y = 2 is
X 3
pY (2) = pXY (x; y) = pXY (2; 2) + pXY (1; 2) + pXY (0; 2) =
4
f(x;y)2RXY :y=2g
8 See p. 120.
The support of X is
RX = f0; 1; 2g
8 pXY (0;2) 1=4
>
>
> pXY pY (2) = 3=4 = 1=3 if x = 0
< (1;2) 1=4
pXjY =2 (x) = pY (2) = 3=4 = 1=3 if x = 1
>
>
pXY (2;2)
= 1=4
>
: pY (2) 3=4 = 1=3 if x = 2
0 if x 2
= RX
E [X jY = 2 ] = 0 pXjY =2 (0) + 1 pXjY =2 (1) + 2 pXjY =2 (2)

1 1 1
= 0 +1 +2 =1
3 3 3
Exercise 2
Let X and Y be two random variables. Remember that the variance of X can be
computed as
2
Var [X] = E X 2 E [X] (27.1)
In a similar manner, the conditional variance of X, given Y = y, can be de…ned as
2
Var [X jY = y ] = E X 2 jY = y E [X jY = y ] (27.2)
Use the law of iterated expectations to prove that
Var [X] = E [Var [X jY = y ]] + Var [E [X jY = y ]]
Solution
This is proved as follows:
Var [X]
2
= E X2 E [X]
A = E E X 2 jY = y E [E [X jY = y ]]
2
h i
B = E Var [X jY = y ] + E [X jY = y ]
2
E [E [X jY = y ]]
2
h i
C = E [Var [X jY = y ]] + E E [X jY = y ]
2
E [E [X jY = y ]]
2
D = E [Var [X jY = y ]] + Var [E [X jY = y ]]
where: in step A we have used the law of iterated expectations; in step B we

have used equation (27.2); in step C we have used the linearity of the expected
value; in step D we have used equation (27.1).
Chapter 28
Independent random
variables
Two random variables are independent if they convey no information about each
other and, as a consequence, receiving information about one of the two does not
change our assessment of the probability distribution of the other.
This lecture provides a formal de…nition of independence and discusses how to
verify whether two or more random variables are independent.
28.1 De…nition
Recall (see the lecture entitled Independent events - p. 99) that two events A and
B are independent if and only if
P (A \ B) = P (A) P (B)
This de…nition is extended to random variables as follows.
De…nition 158 Two random variables X and Y are said to be independent if

and only if
P (fX 2 Ag \ fY 2 Bg) = P (fX 2 Ag) P (fY 2 Bg)
for any couple of events fX 2 Ag and fY 2 Bg, where A R and B R.
In other words, two random variables are independent if and only if the events
related to those random variables are independent events.
The independence between two random variables is also called statistical inde-
pendence.
28.2 Independence criterion

Checking the independence of all possible couples of events related to two random
variables can be very di¢ cult. This is the reason why the above de…nition is
seldom used to verify whether two random variables are independent. The following
criterion is more often used instead.
229
230 CHAPTER 28. INDEPENDENT RANDOM VARIABLES
Proposition 159 Two random variables X and Y are independent if and only if
FXY (x; y) = FX (x) FY (y) 8x; y 2 R
where FX;Y (x; y) is their joint distribution function1 and FX (x) and FY (y) are
their marginal distribution functions2 .
Proof. Using some facts from measure theory (not proved here), it is possible to
demonstrate that, when checking for the condition
it is su¢ cient to con…ne attention to sets A and B taking the form
A = ( 1; x]
B = ( 1; y]
Thus, two random variables are independent if and only if
P (fX 2 ( 1; x]g \ fY 2 ( 1; y]g) = P (fX 2 ( 1; x]g) P (fY 2 ( 1; y]g)
for any x; y 2 R. Using the de…nitions of joint and marginal distribution function,
this condition can be written as
FXY (x; y) = FX (x) FY (y) 8x; y 2 R
Example 160 Let X and Y be two random variables with marginal distribution
functions
0 if x < 0
FX (x) =
1 exp ( x) if x 0
0 if y < 0
FY (y) =
1 exp ( y) if y 0
and joint distribution function
0 if x < 0 or y < 0
FX;Y (x; y) =
1 exp ( x) exp ( y) + exp ( x y) if x 0 and y 0
X and Y are independent if and only if
FXY (x; y) = FX (x) FY (y)
which is straightforward to verify. When x < 0 or y < 0, then
FX (x) FY (y) = 0 = FX;Y (x; y)
When x 0 and y 0, then:
FX (x) FY (y) = [1 exp ( x)] [1 exp ( y)]
= 1 exp ( x) exp ( y) + exp ( x) exp ( y)
= 1 exp ( x) exp ( y) + exp ( x y)
= FX;Y (x; y)
1 See p. 118.
2 See p. 119.
28.3. INDEPENDENCE BETWEEN DISCRETE VARIABLES 231
28.3 Independence between discrete variables

When the two variables, taken together, form a discrete random vector, indepen-
dence can also be veri…ed using the following proposition:
Proposition 161 Two random variables X and Y , forming a discrete random

vector, are independent if and only if
pXY (x; y) = pX (x) pY (y) 8x; y 2 R
where pX;Y (x; y) is their joint probability mass function3 and pX (x) and pY (y)
are their marginal probability mass functions4 .
The following example illustrates how this criterion can be used.
Example 162 Let [X Y ] be a discrete random vector with support
RXY = f[1 1] ; [2 0] ; [0 0]g
Let its joint probability mass function be

8
>
> 1=3 if [x y] = [1 1]
<
1=3 if [x y] = [2 0]
pX;Y (x; y) =
>
> 1=3 if [x y] = [0 0]
:
0 otherwise
In order to verify whether X and Y are independent, we …rst need to derive the
marginal probability mass functions of X and Y . The support of X is
RX = f0; 1; 2g
and the support of Y is

RY = f0; 1g
We need to compute the probability of each element of the support of X:
X 1 1
pX (0) = pXY (0; y) = pXY (0; 0) + pXY (0; 1) = +0=
3 3
y2RY
X 1 1
pX (1) = pXY (1; y) = pXY (1; 0) + pXY (1; 1) = 0 + =
3 3
y2RY
X 1 1
pX (2) = pXY (2; y) = pXY (2; 0) + pXY (2; 1) = +0=
3 3
y2RY
Thus, the marginal probability mass function of X is

8
>
> 1=3 if x = 0
X <
1=3 if x = 1
pX (x) = pXY (x; y) =
>
> 1=3 if x = 2
y2RY :
0 otherwise
3 See p. 116.
4 See p. 119.
We need to compute the probability of each element of the support of Y :

X
pY (0) = pXY (x; 0) = pXY (0; 0) + pXY (1; 0) + pXY (2; 0)
x2RX
1 1 2
= +0+ =
3X 3 3
pY (1) = pXY (x; 1) = pXY (0; 1) + pXY (1; 1) + pXY (2; 1)
x2RX
1 1
= 0+ +0=
3 3
Thus, the marginal probability mass function of Y is
8
X < 2=3 if y = 0
pY (y) = pXY (x; y) = 1=3 if y = 1
:
y2RY 0 otherwise
The product of the marginal probability mass functions is

8
>
> (1=3) (2=3) = 2=9 if [x y] = [0 0]
>
>
>
> (1=3) (2=3) = 2=9 if [x y] = [1 0]
>
>
< (1=3) (2=3) = 2=9 if [x y] = [2 0]
pX (x) pY (y) = (1=3) (1=3) = 1=9 if [x y] = [0 1]
>
>
>
> (1=3) (1=3) = 1=9 if [x y] = [1 1]
>
>
>
> (1=3) (1=3) = 1=9 if [x y] = [2 1]
:
0 otherwise
which is obviously di¤ erent from pXY (x; y). Therefore, X and Y are not indepen-
dent.
28.4 Independence between continuous variables

When the two variables, taken together, form an absolutely continuous random
vector, independence can also be veri…ed using the following proposition:
Proposition 163 Two random variables X and Y , forming an absolutely contin-

uous random vector, are independent if and only if
fXY (x; y) = fX (x) fY (y) 8x; y 2 R
where fX;Y (x; y) is their joint probability density function5 and fX (x) and fY (y)
are their marginal probability density functions6 .
The following example illustrates how this criterion can be used.
Example 164 Let the joint probability density function of X and Y be
0 if x 2
= [0; 1] or y 2
= [0; 1]
fX;Y (x; y) =
1 if x 2 [0; 1] and y 2 [0; 1]
5 See p. 117.
6 See p. 119.
Its marginals are

Z 1
fX (x) = fX;Y (x; y) dy
1
Z 0 Z 1 Z 1
= fX;Y (x; y) dy + fX;Y (x; y) dy + fX;Y (x; y) dy
1 0 1
Z 1
= 0+ fX;Y (x; y) dy + 0
0
0 if x 2
= [0; 1]
=
1 if x 2 [0; 1]
and
Z 1
fY (y) = fX;Y (x; y) dx
1
Z 0 Z 1 Z 1
= fX;Y (x; y) dx + fX;Y (x; y) dx + fX;Y (x; y) dx
1 0 1
Z 1
= 0+ fX;Y (x; y) dx + 0
0
0 if y 2
= [0; 1]
=
1 if y 2 [0; 1]
Verifying that
fXY (x; y) = fX (x) fY (y)
is straightforward. When x 2
= [0; 1] or y 2
= [0; 1], then
fX (x) fY (y) = 0 = fX;Y (x; y)
When x 2 [0; 1] and y 2 [0; 1], then:
fX (x) fY (y) = 1 1 = 1 = fX;Y (x; y)
28.5 More details

The following subsections contain more details about statistical independence.
28.5.1 Mutually independent random variables

The de…nition of mutually independent random variables extends the de…nition of
mutually independent events to random variables.
De…nition 165 We say that n random variables X1 , . . . , Xn are mutually in-

dependent (or jointly independent) if and only if
0 1
\k Yk
P@ Xij 2 Aj A = P Xij 2 Aj
j=1 j=1
for any sub-collection of k random variables Xi1 , . . . , Xik (where k n) and for
any collection of events fXi1 2 A1 g ; : : : ; fXik 2 Ak g, where A1 ; : : : ; Ak R.
In other words, n random variables are mutually independent if the events

related to those random variables are mutually independent events7 .
Denote by X a random vector whose components are X1 , . . . , Xn . The above
condition for mutual independence can be replaced:
1. in general, by a condition on the joint distribution function of X:

n
Y
FX (x1 ; : : : ; xn ) = FXj (xj )
j=1
2. for discrete random variables, by a condition on the joint probability mass

function of X:
n
Y
pX (x1 ; : : : ; xn ) = pXj (xj )
j=1
3. for absolutely continuous random variables, by a condition on the joint prob-

ability density function of X:
n
Y
fX (x1 ; : : : ; xn ) = fXj (xj )
j=1
28.5.2 Mutual independence via expectations

It can be proved that n random variables X1 , . . . , Xn are mutually independent if
and only if: 2 3
Yn n
Y
E4 gj (Xj )5 = E [gj (Xj )]
j=1 j=1
for any n functions g1 , . . . ,gn such that the above expected values exist and are
well-de…ned.
28.5.3 Independence and zero covariance

If two random variables X1 and X2 are independent, then their covariance is zero:
Cov [X1 ; X2 ] = 0
Proof. This is an immediate consequence of the fact that, if X1 and X2 are

independent then:
E [g1 (X1 ) g2 (X2 )] = E [g1 (X1 )] E [g2 (X2 )]
(see the Mutual independence via expectations property above). When g1 and g2
are identity functions (g1 (X1 ) = X1 and g2 (X2 ) = X2 ), then:
E [X1 X2 ] = E [X1 ] E [X2 ]
Therefore, by the covariance formula8 :
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ]

7 See p. 100.
8 See p. 164.
= E [X1 ] E [X2 ] E [X1 ] E [X2 ]

= 0
The converse is not true: two random variables that have zero covariance are
not necessarily independent.
28.5.4 Independent random vectors

The above notions are easily generalized to the case in which X and Y are two
random vectors, having dimensions KX 1 and KY 1 respectively. Denote
their joint distribution functions by FX (X) and FY (Y ) and the joint distribution
function of X and Y together by
FXY (X; Y ) = FX1 ;X2 ;:::;XKX ;Y1 ;Y2 ;:::;YKY (x1 ; x2 ; : : : ; xKX ; y1 ; y2 ; : : : ; yKY )
Also, if the two vectors are discrete or absolutely continuous replace F with p or
f to denote the corresponding probability mass or density functions.
De…nition 166 Two random vectors X and Y are independent if and only if one
of the following equivalent conditions is satis…ed:
1. Condition 1:
for any couple of events fX 2 Ag and fY 2 Bg, where A RKX and B
RKY .
2. Condition 2:
FXY (x; y) = FX (x) FY (y)
KX
for any x 2 R and y 2 RKY (replace F with p or f when the distributions
are discrete or absolutely continuous).
3. Condition 3:
E [g1 (X) g2 (Y )] = E [g1 (X)] E [g2 (Y )]
for any functions g1 : RKX ! R and g2 : RKY ! R such that the above
expected values exist and are well-de…ned.
28.5.5 Mutually independent random vectors

Also the de…nition of mutual independence extends in a straightforward manner
to random vectors:
De…nition 167 We say that n random vectors X1 , . . . , Xn are mutually inde-
pendent (or jointly independent) if and only if
0 1
k
\ Yk
P@ Xij 2 Aj A = P Xij 2 Aj
j=1 j=1
for any sub-collection of k random vectors Xi1 , . . . , Xik (where k n) and for
any collection of events fXi1 2 A1 g ; : : : ; fXik 2 Ak g.
All the equivalent conditions for the joint independence of a set of random
variables (see above) apply with obvious modi…cations also to random vectors.

Exercise 1
Consider two random variables X and Y having marginal distribution functions
8
< 0 if x < 1
FX (x) = 1=2 if 1 x < 2
:
1 if x 2
0 if y < 0
FY (y) = 1 1
1 2 exp ( y) 2 exp ( 2y) if y 0
If X and Y are independent, what is their joint distribution function?
Solution
For X and Y to be independent, their joint distribution function must be equal to
the product of their marginal distribution functions:
8
< 0 if x < 1 or y < 0
1 1 1
FX;Y (x; y) = exp ( y) exp ( 2y) if 1 x < 2 and y 0
: 2 14 4
1 2 exp ( y) 12 exp ( 2y) if x 2 and y 0
Exercise 2
Let [X Y ] be a discrete random vector with support
RXY = f[0 0] ; [0 1] ; [1 0] ; [1 1]g
Let its joint probability mass function be

8
>
> 1=4 if [x y] = [0 0]
>
>
< 1=4 if [x y] = [0 1]
pX;Y (x; y) = 1=4 if [x y] = [1 0]
>
>
>
> 1=4 if [x y] = [1 1]
:
0 otherwise
Are X and Y independent?
Solution
In order to verify whether X and Y are independent, we …rst need to derive the
marginal probability mass functions of X and Y . The support of X is
RX = f0; 1g
and the support of Y is

RY = f0; 1g
We need to compute the probability of each element of the support of X:
X 1
pX (0) = pXY (0; y) = pXY (0; 0) + pXY (0; 1) =
2
y2RY
X 1
pX (1) = pXY (1; y) = pXY (1; 0) + pXY (1; 1) =
2
y2RY
Thus, the marginal probability mass function of X is

8
X < 1=2 if x = 0
pX (x) = pXY (x; y) = 1=2 if x = 1
:
y2RY 0 otherwise
We need to compute the probability of each element of the support of Y :

X 1
pY (0) = pXY (x; 0) = pXY (0; 0) + pXY (1; 0) =
2
x2RX
X 1
pY (1) = pXY (x; 1) = pXY (0; 1) + pXY (1; 1) =
2
x2RX
Thus, the marginal probability mass function of Y is

8
X < 1=2 if y = 0
pY (y) = pXY (x; y) = 1=2 if y = 1
:
x2RX 0 otherwise
The product of the marginal probability mass functions is

8
>
> (1=2) (1=2) = 1=4 if [x y] = [0 0]
>
>
< (1=2) (1=2) = 1=4 if [x y] = [1 0]
pX (x) pY (y) = (1=2) (1=2) = 1=4 if [x y] = [0 1]
>
>
> (1=2) (1=2) = 1=4 if [x y] = [1
> 1]
:
0 otherwise
which is equal to pXY (x; y). Therefore, X and Y are independent.
Exercise 3
RXY = [0; 1) [2; 3]
and let its joint probability density function be

exp ( x) if x 2 [0; 1) and y 2 [2; 3]
fXY (x; y) =
0 otherwise
Are X and Y independent?
Solution
The support of Y is
RY = [2; 3]
Z 1 Z 1
fY (y) = fXY (x; y) dx = exp ( x) dx
1 0
1
= [ exp ( x)]0 = [0 ( 1)] = 1
Summing up, the marginal probability density function of Y is
1 if y 2 [2; 3]
fY (y) =
0 otherwise
The support of X is
RX = [0; 1)
When x 2 = [0; 1), the marginal probability density function of X is 0, while, when
x 2 [0; 1), the marginal probability density function of X is
Z 1 Z 3
fX (x) = fXY (x; y) dy = exp ( x) dy
1 2
Z 3
= exp ( x) dy = exp ( x)
2
The marginal probability density function of X is
exp ( x) if x 2 [0; 1)
fX (x) =
0 otherwise
Verifying that
fXY (x; y) = fX (x) fY (y)
is straightforward. When x 2
= [0; 1) or y 2
= [2; 3], then
fX (x) fY (y) = 0 = fX;Y (x; y)
When x 2 [0; 1) and y 2 [2; 3], then:
fX (x) fY (y) = exp ( x) 1 = exp ( x) = fX;Y (x; y)
Therefore, X and Y are independent.

Part III
Additional topics in
probability theory
239
Chapter 29
Probabilistic inequalities
This lecture introduces some probabilistic inequalities that are used in the proofs
of several important theorems in probability theory.
29.1 Markov’s inequality

Proposition 168 Let X be an integrable1 random variable de…ned on a sample
space . Let X be a positive random variable2 . Let c 2 R++ . Then, the following
inequality, called Markov’s inequality, holds:
E [X]
P (X c) (29.1)
c
Reading and understanding the proof of Markov’s inequality is highly recom-
mended, because it is an interesting application of many elementary properties of
the expected value.
Proof. First note that
1fX cg + 1fX<cg = 1
where 1fX cg is the indicator3 of the event fX cg, and 1fX<cg is the indicator
of the event fX < cg. As a consequence, we can write
E [X] = E [X 1]
= E X 1fX cg + 1fX<cg
= E X1fX cg + E X1fX<cg
Now, note that X1fX<cg is a positive random variable and that the expected value
of a positive random variable is positive4 :
E X1fX<cg 0
Therefore,
E [X] = E X1fX cg + E X1fX<cg E X1fX cg
1 See p. 136.
2 In other words, X (!) 0 for all ! 2 .
3 See p. 197.
4 See p. 150.
241
242 CHAPTER 29. PROBABILISTIC INEQUALITIES
The random variable c 1fX cg is less than or equal to the random variable X
1fX cg for any ! 2 :
c 1fX cg X 1fX cg
because c is always smaller than X when the indicator 1fX cg is not zero. Since
the expected value operator preserves inequalities5 , we have
c 1fX cg X 1fX cg =) E c 1fX cg E X 1fX cg
Furthermore, by using the linearity of the expected value6 and the fact that the
expected value of an indicator is equal to the probability of the event it indicates7 ,
we obtain
E c 1fX cg = cE 1fX cg = cP (X c) =) cP (X c) E X1fX cg
The above inequalities can be put together:
E [X] E X1fX cg
=) E [X] cP (X c)
E X1fX cg cP (X c)
Finally, since c is strictly positive, we can divide both sides of the right-hand
inequality to obtain Markov’s inequality:
E [X]
P (X c)
c
This property also holds when X 0 almost surely8 .
29.2 Chebyshev’s inequality

Proposition 169 Let X be a square integrable9 random variable de…ned on a
sample space . Let and 2 denote the mean and variance of X respectively. Let
k 2 R++ . Then, the following inequality, called Chebyshev’s inequality, holds:
2
P (jX j k)
k2
Proof. The proof is a straightforward application of Markov’s inequality (29.1).
2
Since (X ) is a positive random variable, we can apply Markov’s inequality
with c = k 2 to obtain
h i
2
E (X )
2
P (X ) k2
k2
2
But (X ) k 2 if and only if jX j k; so, we can write
h i
2
E (X )
P (jX j k)
k2
5 See p. 150.
6 See p. 134.
7 See p. 198.
8 See the lecture entitled Zero-probability events (p. 79) for a de…nition of almost sure property.
9 See p. 159.
29.3. JENSENS’S INEQUALITY 243
Furthermore, by the very de…nition of variance, we have

h i
2
E (X ) = Var [X] = 2
Therefore,
2
P (jX j k)
k2
29.3 Jensens’s inequality

Proposition 170 Let X be an integrable random variable. Let g : R ! R be a
convex function such that
Y = g (X)
is also integrable. Then, the following inequality, called Jensen’s inequality, holds:
E [g (X)] g (E [X])
Proof. A function g is convex if, for any point x0 , the graph of g lies entirely
above its tangent at the point x0 :
g (x) g (x0 ) + b (x x0 ) ; 8x
where b is the slope of the tangent. By setting x = X and x0 = E [X], the inequality
becomes
g (X) g (E [X]) + b (X E [X])
By taking the expected value of both sides of the inequality, and by using the fact
that the expected value operator preserves inequalities10 , we obtain
E [g (X)] E [g (E [X]) + b (X E [X])]

A = g (E [X]) + b (E [X] E [X])
= g (E [X])
where: in step A we have used the linearity of the expected value.

If the function g is strictly convex and X is not almost surely constant, then
we have a strict inequality:
E [g (X)] > g (E [X])
Proof. A function g is strictly convex if, for any point x0 , the graph of g lies
entirely above its tangent at the point x0 (and strictly so for points di¤erent from
x0 ):
g (x) > g (x0 ) + b (x x0 ) ; 8x 6= x0
where b is the slope of the tangent. By setting x = X and x0 = E [X], the inequality
becomes
g (X) > g (E [X]) + b (X E [X]) ; 8X 6= E [X]
1 0 See p. 150.
and, of course, g (X) = g (E [X]) when X = E [X]. By taking the expected value of
both sides of the inequality, and by using the fact that the expected value operator
preserves inequalities, we obtain
E [g (X)] > E [g (E [X]) + b (X E [X])]

= g (E [X]) + b (E [X] E [X])
= g (E [X])
where the …rst inequality is strict because we have assumed that X is not almost
surely11 constant, and, as a consequence, the event
fg (X) = g (E [X])g
does not have probability 1.

If the function g is concave, then
E [g (X)] g (E [X])
Proof. If g is concave, then g is convex, and by Jensen’s inequality we have
E [ g (X)] g (E [X])
By multiplying both sides by 1 , and by using the linearity of the expected value,
we obtain the result.
If the function g is strictly concave and X is not almost surely constant, then
E [g (X)] < g (E [X])
Proof. Similar to previous proof.

Exercise 1
Let X be a positive random variable whose expected value is
E [X] = 10
Find a lower bound to the probability
P (X < 20)
Solution
First of all, we need to use the formula for the probability of a complement:
P (X < 20) = 1 P (X 20)

1 1 See p. 79.
Now, we can use Markov’s inequality:
E [X] 10 1
P (X 20) = =
20 20 2
By multiplying both sides of the inequality by 1, we obtain
1
P (X 20)
2
By adding 1 to both sides of the inequality, we obtain
1 1
1 P (X 20) 1 =
2 2
Thus, the lower bound is
1
P (X < 20)
2
Exercise 2
Let X be a random variable such that
E [X] = 0
1
P ( 3 < X < 2) =
2
Find a lower bound to its variance.
Solution
The lower bound can be derived thanks to Chebyshev’s inequality:
Var [X]
A 22 P (jX E [X]j 2)
B = 4 P (jXj 2)
C = 4 [1 P ( 2 < X < 2)]
D 4 [1 P ( 3 < X < 2)]
1
= 4 1 =2
2
where: in step A we have used Chebyshev’s inequality; in step B we have used

the fact that E [X] = 0; in step C we have used the formula for the probability
of a complement; in step D we have used the fact that, by the monotonicity of
probability,
P ( 3 < X < 2) P ( 2 < X < 2)
Thus, the lower bound is
Var [X] 2
Exercise 3
Let X be a strictly positive random variable, such that
1
E [X] =
2
Var [X] = 1
What can you infer, using Jensen’s inequality, about the expected value E [ln (2X)]?
Solution
The function
g (x) = ln (2x)
has …rst derivative
d 1 1
g (x) = 2=
dx 2x x
and second derivative
d2 1
g (x) =
dx2 x2
The second derivative is strictly negative on the domain of de…nition of the function.
Therefore, the function is strictly concave. Furthermore, X is not almost surely
constant because it has strictly positive variance. Hence, by Jensen’s inequality,
we obtain
E [ln (2X)] < ln (2E [X]) = ln (1) = 0
Chapter 30
Legitimate probability mass

functions
In this lecture we analyze two properties of probability mass functions1 . We prove

not only that any probability mass function satis…es these two properties, but also
that any function satisfying these two properties is a legitimate probability mass
function.
30.1 Properties of probability mass functions

Any probability mass function satis…es two basic properties, stated by the following
proposition.
Proposition 171 Let X be a discrete random variable. Its probability mass func-
tion pX (x) satis…es the following two properties:
1. non-negativity:
pX (x) 0; 8x 2 R (30.1)
2. sum over the support equals 1:

X
pX (x) = 1 (30.2)
x2RX
where RX is the support of X.
Proof. Remember that, by the de…nition of a probability mass function, pX (x) is

such that
P (X = x) if x 2 RX
pX (x) =
0 if x 2
= RX
Probabilities cannot be negative, therefore P (X = x) 0 and, as a consequence,

pX (x) 0. This proves property (30.1). Furthermore, the probability of a sure
1 See p. 106.
247
248 CHAPTER 30. LEGITIMATE PMFS
thing2 must be equal to 1. Since, by the very de…nition of support, the event
fX 2 RX g is a sure thing, then
X
1 = P (X 2 RX ) = pX (x)
x2RX
which proves property (30.2).
30.2 Identi…cation of a legitimate pmf

Any probability mass function must satisfy properties (30.1) and (30.2). Using
some standard results from measure theory (omitted here), it is possible to prove
that the converse is also true, i.e., any function pX (x) satisfying the two properties
above is a probability mass function.
Proposition 172 Let pX (x) be a function satisfying properties (30.1) and (30.2).
Then, there exists a discrete random variable X whose probability mass function is
pX (x).
This proposition gives us a powerful method for constructing probability mass

functions. Take a subset of the set of real numbers RX R. Take any function g (x)
that is non-negative on RX (non-negative means that g (x) 0 for any x 2 RX ).
If the sum X
S= g (x)
x2RX
is well-de…ned and is …nite and strictly positive, then de…ne

1
pX (x) = g (x)
S
S is strictly positive, thus pX (x) is non-negative and it satis…es property (30.1).
It also satis…es property (30.2), because
X X 1
pX (x) = g (x)
S
x2RX x2RX
1 X
= g (x)
S
x2RX
1
= S=1
S
Therefore, any function g (x) that is non-negative on RX (RX is chosen arbitrarily)
can be used to construct a probability mass function if its sum over RX is well-
de…ned and is …nite and strictly positive.
Example 173 De…ne

RX = f1; 2; 3; 4; 5g
and a function
x2 if x 2 RX
g (x) =
0 otherwise
2 See the properties of probability (p. 70).
Can we use g (x) to build a probability mass function? First of all, we have to
check that g (x) is non-negative. This is obviously true, because x2 is always non-
negative. Then, we have to check that the sum of g (x) over RX exists and is …nite
and strictly positive:
X
S = g (x) = g (1) + g (2) + g (3) + g (4) + g (5)
x2RX
= 1 + 4 + 9 + 16 + 25 = 55
Since S exists and is …nite and strictly positive, we can de…ne

1 1 2
pX (x) = g (x) = 55 x if x 2 RX
S 0 otherwise
By the above proposition, pX (x) is a legitimate probability mass function.

Exercise 1
Consider the following function:
1
pX (x) = 10 x if x 2 f1; 2; 3; 4g
0 otherwise
Prove that pX (x) is a legitimate probability mass function.
Solution
For x 2 f1; 2; 3; 4g we have
1
pX (x) = x>0
10
while for x 2
= f1; 2; 3; 4g we have
pX (x) = 0
Therefore, pX (x) 0 for any x 2 R, which implies that property (30.1) is satis…ed.
Also property (30.2) is satis…ed, because
X
pX (x) = pX (1) + pX (2) + pX (3) + pX (4)
x2RX
1 1 1 1 10
= 1+ 2+ 3+ 4= =1
10 10 10 10 10
Exercise 2
Consider the function
1 2
pX (x) = 14 x if x 2 f1; 2; 3g
0 otherwise

250 CHAPTER 30. LEGITIMATE PMFS
Solution
For x 2 f1; 2; 3g we have
1 2
pX (x) = x >0
14
while for x 2
= f1; 2; 3g we have
pX (x) = 0
X
pX (x) = pX (1) + pX (2) + pX (3)
x2RX
1 1 1
= 12 + 22 + 32
14 14 14
1 1 1 14
= 1+ 4+ 9= =1
14 14 14 14
Exercise 3
3
4 41 x
if x 2 N
pX (x) =
0 otherwise
Solution
For x 2 N we have
3 1 x
pX (x) = 4 >0
4
because 41 x
is strictly positive. For x 2
= N we have
pX (x) = 0
X 1
X 1
X 3
pX (x) = pX (x) = 41 x
x=1 x=1
4
x2RX
1
" #
3X
x 1 2 3
1 3 1 1 1
= = 1+ + + + :::
4 x=1 4 4 4 4 4
3 1 3 1
= 1 = 3 =1
4 1 4
4 4
Chapter 31
Legitimate probability
density functions
In this lecture we analyze two properties of probability density functions1 . We

prove not only that any probability density function satis…es these two properties,
but also that any function satisfying these two properties is a legitimate probability
density function.
31.1 Properties of probability density functions

Any probability density function satis…es two basic properties, stated by the fol-
lowing proposition.
Proposition 174 Let X be an absolutely continuous random variable. Its proba-

bility density function fX (x) satis…es the following two properties:
1. non-negativity:
fX (x) 0; 8x 2 R (31.1)
2. integral over R equals 1:

Z 1
fX (x) dx = 1 (31.2)
1
Proof. Remember that, by the de…nition of a probability density function, fX (x)

is such that Z b
fX (x) dx = P (X 2 [a; b])
a
for any interval [a; b]. Probabilities cannot be negative; therefore, P (X 2 [a; b]) 0
and Z b
fX (x) dx 0
a
for any interval [a; b]. But the above integral can be non-negative for all intervals
[a; b] only if the integrand function itself is non-negative, i.e. if fX (x) 0 for all
1 See p. 107.
251
252 CHAPTER 31. LEGITIMATE PDFS
x. This proves property (31.1). Furthermore, the probability of a sure thing2 must
be equal to 1. Since fX 2 ( 1; 1)g is a sure thing, then
Z 1
1 = P (X 2 ( 1; 1)) = fX (x) dx
1
which proves property (31.2).
31.2 Identi…cation of a legitimate pdf

Any probability density function must satisfy properties (31.1) and (31.2) above.
Using some standard results from measure theory (omitted here), it is possible to
prove that the converse is also true, i.e., any function fX (x) satisfying the two
properties above is a probability density function.
Proposition 175 Let fX (x) be a function satisfying properties (31.1) and (31.2).
Then, there exists an absolutely continuous random variable X whose probability
density function is fX (x).
This proposition gives us a powerful method for constructing probability density

functions. Take any non-negative3 function g (x). If the integral
Z 1
I= g (x) dx
1
exists and is …nite and strictly positive, then de…ne
1
fX (x) = g (x)
I
I is strictly positive, thus fX (x) is non-negative and it satis…es property (31.1). It
also satis…es Property (31.2), because
Z 1 Z 1
1
fX (x) dx = g (x) dx
1 1 I
Z
1 1
= g (x) dx
I 1
1
= I=1
I
Therefore, any non-negative function g (x) can be used to construct a probability
density function if its integral over R exists and is …nite and strictly positive.
Example 176 De…ne a function g (x) as
x2 if x 2 [0; 1]
g (x) =
0 otherwise
2 See the properties of probability (p. 70).
3 Non-negative means that g (x) 0 for any x 2 R.
Can we use g (x) to build a probability density function? First of all, we have to
check that g (x) is non-negative. This is obviously true, because x2 is always non-
negative. Then, we have to check that the integral of g (x) over R exists and is
…nite and strictly positive:
Z 1 Z 1
I = g (x) dx = x2 dx
1 0
1
1 3 1 1
= x = 0=
3 0 3 3
Since I exists and is …nite and strictly positive, we can de…ne
1 3x2 if x 2 [0; 1]
fX (x) = g (x) =
I 0 otherwise
By Proposition 175, fX (x) is a legitimate probability density function.

Exercise 1
exp ( x) if x 2 [0; 1)
fX (x) =
0 if x 2 ( 1; 0)
where 2 (0; 1). Prove that fX (x) is a legitimate probability density function.
Solution
Since > 0 and the exponential function is strictly positive, fX (x) 0 for any
x 2 R, so the non-negativity property is satis…ed. The integral property is also
satis…ed, because
Z 1 Z 1
fX (x) dx = exp ( x) dx
1 0
1
= [ exp ( x)]0
= 0 ( 1) = 1
Exercise 2
1
u l if x 2 [l; u]
fX (x) =
0 if x 2
= [l; u]
where l; u 2 R and l < u. Prove that fX (x) is a legitimate probability density

function.
Solution
l < u implies u1 l > 0, so fX (x) 0 for any x 2 R and the non-negativity property
is satis…ed. The integral property is also satis…ed, because
Z 1 Z u
1
fX (x) dx = dx
1 l u l
Z u
1
= dx
u l l
1 u
= [x]
u l l
1
= (u l) = 1
u l
Exercise 3
n=2 1 1
2 ( (n=2)) xn=2 1
exp 2x if x 2 [0; 1)
fX (x) =
0 if x 2
= [0; 1)
where n 2 N and () is the Gamma function4 . Prove that fX (x) is a legitimate

probability density function.
Solution
Remember the de…nition of Gamma function:
Z 1
(z) = xz 1 exp ( x) dx
0
(z) is obviously strictly positive for any z, since exp ( x) is strictly positive and
xz 1 is strictly positive on the interval of integration (except at 0, where it is 0).
Therefore, fX (x) satis…es the non-negativity property, because the four factors in
the product
1 1
2 n=2 ( (n=2)) xn=2 1 exp x
2
are all non-negative on the interval [0; 1).
The integral property is also satis…ed, because
Z 1 Z 1
1 1
fX (x) dx = 2 n=2 ( (n=2)) xn=2 1 exp x dx
1 0 2
Z 1
1 1
= 2 n=2 ( (n=2)) xn=2 1 exp x dx
0 2
Z 1
A = 2 n=2 ( (n=2)) 1 n=2 1 1
(2t) exp 2t 2dt
0 2
Z 1
1 n=2 1
= 2 n=2 ( (n=2)) (2) 2 tn=2 1 exp ( t) dt
0
Z 1
1
= ( (n=2)) tn=2 1 exp ( t) dt
0
4 See p. 55.
B = ( (n=2))
1
(n=2)
= 1
where: in step A we have made a change of variable (x = 2t); in step B we have

used the de…nition of Gamma function.
Chapter 32
Factorization of joint
probability mass functions
This lecture discusses how to factorize the joint probability mass function1 of two
discrete random variables X and Y into two factors:
1. the conditional probability mass function2 of X given Y = y;

2. the marginal probability mass function3 of Y .
32.1 The factorization

The factorization, which has already been discussed in the lecture entitled Condi-
tional probability distributions (p. 209), is formally stated in the following propo-
sition.
Proposition 177 (factorization) Let [X Y ] be a discrete random vector with

support RXY and joint probability mass function pXY (x; y). Denote by pXjY =y (x)
the conditional probability mass function of X given Y = y and by pY (y) the
marginal probability mass function of Y . Then:
pXY (x; y) = pXjY =y (x) pY (y)
for any x and y.
32.2 A factorization method

When we know the joint probability mass function pXY (x; y) and we need to fac-
torize it into the conditional probability mass function pXjY =y (x) and the marginal
probability mass function pY (y), we usually proceed in two steps:
1. marginalize pXY (x; y) by summing it over all possible values of x and obtain
the marginal probability mass function pY (y);
1 See p. 116.
2 See p. 210.
3 See p. 120.
257
258 CHAPTER 32. FACTORIZATION OF JOINT PMFS
2. divide pXY (x; y) by pY (y) and obtain the conditional probability mass func-
tion pXjY =y (x) (of course this step makes sense only when pY (y) > 0).
In some cases, the …rst step (marginalization) can be di¢ cult to perform. In
these cases, it is possible to avoid the marginalization step, by making a guess
about the factorization of pXY (x; y) and verifying whether the guess is correct
with the help of the following proposition:
Proposition 178 (factorization method) Suppose there are two functions h (y)
and g (x; y) such that:
1. for any x and y, the following holds:
pXY (x; y) = g (x; y) h (y)
2. for any …xed y, g (x; y), considered as a function of x, is a probability mass

function with the same support of X (i.e. RX ).
Then:
pXjY =y (x) = g (x; y)

pY (y) = h (y)
Proof. The marginal probability mass function of Y satis…es:

X
pY (y) = pXY (x; y)
x2RX
Therefore, by property 1:
X
pY (y) = fXY (x; y)
x2RX
X
= g (x; y) h (y)
x2RX
X
A = h (y) g (x; y)
x2RX
B = h (y)
where: in step A we have used the fact that h (y) does not depend on x; in step
B we have used the fact that, for any …xed y, g (x; y), considered as a function of
x, is a probability mass function and the sum4 of a probability mass function over
its support equals 1. Therefore,
pXY (x; y) = g (x; y) h (y) = g (x; y) pY (y) ; 8 (x; y)
Since we also have that
pXY (x; y) = pXjY =y (x) pY (y) ; 8 (x; y)

4 See p. 247.
32.2. A FACTORIZATION METHOD 259
then, by necessity, it must be that:
g (x; y) = pXjY =y (x)
Thus, whenever we are given a formula for the joint probability mass function
pXY (x; y) and we want to …nd the marginal and the conditional functions, we have
to manipulate the formula and express it as the product of:
1. a function of x and y that is a probability mass function in x for all values

of y;
2. a function of y that does not depend on x.
Example 179 Let X be a 3 1 random vector having a multinomial distribution5

with parameters p1 , p2 and p3 (the probabilities of the three possible outcomes of
each trial) and n (the number of trials). The probabilities are strictly positive
numbers such that:
p 1 + p2 + p3 = 1
The support of X is
RX = x 2 Z3+ : x1 + x2 + x3 = n
where x1 , x2 , x3 denote the components of the vector x. The joint probability mass
function of X is
n! x1 x2 x3
pX (x1 ; x2 ; x3 ) = x1 !x2 !x3 ! p1 p2 p2 if (x1 ; x2 ; x3 ) 2 RX
0 otherwise
Note that:
n!
px1 px2 px3
x1 !x2 !x3 !
(n x3 )! n! px1 1 px2 2 x3 n x3
= p p
x1 !x2 ! (n x3 )!x3 ! pn3 x3 3 3
(n x3 )! px1 1 px2 2 n!
= px3 pn x3
x1 !x2 ! pn3 x3 (n x3 )!x3 ! 3 3
A (n x3 )! px1 1 px2 2 n!
= px3 pn x3
x1 !x2 ! px3 1 +x2 (n x3 )!x3 ! 3 3
(n x3 )! x x n!
= (p1 =p3 ) 1 (p2 =p3 ) 2 px3 pn x3
x1 !x2 ! (n x3 )!x3 ! 3 3
where in step A we have used the fact that
x1 + x2 + x3 = n
Therefore, the joint probability mass function can be factorized as:
pX (x1 ; x2 ; x3 ) = g (x1 ; x2 ; x3 ) h (x3 )

5 See p. 431.
260 CHAPTER 32. FACTORIZATION OF JOINT PMFS
where
8
< (n x3 )! x1 x2 if (x1 ; x2 ) 2 Z2+
x1 !x2 ! (p1 =p3 ) (p2 =p3 )
g (x1 ; x2 ; x3 ) = and x1 + x2 = n x3
:
0 otherwise
and:
n! x3 n x3
h (x3 ) = (n x3 )!x3 ! p3 p3 if x3 2 Z+ and x3 n
0 otherwise
For any x3 n, g (x1 ; x2 ; x3 ) is the probability mass function of a multinomial
distribution with parameters p1 =p3 , p2 =p3 and n x3 . Therefore:
pX1 ;X2 jX3 =x3 (x1 ; x2 ) = g (x1 ; x2 ; x3 )

pX3 (x3 ) = h (x3 )
Chapter 33
Factorization of joint
probability density functions
This lecture discusses how to factorize the joint probability density function1 of
two absolutely continuous random variables (or random vectors) X and Y into two
factors:
1. the conditional probability density function2 of X given Y = y;
2. the marginal probability density function3 of Y .
33.1 The factorization

The factorization, which has already been discussed in the lecture entitled Condi-
tional probability distributions (p. 209), is formally stated in the following propo-
sition.
Proposition 180 (factorization) Let [X Y ] be an absolutely continuous ran-
dom vector with support RXY and joint probability density function fXY (x; y).
Denote by fXjY =y (x) the conditional probability density function of X given Y = y
and by fY (y) the marginal probability density function of Y . Then
fXY (x; y) = fXjY =y (x) fY (y)
for any x and y.
33.2 A factorization method

When we know the joint probability density function fXY (x; y) and we need to
factorize it into the conditional probability density function fXjY =y (x) and the
marginal probability density function fY (y), we usually proceed in two steps:
1. marginalize fXY (x; y) by integrating it with respect to x and obtain the
marginal probability density function fY (y);
1 See p. 116.
2 See p. 213.
3 See p. 120.
261
262 CHAPTER 33. FACTORIZATION OF JOINT PDFS
2. divide fXY (x; y) by fY (y) and obtain the conditional probability density
function fXjY =y (x) (of course this step makes sense only when fY (y) > 0).
In some cases, the …rst step (marginalization) can be di¢ cult to perform. In
these cases, it is possible to avoid the marginalization step, by making a guess
about the factorization of fXY (x; y) and verifying whether the guess is correct
with the help of the following proposition:
Proposition 181 (factorization method) Suppose there are two functions h (y)
and g (x; y) such that:
1. for any x and y, the following holds:
fXY (x; y) = g (x; y) h (y)
2. for any …xed y, g (x; y), considered as a function of x, is a probability density

function.
Then:
fXjY =y (x) = g (x; y)
fY (y) = h (y)
Proof. The marginal probability density of Y satis…es
Z 1
1
Thus, by property 1 above:

Z 1
1
Z 1
= g (x; y) h (y) dx
1
Z 1
A = h (y) g (x; y) dx
1
B = h (y)
where: in step A we have used the fact that h (y) does not depend on x; in step
B we have used the fact that, for any …xed y, g (x; y), considered as a function
of x, is a probability density function and the integral of a probability density
function over R equals 1 (see p. 251). Therefore
fXY (x; y) = g (x; y) h (y) = g (x; y) fY (y)
which, in turn, implies
fXY (x; y)
g (x; y) = = fXjY =y (x)
fY (y)
Thus, whenever we are given a formula for the joint density function fXY (x; y)
and we want to …nd the marginal and the conditional functions, we have to ma-
nipulate the formula and express it as the product of:
33.2. A FACTORIZATION METHOD 263
1. a function of x and y that is a probability density function in x for all values

of y;
2. a function of y that does not depend on x.
Example 182 Let the joint density function of X and Y be

1
fXY (x; y) = 2 y exp ( yx) if x 2 [0; 1) and y 2 [1; 3]
0 otherwise
The joint density can be factorized as follows:
fXY (x; y) = g (x; y) h (y)
where
y exp ( yx) if x 2 [0; 1)
g (x; y) =
0 otherwise
and
1
2 if y 2 [1; 3]
h (y) =
0 otherwise
Note that g (x; y) is a probability density function in x for any …xed y (it is the
probability density function of an exponential random variable4 with parameter y).
Therefore:
fXjY =y (x) = g (x; y)

fY (y) = h (y)
4 See p. 365.
264 CHAPTER 33. FACTORIZATION OF JOINT PDFS
Chapter 34
Functions of random
variables and their
distribution
Let X be a random variable with known distribution. Let another random variable
Y be a function of X:
Y = g (X)
where g : R ! R. How do we derive the distribution of Y from the distribution of
X?
There is no general answer to this question. However, there are several special
cases in which it is easy to derive the distribution of Y . We discuss these cases
below.
34.1 Strictly increasing functions

When the function g is strictly increasing on the support of X, i.e.
8x1 ; x2 2 RX ; x1 > x2 ) g (x1 ) > g (x2 )

1
then g admits an inverse de…ned on the support of Y , i.e. a function g (y) such
that:
X = g 1 (Y )
Furthermore g 1 (y) is itself strictly increasing.
The distribution function of a strictly increasing function of a random variable
Proposition 183 (cdf of an increasing function) Let X be a random variable

with support RX and distribution function1 FX (x). Let g : R ! R be strictly
increasing on the support of X. Then, the support of Y = g (X) is
RY = fy = g (x) : x 2 RX g
1 See p. 108.
265
266 CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES
and the distribution function of Y is

8
< 0 if y < ; 8 2 RY
1
FY (y) = FX g (y) if y 2 RY
:
1 if y > ; 8 2 RY
Proof. Of course, the support RY is determined by g (x) and by all the values X
can take. The distribution function of Y can be derived as follows:
if y is lower than than the lowest value Y can take on, then P (Y y) = 0,
so:
FY (y) = 0 if y < ; 8 2 RY
if y belongs to the support of Y , then FY (y) can be derived as follows:
FY (y)
A = P (Y y)
B = P (g (X) y)
C = P g 1
(g (X)) g 1
(y)
1
= P X g (y)
D = FX g 1
(y)
where: in step A we have used the de…nition of distribution function of Y ;

in step B we have used the de…nition of Y ; in step C we have used the
fact that g 1 exists and is strictly increasing on the support of Y ; in step D
we have used the de…nition of distribution function of X;
if y is higher than than the highest value Y can take on, then P (Y y) = 1,
so:
FY (y) = 1 if y > ; 8 2 RY
Therefore, in the case of an increasing function, knowledge of g 1 and of the

upper and lower bounds of the support of Y is all we need to derive the distribution
function of Y from the distribution function of X.
RX = [1; 2]
and distribution function

8
< 0 if x < 1
1
FX (x) =
: 2x if 1 x 2
1 if x > 2
Let
Y = X2
34.1. STRICTLY INCREASING FUNCTIONS 267
The function g (x) = x2 is strictly increasing and it admits an inverse on the

support of X:
p
g 1 (y) = y
The support of Y is RY = [1; 4]. The distribution function of Y is
8
< 0 if y < ; 8 2 RY , i.e. if y < 1
p
FY (y) = FX g 1 (y) = 12 y if y 2 RY , i.e. if 1 y 4
:
1 if y > ; 8 2 RY , i.e. if y > 4
In the cases in which X is either discrete or absolutely continuous there are

specialized formulae for the probability mass and probability density functions,
which are reported below.
34.1.1 Strictly increasing functions of a discrete variable

When X is a discrete random variable, the probability mass function of Y = g (X)
Proposition 185 (pmf of an increasing function) Let X be a discrete ran-

dom variable with support RX and probability mass function pX (x). Let g : R ! R
be strictly increasing on the support of X. Then, the support of Y = g (X) is
RY = fy = g (x) : x 2 RX g

1
pX g (y) if y 2 RY
pY (y) =
0 if y 2
= RY
Proof. This proposition is a trivial consequence of the fact that a strictly increas-
ing function is invertible:
pY (y) = P (Y = y)
= P (g (X) = y)
= P X = g 1 (y)
1
= pX g (y)
RX = f1; 2; 3g

1
pX (x) = 6x if x 2 RX
0 if x 2
= RX
Let
Y = g (X) = 3 + X 2
The support of Y is
RY = f4; 7; 12g
The function g is strictly increasing and its inverse is

p
g 1 (y) = y 3
The probability mass function of Y is:
1p
pY (y) = 6 y 3 if y 2 RY
0 if y 2
= RY
34.1.2 Strictly increasing functions of a continuous variable

When X is an absolutely continuous random variable and g is di¤erentiable, then
also Y is absolutely continuous and its probability density function can be easily
computed as follows:
Proposition 187 (pdf of an increasing function) Let X be an absolutely con-
tinuous random variable with support RX and probability density function fX (x).
Let g : R ! R be strictly increasing and di¤ erentiable on the support of X. Then,
the support of Y = g (X) is
RY = fy = g (x) : x 2 RX g
and its probability density function is
( 1
dg (y)
fX g 1 (y) if y 2 RY
fY (y) = dy
0 if y 2
= RY
Proof. This proposition is a trivial consequence of the fact that the density func-
tion is the …rst derivative of the distribution function2 : it can be obtained by
di¤erentiating the expression for the distribution function FY (y) found above.
Example 188 Let X be an absolutely continuous random variable with support
RX = (0; 1]
2x if x 2 RX
fX (x) =
0 if x 2
= RX
Let
Y = g (X) = ln (X)
The support of Y is
RY = ( 1; 0]
1
g (y) = exp (y)
with derivative
1
dg
(y)
= exp (y)
dy
The probability density function of Y is
2 exp (y) exp (y) = 2 exp (2y) if y 2 RY
fY (y) =
0 if y 2
= RY
2 See p. 109.
34.2. STRICTLY DECREASING FUNCTIONS 269
34.2 Strictly decreasing functions

When the function g is strictly decreasing on the support of X, i.e.
8x1 ; x2 2 RX ; x1 > x2 ) g (x1 ) < g (x2 )

1
then g admits an inverse de…ned on the support of Y , i.e. a function g (y) such
that
X = g 1 (Y )
Furthermore g 1 (y) is itself strictly decreasing.
The distribution function of a strictly decreasing function of a random variable
Proposition 189 (cdf of a decreasing function) Let X be a random variable

with support RX and distribution function FX (x). Let g : R ! R be strictly
decreasing on the support of X. Then, the support of Y = g (X) is:
RY = fy = g (x) : x 2 RX g
and the distribution function of Y is:

8
< 0 if y < ; 8 2 RY
FY (y) = 1 FX g 1 (y) + P X = g 1
(y) if y 2 RY
:
1 if y > ; 8 2 RY
Proof. Of course, the support RY is determined by g (x) and by all the values X
can take. The distribution function of y can be derived as follows:
if y is lower than than the lowest value Y can take on, then P (Y y) = 0,
so:
FY (y) = 0 if y < ; 8 2 RY
if y belongs to the support of Y , then FY (y) can be derived as follows:
FY (y)
A = P (Y y)
= 1 P (Y > y)
B = 1 P (g (X) > y)
C = 1 P g 1
(g (X)) < g 1
(y)
1
= 1 P X<g (y)
1 1 1
= 1 P X<g (y) P X=g (y) + P X = g (y)
1 1
= 1 P X g (y) + P X = g (y)
D = 1 FX g 1
(y) + P X = g 1
(y)
where: in step A we have used the de…nition of distribution function of Y ;

in step B we have used the de…nition of Y ; in step C we have used the
fact that g 1 exists and is strictly decreasing; in step D we have used the
de…nition of distribution function of X;
if y is higher than than the highest value Y can take on, then P (Y y) = 1,
so:
FY (y) = 1 if y > ; 8 2 RY
Therefore, also in the case of a decreasing function, knowledge of g 1 and of the

upper and lower bounds of the support of Y is all we need to derive the distribution
function of Y from the distribution function of X.
RX = [1; 2]
and distribution function
8
< 0 if x < 1
1
FX (x) =
: 2x if 1 x 2
1 if x > 2
Let
Y = X2
The function g (x) = x2 is strictly decreasing and it admits an inverse on the
support of X: p
1
g (y) = y
The support of Y is RY = [ 4; 1]. The distribution function of Y is
8
>
> if y < ; 8 2 RY ,
>
> 0
>
> i.e. if y < 4
<
1 FX g 1 (y) + P X = g 1 (y) if y 2 RY ,
FY (y) = 1p 1
>
> = 1 2 y + 1
2 fy= 1g
i.e. if 4 y 1
>
>
>
> if y > ; 8 2 R Y ,
: 1
i.e. if y > 1
1
where 1fy= 1g equals 1 when y = 1 and 0 otherwise (because P X = g (y) is
always zero except when y = 1 and g 1 (y) = 1).
We report below the formulae for the special cases in which X is either discrete
or absolutely continuous.
34.2.1 Strictly decreasing functions of a discrete variable

When X is a discrete random variable, the probability mass function of Y = g (X)
Proposition 191 (pmf of a decreasing function) Let X be a discrete random
variable with support RX and probability mass function pX (x). Let g : R ! R be
strictly decreasing on the support of X. Then, the support of Y = g (X) is
RY = fy = g (x) : x 2 RX g
1
pX g (y) if y 2 RY
pY (y) =
0 if y 2
= RY
34.2. STRICTLY DECREASING FUNCTIONS 271
Proof. The proof of this proposition is identical to the proof of the proposition
for strictly increasing functions. In fact, the only property that matters is that a
strictly decreasing function is invertible:
pY (y) = P (Y = y)
= P (g (X) = y)
= P X = g 1 (y)
1
= pX g (y)
RX = f1; 2; 3g

1 2
pX (x) = 14 x if x 2 RX
0 if x 2
= RX
Let
Y = g (X) = 1 2X
The support of Y is
RY = f 5; 3; 1g
The function g is strictly decreasing and its inverse is
1 1 1
g (y) = y
2 2
The probability mass function of Y is
1 1 1 2
pY (y) = 14 2 2y if y 2 RY
0 if y 2
= RY
34.2.2 Strictly decreasing functions of a continuous variable

also Y is absolutely continuous and its probability density function is derived as
follows:
Proposition 193 (pdf of a decreasing function) Let X be an absolutely con-

Let g : R ! R be strictly decreasing and di¤ erentiable on the support of X. Then,
the support of Y = g (X) is
RY = fy = g (x) : x 2 RX g

( 1
1 dg (y)
fX g (y) if y 2 RY
fY (y) = dy
0 if y 2
= RY
Proof. This proposition is easily derived: 1) remembering that the probability

that an absolutely continuous random variable takes on any speci…c value3 is 0
and, as a consequence, P X = g 1 (y) = 0 for any y; 2) using the fact that the
density function is the …rst derivative of the distribution function; 3) di¤erentiating
the expression for the distribution function FY (y) found above.
Example 194 Let X be a uniform random variable4 on the interval [0; 1], i.e. an
absolutely continuous random variable with support
RX = [0; 1]
1 if x 2 RX
fX (x) =
0 if x 2
= RX
Let
1
Y = g (X) = ln (X)
where 2 R++ is a constant. The support of Y is
RY = [0; 1)
where we can safely ignore the fact that g (0) = 1, because fX = 0g is a zero-
probability event5 . The function g is strictly decreasing and its inverse is
1
g (y) = exp ( y)
with derivative
1
dg (y)
= exp ( y)
dy
exp ( y) if y 2 RY
fY (y) =
0 if y 2
= RY
Therefore, Y has an exponential distribution with parameter (see the lecture

entitled Exponential distribution - p. 365).
34.3 Invertible functions

In the case in which the function g (x) is neither strictly increasing nor strictly
decreasing, the formulae given in the previous sections for discrete and absolutely
continuous random variables are still applicable, provided g (x) is one-to-one and
hence invertible. We report these formulae below.
3 See p. 109.
4 See p. 359.
5 See p. 109.
34.3. INVERTIBLE FUNCTIONS 273
34.3.1 One-to-one functions of a discrete variable

When X is a discrete random variable the probability mass function of Y = g (X)
is given by the following:
Proposition 195 (pmf of a one-to-one function) Let X be a discrete random

variable with support RX and probability mass function pX (x). Let g : R ! R be
one-to-one on the support of X. Then, the support of Y = g (X) is
RY = fy = g (x) : x 2 RX g

1
pX g (y) if y 2 RY
pY (y) =
0 if y 2
= RY
Proof. The proof of this proposition is identical to the proof of the propositions
for strictly increasing and stricly decreasing functions found above:
pY (y) = P (Y = y)
= P (g (X) = y)
= P X = g 1 (y)
1
= pX g (y)
34.3.2 One-to-one functions of a continuous variable

also Y is absolutely continuous and its probability density function is given by the
following:
Proposition 196 (pdf of a one-to-one function) Let X be an absolutely con-

Let g : R ! R be one-to-one and di¤ erentiable on the support of X. Then, the
support of Y = g (X) is
RY = fy = g (x) : x 2 RX g
If
d 1
g (y) 6= 0 ; 8y 2 RY
dy
then the probability density function of Y is
( 1
fX g 1 (y) dg dy(y) if y 2 RY
fY (y) =
0 if y 2
= RY
Proof. For a proof of this proposition see: Poirier6 (1995).

6 Poirier, D. J. (1995) Intermediate statistics and econometrics: a comparative approach, MIT
Press.

Exercise 1
RX = [0; 2]
and probability density function:

3 2
fX (x) = 8x if x 2 RX
0 if x 2
= RX
Let p
Y = g (X) = X +1
Find the probability density function of Y .
Solution
The support of Y is hp p i
RY = 1; 3

1
g (y) = y 2 1
with derivative
1
dg (y)
= 2y
dy
3 2
y2 1 2y if y 2 RY
fY (y) = 8
0 if y 2
= RY
Exercise 2
RX = [0; 2]

1
2 if x 2 RX
fX (x) =
0 if x 2
= RX
Let
Y = g (X) = X2
Find the probability density function of Y .
Solution
The support of Y is:
RY = [ 4; 0]
The function g is strictly decreasing and its inverse is
1 1=2
g (y) = ( y)
with derivative
1
dg (y) 1 1=2
= ( y)
dy 2
11 1=2 1 1=2
( y) = ( y) if y 2 RY
fY (y) = 22 4
0 if y 2
= RY
Exercise 3
RX = f1; 2; 3; 4g

1 2
pX (x) = 30 x if x 2 RX
0 if x 2
= RX
Let
Y = g (X) = X 1
Find the probability mass function of Y .
Solution
The support of Y is
RY = f0; 1; 2; 3g
1
g (y) = y + 1

1 2
(y + 1) if y 2 RY
pY (y) = 30
0 if y 2
= RY
Chapter 35
Functions of random vectors

and their distribution
Let X be a K 1 random vector with known distribution, and let a L 1 random

vector Y be a function of X:
Y = g (X)
where g : RK ! RL . How do we derive the distribution of Y from the distribution
of X?
Although there is no general answer to this question, there are some special
cases in which the distribution of Y can be easily derived from the distribution of
X. This lecture discusses some of these special cases.
35.1 One-to-one functions

When the function g (x) is one-to-one and hence invertible, and the random vector
X is either discrete or absolutely continuous, there are readily applicable formulae
for the distribution of Y , which we report below.
35.1.1 One-to-one function of a discrete vector

When X is a discrete random vector, the joint probability mass function1 of Y =
g (X) is given by the following proposition.
Proposition 197 Let X be a K 1 discrete random vector with support RX and

joint probability mass function pX (x). Let g : RK ! RK be one-to-one on the
support of X. Then, the random vector Y = g (X) has support
RY = fy = g (x) : x 2 RX g

1
pX g (y) if y 2 RY
pY (y) =
0 if y 2
= RY
1 See p. 116.
277
278 CHAPTER 35. FUNCTIONS OF RANDOM VECTORS
Proof. If y 2 RY , then
1 1
pY (y) = P (Y = y) = P (g (X) = y) = P X = g (y) = pX g (y)
where we have used the fact that g is one-to-one on the support of Y , and hence
it possesses an inverse g 1 (y). If y 2
= RY , then, trivially, pY (y) = 0.
Example 198 Let X be a 2 1 discrete random vector and denote its components
n o
> >
RX = [1 1] ; [2 0]

8 >
< 1=3 if x = [1 1]
>
pX (x) = 2=3 if x = [2 0]
:
0 otherwise
Let
Y = g (X) = 2X
The support of Y is n o
> >
RY = [2 2] ; [4 0]
The inverse function is

1 1
x=g (y) = y
2
The joint probability mass function of Y is
8 >
< pX 21 y if y = [2 2]
1 >
pY (y) = p y if y = [4 0]
: X 2
0 otherwise
8 >
< pX (1; 1) if y = [2 2]
>
= p (2; 0) if y = [4 0]
: X
0 otherwise
8 >
< 1=3 if y = [2 2]
>
= 2=3 if y = [4 0]
:
0 otherwise
35.1.2 One-to-one function of a continuous vector

When X is an absolutely continuous random vector and g is di¤erentiable, then
also Y is absolutely continuous, and its joint probability density function2 is given
by the following proposition.
Proposition 199 Let X be a K 1 absolutely continuous random vector with

support RX and joint probability density function fX (x). Let g : RK ! RK be
2 See p. 117.
35.1. ONE-TO-ONE FUNCTIONS 279
one-to-one and di¤ erentiable on the support of X. Denote by Jg 1 (y) the Jacobian
matrix of g 1 (y), i.e.,
2 @x @x1 @x1
3
@y
1
@y2 : : : @y
6 @x12 @x2 @x2
: : : @y
K
7
6 @y1 @y2 7
Jg 1 (y) = 6 7
K
6 .. .. .. 7
4 . . . 5
@xK @xK @xK
@y1 @y2 : : : @yK
1
where yi is the i-th component of y, and xi is the i-th component of x = g (y).
Then, the support of Y = g (X) is
RY = fy = g (x) : x 2 RX g
If the determinant of the Jacobian matrix satis…es
det Jg 1 (y) 6= 0 ; 8y 2 RY
then the joint probability density function of Y is

1
fX g (y) det Jg 1 (y) if y 2 RY
fY (y) =
0 if y 2
= RY
Proof. See e.g. Poirier3 (1995).

A special case of the above proposition obtains when the function g is a linear
one-to-one mapping.
Proposition 200 Let X be a K 1 absolutely continuous random vector with

joint probability density fX (x). Let Y be a K 1 random vector such that
Y = + X
where is a constant K 1 vector and is a constant K K invertible ma-

trix. Then, Y is an absolutely continuous random vector whose probability density
function fY (y) satis…es
1 1
fY (y) = jdet( )j fX (y ) if y 2 RY
0 if y 2
= RY
where det ( ) is the determinant of .
Proof. In this case the inverse function is

1 1
g (y) = (y )
The Jacobian matrix is

1
Jg 1 (y) =
When y 2 RY , the joint density of Y is
1 1 1
fX g (y) det Jg 1 (y) = fX (y ) det
1 1
= fX (y ) jdet ( )j
3 Poirier, D. J. (1995) Intermediate statistics and econometrics: a comparative approach, MIT
Press.
Example 201 Let X be a 2 1 random vector with support
RX = [1; 2] [0; 1)
and joint probability density function
x1 exp ( x1 x2 ) if x1 2 [1; 2] and x2 2 [0; 1)

fX (x) =
0 otherwise
where x1 and x2 are the two components of x. De…ne a 2 1 random vector

Y = g (X) with components Y1 and Y2 as follows:
Y1 = 3X1
Y2 = X2
1
The inverse function g (y) is de…ned by
x1 = y1 =3
x2 = y2
1
The Jacobian matrix of g (y) is
" #
@x1 @x1
@y1 @y2 1=3 0
Jg 1 (y) = @x2 @x2 =
@y1 @y2
0 1
Its determinant is
1 1
det Jg 1 (y) = ( 1) 0 0=
3 3
The support of Y is
n o
>
RY = [y1 y2 ] : y1 = 3x1 ; y2 = x2 ; x1 2 [1; 2] ; x2 2 [0; 1)
= [3; 6] ( 1; 0]
For y 2 RY , the joint probability density function of Y is

1
fY (y) = fX g (y) det Jg 1 (y)
1
= (y1 =3) exp ( (y1 =3) ( y2 ))
3
1
= y1 exp (y1 y2 =3)
9
while, for y 2
= RY , the joint probability density function is fY (y) = 0.
35.2 Independent sums

When the components of X are independent and
g (x) = x1 + : : : + xK
then the distribution of Y = g (X) can be derived by using the convolution formulae
illustrated in the lecture entitled Sums of independent random variables (p. 323).
35.3. KNOWN MOMENT GENERATING FUNCTION 281
35.3 Known moment generating function

The joint moment generating function4 of Y = g (X), provided it exists, can be
computed as
MY (t) = E exp t> Y = E exp t> g (X)
by using the transformation theorem5 . If MY (t) is recognized as the joint moment
generating function of a known distribution, then such a distribution is the distri-
bution of Y , because two random vectors have the same distribution if and only if
they have the same joint moment generating function, provided the latter exists.
35.4 Known characteristic function

The joint characteristic function6 of Y = g (X) can be computed as
'Y (t) = E exp it> Y = E exp it> g (X)
by using the transformation theorem. If 'Y (t) is recognized as the joint character-
istic function of a known distribution, then such a distribution is the distribution
of Y , because two random vectors have the same distribution if and only if they
have the same joint characteristic function.

Exercise 1
Let X1 be a uniform random variable7 with support
RX1 = [1; 2]
1 if x1 2 RX1
fX1 (x1 ) =
0 if x1 2
= RX1
Let X2 be an absolutely continuous random variable, independent of X1 , with

support
RX2 = [0; 2]
3 2
fX2 (x2 ) = 8 x2 if x2 2 RX2
0 if x2 2
= RX2
Let
Y1 = X12
4 See p. 297.
5 See p. 134.
6 See p. 315.
7 See p. 359.
Y2 = X1 + X2
Find the joint probability density function of the random vector

>
Y = [Y1 Y2 ]
Solution
Since X1 and X2 are independent, their joint probability density function is equal
to the product of their marginal density functions:
3 2
fX (x1 ; x2 ) = fX1 (x1 ) fX2 (x2 ) = 8 x2 if x1 2 [1; 2] and x2 2 [0; 2]
0 otherwise
The support of Y is
n o
>
RY = [y1 y2 ] : y1 = x21 ; y2 = x1 + x2 ; x1 2 [1; 2] ; x2 2 [0; 2]
n p o
>
= [y1 y2 ] : y1 2 [1; 4] ; y2 2 [ y1 ; 4]
1
The function y = g (x) is one-to-one on RY and its inverse g (y) is de…ned by
p
x1 = y1
p
x2 = y2 y1
with Jacobian matrix

" # " #
@x1 @x1 1 1=2
Jg 1 (y) = @y1 @y2
= 2 y1 0
@x2 @x2 1 1=2
@y1 @y2 2 y1 1
The determinant of the Jacobian matrix is

1 1=2 1 1=2 1 1=2
det Jg 1 (y) = y 1 0 y = y
2 1 2 1 2 1
which is di¤erent from zero for any y belonging to RY . For y 2 RY , the joint
probability density function of Y is
1 p p 1 1=2
fY (y) = fX g (y) det Jg 1 (y) = fX ( y1 ; y2 y1 ) y1
2
3 p 2 1 1=2 3 1=2 p 2
= (y2 y1 ) y = y (y2 y1 )
8 2 1 16 1
while, for y 2
Exercise 2
Let X be a 2 1 random vector with support
RX = [0; 1) [0; 1)
and joint probability density function
exp ( x1 x2 ) if x1 2 [0; 1) and x2 2 [0; 1)

fX (x) =
0 otherwise
where x1 and x2 are the two components of x. De…ne a 2 1 random vector

Y = g (X) with components Y1 and Y2 as follows:
Y1 = 2X1
Y2 = X1 + X2
Find the joint probability density function of the random vector Y .
Solution
1
The inverse function g (y) is de…ned by
x1 = y1 =2
x2 = y2 y1 =2
1
The Jacobian matrix of g (y) is
" #
@x1 @x1
@y1 @y2 1=2 0
Jg 1 (y) = @x2 @x2 =
@y1 @y2
1=2 1
Its determinant is
1 1 1
det Jg 1 (y) = 1 0 =
2 2 2
The support of Y is
n o
>
RY = [y1 y2 ] : y1 = 2x1 ; y2 = x1 + x2 ; x1 2 [0; 1) ; x2 2 [0; 1)
n o
>
= [y1 y2 ] : y1 2 [0; 1) ; y2 2 [y1 =2; 1)
For y 2 RY , the joint probability density function of Y is

1
fY (y) = fX g (y) det Jg 1 (y)
1
= fX (y1 =2; y2 y1 =2)
2
1 1
= exp ( y1 =2 y2 + y1 =2) = exp ( y2 )
2 2
while, for y 2
Chapter 36
Moments and cross-moments
This lecture introduces the notions of moment of a random variable and cross-
moment of a random vector.
36.1 Moments
36.1.1 De…nition of moment
The n-th moment of a random variable is the expected value of its n-th power:
De…nition 202 Let X be a random variable. Let n 2 N. If
X (n) = E [X n ]
exists and is …nite, then X is said to possess a …nite n-th moment and X (n)
is called the n-th moment of X. If E [X n ] is not well-de…ned, then we say that
X does not possess the n-th moment.
36.1.2 De…nition of central moment

The n-th central moment of a random variable X is the expected value of the n-th
power of the deviation of X from its expected value:
De…nition 203 Let X be a random variable. Let n 2 N. If

n
X (n) = E [(X E [X]) ]
exists and is …nite, then X is said to possess a …nite n-th central moment and
X (n) is called the n-th central moment of X.
36.2 Cross-moments
36.2.1 De…nition of cross-moment
Let X be a K 1 random vector. A cross-moment of X is the expected value of
the product of integer powers of the entries of X:
nK
E [X1n1 X2n2 : : : XK ]
285
286 CHAPTER 36. MOMENTS AND CROSS-MOMENTS
where Xi is the i-th entry of X and n1 ; n2 ; : : : ; nK 2 Z+ are non-negative integers.

The following is a formal de…nition of cross-moment:
De…nition 204 Let X be a K 1 random vector. Let n1 , n2 , . . . , nK be K

non-negative integers and
K
X
n= nk (36.1)
k=1
If
X
nK
(n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK ] (36.2)
exists and is …nite, then it is called a cross-moment of X of order n. If all

cross-moments of order n exist and are …nite, i.e. if (36.2) exists and is …nite for
all K-tuples of non-negative integers n1 , n2 , . . . , nK such that condition (36.1) is
satis…ed, then X is said to possess …nite cross-moments of order n.
The following example shows how to compute a cross-moment of a discrete

random vector:
by X1 , X2 and X3 . Let the support of X be:
82 3 2 3 2 39
< 1 2 3 =
RX = 4 2 5 ; 4 1 5 ; 4 3 5
: ;
1 3 2
and its joint probability mass function1 be:

8 >
>
>
1
if x = 1 2 1
>
< 3
>
1
pX (x) = 3 if x = 2 1 3
> 1 >
>
> if x = 3 3 2
: 3
0 otherwise
The following is a cross-moment of X of order 4:
X (1; 2; 1) = E X1 X22 X3
which can be computed using the transformation theorem2 :
X (1; 2; 1) = E X1 X22 X3
X
= x1 x22 x3 pX (x1 ; x2 ; x3 )
(x1 ;x2 ;x3 )2RX
= 1 22 1 pX (1; 2; 1) + 2 12 3 pX (2; 1; 3)
+3 32 2 pX (3; 3; 2)
1 1 1 64
= 4 +6 + 54 =
3 3 3 3
1 See p. 117.
2 See p. 134.
36.2. CROSS-MOMENTS 287
36.2.2 De…nition of central cross-moment

The central cross-moments of a random vector X are just the cross-moments of
the random vector of deviations X E [X]:
De…nition 206 Let X be a K 1 random vector. Let n1 , n2 , . . . , nK be K

non-negative integers and
K
X
n= nk (36.3)
k=1
If " #
K
Y nk
X (n1 ; n2 ; : : : ; nK ) = E (Xk E [Xk ]) (36.4)
k=1
exists and is …nite, then it is called a central cross-moment of X of order n. If

all central cross-moments of order n exist and are …nite, i.e. if (36.4) exists and is
…nite for all K-tuples of non-negative integers n1 , n2 , . . . , nK such that condition
(36.3) is satis…ed, then X is said to possess …nite central cross-moments of
order n.
The following example shows how to compute a central cross-moment of a

discrete random vector:
by X1 , X2 and X3 . Let the support of X be:
82 3 2 3 2 39
< 4 2 1 =
RX = 4 2 5 ; 4 1 5 ; 4 3 5
: ;
4 1 2
and its joint probability mass function be:

8 >
>
>
1
if x = 4 2 4
>
< 25 >
pX (x) = 5 if x = 2 1 1
> 2 >
>
> if x = 1 3 2
: 5
0 otherwise
The expected values of the three components of X are:

1 2 2
E [X1 ] = 4 +2 +1 =2
5 5 5
1 2 2
E [X2 ] = 2 +1 +3 =2
5 5 5
1 2 2
E [X3 ] = 4 +1 +2 =2
5 5 5
The following is a central cross-moment of X of order 3:
h i
2
X (2; 1; 0) = E (X1 E [X1 ]) (X2 E [X2 ])
which can be computed using the transformation theorem:

h i
2
X (2; 1; 0) = E (X 1 E [X 1 ]) (X 2 E [X 2 ])
288 CHAPTER 36. MOMENTS AND CROSS-MOMENTS
X 2
= (x1 2) (x2 2) pX (x1 ; x2 ; x3 )
(x1 ;x2 ;x3 )2RX
2
= (4 2) (2 2) pX (4; 2; 4)
2
+ (2 2) (1 2) pX (2; 1; 1)
2
+ (1 2) (3 2) pX (1; 3; 2)
2 2 2
= (1 2) (3 2) =
5 5
Chapter 37
Moment generating function

of a random variable
The distribution of a random variable is often characterized in terms of its moment

generating function (mgf), a real function whose derivatives at zero are equal to
the moments1 of the random variable. Mgfs have great practical relevance not only
because they can be used to easily derive moments, but also because a probability
distribution is uniquely determined by its mgf, a fact that, coupled with the an-
alytical tractability of mgfs, makes them a handy tool to solve several problems,
such as deriving the distribution of a sum of two or more random variables.
It must be mentioned that not all random variables possess an mgf. However, all
random variables possess a characteristic function2 , another transform that enjoys
properties similar to those enjoyed by the mgf.
37.1 De…nition
We start this lecture by giving a de…nition of mgf.
De…nition 208 Let X be a random variable. If the expected value
E [exp (tX)]
exists and is …nite for all real numbers t belonging to a closed interval [ h; h] R,
with h > 0, then we say that X possesses a moment generating function and
the function MX : [ h; h] ! R de…ned by
MX (t) = E [exp (tX)]
is called the moment generating function of X.
The following example shows how the mgf of an exponential random variable
is derived.
1 See p. 285.
2 See p. 307.
289
290 CHAPTER 37. MGF OF A RANDOM VARIABLE
Example 209 Let X be an exponential random variable3 with parameter 2 R++ .

Its support is the set of positive real numbers
RX = [0; 1)
exp ( x) if x 2 RX
fX (x) =
0 if x 2
= RX
Its mgf is computed as follows:

Z 1
E [exp (tX)] = exp (tx) fX (x) dx
1
Z 1
= exp (tx) exp ( x) dx
0
Z 1
A = exp ((t ) x) dx
0
1
1
= exp ((t ) x)
t 0
1
= 0 =
t t
where: in step A we have assumed that t < , which is necessary for the integral
to be …nite. Therefore, the expected value exists and is …nite for t 2 [ h; h] if h is
such that 0 < h < , and X possesses an mgf
MX (t) =
t
37.2 Moments and mgfs

The mgf takes its name by the fact that it can be used to derive the moments of
X, as stated in the following proposition.
Proposition 210 If a random variable X possesses an mgf MX (t), then, for any
n 2 N, the n-th moment of X, denoted by X (n), exists and is …nite. Furthermore,
dn MX (t)
X (n) = E [X n ] =
dtn t=0
n
where d M X (t)
dtn is the n-th derivative of MX (t) with respect to t, evaluated at
t=0
the point t = 0.
Proof. Proving the above proposition is quite complicated, because a lot of an-
alytical details must be taken care of (see, e.g., Pfei¤er4 - 1978). The intuition,
3 See p. 365.
4 Pfei¤er, P. E. (1978) Concepts of probability theory, Courier Dover Publications.
37.3. DISTRIBUTIONS AND MGFS 291
however, is straightforward: since the expected value is a linear operator and dif-
ferentiation is a linear operation, under appropriate conditions we can di¤erentiate
through the expected value, as follows:
dn MX (t) dn dn
= E [exp (tX)] = E exp (tX) = E [X n exp (tX)]
dtn dtn dtn
which, evaluated at the point t = 0, yields
dn MX (t)
= E [X n exp (0 X)] = E [X n ] = X (n)
dtn t=0
The following example shows how this proposition can be applied.
Example 211 In Example 209 we have demonstrated that the mgf of an exponen-
tial random variable is
MX (t) =
t
The expected value of X can be computed by taking the …rst derivative of the mgf:
dMX (t)
= 2
dt ( t)
and evaluating it at t = 0:
dMX (t) 1
E [X] = = 2 =
dt t=0 ( 0)
The second moment of X can be computed by taking the second derivative of the
mgf:
d2 MX (t) 2
2
= 3
dt ( t)
d2 MX (t) 2 2
E X2 = = 3 = 2
dt2 t=0 ( 0)
And so on for the higher moments.
37.3 Distributions and mgfs

The following proposition states the most important property of the mgf.
Proposition 212 (equality of distributions) Let X and Y be two random vari-

ables. Denote by FX (x) and FY (y) their distribution functions5 , and by MX (t)
and MY (t) their mgfs. X and Y have the same distribution, i.e., FX (x) = FY (x)
for any x, if and only if they have the same mgfs, i.e., MX (t) = MY (t) for any t.
5 See p. 108.
Proof. For a fully general proof of this proposition see, e.g., Feller6 (2008). We
just give an informal proof for the special case in which X and Y are discrete
random variables taking only …nitely many values. The "only if" part is trivial. If
X and Y have the same distribution, then
MX (t) = E [exp (tX)] = E [exp (tY )] = MY (t)
The "if" part is proved as follows. Denote by RX and RY the supports of X and
Y , and by pX (x) and pY (y) their probability mass functions7 . Denote by A the
union of the two supports:
A = R X [ RY
and by a1 ; : : : ; an the elements of A. The mgf of X can be written as
MX (t)
= E [exp (tX)]
X
A = exp (tx) pX (x)
x2RX
Xn
B = exp (tai ) pX (ai )
i=1

have used the fact that pX (ai ) = 0 if ai 2
= RX . By the same token, the mgf of Y
can be written as
Xn
MY (t) = exp (tai ) pY (ai )
i=1
If X and Y have the same mgf, then, for any t belonging to a closed neighborhood
of zero,
MX (t) = MY (t)
and
n
X n
X
exp (tai ) pX (ai ) = exp (tai ) pY (ai )
i=1 i=1
By rearranging terms, we obtain

n
X
exp (tai ) [pX (ai ) pY (ai )] = 0
i=1
This can be true for any t belonging to a closed neighborhood of zero only if
pX (ai ) pY (ai ) = 0
for every i. It follows that that the probability mass functions of X and Y are
equal. As a consequence, also their distribution functions are equal.
It must be stressed that this proposition is extremely important and relevant
from a practical viewpoint: in many cases where we need to prove that two dis-
tributions are equal, it is much easier to prove equality of the mgfs than to prove
equality of the distribution functions.
6 Feller, W. (2008) An introduction to probability theory and its applications, Volume 2, Wiley.
7 See p. 106.
Also note that equality of the distribution functions can be replaced in the
proposition above by equality of the probability mass functions8 if X and Y are
discrete random variables, or by equality of the probability density functions9 if X
and Y are absolutely continuous random variables.
37.4 More details

37.4.1 Mgf of a linear transformation
The next proposition gives a formula for the mgf of a linear transformation.
Proposition 213 Let X be a random variable possessing an mgf MX (t). De…ne
Y = a + bX
where a; b 2 R are two constants and b 6= 0. Then, the random variable Y possesses
an mgf MY (t) and
MY (t) = exp (at) MX (bt)
Proof. Using the de…nition of mgf, we obtain
MY (t) = E [exp (tY )] = E [exp (at + btX)]

= E [exp (at) exp (btX)] = exp (at) E [exp (btX)]
= exp (at) MX (bt)
If MX (t) is de…ned on a closed interval [ h; h], then MY (t) is de…ned on the

h h
interval b; b .
37.4.2 Mgf of a sum

The next proposition shows how to derive the mgf of a sum of independent random
variables.
Proposition 214 Let X1 ; : : : ; Xn be n mutually independent10 random variables.

Let Z be their sum:
Xn
Z= Xi
i=1
Then, the mgf of Z is the product of the mgfs of X1 ; : : : ; Xn :

n
Y
MZ (t) = MXi (t)
i=1
provided the latter exist.
MZ (t) = E [exp (tZ)]

8 See p. 106.
9 See p. 107.
1 0 See p. 233.
" n
!#
X
= E exp t Xi
i=1
" n
!#
X
= E exp tXi
i=1
" n
#
Y
= E exp (tXi )
i=1
n
Y
A = E [exp (tXi )]
i=1
Yn
B = MXi (t)
i=1
where: in step A we have used the properties of mutually independent variables11 ;

in step B we have used the de…nition of mgf.

Some solved exercises on mgfs can be found below.
Exercise 1
Let X be a discrete random variable having a Bernoulli distribution12 . Its support
is
RX = f0; 1g
and its probability mass function13 is
8
< p if x = 1
pX (x) = 1 p if x = 0
:
0 if x 2
= RX
where p 2 (0; 1) is a constant. Derive the mgf of X, if it exists.
Solution
Using the de…nition of mgf, we get
X
MX (t) = E [exp (tX)] = exp (tx) pX (x)
x2RX
= exp (t 1) pX (1) + exp (t 0) pX (0)
= exp (t) p + 1 (1 p) = 1 p + p exp (t)
The mgf exists and it is well-de…ned because the above expected value exists for
any t 2 R.
1 1 See p. 234.
1 2 See p. 335.
1 3 See p. 106.
Exercise 2
Let X be a random variable with mgf
1
MX (t) = (1 + exp (t))
2
Derive the variance of X.
Solution
We can use the following formula for computing the variance14 :
2
Var [X] = E X 2 E [X]
The expected value of X is computed by taking the …rst derivative of the mgf:
dMX (t) 1
= exp (t)
dt 2
dMX (t) 1 1
E [X] = = exp (0) =
dt t=0 2 2
The second moment of X is computed by taking the second derivative of the mgf:
d2 MX (t) 1
= exp (t)
dt2 2
d2 MX (t) 1 1
E X2 = = exp (0) =
dt2 t=0 2 2
Therefore,
2
2 1 1
Var [X] = E X2 E [X] =
2 2
1 1 1
= =
2 4 4
Exercise 3
A random variable X is said to have a Chi-square distribution15 with n degrees of
freedom if its mgf is de…ned for any t < 12 and it is equal to
n=2
MX (t) = (1 2t)
De…ne
Y = X1 + X2
where X1 and X2 are two independent random variables having Chi-square dis-
tributions with n1 and n2 degrees of freedom respectively. Prove that Y has a
Chi-square distribution with n1 + n2 degrees of freedom.
1 4 See p. 156.
1 5 See p. 387.
Solution
The mgfs of X1 and X2 are
n1 =2
MX1 (t) = (1 2t)
n2 =2
MX2 (t) = (1 2t)
The mgf of a sum of independent random variables is the product of the mgfs of
the summands:
n1 =2 n2 =2 (n1 +n2 )=2
MY (t) = (1 2t) (1 2t) = (1 2t)
Therefore, MY (t) is the mgf of a Chi-square random variable with n1 + n2 degrees

of freedom. As a consequence, Y has a Chi-square distribution with n1 +n2 degrees
of freedom.
Chapter 38
Moment generating function

of a random vector
The concept of joint moment generating function (joint mgf) is a multivariate

generalization of the concept of moment generating function (mgf). Similarly to
the univariate case, the joint mgf uniquely determines the joint distribution of its
associated random vector, and it can be used to derive the cross-moments1 of the
distribution by partial di¤erentiation.
If you are not familiar with the univariate concept, you are advised to …rst read
the lecture entitled Moment generating functions (p. 289).
38.1 De…nition
Let us start with a formal de…nition.
De…nition 215 Let X be a K 1 random vector. If the expected value
E exp t> X = E [exp (t1 X1 + t2 X2 + : : : + tK XK )]
exists and is …nite for all K 1 real vectors t belonging to a closed rectangle H
such that
H = [ h1 ; h 1 ] [ h2 ; h 2 ] ::: [ hK ; h K ] RK
with hi > 0 for all i = 1; : : : ; K, then we say that X possesses a joint moment
generating function and the function MX : H ! R de…ned by
MX (t) = E exp t> X
is called the joint moment generating function of X.
As an example, we derive the joint mgf of a standard multivariate normal

random vector.
Example 216 Let X be a K 1 standard multivariate normal random vector2 .

Its support is
R X = RK
1 See p. 285.
2 See p. 439.
297
298 CHAPTER 38. JOINT MGF OF A RANDOM VECTOR
and its joint probability density function3 is
K=2 1 |
fX (x) = (2 ) exp x x
2
As explained in the lecture entitled Multivariate normal distribution (p. 439), the K
components of X are K mutually independent4 standard normal random variables,
because the joint probability density function of X can be written as
fX (x) = f (x1 ) f (x2 ) : : : f (xK )
where xi is the i-th entry of x, and f (xi ) is the probability density function of a
standard normal random variable:
1=2 1 2
f (xi ) = (2 ) exp x
2 i
The joint mgf of X can be derived as follows:
MX (t) = E exp t> X

= E [exp (t1 X1 + t2 X2 + : : : + tK XK )]
"K #
Y
= E exp (ti Xi )
i=1
K
Y
A = E [exp (ti Xi )]
i=1
K
Y
B = MXi (ti )
i=1
where: in step A we have used the fact that the entries of X are mutually in-
dependent5 ; in step B we have used the de…nition of mgf of a random variable6 .
Since the mgf of a standard normal random variable is7
1 2
MXi (ti ) = exp t
2 i
the joint mgf of X is
K
Y K
Y 1 2
MX (t) = MXi (ti ) = exp t
i=1 i=1
2 i
K
!
1 X 1 >
= exp t2i = exp t t
2 i=1
2
Note that the mgf MXi (ti ) of a standard normal random variable is de…ned for
any ti 2 R. As a consequence, the joint mgf of X is de…ned for any t 2 RK .
3 See p. 117.
4 See p. 233.
5 See p. 234.
6 See p. 289.
7 See p. 378.
38.2. CROSS-MOMENTS AND JOINT MGFS 299
38.2 Cross-moments and joint mgfs

The next proposition shows how the joint mgf can be used to derive the cross-
moments of a random vector.
Proposition 217 If a K 1 random vector X possesses a joint mgf MX (t), then

it possesses …nite cross-moments of order n for any n 2 N. Furthermore, if you
de…ne a cross-moment of order n as
X
nK
(n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK ]
PK
where n1 ; n2 ; : : : ; nK 2 Z+ and n = k=1 nk , then
@ n1 +n2 +:::+nK MX (t1 ; t2 ; : : : ; tK )

(n1 ; n2 ; : : : ; nK ) =
X
@tn1 1 @tn2 2 : : : @tnKK t1 =0;t2 =0;:::;tK =0
where the derivative on the right-hand side is an n-th order cross-partial derivative
of MX (t) evaluated at the point t1 = 0, t2 = 0, . . . , tK = 0.
Proof. We do not provide a rigorous proof of this proposition, but see, e.g.,
Pfei¤er8 (1978) and DasGupta9 (2010). The intuition of the proof, however, is
straightforward: since the expected value is a linear operator and di¤erentiation is
a linear operation, under appropriate conditions one can di¤erentiate through the
expected value, as follows:
@ n1 +n2 +:::+nK MX (t1 ; t2 ; : : : ; tK )

@tn1 1 @tn2 2 : : : @tnKK
@ n1 +n2 +:::+nK
= E [exp (t1 X1 + t2 X2 + : : : + tK XK )]
@tn1 1 @tn2 2 : : : @tnKK
@ n1 +n2 +:::+nK
= E exp (t1 X1 + t2 X2 + : : : + tK XK )
@tn1 1 @tn2 2 : : : @tnKK
= E [X1n1 X2n2 : : : XK nK
exp (t1 X1 + t2 X2 + : : : + tK XK )]
which, evaluated at the point t1 = 0; t2 = 0; : : : ; tK = 0, yields
@ n1 +n2 +:::+nK MX (t1 ; t2 ; : : : ; tK )

@tn1 1 @tn2 2 : : : @tnKK t1 =0;t2 =0;:::;tK =0
= E [X1n1 X2n2 : : : XK nK
exp (0 X1 + 0 X2 + : : : + 0 XK )]
n1 n2 nK
= E [X1 X2 : : : XK ]
= X (n1 ; n2 ; : : : ; nK )
The following example shows how the above proposition can be applied.
Example 218 Let us continue with the previous example. The joint mgf of a 2 1
standard normal random vector X is
1 > 1 2 1 2
MX (t) = exp t t = exp t + t
2 2 1 2 2
8 Pfei¤er, P. E. (1978) Concepts of probability theory, Courier Dover Publications.
9 DasGupta, A. (2010) Fundamentals of probability: a …rst course, Springer.
The second cross-moment of X can be computed by taking the second cross-partial

derivative of the joint mgf:
X (1; 1) = E [X1 X2 ]
@2 1 2 1 2
= exp t + t
@t1 @t2 2 1 2 2 t1 =0;t2 =0
@ @ 1 2 1 2
= exp t + t
@t1 @t2 2 1 2 2 t1 =0;t2 =0
@ 1 2 1 2
= t2 exp t + t
@t1 2 1 2 2 t1 =0;t2 =0
1 2 1 2
= t1 t2 exp t + t =0
2 1 2 2 t1 =0;t2 =0
38.3 Joint distributions and joint mgfs

One of the most important properties of the joint mgf is that it completely char-
acterizes the joint distribution of a random vector.
Proposition 219 (equality of distributions) Let X and Y be two K 1 ran-

dom vectors, possessing joint mgfs MX (t) and MY (t). Denote by FX (x) and
FY (y) their joint distribution functions10 . X and Y have the same distribution,
i.e., FX (x) = FY (x) for any x 2 RK , if and only if they have the same mgfs, i.e.,
MX (t) = MY (t) for any t 2 H RK .
Proof. The reader may refer to Feller11 (2008) for a rigorous proof. The informal
proof given here is almost identical to that given for the univariate case12 . We con-
…ne our attention to the case in which X and Y are discrete random vectors taking
only …nitely many values. As far as the left-to-right direction of the implication is
concerned, it su¢ ces to note that if X and Y have the same distribution, then
MX (t) = E exp t> X = E exp t> Y = MY (t)
The right-to-left direction of the implication is proved as follows. Denote by RX

and RY the supports of X and Y , and by pX (x) and pY (y) their joint probability
mass functions13 . De…ne the union of the two supports:
A = R X [ RY
and denote its members by a1 ; : : : ; an . The joint mgf of X can be written as
MX (t) = E exp t> X

X
A = exp t> x pX (x)
x2RX
Xn
B = exp t> ai pX (ai )
i=1
1 0 See p. 118.
1 1 Feller,W. (2008) An introduction to probability theory and its applications, Volume 2, Wiley.
1 2 See p. 291.
1 3 See p. 116.

have used the fact that pX (ai ) = 0 if ai 2
= RX . By the same line of reasoning, the
joint mgf of Y can be written as
n
X
MY (t) = exp t> ai pY (ai )
i=1
If X and Y have the same joint mgf, then
MX (t) = MY (t)
for any t belonging to a closed rectangle where the two mgfs are well-de…ned, and
n
X n
X
exp t> ai pX (ai ) = exp t> ai pY (ai )
i=1 i=1
By rearranging terms, we obtain

n
X
exp t> ai [pX (ai ) pY (ai )] = 0
i=1
This equality can be veri…ed for every t only if
pX (ai ) pY (ai ) = 0
for every i. As a consequence, the joint probability mass functions of X and Y are
equal, which implies that also their joint distribution functions are equal.
This proposition is used very often in applications where one needs to demon-
strate that two joint distributions are equal. In such applications, proving equality
of the joint mgfs is often much easier than proving equality of the joint distribution
functions (see also the comments to Proposition 212).
38.4 More details

38.4.1 Joint mgf of a linear transformation
The next proposition gives a formula for the joint mgf of a linear transformation.
Proposition 220 Let X be a K 1 random vector possessing a joint mgf MX (t).

De…ne
Y = A + BX
where A is a L 1 constant vector and B is an L K constant matrix. Then, the
L 1 random vector Y possesses a joint mgf MY (t), and
MY (t) = exp t> A MX B > t
Proof. Using the de…nition of joint mgf, we obtain
MY (t) = E exp t> Y

= E exp t> A + t> BX
= E exp t> A exp t> BX
= exp t> A E exp t> BX

h >
i
= exp t> A E exp B > t X
= exp t> A MX B > t
If MX (t) is de…ned on a closed rectangle H, then MY (t) is de…ned on another

closed rectangle whose shape and location depend on A and B.
38.4.2 Joint mgf of a vector with independent entries

The next proposition shows how to derive the joint mgf of a vector whose compo-
nents are independent random variables.
Proposition 221 Let X be a K 1 random vector. Let its entries X1 ; : : : ; XK

be K mutually independent random variables possessing an mgf. Denote the mgf
of the i-th entry of X by MXi (ti ). Then, the joint mgf of X is
K
Y
MX (t1 ; : : : ; tK ) = MXi (ti )
i=1
MX (t) = E exp t> X

" K
!#
X
= E exp ti X i
i=1
"K #
Y
= E exp (ti Xi )
i=1
K
Y
A = E [exp (ti Xi )]
i=1
K
Y
B = MXi (ti )
i=1
where: in step A we have used the fact that the entries of X are mutually
independent; in step B we have used the de…nition of mgf of a random variable.
38.4.3 Joint mgf of a sum

The next proposition shows how to derive the joint mgf of a sum of independent
random vectors.
Proposition 222 Let X1 , . . . , Xn be n mutually independent random vectors14 ,

all of dimension K 1. Let Z be their sum:
n
X
Z= Xi
i=1
1 4 See p. 235.
Then, the joint mgf of Z is the product of the joint mgfs of X1 , . . . , Xn :

n
Y
MZ (t) = MXi (t)
i=1
provided the latter exist.
MZ (t) = E exp t> Z

" n
!#
X
>
= E exp t Xi
i=1
" n
!#
X
>
= E exp t Xi
i=1
" n
#
Y
>
= E exp t Xi
i=1
n
Y
A = E exp t> Xi
i=1
Yn
B = MXi (t)
i=1
where: in step A we have used the fact that the vectors Xi are mutually inde-
pendent; in step B we have used the de…nition of joint mgf.

Some solved exercises on joint mgfs can be found below.
Exercise 1
n o
> > >
RX = [1 1] ; [2 0] ; [0 0]

8 >
>
> 1=3 if x = [1 1]
< >
1=3 if x = [2 0]
pX (x) = >
>
: 1=3
> if x = [0 0]
0 otherwise
Derive the joint mgf of X, if it exists.

Solution
By using the de…nition of joint mgf, we get
MX (t) = E exp t> X = E [exp (t1 X1 + t2 X2 )]

X
= exp (t1 x1 + t2 x2 ) pX (x1 ; x2 )
(x1 ;x2 )2RX
= exp (t1 1 + t2 1) pX (1; 1) + exp (t1 2 + t2 0) pX (2; 0)

+ exp (t1 0 + t2 0) pX (0; 0)
1 1 1
= exp (t1 + t2 ) + exp (2t1 ) + exp (0)
3 3 3
1
= [1 + exp (2t1 ) + exp (t1 + t2 )]
3
Obviously, the joint mgf exists and it is well-de…ned because the above expected
value exists for any t 2 R2 .
Exercise 2
Let
>
X = [X1 X2 ]
be a 2 1 random vector with joint mgf
1 2
MX1 ;X2 (t1 ; t2 ) = + exp (t1 + 2t2 )
3 3
Derive the expected value of X1 .
Solution
The mgf of X1 is
MX1 (t1 ) = E [exp (t1 X1 )] = E [exp (t1 X1 + 0 X2 )]

1 2
= MX1 ;X2 (t1 ; 0) = + exp (t1 + 2 0)
3 3
1 2
= + exp (t1 )
3 3
The expected value of X1 is obtained by taking the …rst derivative of its mgf:
dMX1 (t1 ) 2
= exp (t1 )
dt1 3
and evaluating it at t1 = 0:
dMX1 (t1 ) 2 2
E [X1 ] = = exp (0) =
dt1 t1 =0 3 3
Exercise 3
Let
>
X = [X1 X2 ]
be a 2 1 random vector with joint mgf

1
MX1 ;X2 (t1 ; t2 ) = [1 + exp (t1 + 2t2 ) + exp (2t1 + t2 )]
3
Derive the covariance between X1 and X2 .
Solution
We can use the following covariance formula:
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ]
The mgf of X1 is
MX1 (t1 ) = E [exp (t1 X1 )] = E [exp (t1 X1 + 0 X2 )]

1
= MX1 ;X2 (t1 ; 0) = [1 + exp (t1 + 2 0) + exp (2t1 + 0)]
3
1
= [1 + exp (t1 ) + exp (2t1 )]
3
The expected value of X1 is obtained by taking the …rst derivative of its mgf:
dMX1 (t1 ) 1
= [exp (t1 ) + 2 exp (2t1 )]
dt1 3
and evaluating it at t1 = 0:
dMX1 (t1 ) 1
E [X1 ] = = [exp (0) + 2 exp (0)] = 1
dt1 t1 =0 3
The mgf of X2 is
MX2 (t2 ) = E [exp (t2 X2 )] = E [exp (0 X1 + t2 X2 )]

1
= MX1 ;X2 (0; t2 ) = [1 + exp (0 + 2t2 ) + exp (2 0 + t2 )]
3
1
= [1 + exp (2t2 ) + exp (t2 )]
3
To compute the expected value of X2 we take the …rst derivative of its mgf:
dMX2 (t2 ) 1
= [2 exp (2t2 ) + exp (t2 )]
dt2 3
and we evaluate it at t2 = 0:
dMX2 (t2 ) 1
E [X2 ] = = [2 exp (0) + exp (0)] = 1
dt2 t2 =0 3
The second cross-moment of X is computed by taking the second cross-partial

derivative of the joint mgf:
@ 2 MX1 ;X2 (t1 ; t2 ) @ @ 1
= [1 + exp (t1 + 2t2 ) + exp (2t1 + t2 )]
@t1 @t2 @t1 @t2 3
@ 1
= [2 exp (t1 + 2t2 ) + exp (2t1 + t2 )]
@t1 3
1
= [2 exp (t1 + 2t2 ) + 2 exp (2t1 + t2 )]
3
and evaluating it at (t1 ; t2 ) = (0; 0):
@ 2 MX1 ;X2 (t1 ; t2 )

E [X1 X2 ] =
@t1 @t2 t1 =0;t2 =0
1 4
= [2 exp (0) + 2 exp (0)] =
3 3
Therefore,
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ]

4 1
= 1 1=
3 3
Chapter 39
Characteristic function of a
random variable
In the lecture entitled Moment generating function (p. 289), we have explained
that the distribution of a random variable can be characterized in terms of its
moment generating function, a real function that enjoys two important properties:
it uniquely determines its associated probability distribution, and its derivatives
at zero are equal to the moments of the random variable. We have also explained
that not all random variables possess a moment generating function.
The characteristic function (cf) enjoys properties that are almost identical to
those enjoyed by the moment generating function, but it has an important advan-
tage: all random variables possess a characteristic function.
39.1 De…nition
We start this lecture by giving a de…nition of characteristic function.
p
De…nition 223 Let X be a random variable. Let i = 1 be the imaginary unit.
The function ' : R ! C de…ned by
'X (t) = E [exp (itX)]
is called the characteristic function of X.
The …rst thing to be noted is that the characteristic function 'X (t) exists for
any t. This can be proved as follows:

= E [cos (tX) + i sin (tX)]
= E [cos (tX)] + iE [sin (tX)]
and the last two expected values are well-de…ned, because the sine and cosine
functions are bounded in the interval [ 1; 1].
307
308 CHAPTER 39. CF OF A RANDOM VARIABLE
39.2 Moments and cfs

Like the moment generating function of a random variable, the characteristic func-
tion can be used to derive the moments of X, as stated in the following proposition.
Proposition 224 Let X be a random variable and 'X (t) its characteristic func-
tion. Let n 2 N. If the n-th moment of X, denoted by X (n), exists and is …nite,
then 'X (t) is n times continuously di¤ erentiable and
1 dn 'X (t)
X (n) = E [X n ] =
in dtn t=0
n
where d dt'X (t)
n is the n-th derivative of 'X (t) with respect to t, evaluated at
t=0
the point t = 0.
Proof. The proof of the above proposition is quite complex (see, e.g., Resnick1 -
1999). The intuition, however, is straightforward: since the expected value is a lin-
ear operator and di¤erentiation is a linear operation, under appropriate conditions
one can di¤erentiate through the expected value, as follows:
dn 'X (t) dn
= E [exp (itX)]
dtn dtn
dn
= E exp (itX)
dtn
n
= E [(iX) exp (itX)]
= in E [X n exp (itX)]
which, evaluated at the point t = 0, yields
dn 'X (t)
= in E [X n exp (0 iX)] = in E [X n ] = in X (n)
dtn t=0
In practice, the proposition above is not very useful when one wants to compute
a moment of a random variable, because it requires to know in advance whether
the moment exists or not. A much more useful statement is provided by the next
proposition.
Proposition 225 Let X be a random variable and 'X (t) its characteristic func-
tion. If 'X (t) is n times di¤ erentiable at the point t = 0, then
1. if n is even, the k-th moment of X exists and is …nite for any k n;
2. if n is odd, the k-th moment of X exists and is …nite for any k < n.
In both cases, the following holds:
1 dk 'X (t)
X (k) = E X k =
ik dtk t=0
1 Resnick, S. I. (1999) A Probability Path, Birkhauser.

39.3. DISTRIBUTIONS AND CFS 309
Proof. See e.g. Ushakov2 (1999).

The following example shows how this proposition can be used to compute the
second moment of an exponential random variable.
Example 226 Let X be an exponential random variable with parameter 2 R++ .

Its support is
RX = [0; 1)
exp ( x) if x 2 RX
fX (x) =
0 if x 2
= RX
Its characteristic function is
'X (t) = E [exp (itX)] =

it
which is proved in the lecture entitled Exponential distribution (p. 365). Note that
dividing by ( it) does not pose any division-by-zero problem, because > 0 and
the denominator is di¤ erent from 0 also when t = 0. The …rst derivative of the
characteristic function is
d'X (t) i
= 2
dt ( it)
The second derivative of the characteristic function is
d2 'X (t) 2
= 3
dt2 ( it)
Evaluating it at t = 0, we obtain
d2 'X (t) 2
= 2
dt2 t=0
Therefore, the second moment of X exists and is …nite. Furthermore, it can be

computed as
1 d2 'X (t) 1 2 2
E X2 = 2 2
= 2 2 = 2
i dt t=0 i
39.3 Distributions and cfs

Characteristic functions, like moment generating functions, can also be used to
characterize the distribution of a random variable.
Proposition 227 (equality of distributions) Let X and Y be two random vari-

ables. Denote by FX (x) and FY (y) their distribution functions3 and by 'X (t)
and 'Y (t) their characteristic functions. X and Y have the same distribution,
i.e., FX (x) = FY (x) for any x, if and only if they have the same characteristic
function, i.e., 'X (t) = 'Y (t) for any t.
2 Ushakov, N. G. (1999) Selected topics in characteristic functions, VSP.
3 See p. 108.
Proof. For a formal proof, see, e.g., Resnick4 (1999). An informal proof for the
special case in which X and Y have a …nite support can be provided along the
same lines of the proof of Proposition 212, which concerns the moment generating
function. This is left as an exercise (just replace exp (tX) and exp (tY ) in that
proof with exp (itX) and exp (itY )).
This property is analogous to the property of joint moment generating functions
stated in Proposition 212. The same comments we made about that proposition
also apply to this one.
39.4 More details

39.4.1 Cf of a linear transformation
The next proposition gives a formula for the characteristic function of a linear
transformation.
Proposition 228 Let X be a random variable with characteristic function 'X (t).
De…ne
Y = a + bX
where a; b 2 R are two constants and b 6= 0. Then, the characteristic function of
Y is
'Y (t) = exp (iat) 'X (bt)
Proof. Using the de…nition of characteristic function, we get
'Y (t) = E [exp (itY )]

= E [exp (iat + ibtX)]
= E [exp (iat) exp (ibtX)]
= exp (iat) E [exp (ibtX)]
= exp (iat) 'X (bt)
39.4.2 Cf of a sum
The next proposition shows how to derive the characteristic function of a sum of
independent random variables.
Proposition 229 Let X1 , . . . , Xn be n mutually independent random variables5 .

Let Z be their sum:
Xn
Z= Xj
j=1
Then, the characteristic function of Z is the product of the characteristic functions

of X1 , . . . , Xn :
Y n
'Z (t) = 'Xj (t)
j=1
4 Resnick, S. I. (1999) A Probability Path, Birkhauser.

5 See p. 233.

'Z (t) = E [exp (itZ)]
2 0 13
X n
= E 4exp @it X j A5
j=1
2 0 13
n
X
= E 4exp @ itXj A5
j=1
2 3
n
Y
= E4 exp (itXj )5
j=1
n
Y
A = E [exp (itXj )]
j=1
Yn
B = 'Xj (t)
j=1
where: in step A we have used the properties of mutually independent variables6 ;

in step B we have used the de…nition of characteristic function.
39.4.3 Computation of the characteristic function

When X is a discrete random variable with support RX and probability mass
function pX (x), its characteristic function is
X
'X (t) = E [exp (itX)] = exp (itx) pX (x)
x2RX
Thus, the computation of the characteristic function is pretty straightforward: all

we need to do is to sum the complex numbers exp (itx) pX (x) over all values of x
belonging to the support of X.
When X is an absolutely continuous random variable with probability density
function fX (x), its characteristic function is
Z 1
'X (t) = E [exp (itX)] = exp (itx) fX (x) dx
1
The right-hand side integral is a contour integral of a complex function along

the real axis. As people reading these lecture notes are usually not familiar with
contour integration (a topic in complex analysis), we avoid it altogether in the rest
of this book. We instead exploit the fact that
exp (itx) = cos(tx) + i sin (tx)
to rewrite the contour integral as the complex sum of two ordinary integrals:
Z 1 Z 1 Z 1
exp (itx) fX (x) dx = cos(tx)fX (x) dx + i sin(tx)fX (x) dx
1 1 1
and to compute the two integrals separately.

6 See p. 234.

Exercise 1
Let X be a discrete random variable having support
RX = f0; 1; 2g

8
>
> 1=3 if x=0
<
1=3 if x=1
pX (x) =
>
> 1=3 if x=2
:
0 if x2
= RX
Derive the characteristic function of X.
Solution
By using the de…nition of characteristic function, we obtain
X
'X (t) = E [exp (itX)] = exp (itx) pX (x)
x2RX
= exp (it 0) pX (0) + exp (it 1) pX (1) + exp (it 2) pX (2)
1 1 1 1
= + exp (it) + exp (2it) = [1 + exp (it) + exp (2it)]
3 3 3 3
Exercise 2
Use the characteristic function found in the previous exercise to derive the variance
of X.
Solution
We can use the following formula for computing the variance:
2
Var [X] = E X 2 E [X]
The expected value of X is computed by taking the …rst derivative of the charac-
teristic function:
d'X (t) 1
= [i exp (it) + 2i exp (2it)]
dt 3
evaluating it at t = 0, and dividing it by i:
1 d'X (t) 11
E [X] = = [i exp (i 0) + 2i exp (2i 0)] = 1
i dt t=0 i3
The second moment of X is computed by taking the second derivative of the

characteristic function:
d2 'X (t) 1 2
2
= i exp (it) + 4i2 exp (2it)
dt 3
evaluating it at t = 0, and dividing it by i2 :
1 d2 'X (t) 11 2 5
E X2 = = i exp (i 0) + 4i2 exp (2i 0) =
i2 dt2 t=0 i2 3 3
Therefore,
2 5 2
Var [X] = E X 2 E [X] = 12 =
3 3
Exercise 3
Read and try to understand how the characteristic functions of the uniform and
exponential distributions are derived in the lectures entitled Uniform distribution
(p. 359) and Exponential distribution (p. 365).
Chapter 40
Characteristic function of a
random vector
This lecture introduces the notion of joint characteristic function (joint cf) of a
random vector, which is a multivariate generalization of the concept of character-
istic function of a random variable. Before reading this lecture, you are advised to
…rst read the lecture entitled Characteristic function (p. 307).
40.1 De…nition
Let us start this lecture with a de…nition.
p
De…nition 230 Let X be a K 1 random vector. Let i = 1 be the imaginary
unit. The function ' : RK ! C de…ned by
2 0 13
K
X
'X (t) = E exp it> X = E 4exp @i tj Xj A5
j=1
is called the joint characteristic function of X.

The …rst thing to be noted is that the joint characteristic function 'X (t) exists
for any t 2 RK . This can be proved as follows:
'X (t) = E exp it> X
= E cos t> X + i sin t> X
= E cos t> X + iE sin t> X
and the last two expected values are well-de…ned, because the sine and cosine
functions are bounded in the interval [ 1; 1].
40.2 Cross-moments and joint cfs

Like the joint moment generating function1 of a random vector, the joint charac-
teristic function can be used to derive the cross-moments2 of X, as stated in the
1 See p. 297.
2 See p. 285.
315
316 CHAPTER 40. JOINT CF
following proposition.
Proposition 231 Let X be a random vector and 'X (t) its joint characteristic
function. Let n 2 N. De…ne a cross-moment of order n as follows:
X
nK
(n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK ]
where n1 ; n2 ; : : : ; nK 2 Z+ and
K
X
n= nk
k=1
If all cross-moments of order n exist and are …nite, then all the n-th order partial
derivatives of 'X (t) exist and
1 @ n1 +n2 +:::+nK 'X (t1 ; t2 ; : : : ; tK )

(n1 ; n2 ; : : : ; nK ) =
X
in @tn1 1 @tn2 2 : : : @tnKK t1 =0;t2 =0;:::;tK =0
where the partial derivative on the right-hand side of the equation is evaluated at
the point t1 = 0, t2 = 0, . . . , tK = 0.
Proof. See Ushakov3 (1999).

In practice, the proposition above is not very useful when one wants to compute
a cross-moment of a random vector, because the proposition requires to know in
advance whether the cross-moment exists or not. A much more useful proposition
is the following.
Proposition 232 Let X be a random vector and 'X (t) its joint characteristic
function. If all the n-th order partial derivatives of 'X (t) exist, then:
1. if n is even, for any

K
X
m= mk n
k=1
all m-th cross-moments of X exist and are …nite;
2. if n is odd, for any

K
X
m= mk < n
k=1
all m-th cross-moments of X exist and are …nite.
In both cases,
1 @ m1 +m2 +:::+mK 'X (t1 ; t2 ; : : : ; tK )

(m1 ; m2 ; : : : ; mK ) =
X
in @tm m2 mK
1 @t2 : : : @tK
1
t1 =0;t2 =0;:::;tK =0
Proof. See Ushakov (1999).

3 Ushakov, N. G. (1999) Selected topics in characteristic functions, VSP.
40.3. JOINT DISTRIBUTIONS AND JOINT CFS 317
40.3 Joint distributions and joint cfs

The next proposition states the most important property of the joint characteristic
function.
Proposition 233 (equality of distributions) Let X and Y be two K 1 ran-

dom vectors. Denote by FX (x) and FY (y) their joint distribution functions4 and
by 'X (t) and 'Y (t) their joint characteristic functions. X and Y have the same
distribution, i.e., FX (x) = FY (x) for any x 2 RK , if and only if they have the
same characteristic functions, i.e., 'X (t) = 'Y (t) for any t 2 RK .
Proof. See Ushakov (1999). An informal proof for the special case in which X
and Y have a …nite support can be provided along the same lines of the proof of
Proposition 219, which concerns the joint moment generating function. This is left
as an exercise (just replace exp t> X and exp t> Y in that proof with exp it> X
and exp it> Y ).
This property is analogous to the property of joint moment generating functions
stated in Proposition 219. The same comments we made about that proposition
also apply to this one.
40.4 More details

40.4.1 Joint cf of a linear transformation
The next proposition gives a formula for the joint characteristic function of a linear
transformation.
Proposition 234 Let X be a K 1 random vector with characteristic function

'X (t). De…ne
Y = A + BX
where A is a L 1 constant vector and B is a L K constant matrix. Then, the
characteristic function of Y is
'Y (t) = exp it> A 'X B > t
Proof. By using the de…nition of characteristic function, we obtain
'Y (t) = E exp it> Y

= E exp it> A + it> BX
= E exp it> A exp it> BX
= exp it> A E exp it> BX
h >
i
= exp it> A E exp i B > t X
= exp it> A 'X B > t
4 See p. 118.
40.4.2 Joint cf of a random vector with independent entries

The next proposition shows how to derive the joint characteristic function of a
vector whose components are independent random variables.
Proposition 235 Let X be a K 1 random vector. Let its entries X1 ; : : : ; XK be

K mutually independent random variables. Denote the characteristic function of
the j-th entry of X by 'Xj (tj ). Then, the joint characteristic function of X is
K
Y
'X (t1 ; : : : ; tK ) = 'Xj (tj )
j=1

2 0 13
K
X
= E 4exp @i tj Xj A5
j=1
2 3
K
Y
= E4 exp (itj Xj )5
j=1
K
Y
A = E [exp (itj Xj )]
j=1
K
Y
B = 'Xj (tj )
j=1
where: in step A we have used the fact that the entries of X are mutually
independent5 ; in step B we have used the de…nition of characteristic function
of a random variable6 .
40.4.3 Joint cf of a sum

The next proposition shows how to derive the joint characteristic function of a sum
of independent random vectors.
Proposition 236 Let X1 , . . . , Xn be n mutually independent random vectors.

Let Z be their sum:
n
X
Z= Xj
j=1
Then, the joint characteristic function of Z is the product of the joint characteristic
functions of X1 ; : : : ; Xn :
Yn
'Z (t) = 'Xj (t)
j=1
5 In particular, see the mutual independence via expectations property (p. 234).
6 See p. 307.
'Z (t) = E exp it> Z

2 0 13
n
X
= E 4exp @it> Xj A5
j=1
2 0 13
Xn
= E 4exp @ it> Xj A5
j=1
2 3
n
Y
= E4 exp it> Xj 5
j=1
n
Y
A = E exp it> Xj
j=1
Yn
B = 'Xj (t)
j=1
where: in step A we have used the fact that the vectors Xj are mutually inde-
pendent; in step B we have used the de…nition of joint characteristic function of
a random vector given above.

Exercise 1
Let Z1 and Z2 be two independent standard normal random variables7 . Let X be
a 2 1 random vector whose components are de…ned as follows:
X1 = Z12
X2 = Z12 + Z22
Derive the joint characteristic function of X.

Hint: use the fact that Z12 and Z22 are two independent Chi-square random
variables8 having characteristic function
1=2
'Z12 (t) = 'Z22 (t) = (1 2it)
Solution
By using the de…nition of characteristic function, we get

= E [exp (it1 X1 + it2 X2 )]
7 See p. 376.
8 See p. 387.
= E exp it1 Z12 + it2 Z12 + Z22

= E exp i (t1 + t2 ) Z12 + it2 Z22
= E exp i (t1 + t2 ) Z12 exp it2 Z22
A = E exp i (t1 + t2 ) Z12 E exp it2 Z22
B = 'Z12 (t1 + t2 ) 'Z22 (t2 )
1=2 1=2
= (1 2it1 2it2 ) (1 2it2 )
1=2
= [(1 2it1 2it2 ) (1 2it2 )]
1=2
= [1 2it1 2it2 2it2 1 2it2 ( 2it1 ) 2it2 ( 2it2 )]
1=2
= 1 2it1 4it2 4t1 t2 4t22
where: in step A we have used the fact that Z1 and Z2 are independent; in step
B we have used the de…nition of characteristic function.
Exercise 2
Use the joint characteristic function found in the previous exercise to derive the
expected value and the covariance matrix of X.
Solution
We need to compute the partial derivatives of the joint characteristic function:
@' 1 3=2
= 1 2it1 4it2 4t1 t2 4t22 ( 2i 4t2 )
@t1 2
@' 1 3=2
= 1 2it1 4it2 4t1 t2 4t22 ( 4i 4t1 8t2 )
@t2 2
@2' 3 5=2 2
= 1 2it1 4it2 4t1 t2 4t22 ( 2i 4t2 )
@t21 4
@2' 3 5=2 2
= 1 2it1 4it2 4t1 t2 4t22 ( 4i 4t1 8t2 )
@t22 4
3=2
+4 1 2it1 4it2 4t1 t2 4t22
@2' 3 5=2
= 1 2it1 4it2 4t1 t2 4t22 ( 2i 4t2 ) ( 4i 4t1 8t2 )
@t1 @t2 4
3=2
+2 1 2it1 4it2 4t1 t2 4t22
All partial derivatives up to the second order exist and are well de…ned. As a
consequence, all cross-moments up to the second order exist and are …nite and
they can be computed from the above partial derivatives:
1 @' 1
E [X1 ] = = i=1
i @t1 t1 =0;t2 =0 i
1 @' 1
E [X2 ] = = 2i = 2
i @t2 t1 =0;t2 =0 i
1 @2' 1
E X12 = = 3i2 = 3
i2 @t21 t1 =0;t2 =0 i2
2
1 @ ' 1
E X22 = = 12i2 + 4 = 8
i2 @t22 t1 =0;t2 =0 i2
1 @2' 1
E [X1 X2 ] = = 6i2 + 2 = 4
i2 @t1 @t2 t1 =0;t2 =0 i2
The covariances are derived as follows:

2
Var [X1 ] = E X12 E [X1 ] = 3 1=2
2
Var [X2 ] = E X22
E [X2 ] = 8 4 = 4
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ] = 4 2 = 2
Summing up, we have

> >
E [X] = E [X1 ] E [X2 ] = 1 2
Var [X1 ] Cov [X1 ; X2 ] 2 2
Var [X] = =
Cov [X1 ; X2 ] Var [X2 ] 2 4
Exercise 3
Read and try to understand how the joint characteristic function of the multinomial
distribution is derived in the lecture entitled Multinomial distribution (p. 431).
Chapter 41
Sums of independent
random variables
This lecture discusses how to derive the distribution of the sum of two independent
random variables1 . We explain …rst how to derive the distribution function2 of the
sum and then how to derive its probability mass function3 (if the summands are
discrete) or its probability density function4 (if the summands are continuous).
41.1 Distribution function of a sum

The following proposition characterizes the distribution function of the sum in
terms of the distribution functions of the two summands:
Proposition 237 Let X and Y be two independent random variables and denote
by FX (x) and FY (y) their respective distribution functions. Let
Z =X +Y
and denote the distribution function of Z by FZ (z). The following holds:
FZ (z) = E [FX (z Y )]
or
FZ (z) = E [FY (z X)]
Proof. The …rst formula is derived as follows:
FZ (z)
A = P (Z z)
= P (X + Y z)
= P (X z Y )
1 See p. 229.
2 See p. 108.
3 See p. 106.
4 See p. 107.
323
324 CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES
B = E [P (X z Y jY = y )]
C = E [FX (z Y )]
where: in step A we have used the de…nition of distribution function; in step B

we have used the law of iterated expectations; in step C we have used the fact
that X and Y are independent. The second formula is symmetric to the …rst.
The following example illustrates how the above proposition can be used.
Example 238 Let X be a uniform random variable5 with support
RX = [0; 1]
1 if x 2 RX
fX (x) =
0 otherwise
and Y another uniform random variable, independent of X, with support
RY = [0; 1]
1 if y 2 RY
fY (y) =
0 otherwise
The distribution function of X is

8
Z x < 0 if x 0
FX (x) = fX (t) dt = x if 0 < x 1
1 :
1 if x > 1
The distribution function of Z = X + Y is
FZ (z) = E [FX (z Y )]
Z 1
= FX (z y) fY (y) dy
1
Z 1
= FX (z y) dy
0
Z z 1
A = FX (t) dt
z
Z z
B = FX (t) dt
z 1
where: in step A we have made a change of variable6 (t = z y); in step B we

have exchanged the bounds of integration7 . There are four cases to consider:
5 See p. 45.
6 See p. 50.
7 See p. 51.
41.2. PROBABILITY MASS FUNCTION OF A SUM 325
1. If z 0, then Z Z
z z
FZ (z) = FX (t) dt = 0dt = 0
z 1 z 1
2. If 0 < z 1, then
Z z Z 0 Z z
FZ (z) = FX (t) dt = FX (t) dt + FX (t) dt
z 1 z 1 0
Z 0 Z z z
1 2 1 2
= 0dt + tdt = 0 + t = z
z 1 0 2 0 2
3. If 1 < z 2, then
Z z Z 1 Z z
FZ (z) = FX (t) dt = FX (t) dt + FX (t) dt
z 1 z 1 1
Z 1 Z z 1
1 2 z
= tdt + 1dt = t + [t]1
z 1 1 2 z 1
1 2 1 2
= 1 (z 1) + z 1
2 2
1 1 2 1 1 2
= z + 2z 1 +z 1
2 2 2 2
1 2
= z + 2z 1
2
4. If z > 2, then
Z z Z z
FZ (z) = FX (t) dt = 1dt
z 1 z 1
z
= [t]z 1 =z (z 1) = 1
Therefore, combining these four possible cases, we obtain

8
>
> 0 if z 0
< 1 2
2 z if 0 < z 1
FZ (z) = 1 2
>
> z + 2z 1 if 1<z 2
: 2
1 if z > 2
41.2 Probability mass function of a sum

When the two summands are discrete random variables, the probability mass func-
tion of their sum can be derived as follows:
Proposition 239 Let X and Y be two independent discrete random variables and
denote by pX (x) and pY (y) their respective probability mass functions and by RX
and RY their supports. Let
Z =X +Y
and denote the probability mass function of Z by pZ (z). The following holds:
X
pZ (z) = pX (z y) pY (y)
y2RY
or X
pZ (z) = pY (z x) pX (x)
x2RX
Proof. The …rst formula is derived as follows:
pZ (z)
A = P (Z = z)
= P (X + Y = z)
= P (X = z Y )
B = E [P (X = z Y jY = y )]
C = E [pX (z Y )]
X
D = pX (z y) pY (y)
y2RY
where: in step A we have used the de…nition of probability mass function; in step
B we have used the law of iterated expectations; in step C we have used the fact
that X and Y are independent; in step D we have used the de…nition of expected
value. The second formula is symmetric to the …rst.
The two summations above are called convolutions (of two probability mass
functions).
RX = f0; 1g
1=2 if x 2 RX
pX (x) =
0 otherwise
and Y another discrete random variable, independent of X, with support
RY = f0; 1g
1=2 if y 2 RY
pY (y) =
0 otherwise
De…ne
Z =X +Y
Its support is
RY = f0; 1; 2g
The probability mass function of Z, evaluated at z = 0 is
X
pZ (0) = pX (0 y) pY (y) = pX (0 0) pY (0) + pX (0 1) pY (1)
y2RY
41.3. PROBABILITY DENSITY FUNCTION OF A SUM 327
1 1 1 1
= +0 =
2 2 2 4
Evaluated at z = 1, it is
X
y2RY
1 1 1 1 1
= + =
2 2 2 2 2
Evaluated at z = 2, it is
X
y2RY
1 1 1 1
= 0 + =
2 2 2 4
Therefore, the probability mass function of Z is
8
>
> 1=4 if z = 0
<
1=2 if z = 1
pZ (z) =
>
> 1=4 if z = 2
:
0 otherwise
41.3 Probability density function of a sum

When the two summands are absolutely continuous random variables, the proba-
bility density function of their sum can be derived as follows:
Proposition 241 Let X and Y be two independent absolutely continuous random

variables and denote by fX (x) and fY (y) their respective probability density func-
tions. Let
Z =X +Y
and denote the probability density function of Z by fZ (z). The following holds:
Z 1
fZ (z) = fX (z y) fY (y) dy
1
or Z 1
fZ (z) = fY (z x) fX (x) dx
1
Proof. As stated in Proposition 237, the distribution function of a sum of inde-

pendent variables is
FZ (z) = E [FX (z Y )]
Di¤erentiating both sides and using the fact that the density function is the deriv-
ative of the distribution function8 , we obtain
fZ (z)
d
= E [FX (z Y )]
dz
8 See p. 109.
A d
= E FX (z Y )
dz
= E [fX (z Y )]
Z 1
B = fX (z y) fY (y) dy
1
where: in step A we have interchanged di¤erentiation and expectation; in step

B we have used the de…nition of expected value. The second formula is symmetric
to the …rst.
The two integrals above are called convolutions (of two probability density
functions).
Example 242 Let X be an exponential random variable9 with support
RX = [0; 1)
exp ( x) if x 2 RX
fX (x) =
0 otherwise
and Y another exponential random variable, independent of X, with support
RY = [0; 1)
exp ( y) if y 2 RY
fY (y) =
0 otherwise
De…ne
Z =X +Y
The support of Z is
RZ = [0; 1)
When z 2 RZ , the probability density function of Z is
Z 1 Z 1
fZ (z) = fX (z y) fY (y) dy = fX (z y) exp ( y) dy
1 0
Z 1
= exp ( (z y)) 1fz y 0g exp ( y) dy
0
Z 1
= exp ( z + y) 1fy zg exp ( y) dy
Z0 z Z z
= exp ( z + y) exp ( y) dy = exp( z) dy = z exp ( z)
0 0
Therefore, the probability density function of Z is
z exp ( z) if z 2 RZ
fZ (z) =
0 otherwise
9 See p. 365.
41.4 More details

41.4.1 Sum of n independent random variables
We have discussed above how to derive the distribution of the sum of two inde-
pendent random variables. How do we derive the distribution of the sum of more
than two mutually independent10 random variables? Suppose X1 , X2 , . . . , Xn are
n mutually independent random variables and let Z be their sum:
Z = X1 + : : : + Xn
The distribution of Z can be derived recursively, using the results for sums of two
random variables given above:
1. …rst, de…ne
Y2 = X1 + X2
and compute the distribution of Y2 ;
2. then, de…ne
Y3 = Y2 + X3
and compute the distribution of Y3 ;
3. and so on, until the distribution of Z can be computed from
Z = Yn = Yn 1 + Xn

Exercise 1
Let X be a uniform random variable with support
RX = [0; 1]

1 if x 2 RX
fX (x) =
0 otherwise
and Y an exponential random variable, independent of X, with support
RY = [0; 1)

exp ( y) if y 2 RY
fY (y) =
0 otherwise
Derive the probability density function of the sum
Z =X +Y
1 0 See p. 233.
Solution
The support of Z is
RZ = [0; 1)
When z 2 RZ , the probability density function of Z is
Z 1 Z 1
fZ (z) = fX (z y) fY (y) dy = fX (z y) exp ( y) dy
1 0
Z 1 Z 1
= 1f0 z y 1g exp ( y) dy = 1f 1 y z 0g exp ( y) dy
0 0
Z 1 Z z
= 1fz 1 y zg exp ( y) dy = exp ( y) dy
0 z 1
z
= [ exp ( y)]z 1 = exp ( z) + exp (1 z)
Therefore, the probability density function of Z is
exp ( z) + exp (1 z) if z 2 RZ
fZ (z) =
0 otherwise
Exercise 2
RX = f0; 1; 2g
1=3 if x 2 RX
pX (x) =
0 otherwise
and Y another discrete random variable, independent of X, with support
RY = f1; 2g
y=3 if y 2 RY
pY (y) =
0 otherwise
Derive the probability mass function of the sum
Z =X +Y
Solution
The support of Z is:
RZ = f1; 2; 3; 4g
The probability mass function of Z, evaluated at z = 1 is:
X
y2RY
1 1 2 1
= +0 =
3 3 3 9
Evaluated at z = 2, it is:
X
y2RY
1 1 1 2 1
= + =
3 3 3 3 3
X
y2RY
1 1 1 2 1
= + =
3 3 3 3 3
X
y2RY
1 1 2 2
= 0 + =
3 3 3 9
Therefore, the probability mass function of Z is:
8
>
> 1=9 if z = 1
>
>
< 1=3 if z = 2
pZ (z) = 1=3 if z = 3
>
>
>
> 2=9 if z = 4
:
0 otherwise
Part IV
Probability distributions
333
Chapter 42
Bernoulli distribution
Suppose you perform an experiment with two possible outcomes: either success or
failure. Success happens with probability p, while failure happens with probability
1 p. A random variable that takes value 1 in case of success and 0 in case of failure
is called a Bernoulli random variable (alternatively, it is said to have a Bernoulli
distribution).
42.1 De…nition
Bernoulli random variables are characterized as follows:
De…nition 243 Let X be a discrete random variable. Let its support be
RX = f0; 1g
Let p 2 (0; 1). We say that X has a Bernoulli distribution with parameter p if
its probability mass function1 is
8
< p if x = 1
pX (x) = 1 p if x = 0
:
0 if x 2
= RX
Note that, by the above de…nition, any indicator function2 is a Bernoulli random
variable.
The following is a proof that pX (x) is a legitimate probability mass function3 :
Proof. Non-negativity is obvious. We need to prove that the sum of pX (x) over
its support equals 1. This is proved as follows:
X
pX (x) = pX (1) + pX (0)
xeRX
= p + (1 p) = 1
1 See p. 106.
2 See p. 197.
3 See p. 247.
335
336 CHAPTER 42. BERNOULLI DISTRIBUTION
42.2 Expected value

The expected value of a Bernoulli random variable X is
E [X] = p
Proof. It can be derived as follows:

X
E [X] = xpX (x)
x2RX
= 1 pX (1) + 0 pX (0)
= 1 p + 0 (1 p) = p
42.3 Variance
The variance of a Bernoulli random variable X is
Var [X] = p (1 p)
Proof. It can be derived thanks to the usual formula for computing the variance4 :
X
E X2 = x2 pX (x)
x2RX
= 12 pX (1) + 02 pX (0)
= 1 p + 0 (1 p) = p
2
E [X] = p2
2
Var [X] = E X2 E [X] = p p2 = p (1 p)
42.4 Moment generating function

The moment generating function of a Bernoulli random variable X is de…ned for
any t 2 R:
MX (t) = 1 p + p exp (t)
Proof. Using the de…nition of moment generating function:

X
= exp (tx) pX (x)
x2RX
= exp (t 1) pX (1) + exp (t 0) pX (0)
= exp (t) p + 1 (1 p)
= 1 p + p exp (t)
Obviously, the above expected value exists for any t 2 R.

4 Var [X] = E X2 E [X]2 . See p. 156.
42.5. CHARACTERISTIC FUNCTION 337
42.5 Characteristic function

The characteristic function of a Bernoulli random variable X is
'X (t) = 1 p + p exp (it)
Proof. Using the de…nition of characteristic function:

X
= exp (itx) pX (x)
x2RX
= exp (it 1) pX (1) + exp (it 0) pX (0)
= exp (it) p + 1 (1 p)
= 1 p + p exp (it)
42.6 Distribution function

The distribution function of a Bernoulli random variable X is
8
< 0 if x < 0
FX (x) = 1 p if 0 x < 1
:
1 if x 1
Proof. Remember the de…nition of distribution function:
FX (x) = P (X x)
and the fact that X can take either value 0 or value 1. If x < 0, then P (X x) =
0, because X can not take values strictly smaller than 0. If 0 x < 1, then
P (X x) = 1 p, because 0 is the only value strictly smaller than 1 that X can
take. Finally, if x 1, then P (X x) = 1, because all values X can take are
smaller than or equal to 1.
42.7 More details

In the following subsections you can …nd more details about the Bernoulli distrib-
ution.
42.7.1 Relation to the binomial distribution

A sum of independent Bernoulli random variables is a binomial random variable.
This is discussed and proved in the lecture entitled Binomial distribution (p. 341).

Exercise 1
Let X and Y be two independent Bernoulli random variables with parameter p.
Derive the probability mass function of their sum:
Z =X +Y
Solution
The probability mass function of X is
8
< p if x = 1
pX (x) = 1 p if x = 0
:
0 otherwise
8
< p if y = 1
pY (y) = 1 p if y = 0
:
0 otherwise
The support of Z (the set of values Z can take) is
RY = f0; 1; 2g
The formula for the probability mass function of a sum of two independent variables
is5 X
pZ (z) = pX (z y) pY (y)
y2RY
where RY is the support of Y . When z = 0, the formula gives:
X
pZ (0) = pX ( y) pY (y)
y2RY
= pX ( 0) pY (0) + pX ( 1) pY (1)
2
= (1 p) (1 p) + 0 p = (1 p)
When z = 1, the formula gives:
X
pZ (1) = pX (1 y) pY (y)
y2RY
= pX (1 0) pY (0) + pX (1 1) pY (1)
= p (1 p) + (1 p) p = 2p (1 p)
When z = 2, the formula gives:
X
pZ (2) = pX (2 y) pY (y)
y2RY
= pX (2 0) pY (0) + pX (2 1) pY (1)
= 0 (1 p) + p p = p2
Therefore, the probability mass function of Z is
8 2
>
> (1 p) if z = 0
<
2p (1 p) if z = 1
pZ (z) = 2
>
> p if z = 2
:
0 otherwise
5 See p. 325.
Exercise 2
Let X be a Bernoulli random variable with parameter p = 1=2. Find its tenth
moment.
Solution
The moment generating function of X is
1 1
MX (t) = + exp (t)
2 2
The tenth moment of X is equal to the tenth derivative of its moment generating
function6 , evaluated at t = 0:
10 d10 MX (t)
X (10) = E X =
dt10 t=0
But
dMX (t) 1
= exp (t)
dt 2
d2 MX (t) 1
= exp (t)
dt2 2
..
.
d10 MX (t) 1
= exp (t)
dt10 2
so that:
d10 MX (t)
X (10) =
dt10 t=0
1 1
= exp (0) =
2 2
6 See p. 290.
Chapter 43
Binomial distribution
Consider an experiment having two possible outcomes: either success or failure.

Suppose the experiment is repeated several times and the repetitions are indepen-
dent of each other. The total number of experiments where the outcome turns out
to be a success is a random variable whose distribution is called binomial distri-
bution. The distribution has two parameters: the number n of repetitions of the
experiment, and the probability p of success of an individual experiment.
A binomial distribution can be seen as a sum of mutually independent Bernoulli
random variables1 that take value 1 in case of success of the experiment and value
0 otherwise. This connection between the binomial and Bernoulli distributions will
be illustrated in detail in the remainder of this lecture and will be used to prove
several properties of the binomial distribution.
43.1 De…nition
The binomial distribution is characterized as follows.
De…nition 244 Let X be a discrete random variable. Let n 2 N and p 2 (0; 1).
Let the support of X be2
RX = f0; 1; : : : ; ng
We say that X has a binomial distribution with parameters n and p if its prob-
ability mass function3 is
n n x
x px (1 p) if x 2 RX
pX (x) =
0 if x 2
= RX
n n!
where x = x!(n x)! is the binomial coe¢ cient4 .
The two parameters of the distribution are the number of experiments n

and the probability of success p of an individual experiment.
The following is a proof that pX (x) is a legitimate probability mass function5 .
1 See p. 335.
2 In other words, RX is the set of the …rst n natural numbers and 0.
3 See p. 106.
4 See p. 22.
5 See p. 247.
341
342 CHAPTER 43. BINOMIAL DISTRIBUTION
Proof. Non-negativity is obvious. We need to prove that the sum of pX (x) over
the support of X equals 1. This is proved as follows:
X Xn
n x n x n
pX (x) = p (1 p) = [p + (1 p)] = 1n = 1
x=0
x
xeRX
where we have used the formula for binomial expansions6
Xn
n n x n x
(a + b) = a b
x=0
x
43.2 Relation to the Bernoulli distribution

The binomial distribution is intimately related to the Bernoulli distribution. The
following propositions show how.
Proposition 245 A random variable has a binomial distribution with parameters

n and p, with n = 1, if and only if it has a Bernoulli distribution with parameter
p.
Proof. We demonstrate that the two distributions are equivalent by showing that
they have the same probability mass function. The probability mass function of a
binomial distribution with parameters n and p, with n = 1, is
1 1 x
px (1 p) if x 2 f0; 1g
pX (x) = x
0 if x 2
= f0; 1g
but
1 0 1 0 1!
pX (0) = p (1 p) = (1 p) = 1 p
0 0!1!
and
1 1 1 1 1!
pX (1) = p (1 p) = p=p
1 1!0!
Therefore, the probability mass function can be written as
8
< p if x = 1
pX (x) = 1 p if x = 0
:
0 otherwise
which is the probability mass function of a Bernoulli random variable.
Proposition 246 A random variable has a binomial distribution with parameters

n and p if and only if it can be written as a sum of n jointly independent Bernoulli
random variables with parameter p.
6 See p. 25.
43.2. RELATION TO THE BERNOULLI DISTRIBUTION 343
Proof. We prove it by induction. So, we have to prove that it is true for n = 1

and for a generic n, given that it is true for n 1. For n = 1, it has been proved
in Proposition 245. Now, suppose the claim is true for a generic n 1. We have
to verify that Yn is a binomial random variable, where
Yn = X1 + X2 + : : : + Xn
and X1 , X2 , : : :, Xn are independent Bernoulli random variables. Since the claim

is true for n 1, this is tantamount to verifying that
Yn = Yn 1 + Xn
is a binomial random variable, where Yn 1 has a binomial distribution with para-

meters n 1 and p. By performing a convolution7 , we can compute the probability
mass function of Yn :
pYn (yn )
X
= pXn (yn yn 1 ) pYn 1
(yn 1)
yn 1 2RYn 1
X 1 yn +yn
= I ((yn yn 1) 2 f0; 1g) pyn yn 1
(1 p) 1
yn 1 2RYn 1
n 1 yn 1 n 1 yn 1
p (1 p)
yn 1
X
= I (yn 2 fyn 1 ; yn 1 + 1g)
yn 1 2RYn 1
n 1 yn yn 1 +yn 1 1 y +y +n 1 yn 1
p (1 p) n n 1
yn 1
X n 1 yn n yn
= I (yn 2 fyn 1 ; yn 1 + 1g) p (1 p)
yn 1
yn 1 2RYn 1
n yn
X n 1
= pyn (1 p) I (yn 2 fyn 1 ; yn 1 + 1g)
yn 1
yn 1 2RYn 1
If 1 yn n 1, then
X n 1 n 1 n 1 n
I (yn 2 fyn 1 ; yn 1 + 1g) = + =
yn 1 yn yn 1 yn
yn 1 2RYn 1
where the last equality is the recursive formula for binomial coe¢ cients8 . If yn = 0,
then
X n 1
I (yn 2 fyn 1 ; yn 1 + 1g)
yn 1
yn 1 2RYn 1
n 1 n 1 n n
= = =1= =
yn 0 0 yn
7 See p. 326.
8 See p. 25.
Finally, if yn = n, then
X n 1
I (yn 2 fyn 1 ; yn 1 + 1g)
yn 1
yn 1 2RYn 1
n 1 n 1 n n
= = =1= =
yn 1 n 1 n yn
Therefore, for yn 2 RYn , we have

X n 1 n
I (yn 2 fyn 1 ; yn 1 + 1g) =
yn 1 yn
yn 1 2RYn 1
and
n n yn
yn pyn (1 p) if yn 2 RYn
pYn (yn ) =
0 otherwise
which is the probability mass function of a binomial random variable with para-
meters n and p. This completes the proof.
43.3 Expected value

The expected value of a binomial random variable X is
E [X] = np
E [X]
" n #
X
A = E Yi
i=1
n
X
B = E [Yi ]
i=1
Xn
C = p
i=1
= np
where: in step A we have used the fact that X can be represented as a sum of
n independent Bernoulli random variables Y1 ; : : : ; Yn ; in step B we have used
the linearity of the expected value; in step C we have used the formula for the
expected value of a Bernoulli random variable9 .
43.4 Variance
The variance of a binomial random variable X is
Var [X] = np (1 p)
9 See p. 336.
43.5. MOMENT GENERATING FUNCTION 345

Var [X]
" n #
X
A = Var Yi
i=1
n
X
B = Var [Yi ]
i=1
Xn
C = p (1 p)
i=1
= np (1 p)
where: in step A we have used the fact that X can be represented as a sum of n
independent Bernoulli random variables Y1 ; : : : ; Yn ; in step B we have used the
formula for the variance of the sum of jointly independent random variables; in step
C we have used the formula for the variance of a Bernoulli random variable10 .

The moment generating function of a binomial random variable X is de…ned for
any t 2 R:
n
MX (t) = (1 p + p exp (t))
MX (t)
A = E [exp (tX)]
B = E [exp (t (Y1 + : : : + Yn ))]
= E [exp (tY1 ) : : : exp (tYn )]
C = E [exp (tY1 )] : : : E [exp (tYn )]
D = MY1 (t) : : : MYn (t)
E = (1 p + p exp (t)) : : : (1 p + p exp (t))
n
= (1 p + p exp (t))
where: in step A we have used the de…nition of moment generating function; in

step B we have used the fact that X can be represented as a sum of n independent
Bernoulli random variables Y1 ; : : : ; Yn ; in step C we have used the fact that
Y1 ; : : : ; Yn are jointly independent; in step D we have used the de…nition of
moment generating function of Y1 ; : : : ; Yn ; in step E we have used the formula
for the moment generating function of a Bernoulli random variable11 . Since the
moment generating function of a Bernoulli random variable exists for any t 2 R,
also the moment generating function of a binomial random variable exists for any
t 2 R.
1 0 See p. 336.
1 1 See p. 336.

The characteristic function of a binomial random variable X is
n
'X (t) = (1 p + p exp (it))
Proof. Similar to the previous proof:
'X (t)
A = E [exp (itX)]
B = E [exp (it (Y1 + : : : + Yn ))]
= E [exp (itY1 ) : : : exp (itYn )]
C = E [exp (itY1 )] : : : E [exp (itYn )]
D = 'Y1 (t) : : : 'Yn (t)
E = (1 p + p exp (it)) : : : (1 p + p exp (it))
n
= (1 p + p exp (it))
where: in step A we have used the de…nition of characteristic function; in step

B we have used the fact that X can be represented as a sum of n independent
Bernoulli random variables Y1 ; : : : ; Yn ; in step C we have used the fact that
Y1 ; : : : ; Yn are jointly independent; in step D we have used the de…nition of
characteristic function of Y1 ; : : : ; Yn ; in step E we have used the formula for the
characteristic function of a Bernoulli random variable12 .

The distribution function of a binomial random variable X is
8
< 0 Pbxc n s
if x < 0
n s
FX (x) = p (1 p) if 0 x n
: s=0 s
1 if x > n
where bxc is the ‡oor of x, i.e. the largest integer not greater than x.
Proof. For x < 0, FX (x) = 0, because X cannot be smaller than 0. For x > n,
FX (x) = 1, because X is always smaller than or equal to n. For 0 x n:
FX (x)
A = P (X x)
bxc
X
B = P (X = s)
s=0
bxc
X
C = pX (s)
s=0
1 2 See p. 337.
bxc
X n s n s
= p (1 p)
s=0
s

we have used the fact that X can take only integer values; in step C we have used
the de…nition of probability mass function of X.
Values of FX (x) are usually computed by means of computer algorithms. For
example, the MATLAB command
binocdf(x,n,p)
returns the value of the distribution function at the point x when the parameters
of the distribution are n and p.

Exercise 1
Suppose you independently ‡ip a coin 4 times and the outcome of each toss can
be either head (with probability 1=2) or tails (also with probability 1=2). What is
the probability of obtaining exactly 2 tails?
Solution
Denote by X the number of times the outcome is tails (out of the 4 tosses). X
has a binomial distribution with parameters n = 4 and p = 1=2. The probability
of obtaining exactly 2 tails can be computed from the probability mass function of
X as follows:
2 4 2
n 2 n 2 4 1 1
pX (2) = p (1 p) = 1
2 2 2 2
4! 1 1 4 3 2 1 1 6 3
= = = =
2!2! 4 4 2 1 2 1 16 16 8
Exercise 2
Suppose you independently throw a dart 10 times. Each time you throw a dart,
the probability of hitting the target is 3=4. What is the probability of hitting the
target less than 5 times (out of the 10 total times you throw a dart)?
Solution
Denote by X the number of times you hit the target. X has a binomial distribution
with parameters n = 10 and p = 3=4. The probability of hitting the target less
than 5 times can be computed from the distribution function of X as follows:
P (X < 5) = P (X 4) = FX (4)
X4
n s n s
= p (1 p)
s=0
s
X4 s 10 s
10 3 1
= ' 0:0197
s=0
s 4 4
where FX is the distribution function of X and the value of FX (4) can be calculated
with a computer algorithm, for example with the MATLAB command
binocdf(4,10,3/4)
Chapter 44
Poisson distribution
The Poisson distribution is related to the exponential distribution1 . Suppose an

event can occur several times within a given unit of time. When the total number
of occurrences of the event is unknown, we can think of it as a random variable.
This random variable has a Poisson distribution if and only if the time elapsed
between two successive occurrences of the event has an exponential distribution
and it is independent of previous occurrences.
A classical example of a random variable having a Poisson distribution is the
number of phone calls received by a call center. If the time elapsed between two
successive phone calls has an exponential distribution and it is independent of the
time of arrival of the previous calls, then the total number of calls received in one
hour has a Poisson distribution.
12
Exponential
10 distribution
8
Number of phone calls
6
Exponential
distribution
Poisson
4
distribution
0
0 10 20 30 40 50 60 70 80 90
Minutes
The concept is illustrated by the plot above, where the number of phone calls
received is plotted as a function of time. The graph of the function makes an
upward jump each time a phone call arrives. The time elapsed between two suc-
cessive phone calls is equal to the length of each horizontal segment and it has an
1 See p. 365.
349
350 CHAPTER 44. POISSON DISTRIBUTION
exponential distribution. The number of calls received in 60 minutes is equal to the

length of the segment highlighted by the vertical curly brace and it has a Poisson
distribution.
The following sections provide a more formal treatment of the main character-
istics of the Poisson distribution.
44.1 De…nition
The Poisson distribution is characterized as follows:
De…nition 247 Let X be a discrete random variable. Let its support be the set of
non-negative integer numbers:
RX = Z+
Let 2 (0; 1). We say that X has a Poisson distribution with parameter if
its probability mass function2 is
1 x
exp ( ) x! if x 2 RX
pX (x) =
0 if x 2
= RX
where x! is the factorial3 of x.
44.2 Relation to the exponential distribution

The relation between the Poisson distribution and the exponential distribution is
summarized by the following proposition:
Proposition 248 The number of occurrences of an event within a unit of time has
a Poisson distribution with parameter if and only if the time elapsed between two
successive occurrences of the event has an exponential distribution with parameter
and it is independent of previous occurrences.
Proof. Denote by:
1 time elapsed before …rst occurrence

2 time elapsed between …rst and second occurrence
..
.
n time elapsed between (n 1) -th and n-th occurrence
..
.
and by X the number of occurrences of the event. Since X x if and only if

1 + : : : + x 1 (convince yourself of this fact), the proposition is true if and only
if:
P (X x) = P ( 1 + : : : + x 1)
for any x 2 RX . To verify that the equality holds, we need to separately compute
the two probabilities. We start with:
P( 1 + ::: + x 1)
2 See p. 106.
3 See p. 10.
44.2. RELATION TO THE EXPONENTIAL DISTRIBUTION 351
Denote by Z the sum of waiting times:
Z= 1 + ::: + x
Since the sum of independent exponential random variables4 with common para-
meter is a Gamma random variable5 with parameters 2x and x , then Z is a
Gamma random variable with parameters 2x and x , i.e. its probability density
function is
cz x 1 exp ( z) if z 2 [0; 1)
fZ (z) =
0 if z 2
= [0; 1)
where x x
c= =
(x) (x 1)!
and the last equality stems from the fact that we are considering only integer values
of x. We need to integrate the density function to compute the probability that Z
is less than 1:
P( 1 + ::: + x 1) = P (Z 1)
Z 1
= fZ (z) dz
1
Z 1
= cz x 1
exp ( z) dz
0
Z 1
= c zx 1
exp ( z) dz
0
The last integral can be computed integrating by parts x 1 times:

Z 1
zx 1
exp ( z) dz
0
1 Z 1
1 x 1 1
= z exp ( z) + (x 1) z x 2
exp ( z) dz
0 0
Z 1
1 1
= exp ( ) + (x 1) zx 2
exp ( z) dz
0
(
1
1 1 1
= exp ( ) + (x 1) zx 2
exp ( z)
0
Z 1
1
+ (x 2) z x 3
exp ( z) dz
0
1 1
= exp ( ) (x 1) 2 exp ( )
Z 1
1
+ (x 1) (x 2) 2 zx 3
exp ( z) dz
0
= :::
x
X1 (x Z 1
1)! 1 (x 1)! 1
= exp ( )+ x 1 exp ( z) dz
i=1
(x i)! i 1 0
4 See p. 372.
5 See p. 397.
x
X1 1
(x 1)! 1 (x 1)! 1
= exp ( )+ x 1 exp ( z)
i=1
(x i)! i 0
x
X1 (x 1)! 1 (x 1)! (x 1)!
= exp ( ) x exp ( )+ x
i=1
(x i)! i
Multiplying by c, we obtain:
Z 1
c zx 1
exp ( z) dz
0
x Z 1
= zx 1
exp ( z) dz
(x 1)! 0
x
X1 x i
= exp ( ) exp ( )+1
i=1
(x i)!
Xx x i
= 1 exp ( )
i=1
(x i)!
x
X1 j
= 1 exp ( )
j=0
j!
Thus, we have obtained:

x
X1 j
P( 1 + ::: + x 1) = 1 exp ( )
j=0
j!
Now, we need to compute the probability that X is greater than or equal to x:
P (X x) = 1 P (X < x)
= 1 P (X x 1)
x
X1
= 1 P (X = j)
j=0
x
X1
= 1 pX (j)
j=0
x
X1 j
= 1 exp ( )
j=0
j!
= P( 1 + ::: + x 1)
which is exactly what we needed to prove.
44.3 Expected value

The expected value of a Poisson random variable X is:
E [X] =
44.4. VARIANCE 353

X
E [X] = xpX (x)
x2RX
X1
1 x
= x exp ( )
x=0
x!
1
X
A 1 x
= 0+ x exp ( )
x=1
x!
1
X
B 1 y+1
= (y + 1) exp ( )
y=0
(y + 1)!
1
X
C 1 y
= (y + 1) exp ( )
y=0
(y + 1) y!
1
X 1 y
= exp ( )
y=0
y!
1
X
D = pY (y)
y=0
E =
where: in step A we have used the fact that the …rst term of the sum is zero,
because x = 0; in step B we have made a change of variable y = x 1; in step
C we have used the fact that (y + 1)! = (y + 1) y!; in step D we have de…ned
1 y
pY (y) = exp ( )
y!
where pY (y) is the probability mass function of a Poisson random variable with
parameter ; in step E we have used the fact that the sum of a probability mass
function over its support equals 1.
44.4 Variance
The variance of a Poisson random variable X is:
Var [X] =
X
E X2 = x2 pX (x)
x2RX
X1
1 x
= x2 exp ( )
x=0
x!
1
X
A 1 x
= 0+ x2 exp ( )
x=1
x!
6 Var [X] = E X2 E [X]2 . See p. 156.
1
X
B 2 1 y+1
= (y + 1) exp ( )
y=0
(y + 1)!
X1
C 2 1 y
= (y + 1) exp ( )
y=0
(y + 1) y!
1
X 1 y
= (y + 1) exp ( )
y=0
y!
X1
D = (y + 1) pY (y)
y=0
(1 1
)
X X
= ypY (y) + pY (y)
y=0 y=0
E = fE [Y ] + 1g
F = f + 1g
2
= +
where: in step A we have used the fact that the …rst term of the sum is zero,
because x = 0; in step B we have made a change of variable y = x 1; in step
C we have used the fact that (y + 1)! = (y + 1) y!; in step D we have de…ned
1 y
pY (y) = exp ( )
y!
where pY (y) is the probability mass function of a Poisson random variable with
parameter ; in step E we have used the fact that the sum of a probability mass
function over its support equals 1; in step F we have used the fact that the
expected value of a Poisson random variable with parameter is . Finally,
2 2
E [X] =
and
2
Var [X] = E X2 E [X]
2 2
= + =

The moment generating function of a Poisson random variable X is de…ned for any
t 2 R:
MX (t) = exp ( [exp (t) 1])

X
= exp (tx) pX (x)
x2RX
X1
x 1 x
= [exp (t)] exp ( )
x=0
x!
X1 x
( exp (t))
= exp ( )
x=0
x!
= exp ( ) exp ( exp (t))
= exp ( [exp (t) 1])
where
X1 x
( exp (t))
exp ( exp (t)) =
x=0
x!
is the usual Taylor series expansion of the exponential function. Furthermore, since
the series converges for any value of t, the moment generating function of a Poisson
random variable exists for any t 2 R.

The characteristic function of a Poisson random variable X is:
'X (t) = exp ( [exp (it) 1])

X
= exp (itx) pX (x)
x2RX
X x 1 x
= [exp (it)] exp ( )
x!
x2RX
X1 x
( exp (it))
= exp ( )
x=0
x!
= exp ( ) exp ( exp (it))
= exp ( [exp (it) 1])
where:
X1 x
( exp (it))
exp ( exp (it)) =
x=0
x!
is the usual Taylor series expansion of the exponential function (note that the series
converges for any value of t).

The distribution function of a Poisson random variable X is:
Pbxc 1 s
exp ( ) s=0 s! if x 0
FX (x) =
0 otherwise
where bxc is the ‡oor of x, i.e. the largest integer not greater than x.
Proof. Using the de…nition of distribution function:
FX (x)
A = P (X x)
bxc
X
B = P (X = s)
s=0
bxc
X
C = pX (s)
s=0
bxc
X 1 s
= exp ( )
s=0
s!
bxc
X 1 s
= exp ( )
s=0
s!

we have used the fact that X can take only positive integer values; in step C we
have used the de…nition of probability mass function of X.
Values of FX (x) are usually computed by computer algorithms. For example,
the MATLAB command:
poisscdf(x,lambda)
returns the value of the distribution function at the point x when the parameter
of the distribution is equal to lambda.

Exercise 1
The time elapsed between the arrival of a customer at a shop and the arrival
of the next customer has an exponential distribution with expected value equal
to 15 minutes. Furthermore, it is independent of previous arrivals. What is the
probability that more than 6 customers will arrive at the shop during the next
hour?
Solution
If a random variable has an exponential distribution with parameter , then its
expected value is equal to 1= . Here
1
= 0:25 hours
Therefore, = 4. If inter-arrival times are independent exponential random vari-

ables with parameter , then the number of arrivals during a unit of time has a
Poisson distribution with parameter . Thus, the number of customers that will
arrive at the shop during the next hour (denote it by X) is a Poisson random
variable with parameter = 4. The probability that more than 6 customers arrive
at the shop during the next hour is:
P (X > 6) = 1 P (X 6) = 1 FX (6)
6
X 4s
= 1 exp ( 4) ' 0:1107
s=0
s!
The value of FX (6) can be calculated with a computer algorithm, for example with
the MATLAB command:
poisscdf(6,4)
Exercise 2
At a call center, the time elapsed between the arrival of a phone call and the arrival
of the next phone call has an exponential distribution with expected value equal
to 15 seconds. Furthermore, it is independent of previous arrivals. What is the
probability that less than 50 phone calls arrive during the next 15 minutes?
Solution
If a random variable has an exponential distribution with parameter , then its
expected value is equal to 1= . Here
1 1 1
= minutes = quarters of hour
4 60
where, in the last equality, we have taken 15 minutes as the unit of time. Therefore,
= 60. If inter-arrival times are independent exponential random variables with
parameter , then the number of arrivals during a unit of time has a Poisson
distribution with parameter . Thus, the number of phone calls that will arrive
during the next 15 minutes (denote it by X) is a Poisson random variable with
parameter = 60. The probability that less than 50 phone calls arrive during the
next 15 minutes is:
P (X < 50) = P (X 49) = FX (49)

49
X 60s
= exp ( 60) ' 0:0844
s=0
s!
The value of FX (49) can be calculated with a computer algorithm, for example
with the MATLAB command:
poisscdf(49,60)
Chapter 45
Uniform distribution
A continuous random variable has a uniform distribution if all the values belonging
to its support have the same probability density.
45.1 De…nition
The uniform distribution is characterized as follows:
De…nition 249 Let X be an absolutely continuous random variable. Let its sup-
port be a closed interval of real numbers:
RX = [l; u]
We say that X has a uniform distribution on [l; u] if its probability density

function1 is
1
u l if x 2 RX
fX (x) =
0 if x 2
= RX
Sometimes, we also say that X has a rectangular distribution or that X is

a rectangular random variable.
45.2 Expected value

The expected value of a uniform random variable X is
u+l
E [X] =
2
Z 1
E [X] = xfX (x) dx
1
Z u
1
= x dx
l u l
Z u
1
= xdx
u l l
1 See p. 107.
359
360 CHAPTER 45. UNIFORM DISTRIBUTION
u
1 1 2
= x
u l 2 l
1 1 2
= u l2
u l2
(u l) (u + l) u+l
= =
2 (u l) 2
45.3 Variance
The variance of a uniform random variable X is
2
(u l)
Var [X] =
12
Z 1
E X2 = x2 fX (x) dx
1
Z u
1
= x2 dx
l u l
Z u
1
= x2 dx
u l l
u
1 1 3
= x
u l 3 l
1 1 3
= u l3
u l3
(u l) u2 + ul + l2
=
3 (u l)
u + ul + l2
2
=
3
2
2 u+l u2 + 2ul + l2
E [X] = =
2 4
2
u2 + ul + l2 u2 + 2ul + l2
=
3 4
4u2 + 4ul + 4l2 3u2 6ul 3l2
=
12
(4 3) u2 + (4 6) ul + (4 3) l2
=
12
2
u2 2ul + l2 (u l)
= =
12 12
2 Var [X] = E X2 E [X]2 . See p. 156.


The moment generating function of a uniform random variable X is de…ned for
any t 2 R:
1
MX (t) = (u l)t [exp (tu) exp (tl)] if t 6= 0
1 if t = 0

Z 1
= exp (tx) fX (x) dx
1
Z u
1
= exp (tx) dx
l u l
u
1 1
= exp (tx)
u l t l
exp (tu) exp (tl)
=
(u l) t
Note that the above derivation is valid only when t 6= 0. However, when t = 0:
MX (0) = E [exp (0 X)] = E [1] = 1
Furthermore, it is easy to verify that
lim MX (t) = MX (0)

t!0
When t 6= 0, the integral above is well-de…ned and …nite for any t 2 R. Thus, the
moment generating function of a uniform random variable exists for any t 2 R.

The characteristic function of a uniform random variable X is
1
(u l)it [exp (itu) exp (itl)] if t 6= 0
'X (t) =
1 if t = 0

Z 1 Z 1
= cos (tx) fX (x) dx + i sin (tx) fX (x) dx
1 1
Z u Z u
1 1
= cos (tx) dx + i sin (tx) dx
l u l l u l
Z u Z u
1
= cos (tx) dx + i sin (tx) dx
u l l l
u u
1 1 1
= sin (tx) + i cos (tx)
u l t l t l
1
= fsin (tu) sin (tl) i cos (tu) + i cos (tl)g
(u l) t
1
= fi sin (tu) i sin (tl) + cos (tu) cos (tl)g
(u l) it
1
= f[cos (tu) + i sin (tu)] [cos (tl) + i sin (tl)]g
(u l) it
exp (itu) exp (itl)
=
(u l) it
Note that the above derivation is valid only when t 6= 0. However, when t = 0:
'X (0) = E [exp (i 0 X)] = E [1] = 1
Furthermore, it is easy to verify that
lim 'X (t) = 'X (0)

t!0

The distribution function of a uniform random variable X is
8
< 0 if x < l
FX (x) = (x l) = (u l) if l x u
:
1 if x > u
Proof. If x < l, then:

FX (x) = P (X x) = 0
because X can not take on values smaller than l. If l x u, then:
FX (x) = P (X x)
Z x
= fX (t) dt
1
Z x
1
= dt
l u l
1 x
= [t]
u l l
= (x l) = (u l)
If x > u, then:
FX (x) = P (X x) = 1
because X can not take on values greater than u.

Exercise 1
Let X be a uniform random variable with support
RX = [5; 10]
Compute the following probability:
P (7 X 9)
Solution
We can compute this probability either using the probability density function or
the distribution function of X. Using the probability density function:
Z 9 Z 9
1
P (7 X 9) = fX (x) dx = dx
7 7 10 5
1 9 1 2
= [x] = (9 7) =
5 7 5 5
Using the distribution function:
9 5 7 5
P (7 X 9) = FX (9) FX (7) =
10 5 10 5
4 2 2
= =
5 5 5
Exercise 2
Suppose the random variable X has a uniform distribution on the interval [ 2; 4].
P (X > 2)
Solution
This probability can be easily computed using the distribution function of X:
P (X > 2) = 1 P (X 2) = 1 FX (2)
2 ( 2) 4 1
= 1 =1 =
4 ( 2) 6 3
Exercise 3
Suppose the random variable X has a uniform distribution on the interval [0; 1].
Compute the third moment3 of X, i.e.:
X (3) = E X 3
3 See p. 36.
Solution
We can compute the third moment of X using the transformation theorem4 :
Z 1 Z 1
E X3 = x3 fX (x) dx = x3 dx
1 0
1
1 4 1
= x =
4 0 4
4 See p. 134.
Chapter 46
Exponential distribution
How much time will elapse before an earthquake occurs in a given region? How
long do we need to wait before a customer enters our shop? How long will it take
before a call center receives the next phone call? How long will a piece of machinery
work without breaking down?
Questions such as these are often answered in probabilistic terms using the
exponential distribution.
All these questions concern the time we need to wait before a certain event
occurs. If this waiting time is unknown, it is often appropriate to think of it as a
random variable having an exponential distribution. Roughly speaking, the time
X we need to wait before an event occurs has an exponential distribution if the
probability that the event occurs during a certain time interval is proportional to
the length of that time interval. More precisely, X has an exponential distribution
if the conditional probability
P (t < X t+ t jX > t )
is approximately proportional to the length t of the time interval comprised

between the times t and t + t, for any time instant t. In most practical situations
this property is very realistic and this is the reason why the exponential distribution
is so widely used to model waiting times.
The exponential distribution is also related to the Poisson distribution. When
the event can occur more than once and the time elapsed between two successive
occurrences is exponentially distributed and independent of previous occurrences,
the number of occurrences of the event within a given unit of time has a Poisson
distribution. See the lecture entitled Poisson distribution (p. 349) for a more
detailed explanation and an intuitive graphical representation of this fact.
46.1 De…nition
The exponential distribution is characterized as follows.
port be the set of positive real numbers:
RX = [0; 1)
365
366 CHAPTER 46. EXPONENTIAL DISTRIBUTION
Let 2 R++ . We say that X has an exponential distribution with parameter

if its probability density function1 is
exp ( x) if x 2 RX
fX (x) =
0 if x 2
= RX
The parameter is called rate parameter.
The following is a proof that fX (x) is a legitimate probability density function2 .

Proof. Non-negativity is obvious. We need to prove that the integral of fX (x)
over R equals 1. This is proved as follows:
Z 1 Z 1
1
fX (x) dx = exp ( x) dx = [ exp ( x)]0 = 0 ( 1) = 1
1 0
46.2 The rate parameter and its interpretation

We have mentioned that the probability that the event occurs between two dates
t and t + t is proportional to t, conditional on the information that it has not
occurred before t. The rate parameter is the constant of proportionality:
P (t < X t+ t jX > t ) = t + o ( t)
where o ( t) is an in…nitesimal of higher order than t, i.e., a function of t that

goes to zero more quickly than t does.
The above proportionality condition is also su¢ cient to completely characterize
the exponential distribution.
Proposition 251 The proportionality condition
P (t < X t+ t jX > t ) = t + o ( t)
is satis…ed only if X has an exponential distribution.
Proof. The conditional probability P (t < X t+ t jX > t ) can be written as

P (t < X t + t; X > t) P (t < X t + t)
P (t < X t+ t jX > t ) = =
P (X > t) P (X > t)
Denote by FX (x) the distribution function3 of X:
FX (x) = P (X x)
and by SX (x) its survival function:
SX (x) = 1 FX (x) = P (X > x)
Then,
P (t < X t + t) FX (t + t) FX (t) SX (t + t) SX (t)
= =
P (X > t) 1 FX (t) SX (t)
1 See p. 107.
2 See p. 251.
3 See p. 108.
46.2. THE RATE PARAMETER AND ITS INTERPRETATION 367
= t + o ( t)
Dividing both sides by t, we obtain

SX (t + t) SX (t) 1 t
= +o = + o (1)
t SX (t) t
where o (1) is a quantity that tends to 0 when t tends to 0. Taking limits on both
sides, we obtain
SX (t + t) SX (t) 1
lim =
t!0 t SX (t)
or, by the de…nition of derivative:
dSX (t) 1
=
dt SX (t)
This di¤erential equation is easily solved using the chain rule:
dSX (t) 1 d ln (SX (t))
= =
dt SX (t) dt
Taking the integral from 0 to x of both sides
Z x Z x
d ln (SX (t))
dt = dt
0 dt 0
we obtain
x x
[ln (SX (t))]0 = [ t]0
or
ln (SX (x)) = ln (SX (0)) x
But X cannot take negative values. So
SX (0) = 1 FX (0) = 1
which implies
ln (SX (x)) = x
Exponentiating both sides, we get
SX (x) = exp ( x)
Therefore,
1 FX (x) = exp ( x)
or
FX (x) = 1 exp ( x)
Since the density function is the …rst derivative of the distribution function4 , we
obtain
dFX (x)
fX (x) = = exp ( x)
dx
which is the density of an exponential random variable. Therefore, the proportion-
ality condition is satis…ed only if X is an exponential random variable.
4 See p. 109.
46.3 Expected value

The expected value of an exponential random variable X is
1
E [X] =

Z 1
E [X] = xfX (x) dx
0
Z 1
= x exp ( x) dx
0
Z 1
A = [ x exp ( x)]1 + exp ( x) dx
0
0
1
1
= (0 0) + exp ( x)
0
1 1
= 0+ 0+ =
where: in step A we have performed an integration by parts5 .
46.4 Variance
The variance of an exponential random variable X is
1
Var [X] = 2
Proof. The second moment of X is

Z 1
E X2 = x2 exp ( x) dx
0
Z 1
A = 1
x2 exp ( x) 0 + 2x exp ( x) dx
0
1 Z 1
B 2 2
= (0 0) + x exp ( x) + exp ( x) dx
0 0
1
2 1 2
= (0 0) + exp ( x) = 2
0
where: in step A and B we have performed two integrations by parts. Further-

more,
2
2 1 1
E [X] = = 2
The usual formula for computing the variance6 gives

2 2 1 1
Var [X] = E X 2 E [X] = 2 2 = 2
5 See p. 51.
6 See p. 156.

The moment generating function of an exponential random variable X is de…ned
for any t < and it is equal to
MX (t) =
t
Proof. Using the de…nition of moment generating function, we obtain
Z 1
MX (t) = E [exp (tX)] = exp (tx) fX (x) dx
1
Z 1 Z 1
= exp (tx) exp ( x) dx = exp ((t ) x) dx
0 0
1
1
= exp ((t ) x) =
t 0 t
Of course, the above integrals converge only if (t ) < 0, i.e., only if t < . There-
fore, the moment generating function of an exponential random variable exists for
all t < .

The characteristic function of an exponential random variable X is
'X (t) =
it
Proof. Using the de…nition of characteristic function and the fact that
exp (itx) = cos (tx) + i sin (tx)
we can write
Z 1
'X (t) = E [exp (itX)] = exp (itx) fX (x) dx
1
Z 1
= exp (itx) exp ( x) dx
0
Z 1 Z 1
= cos (tx) exp ( x) dx + i sin (tx) exp ( x) dx
0 0
We now compute separately the two integrals. The …rst integral is

Z 1
cos (tx) exp ( x) dx
0
1
A 1
= sin (tx) exp ( x)
t 0
Z 1
1
sin (tx) ( exp ( x)) dx
t
Z0 1
= sin (tx) exp ( x) dx
t 0
1
B 1
= cos (tx) exp ( x)
t t 0
Z 1
1
cos (tx) ( exp ( x)) dx
0 t
Z 1
1
= cos (tx) exp ( x) dx
t t t 0
2 Z 1
t2 t2 0
where in step A and B we have performed two integrations by parts. Therefore,

Z 1 2 Z 1
cos (tx) exp ( x) dx = 2 cos (tx) exp ( x) dx
0 t t2 0
2 Z 1
1+ 2 cos (tx) exp ( x) dx =
t 0 t2
or
Z 1 2 1
t2
cos (tx) exp ( x) dx = 1+ = 2 = 2
0 t2 t2 t2 t2 + t2 +
The second integral is
Z 1
sin (tx) exp ( x) dx
0
1
A 1
= cos (tx) exp ( x)
t 0
Z 1
1
cos (tx) ( exp ( x)) dx
0 t
Z 1
1
t t 0
1
B 1 1
= sin (tx) exp ( x)
t t t 0
Z 1
1
sin (tx) ( exp ( x)) dx
0 t
Z 1
1
t t t 0
2 Z 1
1
t t2 0
where in step A and B we have performed two integrations by parts. Therefore,

Z 1 2 Z 1
1
sin (tx) exp ( x) dx = sin (tx) exp ( x) dx
0 t t2 0
2 Z 1
1
1+ 2 sin (tx) exp ( x) dx =
t 0 t
46.7. DISTRIBUTION FUNCTION 371
or
Z 1 2 1
1 t
sin (tx) exp ( x) dx = 1+ = 2
0 t t2 t2 +
Putting pieces together, we get
Z 1 Z 1
'X (t) = cos (tx) exp ( x) dx + i sin (tx) exp ( x) dx
0 0
t + it
= +i =
t2 + 2 t2 + 2
t2 + 2
+ it it
= =
t + 2
2 it it

The distribution function of an exponential random variable X is
0 if x < 0
FX (x) =
1 exp ( x) if x 0
Proof. If x < 0, then

FX (x) = P (X x) = 0
because X can not take on negative values. If x 0, then
Z x Z x
FX (x) = P (X x) = fX (t) dt = exp ( t) dt
1 0
x
= [ exp ( t)]0 = exp ( x) + 1
46.8 More details

In the following subsections you can …nd more details about the exponential dis-
tribution.
46.8.1 Memoryless property

One of the most important properties of the exponential distribution is the mem-
oryless property:
P (X x + y jX > x ) = P (X y)
for any x 0.
P (X x + y and X > x)
P (X x + y jX > x ) =
P (X > x)
P (x < X x + y)
=
P (X > x)
FX (x + y) FX (x)
=
1 FX (x)
1 exp ( (x + y)) (1 exp ( x))
=
exp ( x)
exp ( x) exp ( (x + y))
=
exp ( x)
= 1 exp ( y) = FX (y) = P (X y)
Remember that X is the time we need to wait before a certain event occurs.
The memoryless property states that the probability that the event happens during
a time interval of length y is independent of how much time x has already elapsed
without the event happening.
46.8.2 Sums of exponential random variables

If X1 , X2 , . . . , Xn are n mutually independent7 random variables having an expo-
nential distribution with parameter , then the sum
n
X
Z= Xi
i=1
has a Gamma distribution8 with parameters 2n and n= .

Proof. This is proved using moment generating functions9 :
n
Y n
Y n
MZ (t) = MXi (t) = =
i=1 i=1
t t
n (2n)=2
1 2 (n= )
= 1 t = 1 t
(2n)
The latter is the moment generating function of a Gamma distribution10 with

parameters 2n and n= . So Z has a Gamma distribution, because two random
variables have the same distribution when they have the same moment generating
function.
The random variable Z is also sometimes said to have an Erlang distribution.
The Erlang distribution is just a special case of the Gamma distribution: a Gamma
random variable is also an Erlang random variable when it can be written as a sum
of exponential random variables.

7 See p. 233.
8 See p. 397.
9 Remember that the moment generating function of a sum of mutually independent random
variables is just the product of their moment generating functions (see p. 293).
1 0 See p. 399.
Exercise 1
The probability that a new customer enters a shop during a given minute is ap-
proximately 1%, irrespective of how many customers have entered the shop during
the previous minutes. Assume that the total time we need to wait before a new
customer enters the shop (denote it by X) has an exponential distribution. What
is the probability that no customer enters the shop during the next hour?
Solution
Time is measured in minutes. Therefore, the probability that no customer enters
the shop during the next hour is
P (X > 60) = 1 P (X 60) = 1 FX (60)
where FX (x) is the distribution function of X. Since X is an exponential random

variable with rate parameter 1%, its distribution function is
FX (x) = 1 exp ( 0:01 x)
Therefore,
P (X > 60) = 1 FX (60) = 1 (1 exp ( 0:01 60))

= exp ( 0:01 60) = exp ( 0:6)
Exercise 2
Let X be an exponential random variable with parameter = ln (3). Compute the
probability
P (2 X 4)
Solution
First of all we can write the probability as
P (2 X 4) = P (fX = 2g [ f2 < X 4g)

= P (X = 2) + P (2 < X 4) = P (2 < X 4)
where we have used the fact that the probability that an absolutely continuous
random variable takes on any speci…c value is equal to zero11 . Now, the probability
can be written in terms of the distribution function of X as
P (2 X 4) = P (2 < X 4) = FX (4) FX (2)

= [1 exp ( ln (3) 4)] [1 exp ( ln (3) 2)]
= exp ( ln (3) 2) exp ( ln (3) 4) = 3 2 3 4
Exercise 3
Suppose the random variable X has an exponential distribution with parameter
= 1. Compute the probability
P (X > 2)
1 1 See p. 109.
Solution
The above probability can be easily computed using the distribution function of
X:
P (X > 2) = 1 P (X 2) = 1 FX (2) = 1 [1 exp ( 2)] = exp ( 2)
Exercise 4
What is the probability that a random variable X is less than its expected value,
if X has an exponential distribution with parameter ?
Solution
The expected value of an exponential random variable with parameter is
1
E [X] =
The probability above can be computed using the distribution function of X:
1 1
P (X E [X]) = P X = FX
1
= 1 exp =1 exp ( 1)
Chapter 47
Normal distribution
The normal distribution is one of the cornerstones of probability theory and sta-
tistics, because of the role it plays in the Central Limit Theorem1 , because of its
analytical tractability and because many real-world phenomena involve random
quantities that are approximately normal (e.g. errors in scienti…c measurement).
It is often called Gaussian distribution, in honor of Carl Friedrich Gauss (1777-
1855), an eminent German mathematician who gave important contributions to-
wards a better understanding of the normal distribution.
Sometimes it is also referred to as "bell-shaped distribution", because the graph
of its probability density function resembles the shape of a bell.
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0.05
4 3 2 1 0 1 2 3 4
As you can see from the above plot of the density of a normal distribution, the
density is symmetric around the mean (indicated by the vertical line at zero). As a
consequence, deviations from the mean having the same magnitude, but di¤erent
signs, have the same probability. The density is also very concentrated around
the mean and becomes very small by moving from the center to the left or to the
right of the distribution (the so called "tails" of the distribution). This means that
1 See p. 545.
375
376 CHAPTER 47. NORMAL DISTRIBUTION
the further a value is from the center of the distribution, the less probable it is to
observe that value.
The remainder of this lecture gives a formal presentation of the main charac-
teristics of the normal distribution, dealing …rst with the special case in which the
distribution has zero mean and unit variance, then with the general case, in which
mean and variance can take any value.
47.1 The standard normal distribution

The adjective "standard" indicates the special case in which the mean is equal to
zero and the variance is equal to one.
47.1.1 De…nition
The standard normal distribution is characterized as follows:
port be the whole set of real numbers:
RX = R
We say that X has a standard normal distribution if its probability density

function2 is:
1 1 2
fX (x) = p exp x
2 2
The following is a proof that fX (x) is indeed a legitimate probability density

function3 .
Proof. fX (x) is a legitimate probability density function if it is non-negative and
if its integral over the support equals 1. The former property is obvious, while the
latter can be proved as follows:
Z 1
fX (x) dx
1
Z 1
1=2 1 2
= (2 ) exp x dx
1 2
Z 1
A = (2 ) 1=2 2 1 2
exp x dx
0 2
Z 1 Z 1 1=2
1=2 1 2 1 2
= (2 ) 2 exp x dx exp x dx
0 2 0 2
Z 1 Z 1 1=2
1=2 1 2 1 2
= (2 ) 2 exp x dx exp y dy
0 2 0 2
Z 1Z 1 1=2
1=2 1 2
= (2 ) 2 exp x + y 2 dydx
0 0 2
Z 1Z 1 1=2
B = (2 ) 1=2 2 1 2
exp x + s2 x2 xdsdx
0 0 2
2 See p. 107.
3 See p. 251.
47.1. THE STANDARD NORMAL DISTRIBUTION 377
Z 1 Z 1 1=2
1=2 1 2
= (2 ) 2 exp x 1 + s2 xdxds
0 0 2
Z 1 1 1=2
1=2 1 1 2
= (2 ) 2 exp x 1 + s2 ds
0 1 + s2 2 0
Z 1 1=2
1=2 1
= (2 ) 2 0+ ds
0 1 + s2
Z 1 1=2
1=2 1
= (2 ) 2 ds
0 1 + s2
1=2 1 1=2
= (2 ) 2 ([arctan (s)]0 )
1=2 1=2
= (2 ) 2 (arctan (1) arctan (0))
1=2
1=2 1=2 1=2 1=2 1=2
= (2 ) 2 0 =2 2 2 =1
2
where: in step A we have used the fact that the integrand is even; in step B we
have made a change of variable (y = xs).

The expected value of a standard normal random variable X is:
E [X] = 0

Z 1
E [X] = xfX (x) dx
1
Z 1
1=2 1 2
= (2 ) x exp x dx
1 2
Z 0
1=2 1 2
= (2 ) x exp x dx
1 2
Z1
1=2 1 2
+ (2 ) x exp x dx
0 2
0
1=2 1 2
= (2 ) exp x
2 1
1
1=2 1 2
+ (2 ) exp x
2 0
1=2 1=2
= (2 ) [ 1 + 0] + (2 ) [0 + 1]
1=2 1=2
= (2 ) (2 ) =0
47.1.3 Variance
The variance of a standard normal random variable X is:
Var [X] = 1
Proof. It can be proved with the usual formula for computing the variance4 :
E X2
Z 1
= x2 fX (x) dx
1
Z 1
1=2 1 2
= (2 ) x2 exp x dx
1 2
Z0
1=2 1 2
= (2 ) x x exp x dx
1 2
Z 1
1 2
+ x x exp x dx
0 2
( Z 0
0
A 1=2 1 2 1 2
= (2 ) x exp x + exp x dx
2 1 1 2
1 Z 1
1 2 1 2
+ x exp x + exp x dx
2 0 0 2
Z 0
1=2 1 2
= (2 ) (0 0) + (0 0) + exp x dx
1 2
Z 1
1 2
+ exp x dx
0 2
Z 1
1=2 1 2
= (2 ) exp x dx
1 2
Z 1
B = fX (x) dx = 1
1
where: in step A we have performed an integration by parts5 ; in step B we have

used the fact that the integral of a probability density function over its support is
equal to 1. Finally:
2
E [X] = 02 = 0
and
2
Var [X] = E X 2 E [X] = 1 0=1
47.1.4 Moment generating function

The moment generating function of a standard normal random variable X is de…ned
for any t 2 R:
1 2
MX (t) = exp t
2

Z 1
1
4 Var [X] =E X2 E [X]2 . See p. 156.
5 See p. 51.
47.1. THE STANDARD NORMAL DISTRIBUTION 379
Z 1
1=2 1 2
= (2 ) exp (tx) exp x dx
1 2
Z 1
1=2 1 2
= (2 ) exp x 2tx dx
1 2
Z 1
1=2 1 2
= (2 ) exp x 2tx + t2 t2 dx
1 2
Z 1
1=2 1 2 1 2
= (2 ) exp t exp x 2tx + t2 dx
1 2 2
Z 1
1 2 1=2 1 2
= exp t (2 ) exp (x t) dx
2 1 2
Z 1
A 1 2 1=2 1 2
= exp t (2 ) exp z dz
2 1 2
Z 1
B 1 2
= exp t fZ (z) dz
2 1
C 1 2
= exp t
2
where: in step A we have performed a change of variable (z = x t); in step B

we have de…ned
1 2
fZ (z) = exp z
2
where fZ (z) is the probability density function of a standard normal random vari-
able; in step C we have used the fact that the integral of a probability density
function over its support is equal to 1. The integral above is well-de…ned and …nite
for any t 2 R. Thus, the moment generating function of a standard normal random
variable exists for any t 2 R.
47.1.5 Characteristic function

The characteristic function of a standard normal random variable X is:
1 2
'X (t) = exp t
2

Z 1 Z 1
= cos (tx) fX (x) dx + i sin (tx) fX (x) dx
1 1
Z 1
A = cos (tx) fX (x) dx
1
where: in step A we have used the fact that sin (tx) fX (x) is an odd function of
x. Now, take the derivative with respect to t of the characteristic function:
d d
' (t) = E [exp (itX)]
dt X dt
d
= E exp (itX)
dt
= E [iX exp (itX)]
= iE [X cos (tX)] E [X sin (tX)]
Z 1 Z 1
= i x cos (tx) fX (x) dx x sin (tx) fX (x) dx
1 1
Z 1
A = x sin (tx) fX (x) dx
1
Z 1
1 1 2
= x sin (tx) p exp x dx
1 2 2
Z 1
d 1 1 2
= sin (tx) p exp x dx
1 dx 2 2
Z 1
d
= sin (tx) fX (x) dx
1 dx
Z 1
B = [sin (tx) fX (x)] 1
1
t cos (tx) fX (x) dx
1
Z 1
= t cos (tx) fX (x) dx
1
where: in step A we have used the fact that x cos (tx) fX (x) is an odd function
of x; in step B we have performed an integration by parts. Putting together the
previous two results, we obtain:
d
' (t) = t'X (t)
dt X
The only function that satis…es this ordinary di¤erential equation (subject to the
condition 'X (0) = E [exp (i 0 X)] = 1) is:
1 2
'X (t) = exp t
2
47.1.6 Distribution function

There is no simple formula for the distribution function FX (x) of a standard normal
random variable X, because the integral
Z x
FX (x) = fX (t) dt
1
cannot be expressed in terms of elementary functions. Therefore, it is necessary to

resort to computer algorithms to compute the values of the distribution function
of a standard normal random variable. For example, the MATLAB command:
normcdf(x)
returns the value of the distribution function at the point x.

47.2. THE NORMAL DISTRIBUTION IN GENERAL 381
Some values of the distribution function of X are used very frequently and
people usually learn them by heart:
FX ( 2:576) = 0:005 FX (2:576) = 0:995

FX ( 2:326) = 0:01 FX (2:326) = 0:99
FX ( 1:96) = 0:025 FX (1:96) = 0:975
FX ( 1:645) = 0:05 FX (1:645) = 0:95
Note also that:

FX ( x) = 1 FX (x)
which is due to the symmetry around 0 of the standard normal density and is often
used in calculations.
In the past, when computers were not widely available, people used to look up
the values of FX (x) in normal distribution tables. A normal distribution table
is a table where FX (x) is tabulated for several values of x. For values of x that
are not tabulated, approximations of FX (x) can be computed by interpolating the
two tabulated values that are closest to x. For example, if x is not tabulated, x1
is the greatest tabulated number smaller than x and x2 is the smallest tabulated
number greater than x, the approximation is as follows:
FX (x2 ) FX (x1 )
FX (x) = FX (x1 ) + (x x1 )
x2 x1
47.2 The normal distribution in general

While in the previous section we restricted our attention to the normal distribution
with zero mean and unit variance, we now deal with the general case.
47.2.1 De…nition
2
The normal distribution with mean and variance is characterized as follows:
RX = R
Let 2 R and 2 R++ . We say that X has a normal distribution with mean
2
and variance , if its probability density function is
!
2
1 1 1 (x )
fX (x) = p exp 2
2 2
We often indicate that X has a normal distribution with mean and variance
2
by:
2
X N ;
47.2.2 Relation to the standard normal distribution

2
A random variable X having a normal distribution with mean and variance
is just a linear function of a standard normal random variable:
2
Proposition 254 If X has a normal distribution with mean and variance ,
then:
X= + Z
where Z is a random variable having a standard normal distribution.
Proof. This can be easily proved using the formula for the density of a function6
of an absolutely continuous variable:
1
1 dg (x)
fX (x) = fZ g (x)
dx
x 1
= fZ
!
2
1 1 x 1
= p exp
2 2
Obviously, then, a standard normal distribution is just a normal distribution

with mean = 0 and variance 2 = 1.

The expected value of a normal random variable X is:
E [X] =
Proof. It is an immediate consequence of the fact that X = + Z (where Z has

a standard normal distribution) and the linearity of the expected value7 :
E [X] = E [ + Z] = + E [Z] = + 0=
47.2.4 Variance
The variance of a normal random variable X is:
2
Var [X] =
Proof. It can be derived using the formula for the variance of linear transforma-
tions8 on X = + Z (where Z has a standard normal distribution):
2 2
Var [X] = Var [ + Z] = Var [Z] =
6 See p. 265. Note that X = g (Z) = + Z is a strictly increasing function of Z, since is

strictly positive.
7 See p. 134.
8 See p. 158.
47.2. THE NORMAL DISTRIBUTION IN GENERAL 383

The moment generating function of a normal random variable X is de…ned for any
t 2 R:
1 2 2
MX (t) = exp t+ t
2
Proof. Recall that X = + Z (where Z has a standard normal distribution) and

that the moment generating function of a standard normal random variable is:
1 2
MZ (t) = exp t
2
We can use the formula for the moment generating function of a linear transfor-
mation9 :
MX (t) = exp ( t) MZ ( t)
1 2
= exp ( t) exp ( t)
2
1 2 2
= exp t+ t
2
It is de…ned for any t 2 R because the moment generating function of Z is de…ned

for any t 2 R.

The characteristic function of a normal random variable X is:
1 2 2
'X (t) = exp i t t
2
Proof. Recall that X = + Z (where Z has a standard normal distribution)

and that the characteristic function of a standard normal random variable is:
1 2
'Z (t) = exp t
2
We can use the formula for the characteristic function of a linear transformation10 :
'X (t) = exp (i t) 'Z ( t)

1 2
= exp (i t) exp ( t)
2
1 2 2
= exp i t t
2
9 See p. 293.
1 0 See p. 310.

The distribution function FX (x) of a normal random variable X can be written
as:
x
FX (x) = FZ
where FZ (z) is the distribution function of a standard normal random variable Z.

Proof. Remember that any normal random variable X with mean and variance
2
can be written as:
X= + Z
where Z is a standard normal random variable. Using this fact, we obtain the fol-
lowing relation between the distribution function of Z and the distribution function
of X:
FX (x) = P (X x)
= P( + Z x)
x
= P Z
x
= FZ
Therefore, if we know how to compute the values of the distribution function of a

standard normal distribution (see above), we also know how to compute the values
of the distribution function of a normal distribution with mean and variance 2 .
Example 255 If we need to compute the value FX 12 of a normal random vari-

able X with mean = 1 and variance 2 = 1, we can compute it using the distri-
bution function of a standard normal random variable Z:
1 1=2 1=2 1 1
FX = FZ = FZ = FZ
2 1 2
47.3 More details

More details about the normal distribution can be found in the following subsec-
tions.
47.3.1 Multivariate normal distribution

A multivariate generalization of the normal distribution is introduced in the lecture
entitled Multivariate normal distribution (p. 439).
47.3.2 Linear combinations of normal random variables

The lecture entitled Linear combinations of normals (p. 469) explains and proves
one of the most important facts about the normal distribution: the linear combi-
nation of jointly normal random variables also has a normal distribution.
47.3.3 Quadratic forms involving normal random variables

The lecture entitled Quadratic forms in normal vectors (p. 481) discusses quadratic
forms involving normal random variables.

Exercise 1
2
Let X be a normal random variable with mean = 3 and variance = 4.
P ( 0:92 X 6:92)
Solution
First of all, we need to express the above probability in terms of the distribution
function of X:
P ( 0:92 X 6:92)
= P (X 6:92) P (X < 0:92)
A = P (X 6:92) P (X 0:92)
= FX (6:92) FX ( 0:92)
where: in step A we have used the fact that the probability that an absolutely
continuous random variable takes on any speci…c value is equal to zero11 .
Then, we need to express the distribution function of X in terms of the distri-
bution function of a standard normal random variable Z:
x x 3
FX (x) = FZ = FZ
2
Therefore, the above probability can be expressed as:
P ( 0:92 X 6:92) = FX (6:92) FX ( 0:92)

6:92 3 0:92 3
= FZ FZ
2 2
= FZ (1:96) FZ ( 1:96) = 0:975 0:025 = 0:95
where we have used the fact that
FZ (1:96) = 1 FZ ( 1:96) = 0:975
which has been discussed above.

1 1 See p. 109.
Exercise 2
Let X be a random variable having a normal distribution with mean = 1 and
variance 2 = 16. Compute the following probability:
P (X > 9)
Solution
We need to use the same technique used in the previous exercise and express the
probability in terms of the distribution function of a standard normal random
variable:
P (X > 9) = 1 P (X 9) = 1 FX (9)
9 1
= 1 FZ p = 1 FZ (2)
16
= 1 0:9772 = 0:0228
where the value FZ (2) can be found with a computer algorithm, for example with
the MATLAB command
normcdf(2)
Exercise 3
Suppose the random variable X has a normal distribution with mean = 1 and
variance 2 = 1. De…ne the random variable Y as follows:
Y = exp (2 + 3X)
Compute the expected value of Y .
Solution
The moment generating function of X is:
1 2 2
MX (t) = E [exp (tX)] = exp t+ t
2
1
= exp t + t2
2
Using the linearity of the expected value, we obtain:
E [Y ] = E [exp (2 + 3X)] = E [exp (2) exp (3X)]

= exp (2) E [exp (3X)] = exp (2) MX (3)
1 19
= exp (2) exp 3 + 9 = exp
2 2
Chapter 48
Chi-square distribution
A random variable X has a Chi-square distribution if it can be written as a sum

of squares:
X = Y12 + : : : + Yn2
where Y1 , . . . , Yn are n mutually independent1 standard normal random variables2 .
The importance of the Chi-square distribution stems from the fact that sums
of this kind are encountered very often in statistics, especially in the estimation of
variance and in hypothesis testing.
48.1 De…nition
Chi-square random variables are characterized as follows.
RX = [0; 1)
Let n 2 N. We say that X has a Chi-square distribution with n degrees of
freedom if its probability density function3 is
1
cxn=2 1
exp 2x if x 2 RX
fX (x) =
0 if x 2
= RX
where c is a constant:
1
c=
2n=2 (n=2)
and () is the Gamma function4 .
The following notation is often employed to indicate that a random variable X
has a Chi-square distribution with n degrees of freedom:
2
X (n)
2
where the symbol means "is distributed as" and (n) indicates a Chi-square
distribution with n degrees of freedom.
1 See p. 233.
2 See p. 376.
3 See p. 107.
4 See p. 55.
387
388 CHAPTER 48. CHI-SQUARE DISTRIBUTION
48.2 Expected value

The expected value of a Chi-square random variable X is
E [X] = n

Z 1
E [X] = xfX (x) dx
0
Z 1
1
= xcxn=2 1 exp x dx
0 2
Z 1
1
= c xn=2 exp x dx
0 2
1 Z 1
A = c 1 n n=2 1 1
xn=2 2 exp x + x 2 exp x dx
2 0 0 2 2
Z 1
1
= c (0 0) + n xn=2 1 exp x dx
0 2
Z 1
1
= n cxn=2 1 exp x dx
0 2
Z 1
= n fX (x) dx
0
B = n

equal to 1.
48.3 Variance
The variance of a Chi-square random variable X is
Var [X] = 2n
Proof. The second moment of X is

Z 1
E X2 = x2 fX (x) dx
0
Z 1
1
= x2 cxn=2 1 exp x dx
0 2
Z 1
1
= c xn=2+1 exp x dx
0 2
1
A = c 1
xn=2+1 2 exp x
2 0
Z 1
n n=2 1
+ + 1 x 2 exp x dx
0 2 2
5 See p. 51.
1 Z
1
= c (0 0) + (n + 2) xn=2 exp x dx
0 2
Z 1
1
= c (n + 2) xn=2 exp x dx
0 2
1
B 1
= c (n + 2) xn=2 2 exp x
2 0
Z 1
n n=2 1 1
+ x 2 exp x dx
0 2 2
Z 1
1
= c (n + 2) (0 0) + n xn=2 1 exp x dx
0 2
Z 1
1
= (n + 2) n cxn=2 1 exp x dx
2
Z0 1
= (n + 2) n fX (x) dx
0
C = (n + 2) n
where: in step A and B we have performed an integration by parts; in step C

we have used the fact that the integral of a probability density function over its
support is equal to 1. Furthermore,
2
E [X] = n2
By employing the usual formula for computing the variance6 , we obtain

2
= (n + 2) n n2 = n (n + 2 n) = 2n

The moment generating function of a Chi-square random variable X is de…ned for
any t < 21 :
n=2
MX (t) = (1 2t)
Proof. By using the de…nition of moment generating function, we get

Z 1
1
Z 1
1
= c exp (tx) xn=2 1 exp x dx
0 2
Z 1
1
= c xn=2 1 exp t x dx
0 2
Z 1 n=2 1
A 2 2
= c y exp ( y) dy
0 1 2t 1 2t
6 See p. 156.
Z 1 n=2
2
= c y n=2 1
exp ( y) dy
0 1 2t
n=2 Z 1
2
= c y n=2 1
exp ( y) dy
1 2t 0
n=2
B 2
= c (n=2)
1 2t
n=2
C 1 2
= (n=2)
2n=2 (n=2) 1 2t
1 2n=2
=
2n=2 (1 2t)
n=2
n=2
= (1 2t)
where: in step A we have performed the change of variable
1
y= t x
2
in step B we have used the de…nition of Gamma function7 ; in step C we have

used the de…nition of c. The integral above is well-de…ned and …nite only when
1
2 t > 0, i.e., when t < 12 . Thus, the moment generating function of a Chi-square
random variable exists for any t < 12 .

The characteristic function of a Chi-square random variable X is
n=2
'X (t) = (1 2it)
Proof. Using the de…nition of characteristic function, we get

Z 1
= exp (itx) fX (x) dx
1
Z 1
1
= c exp (itx) xn=2 1 exp x dx
0 2
Z 1 X 1
!
A 1 k 1
= c (itx) xn=2 1 exp x dx
0 k! 2
k=0
X1 Z 1
1 k 1
= c (it) xk xn=2 1 exp x dx
k! 0 2
k=0
X1 Z 1
1 k 1
= c (it) xk+n=2 1 exp x dx
k! 0 2
k=0
X1
1 k
= c (it) 2k+n=2 (k + n=2)
k!
k=0
7 See p. 55.
48.6. DISTRIBUTION FUNCTION 391
Z 1
1 1
xk+n=2 1
exp x dx
0 2k+n=2 (k + n=2) 2
1
X Z 1
B 1 k
= c (it) 2k+n=2 (k + n=2) fk (x) dx
k! 0
k=0
X1
C 1 k
= c (it) 2k+n=2 (k + n=2)
k!
k=0
X 11
D 1 k
= (it) 2k+n=2 (k + n=2)
2n=2 (n=2) k=0 k!
1
X 1 k (k + n=2)
= (it) 2k
k! (n=2)
k=0
X1
1 k (k + n=2)
= (2it)
k! (n=2)
k=0
1
X kY1
1 k n
= 1+ (2it) +j
k! j=0
2
k=1
E = (1 2it)
n=2
where: in step A we have substituted the Taylor series expansion of exp (itx); in
step B we have de…ned
1 1
fk (x) = xk+n=2 1
exp x
2k+n=2 (k + n=2) 2
where fk (x) is the probability density function of a Chi-square random variable

with 2k + n degrees of freedom; in step C we have used the fact that the integral
of a probability density function over its support is equal to 1; in step D we have
used the de…nition of c; in step E we have used the fact that
1
X kY1
1 k n
1+ (2it) +j
k! j=0
2
k=1
n=2
is the Taylor series expansion of (1 2it) , which you can verify by computing
the expansion yourself.

The distribution function of a Chi-square random variable is:
(n=2; x=2)
FX (x) =
(n=2)
where the function Z y

(z; y) = sz 1
exp ( s) ds
1
is called lower incomplete Gamma function8 and is usually computed by means of

specialized computer algorithms.
Z x
FX (x) = fX (t) dt
1
Z x
1
= ctn=2 1 exp t dt
1 2
Z x=2
A = c (2s)
n=2 1
exp ( s) 2ds
1
Z x=2
= c2n=2 sn=2 1
exp ( s) ds
1
Z x=2
B 1 n=2
= 2 sn=2 1
exp ( s) ds
2n=2 (n=2) 1
Z x=2
1
= sn=2 1
exp ( s) ds
(n=2) 1
(n=2; x=2)
=
(n=2)
where: in step A we have performed a change of variable (s = t=2); in step B

we have used the de…nition of c.
Usually, it is possible to resort to computer algorithms that directly compute
the values of FX (x). For example, the MATLAB command
chi2cdf(x,n)
returns the value at the point x of the distribution function of a Chi-square random
variable with n degrees of freedom.
In the past, when computers were not widely available, people used to look up
the values of FX (x) in Chi-square distribution tables. A Chi-square distribu-
tion table is a table where FX (x) is tabulated for several values of x and n. For
values of x that are not tabulated, approximations of FX (x) can be computed by
interpolation, with the same procedure described for the normal distribution (p.
380).
48.7 More details

In the following subsections you can …nd more details about the Chi-square distri-
bution.
48.7.1 Sums of independent Chi-square random variables

Let X1 be a Chi-square random variable with n1 degrees of freedom and X2 an-
other Chi-square random variable with n2 degrees of freedom. If X1 and X2 are
independent, then their sum has a Chi-square distribution with n1 + n2 degrees of
8 See p. 58.
freedom:
2 2
X1 (n1 ) , X2 (n2 ) 2
=) X1 + X2 (n1 + n2 )
X1 and X2 are independent
This can be generalized to sums of more than two Chi-square random variables,
provided they are mutually independent:
k k
!
Xi 2
(ni ) for i = 1; : : : ; k X X
2
=) Xi ni
X1 ; X2 ; : : : ; Xk are mutually independent
i=1 i=1
Proof. This can be easily proved using moment generating functions. The mo-
ment generating function of Xi is
ni =2
MXi (t) = (1 2t)
De…ne
k
X
X= Xi
i=1
The moment generating function of a sum of mutually independent random vari-

ables is just the product of their moment generating functions9 :
k
Y
MX (t) = MXi (t)
i=1
k
Y ni =2
= (1 2t)
i=1
Pk
ni =2
= (1 2t) i=1
n=2
= (1 2t)
where
k
X
n= ni
i=1
Therefore, the moment generating function of X is the moment generating function

of a Chi-square random variable with n degrees of freedom. As a consequence, X
is a Chi-square random variable with n degrees of freedom.
48.7.2 Relation to the standard normal distribution

Let Z be a standard normal random variable10 and let X be its square:
X = Z2
Then X is a Chi-square random variable with 1 degree of freedom.

Proof. For x 0, the distribution function of X is
FX (x)
9 See p. 293.
1 0 See p. 376.
A = P (X x)
B = P Z2 x
C = P x1=2 Z x1=2
Z x1=2
D = fZ (z) dz
x1=2

we have used the de…nition of X; in step C we have taken the square root on
both sides of the inequality; in step D fZ (z) is the probability density function
of a standard normal random variable:
1 1 2
fZ (z) = p exp z
2 2
For x < 0, FX (x) = 0, because X, being a square, cannot be negative. By using
Leibniz integral rule11 and the fact that the density function is the derivative of the
distribution function12 , the probability density function of X, denoted by fX (x),
is obtained as follows (for x 0):
dFX (x)
fX (x) =
dx
Z x1=2
d
= fZ (z) dz
dx x1=2
d x1=2 d x1=2
= fZ x1=2 fZ x1=2
dx dx
1 1 1=2 2 1 1=2
= p exp x x
2 2 2
1 1 2 1
p exp x1=2 x 1=2
2 2 2
1 1 1=2 1 1 1 1=2 1
= p x exp x +p x exp x
2 2 2 2 2 2
1 1
= p x 1=2 exp x
2 2
1 1
= x1=2 1 exp x
21=2 (1=2) 2
p
where in the last step we have used the fact that13 (1=2) = . For x < 0, it
trivially holds that fX (x) = 0. Therefore,
1 1
21=2 (1=2)
x1=2 1 exp 2x if x 0
fX (x) =
0 if x < 0
which is the probability density function of a Chi-square random variable with 1
degree of freedom.
1 1 See p. 52.
1 2 See p. 109.
1 3 See p. 57.
48.7.3 Relation to the standard normal distribution (2)

Combining the results obtained in the previous two subsections, we obtain that the
sum of squares of n independent standard normal random variables is a Chi-square
random variable with n degrees of freedom.

Exercise 1
Let X be a Chi-square random variable with 3 degrees of freedom. Compute the
probability
P (0:35 X 7:81)
Solution
First of all, we need to express the above probability in terms of the distribution
function of X:
P (0:35 X 7:81) = P (X 7:81) P (X < 0:35)

A = P (X 7:81) P (X 0:35)
= FX (7:81) FX (0:35)
B = 0:95 0:05 = 0:90
where: in step A we have used the fact that the probability that an absolutely
continuous random variable takes on any speci…c value is equal to zero14 ; in step
B the values
FX (7:81) = 0:95
FX (0:35) = 0:05
can be computed with a computer algorithm or found in a Chi-square distribution

table.
Exercise 2
Let X1 and X2 be two independent normal random variables having mean =0
and variance 2 = 16. Compute the probability
P X12 + X22 > 8
Solution
First of all, the two variables X1 and X2 can be written as
X1 = 4Z1
1 4 See p. 109.
X2 = 4Z2
where Z1 and Z2 are two standard normal random variables. Thus, we can write
P X12 + X22 > 8 = P 16Z12 + 16Z22 > 8

8
= P Z12 + Z22 >
16
1
= P Z12 + Z22 >
2
but the sum Z12 + Z22 has a Chi-square distribution with 2 degrees of freedom.
Therefore,
1
P X12 + X22 > 8 = P Z12 + Z22 >
2
1
= 1 P Z12 + Z22
2
1
= 1 FY
2
where FY (1=2) is the distribution function of a Chi-square random variable Y with

2 degrees of freedom, evaluated at the point y = 21 . Using any computer program,
we can …nd
1
FY = 0:2212
2
Exercise 3
Suppose the random variable X has a Chi-square distribution with 5 degrees of
freedom. De…ne the random variable Y as follows:
Y = exp (1 X)
Compute the expected value of Y .
Solution
The expected value of Y can be easily calculated using the moment generating
function of X:
5=2
MX (t) = E [exp (tX)] = (1 2t)
Now, exploiting the linearity of the expected value, we obtain
E [Y ] = E [exp (1 X)] = E [exp (1) exp ( X)]

= exp (1) E [exp ( X)] = exp (1) MX ( 1)
= exp (1) 3 5=2
Chapter 49
Gamma distribution
The Gamma distribution can be thought of as a generalization of the Chi-square

distribution1 . If a random variable Z has a Chi-square distribution with n degrees
of freedom and h is a strictly positive constant, then the random variable X de…ned
as
h
X= Z
n
has a Gamma distribution with parameters n and h.
49.1 De…nition
Gamma random variables are characterized as follows:
RX = [0; 1)
Let n; h 2 R++ . We say that X has a Gamma distribution with parameters n

and h if its probability density function2 is
n1
cxn=2 1
exp h 2x if x 2 RX
fX (x) =
0 if x 2
= RX
n=2
(n=h)
c= n=2
2 (n=2)
A random variable having a Gamma distribution is also called a Gamma random

variable.
1 See p. 387.
2 See p. 107.
3 See p. 55.
397
398 CHAPTER 49. GAMMA DISTRIBUTION
49.2 Expected value

The expected value of a Gamma random variable X is
E [X] = h

Z 1
E [X] = xfX (x) dx
Z0 1
n1
= xcxn=2 1 exp x dx
0 h2
Z 1
n1
= c xn=2 exp x dx
0 h2
1
A = c h n1
xn=2 2 exp x
n h2 0
Z 1
n n=2 1 h n1
+ x 2 exp x dx
0 2 n h2
Z 1
n1
= c (0 0) + h xn=2 1 exp x dx
0 h2
Z 1
n1
= h cxn=2 1 exp x dx
h2
Z0 1
= h fX (x) dx
0
B = h

equal to 1.
49.3 Variance
The variance of a Gamma random variable X is
h2
Var [X] = 2
n
Z 1
2
E X = x2 fX (x) dx
0
Z 1
n1
= x2 cxn=2 1 exp x dx
0 h2
Z 1
n1
= c xn=2+1 exp x dx
0 h2
4 See p. 51.
5 Var [X] = E X2 E [X]2 . See p. 156.
1
A h n1
= c xn=2+1 2 exp x
n h2 0
Z 1
n h n1
+ + 1 xn=2 2 exp x dx
0 2 n h2
Z
h 1 n=2 n1
= c (0 0) + (n + 2) x exp x dx
n 0 h2
Z 1
h n1
= c (n + 2) xn=2 exp x dx
n 0 h2
1
B h h n1
= c (n + 2) xn=2 2 exp x
n n h2 0
Z 1
n n=2 1 h n1
+ x 2 exp x dx
0 2 n h2
Z 1
h n1
= c (n + 2) (0 0) + h xn=2 1 exp x dx
n 0 h2
Z
h2 1 n=2 1 n1
= (n + 2) cx exp x dx
n 0 h2
Z
h2 1
= (n + 2) fX (x) dx
n 0
C h2
= (n + 2)
n
where: in step A and B we have performed an integration by parts; in step C
we have used the fact that the integral of a probability density function over its
support is equal to 1. Finally:
2
E [X] = h2
and
2 h2
Var [X] = E X2 E [X] = (n + 2) h2
n
h2 h2
= (n + 2 n) =2
n n

The moment generating function of a Gamma random variable X is de…ned for
n
any t < 2h :
n=2
2h
MX (t) = 1 t
n
Z 1
1
Z 1 n=2
(n=h) n1
= exp (tx) xn=2 1
exp x dx
0 2n=2 (n=2) h2
Z 1 n=2
(n=h) n1
= xn=2 1
exp x + tx dx
0 2n=2 (n=2) h2
Z 1 n=2
(n=h) 1n 2t n
= n=2
xn=2 1
exp x x dx
0 2 (n=2) h2 n 2
Z 1 n=2
(1=h) nn=2 n=2 1 1 2t n
= x exp x dx
0 2n=2 (n=2) h n 2
n=2
n=2 1 2t
= (1=h)
h n
Z 1 1 2t n=2 n=2
n 1 2t n
h n
xn=2 1 exp x dx
0 2n=2 (n=2) h n 2
n=2
n=2 1 2t
= (1=h)
h n
where the integral equals 1 because it is the integral of the probability density
1
function of a Gamma random variable with parameters n and h1 2t n . Thus:
n=2
n=2 1 2t
MX (t) = (1=h)
h n
n=2
n=2 1 2t
= (h)
h n
n=2
2h
= 1 t
n
Of course, the above integrals converge only if h1 2t n

n > 0, i.e. only if t < 2h .
Therefore, the moment generating function of a Gamma random variable exists for
n
all t < 2h .

The characteristic function of a Gamma random variable X is:
n=2
2h
'X (t) = 1 it
n
'X (t)
= E [exp (itX)]
Z 1
= exp (itx) fX (x) dx
1
Z 1
n1
= c exp (itx) xn=2 1 exp x dx
0 h2
Z 1 X 1
!
A 1 k n1
= c (itx) xn=2 1
exp x dx
0 k! h2
k=0
X1 Z 1
1 k 1n
= c (it) xk xn=2 1
exp x dx
k! 0 2h
k=0
1 Z !
X 1 k
1
1 2k + n
= c (it) xk+n=2 1
exp x dx
k! 0 2 h 2k+n
n
k=0
X1
1 k k n=2
= c (it) 2k+n=2 (k + n=2) (n=h)
k!
k=0
Z !
1 k+n=2
(n=h) 1 2k + n
xk+n=2 1
exp x dx
0 2k+n=2 (k + n=2) 2 h 2k+n
n
1
X Z 1
B 1 k k n=2
= c (it) 2k+n=2 (k + n=2) (n=h) fk (x) dx
k! 0
k=0
X1
C 1 k k n=2
= c (it) 2k+n=2 (k + n=2) (n=h)
k!
k=0
X 1
n=2 1
D (n=h) k k n=2
= n=2
(it) 2k+n=2 (k + n=2) (n=h)
2 (n=2) k=0 k!
X 1 1
1 k k
= (it) 2k (k + n=2) (n=h)
(n=2) k!
k=0
1
X k
1 k 2h (k + n=2)
= (it)
k! n (n=2)
k=0
X1 k
1 2h (k + n=2)
= it
k! n (n=2)
k=0
1
X k kY1
1 2h n
= 1+ it +j
k! n j=0
2
k=1
n=2
E 2h
= 1 it
n
where: in step A we have substituted exp (itx) with its Taylor series expansion;
in step B we have have de…ned
!
k+n=2
(n=h) k+n=2 1 1 2k + n
fk (x) = k+n=2 x exp x
2 (k + n=2) 2 h 2k+n
n
where fk (x) is the probability density function of a Gamma random variable with
parameters 2k + n and h 2k+nn ; in step C we have used the fact that probability
density functions integrate to 1; in step D we have used the de…nition of c; in
step E we have recognized that
1
X k kY1
1 2h n
1+ it +j
k! n j=0
2
k=1
2h n=2
is the Taylor series expansion of 1 n it .

The distribution function of a Gamma random variable is:
(n=2; nx=2h)
FX (x) =
(n=2)
where the function Z y
(z; y) = sz 1
exp ( s) ds
1
is called lower incomplete Gamma function6 and is usually evaluated using special-
ized computer algorithms.
Z x
FX (x) = fX (t) dt
1
Z x
n1
= ctn=2 1 exp t dt
1 h2
Z nx=2h n=2 1
A = c 2h 2h
s exp ( s) ds
1 n n
n=2 Z nx=2h
2h
= c sn=2 1 exp ( s) ds
n 1
n=2 n=2 Z nx=2h
B = (n=h) 2h
sn=2 1 exp ( s) ds
2n=2 (n=2) n 1
Z nx=2h
1
= sn=2 1 exp ( s) ds
(n=2) 1
(n=2; nx=2h)
=
(n=2)
n
where: in step A we have performed a change of variable (s = 2h t); in step B
49.7 More details

In the following subsections you can …nd more details about the Gamma distribu-
tion.
49.7.1 Relation to the Chi-square distribution

The Gamma distribution is a scaled Chi-square distribution.
Proposition 258 If a variable X has a Gamma distribution with parameters n

and h, then:
h
X= Z
n
where Z has a Chi-square distribution7 with n degrees of freedom.
6 See p. 58.
7 See p. 387.
1
1 dg (x)
fX (x) = fZ g (x)
dx
n n
= fZ x
h h
The density function of a Chi-square random variable with n degrees of freedom is
1
kz n=2 1
exp 2z if x 2 [0; 1)
fZ (z) =
0 otherwise
where
1
k=
2n=2 (n=2)
Therefore:
n n
fX (x) = fZ x
( h h
n n=2 n=2 1 1n
= k h x exp 2 hx if x 2 [0; 1)
0 otherwise
which is the density of a Gamma distribution with parameters n and h.
Thus, the Chi-square distribution is a special case of the Gamma distribution,
because, when h = n, we have:
h n
X=Z= Z=Z
n n
In other words, a Gamma distribution with parameters n and h = n is just a
Chi square distribution with n degrees of freedom.
49.7.2 Multiplication by a constant

Multiplying a Gamma random variable by a strictly positive constant one obtains
another Gamma random variable.
Proposition 259 If X is a Gamma random variable with parameters n and h,
then the random variable Y de…ned as
Y = cX (c 2 R++ )
has a Gamma distribution with parameters n and ch.
Proof. This can be easily seen using the result from the previous subsection:
h
X=
Z
n
where Z has a Chi-square distribution with n degrees of freedom. Therefore:
h ch
Y = cX = c Z = Z
n n
In other words, Y is equal to a Chi-square random variable with n degrees of
freedom, divided by n and multiplied by ch. Therefore, it has a Gamma distribution
with parameters n and ch.
8 See p. 265. Note that X = g (Z) = h h
n
Z is a strictly increasing function of Z, since n
is
strictly positive
49.7.3 Relation to the normal distribution

In the lecture entitled Chi-square distribution (p. 387) we have explained that a
Chi-square random variable Z with n 2 N degrees of freedom can be written as
a sum of squares of n independent normal random variables W1 , . . . ,Wn having
mean 0 and variance 1:
Z = W12 + : : : + Wn2
In the previous subsections we have seen that a variable X having a Gamma

distribution with parameters n and h can be written as
h
X= Z
n
where Z has a Chi-square distribution with n degrees of freedom.

Putting these two things together, we obtain
h h
X = Z= W12 + : : : + Wn2
n n
r !2 r !2
h h
= W1 + ::: + Wn
n n
= Y12 + : : : + Yn2

r
h
Yi = Wi ; i = 1; : : : ; n
n
But the variables Yi are normal random variables with mean 0 and variance nh .
Therefore, a Gamma random variable with parameters n and h can be seen as
a sum of squares of n independent normal random variables having mean 0 and
variance h=n.

Exercise 1
Let X1 and X2 be two independent Chi-square random variables having 3 and 5
degrees of freedom respectively. Consider the following random variables:
Y1 = 2X1
1
Y2 = X2
3
Y3 = 3X1 + 3X2
What distribution do they have?

Solution
Being multiples of Chi-square random variables, the variables Y1 , Y2 and Y3 all have
a Gamma distribution. The random variable X1 has n = 3 degrees of freedom and
the random variable Y1 can be written as
h
X1Y1 =
n
where h = 6. Therefore Y1 has a Gamma distribution with parameters n = 3 and
h = 6. The random variable X2 has n = 5 degrees of freedom and the random
variable Y2 can be written as
h
Y2 = X2
n
where h = 5=3. Therefore Y2 has a Gamma distribution with parameters n = 5
and h = 5=3. The random variable X1 + X2 has a Chi-square distribution with
n = 3 + 5 = 8 degrees of freedom, because X1 and X2 are independent9 , and the
random variable Y3 can be written as
h
Y3 =
(X1 + X2 )
n
where h = 24. Therefore Y3 has a Gamma distribution with parameters n = 8 and
h = 24.
Exercise 2
Let X be a random variable having a Gamma distribution with parameters n = 4
and h = 2. De…ne the following random variables:
1
Y1 = X
2
Y2 = 5X
Y3 = 2X
What distribution do these variables have?
Solution
Multiplying a Gamma random variable by a strictly positive constant one still
obtains a Gamma random variable. In particular, the random variable Y1 is a
Gamma random variable with parameters n = 4 and
1
=1 h=2
2
The random variable Y2 is a Gamma random variable with parameters n = 4 and
h = 2 5 = 10
The random variable Y3 is a Gamma random variable with parameters n = 4 and
h=2 2=4
The random variable Y3 is also a Chi-square random variable with 4 degrees of
freedom (remember that a Gamma random variable with parameters n and h is
also a Chi-square random variable when n = h).
9 See the lecture entitled Chi-square distribution (p. 387).
Exercise 3
Let X1 , X2 and X3 be mutually independent normal random variables having mean
= 0 and variance 2 = 3. Consider the random variable
X = 2X12 + 2X22 + 2X32
What distribution does X have?
Solution
The random variable X can be written as
p 2 p 2 p 2
X = 2 3Z1 + 3Z2 + 3Z3
18 2
= Z1 + Z22 + Z32
3
where Z1 , Z2 and Z3 are mutually independent standard normal random variables.
The sum Z12 + Z22 + Z32 has a Chi-square distribution with 3 degrees of freedom.
Therefore X has a Gamma distribution with parameters n = 3 and h = 18.
Chapter 50
Student’s t distribution
A random variable X has a standard Student’s t distribution with n degrees

of freedom if it can be written as a ratio:
Y
X=p
Z
between a standard normal random variable1 Y and the square root of a Gamma
random variable2 Z with parameters n and h = 1, independent of Y .
Equivalently, we can write
Y
X=p
2 =n
n
where 2n is a Chi-square random variable3 with n degrees of freedom (dividing by

n a Chi-square random variable with n degrees of freedom, one obtains a Gamma
random variable with parameters n and h = 1 - see the lecture entilled Gamma
distribution - p. 397).
A random variable X has a non-standard Student’s t distribution with
mean , scale 2 and n degrees of freedom if it can be written as a linear transfor-
mation of a standard Student’s t random variable:
Y
X= + p
Z
where Y and Z are de…ned as before.
The importance of Student’s t distribution stems from the fact that ratios and
linearly transformed ratios of this kind are encountered very often in statistics (see
e.g. the lecture entitled Hypothesis tests about the mean - p. 619).
We …rst introduce the standard Student’s t distribution. We then deal with the
non-standard Student’s t distribution.
50.1 The standard Student’s t distribution

The standard Student’s t distribution is a special case of Student’s t distribution.
By …rst explaining this special case, the exposition of the more general case is
greatly facilitated.
1 See p. 375.
2 See p. 397.
3 See p. 387.
407
408 CHAPTER 50. STUDENT’S T DISTRIBUTION
50.1.1 De…nition
The standard Student’s t distribution is characterized as follows:
RX = R
Let n 2 R++ . We say that X has a standard Student’s t distribution with n
degrees of freedom if its probability density function4 is
(n+1)=2
x2
fX (x) = c 1 +
n
1 1
c= p
n B 2 ; 12
n
and B () is the Beta function5 .

Usually the number of degrees of freedom is integer (n 2 N), but it can also be
real (n 2 R++ ).
50.1.2 Relation to the normal and Gamma distributions

A standard Student’s t random variable can be written as a normal random variable
whose variance is equal to the reciprocal of a Gamma random variable, as shown
by the following proposition:
Proposition 261 (Integral representation) The probability density function of
X can be written as
Z 1
fX (x) = fXjZ=z (x) fZ (z) dz
0
where:
1. fXjZ=z (x) is the probability density function of a normal distribution with
mean 0 and variance 2 = z1 :
2 1=2 1 x2
fXjZ=z (x) = 2 exp
2 2
1=2 1 2
= (2 ) z 1=2 exp zx
2
2. fZ (z) is the probability density function of a Gamma random variable with

parameters n and h = 1:
1
fZ (z) = cz n=2 1
exp n z
2
where
nn=2
c=
2n=2 (n=2)
4 See p. 107.
5 See p. 59.
50.1. THE STANDARD STUDENT’S T DISTRIBUTION 409
Proof. We need to prove that

Z 1
0
where
1=2 1 2
fXjZ=z (x) = (2 ) z 1=2 exp zx
2
and
1
fZ (z) = cz n=2 1
exp n z
2
Let us start from the integrand function:
fXjZ=z (x) fZ (z)

1=2 1 2 1
= (2 ) z 1=2 exp zx cz n=2 1
exp n z
2 2
1=2 1
= (2 ) cz (n+1)=2 1
exp z x2 + n
2
0 1
1=2 n + 1 1
= (2 ) cz (n+1)=2 1 exp @ zA
n+1 2
x2 +n
0 1
1=2 1 n + 1 1
= (2 ) c c2 z (n+1)=2 1 exp @ zA
c2 n+1
2
2
x +n
1=2 1
= (2 ) c fZjX=x (z)
c2
where
(n+1)=2
n+1
(n + 1) = x2 +n
c2 =
2(n+1)=2 ((n + 1) =2)
(n+1)=2
x2 + n
= n 1
2n=2 21=2 2 + 2
and fZjX=x (z) is the probability density function of a random variable having a
Gamma distribution with parameters n + 1 and xn+1 2 +n . Therefore,
Z 1
fXjZ=z (x) fZ (z) dz
Z0 1
1
1=2
= (2 ) c fZjX=x (z) dz
0 c2
Z 1
A 1=2 1
= (2 ) c fZjX=x (z) dz
c2 0
B 1=2 1
= (2 ) c
c2
1=2 nn=2 n 1 (n+1)=2
= (2 ) n=2
2n=2 21=2 + x2 + n
2 (n=2) 2 2
(n+1)=2
1=2 nn=2 1=2 n 1 1 2
= (2 ) 2 + n 1+ x
(n=2) 2 2 n
1=2
1=2 nn=2 1 n 1
= (2 ) +
(n=2) 2 2 2
(n+1)=2
n=2 1=2 1 2
n 1+ x
n
n 1=2 (n+1)=2
1=2 2 + 12 1 1=2 1 2
= (2 ) n 1+ x
(n=2) 2 n
1=2 n (n+1)=2
1 2 + 12 1 2
= 2 n 1+ x
2 (n=2) n
n (n+1)=2
1=2 2 + 12 1 2
= n p 1+ x
(n=2) n
n (n+1)=2
C 1=2 + 12
2 1 2
= n 1+ x
(1=2) (n=2) n
(n+1)=2
D 1=2 1 1 2
= n 1+ x
B (n=2; 1=2) n
= fX (x)
where: in step A we have used the fact that c and c2 do not depend on z; in step
B we have used the fact that the integral of a density function over its support
p
is equal to 1; in step C we have used the fact that = (1=2); in step D we
have used the de…nition of Beta function.
Since X is a zero-mean normal random variable with variance 1=z, conditional
on Z = z, then we can also think of it as a ratio
Y
X=p
Z
where Y has a standard normal distribution, Z has a Gamma distribution and Y
and Z are independent.

The expected value of a standard Student’s t random variable X is well-de…ned
only for n > 1 and it is equal to
E [X] = 0
Proof. It follows from the fact that the density function is symmetric around 0:
Z 1
E [X] = xfX (x) dx
1
Z 0 Z 1
= xfX (x) dx + xfX (x) dx
1 0
Z 0 Z 1
A = ( t) fX ( t) dt + xfX (x) dx
1 0
Z 0 Z 1
= tfX ( t) dt + xfX (x) dx
1 0
Z 1 Z 1
B = tfX ( t) dt + xfX (x) dx
Z0 1 0
Z 1
= xfX ( x) dx + xfX (x) dx
0 0
Z 1 Z 1
C = xfX (x) dx + xfX (x) dx
0 0
= 0
where: in step A we have performed a change of variable in the …rst integral

(t = x); in step B we have exchanged the bounds of integration; in step C we
have used the fact that
fX ( x) = fX (x)
The above integrals are …nite (and so the expected value is well-de…ned) only if
n > 1, because
Z 1
xfX (x) dx
0
Z u
1
2 (n+1)
x2
= lim xc 1 + dx
u!1 0 n
" 1 #
1) u
2 (n
n x2
= c lim 1+
u!1 n 1 n
0
( 1 )
2 2 (n 1)
cn u
= lim 1+ 1
n 1 u!1 n
and the above limit is …nite only if n > 1.
50.1.4 Variance
The variance of a standard Student’s t random variable X is well-de…ned only for
n > 2 and it is equal to
n
Var [X] =
n 2
Proof. It can be derived thanks to the usual formula for computing the variance6
and to the integral representation of the Beta function:
Z 1
2
E X = x2 fX (x) dx
1
Z 0 Z 1
= x2 fX (x) dx + x2 fX (x) dx
1 0
Z 0 Z 1
A = t2 fX ( t) dt + x2 fX (x) dx
1 0
Z 1 Z 1
B = 2
t fX ( t) dt + x2 fX (x) dx
0 0
6 Var [X] = E X2 E [X]2 . See p. 156.
Z 1 Z 1
C = t2 fX (t) dt + x2 fX (x) dx
0 0
Z 1
= 2 x2 fX (x) dx
0
Z 1
1
2 (n+1)
2 x2
= 2c x 1+ dx
0 n
Z 1 p
D n=2 1=2 n 1
= 2c nt (1 + t) p dt
0 2 t
Z 1
3=2 3=2 (n=2 1)
= cn t3=2 1
(1 + t) dt
0
E 3 n
= cn3=2 B ; 1
2 2
F 1 1 1 n
= p n3=2 B + 1; 1
n B n2 ; 12 2 2
n 1 1 n
G 2 + 2 2 +1 2 1
= n n 1 n 1
2 2 2 + 2
1 n
2 + 1 2 1
= n n 1
2 2
1 1 n 2
H = n 2 2 2 n 2
n 1
2 2
n
=
n 2

fX ( t) = fX (t)
2
in step D we have performed a change of variable (t = xn ); in step E we have
used the integral representation of the Beta function; in step F we have used the
de…nition of c; in step G we have used the de…nition of Beta function; in step H
we have used the fact that
(z) = (z 1) (z 1)
Finally:
2
E [X] = 0
and:
2 n
Var [X] = E X 2 E [X] =
n 2
From the above derivation, it should be clear that the variance is well-de…ned only
when n > 2. Otherwise, if n 2, the above improper integrals do not converge
(and the Beta function is not well-de…ned).
50.1.5 Higher moments

The k-th moment of a standard Student’s t random variable X is well-de…ned only
for k < n and it is equal to
k+1 n k n 1
nk=2 2 2 2 2 if k is even
X (k) =
0 if k is odd
Proof. Using the de…nition of moment:
X (k) = E Xk
Z 1
= xk fX (x) dx
1
Z 0 Z 1
= xk fX (x) dx + xk fX (x) dx
1 0
Z 0 Z 1
A =
k
( t) fX ( t) dt + xk fX (x) dx
1 0
Z 1 Z 1
B = ( 1)
k k
t fX ( t) dt + xk fX (x) dx
0 0
Z 1 Z 1
C = ( 1)
k k
t fX (t) dt + xk fX (x) dx
0 0
Z 1
k
= 1 + ( 1) xk fX (x) dx
0

fX ( t) = fX (t)
Therefore, to compute the k-th moment and to verify whether it exists and is …nite,
we need to study the following integral:
Z 1
xk fX (x) dx
0
Z 1
1
2 (n+1)
k x2
= c x 1+ dx
0 n
Z 1 p
A k=2 n=2 1=2 n 1
= c (nt) (1 + t) p dt
0 2 t
Z
1 (k+1)=2 1 k=2 1=2 n=2 1=2
= c n t (1 + t) dt
2 0
Z 1
1 (k+1)=2 (n k)=2
= c n(k+1)=2 t(k+1)=2 1 (1 + t) dt
2 0
B 1 (k+1)=2 k+1 n k
= c n B ;
2 2 2
C 1 1 1 k+1 n k
= p n(k+1)=2 B ;
2 n B n2 ; 12 2 2
1
D 1 k=2 n 1 n+1
= n
2 2 2 2
k+1 n k n+1
2 2 2
k+1 n k
1 k=2 2 2
= n n 1
2 2 2

2
(t = xn ); in step B we have used the integral representation of the Beta function;
in step C we have used the de…nition of c; in step D we have used the de…nition
of Beta function. From the above derivation, it should be clear that the k-th
moment is well-de…ned only when n > k. Otherwise, if n k, the above improper
integrals do not converge (the integrals involve the Beta function, which is well-
de…ned and converges only when its arguments are strictly positive - in this case
only if n 2 k > 0). Therefore, the k-th moment of X is:
Z 1
k
X (k) = 1 + ( 1) xk fX (x) dx
0
k+1 n k
k 1 k=2 2 2
= 1 + ( 1) n n 1
2 2 2
k+1 n k n 1
nk=2 2 2 2 2 if k is even
=
0 if k is odd

A standard Student’s t random variable X does not possess a moment generating
function.
Proof. When a random variable X possesses a moment generating function, then
the k-th moment of X exists and is …nite for any k 2 N. But we have proved above
that the k-th moment of X exists only for k < n. Therefore, X can not have a
moment generating function.

There is no simple expression for the characteristic function of the standard Stu-
dent’s t distribution. It can be expressed in terms of a Modi…ed Bessel func-
tion of the second kind (a solution of a certain di¤erential equation, called modi-
…ed Bessel’s di¤erential equation). The interested reader can consult Sutradhar7
(1986).

There is no simple formula for the distribution function FX (x) of a standard Stu-
dent’s t random variable X, because the integral
Z x
FX (x) = fX (t) dt
1
7 Sutradhar,
B. C. (1986) On the characteristic function of multivariate Student t-distribution,
Canadian Journal of Statistics, 14, 329-337.
50.2. THE STUDENT’S T DISTRIBUTION IN GENERAL 415
cannot be expressed in terms of elementary functions. Therefore, it is usually

necessary to resort to computer algorithms to compute the values of FX (x). For
example, the MATLAB command:
tcdf(x,n)
returns the value of the distribution function at the point x when the degrees of
freedom parameter is equal to n.
50.2 The Student’s t distribution in general

While in the previous section we restricted our attention to the Student’s t distri-
bution with zero mean and unit scale, we now deal with the general case.
50.2.1 De…nition
The Student’s t distribution is characterized as follows:
RX = R
Let 2 R, 2 2 R++ and n 2 R++ . We say that X has a Student’s t distri-

bution with mean , scale 2 and n degrees of freedom if its probability density
function is:
! 12 (n+1)
2
1 (x )
fX (x) = c 1+
n 2
1 1
c= p
n B 2 ; 12
n
and B () is the Beta function.

2
We indicate that X has a t distribution with mean , scale and n degrees
of freedom by:
X T ; 2; n
50.2.2 Relation to the standard Student’s t distribution

A random variable X which has a t distribution with mean , scale 2 and n degrees
of freedom is just a linear function of a standard Student’s t random variable8 :
2
Proposition 263 If X T ; ; n , then:
X= + Z
where Z is a random variable having a standard t distribution.

8 See p. 407.
1
1 dg (x)
fX (x) = fZ g (x)
dx
x 1
= fZ
! 1
2 (n+1)
2
1 (x )
= c 1+ 2
n
Obviously, then, a standard t distribution is just a normal distribution with

mean = 0 and scale 2 = 1.

The expected value of a Student’s t random variable X is well-de…ned only for
n > 1 and it is equal to
E [X] =
Proof. It is an immediate consequence of the fact that X = + Z (where Z has
a standard t distribution) and the linearity of the expected value10 :
E [X] = E [ + Z] = + E [Z] = + 0=
As we have seen above, E [Z] is well-de…ned only for n > 1 and, as a consequence,
also E [X] is well-de…ned only for n > 1.
50.2.4 Variance
The variance of a Student’s t random variable X is well-de…ned only for n > 2 and
it is equal to
n 2
Var [X] =
n 2
Proof. It can be derived using the formula for the variance of linear transforma-
tions11 on X = + Z (where Z has a standard t distribution):
2 2 n
Var [X] = Var [ + Z] = Var [Z] =
n 2
As we have seen above, Var [Z] is well-de…ned only for n > 2 and, as a consequence,
also Var [X] is well-de…ned only for n > 2.

A Student’s t random variable X does not possess a moment generating function.
Proof. It is a consequence of the fact that X = + Z (where Z has a standard t
distribution) and of the fact that a standard Student’s t random variable does not
possess a moment generating function (see above).
9 See p. 265. Note that X = g (Z) = + Z is a strictly increasing function of Z, since is
strictly positive.
1 0 See p. 134.
1 1 See p. 158.

There is no simple expression for the characteristic function of the Student’s t
distribution (see the comments above, for the standard case).

As for the standard t distribution (see above), there is no simple formula for the
distribution function FX (x) of a Student’s t random variable X and it is usually
necessary to resort to computer algorithms to compute the values of FX (x). Most
computer programs provide only routines for the computation of the standard t
distribution function (denote it by FZ (z)). In these cases we need to make a
conversion, as follows:
FX (x) = P (X x)
= P( + Z x)
x
= P Z
x
= FZ
For example, the MATLAB command:

tcdf((x-mu)/sigma,n)
returns the value at the point x of the distribution function of a Student’s t random
variable with mean mu, scale sigma and n degrees of freedom.
50.3 More details

50.3.1 Convergence to the normal distribution
A Student’s t distribution with mean , scale 2 and n degrees of freedom converges
in distribution12 to a normal distribution with mean and variance 2 when the
number of degrees of freedom n becomes large (converges to in…nity).
Proof. As explained before, if Xn has a t distribution, it can be written as:
Y
Xn = + p
2 =n
n
where Y is a standard normal random variable, and 2n is a Chi-square random

variable with n degrees of freedom, independent of Y . Moreover, as explained in
the lecture entitled Chi-square distribution (p. 387), 2n can be written as a sum
of squares of n independent standard normal random variables Z1 ; : : : ; Zn :
n
X
2
n = Zi2
i=1
When n tends to in…nity, the ratio

n
2
n 1X 2
= Z
n n i=1 i
1 2 See p. 527.
converges in probability to E Zi2 = 1, by the Law of Large Numbers13 . As a

consequence, by Slutski’s theorem14 , Xn converges in distribution to
X= + Y
2
which is a normal random variable with mean and variance .
50.3.2 Non-central t distribution

As discussed above, if Y has a standard normal distribution, Z has a Gamma
distribution with parameters n and h = 1 and Y and Z are independent, then the
random variable X de…ned as
Y
X=p
Z
has a standard Student’s t distribution with n degrees of freedom.
Given the same assumptions on Y and Z, de…ne a random variable W as follows:
Y +c
W = p
Z
where c 2 R is a constant. W is said to have a non-central standard Student’s
t distribution with n degrees of freedom and non-centrality parameter
c. We do not discuss the details of this distribution here, but be aware that this
distribution is sometimes used in statistical theory (also in elementary problems)
and that routines to compute its moments and its distribution function can be
found in most statistical software packages.

Exercise 1
Let X1 be a normal random variable with mean = 0 and variance 2 = 4. Let
X2 be a Gamma random variable with parameters n = 10 and h = 3, independent
of X1 . Find the distribution of the ratio
X1
X=p
X2
Solution
We can write:
X1 2 Y
X=p =p p
X2 3 Z
where Y = X1 =2 has a standard normal distribution and Z = X2 =3 has a Gamma
distribution with parameters n = 10 and h = 1. Therefore, the ratio
Y
p
Z
1 3 See p. 535.
1 4 See p. 557.
has a standard Student’s t distribution with n = 10 degrees of freedom and X has

a Student’s t distribution with mean = 0, scale 2 = 4=3 and n = 10 degrees of
freedom.
Exercise 2
Let X1 be a normal random variable with mean = 3 and variance 2 = 1. Let
X2 be a Gamma random variable with parameters n = 15 and h = 2, independent
of X1 . Find the distribution of the random variable
r
2
X= (X1 3)
X2
Solution
We can write: r
2 Y
X= (X1 3) = p
X2 Z
where Y = X1 3 has a standard normal distribution and Z = X2 =2 has a Gamma
distribution with parameters n = 15 and h = 1. Therefore, the ratio
Y
p
Z
has a standard Stutent’s t distribution with n = 15 degrees of freedom.
Exercise 3
2
Let X be a Student’s t random variable with mean = 1, scale = 4 and n = 6
degrees of freedom. Compute:
P (0 X 1)
Solution
First of all, we need to write the probability in terms of the distribution function
of X:
P (0 X 1)
= P (X 1) P (X < 0)
A = P (X 1) P (X 0)
B = FX (1) FX (0)
where: in step A we have used the fact that any speci…c value of X has probability
zero; in step B we have used the de…nition of distribution function. Then, we
express the distribution function FX (x) in terms of the distribution function of a
standard Student’s t random variable Z with n = 6 degrees of freedom:
x 1
FX (x) = FZ
2
so that:
1
P (0 X 1) = FX (1) FX (0) = FZ (0) FZ = 0:1826
2
where the di¤erence FZ (0) FZ ( 1=2) can be computed with a computer algo-
rithm, for example using the MATLAB command
tcdf(0,6)-tcdf(-1/2,6)
Chapter 51
F distribution
A random variable X has an F distribution if it can be written as a ratio:
Y1 =n1
X=
Y2 =n2
between a Chi-square random variable1 Y1 with n1 degrees of freedom and a Chi-

square random variable Y2 , independent of Y1 , with n2 degrees of freedom (where
each of the two random variables has been divided by its degrees of freedom). The
importance of the F distribution stems from the fact that ratios of this kind are
encountered very often in statistics.
51.1 De…nition
The F distribution is characterized as follows:
RX = [0; 1)
Let n1 ; n2 2 N. We say that X has an F distribution with n1 and n2 degrees of

freedom if its probability density function2 is
(n1 +n2 )=2
n1
fX (x) = cxn1 =2 1
1+ x
n2
n1 =2
n1 1
c= n1 n2
n2 B 2 ; 2
and B () is the Beta function3 .

1 See p. 387.
2 See p. 107.
3 See p. 59.
421
422 CHAPTER 51. F DISTRIBUTION
51.2 Relation to the Gamma distribution

An F random variable can be written as a Gamma random variable4 with para-
meters n1 and h1 , where the parameter h1 is equal to the reciprocal of another
Gamma random variable, independent of the …rst one, with parameters n2 and
h2 = 1:
Proposition 265 (Integral representation) The probability density function of

X can be written as
Z 1
0
where:
1. fXjZ=z (x) is the probability density function of a Gamma random variable

with parameters n1 and h1 = z1 :
n =2
(n1 =h1 ) 1 n1 1
fXjZ=z (x) = xn1 =2 1
exp x
2n1 =2 (n1 =2) h1 2
n =2
(n1 z) 1 1
= xn1 =2 1
exp n1 z x
2 1 =2 (n1 =2)
n 2

parameters n2 and h2 = 1:
n2 =2
(n2 ) 1
fZ (z) = z n2 =2 1
exp n2 z
2n2 =2 (n2 =2) 2

Z 1
0
where
n =2
(n1 z) 1 1
fXjZ=z (x) = xn1 =2 1
exp n1 z x
2 1 =2 (n1 =2)
n 2
and
n =2
n2 2 1
fZ (z) = z n2 =2 1
exp n2 z
2n2 =2 (n2 =2) 2
Let us start from the integrand function:
fXjZ=z (x) fZ (z)

n =2
(n1 z) 1 1
= xn1 =2 1
exp n1 z x
2 1 =2 (n1 =2)
n 2
n =2
n2 2 1
z n2 =2 1
exp n2 z
2n2 =2 (n2 =2) 2
4 See p. 397.
51.2. RELATION TO THE GAMMA DISTRIBUTION 423
n =2 n =2
(n1 z) 1 n2 2
=
2n1 =2 (n1 =2) 2n2 =2 (n2 =2)
1
xn1 =2 1 n2 =2 1
z exp (n1 x + n2 ) z
2
n =2 n =2
n1 1 n 2 2
=
2(n1 +n2 )=2 (n1 =2) (n2 =2)
1
xn1 =2 1 (n1 +n2 )=2 1
z exp (n1 x + n2 ) z
2
n =2 n =2
n1 1 n2 2
= xn1 =2 1 (n1 +n2 )=2 1
z
2(n1 +n2 )=2 (n1 =2) (n2 =2)
!
1
n1 + n2 1
exp (n1 + n2 ) z
n1 x + n2 2
n =2 n =2
n1 1 n2 2 11
= xn1 =2 fZjX=x (z)
2(n1 +n2 )=2 (n1 =2) (n2 =2) c
where
(n +n )=2
(n1 x + n2 ) 1 2
c=
2 1 +n2 )=2 ((n1 + n2 ) =2)
(n
Gamma distribution with parameters
n1 + n 2
and
n1 + n 2
n1 x + n2
Therefore:
Z 1
0
Z 1 n =2 n =2
n1 1 n2 2 1
= (n +n )=2
xn1 =2 1 fZjX=x (z) dz
0 2 1 2 (n1 =2) (n2 =2) c
n1 =2 n2 =2 Z 1
A n1 n2 1
= xn1 =2 1 fZjX=x (z) dz
2(n1 +n2 )=2 (n1 =2) (n2 =2) c 0
n =2 n =2
B n1 1 n2 2 11
= xn1 =2
2(n1 +n2 )=2 (n1 =2) (n2 =2) c
n =2 n =2 (n1 +n2 )=2
n1 1 n2 2 12 ((n1 + n2 ) =2)
= xn1 =2
2(n1 +n2 )=2 (n1 =2) (n2 =2) (n1 x + n2 )
(n1 +n2 )=2
((n1 + n2 ) =2) n1 =2 n2 =2 n1 =2 1 (n1 +n2 )=2

= n n2 x (n1 x + n2 )
(n1 =2) (n2 =2) 1
1
(n1 =2) (n2 =2) n =2 n =2 (n1 +n2 )=2
= n1 1 n2 2 xn1 =2 1
n2
((n1 + n2 ) =2)
(n1 +n2 )=2
n1
1+ x
n2
1 (n1 +n2 )=2

(n1 =2) (n2 =2) n =2 n1 =2 n1 =2 1 n1
= n1 1 n 2 x 1+ x
((n1 + n2 ) =2) n2
1 n1 =2 (n1 +n2 )=2
(n1 =2) (n2 =2) n1 n1
= xn1 =2 1
1+ x
((n1 + n2 ) =2) n2 n2
n1 =2 (n1 +n2 )=2
C 1 n1 n1
= xn1 =2 1
1+ x
B (n1 =2; n2 =2) n2 n2
= fX (x)
where: in step A we have used the fact that c does not depend on z; in step B
we have used the fact that the integral of a density function over its support is
equal to 1; in step C we have used the de…nition of Beta function.
51.3 Relation to the Chi-square distribution

In the introduction, we have stated (without a proof) that a random variable X
has an F distribution with n1 and n2 degrees of freedom if it can be written as a
ratio:
Y1 =n1
X=
Y2 =n2
where:
1. Y1 is a Chi-square random variable with n1 degrees of freedom;
2. Y2 is a Chi-square random variable, independent of Y1 , with n2 degrees of
freedom.
The statement can be proved as follows.
Proof. It is a consequence of Proposition 265 above: X can be thought of as
a Gamma random variable with parameters n1 and h1 , where the parameter h1
is equal to the reciprocal of another Gamma random variable Z, independent of
the …rst one, with parameters n2 and h2 = 1. The equivalence can be proved as
follows. Since a Gamma random variable with parameters n1 and h1 is just the
product between the ratio h1 =n1 and a Chi-square random variable with n1 degrees
of freedom (see the lecture entitled Gamma distribution - p. 397), we can write:
h1
X= Y1
n1
where Y1 is a Chi-square random variable with n1 degrees of freedom. Now, we
know that h1 is equal to the reciprocal of another Gamma random variable Z,
independent of Y1 , with parameters n2 and h2 = 1. Therefore:
Y1 =n1
X=
Z
But a Gamma random variable with parameters n2 and h2 = 1 is just the prod-
uct between the ratio 1=n2 and a Chi-square random variable with n2 degrees of
freedom (see the lecture entitled Gamma distribution - p. 397). Therefore, we can
write
Y1 =n1
X=
Y2 =n2
51.4. EXPECTED VALUE 425
51.4 Expected value

The expected value of an F random variable X is well-de…ned only for n2 > 2 and
it is equal to
n2
E [X] =
n2 2
Proof. It can be derived thanks to the integral representation of the Beta function:
Z 1
E [X] = xfX (x) dx
1
Z 1 (n1 +n2 )=2
n1
= xcxn1 =2 1
1+ x dx
0 n2
Z 1 (n1 +n2 )=2
n1
= c xn1 =2 1 + x dx
0 n2
Z 1 n1 =2
A n2 (n1 +n2 )=2 n2
= c t (1 + t) dt
0 n1 n1
n1 =2+1 Z 1
n2 n1 =2 n2 =2
= c tn1 =2 (1 + t) dt
n1 0
n1 =2+1 Z 1
n2 (n1 =2+1) (n2 =2 1)
= c t(n1 =2+1) 1
(1 + t) dt
n1 0
n1 =2+1
B n2 n1 n2
= c B + 1; 1
n1 2 2
n1 =2 n1 =2+1
C n1 n2
1 n1 n2
= B + 1; 1
n2 B n21 ; n22 n1 2 2
n2 1 n1 n2
= B + 1; 1
n1 B n21 ; n22 2 2
D n2 (n1 =2 + n2 =2) (n1 =2 + 1) (n2 =2 1)
=
n1 (n1 =2) (n2 =2) (n1 =2 + 1 + n2 =2 1)
n2 (n1 =2 + n2 =2) (n1 =2 + 1) (n2 =2 1)
=
n1 (n1 =2) (n2 =2) (n1 =2 + n2 =2)
n2 (n1 =2 + 1) (n2 =2 1)
=
n1 (n1 =2) (n2 =2)
E n2 1
= (n1 =2)
n1 n2 =2 1
n2
=
n2 2
n1
where: in step A we have performed a change of variable (t = n2 x); in step B
we have used the integral representation of the Beta function; in step C we have
used the de…nition of c; in step D we have used the de…nition of Beta function;
in step E we have used the following property of the Gamma function:
(z) = (z 1) (z 1)
It is also clear that the expected value is well-de…ned only when n2 > 2: when
n2 2, the above improper integrals do not converge (both arguments of the Beta
function must be strictly positive).
51.5 Variance
The variance of an F random variable X is well-de…ned only for n2 > 4 and it is
equal to
2n22 (n1 + n2 2)
Var [X] = 2
n1 (n2 2) (n2 4)
Proof. It can be derived thanks to the usual formula for computing the variance5
and to the integral representation of the Beta function:
Z 1
E X2 = x2 fX (x) dx
1
Z 1 (n1 +n2 )=2
n1
= x2 cxn1 =2 1
1+ x dx
0 n2
Z 1 (n1 +n2 )=2
n1
= c xn1 =2+1 1 + x dx
0 n2
Z 1 n1 =2+1
A n2 (n1 +n2 )=2 n2
= c t (1 + t) dt
0 n1 n1
n1 =2+2 Z 1
n2 n1 =2 n2 =2
= c tn1 =2+1 (1 + t) dt
n1 0
n1 =2+2 Z 1
n2 (n1 =2+2) (n2 =2 2)
= c t(n1 =2+2) 1
(1 + t) dt
n1 0
n1 =2+2
B n2 n1 n2
= c B + 2; 2
n1 2 2
n1 =2 n1 =2+2
C n1 1 n2 n1 n2
= n1 n2 B + 2; 2
n2 B 2 ; 2
n1 2 2
2
D n2 1 n1 n2
= n1 n2 B + 2; 2
n1 B 2 ; 2
2 2
2
n2 (n1 =2 + n2 =2) (n1 =2 + 2) (n2 =2 2)
=
n1 (n1 =2) (n2 =2) (n1 =2 + 2 + n2 =2 2)
2
n2 (n1 =2 + n2 =2) (n1 =2 + 2) (n2 =2 2)
=
n1 (n1 =2) (n2 =2) (n1 =2 + n2 =2)
2
n2 (n1 =2 + 2) (n2 =2 2)
=
n1 (n1 =2) (n2 =2)
2
E n2 1
= (n1 =2 + 1) (n1 =2)
n1 (n2 =2 1) (n2 =2 2)
2
n2 (n1 + 2) n1
=
n21 (n2 2) (n2 4)
n22 (n1 + 2)
=
n1 (n2 2) (n2 4)
5 Var [X] = E X2 E [X]2 . See p. 156.
51.6. HIGHER MOMENTS 427
n1
used the de…nition of c; in step D we have used the de…nition of Beta function;
in step E we have used the following property of the Gamma function:
(z) = (z 1) (z 1)
Finally:
2
2 n2
E [X] =
n2 2
and:
2
n22 (n1 + 2) n22
= 2
n1 (n2 2) (n2 4) (n2 2)
n22 ((n1 + 2) (n2 2) n1 (n2 4))
= 2
n1 (n2 2) (n2 4)
n22 (n1 n2 2n1 + 2n2 4 n1 n2 + 4n1 )
= 2
n1 (n2 2) (n2 4)
n22 (2n1 + 2n2 4) 2n22 (n1 + n2 2)
= 2 = 2
n1 (n2 2) (n2 4) n1 (n2 2) (n2 4)
It is also clear that the expected value is well-de…ned only when n2 > 4: when
n2 4, the above improper integrals do not converge (both arguments of the Beta
function must be strictly positive).
51.6 Higher moments

The k-th moment of an F random variable X is well-de…ned only for n2 > 2k and
it is equal to
k
n2 (n1 =2 + k) (n2 =2 k)
X (k) =
n1 (n1 =2) (n2 =2)
Proof. Using the de…nition of moment:
Z 1
k
X (k) = E X = xk fX (x) dx
1
Z 1 (n1 +n2 )=2
n1
= xk cxn1 =2 1
1+ x dx
0 n2
Z 1 (n1 +n2 )=2
n1
= c xn1 =2+k 1
1+ x dx
0 n2
Z 1 n1 =2+k 1
A n2 (n1 +n2 )=2 n2
= c t (1 + t) dt
0 n1 n1
n1 =2+k Z 1
n2 n1 =2 n2 =2
= c tn1 =2+k 1
(1 + t) dt
n1 0
n1 =2+k Z 1
n2 (n1 =2+k) (n2 =2 k)
= c t(n1 =2+k) 1
(1 + t) dt
n1 0
n1 =2+k
B n2 n1 n2
= c B + k; k
n1 2 2
n1 =2 n1 =2+k
C n1 1 n2 n1 n2
= n1 n2 B + k; k
n2 B 2 ; 2
n1 2 2
k
n2 1 n1 n2
= n1 n2 B + k; k
n1 B 2 ; 2
2 2
k
D n2 (n1 =2 + n2 =2) (n1 =2 + k) (n2 =2 k)
=
n1 (n1 =2) (n2 =2) (n1 =2 + k + n2 =2 k)
k
n2 (n1 =2 + n2 =2) (n1 =2 + k) (n2 =2 k)
=
n1 (n1 =2) (n2 =2) (n1 =2 + n2 =2)
k
n2 (n1 =2 + k) (n2 =2 k)
=
n1 (n1 =2) (n2 =2)
n1
used the de…nition of c; in step D we have used the de…nition of Beta function.
It is also clear that the expected value is well-de…ned only when n2 > 2k: when
n2 2k, the above improper integrals do not converge (both arguments of the
Beta function must be strictly positive).

An F random variable X does not possess a moment generating function.
Proof. When a random variable X possesses a moment generating function, then
the k-th moment of X exists and is …nite for any k 2 N. But we have proved above
that the k-th moment of X exists only for k < n2 =2. Therefore, X can not have a
moment generating function.

There is no simple expression for the characteristic function of the F distribution.
It can be expressed in terms of the con‡uent hypergeometric function of the second
kind (a solution of a certain di¤erential equation, called con‡uent hypergeometric
di¤erential equation). The interested reader can consult Phillips6 (1982).

The distribution function of an F random variable is:
Z n1 x=n2
1 n1 =2 n2 =2
FX (x) = sn1 =2 1 (1 + s) ds
B n21 ; n22 1
6 Phillips, P. C. B. (1982) "The true characteristic function of the F distribution", Biometrika,
69, 261-264.
where the integral

Z n1 x=n2
n1 =2 n2 =2
sn1 =2 1
(1 + s) ds
1
is known as incomplete Beta function and is usually computed numerically with

the help of a computer algorithm.
FX (x)
Z x
= fX (t) dt
1
Z x (n1 +n2 )=2
n1
= ctn1 =2 1
1+ t dt
1 n2
Z n1 x=n2 1 n =2 1
A n2 n1 =2 n2 =2 n2
= c s (1 + s) ds
1 n1 n1
n1 =2 Z n1 x=n2
n2 n1 =2 n2 =2
= c sn1 =2 1
(1 + s) ds
n1 1
n =2 n1 =2 Z n1 x=n2
B (n1 =n2 ) 1 n2 n1 =2 n2 =2
= sn1 =2 1
(1 + s) ds
B n21 ; n22 n1 1
Z n1 x=n2
1 n1 =2 n2 =2
= n1 n2 sn1 =2 1
(1 + s) ds
B 2 ; 2 1
n1
where: in step A we have performed a change of variable (s = n2 t); in step B

Exercise 1
Let X1 be a Gamma random variable with parameters n1 = 3 and h1 = 2. Let X2
be another Gamma random variable, independent of X1 , with parameters n2 = 5
and h1 = 6. Find the expected value of the ratio:
X1
X2
Solution
We can write:
X1 = 2Z1
X2 = 6Z2
where Z1 and Z2 are two independent Gamma random variables, the parameters
of Z1 are n1 = 3 and h1 = 1 and the parameters of Z2 are n2 = 5 and h2 = 1
(see the lecture entitled Gamma distribution - p. 397). Using this fact, the ratio
becomes:
X1 2 Z1 1 Z1
= =
X2 6 Z2 3 Z2
where Z1 =Z2 has an F distribution with parameters n1 = 3 and n2 = 5. Therefore:
X1 1 Z1 1 Z1
E = E = E
X2 3 Z2 3 Z2
1 n2 1 5 5
= = =
3 n2 2 35 2 9
Exercise 2
Find the third moment of an F random variable with parameters n1 = 6 and
n2 = 18.
Solution
We need to use the formula for the k-th moment of an F random variable:
k
n2 (n1 =2 + k) (n2 =2 k)
X (k) =
n1 (n1 =2) (n2 =2)
Plugging in the parameter values, we obtain:

3
18 (3 + 3) (9 3) (6) (6)
X (3) = = 33
6 (3) (9) (3) (9)
5! 5! 1
= 27 = 27 (5 4 3)
2! 8! 8 7 6
1 135
= 27 5 =
2 7 2 28
where we have used the relation between the Gamma function7 and the factorial
function.
7 See p. 55.
Chapter 52
Multinomial distribution
The multinomial distribution is a generalization of the binomial distribution1 .

If you perform n times an experiment that can have only two outcomes (either
success or failure), then the number of times you obtain one of the two outcomes
(success) is a binomial random variable.
If you perform n times an experiment that can have K outcomes (K can be
any natural number) and you denote by Xi the number of times that you obtain
the i-th outcome, then the random vector
>
X = [X1 X2 . . . XK ]
is a multinomial random vector.
In this lecture we will …rst present the special case in which there is only one
experiment (n = 1), and we will then employ the results obtained for this simple
special case to discuss the more general case of many experiments (n 1).
52.1 The special case of one experiment

In this case, one experiment is performed, having K possible outcomes with prob-
abilities p1 ; : : : ; pK . When the i-th outcome is obtained, the i-th entry of the
multinomial random vector X takes value 1, while all other entries take value 0.
52.1.1 De…nition
The distribution is characterized as follows.
De…nition 266 Let X be a K 1 discrete random vector. Let the support of X
be the set of K 1 vectors having one entry equal to 1 and all other entries equal
to 0: 8 9
< XK =
K
RX = x 2 f0; 1g : xj = 1
: ;
j=1
Let p1 , . . . , pK be K strictly positive numbers such that

K
X
pj = 1
j=1
1 See p. 341.
431
432 CHAPTER 52. MULTINOMIAL DISTRIBUTION
We say that X has a multinomial distribution with probabilities p1 , . . . , pK

and number of trials n = 1 if its joint probability mass function2 is
QK x
j=1 pj j if (x1 ; : : : ; xK ) 2 RX
pX (x1 ; : : : ; xK ) =
0 otherwise
If you are puzzled by the above de…nition of the joint pmf, note that when
(x1 ; : : : ; xK ) 2 RX and xi is equal to 1, because the i-th outcome has been obtained,
then all other entries are equal to 0 and
K
Y x x x
pj j = px1 1 : : : pi i 11 pxi i pi+1
i+1
: : : pxKK
j=1
= p01 : : : p0i 1 p1i p0i+1 : : : p0K

= 1 : : : 1 p1i 1 : : : 1 = pi

E [X] = p (52.1)
where the K 1 vector p is de…ned as follows:
|
p = [p1 p2 . . . pK ]
Proof. The i-th entry of X, denoted by Xi , is an indicator function3 of the event

"the i-th outcome has happened". Therefore, its expected value is equal to the
probability of the event it indicates:
E [Xi ] = pi
52.1.3 Covariance matrix

The covariance matrix of X is
Var [X] =
where is a K K matrix whose generic entry is
pi (1 pi ) if j = i
ij = (52.2)
pi pj if j 6= i
Proof. We need to use the formula4
ij = Cov [Xi ; Xj ] = E [Xi Xj ] E [Xi ] E [Xj ]
If j = i, then
2
ii = E [Xi Xi ] E [Xi ] E [Xi ] = E [Xi ] E [Xi ]
2 See p. 116.
3 See p. 197.
4 See the lecture entitled Covariance matrix - p. 189.
52.1. THE SPECIAL CASE OF ONE EXPERIMENT 433
= pi p2i = pi (1 pi )
where we have used the fact that Xi2 = Xi because Xi can take only values 0 and
1. If j 6= i, then
ij = E [Xi Xj ] E [Xi ] E [Xj ]

= E [Xi ] E [Xj ] = pi pj
where we have used the fact that Xi Xj = 0, because Xi and Xj cannot be both
equal to 1 at the same time.
52.1.4 Joint moment generating function

The joint moment generating function5 of X is de…ned for any t 2 RK :
K
X
MX (t) = pj exp (tj ) (52.3)
j=1
Proof. If the j-th outcome is obtained, then Xi = 0 for i 6= j and Xi = 1 for

i = j. As a consequence
exp (t1 X1 + : : : + tK XK ) = exp (tj )
and the joint moment generating function is
MX (t) = E exp t> X = E [exp (t1 X1 + t2 X2 + : : : + tK XK )]

K
X
= pj exp (tj )
j=1
52.1.5 Joint characteristic function

The joint characteristic function6 of X is
K
X
'X (t) = pj exp (itj ) (52.4)
j=1
Proof. If the j-th outcome is obtained, then Xi = 0 for i 6= j and Xi = 1 for

i = j. As a consequence
exp (it1 X1 + it2 X2 + : : : + itK XK ) = exp (itj )
and the joint characteristic function is
'X (t) = E exp it> X = E [exp (it1 X1 + it2 X2 + : : : + itK XK )]

K
X
= pj exp (itj )
j=1
5 See p. 297.
6 See p. 315.
52.2 Multinomial distribution in general

We now deal with the general case, in which the number of experiments can take
any value n 1.
52.2.1 De…nition
Multinomial random vectors are characterized as follows.
De…nition 267 Let X be a K 1 discrete random vector. Let n 2 N. Let the

support of X be the set of K 1 vectors having non-negative integer entries summing
up to n: ( )
K
X
K
RX = x 2 f0; 1; 2; : : : ; ng : xi = n
i=1
Let p1 , . . . , pK be K strictly positive numbers such that

K
X
pi = 1
i=1
We say that X has a multinomial distribution with probabilities p1 , . . . , pK

and number of trials n, if its joint probability mass function is
n QK
x1 ;x2 ;:::;xK i=1 pxi i if (x1 ; : : : ; xK ) 2 RX
pX (x1 ; : : : ; xK ) =
0 otherwise
n
where x1 ;x2 ;:::;xK is the multinomial coe¢ cient7 .
52.2.2 Representation as a sum of simpler multinomials

The connection between the general case (n 1) and the simpler case illustrated
above (n = 1) is given by the following proposition.
Proposition 268 A random vector X having a multinomial distribution with pa-

rameters p1 ; : : : ; pK and n can be written as
X = Y1 + : : : + Yn
where Y1 ; : : : ; Yn are n independent random vectors all having a multinomial dis-

tribution with parameters p1 ; : : : ; pK and 1.
>
Proof. The sum Y1 + : : : + Yn is equal to the vector [x1 ; : : : ; xK ] when: 1) x1
terms of the sum have their …rst entry equal to 1 and all other entries equal to
0; 2) x2 terms of the sum have their second entry equal to 1 and all other entries
equal to 0; . . . ; K) xK terms of the sum have their K-th entry equal to 1 and all
PK
other entries equal to 0. Provided xi 0 for each i and i=1 xi = n, there are
several di¤erent realizations of the matrix
[Y1 : : : Yn ]
7 See p. 29.
52.2. MULTINOMIAL DISTRIBUTION IN GENERAL 435
satisfying these conditions. Since Y1 ; : : : ; Yn are independent, each of these real-

izations has probability
px1 1 : : : pxKK
Furthermore, their number is equal to the number of partitions of n objects into K
groups8 having numerosities x1 ; : : : ; xK , which in turn is equal to the multinomial
coe¢ cient
n
x1 ; x2 ; : : : ; xK
Therefore
>
P Y1 + : : : + Yn = [x1 ::: xK ]
n
= px1 : : : pxKK = pX (x1 ; : : : ; xK )
x1 ; x2 ; : : : ; xK 1
which proves that X and Y1 + : : : + Yn have the same distribution.

The expected value of a multinomial random vector X is
E [X] = np
where the K 1 vector p is de…ned as follows:

|
p = [p1 p2 . . . pK ]
Proof. Using the fact that X can be written as a sum of n multinomials with
parameters p1 ; : : : ; pK and 1, we obtain
E [X] = E [Y1 + : : : + Yn ] = E [Y1 ] + : : : + E [Yn ]

= p + : : : + p = np
where the result E [Yj ] = p has been derived in the previous section (formula 52.1).

The covariance matrix of a multinomial random vector X is
Var [X] = n
where is a K K matrix whose generic entry is
pi (1 pi ) if j = i
ij =
pi pj if j 6= i
Proof. Since X can be represented as a sum of n independent multinomials with

parameters p1 ; : : : ; pK and 1, we obtain
Var [X]
8 See the lecture entitled Partitions - p. 27.
= Var [Y1 + : : : + Yn ]
A = Var [Y1 ] + : : : + Var [Yn ]
B = nVar [Y1 ]
C
= n
where: in step A we have used the fact that Y1 ; : : : ; Yn are mutually independent;
in step B we have used the fact that Y1 ; : : : ; Yn have the same distribution; in
step C we have used formula (52.2) for the covariance matrix of Y1 .

The joint moment generating function of a multinomial random vector X is de…ned
for any t 2 RK : 0 1n
XK
MX (t) = @ pj exp (tj )A
j=1
Proof. Writing X as a sum of n independent multinomial random vectors with

parameters p1 ; : : : ; pK and 1, the joint moment generating function of X is derived
from that of the summands:
MX (t) = E exp t> X

= E exp t> (Y1 + : : : + Yn )
= E exp t> Y1 + : : : + t> Yn
"n #
Y
= E exp t> Yl
l=1
n
Y
A = E exp t> Yl
l=1
n
Y
B = MYl (t)
l=1
0 1n
XK
C = @ pj exp (tj )A
j=1
in step B we have used the de…nition of moment generating function of Yl ; in step
C we have used formula (52.3) for the moment generating function of Y1 .

The joint characteristic function of X is
0 1n
K
X
'X (t) = @ pj exp (itj )A
j=1
Proof. The derivation is similar to the derivation of the joint moment generating
function:

= E exp it> (Y1 + : : : + Yn )
= E exp it> Y1 + : : : + it> Yn
"n #
Y
= E exp it> Yl
l=1
n
Y
A = E exp it> Yl
l=1
n
Y
B = 'Yl (t)
l=1
0 1n
XK
C = @ pj exp (itj )A
j=1
in step B we have used the de…nition of characteristic function of Yl ; in step C
we have used formula (52.4) for the characteristic function of Y1 .

Exercise 1
A shop selling two items, labeled A and B, needs to construct a probabilistic model
of the sales that will be generated by its next 10 customers. Each time a customer
arrives, only three outcomes are possible: 1) nothing is sold; 2) one unit of item A
is sold; 3) one unit of item B is sold. It has been estimated that the probabilities
of these three outcomes are 0.50, 0.25 and 0.25 respectively. Furthermore, the
shopping behavior of a customer is independent of the shopping behavior of all
other customers. Denote by X a 3 1 vector whose entries X1 ; X2 and X3 are equal
to the number of times each of the three outcomes occurs. Derive the expected
value and the covariance matrix of X.
Solution
The vector X has a multinomial distribution with parameters
>
1 1 1
p=
2 4 4
and n = 10. Therefore, its expected value is

> >
1 1 1 5 5
E [X] = np = 10 = 5
2 4 4 2 2
and its covariance matrix is

2 3
p1 (1 p1 ) p1 p2 p 1 p3
Var [X] = n 4 p1 p 2 p2 (1 p2 ) p 2 p3 5
p1 p 3 p2 p3 p3 (1 p3 )
2 1 1 1 1 1 1
3
2 2 2 4 2 4
= 10 4 1 1
2 4
1 3
4 4
1 1
4 4
5
1 1 1 1 1 3
2 4 4 4 4 4
2 1 1 1
3 2 5 5 5
3
4 8 8 2 4 4
= 10 4 1
8
3
16
1
16
5=4 5
4
15
8
5
8
5
1 1 3 5 5 15
8 16 16 4 8 8
Exercise 2
Given the assumptions made in the previous exercise, suppose that item A costs
$1,000 and item B costs $2,000. Derive the expected value and the variance of the
total revenue generated by the 10 customers.
Solution
The total revenue Y can be written as a linear transformation of the vector X:
Y = AX
where
A = [0 1; 000 2; 000]
By the linearity of the expected value operator, we obtain
2 3
5
E [Y ] = E [AX] = AE [X] = [0 1; 000 2; 000] 4 5=2 5
5=2
= 0 5 + 1; 000 5=2 + 2; 000 5=2 = 7; 500
By using the formula for the covariance matrix of a linear transformation, we obtain
Var [Y ] = Var [AX] = AVar [X] A>

2 5 5 5
32 3
2 4 4 0
= [0 1; 000 2; 000] 4 5
4
15
8
5
8
5 4 1; 000 5
5 5 15
4 8 8
2; 000
2 3
(5=2) 0 (5=4) 1; 000 (5=4) 2; 000
= [0 1; 000 2; 000] 4 (5=4) 0 + (15=8) 1; 000 (5=8) 2; 000 5
(5=4) 0 (5=8) 1; 000 + (15=8) 2; 000
2 3
1; 250 2; 500
= [0 1; 000 2; 000] 4 1; 875 1; 250 5
1; 250 + 3; 750
2 3
3; 750
= [0 1; 000 2; 000] 4 625 5
2; 500
= 0 ( 3; 750) + 1; 000 625 + 2; 000 2; 500 = 5; 625; 000
Chapter 53
Multivariate normal
distribution
The multivariate normal (MV-N) distribution is a multivariate generalization of

the one-dimensional normal distribution1 . In its simplest form, which is called
the "standard" MV-N distribution, it describes the joint distribution of a random
vector whose entries are mutually independent univariate normal random variables,
all having zero mean and unit variance. In its general form, it describes the joint
distribution of a random vector that can be represented as a linear transformation
of a standard MV-N vector.
It is a common mistake to think that any set of normal random variables, when
considered together, form a multivariate normal distribution. This is not the case.
In fact, it is possible to construct random vectors that are not MV-N, but whose
individual elements have normal distributions. The latter fact is very well-known in
the theory of copulae (a theory which allows to specify the distribution of a random
vector by …rst specifying the distribution of its components and then linking the
univariate distributions through a function called copula).
The remainder of this lecture illustrates the main characteristics of the multi-
variate normal distribution, dealing …rst with the "standard" case and then with
the more general case.
53.1 The standard MV-N distribution

The adjective "standard" is used to indicate that the mean of the distribution is
equal to zero and its covariance matrix is equal to the identity matrix.
53.1.1 De…nition
Standard MV-N normal random vectors are characterized as follows.
De…nition 269 Let X be a K 1 absolutely continuous random vector. Let its

support be the set of K-dimensional real vectors:
R X = RK
1 See p. 375.
439
440 CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION
We say that X has a standard multivariate normal distribution if its joint

probability density function2 is
K=2 1 |
fX (x) = (2 ) exp x x
2
53.1.2 Relation to the univariate normal distribution

Denote the i-th component of x by xi . The joint probability density function can
be written as
K
!
K=2 1X 2
fX (x) = (2 ) exp x
2 i=1 i
K
Y 1=2 1 2
= (2 ) exp x
i=1
2 i
K
Y
= f (xi )
i=1
where
1=2 1 2
f (xi ) = (2 ) exp x
2 i
is the probability density function of a standard normal random variable3 .
Therefore, the K components of X are K mutually independent4 standard
normal random variables. A more detailed proof follows.
Proof. As we have seen, the joint probability density function can be written as
K
Y
fX (x) = f (xi )
i=1
where f (xi ) is the probability density function of a standard normal random vari-
able. But f (xi ) is also the marginal probability density function5 of the i-th
component of X:
fXi (xi )
Z 1 Z 1
= ::: fX (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK ) dx1 : : : dxi 1 dxi+1 : : : dxK
1 1
Z 1 Z 1 K
Y
= ::: f (xj ) dx1 : : : dxi 1 dxi+1 : : : dxK
1 1 j=1
Z 1 Z 1
= f (xi ) f (x1 ) dx1 : : : f (xi 1 ) dxi 1
1 1
Z 1 Z 1
f (xi+1 ) dxi+1 : : : f (xK ) dxK
1 1
= f (xi )
2 See p. 117.
3 See p. 376.
4 See p. 233.
5 See p. 120.
53.1. THE STANDARD MV-N DISTRIBUTION 441
where, in the last step, we have used the fact that all the integrals are equal to
1, because they are integrals of probability density functions over their respective
supports. Therefore, the joint probability density function of X is equal to the
product of its marginals, which implies that the components of X are mutually
independent.

The expected value of a standard MV-N random vector X is
E [X] = 0
Proof. All the components of X are standard normal random variables and a
standard normal random variable has mean 0.

The covariance matrix of a standard MV-N random vector X is
Var [X] = I
where I is the K K identity matrix, i.e., a K K matrix whose diagonal entries
are equal to 1 and whose o¤-diagonal entries are equal to 0.
Proof. This is proved using the structure of the covariance matrix6 :
2 3
Var [X1 ] Cov [X1 ; X2 ] : : : Cov [X1 ; XK ]
6 Cov [X1 ; X2 ] Var [X2 ] Cov [X2 ; XK ] 7
6 ::: 7
Var [X] = 6 .. .. . .. 7
4 . . . . . 5
Cov [X1 ; XK ] Cov [X2 ; XK ] : : : Var [XK ]
where Xi is the i-th component of X. Since the components of X are all standard
normal random variables, their variances are all equal to 1:
Var [X1 ] = : : : = Var [XK ] = 1
Furthermore, since the components of X are mutually independent and indepen-
dence implies zero-covariance7 , all the covariances are equal to 0:
Cov [Xi ; Xj ] = 0 if i 6= j
Therefore,
2 3
Var [X1 ] Cov [X1 ; X2 ] : : : Cov [X1 ; XK ]
6 Cov [X1 ; X2 ] Var [X2 ] Cov [X2 ; XK ] 7
6 ::: 7
Var [X] = 6 .. .. .. .. 7
4 . . . . 5
Cov [X1 ; XK ] Cov [X2 ; XK ] : : : Var [XK ]
2 3
1 0 ::: 0
6 0 1 ::: 0 7
6 7
= 6 .. .. . . . 7=I
4 . . . .. 5
0 0 ::: 1
6 See p. 189.
7 See p. 234.

The joint moment generating function of a standard MV-N random vector X is
de…ned for any t 2 RK :
1 >
MX (t) = exp t t
2
Proof. The K components of X are K mutually independent standard normal
random variables (see 53.1.2). As a consequence, the joint mgf of X can be derived
as follows:
MX (t) = E exp t> X

= E [exp (t1 X1 + t2 X2 + : : : + tK XK )]
2 3
YK
= E4 exp (tj Xj )5
j=1
K
Y
A = E [exp (tj Xj )]
j=1
K
Y
B = MXj (tj )
j=1
where: in step A we have used the fact that the components of X are mutually
independent8 ; in step B we have used the de…nition of moment generating func-
tion9 . The moment generating function of a standard normal random variable10
is
1 2
MXj (tj ) = exp t
2 j
which implies that the joint mgf of X is
K
Y K
Y 1 2
MX (t) = MXj (tj ) = exp t
j=1 j=1
2 j
0 1
K
X
1 1 >
= exp @ t2j A = exp t t
2 j=1
2
The mgf MXj (tj ) of a standard normal random variable is de…ned for any tj 2 R.
As a consequence, the joint mgf of X is de…ned for any t 2 RK .

The joint characteristic function of a standard MV-N random vector X is
1 >
'X (t) = exp t t
2
8 See Mutual independence via expectations (p. 234).
9 See p. 297.
1 0 See p. 378.
53.2. THE MV-N DISTRIBUTION IN GENERAL 443
Proof. The K components of X are K mutually independent standard normal

random variables (see 53.1.2). As a consequence, the joint characteristic function
of X can be derived as follows:

= E [exp (it1 X1 + it2 X2 + : : : + itK XK )]
2 3
YK
= E4 exp (itj Xj )5
j=1
K
Y
A = E [exp (itj Xj )]
j=1
K
Y
B = 'Xj (tj )
j=1
where: in step A we have used the fact that the components of X are mutu-
ally independent; in step B we have used the de…nition of joint characteristic
function11 . The characteristic function of a standard normal random variable is12
1 2
'Xj (tj ) = exp t
2 j
which implies that the joint characteristic function of X is

K
Y K
Y 1 2
'X (t) = 'Xj (tj ) = exp t
j=1 j=1
2 j
0 1
K
X
1 1 >
= exp @ t2j A = exp t t
2 j=1
2
53.2 The MV-N distribution in general

While in the previous section we restricted our attention to the multivariate normal
distribution with zero mean and unit variance, we now deal with the general case.
53.2.1 De…nition
MV-N random vectors are characterized as follows.

R X = RK
1 1 See p. 315.
1 2 See p. 379.
Let be a K 1 constant vector and V a K K symmetric and positive de…nite

matrix. We say that X has a multivariate normal distribution with mean
and covariance V if its joint probability density function is
K=2 1=2 1 | 1
fX (x) = (2 ) jdet (V )j exp (x ) V (x )
2
We indicate that X has a multivariate normal distribution with mean and
covariance V by
X N ( ;V )
The K random variables X1 : : : ; XK constituting the vector X are said to be
jointly normal.
53.2.2 Relation to the standard MV-N distribution

The next proposition states that a multivariate normal random vector with ar-
bitrary mean and covariance is just a linear transformation of a standard MV-N
vector.
Proposition 271 Let X be a K 1 random vector having a multivariate normal

distribution with mean and covariance V . Then,
X= + Z (53.1)
where Z is a standard MV-N K 1 vector and is a K K invertible matrix

|
such that V = = | .
Proof. This is proved using the formula for the joint density of a linear function13
of an absolutely continuous random vector:
fX (x)
1 1
= fZ (x )
jdet ( )j
1 K=2 1 1 | 1
= (2 ) exp (x ) (x )
jdet ( )j 2
1 1
K=2 1 | 1 | 1
= jdet ( )j 2
jdet ( )j 2
(2 ) exp (x ) (x )
2
K=2 1 1 | | 1 1
= (2 ) jdet ( ) det ( )j 2
exp (x ) ( ) (x )
2
K=2 |
1 1 | | 1
= (2 ) jdet ( ) det ( )j 2
exp (x ) ( ) (x )
2
K=2 |
1 1 | | 1
= (2 ) jdet ( )j 2
exp (x ) ( ) (x )
2
K=2 1 1 | 1
= (2 ) jdet (V )j 2
exp (x ) V (x )
2
| |
The existence of a matrix satisfying V = = is guaranteed by the fact
that V is symmetric and positive de…nite.
1 3 See p. 279. Note that X = g (Z) = + X is a linear one-to-one mapping because is
invertible.
53.2. THE MV-N DISTRIBUTION IN GENERAL 445

The expected value of a MV-N random vector X is
E [X] =
Proof. This is an immediate consequence of (53.1) and of the linearity of the

expected value14 :
E [X] = E [ + Z] = + E [Z] = + 0=

The covariance matrix of a MV-N random vector X is
Var [X] = V
Proof. This is an immediate consequence of (53.1) and of the properties of covari-

ance matrices15 :
|
Var [X] = Var [ + Z] = Var [Z]
= I |= |
=V

The joint moment generating function of a MV-N random vector X is de…ned for
any t 2 RK :
1
MX (t) = exp t> + t> V t
2
Proof. This is an immediate consequence of (53.1), of the fact that is a K K
|
invertible matrix such that V = , and of the rule for deriving the joint mgf of
16
a linear transformation :
MX (t) = exp t> MZ >

t
1 >
= exp t> exp t >
t
2
1
= exp t> + t> V t
2
where
1 >
MZ (t) = exp t t
2
1 4 See, in particular the Addition to constant matrices (p. 148) and Multiplication by constant
matrices (p. 149) properties of the expected value of a random vector.

1 5 See, in particular, the Addition to constant vectors (p. 191) and Multiplication by constant
matrices (p. 191) properties.

1 6 See p. 301.

The joint characteristic function of a MV-N random vector X is
1 >
'X (t) = exp it> t Vt
2
Proof. This is an immediate consequence of (53.1), of the fact that is a K

|
K invertible matrix such that V = , and of the rule for deriving the joint
characteristic function of a linear transformation17 :
'X (t) = exp it> 'Z >

t
1 >
= exp it> exp t >
t
2
1 >
= exp it> t Vt
2
where
1 >
'Z (t) = exp t t
2
53.3 More details

53.3.1 The univariate normal as a special case
The univariate normal distribution18 is just a special case of the multivariate nor-
mal distribution: setting K = 1 in the joint density function of the multivariate
normal distribution one obtains the density function of the univariate normal dis-
tribution19 .
53.3.2 Mutual independence and joint normality

Let X1 ; : : : ; XK be K mutually independent random variables all having a normal
distribution. Denote by i the mean of Xi and by 2i its variance. Then the K 1
random vector X de…ned as
|
X = [X1 : : : XK ]
has a multivariate normal distribution with mean

|
=[ 1 ::: K]
and covariance matrix 2 3

2
1 0 ::: 0
6 0 2
0 7
6 2 ::: 7
V =6 .. .. .. .. 7
4 . . . . 5
2
0 0 ::: K
1 7 Seep. 317.
1 8 Seep. 375.
1 9 Remember that the determinant and the transpose of a scalar are equal to the scalar itself.
In other words, mutually independent normal random variables are also jointly
normal.
Proof. This can be proved by showing that the product of the probability density
functions of X1 ; : : : ; XK is equal to the joint probability density function of X (this
is left as an exercise).
53.3.3 Linear combinations and transformations

Linear transformations and combinations of multivariate normal random vectors
are also multivariate normal. This is explained and proved in the lecture entitled
Linear combinations of normals (p. 469).
53.3.4 Quadratic forms

The lecture entitled Quadratic forms in normal vectors (p. 481) discusses quadratic
forms involving standard normal random vectors.
53.3.5 Partitioned vectors

The lecture entitled Partitioned multivariate normal vectors (p. 477) discusses
partitions of normal random vectors into sub-vectors.

Exercise 1
>
Let X = [X1 X2 ] be a multivariate normal random vector with mean
>
= [1 2]
and covariance matrix

3 1
V =
1 2
Prove that the random variable
Y = X1 + X2
has a normal distribution with mean equal to 3 and variance equal to 7.

Hint: use the joint moment generating function of X and its properties.
Solution
The random variable Y can be written as
Y = BX
where
B = [1 1]
By using the formula for the joint moment generating function of a linear trans-
formation of a random vector20
MY (t) = MX B > t
and the fact that the mgf of a multivariate normal vector X is

1
2
we obtain
1
MY (t) = MX B > t = exp t> B + t> BV B | t
2
1
= exp B t + BV B | t2
2
where, in the last step, we have also used the fact that t is a scalar, because Y is
unidimensional. Now,
1
B = [1 1] =1 1+1 2=3
2
and
3 1 1 3 1+1 1
BV B | = 1 1 = 1 1
1 2 1 1 1+2 1
4
= 1 1 =1 4+1 3=7
3
Plugging the values just obtained into the formula for the mgf of Y , we get
1 7
MY (t) = exp B t + BV B | t2 = exp 3t + t2
2 2
But this is the moment generating function of a normal random variable with mean
equal to 3 and variance equal to 7 (see the lecture entitled Normal distribution -
p. 375). Therefore, Y is a normal random variable with mean equal to 3 and
variance equal to 7 (remember that a distribution is completely characterized by
its moment generating function).
Exercise 2
>
Let X = [X1 X2 ] be a multivariate normal random vector with mean
>
= [2 3]

2 1
V =
1 2
Using the joint moment generating function of X, derive the cross-moment21
E X12 X2
2 0 See p. 301.
2 1 See p. 285.
Solution
The joint mgf of X is
1
2
1
= exp 2t1 + 3t2 + 2t21 + 2t22 + 2t1 t2
2
= exp 2t1 + 3t2 + t21 + t22 + t1 t2
The third-order cross-moment we want to compute is equal to a third partial

derivative of the mgf, evaluated at zero:
@ 3 MX (t1 ; t2 )
E X12 X2 =
@t21 @t2 t1 =0;t2 =0
The partial derivatives are
@MX (t1 ; t2 )
= (2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
@t1
@ 2 MX (t1 ; t2 ) @ @MX (t1 ; t2 )
=
@t21 @t1 @t1
= 2 exp 2t1 + 3t2 + t21 + t22 + t1 t2
2
+ (2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
@ 3 MX (t1 ; t2 ) @ @ 2 MX (t1 ; t2 )
=
@t21 @t2 @t2 @t21
= 2 (3 + 2t2 + t1 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
+2 (2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
2
+ (2 + 2t1 + t2 ) (3 + 2t2 + t1 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
Thus,
@ 3 MX (t1 ; t2 )
E X12 X2 = = 2 3 1 + 2 2 1 + 22 3 1
@t21 @t2 t1 =0;t2 =0
= 6 + 4 + 12 = 22
Chapter 54
Multivariate Student’s t
distribution
This lecture deals with the multivariate (MV) Student’s t distribution. We …rst
introduce the special case in which the mean is equal to zero and the scale matrix
is equal to the identity matrix. We then deal with the more general case.
54.1 The standard MV Student’s t distribution

The adjective "standard" is used for a multivariate Student’s t distribution having
zero mean and unit scale matrix.
54.1.1 De…nition
Standard multivariate Student’s t random vectors are characterized as follows.

R X = RK
Let n 2 R++ . We say that X has a standard multivariate Student’s t distri-

bution with n degrees of freedom if its joint probability density function1 is
(n+K)=2
1 |
fX (x) = c 1 + x x
n
where
K=2 (n=2 + K=2)
c = (n )
(n=2)

1 See p. 117.
2 See p. 55.
451
452 CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION
54.1.2 Relation to the univariate Student’s t distribution

When K = 1, the de…nition of the standard multivariate Student’s t distribution
coincides with the de…nition of the standard univariate Student’s t distribution3 :
(n+K)=2
K=2 (n=2 + K=2) 1 |
fX (x) = (n ) 1+ x x
(n=2) n
(n+1)=2
1=2 1=2 (n=2 + 1=2) x2
= n 1+
(n=2) n
(n+1)=2
A 1=2 (n=2 + 1=2) x2
= n 1+
(1=2) (n=2) n
(n+1)=2
1 x2
= p 1+
nB (n=2; 1=2) n
where: in step A we have used the fact that4 (1=2) = 1=2

:
54.1.3 Relation to the Gamma and MV normal distributions

A standard multivariate Student’s t random vector can be written as a multivariate
normal random vector5 whose covariance matrix is scaled by the reciprocal of a
Gamma random variable6 , as shown by the following proposition.
Proposition 273 (Integral representation) The joint probability density func-

tion of X can be written as
Z 1
0
where:
1. fXjZ=z (x) is the joint probability density function of a multivariate normal

distribution with mean 0 and covariance V = z1 I (where I is the K K
identity matrix):
1=2 1 | 1
fXjZ=z (x) = c1 jdet (V )j exp x V x
2
1=2 !
K 1
1 1 | 1 1
= c1 det (I) exp x I x
z 2 z
1
= c1 z K=2 exp z x| x
2
where
K=2
c1 = (2 )
3 See p. 407.
4 See p. 57.
5 See p. 439.
6 See p. 397.
54.1. THE STANDARD MV STUDENT’S T DISTRIBUTION 453

parameters n and h = 1:
1
fZ (z) = c2 z n=2 1
exp n z
2
where
nn=2
c2 =
2n=2 (n=2)

Z 1
0
where
1
fXjZ=z (x) = c1 z K=2 exp z x| x
2
and
1
fZ (z) = c2 z n=2 1
exp n z
2
We start from the integrand function:
1 1
fXjZ=z (x) fZ (z) = c1 z K=2 exp z x| x c2 z n=2 1
exp n z
2 2
1
= c1 c2 z (n+K)=2 1
(x| x + n) z
exp
2
0 1
n + K 1
= c1 c2 z (n+K)=2 1 exp @ zA
n+K 2
x| x+n
0 1
1 n + K 1
= c1 c2 c3 z (n+K)=2 1 exp @ zA
c3 n+K
|
2
x x+n
1
= c1 c2 fZjX=x (z)
c3
where
(n+K)=2
n+K
(n + K) = x| x+n (x| x + n)
(n+K)=2
c3 = = n K
2(n+K)=2 ((n + K) =2) 2n=2 2K=2 2 + 2
Gamma distribution with parameters n + K and xn+K | x+n . Therefore,
Z 1
Z0 1
1
= c1 c2 fZjX=x (z) dz
0 c3
Z
A 1 1
= c1 c2 fZjX=x (z) dz
c3 0
B 1
= c1 c2
c3
nn=2 n K
(x| x + n)
K=2 (n+K)=2
= (2 ) 2n=2 2K=2 +
2n=2 (n=2) 2 2
(n+K)=2
K=2 nn=2 K=2 n K 1 |
= (2 ) 2 + n 1+ x x
(n=2) 2 2 n
K=2
K=2 nn=2 1 n K
= (2 ) +
(n=2) 2 2 2
(n+K)=2
n=2 K=2 1 |
n 1+ x x
n
n K=2 (n+K)=2
K=2 2+K 2 1 K=2 1 |
= (2 ) n 1+ x x
(n=2) 2 n
K=2 n (n+K)=2
1 2 +K 2 1 |
= 2 n 1+ x x
2 (n=2) n
n (n+K)=2
K=2 2+K 2 1 |
= (n ) 1+ x x
(n=2) n
= fX (x)
where: in step A we have used the fact that c1 , c2 and c3 do not depend on z; in
step B we have used the fact that the integral of a probability density function
over its support is 1.
Since X has a multivariate normal distribution with mean 0 and covariance
V = z1 I , conditional on Z = z, then we can also think of it as a ratio
1
X=p Y
Z
where Y has a standard multivariate normal distribution, Z has a Gamma distri-

bution and Y and Z are independent.
54.1.4 Marginals
The marginal distribution of the i-th component of X (denote it by Xi ) is a stan-
dard Student’s t distribution with n degrees of freedom. It su¢ ces to note that
the marginal probability density function7 of Xi can be written as
Z 1
fXi (xi ) = fXi jZ=z (xi ) fZ (z) dz
0
where fXi jZ=z (xi ) is the marginal density of Xi jZ = z , i.e., the density of a normal
random variable8 with mean 0 and variance z1 :
1=2 1
fXi jZ=z (xi ) = (2 ) z 1=2 exp z x2i
2
7 See p. 120.
8 See p. 375.
54.1. THE STANDARD MV STUDENT’S T DISTRIBUTION 455
Proof. This is obtained by exchanging the order of integration:
fXi (xi )
Z 1 Z 1
= ::: fX (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK ) dx1 : : : dxi 1 dxi+1 : : : dxK
1 1
Z 1 Z 1 Z 1
= ::: fXjZ=z (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK ) fZ (z) dz
1 1 0
dx1 : : : dxi 1 xi+1 : : : dxK
Z 1Z 1 Z 1
= ::: fXjZ=z (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK )
0 1 1
dx1 : : : dxi 1 dxi+1 : : : dxK fZ (z) dz
Z 1
= fXi jZ=z (xi ) fZ (z) dz
0
But, by Proposition 273, the fact that

Z 1
fXi (xi ) = fXi jZ=z (xi ) fZ (z) dz
0
implies that Xi has a standard multivariate Student’s t distribution with n degrees

of freedom (hence a standard univariate Student’s t distribution with n degrees of
freedom, because the two are the same thing when K = 1).

The expected value of a standard multivariate Student’s t random vector X is
well-de…ned only when n > 1 and it is
E [X] = 0 (54.1)
Proof. E [X] = 0 if E [Xi ] = 0 for all K components Xi . But the marginal

distribution of Xi is a standard Student’s t distribution with n degrees of freedom.
Therefore,
E [Xi ] = 0
provided n > 1.

The covariance matrix of a standard multivariate Student’s t random vector X is
well-de…ned only when n > 2 and it is
n
Var [X] = I
n 2
where I is the K K identity matrix.
Proof. First of all, we can rewrite the covariance matrix as follows:
Var [X]
A = E XX > E [X] E [X]
>
B = E XX >
C = E E XX > jZ = z
h i
D = E E XX > jZ = z E [X jZ = z ] E [X jZ = z ]
>
E = E [Var [X jZ = z ]]
1
= E I
z
1
= E I
z
where: in step A we have used the formula for computing the covariance matrix9 ;
in step B we have used the fact that E [X] = 0 (eq. 54.1); in step C we have
used the Law of Iterated Expectations10 ; in step D we have used the fact that
E [X jZ = z ] = 0
because X has a normal distribution with zero mean, conditional on Z = z; in step

E we have used the fact that, by the de…nition of variance,
>
Var [X jZ = z ] = E XX > jZ = z E [X jZ = z ] E [X jZ = z ]
But
1
E
z
Z 1
1
= fZ (z) dz
0 z
Z 1
1 nn=2 1
= n=2
z n=2 1 exp n z dz
0 z2 (n=2) 2
n=2 Z 1
n 1
= z (n 2)=2 1 exp n z dz
2n=2 (n=2) 0 2
nn=2 2(n 2)=2 ((n 2) =2)
=
2n=2 (n=2) n(n 2)=2
Z 1
n(n 2)=2 n 21
z (n 2)=2 1 exp n 2 z dz
0 2(n 2)=2 ((n 2) =2) n
2
Z
A 2(n 2)=2 nn=2 ((n 2) =2) 1
= ' (z) dz
2n=2 n(n 2)=2 (n=2) 0
2(n 2)=2 nn=2 ((n 2) =2)
B =
2n=2 n(n 2)=2 (n=2)
1 1
C = n
2 (n 2) =2
n
=
n 2
9 See p. 190.
1 0 See p. 225.
54.2. THE MV STUDENT’S T DISTRIBUTION IN GENERAL 457
where: in step A we have de…ned
n(n 2)=2
n 21
' (z) = z (n 2)=2 1
exp n 2 z
2(n 2)=2 ((n 2) =2) n
2
which is the density of a Gamma random variable with parameters n 2 and nn 2 ;

in step B we have used the fact that the integral of a probability density function
over its support is equal to 1; in step C we have used the properties of the Gamma
function. As a consequence,
1 n
Var [X] = E I= I
z n 2
54.2 The MV Student’s t distribution in general

While in the previous section we restricted our attention to the multivariate Stu-
dent’s t distribution with zero mean and unit scale matrix, we now deal with the
general case.
54.2.1 De…nition
Multivariate Student’s t random vectors are characterized as follows.

R X = RK
Let n 2 R++ , let be a K 1 vector and let V be a K K symmetric and positive

de…nite matrix. We say that X has a multivariate Student’s t distribution
with mean , scale matrix V and n degrees of freedom if its joint probability density
function is
(n+K)=2
1 |
fX (x) = c 1 + (x ) V 1 (x )
n
where
K=2 (n=2 + K=2) 1=2
c = (n ) jdet (V )j
(n=2)
We indicate that X has a multivariate Student’s t distribution with mean ,

scale matrix V and n degrees of freedom by
X T ( ; V; n)
54.2.2 Relation to the standard MV Student’s t distribution

If X T ( ; V; n), then X is a linear function of a standard Student’s t random
vector.
Proposition 275 Let X T ( ; V; n). Then,

X= + Z (54.2)
where Z is a K 1 vector having a standard multivariate Student’s t distribution
|
with n degrees of freedom and is a K K invertible matrix such that V = =
|
.
Proof. This is proved using the formula for the joint density of a linear function11
of an absolutely continuous random vector:
1 1
fX (x) = fZ (x )
jdet ( )j
1 K=2 (n=2 + K=2)
= (n )
jdet ( )j (n=2)
(n+K)=2
1 1 | 1
1+ (x ) (x )
n
K=2 (n=2 + K=2) 1 1
= (n ) jdet ( )j 2
jdet ( )j 2
(n=2)
(n+K)=2
1 | 1 | 1
1+ (x ) (x )
n
K=2 (n=2 + K=2) 1
= (n ) jdet ( ) det ( )j 2
(n=2)
(n+K)=2
1 | | 1 1
1+ (x ) ( ) (x )
n
K=2 (n=2 + K=2) |
1
= (n ) jdet ( ) det ( )j 2
(n=2)
(n+K)=2
1 | | 1
1+ (x ) ( ) (x )
n
K=2 (n=2 + K=2) |
1
= (n ) jdet ( )j 2
(n=2)
(n+K)=2
1 | | 1
1+ (x ) ( ) (x )
n
K=2 (n=2 + K=2) 1
= (n ) jdet (V )j 2
(n=2)
(n+K)=2
1 | 1
1+ (x ) V (x )
n
| |
The existence of a matrix satisfying V = = is guaranteed by the fact
that V is symmetric and positive de…nite.

The expected value of a multivariate Student’s t random vector X is
E [X] =
1 1 See p. 279. Note that X = g (Z) = + X is a linear one-to-one mapping since is
invertible.
Proof. This is an immediate consequence of (54.2) and of the linearity of the

expected value12 :
E [X] = E [ + Z] = + E [Z] = + 0=

The covariance matrix of a multivariate Student’s t random vector X is
n
Var [X] = V
n 2
Proof. This is an immediate consequence of (54.2) and of the properties of covari-

ance matrices13 :
Var [X] = Var [ + Z]

= Var [Z] |
n
= I |
n 2
n |
=
n 2
n
= V
n 2

Exercise 1
Let X be a multivariate normal random vector with mean = 0 and covariance
matrix V . Let Z1 ; : : : ; Zn be n normal random variables having zero mean and vari-
ance 2 . Suppose that Z1 ; : : : ; Zn are mutually independent, and also independent
of X. Find the distribution of the random vector Y de…ned as
2
Y =p X
Z12 + : : : + Zn2
Solution
We can write
2
Y = p X
Z12 + : : : + Zn2
1 2 See, in particular the Addition to constant matrices (p. 148) and Multiplication by constant
matrices (p. 149) properties of the expected value of a random vector.

1 3 See, in particular, the Addition to constant vectors (p. 191) and Multiplication by constant
matrices (p. 191) properties.

2
= p p X
n 2
(W1 + : : : + Wn2 ) =n
2 1
= p p Q
n 2
(W1 + : : : + Wn2 ) =n
where W1 ; : : : ; Wn are standard normal random variables, Q has a standard mul-

>
tivariate normal distribution and = V (if you are wondering about the stan-
2 2 2
dardizations Zi = Wi and X = Q, revise the lectures entitled Normal distri-
bution - p. 375, and Multivariate normal distribution - p. 439). Now, the sum
W12 + : : : + Wn2 has a Chi-square distribution with n degrees of freedom and the
ratio
W12 + : : : + Wn2
n
has a Gamma distribution with parameters n and h = 1 (see the lectures enti-
tled Chi-square distribution - p. 387, and Gamma distribution - p. 397). As a
consequence, by the results in Subsection 54.1.3, the ratio
1
p Q
(W12 + : : : + Wn2 ) =n
has a standard multivariate Student’s t distribution with n degrees of freedom.

Finally, by equation (54.2), Y has a multivariate Student’s t distribution with
mean 0 and scale matrix
>
2 2 4 > 4
p p = = V
n n n 2 n 2
Chapter 55
Wishart distribution
This lecture provides a brief introduction to the Wishart distribution, which is a

multivariate generalization of the Gamma distribution1 .
In previous lectures we have explained that:
1. a Chi-square random variable2 with n degrees of freedom can be seen as a

sum of squares of n independent normal random variables having mean 0 and
variance 1;
2. a Gamma random variable with parameters n and h can be seen as a sum

of squares of n independent normal random variables having mean 0 and
variance h=n.
A Wishart random matrix3 with parameters n and H can be seen as a sum

of outer products of n independent multivariate normal random vectors4 having
mean 0 and covariance matrix n1 H. In this sense, the Wishart distribution can
be considered a generalization of the Gamma distribution (take point 2 above
and substitute normal random variables with multivariate normal random vectors,
squares with outer products and the variance h=n with the covariance matrix n1 H).
At the end of this lecture you can …nd a brief review of some basic concepts in
matrix algebra that will be helpful in understanding the remainder of this lecture.
55.1 De…nition
Wishart random matrices are characterized as follows:
De…nition 276 Let W be a K K absolutely continuous random matrix. Let its

support be the set of all K K symmetric and positive de…nite real matrices:
R W = w 2 RK K
: w is symmetric and positive de…nite
Let n be a constant such that n > K 1 and let H be a symmetric and positive
de…nite matrix. We say that W has a Wishart distribution with parameters n
1 See p. 397.
2 See p. 387.
3 See p. 119.
4 See p. 439.
461
462 CHAPTER 55. WISHART DISTRIBUTION
and H if its joint probability density function5 is

n=2 (K+1)=2 n 1
fW (w) = c [det (w)] exp tr H w
2
where
nn=2
c= n=2 QK
2nK=2 [det (H)] K(K 1)=4
j=1 (n=2 + (1 j) =2)
The parameter n needs not be an integer, but, when n is not an integer, W can
no longer be interpreted as a sum of outer products of multivariate normal random
vectors.
55.2 Relation to the MV normal distribution

The following proposition provides the link between the multivariate normal dis-
tribution and the Wishart distribution:
Proposition 277 Let X1 ; : : : ; Xn be n independent K 1 random vectors all hav-

ing a multivariate normal distribution with mean 0 and covariance matrix n1 H. Let
K n. De…ne:
Xn
W = Xi Xi>
i=1
Then W has a Wishart distribution with parameters n and H.
Proof. The proof of this proposition is quite lengthy and complicated. The inter-
ested reader might have a look at Ghosh and Sinha7 (2002).
55.3 Expected value

The expected value of a Wishart random matrix W is
E [W ] = H
Proof. We do not provide a fully general proof, but we prove this result only for
the special case in which n is integer and W can be written as
n
X
W = Xi Xi>
i=1
(see subsection 55.2 above). In this case:

" n #
X
>
E [W ] = E Xi Xi
i=1
5 Seep. 117.
6 Seep. 55.
7 Ghosh, M. and Sinha, B. K. (2002) "A simple derivation of the Wishart distribution", The
American Statistician, 56, 100-101.

55.4. COVARIANCE MATRIX 463
n
X
= E Xi Xi>
i=1
n
X
A = Var [Xi ] + E [Xi ] E Xi>
i=1
n
X
B = Var [Xi ]
i=1
Xn
1 1
= H=n H=H
i=1
n n
where: in step A we have used the fact that the covariance matrix of X can be
written as8
Var [Xi ] = E Xi Xi> E [Xi ] E Xi>
and in step B we have used the fact that E [Xi ] = 0.
55.4 Covariance matrix

The concept of covariance matrix is well-de…ned only for random vectors. However,
when dealing with a random matrix, one might want to compute the covariance
matrix of its associated vectorization (if you are not familiar with the concept of
vectorization, see the review of matrix algebra below for a de…nition). Therefore, in
the case of a Wishart random matrix W , we might want to compute the following
covariance matrix:
Var [vec (W )]
Since vec (W ), the vectorization of W , is a K 2 1 random vector, V is a K 2 K 2
matrix.
It is possible to prove that:
1
Var [vec (W )] = I + Pvec(W ) (H H)
n
where denotes the Kronecker product and Pvec(W ) is the transposition-permutation
matrix associated to vec (W ).
Proof. The proof of this formula can be found in Muirhead9 (2005).
There is a simpler expression for the covariances between the diagonal entries
of W :
2 2
Cov [Wii ; Wjj ] = Hij
n
Proof. Again, we do not provide a fully general proof, but we prove this result
only for the special case in which n is integer and W can be written as:
n
X
W = Xi Xi>
i=1
(see above). To compute this covariance, we …rst need to compute the following
fourth cross-moment:
2 2
E Xsi Xsj
8 See p. 190.
9 Muirhead, R.J. (2005) Aspects of multivariate statistical theory, Wiley.
where Xsi denotes the i-th component (i = 1; : : : ; K) of the random vector Xs

(s = 1; : : : ; n). This cross-moment can be computed by taking the fourth cross-
partial derivative of the joint moment generating function10 of Xsi and Xsj and
evaluating it at zero. While this is not complicated, the algebra is quite tedious.
I recommend doing it with computer algebra, for example utilizing the Matlab
Symbolic Toolbox and the following four commands:
syms t1 t2 s1 s2 s12;
f=exp(0.5*(s1^2)*(t1^2)+0.5*(s2^2)*(t2^2)+s12*t1*t2);
d4f=diff(diff(f,t1,2),t2,2);
subs(d4f,{t1,t2},{0,0})
The result of the computations is
2 2 2
E Xsi Xsj = Var [Xsi ] Var [Xsj ] + 2 (Cov [Xsi ; Xsj ])
2
1 1 1
= Hii Hjj +2 Hij
n n n
Using this result, the covariance between Wii and Wjj is derived as follows:
Cov [Wii ; Wjj ]
" n n
#
X X
2 2
= Cov Xsi ; Xtj
s=1 t=1
n X
X n
A = 2
Cov Xsi 2
; Xtj
s=1 t=1
n
X
B = 2
Cov Xsi 2
; Xsj
s=1
n
X
C = 2
E Xsi 2
Xsj 2
E Xsi 2
E Xsj
s=1
n
X n
X
2 2 2 2
= E Xsi Xsj E Xsi E Xsj
s=1 s=1
n
" # n
X 1 1 1
2 X 1 1
= Hii Hjj +2 Hij Hii Hjj
s=1
n n n s=1
n n
n
X Xn
1 1 2 1
= Hii Hjj + 2 2Hij H H
2 ii jj
s=1
n2 n s=1
n
1 1 2 1
= n 2
Hii Hjj + 2 2Hij n Hii Hjj
n n n2
1 2
= Hii Hjj + 2Hij Hii Hjj
n
2 2
= H
n ij
where: in step A we have used the bilinearity of covariance; in step B we have
used the fact that
2 2
Cov Xsi ; Xtj = 0 for s 6= t
1 0 See p. 297.
55.5. REVIEW OF SOME FACTS IN MATRIX ALGEBRA 465
in step C we have used the usual covariance formula11 .
55.5 Review of some facts in matrix algebra

55.5.1 Outer products
As the Wishart distribution involves outer products of multivariate normal random
vectors, we brie‡y review here the concept of outer product.
If X is a K 1 column vector, the outer product of X with itself is the K K
matrix A obtained from the multiplication of X with its transpose:
A = XX >
Example 278 If X is the 2 1 random vector
X1
X=
X2
then its outer product XX > is the 2 2 random matrix
X1
XX > = X1 X2
X2
X12 X1 X2
=
X2 X1 X22
55.5.2 Symmetric matrices

AK K matrix A is symmetric if and only if
A = A>
i.e. if and only if A equals its transpose.
55.5.3 Positive de…nite matrices

AK K matrix A is said to be positive de…nite if and only if
x> Ax > 0
for any K 1 real vector x such that x 6= 0.

All positive de…nite matrices are also invertible.
Proof. The proof is by contradiction. Suppose a positive de…nite matrix A were
not invertible. Then A would not be full rank, i.e. there would be a vector x 6= 0
such that
Ax = 0
which, premultiplied by x> , would yield
x> Ax = x> 0 = 0
But this is a contradiction.

1 1 See p. 164.
55.5.4 Trace of a matrix

Let A be a K K matrix and denote by Aij the (i; j)-th entry of A (i.e. the entry
at the intersection of the i-th row and the j-th column). The trace of A, denoted
by tr (A), is the sum of all the diagonal entries of A:
K
X
tr (A) = Aii
i=1
55.5.5 Vectorization of a matrix

Given a K L matrix A, its vectorization, denoted by vec (A), is the KL 1
vector obtained by stacking the columns of A on top of each other.
Example 279 If A is a 2 2 matrix:
A11 A12
A=
A21 A22
the vectorization of A is the 4 1 random vector:

2 3
A11
6 A21 7
vec (A) = 6 7
4 A12 5
A22
For a given matrix A, the vectorization of A will in general be di¤erent from

the vectorization of its transpose A> . The transposition permutation matrix
associated to vec (A) is the KL KL matrix Pvec(A) such that:
vec A> = Pvec(A) vec (A)
55.5.6 Kronecker product

Given a K L matrix A and a M N matrix B, the Kronecker product of A
and B, denoted by A B, is a KM LN matrix having the following structure:
2 3
A11 B A12 B : : : A1N B
6 A21 B A22 B : : : A2N B 7
6 7
A B=6 .. .. .. 7
4 . . . 5
AM 1 B AM 2 B : : : AM N B
where Aij is the (i; j)-th entry of A.

Part V
More about normal

distributions
467
Chapter 56
Linear combinations of
normals
One property that makes the normal distribution extremely tractable from an ana-
lytical viewpoint is its closure under linear combinations: the linear combination of
two independent random variables having a normal distribution also has a normal
distribution. This lecture presents a multivariate generalization of this elementary
property and then discusses some special cases.
56.1 Linear transformation of a MV-N vector

A linear transformation of a multivariate normal random vector1 also has a multi-
variate normal distribution, as illustrated by the following:
Proposition 280 Let X be a K 1 multivariate normal random vector with mean

and covariance matrix V . Let A be an L 1 real vector and B an L K full-rank
real matrix. Then the L 1 random vector Y de…ned by
Y = A + BX
E [Y ] = A + B

Var [Y ] = BV B |
Proof. This is proved using the formula for the joint moment generating function
of the linear transformation of a random vector2 . The joint moment generating
function of X is
1
2
Therefore, the joint moment generating function of Y is
MY (t) = exp t> A MX B > t

1 See p. 439.
2 See p. 301.
469
470 CHAPTER 56. LINEAR COMBINATIONS OF NORMALS
1
= exp t> A exp t> B + t> BV B | t
2
1
= exp t> (A + B ) + t> BV B | t
2
which is the moment generating function of a multivariate normal distribution with

mean A + B and covariance matrix BV B | . Note that BV B | needs to be positive
de…nite in order to be the covariance matrix of a proper multivariate normal dis-
tribution, but this is implied by the assumption that B is full-rank. Therefore, Y
has a multivariate normal distribution with mean A + B and covariance matrix
BV B | , because two random vectors have the same distribution when they have
the same joint moment generating function.
The following subsections present some important special cases of the above
property.
56.1.1 Sum of two independent variables

The sum of two independent normal random variables has a normal distribution,
as stated in the following:
Proposition 281 Let X1 be a random variable having a normal distribution with

mean 1 and variance 21 . Let X2 be a random variable, independent of X1 , having
a normal distribution with mean 2 and variance 22 . Then, the random variable
Y de…ned as:
Y = X1 + X2
has a normal distribution with mean
E [Y ] = 1 + 2
and variance
2 2
Var [Y ] = 1 + 2
Proof. First of all, we need to use the fact that mutually independent normal
random variables are jointly normal3 : the 2 1 random vector X de…ned as
|
X = [X1 X2 ]

|
E [X] = [ 1 2]

2
1 0
Var [X] = 2
0 2
We can write:
Y = X1 + X2 = BX
where
B = [1 1]
3 See p. 446.
56.1. LINEAR TRANSFORMATION OF A MV-N VECTOR 471
Therefore, according to Proposition 280, Y has a normal distribution with mean

E [Y ] = BE [X] = 1 + 2
and variance
Var [Y ] = BVar [X] B | = 2
1 + 2
2
56.1.2 Sum of more than two independent variables

The sum of more than two independent normal random variables also has a normal
distribution, as shown in the following:
Proposition 282 Let X1 ; : : : ; XK be K mutually independent normal random
variables, having means 1 ; : : : ; K and variances 21 ; : : : ; 2K . Then, the random
variable Y de…ned as:
XK
Y = Xi
i=1
K
X
E [Y ] = i
i=1
and variance
K
X
2
Var [Y ] = i
i=1
Proof. This can be obtained, either generalizing the proof of Proposition 281, or
using Proposition 281 recursively (starting from the …rst two components of X,
then adding the third one and so on).
56.1.3 Linear combinations of independent variables

The properties illustrated in the previous two examples can be further generalized
to linear combinations of K mutually independent normal random variables:
Proposition 283 Let X1 ; : : : ; XK be K mutually independent normal random
variables, having means 1 ; : : : ; K and variances 21 ; : : : ; 2K . Let b1 ; : : : ; bK be
K constants. Then, the random variable Y de…ned as:
K
X
Y = bi X i
i=1

K
X
E [Y ] = bi i
i=1
and variance
K
X
Var [Y ] = b2i 2
i
i=1
Proof. First of all, we need to use the fact that mutually independent normal
random variables are jointly normal: the K 1 random vector X de…ned as
|
X = [X1 : : : XK ]

|
E [X] = [ 1 ::: K]

2 2
3
1 0 ::: 0
6 0 2
0 7
6 2 ::: 7
Var [X] = 6 .. .. .. .. 7
4 . . . . 5
2
0 0 ::: K
We can write:
K
X
Y = bi Xi = BX
i=1
where
B = [b1 : : : bK ]
Therefore, according to Proposition 280, Y has a (multivariate) normal distribution
with mean:
XK
E [Y ] = BE [X] = bi i
i=1
and variance:
K
X
Var [Y ] = BVar [X] B | = b2i 2
i
i=1
56.1.4 Linear transformation of a variable

A special case of Proposition 280 obtains when X has dimension 1 1 (i.e. it is a
random variable):
Proposition 284 Let X be a normal random variable with mean and variance
2
. Let a and b be two constants (with b 6= 0). Then the random variable Y de…ned
by:
Y = a + bX
E [Y ] = a + b
and variance
Var [Y ] = b2 2
Proof. This is just a special case (for K = 1) of Proposition 280.

56.1.5 Linear combinations of independent vectors

The property illustrated in Example 3 can be generalized to linear combinations
of mutually independent normal random vectors.
Proposition 285 Let X1 ; : : : ; Xn be n mutually independent K 1 normal ran-
dom vectors, having means 1 ; : : : ; n and covariance matrices V1 ; : : : ; Vn . Let
B1 ; : : : ; Bn be n real L K full-rank matrices. Then, the L 1 random vector Y
de…ned as:
n
X
Y = B i Xi
i=1
n
X
E [Y ] = Bi i
i=1

n
X
Var [Y ] = Bi Vi Bi|
i=1
Proof. This is a consequence of the fact that mutually independent normal random
vectors are jointly normal: the Kn 1 random vector X de…ned as
|
X = [X1| : : : Xn| ]
| | |
E [X] = [ 1 ::: n]

2 3
V1 0 ::: 0
6 0 V2 0 7
6 ::: 7
Var [X] = 6 .. .. .. .. 7
4 . . . . 5
0 0 : : : Vn
Therefore, we can apply Proposition 280 to the vector X.

Exercise 1
Let
>
X = [X1 X2 ]
be a 2 1 normal random vector with mean
>
= [1 3]
2 1
V =
1 3
Find the distribution of the random variable Z de…ned as
Z = X1 + 2X2
Solution
We can write
Z = BX
where
B = [1 2]
Being a linear transformation of a multivariate normal random vector, Z is also
multivariate normal. Actually, it is univariate normal, because it is a scalar. Its
mean is
E [Z] = B = 1 1 + 2 3 = 7
and its variance is
2 1 1
Var [Z] = BV B > = [1 2]
1 3 2
2 1+1 2 4
= [1 2] = [1 2]
1 1+3 2 7
= 1 4 + 2 7 = 18
Exercise 2
Let X1 , . . . , Xn be n mutually independent standard normal random variables.
Let b 2 (0; 1) be a constant. Find the distribution of the random variable Y de…ned
as
X n
Y = bi X i
i=1
Solution
Being a linear combination of mutually independent normal random variables, Y
n
X n
X
E [Y ] = bi E [Xi ] = bi 0 = 0
i=1 i=1
and variance
n
X 2
Var [Y ] = bi Var [Xi ]
i=1
Xn
i
= b2 1
i=1
Xn
A = ci
i=1
= c + c2 + : : : + cn
= c 1 + c + : : : + cn 1
1 c
= c 1 + c + : : : + cn 1
1 c
c
= (1 cn )
1 c
c cn+1
=
1 c
b2 b2n+2
=
1 b2
where: in step A we have de…ned c = b2 .

Chapter 57
Partitioned multivariate
normal vectors
Let X be a K 1 multivariate normal random vector1 with mean and covariance

matrix V . In this lecture we present some useful facts about partitionings of X,
that is, about subdivisions of X into two sub-vectors Xa and Xb such that
Xa
X=
Xb
where Xa and Xb have dimensions Ka 1 and Kb 1 respectively and Ka +Kb = K.
57.1 Notation
In what follows, we will denote by
a = E [Xa ] the mean of Xa ;
b = E [Xb ] the mean of Xb ;
Va = Var [Xa ] the covariance matrix of Xa ;
Vb = Var [Xb ] the covariance matrix of Xb ;
Vab = Cov [Xa ; Xb ] the cross-covariance2 between Xa and Xb .
Given this notation, it follows that
a
=
b
and
|
Va Vab
V =
Vab Vb
1 See p. 439.
2 See p. 193.
477
478 CHAPTER 57. PARTITIONED MULTIVARIATE NORMAL VECTORS
57.2 Normality of the sub-vectors

The following proposition states a …rst fact about the two sub-vectors.
Proposition 286 Both Xa and Xb have a multivariate normal distribution, i.e.,
Xa N( a ; Va )
Xb N( b ; Vb )
Proof. The random vector Xa can be written as a linear transformation of X:
Xa = AX
where A is a Ka K matrix whose entries are either zero or one. Thus, Xa has
a multivariate normal distribution, because it is a linear transformation of the
multivariate normal random vector X, and multivariate normality is preserved3
by linear transformations. By the same token, also Xb has a multivariate normal
distribution, because it can be written as a linear transformation of X:
Xb = BX
where B is a Kb K matrix whose entries are either zero or one.

This, of course, implies that any sub-vector of X is multivariate normal when
X is multivariate normal.
57.3 Independence of the sub-vectors

The following proposition states a necessary and su¢ cient condition for the inde-
pendence of the two sub-vectors.
Proposition 287 Xa and Xb are independent if and only if Vab = 0.
Proof. Xa and Xb are independent if and only if their joint moment generating
function is equal to the product of their individual moment generating functions4 .
Since Xa is multivariate normal, its joint moment generating function is
1
MXa (ta ) = exp t>
a a + t> V a ta
2 a
The joint moment generating function of Xb is
1
MXb (tb ) = exp t>
b b + t> V b tb
2 b
The joint moment generating function of Xa and Xb , which is just the joint moment
generating function of X, is
MXa ;Xb (ta ; tb )

= MX (t)
3 See p. 469.
4 See p. 297.
57.3. INDEPENDENCE OF THE SUB-VECTORS 479
1
= exp t> + t> V t
2
|
1 > > Va Vab
= exp t> >
a tb
a
+ ta tb [ta tb ]
b 2 Vab Vb
1 > 1 > 1 > 1 > >
= exp t>
a
>
a + tb b + ta Va ta + tb Vb tb + tb Vab ta + ta Vab tb
2 2 2 2
1 > 1
= exp t>
a
> > >
a + ta Va ta + tb b + tb Vb tb + tb Vab ta
2 2
1 > 1 >
= exp t>
a
> >
a + ta Va ta exp tb b + tb Vb tb exp tb Vab ta
2 2
= MXa (ta ) MXb (tb ) exp t>
b Vab ta
from which it is obvious that
MXa ;Xb (ta ; tb ) = MXa (ta ) MXb (tb )
if and only if Vab = 0.

480 CHAPTER 57. PARTITIONED MULTIVARIATE NORMAL VECTORS
Chapter 58
Quadratic forms in normal

vectors
Let X be a K 1 multivariate normal random vector1 with mean and covariance

matrix V . In this lecture we present some useful facts about quadratic forms in
X, i.e. about forms of the kind
Q = X > AX
where A is a K K matrix and > denotes transposition.

Before discussing quadratic forms in X, we review some facts about matrix
algebra that are needed to understand this lecture.
58.1 Review of relevant facts in matrix algebra

58.1.1 Orthogonal matrices
AK K real matrix A is orthogonal if
A> = A 1
which also implies

A> A = AA> = I
where I is the identity matrix. Of course, if A is orthogonal also A> is orthogonal.
An important property of orthogonal matrices is the following:
Proposition 288 Le X be a K 1 standard multivariate normal random vector,

i.e. X N (0; I). Let A be an orthogonal K K real matrix. De…ne
Y = AX
Then also Y has a standard multivariate normal distribution, i.e. Y N (0; 1).
Proof. The random vector Y has a multivariate normal distribution, because it

is a linear transformation of another multivariate normal random vector (see the
1 See p. 439.
481
482 CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS
lecture entitled Linear combinations of normal random variables - p. 469). Y is

standard normal because its expected value is
E [Y ] = E [AX] = AE [X] = A0 = 0
Var [Y ] = Var [AX] = AVar [X] A|
= AIA| = AA| = I
where the de…nition of orthogonal matrix has been used in the last step.
58.1.2 Symmetric matrices

AK K real matrix A is symmetric if
A = A>
i.e. A equals its transpose.
Real symmetric matrices have the property that they can be decomposed as
A = P DP >
where P is an othogonal matrix and D is a diagonal matrix (i.e. a matrix whose
o¤-diagonal entries are zero). The diagonal elements of D, which are all real, are
the eigenvalues of A and the columns of P are the eigenvectors of A.
58.1.3 Idempotent matrices

AK K real matrix A is idempotent if
A2 = A
which implies
An = A
for any n 2 N.
58.1.4 Symmetric idempotent matrices

If a matrix A is both symmetric and idempotent then its eigenvalues are either
zero or one. In other words, the diagonal entries of the diagonal matrix D in the
decomposition
A = P DP >
are either zero or one.
Proof. This can be easily seen as follows:
P DP > = A = An = A : : : A
= P DP > : : : P DP >
= P D(P > P ) : : : (P > P )DP >
= P Dn P >
which implies
D = Dn 8n 2 N
But this is possible only if the diagonal entries of D are either zero or one.
58.2. QUADRATIC FORMS IN NORMAL VECTORS 483
58.1.5 Trace of a matrix

Let A be a K K real matrix and denote by Aij the (i; j)-th entry of A (i.e. the
entry at the intersection of the i-th row and the j-th column). The trace of A,
denoted by tr (A) is
K
X
tr (A) = Aii
i=1
In other words, the trace is equal to the sum of all the diagonal entries of A.
The trace of A enjoys the following important property:
K
X
tr (A) = i
i=1
where 1; : : : ; K are the K eigenvalues of A.
58.2 Quadratic forms in normal vectors

The following proposition shows that certain quadratic forms in standard normal
random vectors have a Chi-square distribution2 .

i.e. X N (0; I). Let A be a symmetric and idempotent matrix. Let tr (A) be the
trace of A. De…ne:
Q = X > AX
Then Q has a Chi-square distribution with tr (A) degrees of freedom.
Proof. Since A is symmetric, it can be decomposed as
A = P DP >
where P is orthogonal and D is diagonal. The quadratic form can be written as
Q = X > AX = X > P DP > X

>
= P >X D P > X = Y > DY

Y = P >X
By the above theorem on orthogonal transformations of standard multivariate nor-
mal random vectors, the orthogonality of P > implies that Y N (0; 1). Since D
is diagonal, we can write the quadratic form as
K
X
Q = Y > DY = Djj Yj2
j=1
where Yj is the j-th component of Y and Djj is the j-th diagonal entry of D. Since
A is symmetric and idempotent, the diagonal entries of D are either zero or one.
Denote by J the set
J = fj K : Djj = 1g
2 See p. 387.
and by r its cardinality, i.e. the number of diagonal entries of D that are equal to
1. Since Dij 6= 1 ) Dij = 0, we can write
K
X X X
Q= Djj Yj2 = Djj Yj2 = Yj2
j=1 j2J j2J
But the components of a standard normal random vector are mutually independent
standard normal random variables. Therefore, Q is the sum of the squares of
r independent standard normal random variables. Hence, it has a Chi-square
distribution with r degrees of freedom3 . Finally, by the properties of idempotent
matrices and of the trace of a matrix (see above), r is not only the sum of the
number of diagonal entries of D that are equal to 1, but it is also the sum of the
eigenvalues of A. Since the trace of a matrix is equal to the sum of its eigenvalues,
then r = tr (A).
58.3 Independence of quadratic forms

We start this section with a proposition on independence between linear transfor-
mations:

i.e. X N (0; I). Let A be a LA K matrix and B be a LB K matrix. De…ne:
T1 = AX
T2 = BX
Then T1 and T2 are two independent random vectors if and only if AB | = 0.
Proof. First of all, note that T1 and T2 are linear transformations of the same
multivariate normal random vector X. Therefore, they are jointly normal (see the
lecture entitled Linear combinations of normal random variables - p. 469). Their
cross-covariance4 is
h i
>
Cov [X; Y ] = E (T1 E [T1 ]) (T2 E [T2 ])
h i
>
= E (AX E [AX]) (BX E [BX])
h i
A = AE (X E [X]) (X E [X])> B |
B = AVar [X] B |
C = AB |
where: in step A we have used the linearity of the expected value; in step B we
have used the de…nition of covariance matrix; in step C we have used the fact that
Var [X] = I. But, as we explained in the lecture entitled Partitioned multivariate
normal vectors (p. 477), two jointly normal random vectors are independent if
and only if their cross-covariance is equal to 0. In our case, the cross-covariance is
equal to zero if and only if AB | = 0, which proves the proposition.
3 See p. 395.
4 See p. 193.
58.4. EXAMPLES 485
The following proposition gives a necessary and su¢ cient condition for the
independence of two quadratic forms in the same standard multivariate normal
random vector.

i.e. X N (0; I). Let A and B be two K K symmetric and idempotent matrices.
De…ne
Q1 = X > AX
Q2 = X > BX
Then Q1 and Q2 are two independent random variables if and only if AB = 0.
Proof. Since A and B are symmetric and idempotent, we can write

>
Q1 = (AX) (AX)
>
Q2 = (BX) (BX)
from which it is apparent that Q1 and Q2 can be independent only as long as

AX and BX are independent. But, by the above proposition on the independence
between linear transformations of jointly normal random vectors, AX and BX are
independent if and only if AB | = 0. Since B is symmetric, this is the same as
AB = 0.
The following proposition gives a necessary and su¢ cient condition for the
independence between a quadratic form and a linear transformation involving the
same standard multivariate normal random vector.

i.e. X N (0; I). Let A be a L K vector and B a K K symmetric and
idempotent matrix. De…ne
T = AX
Q = X > BX
Then T and Q are independent if and only if AB = 0.
Proof. Since B is symmetric and idempotent, we can write
T = AX
>
Q = (BX) (BX)
from which it is apparent that T and Q can be independent only as long as AX

and BX are independent. But, by the above proposition on the independence
between linear transformations of jointly normal random vectors, AX and AB are
independent if and only if AB | = 0. Since B is symmetric, this is the same as
AB = 0.
58.4 Examples
We discuss here some quadratic forms that are commonly found in statistics.
Sample variance as a quadratic form

Let X1 , . . . , Xn be n independent random variables, all having a normal distribu-
tion with mean and variance 2 . Let their sample mean5 X n be de…ned as
n
1X
Xn = Xi
n i=1
and their adjusted sample variance6 be de…ned as

n
X
1 2
s2 = Xi Xn
n 1 i=1
De…ne the following matrix:

1 >
M =I {{
n
where I is the n-dimensional identity matrix and { is a n 1 vector of ones. In
other words, M has the following structure:
2 3
1 n1 1
n ::: 1
n
6 1
1 n1 ::: 1 7
6 n n 7
M =6 .. 7
4 . 5
1 1 1
n n ::: 1 n
M is a symmetric matrix. By computing the product M M , it can also be

easily veri…ed that M is idempotent.
Denote by X the n 1 random vector whose i-th entry is equal to Xi and note
that X has a multivariate normal distribution with mean { and covariance matrix
2
I .
The matrix M can be used to write the sample variance as
n
X
1 2
s2 = Xi Xn
n 1 i=1
1 >
= (M X) M X
n 1
1
= X >M >M X
n 1
A 1
= X >M M X
n 1
B 1
= X >M X
n 1
where: in step A we have used the fact that M is symmetric; in step B we have
used the fact that M is idempotent.
Now de…ne a new random vector
1
Z= (X {)
5 See p. 573.
6 See p. 583.
58.4. EXAMPLES 487
and note that Z has a standard (mean zero and covariance I) multivariate nor-
mal distribution (see the lecture entitled Linear combinations of normal random
variables - p. 469).
The sample variance can be written as
1
s2 = X >M X
n 1
1 >
= (X { + {) M (X { + {)
n 1
2 >
X { X {
= + { M + {
n 1
2 >
= Z+ { M Z+ {
n 1
2 2
= Z >M Z + Z >M { + {> M Z + {> M {
n 1
The last three terms in the sum are equal to zero, because
M{ = 0
which can be veri…ed by directly performing the multiplication of M and {.

Therefore, the sample variance
2
s2 = Z >M Z
n 1
is proportional to a quadratic form in a standard normal random vector (Z > M Z)
and the quadratic form is obtained from a symmetric and idempotent matrix (M ).
Thanks to the propositions above, we know that the quadratic form Z > M Z has a
Chi-square distribution with tr (M ) degrees of freedom, where tr (M ) is the trace
of M . But the trace of M is
n
X n
X 1
tr (M ) = Mii = 1
i=1 i=1
n
1
= n 1 =n 1
n
So, the quadratic form Z > M Z has a Chi-square distribution with n 1 degrees
of freedom. Multiplying a Chi-square random variable with n 1 degrees of freedom
2
by n 1 one obtains a Gamma random variable with parameters n 1 and 2 (see
the lecture entitled Gamma distribution - p. 403 - for more details).
So, summing up, the adjusted sample variance s2 has a Gamma distribution
with parameters n 1 and 2 .
Futhermore, the adjusted sample variance s2 is independent of the sample mean
X n , which is proved as follows. The sample mean can be written as
1 >
Xn = { X
n
and the sample variance can be written as
1
s2 = X >M X
n 1
Using Proposition 292, verifying the independence of X n and s2 boils down to

verifying that
1 > 1
{ M =0
n n 1
which can be easily checked by directly performing the multiplication of {> and
M.
Part VI
Asymptotic theory
489
Chapter 59
Sequences of random
variables
One of the central topics in probability theory and statistics is the study of se-
quences of random variables, i.e. of sequences1 fXn g whose generic element Xn is
a random variable.
There are several reasons why sequences of random variables are important:
1. Often, we need to analyze a random variable X, but for some reasons X

is too complex to analyze directly. What we usually do in this case is to
approximate X by simpler random variables Xn that are easier to study;
these approximating random variables are arranged into a sequence fXn g
and they become better and better approximations of X as n increases. This
is exactly what we did when we introduced the Lebesgue integral2 .
2. In statistical theory, Xn is often an estimate of an unknown quantity whose

value and whose properties depend on the sample size n (the sample size is
the number of observations used to compute the estimate). Usually, we are
able to analyze the properties of Xn only asymptotically (i.e. when n tends
to in…nity). In this case, fXn g is a sequence of estimates and we analyze the
properties of the limit of fXn g, in the hope that a large sample (the one we
observe) and an in…nite sample (the one we analyze by taking the limit of
Xn ) have a similar behavior.
3. In many applications a random variable is observed repeatedly through time

(for example, the price of a stock is observed every day). In this case fXn g
is the sequence of observations on the random variable and n is a time-index
(in the stock price example, Xn is the price observed in the n-th period).
59.1 Terminology
In this lecture, we introduce some terminology related to sequences of random
variables.
1 See p. 31.
2 See p. 141.
491
492 CHAPTER 59. SEQUENCES OF RANDOM VARIABLES
59.1.1 Realization of a sequence

Let fxn g be a sequence of real numbers and fXn g a sequence of random variables.
If the real number xn is a realization3 of the random variable Xn for every n, then
we say that the sequence of real numbers fxn g is a realization of the sequence
of random variables fXn g and we write
fXn g = fxn g
59.1.2 Sequences on a sample space

Let be a sample space4 . Let fXn g be a sequence of random variables. We say
that fXn g is a sequence of random variables de…ned on the sample space
if all the random variables Xn belonging to the sequence fXn g are functions
from to R.
59.1.3 Independent sequences

Let fXn g be a sequence of random variables de…ned on a sample space . We say
that fXn g is an independent sequence of random variables (or a sequence of
independent random variables) if every …nite subset of fXn g (i.e. every …nite
set of random variables belonging to the sequence) is a set of mutually independent
random variables5 .
59.1.4 Identically distributed sequences

Let fXn g be a sequence of random variables. Denote by Fn (x) the distribution
function6 of a generic element of the sequence Xn . We say that fXn g is a se-
quence of identically distributed random variables if any two elements of
the sequence have the same distribution function:
8x 2 R; 8i; j 2 N; Fi (x) = Fj (x)
59.1.5 IID sequences

Let fXn g be a sequence of random variables de…ned on a sample space . We
say that fXn g is a sequence of independent and identically distributed
random variables (or an IID sequence of random variables), if fXn g is
both a sequence of independent random variables (see 59.1.3) and a sequence of
identically distributed random variables (see 59.1.4).
59.1.6 Stationary sequences

Let fXn g be a sequence of random variables de…ned on a sample space . Take
a …rst group of q successive terms of the sequence Xn+1 , . . . , Xn+q . Now take a
3 See p. 105.
4 See p. 69.
5 See p. 233.
6 See p. 118.
59.1. TERMINOLOGY 493
second group of q successive terms of the sequence Xn+k+1 , . . . , Xn+k+q . The sec-
ond group is located k positions after the …rst group. Denote the joint distribution
function7 of the …rst group of terms by
Fn+1;:::;n+q (x1; : : : ; xq )
and the joint distribution function of the second group of terms by
Fn+k+1;:::;n+k+q (x1; : : : ; xq )
The sequence fXn g is said to be stationary (or strictly stationary) if and

only if
Fn+1;:::;n+q (x1; : : : ; xq ) = Fn+k+1;:::;n+k+q (x1; : : : ; xq )
for any n; k; q 2 N and for any vector (x1; : : : ; xq ) 2 Rq .
In other words, a sequence is strictly stationary if and only if the two random
vectors
[Xn+1 : : : Xn+q ]
and
[Xn+k+1 : : : Xn+k+q ]
have the same distribution for any n, k and q. Requiring strict stationarity is
weaker than requiring that a sequence be IID (see 59.1.5): if fXn g is an IID
sequence, then it is also strictly stationary, while the converse is not necessarily
true.
59.1.7 Weakly stationary sequences

Let fXn g be a sequence of random variables de…ned on a sample space . We
say that fXn g is a covariance stationary sequence (or weakly stationary
sequence) if
9 2 R : E [Xn ] = ; 8n > 0 (1)

8j 0; 9 j 2 R : Cov [Xn ; Xn j] = j ; 8n >j (2)
where n and j are, of course, integers.

Property (1) means that all the random variables belonging to the sequence
fXn g have the same mean.
Property (2) means that the covariance between a term Xn of the sequence and
the term that is located j positions before it (Xn j ) is always the same, irrespective
of how Xn has been chosen. In other words, Cov [Xn ; Xn j ] depends only on j and
not on n. Property (2) also implies that all the random variables in the sequence
have the same variance8 :
9 0 2 R : Var [Xn ] = 0 ; 8n 2N
Strictly stationarity (see 59.1.6) implies weak stationarity only if the mean
E [Xn ] and all the covariances Cov [Xn ; Xn j ] exist and are …nite.
Covariance stationarity does not imply strict stationarity, because the former
imposes restrictions only on …rst and second moments, while the latter imposes
restrictions on the whole distribution.
7 See p. 118.
8 Remember that Cov [Xn ; Xn ] = Var [Xn ].
59.1.8 Mixing sequences

Let fXn g be a sequence of random variables de…ned on a sample space . Intu-
itively, fXn g is a mixing sequence if any two groups of terms of the sequence that
are far apart from each other are approximately independent (and the further the
closer to being independent).
Take a …rst group of q successive terms of the sequence Xn+1 , . . . , Xn+q . Now
take a second group of q successive terms of the sequence Xn+k+1 , . . . , Xn+k+q .
The second group is located k positions after the …rst group. The two groups of
terms are independent if and only if
E [f (Xn+1 ; : : : ; Xn+q ) g (Xn+k+1 ; : : : ; Xn+k+q )] (59.1)

= E [f (Xn+1 ; : : : ; Xn+q )] E [g (Xn+k+1 ; : : : ; Xn+k+q )]
for any two functions f and g. This is just the de…nition of independence9 between
the two random vectors
[Xn+1 : : : Xn+q ] (59.2)
and
[Xn+k+1 : : : Xn+k+q ] (59.3)
Trivially, condition (59.1) can be written as
E [f (Xn+1 ; : : : ; Xn+q ) g (Xn+k+1 ; : : : ; Xn+k+q )]

E [f (Xn+1 ; : : : ; Xn+q )] E [g (Xn+k+1 ; : : : ; Xn+k+q )] = 0
If this condition is true asymptotically (i.e. when k ! 1), then we say that
the sequence fXn g is mixing:
De…nition 293 We say that a sequence of random variables fXn g is mixing (or
strongly mixing) if and only if
limk!1 fE [f (Xn+1 ; : : : ; Xn+q ) g (Xn+k+1 ; : : : ; Xn+k+q )]

E [f (Xn+1 ; : : : ; Xn+q )] E [g (Xn+k+1 ; : : : ; Xn+k+q )]g = 0
for any two functions f and g and for any n and q.
In other words, a sequence is strongly mixing if and only if the two random
vectors (59.2) and (59.3) tend to become more and more independent by increas-
ing k (for any n and q). This is a milder requirement than the requirement of
independence: if fXn g is an independent sequence (see 59.1.3), all its terms are
independent from one another; if fXn g is a mixing sequence, its terms can be de-
pendent, but they become less and less dependent as the distance between their
locations in the sequence increases. Of course, an independent sequence is also a
mixing sequence, while the converse is not necessarily true.
59.1.9 Ergodic sequences

In this section we discuss ergodicity. Roughly speaking, ergodicity is a weak concept
of independence for sequences of random variables.
In the subsections above we have discussed other two concepts of independence
for sequences of random variables:
9 See p. 235.
59.2. LIMIT OF A SEQUENCE OF RANDOM VARIABLES 495
1. independent sequences are sequences of random variables whose terms are

mutually independent;
2. mixing sequences are sequences of random variables whose terms can be de-
pendent but become less and less dependent as their distance increases (by
distance we mean how far apart they are located in the sequence).
Requiring that a sequence be mixing is weaker than requiring that a sequence

be independent: in fact, an independent sequence is also mixing, but the converse
is not true.
Requiring that a sequence be ergodic is even weaker than requiring that a
sequence be mixing (mixing implies ergodicity but not vice versa). This is probably
all you need to know if you are not studying asymptotic theory at an advanced
level, because ergodicity is quite a complicated topic and the de…nition of ergodicity
is fairly abstract. Nevertheless, we give here a quick de…nition of ergodicity for the
sake of completeness.
Denote by RN the set of all possible sequences of real numbers. When fxn g
is a sequence of real numbers, denote by fxn gn>1 the subsequence obtained by
dropping the …rst term of fxn g, i.e.
fxn gn>1 = fx2 ; x3 ; : : :g
We say that a subset A RN is a shift invariant set if and only if fxn gn>1
belongs to A whenever fxn g belongs to A. In symbols:
De…nition 294 A set A RN is shift invariant if and only if
fxn g 2 A =) fxn gn>1 2 A
Shift invariance is used to de…ne ergodicity:
De…nition 295 A sequence of random variables fXn g is said to be an ergodic

sequence if an olny if
either P (fXn g 2 A) = 0 or P (fXn g 2 A) = 1
whenever A is a shift invariant set.
59.2 Limit of a sequence of random variables

As we explained in the lecture entitled Sequences and limits (p. 31), whenever we
want to assess whether a sequence is convergent (and …nd its limit), we need to
de…ne a distance function (or metric) to measure the distance between the terms
of the sequence. Intuitively, a sequence converges to a limit if, by dropping a
su¢ ciently high number of initial terms of the sequence, the remaining terms can
be made as close to each other as we wish. The problem is how to de…ne "close
to each other". As we have explained, the concept of "close to each other" can be
made fully rigorous using the notion of a metric. Therefore, discussing convergence
of a sequence of random variables boils down to discussing what metrics can be
used to measure the distance between two random variables.
In the following lectures, we introduce several di¤erent notions of convergence
of a sequence of random variables: to each di¤erent notion corresponds a di¤erent
way of measuring the distance between two random variables.
The notions of convergence (also called modes of convergence) introduced

in the following lectures are:
1. Pointwise convergence (p. 501).

2. Almost sure convergence (p. 505).
3. Convergence in probability (p. 511).
4. Mean-square convergence (p. 519).
5. Convergence in distribution (p. 527).
Chapter 60
Sequences of random vectors
In this lecture, we generalize the concepts introduced in the lecture entitled Se-
quences of random variables (p. 491). We no longer consider sequences whose
elements are random variables, but we now consider sequences fXn g whose generic
element Xn is a K 1 random vector. The generalization is straightforward, as
the terminology and the basic concepts are almost the same used for sequences of
random variables.
60.1 Terminology
60.1.1 Realization of a sequence
Let fxn g be a sequence of K 1 real vectors and fXn g a sequence of K 1 random
vectors. If the real vector xn is a realization1 of the random vector Xn for every
n, then we say that the sequence of real vectors fxn g is a realization of the
sequence of random vectors fXn g and we write
fXn g = fxn g
60.1.2 Sequences on a sample space

Let be a sample space2 . Let fXn g be a sequence of random vectors. We say
that fXn g is a sequence of random vectors de…ned on the sample space
if all the random vectors Xn belonging to the sequence fXn g are functions from
to RK .
60.1.3 Independent sequences

Let fXn g be a sequence of random vectors de…ned on a sample space . We say
that fXn g is an independent sequence of random vectors (or a sequence
of independent random vectors) if every …nite subset of fXn g (i.e. every
…nite subset of random vectors belonging to the sequence) is a set of mutually
independent random vectors3 .
1 See p. 105.
2 See p. 69.
3 See p. 235.
497
498 CHAPTER 60. SEQUENCES OF RANDOM VECTORS
60.1.4 Identically distributed sequences

Let fXn g be a sequence of random vectors. Denote by Fn (x) the joint distribution
function4 of a generic element of the sequence Xn . We say that fXn g is a sequence
of identically distributed random vectors if any two elements of the sequence
have the same joint distribution function:
8x 2 RK ; 8i; j 2 N; Fi (x) = Fj (x)
60.1.5 IID sequences

Let fXn g be a sequence of random vectors de…ned on a sample space . We
say that fXn g is a sequence of independent and identically distributed
random vectors (or an IID sequence of random vectors), if fXn g is both a
sequence of independent random vectors (see above) and a sequence of identically
distributed random vectors (see above).
60.1.6 Stationary sequences

Let fXn g be a sequence of random vectors de…ned on a sample space . Take a …rst
group of q successive terms of the sequence Xn+1 , . . . , Xn+q . Now take a second
group of q successive terms of the sequence Xn+k+1 , . . . , Xn+k+q . The second
group is located k positions after the …rst group. Denote the joint distribution
function of the …rst group of terms by
Fn+1;:::;n+q (x1; : : : ; xq )
and the joint distribution function of the second group of terms by
Fn+k+1;:::;n+k+q (x1; : : : ; xq )
The sequence fXn g is said to be stationary (or strictly stationary) if and

only if
Fn+1;:::;n+q (x1 ; : : : ; xq ) = Fn+k+1;:::;n+k+q (x1; : : : ; xq )
>
for any n; k; q 2 N and for any vector x> >
1 : : : xq 2 RKq .
In other words, a sequence is strictly stationary if and only if the two random
vectors
> > >
Xn+1 : : : Xn+q
and
> > >
Xn+k+1 : : : Xn+k+q
have the same distribution (for any n, k and q). Requiring strict stationarity is
weaker than requiring that a sequence be IID (see the subsection IID sequences
above): if fXn g is an IID sequence, then it is also strictly stationary, while the
converse is not necessarily true.
4 See p. 118.
60.1. TERMINOLOGY 499
60.1.7 Weakly stationary sequences

Let fXn g be a sequence of random vectors de…ned on a sample space . We
say that fXn g is a covariance stationary sequence (or weakly stationary
sequence) if
9 2 RK : E [Xn ] = ; 8n 2 N (1)
8j 0; 9 j 2 RK K : Cov [Xn ; Xn j] = j ; 8n >j (2)
where n and j are, of course, integers. Property (1) means that all the random
vectors belonging to the sequence fXn g have the same mean. Property (2) means
that the cross-covariance5 between a term Xn of the sequence and the term that
is located j positions before it (Xn j ) is always the same, irrespective of how Xn
has been chosen. In other words, Cov [Xn ; Xn j ] depends only on j and not on
n. Note also that property (2) implies that all the random vectors in the sequence
have the same covariance matrix (because Cov [Xn ; Xn ] = Var [Xn ]):
9 0 2 RK K
: Var [Xn ] = 0 ; 8n 2N
60.1.8 Mixing sequences

The de…nition of mixing sequence of random vectors is a straightforward general-
ization of the de…nition of mixing sequence of random variables, which has been
discussed in the lecture entitled Sequences of random variables (p. 491). Therefore,
we report here the de…nition of mixing sequence of random vectors without further
comments and we refer the reader to the aforementioned lecture for an explanation
of the concept of mixing sequence.
De…nition 296 We say that a sequence of random vectors fXn g is mixing (or
strongly mixing) if and only if
limk!1 fE [f (Xn+1 ; : : : ; Xn+q ) g (Xn+k+1 ; : : : ; Xn+k+q )]

E [f (Xn+1 ; : : : ; Xn+q )] E [g (Xn+k+1 ; : : : ; Xn+k+q )]g = 0
for any two functions f and g and for any n and q.
60.1.9 Ergodic sequences

As in the previous section, we report here a de…nition of ergodic sequence of ran-
dom vectors, which is a straightforward generalization of the de…nition of ergodic
sequence of random variables, and we refer the reader to the lecture entitled Se-
quences of random variables (p. 491) for explanations of the concept of ergodicity.
N
Denote by RK the set of all possible sequences of real K 1 vectors. When
fxn g is a sequence of real vectors, denote by fxn gn>1 the subsequence obtained
by dropping the …rst term of fxn g, i.e.:
fxn gn>1 = fx2 ; x3 ; : : :g

N
We say that a subset A RK is a shift invariant set if and only if fxn gn>1
belongs to A whenever fxn g belongs to A. In symbols:
5 See p. 193.
500 CHAPTER 60. SEQUENCES OF RANDOM VECTORS
N
De…nition 297 A set A RK is shift invariant if and only if
fxn g 2 A =) fxn gn>1 2 A
Shift invariance is used to de…ne ergodicity.
De…nition 298 A sequence of random vectors fXn g is said to be an ergodic

sequence if an olny if
either P (fXn g 2 A) = 0 or P (fXn g 2 A) = 1
whenever A is a shift invariant set.
60.2 Limit of a sequence of random vectors

Similarly to what happens for sequences of random variables, there are several
di¤erent notions of convergence also for sequences of random vectors. In particular,
all the modes of convergence found for random variables can be generalized to
random vectors:
1. Pointwise convergence (p. 501).

2. Almost sure convergence (p. 505).
3. Convergence in probability (p. 511).
4. Mean-square convergence (p. 519).
5. Convergence in distribution (p. 527).
Chapter 61
Pointwise convergence
This lecture discusses pointwise convergence. We deal …rst with pointwise con-
vergence of sequences of random variables and then with pointwise convergence of
sequences of random vectors.
61.1 Sequences of random variables

Let fXn g be a sequence of random variables de…ned on a sample space1 . Let us
consider a single sample point2 ! 2 and a generic random variable Xn belonging
to the sequence. Xn is a function Xn : ! R. However, once we …x !, the
realization Xn (!) associated to the sample point ! is just a real number. By
the same token, once we …x !, the sequence fXn (!)g is just a sequence of real
numbers3 . Therefore, for a …xed !, it is very easy to assess whether the sequence
fXn (!)g is convergent; this is done employing the usual de…nition of convergence
of a sequence of real numbers. If, for a …xed !, the sequence fXn (!)g is convergent,
we denote its limit by X (!), to underline that the limit depends on the speci…c !
we have …xed. A sequence of random variables is said to be pointwise convergent
if and only if the sequence fXn (!)g is convergent for any choice of !:
De…nition 299 Let fXn g be a sequence of random variables de…ned on a sample

space . We say that fXn g is pointwise convergent to a random variable X
de…ned on if and only if fXn (!)g converges to X (!) for all ! 2 . X is called
the pointwise limit of the sequence and convergence is indicated by:
Xn ! X pointwise
Roughly speaking, using pointwise convergence we somehow circumvent the

problem of de…ning the concept of distance between random variables: by …xing
!, we reduce ourselves to the familiar problem of measuring distance between two
real numbers, so that we can employ the usual notion of convergence of sequences
of real numbers.
Example 300 Let = f! 1 ,! 2 g be a sample space with two sample points (! 1 and
! 2 ). Let fXn g be a sequence of random variables such that a generic term Xn of
1 See p. 492.
2 See p. 69.
3 See p. 33.
501
502 CHAPTER 61. POINTWISE CONVERGENCE
the sequence satis…es:

1
n if ! = ! 1
Xn (!) = 2
1+ n if ! = ! 2
We need to check the convergence of the sequences fXn (!)g for all ! 2 , i.e. for
! = ! 1 and for ! = ! 2 : (1) the sequence fXn (! 1 )g, whose generic term is
1
Xn (! 1 ) =
n
is a sequence of real numbers converging to 0; (2) the sequence fXn (! 2 )g, whose
generic term is
2
Xn (! 2 ) = 1 +
n
is a sequence of real numbers converging to 1. Therefore, the sequence of random
variables fXn g converges pointwise to the random variable X, where X is de…ned
as follows:
0 if ! = ! 1
X (!) =
1 if ! = ! 2
61.2 Sequences of random vectors

The above notion of convergence generalizes to sequences of random vectors in a
straightforward manner.
Let fXn g be a sequence of random vectors de…ned on a sample space4 , where
each random vector Xn has dimension K 1. If we …x a single sample point ! 2 ,
the sequence fXn (!)g is a sequence of real K 1 vectors. By the standard criterion
for convergence5 , the sequence of real vectors fXn (!)g is convergent to a vector
X (!) if
lim d (Xn (!) ; X (!)) = 0
n!1
where d (Xn (!) ; X (!)) is the distance between a generic term of the sequence
Xn (!) and the limit X (!). The distance between Xn (!) and X (!) is de…ned to
be equal to the Euclidean norm of their di¤erence:
d (Xn (!) ; X (!))

= kXn (!) X (!)k
q
2 2
= [Xn;1 (!) X ;1 (!)] + : : : + [Xn;K (!) X ;K (!)]
where the second subscript is used to indicate the individual components of the
vectors Xn (!) and X (!). Thus, for a …xed !, the sequence of real vectors fXn (!)g
is convergent to a vector X (!) if
lim kXn (!) X (!)k = 0

n!1
A sequence of random vectors fXn g is said to be pointwise convergent if and

only if the sequence fXn (!)g is convergent for any choice of !:
4 See p. 497.
5 See p. 36.
De…nition 301 Let fXn g be a sequence of random vectors de…ned on a sample

space . We say that fXn g is pointwise convergent to a random vector X
de…ned on if and only if fXn (!)g converges to X (!) for all ! 2 , i.e.
8!; lim kXn (!) X (!)k = 0

n!1
X is called the pointwise limit of the sequence and convergence is indicated by:
Xn ! X pointwise
Now, denote by fXn;i g the sequence of the i-th components of the vectors Xn . It
can be proved that the sequence of random vectors fXn g is pointwise convergent if
and only if all the K sequences of random variables fXn;i g are pointwise convergent:
Proposition 302 Let fXn g be a sequence of random vectors de…ned on a sample

space . Denote by fXn;i g the sequence of random variables obtained by taking the
i-th component of each random vector Xn . The sequence fXn g converges pointwise
to the random vector X if and only if fXn;i g converges pointwise to the random
variable X ;i (the i-th component of X) for each i = 1; : : : ; K.

Exercise 1
Let the sample space be6 :
= [0; 1]
De…ne a sequence of random variables fXn g as follows:

!
Xn (!) = 8! 2
2n
Find the pointwise limit of the sequence fXn g.
Solution
For a …xed sample point !, the sequence of real numbers fXn (!)g has limit:
!
lim Xn (!) = lim =0
n!1 n!1 2n
Therefore, the sequence of random variables fXn g converges pointwise to the

random variable X de…ned as follows:
X (!) = 0 8! 2
6 In other words, the sample space is the set of all real numbers between 0 and 1.
504 CHAPTER 61. POINTWISE CONVERGENCE
Exercise 2
Suppose the sample space is as in the previous exercise:
= [0; 1]
! n
Xn (!) = 1 + 8! 2
n
Find the pointwise limit of the sequence fXn g.
Solution
For a given sample point !, the sequence of real numbers fXn (!)g has limit7 :
! n
lim Xn (!) = lim 1+ = exp (!)
n!1 n!1 n
Thus, the sequence of random variables fXn g converges pointwise to the ran-
dom variable X de…ned as follows:
X (!) = exp (!) 8! 2
Exercise 3
Suppose the sample space is as in the previous exercises:
= [0; 1]
Xn (!) = ! n 8! 2
X (!) = 0 8! 2
Does the sequence fXn g converge pointwise to the random variable X?
Solution
For ! 2 [0; 1), the sequence of real numbers fXn (!)g has limit:
lim Xn (!) = lim ! n = 0
n!1 n!1
However, for ! = 1, the sequence of real numbers fXn (!)g has limit:
lim Xn (!) = lim 1n = 1
n!1 n!1
Thus, the sequence of random variables fXn g does not converge pointwise to the
random variable X, but it converges pointwise to the random variable Y de…ned
as follows:
0 if ! 2 [0; 1)
Y (!) =
1 if ! = 1
7 Note that this limit is encountered very frequently and you can …nd a proof of it in most
calculus textbooks.
Chapter 62
Almost sure convergence
This lecture introduces the concept of almost sure convergence. In order to under-
stand this lecture, you should …rst understand the concepts of almost sure property
and almost sure event1 and the concept of pointwise convergence of a sequence of
random variables2 .
We deal …rst with almost sure convergence of sequences of random variables
and then with almost sure convergence of sequences of random vectors.

Let fXn g be a sequence of random variables de…ned on a sample space3 . The
concept of almost sure convergence (or a.s. convergence) is a slight variation
of the concept of pointwise convergence. As we have seen, a sequence of random
variables fXn g is pointwise convergent if and only if the sequence of real numbers
fXn (!)g is convergent for all ! 2 . Achieving convergence for all ! 2 is a
very stringent requirement. Therefore, this requirement is usually weakened, by
requiring the convergence of fXn (!)g for a large enough subset of , and not
necessarily for all ! 2 . In particular, fXn (!)g is usually required to be a
convergent sequence almost surely: if F is the set of all sample points ! for which
the sequence fXn (!)g is convergent, its complement F c must be included in a
zero-probability event:
F = f! 2 : fXn (!)g is a convergent sequenceg

E is a zero-probability event
Fc E
In other words, almost sure convergence requires that the sequences fXn (!)g con-
verge for all sample points ! 2 , except, possibly, for a very small set F c of
sample points (F c must be included in a zero-probability event).

space . We say that fXn g is almost surely convergent (a.s. convergent)
to a random variable X de…ned on if and only if the sequence of real numbers
1 See the lecture entitled Zero-probability events (p. 79).
2 See p. 501.
3 See p. 492.
505
506 CHAPTER 62. ALMOST SURE CONVERGENCE
fXn (!)g converges to X (!) almost surely, i.e. if there exists a zero-probability
event E such that:
f! 2 : fXn (!)g does not converge to X (!)g E
X is called the almost sure limit of the sequence and convergence is indicated
by:
a:s:
Xn ! X
The following is an example of a sequence that converges almost surely:
Example 304 Suppose the sample space is:
= [0; 1]
It is possible to build a probability measure P on , such that P assigns to each
sub-interval of [0; 1] a probability equal to its length4 :
Remember that in this probability model all the sample points ! 2 are as-
signed zero probability (each sample point, when considered as an event, is a zero-
probability event):
8! 2 ; P (f!g) = P ([!; !]) = ! !=0
Now, consider a sequence of random variables fXn g de…ned as follows:
1 if ! = 0
Xn (!) = 1
n if ! =
6 0
When ! 2 (0; 1], the sequence of real numbers fXn (!)g converges to 0, because:
1
lim Xn (!) = lim =0
n!1 n!1 n
However, when ! = 0, the sequence of real numbers fXn (!)g is not convergent to
0, because:
lim Xn (!) = lim 1 = 1
n!1 n!1
De…ne a constant random variable X:
X (!) = 0; 8! 2 [0; 1]
We have that:
f! 2 : fXn (!)g does not converge to X (!)g = f0g
But P (f0g) = 0 because:
P (f0g) = P ([0; 0]) = 0 0=0
which means that the event
f! 2 : fXn (!)g does not converge to X (!)g
is a zero-probability event. Therefore, the sequence fXn g converges to X almost
surely, but it does not converge pointwise to X because fXn (!)g does not converge
to X (!) for all ! 2 .
4 See the lecture entitled Zero-probability events (p. 79).
62.2. SEQUENCES OF RANDOM VECTORS 507

each random vector Xn has dimension K 1. Also in the case of random vectors,
the concept of almost sure convergence is obtained from the concept of pointwise
convergence by relaxing the assumption that the sequence fXn (!)g converges for
all ! 2 (remember that the sequence of real vectors fXn (!)g converges to a real
vector X (!) if and only if limn!1 kXn (!) X (!)k = 0). Instead, it is required
that the sequence fXn (!)g converges for almost all ! (i.e. almost surely).

space . We say that fXn g is almost surely convergent to a random vector X
de…ned on if and only if the sequence of real vectors fXn (!)g converges to the
real vector X (!) almost surely, i.e. if there exists a zero-probability event E such
that:
f! 2 : fXn (!)g does not converge to X (!)g E
X is called the almost sure limit of the sequence and convergence is indicated
by:
a:s:
Xn ! X
Now, denote by fXn;i g the sequence of the i-th components of the vectors
Xn . It can be proved that the sequence of random vectors fXn g is almost surely
convergent if and only if all the K sequences of random variables fXn;i g are almost
surely convergent.

space . Denote by fXn;i g the sequence of random variables obtained by taking the
i-th component of each random vector Xn . The sequence fXn g converges almost
surely to the random vector X if and only if fXn;i g converges almost surely to the
random variable X ;i (the i-th component of X) for each i = 1; : : : ; K.

Exercise 1
Let the sample space be:
= [0; 1]
Sub-intervals of [0; 1] are assigned a probability equal to their length:
Xn (!) = ! n 8! 2
5 See p. 497.
X (!) = 0 8! 2
Does the sequence fXn g converge almost surely to X?
Solution
For a …xed sample point ! 2 [0; 1), the sequence of real numbers fXn (!)g has
limit:
lim Xn (!) = lim ! n = 0
n!1 n!1
For ! = 1, the sequence of real numbers fXn (!)g has limit:
lim Xn (!) = lim ! n = lim 1n = 1

n!1 n!1 n!1
Therefore, the sequence of random variables fXn g does not converge pointwise
to X, because
lim Xn (!) 6= X (!)
n!1
for ! = 1. However, the set of sample points ! such that fXn (!)g does not
converge to X (!) is a zero-probability event:
P (f! 2 : fXn (!)g does not converge to X (!)g) = P (f1g) = 0
Therefore, the sequence fXn g converges almost surely to X.
Exercise 2
Let fXn g and fYn g be two sequences of random variables de…ned on a sample
space . Let X and Y be two random variables de…ned on such that:
a:s:
Xn ! X
a:s:
Yn ! Y
Prove that
a:s:
Xn + Yn ! X + Y
Solution
Denote by FX the set of sample points for which fXn (!)g converges to X (!):
FX = f! 2 : fXn (!)g converges to X (!)g
The fact that fXn g converges almost surely to X implies that

c
FX EX
where P (EX ) = 0.
Denote by FY the set of sample points for which fYn (!)g converges to Y (!):
FY = f! 2 : fYn (!)g converges to Y (!)g

The fact that fYn g converges almost surely to Y implies that
FYc EY
where P (EY ) = 0.
Now, denote by FXY the set of sample points for which fXn (!) + Yn (!)g
converges to X (!) + Y (!):
FXY = f! 2 : fXn (!) + Yn (!)g converges to X (!) + Y (!)g
Observe that if ! 2 FX \FY then fXn (!) + Yn (!)g converges to X (!)+Y (!),
because the sum of two sequences of real numbers is convergent if the two sequences
are convergent. Therefore
FX \ FY FXY
Taking the complement of both sides, we obtain
c c
FXY (FX \ FY )
A c
= FX [ FYc
B EX [ EY
where: in step A we have used De Morgan’s law; in step B we have used the
c
fact that FX EX and FYc EY . But
P (EX [ EY ) = P (EX ) + P (EY ) P (EX \ EY )

P (EX ) + P (EY ) = 0 + 0 = 0
c
and, as a consequence, P (EX [ EY ) = 0. Thus, the set FXY of sample points !
such that fXn (!) + Yn (!)g does not converge to X (!) + Y (!) is included in the
zero-probability event EX [ EY , which means that
a:s:
Xn + Yn ! X + Y
Exercise 3
Let the sample space be:
= [0; 1]
Sub-intervals of [0; 1] are assigned a probability equal to their length, as in Exercise
1 above.
1
1 if ! 2 0; 1 n
Xn (!) =
n otherwise
Find an almost sure limit of the sequence.
Solution
If ! = 0 or ! = 1, then the sequence of real numbers fXn (!)g is not convergent:
lim Xn (!) = lim n = 1

n!1 n!1
For ! 2 (0; 1), the sequence of real numbers fXn (!)g has limit:
lim Xn (!) = 1
n!1
because for any ! we can …nd n0 such that ! 2 0; 1 n1 for any n n0 (as a
consequence Xn (!) = 1 for any n n0 ).
Thus, the sequence of random variables fXn g converges almost surely to the
random variable X de…ned as:
X (!) = 1 8! 2
because the set of sample points ! such that fXn (!)g does not converge to X (!)
is a zero-probability event:
P (f! 2 : fXn (!)g does not converge to X (!)g)

= P (f0; 1g) = P (f0g) + P (f1g) = 0 + 0 = 0
Chapter 63
Convergence in probability
This lecture discusses convergence in probability. We deal …rst with convergence

in probability of sequences of random variables and then with convergence in prob-
ability of sequences of random vectors.

As we have discussed in the lecture entitled Sequences of random variables (p.
495), di¤erent concepts of convergence are based on di¤erent ways of measuring
the distance between two random variables (how "close to each other" two random
variables are). The concept of convergence in probability is based on the
following intuition: two random variables are "close to each other" if there is a
high probability that their di¤erence is very small.
Let fXn g be a sequence of random variables de…ned on a sample space1 . Let
X be a random variable and " a strictly positive number. Consider the following
probability:
P (jXn Xj > ") (63.1)
Intuitively, Xn is considered far from X when jXn Xj > "; therefore, (63.1) is
the probability that Xn is far from X. If fXn g converges to X, then (63.1) should
become smaller and smaller as n increases. In other words, the probability of Xn
being far from X should go to zero when n increases. Formally, we should have:
lim P (jXn Xj > ") = 0 (63.2)

n!1
Note that the sequence

fP (jXn Xj > ")g
is a sequence of real numbers, therefore the limit in (63.2) is the usual limit of a
sequence of real numbers.
Furthermore, condition (63.2) should be satis…ed for any " (also for very small
", which means that we are very restrictive on our criterion for deciding whether
Xn is far from X). This leads us to the following de…nition of convergence:
1 See p. 492.
511
512 CHAPTER 63. CONVERGENCE IN PROBABILITY

space . We say that fXn g is convergent in probability to a random variable
X de…ned on if and only if
lim P (jXn Xj > ") = 0

n!1
for any " > 0. X is called the probability limit of the sequence and convergence
is indicated by
P
Xn ! X
or by
plim Xn = X
n!1
The following example illustrates the concept of convergence in probability:
RX = f0; 1g
and probability mass function2 :

8
< 1=3 if x = 1
pX (x) = 2=3 if x = 0
:
0 otherwise
Consider a sequence of random variables fXn g whose generic term is:
1
Xn = 1+ X
n
We want to prove that fXn g converges in probability to X. Take any " > 0. Note
that:
jXn Xj
A 1
= 1+ X X
n
1
= X
n
B 1
= X
n
where: in step A we have used the de…ntion of Xn ; in step B we have used the
fact that X cannot be negative. When X = 0, which happens with probability 32 ,
we have that:
1
jXn Xj = X = 0
n
and, of course, jXn Xj ". When X = 1, which happens with probability 13 , we
have that
1 1
jXn Xj = X=
n n
2 See p. 106.
1 1
and jXn Xj " if and only if n " (or n " ). Therefore:
1
2=3 if n < "
P (jXn Xj ") = 1
1 if n "
and
1
1=3 if n < "
P (jXn Xj > ") = 1 P (jXn Xj ") = 1
0 if n "
Thus, P (jXn Xj > ") trivially converges to 0, because it is identically equal to
1
0 for all n such that n " . Since " was arbitrary, we have obtained the desired
result:
lim P (jXn Xj > ") = 0
n!1
for any " > 0.

each random vector Xn has dimension K 1.
We have explained above that a sequence of random variables fXn g converges
in probability if and only if
lim P (d (Xn ; X) > ") = 0

n!1
for any " > 0, where

d (Xn ; X) = jXn Xj
4
is the distance of Xn from X.
In the case of random vectors, the de…nition of convergence in probability re-
mains the same, but distance is measured by the Euclidean norm of the di¤erence
between the two vectors:
d (Xn ; X) = kXn Xk
q
2 2
= [Xn;1 (!) X ;1 (!)] + : : : + [Xn;K (!) X ;K (!)]
where the second subscript is used to indicate the individual components of the
vectors Xn (!) and X (!).
The following is a formal de…nition:

space . We say that fXn g is convergent in probability to a random vector X
de…ned on if and only if
lim P (kXn Xk > ") = 0

n!1
for any " > 0. X is called the probability limit of the sequence and convergence
is indicated by
P
Xn ! X
3 See p. 497.
4 See p. 34.
or by
plim Xn = X
n!1
Now, denote by fXn;i g the sequence of the i-th components of the vectors
Xn . It can be proved that the sequence of random vectors fXn g is convergent
in probability if and only if all the K sequences of random variables fXn;i g are
convergent in probability:

space . Denote by fXn;i g the sequence of random variables obtained by taking
the i-th component of each random vector Xn . The sequence fXn g converges in
probability to the random vector X if and only if the sequence fXn;i g converges
in probability to the random variable X ;i (the i-th component of X) for each
i = 1; : : : ; K.

Exercise 1
Let U be a random variable having a uniform distribution5 on the interval [0; 1].
In other words, U is an absolutely continuous random variable with support
RU = [0; 1]
1 if u 2 [0; 1]
fU (u) =
0 if u 2
= [0; 1]
Now, de…ne a sequence of random variables fXn g as follows:
X1 = 1fU 2[0;1]g
X2 = 1fU 2[0;1=2]g X3 = 1fU 2[1=2;1]g
X4 = 1fU 2[0;1=4]g X5 = 1fU 2[1=4;2=4]g X6 = 1fU 2[2=4;3=4]g X7 = 1fU 2[3=4;1]g
X8 = 1fU 2[0;1=8]g X9 = 1fU 2[1=8;2=8]g X10 = 1fU 2[2=8;3=8]g : : :
X16 = 1fU 2[0;1=16]g X17 = 1fU 2[1=16;2=16]g X18 = 1fU 2[2=16;3=16]g : : :
..
.
where 1fU 2[a;b]g is the indicator function6 of the event fU 2 [a; b]g.
Find the probability limit (if it exists) of the sequence fXn g.
5 See p. 359.
6 See p. 197.
Solution
A generic term Xn of the sequence, being an indicator function, can take only two
values:
it can take value 1 with probability:
j j+1 1
P (Xn = 1) = P U 2 ; =
m m m
where m is an integer satisfying

n
<m n (63.3)
2
and j is an integer satisfying
n=m+j
it can take value 0 with probability:
1
P (Xn = 0) = 1 P (Xn = 1) = 1
m
By (515), m goes to in…nity as n goes to in…nity and
1
lim P (Xn = 0) = lim 1 =1
n!1 m!1 m
Therefore, the probability that Xn is equal to zero converges to 1 as n goes to

in…nity. So, obviously, fXn g converges in probability to the constant random
variable
X (!) = 0; 8! 2
because for any " > 0,
lim P (jXn Xj > ") = lim P (jXn 0j > ")

n!1 n!1
A = lim P (Xn > ")

n!1
B = lim P (Xn = 1)
n!1
1
= lim =0
m!1 m
where: in step A we have used the fact that Xn is positive; in step B we have
used the fact that Xn can take only value 0 or value 1.
Exercise 2
Does the sequence in the previous exercise also converge almost surely7 ?
7 See p. 505.
Solution
We can identify the sample space8 with the support of U :
= RU = [0; 1]
and the sample points ! 2 with the realizations of U : when the realization is
U = u, then ! = u. Almost sure convergence requires that
n oc
! 2 : lim Xn (!) = X (!) E
n!1
where E is a zero-probability event9 and the superscript c denotes the complement

of a set. In other words, the set of sample points ! for which the sequence fXn (!)g
does not converge to X (!) must be included in a zero-probability event E. In our
case, it is easy to see that, for any …xed sample point ! = u 2 [0; 1], the sequence
fXn (!)g does not converge to X (!) = 0, because in…nitely many terms in the
sequence are equal to 1. Therefore:
n oc
P ! 2 : lim Xn (!) = X (!) =1
n!1
and, trivially, there does not exist a zero-probability event including the set
n oc
! 2 : lim Xn (!) = X (!)
n!1
Thus, the sequence does not converge almost surely to X.
Exercise 3
Let fXn g be an IID sequence10 of continuous random variables having a uniform
distribution with support
1 1
RXn = ;
n n
n 1 1
2 if x 2 n; n
fXn (x) = 1 1
0 if x 2
= n; n
Find the probability limit (if it exists) of the sequence fXn g.
Solution
As n tends to in…nity, the probability density tends to become concentrated around
the point x = 0. Therefore, it seems reasonable to conjecture that the sequence
fXn g converges in probability to the constant random variable
X (!) = 0; 8! 2
To rigorously verify this claim we need to use the formal de…nition of convergence
in probability. For any " > 0,
lim P (jXn Xj > ") = lim P (jXn 0j > ")
n!1 n!1
8 See p. 69.
9 See p. 79.
1 0 See p. 492.
= lim [1 P ( " Xn ")]

n!1
Z "
= 1 lim fXn (x) dx
n!1 "
Z min(";1=n)
n
= 1 lim dx
n!1 max( "; 1=n) 2
Z 1=n
A n
= 1 lim dx
n!1 1=n 2
= 1 lim 1
n!1
= 0
1
where: in step A we have used the fact that n < " when n becomes large.
Chapter 64
Mean-square convergence
This lecture discusses mean-square convergence. We deal …rst with mean-square

convergence of sequences of random variables and then with mean-square conver-
gence of sequences of random vectors.

In the lecture entitled Sequences of random variables (p. 495) we have stressed the
fact that di¤erent concepts of convergence are based on di¤erent ways of measuring
the distance between two random variables (how "close to each other" two random
variables are). The concept of mean-square convergence, or convergence in
mean-square, is based on the following intuition: two random variables are "close
to each other" if the square of their di¤erence is on average small.
Let fXn g be a sequence of random variables de…ned on a sample space1 .
Let X be a random variable. The sequence fXn g is said to converge to X in
mean-square if fXn g converges to X according to the metric2 d (Xn ; X) de…ned as
follows: h i
2
d (Xn ; X) = E (Xn X) (64.1)
Note that d (Xn ; X) is well-de…ned only if the expected value on the right hand
side exists. Usually, Xn and X are required to be square integrable3 , which ensures
that (64.1) is well-de…ned and …nite.
Intuitively, for a …xed sample point4 !, the squared di¤erence
2
(Xn (!) X (!))
between the two realizations of Xn and X provides a measure of how di¤erent

those two realizations are. The mean squared di¤erence (64.1) provides a measure
of how di¤erent those two realizations are on average (as ! varies): if it becomes
smaller and smaller by increasing n, then the sequence fXn g converges to X.
We summarize the concept of mean-square convergence in the following:
1 See p. 492.
2 If you do not understand what it means to "converge according to a metric", you need to
revise the material discussed in the lecture entitled Sequences and limits (p. 34).
3 See p. 159.
4 See p. 69.
519
520 CHAPTER 64. MEAN-SQUARE CONVERGENCE
De…nition 311 Let fXn g be a sequence of square integrable random variables

de…ned on a sample space . We say that fXn g is mean-square convergent,
or convergent in mean-square, if and only if there exists a square integrable
random variable X such that fXn g converges to X, according to the metric
h i
2
d (Xn ; X) = E (Xn X)
i.e. h i
2
lim E (Xn X) =0 (64.2)
n!1
X is called the mean-square limit of the sequence and convergence is indicated

by
m:s:
Xn ! X
or by
L2
Xn ! X
L2
Note that (64.2) is just the usual criterion for convergence5 , while Xn ! X
indicates that convergence is in the Lp space6 L2 , because both fXn g and X have
been required to be square integrable.
The following example illustrates the concept of mean-square convergence:
Example 312 Let fXn g be a covariance stationary7 sequence of random variables

such that all the random variables in the sequence have the same expected value ,
the same variance 2 and zero covariance with each other. De…ne the sample mean
X n as follows:
n
1X
Xn = Xi
n i=1
and de…ne a constant random variable X = . The distance between a generic

term of the sequence X n and X is
h 2
i h 2
i
d X n; X = E Xn X =E Xn
But is equal to the expected value of X n , because

" n # n
1X 1X
E Xn = E Xi = E [Xi ]
n i=1 n i=1
n
1X 1
= = n =
n i=1 n
Therefore
h 2
i
d X n; X = E Xn X
h 2
i
= E Xn
5 See p. 36.
6 See p. 136.
7 See p. 493.
h 2
i
= E Xn E Xn
= Var X n
by the very de…nition of variance. In turn, the variance of X n is

" n #
1X
Var X n = Var Xi
n i=1
" n #
X
A = 1 Var Xi
n2 i=1
n
B 1 X
= Var [Xi ]
n2 i=1
n
1 X 2 1 2
2
= = n =
n2 i=1 n2 n
where: in step A we have used the properties of variance8 ; in step B we have

used the fact that the variance of a sum is equal to the sum of the variances when
the random variables in the sum have zero covariance with each other9 . Thus:
h 2
i 2
d X n; X = E X n X = Var X n =
n
and h i 2
2
lim E Xn X = lim =0
n!1 n!1 n
But this is just the de…nition of mean square convergence of X n to X. Therefore,
the sequence X n converges in mean-square to the constant random variable X =
.

Let fXn g be a sequence of random vectors de…ned on a sample space10 ,
where each random vector Xn has dimension K 1. The sequence of random
vectors fXn g is said to converge to a random vector X in mean-square if fXn g
converges to X according to the metric11 d (Xn ; X) de…ned as follows:
h i
2
d (Xn ; X) = E kXn Xk (64.3)
h i
2 2
= E (Xn;1 X ;1 ) + : : : + (Xn;K X ;K )
where kXn Xk is the Euclidean norm of the di¤erence between Xn and X and
the second subscript is used to indicate the individual components of the vectors
Xn and X.
8 See, in particular, the property Multiplication by a constant (p. 158).
9 See p. 168.
1 0 See p. 497.
1 1 See p. 34.
Of course, d (Xn ; X) is well-de…ned only if the expected value on the right

hand side exists. A su¢ cient condition for (64.3) to be well-de…ned is that all the
components of Xn and X be square integrable random variables.
Intuitively, for a …xed sample point !, the square of the Euclidean norm
2
kXn (!) X (!)k
of the di¤erence between the two realizations of Xn and X provides a measure of

how di¤erent those two realizations are. The mean of the square of the Euclidean
norm (formula 64.3 above) provides a measure of how di¤erent those two realiza-
tions are on average (as ! varies): if it becomes smaller and smaller by increasing
n, then the sequence of random vectors fXn g converges to the vector X.
The following is a formal de…nition of mean-square convergence for random
vectors:

space , whose components are square integrable random variables. We say that
fXn g is mean-square convergent,or convergent in mean-square, if and only
if there exists a random vector X with square integrable components such that fXn g
converges to X, according to the metric
h i
2
d (Xn ; X) = E kXn Xk
i.e. h i
2
lim E kXn Xk =0 (64.4)
n!1
X is called the mean-square limit of the sequence and convergence is indicated

by
m:s:
Xn ! X
or by:
L2
Xn ! X
L2
Note that (64.4) is just the usual criterion for convergence, while Xn ! X
indicates that convergence is in the Lp space L2 , because both fXn g and X have
been required to have square integrable components.
Now, denote by fXn;i g the sequence of the i-th components of the vectors Xn .
It can be proved that the sequence of random vectors fXn g is convergent in mean-
square if and only if all the K sequences of random variables fXn;i g are convergent
in mean-square:

space , such that their components are square integrable random variables. Denote
by fXn;i g the sequence of random variables obtained by taking the i-th component
of each random vector Xn . The sequence fXn g converges in mean-square to the
random vector X if and only if fXn;i g converges in mean-square to the random
variable X ;i (the i-th component of X) for each i = 1; : : : ; K.

Exercise 1
Let U be a random variable having a uniform distribution12 on the interval [1; 2].
In other words, U is an absolutely continuous random variable with support
RU = [1; 2]
1 if u 2 [1; 2]
fU (u) =
0 if u 2
= [1; 2]
Consider a sequence of random variables fXn g whose generic term is
Xn = 1fU 2[1;2 1=n]g
where 1fU 2[1;2 1=n]g is the indicator function13 of the event fU 2 [1; 2 1=n]g.
Find the mean-square limit (if it exists) of the sequence fXn g.
Solution
When n tends to in…nity, the interval [1; 2 1=n] becomes similar to the interval
[1; 2], because
1
lim 2 =2
n!1 n
Therefore, we conjecture that the indicators 1fU 2[1;2 1=n]g converge in mean-
square to the indicator 1fU 2[1;2]g . But 1fU 2[1;2]g is always equal to 1, so our
conjecture is that the sequence fXn g converges in mean square to 1. To verify our
conjecture, we need to verify that
h i
2
lim E (Xn 1) = 0
n!1
The expected value can be computed as follows:

h i Z 1
2 2
E (Xn 1) = 1fu2[1;2 1=n]g 1 fU (u) du
1
Z 2
2
= 1fu2[1;2 1=n]g 1 du
1
Z 2
= 12fu2[1;2 1=n]g + 12 2 1fu2[1;2 1=n]g 1 du
1
Z 2
= 1fu2[1;2 1=n]g +1 2 1fu2[1;2 1=n]g du
1
Z 2 Z 2 Z 2
= 1fu2[1;2 1=n]g du + du 2 1fu2[1;2 1=n]g du
1 1 1
Z 2 1=n Z 2 Z 2 1=n
= du + du 2 du
1 1 1
2 1=n 2 2 1=n
= [u]1 + [u]1 2 [u]1
1 2 See p. 359.
1 3 See p. 197.
1 1 1
= 2 1+2 1 2 2 1 =
n n n
Thus, the sequence fXn g converges in mean-square to 1 because

h i 1
2
lim E (Xn 1) = lim =0
n!1 n!1 n
Exercise 2
Let fXn g be a sequence of discrete random variables. Let the probability mass
function of a generic term of the sequence Xn be
8
< 1=n if xn = n
pXn (xn ) = 1 1=n if xn = 0
:
0 otherwise
Find the mean-square limit (if it exists) of the sequence fXn g.
Solution
Note that
1
lim P (Xn = 0) = lim 1 =1
n!1 n!1 n
Therefore, one would expect that the sequence fXn g converges to the constant
random variable X = 0. However, the sequence fXn g does not converge in mean-
square to 0. The distance of a generic term of the sequence from 0 is
h i 1 1
2
E (Xn 0) = E Xn2 = n2 +0 1 =n
n n
Thus h i
2
lim E (Xn 0) =1
n!1
while, if fXn g was convergent, we would have

h i
2
lim E (Xn 0) = 0
n!1
Exercise 3
Does the sequence in the previous exercise converge in probability?
Solution
The sequence fXn g converges in probability to the constant random variable X = 0
because for any " > 0
lim P (jXn Xj > ")

n!1
= lim P (jXn 0j > ")
n!1
A = lim P (Xn > ")

n!1
B = lim P (Xn = n)
n!1
1
= lim =0
n!1 n
where: in step A we have used the fact that Xn is positive; in step B we have
used the fact that Xn can take only value 0 or value n.
Chapter 65
Convergence in distribution
This lecture discusses convergence in distribution. We deal …rst with convergence

in distribution of sequences of random variables and then with convergence in
distribution of sequences of random vectors.

In the lecture entitled Limit of a sequence of random variables (p. 495) we ex-
plained that di¤erent concepts of convergence are based on di¤erent ways of mea-
suring the distance between two random variables (how "close to each other" two
random variables are). The concept of convergence in distribution is based
on the following intuition: two random variables are "close to each other" if their
distribution functions1 are "close to each other" .
Let fXn g be a sequence of random variables2 . Let us consider a generic random
variable Xn belonging to the sequence. Denote by Fn (x) its distribution function.
Fn (x) is a function Fn : R ! [0; 1]. Once we …x x, the value Fn (x) associated
to the point x is a real number. By the same token, once we …x x, the sequence
fFn (x)g is a sequence of real numbers. Therefore, for a …xed x, it is very easy
to assess whether the sequence fFn (x)g is convergent; this is done employing the
usual de…nition of convergence of sequences of real numbers3 . If, for a …xed x,
the sequence fFn (x)g is convergent, we denote its limit by FX (x) (note that the
limit depends on the speci…c x we have …xed). A sequence of random variables
fXn g is said to be convergent in distribution if and only if the sequence fFn (x)g
is convergent for any choice of x (except, possibly, for some "special values" of x
where FX (x) is not continuous in x):
De…nition 315 Let fXn g be a sequence of random variables. Denote by Fn (x)

the distribution function of Xn . We say that fXn g is convergent in distribution
(or convergent in law) if and only if there exists a distribution function FX (x) such
that the sequence fFn (x)g converges to FX (x) for all points x 2 R where FX (x)
is continuous. If a random variable X has distribution function FX (x), then X is
called the limit in distribution (or limit in law) of the sequence and convergence
1 See p. 108.
2 See p. 491.
3 See p. 33.
527
528 CHAPTER 65. CONVERGENCE IN DISTRIBUTION
is indicated by
d
Xn ! X
Note that convergence in distribution only involves the distribution functions of
the random variables belonging to the sequence fXn g and that these random vari-
ables need not be de…ned on the same sample space4 . On the contrary, the modes of
convergence we have discussed in previous lectures (pointwise convergence, almost
sure convergence, convergence in probability, mean-square convergence) require
that all the variables in the sequence be de…ned on the same sample space.
Example 316 Let fXn g be a sequence of IID5 random variables all having a uni-
form distribution6 on the interval [0; 1], i.e. the distribution function of Xn is
8
< 0 if x < 0
FXn (x) = x if 0 x < 1
:
1 if x 1
De…ne:
Yn = n 1 max Xi
1 i n
The distribution function of Yn is

FYn (y) = P (Yn y)
= P n 1 max Xi y
1 i n
y
= P max Xi 1
1 i n n
y
= 1 P max Xi < 1
1 i n n
y y y
= 1 P X1 < 1 ; X2 < 1 ; : : : ; Xn < 1
n n n
A y y y
= 1 P X1 < 1 P X2 < 1 ::: P Xn < 1
n n n
B y y y
= 1 P X1 1 P X2 1 ::: P Xn 1
n n n
C y y y
= 1 FX1 1 FX2 1 : : : FXn 1
n n n
h y in
D = 1 FXn 1
n
where: in step A we have used the fact that the variables Xi are mutually in-
dependent; in step B we have used the fact that the variables Xi are absolutely
continuous; in step C we have used the de…nition of distribution function; in step
D we have used the fact that the variables Xi have identical distributions. Thus:
8
< 0 if y < 0
n
FYn (y) = 1 1 ny if 0 y < n
:
1 if y n
4 See p. 69.
5 See p. 492.
6 See p. 359.
Since n
y
lim 1 = exp ( y)
n!1 n
we have
0 if y < 0
lim FYn (y) = FY (y) =
n!1 1 exp ( y) if y 0
where FY (y) is the distribution function of an exponential random variable7 . There-
fore, the sequence fYn g converges in law to an exponential distribution.

The de…nition of convergence in distribution of a sequence of random vectors is al-
most identical; we just need to replace distribution functions in the above de…nition
with joint distribution functions8 :
De…nition 317 Let fXn g be a sequence of K 1 random vectors. Denote by

Fn (x) the joint distribution function of Xn . We say that fXn g is convergent in
distribution (or convergent in law) if and only if there exists a joint distribution
function FX (x) such that the sequence fFn (x)g converges to FX (x) for all points
x 2 RK where FX (x) is continuous. If a random vector X has joint distribution
function FX (x), then X is called the limit in distribution (or limit in law) of
the sequence and convergence is indicated by
d
Xn ! X
65.3 More details

65.3.1 Proper distribution functions
Let fXn g be a sequence of random variables and denote by Fn (x) the distribution
function of Xn . Suppose that we …nd a function FX (x) such that
FX (x) = lim Fn (x)

n!1
for all x 2 R where FX (x) is continuous. How do we check that FX (x) is a proper
distribution function, so that we can say that the sequence fXn g converges in
distribution?
FX (x) is a proper distribution function if it satis…es the following four proper-
ties:
1. Increasing. FX (x) is increasing, i.e. FX (x1 ) FX (x2 ) if x1 < x2 .
2. Right-continuous. FX (x) is right-continuous, i.e.
lim FX (t) = FX (x)

t!x
t x
for any x 2 R.
7 See p. 365.
8 See p. 118.
3. Limit at minus in…nity. FX (x) satis…es

lim F (x) = 0
x! 1
4. Limit at plus in…nity. FX (x) satis…es

lim F (x) = 1
x!1

Exercise 1
Let fXn g be a sequence of random variables having distribution functions
8
>
> 0 if x 0
< n x + 1 x2 if 0 < x 1
Fn (x) = 2n+1 4n+2
n 1 2
>
> x 4n+2 x 4x + 2 if 1 < x 2
: 2n+1
1 if x > 2
Find the limit in distribution (if it exists) of the sequence fXn g.
Solution
If 0 < x 1, then
n 1
lim Fn (x) = lim x+ x2
n!1 2n + 1
n!1 4n + 2
n 1
= x lim + x2 lim
n!1 2n + 1 n!1 4n + 2
1 1
= x + x2 0 = x
2 2
If 1 < x 2, then
n 1
lim Fn (x) = lim x x2 4x + 2
n!1 2n + 1
n!1 4n + 2
n 1
= x lim + x2 4x + 2 lim
n!1 2n + 1 n!1 4n + 2
1 1
= x + x2 4x + 2 0 = x
2 2
We now need to verify that the function
8
< 0 if x 0
1
FX (x) = lim Fn (x) = x if 0<x 2
n!1 : 2
1 if x > 2
is a proper distribution function. The function is increasing, continuous, its limit
at minus in…nity is 0 and its limit at plus in…nity is 1, hence it satis…es the four
properties that a proper distribution function needs to satisfy. This implies that
fXn g converges in distribution to a random variable X having distribution function
FX (x).
Exercise 2
8
< 0 if x < 0
n
Fn (x) = 1 (1 x) if 0 x 1
:
1 if x > 1
Solution
If x = 0, then
n n
lim Fn (x) = lim [1 (1 x) ] = 1 lim (1 0)
n!1 n!1 n!1
= 1 1=0
If 0 < x 1, then
n n
lim Fn (x) = lim [1 (1 x) ] = 1 lim (1 x)
n!1 n!1 n!1
= 1 0=1
Therefore, the distribution functions Fn (x) converge to the function
0 if x 0
GX (x) = lim Fn (x) =
n!1 1 if x > 0
which is not a proper distribution function, because it is not right-continuous at

the point x = 0. However, note that the function
0 if x < 0
FX (x) =
1 if x 0
is a proper distribution function and it is equal to limn!1 Fn (x) at all points

except at the point x = 0. But this is a point of discontinuity of FX (x). As a
consequence, the sequence fXn g converges in distribution to a random variable X
having distribution function FX (x).
Exercise 3
8
< 0 if x < 0
Fn (x) = nx if 0 x 1=n
:
1 if x > 1=n
Solution
The distribution functions Fn (x) converge to the function
0 if x 0
GX (x) = lim Fn (x) =
n!1 1 if x > 0
This is the same limiting function found in the previous exercise. As a consequence,
the sequence fXn g converges in distribution to a random variable X having dis-
tribution function
0 if x < 0
FX (x) =
1 if x 0
Chapter 66
Relations between modes of

convergence
In the previous lectures, we have introduced several notions of convergence of a

sequence of random variables (also called modes of convergence). There are
several relations between the various modes of convergence, which are discussed in
the following subsections and are summarized by the following diagram (an arrow
denotes implication in the arrow’s direction):
Almost sure Mean square

convergence convergence
& .
Convergence
in probability
#
Convergence
in distribution
66.1 Almost sure ) Probability

Proposition 318 If a sequence of random variables fXn g converges almost surely
to a random variable X, then fXn g also converges in probability to X.
Proof. See e.g. Resnick1 (1999).
66.2 Probability ) Distribution

Proposition 319 If a sequence of random variables fXn g converges in probability
to a random variable X, then fXn g also converges in distribution to X.
Proof. See e.g. Resnick (1999).
1 Resnick, S.I. (1999) "A Probability Path", Birkhauser.
533
534 CHAPTER 66. MODES OF CONVERGENCE - RELATIONS
66.3 Almost sure ) Distribution

Proposition 320 If a sequence of random variables fXn g converges almost surely
to a random variable X, then fXn g also converges in distribution to X.
Proof. This is obtained putting together Propositions (318) and (319) above.
66.4 Mean square ) Probability

Proposition 321 If a sequence of random variables fXn g converges in mean
square to a random variable X, then fXn g also converges in probability to X.
Proof.
n Weocan apply Markov’s inequality2 to a generic term of the sequence
2
(Xn X) :
h i
2
E (Xn X)
2
P (Xn X) c2
c2
for any strictly positive real number c. Taking the square root of both sides of the
left-hand inequality, we obtain
h i
2
E (Xn X)
P (jXn Xj c)
c2
Taking limits on both sides, we get
h i h i
2 2
E (Xn X) limn!1 E (Xn X)
lim P (jXn Xj c) lim = =0
n!1 n!1 c2 c2
where we have used the fact that, by the very de…nition of convergence in mean
square: h i
2
lim E (Xn X) = 0
n!1
Since, by the very de…nition of probability, it must be that
P (jXn Xj c) 0
then it must be that also
lim P (jXn Xj c) = 0
n!1
Note that this holds for any arbitrarily small c. By the de…nition of convergence in
probability, this means that Xn converges in probability to X (if you are wondering
about strict and weak inequalities here and in the de…nition of convergence in
probability, note that jXn Xj c implies jXn Xj > " for any strictly positive
" < c).
66.5 Mean square ) Distribution

Proposition 322 If a sequence of random variables fXn g converges in mean
square to a random variable X, then fXn g also converges in distribution to X.
Proof. This is obtained putting together Propositions (321) and (319) above.
2 See p. 241.
Chapter 67
Laws of Large Numbers
Let fXn g be a sequence of random variables1 . Let X n be the sample mean of the
…rst n terms of the sequence:
n
1X
Xn = Xi
n i=1
A Law of Large Numbers (LLN) is a proposition stating a set of conditions

that are su¢ cient to guarantee the convergence of the sample mean X n to a con-
stant, as the sample size n increases. It is called a Weak Law of Large Numbers
(WLLN) if the sequence X n converges in probability2 and a Strong Law of
Large Numbers (SLLN) if the sequence X n converges almost surely3 .
There are literally dozens of Laws of Large Numbers. We report some examples
below.
67.1 Weak Laws of Large Numbers

67.1.1 Chebyshev’s WLLN
Probably, the best known Law of Large Numbers is Chebyshev’s:
Proposition 323 (Chebyshev’s WLLN) Let fXn g be an uncorrelated and co-

variance stationary sequence of random variables4 :
9 2 R : E [Xn ] = ; 8n 2 N
9 2 2 R+ : Var [Xn ] = 2 ; 8n 2 N
Cov [Xn ; Xn+k ] = 0; 8n; k 2 N
Then, a Weak Law of Large Numbers applies to the sample mean:
plim X n =
n!1
1 See p. 491.
2 See p. 511.
3 See p. 505.
4 In other words, all the random variables in the sequence have the same mean , the same
variance 2 and zero covariance with each other. See p. 493 for a de…nition of covariance
stationary sequence.
535
536 CHAPTER 67. LAWS OF LARGE NUMBERS
where plim denotes a probability limit5 .
Proof. The expected value of the sample mean X n is:

" n # n
1X 1X
n i=1 n i=1
n
1X 1
= = n =
n i=1 n
The variance of the sample mean X n is:

" n
#
1X
Var X n = Var Xi
n i=1
" n #
1 X
A = Var Xi
n2 i=1
n
B 1 X
= Var [Xi ]
n2 i=1
n
1 X 2 1 2
2
= = n =
n2 i=1 n2 n

used the fact that the variance of a sum is equal to the sum of the variances when
the random variables in the sum have zero covariance with each other7 . Now we
can apply Chebyshev’s inequality8 to the sample mean X n :
Var X n
P Xn E Xn k
k2
for any strictly positive real number k. Plugging in the values for the expected
value and the variance derived above, we obtain:
2
P Xn k
nk 2
Since
2
lim =0
n!1 nk 2
and
P Xn k 0
then it must also be that:
lim P X n k =0
n!1
Note that this holds for any arbitrarily small k. By the very de…nition of conver-
gence in probability, this means that X n converges in probability to (if you are
5 See p. 511.
6 See,in particular, the Multiplication by a constant property (p. 158).
7 See p. 168.
8 See p. 242.
67.1. WEAK LAWS OF LARGE NUMBERS 537
wondering about strict and weak inequalities here and in the de…nition of conver-
gence in probability, note that X n k implies X n > " for any strictly
positive " < k).
Note that it is customary to state Chebyshev’s Weak Law of Large Numbers
as a result on the convergence in probability of the sample mean:
plim X n =
n!1
However, the conditions of the above theorem guarantee the mean square conver-
gence9 of the sample mean to :
m:s:
Xn !
Proof. In the above proof of Chebyshev’s Weak Law of Large Numbers, it is

proved that:
2
Var X n =
n
and that
E Xn =
This implies that
h 2
i h 2
i 2
E Xn =E Xn E Xn = Var X n =
n
As a consequence:
h 2
i 2
lim E Xn = lim =0
n!1 n!1 n
but this is just the de…nition of mean square convergence of X n to .
Hence, in Chebyshev’s Weak Law of Large Numbers, convergence in proba-
bility is just a consequence of the fact that convergence in mean square implies
convergence in probability10 .
67.1.2 Chebyshev’s WLLN for correlated sequences

Chebyshev’s Weak Law of Large Numbers (see above) sets forth the requirement
that the terms of the sequence fXn g have zero covariance with each other. By
relaxing this requirement and allowing for some correlation between the terms of
the sequence fXn g, a more general version of Chebyshev’s Weak Law of Large
Numbers can be obtained:
Proposition 324 (Chebyshev’s WLLN for correlated sequences) Let fXn g

be a covariance stationary sequence of random variables11 :
9 2 R : E [Xn ] = ; 8n > 0
8j 0; 9 j 2 R : Cov [Xn ; Xn j] = j ; 8n >j
9 See p. 519.
1 0 See p. 534.
1 1 In other words, all the random variables in the sequence have the same mean , the same
variance 0 and the covariance between a term Xn of the sequence and the term that is located
j positions before it (Xn j ) is always the same ( j ), irrespective of how Xn has been chosen.
If covariances tend to be zero on average, i.e. if

n
1X
lim i =0 (67.1)
n!1 n
i=0
then a Weak Law of Large Numbers applies to the sample mean:

plim X n =
n!1
Proof. For a full proof see e.g. Karlin and Taylor12 (1975). We give here a proof
based on the assumption that covariances are absolutely summable:
1
X
j <1
j=0
which is stronger than (67.1). The expected value of the sample mean X n is
" n # n
1X 1X
n i=1 n i=1
n
1X 1
= = n =
n i=1 n
The variance of the sample mean X n is:

" n #
1X
Var X n = Var Xi
n i=1
" n #
X
A = 1 Var Xi
n2 i=1
8 9
<X n Xn Xi 1 =
B = 1 Var [Xi ] + 2 Cov [Xi ; Xj ]
n : i=1
2
i=2 j=1
;
( n
1 X
= + 2 1 + 2 ( 1 + 2)
n2 i=1 0
+::: + 2 1 + 2 + ::: n 1
1
= n + 2 (n 1) 1 + 2 (n 2) + ::: + 2
n2( 0 )
2 n 1
n
X1 (n i)
1
= 0+2 i
n i=1
n

used the formula for the variance of a sum14 . Note that:
( n
)
1 X1 (n i)
Var X n = 0+2 i
n i=1
n
1 2 Karlin, S., Taylor, H. M. (1975) "A …rst course in stochastic processes", Academic Press.
1 3 See,in particular, the Multiplication by a constant property (p. 158).
1 4 See p. 168.
67.1. WEAK LAWS OF LARGE NUMBERS 539
( n
)
1 X1 (n i)
0 +2 i
n i=1
n
( n
)
1 X1 (n i)
= 0 +2 j ij
n i=1
n
( n
)
1 X1
A
0+2 j ij
n i=1
( 1
)
1 X
0 +2 j ij
n i=1
where: in step A we have used the fact that
(n i)
<1
n
But the covariances are absolutely summable, so that:

1
X
0+2 j ij =
i=1
where is a …nite constant. Therefore:
Var X n
n
Now we can apply Chebyshev’s inequality to the sample mean X n :
Var X n
P Xn E Xn k
k2
for any strictly positive real number k. Plugging in the values for the expected
value and the variance derived above, we obtain:
Var X n
P Xn k
k2 nk 2
Since
lim =0
n!1 nk 2
and
P Xn k 0
then it must also be that:
lim P X n k =0
n!1
Note that this holds for any arbitrarily small k. By the very de…nition of conver-
gence in probability, this means that X n converges in probability to (if you are
wondering about strict and weak inequalities here and in the de…nition of conver-
gence in probability, note that X n k implies X n > " for any strictly
positive " < k).
Also Chebyshev’s Weak Law of Large Numbers for correlated sequences has
been stated as a result on the convergence in probability of the sample mean:
plim X n =
n!1
However, the conditions of the above theorem also guarantee the mean square
convergence of the sample mean to :
m:s:
Xn !
Proof. In the above proof of Chebyshev’s Weak Law of Large Numbers for corre-
lated sequences, we proved that
Var X n
n
and that
E Xn =
This implies:
h 2
i h 2
i
E Xn =E Xn E Xn = Var X n
n
Thus, taking limits on both sides, we obtain:
h 2
i
lim E X n lim =0
n!1 n!1 n
But h i
2
E Xn 0
so it must be: h i
2
lim E Xn =0
n!1
This is just the de…nition of mean square convergence of X n to .

Hence, also in Chebyshev’s Weak Law of Large Numbers for correlated se-
quences, convergence in probability descends from the fact that convergence in
mean square implies convergence in probability.
67.2 Strong Laws of Large numbers

67.2.1 Kolmogorov’s SLLN
Among the Strong Laws of Large Numbers, Kolmogorov’s is probably the best
known:
Proposition 325 (Kolmogorov’s SLLN) Let fXn g be an IID sequence15 of
random variables having …nite mean:
E [Xn ] = < 1; 8n 2 N
Then, a Strong Law of Large Numbers applies to the sample mean:
a:s:
Xn !
a:s:
where ! denotes almost sure convergence16 .
1 5 See p. 492.
1 6 See p. 505.
67.3. LAWS OF LARGE NUMBERS FOR RANDOM VECTORS 541
Proof. See, for example, Resnick17 (1999) and Williams18 (1991).
67.2.2 Ergodic theorem

In Kolmogorov’s Strong Law of Large Numbers, the sequence fXn g is required to
be an IID sequence. This requirement can be weakened, by requiring fXn g to be
stationary19 and ergodic20 .
Proposition 326 (Ergodic Theorem) Let fXn g be a stationary and ergodic se-
quence of random variables having …nite mean:
E [Xn ] = < 1; 8n 2 N
Then, a Strong Law of Large Numbers applies to the sample mean:
a:s:
Xn !
Proof. See, for example, Karlin and Taylor21 (1975) and White22 (2001).
67.3 Laws of Large numbers for random vectors

The Laws of Large Numbers we have just presented concern sequences of random
variables. However, they can be extended in a straightforward manner to sequences
of random vectors:
Proposition 327 Let fXn g be a sequence of K 1 random vectors, let E [Xn ] =
be their common expected value and
n
1X
Xn = Xi
n i=1
their sample mean. Denote the j-th component of Xn by Xn;j and the j-th compo-
nent of X n by X n;j . Then:
a Weak Law of Large Numbers applies to the sample mean X n if and only if
a Weak Law of Large numbers applies to each of the components of the vector
X n , i.e. if and only if
plim X n;j = j; j = 1; : : : ; K
n!1
a Strong Law of Large Numbers applies to the sample mean X n if and only
if a Strong Law of Large numbers applies to each of the components of the
vector X n , i.e. if and only if
a:s:
X n;j ! j; j = 1; : : : ; K
Proof. This is a consequence of the fact that a vector converges in probability

(almost surely) if and only if all of its components converge in probability (almost
surely). See the lectures entitled Convergence in probability (p. 511) and Almost
sure convergence (p. 505).
1 7 Resnick, S.I. (1999) "A Probability Path", Birkhauser.
1 8 Williams, D. (1991) "Probability with martingales", Cambridge University Press.
1 9 See p. 492.
2 0 See p. 494.
2 1 Karlin, S., Taylor, H. M. (1975) "A …rst course in stochastic processes", Academic Press.
2 2 White, H. (2001) "Asymptotic theory for econometricians", Academic Press.

Exercise 1
Let f"n g be an IID sequence. A generic term of the sequence has mean and
variance 2 . Let fXn g be a covariance stationary sequence such that a generic
term of the sequence satis…es
Xn = Xn 1 + "n
where 1< < 1. Denote by

n
1X
Xn = Xn
n i=1
the sample mean of the sequence. Verify whether the sequence X n satis…es
the conditions that are required by Chebyshev’s Weak Law of Large Numbers. In
a¢ rmative case, …nd its probability limit.
Solution
By assumption the sequence fXn g is covariance stationary. So all the terms of the
sequence have the same expected value. Taking the expected value of both sides
of the equation
X n = X n 1 + "n
we obtain:
E [Xn ] = E [ X n 1 + "n ]
= E [Xn 1 ] + E ["n ]
= E [Xn ] +
Solving for E [Xn ] we obtain:
E [Xn ] =
1
By the same token, the variance can be derived from:
Var [Xn ] = Var [ Xn 1 + "n ]

A = 2
Var [Xn 1 ] + Var ["n ]
2
= Var [Xn ] + 2
where: in step A we have used the fact that Xn 1 is independent of "n because
f"n g is IID. Solving for Var [Xn ], we obtain
2
Var [Xn ] = 2
1
Now, we need to derive Cov [Xn ; Xn+j ]. Note that:
Xn+1 = Xn + "n+1
2
Xn+2 = Xn+1 + "n+2 = Xn + "n+2 + "n+1
3 2
Xn+3 = Xn+2 + "n+3 = Xn + "n+3 + "n+2 + "n+1
..
.
j 1
X
j s
Xn+j = Xn+j 1 + "n+j = Xn + "n+j s
s=0
The covariance between two terms of the sequence is:
j = Cov [Xn ; Xn+j ]

" j 1
#
X
j s
= Cov Xn ; Xn + "n+j s
s=0
j 1
X
A = j
Cov [Xn ; Xn ] + s
Cov [Xn ; "n+j s]
s=0
B = j
Cov [Xn ; Xn ]
j
= Var [Xn ]
2
j
= 2
1

used the fact that Xn is independent of "n+j s because f"n g is IID. The sum of
the covariances is:
n
X n
X 2
j
j = 2
j=0 j=0
1
2 n
X
j
= 2
1 j=0
2 n
X
1 j
= 2 1
1 j=0
2
1 2 n n+1
= 2
1 + + ::: +
1 1
2 n+1
1
= 2
1 1
Thus, covariances tend to be zero on average:
n
1X 1 2
1 n+1
lim j = lim 2
=0
n!1 n n!1 n 1 1
j=0
and the conditions of Chebyshev’s Weak Law of Large Numbers are satis…ed.
Therefore, the sample mean converges in probability to the population mean:
plim X n = E [Xn ] =
n!1 1
Chapter 68
Central Limit Theorems
Let fXn g be a sequence of random variables1 . Let X n be the sample mean of the
…rst n terms of the sequence:
n
1X
Xn = Xi
n i=1
A Central Limit Theorem (CLT) is a proposition giving a set of conditions

that are su¢ cient to guarantee the convergence of the sample mean X n to a normal
distribution, as the sample size n increases.
More precisely, a Central Limit Theorem is a proposition giving a set of condi-
tions that are su¢ cient to guarantee that
p Xn d
n !Z
d
where Z is a standard normal random variable2 , and are two constants and !
indicates convergence in distribution3 .
Why is the expression X n = multiplied by the square root of n? If we
p
do not multiply it by n, then X n = converges to a constant, provided that
4
the conditions
p of a Law of Large Numbers apply. On the contrary, multiplying
it by n, we obtain a sequence that converges to a proper random variable (i.e.
a random variable that is not constant). When the conditions of a Central Limit
Theorem apply, this variable has a normal distribution.
In practice, the CLT is used as follows:
1. we observe a sample consisting of n observations X1 , X2 , : : :, Xn ;
2. if n is large enough, then a standard
p normal distribution is a good approxi-
mation of the distribution of n X n = ;
3. therefore, we pretend that
p
n Xn = N (0; 1)
1 See p. 491.
2 Remember that a standard normal random variable is a normal random variable with zero
mean and unit variance (p. 376).
3 See p. 527.
4 See p. 535.
545
546 CHAPTER 68. CENTRAL LIMIT THEOREMS
where N indicates the normal distribution;
4. as a consequence, the distribution of the sample mean X n is

2
Xn N ;
n
There are several Central Limit Theorems. We report some examples below.
68.1 Examples of Central Limit Theorems

68.1.1 Lindeberg-Lévy CLT
The best known Central Limit Theorem is probably Lindeberg-Lévy CLT:
Proposition 328 (Lindeberg-Lévy CLT) Let fXn g be an IID sequence5 of ran-

dom variables such that
E [Xn ] = < 1; 8n 2 N
Var [Xn ] = 2 < 1; 8n 2 N
2
where > 0. Then, a Central Limit Theorem applies to the sample mean X n :
p Xn d
n !Z
d
where Z is a standard normal random variable and ! denotes convergence in
distribution.
Proof. We will just sketch a proof. For a detailed and rigorous proof see, for
example, Resnick6 (1999) and Williams7 (1991). First of all, denote by fZn g the
sequence whose generic term is
p Xn
Zn = n
The characteristic function8 of Zn is
'Zn (t) = E [exp (itZn )]

p Xn
= E exp it n
" n
!!#
p 1 1X
= E exp it n Xi
n i=1
" n
!#
t X Xi
= E exp i p
n i=1
5 See p. 492.
6 Resnick, S.I. (1999) "A Probability Path", Birkhauser.
7 Williams, D. (1991) "Probability with martingales", Cambridge University Press.
8 See p. 307.
68.1. EXAMPLES OF CENTRAL LIMIT THEOREMS 547
" n
#
Y t Xi
= E exp i p
i=1
n
n
Y
A t Xi
= E exp i p
i=1
n
n
Y
B t
= E exp i p Yi
i=1
n
n
Y
C t
= 'Yi p
i=1
n
n
D t
= 'Y1 p
n
where: in step A we have used the fact that the random variables Xi are mutually
independent9 ; in step B we have de…ned
Xi
Yi =
in step C we have used the de…nition of characteristic function and we have

denoted the characteristic function of Yi by 'Yi (t); in step D we have used
the fact that all the variables Yi have the same distribution and hence the same
characteristic function. Now take a second order Taylor series expansion of 'Y1 (s)
around the point s = 0:
'Y1 (s) = E [exp (isY1 )]
d
= E [exp (isY1 )]js=0 + (E [exp (isY1 )]) s
ds s=0
1 d2
+ (E [exp (isY1 )]) s2 + o s2
2 ds2 s=0
d
= E [exp (isY1 )]js=0 + E exp (isY1 ) s
ds s=0
1 d2
+ E exp (isY1 ) s2 + o s2
2 ds2 s=0
= E [exp (isY1 )]js=0 + (E [iY1 exp (isY1 )])js=0 s
1
+ E Y12 exp (isY1 ) s=0 s2 + o s2
2
1
= 1 + iE [Y1 ] s E Y12 s2 + o s2
2
A 1
= 1 Var [Y1 ] s2 + o s2
2
B 1 2
= 1 s + o s2
2
where: o s2 is an in…nitesimal of higher order than s2 , i.e. a quantity that
converges to 0 faster than s2 does; in step A we have used the fact that
E [Y1 ] = 0
9 In particular, see the Mutual independence via expectations property (p. 234).
in step B we have used the fact that
Var [Y1 ] = 1
Therefore:
n
t
lim 'Zn (t) = lim 'Y1 p
n!1 n!1 n
" !#n
2 2
1 t t
= lim 1 p +o p
n!1 2 n n
n
1 t2 t2
= lim 1 +o
n!1 2n n
1 2
= exp t = 'Z (t)
2
So, we have that:

lim 'Zn (t) = 'Z (t)
n!1
where
1 2
'Z (t) = exp t
2
is the characteristic function of a standard normal random variable Z (see the
lecture entitled Normal distribution - p. 379). A theorem, called Lévy continuity
theorem, which we do not cover in these lectures, states that if a sequence of
random variables fZn g is such that their characteristic functions 'Zn (t) converge
to the characteristic function 'Z (t) of a random variable Z, then the sequence
fZn g converges in distribution to Z. Therefore, in our case the sequence fZn g
converges in distribution to a standard normal distribution.
So, roughly speaking, under the stated assumptions, the distribution of the
sample mean X n can be approximated by a normal distribution with mean and
2
variance n (provided n is large enough).
Also note that the conditions for the validity of Lindeberg-Lévy Central Limit
Theorem resemble the conditions for the validity of Kolmogorov’s Strong Law of
Large Numbers10 . The only di¤erence is the additional requirement that
2
Var [Xn ] = < 1; 8n 2 N
68.1.2 A CLT for correlated sequences

In the Lindeberg-Lévy CLT (see above), the sequence fXn g is required to be an
IID sequence. The assumption of independence can be weakened as follows:
Proposition 329 (CLT for correlated sequences) Let fXn g be a stationary11

and mixing12 sequence of random variables satisfying a CLT technical condition
(de…ned in the proof below) and such that
E [Xn ] = < 1; 8n 2 N
1 0 See p. 540.
1 1 See p. 492.
1 2 See p. 494.
68.2. MULTIVARIATE GENERALIZATIONS 549
2
Var [Xn ] = < 1; 8n 2 N
1
X
lim nVar X n = 2 + 2 Cov [X1 ; Xi ] = V < 1
n!1
i=2
where V > 0. Then, a Central Limit Theorem applies to the sample mean X n :
p Xn d
n p !Z
V
d
where Z is a standard normal random variable and ! denotes convergence in
distribution.
Proof. Several di¤erent technical conditions (beyond those explicitly stated in

the above proposition) are imposed in the literature in order to derive Central
Limit Theorems for correlated sequences. These conditions are usually very mild
and di¤er from author to author. We do not mention these technical conditions
here and just refer to them as CLT technical conditions. For a proof see, for
example, Durrett13 (2010) and White14 (2001).
So, roughly speaking, under the stated assumptions, the distribution of the
sample mean X n can be approximated by a normal distribution with mean and
variance Vn (provided n is large enough).
Also note that the conditions for the validity of the Central Limit Theorem
for correlated sequences resemble the conditions for the validity of the ergodic
theorem15 . The main di¤erences (beyond some technical conditions that are not
explicitly stated in the above proposition) are the additional requirements that
2
Var [Xn ] = < 1; 8n 2 N
1
X
2
V = lim nVar X n = +2 Cov [X1 ; Xi ] < 1
n!1
i=2
and the fact that ergodicity is replaced by the stronger condition of mixing.
Finally, let us mention that the variance V in the above proposition, which is
de…ned as
V = lim nVar X n
n!1
is called the long-run variance of X n .
68.2 Multivariate generalizations

The results illustrated above for sequences of random variables extend in a straight-
forward manner to sequences of random vectors. For example, the multivariate
version of the Lindeberg-Lévy CLT is:
Proposition 330 (Multivariate Lindeberg-Lévy CLT) Let fXn g be an IID

sequence of K 1 random vectors such that
E [Xn ] = 2 RK ; 8n 2 N
1 3 Durrett, R. (2010) "Probability: Theory and Examples", Cambridge University Press.
1 4 White, H. (2001) "Asymptotic theory for econometricians", Academic Press.
1 5 See p. 541.
Var [Xn ] = 2 RK K
; 8n 2 N
Pn
where is a positive de…nite matrix. Let X n = n1 i=1 Xi be the vector of sample
means. Then:
p d
n 1 Xn !Z
d
where Z is a standard multivariate normal random vector16 and ! denotes con-
vergence in distribution.
Proof. For a proof see, for example, Basu17 (2004), DasGupta18 (2008) and
McCabe and Tremayne19 (1993).
In a similar manner, the CLT for correlated sequences generalizes to random
vectors (V becomes a matrix, called long-run covariance matrix).

Exercise 1
Let fXn g be a sequence of independent Bernoulli random variables20 with para-
meter p = 21 , i.e. a generic term Xn of the sequence has support
RXn = f0; 1g

8
< 1=2 if x = 1
pXn (x) = 1=2 if x = 0
:
0 if x 2
= RXn
Use a Central Limit Theorem to derive an approximate distribution for the

mean of the …rst 100 terms of the sequence.
Solution
The sequence fXn g is and IID sequence. The mean of a generic term of the
sequence is
X
E [Xn ] = xpXn (x) = 1 pXn (1) + 0 pXn (0)
x2RXn
1 1 1
= 1 +0 1 = <1
2 2 2
1 6 See p. 439.
1 7 Basu, A. K. (2004) Measure theory and probability, PHI Learning PVT.
1 8 DasGupta, A. (2008) Asymptotic theory of statistics and probability, Springer.
1 9 McCabe, B. and A. Tremayne (1993) Elements of modern asymptotic theory with statistical
applications, Manchester University Press.

2 0 See p. 335.
The variance of a generic term of the sequence can be derived thanks to the usual
formula for computing the variance21 :
X
E Xn2 = x2 pX (x) = 12 pXn (1) + 02 pXn (0)
x2RXn
1 1 1
= 1 +0 =
2 2 2
2 1
E [Xn ] =
4
2 1 1 1
Var [Xn ] = E Xn2 E [Xn ] = = <1
2 4 4
Therefore, the sequence fXn g satis…es the conditions of Lindeberg-Lévy Central
Limit Theorem (IID, …nite mean, …nite variance). The mean of the …rst 100 terms
of the sequence is:
100
1 X
X 100 = Xi
100 i=1
Using the Central Limit Theorem to approximate its distribution, we obtain:
Var [Xn ]
Xn N E [Xn ] ;
n
or
1 1
X 100 N ;
2 400
Exercise 2
Let fXn g be a sequence of independent Bernoulli random variables with parameter
p = 21 , as in the previous exercise. Let fYn g be another sequence of random
variables such that
1
Yn = Xn+1 Xn ; 8n
2
Suppose fYn g satis…es the conditions of a Central Limit Theorem for correlated
sequences. Derive an approximate distribution for the mean of the …rst n terms of
the sequence fYn g.
Solution
The sequence fXn g is and IID sequence. The mean of a generic term of the
sequence is
1 1
E [Yn ] = E Xn+1 Xn = E [Xn+1 ] E [Xn ]
2 2
1 11 1
= =
2 22 4
The variance of a generic term of the sequence is
1
Var [Yn ] = Var Xn+1 Xn
2
2 1 Var [X] = E X2 E [X]2 . See p. 156.
1 1
= Var [Xn+1 ] + Var [Xn ] 2 Cov [Xn+1 ; Xn ]
4 2
A 1
= Var [Xn+1 ] + Var [Xn ]
4
1 11 5
= + =
4 44 16
where: in step A we have used the fact that Xn and Xn+1 are independent. The
covariance between two successive terms of the sequence is
Cov [Yn+1 ; Yn ]
1 1
= Cov Xn+2 Xn+1 ; Xn+1 Xn
2 2
A 1
= Cov [Xn+2 ; Xn+1 ] Cov [Xn+2 ; Xn ]
2
1 1
Cov [Xn+1 ; Xn+1 ] + Cov [Xn+1 ; Xn ]
2 4
B 1
= Cov [Xn+1 ; Xn+1 ]
2
C 1
= Var [Xn+1 ]
2
11 1
= =
24 8
where: in step A we have used the bilinearity of covariance22 ; in step B we have

used the fact that the terms of fXn g are independent; in step C we have used the
fact that the covariance of a random variable with itself is equal to its variance.
The covariance between two terms that are not adjacent (Yn and Yn+j , with j > 1)
is
Cov [Yn+j ; Yn ]
1 1
= Cov Xn+j+1 Xn+j ; Xn+1 Xn
2 2
A 1
= Cov [Xn+j+1 ; Xn+1 ] Cov [Xn+j+1 ; Xn ]
2
1 1
Cov [Xn+j ; Xn+1 ] + Cov [Xn+j ; Xn ]
2 4
B = 0

used the fact that the terms of fXn g are independent. The long-run variance is
1
X
V = Var [Y1 ] + 2 Cov [Xj ; X1 ]
j=2
= Var [Y1 ] + 2Cov [X2 ; X1 ]
5 2 1
= =
16 8 16
2 2 See p. 166.
The mean of the …rst n terms of the sequence fYn g is

n
1X
Yn = Yi
n i=1
Using the Central Limit Theorem for correlated sequences to approximate its dis-
tribution, we obtain
V
Y n N E [Yn ] ;
n
or
1 1
Yn N ;
4 16 n
Exercise 3
Let Y be a binomial random variable with parameters n = 100 and p = 12 (you need
to read the lecture entitled Binomial distribution 23 in order to be able to solve this
exercise). Using the Central Limit Theorem, show that a normal random variable
X with mean = 50 and variance 2 = 25 can be used as an approximation of Y .
Solution
1
A binomial random variable Y with parameters n = 100 and p = 2 can be written
as
100
X
Y = Xi
i=1
where X1 , . . . , X100 are mutually independent Bernoulli random variables with

parameter p = 12 . Thus:
100
!
1 X
Y = 100 Xi
100 i=1
= 100 X 100
In the …rst exercise, we have shown that the distribution of X 100 can be approxi-
mated by a normal distribution:
1 1
X 100 N ;
2 400
Therefore, the distribution of Y can be approximated by
1 1
Y N 100; 1002
2 400
Thus, Y can be approximated by a normal distribution with mean = 50 and

variance 2 = 25.
2 3 See p. 341.
Chapter 69
Convergence of
transformations
This lecture discusses some well-known results on the convergence of transformed

sequences of random vectors.
69.1 Continuous mapping theorem

Suppose a sequence of random vectors fXn g converges to a random vector X
(either in probability, in distribution or almost surely). Now, take a transformed
sequence fg (Xn )g, where g is a function. Under what conditions is fg (Xn )g also
a convergent sequence?
The following proposition, known as Continuous Mapping Theorem, states that
convergence is preserved by continuous transformations.
Proposition 331 Let fXn g be a sequence of K-dimensional random vectors. Let

g : RK ! RL be a continuous function. Then:
P P
Xn ! X =) g (Xn ) ! g (X)
a:s: a:s:
Xn ! X =) g (Xn ) ! g (X)
d d
Xn ! X =) g (Xn ) ! g (X)
P a:s:
where ! denotes convergence in probability, ! denotes almost sure convergence
d
and ! denotes convergence in distribution.
Proof. See e.g. Shao1 (2003).

The following subsections present some important applications of the continu-
ous mapping theorem.
69.1.1 Convergence in probability of sums and products

An important implication of the continuous mapping theorem is that arithmetic
operations preserve convergence in probability.
1 Shao, J. (2003) Mathematical statistics, Springer.
555
556 CHAPTER 69. CONVERGENCE OF TRANSFORMATIONS
P P
Proposition 332 If Xn ! X and Yn ! Y . Then:
P
Xn + Yn ! X + Y
P
Xn Yn ! XY
Proof. First of all, note that convergence in probability of fXn g and of fYn g
implies their joint convergence in probability2 , i.e. their convergence as a vector:
P
[Xn Yn ] ! [X Y ]
Now, the sum and the product are continuous functions of the operands. Thus, for
example:
X + Y = g ([X Y ])
where g is a continuous function, and, using the continuous mapping theorem:
plim (Xn + Yn ) = plim g ([Xn Yn ])
n!1 n!1
= g plim [Xn Yn ]
n!1
= g ([X Y ])
= X +Y
where plim denotes a limit in probability.
69.1.2 Almost sure convergence of sums and products

Everything that was said in the previous subsection applies, with obvious modi…-
cations, also to almost surely convergent sequences:
a:s: a:s:
Proposition 333 If Xn ! X and Yn ! Y , then:
a:s:
Xn + Yn ! X + Y
a:s:
Xn Yn ! XY
Proof. Similar to previous proof. Just replace convergence in probability with
almost sure convergence.
69.1.3 Convergence in distribution of sums and products

For convergence almost surely and convergence in probability, the convergence
of fXn g and fYn g individually implies their joint convergence as a vector (see
the previous two proofs), but this is not the case for convergence in distribution.
Therefore, to obtain preservation of convergence in distribution under arithmetic
operations, we need the stronger assumption of joint convergence in distribution:
Proposition 334 If
d
[Xn Yn ] ! [X Y ]
then:
d
Xn + Yn ! X + Y
d
Xn Yn ! XY
Proof. Again, similar to the proof for convergence in probability, but this time
joint convergence is already in the assumptions.
2 See the lecture entitled Convergence in probability (p. 511).
69.2. SLUTSKI’S THEOREM 557
69.2 Slutski’s Theorem

Slutski’s theorem concerns the convergence in distribution of the product of two
sequences, one converging in distribution and the other converging in probability
to a constant:
d: P:
Proposition 335 If Xn ! X and Yn ! c, where c is a constant, then:
d
Xn + Yn ! X + c
d
Xn Yn ! cX
d:
Proof. It is possible to prove (see e.g. van der Vaart3 - 2000) that Xn ! X and
P:
Yn ! c imply
d
[Xn Yn ] ! [X c]
but this in turn implies convergence under arithmetic operations (see above).
69.3 More details

69.3.1 Convergence of ratios
As a byproduct of the propositions stated above, we also have:
Proposition 336 If a sequence of random variables fXn g converges to X, then
1=Xn ! 1=X
provided X is almost surely di¤ erent from 0 (we did not specify the kind of conver-
gence, which can be in probability, almost surely or in distribution).
Proof. This is a consequence of the Continuous mapping theorem and of the fact
that
g (x) = 1=x
is a continuous function for x 6= 0.
As a consequence:
Proposition 337 If two sequences of random variables fXn g and fYn g converge
to X and Y respectively, then
Xn =Yn ! X=Y
provided Y is almost surely di¤ erent from 0. Convergence can be in probability,

almost surely or in distribution (but the latter requires joint convergence in distri-
bution of fXn g and fYn g).
Proof. This is a consequence of the fact that the ratio can be written as a product
Xn =Yn = Xn (1=Yn )
The …rst operand of the product converges by assumption. The second converges
because of the previous proposition. Therefore, their product converges because
convergence is preserved under products.
3 A. W. van der Vaart (2000) Asymptotic Statistics, Cambridge University Press.
69.3.2 Random matrices

The continuous mapping theorem and Slutski’s theorem apply also to random
matrices4 , because random matrices are just random vectors whose elements have
been arranged into columns.
In particular:
if two sequences or random matrices are convergent, then also the sum and
the product of their terms are convergent (provided their dimensions are such
that they can be summed or multiplied);
if a sequence of square random matrices fXn g converges to a random matrix
X, then the sequence of inverse matrices Xn 1 converges to the random
matrix X 1 (provided the matrices are invertible). This is a consequence of
the fact that matrix inversion is a continuous transformation.

Exercise 1
Let fXn g be a sequence of K 1 random vectors such that
d
Xn ! X
where X is a normal random vector with mean and invertible covariance matrix
V.
Let fAn g be a sequence of L K random matrices such that:
P
An ! A
where A is a constant matrix. Find the limit in distribution of the sequence of
products fAn Xn g.
Solution
By Slutski’s theorem
d
An Xn ! Y
where
Y = AX
The random vector Y has a multivariate normal distribution, because it is a linear
transformation of a multivariate normal random vector5 . The expected value of Y
is
E [Y ] = E [AX] = AE [X] = A
Var [Y ] = Var [AX] = AVar [X] A| = AV A|
Therefore, the sequence of products fAn Xn g converges in distribution to a multi-
variate normal random vector with mean A and covariance matrix AV A| .
4 See p. 119.
5 See p. 469.
Exercise 2
Let fXn g be a sequence of K 1 random vectors such that
d
Xn ! X
where X is a normal random vector with mean 0 and invertible covariance matrix
V.
Let fVn g be a sequence of K K random matrices such that
P
Vn ! V
Find the limit in distribution of the sequence
Xn| Vn 1 Xn
Solution
By the continuous mapping theorem
1 P 1
Vn !V
Therefore, by Slutski’s theorem

d
Vn 1 X n ! V 1
X
Using the continuous mapping theorem again:
Xn| Vn 1 Xn = Xn| Vn 1 Xn ! X | V X = X |V
d 1 1
X
Since V is an invertible covariance matrix, there exists an invertible matrix such

that
|
V =
Therefore
X |V = X| ( |
1 1
X ) X
| | 1 1
= X ( ) X
1 | 1
= X X
= Z |Z

1
Z= X
The random vector Z has a multivariate normal distribution, because it is a linear
transformation of a multivariate normal random vector. The expected value of Z
is
1 1 1
E [Z] = E X = E [X] = 0=0
1
Var [Z] = Var X
1 1 |
= Var [X]
1 | 1 |
=
| 1 |
=
| | 1
= ( ) =I
Thus, Z has a standard multivariate normal distribution (mean 0 and variance I)

and
X | V 1 X = Z | Z = Z | IZ
is a quadratic form in a standard normal random vector6 . So, X | V 1 X has a Chi-
square distribution with n = tr (I) degrees of freedom. Summing up, the sequence
Xn| Vn 1 Xn converges in distribution to a Chi-square distribution with n degrees
of freedom.
Exercise 3
Let everything be as in the previous exercise, except for the fact that now X has
mean . Find the limit in distribution of the sequence
|
(Xn n) Vn 1 (Xn n)
where f ng is a sequence of K 1 random vectors converging in probability to .
Solution
De…ne
Yn = Xn n
By Slutski’s theorem
d
Yn ! Y
where
Y =X
is a multivariate normal random variable with mean 0 and variance V . Thus, we
can use the results of the previous exercise on the sequence
Yn| Vn 1 Yn
which is the same as

|
(Xn n) Vn 1 (Xn n)
and we …nd that it converges in distribution to a Chi-square distribution with n

degrees of freedom.
6 See p. 481.
Part VII
Fundamentals of statistics
561
Chapter 70
Statistical inference
Statistical inference is the act of using observed data to infer unknown prop-
erties and characteristics of the probability distribution from which the observed
data have been generated. The set of data that is used to make inferences is called
sample.
70.1 Samples
In the simplest possible case, we observe the realizations1 x1 , . . . , xn of n indepen-
dent random variables2 X1 , . . . , Xn having a common distribution function3 FX (x)
and we use the observed realizations to infer some characteristics of FX (x). With
a slight abuse of language, we sometimes say "n independent realizations of a
random variable X" instead of saying "the realizations of n independent random
variables X1 , . . . , Xn having a common distribution function FX (x)".
Example 338 The lifetime of a certain type of electronic device is a random vari-
able X, whose distribution function FX (x) is unknown. Suppose we independently
observe the lifetimes of 10 components. Denote these realizations by x1 , x2 , . . . ,
x10 . We are interested in the expected value of X, which is an unknown charac-
teristic of FX (x). We infer E [X] from the data, estimating E [X] with the sample
mean
10
1 X
x= xi
10 i=1
In this simple example the observed data x1 , x2 , . . . , x10 constitute our sample and
E [X] is the quantity about which we are making a statistical inference.
While in the simplest case X1 , . . . , Xn are independent random variables, more
complicated cases are possible. For example:
1. X1 , . . . , Xn are not independent;
2. X1 , . . . , Xn are random vectors having a common joint distribution function4
FX (x);
1 See p. 105.
2 See p. 229.
3 See p. 108.
4 See p. 118.
563
564 CHAPTER 70. STATISTICAL INFERENCE
3. X1 , . . . , Xn do not have a common probability distribution.

Is there a de…nition of sample that generalizes all of the above special cases?
Fortunately, there is one and it is extremely simple:
De…nition 339 A sample is the realization of a random vector .
The distribution function of , denoted by F ( ), is the unknown distribu-
tion function that constitutes the object of inference.
Therefore, "sample" is just a synonym of "realization of a random vector".
The following examples show how this general de…nition accommodates the special
cases mentioned above.
Example 340 When we observe n realizations x1 , . . . , xn of n independent ran-
dom variables X1 , . . . , Xn having a common distribution function FX (x), the
sample is the n-dimensional vector
= [x1 : : : xn ] (70.1)
which is a realization of the random vector
= [X1 : : : Xn ] (70.2)
The joint distribution function of is:
F ( ) = FX (x1 ) : : : FX (xn )
Example 341 When we observe n realizations x1 , . . . , xn of n random variables
X1 , . . . , Xn that are not independent but have a common distribution function
FX (x), the sample is again the n-dimensional vector (70.1), which is a realization
of the random vector (70.2). However, in this case the joint distribution function
F ( ) can no longer be written as the product of the distribution functions of X1 ,
. . . , Xn .
Example 342 When we observe n realizations x1 , . . . , xn of n independent K 1
random vectors X1 , . . . , Xn having a common joint distribution function FX (x),
the sample is the nK 1 vector
|
= [x|1 : : : x|n ] (70.3)
= [X1| : : : Xn| ] (70.4)
The joint distribution function of is:
F ( ) = FX (x1 ) : : : FX (xn )
Example 343 When we observe n realizations x1 , . . . , xn of n independent K 1
random vectors X1 , . . . , Xn having di¤ erent joint distribution functions FX1 (x1 ),
. . . , FXn (xn ), the sample is the nK 1 vector (70.3), which is a realization of the
random vector (70.4). The joint distribution function of is:
F ( ) = FX1 (x1 ) : : : FXn (xn )
When the sample is made of the realizations of n random variables (or n random
vectors), as in (70.1) and (70.3), then we say that the sample has size n (or that the
sample size is n). An individual realization xi is referred to as an observation
from the sample.
70.2. STATISTICAL MODELS 565
70.2 Statistical models

In the previous section we have de…ned a sample as a realization of a random
vector having joint distribution function F ( ). The sample is used to infer
some characteristics of F ( ) that are not fully known by the statistician. The
properties and the characteristics of F ( ) that are already known (or are assumed
to be known) before observing the sample are called a model for . In mathematical
terms, a model for is a set of joint distribution functions to which F ( ) is
assumed to belong.
De…nition 344 (Statistical model) Let the sample be a realization of a l-

dimensional random vector having joint distribution function F ( ). Let be
the set of all l-dimensional joint distribution functions:
= F : Rl ! R+ such that F is a joint distribution function
A subset is called a statistical model (or a model speci…cation or,

simply, a model) for . If F 2 the model is said to be correctly speci…ed (or
well-speci…ed). Otherwise, if F 2= the model is said to be mis-speci…ed.
Continuing the examples of the previous section:
Example 345 Suppose our sample is made of n realizations x1 , . . . , xn of n

random variables X1 , . . . , Xn . Assume that the n random variables are mutually
independent and that they have a common distribution function FX (x) : The sample
is the n-dimensional vector
= [x1 : : : xn ]
is the set of all possible distribution functions of the random vector
= [X1 : : : Xn ]
Recalling the de…nition of marginal distribution function5 and the characterization

of mutual independence6 , the statistical model is de…ned as follows:
all the marginals of F are equal and

= F 2 :
F is equal to the product of its marginals
Example 346 Take the example above and drop the assumption that the n random
variables X1 , . . . , Xn are mutually independent. The statistical model is now:
= fF 2 : all the marginals of F are equalg
The next subsections introduce some terminology related to model speci…cation.
70.2.1 Parametric models

A model for is called a parametric model if the joint distribution functions
belonging to are put into correspondence with a set of real vectors.
5 See p. 119.
6 See p. 235.
De…nition 347 (Parametric model) Let be a model for . Let Rp be

a set of p-dimensional real vectors. Let ( ) be a correspondence that associates
a subset of to each 2 . The triple ( ; ; ) is a parametric model if and
only if [
= ( )
2
The set is called parameter space. A vector 2 is called a parameter.
Therefore, in a parametric model every element of is put into correspondence

with at least one parameter .
When ( ) associates to each parameter a unique joint distribution function
(i.e. when ( ) is a function) the parametric model is called a parametric family.
De…nition 348 (Parametric family) Let ( ; ; ) be a parametric model. If

is a function from to , then the parametric model is called a parametric
family. In this case, the joint distribution function associated to a parameter is
denoted by F ( ; ).
When each distribution function is associated with only one parameter, the
parametric family is said to be identi…able.
De…nition 349 (Identi…able parametric family) Let ( ; ; ) be a paramet-

ric family. If is one-to-one (i.e. each distribution function F is associated with
only one parameter), then the parametric family is said to be identi…able.
70.3 Statistical inferences

A statistical inference is a statement about the unknown distribution function
F ( ), based on the observed sample and the statistical model . Statistical
inferences are often chosen among a set of possible inferences and take the form
of model restrictions. Given a subset of the original model R , a model
restriction can be either an inclusion restriction:
F 2 R
or an exclusion restriction:
F 2
= R
The following are common kinds of statistical inferences:
1. In hypothesis testing, a restriction F 2 R is proposed and the choice is

between two possible statements:
(a) reject the restriction;

(b) do not reject the restriction.
2. In estimation, a restriction F 2 R must be chosen among a set of possible

restrictions.
3. In Bayesian inference, the observed sample is used to update the sub-

jective probability that a restriction F 2 R is true.
70.4. DECISION THEORY 567
70.4 Decision theory

The choice of the statement (the statistical inference) to make based on the ob-
served data can often be formalized as a decision problem where:
1. making a statistical inference is regarded as an action;

2. each action can have di¤erent consequences, depending on which distribu-
tion function F ( ) is the true one;
3. a preference ordering over possible consequences needs to be elicited;
4. an optimal course of action needs to be taken, coherently with elicited
preferences.
There are several di¤erent ways of formalizing such a decision problem. The
branch of statistics that analyzes these decision problems is called statistical de-
cision theory.
Chapter 71
Point estimation
In the lecture entitled Statistical inference (p. 563) we have de…ned statistical
inference as the act of using a sample to make statements about the probability
distribution that generated the sample. The sample is regarded as the realization
of a random vector , whose unknown joint distribution function1 , denoted by
F ( ), is assumed to belong to a set of distribution functions , called statistical
model.
When the model is put into correspondence with a set Rp of real
vectors, then we have a parametric model. is called the parameter space and
its elements are called parameters. Denote by 0 the parameter that is associated
with the unknown distribution function F ( ) and assume that 0 is unique. 0
is called the true parameter, because it is associated to the distribution that
actually generated the sample. This lecture introduces a type of inference about
the true parameter called point estimation.
71.1 Estimate and estimator

Roughly speaking, point estimation is the act of choosing a parameter b 2 that
is our best guess of of the true (and unknown) parameter 0 . Our best guess b is
called an estimate of 0 .
When the estimate b is produced using a prede…ned rule (a function) that
associates a parameter estimate b to each in the support of , we can write:
b = b( )
The function b ( ) is called an estimator. Often, the symbol b is used to

denote both the estimate and the estimator. The meaning is usually clear from
the context.
71.2 Estimation error, loss and risk

Using the decision-theoretic terminology introduced in the lecture entitled Statis-
tical inference 2 , making an estimate b is an act that produces some consequences.
1 See p. 118.
2 See, in particular, the section on Decision Theory (p. 567).
569
570 CHAPTER 71. POINT ESTIMATION
Among the consequences that are usually considered in a parametric decision prob-
lem, the most relevant one is the estimation error. The estimation error e is the
di¤erence between the estimate b and the true parameter 0 :
e=b 0
Of course, the statistician’s goal is to commit the smallest possible estimation

error. This preference can be formalized using loss functions. A loss function
L b; 0 , mapping into R, quanti…es the loss incurred by estimating 0 with
b.
Frequently used loss functions are:
1. the absolute error:

L b; 0 = b 0
where kk is the Euclidean norm (it coincides with the absolute value when
R).
2. the squared error:

2
L b; 0 = b 0
When the estimate b is obtained from an estimator (a function of the sample

, which in turn is a realization of the random vector ), then the loss
L b( ); 0
can be thought of as a random variable. Its expected value is called the statistical
risk (or, simply, the risk) of the estimator b and it is denoted by R b :
h i
R b = E L b( ); 0
where the expected value is computed with respect to the true distribution function
F ( ). Thus, the risk R b depends both on the true parameter 0 and on the
distribution function of . In practice, these quantities are unknown, so also the
risk needs to be estimated. For example, we can compute an estimate R b of the
risk by pretending that the estimate were the true parameter, denoting by e the
b
estimator of b and computing the estimated risk as:
h i
R e = E L e( );b
where the expected value is with respect to the estimated distribution function
F ;b .
Even if the risk is unknown, the notion of risk is often used to derive theoretical
properties of estimators. In any case, parameter estimation is always guided, at
least ideally, by the principle of risk minimization, i.e. by the search for estimators
b that minimize the risk R b .
Depending on the speci…c loss function we use, the statistical risk of an estima-
tor can take di¤erent names:
71.3. OTHER CRITERIA TO EVALUATE ESTIMATORS 571
1. when the absolute error is used as a loss function, then the risk
h i
R b =E b 0
is called the mean absolute error of the estimator.
2. when the squared error is used as a loss function, then the risk
2
R b =E b 0
is called mean squared error (MSE). The square root of the mean squared
error is called root mean squared error (RMSE).
71.3 Other criteria to evaluate estimators

In this section we discuss other criteria that are commonly used to evaluate esti-
mators.
71.3.1 Unbiasedness
If an estimator produces parameter estimates that are on average correct, then it
is said to be unbiased. The following is a formal de…nition:
De…nition 350 (unbiasedness) Let 0 be the true parameter and let b be an

estimator of 0 . b is an unbiased estimator of 0 if and only if:
h i
E b = 0
If an estimator is not unbiased, then it is called a biased estimator.

h i
Note that in the above de…nition of unbiasedness E b is a shorthand for:
h i
E b( )
where is the random vector of which the sample is a realization and the expected
value is computed with respect to the true distribution function F ( ).
Also note that if an estimator is unbiased, this implies that the estimation error
is on average zero:
h i h i
E [e] = E b 0 =E
b 0 = 0 0 =0
71.3.2 Consistency
If an estimator produces parameter estimates that converge to the true value when
the sample size increases, then it is said to be consistent. The following is a formal
de…nition:
572 CHAPTER 71. POINT ESTIMATION
De…nition 351 (consistency) Let f n g be a sequence of samples such that all

the distribution functions F n ( n ) are
n put into
o correspondence with the same para-
meter 0 . A sequence of estimators bn ( n ) is said to be consistent (or weakly
consistent) if and only if:
plim bn = 0
n!1
where plim indicates convergence in probability3 . The sequence of estimators is

said to be strongly consistent if and only if:
bn a:s:
! 0
a:s:
where ! indicates almost sure convergence4 . A sequence of estimators which is
not consistent is called inconsistent.
When the sequence of estimators is obtained using the same prede…ned rule for
every sample n , we often say, with a slight abuse of language, "consistent estima-
tor" instead of saying "consistent sequence of estimators". In such cases, what we
mean is that the prede…ned rule produces a consistent sequence of estimators.
71.4 Examples
You can …nd examples of point estimation in the lectures entitled Point estimation
of the mean (p. 573) and Point estimation of the variance (p. 579).
3 See p. 511.
4 See p. 505.
Chapter 72
Point estimation of the mean
This lecture presents some examples of point estimation problems, focusing on

mean estimation, i.e. on using a sample to produce a point estimate of the mean
of an unknown distribution. Before reading this lecture you need to be familiar
with the material presented in the lecture entitled Point estimation (p. 569).
72.1 Normal IID samples

72.1.1 The sample
In this example, which is probably the most important in the history of statistics,
the sample n is made of n independent draws from a normal distribution having
unknown mean and variance 2 . Speci…cally, we observe n realizations x1 , . . . ,
xn of n independent random variables X1 , . . . , Xn , all having a normal distribution
with mean and variance 2 . The sample1 is the n-dimensional vector
n = [x1 : : : xn ]
n = [X1 : : : Xn ]
72.1.2 The estimator

As an estimator2 of the mean , we use the sample mean X n :
n
1X
Xn = Xi
n i=1
72.1.3 Expected value of the estimator

The expected value of the estimator X n is equal to the true mean :
E Xn =
1 See p. 564.
2 See p. 569.
573
574 CHAPTER 72. POINT ESTIMATION OF THE MEAN
Proof. This can be proved using linearity of the expected value:

" n #
1X
E Xn = E Xi
n i=1
n
1X
= E [Xi ]
n i=1
n
1X 1
= = n =
n i=1 n
Therefore, the estimator X n is unbiased3 .
72.1.4 Variance of the estimator

The variance of the estimator X n is:
2
Var X n =
n
Proof. This can be proved using the formula for the variance of an independent
sum4 :
" n #
1X
Var X n = Var Xi
n i=1
" n #
1 X
= Var Xi
n2 i=1
n
1 X
= Var [Xi ]
n2 i=1
n
1 X 2
=
n2 i=1
2
1 2
= n =
n2 n
Therefore, the variance of the estimator tends to zero as the sample size n tends
to in…nity.
72.1.5 Distribution of the estimator

The estimator X n has a normal distribution:
2
Xn N ;
n
Proof. Note that the sample mean X n is a linear combination of the normal
and independent random variables X1 ; : : : ; Xn (all the coe¢ cients of the linear
3 See p. 571.
4 See p. 168.
72.2. IID SAMPLES 575
combination are equal to n1 ). Therefore, X n is normal because a linear combination

of independent normal random variables is normal5 . The mean and the variance
of the distribution have already been derived above.
72.1.6 Risk of the estimator

The mean squared error6 of the estimator is:
2
MSE X n =
n
h 2
i
MSE X n = E Xn
h i
A = E Xn
2
h 2
i
= E Xn
B = Var X n
2
=
n
where: in step A we have used the fact that in one dimension the Euclidean norm
is the same as the absolute value; in step B we have used the de…nition of variance
and the fact that E X n = (see above).
72.1.7 Consistency of the estimator

The sequence fXn g satis…es the conditions of Kolmogorov’s Strong Law of Large
Numbers7 (fXn g is an IID sequence with …nite mean). Therefore, the sample mean
X n converges almost surely to the true mean :
a:s:
Xn !
i.e. the estimator X n is strongly consistent8 . Of course, the estimator is also

weakly consistent, because almost sure convergence implies convergence in proba-
bility9 :
plim X n =
n!1
72.2 IID samples

72.2.1 The sample
In this example, the sample n is made of n independent draws from a probability
distribution having unknown mean and variance 2 . Speci…cally, we observe
5 See p. 471.
6 See p. 571.
7 See p. 540.
8 See p. 571.
9 See p. 533.
n realizations x1 , . . . , xn of n independent random variables X1 , . . . , Xn , all

having the same distribution with mean and variance 2 . The sample is the
n-dimensional vector
n = [x1 : : : xn ]
n = [X1 : : : Xn ]
The di¤erence with respect to the previous example is that now we are no longer
assuming that the sample points come from a normal distribution.

Again, the estimator of the mean is the sample mean X n :
n
1X
Xn = Xi
n i=1

The expected value of the estimator X n is equal to the true mean and is therefore
unbiased:
E Xn =
The proof is the same found in the previous example.

The variance of the estimator X n is:
2
Var X n =
n
Also in this case the proof is the same found in the previous example.

Unlike in the previous example, the estimator X n does not necessarily have a
normal distribution (its distribution depends on the distribution of the terms of
the sequence fXn g). However, we will see below that X n has a normal distribution
asymptotically (i.e. it converges to a normal distribution when n becomes large).

The mean squared error of the estimator is:
2
MSE X n = Var X n =
n
The proof is the same found in the previous example.

The sequence fXn g satis…es the conditions of Kolmogorov’s Strong Law of Large
Numbers (fXn g is an IID sequence with …nite mean). Therefore, the estimator
X n is both strongly consistent and weakly consistent (see example above).
72.2.8 Asymptotic normality

The sequence fXn g satis…es the conditions of Lindeberg-Lévy Central Limit The-
orem10 (fXn g is an IID sequence with …nite mean and variance). Therefore, the
sample mean X n is asymptotically normal:
p Xn d
n !Z
d
where Z is a standard normal random variable11 and ! denotes convergence in
distribution. In other words, the sample mean X n converges in distribution to a
2
normal random variable with mean and variance n .

Exercise 1
Consider an experiment that can have only two outcomes: either success, with
probability p, or failure, with probability 1 p. The probability of success is
unknown, but we know that
1 1
p2 ;
10 5
Suppose we can independently repeat the experiment as many times as we wish
and use the ratio
Successes obtained
Total experiments performed
as an estimator of p. What is the minimum number of experiments needed in order
to be sure that the standard deviation of the estimator is less than 1=100?
Solution
Denote by pb the estimator of p. It can be written as
n
1X
pb = Xi
n i=1
where n is the number of repetitions of the experiment and X1 ; : : : ; Xn are n

independent random variables having a Bernoulli distribution12 with parameter p.
1 0 See p. 546.
11 A normal random variable with zero mean and unit variance (see p. 376).
1 2 See p. 335.
Therefore, pb is the sample mean of n independent Bernoulli random variables with

expected value p and
Var [bp]
A Var [Xi ]
=
n
B p (1 p)
=
n
p (1 p)
max
p2[ 10
1 1
;5] n
1 1
5 1 5
=
n
4
=
25n
where: in step A we have used the formula for the variance of the sample mean; in
step B we have used the formula for the variance of a Bernoulli random variable.
Thus
4
Var [b
p]
25n
We need to ensure that
p 1
Std [b
p] = Var [b p]
100
or
1
Var [b
p]
10000
which is certainly veri…ed if
4 1
=
25n 10000
or
40000
n= = 1600
25
Exercise 2
Suppose you observe a sample of 100 independent draws from a distribution having
unknown mean and known variance 2 = 1. How can you approximate the
distribution of their sample mean?
Solution
We can approximate the distribution of the sample mean with its asymptotic distri-
bution. So the distribution of the sample mean can be approximated by a normal
distribution with mean and variance
2
1
=
n 100
Chapter 73
Point estimation of the

variance
This lecture presents some examples of point estimation problems, focusing on

variance estimation, i.e. on using a sample to produce a point estimate of the
variance of an unknown distribution. Before reading this lecture you need to be
familiar with the material presented in the lecture entitled Point estimation (p.
569).
73.1 Normal IID samples - Known mean

In this example we make assumptions that are similar to those we made in the
example of mean estimation entitled Normal IID samples (p. 573). The reader is
strongly advised to read that example before reading this one.
73.1.1 The sample

The sample n is made of n independent draws from a normal distribution having
known mean and unknown variance 2 . Speci…cally, we observe n realizations
x1 , . . . , xn of n independent random variables X1 , . . . , Xn , all having a normal
distribution with known mean and unknown variance 2 . The sample1 is the
n = [x1 : : : xn ]
n = [X1 : : : Xn ]

We use the following estimator2 of variance:
X n
c2 = 1 (Xi )
2
n
n i=1
1 See p. 564.
2 See p. 569.
579
580 CHAPTER 73. POINT ESTIMATION OF THE VARIANCE

The expected value of the estimator c2n is equal to the true variance 2
:
h i
E c2n = 2 (73.1)
Proof. This can be proved using the linearity of the expected value:
" n #
h i 1 X
E c2n
2
= E (Xi )
n i=1
1X h i
n
2
= E (Xi )
n i=1
n
A 1X
= Var [Xi ]
n i=1
1 2 2
= n =
n
where: in step A we have used the de…nition of variance.

Therefore, the estimator c2n is unbiased3 .

The variance of the estimator c2n is:
h i 2 4
Var c2n =
n
" n #
h i 1X
Var c2n
2
= Var (Xi )
n i=1
" n #
1 X 2
= 2
Var (Xi )
n i=1
1 X
n h i
A = Var (Xi )
2
2
n i=1
!
1
n
X h i n
X h i2
4 2
= E (Xi ) E (Xi )
n2 i=1 i=1
n n
!
1 X X
B = 3 4 2 2
n2 i=1 i=1
n
!
1 X
4
= 2
n2 i=1
1 4 2 4
= n2 =
n2 n
3 See p. 571.
73.1. NORMAL IID SAMPLES - KNOWN MEAN 581
where: in step A we have used the formula for the variance of an independent
sum4 ; in step B we have used the fact that for a normal distribution
h i
4
E (Xi ) =3 4
Therefore, the variance of the estimator tends to zero as the sample size n tends
to in…nity.

The estimator c2n has a Gamma distribution5 with parameters n and 2
.
Proof. The estimator c2n can be written as:
n
c2 1X 2
n = (Xi )
n i=1
2 n
X 2
Xi
=
n i=1
2 n
X 2
= Zi2 = W
n i=1
n
where the variables Zi are independent standard normal random variables6 and W ,
being a sum of squares of n independent standard normal random variables, has
a Chi-square distribution with n degrees of freedom7 . Multiplying a Chi-square
random variable with n degrees of freedom by 2 =n one obtains8 a Gamma random
variable with parameters n and 2 .

The mean squared error9 of the estimator is:
2
MSE c2n = E c2
n
2
2
A = E c2 2
n
2
= E c2 2
n
h i 2 4
B = Var c2n =
n
where: in step A we have used the fact that the Euclidean norm in one dimension
is the same as the absolute value; in step B we have used the de…nition of variance
and (73.1).
4 See p. 168.
5 See p. 397.
6 See p. 375.
7 See p. 395.
8 See p. 402.
9 See p. 571.

The estimator
X n
c2 = 1 (Xi )
2
n
n i=1
can be viewed as the sample mean of a sequence fYn g where the generic term of
the sequence is
2
Yn = (Xn )
The sequence fYn g satis…es the conditions of Kolmogorov’s Strong Law of Large
Numbers10 (fXn g is an IID sequence with …nite mean). Therefore, the sample
mean of Yn converges almost surely to the true mean E [Yn ]:
c2 a:s:
! E [Yn ] = 2
Therefore the estimator c2 is strongly consistent. It is also weakly consistent11 :
plim c2 = 2
n!1
because almost sure convergence implies convergence in probability12 .
73.2 Normal IID samples - Unknown mean

This example is similar to the previous one. The only di¤erence is that we relax
the assumption that the mean of the distribution is known.
73.2.1 The sample

unknown mean and unknown variance 2 . Speci…cally, we observe n realizations
distribution with unknown mean and unknown variance 2 . The sample is the
n = [x1 : : : xn ]
n = [X1 : : : Xn ]

In this example also the mean of the distribution, being unknown, needs to be
estimated. It is estimated with the sample mean X n :
n
1X
Xn = Xi
n i=1
We use the following estimators of variance:

1 0 See p. 540.
1 1 See p. 571.
1 2 See p. 533.
73.2. NORMAL IID SAMPLES - UNKNOWN MEAN 583
1. the unadjusted sample variance:

n
1X 2
Sn2 = Xi Xn
n i=1
2. the adjusted sample variance:

n
X
1 2
s2n = Xi Xn
n 1 i=1

The expected value of the unadjusted sample variance is:
n 1
E Sn2 = 2
n
" n #
2 1X 2
E Sn = E Xi X n
n i=1
1X h i
n
2
= E Xi Xn
n i=1
1X h i
n
2 2
= E (Xi ) + Xn 2 (Xi ) Xn
n i=1
1X h i 1X h i
n n
2 2
= E (Xi ) + E Xn
n i=1 n i=1
n
2X
E (Xi ) Xn
n i=1
n n
A 1X 1X
= Var [Xi ] + Var X n
n i=1 n i=1
n
2X
E (Xi ) Xn
n i=1
n n
1X 2 1X 2
= +
n i=1 n i=1 n
2 0 13
n n
2X 4 1X
E (Xi )@ Xj A5
n i=1 n j=1
1 1 2
2
= n n +
n n 2n 0 13
n n
21X 4 X
E (Xi )@ Xj n A5
n n i=1 j=1
2 0 13
n n
2
2 X 4 X
= 2
+ E (Xi )@ (Xj )A5
n n2 i=1 j=1
n n
n+1 2 2 XX
= E [(Xi ) (Xj )]
n n2 i=1 j=1
where: in step A we have used the fact that13

2
Var X n =
n
But
E [(Xi ) (Xj )] = 0
when i 6= j (because Xi and Xj are independent when i 6= j). Therefore:
n n
n+1 2 XX
E Sn2 = 2
E [(Xi ) (Xj )]
n n2 i=1 j=1
n
n+1 2 2 X
= E [(Xi ) (Xi )]
n n2 i=1
2 X h i
n
n+1 2 2
= E (Xi )
n n2 i=1
n
n+1 2 2 X
= Var [Xi ]
n n2 i=1
n+1 2 2
= n 2
n n2
n+1 2 2 2 n 1 2
= =
n n n
Therefore, the unadjusted sample variance Sn2 is a biased14 estimator of the

true variance 2 .
The adjusted sample variance s2n , on the contrary, is an unbiased estimator of
variance:
E s2n = 2
" n
#
1 X 2
E s2n = E Xi Xn
n 1 i=1
" #n
1 n 2 X
= E Xi X n
n 1 n i=1
" n #
n 1X 2
= E Xi X n
n 1 n i=1
n
= E Sn2
n 1
n n 1 2
= = 2
n 1 n
1 3 See p. 574.
1 4 See p. 571.
Thus, when also the mean is being estimated, we need to divide by n 1

rather than by n to obtain an unbiased estimator. Intuitively, by considering
squared deviations from the sample mean rather than squared deviations from the
true mean, we are underestimating the true variability of the data. In fact, the sum
of squared deviations from the true mean is always larger than the sum of squared
deviations from the sample mean. Dividing by n 1 rather than by n exactly
corrects this bias. The number n 1 by which we divide is called the number of
degrees of freedom and it is equal to the number n of sample points minus the
number of other parameters to be estimated (in our case 1, the true mean ).
The factor by which we need to multiply the biased estimatot Sn2 to obtain the
unbiased estimator s2n is, of course:
n
n 1
This factor is known as degrees of freedom adjustment, which explains why
Sn2 is called unadjusted sample variance and s2n is called adjusted sample variance.

The variance of the unadjusted sample variance is:
n 12 4
Var Sn2 =
n n
Proof. This is proved in subsection 73.2.5.
The variance of the adjusted sample variance is:
2 4
Var s2n =
n 1
Proof. This is also proved in subsection 73.2.5.
Therefore, both the variance of Sn2 and the variance of s2n converge to zero as
the sample size n tends to in…nity. Also note that the unadjusted sample variance
Sn2 , despite being biased, has a smaller variance than the adjusted sample variance
s2n , which is instead unbiased.

The unadjusted sample variance Sn2 has a Gamma distribution with parameters
2
n 1 and (n n1) .
Proof. To prove this result, we need to use some facts on quadratic forms involv-
ing normal random variables, which have been introduced in the lecture entitled
Quadratic forms in normal vectors (p. 481). To understand this proof, you need
to …rst read that lecture, in particular the section entitled Sample variance as a
quadratic form (p. 486). De…ne the matrix:
1 >
M =I {{
n
where I is an n n identity matrix and { is a n 1 vector of ones. M is symmetric
and idempotent. Denote by X the n 1 random vector whose i-th entry is equal
to Xi . The random vector X has a multivariate normal distribution with mean {

and covariance matrix 2 I. Using the fact that the matrix M is symmetric and
idempotent, the unadjusted sample variance can be written as:
1 >
Sn2 = X MX
n
Using the fact that the random vector
1
Z= (X {)
has a standard multivariate normal distribution15 and the fact that M { = 0, we

can rewrite:
2
Z >M Z
Sn2 =
n
In other words, Sn2 is proportional to a quadratic form in a standard normal random
vector (Z > M Z) and the quadratic form involves a symmetric and idempotent
matrix whose trace is equal to n 1. Therefore, the quadratic form Z > M Z has a
Chi-square distribution with n 1 degrees of freedom. Finally, we can write:
2
(n 1) Z >M Z
Sn2 =
n n 1
i.e. Sn2 is a Chi-square random variable divided by its number of degrees of freedom
2
and multiplied by (n n1) . Thus16 , Sn2 is a Gamma random variable with parame-
2
ters n 1 and (n 1)
n . Also, by the properties of Gamma random variables, its
expected value is:
2
(n 1)
E Sn2 =
n
and its variance is:
2 2
2 (n 1)
Var Sn2 =
n 1 n
n 12 4
=
n n
The adjusted sample variance s2n has a Gamma distribution with parameters
n 1 and 2 .
Proof. The proof of this result is similar to the proof for unadjusted sample
variance found above. It can also be found in the lecture entitled Quadratic forms
in normal vectors (p. 481). Here, we just notice that s2n , being a Gamma random
variable with parameters n 1 and 2 , has expected value
E s2n = 2
and variance
2 4
Var s2n =
n 1
1 5 See p. 439.
1 6 See p. 397.

The mean squared error of the unadjusted sample variance is:
2n 1
MSE Sn2 = 4
n2
Proof. It can be proved as follows:
h i
2 2
MSE Sn2 = E Sn2
h i
A = E S2 2 2
n
h i
2 2
= E Sn2
h i
2 2
= E Sn2 E Sn2 + E Sn2
h 2
i h i
2 2
= E Sn2 E Sn2 + E E Sn2
+2E Sn2 E Sn2 E Sn2 2
2 2
= Var Sn2 + E Sn2 + 2 E Sn2 2
E Sn2 E Sn2
2
n 12 4 (n 1) 2
2
= + + 2 E Sn2 2
0
n n n
2 2
2n 2 4 (n 1 n)
= +
n2 n
2n 1 4
=
n2
where: in step A we have used the fact that the Euclidean norm in one dimension
is the same as the absolute value
The mean squared error of the adjusted sample variance is:
2
MSE s2n = 4
n 1
Proof. It can be proved as follows:
h i
2 2
MSE s2n = E s2n
h i
2 2
= E s2n
h i
2 2
= E s2n
h 2
i
= E s2n E s2n
2 4
= Var s2n =
n 1
Therefore, the mean squared error of the unadjusted sample variance is always
smaller than the mean squared error of the adjusted sample variance:
2n 1
MSE Sn2 = 4
n2
2n 1 4
= 2
n n2
2 1 4
=
n n2
2 4 2 4
< < = MSE s2n
n n 1

The unadjusted sample variance can be written as:
n
1X 2
Sn2 = Xi Xn
n i=1
n n n
1X 2 1X 2 2X
= X + X Xi X n
n i=1 i n i=1 n n i=1
n n
1X 2 1 2 1X
= Xi + nX n 2X n Xi
n i=1 n n i=1
n
A 1X 2 2
= X + Xn 2X n X n
n i=1 i
n
1X 2 2
= X Xn
n i=1 i
B = Wn Xn
2
where: in step A we have used the fact that

n
1X
Xn = Xi
n i=1
in step B we have de…ned:

n
1X 2
Wn = X
n i=1 i
The two sequences X n and Wn are the sample means of Xn and Xn2 respectively.
The latter both satisfy the conditions of Kolmogorov’s Strong Law of Large Num-
bers (they form IID sequences with …nite means), which implies that their sample
means X n and Wn converge almost surely to their true means:
a:s:
X n ! E [Xn ]
a:s:
Wn ! E Xn2
Since the function
2
g X n ; W n = Wn Xn
is continuous and almost sure convergence is preserved by continuous transforma-
tions17 , we obtain:
a:s: 2
Sn2 ! g E [Xn ] ; E Xn2 = E Xn2 E [Xn ] = Var [Xn ] = 2
1 7 See p. 555.
Therefore the estimator Sn2 is strongly consistent. It is also weakly consistent,

because almost sure convergence implies convergence in probability:
plim Sn2 = 2
n!1
The adjusted sample variance s2n can be written as:

n
s2n = Sn2
n 1
n
The ratio n 1 can be thought of as a constant random variable Zn de…ned as
follows:
n
Zn =
n 1
which converges almost surely to 1. Therefore
s2n = Zn Sn2
where both Zn and Sn2 are almost surely convergent. Since the product is a con-
tinuous function and almost sure convergence is preserved by continuous transfor-
mation, we have:
a:s:
s2n ! 1 2 = 2
Thus, also s2n is strongly consistent.

Exercise 1
You observe three independent draws from a normal distribution having unknown
mean and unknown variance 2 . Their values are 50, 100 and 150. Use these
values to produce an unbiased estimate of the variance of the distribution.
Solution
The sample mean is
50 + 100 + 150
Xn = = 100
3
An unbiased estimate of the variance is provided by the adjusted sample variance:
n
X
1 2
s2n = Xi Xn
n 1 i=1
1 h i
2
= (50 100) + (100 100)2 + (150 100)2
3 1
1 5000
= [2500 + 0 + 2500] = = 2500
2 2
Exercise 2
A machine (a laser range…nder) is used to measure the distance between the ma-
chine itself and a given object. When measuring the distance to an object located
10 meters apart, measurement errors committed by the machine are normally and
independently distributed and are on average equal to zero. The variance of the
measurement errors is less than 1 squared centimeter, but its exact value is un-
known and needs to be estimated. To estimate it, we repeatedly take the same mea-
surement and we compute the sample variance of the measurement errors (which
we are also able to compute, because we know the true distance). How many mea-
surements do we need to take to obtain an estimator of variance having a standard
deviation less than 0.1 squared centimeters?
Solution
Denote the measurement errors by X1 , . . . , Xn . The following estimator of variance
is used:
X n
c2 = 1 X2
n
n i=1 i
The variance of this estimator is:
h i
Var c2n
2
A 2 (Var [Xi ])
=
n
2
B 2 1cm2
n
2 4
= cm
n
where: in step A we have used the formula for the variance of the sample variance;
in step B we have used the upper bound stated in the problem (variance less than
1 squared centimeter). Thus
h i 2 4
Var c2n cm
n
We need to ensure that
i r
h h i 1
c
Std n = Var c2n
2 cm2
10
or h i 1
Var c2n cm4
100
which is certainly veri…ed if
2 1
=
n 100
or
n = 200
Chapter 74
Set estimation
In statistical inference, a sample1 is employed to make statements about the

probability distribution from which the sample has been generated (see the lecture
entitled Statistical inference - p. 563). The sample can be regarded as a realiza-
tion of a random vector , whose joint distribution function, denoted by F ( ),
is unknown, but is assumed to belong to a set of distribution functions , called
statistical model.
In a parametric model, the set is put into correspondence with a set Rp
of p-dimensional real vectors. is called the parameter space and its elements
are called parameters. Denote by 0 the parameter that is associated with the
unknown distribution function F ( ) and assume that 0 is unique. 0 is called
the true parameter, because it is associated to the distribution that actually
generated the sample. This lecture discusses a kind of inference about the true
parameter 0 called set estimation.
74.1 Con…dence set

Roughly speaking, set estimation is the act of choosing a subset T of the para-
meter space (T ) in such a way that T has a high probability of containing the
true (and unknown) parameter 0 . The chosen subset T is called a set estimate
of 0 or a con…dence set for 0 .
When the parameter space is a subset of the set of real numbers R and the
subset T is chosen among the intervals of R (e.g. intervals of the kind [a; b]), we
speak about interval estimation (instead of set estimation), interval estimate
(instead of set estimate) and con…dence interval (instead of con…dence set).
When the set estimate T is produced using a prede…ned rule (a function) that
associates a set estimate T to each in the support of , we can write:
T =T( )
The function T ( ) is called a set estimator. Often, the symbol T is used to

denote both the set estimate and the set estimator. The meaning is usually clear
from the context.
1 See p. 564.
591
592 CHAPTER 74. SET ESTIMATION
74.2 Coverage probability - con…dence coe¢ cient

As we already said, set estimation is the act of choosing a subset T of the para-
meter space in such a way that T has a high probability of containing the true
parameter 0 . The probability that T contains the true parameter is called cov-
erage probability and it is usually chosen by the statistician. Intuitively, before
observing the data the statistician makes a statement:
0 2
where is the parameter space, containing all the parameters that are deemed
plausible. The statistician believes the statement to be true, but the statement is
not very informative, because is a very large set. After observing the data, she
makes a more informative statement:
0 2T
This statement is more informative, because T is smaller than , but it has a

positive probability of being wrong (which is the complement to 1 of the coverage
probability). In controlling this probability, the statistician faces a trade-o¤: if
she decreases the probability of being wrong, then her statements become less
informative; on the contrary, if she increases the probability of being wrong, then
her statements become more informative.
In formal terms, the coverage probability of a set estimator is de…ned as follows:
C (T; 0) =P 0 ( 0 2 T ( ))
where the notation P 0 is used to indicate the fact that the probability is calculated
using the distribution function F ( ; 0 ) associated to the true parameter 0 . It is
important to note that in the above de…nition of coverage probability the random
quantity is the interval T ( ), while the parameter is …xed.
In practice, the coverage probability is seldom known, because it depends on
the unknown parameter 0 (although in some cases it is equal for all parameters
belonging to the parameter space). When the coverage probability is not known,
it is customary to compute the con…dence coe¢ cient c (T ), which is de…ned as
follows:
c (T ) = inf C (T; )
2
In other words, the con…dence coe¢ cient c (T ) is equal to the smallest possible
coverage probability. The con…dence coe¢ cient is also often called level of con-
…dence.
74.3 Size of a con…dence set

We already mentioned that there is a trade-o¤ in the construction and choice of
a set estimator. On the one hand, we want our set estimator T to have a high
coverage probability, i.e. we want the set T to include the true parameter with
a high probability. On the other hand, we want the size of T to be as small as
possible, so as to make our interval estimate more precise. What do we mean
by size of T ? If the parameter space is unidimensional and T is an interval
estimate, then the size of T is just its length. If the space is multidimensional,
74.4. OTHER CRITERIA TO EVALUATE SET ESTIMATORS 593
then the size of T is its volume. The size of a con…dence set is also called
measure of a con…dence set (for those who have a grasp of measure theory, the
name stems from the fact that Lebesgue measure is the generalization of volume in
multidimensional spaces). If we denote by (T ) the size of a con…dence set, then
we can also de…ne the expected size of a set estimator T :
E 0
[ (T ( ))]
where the notation E 0 is used to indicate the fact that the expected value is cal-
culated using the distribution function F ( ; 0 ) associated to the true parameter
0 . Like the coverage probability, also the expected size of a set estimator depends
on the unknown parameter 0 . Hence, unless it is a constant function of 0 , one
needs to somehow estimate it or to take the in…mum over all possible values of the
parameter, as we did above for coverage probabilities.
74.4 Other criteria to evaluate set estimators

Although size is probably the simplest criterion to evaluate and select set estima-
tors, there are several other criteria. We do not discuss them here, but we refer
the reader to the very nice exposition in Berger and Casella2 (2002).
74.5 Examples
You can …nd examples of set estimation in the lectures entitled Set estimation of
the mean (p. 595) and Set estimation of the variance (p. 607).
2 Berger, R. L. and G. Casella (2002) "Statistical inference", Duxbury Advanced Series.

594 CHAPTER 74. SET ESTIMATION
Chapter 75
Set estimation of the mean
This lecture presents some examples of set estimation1 problems, focusing on set
estimation of the mean, i.e. on using a sample to produce a set estimate of the
mean of an unknown distribution.
75.1 Normal IID samples - Known variance

In this example we make assumptions that are similar to those we made in the
example of point estimation of the mean entitled Normal IID samples (p. 573). It
would be bene…cial to read that example before reading this one.
75.1.1 The sample

In this example, the sample n is made of n independent draws from a normal
distribution having unknown mean and known variance 2 . Speci…cally, we
observe n realizations x1 , . . . , xn of n independent random variables X1 , . . . , Xn ,
all having a normal distribution with unknown mean and known variance 2 .
The sample is the n-dimensional vector
n = [x1 : : : xn ]
n = [X1 : : : Xn ]
75.1.2 The interval estimator

To construct an interval estimator2 of the mean , we use the sample mean
n
1X
Xn = Xi
n i=1
The interval estimator is

" r r #
2 2
Tn = X n z; X n + z
n n
1 See p. 591.
2 See p. 591.
595
596 CHAPTER 75. SET ESTIMATION OF THE MEAN
where z 2 R++ is a strictly positive constant.
75.1.3 Coverage probability

The coverage probability3 of the interval estimator Tn is
C (Tn ; ) = P ( 2 Tn ) = P ( z Z z)
where Z is a standard normal random variable4 .

Proof. The coverage probability can be written as
r r !
2 2
P ( 2 Tn ) = P Xn z Xn + z
n n
( r ) ( r )!
2 2
= P Xn z \ Xn + z
n n
( r ) ( r )!
2 2
= P Xn z \ Xn z
n n
( ) ( )!
Xn Xn
= P p z \ p z
2 =n 2 =n
!
Xn
= P z p z
2 =n
= P( z Z z)

Xn
Z= p
2 =n
In the lecture entitled Point estimation of the mean (p. 573), we have demonstrated
that, given the assumptions on the sample n made above, the sample mean X n
has a normal distribution with mean and variance 2 =n. Subtracting the mean
of a normal random variable from the random variable itself and dividing it by
the square root of its variance, one obtains a standard normal random variable.
Therefore, the variable Z has a standard normal distribution.
75.1.4 Con…dence coe¢ cient

Note that the coverage probability does not depend on the unknown parameter .
Therefore, the con…dence coe¢ cient5 of the interval estimator Tn coincides with
its coverage probability:
c (Tn ) = inf C (Tn ; ) = P ( z Z z)

2R
where Z is a standard normal random variable.

3 See p. 592.
4 See p. 375.
5 See p. 592.
75.2. NORMAL IID SAMPLES - UNKNOWN VARIANCE 597
75.1.5 Size
The size6 of the interval estimator Tn is
" r r #!
2 2
(Tn ) = Xn z; X n + z
n n
r r r
2 2 2
= Xn + z Xn + z=2 z
n n n
75.1.6 Expected size

Note that the size does not depend on the sample n. Therefore, the expected size
is " r # r
2 2
E [ (Tn )] = E 2 z =2 z
n n
75.2 Normal IID samples - Unknown variance

This example is similar to the previous one. The only di¤erence is that we now
relax the assumption that the variance of the distribution is known.
75.2.1 The sample

distribution having unknown mean and unknown variance 2 . Speci…cally, we
all having a normal distribution with unknown mean and unknown variance 2 .
n = [x1 : : : xn ]
n = [X1 : : : Xn ]

To construct interval estimators of the mean , we use the sample mean
n
1X
Xn = Xi
n i=1
and either the unadjusted sample variance7

n
1X 2
Sn2 = Xi Xn
n i=1
6 See p. 592.
7 See p. 583.
or the adjusted sample variance8

n
X
1 2
s2n = Xi Xn
n 1 i=1
We consider two interval estimators of the mean:

" r r #
S 2 S 2
n n
Tnu = Xn z; X n + z
n n
" r r #
a s2n s2n
Tn = Xn z; X n + z
n n
where z 2 R++ is a strictly positive constant and the superscripts u and a indicate
whether the estimator is based on the unadjusted or the adjusted sample variance.

The coverage probability of the interval estimator Tnu is
r r !
u 2 u n 1 n 1
C Tn ; ; = P ( 2 Tn ) = P z Zn 1 z
n n
where Zn 1 is a standard Student’s t random variable9 with n 1 degrees of

freedom.
r r !
S 2 S 2
n n
P ( 2 Tnu ) = P X n z Xn + z
n n
( r ) ( r )!
Sn2 Sn2
= P Xn z \ Xn + z
n n
( r ) ( r )!
Sn2 Sn2
= P Xn z \ Xn z
n n
( ) ( )!
Xn Xn
= P p z \ p z
Sn2 =n Sn2 =n
!
Xn
= P z p z
Sn2 =n
r r r !
n 1 n 1 Xn n 1
= P z p z
n n Sn2 =n n
r r !
n 1 n 1
= P z Zn 1 z
n n
where we have de…ned r

n 1 Xn
Zn 1 = p
n Sn2 =n
8 See p. 583.
9 See p. 407.
Now, rewrite Zn 1 as
r
n 1 Xn
Zn 1 = p
n Sn2 =n
r p
n 1 Xn 2 =n
= p p
n 2 =n Sn2 =n
r
n 1 Xn 1
= p p
n 2 =n Sn2 = 2
r
n 1 Xn 1
= p q
n 2 =n n 1 2 2
n sn =
Xn 1 Y
= p p =p
2 =n s2n = 2 W

Xn
Y = p
2 =n
W = s2n = 2
and we have used the fact that the unadjusted sample variance can be expressed
as a function of the adjusted sample variance as follows:
n 1
Sn2 = s2n
n
In the lecture entitled Point estimation of the variance (p. 579), we have demon-
strated that, given the assumptions on the sample n made above, the adjusted
sample variance s2n has a Gamma distribution10 with parameters n 1 and 2 .
Therefore, the random variable W has a Gamma distribution with parameters
n 1 and 1. Moreover, the random variable Y has a standard normal distribution
(see the previous section). Hence, Zn 1 is the ratio between a standard normal
random variable and the square root of a Gamma random variable with parameters
n 1 and 1. As a consequence, Zn 1 has a standard Student’s t distribution with
n 1 degrees of freedom11 .
The coverage probability of the interval estimator Tna is
C Tna ; ; 2
= P ( 2 Tna ) = P ( z Zn 1 z)
where Zn 1 is a standard Student’s t random variable with n 1 degrees of freedom.

r r !
s 2 s2
n n
P ( 2 Tna ) = P X n z Xn + z
n n
( r ) ( r )!
s2n s2n
= P Xn z \ Xn + z
n n
1 0 See p. 397.
1 1 See the lecture entitled Student’s t distribution (p. 407) for a proof of this fact.
( r ) ( r )!
s2n s2n
= P Xn z \ Xn z
n n
( ) ( )!
Xn Xn
= P p z \ p z
s2n =n s2n =n
!
Xn
= P z p z
s2n =n
= P( z Zn 1 z)

Xn
Zn 1 = p
s2n =n
Now, rewrite Zn 1 as
p
Xn Xn 2 =n
Zn 1 = p = p p
2
sn =n 2 =n 2
sn =n
Xn 1 Y
= p p =p
2 =n s2n = 2 W
Xn
Y = p
2 =n
W = s2n = 2
strated that, given the assumptions on the sample n made above, the adjusted
sample variance s2n has a Gamma distribution with parameters n 1 and 2 . There-
fore, the random variable W has a Gamma distribution with parameters n 1 and
1. Moreover, the random variable Y has a standard normal distribution (see the
previous section). Hence, Zn 1 is the ratio between a standard normal random
variable and the square root of a Gamma random variable with parameters n 1
and 1. As a consequence, Zn 1 has a standard Student’s t distribution with n 1
degrees of freedom (see also the previous proof).
Note that the coverage probability of the con…dence interval based on the unad-
justed sample variance Sn2 is lower than the coverage probability of the con…dence
interval based on the adjusted sample variance s2n , because
r
n 1
z<z
n
and, as a consequence
r r !
n 1 n 1
C Tnu ; ; 2
= P z Zn 1 z
n n
< P( z Zn 1 z) = C Tna ; ; 2

Note that the coverage probability of both Tnu and Tna does not depend on the
unknown parameters and 2 . Therefore, the con…dence coe¢ cients of the two
con…dence intervals coincide with the respective coverage probabilities:
r r !
n 1 n 1
c (Tnu ) = inf C Tnu ; ; 2
=P z Zn 1 z
2R; 22R++ n n
c (Tna ) = inf
2
C Tna ; ; 2
= P( z Zn 1 z)
2R; 2R++
where Zn 1 is a standard Student’s t distribution with n 1 degrees of freedom.
75.2.5 Size
The size of the con…dence interval Tnu is
" rr #!
Sn2 Sn2
(Tnu ) = Xn z; X n + z
n n
r r r
Sn2 Sn2 Sn2
= Xn + z Xn + z=2 z
n n n
while the size of the con…dence interval Tna is

" r r #!
s2 s2
n n
(Tna ) = Xn z; X n + z
n n
r r r
s2n s2n s2n
= Xn + z Xn + z=2 z
n n n
Note that the size of the con…dence interval based on the unadjusted sample
variance Sn2 is smaller than the size of the con…dence interval based on the adjusted
sample variance s2n , because
Sn2 < s2n
r r
Sn2 s2n
(Tnu ) =2 z<2 z= (Tna )
n n
Thus, the con…dence interval based on the unadjusted sample variance has a
smaller size and a smaller coverage probability. As we have explained in the lecture
entitled Set estimation (p. 591), the choice of set estimators is often inspired by the
principle of achieving the highest possible coverage probability for a given size or
the smallest possible size for a given coverage probability. Following this principle,
there is no clear ranking between the estimator based on the unadjusted sample
variance and the estimator based on the adjusted sample variance, because the
former has smaller size, but the latter has higher coverage probability.

The expected size of Tnu is
" r # "r # r
Sn2 2 (n=2) 2
E[ (Tnu )] =E 2 z = 2 z
n n ((n 1) =2) n
where () is the Gamma function12 .

Proof. We need to use the fact that Sn2 has a Gamma distribution with parameters
n 1 and nn 1 2 . To simplify the notation, set
X = Sn2
The probability density function of X is:

n 1
cx(n 1)=2 1
exp 2 2x if x 2 [0; 1)
fX (x) =
0 otherwise
2 (n 1)=2
n=
c=
2(n 1)=2 ((n 1) =2)
and () is the Gamma function. Therefore:
" r # r
u Sn2 1 hp 2 i
E [ (Tn )] = E 2 z = 2z E Sn
n n
r
1 h 1=2 i
= 2z E X
n
Z 1
n 1
= 2zn 1=2 x1=2 cx(n 1)=2 1 exp 2 2
x dx
0
Z 1
n 1
= 2zn 1=2 c xn=2 1 exp 2 2
x dx
0
Z
1 1 n 1
= 2zn 1=2 c c1 xn=2 1 exp 2 2
x dx
c1 0
1
= 2zn 1=2 c
c1
2 n=2
n=
c1 =
2n=2 (n=2)
and we have used the fact that
Z 1
n 1
c1 xn=2 1
exp 2 2
x dx = 1
0
because it is the integral of the density of a Gamma random variable with para-
meters n and 2 over its support and probability densities integrate to 1. Thus:
1
E [ (Tnu )] = 2zn 1=2
c
c1
1 2 See p. 55.
2 (n 1)=2
1=2 n= 2n=2 (n=2)
= 2zn
2(n 1)=2 ((n 1) =2) (n= 2 )n=2
2 (n 1)=2
1=2 2n=2 n= (n=2)
= 2zn
2(n 1)=2
(n= 2 )n=2 ((n 1) =2)
1=2 (n=2)
= 2zn 1=2 21=2 n= 2
((n 1) =2)
r r
2 2 (n=2)
= 2z
n n ((n 1) =2)
"r # r
2 (n=2) 2
= 2 z
n ((n 1) =2) n
The expected size of Tna is

" r # "r # r
s2n 2 (n=2) 2
E[ (Tna )] =E 2 z = 2 z
n n 1 ((n 1) =2) n
where () is the Gamma function.

Proof. Using the fact that
n
s2n = Sn2
n 1
we obtain
" r # 2 s 3
n 2
s2n S n
E [ (Tna )] = E 2 z = E 42 n 1 z 5
n n
r " r # r
n Sn2 n
= E 2 z = E [ (Tnu )]
n 1 n n 1
r "r # r
n 2 (n=2) 2
= 2 z
n 1 n ((n 1) =2) n
"r # r
2 (n=2) 2
= 2 z
n 1 ((n 1) =2) n

Exercise 1
Suppose you observe a sample of 100 independent draws from a normal distribution
having unknown mean and known variance 2 = 1. Denote the 100 draws by
X1 , . . . , X100 . Suppose their sample mean X 100 is equal to 1, i.e.:
100
1 X
X 100 = Xi = 1
100 i=1
Find a con…dence interval for , using a set estimator of having 90% coverage
probability.
Solution
For a given sample size n, the interval estimator
" r r #
2 2
Tn = X n z; X n + z
n n
has coverage probability

C (Tn ; ) = P ( 2 Tn ) = P ( z Z z)
where Z is a standard normal random variable and z 2 R++ is a strictly positive
constant. Thus, we need to …nd z such that
P( z Z z) = 90%
But
P( z Z z) = P (Z z) P (Z < z)
= 1 P (Z > z) P (Z < z)
= 1 2P (Z > z)
where the last equality stems from the fact that the standard normal distribution
is symmetric around zero. Therefore z must be such that
1 2P (Z > z) = 0:9
or
P (Z > z) = 0:05
Using normal distribution tables or a computer program to …nd the value of z, we
obtain
z = 1:645
Thus, the con…dence interval for is
" r r #
2 2
T100 = X 100 z; X n + z
100 100
" r r #
1 1
= 1 1:645; 1 + 1:645
100 100
= [1 0:1645; 1 + 0:1645]
= [0:8355; 1:1645]
Exercise 2
having unknown mean and unknown variance 2 . Denote the 100 draws by X1 ,
. . . , X100 . Suppose their sample mean X 100 is equal to 1, i.e.:
100
1 X
X 100 = Xi = 1
100 i=1
and their adjusted sample variance s2100 is equal to 4, i.e.:

100
1 X 2
s2100 = Xi X 100 =4
99 i=1
Find a con…dence interval for , using a set estimator of having 99% coverage
probability.
Solution
" r r #
a s2n s2n
Tn = X n z; X n + z
n n
C Tna ; ; 2
= P ( 2 Tna ) = P ( z Zn 1 z)
where Zn 1 is a standard Student’s t random variable with n 1 degrees of freedom

and z 2 R++ is a strictly positive constant. Thus, we need to …nd z such that
P( z Zn 1 z) = 99%
But
P( z Z z) = P (Z z) P (Z < z)
= 1 P (Z > z) P (Z < z)
= 1 2P (Z > z)
where the last equality stems from the fact that the standard Student’s t distrib-
ution is symmetric around zero. Therefore z must be such that
1 2P (Z > z) = 0:99
or
P (Z > z) = 0:005
Using a computer program to …nd the value of z (for example, with the MATLAB
command tinv(0.995,99)), we obtain
z = 2:6264
Thus, the con…dence interval for is

" r # r
s2100
s2100
T100 = X 100 z; X n +
z
100100
" r r #
4 4
= 1 2:6264; 1 + 2:6264
100 100
= [1 0:5253; 1 + 0:5253]
= [0:4747; 1:5253]
Chapter 76
Set estimation of the

variance
This lecture presents some examples of set estimation1 problems, focusing on set
estimation of the variance, i.e. on using a sample to produce a set estimate of
the variance of an unknown distribution.

In this example we make the same assumptions we made in the example of point
estimation of the variance entitled Normal IID samples - Known mean (p. 579).
The reader is strongly advised to read that example before reading this one.
76.1.1 The sample

distribution with known mean and unknown variance 2 . The sample2 is the
n = [x1 : : : xn ]
n = [X1 : : : Xn ]

The interval estimator3 of the variance 2 , is based on the following point estimator
of the variance:
Xn
c2 = 1 (Xi )
2
n
n i=1
1 See p. 591.
2 See p. 564.
3 See p. 591.
607
608 CHAPTER 76. SET ESTIMATION OF THE VARIANCE
The interval estimator is
n c2 n c2
Tn = ;
z2 n z1 n
where z1 and z2 are strictly positive constants and z1 < z2 .

The coverage probability4 of the interval estimator Tn is
2 2
C Tn ; =P 2 Tn = P (z1 Z z2 )
where Z is a Chi-square random variable5 with n degrees of freedom.

2 n c2 2 n c2
P 2 Tn = P
z2 n z1 n
n c2 2 2 n c2
= P \
z2 n z1 n
nn o n n c2 o
= P c2 z2 \ z 1
2 n 2 n
n c2
= P z1 2 n z2
= P (z1 Z z2 )

n c2
Z= 2 n
strated that, given the assumptions on the sample n made above, the estimator
c2 has a Gamma distribution6 with parameters n and 2 . Multiplying a Gamma
n
random variable with parameters n and 2 by n2 one obtains a Chi-square ran-
dom variable with n degrees of freedom. Therefore, the variable Z has a Chi-square

Note that the coverage probability does not depend on the unknown parameter 2 .
Therefore, the con…dence coe¢ cient7 of the interval estimator Tn coincides with
its coverage probability:
2
c (Tn ) = inf
2 2R
C Tn ; = P (z1 Z z2 )
++
where Z is a Chi-square random variable with n degrees of freedom.

4 See p. 592.
5 See p. 387.
6 See p. 397.
7 See p. 592.
76.1.5 Size
The size8 of the interval estimator Tn is
n c2 n c2
(Tn ) = ;
z 2 n z1 n
n c2 n c2
=
z1 n z2 n
1 1 c2
= n n
z1 z2

Note that the size depends on c2n and hence on the sample n. The expected size
of the interval estimator Tn is
1 1 c2
E [ (Tn )] = E n n
z1 z2
1 1 h i
= n E c2n
z1 z2
1 1 2
= n
z1 z2
h i
where we have used the fact that c2n is an unbiased estimator of 2
(i.e. E c2n =
2
, see p. 580).

relax the assumption that the mean of the distribution is known.
76.2.1 The sample

n = [x1 : : : xn ]
n = [X1 : : : Xn ]
8 See p. 592.

2
To construct interval estimators of the variance , we use the sample mean X n :
n
1X
Xn = Xi
n i=1
and either the unadjusted sample variance9 :

n
1X 2
Sn2 = Xi Xn
n i=1
or the adjusted sample variance:

n
X
1 2
s2n = Xi Xn
n 1 i=1
We consider the following interval estimator of the variance:
n 2 n 2 n 1 2 n 1 2
Tn = Sn ; Sn = s ; s
z2 z1 z2 n z 1 n
where z1 and z2 are strictly positive constants and z1 < z2 .

The coverage probability of the interval estimator Tn is
2 2
C Tn ; ; =P 2 Tn = P (z1 Zn 1 z2 )
where Zn 1 is a Chi-square random variable with n 1 degrees of freedom.

2 n 2 2 n 2
P 2 Tn = P S S
z2 n z1 n
n 2 2 2 n 2
= P S \ S
z2 n z1 n
nn o n n 2o
= P S2
2 n
z2 \ z 1 S
2 n
n
= P z1 2
Sn2 z2
= P (z1 Zn 1 z2 )

n
Zn 1 = 2
Sn2
strated that, given the assumptions on the sample n made above, the unadjusted
sample variance Sn2 has a Gamma distribution with parameters n 1 and nn 1 2 .
9 See p. 583 for a de…nition and a discussion of adjusted and unadjusted sample variance.
Therefore, the random variable Zn 1 has a Gamma distribution with parameters

n 1 and h where
n n 1 2
h= 2
=n 1
n
But a Gamma distribution with parameters n 1 and n 1 is a Chi-square distrib-
ution with n 1 degrees of freedom. Therefore, Zn 1 has a Chi-square distribution
with n 1 degrees of freedom.

Note that the coverage probability of Tn does not depend on the unknown para-
meters and 2 . Therefore, the con…dence coe¢ cient of the con…dence interval
coincides with its coverage probability:
2
c (Tn ) = inf
2
C Tn ; ; = P (z1 Zn 1 z2 )
2R; 2R++
where Zn 1 is a Chi-square distribution with n 1 degrees of freedom.
76.2.5 Size
The size of the con…dence interval Tn is
n 2 n 2
(Tn ) = S ; S
z2 n z1 n
1 1
= n Sn2
z1 z2

The expected size of Tn is
1 1
E [ (Tn )] = E n Sn2
z1 z2
1 1
= n E Sn2
z1 z2
1 1 n 1 2
= n
z1 z2 n
1 1 2
= (n 1)
z1 z2
where in the penultimate step we have used the fact (proved in the lecture entitled
Point estimation of the variance - p. 579) that
n 1
E Sn2 = 2
n

Exercise 1
having known mean = 0 and unknown variance 2 . Denote the 100 draws by
X1 , . . . , X100 . Suppose that
n
d
2 1 X 2
100 = X =1
100 i=1 i
Find a con…dence interval for 2 , using a set estimator of 2 having 90% cov-
erage probability.
Hint: a Chi-square random variable Z with 100 degrees of freedom has a dis-
tribution function FZ (z) such that
FZ (77:9295) = 0:05
FZ (124:3421) = 0:95
Solution
n c2 n c2
Tn = ;
z2 n z1 n
2 2
C Tn ; =P 2 Tn = P (z1 Z z2 )
where Z is a Chi-square random variable with n degrees of freedom and z1 ; z2 2
R++ are strictly positive constants. Thus, if we set
z1 = 77:9295
z2 = 124:3421
then
P (z1 Z z2 ) = P (Z z2 ) P (Z < z1 )
A = P (Z z2 ) P (Z z1 )
B = FZ (z2 ) FZ (z1 )
= FZ (124:3421) FZ (77:9295)
= 0:95 0:05 = 0:9
which is equal to our desired coverage probability (in step A we have used the
fact that any speci…c realization of an absolutely continuous random variable has
zero probability; in step B we have used the de…nition of distribution function).
Thus, the con…dence interval for 2 is
100 d2 ; 100 d 2
T100 =
z2 100 z1 100
100 100
= ;
124:3421 77:9295
= [0:8042; 1:2832]
Exercise 2
having unknown mean and unknown variance 2 . Denote the 100 draws by X1 ,
. . . , X100 . Suppose that their adjusted sample variance s2100 is equal to 5, i.e.:
100
1 X 2
s2100 = Xi X 100 =5
99 i=1
Find a con…dence interval for 2 , using a set estimator of 2 having 99% cov-
erage probability.
Hint: a Chi-square random variable Z with 99 degrees of freedom has a distri-
bution function FZ (z) such that
FZ (66:5101) = 0:005
FZ (138:9868) = 0:995
Solution
n 1 n 1
Tn = s2n ; s2n
z2 z1
2 2
C Tn ; ; =P 2 Tn = P (z1 Z z2 )
where Z is a Chi-square random variable with n 1 degrees of freedom and z1 ; z2 2
R++ are strictly positive constants. Thus, if we set
z1 = 66:5101
z2 = 138:9868
then
P (z1 Z z2 ) = P (Z z2 ) P (Z < z1 )
A = P (Z z2 ) P (Z z1 )
B = FZ (z2 ) FZ (z1 )
= FZ (138:9868) FZ (66:5101)
= 0:995 0:005 = 0:99
which is equal to our desired coverage probability (in step A we have used the
fact that any speci…c realization of an absolutely continuous random variable has
zero probability; in step B we have used the de…nition of distribution function).
Thus, the con…dence interval for 2 is
99 2 99 2
T100 = s ; s
z2 100 z1 100
99 99
= 5; 5
138:9868 66:5101
= [3:5615; 7:4425]
Chapter 77
Hypothesis testing
Hypothesis testing is a method of making statistical inferences.

As we have discussed in the lecture entitled Statistical inference (p. 563), a
statistical inference is a statement about the probability distribution from which
a sample has been drawn. The sample can be regarded as a realization of a
random vector , whose unknown joint distribution function1 F ( ) is assumed to
belong to a set of distribution functions , called statistical model.
In hypothesis testing we make a statement about a model restriction involv-
ing a subset R of the original model. The statement we make is chosen
between two possible statements:
1. reject the restriction F 2 R;
2. do not reject the restriction F 2 R.
Roughly speaking, we start from a large set of distributions that might pos-
sibly have generated the sample and we would like to restrict our attention to a
smaller set R . In a test of hypothesis, we use the sample to decide whether or
not to restrict our attention to the smaller set R .
If we have a parametric model, we can also carry out parametric tests of hy-
pothesis.
Remember that in a parametric model the set of distribution functions is
put into correspondence with a set Rp of p-dimensional real vectors called
the parameter space. The elements of are called parameters. Denote by 0 the
parameter that is associated with the unknown distribution function F ( ) and
assume that 0 is unique. 0 is called the true parameter, because it is associated
to the distribution that actually generated the sample.
In parametric hypothesis testing we have a restriction R on the parameter
space and we choose one of the following two statements about the restriction:
1. reject the restriction 0 2 R;
2. do not reject the restriction 0 2 R.
For concreteness, we will focus on parametric hypothesis testing in this lecture,

but most of the things we will say apply with straightforward modi…cations to
hypothesis testing in general.
1 See p. 118.
615
616 CHAPTER 77. HYPOTHESIS TESTING
77.1 Null hypothesis

The hypothesis that the restriction is true is called null hypothesis and it is
usually denoted by H0 :
H0 : 0 2 R
77.2 Alternative hypothesis

The restriction 0 2 cR (where cR is the complement of R) is often called
alternative hypothesis and it is denoted by H1 :
c
H1 : 0 2 R
For some authors, "rejecting the null hypothesis H0 " and "accepting the alter-
native hypothesis H1 " are synonyms. For other authors, however, "rejecting the
null hypothesis H0 " does not necessarily imply "accepting the alternative hypoth-
esis H1 ". Although this is mostly a matter of language, it is possible to envision
situations in which, after rejecting H0 , a second test of hypothesis is performed
whereby H1 becomes the new null hypothesis and it is rejected (this may happen
for example if the model is mis-speci…ed2 ). In these situations, if "rejecting the
null hypothesis H0 " and "accepting the alternative hypothesis H1 " are treated as
synonyms, then some confusion arises, because the …rst test leads to "accept H1 "
and the second test leads to "reject H1 ".
Also note that some statisticians sometimes take into consideration as an alter-
native hypothesis a set smaller than cR . In these cases, the null hypothesis and
the alternative hypothesis do not cover all the possibilities contemplated by the
parameter space .
77.3 Types of errors

When we decide whether to reject a restriction or not to reject it, we can incur in
two types of errors:
1. reject the restriction 0 2 R when the restriction is true; this is called an

error of the …rst kind or a Type I error;
2. do not reject the restriction 0 2 R when the restriction is false; this is

called an error of the second kind or a Type II error.
77.4 Critical region

Remember that the sample is regarded as a realization of a random vector
having support R .
A test of hypothesis is usually carried out by explicitly or implicitly subdividing
the support R into two disjoint subsets. One of the two subsets, denoted by C
2 See p. 565.
77.5. TEST STATISTIC 617
is called the critical region (or rejection region) and it is the set of all values
of for which the null hypothesis is rejected:
C = f 2 R : H0 is rejected whenever is observedg
The other subset is just the complement of the critical region:
C c = f 2 R : H0 is not rejected whenever is observedg
and it is, of course, such that
C [ Cc = R
77.5 Test statistic

The critical region is often implicitly de…ned in terms of a test statistic and a
critical region for the test statistic. A test statistic is a random variable S whose
realization is a function of the sample . In symbols:
S = s( )
A critical region for S is a subset CS R of the set of real numbers and the
test is performed based on the test statistic, as follows:
s ( ) 2 CS =) 2 C =) H0 is rejected
s( ) 2= CS =) 2
= C =) H0 is not rejected
If the complement of the critical region for S is an interval, i.e.
CSc = [sl ; su ]
the lower bound of the interval sl and the upper bound su are called critical
values of the test.
77.6 Power function

The power function of a test of hypothesis is the function that associates the
probability of rejecting H0 to each parameter 2 . Denoting the critical region
by C , the power function ( ) is de…ned as follows:
( )=P ( 2C )
where the notation P is used to indicate the fact that the probability is calculated
using the distribution function F ( ; ) associated to the parameter .
77.7 Size of a test

When 2 R , the power function ( ) gives us the probability of committing a
Type I error, i.e. the probability of rejecting the null hypothesis when the null
hypothesis is true.
618 CHAPTER 77. HYPOTHESIS TESTING
The maximum probability of committing a Type I error is
= sup ( )
2 R
and it is called the size of the test. The size of the test is also called by some
authors the level of signi…cance of the test. However, according to other
authors, who assign a slightly di¤erent meaning to the term, the level of signi…cance
of a test is an upper bound on the size of the test, i.e. a constant that, to the
statistician’s knowledge, satis…es:
sup ( )
2 R
77.8 Criteria to evaluate tests

Tests of hypothesis are most commonly evaluated based on their size and power.
An ideal test should have size equal to 0 (i.e. the probability of rejecting the null
hypothesis when the null hypothesis is true should be 0) and power equal to 1
when 0 2 = R (i.e. the probability of rejecting the null hypothesis when the null
hypothesis is false should be 1). Of course, such an ideal test is never found in
practice, but the best we can hope for is a test with a very small size and a very
high probability of rejecting a false hypothesis. Nevertheless, this ideal is routinely
used to choose among di¤erent tests: for example, when choosing between two
tests having the same size, we will always utilize the test that has the higher power
when 0 2 = R ; also, when choosing between two tests that have the same power
when 0 2 = R , we will always utilize the test that has the smaller size.
Several other criteria, beyond power and size, are used to evaluate tests of
hypothesis. We do not discuss them here, but we refer the reader to the very nice
exposition in Berger and Casella3 (2002).
77.9 Examples
You can …nd examples of hypothesis testing in the lectures entitled Hypothesis tests
about the mean (p. 619) and Hypothesis tests about the variance (p. 619).
3 Berger, R. L. and G. Casella (2002) "Statistical inference", Duxbury Advanced Series.

Chapter 78
Hypothesis tests about the

mean
This lecture presents some examples of hypothesis testing1 , focusing on tests of

hypothesis about the mean, that is, on using a sample to perform tests of
hypothesis about the mean of an unknown distribution.
78.1 Normal IID samples - Known variance

In this example we make the same assumptions we made in the example of set
estimation of the mean entitled Normal IID samples - Known variance (p. 595).
78.1.1 The sample

unknown mean and known variance 2 . Speci…cally, we observe n realizations
distribution with unknown mean and known variance 2 . The sample is the
n = [x1 : : : xn ]
n = [X1 : : : Xn ]
78.1.2 The null hypothesis

We test the null hypothesis2 that the mean is equal to a speci…c value 0:
H0 : = 0
1 See p. 615.
2 See p. 616.
619
620 CHAPTER 78. HYPOTHESIS TESTS ABOUT THE MEAN
78.1.3 The alternative hypothesis

We assume that the parameter space is the whole real line, i.e., 2 R. Therefore,
the alternative hypothesis3 is
H1 : 6= 0
78.1.4 The test statistic

To construct a test statistic4 , we use the sample mean
n
1X
Xn = Xi
n i=1
The test statistic is

Xn 0
Zn = p
2 =n
This test statistic is often called z-statistic or normal z-statistic, and a test of
hypothesis based on this statistic is called z-test or normal z-test.
78.1.5 The critical region

Let z 2 R++ . We reject the null hypothesis H0 if Zn > z or if Zn < z. In other
words, the critical region5 is
CZn = ( 1; z) [ (z; 1)
Thus, the critical values of the test are z and z.
78.1.6 The power function

The power function6 of the test is
!
( ) = P (Zn 2
= [ z; z]) = 1 P z + p0 Z z + p0 (78.1)
2 =n 2 =n
where Z is a standard normal random variable, and the notation P is used to

indicate the fact that the probability of rejecting the null hypothesis is computed
under the hypothesis that the true mean is equal to .
Proof. The power function can be written as
( ) = P (Zn 2
= [ z; z])
= 1 P (Zn 2 [ z; z])
!
Xn 0
= 1 P z p z
2 =n
!
X
= 1 P z+p 0
p n z+p 0
2 =n 2 =n 2 =n
3 See p. 616.
4 See p. 617.
5 See p. 616.
6 See p. 617.
!
Xn
= 1 P z + p0 p z + p0
2 =n 2 =n 2 =n
!
= 1 P z + p0 Z z + p0
2 =n 2 =n

Xn
Z= p
2 =n
As demonstrated in the lecture entitled Point estimation of the mean (p. 573),
the sample mean X n has a normal distribution with mean and variance 2 =n,
given the assumptions on the sample n we made above. By subtracting the mean
of a normal random variable from the random variable itself, and dividing it by
the square root of its variance, one obtains a standard normal random variable.
Therefore, the variable Z has a standard normal distribution.
78.1.7 The size of the test

When evaluated at the point = 0 , the power function is equal to the probability
of committing a Type I error, that is, the probability of rejecting the null hypothesis
when the null hypothesis is true. This probability is called the size of the test7 and
it is equal to
( 0) =P 0
(Zn 2
= [ z; z]) = 1 P( z Z z)
where Z is a standard normal random variable.

Proof. This is trivially obtained by substituting with 0 in formula (78.1).
78.2 Normal IID samples - Unknown variance

relax the assumption that the variance of the distribution is known.
78.2.1 The sample

n = [x1 : : : xn ]
n = [X1 : : : Xn ]
7 See p. 617.

We test the null hypothesis that the mean is equal to a speci…c value 0:
H0 : = 0

We assume that the parameter space is the whole real line, i.e., 2 R. Therefore,
the alternative hypothesis is
H1 : 6= 0

We construct two test statistics, using the sample mean
n
1X
Xn = Xi
n i=1

n
1X 2
Sn2 = Xi Xn
n i=1
or the adjusted sample variance

n
X
1 2
s2n = Xi Xn
n 1 i=1
The two test statistics are
Xn 0
Znu = p
Sn2 =n
Xn 0
Zna = p
s2n =n
where the superscripts u and a indicate whether the test statistic is based on the
unadjusted or the adjusted sample variance. These two test statistics are often
called t-statistics or Student’s t-statistics, and tests of hypothesis based on
these statistics are called t-tests or Student’s t-tests.

Let z 2 R++ . We reject the null hypothesis H0 if Zni > z or if Zni < z (for i = u
or i = a). In other words, the critical region is
CZni = ( 1; z) [ (z; 1)
Thus, the critical values of the test are z and z.

8 See p. 583.

The power function of the test based on the unadjusted sample variance is
r r !
u u n 1 n 1
( ) = P (Zn 2= [ z; z]) = 1 P z Wn 1 z (78.2)
n n
where the notation P is used to indicate the fact that the probability of rejecting
the null hypothesis is computed under the hypothesis that the true mean is equal
to , and Wn 1 is a non-central standard Student’s t distribution9 with n 1
degrees of freedom and non-centrality parameter equal to
0
p
2 =n

u
( ) = P (Znu 2
= [ z; z])
= 1 P (Znu 2 [ z; z])
!
Xn 0
= 1 P z p z
Sn2 =n
r r !
n 1 Xn +( 0) n 1
= 1 P z p z
n 2
Sn = (n 1) n
0 1
r Xn
p + p 20 r
B n 1 2 =n =n n 1 C
= 1 P @ z q zA
n 2 n
Sn (n 1) 2 n
r r !
n 1 n 1
= 1 P z Wn 1 z
n n

.p .p
Xn 2 =n +( 2 =n
0)
Wn 1 = q
n
Sn2 (n 1) 2
Given the assumptions on the sample n we made above, the sample mean X n
has a normal distribution with10 mean and variance 2 =n, so that the random
variable .p
Xn 2 =n
has a standard normal distribution. Furthermore, the unadjusted sample variance

Sn2 has a Gamma distribution with11 parameters n 1 and nn 1 2 , so that the
random variable
n
Sn2
(n 1) 2
has a Gamma distribution with parameters n 1 and 1. By adding a constant c to
a standard normal distribution and dividing the sum thus obtained by the square
9 See p. 418.
1 0 See p. 574.
1 1 See p. 585.
root of a Gamma random variable with parameters n 1 and 1, one obtains a

non-central standard Student’s t distribution with n 1 degrees of freedom and
non-centrality parameter c. Therefore, the random variable Wn 1 has a non-central
standard Student’s t distribution with n 1 degrees of freedom and non-centrality
parameter .p
( 2 =n
0)
The power function of the test based on the adjusted sample variance is
a
( ) = P (Zna 2
= [ z; z]) = 1 P( z Wn 1 z)
the null hypothesis is computed under the hypothesis that the true mean is equal to
, and Wn 1 is a non-central standard Student’s t distribution with n 1 degrees
of freedom and non-centrality parameter equal to
0
p
2 =n

a
( ) = P (Zna 2
= [ z; z])
= 1 P (Zna 2 [ z; z])
!
Xn 0
= 1 P z p z
s2n =n
!
Xn +( 0)
= 1 P z p z
s2n =n
0 .p .p 1
Xn 2 =n +( 2 =n
0)
= 1 P @ z p .p zA
s2n =n 2 =n
0 .p .p 1
Xn 2 =n + ( ) 2 =n
0
= 1 P @ z p zA
2
sn = 2
= 1 P( z Wn 1 z)

.p .p
Xn 2 =n +( 2 =n
0)
Wn 1 = p
s2n = 2
Given the assumptions on the sample n we made above, the sample mean X n has
a normal distribution with mean and variance 2 =n, so that the random variable
.p
Xn 2 =n
has a standard normal distribution. Furthermore, the adjusted sample variance

s2n has a Gamma distribution with parameters n 1 and 2 , so that the random
variable
s2n = 2
has a Gamma distribution with parameters n 1 and 1. By adding a constant c to

a standard normal distribution and dividing the sum thus obtained by the square
root of a Gamma random variable with parameters n 1 and 1, one obtains a
non-central standard Student’s t distribution with n 1 degrees of freedom and
non-centrality parameter c. Therefore, the random variable Wn 1 has a non-central
standard Student’s t distribution with n 1 degrees of freedom and non-centrality
parameter .p
( 2 =n
0)
Note that, for a …xed z, the test based on the unadjusted sample variance is
more powerful than the test based on the adjusted sample variance:
r r !
u n 1 n 1
( ) = 1 P z Wn 1 z
n n
a
> 1 P( z Wn 1 z) = ( )
because r
n 1
<1
n
r r !
n 1 n 1
P z Wn 1 z < P( z Wn 1 z)
n n

The size of the test based on the unadjusted sample variance is equal to
r r !
u n 1 n 1
( 0) = 1 P z Wn 1 z
n n
where Wn 1 is a standard Student’s t distribution12 with n 1 degrees of freedom.

Proof. When evaluated at the point = 0 , the power function is equal to the
size of the test, that is, the probability of committing a Type I error. The power
function evaluated at 0 is
r r !
u n 1 n 1
( 0) = 1 P z Wn 1 z
n n
where Wn 1 is a non-central standard Student’s t distribution with n 1 degrees

p0 0
2 =n
Therefore, when = 0 , the non-centrality parameter is equal to 0 and Wn 1 is

just a standard Student’s t distribution.
1 2 See p. 407.
The size of the test based on the adjusted sample variance is equal to
a
( 0) =1 P( z Wn 1 z)
where Wn 1 is a standard Student’s t distribution with n 1 degrees of freedom.
Proof. When evaluated at the point = 0 , the power function is equal to the
size of the test, that is, the probability of committing a Type I error. The power
function evaluated at 0 is
u
( 0) =1 P( z Wn 1 z)
where Wn 1 is a non-central standard Student’s t distribution with n 1 degrees
p0 0
2 =n
Therefore, when = 0 , the non-centrality parameter is equal to 0 and Wn 1 is

just a standard Student’s t distribution.
Note that, for a …xed z, the test based on the unadjusted sample variance has
a greater size than the test based on the adjusted sample variance, because, as
demonstrated above, the former also has a greater power than the latter for any
value of the true parameter .

Exercise 1
Denote by Fn (x; k) the distribution function of a non-central standard Student’s
t distribution with n degrees of freedom and non-centrality parameter equal to k.
Suppose a statistician observes 100 independent realizations of a normal random
variable. The mean and the variance of the random variable, which the statistician
does not know, are equal to 1 and 4 respectively. What is the probability, expressed
in terms of Fn (x; k), that the statistician will reject the null hypothesis that the
mean is equal to zero if she runs a t-test based on the 100 observed realizations,
setting z = 2 as the critical value, and using the adjusted sample variance to
compute the t-statistic?
Solution
The probability of rejecting the null hypothesis 0 = 0 is obtained by evaluating
the power function of the test at = 1:
a a
( )= (1) = P (Zna 2
= [ z; z]) = 1 P( 2 W99 2)
the null hypothesis is computed under the hypothesis that the true mean is equal to
= 1, and W99 is a non-central standard Student’s t distribution with 99 degrees
of freedom and non-centrality parameter
0 1 0 10
k=p =p = =5
2 =n 4=100 2
Thus, the probability of rejecting the null hypothesis is equal to

a
(1) = 1 P ( 2 W99 2)
= 1 [F99 (2; 5) F99 ( 2; 5)]
= 1 + F99 ( 2; 5) F99 (2; 5)
Exercise 2
Denote by Fn (x) the distribution function of a standard Student’s t distribution
with n degrees of freedom, and by Fn 1 (p) its inverse. Suppose that a statistician
observes 100 independent realizations of a normal random variable, and she per-
forms a t-test of the null hypothesis that the mean of the variable is equal to zero,
based on the 100 observed realizations, and using the unadjusted sample variance
to compute the t-statistic. What critical value should she use in order to incur into
a Type I error with 10% probability? Express it in terms of Fn (x).
Solution
A Type I error is committed when the null hypothesis is true, but it is rejected.
The probability of rejecting the null hypothesis 0 = 0 is
r r !
u u n 1 n 1
( 0) = (0) = 1 P z Wn 1 z
n n
r r !
99 99
= 1 P z W99 z
100 100
where z is the critical value, and W99 is a standard Student’s t distribution with
99 degrees of freedom. This probability can be expressed as
r r !
99 99
1 P z W99 z
100 100
" r ! r !#
99 99
= 1 F99 z F99 z
100 100
r ! r !
99 99
= 1 F99 z + F99 z
100 100
" r !# r !
A = 1 99 99
1 F99 z + F99 z
100 100
r !
99
= 2 F99 z
100
where: in step A we have used the fact that the density of a standard Student’s
t distribution is symmetric around zero. Thus, we need to set z in such a way that
r !
99 1
2 F99 z =
100 10
This is accomplished by r
100 1
z= F 1
99 99 20
Chapter 79
Hypothesis tests about the

variance
This lecture presents some examples of hypothesis testing1 , focusing on tests of

hypothesis about the variance, that is, on using a sample to perform tests of
hypothesis about the variance of an unknown distribution.

In this example we make the same assumptions we made in the example of set
estimation of the variance entitled Normal IID samples - Known mean (p. 607).
79.1.1 The sample

distribution with known mean and unknown variance 2 . The sample is the
n = [x1 : : : xn ]
n = [X1 : : : Xn ]

We test the null hypothesis2 that the variance 2
is equal to a speci…c value 2
0 > 0:
2 2
H0 : = 0
1 See p. 615.
2 See p. 616.
629
630 CHAPTER 79. HYPOTHESIS TESTS ABOUT THE VARIANCE

We assume that the parameter space is the set of strictly positive real numbers,
i.e., 2 2 R++ . Therefore, the alternative hypothesis3 is
2 2 2
H1 : > 0; 6= 0

To construct a test statistic4 , we use the following point estimator of the variance
X n
c2 = 1 (Xi
2
)
n
n i=1

2 n c2
n = 2 n
0
This test statistic is often called Chi-square statistic (also written as 2 -statistic)
and a test of hypothesis based on this statistic is called Chi-square test (also
written as 2 -test).

2
Let z1 ; z2 2 R+ and z1 < z2 . We reject the null hypothesis H0 if n < z1 or if
2 5
n > z2 . In other words, the critical region is
C 2
n
= [0; z1 ) [ (z2 ; 1)
Thus, the critical values of the test are z1 and z2 .

The power function6 of the test is
2 2
2 2 0 0
=P 2
n 2
= [z1 ; z2 ] = 1 P z
2 1 n z
2 2
(79.1)
where n is a Chi-square random variable7 with n degrees of freedom and the

notation P 2 is used to indicate the fact that the probability of rejecting the null
hypothesis is computed under the hypothesis that the true variance is equal to 2 .
2 2
= P 2
n 2
= [z1 ; z2 ]
2
= 1 P 2
n 2 [z1 ; z2 ]
n c2
= 1 P 2 z1 2 n z2
0
3 See p. 616.
4 See p. 617.
5 See p. 616.
6 See p. 617.
7 See p. 387.
2 2
0 n c2 0
= 1 P 2 z
2 1 2 n z
2 2
2 2
0 0
= 1 P z
2 1 n z
2 2

n c2
n = 2 n
As demonstrated in the lecture entitled Point estimation of the variance (p. 579),
the estimator c2n has a Gamma distribution8 with parameters n and 2 , given the
assumptions on the sample n we made above. Multiplying a Gamma random
variable with parameters n and 2 by n= 2 one obtains a Chi-square random
variable with n degrees of freedom. Therefore, the variable n has a Chi-square

When evaluated at the point 2 = 20 , the power function is equal to the probability
of committing a Type I error, that is, the probability of rejecting the null hypothesis
when the null hypothesis is true. This probability is called the size of the test9 and
it is equal to
2 2
0 =P 2
0 n 2
= [z1 ; z2 ] = 1 P (z1 n z2 )
where n is a Chi-square random variable with n degrees of freedom.
Proof. This is obtained by substituting 2 with 20 in formula (79.1).

relax the assumption that the mean of the distribution is known.
79.2.1 The sample

n = [x1 : : : xn ]
n = [X1 : : : Xn ]

2 2
We test the null hypothesis that the variance is equal to a speci…c value 0 > 0:
2 2
H0 : = 0
8 See p. 397.
9 See p. 617.

We assume that the parameter space is the set of strictly positive real numbers,
i.e., 2 2 R++ . Therefore, the alternative hypothesis is
2 2 2
H1 : > 0; 6= 0

We construct a test statistic, using the sample mean
n
1X
Xn = Xi
n i=1

n
1X 2
Sn2 = Xi Xn
n i=1
or the adjusted sample variance

n
X
1 2
s2n = Xi Xn
n 1 i=1

2 n 2 n 1
n = 2 Sn = 2 s2n
0 0
This test statistic is often called Chi-square statistic (also written as 2 -statistic)
and a test of hypothesis based on this statistic is called Chi-square test (also
written as 2 -test).

2
Let z1 ; z2 2 R+ and z1 < z2 . We reject the null hypothesis H0 if n < z1 or if
2
n > z2 . In other words, the critical region is
C 2
n
= [0; z1 ) [ (z2 ; 1)
Thus, the critical values of the test are z1 and z2 .

The power function of the test is
2 2
2 2 0 0
=P 2
n 2
= [z1 ; z2 ] = 1 P z
2 1 n z
2 2
(79.2)
where the notation P 2 is used to indicate the fact that the probability of rejecting
the null hypothesis is computed under the hypothesis that the true variance is
equal to 2 and n has a Chi-square distribution with n 1 degrees of freedom.
1 0 See p. 583.

2 2
= P 2
n 2
= [z1 ; z2 ]
2
= 1 P 2
n 2 [z1 ; z2 ]
n 2
= 1 P 2 z1 2 Sn z2
0
2 2
0 n 0
= 1 P 2 z
2 1
S2
2 n
z
2 2
2 2
0 0
= 1 P z
2 1 n z
2 2

n
n = 2
Sn2
Given the assumptions on the sample n we made above, the unadjusted sample
variance Sn2 has a Gamma distribution with parameters11 n 1 and nn 1 2 , so that
the random variable
n 1 2 n 2
n 1 2 Sn = 2 Sn
n
has a Chi-square distribution with n 1 degrees of freedom.

The size of the test is equal to
2 2
0 =P 2
0 n 2
= [z1 ; z2 ] = 1 P (z1 n z2 )
where n has a Chi-square distribution with n 1 degrees of freedom.

Proof. This is obtained by substituting 2 with 20 in formula (79.2).

Exercise 1
Denote by Fn (x) the distribution function of a Chi-square random variable with n
degrees of freedom. Suppose you observe 40 independent realizations of a normal
random variable. What is the probability, expressed in terms of Fn (x), that you
will commit a Type I error if you run a Chi-square test of the null hypothesis that
the variance is equal to 1, based on the 40 observed realizations, and choosing
z1 = 0:8 and z2 = 1:2 as the critical values?
Solution
The probability of committing a Type I error is equal to the size of the test:
2
0 = (1) = 1 P (z1 40 z2 )
1 1 See p. 579.
where 40 has a Chi-square distribution with 39 degrees of freedom. But
P (z1 40 z2 ) = F39 (z2 ) F39 (z1 )

= F39 (1:2) F39 (0:8)
Thus
2
0 = (1) = 1 P (z1 40 z2 ) = 1 F39 (1:2) + F39 (0:8)
Exercise 2
Make the same assumptions of the previous exercise and denote by Fn 1 (p) the
inverse of Fn (x). Change the critical value z1 in such a way that the size of the
test becomes exactly equal to 5%.
Solution
Replace 0:8 with z1 in the formula for the size of the test:
2
0 =1 F39 (1:2) + F39 (z1 )
2
You need to set z1 in such a way that 0 = 0:05. In other words, you need to
solve
0:05 = 1 F39 (1:2) + F39 (z1 )
which is equivalent to
F39 (z1 ) = 0:05 1 + F39 (1:2)
Provided the right-hand side of the equation is positive, this is solved by
z1 = F391 ( 0:95 + F39 (1:2))

1480215236

Uploaded by

Copyright:

Available Formats

1480215236

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1480215236

Uploaded by

Copyright:

Available Formats

Lectures on Probability Theory

and Mathematical Statistics

4.4 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Partitions into groups 27

6 Sequences and limits 31

7 Review of di¤erentiation rules 39

8 Review of integration rules 45

9.1.5 Lower incomplete Gamma function . . . . . . . . . . . . . . . 58

15 Random variables 105

16 Random vectors 115

17 Expected value 127

17.7 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

18 Expected value and the Lebesgue integral 141

19 Properties of the expected value 147

22 Linear correlation 177

23 Covariance matrix 189

24 Indicator function 197

25 Conditional probability as a random variable 201

26 Conditional probability distributions 209

27 Conditional expectation 221

28 Independent random variables 229

III Additional topics in probability theory 239

30 Legitimate probability mass functions 247

31 Legitimate probability density functions 251

32 Factorization of joint probability mass functions 257

33 Factorization of joint probability density functions 261

34 Functions of random variables and their distribution 265

35 Functions of random vectors and their distribution 277

36 Moments and cross-moments 285

37 Moment generating function of a random variable 289

38 Moment generating function of a random vector 297

39 Characteristic function of a random variable 307

40 Characteristic function of a random vector 315

41 Sums of independent random variables 323

IV Probability distributions 333

43 Binomial distribution 341

44 Poisson distribution 349

45 Uniform distribution 359

46 Exponential distribution 365

47 Normal distribution 375

47.3.2 Linear combinations of normal random variables . . . . . . . 384

48 Chi-square distribution 387

49 Gamma distribution 397

50 Student’s t distribution 407

50.3.2 Non-central t distribution . . . . . . . . . . . . . . . . . . . . 418

52 Multinomial distribution 431

53 Multivariate normal distribution 439

53.3.4 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . 447

54 Multivariate Student’s t distribution 451

55 Wishart distribution 461

V More about normal distributions 467

57 Partitioned multivariate normal vectors 477

58 Quadratic forms in normal vectors 481

VI Asymptotic theory 489

60 Sequences of random vectors 497

61 Pointwise convergence 501

62 Almost sure convergence 505

63 Convergence in probability 511

64 Mean-square convergence 519