1480215236
1480215236
1480215236
Marco Taboga
ii
Contents
I Mathematical tools 1
1 Set theory 3
1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Set membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Set inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 De Morgan’s Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Permutations 9
2.1 Permutations without repetition . . . . . . . . . . . . . . . . . . . . 9
2.1.1 De…nition of permutation without repetition . . . . . . . . . 9
2.1.2 Number of permutations without repetition . . . . . . . . . . 10
2.2 Permutations with repetition . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 De…nition of permutation with repetition . . . . . . . . . . . 11
2.2.2 Number of permutations with repetition . . . . . . . . . . . . 11
2.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 k-permutations 15
3.1 k-permutations without repetition . . . . . . . . . . . . . . . . . . . 15
3.1.1 De…nition of k-permutation without repetition . . . . . . . . 15
3.1.2 Number of k-permutations without repetition . . . . . . . . . 16
3.2 k-permutations with repetition . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 De…nition of k-permutation with repetition . . . . . . . . . . 17
3.2.2 Number of k-permutations with repetition . . . . . . . . . . . 18
3.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Combinations 21
4.1 Combinations without repetition . . . . . . . . . . . . . . . . . . . . 21
4.1.1 De…nition of combination without repetition . . . . . . . . . 21
4.1.2 Number of combinations without repetition . . . . . . . . . . 22
4.2 Combinations with repetition . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 De…nition of combination with repetition . . . . . . . . . . . 23
4.2.2 Number of combinations with repetition . . . . . . . . . . . . 23
4.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Binomial coe¢ cients and binomial expansions . . . . . . . . . 25
4.3.2 Recursive formula for binomial coe¢ cients . . . . . . . . . . . 25
iii
iv CONTENTS
9 Special functions 55
9.1 Gamma function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.1.2 Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.1.3 Relation to the factorial function . . . . . . . . . . . . . . . . 56
9.1.4 Values of the Gamma function . . . . . . . . . . . . . . . . . 57
CONTENTS v
II Fundamentals of probability 67
10 Probability 69
10.1 Sample space, sample points and events . . . . . . . . . . . . . . . . 69
10.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.3 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.3.1 Probability of the empty set . . . . . . . . . . . . . . . . . . . 71
10.3.2 Additivity and sigma-additivity . . . . . . . . . . . . . . . . . 72
10.3.3 Probability of the complement . . . . . . . . . . . . . . . . . 72
10.3.4 Probability of a union . . . . . . . . . . . . . . . . . . . . . . 73
10.3.5 Monotonicity of probability . . . . . . . . . . . . . . . . . . . 73
10.4 Interpretations of probability . . . . . . . . . . . . . . . . . . . . . . 74
10.4.1 Classical interpretation of probability . . . . . . . . . . . . . 74
10.4.2 Frequentist interpretation of probability . . . . . . . . . . . . 74
10.4.3 Subjectivist interpretation of probability . . . . . . . . . . . . 74
10.5 More rigorous de…nitions . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.5.1 A more rigorous de…nition of event . . . . . . . . . . . . . . . 75
10.5.2 A more rigorous de…nition of probability . . . . . . . . . . . . 76
10.6 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
11 Zero-probability events 79
11.1 De…nition and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2 Almost sure and almost surely . . . . . . . . . . . . . . . . . . . . . 80
11.3 Almost sure events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
11.4 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
12 Conditional probability 85
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
12.2 The case of equally likely sample points . . . . . . . . . . . . . . . . 85
12.3 A more general approach . . . . . . . . . . . . . . . . . . . . . . . . 87
12.4 Tackling division by zero . . . . . . . . . . . . . . . . . . . . . . . . . 90
12.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
12.5.1 The law of total probability . . . . . . . . . . . . . . . . . . . 90
12.6 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
13 Bayes’rule 95
13.1 Statement of Bayes’rule . . . . . . . . . . . . . . . . . . . . . . . . . 95
13.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
13.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vi CONTENTS
14 Independent events 99
14.1 De…nition of independent event . . . . . . . . . . . . . . . . . . . . . 99
14.2 Mutually independent events . . . . . . . . . . . . . . . . . . . . . . 100
14.3 Zero-probability events and independence . . . . . . . . . . . . . . . 101
14.4 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
20 Variance 155
20.1 De…nition of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
20.2 Interpretation of variance . . . . . . . . . . . . . . . . . . . . . . . . 155
20.3 Computation of variance . . . . . . . . . . . . . . . . . . . . . . . . . 155
20.4 Variance formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.6 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
20.6.1 Variance and standard deviation . . . . . . . . . . . . . . . . 157
20.6.2 Addition to a constant . . . . . . . . . . . . . . . . . . . . . . 157
20.6.3 Multiplication by a constant . . . . . . . . . . . . . . . . . . 158
20.6.4 Linear transformations . . . . . . . . . . . . . . . . . . . . . . 158
20.6.5 Square integrability . . . . . . . . . . . . . . . . . . . . . . . 159
20.7 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
21 Covariance 163
21.1 De…nition of covariance . . . . . . . . . . . . . . . . . . . . . . . . . 163
21.2 Interpretation of covariance . . . . . . . . . . . . . . . . . . . . . . . 163
21.3 Covariance formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
21.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
21.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
21.5.1 Covariance of a random variable with itself . . . . . . . . . . 166
21.5.2 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
21.5.3 Bilinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
21.5.4 Variance of the sum of two random variables . . . . . . . . . 167
21.5.5 Variance of the sum of n random variables . . . . . . . . . . . 168
21.6 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
viii CONTENTS
51 F distribution 421
51.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
51.2 Relation to the Gamma distribution . . . . . . . . . . . . . . . . . . 422
51.3 Relation to the Chi-square distribution . . . . . . . . . . . . . . . . . 424
51.4 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
51.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
51.6 Higher moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
51.7 Moment generating function . . . . . . . . . . . . . . . . . . . . . . . 428
51.8 Characteristic function . . . . . . . . . . . . . . . . . . . . . . . . . . 428
51.9 Distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
51.10Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
xxi
xxii PREFACE
Dedication
This book is dedicated to Emanuela and Anna.
Part I
Mathematical tools
1
Chapter 1
Set theory
1.1 Sets
A set is a collection of objects. Sets are usually denoted by a letter and the objects
(or elements) belonging to a set are usually listed within curly brackets.
Example 1 Denote by the letter S the set of the natural numbers less than or
equal to 5. Then, we can write
S = f1; 2; 3; 4; 5g
Example 2 Denote by the letter A the set of the …rst …ve letters of the alphabet.
Then, we can write
A = fa; b; c; d; eg
Note that a set is an unordered collection of objects, i.e., the order in which
the elements of a set are listed does not matter.
S = f1; 2; 3; 4; 5g
S = fn 2 N : n 5g
which reads as follows: "S is the set of all natural numbers n such that n is less
than or equal to 5", where the colon symbol (:) means "such that" and precedes a
list of conditions that the elements of the set need to satisfy.
3
4 CHAPTER 1. SET THEORY
a2A
a2
=A
which reads "a does not belong to A" or "a is not a member of A".
A = f2; 4; 6; 8; 10g
A B
B A
A B
1.4. UNION 5
When A B but A is not the same as B, i.e., there are elements of B that do
not belong to A, then we write
A B
which reads "A is strictly included in B", or
B A
We also say that A is a proper subset of B.
Example 7 Given the sets
A = f2; 3g
B = f1; 2; 3; 4g
C = f2; 3g
we have that
A B
A C
but we cannot write
A C
1.4 Union
The union of two sets A and B is the set of all elements that belong to at least one
of them, and it is denoted by
A[B
Example 8 De…ne two sets A and B as follows:
A = fa; b; c; dg
B = fc; d; e; f g
Their union is
A [ B = fa; b; c; d; e; f g
If A1 , A2 , . . . , An are n sets, their union is the set of all elements that belong
to at least one of them, and it is denoted by
n
[
Ai = A1 [ A2 [ : : : [ An
i=1
1.5 Intersection
The intersection of two sets A and B is the set of all elements that belong to both
of them, and it is denoted by
A\B
A = fa; b; c; dg
B = fc; d; e; f g
Their intersection is
A \ B = fc; dg
A1 = fa; b; c; dg
A2 = fc; d; e; f g
A3 = fc; f; gg
Their intersection is
3
\
Ai = A1 \ A2 \ A3 = fcg
i=1
1.6 Complement
Suppose that our attention is con…ned to sets that are all included in a larger set
, called universal set. Let A be one of these sets. The complement of A is the set
of all elements of that do not belong to A and it is indicated by
Ac
= fa; b; c; d; e; f; g; hg
A = fb; c; dg
B = fc; d; eg
Ac = fa; e; f; g; hg
Bc = fa; b; f; g; hg
1.7. DE MORGAN’S LAWS 7
Exercise 1
De…ne the following sets
A1 = fa; b; cg
A2 = fb; c; d; e; f g
A3 = fb; f g
A4 = fa; b; dg
Solution
The union can be written as
A = A2 [ A3 [ A4
The union of the three sets A2 , A3 and A4 is the set of all elements that belong to
at least one of them:
A = A2 [ A3 [ A4
= fa; b; c; d; e; f g
Exercise 2
Given the sets de…ned in the previous exercise, list all the elements belonging to
the set
\4
A= Ai
i=1
8 CHAPTER 1. SET THEORY
Solution
The intersection can be written as
A = A1 \ A2 \ A3 \ A4
The intersection of the four sets A1 , A2 , A3 and A4 is the set of elements that are
members of all the four sets:
A = A1 \ A2 \ A3 \ A4
= fbg
Exercise 3
Suppose that A and B are two subsets of a universal set , and that
Ac = fa; b; cg
Bc = fb; c; dg
Solution
Using De Morgan’s laws, we obtain
c
(A [ B) = Ac \ B c
= fa; b; cg \ fb; c; dg
= fb; cg
Chapter 2
Permutations
This lecture introduces permutations, one of the most important concepts in com-
binatorial analysis.
We …rst deal with permutations without repetition, also called simple permu-
tations, and then with permutations with repetition.
9
10 CHAPTER 2. PERMUTATIONS
1. First, we assign an object to the …rst slot. There are n objects that can be
assigned to the …rst slot, so there are
2. Then, we assign an object to the second slot. There were n objects, but one
has already been assigned to a slot. So, we are left with n 1 objects that
can be assigned to the second slot. Thus, there are
and
n (n 1) possible ways to …ll the …rst two slots
3. Then, we assign an object to the third slot. There were n objects, but two
have already been assigned to a slot. So, we are left with n 2 objects that
can be assigned to the third slot. Thus, there are
and
n (n 1) (n 2) possible ways to …ll the …rst three slots
4. An so on, until only one object and one free slot remain.
5. Finally, when only one free slot remains, we assign the remaining object to
it. There is only one way to do this. Thus, there is
and
Pn = n (n 1) (n 2) : : : 2 1
Pn = n!
0! = 1
P5 = 5! = 5 4 3 2 1 = 120
2.2. PERMUTATIONS WITH REPETITION 11
Thus, the di¤erence between simple permutations and permutations with rep-
etition is that objects can be selected only once in the former, while they can be
selected more than once in the latter.
The following subsections give a slightly more formal de…nition of permutation
with repetition and deal with the problem of counting the number of possible
permutations with repetition.
Example 15 Consider two objects, a1 and a2 . There are two slots to …ll, s1 and
s2 . There are four possible permutations with repetition of the two objects, that is,
four possible ways to assign an object to each slot, being allowed to assign the same
object to more than one slot:
Slots s1 s2
Permutation 1 a1 a1
Permutation 2 a1 a2
Permutation 3 a2 a1
Permutation 4 a2 a2
1. First, we assign an object to the …rst slot. There are n objects that can be
assigned to the …rst slot, so there are
2. Then, we assign an object to the second slot. Even if one object has been
assigned to a slot in the previous step, we can still choose among n objects,
because we are allowed to choose an object more than once. So, there are n
objects that can be assigned to the second slot and
and
n n possible ways to …ll the …rst two slots
3. Then, we assign an object to the third slot. Even if two objects have been
assigned to a slot in the previous two steps, we can still choose among n
objects, because we are allowed to choose an object more than once. So,
there are n objects that can be assigned to the third slot and
and
n n n possible ways to …ll the …rst three slots
4. An so on, until we are left with only one free slot (the n-th).
5. When only one free slot remains, we assign one of the n objects to it. Thus,
there are
n possible ways to …ll the last slot
and
|n n {z: : : n} possible ways to …ll the n available slots
n tim es
Pn0 = nn
P30 = 33 = 27
Exercise 1
There are 5 seats around a table and 5 people to be seated at the table. In how
many di¤erent ways can they seat themselves?
Solution
Sitting 5 people at the table is a sequential problem. We need to assign a person
to the …rst chair. There are 5 possible ways to do this. Then we need to assign
a person to the second chair. There are 4 possible ways to do this, because one
person has already been assigned. An so on, until there remain one free chair and
one person to be seated. Therefore, the number of ways to seat the 5 people at the
table is equal to the number of permutations of 5 objects (without repetition). If
we denote it by P5 , then
P5 = 5! = 5 4 3 2 1 = 120
2.3. SOLVED EXERCISES 13
Exercise 2
Bob, John, Luke and Tim play a tennis tournament. The rules of the tournament
are such that at the end of the tournament a ranking will be made and there will
be no ties. How many di¤erent rankings can there be?
Solution
Ranking 4 people is a sequential problem. We need to assign a person to the …rst
place. There are 4 possible ways to do this. Then we need to assign a person to the
second place. There are 3 possible ways to do this, because one person has already
been assigned. An so on, until there remains one person to be assigned. Therefore,
the number of ways to rank the 4 people participating in the tournament is equal
to the number of permutations of 4 objects (without repetition). If we denote it
by P4 , then
P4 = 4! = 4 3 2 1 = 24
Exercise 3
A byte is a number consisting of 8 digits that can be equal either to 0 or to 1. How
many di¤erent bytes are there?
Solution
To answer this question we need to follow a line of reasoning similar to the one we
followed when we derived the number of permutations with repetition. There are
2 possible ways to choose the …rst digit and 2 possible ways to choose the second
digit. So, there are 4 possible ways to choose the …rst two digits. There are 2
possible ways two choose the third digit and 4 possible ways to choose the …rst
two. Thus, there are 8 possible ways to choose the …rst three digits. An so on,
until we have chosen all digits. Therefore, the number of ways to choose the 8
digits is equal to
: : 2} = 28 = 256
|2 :{z
8 tim es
14 CHAPTER 2. PERMUTATIONS
Chapter 3
k-permutations
1. the order of selection matters (the same k objects selected in di¤erent orders
are regarded as di¤erent k-permutations);
Example 17 Consider three objects, a1 , a2 and a3 . There are two slots, s1 and
s2 , to which we can assign two of the three objects. There are six possible 2-
permutations of the three objects, that is, six possible ways to choose two objects
1 See p. 9.
15
16 CHAPTER 3. K-PERMUTATIONS
Slots s1 s2
2-permutation 1 a1 a2
2-permutation 2 a1 a3
2-permutation 3 a2 a1
2-permutation 4 a2 a3
2-permutation 5 a3 a1
2-permutation 6 a3 a2
1. First, we assign an object to the …rst slot. There are n objects that can be
assigned to the …rst slot, so there are
2. Then, we assign an object to the second slot. There were n objects, but one
has already been assigned to a slot. So, we are left with n 1 objects that
can be assigned to the second slot. Thus, there are
and
n (n 1) possible ways to …ll the …rst two slots
3. Then, we assign an object to the third slot. There were n objects, but two
have already been assigned to a slot. So, we are left with n 2 objects that
can be assigned to the third slot. Thus, there are
and
n (n 1) (n 2) possible ways to …ll the …rst three slots
4. An so on, until we are left with n k + 1 objects and only one free slot (the
k-th).
5. Finally, when only one free slot remains, we assign one of the remaining
n k + 1 objects to it. Thus, there are
and
Pn;k = n (n 1) (n 2) : : : (n k + 1)
n (n 1) (n 2) : : : (n k + 1) (n k) (n k 1) : : : 2 1
Pn;k =
(n k) (n k 1) : : : 2 1
n!
Pn;k =
(n k)!
Pn;k = nk
5! 5 4 3 2 1
P5;3 = = = 5 4 3 = 60
2! 2 1
1. the order of selection matters (the same k objects selected in di¤erent orders
are regarded as di¤erent k-permutations);
Example 19 Consider three objects a1 , a2 and a3 and two slots, s1 and s2 . There
are nine possible 2-permutations with repetition of the three objects, that is, nine
possible ways to choose two objects and …ll the two slots with the two objects, being
allowed to pick the same object more than once:
Slots s1 s2
2-permutation 1 a1 a1
2-permutation 2 a1 a2
2-permutation 3 a1 a3
2-permutation 4 a2 a1
2-permutation 5 a2 a2
2-permutation 6 a2 a3
2-permutation 7 a3 a1
2-permutation 8 a3 a2
2-permutation 9 a3 a3
1. First, we assign an object to the …rst slot. There are n objects that can be
assigned to the …rst slot, so there are
2. Then, we assign an object to the second slot. Even if one object has been
assigned to a slot in the previous step, we can still choose among n objects,
because we are allowed to choose an object more than once. So, there are n
objects that can be assigned to the second slot and
and
n n possible ways to …ll the …rst two slots
3. Then, we assign an object to the third slot. Even if two objects have been
assigned to a slot in the previous two steps, we can still choose among n
objects, because we are allowed to choose an object more than once. So,
there are n objects that can be assigned to the second slot and
and
n n n possible ways to …ll the …rst three slots
4. An so on, until we are left with only one free slot (the k-th).
5. When only one free slot remains, we assign one of the n objects to it. Thus,
there are
n possible ways to …ll the last slot
3.3. SOLVED EXERCISES 19
and
|n n {z: : : n} possible ways to …ll the k available slots
k tim es
Exercise 1
There is a basket of fruit containing an apple, a banana and an orange and there
are …ve girls who want to eat one fruit. How many ways are there to give three of
the …ve girls one fruit each and leave two of them without a fruit to eat?
Solution
Giving the three fruits to three of the …ve girls is a sequential problem. We …rst
give the apple to one of the girls. There are 5 possible ways to do this. Then we
give the banana to one of the remaining girls. There are 4 possible ways to do this,
because one girl has already been given a fruit. Finally, we give the orange to one
of the remaining girls. There are 3 possible ways to do this, because two girls have
already been given a fruit. Summing up, the number of ways to assign the three
fruits is equal to the number of 3-permutations of 5 objects (without repetition).
If we denote it by P5;3 , then
5! 5 4 3 2 1
P5;3 = =
(5 3)! 2 1
= 5 4 3 = 60
Exercise 2
An hexadecimal number is a number whose digits can take sixteen di¤erent values:
either one of the ten numbers from 0 to 9, or one of the six letters from A to F. How
many di¤erent 8-digit hexadecimal numbers are there, if an hexadecimal number
is allowed to begin with any number of zeros?
Solution
Choosing the 8 digits of the hexadecimal number is a sequential problem. There
are 16 possible ways to choose the …rst digit and 16 possible ways to choose the
second digit. So, there are 16 16 possible ways to choose the …rst two digits.
There are 16 possible ways two choose the third digit and 16 16 possible ways to
20 CHAPTER 3. K-PERMUTATIONS
choose the …rst two. Thus, there are 16 16 16 possible ways to choose the …rst
three digits. An so on, until we have chosen all digits. Therefore, the number of
ways to choose the 8 digits is equal to the number of 8-permutations with repetition
of 16 objects:
0
P16;8 = 168
Exercise 3
An urn contains ten balls, each representing one of the ten numbers from 0 to 9.
Three balls are drawn at random from the urn and the corresponding numbers are
written down to form a 3-digit number, writing down the digits from left to right
in the order in which they have been extracted. When a ball is drawn from the
urn it is set aside, so that it cannot be extracted again. If one were to write down
all the 3-digit numbers that could possibly be formed, how many would they be?
Solution
The 3 balls are drawn sequentially. At the …rst draw there are 10 balls, hence 10
possible values for the …rst digit of our 3-digit number. At the second draw there
are 9 balls left, hence 9 possible values for the second digit of our 3-digit number.
At the third and last draw there are 8 balls left, hence 8 possible values for the third
digit of our 3-digit number. Summing up, the number of possible 3-digit numbers
is equal to the number of 3-permutations of 10 objects (without repetition). If we
denote it by P10;3 , then
10! 10 9 : : : 2 1
P10;3 = =
(10 3)! 7 6 ::: 2 1
= 10 9 8 = 720
Chapter 4
Combinations
This lecture introduces combinations, one of the most important concepts in com-
binatorial analysis. Before reading this lecture, you should be familiar with the
concept of permutation1 .
We …rst deal with combinations without repetition and then with combinations
with repetition.
21
22 CHAPTER 4. COMBINATIONS
Other combinations are not possible, because, for example, fa2 ; a1 g is the same as
fa1 ; a2 g.
Pn;k n!
Cn;k = =
Pk (n k)!k!
n
Cn;k =
k
n
and k is called binomial coe¢ cient.
1. the order of selection does not matter (the same objects selected in di¤erent
orders are regarded as the same combination);
2. each object can be selected more than once.
2 See
the lectures entitled Permutations (p. 9) and k-permutations (p. 15).
3Pis the number of all possible ways to order the k objects - the number of permutations of
k
k objects.
4.2. COMBINATIONS WITH REPETITION 23
Thus, the di¤erence between simple combinations and combinations with rep-
etition is that objects can be selected only once in the former, while they can be
selected more than once in the latter.
The following subsections give a slightly more formal de…nition of combination
with repetition and deal with the problem of counting the number of possible
combinations with repetition.
fa; b; c; ag
is a valid multiset, but not a valid set, because the letter a appears more than once.
Like sets, multisets are unordered collections of objects, i.e. the order in which the
elements of a multiset are listed does not matter.
Let a1 , a2 ,. . . , an be n objects. A combination with repetition of k objects
from the n objects is one of the possible ways to form a multiset containing k
objects taken from the set fa1 ; a2 ; : : : ; an g.
Example 23 Consider three objects, a1 , a2 and a3 . There are six possible combi-
nations with repetition of two objects from a1 , a2 and a3 , that is, six possible ways
to choose two objects from this set of three, allowing for repetitions:
Combination 1 a1 and a2
Combination 2 a1 and a3
Combination 3 a2 and a3
Combination 4 a1 and a1
Combination 5 a2 and a2
Combination 6 a3 and a3
Other combinations are not possible, because, for example, fa2 ; a1 g is the same as
fa1 ; a2 g.
Example 24 We need to order two scoops of ice cream, choosing among four
‡avours: chocolate, pistachio, strawberry and vanilla. It is possible to order two
scoops of the same ‡avour. How many di¤ erent combinations can we order? The
4 See the lecture entitled Set theory (p. 3).
24 CHAPTER 4. COMBINATIONS
number of di¤ erent combinations we can order is equal to the number of possible
combinations with repetition of 2 objects from 4. Let us represent an order as a
string of crosses ( ) and vertical bars (j), where a vertical bar delimits two adjacent
‡avours and a cross denotes a scoop of a given ‡avour. For example,
jjj 1 chocolate, 1 vanilla
jj j 1 strawberry, 1 vanilla
jjj 2 chocolate
jj j 2 strawberry
where the …rst vertical bar (the leftmost one) delimits chocolate and pistachio, the
second one delimits pistachio and strawberry and the third one delimits strawberry
and vanilla. Each string contains three vertical bars, one less than the number of
‡avours, and two crosses, one for each scoop. Therefore, each string contains a
total of …ve symbols. Making an order is equivalent to choosing which two of the
…ve symbols will be a cross (the remaining will be vertical bars). So, to make an
order, we need to choose 2 objects from 5. The number of possible ways to choose
2 objects from 5 is equal to the number of possible combinations without repetition5
of 2 objects from 5. Therefore, there are
5 5!
= = 10
2 (5 2)!2!
di¤ erent orders we can make.
In general, choosing k objects from n with repetition is equivalent to writing
a string with n + k 1 symbols, of which n 1 are vertical bars (j) and k are
crosses ( ). In turn, this is equivalent to choose the k positions in the string
(among the available n + k 1) that will contain a cross (the remaining ones will
contain vertical bars). But choosing k positions from n + k 1 is like choosing a
combination without repetition of k objects from n+k 1. Therefore, the number
of possible combinations with repetition is
0 n+k 1
Cn;k = Cn+k 1;k =
k
(n + k 1)! (n + k 1)!
= =
(n + k 1 k)!k! (n 1)!k!
The number of possible combinations with repetition is often denoted by
0 n
Cn;k =
k
n
and k is called a multiset coe¢ cient.
Example 25 The number of possible combinations with repetition of 3 objects from
5 is
0 (5 + 3 1)! 7!
C5;3 = =
(5 1)!3! 4!3!
7 6 5 4 3 2 1
=
(4 3 2 1) (3 2 1)
7 6 5
= = 7 5 = 35
3 2 1
5 See p. 21.
4.3. MORE DETAILS 25
Exercise 1
3 cards are drawn from a standard deck of 52 cards. How many di¤erent 3-card
hands can possibly be drawn?
Solution
First of all, the order in which the 3 cards are drawn does not matter, that is,
the same cards drawn in di¤erent orders are regarded as the same 3-card hand.
Furthermore, each card can be drawn only once. Therefore the number of di¤er-
ent 3-card hands that can possibly be drawn is equal to the number of possible
combinations without repetition of 3 objects from 52. If we denote it by C52;3 ,
then
52 52! 52!
C52;3 = = =
3 (52 3)!3! 49!3!
52 51 50 52 51 50
= = = 22100
3! 3 2 1
26 CHAPTER 4. COMBINATIONS
Exercise 2
John has got one dollar, with which he can buy green, red and yellow candies.
Each candy costs 50 cents. John will spend all the money he has on candies. How
many di¤erent combinations of green, red and yellow candies can he buy?
Solution
First of all, the order in which the 3 di¤erent colors are chosen does not matter.
Furthermore, each color can be chosen more than once. Therefore, the number of
di¤erent combinations of colored candies John can choose is equal to the number
of possible combinations with repetition of 2 objects from 3. If we denote it by
0
C3;2 , then
0 3 3+2 1 4
C3;2 = = =
2 2 2
4! 4! 4 3 4 3
= = = = =6
(4 2)!2! 2!2! 2! 2 1
Exercise 3
The board of directors of a corporation comprises 10 members. An executive board,
formed by 4 directors, needs to be elected. How many possible ways are there to
form the executive board?
Solution
First of all, the order in which the 4 directors are selected does not matter. Fur-
thermore, each director can be elected to the executive board only once. Therefore,
the number of di¤erent ways to form the executive board is equal to the number
of possible combinations without repetition of 4 objects from 10. If we denote it
by C10;4 , then
10 10! 10!
C10;4 = = =
4 (10 4)!4! 6!4!
10 9 8 7 10 9 8 7
= = = 210
4! 4 3 2 1
Chapter 5
This lecture introduces partitions into groups. Before reading this lecture, you
should read the lectures entitled Permutations (p. 9) and Combinations (p. 21).
A partition of n objects into k groups is one of the possible ways of subdividing
the n objects into k groups (k n). The rules are:
1. the order in which objects are assigned to a group does not matter;
2. each object can be assigned to only one group.
The following subsections give a slightly more formal de…nition of partition into
groups and deal with the problem of counting the number of possible partitions
into groups.
27
28 CHAPTER 5. PARTITIONS INTO GROUPS
n n!
=
n1 n1 ! (n n1 )!
possible ways to form the …rst group.
2. Then, we assign n2 objects to the second group. There were n objects, but
n1 have already been assigned to the …rst group. So, there are n n1 objects
left, that can be assigned to the second group. The number of possible
ways to choose n2 of the remaining n n1 objects is equal to the number of
combinations of n2 elements from n n1 . So there are
n n1 (n n1 )!
=
n2 n2 ! (n n1 n2 )!
possible ways to form the second group and
n n n1 n! (n n1 )!
=
n1 n2 n1 ! (n n1 )! n2 ! (n n1 n2 )!
n!
=
n1 !n2 ! (n n1 n2 )!
possible ways to form the …rst two groups.
3. Then, we assign n3 objects to the third group. There were n objects, but
n1 + n2 have already been assigned to the …rst two groups. So, there are
n n1 n2 objects left, that can be assigned to the third group. The number
of possible ways to choose n3 of the remaining n n1 n2 objects is equal
to the number of combinations of n3 elements from n n1 n2 . So there are
n n1 n2 (n n1 n2 )!
=
n3 n3 ! (n n1 n2 n3 )!
possible ways to form the third group and
n n n1 n n1 n2
n1 n2 n3
n! (n n1 n2 )!
=
n1 !n2 ! (n n1 n2 )! n3 ! (n n1 n2 n3 )!
n!
=
n1 !n2 !n3 ! (n n1 n2 n3 )!
possible ways to form the …rst three groups.
1 See the lecture entitled Combinations (p. 21).
5.3. MORE DETAILS 29
4. An so on, until we are left with nk objects and the last group. There is only
one way to form the last group, which can also be written as:
n n1 n2 : : : nk 1 (n n1 n2 : : : nk 1 )!
=
nk nk ! (n n1 n2 : : : nk )!
As a consequence, there are
n n n1 nn1 n2 n n1 n2 : : : nk 1
:::
n1 n2 n3 nk
n!
=
n1 !n2 ! : : : nk 1 ! (n n1 n2 : : : nk 1 )!
(n n1 n2 : : : nk 1 )!
nk ! (n n1 n2 : : : nk )!
n!
=
n1 !n2 ! : : : nk ! (n n1 n2 : : : nk )!
n!
=
n1 !n2 ! : : : nk !0!
n!
=
n1 !n2 ! : : : nk !
possible ways to form all the groups.
Therefore, by the above sequential argument, the total number of possible
partitions into the k groups is
n!
Pn1 ;n2 ;:::;nk =
n1 !n2 ! : : : nk !
The number Pn1 ;n2 ;:::;nk is often indicated as follows:
n
Pn1 ;n2 ;:::;nk =
n1 ; n 2 ; : : : ; n k
and n1 ;n2n;:::;nk is called a multinomial coe¢ cient.
Sometimes the following notation is also used:
Pn1 ;n2 ;:::;nk = (n1 ; n2 ; : : : ; nk )!
Example 27 The number of possible partitions of 4 objects into 2 groups of 2
objects is
4 4! 4 3 2 1
P2;2 = = = =6
2; 2 2!2! (2 1) (2 1)
Exercise 1
John has a basket of fruit containing one apple, one banana, one orange and one
kiwi. He wants to give one fruit to each of his two little sisters and two fruits to
his big brother. In how many di¤erent ways can he do this?
Solution
John needs to decide how to partition 4 objects into 3 groups. The …rst two groups
will contain one object and the third one will contain two objects. The total
number of partitions is
4 4!
P1;1;2 = =
1; 1; 2 1!1!2!
4 3 2 1 24
= = = 12
1 1 2 1 2
Exercise 2
Ten friends want to play basketball. They need to divide into two teams of …ve
players. In how many di¤erent ways can they do this?
Solution
They need to decide how to partition 10 objects into 2 groups. Each group will
contain 5 objects. The total number of partitions is
10 10!
P5;5 = =
5; 5 5!5!
10 9 8 7 6 5 4 3 2 1
=
(5 4 3 2 1) (5 4 3 2 1)
10 9 8 7 6 9 8 7 6
= =
5 4 3 2 1 4 3
= 9 2 7 2 = 252
Chapter 6
31
32 CHAPTER 6. SEQUENCES AND LIMITS
Thus, a is a limit of fan g if, by dropping a su¢ ciently high number of initial
terms of fan g, we can make the remaining terms of fan g as close to a as we like.
Intuitively, a is a limit of fan g if an becomes closer and closer to a by letting n go
to in…nity.
6.3. LIMIT OF A SEQUENCE 33
d (an ; a) = jan aj
Using the concept of distance, the above informal de…nition can be made rigorous.
For those unfamiliar with the universal quanti…ers 8 (any) and 9 (exists), the
notation
8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0
reads as follows: "For any arbitrarily small number ", there exists a natural number
n0 such that the distance between an and a is less than " for all the terms an with
n > n0 ", which can also be restated as "For any arbitrarily small number ", you
can …nd a subsequence fan gn>n0 such that the distance between any term of the
subsequence and a is less than "", or as "By dropping a su¢ ciently high number
of initial terms of fan g, you can make the remaining terms as close to a as you
wish".
It is possible to prove that a convergent sequence has a unique limit, that is, if
fan g has a limit a, then a is the unique limit of fan g.
where the last equality holds because all the terms of the sequence are positive and
hence equal to their absolute values. Therefore, we need to …nd an n0 2 N such
that all the terms of the subsequence fan gn>n0 satisfy
Since
an < an0 ; 8n > n0
condition (6.2) is satis…ed if an0 < ", which is equivalent to n10 < ". As a con-
sequence, it su¢ ces to pick any n0 such that n0 > 1" to satisfy condition (6.1).
Summing up, we have just shown that, for any ", we are able to …nd n0 2 N such
that all terms of the subsequence fan gn>n0 have distance from zero less than ",
which implies that 0 is the limit of the sequence fan g.
The de…nition is the same given in De…nition 31, except for the fact that now
both a and the terms of the sequence fan g belong to a generic set of objects A.
1. non-negativity: d (a; a0 ) 0;
2. identity of indiscernibles: d (a; a0 ) = 0 if and only if a = a0 ;
3. symmetry: d (a; a0 ) = d (a0 ; a);
4. triangle inequality: d (a; a0 ) + d (a0 ; a00 ) d (a; a00 ).
All four properties are very intuitive: property 1) says that the distance between
two points cannot be a negative number; property 2) says that the distance between
two points is zero if and only if the two points coincide; property 3) says that the
distance from a to a0 is the same as the distance from a0 to a; property 4) says that
the distance you cover when you go from a to a00 directly is less than or equal to
the distance you cover when you go from a to a00 passing from a third point a0 (in
other words, if a0 is not on the way from a to a00 , you are increasing the distance
covered).
which coincides with the de…nition of distance between real numbers already given
above.
Whenever we are faced with a sequence of objects and we want to assess whether
it is convergent, we need to de…ne a distance function on the set of objects to which
the terms of the sequence belong, and verify that the proposed distance function
satis…es all the properties of a proper distance function (a metric). For example, in
probability theory and statistics we often deal with sequences of random variables.
To assess whether these sequences are convergent, we need to de…ne a metric to
measure the distance between two random variables. As we will see in the lecture
entitled Sequences of random variables (see p. 491), there are several ways of
de…ning the concept of distance between two random variables. All these ways are
legitimate and are useful in di¤erent situations.
If a is a limit of the sequence fan g, we say that the sequence fan g is a convergent
sequence and that it converges to a. We indicate the fact that a is a limit of
fan g by
a = lim an
n!1
Also in this case, it is possible to prove that a convergent sequence has a unique
limit.
Proof. The proof is by contradiction. Suppose that a and a0 are two limits of
a sequence fan g and a 6= a0 . By combining property 1) and 2) of a metric (see
above), we obtain
d (a; a0 ) > 0
i.e., d (a; a0 ) = d, where d is a strictly positive constant. Pick any term an of the
sequence. By property 4) of a metric (the triangle inequality), we have
Now, take any " < d. Since a is a limit of the sequence, we can …nd n0 such that
d (a; an ) < "; 8n > n0 , which means that
and
d (a0 ; an ) d " > 0; 8n > n0
0
Therefore, d (a ; an ) can not be made smaller than d ", and, as a consequence, a0
cannot be a limit of the sequence.
Convergence criterion
In practice, it is usually di¢ cult to assess the convergence of a sequence using
De…nition 37. Instead, convergence can be assessed using the following criterion.
lim d (an ; a) = 0
n!1
Proof. This is easily proved by de…ning a sequence of real numbers fdn g whose
generic term is
dn = d (an ; a)
and noting that the de…nition of convergence of fan g to a, which is
can be written as
1. …nd a metric d (an ; a) to measure the distance between the terms of the
sequence an and the candidate limit a;
2. de…ne a new sequence fdn g, where dn = d (an ; a);
3. study the convergence of the sequence fdn g, which is a simple problem, be-
cause fdn g is a sequence of real numbers.
38 CHAPTER 6. SEQUENCES AND LIMITS
Chapter 7
Review of di¤erentiation
rules
This lecture contains a summary of di¤erentiation rules, i.e. of rules for computing
the derivative of a function. This review is neither detailed nor rigorous and it
is not meant to be a substitute for a proper lecture on di¤erentiation. Its only
purpose it to serve as a quick review of di¤erentiation rules.
d
In what follows, f (x) will denote a function of one variable and dx f (x) will
denote its …rst derivative.
d
f (x) = 0
dx
d
f (x) = nxn 1
dx
Example 40 De…ne
f (x) = x5
The derivative of f (x) is
d
f (x) = 5x5 1
= 5x4
dx
39
40 CHAPTER 7. REVIEW OF DIFFERENTIATION RULES
Example 41 De…ne p
3
f (x) = x4
The derivative of f (x) is
d d p
3 d 4 4 1=3
f (x) = x4 = x4=3 = x4=3 1
= x
dx dx dx 3 3
1. Multiplication by a constant
d d
(c1 f1 (x)) = c1 f1 (x)
dx dx
2. Addition
d d d
(f1 (x) + f2 (x)) = f1 (x) + f2 (x)
dx dx dx
Example 44 De…ne
f (x) = 2 + exp (x)
The derivative of f (x) is
d d d d
f (x) = (2 + exp (x)) = (2) + (exp (x))
dx dx dx dx
The …rst summand is
d
(2) = 0
dx
because the derivative of a constant is 0. The second summand is
d
(exp (x)) = exp (x)
dx
by the rule for di¤ erentiating exponentials. Therefore
d d d
f (x) = (2) + (exp (x)) = 0 + exp (x) = exp (x)
dx dx dx
d d d
(f1 (x) f2 (x)) = f1 (x) f2 (x) + f1 (x) f2 (x)
dx dx dx
Example 45 De…ne
f (x) = x ln (x)
The derivative of f (x) is
d d d d
f (x) = (x ln (x)) = (x) ln (x) + x (ln (x))
dx dx dx dx
1
= 1 ln (x) + x = ln (x) + 1
x
42 CHAPTER 7. REVIEW OF DIFFERENTIATION RULES
What does this chain rule mean in practice? It means that …rst you need to
compute the derivative of g (y):
d
g (y)
dy
Then, you substitute y with h (x):
d
g (y)
dy y=h(x)
d
h (x)
dx
Example 46 De…ne
f (x) = ln x2
The function f (x) is a composite function:
f (x) = g (h (x))
where
g (y) = ln (y)
and
h (x) = x2
The derivative of h (x) is
d d
h (x) = x2 = 2x
dx dx
The derivative of g (y) is
d d 1
g (y) = (ln (y)) =
dy dy y
d 1 1
g (y) = = 2
dy y=h(x) h (x) x
Therefore
!
d d d 1 2
(g (h (x))) = g (y) h (x) = 2 2x =
dx dy y=h(x) dx x x
7.8. DERIVATIVES OF TRIGONOMETRIC FUNCTIONS 43
d
g (y) = sin (h (x)) = sin x2
dy y=h(x)
Therefore
!
d d d
(g (h (x))) = g (y) h (x) = sin x2 2x
dx dy y=h(x) dx
1
then its inverse x = f (y) has derivative
! 1
d 1 d
f (y) = f (x)
dy dx x=f 1 (y)
Example 48 De…ne
f (x) = exp (3x)
Its inverse is
1 1
f (y) = ln (y)
3
The derivative of f (x) is
d
f (x) = 3 exp (3x)
dx
As a consequence
d 1
f (x) = 3 exp (3x)jx= 1 ln(y) = 3 exp 3 ln (y) = 3y
dx x=f 1 (y)
3 3
and ! 1
d 1 d 1
f (y) = f (x) = (3y)
dy dx x=f 1 (y)
Chapter 8
This lecture contains a summary of integration rules, i.e. of rules for computing
de…nite and inde…nite integrals of a function. This review is neither detailed nor
rigorous and it is not meant to be a substitute for a proper lecture on integration.
Its only purpose it to serve as a quick review of integration rules.
d
In what follows, f (x) will denote a function of one variable and dx f (x) will
denote its …rst derivative.
d
F (x) = f (x)
dx
An inde…nite integral F (x) is denoted by
Z
F (x) = f (x) dx
Example 49 Let
f (x) = x3
The function
1 4
F (x) = x
4
is an inde…nite integral of f (x) because
d d 1 4 1 d 1
F (x) = x = x4 = 4x3 = x3
dx dx 4 4 dx 4
45
46 CHAPTER 8. REVIEW OF INTEGRATION RULES
d d 1 1 4 d 1 1 d
G (x) = + x = + x4
dx dx 2 4 dx 2 4 dx
1
= 0+ 4x3 = x3
4
Note that if a function F (x) is an inde…nite integral of f (x) then also the
function
G (x) = F (x) + c
is an inde…nite integral of f (x) for any constant c 2 R, because
d d d d
G (x) = (F (x) + c) = (F (x)) + (c)
dx dx dx dx
= f (x) + 0 = f (x)
This is also the reason why the adjective inde…nite is used: because inde…nite
integrals are de…ned only up to a constant.
The following subsections contain some rules for computing the inde…nite in-
tegrals of functions that are frequently encountered in probability theory and sta-
tistics. In all these subsections, c will denote a constant and the integration rules
will be reported without a proof. Proofs are trivial and can be easily performed
by the reader: it su¢ ces to compute the …rst derivative of F (x) and verify that it
equals f (x).
F (x) = ax + c
1
F (x) = xn+1 + c
n+1
when n 6= 1. When n = 1, i.e. when
1
f (x) =
x
the integral is
F (x) = ln (x) + c
8.1. INDEFINITE INTEGRALS 47
In other words, the integral of a linear combination is equal to the linear com-
binations of the integrals. This property is called "linearity of the integral".
Two special cases of this rule are:
1. Multiplication by a constant:
Z Z
c1 f1 (x) dx = c1 f1 (x) dx
2. Addition:
Z Z Z
(f1 (x) + f2 (x)) dx = f1 (x) dx + f2 (x) dx
1 Remember ln(x)
that logb (x) = ln(b) .
2 Remember that bx = exp (x ln (b)).
48 CHAPTER 8. REVIEW OF INTEGRATION RULES
f (x) is called the integrand function and a and b are called the upper
bound of integration and the lower bound of integration.
The following subsections contain some properties of de…nite integrals, which
are also often utilized to actually compute de…nite integrals.
d
F (x) = f (x)
dx
In other words, if you di¤erentiate a de…nite integral with respect to its upper
bound of integration, then you obtain the integrand function.
Example 50 De…ne Z x
F (x) = exp (2t) dt
a
Then:
d
F (x) = exp (2x)
dx
8.2. DEFINITE INTEGRALS 49
2. Addition:
Z b Z b Z b
(f1 (x) + f2 (x)) dx = f1 (x) dx + f2 (x) dx
a a a
t = g (x)
t = g (x)
and obtain
d
dt = g (x) dx
dx
2. Recompute the bounds of integration:
x = a ) t = g (x) = g (a)
x = b ) t = g (x) = g (b)
d
3. Substitute g (x) and dx g (x) dx in the integral:
Z b Z g(b)
d
f (g (x)) g (x) dx = f (t) dt
a dx g(a)
t = ln (x)
d 1
dt = ln (x) dx = dx
dx x
The new bounds of integration are
x = 1 ) t = ln (1) = 0
x = 2 ) t = ln (2)
g (x) = 1
where both the lower bound of integration a and the upper bound of integration b
may depend on y, under appropriate technical conditions (not discussed here) the
…rst derivative of the function I (y) with respect to y can be computed as follows:
d d d
I (y) = b (y) f (b (y) ; y) a (y) f (a (y) ; y)
dy dy dy
Z b(y)
@
+ f (x; y) dx
a(y) @y
@
where @y f (x; y) is the …rst partial derivative of f (x; y) with respect to y.
is
d d
I (y) = y2 + 1 exp y2 + 1 y
dy dy
Z y 2 +1
d 2 @
y exp y 2 y + (exp (xy)) dx
dy y2 @y
Z y2 +1
= 2y exp y 3 + y 2y exp y 3 + x exp (xy) dx
y2
Exercise 1
Compute the following integral:
Z 1
cos (x) exp ( x) dx
0
Solution
Performing two integrations by parts, we obtain:
Z 1
cos (x) exp ( x) dx
0
8.3. SOLVED EXERCISES 53
Z 1
A = [sin (x) exp ( x)]0
1
sin (x) ( exp ( x)) dx
0
Z 1
= 0 0+ sin (x) exp ( x) dx
0
Z 1
B = [ cos (x) exp ( x)]0
1
( cos (x)) ( exp ( x)) dx
0
Z 1
= 0 ( 1) cos (x) exp ( x) dx
0
Z 1
= 1 cos (x) exp ( x) dx
0
or Z 1
1
cos (x) exp ( x) dx =
0 2
Exercise 2
Use Leibniz integral rule to compute the derivative with respect to y of the following
integral:
Z y2
I (y) = exp ( xy) dx
0
Solution
Leibniz integral rule is
Z b(y)
d d d
f (x; y) dx = b (y) f (b (y) ; y) a (y) f (a (y) ; y)
dy a(y) dy dy
Z b(y)
@
+ f (x; y) dx
a(y) @y
Exercise 3
Compute the following integral:
Z 1
2
x 1 + x2 dx
0
Solution
This integral can be solved using the change of variable technique:
Z 1 Z 1
2 2 1 2
x 1+x dx = (1 + t) dt
0 0 2
1
1 1 1 1 1 1
= (1 + t) = + 1=
2 0 2 2 2 4
Special functions
This chapter brie‡y introduces some special functions that are frequently used in
probability and statistics.
n! = 1 2 : : : (n 1) n
n! = (n 1)! n
(z) = (z 1) (z 1)
9.1.1 De…nition
The following is a possible de…nition of the Gamma function.
While the domain of de…nition of the Gamma function can be extended beyond
the set R++ of strictly positive real numbers, for example, to complex numbers,
the somewhat restrictive de…nition given above is more than su¢ cient to address
all the problems involving the Gamma function that are found in these lectures.
1 See p. 10.
55
56 CHAPTER 9. SPECIAL FUNCTIONS
9.1.2 Recursion
The next proposition states a recursive property that is used to derive several other
properties of the Gamma function.
(z) = (z 1) (z 1) (9.1)
(n) = (n 1)!
(1) = 1 = 0!
(2) = (2 1) (2 1) = (1) 1 = 1 = 1!
(3) = (3 1) (3 1) = (2) 2 = 1 2 = 2!
(4) = (4 1) (4 1) = (3) 3 = 1 2 3 = 3!
..
.
(n) = (n 1) (n 1) = 1 2 3 : : : (n 1) = (n 1)!
2 See p.51.
9.1. GAMMA FUNCTION 57
1 p
=
2
Proof. Using the de…nition of Gamma function and performing a change of vari-
able, we obtain
Z 1
1
= x1=2 1
exp ( x) dx
2 0
Z 1
1=2
= x exp ( x) dx
0
Z 1
A = 2 exp t2 dt
0
Z 1 Z 1 1=2
= 2 exp t2 dt exp t2 dt
0 0
Z 1 Z 1 1=2
= 2 exp t2 dt exp s2 ds
0 0
Z 1 Z 1 1=2
= 2 exp t2 s2 dtds
0 0
Z 1 Z 1 1=2
B = 2 exp s2 u2 s2 sduds
0 0
Z 1 Z 1 1=2
= 2 exp 1 + u2 s2 sdsdu
0 0
Z 1 1 1=2
1
= 2 exp 1 + u2 s2 du
0 2 (1 + u2 ) 0
Z 1 1=2
1
= 2 0+ du
0 2 (1 + u2 )
Z 1 1=2
1
= 21=2 du
0 1 + u2
1=2 1 1=2
= 2 ([arctan (u)]0 )
1=2
= 21=2 (arctan (1) arctan (0))
1=2
= 21=2 0 = 1=2
2
1 p nY1 1
n+ = j+
2 j=0
2
for n 2 N.
1
n+
2
1 1
= n 1+ n 1+
2 2
1 1 1
= n 1+ n 2+ n 2+
2 2 2
..
.
1 1 1 1
= n 1+ n 2+ ::: n n+ n n+
2 2 2 2
1 1 1 1
= n 1+ n 2+ :::
2 2 2 2
p nY1 1
= j+
j=0
2
There are also other special cases in which the value of the Gamma function can
be derived analytically, but it is not possible to express (z) in terms of elementary
functions for every z. As a consequence, one often needs to resort to numerical
algorithms to compute (z). For example, the Matlab command
gamma(z)
The function (z; y) thus obtained is called lower incomplete Gamma function.
3 Abramowitz,
M. and I. A. Stegun (1965) Hanbook of mathematical functions: with formulas,
graphs, and mathematical tables, Courier Dover Publications.
9.2. BETA FUNCTION 59
9.2.1 De…nition
The following is a possible de…nition of the Beta function.
(x) (y)
B (x; y) =
(x + y)
While the domain of de…nition of the Beta function can be extended beyond
the set R2++ of couples of strictly positive real numbers, for example, to couples
of complex numbers, the somewhat restrictive de…nition given above is more than
su¢ cient to address all the problems involving the Beta function that are found in
these lectures.
Proof. Given the de…nition of the Beta function as a ratio of Gamma functions,
the equality holds if and only if
Z 1
x y (x) (y)
tx 1 (1 + t) dt =
0 (x + y)
or Z 1
x y
(x + y) tx 1
(1 + t) dt = (x) (y)
0
(x) (y)
Z 1 Z 1
A = ux 1
exp ( u) du vy 1
exp ( v) dv
0 0
60 CHAPTER 9. SPECIAL FUNCTIONS
Z 1 Z 1
y 1
= v exp ( v) ux 1
exp ( u) dudv
0 0
Z 1 Z 1
B = vy 1
exp ( v) (vt)
x 1
exp ( vt) vdtdv
Z0 1 Z0 1
= vy 1
exp ( v) v x tx 1
exp ( vt) dtdv
0 0
Z 1 Z 1
= v x+y 1
exp ( v) tx 1
exp ( vt) dtdv
Z0 1 Z 1 0
= v x+y 1 x 1
t exp ( (1 + t) v) dtdv
0 0
Z 1 Z 1
= tx 1
v x+y 1
exp ( (1 + t) v) dvdt
0 0
Z 1 Z 1 x+y 1
C s 1
= tx 1
exp ( s) dsdt
1+t 1+t
Z0 1 0
Z 1
x 1 x y
= t (1 + t) sx+y 1
exp ( s) dsdt
0 0
Z 1
D = tx 1
(1 + t)
x y
(x + y) dt
0
Z 1
x y
= (x + y) tx 1
(1 + t) dt
0
t 1
s= =1
1+t 1+t
Before performing it, note that
t
lim =1
t!1 1+t
9.3. SOLVED EXERCISES 61
and that
1 s
t= 1=
1 s 1 s
Furthermore, by di¤erentiating the previous expression, we obtain
2
1
dt = ds
1 s
We are now ready to perform the change of variable:
Z 1
x y
B (x; y) = tx 1 (1 + t) dt
0
Z 1 x 1 x y 2
s s 1
= 1+ ds
0 1 s 1 s 1 s
Z 1 x 1 x y 2
s 1 1
= ds
0 1 s 1 s 1 s
Z 1 x 1 x y+2
1
= sx 1
ds
0 1 s
Z 1 1 y
1
= sx 1
ds
0 1 s
Z 1
y 1
= sx 1
(1 s) ds
0
Note that the two representations (9.2) and (9.3) involve improper integrals
that converge if x > 0 and y > 0. This might help you to see why the arguments
of the Beta function are required to be strictly positive in De…nition 61.
Exercise 1
Compute the ratio
16
3
10
3
Solution
We need to repeatedly apply the recursive formula
(z) = (z 1) (z 1)
to the numerator of the ratio, as follows:
16 16 16 13
3 3 1 3 1 13 3
10 = 10 = 10
3 3
3 3
13 13 10
13 3 1 3 1 13 10 3 130
= 10 = 10 =
3 3
3 3 3
9
62 CHAPTER 9. SPECIAL FUNCTIONS
Exercise 2
Compute
(5)
Solution
We need to use the relation of the Gamma function to the factorial function:
(n) = (n 1)!
(5) = (5 1)! = 4! = 4 3 2 1 = 24
Exercise 3
Express the integral Z 1
1
x9=2 exp x dx
0 2
in terms of the Gamma function.
Solution
This is accomplished as follows:
Z 1
1
x9=2 exp x dx
0 2
Z 1
A =
9=2
(2t) exp ( t) 2dt
0
Z 1
11=2
= 2 t11=2 1 exp ( t) dt
0
B = 211=2 (11=2)
Exercise 4
Compute the product
5 3
B ;1
2 2
where () is the Gamma function and B () is the Beta function.
Solution
We need to write the Beta function in terms of Gamma functions:
3
5 3 5 2 (1)
B ;1 = 3
2 2 2 2 +1
9.3. SOLVED EXERCISES 63
3
5 2 (1)
= 5
2 2
3
= (1)
2
A 3
=
2
B 3 3
= 1 1
2 2
1 1
=
2 2
C 1p
=
2
where: in step A we have used the fact that (1) = 1; in step B we have used
the recursive formula for the Gamma function; in step C we have used the fact
that
1 p
=
2
Exercise 5
Compute the ratio
7 9
B 2; 2
5 11
B 2; 2
Solution
This is achieved by rewriting the numerator of the ratio in terms of Gamma func-
tions:
7 9
B 2; 2
5 11
B 2; 2
7 9
1 2 2
= 5 11 7 9
B 2; 2 2 + 2
7 7 9
A 1 2 1 2 1 2
= 5 11 7 9
B 2; 2 2 + 2
5 5 11 11
B 1 2 2 2 2 1
= 5 11 7 9
B 2; 2 2 1+ 2 +1
5 11
52 1 2 2
=
2 9 B 52 ; 11
2
5
2 + 11
2
C 5 1 5 11 5
= B ; =
9 B 2 ; 11
5
2
2 2 9
where: in steps A and B we have used the the recursive formula for the Gamma
function; in step C we have used the de…nition of Beta function.
64 CHAPTER 9. SPECIAL FUNCTIONS
Exercise 6
Compute the integral Z 1
5
x3=2 (1 + 2x) dx
0
Solution
We …rst express the integral in terms of the Beta function:
Z 1
5
x3=2 (1 + 2x) dx
0
Z 1 3=2
A 1 5 1
= t (1 + t) dt
0 2 2
5=2 Z 1
1 5
= t3=2 (1 + t) dt
2 0
5=2 Z 1
1 5=2 5=2
= t5=2 1
(1 + t) dt
2 0
5=2
B 1 5 5
= B ;
2 2 2
1 p
=
2
Substituting the above number into the previous expression for the integral, we
obtain
Z 1 5=2 5=2
5 1 5 5 1 9
x3=2 (1 + 2x) dx = B ; =
0 2 2 2 2 384
1p 9 9 p 3 p
= 2 = 2 = 2
8 384 3072 1024
9.3. SOLVED EXERCISES 65
If you wish, you can check the above result using the MATLAB commands
syms x
f = (x^(3=2)) ((1 + 2 x)^ 5)
int(f; 0; Inf)
66 CHAPTER 9. SPECIAL FUNCTIONS
Part II
Fundamentals of probability
67
Chapter 10
Probability
Probability is used to quantify the likelihood of things that can happen, when it
is not yet known whether they will happen. Sometimes probability is also used to
quantify the likelihood of things that could have happened in the past, when it is
not yet known whether they actually happened.
Since we usually speak of the "probability of an event", the next section intro-
duces a formal de…nition of the concept of event. We then discuss the properties
that probability needs to satisfy. Finally, we discuss some possible interpretations
of the concept of probability.
Example 64 Suppose that we toss a die. Six numbers, from 1 to 6, can appear
face up, but we do not yet know which one of them will appear. The sample space
is
= f1; 2; 3; 4; 5; 6g
1 In this lecture we are going to use the Greek letter (Omega), which is often used in
probability theory. is upper case, while ! is lower case.
2 P. 75.
69
70 CHAPTER 10. PROBABILITY
Each of the six numbers is a sample point. The outcomes are mutually exclusive,
because only one number at a time can appear face up. The outcomes are also
exhaustive, because at least one of the six numbers in will appear face up, after
we toss the die. De…ne
E = f1; 3; 5g
E is an event (a subset of ). In words, the event E can be described as "an odd
number appears face up". Now, de…ne
F = f6g
10.2 Probability
The probability of an event is a real number, attached to the event, that tells us
how likely that event is. Suppose E is an event. We denote the probability of E
by P (E).
Probability needs to satisfy the following properties:
2. Sure thing. P ( ) = 1.
Example 65 Suppose that we ‡ip a coin. The possible outcomes are either tail
(T ) or head (H), i.e.,
= fT; Hg
3 See p. 31.
10.3. PROPERTIES OF PROBABILITY 71
There are a total of four subsets of (events): itself, the empty set ;, the event
fT g and the event fHg. The following assignment of probabilities satis…es the
properties enumerated above:
1 1
P ( ) = 1, P (;) = 0, P (fT g) = , P (fHg) =
2 2
All these probabilities are between 0 and 1, so the range property is satis…ed.
P ( ) = 1, so the sure thing property is satis…ed. Also sigma-additivity is sat-
is…ed, because
1 1
P (fT g [ fHg) = P( ) = 1 = + = P (fT g) + P (fHg)
2 2
P (f g [ f;g) = P ( ) = 1 = 1 + 0 = P (f g) + P (f;g)
1 1
P (fT g [ f;g) = P (T ) = = + 0 = P (fT g) + P (f;g)
2 2
1 1
P (fHg [ f;g) = P (H) = = + 0 = P (fHg) + P (f;g)
2 2
and the four couples (T; H), ( ; ;), (T; ;), (H; ;) are the only four possible couples
of disjoint sets.
Before ending this section, two remarks are in order. First, we have not dis-
cussed the interpretations of probability, but below you can …nd a brief discussion
of the interpretations of probability. Second, we have been somewhat sloppy in
de…ning events and probability, but you can …nd a more rigorous de…nition of
probability below.
P (;) = 0
1 = P( )
1
!
[
= P En
n=1
1
X
= P (En )
n=1
1
X
= P( ) + P (;)
n=2
72 CHAPTER 10. PROBABILITY
that is,
1
X
P( ) + P (;) = 1
n=2
1
!
[
P (E [ F ) = P En
n=1
1
X
= P (En )
n=1
1
X
= P (E) + P (F ) + P (;)
n=3
= P (E) + P (F )
since P (;) = 0.
P (E c ) = 1 P (E)
In words, the probability that an event does not occur (P (E c )) is equal to one
minus the probability that it occurs.
Proof. Note that
= E [ Ec (10.2)
and that E and E c are disjoint sets. Then, using the sure thing property and …nite
additivity, we obtain
1 = P ( ) = P (E [ E c ) = P (E) + P (E c )
P (E [ F ) = P (E) + P (F ) P (E \ F )
P (E) = P (E \ ) = P (E \ (F [ F c ))
= P ((E \ F ) [ (E \ F c )) = P (E \ F ) + P (E \ F c )
and
P (F ) = P (F \ ) = P (F \ (E [ E c ))
= P ((F \ E) [ (F \ E c )) = P (F \ E) + P (F \ E c )
so that
P (E \ F c ) = P (E) P (E \ F )
P (F \ E c ) = P (F ) P (E \ F )
E [ F = (E \ F ) [ (E \ F c ) [ (F \ E c )
and the three events on the right hand side are disjoint. Thus,
P (E [ F ) = P ((E \ F ) [ (E \ F c ) [ (F \ E c ))
= P (E \ F ) + P (E \ F c ) + P (F \ E c )
= P (E \ F ) + P (E) P (E \ F ) + P (F ) P (E \ F )
= P (E) + P (F ) P (E \ F )
P (E) P (F )
P (F ) = P ((F \ E) [ (F \ E c ))
= P (F \ E) + P (F \ E c )
= P (E) + P (F \ E c )
P (F ) = P (E) + P (F \ E c ) P (E)
74 CHAPTER 10. PROBABILITY
1. Whole set. 2 F.
2. Closure under complementation. If E 2 F then also E c 2 F (E c , the
complement of E with respect to , is the set of all elements of that do
not belong to E).
3. Closure under countable unions. If E1 , E2 , ..., En ,... are a sequence of
subsets of belonging to F, then
1
!
[
En 2 F
n=1
If E 2 F and F 2 F, then (E [ F ) 2 F
It means that if "one of the things in E will happen" and "one of the things in F
will happen" are considered two events, then also "one of the things in E or one
of the things in F will happen" must be considered an event. This simply means
76 CHAPTER 10. PROBABILITY
that if you are able to separately assess the possibility of two events E and F
happening, then, of course, you must be able to assess the possibility of one or the
other happening. Property c) simply extends this intuitive property to countable5
collection of events: the extension is needed for mathematical reasons, to derive
certain continuity properties of probability measures.
1. Sure thing. P ( ) = 1.
Nothing new has been added to the de…nition given above. This de…nition
just clari…es that a probability measure is a function de…ned on a sigma-algebra of
events. Hence, it is not possible to properly speak of probability for subsets of
that do not belong to the sigma-algebra.
A triple ( ; F,P) is called a probability space and the sets belonging to the
sigma-algebra F are called measurable sets.
Exercise 1
A ball is drawn at random from an urn containing colored balls. The balls can be
either red or blue (no other colors are possible). The probability of drawing a blue
ball is 1=3. What is the probability of drawing a red ball?
Solution
The sample space can be represented as the union of two disjoint events E and
F:
=E[F
where the event E can be described as "a red ball is drawn" and the event F can
be described as "a blue ball is drawn". Note that E is the complement of F :
E = Fc
5 See p. 32.
10.6. SOLVED EXERCISES 77
Exercise 2
Consider a sample space comprising three possible outcomes:
= f! 1 ; ! 2 ; ! 3 g
Solution
There are two events whose probability is 3=4.
The …rst one is
E = f! 1 ; ! 3 g
By using the formula for the probability of a union of disjoint events, we get
Exercise 3
Consider a sample space comprising four possible outcomes:
= f! 1 ; ! 2 ; ! 3 ; ! 4 g
E = f! 1 g
F = f! 1 ; ! 2 g
G = f! 1 ; ! 2 ; ! 3 g
78 CHAPTER 10. PROBABILITY
H = f! 2 ; ! 4 g
Find P (H).
Solution
First note that, by additivity,
Therefore, in order to compute P (H), we need to compute P (f! 2 g) and P (f! 4 g).
P (f! 2 g) is found using additivity on F :
5
= P (F ) = P (f! 1 g [ f! 2 g) = P (f! 1 g) + P (f! 2 g)
10
1
= P (E) + P (f! 2 g) = + P (f! 2 g)
10
so that
5 1 4
P (f! 2 g) = =
10 10 10
P (f! 4 g) is found using the fact that one minus the probability of an event is
equal to the probability of its complement and the fact that f! 4 g = Gc :
7 3
P (f! 4 g) = P (Gc ) = 1 P (G) = 1 =
10 10
As a consequence,
4 3 7
P (H) = P (f! 2 g) + P (f! 4 g) = + =
10 10 10
Chapter 11
Zero-probability events
The notion of a zero-probability event plays a special role in probability theory and
statistics, because it underpins the important concepts of almost sure property and
almost sure event. In this lecture, we de…ne zero-probability events and discuss
some counterintuitive aspects of their apparently simple de…nition, in particular
the fact that a zero-probability event is not an event that never happens: there are
common probabilistic settings where zero-probability events do happen all the time!
After discussing this matter, we introduce the concepts of almost sure property and
almost sure event.
P (E) = 0
Despite the simplicity of this de…nition, there are some features of zero-
probability events that might seem paradoxical. We illustrate these features with
the following example.
= [0; 1]
It is possible to assign probabilities in such a way that each sub-interval has prob-
ability equal to its length:
79
80 CHAPTER 11. ZERO-PROBABILITY EVENTS
Stated di¤ erently, every possible outcome is a zero-probability event. This might
seem counterintuitive. In everyday language, a zero-probability event is an event
that never happens. However, this example illustrates that a zero-probability event
can indeed happen. Since the sample space provides an exhaustive description of
the possible outcomes, one and only one of the sample points3 ! 2 will be the
realized outcome4 . But we have just demonstrated that all the sample points are
zero-probability events; as a consequence, the realized outcome can only be a zero-
probability event. Another apparently paradoxical aspect of this probability model
is that the sample space can be obtained as the union of disjoint zero-probability
events: [
= f!g
!2
where each ! 2 is a zero-probability event and all events in the union are disjoint.
If we forgot that the additivity property of probability applies only to countable
collections of subsets, we would mistakenly deduce that
!
[ X
P( ) = P f!g = P (!) = 0
!2 !2
The main lesson to be taken from this example is that a zero-probability event
is not an event that never happens: in some probability models, where the sample
space is not countable, zero-probability events do happen all the time!
De…nition 68 Let be some property that a sample point ! 2 can either satisfy
or not satisfy. Let F be the set of all sample points that satisfy the property:
F = f! 2 : ! satis…es property g
2 Williams, D. (1991) Probability with martingales, Cambridge University Press.
3 See p. 69.
4 See p. 69.
5 See p. 70.
11.3. ALMOST SURE EVENTS 81
Denote its complement, that is, the set of all points not satisfying property , by
F c . Property is said to be almost sure if there exists a zero-probability event
E such that6 F c E.
Fc E ) P (F c ) P (E)
which in turn implies P (F c ) = 0. Finally, recalling the formula for the probability
of a complement9 , we obtain
P (F ) = 1 P (F c ) = 1 0=1
Example 69 Consider the sample space = [0; 1] and the assignment of proba-
bilities introduced in the previous example:
E = f! 2 : ! is a rational numberg
E = f! 1 ; : : : ; ! n ; : : :g
6 In other words, the set F c of all points that do not satisfy the property is included in a
zero-probability event.
7 See the lecture entitled Probability (p. 69).
8 See p. 73
9 See p. 72.
1 0 See p. 32.
82 CHAPTER 11. ZERO-PROBABILITY EVENTS
P (F ) = P (E c ) = 1 P (E) = 1 0=1
Exercise 1
2
Let E and F be two events. Let E c be a zero-probability event and P (F ) = 3.
Compute P (E [ F ).
Solution
E c is a zero-probability event, which means that
P (E c ) = 0
P (E) = 1 P (E c ) = 1 0=1
P (E [ F ) P (E)
P (E [ F ) = 1
Exercise 2
1
Let E and F be two events. Let E c be a zero-probability event and P (F ) = 2.
Compute P (E \ F ).
1 1 See p. 72.
11.4. SOLVED EXERCISES 83
Solution
E c is a zero-probability event, which means that
P (E c ) = 0
P (E) = 1 P (E c ) = 1 0=1
P (E \ F ) = P (E) + P (F ) P (E [ F )
1 3
= 1+ P (E [ F ) = P (E [ F )
2 2
Since (E [ F ) E, by monotonicity, we obtain
P (E [ F ) P (E)
P (E [ F ) = 1
Conditional probability
12.1 Introduction
Let be a sample space and let P (E) denote the probability assigned to the events
E . Suppose that, after assigning probabilites P (E) to the events in , we
receive new information about the things that will happen (the possible outcomes).
In particular, suppose that we are told that the realized outcome will belong to a
set I . How should we revise the probabilities assigned to the events in , to
properly take the new information into account?
Denote by P (E jI ) the revised probability assigned to an event E after
learning that the realized outcome will be an element of I. P (E jI ) is called the
conditional probability of E given I.
Despite being an intuitive concept, conditional probability is quite di¢ cult to
de…ne in a rigorous way. We take a gradual approach in this lecture. We …rst
discuss conditional probability for the very special case in which all the sample
points are equally likely. We then give a more general de…nition. Finally, we refer
the reader to other lectures where conditional probability is de…ned in even more
abstract ways.
= f! 1 ; : : : ; ! n g
Suppose also that each sample point is assigned the same probability:
1
P (f! 1 g) = : : : = P (f! n g) =
n
85
86 CHAPTER 12. CONDITIONAL PROBABILITY
card (E)
P (E) =
card ( )
where card denotes the cardinality of a set, i.e. the number of its elements. In
other words, the probability of an event E is obtained in two steps:
1. counting the number of "cases that are favorable to the event E", i.e. the
number of elements ! i belonging to E;
2. dividing the number thus obtained by the number of "all possible cases", i.e.
the number of elements ! i belonging to .
card (f! 1 ; ! 2 g) 2
P (E) = =
card ( ) n
When we learn that the realized outcome will belong to a set I , we still
apply the rule
number of cases that are favorable to the event
probability of an event =
number of all possible cases
However, the number of all possible cases is now equal to the number of elements
of I, because only the outcomes beloning to I are still possible. Furthermore, the
number of favorable cases is now equal to the number of elements of E \ I, because
the outcomes in E \ I c are no longer possible. As a consequence:
card (E \ I)
P (E jI ) =
card (I)
card (E \ I) =card ( ) P (E \ I)
P (E jI ) = =
card (I) =card ( ) P (I)
Therefore, when all sample points are equally likely, conditional probabilities
are computed as
P (E \ I)
P (E jI ) =
P (I)
Example 70 Suppose that we toss a die. Six numbers (from 1 to 6) can appear
face up, but we do not yet know which one of them will appear. The sample space
is
= f1; 2; 3; 4; 5; 6g
Each of the six numbers is a sample point and is assigned probability 1=6. De…ne
the event E as follows:
E = f1; 3; 5g
where the event E could be described as "an odd number appears face up". De…ne
the event I as follows:
I = f4; 5; 6g
12.3. A MORE GENERAL APPROACH 87
where the event I could be described as "a number greater than 3 appears face up".
The probability of I is
1 1 1 1
P (I) = P (f4g) + P (f5g) + P (f6g) = + + =
6 6 6 2
Suppose we are told that the realized outcome will belong to I. How do we have
to revise our assessment of the probability of the event E, according to the rules
of conditional probability? First of all, we need to compute the probability of the
event E \ I:
1
P (E \ I) = P (f5g) =
6
Then, the conditional probability of E given I is
1
P (E \ I) 6 2 1
P (E jI ) = = 1 = =
P (I) 2
6 3
In the next section, we will show that the conditional probability formula
P (E \ I)
P (E jI ) =
P (I)
is valid also for more general cases (i.e. when the sample points are not all equally
likely). However, this formula already allows us to understand why de…ning con-
ditional probability is a challenging task. In the conditional probability formula,
a division by P (I) is performed. This division is impossible when I is a zero-
probability event1 . If we want to be able to de…ne P (E jI ) also when P (I) = 0,
then we need to give a more complicated de…nition of conditional probability. We
will return to this point later.
2. Sure thing. P (I jI ) = 1.
P (E \ I) P (I)
Hence:
P (E \ I)
1
P (I)
Furthermore, since P (E \ I) 0 and P (I) 0, also
P (E \ I)
0
P (I)
4 See p. 73.
12.3. A MORE GENERAL APPROACH 89
But
! S
1
1
[ P En \I
n=1
P En jI =
n=1
P (I)
S
1
P (En \ I)
n=1
=
P (I)
P1
n=1 P (En \ I)
=
P (I)
1
X P (En \ I)
=
n=1
P (I)
X1
= P (En jI )
n=1
P (E \ I jI ) 6= P (E \ I jI )
because
P (E jI ) = P ((E \ I) [ (E \ I c ) jI )
= P (E \ I jI ) + P (E \ I c jI )
= P (E \ I jI )
and
P (E jI ) = P ((E \ I) [ (E \ I c ) jI )
= P (E \ I jI ) + P (E \ I c jI )
= P (E \ I jI )
I1 , . . . , In is a partition of .
The law of total probability states that, for any event E, the following holds:
P (E) = P (E \ )
0 0 11
n
[
= P @E \ @ Ij AA
j=1
0 1
n
[
= P@ (E \ Ij )A
j=1
n
X
A = P (E \ Ij )
j=1
Xn
B = P (E jIj ) P (Ij )
j=1
Exercise 1
Consider a sample space comprising three possible outcomes ! 1 , ! 2 , ! 3 :
= f! 1 ; ! 2 ; ! 3 g
Suppose the three possible outcomes are assigned the following probabilities:
1
P (! 1 ) =
5
2
P (! 2 ) =
5
2
P (! 3 ) =
5
De…ne the events
E = f! 1 ; ! 2 g
F = f! 1 ; ! 3 g
Solution
We need to use the conditional probability formula
P (F \ E c )
P (F jE c ) =
P (E c )
The numerator is
2
P (F \ E c ) = P (f! 1 ; ! 3 g \ f! 3 g) = P (f! 3 g) =
5
and the denominator is
2
P (E c ) = P (f! 3 g) =
5
As a consequence:
P (F \ E c ) 2=5
P (F jE c ) = = =1
P (E c ) 2=5
Exercise 2
Consider a sample space comprising four possible outcomes ! 1 , ! 2 , ! 3 , ! 4 :
= f! 1 ; ! 2 ; ! 3 ; ! 4 g
Suppose the four possible outcomes are assigned the following probabilities:
1
P (! 1 ) =
10
4
P (! 2 ) =
10
3
P (! 3 ) =
10
2
P (! 4 ) =
10
De…ne two events
E = f! 1 ; ! 2 g
F = f! 2 ; ! 3 g
Compute P (E jF ), the conditional probability of E given F .
Solution
We need to use the formula
P (E \ F )
P (E jF ) =
P (F )
But
4
P (E \ F ) = P (f! 2 g) =
10
while, using additivity:
4 3 7
P (F ) = P (f! 2 ; ! 3 g) = P (f! 2 g) + P (f! 3 g) = + =
10 10 10
Therefore:
P (E \ F ) 4=10 4
P (E jF ) = = =
P (F ) 7=10 7
12.6. SOLVED EXERCISES 93
Exercise 3
The Census Bureau has estimated the following survival probabilities for men:
What is the conditional probability that a man lives at least 80 years given that
he has just celebrated his 70th birthday?
Solution
Given an hypothetical sample space , de…ne the two events:
P (F \ E)
P (F jE ) =
P (E)
F \E =F
Therefore:
1
P (F \ E) = P (F ) = 50% =
2
Thus:
P (F \ E) 1=2 5
P (F jE ) = = =
P (E) 4=5 8
94 CHAPTER 12. CONDITIONAL PROBABILITY
Chapter 13
Bayes’rule
This lecture introduces Bayes’rule. Before reading this lecture, make sure you are
familiar with the concept of conditional probability (p. 85).
P (B jA ) P (A)
P (A jB ) =
P (B)
P (A \ B)
P (A jB ) =
P (B)
and
P (A \ B)
P (B jA ) =
P (A)
Re-arrange the second formula:
P (A \ B) = P (B jA ) P (A)
P (A \ B) P (B jA ) P (A)
P (A jB ) = =
P (B) P (B)
The following example shows how Bayes’ rule can be applied in a practical
situation.
95
96 CHAPTER 13. BAYES’RULE
Example 72 An HIV test gives a positive result with probability 98% when the
patient is indeed a¤ ected by HIV, while it gives a negative result with 99% probability
when the patient is not a¤ ected by HIV. If a patient is drawn at random from a
population in which 0,1% of individuals are a¤ ected by HIV and he is found positive,
what is the probability that he is indeed a¤ ected by HIV? In probabilistic terms, what
we know about this problem can be formalized as follows:
The unconditional probability of being found positive can be derived using the law
of total probability1 :
Therefore, even if the test is conditionally very accurate, the unconditional proba-
bility of being a¤ ected by HIV when found positive is less than 10 per cent!
13.2 Terminology
The quantities involved in Bayes’rule
P (B jA ) P (A)
P (A jB ) =
P (B)
Exercise 1
There are two urns containing colored balls. The …rst urn contains 50 red balls and
50 blue balls. The second urn contains 30 red balls and 70 blue balls. One of the
two urns is randomly chosen (both urns have probability 50% of being chosen) and
then a ball is drawn at random from one of the two urns. If a red ball is drawn,
what is the probability that it comes from the …rst urn?
Solution
In probabilistic terms, what we know about this problem can be formalized as
follows:
1
P (red jurn 1 ) =
2
3
P (red jurn 2 ) =
10
1
P (urn1) =
2
1
P (urn 2) =
2
The unconditional probability of drawing a red ball can be derived using the law
of total probability:
Exercise 2
An economics consulting …rm has created a model to predict recessions. The model
predicts a recession with probability 80% when a recession is indeed coming and
with probability 10% when no recession is coming. The unconditional probability
of falling into a recession is 20%. If the model predicts a recession, what is the
probability that a recession will indeed come?
Solution
What we know about this problem can be formalized as follows:
8
P (rec. pred. jrec. coming ) =
10
1
P (rec. pred. jrec. not coming ) =
10
2
P (rec. coming) =
10
98 CHAPTER 13. BAYES’RULE
2 8
P (rec. not coming) = 1 P (rec. coming) = 1 =
10 10
The unconditional probability of predicting a recession can be derived using the
law of total probability:
Exercise 3
Alice has two coins in her pocket, a fair coin (head on one side and tail on the other
side) and a two-headed coin. She picks one at random from her pocket, tosses it
and obtains head. What is the probability that she ‡ipped the fair coin?
Solution
What we know about this problem can be formalized as follows:
1
P (head jfair coin ) =
2
P (head junfair coin ) = 1
1
P (fair coin) =
2
1
P (unfair coin) =
2
The unconditional probability of obtaining head can be derived using the law of
total probability:
Independent events
This lecture introduces the notion of independent event. Before reading this lec-
ture, make sure you are familiar with the concept of conditional probability1 .
P (E \ F ) = P (E) P (F )
This de…nition implies properties (14.1) and (14.2) above: if E and F are
independent, and (say) P (E) > 0, then
P (E \ F ) P (E) P (F )
P (F jE ) = = = P (F )
P (E) P (E)
= fB1 ; B2 ; B3 ; B4 g
1 See p. 85.
99
100 CHAPTER 14. INDEPENDENT EVENTS
Each of the four balls has the same probability of being drawn, equal to 14 , i.e.,
1
P (fB1 g) = P (fB2 g) = P (fB3 g) = P (fB4 g) =
4
De…ne the events E and F as follows:
E = fB1 ; B2 g
F = fB2 ; B3 g
E = fB1 ; B2 g
F = fB2 ; B3 g
G = fB2 ; B4 g
14.3. ZERO-PROBABILITY EVENTS AND INDEPENDENCE 101
Exercise 1
Suppose that we toss a die. Six numbers (from 1 to 6) can appear face up, but we
do not yet know which one of them will appear. The sample space is
= f1; 2; 3; 4; 5; 6g
1
Each of the six numbers is a sample point and is assigned probability 6. De…ne
the events E and F as follows:
E = f1; 3; 4g
F = f3; 4; 5; 6g
Prove that E and F are independent.
2 See p. 79.
3 See p. 73.
102 CHAPTER 14. INDEPENDENT EVENTS
Solution
The probability of E is
1 1 2
P (E \ F ) = = = P (E) P (F )
3 2 3
Exercise 2
A …rm undertakes two projects, A and B. The probabilities of having a successful
outcome are 43 for project A and 12 for project B. The probability that both
7
projects will have a successful outcome is 16 . Are the outcomes of the two projects
independent?
Solution
Denote by E the event "project A is successful", by F the event "project B is
successful" and by G the event "both projects are successful". The event G can
be expressed as
G=E\F
If E and F are independent, it must be that
P (G) = P (E \ F ) = P (E) P (F )
3 1 3 7
= = 6=
4 2 8 16
Therefore, the outcomes of the two projects are not independent.
Exercise 3
A …rm undertakes two projects, A and B. The probabilities of having a successful
outcome are 23 for project A and 45 for project B. What is the probability that
neither of the two projects will have a successful outcome if their outcomes are
independent?
14.4. SOLVED EXERCISES 103
Solution
Denote by E the event "project A is successful", by F the event "project B is
successful" and by G the event "neither of the two projects is successful". The
event G can be expressed as
G = Ec \ F c
where E c and F c are the complements of E and F . Thus, the probability that
neither of the two projects will have a successful outcome is
P (G) = P (E c \ F c )
A =
c
P ((E [ F ) )
B = 1 P (E [ F )
C = 1 (P (E) + P (F ) P (E \ F ))
= 1 P (E) P (F ) + P (E \ F )
D = 1 P (E) P (F ) + P (E) P (F )
2 4 2 4
= 1 +
3 5 3 5
15 10 12 + 8 1
= =
15 15
where: in step A we have used De Morgan’s law4 ; in step B we have used the
formula for the probability of a complement5 ; in step C we have used the formula
for the probability of a union6 ; in step D we have used the fact that E and F are
independent.
4 See p. 7.
5 See p. 72.
6 See p. 73.
104 CHAPTER 14. INDEPENDENT EVENTS
Chapter 15
Random variables
This lecture introduces the concept of random variable. Before reading this lec-
ture, make sure you are familiar with the concepts of sample space, sample point,
event, possible outcome, realized outcome, and probability (see the lecture entitled
Probability - p. 69).
105
106 CHAPTER 15. RANDOM VARIABLES
Example 79 Suppose that we ‡ip a coin. The possible outcomes are either tail
(T ) or head (H), i.e.,
= fT; Hg
The two outcomes are assigned equal probabilities:
1
P (fT g) = P (fHg) =
2
If tail (T ) is the outcome, we win one dollar, if head (H) is the outcome we lose
one dollar. The amount X we win (or lose) is a random variable, de…ned as
1 if ! = T
X (!) =
1 if ! = H
Most of the time, statisticians deal with two special kinds of random variables:
P (X = x) if x 2 RX
pX (x) =
0 if x 2
= RX
2 See the lecture entitled Sequences and Limits (p. 32) for a de…nition of countable set.
15.3. ABSOLUTELY CONTINUOUS RANDOM VARIABLES 107
Example 81 Let X be a discrete random variable that can take only two values:
1 with probability q and 0 with probability 1 q, where 0 q 1. Its support is
RX = f0; 1g
The properties of probability mass functions are discussed in more detail in the
lecture entitled Legitimate probability mass functions (p. 247). We anticipate here
that probability mass functions are characterized by two fundamental properties:
It turns out not only that any probability mass function must satisfy these two
properties, but also that any function satisfying these two properties is a legitimate
probability mass function. You can …nd a detailed discussion of this fact in the
aforementioned lecture.
1 if x 2 [0; 1]
fX (x) =
0 otherwise
108 CHAPTER 15. RANDOM VARIABLES
1 3
The probability that the realization of X belongs, for example, to the interval 4; 4
is
Z 3=4 Z 3=4
P (X 2 [a; b]) = fX (x) dx = dx
1=4 1=4
3=4 3 1 1
= [x]1=4 = =
4 4 2
The properties of probability density functions are discussed in more detail in
the lecture entitled Legitimate probability density functions (p. 251). We antici-
pate here that probability density functions are characterized by two fundamental
properties:
1. non-negativity: fX (x) 0 for any x 2 R;
R1
2. integral over R equals 1: f (x) dx = 1.
1 X
It turns out not only that any probability density function must satisfy these
two properties, but also that any function satisfying these two properties is a
legitimate probability density function. You can …nd a detailed discussion of this
fact in the aforementioned lecture.
Hence, by taking the derivative with respect to x of both sides of the above equa-
tion, we obtain
dFX (x)
= fX (x)
dx
f! 2 : X (!) 2 Bg 2 F
for any B 2 B (R) and the probability on the right hand side is well-de…ned because
the set
f! 2 : X (!) 2 Bg
is measurable by the very de…nition of random variable.
Exercise 1
Let X be a discrete random variable. Let its support RX be
RX = f0; 1; 2; 3; 4g
1=5 if x 2 RX
pX (x) =
0 if x 2
= RX
Compute
P (1 X < 4)
Solution
By using the additivity of probability, we have
Exercise 2
Let X be a discrete random variable. Let its support RX be the set of the …rst 20
natural numbers:
RX = f1; 2; : : : ; 19; 20g
Let its probability mass function pX (x) be
x=210 if x 2 RX
pX (x) =
0 if x 2
= RX
Solution
By the additivity of probability, we have
Exercise 3
Let X be a discrete random variable. Let its support RX be
RX = f0; 1; 2; 3g
15.6. SOLVED EXERCISES 111
Solution
First note that, by additivity, we have
P (X < 3) = P (fX = 0g [ fX = 1g [ fX = 2g)
= P (fX = 0g) + P (fX = 1g) + P (fX = 2g)
= pX (0) + pX (1) + pX (2)
Therefore, in order to compute P (X < 3), we need to evaluate the probability
mass function at the three points x = 0 , x = 1 and x = 2:
0 3 0
3 1 3 3! 27 27
pX (0) = = 1 =
0 4 4 0!3! 64 64
1 3 1
3 1 3 3! 1 9
pX (1) = =
1 4 4 1!2! 4 16
1 9 27
= 3 =
4 16 64
2 3 2
3 1 3 3! 1 3 9
pX (2) = = =
2 4 4 2!1! 16 4 64
Finally,
P (X < 3) = pX (0) + pX (1) + pX (2)
27 27 9 63
= + + =
64 64 64 64
Exercise 4
Let X be an absolutely continuous random variable. Let its support RX be
RX = [0; 1]
Let its probability density function fX (x) be
1 if x 2 RX
fX (x) =
0 if x 2
= RX
Compute
1
P X 2
2
4 See p. 22.
112 CHAPTER 15. RANDOM VARIABLES
Solution
The probability that an absolutely continuous random variable takes a value in a
given interval is equal to the integral of the probability density function over that
interval:
Z 2
1 1
P X 2 = P X2 ;2 = fX (x) dx
2 2 1=2
Z 1
1 1 1
= dx = [x]1=2 = 1 =
1=2 2 2
Exercise 5
Let X be an absolutely continuous random variable. Let its support RX be
RX = [0; 1]
2x if x 2 RX
fX (x) =
0 if x 2
= RX
Compute
1 1
P X
4 2
Solution
The probability that an absolutely continuous random variable takes a value in a
given interval is equal to the integral of the probability density function over that
interval:
Z 1=2
1 1 1 1
P X = P X2 ; = fX (x) dx
4 2 4 2 1=4
Z 1=2
1=2 1 1 3
= 2xdx = x2 1=4 = =
1=4 4 16 16
Exercise 6
Let X be an absolutely continuous random variable. Let its support RX be
RX = [0; 1)
exp ( x) if x 2 RX
fX (x) =
0 if x 2
= RX
where > 0.
Compute
P (X 1)
15.6. SOLVED EXERCISES 113
Solution
The probability that an absolutely continuous random variable takes a value in a
given interval is equal to the integral of the probability density function over that
interval:
Z 1
P (X 1) = P (X 2 [1; 1)) = fX (x) dx
1
Z 1
1
= exp ( x) dx = [ exp ( x)]1
1
= 0 ( exp ( )) = exp ( )
114 CHAPTER 15. RANDOM VARIABLES
Chapter 16
Random vectors
115
116 CHAPTER 16. RANDOM VECTORS
Example 86 Two coins are tossed. The possible outcomes of each toss can be
either tail (T ) or head (H). The sample space is
= fT T; T H; HT; HHg
1
P (fT T g) = P (fT Hg) = P (fHT g) = P (fHHg) =
4
If tail (T ) is the outcome, we win one dollar, if head (H) is the outcome we lose
one dollar. A 2-dimensional random vector X indicates the amount we win (or
lose) on each toss:
8
>
> 1 1 if ! = T T
<
1 1 if ! = T H
X (!) =
>
> 1 1 if ! = HT
:
1 1 if ! = HH
P X= 1 1 = P !2 : X (!) = 1 1
1
= P (fT T g) =
4
The probability of losing one dollar on the second toss is
The following notations are used interchangeably to indicate the joint proba-
bility mass function:
In the second and third notation the K components of the random vector X are
explicitly indicated.
The following notations are used interchangeably to indicate the joint proba-
bility density function:
In the second and third notation the K components of the random vector X are
explicitly indicated.
118 CHAPTER 16. RANDOM VECTORS
RX = [0; 1] [0; 1]
The probability that the realization of X falls in the rectangle [0; 1=2] [0; 1=2] is
Z 1=2 Z 1=2
1 1
P X 2 0; 0; = fX (x1 ; x2 ) dx2 dx1
2 2 0 0
Z 1=2 Z 1=2 Z 1=2
1=2
= dx2 dx1 = [x2 ]0 dx1
0 0 0
Z 1=2 Z 1=2
1 1 1 1=2
= dx1 = dx1 = [x1 ]0
0 2 2 0 2
1 1 1
= =
2 2 4
FX (x) = P (X1 x1 ; : : : ; XK xK ) ; 8x 2 RK
The following notations are used interchangeably to indicate the joint distrib-
ution function:
In the second and third notation the K components of the random vector X are
explicitly indicated.
Sometimes, we talk about the joint distribution of a random vector, without
specifying whether we are referring to the joint distribution function, or to the joint
probability mass function (in the case of discrete random vectors), or to the joint
probability density function (in the case of absolutely continuous random vectors).
This ambiguity is legitimate, since:
In the remainder of this lecture, we use the term joint distribution when we
are making statements that apply both to the distribution function and to the
probability mass (or density) function of a random vector.
A11 A12
A=
A21 A22
When vec (A) is a discrete random vector, then we say that A is a discrete
random matrix and the joint probability mass function of A is just the joint prob-
ability mass function of vec (A). By the same token, when vec (A) is an absolutely
continuous random vector, then we say that A is an absolutely continuous random
matrix and the joint probability density function of A is just the joint probability
density function of vec (A).
In other words, the joint probability mass function of X i is obtained summing the
joint probability mass function of X over all values xi that belong to the support
of Xi .
X i = [X1 : : : Xi 1 Xi+1 : : : XK ]
@ K FX (x)
= fX (x)
@x1 : : : @xK
f! 2 : X (!) 2 Bg 2 F
Exercise 1
Let X be a 2 1 discrete random vector and denote its components by X1 and X2 .
Let the support of X be the set of all 2 1 vectors such that their entries belong
to the set of the …rst three natural numbers, i.e.,
n o
>
RX = x = [x1 x2 ] : x1 2 N3 and x2 2 N3
where
N3 = f1; 2; 3g
Let the joint probability mass function of X be
(
1 >
pX (x1 ; x2 ) = 36 x1 x2 if [x1 x2 ] 2 RX
>
0 if [x1 x2 ] 2= RX
Solution
Trivially, we need to evaluate the joint probability mass function at the point (2; 3),
i.e.,
1 6 1
P (X1 = 2 and X2 = 3) = pX (2; 3) = 2 3= =
36 36 6
Exercise 2
Let X be a 2 1 discrete random vector and denote its components by X1 and X2 .
Let the support of X be the set of all 2 1 vectors such that their entries belong
to the set of the …rst three natural numbers, i.e.,
n o
>
RX = x = [x1 x2 ] : x1 2 N3 and x2 2 N3
where
N3 = f1; 2; 3g
Let the joint probability mass function of X be
(
1 >
pX (x1 ; x2 ) = 36 (x1 + x2 ) if [x1 x2 ] 2 RX
>
0 if [x1 x2 ] 2= RX
Solution
There are only two possible cases that give rise to the occurrence X1 + X2 = 3.
These cases are
>
X = [1 2]
and
>
X = [2 1]
16.6. SOLVED EXERCISES 123
Since these two cases are disjoint events, we can use the additivity property of
probability7 :
n o n o
> >
P (X1 + X2 = 3) = P X = [1 2] [ X = [2 1]
n o n o
> >
= P X = [1 2] + P X = [2 1]
= pX (1; 2) + pX (2; 1)
1 1 6 1
= (1 + 2) + (2 + 1) = =
36 36 36 6
Exercise 3
Let X be a 2 1 discrete random vector and denote its components by X1 and
X2 . Let the support of X be
n o
> > >
RX = [1 1] ; [2 0] ; [0 0]
Solution
The support of X1 is
RX1 = f0; 1; 2g
We need to compute the probability of each element of the support of X1 :
X 1
pX1 (0) = pX (x1 ; x2 ) = pX (0; 0) =
3
f(x1 ;x2 )2RX :x1 =0g
X 1
pX1 (1) = pX (x1 ; x2 ) = pX (1; 1) =
3
f(x1 ;x2 )2RX :x1 =1g
X 1
pX1 (2) = pX (x1 ; x2 ) = pX (2; 0) =
3
f(x1 ;x2 )2RX :x1 =2g
The support of X2 is
RX2 = f0; 1g
7 See p. 72.
124 CHAPTER 16. RANDOM VECTORS
Exercise 4
Let X be a 2 1 absolutely continuous random vector and denote its components
by X1 and X2 . Let the support of X be
RX = [0; 2] [0; 3]
i.e. the set of all 2 1 vectors such that the …rst component belongs to the
interval [0; 2] and the second component belongs to the interval [0; 3]. Let the joint
probability density function of X be
1=6 if x 2 RX
fX (x) =
0 otherwise
Compute P (1 X1 3; 1 X2 1).
Solution
By the very de…nition of joint probability density function:
P (1 X1 3; 1 X2 1)
Z 3Z 1
= fX (x1 ; x2 ) dx2 dx1
1 1
Z 2 Z 1 Z 2
1 1 1
= dx2 dx1 = [x2 ]0 dx1
1 0 6 6 1
Z 2
1 1 2 1
= 1dx1 = [x1 ]1 =
6 1 6 6
Exercise 5
Let X be a 2 1 absolutely continuous random vector and denote its components
by X1 and X2 . Let the support of X be
RX = [0; 1) [0; 2]
i.e. the set of all 2 1 vectors such that the …rst component belongs to the
interval [0; 1) and the second component belongs to the interval [0; 2]. Let the
joint probability density function of X be
exp ( 2x1 ) if x 2 RX
fX (x) = fX (x1 ; x2 ) =
0 otherwise
Compute P (X1 + X2 3).
16.6. SOLVED EXERCISES 125
Solution
First of all note that X1 + X2 3 if and only if X2 3 X1 . Using the de…nition
of joint probability density function, we obtain
n o
>
P (X1 + X2 3) = P [x1 x2 ] : x1 2 R; x2 2 ( 1; 3 x1 ]
Z 1 Z 3 x1
= fX (x1 ; x2 ) dx2 dx1
1 1
Therefore,
P (X1 + X2 3)
Z 1 Z 3 x1
= fX (x1 ; x2 ) dx2 dx1
1 1
Z 1Z 2 Z 3 Z 3 x1
= exp ( 2x1 ) dx2 dx1 + exp ( 2x1 ) dx2 dx1
0 0 1 0
Z 1 Z 2 Z 3 Z 3 x1
= exp ( 2x1 ) dx2 dx1 + exp ( 2x1 ) dx2 dx1
0 0 1 0
Z 1 Z 3
= exp ( 2x1 ) 2dx1 + exp ( 2x1 ) (3 x1 ) dx1
0 1
Z 3 Z 3
1
= [ exp ( 2x1 )]0 + 3 exp ( 2x1 ) dx1 x1 exp ( 2x1 ) dx1
1 1
3
1
= exp ( 2) + 1 + 3 exp ( 2x1 )
2 1
( Z 3 )
3
1 1
x1 exp ( 2x1 ) exp ( 2x1 ) dx1
2 1 1 2
3 3 3
= 1 exp ( 6) + exp ( 2) + exp ( 6)
exp ( 2)
2 2 2
3
1 1
exp ( 2) + exp ( 2x1 )
2 4 1
1 1
= 1 + exp ( 6) exp ( 2)
4 4
Exercise 6
Let X be a 2 1 absolutely continuous random vector and denote its components
by X1 and X2 . Let the support of X be
RX = R2+
126 CHAPTER 16. RANDOM VECTORS
i.e., the set of all 2-dimensional vectors with positive entries. Let its joint proba-
bility density function be
exp ( x1 x2 ) if x1 0 and x2 0
fX (x) = fX (x1 ; x2 ) =
0 otherwise
RX1 = R+
We can …nd the marginal density by integrating the joint density with respect
to x2 : Z 1
fX1 (x) = fX (x; x2 ) dx2
1
When x < 0, then fX (x; x2 ) = 0 and the above integral is trivially equal to 0.
Thus, when x < 0, then fX1 (x) = 0.
When x > 0, then
Z 1 Z 0 Z 1
fX1 (x) = fX (x; x2 ) dx2 = fX (x; x2 ) dx2 + fX (x; x2 ) dx2
1 1 0
but the …rst of the two integrals is zero since fX (x; x2 ) = 0 when x2 < 0; as a
consequence,
Z 0 Z 1
fX1 (x) = fX (x; x2 ) dx2 + fX (x; x2 ) dx2
1 0
Z 1 Z 1
= fX (x; x2 ) dx2 = exp ( x x2 ) dx2
0 0
Z 1 Z 1
= exp ( x) exp ( x2 ) dx2 = exp ( x) exp ( x2 ) dx2
0 0
1
= exp ( x) [ exp ( x2 )]0 = exp ( x) ( 0 ( 1)) = exp ( x)
exp ( x) if x 0
fX1 (x) =
0 otherwise
exp ( x) if x 0
fX2 (x) =
0 otherwise
Chapter 17
Expected value
The concept of expected value of a random variable1 is one of the most important
concepts in probability theory. It was …rst devised in the 17th century to analyze
gambling games and answer questions such as: how much do I gain - or lose - on
average, if I repeatedly play a given gambling game? how much can I expect to
gain - or lose - by performing a certain bet? If the possible outcomes of the game
(or the bet) and their associated probabilities are described by a random variable,
then these questions can be answered by computing its expected value, which is
equal to a weighted average of the outcomes, in which each outcome is weighted by
its probability. For example, if you play a game where you gain 2$ with probability
1=2 and you lose 1$ with probability 1=2, then the expected value of the game is
half a dollar:
2$ (1=2) + ( 1$) (1=2) = 1=2$
What does this mean? Roughly speaking, it means that if you play this game
many times and the number of times each of the two possible outcomes occurs is
proportional to its probability, then, on average you gain 1=2$ each time you play
the game. For instance, if you play the game 100 times, win 50 times and lose the
remaining 50, then your average winning is equal to the expected value:
127
128 CHAPTER 17. EXPECTED VALUE
provided that: X
jxj pX (x) < 1
x2RX
The symbol X
x2RX
indicates summation over all the elements of the support RX . So, for example, if
RX = f1; 2; 3g
then: X
xpX (x) = 1 pX (1) + 2 pX (2) + 3 pX (3)
x2RX
is well-de…ned also when the support RX contains in…nitely many elements. When
summing in…nitely many terms, the order in which you sum them can change the
result of the sum. However, if the terms are absolutely summable, then the order
in which you sum becomes irrelevant. In the above de…nition of expected value,
the order of the sum X
xpX (x)
x2RX
RX = f0; 1g
provided that: Z 1
jxj fX (x) dx < 1
1
This integral can be thought of as the limiting case of the sum (17.1) found in
the discrete case. Here pX (x)
R 1is replaced by fX (x) dx (the in…nitesimal
P probability
of x) and the integral sign 1 replaces the summation sign x2RX .
The requirement that Z 1
jxj fX (x) dx < 1
1
and it is well-de…ned only if both limits are …nite. Absolute integrability guarantees
that the latter condition is met and that the expected value is well-de…ned.
When the absolute integrability condition is not satis…ed, we say that the ex-
pected value of X is not well-de…ned or that it does not exist.
130 CHAPTER 17. EXPECTED VALUE
RX = [0; 1)
exp ( x) if x 2 [0; 1)
fX (x) =
0 otherwise
where the integral is a Riemann-Stieltjes integral and the expected value exists and
is well-de…ned only as long as the integral is well-de…ned.
Also this integral is the limiting case of formula (17.1) for the expected value
of a discrete random variable.
R1 Here dFX (x) replaces pXP (x) (the probability of x)
and the integral sign 1 replaces the summation sign x2RX .
2 See p. 108.
17.4. THE RIEMANN-STIELTJES INTEGRAL 131
The following section contains a brief and informal introduction to the Riemann-
Stieltjes integral and an explanation of the above formula. Less technically oriented
readers can safely skip it: when they encounter a Riemann-Stieltjes integral, they
can just think of it as a formal notation which allows a uni…ed treatment of discrete
and absolutely continuous random variables and can be treated as a sum in one
case and as an ordinary Riemann integral in the other.
17.4.1 Intuition
As we have already seen above, the expected value of a discrete random variable
is straightforward to compute: the expected value of a discrete variable X is the
weighted average of the values that X can take on (the elements of the support
RX ), where each possible value x is weighted by its respective probability pX (x):
X
E [X] = xpX (x)
x2RX
When X is not discrete the above summation does not make any sense. However,
there is a workaround that allows to extend the formula to random variables that
are not discrete. The workaround entails approximating X with discrete variables
that can take on only …nitely many values.
Let x0 , x1 , . . . , xn be n + 1 real numbers (n 2 N) such that:
As the number n of points increases and the points become closer and closer (the
maximum distance between two successive points tends to zero), Xn becomes a
very good approximation of X, until, in the limit, it is indistinguishable from X.
The expected value of Xn is easy to compute:
n
X
E [Xn ] = xi P (Xn = xi )
i=1
Xn
= xi P (X 2 (xi 1 ; xi ])
i=1
Xn
= xi [FX (xi ) FX (xi 1 )]
i=1
The expected value of X is then de…ned as the limit of E [Xn ] when n tends to
in…nity (i.e. when the approximation becomes better and better):
n
X
E [X] = lim E [Xn ] = lim xi [FX (xi ) FX (xi 1 )]
n!1 n!1
i=1
When the latter limit exists and is well-de…ned, it is called the Riemann-Stieltjes
integral of x with respect to FX (x) and it is indicated as follows:
Z 1 n
X
xdFX (x) = lim xi [FX (xi ) FX (xi 1 )]
1 n!1
i=1
R1
Roughly speaking, the integral notation can be thought of as a shorthand
Pn 1
for limn!1 i=1 and the di¤erential notation dFX (x) can be thought of as a
shorthand for [FX (xi ) FX (xi 1 )].
Z " #
c2
+ g (x) fX (x) dx + g (c2 ) FX (c2 ) lim FX (x)
x!c2
c1 x<c2
+:::
Z cn " #
+ g (x) fX (x) dx + g (cn ) FX (cn ) lim FX (x)
x!cn
cn 1 x<cn
Z b
+ g (x) fX (x) dx
cn
RX = [0; 1]
R
provided XdP (the Lebesgue integral of X with respect to P) exists and is well-
de…ned.
3 See p. 69.
134 CHAPTER 17. EXPECTED VALUE
It is possible (albeit non-trivial) to prove that the above two formulae hold also
when X is a K-dimensional random vector, g : RK ! R is a real function of K
variables and Y = g (X). When X is a discrete random vector and pX (x) is its
joint probability mass function, then:
X
E [Y ] = g (x) pX (x)
x2RX
When X is an absolutely continuous random vector and fX (x) is its joint proba-
bility density function, then:
Z 1 Z 1
E [Y ] = ::: g (x1 ; : : : ; xK ) fX (x1 ; : : : ; xK ) dx1 : : : dxK
1 1
Y = a + bX
E [Y ] = a + bE [X]
E [Y ]
17.6. MORE DETAILS 135
X
A = (a + bx) pX (x)
x2RX
X X
= apX (x) + bxpX (x)
x2RX x2RX
X X
= a pX (x) + b xpX (x)
x2RX x2RX
X
B = a+b xpX (x)
x2RX
C = a + bE [X]
E [Y ]
Z 1
A = (a + bx) fX (x) dx
1
Z 1 Z 1
= afX (x) dx + bxfX (x) dx
1 1
Z 1 Z 1
= a fX (x) dx + b xfX (x) dx
1 1
Z 1
B = a+b xfX (x) dx
1
C = a + bE [X]
A stronger linearity property holds, which involves two or more random vari-
ables. The property can be proved only using the Lebesgue integral4 . The property
is as follows: let X1 and X2 be two random variables and let c1 2 R and c2 2 R
be two constants; then:
17.6.5 Integrability
Denote the absolute value of a random variable X by jXj. If E [jXj] exists and
is …nite, we say that X is an integrable random variable, or just that X is
integrable.
17.6.6 Lp spaces
p
Let 1 p < 1. The space of all random variables X such that E [jXj ] exists
and is …nite is denoted by Lp or Lp ( ; F,P), where the triple ( ; F,P) makes the
dependence on the underlying probability space5 explicit. If X belongs to Lp , we
write X 2 Lp ( ; F,P). Hence, if X is integrable, we write X 2 L1 ( ; F,P).
Exercise 1
Let X be a discrete random variable. Let its support be
RX = f0; 1; 2; 3; 4g
Let its probability mass function pX (x) be
1=5 if x 2 RX
pX (x) =
0 if x 2
= RX
Compute the expected value of X.
5 See p. 76.
17.7. SOLVED EXERCISES 137
Solution
Since X is discrete, its expected value is computed as a sum over the support of
X:
X
E [X] = xpX (x)
x2RX
= 0 pX (0) + 1 pX (1) + 2 pX (2) + 3 pX (3) + 4 pX (4)
1 1 1 1 1
= 0 +1 +2 +3 +4
5 5 5 5 5
1+2+3+4 10
= = =2
5 5
Exercise 2
Let X be a discrete random variable. Let its support be
RX = f1; 2; 3g
Let its probability mass function pX (x) be
x=6 if x 2 RX
pX (x) =
0 if x 2
= RX
Compute the expected value of X.
Solution
Since X is discrete, its expected value is computed as a sum over the support of
X:
X
E [X] = xpX (x) = 1 pX (1) + 2 pX (2) + 3 pX (3)
x2RX
1 2 3 1 4 9 14 7
= 1 +2 +3 = + + = =
6 6 6 6 6 6 6 3
Exercise 3
Let X be a discrete random variable. Let its support be
RX = f2; 4g
Let its probability mass function pX (x) be
x2 =20 if x 2 RX
pX (x) =
0 if x 2
= RX
Compute the expected value of X.
Solution Since X is discrete, its expected value is computed as a sum over the
support of X:
X
E [X] = xpX (x) = 2 pX (2) + 4 pX (4)
x2RX
4 16 8 64 72 18
= 2 +4 = + = =
20 20 20 20 20 5
138 CHAPTER 17. EXPECTED VALUE
Exercise 4
Let X be an absolutely continuous random variable with uniform distribution on
the interval [1; 3].
Its support is
RX = [1; 3]
Its probability density function is
1=2 if x 2 [1; 3]
fX (x) =
0 otherwise
Solution
Since X is absolutely continuous, its expected value can be computed as an integral:
Z 1
E [X] = xfX (x) dx
1
Z 1 Z 3 Z 1
= xfX (x) dx + xfX (x) dx + xfX (x) dx
1 1 3
Z 1 Z 3 Z 1
1
= x0dx + x dx + x0dx
1 1 2 3
3
1 2 1 2 1 2 9 1 8
= 0+ x +0= 3 1 = = =2
4 1 4 4 4 4
Note that the trick is to: 1) subdivide the interval of integration to isolate
the sub-intervals where the density is zero; 2) split up the integral among the
sub-intervals thus identi…ed.
Exercise 5
Let X be an absolutely continuous random variable. Its support is
RX = R
1 1 2
fX (x) = p exp x
2 2
Solution
Since X is absolutely continuous, its expected value can be computed as an integral:
Z 1 Z 1
1=2 1 2
E [X] = xfX (x) dx = (2 ) x exp x dx
1 1 2
Z 0 Z 1
1=2 1 2 1=2 1 2
= (2 ) x exp x dx + (2 ) x exp x dx
1 2 0 2
17.7. SOLVED EXERCISES 139
0 1
1=2 1 2 1=2 1 2
= (2 ) exp x + (2 ) exp x
2 1 2 0
1=2 1=2 1=2 1=2
= (2 ) [ 1 + 0] + (2 ) [0 + 1] = (2 ) + (2 ) =0
Exercise 6
Let X be an absolutely continuous random variable. Its support is
RX = [0; 1]
fX (x) = 3x2
Solution
Since X is absolutely continuous, its expected value can be computed as an integral:
Z 1 Z 1
E [X] = xfX (x) dx = x3x2 dx
1 0
Z1 1
1 4 1 3
= 3 x3 dx = 3 x =3 0 =
0 4 0 4 4
Exercise 7
Let FX (x) be de…ned as follows:
0 if x < 0
FX (x) =
1 exp ( x) if x 0
where > 0.
Compute the following integral:
Z 2
xdFX (x)
1
Solution
FX (x) is continuously di¤erentiable on the interval [1; 2]. Its derivative fX (x) is:
fX (x) = exp ( x)
1n 2
o
= 2 exp ( 2 ) + exp ( ) + [ exp ( t)]
1
= f 2 exp ( 2 ) + exp ( ) exp ( 2 ) + exp ( )g
1
= f( 2 1) exp ( 2 ) + ( + 1) exp ( )g
Chapter 18
18.1 Intuition
Let us recall the informal de…nition of expected value we have given in the lecure
entitled Expected Value (p. 127):
De…nition 102 The expected value of a random variable X is the weighted av-
erage of the values that X can take on, where each possible value is weighted by its
respective probability.
When X is discrete and can take on only …nitely many values, it is straightfor-
ward to compute the expected value of X, by just applying the above de…nition.
Denote by x1 , . . . , xn the n values that X can take on (the n elements of its sup-
port). Let be the sample space on which X is de…ned. Also de…ne the following
events:
E1 = f! 2 : X (!) = x1 g
..
.
En = f! 2 : X (!) = xn g
i.e. the expected value of X is the weighted average of the values that X can take
on, where each possible value xi is weighted by its respective probability P (Ei ).
141
142 CHAPTER 18. EXPECTED VALUE AND THE LEBESGUE INTEGRAL
Note that this is a way of expressing the expected value that uses neither FX (x),
the distribution function1 of X, nor its probability mass function2 pX (x). Instead,
the above way of expressing the expected value uses only the probability P (E)
de…ned on the events E . In many applications, it turns out that this is a very
convenient way of expressing (and calculating) the expected value: for example,
when the distribution function FX (x) is not directly known and it is di¢ cult to
derive, it is sometimes easier to directly compute the probabilities P (E) de…ned
on the events E . Below, this will be illustrated with an example.
When X is discrete, but can take on in…nitely many values, in a similar fashion
we can write:
1
X
E [X] = xi P (Ei ) (18.1)
i=1
In this case, however, there is a possibility that E [X] is not well-de…ned: this
happens when the in…nite series above does not converge, i.e. when the limit
n
X
lim xi P (Ei )
n!1
i=1
does not exist. In the next section we will show how to take care of this possibility.
In the case in which X in not discrete (its support has the power of the con-
tinuum), things are much more complicated. In this case, the summation in (18.1)
does not make any sense, because the support of X cannot be arranged into a se-
quence3 and so there is no sequence over which we can sum. Thus, we have to …nd
a workaround. The workaround is similar to the one we have discussed in the pre-
sentation of the Stieltjes integral4 : …rst, we build a simpler random variable Y that
is a good approximation of X and whose expected value can easily be computed;
then, we make the approximation better and better; …nally, we de…ne the expected
value of X to be equal to the expected value of Y when the approximation tends
to become perfect.
How does the approximation work, intuitively? We illustrate it in three steps:
1. in the …rst step, we partition the sample space into n events E1 , . . . , En ,
such that Ei \ Ej = ; for i 6= j and
n
[
= Ei
i=1
2. in the second step we …nd, for each event Ei , the smallest value that X can
take on when the event Ei happens:
yi = inf fX (!) : ! 2 Ei g
In this way, we have built a random variable Y such that Y (!) X (!) for
any !. The …ner the partition E1 , . . . , En is, the better the approximation is:
intuitively, when the sets Ei become smaller, then yi becomes on average closer to
the values that X can take on when Ei happens.
The expected value of Y is, of course, easy to compute:
n
X
E [Y ] = yi P (Ei )
i=1
and the integral is called the Lebesgue integral of X with respect to the probability
measure P. The notation dP (or d!) indicates that the sets Ei become very small by
improvingR the approximation (making the partition E1 , .P . . , En …ner); the integral
n
notation can be thought of as a shorthand for limY !X i=1 ; X appears in place
of Y in the integral, because the two tend to coincide when the approximation
becomes better and better.
The next example shows an important application of the linearity of the Lebesgue
integral. The example also shows how the Lebesgue integral can, in certain sit-
uations, be much simpler to use than the Stieltjes integral when computing the
expected value of a random variable.
Example 104 Let X1 and X2 be two random variables de…ned on a sample space
. We want to de…ne (and compute) the expected value of the sum X1 +X2 . De…ne
a new random variable
Z = X1 + X2
Using the Stieltjes integral5 , the expected value is de…ned as follows:
Z 1
E [Z] = zdFZ (z)
1
5 See p. 130.
144 CHAPTER 18. EXPECTED VALUE AND THE LEBESGUE INTEGRAL
where FZ (z) is the distribution function of Z. Hence, to compute the above integral,
we …rst need to know the distribution function of Z (which might be extremely
di¢ cult to derive). Using the Lebesgue integral, the expected value is de…ned as
follows: Z
E [Z] = ZdP
Thus, to compute the expected value of Z, we do not need to know the distribution
function of Z, but we only need to know the expected values of X1 and X2 .
This is just an example of how linearity of the Lebesgue integral translates into
linearity of the expected value. The more general property is summarized by
the following:
Y can be written as
8
< y1
> when ! 2 E1
Y (!) = ..
> .
:
yn when ! 2 En
yi 0 for all i.
Note that a simple random variable is also a discrete random variable. Hence,
the expected value of a simple random variable is easy to compute, because it is
just the weighted sum of the elements of its support.
The Lebesgue integral of a simple random variable Y is de…ned to be equal to
its expected value:
Z Xn
Y dP = yi P (Ei )
i=1
18.3. A MORE RIGOROUS DEFINITION 145
X = X+ X
Finally, the Lebesgue integral of X is de…ned as the di¤erence between the integrals
of its positive and negative parts:
Z Z Z
+
XdP = X dP X dP
R R
provided the di¤erence makes sense; in case both X + dP and X dP are equal to
in…nity, then the di¤erence is not well-de…ned and we say that X is not integrable.
146 CHAPTER 18. EXPECTED VALUE AND THE LEBESGUE INTEGRAL
Chapter 19
This lecture discusses some fundamental properties of the expected value operator.
Although most of these properties can be understood and proved using the material
presented in previous lectures, some properties are gathered here for convenience,
but can be proved and understood only after reading the material presented in
subsequent lectures.
E [aX] = aE [X]
This property has already been discussed in the lecture entitled Expected value (p.
134).
E [X] = 3
Y = 2X
Then:
E [Y ] = E [2X] = 2E [X] = 2 3 = 6
147
148 CHAPTER 19. PROPERTIES OF THE EXPECTED VALUE
Also this property has already been discussed in the lecture entitled Expected value
(p. 134).
Example 108 Let X and Y be two random variables with expected values
E [X] = 2
E [Y ] = 5
Z =X +Y
Then
E [Z] = E [X + Y ] = E [X] + E [Y ] = 2 + 5 = 7
This can be trivially obtained combining the two properties above (scalar multi-
plication and sum). Considering a1 ; a2 ; : : : ; aK as the K entries of a 1 K vector
a and X1 , X2 , . . . , XK as the K entries of a K 1 random vector X, the above
property can be written as
E [aX] = aE [X]
which is a multivariate generalization of the scalar multiplication property above.
Example 109 Let X and Y be two random variables with expected values
E [X] = 1
E [Y ] = 4
Z = X + 3Y
Then
E [Z] = E [X + 3Y ] = E [X] + 3E [Y ]
= 1 + 3 4 = 1 + 12 = 13
E [A + ] = A + E [ ]
This is easily proved by applying the linearity properties above to each entry of
the random matrix A + .
1 See p. 119
19.1. LINEARITY OF THE EXPECTED VALUE 149
Example 110 Let X be a 2 1 random vector such that its two entries X1 and
X2 have expected values
E [X1 ] = 0
E [X2 ] = 2
1
A=
7
Y =A+X
Then
E [Y ] = E [A + X] = A + E [X]
1 E [X1 ]
= +
7 E [X2 ]
1 0 1
= + =
7 2 9
E [B ] = BE [ ]
E [ C] = E [ ] C
E [B C] = E [B ( C)] = BE [ C] = BE [ ] C
E [X1 ] = E [X2 ] = 3
where X1 and X2 are the two components of X. Let A be the following 2 2 matrix
of constants:
2 0
A=
3 1
150 CHAPTER 19. PROPERTIES OF THE EXPECTED VALUE
Y = XA
Then
E [Y ] = E [XA] = E [X] A
2 0
= E [X1 ] E [X2 ]
3 1
2 0
= 3 3
3 1
= 3 2+3 3 3 0+3 1
= 15 3
X (!) 0 ; 8! 2
Then
E [X] 0
Proof. Intuitively, this is obvious: the expected value of X is a weighted average
of the values that X can take on; X can take on only positive values; therefore, also
its expected value must be positive. Formally, the expected value is the Lebesgue
integral3 of X; X can be approximated to any degree of accuracy by positive simple
random variables whose Lebesgue integral is obviously positive; therefore, also the
Lebesgue integral of X must be positive.
E [X] E [Y ]
E [Y X] = E [(Y X) 1] (19.1)
= E [(Y X) 1E ] + E [(Y X) 1E c ]
= E [(Y X) 1E c ]
because
E [(Y X) 1E ] = 0
by the properties of indicators of zero-probability events. If ! 2 E c , then
(Y X) 0
and
(Y X) 1E c 0
On the contrary, if ! 2 E, then
1E c = 0
and
(Y X) 1E c = 0
Therefore:
(Y X) 1E c 0 ,8! 2
which means that (Y X) 1E c is a positive random variable. Thus
E [(Y X) 1E c ] 0 (19.2)
E [Y X] 0
E [Y X] = E [Y ] E [X] 0
Therefore:
E [X] E [Y ]
Exercise 1
Let X and Y be two random variables, having expected values
p
E [X] = 2
E [Y ] = 1
Solution
Exercise 2
Let X be a 2 1 random vector such that its two entries X1 and X2 have expected
values:
E [X1 ] = 2
E [X2 ] = 3
1 2
A=
0 1
Y = AX
Solution
The linearity property of the expected value applies also to the multiplication of a
constant matrix and a random vector:
E [Y ] = E [AX] = AE [X]
1 2 E [X1 ]
=
0 1 E [X2 ]
1 2 2
=
0 1 3
1 2+2 3 8
= =
0 2+1 3 3
19.3. SOLVED EXERCISES 153
Exercise 3
Let be a 2 2 matrix with random entries, such that all its entries have expected
value equal to 1. Let A be the following 1 2 constant vector:
A= 2 3
Y =A
Solution
The linearity property of the expected value applies also to the multiplication of a
constant vector and a matrix with random entries:
E [Y ] = E [A ] = AE [ ]
1 1
= 2 3
1 1
= 2 1+3 1 2 1+3 1
= 5 5
154 CHAPTER 19. PROPERTIES OF THE EXPECTED VALUE
Chapter 20
Variance
This lecture introduces the concept of variance. Before reading this lecture, make
sure you are familiar with the concept of random variable (see the lecture entitled
Random variables - p. 105) and with the concept of expected value (see the lecture
entitled Expected value - p. 127).
155
156 CHAPTER 20. VARIANCE
Y =X E [X]
3. take the square of Y , so that positive and negative deviations from the mean
having the same magnitude yield the same measure of distance from E [X];
4. …nally, compute the expected value of the squared deviation Y 2 to know how
much on average X deviates from E [X]:
h i
2
Var [X] = E Y 2 = E (X E [X])
Proof. The variance formula is derived as follows. First, expand the square:
h i h i
2 2
Var [X] = E (X E [X]) = E X 2 + E [X] 2E [X] X
The above variance formula also makes clear that variance exists and is well-
de…ned only as long as E [X] and E X 2 exist and are well-de…ned.
20.5 Example
The following example shows how to compute the variance of a discrete random
variable2 using both the de…nition and the variance formula above.
RX = f0; 1g
E X 2 = 12 pX (1) + 02 pX (0) = 1 q + 0 (1 q) = q
Its variance is
2
Var [X] = E X 2 E [X] = q q 2 = q (1 q)
Alternatively, we can compute the variance of X using the de…nition. De…ne a new
random variable, the squared deviation of X from E [X], as
2
Z = (X E [X])
The support of Z is n o
2
RZ = (1 q) ; q 2
and its probability mass function is
8 2
< q if z = (1 q)
pZ (z) = 1 q if z = q 2
:
0 otherwise
The exercises at the end of this lecture provide more examples of how variance
can be computed.
where we have used the fact that, by linearity of the expected value, it holds that
E [a + X] = a + E [X]
where we have used the fact that, by linearity of the expected value, it holds that
E [bX] = bE [X]
Exercise 1
Let X be a discrete random variable with support
RX = f0; 1; 2; 3g
1=4 if x 2 RX
pX (x) =
0 otherwise
Solution
The expected value of X is
X
E [X] = xpX (x)
x2RX
= 0 pX (0) + 1 pX (1) + 2 pX (2) + 3 pX (3)
1 1 1 1 6 3
= 0 +1 +2 +3 = =
4 4 4 4 4 2
The expected value of X 2 is
X
E X2 = x2 pX (x)
x2RX
Exercise 2
Let X be a discrete random variable with support
RX = f1; 2; 3; 4g
Solution
The expected value of X is
X
E [X] = xpX (x)
x2RX
= 1 pX (1) + 2 pX (2) + 3 pX (3) + 4 pX (4)
1 4 9 16
= 1 +2 +3 +4
30 30 30 30
1 100 10
= (1 + 8 + 27 + 64) = =
30 30 3
The expected value of X 2 is
X
E X2 = x2 pX (x)
x2RX
Exercise 3
Read and try to understand how the variance of a Poisson random variable is
derived in the lecture entitled Poisson distribution (p. 349).
Exercise 4
Let X be an absolutely continuous random variable4 with support
RX = [0; 1]
4 See p. 107.
20.7. SOLVED EXERCISES 161
1 if x 2 [0; 1]
fX (x) =
0 otherwise
Solution
The expected value of X is
Z 1 Z 1
E [X] = xfX (x) dx = xdx
1 0
1
1 2 1 1
= x = 0=
2 0 2 2
The variance of X is
2
2 1 1
Var [X] = E X2 E [X] =
3 2
4 3 1
= =
12 12 12
Exercise 5
Let X be an absolutely continuous random variable with support
RX = [0; 1]
3x2 if x 2 [0; 1]
fX (x) =
0 otherwise
Solution
The expected value of X is
Z 1 Z 1 Z 1
2
E [X] = xfX (x) dx = x3x dx = 3x3 dx
1 0 0
1
3 4 3 3
= x = 0=
4 0 4 4
162 CHAPTER 20. VARIANCE
The variance of X is
2
2 3 3
Var [X] = E X2 E [X] =
5 4
3 9 48 45 3
= = =
5 16 80 80
Exercise 6
Read and try to understand how the variance of a Chi-square random variable is
derived in the lecture entitled Chi-square distribution (p. 387).
Chapter 21
Covariance
This lecture introduces the concept of covariance. Before reading this lecture, make
sure you are familiar with the concepts of random variable (p. 105), expected value
(p. 127) and variance (p. 155).
X = X E [X]
Y = Y E [Y ]
In other words, X and Y are the deviations of X and Y from their respective
means. When the product XY is positive, it means that:
163
164 CHAPTER 21. COVARIANCE
Cov [X; Y ] = E XY
Cov [X; Y ] > 0 implies that X tends to be high when Y is high and low when
Y is low;
Cov [X; Y ] < 0 implies that X tends to be high when Y is low and vice versa.
When Cov [X; Y ] = 0, X and Y do not display any of the above two tendencies.
Proposition 118 The covariance between two random variables X and Y can be
expressed as
Cov [X; Y ] = E [XY ] E [X] E [Y ]
21.4 Example
The following example shows how to compute the covariance between two discrete
random variables.
1 See p. 134.
21.4. EXAMPLE 165
Example 119 Let X be a 2 1 discrete random vector2 and denote its components
by X1 and X2 . Let the support of X be
1 2 0
RX = ; ;
1 0 0
and its joint probability mass function be
8 >
>
> 1=3 if x = 1 1
>
< >
pX (x) = 1=3 if x = 2 0
> >
>
> 1=3 if x = 0 0
:
0 otherwise
The support of X1 is
RX1 = f0; 1; 2g
and its marginal probability mass function3 is
8
>
> 1=3 if x = 0
X <
1=3 if x = 1
pX1 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 2
f(x1 ;x2 )2RX :x1 =xg :
0 otherwise
The expected value of X1 is
X 1 1 1
E [X1 ] = xpX1 (x) = 0+ 1+ 2=1
3 3 3
x2RX1
The support of X2 is
RX2 = f0; 1g
and its marginal probability mass function is
8
X < 2=3 if x = 0
pX2 (x) = pX (x1 ; x2 ) = 1=3 if x = 1
:
f(x1 ;x2 )2RX :x2 =xg 0 otherwise
The expected value of X2 is
X 2 1 1
E [X2 ] = xpX2 (x) = 0+ 1=
3 3 3
x2RX2
Proposition 120 Let Cov [X; X] be the covariance of a random variable with it-
self. Then,
Cov [X; X] = Var [X]
where in the last step we have used the very de…nition of variance.
21.5.2 Symmetry
The covariance operator is symmetric.
Proposition 121 Let Cov [X; Y ] be the covariance between two random variables
X and Y . Then,
Cov [X; Y ] = Cov [Y; X]
21.5.3 Bilinearity
The covariance operator is linear in both of its arguments.
and
Cov [Y; a1 X1 + a2 X2 ] = a1 Cov [Y; X1 ] + a2 Cov [Y; X2 ]
21.5. MORE DETAILS 167
Proof. That the …rst argument is linear is proved by using the linearity of the
expected value:
Cov [a1 X1 + a2 X2 ; Y ]
= E [(a1 X1 + a2 X2 E [a1 X1 + a2 X2 ]) (Y E [Y ])]
= E [(a1 X1 E [a1 X1 ]) (Y E [Y ]) + (a2 X2 E [a2 X2 ]) (Y E [Y ])]
= a1 E [(X1 E [X1 ]) (Y E [Y ])] + a2 E [(X2 E [X2 ]) (Y E [Y ])]
= a1 Cov [X1 ; Y ] + a2 Cov [X2 ; Y ]
Proposition 123 Let X1 and X2 be two random variables. If the variance of their
sum exists and is well-de…ned, then
Var [X1 + X2 ]
h i
2
= E (X1 + X2 E [X1 + X2 ])
h i
2
= E ((X1 E [X1 ]) + (X2 E [X2 ]))
h i
2 2
= E (X1 E [X1 ]) + (X2 E [X2 ]) + 2 (X1 E [X1 ]) (X2 E [X2 ])
h i h i
2 2
= E (X1 E [X1 ]) + E (X2 E [X2 ])
+2E [(X1 E [X1 ]) (X2 E [X2 ])]
= Var [X1 ] + Var [X2 ] + 2Cov [X1 ; X2 ]
Thus, to compute the variance of the sum of two random variables we need to
know their covariance.
Obviously then, the formula
Proof. The formula is proved by using the bilinearity of the covariance operator
(see 21.5.3):
" n # " n n
#
X X X
Var Xi = Cov Xi ; Xi
i=1 i=1 i=1
2 3
Xn n
X
= Cov 4 Xi ; Xj 5
i=1 j=1
2 3
n
X n
X
= Cov 4Xi ; Xj 5
i=1 j=1
n X
X n
= Cov [Xi ; Xj ]
i=1 j=1
n
X n X
X i 1
= Cov [Xi ; Xi ] + Cov [Xi ; Xj ]
i=1 i=2 j=1
n
X n
X
+ Cov [Xi ; Xj ]
i=2 j=i+1
n
X n X
X i 1
= Var [Xi ] + Cov [Xi ; Xj ]
i=1 i=2 j=1
j 1
n X
X
+ Cov [Xj ; Xi ]
j=2 i=1
n
X n X
X i 1
= Var [Xi ] + 2 Cov [Xi ; Xj ]
i=1 i=2 j=1
Formula (21.1) implies that when all the random variables in the sum have
zero covariance with each other, then the variance of the sum is just the sum of
the variances: " n #
X Xn
Var Xi = Var [Xi ]
i=1 i=1
This is true, for example, when the random variables in the sum are mutually
independent5 , because independence implies zero covariance6 .
5 See p. 233.
6 See p. 234.
21.6. SOLVED EXERCISES 169
Exercise 1
Let X be a 2 1 discrete random vector and denote its components by X1 and
X2 . Let the support of X be
1 2
RX = ;
3 1
Solution
The support of X1 is
RX1 = f1; 2g
and its marginal probability mass function7 is
8
X < 2=3 if x = 1
pX1 (x) = pX (x1 ; x2 ) = 1=3 if x = 2
:
f(x1 ;x2 )2RX :x1 =xg 0 otherwise
The support of X2 is
RX2 = f1; 3g
and its marginal probability mass function is
8
X < 1=3 if x = 1
pX2 (x) = pX (x1 ; x2 ) = 2=3 if x = 3
:
f(x1 ;x2 )2RX :x2 =xg 0 otherwise
7 See p. 120.
170 CHAPTER 21. COVARIANCE
Exercise 2
Let X be a 2 1 discrete random vector and denote its entries by X1 and X2 . Let
the support of X be
1 2 4
RX = ; ;
2 1 4
and its joint probability mass function be
8 >
>
> 1=3 if x = 1 2
>
< >
pX (x) = 1=3 if x = 2 1
> >
>
> 1=3 if x = 4 4
:
0 otherwise
Solution
The support of X1 is
RX1 = f1; 2; 4g
and its marginal probability mass function is
8
>
> 1=3 if x = 1
X <
1=3 if x = 2
pX1 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 4
f(x1 ;x2 )2RX :x1 =xg :
0 otherwise
The mean of X1 is
X 1 1 1 7
E [X1 ] = xpX1 (x) = 1+ 2+ 4=
3 3 3 3
x2RX1
The support of X2 is
RX2 = f1; 2; 4g
8 See p. 134.
21.6. SOLVED EXERCISES 171
The expected value of the product X1 X2 can be derived thanks to the transforma-
tion theorem:
X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 ) =
x2RX
=
(1 2) pX (1; 2) + (2 1) pX (2; 1) + (4 4) pX (4; 4)
1 1 1 20
= 2 +2 + 16 =
3 3 3 3
By putting pieces together, we obtain the covariance between X1 and X2 :
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ]
20 7 7 60 49 11
= = =
3 3 3 9 9
Exercise 3
Let X and Y be two random variables such that
Var [X] = 2
Cov [X; Y ] = 1
Compute the covariance
Cov [5X; 2X + 3Y ]
Solution
By exploiting the bilinearity of the covariance operator, we obtain
Cov [5X; 2X + 3Y ] = 5Cov [X; 2X + 3Y ] = 10Cov [X; X] + 15Cov [X; Y ]
= 10Var [X] + 15Cov [X; Y ] = 10 2 + 15 1 = 35
Exercise 4
Let [X Y ] be an absolutely continuous random vector9 with support
RXY = f(x; y) : 0 x y 2g
In other words, the support RXY is the set of all couples (x; y) such that 0 y 2
and 0 x y. Let the joint probability density function of [X Y ] be
3
fXY (x; y) = 8y if (x; y) 2 RXY
0 otherwise
Compute the covariance between X and Y .
9 See p. 117.
172 CHAPTER 21. COVARIANCE
Solution
The support of X is:
RX = [0; 2]
thus, when x 2= [0; 2], the marginal probability density function10 of X is 0, while,
when x 2 [0; 2], the marginal probability density function of X is
Z 1 Z 2 2
3 3 2 3 3 2
fX (x) = fXY (x; y) dy = ydy = y = x
1 x 8 16 x 4 16
The expected value of the product XY can be computed by using the transforma-
tion theorem:
Z 1Z 1 Z 2 Z y
3
E [XY ] = xyfXY (xy) dydx = xy ydx dy
1 1 0 0 8
Z 2 Z y Z 2 y
3 2 3 2 1 2
= y xdx dy = y x dy
0 8 0 0 8 2 0
Z 2 Z 2
3 21 2 3 4
= y y dy = y dy
0 8 2 0 16
2
3 1 5 3 32 6
= y = =
16 5 0 16 5 5
Hence, by using the covariance formula, the covariance between X and Y can be
computed as
6 3 3
Cov [X; Y ] = E [XY ] E [X] E [Y ] =
5 4 2
6 9 48 45 3
= = =
5 8 40 40
Exercise 5
Let [X Y ] be an absolutely continuous random vector with support
RXY = [0; 1) [1; 4]
and its joint probability density function be
1
fXY (x; y) = 3 y exp ( xy) if x 2 [0; 1) and y 2 [1; 4]
0 otherwise
Compute the covariance between X and Y .
Solution
The support of Y is
RY = [1; 4]
When y 2 = [1; 4], the marginal probability density function of Y is 0, while, when
y 2 [1; 4], the marginal probability density function of Y is
Z 1 Z 1
1
fY (y) = fXY (x; y) dx = y exp ( xy) dx
1 0 3
1 1 1 1
= [ exp ( xy)]0 = [0 ( 1)] =
3 3 3
Thus, the marginal probability density function of Y is
1=3 if y 2 [1; 4]
fY (y) =
0 otherwise
The expected value of Y is
Z 1 Z 4
1
E [Y ] = yfY (y) dy = y dy
1 1 3
174 CHAPTER 21. COVARIANCE
4
1 2 1 1 15 5
= y = 16 = =
6 1 6 6 6 2
The support of X is
RX = [0; 1)
When x 2 = [0; 1), the marginal probability density function of X is 0, while, when
x 2 [0; 1), the marginal probability density function of X is
Z 1 Z 4
1
fX (x) = fXY (x; y) dy = y exp ( xy) dy
1 1 3
We do not explicitly compute the integral, but we write the marginal probability
density function of X as follows:
R4 1
y exp ( xy) dy if x 2 [0; 1)
fX (x) = 1 3
0 otherwise
Z 1 Z 1
E [XY ] = xyfXY (xy) dydx
1 1
Z 1 Z 4
1
= xy y exp ( xy) dy dx
0 1 3
Z 4 Z 1
A 1
= y xy exp ( xy) dx dy
3 1 0
21.6. SOLVED EXERCISES 175
Z 4 Z 1
B 1 1
= y t exp ( t) dt dy
3 1 y 0
Z 4 Z 1
C 1 1
= [ t exp ( t)]0 + exp ( t) dt dy
3 1 0
Z 4
1 1
= (0 + [ exp ( t)]0 ) dy
3 1
Z 4
1 1
= dy = 3=1
3 1 3
Exercise 6
Let X and Y be two random variables such that
Var [X] = 4
Cov [X; Y ] = 2
Cov [3X; X + 3Y ]
Solution
The bilinearity of the covariance operator implies that
Linear correlation
This lecture introduces the linear correlation coe¢ cient. Before reading this lec-
ture, make sure you are familiar with the concept of covariance (p. 163).
Cov [X; Y ]
Corr [X; Y ] =
stdev [X] stdev [Y ]
where Cov [X; Y ] is the covariance between X and Y and stdev [X] and stdev [Y ]
are the standard deviations1 of X and Y .
Of course, the linear correlation coe¢ cient is well-de…ned only as long as the
three quantities Cov [X; Y ], stdev [X] and stdev [Y ] exist and are well-de…ned.
Moreover, while the ratio is well-de…ned only if stdev [X] and stdev [Y ] are
strictly greater than zero, it is often assumed that Corr [X; Y ] = 0 when one of
the two standard deviations is zero. This is equivalent to assuming that 0=0 = 0,
because Cov [X; Y ] = 0 when one of the two standard deviations is zero.
22.2 Interpretation
Linear correlation is a measure of dependence, or association, between two random
variables. Its interpretation is similar to the interpretation of covariance2 .
The correlation between X and Y provides a measure of the degree to which
X and Y tend to "move together": Corr [X; Y ] > 0 indicates that deviations of X
and Y from their respective means tend to have the same sign; Corr [X; Y ] < 0
indicates that deviations of X and Y from their respective means tend to have
opposite signs; when Corr [X; Y ] = 0, X and Y do not display any of these two
tendencies.
1 See p. 157.
2 See the lecture entitled Covariance (p. 163) for a detailed explanation.
177
178 CHAPTER 22. LINEAR CORRELATION
22.3 Terminology
The following terminology is often used:
1. If Corr [X; Y ] > 0 then X and Y are said to be positively linearly corre-
lated (or simply positively correlated).
2. If Corr [X; Y ] < 0 then X and Y are said to be negatively linearly corre-
lated (or simply negatively correlated).
3. If Corr [X; Y ] 6= 0 then X and Y are said to be linearly correlated (or
simply correlated).
4. If Corr [X; Y ] = 0 then X and Y are said to be uncorrelated. Also note
that Cov [X; Y ] = 0 ) Corr [X; Y ] = 0, therefore two random variables X
and Y are uncorrelated whenever Cov [X; Y ] = 0.
22.4 Example
The following example shows how to compute the coe¢ cient of linear correlation
between two discrete random variables.
Example 125 Let X be a 2 1 random vector and denote its components by X1
and X2 . Let the support of X be
1 1 1
RX = ; ;
1 1 1
The variance of X1 is
2
2 1 8
Var [X1 ] = E X12 E [X1 ] = 1 =
3 9
The standard deviation of X1 is
r
p 8
stdev [X1 ] = Var [X1 ] =
9
The support of X2 is
RX2 = f 1; 1g
and its probability mass function is
8
X < 1=3 if x = 1
pX2 (x) = pX (x1 ; x2 ) = 2=3 if x = 1
:
f(x1 ;x2 )2RX :x2 =xg 0 otherwise
The expected value of X2 is
X 1 2 1
E [X2 ] = xpX2 (x) = ( 1) + 1=
3 3 3
x2RX2
The variance of X2 is
2
2 1 8
Var [X2 ] = E X22 E [X2 ] = 1 =
3 9
The standard deviation of X2 is
r
p 8
stdev [X2 ] = Var [X2 ] =
9
By using the transformation theorem4 , we can compute the expected value of the
product X1 X2 :
X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 )
x2RX
4 See p. 134.
180 CHAPTER 22. LINEAR CORRELATION
1 1 1 1
= ( 1 1) + (( 1) ( 1)) + (1 1) =
3 3 3 3
Hence, the covariance between X1 and X2 is
1 1 1 4
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ] = =
3 3 3 9
Proposition 126 If the correlation coe¢ cient of a random variable with itself
exists and is well-de…ned, then
Corr [X; X] = 1
22.5.2 Symmetry
The linear correlation coe¢ cient is symmetric.
Proposition 127 If the correlation coe¢ cient between two random variables exists
and is well-de…ned, then it satis…es
Cov [Y; X]
=
stdev [X] stdev [Y ]
Cov [Y; X]
=
stdev [Y ] stdev [X]
= Corr [Y; X]
where we have used the fact that covariance is symmetric6 :
Cov [X; Y ] = Cov [Y; X]
Exercise 1
Let X be a 2 1 discrete random vector and denote its components by X1 and
X2 . Let the support of X be
1 2
RX = ;
5 1
and its joint probability mass function be
8
>
>
< 4=5 if x = 1 5
>
pX (x) = 1=5 if x = 2 1
>
: 0 otherwise
Compute the coe¢ cient of linear correlation between X1 and X2 .
Solution
The support of X1 is
RX1 = f1; 2g
and its marginal probability mass function7 is
8
X < 4=5 if x = 1
pX1 (x) = pX (x1 ; x2 ) = 1=5 if x = 2
:
f(x1 ;x2 )2RX :x1 =xg 0 otherwise
The expected value of X1 is
X 4 1 6
E [X1 ] = xpX1 (x) = 1 +2 =
5 5 5
x2RX1
6 See p. 166.
7 See p. 120.
182 CHAPTER 22. LINEAR CORRELATION
The variance of X1 is
2
2 8 6 40 36 4
Var [X1 ] = E X12 E [X1 ] = = =
5 5 25 25
The standard deviation of X1 is
r
4 2
stdev [X1 ] = =
25 5
The support of X2 is
RX2 = f1; 5g
and its marginal probability mass function is
8
X < 1=5 if x = 1
pX2 (x) = pX (x1 ; x2 ) = 4=5 if x = 5
:
f(x1 ;x2 )2RX :x2 =xg 0 otherwise
The expected value of X2 is
X 1 4 21
E [X2 ] = xpX2 (x) = 1 +5 =
5 5 5
x2RX2
The variance of X2 is
2
2 101 21 505 441 64
Var [X2 ] = E X22 E [X2 ] = = =
5 5 25 25
The standard deviation of X1 is
r
64 8
stdev [X2 ] = =
25 5
By using the transformation theorem, we can compute the expected value of X1 X2 :
X
E [X1 X2 ] = x1 x2 pX (x1 ; x2 ) = (1 5) pX (1; 5) + (2 1) pX (2; 1)
x2RX
4 1 22
= 5 +2 =
5 5 5
Hence, the covariance between X1 and X2 is
22 6 21
Cov [X1 ; X2 ] = E [X1 X2 ] E [X1 ] E [X2 ] =
5 5 5
110 126 16
= =
25 25
and the coe¢ cient of linear correlation between X1 and X2 is
16
Cov [X1 ; X2 ] 25
Corr [X1 ; X2 ] = = 2 8 = 1
stdev [X1 ] stdev [X2 ] 5 5
22.6. SOLVED EXERCISES 183
Exercise 2
Let X be a 2 1 discrete random vector and denote its entries by X1 and X2 . Let
the support of X be
1 2 3
RX = ; ;
2 1 3
and its joint probability mass function be
8 >
>
> 1=3 if x = 1
>
<
2
>
pX (x) = 1=3 if x = 2 1
> >
>
> 1=3 if x = 3 3
:
0 otherwise
Solution
The support of X1 is
RX1 = f1; 2; 3g
and its marginal probability mass function is
8
>
> 1=3 if x = 1
X <
1=3 if x = 2
pX1 (x) = pX (x1 ; x2 ) =
>
> 1=3 if x = 3
f(x1 ;x2 )2RX :x1 =xg :
0 otherwise
The mean of X1 is
X 1 1 1 6
E [X1 ] = xpX1 (x) = 1 +2 +3 = =2
3 3 3 3
x2RX1
2 14 14 12 2
Var [X1 ] = E X12 E [X1 ] = 22 = =
3 3 3
The standard deviation of X1 is
r
2
stdev [X1 ] =
3
The support of X2 is
RX2 = f1; 2; 3g
184 CHAPTER 22. LINEAR CORRELATION
Exercise 3
Let [X Y ] be an absolutely continuous random vector with support
RXY = [0; 1) [1; 2]
8
and let its joint probability density function be
2y exp ( 2xy) if x 2 [0; 1) and y 2 [1; 2]
fXY (x; y) =
0 otherwise
Compute the covariance between X and Y .
8 See p. 117.
22.6. SOLVED EXERCISES 185
Solution
The support of Y is
RY = [1; 2]
When y 2 = RY , the marginal probability density function9 of Y is 0, while, when y 2
RY , the marginal probability density function of Y can be obtained by integrating
x out of the joint probability density as follows:
Z 1 Z 1
fY (y) = fXY (x; y) dx = 2y exp ( 2xy) dx
1 0
1
= [ exp ( 2xy)]0 = [0 ( 1)] = 1
1 if y 2 [1; 2]
fY (y) =
0 otherwise
The variance of Y is
2
2 7 3 28 27 1
Var [Y ] = E Y 2 E [Y ] = = =
3 2 12 12
We do not explicitly compute the integral, but we write the marginal probability
density function of X as follows:
R2
2y exp ( 2xy) dy if x 2 [0; 1)
fX (x) = 1
0 otherwise
9 See p. 120.
186 CHAPTER 22. LINEAR CORRELATION
1h i
2
2 1 1 2
Var [X] = E X 2 E [X] = ln (2) = 1 (ln (2))
4 2 4
2 3 ln (2)
= q q
2 1
1 (ln (2)) 3
188 CHAPTER 22. LINEAR CORRELATION
Chapter 23
Covariance matrix
This lecture introduces the covariance matrix of a random vector, which is a mul-
tivariate generalization of the concept of variance of a random variable. Before
reading this lecture, make sure you are familiar with the concepts of variance (p.
155) and covariance (p. 163).
23.1 De…nition
Let X be a K 1 random vector. The covariance matrix of X, or variance-
covariance matrix of X, denoted by Var [X], is de…ned as follows:
h i
>
Var [X] = E (X E [X]) (X E [X])
189
190 CHAPTER 23. COVARIANCE MATRIX
h i
> >
= E XX > 2XE [X] + E [X] E [X]
B = E XX >
>
2E [X] E [X] + E [X] E [X]
>
>
= E XX > E [X] E [X]
where: in step A we have used the fact that a scalar is equal to its transpose; in
step B we have used the linearity of the expected value2 .
This formula also makes clear that the covariance matrix exists and is well-
de…ned only as long as the vector of expected values E [X] and the matrix of
second cross-moments3 E XX > exist and are well-de…ned.
Proof. This is easily proved using the fact that5 E [bX] = bE [X]:
h i
>
Var [bX] = E (bX E [bX]) (bX E [bX])
h i
>
= E (bX bE [X]) (bX bE [X])
h i
>
= E b (X E [X]) (X E [X]) b>
h i
>
= bE (X E [X]) (X E [X]) b>
= bVar [X] b>
23.4.4 Symmetry
The covariance matrix is a symmetric matrix, i.e., it is equal to its transpose.
where the last inequality follows from the fact that variance is always positive.
h i
C >
= E a (X E [X]) (X E [X]) b>
23.5. SOLVED EXERCISES 193
h i
D = aE (X E [X]) (X E [X])
>
b>
23.4.7 Cross-covariance
The term covariance matrix is sometimes also referred to the matrix of covariances
between the elements of two vectors. Let X be a K 1 random vector and Y be
a L 1 random vector. The covariance matrix between X and Y , or cross-
covariance between X and Y , denoted by Cov [X; Y ], is de…ned as follows:
h i
>
Cov [X; Y ] = E (X E [X]) (Y E [Y ])
where the (i; j)-th entry of the matrix is equal to the covariance between Xi and
Yj :
Vij = E [(Xi E [Xi ]) (Yj E [Yj ])] = Cov [Xi ; Yj ]
Note that Cov [X; Y ] is not the same as Cov [Y; X]. In fact, Cov [Y; X] is a
L K matrix equal to the transpose of Cov [X; Y ]:
h i
>
Cov [Y; X] = E (Y E [Y ]) (X E [X])
h i>
>
= E (X E [X]) (Y E [Y ])
>
= Cov [X; Y ]
Exercise 1
Let X be a 2 1 random vector and denote its components by X1 and X2 . The
covariance matrix of X is
4 1
Var [X] =
1 2
Compute the variance of the random variable Y de…ned as
Y = 3X1 + 4X2
Solution
Using a matrix notation, Y can be written as
X1
Y = [3 4] = bX
X2
Exercise 2
Let X be a 3 1 random vector and denote its components by X1 , X2 and X3 .
The covariance matrix of X is
2 3
3 1 0
Var [X] = 4 1 2 0 5
0 0 1
Solution
Using the bilinearity of the covariance operator6 , we obtain
The same result can be obtained using the formula for the covariance between two
linear transformations. Let us de…ne
a = [1 0 2]
b = [0 3 0]
Then, we have
Cov [X1 + 2X3 ; 3X2 ] = Cov [aX; bX] = aVar [X] b>
2 32 3
3 1 0 0
= [1 0 2] 4 1 2 0 5 4 3 5
0 0 1 0
2 3
3 0+1 3+0 0
= [1 0 2] 4 1 0 + 2 3 + 0 0 5
0 0+0 3+1 0
2 3
3
= [1 0 2] 4 6 5 = 1 3 + 0 6 + 2 0 = 3
0
Exercise 3
Let X be a K 1 random vector whose covariance matrix is equal to the identity
matrix:
Var [X] = I
De…ne a new random vector Y as follows:
Y = AX
AA> = I
Solution
By using the formula for the covariance matrix of a linear transformation, we obtain
Indicator function
This lecture introduces the concept of indicator function. Before reading this lec-
ture, make sure you are familiar with the concepts of random variable (p. 105)
and expected value (p. 127).
24.1 De…nition
Let be a sample space1 , let E be an event and denote by P (E) the probabil-
ity assigned to the event E. The indicator function of the event E (or indicator
random variable of the event E), denoted by 1E , is a random variable de…ned
as follows:
1 if ! 2 E
1E (!) =
0 if ! 2=E
In other words, the indicator function of the event E is a random variable that
takes value 1 when the event E happens and value 0 when the event E does not
happen.
Example 136 We toss a die and one of the six numbers from 1 to 6 can appear
face up. The sample space is:
= f1; 2; 3; 4; 5; 6g
De…ne the event
E = f1; 3; 5g
i.e. E is the event "An odd number appears face up". A random variable that takes
value 1 when an odd number appears face up and value 0 otherwise is an indicator
of the event E.
From the above de…nition, it can easily be seen that 1E is a discrete random
variable2 with support R1E = f0; 1g and probability mass function:
8
< P (E) if x = 1
p1E (x) = P (E c ) = 1 P (E) if x = 0
:
0 otherwise
Indicator functions are heavily used in probability theory to simplify notation
and to prove theorems.
1 See p. 69.
2 See p. 106.
197
198 CHAPTER 24. INDICATOR FUNCTION
24.2.1 Powers
The n-th power of 1E is equal to 1E :
n
(1E (!)) = 1E (!) ; 8n; !
because 1E can be either 0 or 1 and
0n = 0
1n = 1
24.2.3 Variance
The variance of 1E is equal to P (E) (1 P (E)). Using the powers property above
and the formula for computing the variance3 :
h i
2 2
Var [1E ] = E (1E ) E [1E ]
2
= E [1E ] E [1E ]
2
= P (E) P (E)
= P (E) (1 P (E))
24.2.4 Intersections
If E and F are two events, then:
1E\F = 1E 1F
In fact:
1. if ! 2 E \ F , then
1E\F (!) = 1
and
! 2 E; ! 2 F
=) 1E (!) = 1; 1F (!) = 1
=) 1E (!) 1F (!) = 1
3 See p. 156.
24.3. SOLVED EXERCISES 199
2. if ! 2
= E \ F , then
1E\F (!) = 0
and
either ! 2
= E or ! 2= F or both
=) either 1E (!) = 0 or 1F (!) = 0 or both
=) 1E (!) 1F (!) = 0
E [X1E ] = 0
While a rigorous proof of this fact is beyond the scope of this introductory expo-
sition, this property should be intuitive. The random variable (X1E ) (!) is equal
to zero for all sample points ! except possibly for the points ! 2 E. The expected
value is a weighted average of the values X1E can take on, where each value is
weighted by its respective probability. The non-zero values X1E can take on are
weighted by zero probabilities, so E [X1E ] must be zero.
Exercise 1
Consider a random variable X and another random variable Y de…ned as a function
of X.
2 if X < 2
Y =
X if X 2
Express Y using the indicator functions of the events fX < 2g and fX 2g.
Solution
Denote by 1fX<2g the indicator of the event fX < 2g and denote by 1fX 2g the
indicator of the event fX 2g. We can write Y as:
Y = 2 1fX<2g + X 1fX 2g
Exercise 2
Let X be a positive random variable, i.e. a random variable that can take on only
positive values. Let c be a constant. Prove that
E [X] E X1fX cg
Solution
First note that the sum of the indicators 1fX cg and 1fX<cg is always equal to 1:
1fX cg + 1fX<cg = 1
E [X] = E [X 1]
= E X 1fX cg + 1fX<cg
= E X1fX cg + E X1fX<cg
Now, note that X1fX<cg is a positive random variable and that the expected value
of a positive random variable is positive6 :
E X1fX<cg 0
Thus:
E [X] = E X1fX cg + E X1fX<cg E X1fX cg
Exercise 3
Let E be an event and denote its indicator function by 1E . Let E c be the com-
plement of E and denote its indicator function by 1E c . Can you express 1E c as a
function of 1E ?
Solution
The sum of the two indicators is always equal to 1:
1E + 1E c = 1
Therefore:
1E c = 1 1E
6 See p. 150.
Chapter 25
Conditional probability as a
random variable
In the lecture entitled Conditional probability (p. 85) we have stated a number of
properties that conditional probabilities shoud satisfy to be rational in some sense.
We have proved that, whenever P (G) > 0, these properties are satis…ed if and only
if
P (E \ G)
P (E jG ) =
P (G)
but we have not been able to derive a formula for probabilities conditional on zero-
probability events1 , i.e. we have not been able to …nd a way to compute P (E jG )
when P (G) = 0.
Thus, we have concluded that the above elementary formula cannot be taken
as a general de…nition of conditional probability, because it does not cover zero-
probability events.
In this lecture we discuss a completely general de…nition of conditional proba-
bility, which covers also the case in which P (G) = 0.
The plan of the lecture is as follows.
201
202 CHAPTER 25. CONDITIONAL PROBABILITY AS A R.V.
2. if G; F 2 G then either G = F or G \ F = ;;
S
3. = G.
G2G
Example 138 Suppose that we toss a die. Six numbers (from 1 to 6) can appear
face up, but we do not yet know which one of them will appear. The sample space
is
= f1; 2; 3; 4; 5; 6g
Let any subset of be considered an event. De…ne the two events:
G1 = f1; 2; 3g
G2 = f4; 5; 6g
F1 = f1; 2g
F2 = f3; 4; 5g
F3 = f5; 6g
F2 \ F3 = f5g =
6 ; and F2 6= F3
2 See p. 69.
25.2. PROBABILITIES CONDITIONAL ON A PARTITION 203
2 2 1 1
E [P (E jG )] = pP(EjG ) + pP(EjG )
3 3 3 3
2 1 1 1 1
= + = = P (E)
3 2 3 2 2
The above property can be generalized as follows:
G = fG1 ; G2 ; : : : ; Gn g
be a …nite partition of events of such that P (Gi ) > 0 for every i. Let H be any
event obtained as a union of events Gi 2 G. Let 1H be the indicator function4 of
H. Let P (E jG ) be de…ned as in (25.1). Then:
E [1H P (E jG )] = P (E \ H) (25.3)
4 See p. 197.
25.4. THE FUNDAMENTAL PROPERTY AS A DEFINITION 205
Proof. Without loss of generality, we can assume that H is obtained as the union
of the …rst k (k n) sets of the partition G (we can always re-arrange the sets Gi
by changing their indices):
k
[
H= Gi
i=1
E [1H Y ] = P (E \ H)
206 CHAPTER 25. CONDITIONAL PROBABILITY AS A R.V.
E [1H Y1 ] = P (E \ H)
E [1H Y2 ] = P (E \ H)
E [1H P (E jG )] = P (E \ H)
P (E jI ) is I-measurable
E [1H P (E jI )] = P (E \ H) for any H 2 I
It can be shown that this de…nition is completely equivalent to our de…nition above,
provided I is the smallest -algebra containing all the events H 2 F obtainable as
unions of events G 2 G (where G is a partition of events of ).
5 In other words, there exists a zero-probability event F such that:
f! 2 : Y1 (!) 6= Y2 (!)g F
See the lecture entitled Zero-probability events (p. 79) for a de…nition of almost sure events and
zero-probability events.
6 See p. 75.
7 See p. 76.
25.5. MORE DETAILS 207
Conditional probability
distributions
209
210 CHAPTER 26. CONDITIONAL PROBABILITY DISTRIBUTIONS
pXjY =y (x) = P (X = x jY = y )
How do we derive the conditional probability mass function from the joint prob-
ability mass function5 pXY (x; y)? The following proposition provides an answer
to this question:
Proposition 145 Let [X Y ] be a discrete random vector. Let pXY (x; y) be its
joint probability mass function and let pY (y) be the marginal probability mass func-
tion6 of Y . The conditional probability mass function of X given Y = y is
pXY (x; y)
pXjY =y (x) =
pY (y)
provided pY (y) > 0.
Proof. This is just the usual formula for computing conditional probabilities
(conditional probability equals joint probability divided by marginal probability):
pXjY =y (x) = P (X = x jY = y )
P (X = x and Y = y)
=
P (Y = y)
pXY (x; y)
=
pY (y)
Note that the above proposition assumes knowledge of the marginal probability
mass function pY (y), which can be derived from the joint probability mass function
pXY (x; y) by marginalization7 .
= [0; 1]
i.e. the sample space is the set of all real numbers between 0 and 1. It is possible
to build a probability measure P on , such that P assigns to each sub-interval of
[0; 1] a probability equal to its length, i.e.:
This is the same sample space discussed in the lecture on zero-probability events8 .
De…ne a random variable X as follows:
1 if ! = 0
X (!) =
0 otherwise
Both X and Y are discrete random variables and, considered together, they con-
stitute a discrete random vector [X Y ]. Suppose we want to compute the condi-
tional probability mass function of X conditional on Y = 1. It is easy to see that
pY (1) = 0. As a consequence, we cannot use the formula:
pXY (x; 1)
pXjY =1 (x) =
pY (1)
because division by zero is not possible. It turns out that also the technique of
implicitly deriving a conditional probability as a realization of a random variable
satisfying the de…nition of a conditional probability with respect to a partition (see
the lecture entitled Conditional probability as a random variable - p. 201) does not
allow to unambiguously derive pXjY =1 (x). In this case, the partition of interest is
G = fG1 ; G2 g, where:
G1 = f! 2 : Y (!) = 1g = f0; 1g
G2 = f! 2 : Y (!) = 0g = (0; 1)
and pXjY =1 (x) can be viewed as the realization of the conditional probability
P (X = x jG ) (!)
which implies:
pXjY =1 (x) 0 = 0
pXjY =0 (x) = pXY (x; 0)
The second equation does not help determining pXjY =1 (x). So, from the …rst equa-
tion it is evident that pXjY =1 (x) is undetermined (any number, when multiplied
by zero, gives zero). One can show that also the requirement that
P (X = x jG )
What does it mean that (26.1) is undetermined? It means that any choice of (26.1)
is legitimate, provided the requirement
0 pXjY =1 (x) 1
is satis…ed. Is this really a paradox? No, because conditional probability with respect
to a partition is de…ned up to almost sure equality, G1 is a zero-probability event,
so the value that P (E jG ) takes on G1 does not matter (roughly speaking, we do not
really need to care about zero-probability events, provided there is only a countable
number of them).
9 E [1 jG )] = P (E \ H) - see p. 204.
H P (E
1 0 See p. 207.
26.2. CONDITIONAL PROBABILITY DENSITY FUNCTION 213
How do we derive the conditional probability mass function from the joint
probability density function12 fXY (x; y)?
Deriving the conditional distribution of X given Y = y is far from obvious:
whatever value of y we choose, we are conditioning on a zero-probability event
(P (Y = y) = 0 - see p. 109 for an explanation); therefore, the standard formula
(conditional probability equals joint probability divided by marginal probability)
cannot be used. However, it turns out that the de…nition of conditional probability
with respect to a partition13 can be fruitfully applied in this case to derive the
conditional probability density function of X given Y = y:
E [1H P (E jG )] = P (E \ H)
for any H and E. Thanks to some basic results in measure theory, we can con…ne
our attention to the events H and E that can be written as follows:
H = f! 2 : Y 2 [y1 ; y2 ] ; [y1 ; y2 ] RY g
E = f! 2 : X 2 [x1 ; x2 ] ; [x1 ; x2 ] RX g
1 1 See p. 107.
1 2 See p. 117.
1 3 See p. 206.
214 CHAPTER 26. CONDITIONAL PROBABILITY DISTRIBUTIONS
For these events, it is immediate to verify that the fundamental property of condi-
tional probability holds. First, by the very de…nition of a conditional probability
density function: Z x2
P (E jG ) = fXjY =y (x) dx
x1
1=4 if y 2 [1; 5]
fY (y) =
0 otherwise
1 4 See p. 134.
26.3. CONDITIONAL DISTRIBUTION FUNCTION 215
When evaluated at y = 1, it is
1
fY (1) =
4
The support of X is
RX = [0; 1)
FXjY =y (x) = P (X x jY = y ) ; 8x 2 R
Gy = f! 2 : Y = yg
G = fGy : y 2 RY g
Exercise 1
Let [X Y ] be a discrete random vector with support:
RXY = f[1 0] ; [2 0] ; [1 1] ; [1 2]g
and joint probability mass function:
8
>
> 1=4 if x = 1 and y =0
>
>
< 1=4 if x = 2 and y =0
pXY (x; y) = 1=4 if x = 1 and y =1
>
>
>
> 1=4 if x = 1 and y =2
:
0 otherwise
Compute the conditional probability mass function of X given Y = 0.
26.5. SOLVED EXERCISES 217
Solution
The marginal probability mass function of Y evaluated at y = 0 is
X
pY (0) = pXY (x; y)
f(x;y)2RXY :y=0g
1 1 1
= pXY (1; 0) + pXY (2; 0) = + =
4 4 2
The support of X is:
RX = f1; 2g
Thus, the conditional probability mass function of X given Y = 0 is
8 pXY (1;0) 1=4 1
>
< pY (0) = 1=2 = 2 if x = 1
pXY (2;0) 1=4 1
pXjY =0 (x) = pY (0) = 1=2 = 2 if x = 2
>
: pXY (x;0) 0
pY (0) = 1=2 =0 if x 2
= RX
Exercise 2
Let [X Y ] be an absolutely continuous random vector with support:
Solution
The support of Y is:
RY = [1; 2]
When y 2 = [1; 2], the marginal probability density function of Y is fY (y) = 0; when
y 2 [1; 2], the marginal probability density function of Y is
Z 1 Z 1
2 2
fY (y) = fXY (x; y) dx = y exp ( xy) dx
1 0 3
1
2 2
= y exp ( xy) = y
3 0 3
The support of X is
RX = [0; 1)
Thus, the conditional probability density function of X given Y = 2 is
fXY (x; 2)
fXjY =2 (x) =
fY (2)
( 2
(2=3) 2 exp( 2x)
= 2 exp ( 2x) if x 2 [0; 1)
= 4=3
0 otherwise
Exercise 3
Let X be an absolutely continuous random variable with support
RX = [0; 1]
1 if x 2 [0; 1]
fX (x) =
0 otherwise
RY = [0; 1]
(1 + 2xy) = (1 + x) if y 2 [0; 1]
fY jX=x (y) =
0 otherwise
Solution
The support of the vector [X Y ] is
Z 1
1 + 2xy
= dx
0 1+x
Z 1
A 1+x x + 2yx
= dx
0 1+x
Z 1
(2y 1) x
= 1+ dx
0 1+x
Z 1 Z 1
B x
= dx + (2y 1) dx
0 0 1 + x
Z 1
C 1 1+x 1
= [x]0 + (2y 1) dx
0 1+x
Z 1
1
= 1 + (2y 1) 1 dx
0 1+x
Z 1 Z 1
D 1
= 1 + (2y 1) dx dx
0 0 1 + x
h i
1 1
= 1 + (2y 1) [x]0 [ln (1 + x)]0
= 1 + (2y 1) [1 ln (2)]
= 1 + 2 [1 ln (2)] y 1 + ln (2)
= ln (2) + 2 [1 ln (2)] y
where: in step A we have added and subtracted x from the numerator; in step B
we have used the linearity of the integral; in step C we have added and subtracted
1 from the numerator; in step D we have used the linearity of the integral. Thus,
the marginal probability density function of Y is
Conditional expectation
27.1 De…nition
The following informal de…nition is very similar to the de…nition of expected value
we have given in the lecture entitled Expected value (p. 127).
De…nition 152 (informal) Let X and Y be two random variables. The condi-
tional expectation of X given Y = y is the weighted average of the values that
X can take on, where each possible value is weighted by its respective conditional
probability (conditional on the information that Y = y).
E [X jY = y ]
221
222 CHAPTER 27. CONDITIONAL EXPECTATION
De…nition 152: the weights of the average are given by the conditional probability
mass function3 of X.
De…nition 153 Let X and Y be two discrete random variables. Let RX be the
support of X and let pXjY =y (x) be the conditional probability mass function of X
given Y = y. The conditional expectation of X given Y = y is
X
E [X jY = y ] = xpXjY =y (x)
x2RX
provided that X
jxj pXjY =y (x) < 1
x2RX
P
If you do not understand the symbol x2RX and the …niteness condition above
(absolute summability) go back to the lecture entitled Expected value (p. 127),
where they are explained.
The support of X is
RX = f0; 1; 2g
Thus, the conditional probability mass function of X given Y = 0 is
8 pXY (0;0) 1=3
>
> pY (0) = 2=3 = 1=2 if x = 0
>
< pXY (1;0) 0
pY (0) = 2=3 =0 if x = 1
pXjY =0 (x) =
>
>
pXY (2;0)
= 1=3
= 1=2 if x = 2
>
: pY (0) 2=3
0 if x 2
= RX
De…nition 155 Let X and Y be two absolutely continuous random variables. Let
RX be the support of X and let fXjY =y (x) be the conditional probability density
function5 of X given Y = y. The conditional expectation of X given Y = y is
Z 1
E [X jY = y ] = xfXjY =y (x) dx
1
provided that Z 1
jxj fXjY =y (x) dx < 1
1
If you do not understand why an integration is required and why the …niteness
condition above (absolute integrability) is imposed, you can …nd an explanation in
the lecture entitled Expected value (p. 127).
1=2 if y 2 [2; 4]
fY (y) =
0 otherwise
When evaluated at y = 2, it is
1
fY (2) =
2
4 See p. 117.
5 See p. 213.
224 CHAPTER 27. CONDITIONAL EXPECTATION
The support of X is
RX = [0; 1)
Thus, the conditional probability density function of X given Y = 2 is
fXY (x; 2) 2 exp ( 2x) if x 2 [0; 1)
fXjY =2 (x) = =
fY (2) 0 otherwise
The conditional expectation of X given Y = 2 is
E [X jY = 2 ]
Z 1
= xfXjY =2 (x) dx
1
Z 1
= x2 exp ( 2x) dx
0
Z 1
A 1
= t exp ( t) dt
2 0
Z 1
B 1 1
= [ t exp ( t)]0 + exp ( t) dt
2 0
1 1
= f0 0 + [ exp ( t)]0 g
2
1 1
= f0 + 1g =
2 2
where: in step A we have performed a change of variable (t = 2x); in step B we
have performed an integration by parts.
where the integral is a Riemann-Stieltjes integral and the expected value exists and
is well-de…ned only as long as the integral is well-de…ned.
The above formula follows the same logic of the formula for the expected value:
Z 1
E [X] = xdFX (x)
1
with the only di¤erence that the unconditional distribution function FX (x) has
now been replaced with the conditional distribution function FXjY =y (x). The
reader who feels unfamiliar with this formula can go back to the lecture entitled
Expected value (p. 127) and read an intuitive introduction to the Riemann-Stieltjes
integral and its use in probability theory.
6 See p. 215.
27.5. MORE DETAILS 225
E [E [X jY ]]
X
A = E [X jY = y ] pY (y)
y2RY
X X
B = xpXjY =y (x) pY (y)
y2RY x2RX
X X
C = xpXY (x; y)
y2RY x2RX
X X
= x pXY (x; y)
x2RX y2RY
X
D = xpX (x)
x2RX
E = E [X]
E [E [X jY ]]
7 See p. 120.
226 CHAPTER 27. CONDITIONAL EXPECTATION
Z 1
A = E [X jY = y ] fY (y) dy
1
Z 1 Z 1
B = xfXjY =y (x) dxfY (y) dy
1 1
Z 1 Z 1
= xfXjY =y (x) fY (y) dxdy
1 1
Z 1 Z 1
C = xfXY (x; y) dxdy
1 1
Z 1 Z 1
= x fXY (x; y) dydx
1 1
Z 1
D = xfX (x) dx
1
E = E [X]
Exercise 1
Let [X Y ] be a random vector with support
RXY = f[2 2] ; [2 0] ; [1 2] ; [0 2]g
and joint probability mass function
8
>
> 1=4 if x = 2 and y =2
>
>
< 1=4 if x = 2 and y =0
pXY (x; y) = 1=4 if x = 1 and y =2
>
>
>
> 1=4 if x = 0 and y =2
:
0 otherwise
What is the conditional expectation of X given Y = 2?
Solution
Let us compute the conditional probability mass function of X given Y = 2. The
marginal probability mass function of Y evaluated at y = 2 is
X 3
pY (2) = pXY (x; y) = pXY (2; 2) + pXY (1; 2) + pXY (0; 2) =
4
f(x;y)2RXY :y=2g
8 See p. 120.
27.6. SOLVED EXERCISES 227
The support of X is
RX = f0; 1; 2g
Thus, the conditional probability mass function of X given Y = 2 is
8 pXY (0;2) 1=4
>
>
> pXY pY (2) = 3=4 = 1=3 if x = 0
< (1;2) 1=4
pXjY =2 (x) = pY (2) = 3=4 = 1=3 if x = 1
>
>
pXY (2;2)
= 1=4
>
: pY (2) 3=4 = 1=3 if x = 2
0 if x 2
= RX
Exercise 2
Let X and Y be two random variables. Remember that the variance of X can be
computed as
2
Var [X] = E X 2 E [X] (27.1)
In a similar manner, the conditional variance of X, given Y = y, can be de…ned as
2
Var [X jY = y ] = E X 2 jY = y E [X jY = y ] (27.2)
Solution
This is proved as follows:
Var [X]
2
= E X2 E [X]
A = E E X 2 jY = y E [E [X jY = y ]]
2
h i
B = E Var [X jY = y ] + E [X jY = y ]
2
E [E [X jY = y ]]
2
h i
C = E [Var [X jY = y ]] + E E [X jY = y ]
2
E [E [X jY = y ]]
2
D = E [Var [X jY = y ]] + Var [E [X jY = y ]]
Independent random
variables
Two random variables are independent if they convey no information about each
other and, as a consequence, receiving information about one of the two does not
change our assessment of the probability distribution of the other.
This lecture provides a formal de…nition of independence and discusses how to
verify whether two or more random variables are independent.
28.1 De…nition
Recall (see the lecture entitled Independent events - p. 99) that two events A and
B are independent if and only if
P (A \ B) = P (A) P (B)
In other words, two random variables are independent if and only if the events
related to those random variables are independent events.
The independence between two random variables is also called statistical inde-
pendence.
229
230 CHAPTER 28. INDEPENDENT RANDOM VARIABLES
Proposition 159 Two random variables X and Y are independent if and only if
FXY (x; y) = FX (x) FY (y) 8x; y 2 R
where FX;Y (x; y) is their joint distribution function1 and FX (x) and FY (y) are
their marginal distribution functions2 .
Proof. Using some facts from measure theory (not proved here), it is possible to
demonstrate that, when checking for the condition
P (fX 2 Ag \ fY 2 Bg) = P (fX 2 Ag) P (fY 2 Bg)
it is su¢ cient to con…ne attention to sets A and B taking the form
A = ( 1; x]
B = ( 1; y]
Thus, two random variables are independent if and only if
P (fX 2 ( 1; x]g \ fY 2 ( 1; y]g) = P (fX 2 ( 1; x]g) P (fY 2 ( 1; y]g)
for any x; y 2 R. Using the de…nitions of joint and marginal distribution function,
this condition can be written as
FXY (x; y) = FX (x) FY (y) 8x; y 2 R
Example 160 Let X and Y be two random variables with marginal distribution
functions
0 if x < 0
FX (x) =
1 exp ( x) if x 0
0 if y < 0
FY (y) =
1 exp ( y) if y 0
and joint distribution function
0 if x < 0 or y < 0
FX;Y (x; y) =
1 exp ( x) exp ( y) + exp ( x y) if x 0 and y 0
X and Y are independent if and only if
FXY (x; y) = FX (x) FY (y)
which is straightforward to verify. When x < 0 or y < 0, then
FX (x) FY (y) = 0 = FX;Y (x; y)
When x 0 and y 0, then:
FX (x) FY (y) = [1 exp ( x)] [1 exp ( y)]
= 1 exp ( x) exp ( y) + exp ( x) exp ( y)
= 1 exp ( x) exp ( y) + exp ( x y)
= FX;Y (x; y)
1 See p. 118.
2 See p. 119.
28.3. INDEPENDENCE BETWEEN DISCRETE VARIABLES 231
where pX;Y (x; y) is their joint probability mass function3 and pX (x) and pY (y)
are their marginal probability mass functions4 .
In order to verify whether X and Y are independent, we …rst need to derive the
marginal probability mass functions of X and Y . The support of X is
RX = f0; 1; 2g
which is obviously di¤ erent from pXY (x; y). Therefore, X and Y are not indepen-
dent.
where fX;Y (x; y) is their joint probability density function5 and fX (x) and fY (y)
are their marginal probability density functions6 .
0 if x 2
= [0; 1] or y 2
= [0; 1]
fX;Y (x; y) =
1 if x 2 [0; 1] and y 2 [0; 1]
5 See p. 117.
6 See p. 119.
28.5. MORE DETAILS 233
and
Z 1
fY (y) = fX;Y (x; y) dx
1
Z 0 Z 1 Z 1
= fX;Y (x; y) dx + fX;Y (x; y) dx + fX;Y (x; y) dx
1 0 1
Z 1
= 0+ fX;Y (x; y) dx + 0
0
0 if y 2
= [0; 1]
=
1 if y 2 [0; 1]
Verifying that
fXY (x; y) = fX (x) fY (y)
is straightforward. When x 2
= [0; 1] or y 2
= [0; 1], then
for any sub-collection of k random variables Xi1 , . . . , Xik (where k n) and for
any collection of events fXi1 2 A1 g ; : : : ; fXik 2 Ak g, where A1 ; : : : ; Ak R.
234 CHAPTER 28. INDEPENDENT RANDOM VARIABLES
for any n functions g1 , . . . ,gn such that the above expected values exist and are
well-de…ned.
Cov [X1 ; X2 ] = 0
(see the Mutual independence via expectations property above). When g1 and g2
are identity functions (g1 (X1 ) = X1 and g2 (X2 ) = X2 ), then:
The converse is not true: two random variables that have zero covariance are
not necessarily independent.
for any sub-collection of k random vectors Xi1 , . . . , Xik (where k n) and for
any collection of events fXi1 2 A1 g ; : : : ; fXik 2 Ak g.
All the equivalent conditions for the joint independence of a set of random
variables (see above) apply with obvious modi…cations also to random vectors.
236 CHAPTER 28. INDEPENDENT RANDOM VARIABLES
Exercise 1
Consider two random variables X and Y having marginal distribution functions
8
< 0 if x < 1
FX (x) = 1=2 if 1 x < 2
:
1 if x 2
0 if y < 0
FY (y) = 1 1
1 2 exp ( y) 2 exp ( 2y) if y 0
If X and Y are independent, what is their joint distribution function?
Solution
For X and Y to be independent, their joint distribution function must be equal to
the product of their marginal distribution functions:
8
< 0 if x < 1 or y < 0
1 1 1
FX;Y (x; y) = exp ( y) exp ( 2y) if 1 x < 2 and y 0
: 2 14 4
1 2 exp ( y) 12 exp ( 2y) if x 2 and y 0
Exercise 2
Let [X Y ] be a discrete random vector with support
Solution
In order to verify whether X and Y are independent, we …rst need to derive the
marginal probability mass functions of X and Y . The support of X is
RX = f0; 1g
X 1
pX (1) = pXY (1; y) = pXY (1; 0) + pXY (1; 1) =
2
y2RY
Exercise 3
Let [X Y ] be an absolutely continuous random vector with support
Solution
The support of Y is
RY = [2; 3]
When y 2 = [2; 3], the marginal probability density function of Y is 0, while, when
y 2 [2; 3], the marginal probability density function of Y is
Z 1 Z 1
fY (y) = fXY (x; y) dx = exp ( x) dx
1 0
238 CHAPTER 28. INDEPENDENT RANDOM VARIABLES
1
= [ exp ( x)]0 = [0 ( 1)] = 1
1 if y 2 [2; 3]
fY (y) =
0 otherwise
The support of X is
RX = [0; 1)
When x 2 = [0; 1), the marginal probability density function of X is 0, while, when
x 2 [0; 1), the marginal probability density function of X is
Z 1 Z 3
fX (x) = fXY (x; y) dy = exp ( x) dy
1 2
Z 3
= exp ( x) dy = exp ( x)
2
exp ( x) if x 2 [0; 1)
fX (x) =
0 otherwise
Verifying that
fXY (x; y) = fX (x) fY (y)
is straightforward. When x 2
= [0; 1) or y 2
= [2; 3], then
Additional topics in
probability theory
239
Chapter 29
Probabilistic inequalities
This lecture introduces some probabilistic inequalities that are used in the proofs
of several important theorems in probability theory.
E [X] = E [X 1]
= E X 1fX cg + 1fX<cg
= E X1fX cg + E X1fX<cg
Now, note that X1fX<cg is a positive random variable and that the expected value
of a positive random variable is positive4 :
E X1fX<cg 0
Therefore,
1 See p. 136.
2 In other words, X (!) 0 for all ! 2 .
3 See p. 197.
4 See p. 150.
241
242 CHAPTER 29. PROBABILISTIC INEQUALITIES
The random variable c 1fX cg is less than or equal to the random variable X
1fX cg for any ! 2 :
c 1fX cg X 1fX cg
because c is always smaller than X when the indicator 1fX cg is not zero. Since
the expected value operator preserves inequalities5 , we have
Furthermore, by using the linearity of the expected value6 and the fact that the
expected value of an indicator is equal to the probability of the event it indicates7 ,
we obtain
E [X] E X1fX cg
=) E [X] cP (X c)
E X1fX cg cP (X c)
Finally, since c is strictly positive, we can divide both sides of the right-hand
inequality to obtain Markov’s inequality:
E [X]
P (X c)
c
Therefore,
2
P (jX j k)
k2
E [g (X)] g (E [X])
Proof. A function g is convex if, for any point x0 , the graph of g lies entirely
above its tangent at the point x0 :
g (x) g (x0 ) + b (x x0 ) ; 8x
where b is the slope of the tangent. By setting x = X and x0 = E [X], the inequality
becomes
g (X) g (E [X]) + b (X E [X])
By taking the expected value of both sides of the inequality, and by using the fact
that the expected value operator preserves inequalities10 , we obtain
Proof. A function g is strictly convex if, for any point x0 , the graph of g lies
entirely above its tangent at the point x0 (and strictly so for points di¤erent from
x0 ):
g (x) > g (x0 ) + b (x x0 ) ; 8x 6= x0
where b is the slope of the tangent. By setting x = X and x0 = E [X], the inequality
becomes
g (X) > g (E [X]) + b (X E [X]) ; 8X 6= E [X]
1 0 See p. 150.
244 CHAPTER 29. PROBABILISTIC INEQUALITIES
and, of course, g (X) = g (E [X]) when X = E [X]. By taking the expected value of
both sides of the inequality, and by using the fact that the expected value operator
preserves inequalities, we obtain
where the …rst inequality is strict because we have assumed that X is not almost
surely11 constant, and, as a consequence, the event
fg (X) = g (E [X])g
E [g (X)] g (E [X])
E [ g (X)] g (E [X])
By multiplying both sides by 1 , and by using the linearity of the expected value,
we obtain the result.
If the function g is strictly concave and X is not almost surely constant, then
Exercise 1
Let X be a positive random variable whose expected value is
E [X] = 10
P (X < 20)
Solution
First of all, we need to use the formula for the probability of a complement:
E [X] 10 1
P (X 20) = =
20 20 2
By multiplying both sides of the inequality by 1, we obtain
1
P (X 20)
2
By adding 1 to both sides of the inequality, we obtain
1 1
1 P (X 20) 1 =
2 2
Thus, the lower bound is
1
P (X < 20)
2
Exercise 2
Let X be a random variable such that
E [X] = 0
1
P ( 3 < X < 2) =
2
Find a lower bound to its variance.
Solution
The lower bound can be derived thanks to Chebyshev’s inequality:
Var [X]
A 22 P (jX E [X]j 2)
B = 4 P (jXj 2)
C = 4 [1 P ( 2 < X < 2)]
D 4 [1 P ( 3 < X < 2)]
1
= 4 1 =2
2
Exercise 3
Let X be a strictly positive random variable, such that
1
E [X] =
2
Var [X] = 1
What can you infer, using Jensen’s inequality, about the expected value E [ln (2X)]?
Solution
The function
g (x) = ln (2x)
has …rst derivative
d 1 1
g (x) = 2=
dx 2x x
and second derivative
d2 1
g (x) =
dx2 x2
The second derivative is strictly negative on the domain of de…nition of the function.
Therefore, the function is strictly concave. Furthermore, X is not almost surely
constant because it has strictly positive variance. Hence, by Jensen’s inequality,
we obtain
E [ln (2X)] < ln (2E [X]) = ln (1) = 0
Chapter 30
Proposition 171 Let X be a discrete random variable. Its probability mass func-
tion pX (x) satis…es the following two properties:
1. non-negativity:
pX (x) 0; 8x 2 R (30.1)
247
248 CHAPTER 30. LEGITIMATE PMFS
thing2 must be equal to 1. Since, by the very de…nition of support, the event
fX 2 RX g is a sure thing, then
X
1 = P (X 2 RX ) = pX (x)
x2RX
Proposition 172 Let pX (x) be a function satisfying properties (30.1) and (30.2).
Then, there exists a discrete random variable X whose probability mass function is
pX (x).
Can we use g (x) to build a probability mass function? First of all, we have to
check that g (x) is non-negative. This is obviously true, because x2 is always non-
negative. Then, we have to check that the sum of g (x) over RX exists and is …nite
and strictly positive:
X
S = g (x) = g (1) + g (2) + g (3) + g (4) + g (5)
x2RX
= 1 + 4 + 9 + 16 + 25 = 55
Exercise 1
Consider the following function:
1
pX (x) = 10 x if x 2 f1; 2; 3; 4g
0 otherwise
Solution
For x 2 f1; 2; 3; 4g we have
1
pX (x) = x>0
10
while for x 2
= f1; 2; 3; 4g we have
pX (x) = 0
Therefore, pX (x) 0 for any x 2 R, which implies that property (30.1) is satis…ed.
Also property (30.2) is satis…ed, because
X
pX (x) = pX (1) + pX (2) + pX (3) + pX (4)
x2RX
1 1 1 1 10
= 1+ 2+ 3+ 4= =1
10 10 10 10 10
Exercise 2
Consider the function
1 2
pX (x) = 14 x if x 2 f1; 2; 3g
0 otherwise
Solution
For x 2 f1; 2; 3g we have
1 2
pX (x) = x >0
14
while for x 2
= f1; 2; 3g we have
pX (x) = 0
Therefore, pX (x) 0 for any x 2 R, which implies that property (30.1) is satis…ed.
Also property (30.2) is satis…ed, because
X
pX (x) = pX (1) + pX (2) + pX (3)
x2RX
1 1 1
= 12 + 22 + 32
14 14 14
1 1 1 14
= 1+ 4+ 9= =1
14 14 14 14
Exercise 3
Consider the function
3
4 41 x
if x 2 N
pX (x) =
0 otherwise
Solution
For x 2 N we have
3 1 x
pX (x) = 4 >0
4
because 41 x
is strictly positive. For x 2
= N we have
pX (x) = 0
Therefore, pX (x) 0 for any x 2 R, which implies that property (30.1) is satis…ed.
Also property (30.2) is satis…ed, because
X 1
X 1
X 3
pX (x) = pX (x) = 41 x
x=1 x=1
4
x2RX
1
" #
3X
x 1 2 3
1 3 1 1 1
= = 1+ + + + :::
4 x=1 4 4 4 4 4
3 1 3 1
= 1 = 3 =1
4 1 4
4 4
Chapter 31
Legitimate probability
density functions
1. non-negativity:
fX (x) 0; 8x 2 R (31.1)
for any interval [a; b]. Probabilities cannot be negative; therefore, P (X 2 [a; b]) 0
and Z b
fX (x) dx 0
a
for any interval [a; b]. But the above integral can be non-negative for all intervals
[a; b] only if the integrand function itself is non-negative, i.e. if fX (x) 0 for all
1 See p. 107.
251
252 CHAPTER 31. LEGITIMATE PDFS
x. This proves property (31.1). Furthermore, the probability of a sure thing2 must
be equal to 1. Since fX 2 ( 1; 1)g is a sure thing, then
Z 1
1 = P (X 2 ( 1; 1)) = fX (x) dx
1
Proposition 175 Let fX (x) be a function satisfying properties (31.1) and (31.2).
Then, there exists an absolutely continuous random variable X whose probability
density function is fX (x).
1
fX (x) = g (x)
I
I is strictly positive, thus fX (x) is non-negative and it satis…es property (31.1). It
also satis…es Property (31.2), because
Z 1 Z 1
1
fX (x) dx = g (x) dx
1 1 I
Z
1 1
= g (x) dx
I 1
1
= I=1
I
Therefore, any non-negative function g (x) can be used to construct a probability
density function if its integral over R exists and is …nite and strictly positive.
x2 if x 2 [0; 1]
g (x) =
0 otherwise
2 See the properties of probability (p. 70).
3 Non-negative means that g (x) 0 for any x 2 R.
31.3. SOLVED EXERCISES 253
Can we use g (x) to build a probability density function? First of all, we have to
check that g (x) is non-negative. This is obviously true, because x2 is always non-
negative. Then, we have to check that the integral of g (x) over R exists and is
…nite and strictly positive:
Z 1 Z 1
I = g (x) dx = x2 dx
1 0
1
1 3 1 1
= x = 0=
3 0 3 3
1 3x2 if x 2 [0; 1]
fX (x) = g (x) =
I 0 otherwise
Exercise 1
Consider the function
exp ( x) if x 2 [0; 1)
fX (x) =
0 if x 2 ( 1; 0)
where 2 (0; 1). Prove that fX (x) is a legitimate probability density function.
Solution
Since > 0 and the exponential function is strictly positive, fX (x) 0 for any
x 2 R, so the non-negativity property is satis…ed. The integral property is also
satis…ed, because
Z 1 Z 1
fX (x) dx = exp ( x) dx
1 0
1
= [ exp ( x)]0
= 0 ( 1) = 1
Exercise 2
Consider the function
1
u l if x 2 [l; u]
fX (x) =
0 if x 2
= [l; u]
Solution
l < u implies u1 l > 0, so fX (x) 0 for any x 2 R and the non-negativity property
is satis…ed. The integral property is also satis…ed, because
Z 1 Z u
1
fX (x) dx = dx
1 l u l
Z u
1
= dx
u l l
1 u
= [x]
u l l
1
= (u l) = 1
u l
Exercise 3
Consider the function
n=2 1 1
2 ( (n=2)) xn=2 1
exp 2x if x 2 [0; 1)
fX (x) =
0 if x 2
= [0; 1)
Solution
Remember the de…nition of Gamma function:
Z 1
(z) = xz 1 exp ( x) dx
0
(z) is obviously strictly positive for any z, since exp ( x) is strictly positive and
xz 1 is strictly positive on the interval of integration (except at 0, where it is 0).
Therefore, fX (x) satis…es the non-negativity property, because the four factors in
the product
1 1
2 n=2 ( (n=2)) xn=2 1 exp x
2
are all non-negative on the interval [0; 1).
The integral property is also satis…ed, because
Z 1 Z 1
1 1
fX (x) dx = 2 n=2 ( (n=2)) xn=2 1 exp x dx
1 0 2
Z 1
1 1
= 2 n=2 ( (n=2)) xn=2 1 exp x dx
0 2
Z 1
A = 2 n=2 ( (n=2)) 1 n=2 1 1
(2t) exp 2t 2dt
0 2
Z 1
1 n=2 1
= 2 n=2 ( (n=2)) (2) 2 tn=2 1 exp ( t) dt
0
Z 1
1
= ( (n=2)) tn=2 1 exp ( t) dt
0
4 See p. 55.
31.3. SOLVED EXERCISES 255
B = ( (n=2))
1
(n=2)
= 1
Factorization of joint
probability mass functions
This lecture discusses how to factorize the joint probability mass function1 of two
discrete random variables X and Y into two factors:
1. marginalize pXY (x; y) by summing it over all possible values of x and obtain
the marginal probability mass function pY (y);
1 See p. 116.
2 See p. 210.
3 See p. 120.
257
258 CHAPTER 32. FACTORIZATION OF JOINT PMFS
2. divide pXY (x; y) by pY (y) and obtain the conditional probability mass func-
tion pXjY =y (x) (of course this step makes sense only when pY (y) > 0).
In some cases, the …rst step (marginalization) can be di¢ cult to perform. In
these cases, it is possible to avoid the marginalization step, by making a guess
about the factorization of pXY (x; y) and verifying whether the guess is correct
with the help of the following proposition:
Proposition 178 (factorization method) Suppose there are two functions h (y)
and g (x; y) such that:
Then:
Therefore, by property 1:
X
pY (y) = fXY (x; y)
x2RX
X
= g (x; y) h (y)
x2RX
X
A = h (y) g (x; y)
x2RX
B = h (y)
where: in step A we have used the fact that h (y) does not depend on x; in step
B we have used the fact that, for any …xed y, g (x; y), considered as a function of
x, is a probability mass function and the sum4 of a probability mass function over
its support equals 1. Therefore,
Thus, whenever we are given a formula for the joint probability mass function
pXY (x; y) and we want to …nd the marginal and the conditional functions, we have
to manipulate the formula and express it as the product of:
RX = x 2 Z3+ : x1 + x2 + x3 = n
where x1 , x2 , x3 denote the components of the vector x. The joint probability mass
function of X is
n! x1 x2 x3
pX (x1 ; x2 ; x3 ) = x1 !x2 !x3 ! p1 p2 p2 if (x1 ; x2 ; x3 ) 2 RX
0 otherwise
Note that:
n!
px1 px2 px3
x1 !x2 !x3 !
(n x3 )! n! px1 1 px2 2 x3 n x3
= p p
x1 !x2 ! (n x3 )!x3 ! pn3 x3 3 3
(n x3 )! px1 1 px2 2 n!
= px3 pn x3
x1 !x2 ! pn3 x3 (n x3 )!x3 ! 3 3
A (n x3 )! px1 1 px2 2 n!
= px3 pn x3
x1 !x2 ! px3 1 +x2 (n x3 )!x3 ! 3 3
(n x3 )! x x n!
= (p1 =p3 ) 1 (p2 =p3 ) 2 px3 pn x3
x1 !x2 ! (n x3 )!x3 ! 3 3
x1 + x2 + x3 = n
where
8
< (n x3 )! x1 x2 if (x1 ; x2 ) 2 Z2+
x1 !x2 ! (p1 =p3 ) (p2 =p3 )
g (x1 ; x2 ; x3 ) = and x1 + x2 = n x3
:
0 otherwise
and:
n! x3 n x3
h (x3 ) = (n x3 )!x3 ! p3 p3 if x3 2 Z+ and x3 n
0 otherwise
For any x3 n, g (x1 ; x2 ; x3 ) is the probability mass function of a multinomial
distribution with parameters p1 =p3 , p2 =p3 and n x3 . Therefore:
Factorization of joint
probability density functions
This lecture discusses how to factorize the joint probability density function1 of
two absolutely continuous random variables (or random vectors) X and Y into two
factors:
1. the conditional probability density function2 of X given Y = y;
2. the marginal probability density function3 of Y .
261
262 CHAPTER 33. FACTORIZATION OF JOINT PDFS
2. divide fXY (x; y) by fY (y) and obtain the conditional probability density
function fXjY =y (x) (of course this step makes sense only when fY (y) > 0).
In some cases, the …rst step (marginalization) can be di¢ cult to perform. In
these cases, it is possible to avoid the marginalization step, by making a guess
about the factorization of fXY (x; y) and verifying whether the guess is correct
with the help of the following proposition:
Proposition 181 (factorization method) Suppose there are two functions h (y)
and g (x; y) such that:
1. for any x and y, the following holds:
fXY (x; y) = g (x; y) h (y)
B = h (y)
where: in step A we have used the fact that h (y) does not depend on x; in step
B we have used the fact that, for any …xed y, g (x; y), considered as a function
of x, is a probability density function and the integral of a probability density
function over R equals 1 (see p. 251). Therefore
fXY (x; y) = g (x; y) h (y) = g (x; y) fY (y)
which, in turn, implies
fXY (x; y)
g (x; y) = = fXjY =y (x)
fY (y)
Thus, whenever we are given a formula for the joint density function fXY (x; y)
and we want to …nd the marginal and the conditional functions, we have to ma-
nipulate the formula and express it as the product of:
33.2. A FACTORIZATION METHOD 263
where
y exp ( yx) if x 2 [0; 1)
g (x; y) =
0 otherwise
and
1
2 if y 2 [1; 3]
h (y) =
0 otherwise
Note that g (x; y) is a probability density function in x for any …xed y (it is the
probability density function of an exponential random variable4 with parameter y).
Therefore:
4 See p. 365.
264 CHAPTER 33. FACTORIZATION OF JOINT PDFS
Chapter 34
Functions of random
variables and their
distribution
Let X be a random variable with known distribution. Let another random variable
Y be a function of X:
Y = g (X)
where g : R ! R. How do we derive the distribution of Y from the distribution of
X?
There is no general answer to this question. However, there are several special
cases in which it is easy to derive the distribution of Y . We discuss these cases
below.
RY = fy = g (x) : x 2 RX g
1 See p. 108.
265
266 CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES
Proof. Of course, the support RY is determined by g (x) and by all the values X
can take. The distribution function of Y can be derived as follows:
if y is lower than than the lowest value Y can take on, then P (Y y) = 0,
so:
FY (y) = 0 if y < ; 8 2 RY
FY (y)
A = P (Y y)
B = P (g (X) y)
C = P g 1
(g (X)) g 1
(y)
1
= P X g (y)
D = FX g 1
(y)
if y is higher than than the highest value Y can take on, then P (Y y) = 1,
so:
FY (y) = 1 if y > ; 8 2 RY
RX = [1; 2]
Let
Y = X2
34.1. STRICTLY INCREASING FUNCTIONS 267
RY = fy = g (x) : x 2 RX g
Proof. This proposition is a trivial consequence of the fact that a strictly increas-
ing function is invertible:
pY (y) = P (Y = y)
= P (g (X) = y)
= P X = g 1 (y)
1
= pX g (y)
RX = f1; 2; 3g
Let
Y = g (X) = 3 + X 2
The support of Y is
RY = f4; 7; 12g
268 CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES
RY = fy = g (x) : x 2 RX g
Proof. Of course, the support RY is determined by g (x) and by all the values X
can take. The distribution function of y can be derived as follows:
if y is lower than than the lowest value Y can take on, then P (Y y) = 0,
so:
FY (y) = 0 if y < ; 8 2 RY
FY (y)
A = P (Y y)
= 1 P (Y > y)
B = 1 P (g (X) > y)
C = 1 P g 1
(g (X)) < g 1
(y)
1
= 1 P X<g (y)
1 1 1
= 1 P X<g (y) P X=g (y) + P X = g (y)
1 1
= 1 P X g (y) + P X = g (y)
D = 1 FX g 1
(y) + P X = g 1
(y)
if y is higher than than the highest value Y can take on, then P (Y y) = 1,
so:
FY (y) = 1 if y > ; 8 2 RY
Proof. The proof of this proposition is identical to the proof of the proposition
for strictly increasing functions. In fact, the only property that matters is that a
strictly decreasing function is invertible:
pY (y) = P (Y = y)
= P (g (X) = y)
= P X = g 1 (y)
1
= pX g (y)
RX = f1; 2; 3g
Let
Y = g (X) = 1 2X
The support of Y is
RY = f 5; 3; 1g
The function g is strictly decreasing and its inverse is
1 1 1
g (y) = y
2 2
The probability mass function of Y is
1 1 1 2
pY (y) = 14 2 2y if y 2 RY
0 if y 2
= RY
RY = fy = g (x) : x 2 RX g
Example 194 Let X be a uniform random variable4 on the interval [0; 1], i.e. an
absolutely continuous random variable with support
RX = [0; 1]
1 if x 2 RX
fX (x) =
0 if x 2
= RX
Let
1
Y = g (X) = ln (X)
RY = [0; 1)
where we can safely ignore the fact that g (0) = 1, because fX = 0g is a zero-
probability event5 . The function g is strictly decreasing and its inverse is
1
g (y) = exp ( y)
with derivative
1
dg (y)
= exp ( y)
dy
The probability density function of Y is
exp ( y) if y 2 RY
fY (y) =
0 if y 2
= RY
RY = fy = g (x) : x 2 RX g
Proof. The proof of this proposition is identical to the proof of the propositions
for strictly increasing and stricly decreasing functions found above:
pY (y) = P (Y = y)
= P (g (X) = y)
= P X = g 1 (y)
1
= pX g (y)
RY = fy = g (x) : x 2 RX g
If
d 1
g (y) 6= 0 ; 8y 2 RY
dy
then the probability density function of Y is
( 1
fX g 1 (y) dg dy(y) if y 2 RY
fY (y) =
0 if y 2
= RY
Press.
274 CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES
Exercise 1
Let X be an absolutely continuous random variable with support
RX = [0; 2]
Let p
Y = g (X) = X +1
Find the probability density function of Y .
Solution
The support of Y is hp p i
RY = 1; 3
with derivative
1
dg (y)
= 2y
dy
The probability density function of Y is
3 2
y2 1 2y if y 2 RY
fY (y) = 8
0 if y 2
= RY
Exercise 2
Let X be an absolutely continuous random variable with support
RX = [0; 2]
Let
Y = g (X) = X2
Find the probability density function of Y .
34.4. SOLVED EXERCISES 275
Solution
The support of Y is:
RY = [ 4; 0]
The function g is strictly decreasing and its inverse is
1 1=2
g (y) = ( y)
with derivative
1
dg (y) 1 1=2
= ( y)
dy 2
The probability density function of Y is
11 1=2 1 1=2
( y) = ( y) if y 2 RY
fY (y) = 22 4
0 if y 2
= RY
Exercise 3
Let X be a discrete random variable with support
RX = f1; 2; 3; 4g
Let
Y = g (X) = X 1
Find the probability mass function of Y .
Solution
The support of Y is
RY = f0; 1; 2; 3g
The function g is strictly increasing and its inverse is
1
g (y) = y + 1
RY = fy = g (x) : x 2 RX g
277
278 CHAPTER 35. FUNCTIONS OF RANDOM VECTORS
Proof. If y 2 RY , then
1 1
pY (y) = P (Y = y) = P (g (X) = y) = P X = g (y) = pX g (y)
where we have used the fact that g is one-to-one on the support of Y , and hence
it possesses an inverse g 1 (y). If y 2
= RY , then, trivially, pY (y) = 0.
Example 198 Let X be a 2 1 discrete random vector and denote its components
by X1 and X2 . Let the support of X be
n o
> >
RX = [1 1] ; [2 0]
Let
Y = g (X) = 2X
The support of Y is n o
> >
RY = [2 2] ; [4 0]
one-to-one and di¤ erentiable on the support of X. Denote by Jg 1 (y) the Jacobian
matrix of g 1 (y), i.e.,
2 @x @x1 @x1
3
@y
1
@y2 : : : @y
6 @x12 @x2 @x2
: : : @y
K
7
6 @y1 @y2 7
Jg 1 (y) = 6 7
K
6 .. .. .. 7
4 . . . 5
@xK @xK @xK
@y1 @y2 : : : @yK
1
where yi is the i-th component of y, and xi is the i-th component of x = g (y).
Then, the support of Y = g (X) is
RY = fy = g (x) : x 2 RX g
det Jg 1 (y) 6= 0 ; 8y 2 RY
Y = + X
Press.
280 CHAPTER 35. FUNCTIONS OF RANDOM VECTORS
RX = [1; 2] [0; 1)
Y1 = 3X1
Y2 = X2
1
The inverse function g (y) is de…ned by
x1 = y1 =3
x2 = y2
1
The Jacobian matrix of g (y) is
" #
@x1 @x1
@y1 @y2 1=3 0
Jg 1 (y) = @x2 @x2 =
@y1 @y2
0 1
Its determinant is
1 1
det Jg 1 (y) = ( 1) 0 0=
3 3
The support of Y is
n o
>
RY = [y1 y2 ] : y1 = 3x1 ; y2 = x2 ; x1 2 [1; 2] ; x2 2 [0; 1)
= [3; 6] ( 1; 0]
g (x) = x1 + : : : + xK
then the distribution of Y = g (X) can be derived by using the convolution formulae
illustrated in the lecture entitled Sums of independent random variables (p. 323).
35.3. KNOWN MOMENT GENERATING FUNCTION 281
by using the transformation theorem. If 'Y (t) is recognized as the joint character-
istic function of a known distribution, then such a distribution is the distribution
of Y , because two random vectors have the same distribution if and only if they
have the same joint characteristic function.
Exercise 1
Let X1 be a uniform random variable7 with support
RX1 = [1; 2]
1 if x1 2 RX1
fX1 (x1 ) =
0 if x1 2
= RX1
Let
Y1 = X12
4 See p. 297.
5 See p. 134.
6 See p. 315.
7 See p. 359.
282 CHAPTER 35. FUNCTIONS OF RANDOM VECTORS
Y2 = X1 + X2
Solution
Since X1 and X2 are independent, their joint probability density function is equal
to the product of their marginal density functions:
3 2
fX (x1 ; x2 ) = fX1 (x1 ) fX2 (x2 ) = 8 x2 if x1 2 [1; 2] and x2 2 [0; 2]
0 otherwise
The support of Y is
n o
>
RY = [y1 y2 ] : y1 = x21 ; y2 = x1 + x2 ; x1 2 [1; 2] ; x2 2 [0; 2]
n p o
>
= [y1 y2 ] : y1 2 [1; 4] ; y2 2 [ y1 ; 4]
1
The function y = g (x) is one-to-one on RY and its inverse g (y) is de…ned by
p
x1 = y1
p
x2 = y2 y1
1 p p 1 1=2
fY (y) = fX g (y) det Jg 1 (y) = fX ( y1 ; y2 y1 ) y1
2
3 p 2 1 1=2 3 1=2 p 2
= (y2 y1 ) y = y (y2 y1 )
8 2 1 16 1
while, for y 2
= RY , the joint probability density function is fY (y) = 0.
Exercise 2
Let X be a 2 1 random vector with support
RX = [0; 1) [0; 1)
Y1 = 2X1
Y2 = X1 + X2
Solution
1
The inverse function g (y) is de…ned by
x1 = y1 =2
x2 = y2 y1 =2
1
The Jacobian matrix of g (y) is
" #
@x1 @x1
@y1 @y2 1=2 0
Jg 1 (y) = @x2 @x2 =
@y1 @y2
1=2 1
Its determinant is
1 1 1
det Jg 1 (y) = 1 0 =
2 2 2
The support of Y is
n o
>
RY = [y1 y2 ] : y1 = 2x1 ; y2 = x1 + x2 ; x1 2 [0; 1) ; x2 2 [0; 1)
n o
>
= [y1 y2 ] : y1 2 [0; 1) ; y2 2 [y1 =2; 1)
This lecture introduces the notions of moment of a random variable and cross-
moment of a random vector.
36.1 Moments
36.1.1 De…nition of moment
The n-th moment of a random variable is the expected value of its n-th power:
X (n) = E [X n ]
exists and is …nite, then X is said to possess a …nite n-th moment and X (n)
is called the n-th moment of X. If E [X n ] is not well-de…ned, then we say that
X does not possess the n-th moment.
exists and is …nite, then X is said to possess a …nite n-th central moment and
X (n) is called the n-th central moment of X.
36.2 Cross-moments
36.2.1 De…nition of cross-moment
Let X be a K 1 random vector. A cross-moment of X is the expected value of
the product of integer powers of the entries of X:
nK
E [X1n1 X2n2 : : : XK ]
285
286 CHAPTER 36. MOMENTS AND CROSS-MOMENTS
If
X
nK
(n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK ] (36.2)
Example 205 Let X be a 3 1 discrete random vector and denote its components
by X1 , X2 and X3 . Let the support of X be:
82 3 2 3 2 39
< 1 2 3 =
RX = 4 2 5 ; 4 1 5 ; 4 3 5
: ;
1 3 2
X (1; 2; 1) = E X1 X22 X3
X (1; 2; 1) = E X1 X22 X3
X
= x1 x22 x3 pX (x1 ; x2 ; x3 )
(x1 ;x2 ;x3 )2RX
= 1 22 1 pX (1; 2; 1) + 2 12 3 pX (2; 1; 3)
+3 32 2 pX (3; 3; 2)
1 1 1 64
= 4 +6 + 54 =
3 3 3 3
1 See p. 117.
2 See p. 134.
36.2. CROSS-MOMENTS 287
If " #
K
Y nk
X (n1 ; n2 ; : : : ; nK ) = E (Xk E [Xk ]) (36.4)
k=1
Example 207 Let X be a 3 1 discrete random vector and denote its components
by X1 , X2 and X3 . Let the support of X be:
82 3 2 3 2 39
< 4 2 1 =
RX = 4 2 5 ; 4 1 5 ; 4 3 5
: ;
4 1 2
37.1 De…nition
We start this lecture by giving a de…nition of mgf.
E [exp (tX)]
exists and is …nite for all real numbers t belonging to a closed interval [ h; h] R,
with h > 0, then we say that X possesses a moment generating function and
the function MX : [ h; h] ! R de…ned by
The following example shows how the mgf of an exponential random variable
is derived.
1 See p. 285.
2 See p. 307.
289
290 CHAPTER 37. MGF OF A RANDOM VARIABLE
RX = [0; 1)
exp ( x) if x 2 RX
fX (x) =
0 if x 2
= RX
where: in step A we have assumed that t < , which is necessary for the integral
to be …nite. Therefore, the expected value exists and is …nite for t 2 [ h; h] if h is
such that 0 < h < , and X possesses an mgf
MX (t) =
t
Proposition 210 If a random variable X possesses an mgf MX (t), then, for any
n 2 N, the n-th moment of X, denoted by X (n), exists and is …nite. Furthermore,
dn MX (t)
X (n) = E [X n ] =
dtn t=0
n
where d M X (t)
dtn is the n-th derivative of MX (t) with respect to t, evaluated at
t=0
the point t = 0.
Proof. Proving the above proposition is quite complicated, because a lot of an-
alytical details must be taken care of (see, e.g., Pfei¤er4 - 1978). The intuition,
3 See p. 365.
4 Pfei¤er, P. E. (1978) Concepts of probability theory, Courier Dover Publications.
37.3. DISTRIBUTIONS AND MGFS 291
however, is straightforward: since the expected value is a linear operator and dif-
ferentiation is a linear operation, under appropriate conditions we can di¤erentiate
through the expected value, as follows:
dn MX (t) dn dn
= E [exp (tX)] = E exp (tX) = E [X n exp (tX)]
dtn dtn dtn
dn MX (t)
= E [X n exp (0 X)] = E [X n ] = X (n)
dtn t=0
Example 211 In Example 209 we have demonstrated that the mgf of an exponen-
tial random variable is
MX (t) =
t
The expected value of X can be computed by taking the …rst derivative of the mgf:
dMX (t)
= 2
dt ( t)
and evaluating it at t = 0:
dMX (t) 1
E [X] = = 2 =
dt t=0 ( 0)
The second moment of X can be computed by taking the second derivative of the
mgf:
d2 MX (t) 2
2
= 3
dt ( t)
and evaluating it at t = 0:
d2 MX (t) 2 2
E X2 = = 3 = 2
dt2 t=0 ( 0)
Proof. For a fully general proof of this proposition see, e.g., Feller6 (2008). We
just give an informal proof for the special case in which X and Y are discrete
random variables taking only …nitely many values. The "only if" part is trivial. If
X and Y have the same distribution, then
The "if" part is proved as follows. Denote by RX and RY the supports of X and
Y , and by pX (x) and pY (y) their probability mass functions7 . Denote by A the
union of the two supports:
A = R X [ RY
and by a1 ; : : : ; an the elements of A. The mgf of X can be written as
MX (t)
= E [exp (tX)]
X
A = exp (tx) pX (x)
x2RX
Xn
B = exp (tai ) pX (ai )
i=1
If X and Y have the same mgf, then, for any t belonging to a closed neighborhood
of zero,
MX (t) = MY (t)
and
n
X n
X
exp (tai ) pX (ai ) = exp (tai ) pY (ai )
i=1 i=1
This can be true for any t belonging to a closed neighborhood of zero only if
pX (ai ) pY (ai ) = 0
for every i. It follows that that the probability mass functions of X and Y are
equal. As a consequence, also their distribution functions are equal.
It must be stressed that this proposition is extremely important and relevant
from a practical viewpoint: in many cases where we need to prove that two dis-
tributions are equal, it is much easier to prove equality of the mgfs than to prove
equality of the distribution functions.
6 Feller, W. (2008) An introduction to probability theory and its applications, Volume 2, Wiley.
7 See p. 106.
37.4. MORE DETAILS 293
Also note that equality of the distribution functions can be replaced in the
proposition above by equality of the probability mass functions8 if X and Y are
discrete random variables, or by equality of the probability density functions9 if X
and Y are absolutely continuous random variables.
Y = a + bX
where a; b 2 R are two constants and b 6= 0. Then, the random variable Y possesses
an mgf MY (t) and
MY (t) = exp (at) MX (bt)
Exercise 1
Let X be a discrete random variable having a Bernoulli distribution12 . Its support
is
RX = f0; 1g
and its probability mass function13 is
8
< p if x = 1
pX (x) = 1 p if x = 0
:
0 if x 2
= RX
Solution
Using the de…nition of mgf, we get
X
MX (t) = E [exp (tX)] = exp (tx) pX (x)
x2RX
= exp (t 1) pX (1) + exp (t 0) pX (0)
= exp (t) p + 1 (1 p) = 1 p + p exp (t)
The mgf exists and it is well-de…ned because the above expected value exists for
any t 2 R.
1 1 See p. 234.
1 2 See p. 335.
1 3 See p. 106.
37.5. SOLVED EXERCISES 295
Exercise 2
Let X be a random variable with mgf
1
MX (t) = (1 + exp (t))
2
Derive the variance of X.
Solution
We can use the following formula for computing the variance14 :
2
Var [X] = E X 2 E [X]
The expected value of X is computed by taking the …rst derivative of the mgf:
dMX (t) 1
= exp (t)
dt 2
and evaluating it at t = 0:
dMX (t) 1 1
E [X] = = exp (0) =
dt t=0 2 2
The second moment of X is computed by taking the second derivative of the mgf:
d2 MX (t) 1
= exp (t)
dt2 2
and evaluating it at t = 0:
d2 MX (t) 1 1
E X2 = = exp (0) =
dt2 t=0 2 2
Therefore,
2
2 1 1
Var [X] = E X2 E [X] =
2 2
1 1 1
= =
2 4 4
Exercise 3
A random variable X is said to have a Chi-square distribution15 with n degrees of
freedom if its mgf is de…ned for any t < 12 and it is equal to
n=2
MX (t) = (1 2t)
De…ne
Y = X1 + X2
where X1 and X2 are two independent random variables having Chi-square dis-
tributions with n1 and n2 degrees of freedom respectively. Prove that Y has a
Chi-square distribution with n1 + n2 degrees of freedom.
1 4 See p. 156.
1 5 See p. 387.
296 CHAPTER 37. MGF OF A RANDOM VARIABLE
Solution
The mgfs of X1 and X2 are
n1 =2
MX1 (t) = (1 2t)
n2 =2
MX2 (t) = (1 2t)
The mgf of a sum of independent random variables is the product of the mgfs of
the summands:
n1 =2 n2 =2 (n1 +n2 )=2
MY (t) = (1 2t) (1 2t) = (1 2t)
38.1 De…nition
Let us start with a formal de…nition.
exists and is …nite for all K 1 real vectors t belonging to a closed rectangle H
such that
H = [ h1 ; h 1 ] [ h2 ; h 2 ] ::: [ hK ; h K ] RK
with hi > 0 for all i = 1; : : : ; K, then we say that X possesses a joint moment
generating function and the function MX : H ! R de…ned by
297
298 CHAPTER 38. JOINT MGF OF A RANDOM VECTOR
K=2 1 |
fX (x) = (2 ) exp x x
2
As explained in the lecture entitled Multivariate normal distribution (p. 439), the K
components of X are K mutually independent4 standard normal random variables,
because the joint probability density function of X can be written as
where xi is the i-th entry of x, and f (xi ) is the probability density function of a
standard normal random variable:
1=2 1 2
f (xi ) = (2 ) exp x
2 i
The joint mgf of X can be derived as follows:
where: in step A we have used the fact that the entries of X are mutually in-
dependent5 ; in step B we have used the de…nition of mgf of a random variable6 .
Since the mgf of a standard normal random variable is7
1 2
MXi (ti ) = exp t
2 i
the joint mgf of X is
K
Y K
Y 1 2
MX (t) = MXi (ti ) = exp t
i=1 i=1
2 i
K
!
1 X 1 >
= exp t2i = exp t t
2 i=1
2
Note that the mgf MXi (ti ) of a standard normal random variable is de…ned for
any ti 2 R. As a consequence, the joint mgf of X is de…ned for any t 2 RK .
3 See p. 117.
4 See p. 233.
5 See p. 234.
6 See p. 289.
7 See p. 378.
38.2. CROSS-MOMENTS AND JOINT MGFS 299
X
nK
(n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK ]
PK
where n1 ; n2 ; : : : ; nK 2 Z+ and n = k=1 nk , then
where the derivative on the right-hand side is an n-th order cross-partial derivative
of MX (t) evaluated at the point t1 = 0, t2 = 0, . . . , tK = 0.
Proof. We do not provide a rigorous proof of this proposition, but see, e.g.,
Pfei¤er8 (1978) and DasGupta9 (2010). The intuition of the proof, however, is
straightforward: since the expected value is a linear operator and di¤erentiation is
a linear operation, under appropriate conditions one can di¤erentiate through the
expected value, as follows:
The following example shows how the above proposition can be applied.
Example 218 Let us continue with the previous example. The joint mgf of a 2 1
standard normal random vector X is
1 > 1 2 1 2
MX (t) = exp t t = exp t + t
2 2 1 2 2
8 Pfei¤er, P. E. (1978) Concepts of probability theory, Courier Dover Publications.
9 DasGupta, A. (2010) Fundamentals of probability: a …rst course, Springer.
300 CHAPTER 38. JOINT MGF OF A RANDOM VECTOR
X (1; 1) = E [X1 X2 ]
@2 1 2 1 2
= exp t + t
@t1 @t2 2 1 2 2 t1 =0;t2 =0
@ @ 1 2 1 2
= exp t + t
@t1 @t2 2 1 2 2 t1 =0;t2 =0
@ 1 2 1 2
= t2 exp t + t
@t1 2 1 2 2 t1 =0;t2 =0
1 2 1 2
= t1 t2 exp t + t =0
2 1 2 2 t1 =0;t2 =0
Proof. The reader may refer to Feller11 (2008) for a rigorous proof. The informal
proof given here is almost identical to that given for the univariate case12 . We con-
…ne our attention to the case in which X and Y are discrete random vectors taking
only …nitely many values. As far as the left-to-right direction of the implication is
concerned, it su¢ ces to note that if X and Y have the same distribution, then
A = R X [ RY
MX (t) = MY (t)
for any t belonging to a closed rectangle where the two mgfs are well-de…ned, and
n
X n
X
exp t> ai pX (ai ) = exp t> ai pY (ai )
i=1 i=1
pX (ai ) pY (ai ) = 0
for every i. As a consequence, the joint probability mass functions of X and Y are
equal, which implies that also their joint distribution functions are equal.
This proposition is used very often in applications where one needs to demon-
strate that two joint distributions are equal. In such applications, proving equality
of the joint mgfs is often much easier than proving equality of the joint distribution
functions (see also the comments to Proposition 212).
where: in step A we have used the fact that the entries of X are mutually
independent; in step B we have used the de…nition of mgf of a random variable.
where: in step A we have used the fact that the vectors Xi are mutually inde-
pendent; in step B we have used the de…nition of joint mgf.
Exercise 1
Let X be a 2 1 discrete random vector and denote its components by X1 and
X2 . Let the support of X be
n o
> > >
RX = [1 1] ; [2 0] ; [0 0]
Solution
By using the de…nition of joint mgf, we get
Exercise 2
Let
>
X = [X1 X2 ]
be a 2 1 random vector with joint mgf
1 2
MX1 ;X2 (t1 ; t2 ) = + exp (t1 + 2t2 )
3 3
Derive the expected value of X1 .
Solution
The mgf of X1 is
dMX1 (t1 ) 2
= exp (t1 )
dt1 3
and evaluating it at t1 = 0:
dMX1 (t1 ) 2 2
E [X1 ] = = exp (0) =
dt1 t1 =0 3 3
Exercise 3
Let
>
X = [X1 X2 ]
38.5. SOLVED EXERCISES 305
Solution
We can use the following covariance formula:
The mgf of X1 is
The mgf of X2 is
1
= [2 exp (t1 + 2t2 ) + 2 exp (2t1 + t2 )]
3
and evaluating it at (t1 ; t2 ) = (0; 0):
Characteristic function of a
random variable
In the lecture entitled Moment generating function (p. 289), we have explained
that the distribution of a random variable can be characterized in terms of its
moment generating function, a real function that enjoys two important properties:
it uniquely determines its associated probability distribution, and its derivatives
at zero are equal to the moments of the random variable. We have also explained
that not all random variables possess a moment generating function.
The characteristic function (cf) enjoys properties that are almost identical to
those enjoyed by the moment generating function, but it has an important advan-
tage: all random variables possess a characteristic function.
39.1 De…nition
We start this lecture by giving a de…nition of characteristic function.
p
De…nition 223 Let X be a random variable. Let i = 1 be the imaginary unit.
The function ' : R ! C de…ned by
The …rst thing to be noted is that the characteristic function 'X (t) exists for
any t. This can be proved as follows:
and the last two expected values are well-de…ned, because the sine and cosine
functions are bounded in the interval [ 1; 1].
307
308 CHAPTER 39. CF OF A RANDOM VARIABLE
Proposition 224 Let X be a random variable and 'X (t) its characteristic func-
tion. Let n 2 N. If the n-th moment of X, denoted by X (n), exists and is …nite,
then 'X (t) is n times continuously di¤ erentiable and
1 dn 'X (t)
X (n) = E [X n ] =
in dtn t=0
n
where d dt'X (t)
n is the n-th derivative of 'X (t) with respect to t, evaluated at
t=0
the point t = 0.
Proof. The proof of the above proposition is quite complex (see, e.g., Resnick1 -
1999). The intuition, however, is straightforward: since the expected value is a lin-
ear operator and di¤erentiation is a linear operation, under appropriate conditions
one can di¤erentiate through the expected value, as follows:
dn 'X (t) dn
= E [exp (itX)]
dtn dtn
dn
= E exp (itX)
dtn
n
= E [(iX) exp (itX)]
= in E [X n exp (itX)]
dn 'X (t)
= in E [X n exp (0 iX)] = in E [X n ] = in X (n)
dtn t=0
In practice, the proposition above is not very useful when one wants to compute
a moment of a random variable, because it requires to know in advance whether
the moment exists or not. A much more useful statement is provided by the next
proposition.
Proposition 225 Let X be a random variable and 'X (t) its characteristic func-
tion. If 'X (t) is n times di¤ erentiable at the point t = 0, then
2. if n is odd, the k-th moment of X exists and is …nite for any k < n.
1 dk 'X (t)
X (k) = E X k =
ik dtk t=0
exp ( x) if x 2 RX
fX (x) =
0 if x 2
= RX
d2 'X (t) 2
= 3
dt2 ( it)
Evaluating it at t = 0, we obtain
d2 'X (t) 2
= 2
dt2 t=0
Proof. For a formal proof, see, e.g., Resnick4 (1999). An informal proof for the
special case in which X and Y have a …nite support can be provided along the
same lines of the proof of Proposition 212, which concerns the moment generating
function. This is left as an exercise (just replace exp (tX) and exp (tY ) in that
proof with exp (itX) and exp (itY )).
This property is analogous to the property of joint moment generating functions
stated in Proposition 212. The same comments we made about that proposition
also apply to this one.
Proposition 228 Let X be a random variable with characteristic function 'X (t).
De…ne
Y = a + bX
where a; b 2 R are two constants and b 6= 0. Then, the characteristic function of
Y is
'Y (t) = exp (iat) 'X (bt)
39.4.2 Cf of a sum
The next proposition shows how to derive the characteristic function of a sum of
independent random variables.
Exercise 1
Let X be a discrete random variable having support
RX = f0; 1; 2g
Solution
By using the de…nition of characteristic function, we obtain
X
'X (t) = E [exp (itX)] = exp (itx) pX (x)
x2RX
= exp (it 0) pX (0) + exp (it 1) pX (1) + exp (it 2) pX (2)
1 1 1 1
= + exp (it) + exp (2it) = [1 + exp (it) + exp (2it)]
3 3 3 3
Exercise 2
Use the characteristic function found in the previous exercise to derive the variance
of X.
Solution
We can use the following formula for computing the variance:
2
Var [X] = E X 2 E [X]
The expected value of X is computed by taking the …rst derivative of the charac-
teristic function:
d'X (t) 1
= [i exp (it) + 2i exp (2it)]
dt 3
evaluating it at t = 0, and dividing it by i:
1 d'X (t) 11
E [X] = = [i exp (i 0) + 2i exp (2i 0)] = 1
i dt t=0 i3
1 d2 'X (t) 11 2 5
E X2 = = i exp (i 0) + 4i2 exp (2i 0) =
i2 dt2 t=0 i2 3 3
Therefore,
2 5 2
Var [X] = E X 2 E [X] = 12 =
3 3
Exercise 3
Read and try to understand how the characteristic functions of the uniform and
exponential distributions are derived in the lectures entitled Uniform distribution
(p. 359) and Exponential distribution (p. 365).
314 CHAPTER 39. CF OF A RANDOM VARIABLE
Chapter 40
Characteristic function of a
random vector
This lecture introduces the notion of joint characteristic function (joint cf) of a
random vector, which is a multivariate generalization of the concept of character-
istic function of a random variable. Before reading this lecture, you are advised to
…rst read the lecture entitled Characteristic function (p. 307).
40.1 De…nition
Let us start this lecture with a de…nition.
p
De…nition 230 Let X be a K 1 random vector. Let i = 1 be the imaginary
unit. The function ' : RK ! C de…ned by
2 0 13
K
X
'X (t) = E exp it> X = E 4exp @i tj Xj A5
j=1
315
316 CHAPTER 40. JOINT CF
following proposition.
Proposition 231 Let X be a random vector and 'X (t) its joint characteristic
function. Let n 2 N. De…ne a cross-moment of order n as follows:
X
nK
(n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK ]
where n1 ; n2 ; : : : ; nK 2 Z+ and
K
X
n= nk
k=1
If all cross-moments of order n exist and are …nite, then all the n-th order partial
derivatives of 'X (t) exist and
where the partial derivative on the right-hand side of the equation is evaluated at
the point t1 = 0, t2 = 0, . . . , tK = 0.
Proposition 232 Let X be a random vector and 'X (t) its joint characteristic
function. If all the n-th order partial derivatives of 'X (t) exist, then:
In both cases,
Proof. See Ushakov (1999). An informal proof for the special case in which X
and Y have a …nite support can be provided along the same lines of the proof of
Proposition 219, which concerns the joint moment generating function. This is left
as an exercise (just replace exp t> X and exp t> Y in that proof with exp it> X
and exp it> Y ).
This property is analogous to the property of joint moment generating functions
stated in Proposition 219. The same comments we made about that proposition
also apply to this one.
4 See p. 118.
318 CHAPTER 40. JOINT CF
K
Y
'X (t1 ; : : : ; tK ) = 'Xj (tj )
j=1
where: in step A we have used the fact that the entries of X are mutually
independent5 ; in step B we have used the de…nition of characteristic function
of a random variable6 .
Then, the joint characteristic function of Z is the product of the joint characteristic
functions of X1 ; : : : ; Xn :
Yn
'Z (t) = 'Xj (t)
j=1
5 In particular, see the mutual independence via expectations property (p. 234).
6 See p. 307.
40.5. SOLVED EXERCISES 319
where: in step A we have used the fact that the vectors Xj are mutually inde-
pendent; in step B we have used the de…nition of joint characteristic function of
a random vector given above.
Exercise 1
Let Z1 and Z2 be two independent standard normal random variables7 . Let X be
a 2 1 random vector whose components are de…ned as follows:
X1 = Z12
X2 = Z12 + Z22
Solution
By using the de…nition of characteristic function, we get
where: in step A we have used the fact that Z1 and Z2 are independent; in step
B we have used the de…nition of characteristic function.
Exercise 2
Use the joint characteristic function found in the previous exercise to derive the
expected value and the covariance matrix of X.
Solution
We need to compute the partial derivatives of the joint characteristic function:
@' 1 3=2
= 1 2it1 4it2 4t1 t2 4t22 ( 2i 4t2 )
@t1 2
@' 1 3=2
= 1 2it1 4it2 4t1 t2 4t22 ( 4i 4t1 8t2 )
@t2 2
@2' 3 5=2 2
= 1 2it1 4it2 4t1 t2 4t22 ( 2i 4t2 )
@t21 4
@2' 3 5=2 2
= 1 2it1 4it2 4t1 t2 4t22 ( 4i 4t1 8t2 )
@t22 4
3=2
+4 1 2it1 4it2 4t1 t2 4t22
@2' 3 5=2
= 1 2it1 4it2 4t1 t2 4t22 ( 2i 4t2 ) ( 4i 4t1 8t2 )
@t1 @t2 4
3=2
+2 1 2it1 4it2 4t1 t2 4t22
All partial derivatives up to the second order exist and are well de…ned. As a
consequence, all cross-moments up to the second order exist and are …nite and
they can be computed from the above partial derivatives:
1 @' 1
E [X1 ] = = i=1
i @t1 t1 =0;t2 =0 i
1 @' 1
E [X2 ] = = 2i = 2
i @t2 t1 =0;t2 =0 i
40.5. SOLVED EXERCISES 321
1 @2' 1
E X12 = = 3i2 = 3
i2 @t21 t1 =0;t2 =0 i2
2
1 @ ' 1
E X22 = = 12i2 + 4 = 8
i2 @t22 t1 =0;t2 =0 i2
1 @2' 1
E [X1 X2 ] = = 6i2 + 2 = 4
i2 @t1 @t2 t1 =0;t2 =0 i2
Exercise 3
Read and try to understand how the joint characteristic function of the multinomial
distribution is derived in the lecture entitled Multinomial distribution (p. 431).
322 CHAPTER 40. JOINT CF
Chapter 41
Sums of independent
random variables
This lecture discusses how to derive the distribution of the sum of two independent
random variables1 . We explain …rst how to derive the distribution function2 of the
sum and then how to derive its probability mass function3 (if the summands are
discrete) or its probability density function4 (if the summands are continuous).
Proposition 237 Let X and Y be two independent random variables and denote
by FX (x) and FY (y) their respective distribution functions. Let
Z =X +Y
FZ (z) = E [FX (z Y )]
or
FZ (z) = E [FY (z X)]
FZ (z)
A = P (Z z)
= P (X + Y z)
= P (X z Y )
1 See p. 229.
2 See p. 108.
3 See p. 106.
4 See p. 107.
323
324 CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES
B = E [P (X z Y jY = y )]
C = E [FX (z Y )]
RX = [0; 1]
1 if x 2 RX
fX (x) =
0 otherwise
RY = [0; 1]
1 if y 2 RY
fY (y) =
0 otherwise
FZ (z) = E [FX (z Y )]
Z 1
= FX (z y) fY (y) dy
1
Z 1
= FX (z y) dy
0
Z z 1
A = FX (t) dt
z
Z z
B = FX (t) dt
z 1
1. If z 0, then Z Z
z z
FZ (z) = FX (t) dt = 0dt = 0
z 1 z 1
2. If 0 < z 1, then
Z z Z 0 Z z
FZ (z) = FX (t) dt = FX (t) dt + FX (t) dt
z 1 z 1 0
Z 0 Z z z
1 2 1 2
= 0dt + tdt = 0 + t = z
z 1 0 2 0 2
3. If 1 < z 2, then
Z z Z 1 Z z
FZ (z) = FX (t) dt = FX (t) dt + FX (t) dt
z 1 z 1 1
Z 1 Z z 1
1 2 z
= tdt + 1dt = t + [t]1
z 1 1 2 z 1
1 2 1 2
= 1 (z 1) + z 1
2 2
1 1 2 1 1 2
= z + 2z 1 +z 1
2 2 2 2
1 2
= z + 2z 1
2
4. If z > 2, then
Z z Z z
FZ (z) = FX (t) dt = 1dt
z 1 z 1
z
= [t]z 1 =z (z 1) = 1
Proposition 239 Let X and Y be two independent discrete random variables and
denote by pX (x) and pY (y) their respective probability mass functions and by RX
and RY their supports. Let
Z =X +Y
and denote the probability mass function of Z by pZ (z). The following holds:
X
pZ (z) = pX (z y) pY (y)
y2RY
326 CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES
or X
pZ (z) = pY (z x) pX (x)
x2RX
pZ (z)
A = P (Z = z)
= P (X + Y = z)
= P (X = z Y )
B = E [P (X = z Y jY = y )]
C = E [pX (z Y )]
X
D = pX (z y) pY (y)
y2RY
where: in step A we have used the de…nition of probability mass function; in step
B we have used the law of iterated expectations; in step C we have used the fact
that X and Y are independent; in step D we have used the de…nition of expected
value. The second formula is symmetric to the …rst.
The two summations above are called convolutions (of two probability mass
functions).
RX = f0; 1g
1=2 if x 2 RX
pX (x) =
0 otherwise
RY = f0; 1g
1=2 if y 2 RY
pY (y) =
0 otherwise
De…ne
Z =X +Y
Its support is
RY = f0; 1; 2g
The probability mass function of Z, evaluated at z = 0 is
X
pZ (0) = pX (0 y) pY (y) = pX (0 0) pY (0) + pX (0 1) pY (1)
y2RY
41.3. PROBABILITY DENSITY FUNCTION OF A SUM 327
1 1 1 1
= +0 =
2 2 2 4
Evaluated at z = 1, it is
X
pZ (1) = pX (1 y) pY (y) = pX (1 0) pY (0) + pX (1 1) pY (1)
y2RY
1 1 1 1 1
= + =
2 2 2 2 2
Evaluated at z = 2, it is
X
pZ (2) = pX (2 y) pY (y) = pX (2 0) pY (0) + pX (2 1) pY (1)
y2RY
1 1 1 1
= 0 + =
2 2 2 4
Therefore, the probability mass function of Z is
8
>
> 1=4 if z = 0
<
1=2 if z = 1
pZ (z) =
>
> 1=4 if z = 2
:
0 otherwise
or Z 1
fZ (z) = fY (z x) fX (x) dx
1
fZ (z)
d
= E [FX (z Y )]
dz
8 See p. 109.
328 CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES
A d
= E FX (z Y )
dz
= E [fX (z Y )]
Z 1
B = fX (z y) fY (y) dy
1
RX = [0; 1)
exp ( x) if x 2 RX
fX (x) =
0 otherwise
RY = [0; 1)
exp ( y) if y 2 RY
fY (y) =
0 otherwise
De…ne
Z =X +Y
The support of Z is
RZ = [0; 1)
When z 2 RZ , the probability density function of Z is
Z 1 Z 1
fZ (z) = fX (z y) fY (y) dy = fX (z y) exp ( y) dy
1 0
Z 1
= exp ( (z y)) 1fz y 0g exp ( y) dy
0
Z 1
= exp ( z + y) 1fy zg exp ( y) dy
Z0 z Z z
= exp ( z + y) exp ( y) dy = exp( z) dy = z exp ( z)
0 0
z exp ( z) if z 2 RZ
fZ (z) =
0 otherwise
9 See p. 365.
41.4. MORE DETAILS 329
Z = X1 + : : : + Xn
The distribution of Z can be derived recursively, using the results for sums of two
random variables given above:
1. …rst, de…ne
Y2 = X1 + X2
and compute the distribution of Y2 ;
2. then, de…ne
Y3 = Y2 + X3
and compute the distribution of Y3 ;
3. and so on, until the distribution of Z can be computed from
Z = Yn = Yn 1 + Xn
Exercise 1
Let X be a uniform random variable with support
RX = [0; 1]
RY = [0; 1)
Z =X +Y
1 0 See p. 233.
330 CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES
Solution
The support of Z is
RZ = [0; 1)
When z 2 RZ , the probability density function of Z is
Z 1 Z 1
fZ (z) = fX (z y) fY (y) dy = fX (z y) exp ( y) dy
1 0
Z 1 Z 1
= 1f0 z y 1g exp ( y) dy = 1f 1 y z 0g exp ( y) dy
0 0
Z 1 Z z
= 1fz 1 y zg exp ( y) dy = exp ( y) dy
0 z 1
z
= [ exp ( y)]z 1 = exp ( z) + exp (1 z)
exp ( z) + exp (1 z) if z 2 RZ
fZ (z) =
0 otherwise
Exercise 2
Let X be a discrete random variable with support
RX = f0; 1; 2g
1=3 if x 2 RX
pX (x) =
0 otherwise
RY = f1; 2g
y=3 if y 2 RY
pY (y) =
0 otherwise
Z =X +Y
Solution
The support of Z is:
RZ = f1; 2; 3; 4g
The probability mass function of Z, evaluated at z = 1 is:
X
pZ (1) = pX (1 y) pY (y) = pX (1 1) pY (1) + pX (1 2) pY (2)
y2RY
1 1 2 1
= +0 =
3 3 3 9
41.5. SOLVED EXERCISES 331
Evaluated at z = 2, it is:
X
pZ (2) = pX (2 y) pY (y) = pX (2 1) pY (1) + pX (2 2) pY (2)
y2RY
1 1 1 2 1
= + =
3 3 3 3 3
Evaluated at z = 3, it is:
X
pZ (3) = pX (3 y) pY (y) = pX (3 1) pY (1) + pX (3 2) pY (2)
y2RY
1 1 1 2 1
= + =
3 3 3 3 3
Evaluated at z = 4, it is:
X
pZ (4) = pX (4 y) pY (y) = pX (4 1) pY (1) + pX (4 2) pY (2)
y2RY
1 1 2 2
= 0 + =
3 3 3 9
Therefore, the probability mass function of Z is:
8
>
> 1=9 if z = 1
>
>
< 1=3 if z = 2
pZ (z) = 1=3 if z = 3
>
>
>
> 2=9 if z = 4
:
0 otherwise
332 CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES
Part IV
Probability distributions
333
Chapter 42
Bernoulli distribution
Suppose you perform an experiment with two possible outcomes: either success or
failure. Success happens with probability p, while failure happens with probability
1 p. A random variable that takes value 1 in case of success and 0 in case of failure
is called a Bernoulli random variable (alternatively, it is said to have a Bernoulli
distribution).
42.1 De…nition
Bernoulli random variables are characterized as follows:
RX = f0; 1g
Let p 2 (0; 1). We say that X has a Bernoulli distribution with parameter p if
its probability mass function1 is
8
< p if x = 1
pX (x) = 1 p if x = 0
:
0 if x 2
= RX
Note that, by the above de…nition, any indicator function2 is a Bernoulli random
variable.
The following is a proof that pX (x) is a legitimate probability mass function3 :
Proof. Non-negativity is obvious. We need to prove that the sum of pX (x) over
its support equals 1. This is proved as follows:
X
pX (x) = pX (1) + pX (0)
xeRX
= p + (1 p) = 1
1 See p. 106.
2 See p. 197.
3 See p. 247.
335
336 CHAPTER 42. BERNOULLI DISTRIBUTION
E [X] = p
42.3 Variance
The variance of a Bernoulli random variable X is
Var [X] = p (1 p)
Proof. It can be derived thanks to the usual formula for computing the variance4 :
X
E X2 = x2 pX (x)
x2RX
= 12 pX (1) + 02 pX (0)
= 1 p + 0 (1 p) = p
2
E [X] = p2
2
Var [X] = E X2 E [X] = p p2 = p (1 p)
FX (x) = P (X x)
and the fact that X can take either value 0 or value 1. If x < 0, then P (X x) =
0, because X can not take values strictly smaller than 0. If 0 x < 1, then
P (X x) = 1 p, because 0 is the only value strictly smaller than 1 that X can
take. Finally, if x 1, then P (X x) = 1, because all values X can take are
smaller than or equal to 1.
Exercise 1
Let X and Y be two independent Bernoulli random variables with parameter p.
Derive the probability mass function of their sum:
Z =X +Y
Solution
The probability mass function of X is
8
< p if x = 1
pX (x) = 1 p if x = 0
:
0 otherwise
The probability mass function of Y is
8
< p if y = 1
pY (y) = 1 p if y = 0
:
0 otherwise
The support of Z (the set of values Z can take) is
RY = f0; 1; 2g
The formula for the probability mass function of a sum of two independent variables
is5 X
pZ (z) = pX (z y) pY (y)
y2RY
where RY is the support of Y . When z = 0, the formula gives:
X
pZ (0) = pX ( y) pY (y)
y2RY
= pX ( 0) pY (0) + pX ( 1) pY (1)
2
= (1 p) (1 p) + 0 p = (1 p)
When z = 1, the formula gives:
X
pZ (1) = pX (1 y) pY (y)
y2RY
= pX (1 0) pY (0) + pX (1 1) pY (1)
= p (1 p) + (1 p) p = 2p (1 p)
When z = 2, the formula gives:
X
pZ (2) = pX (2 y) pY (y)
y2RY
= pX (2 0) pY (0) + pX (2 1) pY (1)
= 0 (1 p) + p p = p2
Therefore, the probability mass function of Z is
8 2
>
> (1 p) if z = 0
<
2p (1 p) if z = 1
pZ (z) = 2
>
> p if z = 2
:
0 otherwise
5 See p. 325.
42.8. SOLVED EXERCISES 339
Exercise 2
Let X be a Bernoulli random variable with parameter p = 1=2. Find its tenth
moment.
Solution
The moment generating function of X is
1 1
MX (t) = + exp (t)
2 2
The tenth moment of X is equal to the tenth derivative of its moment generating
function6 , evaluated at t = 0:
10 d10 MX (t)
X (10) = E X =
dt10 t=0
But
dMX (t) 1
= exp (t)
dt 2
d2 MX (t) 1
= exp (t)
dt2 2
..
.
d10 MX (t) 1
= exp (t)
dt10 2
so that:
d10 MX (t)
X (10) =
dt10 t=0
1 1
= exp (0) =
2 2
6 See p. 290.
340 CHAPTER 42. BERNOULLI DISTRIBUTION
Chapter 43
Binomial distribution
43.1 De…nition
The binomial distribution is characterized as follows.
De…nition 244 Let X be a discrete random variable. Let n 2 N and p 2 (0; 1).
Let the support of X be2
RX = f0; 1; : : : ; ng
We say that X has a binomial distribution with parameters n and p if its prob-
ability mass function3 is
n n x
x px (1 p) if x 2 RX
pX (x) =
0 if x 2
= RX
n n!
where x = x!(n x)! is the binomial coe¢ cient4 .
341
342 CHAPTER 43. BINOMIAL DISTRIBUTION
Proof. Non-negativity is obvious. We need to prove that the sum of pX (x) over
the support of X equals 1. This is proved as follows:
X Xn
n x n x n
pX (x) = p (1 p) = [p + (1 p)] = 1n = 1
x=0
x
xeRX
Xn
n n x n x
(a + b) = a b
x=0
x
Proof. We demonstrate that the two distributions are equivalent by showing that
they have the same probability mass function. The probability mass function of a
binomial distribution with parameters n and p, with n = 1, is
1 1 x
px (1 p) if x 2 f0; 1g
pX (x) = x
0 if x 2
= f0; 1g
but
1 0 1 0 1!
pX (0) = p (1 p) = (1 p) = 1 p
0 0!1!
and
1 1 1 1 1!
pX (1) = p (1 p) = p=p
1 1!0!
Therefore, the probability mass function can be written as
8
< p if x = 1
pX (x) = 1 p if x = 0
:
0 otherwise
Yn = X1 + X2 + : : : + Xn
Yn = Yn 1 + Xn
pYn (yn )
X
= pXn (yn yn 1 ) pYn 1
(yn 1)
yn 1 2RYn 1
X 1 yn +yn
= I ((yn yn 1) 2 f0; 1g) pyn yn 1
(1 p) 1
yn 1 2RYn 1
n 1 yn 1 n 1 yn 1
p (1 p)
yn 1
X
= I (yn 2 fyn 1 ; yn 1 + 1g)
yn 1 2RYn 1
n 1 yn yn 1 +yn 1 1 y +y +n 1 yn 1
p (1 p) n n 1
yn 1
X n 1 yn n yn
= I (yn 2 fyn 1 ; yn 1 + 1g) p (1 p)
yn 1
yn 1 2RYn 1
n yn
X n 1
= pyn (1 p) I (yn 2 fyn 1 ; yn 1 + 1g)
yn 1
yn 1 2RYn 1
If 1 yn n 1, then
X n 1 n 1 n 1 n
I (yn 2 fyn 1 ; yn 1 + 1g) = + =
yn 1 yn yn 1 yn
yn 1 2RYn 1
where the last equality is the recursive formula for binomial coe¢ cients8 . If yn = 0,
then
X n 1
I (yn 2 fyn 1 ; yn 1 + 1g)
yn 1
yn 1 2RYn 1
n 1 n 1 n n
= = =1= =
yn 0 0 yn
7 See p. 326.
8 See p. 25.
344 CHAPTER 43. BINOMIAL DISTRIBUTION
Finally, if yn = n, then
X n 1
I (yn 2 fyn 1 ; yn 1 + 1g)
yn 1
yn 1 2RYn 1
n 1 n 1 n n
= = =1= =
yn 1 n 1 n yn
and
n n yn
yn pyn (1 p) if yn 2 RYn
pYn (yn ) =
0 otherwise
which is the probability mass function of a binomial random variable with para-
meters n and p. This completes the proof.
E [X] = np
E [X]
" n #
X
A = E Yi
i=1
n
X
B = E [Yi ]
i=1
Xn
C = p
i=1
= np
where: in step A we have used the fact that X can be represented as a sum of
n independent Bernoulli random variables Y1 ; : : : ; Yn ; in step B we have used
the linearity of the expected value; in step C we have used the formula for the
expected value of a Bernoulli random variable9 .
43.4 Variance
The variance of a binomial random variable X is
Var [X] = np (1 p)
9 See p. 336.
43.5. MOMENT GENERATING FUNCTION 345
where: in step A we have used the fact that X can be represented as a sum of n
independent Bernoulli random variables Y1 ; : : : ; Yn ; in step B we have used the
formula for the variance of the sum of jointly independent random variables; in step
C we have used the formula for the variance of a Bernoulli random variable10 .
'X (t)
A = E [exp (itX)]
B = E [exp (it (Y1 + : : : + Yn ))]
= E [exp (itY1 ) : : : exp (itYn )]
C = E [exp (itY1 )] : : : E [exp (itYn )]
D = 'Y1 (t) : : : 'Yn (t)
E = (1 p + p exp (it)) : : : (1 p + p exp (it))
n
= (1 p + p exp (it))
where bxc is the ‡oor of x, i.e. the largest integer not greater than x.
Proof. For x < 0, FX (x) = 0, because X cannot be smaller than 0. For x > n,
FX (x) = 1, because X is always smaller than or equal to n. For 0 x n:
FX (x)
A = P (X x)
bxc
X
B = P (X = s)
s=0
bxc
X
C = pX (s)
s=0
1 2 See p. 337.
43.8. SOLVED EXERCISES 347
bxc
X n s n s
= p (1 p)
s=0
s
binocdf(x,n,p)
returns the value of the distribution function at the point x when the parameters
of the distribution are n and p.
Exercise 1
Suppose you independently ‡ip a coin 4 times and the outcome of each toss can
be either head (with probability 1=2) or tails (also with probability 1=2). What is
the probability of obtaining exactly 2 tails?
Solution
Denote by X the number of times the outcome is tails (out of the 4 tosses). X
has a binomial distribution with parameters n = 4 and p = 1=2. The probability
of obtaining exactly 2 tails can be computed from the probability mass function of
X as follows:
2 4 2
n 2 n 2 4 1 1
pX (2) = p (1 p) = 1
2 2 2 2
4! 1 1 4 3 2 1 1 6 3
= = = =
2!2! 4 4 2 1 2 1 16 16 8
Exercise 2
Suppose you independently throw a dart 10 times. Each time you throw a dart,
the probability of hitting the target is 3=4. What is the probability of hitting the
target less than 5 times (out of the 10 total times you throw a dart)?
Solution
Denote by X the number of times you hit the target. X has a binomial distribution
with parameters n = 10 and p = 3=4. The probability of hitting the target less
than 5 times can be computed from the distribution function of X as follows:
P (X < 5) = P (X 4) = FX (4)
348 CHAPTER 43. BINOMIAL DISTRIBUTION
X4
n s n s
= p (1 p)
s=0
s
X4 s 10 s
10 3 1
= ' 0:0197
s=0
s 4 4
where FX is the distribution function of X and the value of FX (4) can be calculated
with a computer algorithm, for example with the MATLAB command
binocdf(4,10,3/4)
Chapter 44
Poisson distribution
12
Exponential
10 distribution
8
Number of phone calls
6
Exponential
distribution
Poisson
4
distribution
0
0 10 20 30 40 50 60 70 80 90
Minutes
The concept is illustrated by the plot above, where the number of phone calls
received is plotted as a function of time. The graph of the function makes an
upward jump each time a phone call arrives. The time elapsed between two suc-
cessive phone calls is equal to the length of each horizontal segment and it has an
1 See p. 365.
349
350 CHAPTER 44. POISSON DISTRIBUTION
44.1 De…nition
The Poisson distribution is characterized as follows:
De…nition 247 Let X be a discrete random variable. Let its support be the set of
non-negative integer numbers:
RX = Z+
Let 2 (0; 1). We say that X has a Poisson distribution with parameter if
its probability mass function2 is
1 x
exp ( ) x! if x 2 RX
pX (x) =
0 if x 2
= RX
Proposition 248 The number of occurrences of an event within a unit of time has
a Poisson distribution with parameter if and only if the time elapsed between two
successive occurrences of the event has an exponential distribution with parameter
and it is independent of previous occurrences.
P( 1 + ::: + x 1)
2 See p. 106.
3 See p. 10.
44.2. RELATION TO THE EXPONENTIAL DISTRIBUTION 351
Z= 1 + ::: + x
Since the sum of independent exponential random variables4 with common para-
meter is a Gamma random variable5 with parameters 2x and x , then Z is a
Gamma random variable with parameters 2x and x , i.e. its probability density
function is
cz x 1 exp ( z) if z 2 [0; 1)
fZ (z) =
0 if z 2
= [0; 1)
where x x
c= =
(x) (x 1)!
and the last equality stems from the fact that we are considering only integer values
of x. We need to integrate the density function to compute the probability that Z
is less than 1:
P( 1 + ::: + x 1) = P (Z 1)
Z 1
= fZ (z) dz
1
Z 1
= cz x 1
exp ( z) dz
0
Z 1
= c zx 1
exp ( z) dz
0
4 See p. 372.
5 See p. 397.
352 CHAPTER 44. POISSON DISTRIBUTION
x
X1 1
(x 1)! 1 (x 1)! 1
= exp ( )+ x 1 exp ( z)
i=1
(x i)! i 0
x
X1 (x 1)! 1 (x 1)! (x 1)!
= exp ( ) x exp ( )+ x
i=1
(x i)! i
Multiplying by c, we obtain:
Z 1
c zx 1
exp ( z) dz
0
x Z 1
= zx 1
exp ( z) dz
(x 1)! 0
x
X1 x i
= exp ( ) exp ( )+1
i=1
(x i)!
Xx x i
= 1 exp ( )
i=1
(x i)!
x
X1 j
= 1 exp ( )
j=0
j!
P (X x) = 1 P (X < x)
= 1 P (X x 1)
x
X1
= 1 P (X = j)
j=0
x
X1
= 1 pX (j)
j=0
x
X1 j
= 1 exp ( )
j=0
j!
= P( 1 + ::: + x 1)
E [X] =
44.4. VARIANCE 353
E =
where: in step A we have used the fact that the …rst term of the sum is zero,
because x = 0; in step B we have made a change of variable y = x 1; in step
C we have used the fact that (y + 1)! = (y + 1) y!; in step D we have de…ned
1 y
pY (y) = exp ( )
y!
where pY (y) is the probability mass function of a Poisson random variable with
parameter ; in step E we have used the fact that the sum of a probability mass
function over its support equals 1.
44.4 Variance
The variance of a Poisson random variable X is:
Var [X] =
Proof. It can be derived thanks to the usual formula for computing the variance6 :
X
E X2 = x2 pX (x)
x2RX
X1
1 x
= x2 exp ( )
x=0
x!
1
X
A 1 x
= 0+ x2 exp ( )
x=1
x!
6 Var [X] = E X2 E [X]2 . See p. 156.
354 CHAPTER 44. POISSON DISTRIBUTION
1
X
B 2 1 y+1
= (y + 1) exp ( )
y=0
(y + 1)!
X1
C 2 1 y
= (y + 1) exp ( )
y=0
(y + 1) y!
1
X 1 y
= (y + 1) exp ( )
y=0
y!
X1
D = (y + 1) pY (y)
y=0
(1 1
)
X X
= ypY (y) + pY (y)
y=0 y=0
E = fE [Y ] + 1g
F = f + 1g
2
= +
where: in step A we have used the fact that the …rst term of the sum is zero,
because x = 0; in step B we have made a change of variable y = x 1; in step
C we have used the fact that (y + 1)! = (y + 1) y!; in step D we have de…ned
1 y
pY (y) = exp ( )
y!
where pY (y) is the probability mass function of a Poisson random variable with
parameter ; in step E we have used the fact that the sum of a probability mass
function over its support equals 1; in step F we have used the fact that the
expected value of a Poisson random variable with parameter is . Finally,
2 2
E [X] =
and
2
Var [X] = E X2 E [X]
2 2
= + =
where
X1 x
( exp (t))
exp ( exp (t)) =
x=0
x!
is the usual Taylor series expansion of the exponential function. Furthermore, since
the series converges for any value of t, the moment generating function of a Poisson
random variable exists for any t 2 R.
where:
X1 x
( exp (it))
exp ( exp (it)) =
x=0
x!
is the usual Taylor series expansion of the exponential function (note that the series
converges for any value of t).
where bxc is the ‡oor of x, i.e. the largest integer not greater than x.
Proof. Using the de…nition of distribution function:
FX (x)
A = P (X x)
bxc
X
B = P (X = s)
s=0
bxc
X
C = pX (s)
s=0
bxc
X 1 s
= exp ( )
s=0
s!
bxc
X 1 s
= exp ( )
s=0
s!
Exercise 1
The time elapsed between the arrival of a customer at a shop and the arrival
of the next customer has an exponential distribution with expected value equal
to 15 minutes. Furthermore, it is independent of previous arrivals. What is the
probability that more than 6 customers will arrive at the shop during the next
hour?
Solution
If a random variable has an exponential distribution with parameter , then its
expected value is equal to 1= . Here
1
= 0:25 hours
arrive at the shop during the next hour (denote it by X) is a Poisson random
variable with parameter = 4. The probability that more than 6 customers arrive
at the shop during the next hour is:
P (X > 6) = 1 P (X 6) = 1 FX (6)
6
X 4s
= 1 exp ( 4) ' 0:1107
s=0
s!
The value of FX (6) can be calculated with a computer algorithm, for example with
the MATLAB command:
poisscdf(6,4)
Exercise 2
At a call center, the time elapsed between the arrival of a phone call and the arrival
of the next phone call has an exponential distribution with expected value equal
to 15 seconds. Furthermore, it is independent of previous arrivals. What is the
probability that less than 50 phone calls arrive during the next 15 minutes?
Solution
If a random variable has an exponential distribution with parameter , then its
expected value is equal to 1= . Here
1 1 1
= minutes = quarters of hour
4 60
where, in the last equality, we have taken 15 minutes as the unit of time. Therefore,
= 60. If inter-arrival times are independent exponential random variables with
parameter , then the number of arrivals during a unit of time has a Poisson
distribution with parameter . Thus, the number of phone calls that will arrive
during the next 15 minutes (denote it by X) is a Poisson random variable with
parameter = 60. The probability that less than 50 phone calls arrive during the
next 15 minutes is:
The value of FX (49) can be calculated with a computer algorithm, for example
with the MATLAB command:
poisscdf(49,60)
358 CHAPTER 44. POISSON DISTRIBUTION
Chapter 45
Uniform distribution
A continuous random variable has a uniform distribution if all the values belonging
to its support have the same probability density.
45.1 De…nition
The uniform distribution is characterized as follows:
De…nition 249 Let X be an absolutely continuous random variable. Let its sup-
port be a closed interval of real numbers:
RX = [l; u]
359
360 CHAPTER 45. UNIFORM DISTRIBUTION
u
1 1 2
= x
u l 2 l
1 1 2
= u l2
u l2
(u l) (u + l) u+l
= =
2 (u l) 2
45.3 Variance
The variance of a uniform random variable X is
2
(u l)
Var [X] =
12
Proof. It can be derived thanks to the usual formula for computing the variance2 :
Z 1
E X2 = x2 fX (x) dx
1
Z u
1
= x2 dx
l u l
Z u
1
= x2 dx
u l l
u
1 1 3
= x
u l 3 l
1 1 3
= u l3
u l3
(u l) u2 + ul + l2
=
3 (u l)
u + ul + l2
2
=
3
2
2 u+l u2 + 2ul + l2
E [X] = =
2 4
2
Var [X] = E X2 E [X]
u2 + ul + l2 u2 + 2ul + l2
=
3 4
4u2 + 4ul + 4l2 3u2 6ul 3l2
=
12
(4 3) u2 + (4 6) ul + (4 3) l2
=
12
2
u2 2ul + l2 (u l)
= =
12 12
Note that the above derivation is valid only when t 6= 0. However, when t = 0:
When t 6= 0, the integral above is well-de…ned and …nite for any t 2 R. Thus, the
moment generating function of a uniform random variable exists for any t 2 R.
1
= fsin (tu) sin (tl) i cos (tu) + i cos (tl)g
(u l) t
1
= fi sin (tu) i sin (tl) + cos (tu) cos (tl)g
(u l) it
1
= f[cos (tu) + i sin (tu)] [cos (tl) + i sin (tl)]g
(u l) it
exp (itu) exp (itl)
=
(u l) it
Note that the above derivation is valid only when t 6= 0. However, when t = 0:
FX (x) = P (X x)
Z x
= fX (t) dt
1
Z x
1
= dt
l u l
1 x
= [t]
u l l
= (x l) = (u l)
If x > u, then:
FX (x) = P (X x) = 1
because X can not take on values greater than u.
Exercise 1
Let X be a uniform random variable with support
RX = [5; 10]
P (7 X 9)
Solution
We can compute this probability either using the probability density function or
the distribution function of X. Using the probability density function:
Z 9 Z 9
1
P (7 X 9) = fX (x) dx = dx
7 7 10 5
1 9 1 2
= [x] = (9 7) =
5 7 5 5
Using the distribution function:
9 5 7 5
P (7 X 9) = FX (9) FX (7) =
10 5 10 5
4 2 2
= =
5 5 5
Exercise 2
Suppose the random variable X has a uniform distribution on the interval [ 2; 4].
Compute the following probability:
P (X > 2)
Solution
P (X > 2) = 1 P (X 2) = 1 FX (2)
2 ( 2) 4 1
= 1 =1 =
4 ( 2) 6 3
Exercise 3
Suppose the random variable X has a uniform distribution on the interval [0; 1].
Compute the third moment3 of X, i.e.:
X (3) = E X 3
3 See p. 36.
364 CHAPTER 45. UNIFORM DISTRIBUTION
Solution
We can compute the third moment of X using the transformation theorem4 :
Z 1 Z 1
E X3 = x3 fX (x) dx = x3 dx
1 0
1
1 4 1
= x =
4 0 4
4 See p. 134.
Chapter 46
Exponential distribution
How much time will elapse before an earthquake occurs in a given region? How
long do we need to wait before a customer enters our shop? How long will it take
before a call center receives the next phone call? How long will a piece of machinery
work without breaking down?
Questions such as these are often answered in probabilistic terms using the
exponential distribution.
All these questions concern the time we need to wait before a certain event
occurs. If this waiting time is unknown, it is often appropriate to think of it as a
random variable having an exponential distribution. Roughly speaking, the time
X we need to wait before an event occurs has an exponential distribution if the
probability that the event occurs during a certain time interval is proportional to
the length of that time interval. More precisely, X has an exponential distribution
if the conditional probability
P (t < X t+ t jX > t )
46.1 De…nition
The exponential distribution is characterized as follows.
De…nition 250 Let X be an absolutely continuous random variable. Let its sup-
port be the set of positive real numbers:
RX = [0; 1)
365
366 CHAPTER 46. EXPONENTIAL DISTRIBUTION
P (t < X t+ t jX > t ) = t + o ( t)
P (t < X t+ t jX > t ) = t + o ( t)
FX (x) = P (X x)
Then,
P (t < X t + t) FX (t + t) FX (t) SX (t + t) SX (t)
= =
P (X > t) 1 FX (t) SX (t)
1 See p. 107.
2 See p. 251.
3 See p. 108.
46.2. THE RATE PARAMETER AND ITS INTERPRETATION 367
= t + o ( t)
where o (1) is a quantity that tends to 0 when t tends to 0. Taking limits on both
sides, we obtain
SX (t + t) SX (t) 1
lim =
t!0 t SX (t)
or, by the de…nition of derivative:
dSX (t) 1
=
dt SX (t)
This di¤erential equation is easily solved using the chain rule:
dSX (t) 1 d ln (SX (t))
= =
dt SX (t) dt
Taking the integral from 0 to x of both sides
Z x Z x
d ln (SX (t))
dt = dt
0 dt 0
we obtain
x x
[ln (SX (t))]0 = [ t]0
or
ln (SX (x)) = ln (SX (0)) x
But X cannot take negative values. So
SX (0) = 1 FX (0) = 1
which implies
ln (SX (x)) = x
Exponentiating both sides, we get
SX (x) = exp ( x)
Therefore,
1 FX (x) = exp ( x)
or
FX (x) = 1 exp ( x)
Since the density function is the …rst derivative of the distribution function4 , we
obtain
dFX (x)
fX (x) = = exp ( x)
dx
which is the density of an exponential random variable. Therefore, the proportion-
ality condition is satis…ed only if X is an exponential random variable.
4 See p. 109.
368 CHAPTER 46. EXPONENTIAL DISTRIBUTION
46.4 Variance
The variance of an exponential random variable X is
1
Var [X] = 2
5 See p. 51.
6 See p. 156.
46.5. MOMENT GENERATING FUNCTION 369
MX (t) =
t
Proof. Using the de…nition of moment generating function, we obtain
Z 1
MX (t) = E [exp (tX)] = exp (tx) fX (x) dx
1
Z 1 Z 1
= exp (tx) exp ( x) dx = exp ((t ) x) dx
0 0
1
1
= exp ((t ) x) =
t 0 t
Of course, the above integrals converge only if (t ) < 0, i.e., only if t < . There-
fore, the moment generating function of an exponential random variable exists for
all t < .
'X (t) =
it
Proof. Using the de…nition of characteristic function and the fact that
we can write
Z 1
'X (t) = E [exp (itX)] = exp (itx) fX (x) dx
1
Z 1
= exp (itx) exp ( x) dx
0
Z 1 Z 1
= cos (tx) exp ( x) dx + i sin (tx) exp ( x) dx
0 0
1
B 1
= cos (tx) exp ( x)
t t 0
Z 1
1
cos (tx) ( exp ( x)) dx
0 t
Z 1
1
= cos (tx) exp ( x) dx
t t t 0
2 Z 1
= cos (tx) exp ( x) dx
t2 t2 0
or
Z 1 2 1
1 t
sin (tx) exp ( x) dx = 1+ = 2
0 t t2 t2 +
Putting pieces together, we get
Z 1 Z 1
'X (t) = cos (tx) exp ( x) dx + i sin (tx) exp ( x) dx
0 0
t + it
= +i =
t2 + 2 t2 + 2
t2 + 2
+ it it
= =
t + 2
2 it it
0 if x < 0
FX (x) =
1 exp ( x) if x 0
P (X x + y jX > x ) = P (X y)
for any x 0.
Proof. This is proved as follows:
P (X x + y and X > x)
P (X x + y jX > x ) =
P (X > x)
P (x < X x + y)
=
P (X > x)
372 CHAPTER 46. EXPONENTIAL DISTRIBUTION
FX (x + y) FX (x)
=
1 FX (x)
1 exp ( (x + y)) (1 exp ( x))
=
exp ( x)
exp ( x) exp ( (x + y))
=
exp ( x)
= 1 exp ( y) = FX (y) = P (X y)
Remember that X is the time we need to wait before a certain event occurs.
The memoryless property states that the probability that the event happens during
a time interval of length y is independent of how much time x has already elapsed
without the event happening.
variables is just the product of their moment generating functions (see p. 293).
1 0 See p. 399.
46.9. SOLVED EXERCISES 373
Exercise 1
The probability that a new customer enters a shop during a given minute is ap-
proximately 1%, irrespective of how many customers have entered the shop during
the previous minutes. Assume that the total time we need to wait before a new
customer enters the shop (denote it by X) has an exponential distribution. What
is the probability that no customer enters the shop during the next hour?
Solution
Time is measured in minutes. Therefore, the probability that no customer enters
the shop during the next hour is
Therefore,
Exercise 2
Let X be an exponential random variable with parameter = ln (3). Compute the
probability
P (2 X 4)
Solution
First of all we can write the probability as
where we have used the fact that the probability that an absolutely continuous
random variable takes on any speci…c value is equal to zero11 . Now, the probability
can be written in terms of the distribution function of X as
Exercise 3
Suppose the random variable X has an exponential distribution with parameter
= 1. Compute the probability
P (X > 2)
1 1 See p. 109.
374 CHAPTER 46. EXPONENTIAL DISTRIBUTION
Solution
The above probability can be easily computed using the distribution function of
X:
Exercise 4
What is the probability that a random variable X is less than its expected value,
if X has an exponential distribution with parameter ?
Solution
The expected value of an exponential random variable with parameter is
1
E [X] =
1 1
P (X E [X]) = P X = FX
1
= 1 exp =1 exp ( 1)
Chapter 47
Normal distribution
The normal distribution is one of the cornerstones of probability theory and sta-
tistics, because of the role it plays in the Central Limit Theorem1 , because of its
analytical tractability and because many real-world phenomena involve random
quantities that are approximately normal (e.g. errors in scienti…c measurement).
It is often called Gaussian distribution, in honor of Carl Friedrich Gauss (1777-
1855), an eminent German mathematician who gave important contributions to-
wards a better understanding of the normal distribution.
Sometimes it is also referred to as "bell-shaped distribution", because the graph
of its probability density function resembles the shape of a bell.
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0.05
4 3 2 1 0 1 2 3 4
As you can see from the above plot of the density of a normal distribution, the
density is symmetric around the mean (indicated by the vertical line at zero). As a
consequence, deviations from the mean having the same magnitude, but di¤erent
signs, have the same probability. The density is also very concentrated around
the mean and becomes very small by moving from the center to the left or to the
right of the distribution (the so called "tails" of the distribution). This means that
1 See p. 545.
375
376 CHAPTER 47. NORMAL DISTRIBUTION
the further a value is from the center of the distribution, the less probable it is to
observe that value.
The remainder of this lecture gives a formal presentation of the main charac-
teristics of the normal distribution, dealing …rst with the special case in which the
distribution has zero mean and unit variance, then with the general case, in which
mean and variance can take any value.
47.1.1 De…nition
The standard normal distribution is characterized as follows:
De…nition 252 Let X be an absolutely continuous random variable. Let its sup-
port be the whole set of real numbers:
RX = R
Z 1 Z 1 1=2
1=2 1 2
= (2 ) 2 exp x 1 + s2 xdxds
0 0 2
Z 1 1 1=2
1=2 1 1 2
= (2 ) 2 exp x 1 + s2 ds
0 1 + s2 2 0
Z 1 1=2
1=2 1
= (2 ) 2 0+ ds
0 1 + s2
Z 1 1=2
1=2 1
= (2 ) 2 ds
0 1 + s2
1=2 1 1=2
= (2 ) 2 ([arctan (s)]0 )
1=2 1=2
= (2 ) 2 (arctan (1) arctan (0))
1=2
1=2 1=2 1=2 1=2 1=2
= (2 ) 2 0 =2 2 2 =1
2
where: in step A we have used the fact that the integrand is even; in step B we
have made a change of variable (y = xs).
E [X] = 0
47.1.3 Variance
The variance of a standard normal random variable X is:
Var [X] = 1
378 CHAPTER 47. NORMAL DISTRIBUTION
Proof. It can be proved with the usual formula for computing the variance4 :
E X2
Z 1
= x2 fX (x) dx
1
Z 1
1=2 1 2
= (2 ) x2 exp x dx
1 2
Z0
1=2 1 2
= (2 ) x x exp x dx
1 2
Z 1
1 2
+ x x exp x dx
0 2
( Z 0
0
A 1=2 1 2 1 2
= (2 ) x exp x + exp x dx
2 1 1 2
1 Z 1
1 2 1 2
+ x exp x + exp x dx
2 0 0 2
Z 0
1=2 1 2
= (2 ) (0 0) + (0 0) + exp x dx
1 2
Z 1
1 2
+ exp x dx
0 2
Z 1
1=2 1 2
= (2 ) exp x dx
1 2
Z 1
B = fX (x) dx = 1
1
C 1 2
= exp t
2
where: in step A we have used the fact that sin (tx) fX (x) is an odd function of
x. Now, take the derivative with respect to t of the characteristic function:
d d
' (t) = E [exp (itX)]
dt X dt
380 CHAPTER 47. NORMAL DISTRIBUTION
d
= E exp (itX)
dt
= E [iX exp (itX)]
= iE [X cos (tX)] E [X sin (tX)]
Z 1 Z 1
= i x cos (tx) fX (x) dx x sin (tx) fX (x) dx
1 1
Z 1
A = x sin (tx) fX (x) dx
1
Z 1
1 1 2
= x sin (tx) p exp x dx
1 2 2
Z 1
d 1 1 2
= sin (tx) p exp x dx
1 dx 2 2
Z 1
d
= sin (tx) fX (x) dx
1 dx
Z 1
B = [sin (tx) fX (x)] 1
1
t cos (tx) fX (x) dx
1
Z 1
= t cos (tx) fX (x) dx
1
where: in step A we have used the fact that x cos (tx) fX (x) is an odd function
of x; in step B we have performed an integration by parts. Putting together the
previous two results, we obtain:
d
' (t) = t'X (t)
dt X
The only function that satis…es this ordinary di¤erential equation (subject to the
condition 'X (0) = E [exp (i 0 X)] = 1) is:
1 2
'X (t) = exp t
2
normcdf(x)
Some values of the distribution function of X are used very frequently and
people usually learn them by heart:
which is due to the symmetry around 0 of the standard normal density and is often
used in calculations.
In the past, when computers were not widely available, people used to look up
the values of FX (x) in normal distribution tables. A normal distribution table
is a table where FX (x) is tabulated for several values of x. For values of x that
are not tabulated, approximations of FX (x) can be computed by interpolating the
two tabulated values that are closest to x. For example, if x is not tabulated, x1
is the greatest tabulated number smaller than x and x2 is the smallest tabulated
number greater than x, the approximation is as follows:
FX (x2 ) FX (x1 )
FX (x) = FX (x1 ) + (x x1 )
x2 x1
47.2.1 De…nition
2
The normal distribution with mean and variance is characterized as follows:
De…nition 253 Let X be an absolutely continuous random variable. Let its sup-
port be the whole set of real numbers:
RX = R
Let 2 R and 2 R++ . We say that X has a normal distribution with mean
2
and variance , if its probability density function is
!
2
1 1 1 (x )
fX (x) = p exp 2
2 2
We often indicate that X has a normal distribution with mean and variance
2
by:
2
X N ;
382 CHAPTER 47. NORMAL DISTRIBUTION
Proof. This can be easily proved using the formula for the density of a function6
of an absolutely continuous variable:
1
1 dg (x)
fX (x) = fZ g (x)
dx
x 1
= fZ
!
2
1 1 x 1
= p exp
2 2
E [X] =
E [X] = E [ + Z] = + E [Z] = + 0=
47.2.4 Variance
The variance of a normal random variable X is:
2
Var [X] =
Proof. It can be derived using the formula for the variance of linear transforma-
tions8 on X = + Z (where Z has a standard normal distribution):
2 2
Var [X] = Var [ + Z] = Var [Z] =
1 2
MZ (t) = exp t
2
We can use the formula for the moment generating function of a linear transfor-
mation9 :
MX (t) = exp ( t) MZ ( t)
1 2
= exp ( t) exp ( t)
2
1 2 2
= exp t+ t
2
1 2 2
'X (t) = exp i t t
2
1 2
'Z (t) = exp t
2
We can use the formula for the characteristic function of a linear transformation10 :
9 See p. 293.
1 0 See p. 310.
384 CHAPTER 47. NORMAL DISTRIBUTION
FX (x) = P (X x)
= P( + Z x)
x
= P Z
x
= FZ
1 1=2 1=2 1 1
FX = FZ = FZ = FZ
2 1 2
Exercise 1
2
Let X be a normal random variable with mean = 3 and variance = 4.
Compute the following probability:
P ( 0:92 X 6:92)
Solution
First of all, we need to express the above probability in terms of the distribution
function of X:
P ( 0:92 X 6:92)
= P (X 6:92) P (X < 0:92)
A = P (X 6:92) P (X 0:92)
= FX (6:92) FX ( 0:92)
where: in step A we have used the fact that the probability that an absolutely
continuous random variable takes on any speci…c value is equal to zero11 .
Then, we need to express the distribution function of X in terms of the distri-
bution function of a standard normal random variable Z:
x x 3
FX (x) = FZ = FZ
2
Exercise 2
Let X be a random variable having a normal distribution with mean = 1 and
variance 2 = 16. Compute the following probability:
P (X > 9)
Solution
We need to use the same technique used in the previous exercise and express the
probability in terms of the distribution function of a standard normal random
variable:
P (X > 9) = 1 P (X 9) = 1 FX (9)
9 1
= 1 FZ p = 1 FZ (2)
16
= 1 0:9772 = 0:0228
where the value FZ (2) can be found with a computer algorithm, for example with
the MATLAB command
normcdf(2)
Exercise 3
Suppose the random variable X has a normal distribution with mean = 1 and
variance 2 = 1. De…ne the random variable Y as follows:
Y = exp (2 + 3X)
Solution
The moment generating function of X is:
1 2 2
MX (t) = E [exp (tX)] = exp t+ t
2
1
= exp t + t2
2
Chi-square distribution
48.1 De…nition
Chi-square random variables are characterized as follows.
De…nition 256 Let X be an absolutely continuous random variable. Let its sup-
port be the set of positive real numbers:
RX = [0; 1)
Let n 2 N. We say that X has a Chi-square distribution with n degrees of
freedom if its probability density function3 is
1
cxn=2 1
exp 2x if x 2 RX
fX (x) =
0 if x 2
= RX
where c is a constant:
1
c=
2n=2 (n=2)
and () is the Gamma function4 .
The following notation is often employed to indicate that a random variable X
has a Chi-square distribution with n degrees of freedom:
2
X (n)
2
where the symbol means "is distributed as" and (n) indicates a Chi-square
distribution with n degrees of freedom.
1 See p. 233.
2 See p. 376.
3 See p. 107.
4 See p. 55.
387
388 CHAPTER 48. CHI-SQUARE DISTRIBUTION
E [X] = n
B = n
48.3 Variance
The variance of a Chi-square random variable X is
Var [X] = 2n
1 Z
1
= c (0 0) + (n + 2) xn=2 exp x dx
0 2
Z 1
1
= c (n + 2) xn=2 exp x dx
0 2
1
B 1
= c (n + 2) xn=2 2 exp x
2 0
Z 1
n n=2 1 1
+ x 2 exp x dx
0 2 2
Z 1
1
= c (n + 2) (0 0) + n xn=2 1 exp x dx
0 2
Z 1
1
= (n + 2) n cxn=2 1 exp x dx
2
Z0 1
= (n + 2) n fX (x) dx
0
C = (n + 2) n
Z 1 n=2
2
= c y n=2 1
exp ( y) dy
0 1 2t
n=2 Z 1
2
= c y n=2 1
exp ( y) dy
1 2t 0
n=2
B 2
= c (n=2)
1 2t
n=2
C 1 2
= (n=2)
2n=2 (n=2) 1 2t
1 2n=2
=
2n=2 (1 2t)
n=2
n=2
= (1 2t)
1
y= t x
2
E = (1 2it)
n=2
where: in step A we have substituted the Taylor series expansion of exp (itx); in
step B we have de…ned
1 1
fk (x) = xk+n=2 1
exp x
2k+n=2 (k + n=2) 2
1
X kY1
1 k n
1+ (2it) +j
k! j=0
2
k=1
n=2
is the Taylor series expansion of (1 2it) , which you can verify by computing
the expansion yourself.
(n=2; x=2)
FX (x) =
(n=2)
chi2cdf(x,n)
returns the value at the point x of the distribution function of a Chi-square random
variable with n degrees of freedom.
In the past, when computers were not widely available, people used to look up
the values of FX (x) in Chi-square distribution tables. A Chi-square distribu-
tion table is a table where FX (x) is tabulated for several values of x and n. For
values of x that are not tabulated, approximations of FX (x) can be computed by
interpolation, with the same procedure described for the normal distribution (p.
380).
freedom:
2 2
X1 (n1 ) , X2 (n2 ) 2
=) X1 + X2 (n1 + n2 )
X1 and X2 are independent
This can be generalized to sums of more than two Chi-square random variables,
provided they are mutually independent:
k k
!
Xi 2
(ni ) for i = 1; : : : ; k X X
2
=) Xi ni
X1 ; X2 ; : : : ; Xk are mutually independent
i=1 i=1
Proof. This can be easily proved using moment generating functions. The mo-
ment generating function of Xi is
ni =2
MXi (t) = (1 2t)
De…ne
k
X
X= Xi
i=1
n=2
= (1 2t)
where
k
X
n= ni
i=1
X = Z2
FX (x)
9 See p. 293.
1 0 See p. 376.
394 CHAPTER 48. CHI-SQUARE DISTRIBUTION
A = P (X x)
B = P Z2 x
C = P x1=2 Z x1=2
Z x1=2
D = fZ (z) dz
x1=2
Exercise 1
Let X be a Chi-square random variable with 3 degrees of freedom. Compute the
probability
P (0:35 X 7:81)
Solution
First of all, we need to express the above probability in terms of the distribution
function of X:
where: in step A we have used the fact that the probability that an absolutely
continuous random variable takes on any speci…c value is equal to zero14 ; in step
B the values
FX (7:81) = 0:95
FX (0:35) = 0:05
Exercise 2
Let X1 and X2 be two independent normal random variables having mean =0
and variance 2 = 16. Compute the probability
Solution
First of all, the two variables X1 and X2 can be written as
X1 = 4Z1
1 4 See p. 109.
396 CHAPTER 48. CHI-SQUARE DISTRIBUTION
X2 = 4Z2
where Z1 and Z2 are two standard normal random variables. Thus, we can write
but the sum Z12 + Z22 has a Chi-square distribution with 2 degrees of freedom.
Therefore,
1
P X12 + X22 > 8 = P Z12 + Z22 >
2
1
= 1 P Z12 + Z22
2
1
= 1 FY
2
Exercise 3
Suppose the random variable X has a Chi-square distribution with 5 degrees of
freedom. De…ne the random variable Y as follows:
Y = exp (1 X)
Solution
The expected value of Y can be easily calculated using the moment generating
function of X:
5=2
MX (t) = E [exp (tX)] = (1 2t)
Now, exploiting the linearity of the expected value, we obtain
Gamma distribution
49.1 De…nition
Gamma random variables are characterized as follows:
De…nition 257 Let X be an absolutely continuous random variable. Let its sup-
port be the set of positive real numbers:
RX = [0; 1)
where c is a constant:
n=2
(n=h)
c= n=2
2 (n=2)
and () is the Gamma function3 .
397
398 CHAPTER 49. GAMMA DISTRIBUTION
E [X] = h
B = h
49.3 Variance
The variance of a Gamma random variable X is
h2
Var [X] = 2
n
Proof. It can be derived thanks to the usual formula for computing the variance5 :
Z 1
2
E X = x2 fX (x) dx
0
Z 1
n1
= x2 cxn=2 1 exp x dx
0 h2
Z 1
n1
= c xn=2+1 exp x dx
0 h2
4 See p. 51.
5 Var [X] = E X2 E [X]2 . See p. 156.
49.4. MOMENT GENERATING FUNCTION 399
1
A h n1
= c xn=2+1 2 exp x
n h2 0
Z 1
n h n1
+ + 1 xn=2 2 exp x dx
0 2 n h2
Z
h 1 n=2 n1
= c (0 0) + (n + 2) x exp x dx
n 0 h2
Z 1
h n1
= c (n + 2) xn=2 exp x dx
n 0 h2
1
B h h n1
= c (n + 2) xn=2 2 exp x
n n h2 0
Z 1
n n=2 1 h n1
+ x 2 exp x dx
0 2 n h2
Z 1
h n1
= c (n + 2) (0 0) + h xn=2 1 exp x dx
n 0 h2
Z
h2 1 n=2 1 n1
= (n + 2) cx exp x dx
n 0 h2
Z
h2 1
= (n + 2) fX (x) dx
n 0
C h2
= (n + 2)
n
where: in step A and B we have performed an integration by parts; in step C
we have used the fact that the integral of a probability density function over its
support is equal to 1. Finally:
2
E [X] = h2
and
2 h2
Var [X] = E X2 E [X] = (n + 2) h2
n
h2 h2
= (n + 2 n) =2
n n
Z 1 n=2
(n=h) n1
= xn=2 1
exp x + tx dx
0 2n=2 (n=2) h2
Z 1 n=2
(n=h) 1n 2t n
= n=2
xn=2 1
exp x x dx
0 2 (n=2) h2 n 2
Z 1 n=2
(1=h) nn=2 n=2 1 1 2t n
= x exp x dx
0 2n=2 (n=2) h n 2
n=2
n=2 1 2t
= (1=h)
h n
Z 1 1 2t n=2 n=2
n 1 2t n
h n
xn=2 1 exp x dx
0 2n=2 (n=2) h n 2
n=2
n=2 1 2t
= (1=h)
h n
where the integral equals 1 because it is the integral of the probability density
1
function of a Gamma random variable with parameters n and h1 2t n . Thus:
n=2
n=2 1 2t
MX (t) = (1=h)
h n
n=2
n=2 1 2t
= (h)
h n
n=2
2h
= 1 t
n
'X (t)
= E [exp (itX)]
Z 1
= exp (itx) fX (x) dx
1
Z 1
n1
= c exp (itx) xn=2 1 exp x dx
0 h2
Z 1 X 1
!
A 1 k n1
= c (itx) xn=2 1
exp x dx
0 k! h2
k=0
49.5. CHARACTERISTIC FUNCTION 401
X1 Z 1
1 k 1n
= c (it) xk xn=2 1
exp x dx
k! 0 2h
k=0
1 Z !
X 1 k
1
1 2k + n
= c (it) xk+n=2 1
exp x dx
k! 0 2 h 2k+n
n
k=0
X1
1 k k n=2
= c (it) 2k+n=2 (k + n=2) (n=h)
k!
k=0
Z !
1 k+n=2
(n=h) 1 2k + n
xk+n=2 1
exp x dx
0 2k+n=2 (k + n=2) 2 h 2k+n
n
1
X Z 1
B 1 k k n=2
= c (it) 2k+n=2 (k + n=2) (n=h) fk (x) dx
k! 0
k=0
X1
C 1 k k n=2
= c (it) 2k+n=2 (k + n=2) (n=h)
k!
k=0
X 1
n=2 1
D (n=h) k k n=2
= n=2
(it) 2k+n=2 (k + n=2) (n=h)
2 (n=2) k=0 k!
X 1 1
1 k k
= (it) 2k (k + n=2) (n=h)
(n=2) k!
k=0
1
X k
1 k 2h (k + n=2)
= (it)
k! n (n=2)
k=0
X1 k
1 2h (k + n=2)
= it
k! n (n=2)
k=0
1
X k kY1
1 2h n
= 1+ it +j
k! n j=0
2
k=1
n=2
E 2h
= 1 it
n
where: in step A we have substituted exp (itx) with its Taylor series expansion;
in step B we have have de…ned
!
k+n=2
(n=h) k+n=2 1 1 2k + n
fk (x) = k+n=2 x exp x
2 (k + n=2) 2 h 2k+n
n
where fk (x) is the probability density function of a Gamma random variable with
parameters 2k + n and h 2k+nn ; in step C we have used the fact that probability
density functions integrate to 1; in step D we have used the de…nition of c; in
step E we have recognized that
1
X k kY1
1 2h n
1+ it +j
k! n j=0
2
k=1
2h n=2
is the Taylor series expansion of 1 n it .
402 CHAPTER 49. GAMMA DISTRIBUTION
is called lower incomplete Gamma function6 and is usually evaluated using special-
ized computer algorithms.
Proof. This is proved as follows:
Z x
FX (x) = fX (t) dt
1
Z x
n1
= ctn=2 1 exp t dt
1 h2
Z nx=2h n=2 1
A = c 2h 2h
s exp ( s) ds
1 n n
n=2 Z nx=2h
2h
= c sn=2 1 exp ( s) ds
n 1
n=2 n=2 Z nx=2h
B = (n=h) 2h
sn=2 1 exp ( s) ds
2n=2 (n=2) n 1
Z nx=2h
1
= sn=2 1 exp ( s) ds
(n=2) 1
(n=2; nx=2h)
=
(n=2)
n
where: in step A we have performed a change of variable (s = 2h t); in step B
we have used the de…nition of c.
Proof. This can be easily proved using the formula for the density of a function8
of an absolutely continuous variable:
1
1 dg (x)
fX (x) = fZ g (x)
dx
n n
= fZ x
h h
The density function of a Chi-square random variable with n degrees of freedom is
1
kz n=2 1
exp 2z if x 2 [0; 1)
fZ (z) =
0 otherwise
where
1
k=
2n=2 (n=2)
Therefore:
n n
fX (x) = fZ x
( h h
n n=2 n=2 1 1n
= k h x exp 2 hx if x 2 [0; 1)
0 otherwise
which is the density of a Gamma distribution with parameters n and h.
Thus, the Chi-square distribution is a special case of the Gamma distribution,
because, when h = n, we have:
h n
X=Z= Z=Z
n n
In other words, a Gamma distribution with parameters n and h = n is just a
Chi square distribution with n degrees of freedom.
h
X= Z
n
h h
X = Z= W12 + : : : + Wn2
n n
r !2 r !2
h h
= W1 + ::: + Wn
n n
= Y12 + : : : + Yn2
But the variables Yi are normal random variables with mean 0 and variance nh .
Therefore, a Gamma random variable with parameters n and h can be seen as
a sum of squares of n independent normal random variables having mean 0 and
variance h=n.
Exercise 1
Let X1 and X2 be two independent Chi-square random variables having 3 and 5
degrees of freedom respectively. Consider the following random variables:
Y1 = 2X1
1
Y2 = X2
3
Y3 = 3X1 + 3X2
Solution
Being multiples of Chi-square random variables, the variables Y1 , Y2 and Y3 all have
a Gamma distribution. The random variable X1 has n = 3 degrees of freedom and
the random variable Y1 can be written as
h
X1Y1 =
n
where h = 6. Therefore Y1 has a Gamma distribution with parameters n = 3 and
h = 6. The random variable X2 has n = 5 degrees of freedom and the random
variable Y2 can be written as
h
Y2 = X2
n
where h = 5=3. Therefore Y2 has a Gamma distribution with parameters n = 5
and h = 5=3. The random variable X1 + X2 has a Chi-square distribution with
n = 3 + 5 = 8 degrees of freedom, because X1 and X2 are independent9 , and the
random variable Y3 can be written as
h
Y3 =
(X1 + X2 )
n
where h = 24. Therefore Y3 has a Gamma distribution with parameters n = 8 and
h = 24.
Exercise 2
Let X be a random variable having a Gamma distribution with parameters n = 4
and h = 2. De…ne the following random variables:
1
Y1 = X
2
Y2 = 5X
Y3 = 2X
What distribution do these variables have?
Solution
Multiplying a Gamma random variable by a strictly positive constant one still
obtains a Gamma random variable. In particular, the random variable Y1 is a
Gamma random variable with parameters n = 4 and
1
=1 h=2
2
The random variable Y2 is a Gamma random variable with parameters n = 4 and
h = 2 5 = 10
The random variable Y3 is a Gamma random variable with parameters n = 4 and
h=2 2=4
The random variable Y3 is also a Chi-square random variable with 4 degrees of
freedom (remember that a Gamma random variable with parameters n and h is
also a Chi-square random variable when n = h).
9 See the lecture entitled Chi-square distribution (p. 387).
406 CHAPTER 49. GAMMA DISTRIBUTION
Exercise 3
Let X1 , X2 and X3 be mutually independent normal random variables having mean
= 0 and variance 2 = 3. Consider the random variable
Solution
The random variable X can be written as
p 2 p 2 p 2
X = 2 3Z1 + 3Z2 + 3Z3
18 2
= Z1 + Z22 + Z32
3
where Z1 , Z2 and Z3 are mutually independent standard normal random variables.
The sum Z12 + Z22 + Z32 has a Chi-square distribution with 3 degrees of freedom.
Therefore X has a Gamma distribution with parameters n = 3 and h = 18.
Chapter 50
Student’s t distribution
407
408 CHAPTER 50. STUDENT’S T DISTRIBUTION
50.1.1 De…nition
The standard Student’s t distribution is characterized as follows:
De…nition 260 Let X be an absolutely continuous random variable. Let its sup-
port be the whole set of real numbers:
RX = R
Let n 2 R++ . We say that X has a standard Student’s t distribution with n
degrees of freedom if its probability density function4 is
(n+1)=2
x2
fX (x) = c 1 +
n
where c is a constant:
1 1
c= p
n B 2 ; 12
n
where:
1. fXjZ=z (x) is the probability density function of a normal distribution with
mean 0 and variance 2 = z1 :
2 1=2 1 x2
fXjZ=z (x) = 2 exp
2 2
1=2 1 2
= (2 ) z 1=2 exp zx
2
where
1=2 1 2
fXjZ=z (x) = (2 ) z 1=2 exp zx
2
and
1
fZ (z) = cz n=2 1
exp n z
2
Let us start from the integrand function:
1=2 1
= (2 ) c fZjX=x (z)
c2
where
(n+1)=2
n+1
(n + 1) = x2 +n
c2 =
2(n+1)=2 ((n + 1) =2)
(n+1)=2
x2 + n
= n 1
2n=2 21=2 2 + 2
and fZjX=x (z) is the probability density function of a random variable having a
Gamma distribution with parameters n + 1 and xn+1 2 +n . Therefore,
Z 1
fXjZ=z (x) fZ (z) dz
Z0 1
1
1=2
= (2 ) c fZjX=x (z) dz
0 c2
Z 1
A 1=2 1
= (2 ) c fZjX=x (z) dz
c2 0
B 1=2 1
= (2 ) c
c2
1=2 nn=2 n 1 (n+1)=2
= (2 ) n=2
2n=2 21=2 + x2 + n
2 (n=2) 2 2
410 CHAPTER 50. STUDENT’S T DISTRIBUTION
(n+1)=2
1=2 nn=2 1=2 n 1 1 2
= (2 ) 2 + n 1+ x
(n=2) 2 2 n
1=2
1=2 nn=2 1 n 1
= (2 ) +
(n=2) 2 2 2
(n+1)=2
n=2 1=2 1 2
n 1+ x
n
n 1=2 (n+1)=2
1=2 2 + 12 1 1=2 1 2
= (2 ) n 1+ x
(n=2) 2 n
1=2 n (n+1)=2
1 2 + 12 1 2
= 2 n 1+ x
2 (n=2) n
n (n+1)=2
1=2 2 + 12 1 2
= n p 1+ x
(n=2) n
n (n+1)=2
C 1=2 + 12
2 1 2
= n 1+ x
(1=2) (n=2) n
(n+1)=2
D 1=2 1 1 2
= n 1+ x
B (n=2; 1=2) n
= fX (x)
where: in step A we have used the fact that c and c2 do not depend on z; in step
B we have used the fact that the integral of a density function over its support
p
is equal to 1; in step C we have used the fact that = (1=2); in step D we
have used the de…nition of Beta function.
Since X is a zero-mean normal random variable with variance 1=z, conditional
on Z = z, then we can also think of it as a ratio
Y
X=p
Z
where Y has a standard normal distribution, Z has a Gamma distribution and Y
and Z are independent.
E [X] = 0
Proof. It follows from the fact that the density function is symmetric around 0:
Z 1
E [X] = xfX (x) dx
1
Z 0 Z 1
= xfX (x) dx + xfX (x) dx
1 0
Z 0 Z 1
A = ( t) fX ( t) dt + xfX (x) dx
1 0
50.1. THE STANDARD STUDENT’S T DISTRIBUTION 411
Z 0 Z 1
= tfX ( t) dt + xfX (x) dx
1 0
Z 1 Z 1
B = tfX ( t) dt + xfX (x) dx
Z0 1 0
Z 1
= xfX ( x) dx + xfX (x) dx
0 0
Z 1 Z 1
C = xfX (x) dx + xfX (x) dx
0 0
= 0
50.1.4 Variance
The variance of a standard Student’s t random variable X is well-de…ned only for
n > 2 and it is equal to
n
Var [X] =
n 2
Proof. It can be derived thanks to the usual formula for computing the variance6
and to the integral representation of the Beta function:
Z 1
2
E X = x2 fX (x) dx
1
Z 0 Z 1
= x2 fX (x) dx + x2 fX (x) dx
1 0
Z 0 Z 1
A = t2 fX ( t) dt + x2 fX (x) dx
1 0
Z 1 Z 1
B = 2
t fX ( t) dt + x2 fX (x) dx
0 0
6 Var [X] = E X2 E [X]2 . See p. 156.
412 CHAPTER 50. STUDENT’S T DISTRIBUTION
Z 1 Z 1
C = t2 fX (t) dt + x2 fX (x) dx
0 0
Z 1
= 2 x2 fX (x) dx
0
Z 1
1
2 (n+1)
2 x2
= 2c x 1+ dx
0 n
Z 1 p
D n=2 1=2 n 1
= 2c nt (1 + t) p dt
0 2 t
Z 1
3=2 3=2 (n=2 1)
= cn t3=2 1
(1 + t) dt
0
E 3 n
= cn3=2 B ; 1
2 2
F 1 1 1 n
= p n3=2 B + 1; 1
n B n2 ; 12 2 2
n 1 1 n
G 2 + 2 2 +1 2 1
= n n 1 n 1
2 2 2 + 2
1 n
2 + 1 2 1
= n n 1
2 2
1 1 n 2
H = n 2 2 2 n 2
n 1
2 2
n
=
n 2
(z) = (z 1) (z 1)
Finally:
2
E [X] = 0
and:
2 n
Var [X] = E X 2 E [X] =
n 2
From the above derivation, it should be clear that the variance is well-de…ned only
when n > 2. Otherwise, if n 2, the above improper integrals do not converge
(and the Beta function is not well-de…ned).
50.1. THE STANDARD STUDENT’S T DISTRIBUTION 413
X (k) = E Xk
Z 1
= xk fX (x) dx
1
Z 0 Z 1
= xk fX (x) dx + xk fX (x) dx
1 0
Z 0 Z 1
A =
k
( t) fX ( t) dt + xk fX (x) dx
1 0
Z 1 Z 1
B = ( 1)
k k
t fX ( t) dt + xk fX (x) dx
0 0
Z 1 Z 1
C = ( 1)
k k
t fX (t) dt + xk fX (x) dx
0 0
Z 1
k
= 1 + ( 1) xk fX (x) dx
0
B 1 (k+1)=2 k+1 n k
= c n B ;
2 2 2
C 1 1 1 k+1 n k
= p n(k+1)=2 B ;
2 n B n2 ; 12 2 2
1
D 1 k=2 n 1 n+1
= n
2 2 2 2
414 CHAPTER 50. STUDENT’S T DISTRIBUTION
k+1 n k n+1
2 2 2
k+1 n k
1 k=2 2 2
= n n 1
2 2 2
7 Sutradhar,
B. C. (1986) On the characteristic function of multivariate Student t-distribution,
Canadian Journal of Statistics, 14, 329-337.
50.2. THE STUDENT’S T DISTRIBUTION IN GENERAL 415
tcdf(x,n)
returns the value of the distribution function at the point x when the degrees of
freedom parameter is equal to n.
50.2.1 De…nition
The Student’s t distribution is characterized as follows:
De…nition 262 Let X be an absolutely continuous random variable. Let its sup-
port be the whole set of real numbers:
RX = R
where c is a constant:
1 1
c= p
n B 2 ; 12
n
X= + Z
Proof. This can be easily proved using the formula for the density of a function9
of an absolutely continuous variable:
1
1 dg (x)
fX (x) = fZ g (x)
dx
x 1
= fZ
! 1
2 (n+1)
2
1 (x )
= c 1+ 2
n
E [X] = E [ + Z] = + E [Z] = + 0=
As we have seen above, E [Z] is well-de…ned only for n > 1 and, as a consequence,
also E [X] is well-de…ned only for n > 1.
50.2.4 Variance
The variance of a Student’s t random variable X is well-de…ned only for n > 2 and
it is equal to
n 2
Var [X] =
n 2
Proof. It can be derived using the formula for the variance of linear transforma-
tions11 on X = + Z (where Z has a standard t distribution):
2 2 n
Var [X] = Var [ + Z] = Var [Z] =
n 2
As we have seen above, Var [Z] is well-de…ned only for n > 2 and, as a consequence,
also Var [X] is well-de…ned only for n > 2.
x
= FZ
X= + Y
2
which is a normal random variable with mean and variance .
Exercise 1
Let X1 be a normal random variable with mean = 0 and variance 2 = 4. Let
X2 be a Gamma random variable with parameters n = 10 and h = 3, independent
of X1 . Find the distribution of the ratio
X1
X=p
X2
Solution
We can write:
X1 2 Y
X=p =p p
X2 3 Z
where Y = X1 =2 has a standard normal distribution and Z = X2 =3 has a Gamma
distribution with parameters n = 10 and h = 1. Therefore, the ratio
Y
p
Z
1 3 See p. 535.
1 4 See p. 557.
50.4. SOLVED EXERCISES 419
Exercise 2
Let X1 be a normal random variable with mean = 3 and variance 2 = 1. Let
X2 be a Gamma random variable with parameters n = 15 and h = 2, independent
of X1 . Find the distribution of the random variable
r
2
X= (X1 3)
X2
Solution
We can write: r
2 Y
X= (X1 3) = p
X2 Z
where Y = X1 3 has a standard normal distribution and Z = X2 =2 has a Gamma
distribution with parameters n = 15 and h = 1. Therefore, the ratio
Y
p
Z
has a standard Stutent’s t distribution with n = 15 degrees of freedom.
Exercise 3
2
Let X be a Student’s t random variable with mean = 1, scale = 4 and n = 6
degrees of freedom. Compute:
P (0 X 1)
Solution
First of all, we need to write the probability in terms of the distribution function
of X:
P (0 X 1)
= P (X 1) P (X < 0)
A = P (X 1) P (X 0)
B = FX (1) FX (0)
where: in step A we have used the fact that any speci…c value of X has probability
zero; in step B we have used the de…nition of distribution function. Then, we
express the distribution function FX (x) in terms of the distribution function of a
standard Student’s t random variable Z with n = 6 degrees of freedom:
x 1
FX (x) = FZ
2
420 CHAPTER 50. STUDENT’S T DISTRIBUTION
so that:
1
P (0 X 1) = FX (1) FX (0) = FZ (0) FZ = 0:1826
2
where the di¤erence FZ (0) FZ ( 1=2) can be computed with a computer algo-
rithm, for example using the MATLAB command
tcdf(0,6)-tcdf(-1/2,6)
Chapter 51
F distribution
Y1 =n1
X=
Y2 =n2
51.1 De…nition
The F distribution is characterized as follows:
De…nition 264 Let X be an absolutely continuous random variable. Let its sup-
port be the set of positive real numbers:
RX = [0; 1)
where c is a constant:
n1 =2
n1 1
c= n1 n2
n2 B 2 ; 2
421
422 CHAPTER 51. F DISTRIBUTION
where:
where
n =2
(n1 z) 1 1
fXjZ=z (x) = xn1 =2 1
exp n1 z x
2 1 =2 (n1 =2)
n 2
and
n =2
n2 2 1
fZ (z) = z n2 =2 1
exp n2 z
2n2 =2 (n2 =2) 2
Let us start from the integrand function:
n =2 n =2
(n1 z) 1 n2 2
=
2n1 =2 (n1 =2) 2n2 =2 (n2 =2)
1
xn1 =2 1 n2 =2 1
z exp (n1 x + n2 ) z
2
n =2 n =2
n1 1 n 2 2
=
2(n1 +n2 )=2 (n1 =2) (n2 =2)
1
xn1 =2 1 (n1 +n2 )=2 1
z exp (n1 x + n2 ) z
2
n =2 n =2
n1 1 n2 2
= xn1 =2 1 (n1 +n2 )=2 1
z
2(n1 +n2 )=2 (n1 =2) (n2 =2)
!
1
n1 + n2 1
exp (n1 + n2 ) z
n1 x + n2 2
n =2 n =2
n1 1 n2 2 11
= xn1 =2 fZjX=x (z)
2(n1 +n2 )=2 (n1 =2) (n2 =2) c
where
(n +n )=2
(n1 x + n2 ) 1 2
c=
2 1 +n2 )=2 ((n1 + n2 ) =2)
(n
and fZjX=x (z) is the probability density function of a random variable having a
Gamma distribution with parameters
n1 + n 2
and
n1 + n 2
n1 x + n2
Therefore:
Z 1
fXjZ=z (x) fZ (z) dz
0
Z 1 n =2 n =2
n1 1 n2 2 1
= (n +n )=2
xn1 =2 1 fZjX=x (z) dz
0 2 1 2 (n1 =2) (n2 =2) c
n1 =2 n2 =2 Z 1
A n1 n2 1
= xn1 =2 1 fZjX=x (z) dz
2(n1 +n2 )=2 (n1 =2) (n2 =2) c 0
n =2 n =2
B n1 1 n2 2 11
= xn1 =2
2(n1 +n2 )=2 (n1 =2) (n2 =2) c
n =2 n =2 (n1 +n2 )=2
n1 1 n2 2 12 ((n1 + n2 ) =2)
= xn1 =2
2(n1 +n2 )=2 (n1 =2) (n2 =2) (n1 x + n2 )
(n1 +n2 )=2
where: in step A we have used the fact that c does not depend on z; in step B
we have used the fact that the integral of a density function over its support is
equal to 1; in step C we have used the de…nition of Beta function.
51.5 Variance
The variance of an F random variable X is well-de…ned only for n2 > 4 and it is
equal to
2n22 (n1 + n2 2)
Var [X] = 2
n1 (n2 2) (n2 4)
Proof. It can be derived thanks to the usual formula for computing the variance5
and to the integral representation of the Beta function:
Z 1
E X2 = x2 fX (x) dx
1
Z 1 (n1 +n2 )=2
n1
= x2 cxn1 =2 1
1+ x dx
0 n2
Z 1 (n1 +n2 )=2
n1
= c xn1 =2+1 1 + x dx
0 n2
Z 1 n1 =2+1
A n2 (n1 +n2 )=2 n2
= c t (1 + t) dt
0 n1 n1
n1 =2+2 Z 1
n2 n1 =2 n2 =2
= c tn1 =2+1 (1 + t) dt
n1 0
n1 =2+2 Z 1
n2 (n1 =2+2) (n2 =2 2)
= c t(n1 =2+2) 1
(1 + t) dt
n1 0
n1 =2+2
B n2 n1 n2
= c B + 2; 2
n1 2 2
n1 =2 n1 =2+2
C n1 1 n2 n1 n2
= n1 n2 B + 2; 2
n2 B 2 ; 2
n1 2 2
2
D n2 1 n1 n2
= n1 n2 B + 2; 2
n1 B 2 ; 2
2 2
2
n2 (n1 =2 + n2 =2) (n1 =2 + 2) (n2 =2 2)
=
n1 (n1 =2) (n2 =2) (n1 =2 + 2 + n2 =2 2)
2
n2 (n1 =2 + n2 =2) (n1 =2 + 2) (n2 =2 2)
=
n1 (n1 =2) (n2 =2) (n1 =2 + n2 =2)
2
n2 (n1 =2 + 2) (n2 =2 2)
=
n1 (n1 =2) (n2 =2)
2
E n2 1
= (n1 =2 + 1) (n1 =2)
n1 (n2 =2 1) (n2 =2 2)
2
n2 (n1 + 2) n1
=
n21 (n2 2) (n2 4)
n22 (n1 + 2)
=
n1 (n2 2) (n2 4)
5 Var [X] = E X2 E [X]2 . See p. 156.
51.6. HIGHER MOMENTS 427
n1
where: in step A we have performed a change of variable (t = n2 x); in step B
we have used the integral representation of the Beta function; in step C we have
used the de…nition of c; in step D we have used the de…nition of Beta function;
in step E we have used the following property of the Gamma function:
(z) = (z 1) (z 1)
Finally:
2
2 n2
E [X] =
n2 2
and:
2
Var [X] = E X2 E [X]
n22 (n1 + 2) n22
= 2
n1 (n2 2) (n2 4) (n2 2)
n22 ((n1 + 2) (n2 2) n1 (n2 4))
= 2
n1 (n2 2) (n2 4)
n22 (n1 n2 2n1 + 2n2 4 n1 n2 + 4n1 )
= 2
n1 (n2 2) (n2 4)
n22 (2n1 + 2n2 4) 2n22 (n1 + n2 2)
= 2 = 2
n1 (n2 2) (n2 4) n1 (n2 2) (n2 4)
It is also clear that the expected value is well-de…ned only when n2 > 4: when
n2 4, the above improper integrals do not converge (both arguments of the Beta
function must be strictly positive).
n1 =2+k Z 1
n2 (n1 =2+k) (n2 =2 k)
= c t(n1 =2+k) 1
(1 + t) dt
n1 0
n1 =2+k
B n2 n1 n2
= c B + k; k
n1 2 2
n1 =2 n1 =2+k
C n1 1 n2 n1 n2
= n1 n2 B + k; k
n2 B 2 ; 2
n1 2 2
k
n2 1 n1 n2
= n1 n2 B + k; k
n1 B 2 ; 2
2 2
k
D n2 (n1 =2 + n2 =2) (n1 =2 + k) (n2 =2 k)
=
n1 (n1 =2) (n2 =2) (n1 =2 + k + n2 =2 k)
k
n2 (n1 =2 + n2 =2) (n1 =2 + k) (n2 =2 k)
=
n1 (n1 =2) (n2 =2) (n1 =2 + n2 =2)
k
n2 (n1 =2 + k) (n2 =2 k)
=
n1 (n1 =2) (n2 =2)
n1
where: in step A we have performed a change of variable (t = n2 x); in step B
we have used the integral representation of the Beta function; in step C we have
used the de…nition of c; in step D we have used the de…nition of Beta function.
It is also clear that the expected value is well-de…ned only when n2 > 2k: when
n2 2k, the above improper integrals do not converge (both arguments of the
Beta function must be strictly positive).
69, 261-264.
51.10. SOLVED EXERCISES 429
FX (x)
Z x
= fX (t) dt
1
Z x (n1 +n2 )=2
n1
= ctn1 =2 1
1+ t dt
1 n2
Z n1 x=n2 1 n =2 1
A n2 n1 =2 n2 =2 n2
= c s (1 + s) ds
1 n1 n1
n1 =2 Z n1 x=n2
n2 n1 =2 n2 =2
= c sn1 =2 1
(1 + s) ds
n1 1
n =2 n1 =2 Z n1 x=n2
B (n1 =n2 ) 1 n2 n1 =2 n2 =2
= sn1 =2 1
(1 + s) ds
B n21 ; n22 n1 1
Z n1 x=n2
1 n1 =2 n2 =2
= n1 n2 sn1 =2 1
(1 + s) ds
B 2 ; 2 1
n1
where: in step A we have performed a change of variable (s = n2 t); in step B
we have used the de…nition of c.
Exercise 1
Let X1 be a Gamma random variable with parameters n1 = 3 and h1 = 2. Let X2
be another Gamma random variable, independent of X1 , with parameters n2 = 5
and h1 = 6. Find the expected value of the ratio:
X1
X2
Solution
We can write:
X1 = 2Z1
X2 = 6Z2
where Z1 and Z2 are two independent Gamma random variables, the parameters
of Z1 are n1 = 3 and h1 = 1 and the parameters of Z2 are n2 = 5 and h2 = 1
430 CHAPTER 51. F DISTRIBUTION
(see the lecture entitled Gamma distribution - p. 397). Using this fact, the ratio
becomes:
X1 2 Z1 1 Z1
= =
X2 6 Z2 3 Z2
where Z1 =Z2 has an F distribution with parameters n1 = 3 and n2 = 5. Therefore:
X1 1 Z1 1 Z1
E = E = E
X2 3 Z2 3 Z2
1 n2 1 5 5
= = =
3 n2 2 35 2 9
Exercise 2
Find the third moment of an F random variable with parameters n1 = 6 and
n2 = 18.
Solution
We need to use the formula for the k-th moment of an F random variable:
k
n2 (n1 =2 + k) (n2 =2 k)
X (k) =
n1 (n1 =2) (n2 =2)
7 See p. 55.
Chapter 52
Multinomial distribution
52.1.1 De…nition
The distribution is characterized as follows.
De…nition 266 Let X be a K 1 discrete random vector. Let the support of X
be the set of K 1 vectors having one entry equal to 1 and all other entries equal
to 0: 8 9
< XK =
K
RX = x 2 f0; 1g : xj = 1
: ;
j=1
431
432 CHAPTER 52. MULTINOMIAL DISTRIBUTION
If you are puzzled by the above de…nition of the joint pmf, note that when
(x1 ; : : : ; xK ) 2 RX and xi is equal to 1, because the i-th outcome has been obtained,
then all other entries are equal to 0 and
K
Y x x x
pj j = px1 1 : : : pi i 11 pxi i pi+1
i+1
: : : pxKK
j=1
E [Xi ] = pi
pi (1 pi ) if j = i
ij = (52.2)
pi pj if j 6= i
If j = i, then
2
ii = E [Xi Xi ] E [Xi ] E [Xi ] = E [Xi ] E [Xi ]
2 See p. 116.
3 See p. 197.
4 See the lecture entitled Covariance matrix - p. 189.
52.1. THE SPECIAL CASE OF ONE EXPERIMENT 433
= pi p2i = pi (1 pi )
where we have used the fact that Xi2 = Xi because Xi can take only values 0 and
1. If j 6= i, then
where we have used the fact that Xi Xj = 0, because Xi and Xj cannot be both
equal to 1 at the same time.
5 See p. 297.
6 See p. 315.
434 CHAPTER 52. MULTINOMIAL DISTRIBUTION
52.2.1 De…nition
Multinomial random vectors are characterized as follows.
X = Y1 + : : : + Yn
[Y1 : : : Yn ]
7 See p. 29.
52.2. MULTINOMIAL DISTRIBUTION IN GENERAL 435
E [X] = np
Proof. Using the fact that X can be written as a sum of n multinomials with
parameters p1 ; : : : ; pK and 1, we obtain
where the result E [Yj ] = p has been derived in the previous section (formula 52.1).
Var [X] = n
pi (1 pi ) if j = i
ij =
pi pj if j 6= i
Var [X]
8 See the lecture entitled Partitions - p. 27.
436 CHAPTER 52. MULTINOMIAL DISTRIBUTION
= Var [Y1 + : : : + Yn ]
A = Var [Y1 ] + : : : + Var [Yn ]
B = nVar [Y1 ]
C
= n
where: in step A we have used the fact that Y1 ; : : : ; Yn are mutually independent;
in step B we have used the fact that Y1 ; : : : ; Yn have the same distribution; in
step C we have used formula (52.2) for the covariance matrix of Y1 .
where: in step A we have used the fact that Y1 ; : : : ; Yn are mutually independent;
in step B we have used the de…nition of moment generating function of Yl ; in step
C we have used formula (52.3) for the moment generating function of Y1 .
Proof. The derivation is similar to the derivation of the joint moment generating
function:
where: in step A we have used the fact that Y1 ; : : : ; Yn are mutually independent;
in step B we have used the de…nition of characteristic function of Yl ; in step C
we have used formula (52.4) for the characteristic function of Y1 .
Exercise 1
A shop selling two items, labeled A and B, needs to construct a probabilistic model
of the sales that will be generated by its next 10 customers. Each time a customer
arrives, only three outcomes are possible: 1) nothing is sold; 2) one unit of item A
is sold; 3) one unit of item B is sold. It has been estimated that the probabilities
of these three outcomes are 0.50, 0.25 and 0.25 respectively. Furthermore, the
shopping behavior of a customer is independent of the shopping behavior of all
other customers. Denote by X a 3 1 vector whose entries X1 ; X2 and X3 are equal
to the number of times each of the three outcomes occurs. Derive the expected
value and the covariance matrix of X.
Solution
The vector X has a multinomial distribution with parameters
>
1 1 1
p=
2 4 4
Exercise 2
Given the assumptions made in the previous exercise, suppose that item A costs
$1,000 and item B costs $2,000. Derive the expected value and the variance of the
total revenue generated by the 10 customers.
Solution
The total revenue Y can be written as a linear transformation of the vector X:
Y = AX
where
A = [0 1; 000 2; 000]
By the linearity of the expected value operator, we obtain
2 3
5
E [Y ] = E [AX] = AE [X] = [0 1; 000 2; 000] 4 5=2 5
5=2
= 0 5 + 1; 000 5=2 + 2; 000 5=2 = 7; 500
By using the formula for the covariance matrix of a linear transformation, we obtain
Multivariate normal
distribution
53.1.1 De…nition
Standard MV-N normal random vectors are characterized as follows.
R X = RK
1 See p. 375.
439
440 CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION
K=2 1 |
fX (x) = (2 ) exp x x
2
where
1=2 1 2
f (xi ) = (2 ) exp x
2 i
is the probability density function of a standard normal random variable3 .
Therefore, the K components of X are K mutually independent4 standard
normal random variables. A more detailed proof follows.
Proof. As we have seen, the joint probability density function can be written as
K
Y
fX (x) = f (xi )
i=1
where f (xi ) is the probability density function of a standard normal random vari-
able. But f (xi ) is also the marginal probability density function5 of the i-th
component of X:
fXi (xi )
Z 1 Z 1
= ::: fX (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK ) dx1 : : : dxi 1 dxi+1 : : : dxK
1 1
Z 1 Z 1 K
Y
= ::: f (xj ) dx1 : : : dxi 1 dxi+1 : : : dxK
1 1 j=1
Z 1 Z 1
= f (xi ) f (x1 ) dx1 : : : f (xi 1 ) dxi 1
1 1
Z 1 Z 1
f (xi+1 ) dxi+1 : : : f (xK ) dxK
1 1
= f (xi )
2 See p. 117.
3 See p. 376.
4 See p. 233.
5 See p. 120.
53.1. THE STANDARD MV-N DISTRIBUTION 441
where, in the last step, we have used the fact that all the integrals are equal to
1, because they are integrals of probability density functions over their respective
supports. Therefore, the joint probability density function of X is equal to the
product of its marginals, which implies that the components of X are mutually
independent.
6 See p. 189.
7 See p. 234.
442 CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION
where: in step A we have used the fact that the components of X are mutually
independent8 ; in step B we have used the de…nition of moment generating func-
tion9 . The moment generating function of a standard normal random variable10
is
1 2
MXj (tj ) = exp t
2 j
which implies that the joint mgf of X is
K
Y K
Y 1 2
MX (t) = MXj (tj ) = exp t
j=1 j=1
2 j
0 1
K
X
1 1 >
= exp @ t2j A = exp t t
2 j=1
2
The mgf MXj (tj ) of a standard normal random variable is de…ned for any tj 2 R.
As a consequence, the joint mgf of X is de…ned for any t 2 RK .
1 >
'X (t) = exp t t
2
8 See Mutual independence via expectations (p. 234).
9 See p. 297.
1 0 See p. 378.
53.2. THE MV-N DISTRIBUTION IN GENERAL 443
where: in step A we have used the fact that the components of X are mutu-
ally independent; in step B we have used the de…nition of joint characteristic
function11 . The characteristic function of a standard normal random variable is12
1 2
'Xj (tj ) = exp t
2 j
53.2.1 De…nition
MV-N random vectors are characterized as follows.
R X = RK
1 1 See p. 315.
1 2 See p. 379.
444 CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION
K=2 1=2 1 | 1
fX (x) = (2 ) jdet (V )j exp (x ) V (x )
2
We indicate that X has a multivariate normal distribution with mean and
covariance V by
X N ( ;V )
The K random variables X1 : : : ; XK constituting the vector X are said to be
jointly normal.
X= + Z (53.1)
Proof. This is proved using the formula for the joint density of a linear function13
of an absolutely continuous random vector:
fX (x)
1 1
= fZ (x )
jdet ( )j
1 K=2 1 1 | 1
= (2 ) exp (x ) (x )
jdet ( )j 2
1 1
K=2 1 | 1 | 1
= jdet ( )j 2
jdet ( )j 2
(2 ) exp (x ) (x )
2
K=2 1 1 | | 1 1
= (2 ) jdet ( ) det ( )j 2
exp (x ) ( ) (x )
2
K=2 |
1 1 | | 1
= (2 ) jdet ( ) det ( )j 2
exp (x ) ( ) (x )
2
K=2 |
1 1 | | 1
= (2 ) jdet ( )j 2
exp (x ) ( ) (x )
2
K=2 1 1 | 1
= (2 ) jdet (V )j 2
exp (x ) V (x )
2
| |
The existence of a matrix satisfying V = = is guaranteed by the fact
that V is symmetric and positive de…nite.
1 3 See p. 279. Note that X = g (Z) = + X is a linear one-to-one mapping because is
invertible.
53.2. THE MV-N DISTRIBUTION IN GENERAL 445
E [X] =
E [X] = E [ + Z] = + E [Z] = + 0=
Var [X] = V
where
1 >
MZ (t) = exp t t
2
1 4 See, in particular the Addition to constant matrices (p. 148) and Multiplication by constant
1 >
'X (t) = exp it> t Vt
2
where
1 >
'Z (t) = exp t t
2
In other words, mutually independent normal random variables are also jointly
normal.
Proof. This can be proved by showing that the product of the probability density
functions of X1 ; : : : ; XK is equal to the joint probability density function of X (this
is left as an exercise).
Exercise 1
>
Let X = [X1 X2 ] be a multivariate normal random vector with mean
>
= [1 2]
Y = X1 + X2
Solution
The random variable Y can be written as
Y = BX
where
B = [1 1]
448 CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION
By using the formula for the joint moment generating function of a linear trans-
formation of a random vector20
MY (t) = MX B > t
Exercise 2
>
Let X = [X1 X2 ] be a multivariate normal random vector with mean
>
= [2 3]
E X12 X2
2 0 See p. 301.
2 1 See p. 285.
53.4. SOLVED EXERCISES 449
Solution
The joint mgf of X is
1
MX (t) = exp t> + t> V t
2
1
= exp 2t1 + 3t2 + 2t21 + 2t22 + 2t1 t2
2
= exp 2t1 + 3t2 + t21 + t22 + t1 t2
@ 3 MX (t1 ; t2 )
E X12 X2 =
@t21 @t2 t1 =0;t2 =0
@MX (t1 ; t2 )
= (2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
@t1
@ 2 MX (t1 ; t2 ) @ @MX (t1 ; t2 )
=
@t21 @t1 @t1
= 2 exp 2t1 + 3t2 + t21 + t22 + t1 t2
2
+ (2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
@ 3 MX (t1 ; t2 ) @ @ 2 MX (t1 ; t2 )
=
@t21 @t2 @t2 @t21
= 2 (3 + 2t2 + t1 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
+2 (2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
2
+ (2 + 2t1 + t2 ) (3 + 2t2 + t1 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2
Thus,
@ 3 MX (t1 ; t2 )
E X12 X2 = = 2 3 1 + 2 2 1 + 22 3 1
@t21 @t2 t1 =0;t2 =0
= 6 + 4 + 12 = 22
450 CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION
Chapter 54
Multivariate Student’s t
distribution
This lecture deals with the multivariate (MV) Student’s t distribution. We …rst
introduce the special case in which the mean is equal to zero and the scale matrix
is equal to the identity matrix. We then deal with the more general case.
54.1.1 De…nition
Standard multivariate Student’s t random vectors are characterized as follows.
R X = RK
where
K=2 (n=2 + K=2)
c = (n )
(n=2)
451
452 CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION
where:
1=2 1 | 1
fXjZ=z (x) = c1 jdet (V )j exp x V x
2
1=2 !
K 1
1 1 | 1 1
= c1 det (I) exp x I x
z 2 z
1
= c1 z K=2 exp z x| x
2
where
K=2
c1 = (2 )
3 See p. 407.
4 See p. 57.
5 See p. 439.
6 See p. 397.
54.1. THE STANDARD MV STUDENT’S T DISTRIBUTION 453
1
fZ (z) = c2 z n=2 1
exp n z
2
where
nn=2
c2 =
2n=2 (n=2)
where
1
fXjZ=z (x) = c1 z K=2 exp z x| x
2
and
1
fZ (z) = c2 z n=2 1
exp n z
2
We start from the integrand function:
1 1
fXjZ=z (x) fZ (z) = c1 z K=2 exp z x| x c2 z n=2 1
exp n z
2 2
1
= c1 c2 z (n+K)=2 1
(x| x + n) z
exp
2
0 1
n + K 1
= c1 c2 z (n+K)=2 1 exp @ zA
n+K 2
x| x+n
0 1
1 n + K 1
= c1 c2 c3 z (n+K)=2 1 exp @ zA
c3 n+K
|
2
x x+n
1
= c1 c2 fZjX=x (z)
c3
where
(n+K)=2
n+K
(n + K) = x| x+n (x| x + n)
(n+K)=2
c3 = = n K
2(n+K)=2 ((n + K) =2) 2n=2 2K=2 2 + 2
and fZjX=x (z) is the probability density function of a random variable having a
Gamma distribution with parameters n + K and xn+K | x+n . Therefore,
Z 1
fXjZ=z (x) fZ (z) dz
Z0 1
1
= c1 c2 fZjX=x (z) dz
0 c3
Z
A 1 1
= c1 c2 fZjX=x (z) dz
c3 0
454 CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION
B 1
= c1 c2
c3
nn=2 n K
(x| x + n)
K=2 (n+K)=2
= (2 ) 2n=2 2K=2 +
2n=2 (n=2) 2 2
(n+K)=2
K=2 nn=2 K=2 n K 1 |
= (2 ) 2 + n 1+ x x
(n=2) 2 2 n
K=2
K=2 nn=2 1 n K
= (2 ) +
(n=2) 2 2 2
(n+K)=2
n=2 K=2 1 |
n 1+ x x
n
n K=2 (n+K)=2
K=2 2+K 2 1 K=2 1 |
= (2 ) n 1+ x x
(n=2) 2 n
K=2 n (n+K)=2
1 2 +K 2 1 |
= 2 n 1+ x x
2 (n=2) n
n (n+K)=2
K=2 2+K 2 1 |
= (n ) 1+ x x
(n=2) n
= fX (x)
where: in step A we have used the fact that c1 , c2 and c3 do not depend on z; in
step B we have used the fact that the integral of a probability density function
over its support is 1.
Since X has a multivariate normal distribution with mean 0 and covariance
V = z1 I , conditional on Z = z, then we can also think of it as a ratio
1
X=p Y
Z
54.1.4 Marginals
The marginal distribution of the i-th component of X (denote it by Xi ) is a stan-
dard Student’s t distribution with n degrees of freedom. It su¢ ces to note that
the marginal probability density function7 of Xi can be written as
Z 1
fXi (xi ) = fXi jZ=z (xi ) fZ (z) dz
0
where fXi jZ=z (xi ) is the marginal density of Xi jZ = z , i.e., the density of a normal
random variable8 with mean 0 and variance z1 :
1=2 1
fXi jZ=z (xi ) = (2 ) z 1=2 exp z x2i
2
7 See p. 120.
8 See p. 375.
54.1. THE STANDARD MV STUDENT’S T DISTRIBUTION 455
fXi (xi )
Z 1 Z 1
= ::: fX (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK ) dx1 : : : dxi 1 dxi+1 : : : dxK
1 1
Z 1 Z 1 Z 1
= ::: fXjZ=z (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK ) fZ (z) dz
1 1 0
dx1 : : : dxi 1 xi+1 : : : dxK
Z 1Z 1 Z 1
= ::: fXjZ=z (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK )
0 1 1
dx1 : : : dxi 1 dxi+1 : : : dxK fZ (z) dz
Z 1
= fXi jZ=z (xi ) fZ (z) dz
0
E [X] = 0 (54.1)
Var [X]
A = E XX > E [X] E [X]
>
B = E XX >
456 CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION
C = E E XX > jZ = z
h i
D = E E XX > jZ = z E [X jZ = z ] E [X jZ = z ]
>
E = E [Var [X jZ = z ]]
1
= E I
z
1
= E I
z
where: in step A we have used the formula for computing the covariance matrix9 ;
in step B we have used the fact that E [X] = 0 (eq. 54.1); in step C we have
used the Law of Iterated Expectations10 ; in step D we have used the fact that
E [X jZ = z ] = 0
>
Var [X jZ = z ] = E XX > jZ = z E [X jZ = z ] E [X jZ = z ]
But
1
E
z
Z 1
1
= fZ (z) dz
0 z
Z 1
1 nn=2 1
= n=2
z n=2 1 exp n z dz
0 z2 (n=2) 2
n=2 Z 1
n 1
= z (n 2)=2 1 exp n z dz
2n=2 (n=2) 0 2
nn=2 2(n 2)=2 ((n 2) =2)
=
2n=2 (n=2) n(n 2)=2
Z 1
n(n 2)=2 n 21
z (n 2)=2 1 exp n 2 z dz
0 2(n 2)=2 ((n 2) =2) n
2
Z
A 2(n 2)=2 nn=2 ((n 2) =2) 1
= ' (z) dz
2n=2 n(n 2)=2 (n=2) 0
2(n 2)=2 nn=2 ((n 2) =2)
B =
2n=2 n(n 2)=2 (n=2)
1 1
C = n
2 (n 2) =2
n
=
n 2
9 See p. 190.
1 0 See p. 225.
54.2. THE MV STUDENT’S T DISTRIBUTION IN GENERAL 457
n(n 2)=2
n 21
' (z) = z (n 2)=2 1
exp n 2 z
2(n 2)=2 ((n 2) =2) n
2
1 n
Var [X] = E I= I
z n 2
54.2.1 De…nition
Multivariate Student’s t random vectors are characterized as follows.
R X = RK
X T ( ; V; n)
E [X] = E [ + Z] = + E [Z] = + 0=
Exercise 1
Let X be a multivariate normal random vector with mean = 0 and covariance
matrix V . Let Z1 ; : : : ; Zn be n normal random variables having zero mean and vari-
ance 2 . Suppose that Z1 ; : : : ; Zn are mutually independent, and also independent
of X. Find the distribution of the random vector Y de…ned as
2
Y =p X
Z12 + : : : + Zn2
Solution
We can write
2
Y = p X
Z12 + : : : + Zn2
1 2 See, in particular the Addition to constant matrices (p. 148) and Multiplication by constant
2
= p p X
n 2
(W1 + : : : + Wn2 ) =n
2 1
= p p Q
n 2
(W1 + : : : + Wn2 ) =n
Wishart distribution
55.1 De…nition
Wishart random matrices are characterized as follows:
R W = w 2 RK K
: w is symmetric and positive de…nite
Let n be a constant such that n > K 1 and let H be a symmetric and positive
de…nite matrix. We say that W has a Wishart distribution with parameters n
1 See p. 397.
2 See p. 387.
3 See p. 119.
4 See p. 439.
461
462 CHAPTER 55. WISHART DISTRIBUTION
The parameter n needs not be an integer, but, when n is not an integer, W can
no longer be interpreted as a sum of outer products of multivariate normal random
vectors.
Proof. The proof of this proposition is quite lengthy and complicated. The inter-
ested reader might have a look at Ghosh and Sinha7 (2002).
E [W ] = H
Proof. We do not provide a fully general proof, but we prove this result only for
the special case in which n is integer and W can be written as
n
X
W = Xi Xi>
i=1
n
X
= E Xi Xi>
i=1
n
X
A = Var [Xi ] + E [Xi ] E Xi>
i=1
n
X
B = Var [Xi ]
i=1
Xn
1 1
= H=n H=H
i=1
n n
where: in step A we have used the fact that the covariance matrix of X can be
written as8
Var [Xi ] = E Xi Xi> E [Xi ] E Xi>
and in step B we have used the fact that E [Xi ] = 0.
(see above). To compute this covariance, we …rst need to compute the following
fourth cross-moment:
2 2
E Xsi Xsj
8 See p. 190.
9 Muirhead, R.J. (2005) Aspects of multivariate statistical theory, Wiley.
464 CHAPTER 55. WISHART DISTRIBUTION
A = XX >
X1
X=
X2
X1
XX > = X1 X2
X2
X12 X1 X2
=
X2 X1 X22
A = A>
x> Ax > 0
x> Ax = x> 0 = 0
A11 A12
A=
A21 A22
467
Chapter 56
Linear combinations of
normals
One property that makes the normal distribution extremely tractable from an ana-
lytical viewpoint is its closure under linear combinations: the linear combination of
two independent random variables having a normal distribution also has a normal
distribution. This lecture presents a multivariate generalization of this elementary
property and then discusses some special cases.
Y = A + BX
E [Y ] = A + B
Proof. This is proved using the formula for the joint moment generating function
of the linear transformation of a random vector2 . The joint moment generating
function of X is
1
MX (t) = exp t> + t> V t
2
Therefore, the joint moment generating function of Y is
469
470 CHAPTER 56. LINEAR COMBINATIONS OF NORMALS
1
= exp t> A exp t> B + t> BV B | t
2
1
= exp t> (A + B ) + t> BV B | t
2
E [Y ] = 1 + 2
and variance
2 2
Var [Y ] = 1 + 2
Proof. First of all, we need to use the fact that mutually independent normal
random variables are jointly normal3 : the 2 1 random vector X de…ned as
|
X = [X1 X2 ]
We can write:
Y = X1 + X2 = BX
where
B = [1 1]
3 See p. 446.
56.1. LINEAR TRANSFORMATION OF A MV-N VECTOR 471
and variance
Var [Y ] = BVar [X] B | = 2
1 + 2
2
and variance
K
X
2
Var [Y ] = i
i=1
Proof. This can be obtained, either generalizing the proof of Proposition 281, or
using Proposition 281 recursively (starting from the …rst two components of X,
then adding the third one and so on).
and variance
K
X
Var [Y ] = b2i 2
i
i=1
472 CHAPTER 56. LINEAR COMBINATIONS OF NORMALS
Proof. First of all, we need to use the fact that mutually independent normal
random variables are jointly normal: the K 1 random vector X de…ned as
|
X = [X1 : : : XK ]
We can write:
K
X
Y = bi Xi = BX
i=1
where
B = [b1 : : : bK ]
Therefore, according to Proposition 280, Y has a (multivariate) normal distribution
with mean:
XK
E [Y ] = BE [X] = bi i
i=1
and variance:
K
X
Var [Y ] = BVar [X] B | = b2i 2
i
i=1
Proposition 284 Let X be a normal random variable with mean and variance
2
. Let a and b be two constants (with b 6= 0). Then the random variable Y de…ned
by:
Y = a + bX
has a normal distribution with mean
E [Y ] = a + b
and variance
Var [Y ] = b2 2
Proof. This is a consequence of the fact that mutually independent normal random
vectors are jointly normal: the Kn 1 random vector X de…ned as
|
X = [X1| : : : Xn| ]
has a multivariate normal distribution with mean
| | |
E [X] = [ 1 ::: n]
Exercise 1
Let
>
X = [X1 X2 ]
be a 2 1 normal random vector with mean
>
= [1 3]
and covariance matrix
2 1
V =
1 3
Find the distribution of the random variable Z de…ned as
Z = X1 + 2X2
474 CHAPTER 56. LINEAR COMBINATIONS OF NORMALS
Solution
We can write
Z = BX
where
B = [1 2]
Being a linear transformation of a multivariate normal random vector, Z is also
multivariate normal. Actually, it is univariate normal, because it is a scalar. Its
mean is
E [Z] = B = 1 1 + 2 3 = 7
and its variance is
2 1 1
Var [Z] = BV B > = [1 2]
1 3 2
2 1+1 2 4
= [1 2] = [1 2]
1 1+3 2 7
= 1 4 + 2 7 = 18
Exercise 2
Let X1 , . . . , Xn be n mutually independent standard normal random variables.
Let b 2 (0; 1) be a constant. Find the distribution of the random variable Y de…ned
as
X n
Y = bi X i
i=1
Solution
Being a linear combination of mutually independent normal random variables, Y
has a normal distribution with mean
n
X n
X
E [Y ] = bi E [Xi ] = bi 0 = 0
i=1 i=1
and variance
n
X 2
Var [Y ] = bi Var [Xi ]
i=1
Xn
i
= b2 1
i=1
Xn
A = ci
i=1
= c + c2 + : : : + cn
= c 1 + c + : : : + cn 1
1 c
= c 1 + c + : : : + cn 1
1 c
c
= (1 cn )
1 c
56.2. SOLVED EXERCISES 475
c cn+1
=
1 c
b2 b2n+2
=
1 b2
Partitioned multivariate
normal vectors
Xa
X=
Xb
57.1 Notation
In what follows, we will denote by
a
=
b
and
|
Va Vab
V =
Vab Vb
1 See p. 439.
2 See p. 193.
477
478 CHAPTER 57. PARTITIONED MULTIVARIATE NORMAL VECTORS
Xa N( a ; Va )
Xb N( b ; Vb )
Xa = AX
where A is a Ka K matrix whose entries are either zero or one. Thus, Xa has
a multivariate normal distribution, because it is a linear transformation of the
multivariate normal random vector X, and multivariate normality is preserved3
by linear transformations. By the same token, also Xb has a multivariate normal
distribution, because it can be written as a linear transformation of X:
Xb = BX
Proof. Xa and Xb are independent if and only if their joint moment generating
function is equal to the product of their individual moment generating functions4 .
Since Xa is multivariate normal, its joint moment generating function is
1
MXa (ta ) = exp t>
a a + t> V a ta
2 a
1
MXb (tb ) = exp t>
b b + t> V b tb
2 b
The joint moment generating function of Xa and Xb , which is just the joint moment
generating function of X, is
1
= exp t> + t> V t
2
|
1 > > Va Vab
= exp t> >
a tb
a
+ ta tb [ta tb ]
b 2 Vab Vb
1 > 1 > 1 > 1 > >
= exp t>
a
>
a + tb b + ta Va ta + tb Vb tb + tb Vab ta + ta Vab tb
2 2 2 2
1 > 1
= exp t>
a
> > >
a + ta Va ta + tb b + tb Vb tb + tb Vab ta
2 2
1 > 1 >
= exp t>
a
> >
a + ta Va ta exp tb b + tb Vb tb exp tb Vab ta
2 2
= MXa (ta ) MXb (tb ) exp t>
b Vab ta
Q = X > AX
A> = A 1
Y = AX
Then also Y has a standard multivariate normal distribution, i.e. Y N (0; 1).
481
482 CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS
In other words, the trace is equal to the sum of all the diagonal entries of A.
The trace of A enjoys the following important property:
K
X
tr (A) = i
i=1
A = P DP >
where Yj is the j-th component of Y and Djj is the j-th diagonal entry of D. Since
A is symmetric and idempotent, the diagonal entries of D are either zero or one.
Denote by J the set
J = fj K : Djj = 1g
2 See p. 387.
484 CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS
and by r its cardinality, i.e. the number of diagonal entries of D that are equal to
1. Since Dij 6= 1 ) Dij = 0, we can write
K
X X X
Q= Djj Yj2 = Djj Yj2 = Yj2
j=1 j2J j2J
But the components of a standard normal random vector are mutually independent
standard normal random variables. Therefore, Q is the sum of the squares of
r independent standard normal random variables. Hence, it has a Chi-square
distribution with r degrees of freedom3 . Finally, by the properties of idempotent
matrices and of the trace of a matrix (see above), r is not only the sum of the
number of diagonal entries of D that are equal to 1, but it is also the sum of the
eigenvalues of A. Since the trace of a matrix is equal to the sum of its eigenvalues,
then r = tr (A).
T1 = AX
T2 = BX
Proof. First of all, note that T1 and T2 are linear transformations of the same
multivariate normal random vector X. Therefore, they are jointly normal (see the
lecture entitled Linear combinations of normal random variables - p. 469). Their
cross-covariance4 is
h i
>
Cov [X; Y ] = E (T1 E [T1 ]) (T2 E [T2 ])
h i
>
= E (AX E [AX]) (BX E [BX])
h i
A = AE (X E [X]) (X E [X])> B |
B = AVar [X] B |
C = AB |
where: in step A we have used the linearity of the expected value; in step B we
have used the de…nition of covariance matrix; in step C we have used the fact that
Var [X] = I. But, as we explained in the lecture entitled Partitioned multivariate
normal vectors (p. 477), two jointly normal random vectors are independent if
and only if their cross-covariance is equal to 0. In our case, the cross-covariance is
equal to zero if and only if AB | = 0, which proves the proposition.
3 See p. 395.
4 See p. 193.
58.4. EXAMPLES 485
The following proposition gives a necessary and su¢ cient condition for the
independence of two quadratic forms in the same standard multivariate normal
random vector.
Q1 = X > AX
Q2 = X > BX
T = AX
Q = X > BX
T = AX
>
Q = (BX) (BX)
58.4 Examples
We discuss here some quadratic forms that are commonly found in statistics.
486 CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS
where: in step A we have used the fact that M is symmetric; in step B we have
used the fact that M is idempotent.
Now de…ne a new random vector
1
Z= (X {)
5 See p. 573.
6 See p. 583.
58.4. EXAMPLES 487
and note that Z has a standard (mean zero and covariance I) multivariate nor-
mal distribution (see the lecture entitled Linear combinations of normal random
variables - p. 469).
The sample variance can be written as
1
s2 = X >M X
n 1
1 >
= (X { + {) M (X { + {)
n 1
2 >
X { X {
= + { M + {
n 1
2 >
= Z+ { M Z+ {
n 1
2 2
= Z >M Z + Z >M { + {> M Z + {> M {
n 1
The last three terms in the sum are equal to zero, because
M{ = 0
So, the quadratic form Z > M Z has a Chi-square distribution with n 1 degrees
of freedom. Multiplying a Chi-square random variable with n 1 degrees of freedom
2
by n 1 one obtains a Gamma random variable with parameters n 1 and 2 (see
the lecture entitled Gamma distribution - p. 403 - for more details).
So, summing up, the adjusted sample variance s2 has a Gamma distribution
with parameters n 1 and 2 .
Futhermore, the adjusted sample variance s2 is independent of the sample mean
X n , which is proved as follows. The sample mean can be written as
1 >
Xn = { X
n
and the sample variance can be written as
1
s2 = X >M X
n 1
488 CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS
Asymptotic theory
489
Chapter 59
Sequences of random
variables
One of the central topics in probability theory and statistics is the study of se-
quences of random variables, i.e. of sequences1 fXn g whose generic element Xn is
a random variable.
There are several reasons why sequences of random variables are important:
59.1 Terminology
In this lecture, we introduce some terminology related to sequences of random
variables.
1 See p. 31.
2 See p. 141.
491
492 CHAPTER 59. SEQUENCES OF RANDOM VARIABLES
fXn g = fxn g
second group of q successive terms of the sequence Xn+k+1 , . . . , Xn+k+q . The sec-
ond group is located k positions after the …rst group. Denote the joint distribution
function7 of the …rst group of terms by
Fn+1;:::;n+q (x1; : : : ; xq )
Fn+k+1;:::;n+k+q (x1; : : : ; xq )
9 0 2 R : Var [Xn ] = 0 ; 8n 2N
Strictly stationarity (see 59.1.6) implies weak stationarity only if the mean
E [Xn ] and all the covariances Cov [Xn ; Xn j ] exist and are …nite.
Covariance stationarity does not imply strict stationarity, because the former
imposes restrictions only on …rst and second moments, while the latter imposes
restrictions on the whole distribution.
7 See p. 118.
8 Remember that Cov [Xn ; Xn ] = Var [Xn ].
494 CHAPTER 59. SEQUENCES OF RANDOM VARIABLES
for any two functions f and g. This is just the de…nition of independence9 between
the two random vectors
[Xn+1 : : : Xn+q ] (59.2)
and
[Xn+k+1 : : : Xn+k+q ] (59.3)
Trivially, condition (59.1) can be written as
If this condition is true asymptotically (i.e. when k ! 1), then we say that
the sequence fXn g is mixing:
De…nition 293 We say that a sequence of random variables fXn g is mixing (or
strongly mixing) if and only if
In other words, a sequence is strongly mixing if and only if the two random
vectors (59.2) and (59.3) tend to become more and more independent by increas-
ing k (for any n and q). This is a milder requirement than the requirement of
independence: if fXn g is an independent sequence (see 59.1.3), all its terms are
independent from one another; if fXn g is a mixing sequence, its terms can be de-
pendent, but they become less and less dependent as the distance between their
locations in the sequence increases. Of course, an independent sequence is also a
mixing sequence, while the converse is not necessarily true.
We say that a subset A RN is a shift invariant set if and only if fxn gn>1
belongs to A whenever fxn g belongs to A. In symbols:
In this lecture, we generalize the concepts introduced in the lecture entitled Se-
quences of random variables (p. 491). We no longer consider sequences whose
elements are random variables, but we now consider sequences fXn g whose generic
element Xn is a K 1 random vector. The generalization is straightforward, as
the terminology and the basic concepts are almost the same used for sequences of
random variables.
60.1 Terminology
60.1.1 Realization of a sequence
Let fxn g be a sequence of K 1 real vectors and fXn g a sequence of K 1 random
vectors. If the real vector xn is a realization1 of the random vector Xn for every
n, then we say that the sequence of real vectors fxn g is a realization of the
sequence of random vectors fXn g and we write
fXn g = fxn g
497
498 CHAPTER 60. SEQUENCES OF RANDOM VECTORS
Fn+1;:::;n+q (x1; : : : ; xq )
Fn+k+1;:::;n+k+q (x1; : : : ; xq )
and
> > >
Xn+k+1 : : : Xn+k+q
have the same distribution (for any n, k and q). Requiring strict stationarity is
weaker than requiring that a sequence be IID (see the subsection IID sequences
above): if fXn g is an IID sequence, then it is also strictly stationary, while the
converse is not necessarily true.
4 See p. 118.
60.1. TERMINOLOGY 499
9 2 RK : E [Xn ] = ; 8n 2 N (1)
8j 0; 9 j 2 RK K : Cov [Xn ; Xn j] = j ; 8n >j (2)
where n and j are, of course, integers. Property (1) means that all the random
vectors belonging to the sequence fXn g have the same mean. Property (2) means
that the cross-covariance5 between a term Xn of the sequence and the term that
is located j positions before it (Xn j ) is always the same, irrespective of how Xn
has been chosen. In other words, Cov [Xn ; Xn j ] depends only on j and not on
n. Note also that property (2) implies that all the random vectors in the sequence
have the same covariance matrix (because Cov [Xn ; Xn ] = Var [Xn ]):
9 0 2 RK K
: Var [Xn ] = 0 ; 8n 2N
De…nition 296 We say that a sequence of random vectors fXn g is mixing (or
strongly mixing) if and only if
N
De…nition 297 A set A RK is shift invariant if and only if
Pointwise convergence
This lecture discusses pointwise convergence. We deal …rst with pointwise con-
vergence of sequences of random variables and then with pointwise convergence of
sequences of random vectors.
Xn ! X pointwise
Example 300 Let = f! 1 ,! 2 g be a sample space with two sample points (! 1 and
! 2 ). Let fXn g be a sequence of random variables such that a generic term Xn of
1 See p. 492.
2 See p. 69.
3 See p. 33.
501
502 CHAPTER 61. POINTWISE CONVERGENCE
We need to check the convergence of the sequences fXn (!)g for all ! 2 , i.e. for
! = ! 1 and for ! = ! 2 : (1) the sequence fXn (! 1 )g, whose generic term is
1
Xn (! 1 ) =
n
is a sequence of real numbers converging to 0; (2) the sequence fXn (! 2 )g, whose
generic term is
2
Xn (! 2 ) = 1 +
n
is a sequence of real numbers converging to 1. Therefore, the sequence of random
variables fXn g converges pointwise to the random variable X, where X is de…ned
as follows:
0 if ! = ! 1
X (!) =
1 if ! = ! 2
where d (Xn (!) ; X (!)) is the distance between a generic term of the sequence
Xn (!) and the limit X (!). The distance between Xn (!) and X (!) is de…ned to
be equal to the Euclidean norm of their di¤erence:
where the second subscript is used to indicate the individual components of the
vectors Xn (!) and X (!). Thus, for a …xed !, the sequence of real vectors fXn (!)g
is convergent to a vector X (!) if
X is called the pointwise limit of the sequence and convergence is indicated by:
Xn ! X pointwise
Now, denote by fXn;i g the sequence of the i-th components of the vectors Xn . It
can be proved that the sequence of random vectors fXn g is pointwise convergent if
and only if all the K sequences of random variables fXn;i g are pointwise convergent:
Exercise 1
Let the sample space be6 :
= [0; 1]
Solution
For a …xed sample point !, the sequence of real numbers fXn (!)g has limit:
!
lim Xn (!) = lim =0
n!1 n!1 2n
X (!) = 0 8! 2
6 In other words, the sample space is the set of all real numbers between 0 and 1.
504 CHAPTER 61. POINTWISE CONVERGENCE
Exercise 2
Suppose the sample space is as in the previous exercise:
= [0; 1]
De…ne a sequence of random variables fXn g as follows:
! n
Xn (!) = 1 + 8! 2
n
Find the pointwise limit of the sequence fXn g.
Solution
For a given sample point !, the sequence of real numbers fXn (!)g has limit7 :
! n
lim Xn (!) = lim 1+ = exp (!)
n!1 n!1 n
Thus, the sequence of random variables fXn g converges pointwise to the ran-
dom variable X de…ned as follows:
X (!) = exp (!) 8! 2
Exercise 3
Suppose the sample space is as in the previous exercises:
= [0; 1]
De…ne a sequence of random variables fXn g as follows:
Xn (!) = ! n 8! 2
De…ne a random variable X as follows:
X (!) = 0 8! 2
Does the sequence fXn g converge pointwise to the random variable X?
Solution
For ! 2 [0; 1), the sequence of real numbers fXn (!)g has limit:
lim Xn (!) = lim ! n = 0
n!1 n!1
However, for ! = 1, the sequence of real numbers fXn (!)g has limit:
lim Xn (!) = lim 1n = 1
n!1 n!1
Thus, the sequence of random variables fXn g does not converge pointwise to the
random variable X, but it converges pointwise to the random variable Y de…ned
as follows:
0 if ! 2 [0; 1)
Y (!) =
1 if ! = 1
7 Note that this limit is encountered very frequently and you can …nd a proof of it in most
calculus textbooks.
Chapter 62
This lecture introduces the concept of almost sure convergence. In order to under-
stand this lecture, you should …rst understand the concepts of almost sure property
and almost sure event1 and the concept of pointwise convergence of a sequence of
random variables2 .
We deal …rst with almost sure convergence of sequences of random variables
and then with almost sure convergence of sequences of random vectors.
In other words, almost sure convergence requires that the sequences fXn (!)g con-
verge for all sample points ! 2 , except, possibly, for a very small set F c of
sample points (F c must be included in a zero-probability event).
505
506 CHAPTER 62. ALMOST SURE CONVERGENCE
fXn (!)g converges to X (!) almost surely, i.e. if there exists a zero-probability
event E such that:
f! 2 : fXn (!)g does not converge to X (!)g E
X is called the almost sure limit of the sequence and convergence is indicated
by:
a:s:
Xn ! X
The following is an example of a sequence that converges almost surely:
Example 304 Suppose the sample space is:
= [0; 1]
It is possible to build a probability measure P on , such that P assigns to each
sub-interval of [0; 1] a probability equal to its length4 :
if 0 a b 1 and E = [a; b] , then P (E) = b a
Remember that in this probability model all the sample points ! 2 are as-
signed zero probability (each sample point, when considered as an event, is a zero-
probability event):
8! 2 ; P (f!g) = P ([!; !]) = ! !=0
Now, consider a sequence of random variables fXn g de…ned as follows:
1 if ! = 0
Xn (!) = 1
n if ! =
6 0
When ! 2 (0; 1], the sequence of real numbers fXn (!)g converges to 0, because:
1
lim Xn (!) = lim =0
n!1 n!1 n
However, when ! = 0, the sequence of real numbers fXn (!)g is not convergent to
0, because:
lim Xn (!) = lim 1 = 1
n!1 n!1
De…ne a constant random variable X:
X (!) = 0; 8! 2 [0; 1]
We have that:
f! 2 : fXn (!)g does not converge to X (!)g = f0g
But P (f0g) = 0 because:
P (f0g) = P ([0; 0]) = 0 0=0
which means that the event
f! 2 : fXn (!)g does not converge to X (!)g
is a zero-probability event. Therefore, the sequence fXn g converges to X almost
surely, but it does not converge pointwise to X because fXn (!)g does not converge
to X (!) for all ! 2 .
4 See the lecture entitled Zero-probability events (p. 79).
62.2. SEQUENCES OF RANDOM VECTORS 507
Now, denote by fXn;i g the sequence of the i-th components of the vectors
Xn . It can be proved that the sequence of random vectors fXn g is almost surely
convergent if and only if all the K sequences of random variables fXn;i g are almost
surely convergent.
Exercise 1
Let the sample space be:
= [0; 1]
Sub-intervals of [0; 1] are assigned a probability equal to their length:
Xn (!) = ! n 8! 2
5 See p. 497.
508 CHAPTER 62. ALMOST SURE CONVERGENCE
X (!) = 0 8! 2
Solution
For a …xed sample point ! 2 [0; 1), the sequence of real numbers fXn (!)g has
limit:
lim Xn (!) = lim ! n = 0
n!1 n!1
Therefore, the sequence of random variables fXn g does not converge pointwise
to X, because
lim Xn (!) 6= X (!)
n!1
for ! = 1. However, the set of sample points ! such that fXn (!)g does not
converge to X (!) is a zero-probability event:
Exercise 2
Let fXn g and fYn g be two sequences of random variables de…ned on a sample
space . Let X and Y be two random variables de…ned on such that:
a:s:
Xn ! X
a:s:
Yn ! Y
Prove that
a:s:
Xn + Yn ! X + Y
Solution
Denote by FX the set of sample points for which fXn (!)g converges to X (!):
where P (EX ) = 0.
Denote by FY the set of sample points for which fYn (!)g converges to Y (!):
FYc EY
where P (EY ) = 0.
Now, denote by FXY the set of sample points for which fXn (!) + Yn (!)g
converges to X (!) + Y (!):
Observe that if ! 2 FX \FY then fXn (!) + Yn (!)g converges to X (!)+Y (!),
because the sum of two sequences of real numbers is convergent if the two sequences
are convergent. Therefore
FX \ FY FXY
Taking the complement of both sides, we obtain
c c
FXY (FX \ FY )
A c
= FX [ FYc
B EX [ EY
where: in step A we have used De Morgan’s law; in step B we have used the
c
fact that FX EX and FYc EY . But
Exercise 3
Let the sample space be:
= [0; 1]
Sub-intervals of [0; 1] are assigned a probability equal to their length, as in Exercise
1 above.
De…ne a sequence of random variables fXn g as follows:
1
1 if ! 2 0; 1 n
Xn (!) =
n otherwise
Solution
If ! = 0 or ! = 1, then the sequence of real numbers fXn (!)g is not convergent:
For ! 2 (0; 1), the sequence of real numbers fXn (!)g has limit:
lim Xn (!) = 1
n!1
because for any ! we can …nd n0 such that ! 2 0; 1 n1 for any n n0 (as a
consequence Xn (!) = 1 for any n n0 ).
Thus, the sequence of random variables fXn g converges almost surely to the
random variable X de…ned as:
X (!) = 1 8! 2
because the set of sample points ! such that fXn (!)g does not converge to X (!)
is a zero-probability event:
Convergence in probability
1 See p. 492.
511
512 CHAPTER 63. CONVERGENCE IN PROBABILITY
for any " > 0. X is called the probability limit of the sequence and convergence
is indicated by
P
Xn ! X
or by
plim Xn = X
n!1
RX = f0; 1g
1
Xn = 1+ X
n
We want to prove that fXn g converges in probability to X. Take any " > 0. Note
that:
jXn Xj
A 1
= 1+ X X
n
1
= X
n
B 1
= X
n
where: in step A we have used the de…ntion of Xn ; in step B we have used the
fact that X cannot be negative. When X = 0, which happens with probability 32 ,
we have that:
1
jXn Xj = X = 0
n
and, of course, jXn Xj ". When X = 1, which happens with probability 13 , we
have that
1 1
jXn Xj = X=
n n
2 See p. 106.
63.2. SEQUENCES OF RANDOM VECTORS 513
1 1
and jXn Xj " if and only if n " (or n " ). Therefore:
1
2=3 if n < "
P (jXn Xj ") = 1
1 if n "
and
1
1=3 if n < "
P (jXn Xj > ") = 1 P (jXn Xj ") = 1
0 if n "
Thus, P (jXn Xj > ") trivially converges to 0, because it is identically equal to
1
0 for all n such that n " . Since " was arbitrary, we have obtained the desired
result:
lim P (jXn Xj > ") = 0
n!1
for any " > 0.
d (Xn ; X) = kXn Xk
q
2 2
= [Xn;1 (!) X ;1 (!)] + : : : + [Xn;K (!) X ;K (!)]
where the second subscript is used to indicate the individual components of the
vectors Xn (!) and X (!).
The following is a formal de…nition:
for any " > 0. X is called the probability limit of the sequence and convergence
is indicated by
P
Xn ! X
3 See p. 497.
4 See p. 34.
514 CHAPTER 63. CONVERGENCE IN PROBABILITY
or by
plim Xn = X
n!1
Now, denote by fXn;i g the sequence of the i-th components of the vectors
Xn . It can be proved that the sequence of random vectors fXn g is convergent
in probability if and only if all the K sequences of random variables fXn;i g are
convergent in probability:
Exercise 1
Let U be a random variable having a uniform distribution5 on the interval [0; 1].
In other words, U is an absolutely continuous random variable with support
RU = [0; 1]
1 if u 2 [0; 1]
fU (u) =
0 if u 2
= [0; 1]
X1 = 1fU 2[0;1]g
X2 = 1fU 2[0;1=2]g X3 = 1fU 2[1=2;1]g
X4 = 1fU 2[0;1=4]g X5 = 1fU 2[1=4;2=4]g X6 = 1fU 2[2=4;3=4]g X7 = 1fU 2[3=4;1]g
X8 = 1fU 2[0;1=8]g X9 = 1fU 2[1=8;2=8]g X10 = 1fU 2[2=8;3=8]g : : :
X16 = 1fU 2[0;1=16]g X17 = 1fU 2[1=16;2=16]g X18 = 1fU 2[2=16;3=16]g : : :
..
.
where 1fU 2[a;b]g is the indicator function6 of the event fU 2 [a; b]g.
Find the probability limit (if it exists) of the sequence fXn g.
5 See p. 359.
6 See p. 197.
63.3. SOLVED EXERCISES 515
Solution
A generic term Xn of the sequence, being an indicator function, can take only two
values:
j j+1 1
P (Xn = 1) = P U 2 ; =
m m m
n=m+j
1
P (Xn = 0) = 1 P (Xn = 1) = 1
m
1
lim P (Xn = 0) = lim 1 =1
n!1 m!1 m
B = lim P (Xn = 1)
n!1
1
= lim =0
m!1 m
where: in step A we have used the fact that Xn is positive; in step B we have
used the fact that Xn can take only value 0 or value 1.
Exercise 2
Does the sequence in the previous exercise also converge almost surely7 ?
7 See p. 505.
516 CHAPTER 63. CONVERGENCE IN PROBABILITY
Solution
We can identify the sample space8 with the support of U :
= RU = [0; 1]
and the sample points ! 2 with the realizations of U : when the realization is
U = u, then ! = u. Almost sure convergence requires that
n oc
! 2 : lim Xn (!) = X (!) E
n!1
and, trivially, there does not exist a zero-probability event including the set
n oc
! 2 : lim Xn (!) = X (!)
n!1
Exercise 3
Let fXn g be an IID sequence10 of continuous random variables having a uniform
distribution with support
1 1
RXn = ;
n n
and probability density function
n 1 1
2 if x 2 n; n
fXn (x) = 1 1
0 if x 2
= n; n
Solution
As n tends to in…nity, the probability density tends to become concentrated around
the point x = 0. Therefore, it seems reasonable to conjecture that the sequence
fXn g converges in probability to the constant random variable
X (!) = 0; 8! 2
To rigorously verify this claim we need to use the formal de…nition of convergence
in probability. For any " > 0,
lim P (jXn Xj > ") = lim P (jXn 0j > ")
n!1 n!1
8 See p. 69.
9 See p. 79.
1 0 See p. 492.
63.3. SOLVED EXERCISES 517
Mean-square convergence
Note that d (Xn ; X) is well-de…ned only if the expected value on the right hand
side exists. Usually, Xn and X are required to be square integrable3 , which ensures
that (64.1) is well-de…ned and …nite.
Intuitively, for a …xed sample point4 !, the squared di¤erence
2
(Xn (!) X (!))
519
520 CHAPTER 64. MEAN-SQUARE CONVERGENCE
i.e. h i
2
lim E (Xn X) =0 (64.2)
n!1
L2
Note that (64.2) is just the usual criterion for convergence5 , while Xn ! X
indicates that convergence is in the Lp space6 L2 , because both fXn g and X have
been required to be square integrable.
The following example illustrates the concept of mean-square convergence:
Therefore
h 2
i
d X n; X = E Xn X
h 2
i
= E Xn
5 See p. 36.
6 See p. 136.
7 See p. 493.
64.2. SEQUENCES OF RANDOM VECTORS 521
h 2
i
= E Xn E Xn
= Var X n
where kXn Xk is the Euclidean norm of the di¤erence between Xn and X and
the second subscript is used to indicate the individual components of the vectors
Xn and X.
8 See, in particular, the property Multiplication by a constant (p. 158).
9 See p. 168.
1 0 See p. 497.
1 1 See p. 34.
522 CHAPTER 64. MEAN-SQUARE CONVERGENCE
i.e. h i
2
lim E kXn Xk =0 (64.4)
n!1
Exercise 1
Let U be a random variable having a uniform distribution12 on the interval [1; 2].
In other words, U is an absolutely continuous random variable with support
RU = [1; 2]
1 if u 2 [1; 2]
fU (u) =
0 if u 2
= [1; 2]
where 1fU 2[1;2 1=n]g is the indicator function13 of the event fU 2 [1; 2 1=n]g.
Find the mean-square limit (if it exists) of the sequence fXn g.
Solution
When n tends to in…nity, the interval [1; 2 1=n] becomes similar to the interval
[1; 2], because
1
lim 2 =2
n!1 n
Therefore, we conjecture that the indicators 1fU 2[1;2 1=n]g converge in mean-
square to the indicator 1fU 2[1;2]g . But 1fU 2[1;2]g is always equal to 1, so our
conjecture is that the sequence fXn g converges in mean square to 1. To verify our
conjecture, we need to verify that
h i
2
lim E (Xn 1) = 0
n!1
1 1 1
= 2 1+2 1 2 2 1 =
n n n
Exercise 2
Let fXn g be a sequence of discrete random variables. Let the probability mass
function of a generic term of the sequence Xn be
8
< 1=n if xn = n
pXn (xn ) = 1 1=n if xn = 0
:
0 otherwise
Solution
Note that
1
lim P (Xn = 0) = lim 1 =1
n!1 n!1 n
Therefore, one would expect that the sequence fXn g converges to the constant
random variable X = 0. However, the sequence fXn g does not converge in mean-
square to 0. The distance of a generic term of the sequence from 0 is
h i 1 1
2
E (Xn 0) = E Xn2 = n2 +0 1 =n
n n
Thus h i
2
lim E (Xn 0) =1
n!1
Exercise 3
Does the sequence in the previous exercise converge in probability?
Solution
The sequence fXn g converges in probability to the constant random variable X = 0
because for any " > 0
B = lim P (Xn = n)
n!1
1
= lim =0
n!1 n
where: in step A we have used the fact that Xn is positive; in step B we have
used the fact that Xn can take only value 0 or value n.
526 CHAPTER 64. MEAN-SQUARE CONVERGENCE
Chapter 65
Convergence in distribution
527
528 CHAPTER 65. CONVERGENCE IN DISTRIBUTION
is indicated by
d
Xn ! X
Note that convergence in distribution only involves the distribution functions of
the random variables belonging to the sequence fXn g and that these random vari-
ables need not be de…ned on the same sample space4 . On the contrary, the modes of
convergence we have discussed in previous lectures (pointwise convergence, almost
sure convergence, convergence in probability, mean-square convergence) require
that all the variables in the sequence be de…ned on the same sample space.
Example 316 Let fXn g be a sequence of IID5 random variables all having a uni-
form distribution6 on the interval [0; 1], i.e. the distribution function of Xn is
8
< 0 if x < 0
FXn (x) = x if 0 x < 1
:
1 if x 1
De…ne:
Yn = n 1 max Xi
1 i n
= P n 1 max Xi y
1 i n
y
= P max Xi 1
1 i n n
y
= 1 P max Xi < 1
1 i n n
y y y
= 1 P X1 < 1 ; X2 < 1 ; : : : ; Xn < 1
n n n
A y y y
= 1 P X1 < 1 P X2 < 1 ::: P Xn < 1
n n n
B y y y
= 1 P X1 1 P X2 1 ::: P Xn 1
n n n
C y y y
= 1 FX1 1 FX2 1 : : : FXn 1
n n n
h y in
D = 1 FXn 1
n
where: in step A we have used the fact that the variables Xi are mutually in-
dependent; in step B we have used the fact that the variables Xi are absolutely
continuous; in step C we have used the de…nition of distribution function; in step
D we have used the fact that the variables Xi have identical distributions. Thus:
8
< 0 if y < 0
n
FYn (y) = 1 1 ny if 0 y < n
:
1 if y n
4 See p. 69.
5 See p. 492.
6 See p. 359.
65.2. SEQUENCES OF RANDOM VECTORS 529
Since n
y
lim 1 = exp ( y)
n!1 n
we have
0 if y < 0
lim FYn (y) = FY (y) =
n!1 1 exp ( y) if y 0
where FY (y) is the distribution function of an exponential random variable7 . There-
fore, the sequence fYn g converges in law to an exponential distribution.
for all x 2 R where FX (x) is continuous. How do we check that FX (x) is a proper
distribution function, so that we can say that the sequence fXn g converges in
distribution?
FX (x) is a proper distribution function if it satis…es the following four proper-
ties:
for any x 2 R.
7 See p. 365.
8 See p. 118.
530 CHAPTER 65. CONVERGENCE IN DISTRIBUTION
Exercise 1
Let fXn g be a sequence of random variables having distribution functions
8
>
> 0 if x 0
< n x + 1 x2 if 0 < x 1
Fn (x) = 2n+1 4n+2
n 1 2
>
> x 4n+2 x 4x + 2 if 1 < x 2
: 2n+1
1 if x > 2
Find the limit in distribution (if it exists) of the sequence fXn g.
Solution
If 0 < x 1, then
n 1
lim Fn (x) = lim x+ x2
n!1 2n + 1
n!1 4n + 2
n 1
= x lim + x2 lim
n!1 2n + 1 n!1 4n + 2
1 1
= x + x2 0 = x
2 2
If 1 < x 2, then
n 1
lim Fn (x) = lim x x2 4x + 2
n!1 2n + 1
n!1 4n + 2
n 1
= x lim + x2 4x + 2 lim
n!1 2n + 1 n!1 4n + 2
1 1
= x + x2 4x + 2 0 = x
2 2
We now need to verify that the function
8
< 0 if x 0
1
FX (x) = lim Fn (x) = x if 0<x 2
n!1 : 2
1 if x > 2
is a proper distribution function. The function is increasing, continuous, its limit
at minus in…nity is 0 and its limit at plus in…nity is 1, hence it satis…es the four
properties that a proper distribution function needs to satisfy. This implies that
fXn g converges in distribution to a random variable X having distribution function
FX (x).
65.4. SOLVED EXERCISES 531
Exercise 2
Let fXn g be a sequence of random variables having distribution functions
8
< 0 if x < 0
n
Fn (x) = 1 (1 x) if 0 x 1
:
1 if x > 1
Solution
If x = 0, then
n n
lim Fn (x) = lim [1 (1 x) ] = 1 lim (1 0)
n!1 n!1 n!1
= 1 1=0
If 0 < x 1, then
n n
lim Fn (x) = lim [1 (1 x) ] = 1 lim (1 x)
n!1 n!1 n!1
= 1 0=1
0 if x 0
GX (x) = lim Fn (x) =
n!1 1 if x > 0
0 if x < 0
FX (x) =
1 if x 0
Exercise 3
Let fXn g be a sequence of random variables having distribution functions
8
< 0 if x < 0
Fn (x) = nx if 0 x 1=n
:
1 if x > 1=n
Solution
The distribution functions Fn (x) converge to the function
0 if x 0
GX (x) = lim Fn (x) =
n!1 1 if x > 0
532 CHAPTER 65. CONVERGENCE IN DISTRIBUTION
This is the same limiting function found in the previous exercise. As a consequence,
the sequence fXn g converges in distribution to a random variable X having dis-
tribution function
0 if x < 0
FX (x) =
1 if x 0
Chapter 66
533
534 CHAPTER 66. MODES OF CONVERGENCE - RELATIONS
Note that this holds for any arbitrarily small c. By the de…nition of convergence in
probability, this means that Xn converges in probability to X (if you are wondering
about strict and weak inequalities here and in the de…nition of convergence in
probability, note that jXn Xj c implies jXn Xj > " for any strictly positive
" < c).
Let fXn g be a sequence of random variables1 . Let X n be the sample mean of the
…rst n terms of the sequence:
n
1X
Xn = Xi
n i=1
9 2 R : E [Xn ] = ; 8n 2 N
9 2 2 R+ : Var [Xn ] = 2 ; 8n 2 N
Cov [Xn ; Xn+k ] = 0; 8n; k 2 N
plim X n =
n!1
1 See p. 491.
2 See p. 511.
3 See p. 505.
4 In other words, all the random variables in the sequence have the same mean , the same
variance 2 and zero covariance with each other. See p. 493 for a de…nition of covariance
stationary sequence.
535
536 CHAPTER 67. LAWS OF LARGE NUMBERS
Var X n
P Xn E Xn k
k2
for any strictly positive real number k. Plugging in the values for the expected
value and the variance derived above, we obtain:
2
P Xn k
nk 2
Since
2
lim =0
n!1 nk 2
and
P Xn k 0
then it must also be that:
lim P X n k =0
n!1
Note that this holds for any arbitrarily small k. By the very de…nition of conver-
gence in probability, this means that X n converges in probability to (if you are
5 See p. 511.
6 See,in particular, the Multiplication by a constant property (p. 158).
7 See p. 168.
8 See p. 242.
67.1. WEAK LAWS OF LARGE NUMBERS 537
wondering about strict and weak inequalities here and in the de…nition of conver-
gence in probability, note that X n k implies X n > " for any strictly
positive " < k).
Note that it is customary to state Chebyshev’s Weak Law of Large Numbers
as a result on the convergence in probability of the sample mean:
plim X n =
n!1
However, the conditions of the above theorem guarantee the mean square conver-
gence9 of the sample mean to :
m:s:
Xn !
9 2 R : E [Xn ] = ; 8n > 0
8j 0; 9 j 2 R : Cov [Xn ; Xn j] = j ; 8n >j
9 See p. 519.
1 0 See p. 534.
1 1 In other words, all the random variables in the sequence have the same mean , the same
variance 0 and the covariance between a term Xn of the sequence and the term that is located
j positions before it (Xn j ) is always the same ( j ), irrespective of how Xn has been chosen.
538 CHAPTER 67. LAWS OF LARGE NUMBERS
Proof. For a full proof see e.g. Karlin and Taylor12 (1975). We give here a proof
based on the assumption that covariances are absolutely summable:
1
X
j <1
j=0
which is stronger than (67.1). The expected value of the sample mean X n is
" n # n
1X 1X
E Xn = E Xi = E [Xi ]
n i=1 n i=1
n
1X 1
= = n =
n i=1 n
(n i)
<1
n
Var X n
n
Now we can apply Chebyshev’s inequality to the sample mean X n :
Var X n
P Xn E Xn k
k2
for any strictly positive real number k. Plugging in the values for the expected
value and the variance derived above, we obtain:
Var X n
P Xn k
k2 nk 2
Since
lim =0
n!1 nk 2
and
P Xn k 0
then it must also be that:
lim P X n k =0
n!1
Note that this holds for any arbitrarily small k. By the very de…nition of conver-
gence in probability, this means that X n converges in probability to (if you are
wondering about strict and weak inequalities here and in the de…nition of conver-
gence in probability, note that X n k implies X n > " for any strictly
positive " < k).
540 CHAPTER 67. LAWS OF LARGE NUMBERS
Also Chebyshev’s Weak Law of Large Numbers for correlated sequences has
been stated as a result on the convergence in probability of the sample mean:
plim X n =
n!1
However, the conditions of the above theorem also guarantee the mean square
convergence of the sample mean to :
m:s:
Xn !
Proof. In the above proof of Chebyshev’s Weak Law of Large Numbers for corre-
lated sequences, we proved that
Var X n
n
and that
E Xn =
This implies:
h 2
i h 2
i
E Xn =E Xn E Xn = Var X n
n
Thus, taking limits on both sides, we obtain:
h 2
i
lim E X n lim =0
n!1 n!1 n
But h i
2
E Xn 0
so it must be: h i
2
lim E Xn =0
n!1
a Strong Law of Large Numbers applies to the sample mean X n if and only
if a Strong Law of Large numbers applies to each of the components of the
vector X n , i.e. if and only if
a:s:
X n;j ! j; j = 1; : : : ; K
Exercise 1
Let f"n g be an IID sequence. A generic term of the sequence has mean and
variance 2 . Let fXn g be a covariance stationary sequence such that a generic
term of the sequence satis…es
Xn = Xn 1 + "n
the sample mean of the sequence. Verify whether the sequence X n satis…es
the conditions that are required by Chebyshev’s Weak Law of Large Numbers. In
a¢ rmative case, …nd its probability limit.
Solution
By assumption the sequence fXn g is covariance stationary. So all the terms of the
sequence have the same expected value. Taking the expected value of both sides
of the equation
X n = X n 1 + "n
we obtain:
E [Xn ] = E [ X n 1 + "n ]
= E [Xn 1 ] + E ["n ]
= E [Xn ] +
E [Xn ] =
1
By the same token, the variance can be derived from:
where: in step A we have used the fact that Xn 1 is independent of "n because
f"n g is IID. Solving for Var [Xn ], we obtain
2
Var [Xn ] = 2
1
Now, we need to derive Cov [Xn ; Xn+j ]. Note that:
Xn+1 = Xn + "n+1
67.4. SOLVED EXERCISES 543
2
Xn+2 = Xn+1 + "n+2 = Xn + "n+2 + "n+1
3 2
Xn+3 = Xn+2 + "n+3 = Xn + "n+3 + "n+2 + "n+1
..
.
j 1
X
j s
Xn+j = Xn+j 1 + "n+j = Xn + "n+j s
s=0
B = j
Cov [Xn ; Xn ]
j
= Var [Xn ]
2
j
= 2
1
and the conditions of Chebyshev’s Weak Law of Large Numbers are satis…ed.
Therefore, the sample mean converges in probability to the population mean:
plim X n = E [Xn ] =
n!1 1
544 CHAPTER 67. LAWS OF LARGE NUMBERS
Chapter 68
Let fXn g be a sequence of random variables1 . Let X n be the sample mean of the
…rst n terms of the sequence:
n
1X
Xn = Xi
n i=1
d
where Z is a standard normal random variable2 , and are two constants and !
indicates convergence in distribution3 .
Why is the expression X n = multiplied by the square root of n? If we
p
do not multiply it by n, then X n = converges to a constant, provided that
4
the conditions
p of a Law of Large Numbers apply. On the contrary, multiplying
it by n, we obtain a sequence that converges to a proper random variable (i.e.
a random variable that is not constant). When the conditions of a Central Limit
Theorem apply, this variable has a normal distribution.
In practice, the CLT is used as follows:
1. we observe a sample consisting of n observations X1 , X2 , : : :, Xn ;
2. if n is large enough, then a standard
p normal distribution is a good approxi-
mation of the distribution of n X n = ;
3. therefore, we pretend that
p
n Xn = N (0; 1)
1 See p. 491.
2 Remember that a standard normal random variable is a normal random variable with zero
mean and unit variance (p. 376).
3 See p. 527.
4 See p. 535.
545
546 CHAPTER 68. CENTRAL LIMIT THEOREMS
There are several Central Limit Theorems. We report some examples below.
E [Xn ] = < 1; 8n 2 N
Var [Xn ] = 2 < 1; 8n 2 N
2
where > 0. Then, a Central Limit Theorem applies to the sample mean X n :
p Xn d
n !Z
d
where Z is a standard normal random variable and ! denotes convergence in
distribution.
Proof. We will just sketch a proof. For a detailed and rigorous proof see, for
example, Resnick6 (1999) and Williams7 (1991). First of all, denote by fZn g the
sequence whose generic term is
p Xn
Zn = n
Var [Y1 ] = 1
Therefore:
n
t
lim 'Zn (t) = lim 'Y1 p
n!1 n!1 n
" !#n
2 2
1 t t
= lim 1 p +o p
n!1 2 n n
n
1 t2 t2
= lim 1 +o
n!1 2n n
1 2
= exp t = 'Z (t)
2
where
1 2
'Z (t) = exp t
2
is the characteristic function of a standard normal random variable Z (see the
lecture entitled Normal distribution - p. 379). A theorem, called Lévy continuity
theorem, which we do not cover in these lectures, states that if a sequence of
random variables fZn g is such that their characteristic functions 'Zn (t) converge
to the characteristic function 'Z (t) of a random variable Z, then the sequence
fZn g converges in distribution to Z. Therefore, in our case the sequence fZn g
converges in distribution to a standard normal distribution.
So, roughly speaking, under the stated assumptions, the distribution of the
sample mean X n can be approximated by a normal distribution with mean and
2
variance n (provided n is large enough).
Also note that the conditions for the validity of Lindeberg-Lévy Central Limit
Theorem resemble the conditions for the validity of Kolmogorov’s Strong Law of
Large Numbers10 . The only di¤erence is the additional requirement that
2
Var [Xn ] = < 1; 8n 2 N
E [Xn ] = < 1; 8n 2 N
1 0 See p. 540.
1 1 See p. 492.
1 2 See p. 494.
68.2. MULTIVARIATE GENERALIZATIONS 549
2
Var [Xn ] = < 1; 8n 2 N
1
X
lim nVar X n = 2 + 2 Cov [X1 ; Xi ] = V < 1
n!1
i=2
where V > 0. Then, a Central Limit Theorem applies to the sample mean X n :
p Xn d
n p !Z
V
d
where Z is a standard normal random variable and ! denotes convergence in
distribution.
and the fact that ergodicity is replaced by the stronger condition of mixing.
Finally, let us mention that the variance V in the above proposition, which is
de…ned as
V = lim nVar X n
n!1
E [Xn ] = 2 RK ; 8n 2 N
1 3 Durrett, R. (2010) "Probability: Theory and Examples", Cambridge University Press.
1 4 White, H. (2001) "Asymptotic theory for econometricians", Academic Press.
1 5 See p. 541.
550 CHAPTER 68. CENTRAL LIMIT THEOREMS
Var [Xn ] = 2 RK K
; 8n 2 N
Pn
where is a positive de…nite matrix. Let X n = n1 i=1 Xi be the vector of sample
means. Then:
p d
n 1 Xn !Z
d
where Z is a standard multivariate normal random vector16 and ! denotes con-
vergence in distribution.
Proof. For a proof see, for example, Basu17 (2004), DasGupta18 (2008) and
McCabe and Tremayne19 (1993).
In a similar manner, the CLT for correlated sequences generalizes to random
vectors (V becomes a matrix, called long-run covariance matrix).
Exercise 1
Let fXn g be a sequence of independent Bernoulli random variables20 with para-
meter p = 21 , i.e. a generic term Xn of the sequence has support
RXn = f0; 1g
Solution
The sequence fXn g is and IID sequence. The mean of a generic term of the
sequence is
X
E [Xn ] = xpXn (x) = 1 pXn (1) + 0 pXn (0)
x2RXn
1 1 1
= 1 +0 1 = <1
2 2 2
1 6 See p. 439.
1 7 Basu, A. K. (2004) Measure theory and probability, PHI Learning PVT.
1 8 DasGupta, A. (2008) Asymptotic theory of statistics and probability, Springer.
1 9 McCabe, B. and A. Tremayne (1993) Elements of modern asymptotic theory with statistical
The variance of a generic term of the sequence can be derived thanks to the usual
formula for computing the variance21 :
X
E Xn2 = x2 pX (x) = 12 pXn (1) + 02 pXn (0)
x2RXn
1 1 1
= 1 +0 =
2 2 2
2 1
E [Xn ] =
4
2 1 1 1
Var [Xn ] = E Xn2 E [Xn ] = = <1
2 4 4
Therefore, the sequence fXn g satis…es the conditions of Lindeberg-Lévy Central
Limit Theorem (IID, …nite mean, …nite variance). The mean of the …rst 100 terms
of the sequence is:
100
1 X
X 100 = Xi
100 i=1
Using the Central Limit Theorem to approximate its distribution, we obtain:
Var [Xn ]
Xn N E [Xn ] ;
n
or
1 1
X 100 N ;
2 400
Exercise 2
Let fXn g be a sequence of independent Bernoulli random variables with parameter
p = 21 , as in the previous exercise. Let fYn g be another sequence of random
variables such that
1
Yn = Xn+1 Xn ; 8n
2
Suppose fYn g satis…es the conditions of a Central Limit Theorem for correlated
sequences. Derive an approximate distribution for the mean of the …rst n terms of
the sequence fYn g.
Solution
The sequence fXn g is and IID sequence. The mean of a generic term of the
sequence is
1 1
E [Yn ] = E Xn+1 Xn = E [Xn+1 ] E [Xn ]
2 2
1 11 1
= =
2 22 4
The variance of a generic term of the sequence is
1
Var [Yn ] = Var Xn+1 Xn
2
2 1 Var [X] = E X2 E [X]2 . See p. 156.
552 CHAPTER 68. CENTRAL LIMIT THEOREMS
1 1
= Var [Xn+1 ] + Var [Xn ] 2 Cov [Xn+1 ; Xn ]
4 2
A 1
= Var [Xn+1 ] + Var [Xn ]
4
1 11 5
= + =
4 44 16
where: in step A we have used the fact that Xn and Xn+1 are independent. The
covariance between two successive terms of the sequence is
Cov [Yn+1 ; Yn ]
1 1
= Cov Xn+2 Xn+1 ; Xn+1 Xn
2 2
A 1
= Cov [Xn+2 ; Xn+1 ] Cov [Xn+2 ; Xn ]
2
1 1
Cov [Xn+1 ; Xn+1 ] + Cov [Xn+1 ; Xn ]
2 4
B 1
= Cov [Xn+1 ; Xn+1 ]
2
C 1
= Var [Xn+1 ]
2
11 1
= =
24 8
Cov [Yn+j ; Yn ]
1 1
= Cov Xn+j+1 Xn+j ; Xn+1 Xn
2 2
A 1
= Cov [Xn+j+1 ; Xn+1 ] Cov [Xn+j+1 ; Xn ]
2
1 1
Cov [Xn+j ; Xn+1 ] + Cov [Xn+j ; Xn ]
2 4
B = 0
Using the Central Limit Theorem for correlated sequences to approximate its dis-
tribution, we obtain
V
Y n N E [Yn ] ;
n
or
1 1
Yn N ;
4 16 n
Exercise 3
Let Y be a binomial random variable with parameters n = 100 and p = 12 (you need
to read the lecture entitled Binomial distribution 23 in order to be able to solve this
exercise). Using the Central Limit Theorem, show that a normal random variable
X with mean = 50 and variance 2 = 25 can be used as an approximation of Y .
Solution
1
A binomial random variable Y with parameters n = 100 and p = 2 can be written
as
100
X
Y = Xi
i=1
In the …rst exercise, we have shown that the distribution of X 100 can be approxi-
mated by a normal distribution:
1 1
X 100 N ;
2 400
1 1
Y N 100; 1002
2 400
2 3 See p. 341.
554 CHAPTER 68. CENTRAL LIMIT THEOREMS
Chapter 69
Convergence of
transformations
P a:s:
where ! denotes convergence in probability, ! denotes almost sure convergence
d
and ! denotes convergence in distribution.
555
556 CHAPTER 69. CONVERGENCE OF TRANSFORMATIONS
P P
Proposition 332 If Xn ! X and Yn ! Y . Then:
P
Xn + Yn ! X + Y
P
Xn Yn ! XY
Proof. First of all, note that convergence in probability of fXn g and of fYn g
implies their joint convergence in probability2 , i.e. their convergence as a vector:
P
[Xn Yn ] ! [X Y ]
Now, the sum and the product are continuous functions of the operands. Thus, for
example:
X + Y = g ([X Y ])
where g is a continuous function, and, using the continuous mapping theorem:
plim (Xn + Yn ) = plim g ([Xn Yn ])
n!1 n!1
= g plim [Xn Yn ]
n!1
= g ([X Y ])
= X +Y
where plim denotes a limit in probability.
1=Xn ! 1=X
provided X is almost surely di¤ erent from 0 (we did not specify the kind of conver-
gence, which can be in probability, almost surely or in distribution).
Proof. This is a consequence of the Continuous mapping theorem and of the fact
that
g (x) = 1=x
is a continuous function for x 6= 0.
As a consequence:
Proposition 337 If two sequences of random variables fXn g and fYn g converge
to X and Y respectively, then
Xn =Yn ! X=Y
Proof. This is a consequence of the fact that the ratio can be written as a product
Xn =Yn = Xn (1=Yn )
The …rst operand of the product converges by assumption. The second converges
because of the previous proposition. Therefore, their product converges because
convergence is preserved under products.
3 A. W. van der Vaart (2000) Asymptotic Statistics, Cambridge University Press.
558 CHAPTER 69. CONVERGENCE OF TRANSFORMATIONS
Exercise 1
Let fXn g be a sequence of K 1 random vectors such that
d
Xn ! X
where X is a normal random vector with mean and invertible covariance matrix
V.
Let fAn g be a sequence of L K random matrices such that:
P
An ! A
where A is a constant matrix. Find the limit in distribution of the sequence of
products fAn Xn g.
Solution
By Slutski’s theorem
d
An Xn ! Y
where
Y = AX
The random vector Y has a multivariate normal distribution, because it is a linear
transformation of a multivariate normal random vector5 . The expected value of Y
is
E [Y ] = E [AX] = AE [X] = A
and its covariance matrix is
Var [Y ] = Var [AX] = AVar [X] A| = AV A|
Therefore, the sequence of products fAn Xn g converges in distribution to a multi-
variate normal random vector with mean A and covariance matrix AV A| .
4 See p. 119.
5 See p. 469.
69.4. SOLVED EXERCISES 559
Exercise 2
Let fXn g be a sequence of K 1 random vectors such that
d
Xn ! X
where X is a normal random vector with mean 0 and invertible covariance matrix
V.
Let fVn g be a sequence of K K random matrices such that
P
Vn ! V
Xn| Vn 1 Xn
Solution
By the continuous mapping theorem
1 P 1
Vn !V
Xn| Vn 1 Xn = Xn| Vn 1 Xn ! X | V X = X |V
d 1 1
X
X |V = X| ( |
1 1
X ) X
| | 1 1
= X ( ) X
1 | 1
= X X
= Z |Z
1 | 1 |
=
| 1 |
=
| | 1
= ( ) =I
Exercise 3
Let everything be as in the previous exercise, except for the fact that now X has
mean . Find the limit in distribution of the sequence
|
(Xn n) Vn 1 (Xn n)
Solution
De…ne
Yn = Xn n
By Slutski’s theorem
d
Yn ! Y
where
Y =X
is a multivariate normal random variable with mean 0 and variance V . Thus, we
can use the results of the previous exercise on the sequence
Yn| Vn 1 Yn
6 See p. 481.
Part VII
Fundamentals of statistics
561
Chapter 70
Statistical inference
Statistical inference is the act of using observed data to infer unknown prop-
erties and characteristics of the probability distribution from which the observed
data have been generated. The set of data that is used to make inferences is called
sample.
70.1 Samples
In the simplest possible case, we observe the realizations1 x1 , . . . , xn of n indepen-
dent random variables2 X1 , . . . , Xn having a common distribution function3 FX (x)
and we use the observed realizations to infer some characteristics of FX (x). With
a slight abuse of language, we sometimes say "n independent realizations of a
random variable X" instead of saying "the realizations of n independent random
variables X1 , . . . , Xn having a common distribution function FX (x)".
Example 338 The lifetime of a certain type of electronic device is a random vari-
able X, whose distribution function FX (x) is unknown. Suppose we independently
observe the lifetimes of 10 components. Denote these realizations by x1 , x2 , . . . ,
x10 . We are interested in the expected value of X, which is an unknown charac-
teristic of FX (x). We infer E [X] from the data, estimating E [X] with the sample
mean
10
1 X
x= xi
10 i=1
In this simple example the observed data x1 , x2 , . . . , x10 constitute our sample and
E [X] is the quantity about which we are making a statistical inference.
While in the simplest case X1 , . . . , Xn are independent random variables, more
complicated cases are possible. For example:
1. X1 , . . . , Xn are not independent;
2. X1 , . . . , Xn are random vectors having a common joint distribution function4
FX (x);
1 See p. 105.
2 See p. 229.
3 See p. 108.
4 See p. 118.
563
564 CHAPTER 70. STATISTICAL INFERENCE
= [X1 : : : Xn ]
Example 346 Take the example above and drop the assumption that the n random
variables X1 , . . . , Xn are mutually independent. The statistical model is now:
When each distribution function is associated with only one parameter, the
parametric family is said to be identi…able.
F 2 R
or an exclusion restriction:
F 2
= R
There are several di¤erent ways of formalizing such a decision problem. The
branch of statistics that analyzes these decision problems is called statistical de-
cision theory.
568 CHAPTER 70. STATISTICAL INFERENCE
Chapter 71
Point estimation
In the lecture entitled Statistical inference (p. 563) we have de…ned statistical
inference as the act of using a sample to make statements about the probability
distribution that generated the sample. The sample is regarded as the realization
of a random vector , whose unknown joint distribution function1 , denoted by
F ( ), is assumed to belong to a set of distribution functions , called statistical
model.
When the model is put into correspondence with a set Rp of real
vectors, then we have a parametric model. is called the parameter space and
its elements are called parameters. Denote by 0 the parameter that is associated
with the unknown distribution function F ( ) and assume that 0 is unique. 0
is called the true parameter, because it is associated to the distribution that
actually generated the sample. This lecture introduces a type of inference about
the true parameter called point estimation.
569
570 CHAPTER 71. POINT ESTIMATION
Among the consequences that are usually considered in a parametric decision prob-
lem, the most relevant one is the estimation error. The estimation error e is the
di¤erence between the estimate b and the true parameter 0 :
e=b 0
where kk is the Euclidean norm (it coincides with the absolute value when
R).
L b( ); 0
can be thought of as a random variable. Its expected value is called the statistical
risk (or, simply, the risk) of the estimator b and it is denoted by R b :
h i
R b = E L b( ); 0
where the expected value is computed with respect to the true distribution function
F ( ). Thus, the risk R b depends both on the true parameter 0 and on the
distribution function of . In practice, these quantities are unknown, so also the
risk needs to be estimated. For example, we can compute an estimate R b of the
risk by pretending that the estimate were the true parameter, denoting by e the
b
estimator of b and computing the estimated risk as:
h i
R e = E L e( );b
where the expected value is with respect to the estimated distribution function
F ;b .
Even if the risk is unknown, the notion of risk is often used to derive theoretical
properties of estimators. In any case, parameter estimation is always guided, at
least ideally, by the principle of risk minimization, i.e. by the search for estimators
b that minimize the risk R b .
Depending on the speci…c loss function we use, the statistical risk of an estima-
tor can take di¤erent names:
71.3. OTHER CRITERIA TO EVALUATE ESTIMATORS 571
1. when the absolute error is used as a loss function, then the risk
h i
R b =E b 0
2. when the squared error is used as a loss function, then the risk
2
R b =E b 0
is called mean squared error (MSE). The square root of the mean squared
error is called root mean squared error (RMSE).
71.3.1 Unbiasedness
If an estimator produces parameter estimates that are on average correct, then it
is said to be unbiased. The following is a formal de…nition:
where is the random vector of which the sample is a realization and the expected
value is computed with respect to the true distribution function F ( ).
Also note that if an estimator is unbiased, this implies that the estimation error
is on average zero:
h i h i
E [e] = E b 0 =E
b 0 = 0 0 =0
71.3.2 Consistency
If an estimator produces parameter estimates that converge to the true value when
the sample size increases, then it is said to be consistent. The following is a formal
de…nition:
572 CHAPTER 71. POINT ESTIMATION
a:s:
where ! indicates almost sure convergence4 . A sequence of estimators which is
not consistent is called inconsistent.
When the sequence of estimators is obtained using the same prede…ned rule for
every sample n , we often say, with a slight abuse of language, "consistent estima-
tor" instead of saying "consistent sequence of estimators". In such cases, what we
mean is that the prede…ned rule produces a consistent sequence of estimators.
71.4 Examples
You can …nd examples of point estimation in the lectures entitled Point estimation
of the mean (p. 573) and Point estimation of the variance (p. 579).
3 See p. 511.
4 See p. 505.
Chapter 72
n = [x1 : : : xn ]
n = [X1 : : : Xn ]
E Xn =
1 See p. 564.
2 See p. 569.
573
574 CHAPTER 72. POINT ESTIMATION OF THE MEAN
Therefore, the variance of the estimator tends to zero as the sample size n tends
to in…nity.
Proof. Note that the sample mean X n is a linear combination of the normal
and independent random variables X1 ; : : : ; Xn (all the coe¢ cients of the linear
3 See p. 571.
4 See p. 168.
72.2. IID SAMPLES 575
h 2
i
= E Xn
B = Var X n
2
=
n
where: in step A we have used the fact that in one dimension the Euclidean norm
is the same as the absolute value; in step B we have used the de…nition of variance
and the fact that E X n = (see above).
n = [X1 : : : Xn ]
The di¤erence with respect to the previous example is that now we are no longer
assuming that the sample points come from a normal distribution.
p Xn d
n !Z
d
where Z is a standard normal random variable11 and ! denotes convergence in
distribution. In other words, the sample mean X n converges in distribution to a
2
normal random variable with mean and variance n .
Exercise 1
Consider an experiment that can have only two outcomes: either success, with
probability p, or failure, with probability 1 p. The probability of success is
unknown, but we know that
1 1
p2 ;
10 5
Suppose we can independently repeat the experiment as many times as we wish
and use the ratio
Successes obtained
Total experiments performed
as an estimator of p. What is the minimum number of experiments needed in order
to be sure that the standard deviation of the estimator is less than 1=100?
Solution
Denote by pb the estimator of p. It can be written as
n
1X
pb = Xi
n i=1
Var [bp]
A Var [Xi ]
=
n
B p (1 p)
=
n
p (1 p)
max
p2[ 10
1 1
;5] n
1 1
5 1 5
=
n
4
=
25n
where: in step A we have used the formula for the variance of the sample mean; in
step B we have used the formula for the variance of a Bernoulli random variable.
Thus
4
Var [b
p]
25n
We need to ensure that
p 1
Std [b
p] = Var [b p]
100
or
1
Var [b
p]
10000
which is certainly veri…ed if
4 1
=
25n 10000
or
40000
n= = 1600
25
Exercise 2
Suppose you observe a sample of 100 independent draws from a distribution having
unknown mean and known variance 2 = 1. How can you approximate the
distribution of their sample mean?
Solution
We can approximate the distribution of the sample mean with its asymptotic distri-
bution. So the distribution of the sample mean can be approximated by a normal
distribution with mean and variance
2
1
=
n 100
Chapter 73
n = [X1 : : : Xn ]
579
580 CHAPTER 73. POINT ESTIMATION OF THE VARIANCE
Proof. This can be proved using the linearity of the expected value:
" n #
h i 1 X
E c2n
2
= E (Xi )
n i=1
1X h i
n
2
= E (Xi )
n i=1
n
A 1X
= Var [Xi ]
n i=1
1 2 2
= n =
n
1 X
n h i
A = Var (Xi )
2
2
n i=1
!
1
n
X h i n
X h i2
4 2
= E (Xi ) E (Xi )
n2 i=1 i=1
n n
!
1 X X
B = 3 4 2 2
n2 i=1 i=1
n
!
1 X
4
= 2
n2 i=1
1 4 2 4
= n2 =
n2 n
3 See p. 571.
73.1. NORMAL IID SAMPLES - KNOWN MEAN 581
where: in step A we have used the formula for the variance of an independent
sum4 ; in step B we have used the fact that for a normal distribution
h i
4
E (Xi ) =3 4
Therefore, the variance of the estimator tends to zero as the sample size n tends
to in…nity.
where the variables Zi are independent standard normal random variables6 and W ,
being a sum of squares of n independent standard normal random variables, has
a Chi-square distribution with n degrees of freedom7 . Multiplying a Chi-square
random variable with n degrees of freedom by 2 =n one obtains8 a Gamma random
variable with parameters n and 2 .
2
A = E c2 2
n
2
= E c2 2
n
h i 2 4
B = Var c2n =
n
where: in step A we have used the fact that the Euclidean norm in one dimension
is the same as the absolute value; in step B we have used the de…nition of variance
and (73.1).
4 See p. 168.
5 See p. 397.
6 See p. 375.
7 See p. 395.
8 See p. 402.
9 See p. 571.
582 CHAPTER 73. POINT ESTIMATION OF THE VARIANCE
can be viewed as the sample mean of a sequence fYn g where the generic term of
the sequence is
2
Yn = (Xn )
The sequence fYn g satis…es the conditions of Kolmogorov’s Strong Law of Large
Numbers10 (fXn g is an IID sequence with …nite mean). Therefore, the sample
mean of Yn converges almost surely to the true mean E [Yn ]:
c2 a:s:
! E [Yn ] = 2
plim c2 = 2
n!1
n = [X1 : : : Xn ]
1 1 2
2
= n n +
n n 2n 0 13
n n
21X 4 X
E (Xi )@ Xj n A5
n n i=1 j=1
2 0 13
n n
2
2 X 4 X
= 2
+ E (Xi )@ (Xj )A5
n n2 i=1 j=1
584 CHAPTER 73. POINT ESTIMATION OF THE VARIANCE
n n
n+1 2 2 XX
= E [(Xi ) (Xj )]
n n2 i=1 j=1
n 12 4
Var Sn2 =
n n
Proof. This is proved in subsection 73.2.5.
The variance of the adjusted sample variance is:
2 4
Var s2n =
n 1
Proof. This is also proved in subsection 73.2.5.
Therefore, both the variance of Sn2 and the variance of s2n converge to zero as
the sample size n tends to in…nity. Also note that the unadjusted sample variance
Sn2 , despite being biased, has a smaller variance than the adjusted sample variance
s2n , which is instead unbiased.
1 >
M =I {{
n
where I is an n n identity matrix and { is a n 1 vector of ones. M is symmetric
and idempotent. Denote by X the n 1 random vector whose i-th entry is equal
586 CHAPTER 73. POINT ESTIMATION OF THE VARIANCE
i.e. Sn2 is a Chi-square random variable divided by its number of degrees of freedom
2
and multiplied by (n n1) . Thus16 , Sn2 is a Gamma random variable with parame-
2
ters n 1 and (n 1)
n . Also, by the properties of Gamma random variables, its
expected value is:
2
(n 1)
E Sn2 =
n
and its variance is:
2 2
2 (n 1)
Var Sn2 =
n 1 n
n 12 4
=
n n
The adjusted sample variance s2n has a Gamma distribution with parameters
n 1 and 2 .
Proof. The proof of this result is similar to the proof for unadjusted sample
variance found above. It can also be found in the lecture entitled Quadratic forms
in normal vectors (p. 481). Here, we just notice that s2n , being a Gamma random
variable with parameters n 1 and 2 , has expected value
E s2n = 2
and variance
2 4
Var s2n =
n 1
1 5 See p. 439.
1 6 See p. 397.
73.2. NORMAL IID SAMPLES - UNKNOWN MEAN 587
2 2
= Var Sn2 + E Sn2 + 2 E Sn2 2
E Sn2 E Sn2
2
n 12 4 (n 1) 2
2
= + + 2 E Sn2 2
0
n n n
2 2
2n 2 4 (n 1 n)
= +
n2 n
2n 1 4
=
n2
where: in step A we have used the fact that the Euclidean norm in one dimension
is the same as the absolute value
The mean squared error of the adjusted sample variance is:
2
MSE s2n = 4
n 1
Proof. It can be proved as follows:
h i
2 2
MSE s2n = E s2n
h i
2 2
= E s2n
h i
2 2
= E s2n
h 2
i
= E s2n E s2n
2 4
= Var s2n =
n 1
Therefore, the mean squared error of the unadjusted sample variance is always
smaller than the mean squared error of the adjusted sample variance:
2n 1
MSE Sn2 = 4
n2
588 CHAPTER 73. POINT ESTIMATION OF THE VARIANCE
2n 1 4
= 2
n n2
2 1 4
=
n n2
2 4 2 4
< < = MSE s2n
n n 1
B = Wn Xn
2
1 7 See p. 555.
73.3. SOLVED EXERCISES 589
plim Sn2 = 2
n!1
s2n = Zn Sn2
where both Zn and Sn2 are almost surely convergent. Since the product is a con-
tinuous function and almost sure convergence is preserved by continuous transfor-
mation, we have:
a:s:
s2n ! 1 2 = 2
Exercise 1
You observe three independent draws from a normal distribution having unknown
mean and unknown variance 2 . Their values are 50, 100 and 150. Use these
values to produce an unbiased estimate of the variance of the distribution.
Solution
The sample mean is
50 + 100 + 150
Xn = = 100
3
An unbiased estimate of the variance is provided by the adjusted sample variance:
n
X
1 2
s2n = Xi Xn
n 1 i=1
1 h i
2
= (50 100) + (100 100)2 + (150 100)2
3 1
1 5000
= [2500 + 0 + 2500] = = 2500
2 2
590 CHAPTER 73. POINT ESTIMATION OF THE VARIANCE
Exercise 2
A machine (a laser range…nder) is used to measure the distance between the ma-
chine itself and a given object. When measuring the distance to an object located
10 meters apart, measurement errors committed by the machine are normally and
independently distributed and are on average equal to zero. The variance of the
measurement errors is less than 1 squared centimeter, but its exact value is un-
known and needs to be estimated. To estimate it, we repeatedly take the same mea-
surement and we compute the sample variance of the measurement errors (which
we are also able to compute, because we know the true distance). How many mea-
surements do we need to take to obtain an estimator of variance having a standard
deviation less than 0.1 squared centimeters?
Solution
Denote the measurement errors by X1 , . . . , Xn . The following estimator of variance
is used:
X n
c2 = 1 X2
n
n i=1 i
The variance of this estimator is:
h i
Var c2n
2
A 2 (Var [Xi ])
=
n
2
B 2 1cm2
n
2 4
= cm
n
where: in step A we have used the formula for the variance of the sample variance;
in step B we have used the upper bound stated in the problem (variance less than
1 squared centimeter). Thus
h i 2 4
Var c2n cm
n
We need to ensure that
i r
h h i 1
c
Std n = Var c2n
2 cm2
10
or h i 1
Var c2n cm4
100
which is certainly veri…ed if
2 1
=
n 100
or
n = 200
Chapter 74
Set estimation
T =T( )
591
592 CHAPTER 74. SET ESTIMATION
0 2
where is the parameter space, containing all the parameters that are deemed
plausible. The statistician believes the statement to be true, but the statement is
not very informative, because is a very large set. After observing the data, she
makes a more informative statement:
0 2T
C (T; 0) =P 0 ( 0 2 T ( ))
where the notation P 0 is used to indicate the fact that the probability is calculated
using the distribution function F ( ; 0 ) associated to the true parameter 0 . It is
important to note that in the above de…nition of coverage probability the random
quantity is the interval T ( ), while the parameter is …xed.
In practice, the coverage probability is seldom known, because it depends on
the unknown parameter 0 (although in some cases it is equal for all parameters
belonging to the parameter space). When the coverage probability is not known,
it is customary to compute the con…dence coe¢ cient c (T ), which is de…ned as
follows:
c (T ) = inf C (T; )
2
In other words, the con…dence coe¢ cient c (T ) is equal to the smallest possible
coverage probability. The con…dence coe¢ cient is also often called level of con-
…dence.
then the size of T is its volume. The size of a con…dence set is also called
measure of a con…dence set (for those who have a grasp of measure theory, the
name stems from the fact that Lebesgue measure is the generalization of volume in
multidimensional spaces). If we denote by (T ) the size of a con…dence set, then
we can also de…ne the expected size of a set estimator T :
E 0
[ (T ( ))]
where the notation E 0 is used to indicate the fact that the expected value is cal-
culated using the distribution function F ( ; 0 ) associated to the true parameter
0 . Like the coverage probability, also the expected size of a set estimator depends
on the unknown parameter 0 . Hence, unless it is a constant function of 0 , one
needs to somehow estimate it or to take the in…mum over all possible values of the
parameter, as we did above for coverage probabilities.
74.5 Examples
You can …nd examples of set estimation in the lectures entitled Set estimation of
the mean (p. 595) and Set estimation of the variance (p. 607).
This lecture presents some examples of set estimation1 problems, focusing on set
estimation of the mean, i.e. on using a sample to produce a set estimate of the
mean of an unknown distribution.
n = [x1 : : : xn ]
which is a realization of the random vector
n = [X1 : : : Xn ]
595
596 CHAPTER 75. SET ESTIMATION OF THE MEAN
C (Tn ; ) = P ( 2 Tn ) = P ( z Z z)
= P( z Z z)
In the lecture entitled Point estimation of the mean (p. 573), we have demonstrated
that, given the assumptions on the sample n made above, the sample mean X n
has a normal distribution with mean and variance 2 =n. Subtracting the mean
of a normal random variable from the random variable itself and dividing it by
the square root of its variance, one obtains a standard normal random variable.
Therefore, the variable Z has a standard normal distribution.
75.1.5 Size
The size6 of the interval estimator Tn is
" r r #!
2 2
(Tn ) = Xn z; X n + z
n n
r r r
2 2 2
= Xn + z Xn + z=2 z
n n n
n = [x1 : : : xn ]
n = [X1 : : : Xn ]
6 See p. 592.
7 See p. 583.
598 CHAPTER 75. SET ESTIMATION OF THE MEAN
where z 2 R++ is a strictly positive constant and the superscripts u and a indicate
whether the estimator is based on the unadjusted or the adjusted sample variance.
Now, rewrite Zn 1 as
r
n 1 Xn
Zn 1 = p
n Sn2 =n
r p
n 1 Xn 2 =n
= p p
n 2 =n Sn2 =n
r
n 1 Xn 1
= p p
n 2 =n Sn2 = 2
r
n 1 Xn 1
= p q
n 2 =n n 1 2 2
n sn =
Xn 1 Y
= p p =p
2 =n s2n = 2 W
W = s2n = 2
and we have used the fact that the unadjusted sample variance can be expressed
as a function of the adjusted sample variance as follows:
n 1
Sn2 = s2n
n
In the lecture entitled Point estimation of the variance (p. 579), we have demon-
strated that, given the assumptions on the sample n made above, the adjusted
sample variance s2n has a Gamma distribution10 with parameters n 1 and 2 .
Therefore, the random variable W has a Gamma distribution with parameters
n 1 and 1. Moreover, the random variable Y has a standard normal distribution
(see the previous section). Hence, Zn 1 is the ratio between a standard normal
random variable and the square root of a Gamma random variable with parameters
n 1 and 1. As a consequence, Zn 1 has a standard Student’s t distribution with
n 1 degrees of freedom11 .
The coverage probability of the interval estimator Tna is
C Tna ; ; 2
= P ( 2 Tna ) = P ( z Zn 1 z)
Xn
Y = p
2 =n
W = s2n = 2
In the lecture entitled Point estimation of the variance (p. 579), we have demon-
strated that, given the assumptions on the sample n made above, the adjusted
sample variance s2n has a Gamma distribution with parameters n 1 and 2 . There-
fore, the random variable W has a Gamma distribution with parameters n 1 and
1. Moreover, the random variable Y has a standard normal distribution (see the
previous section). Hence, Zn 1 is the ratio between a standard normal random
variable and the square root of a Gamma random variable with parameters n 1
and 1. As a consequence, Zn 1 has a standard Student’s t distribution with n 1
degrees of freedom (see also the previous proof).
Note that the coverage probability of the con…dence interval based on the unad-
justed sample variance Sn2 is lower than the coverage probability of the con…dence
interval based on the adjusted sample variance s2n , because
r
n 1
z<z
n
and, as a consequence
r r !
n 1 n 1
C Tnu ; ; 2
= P z Zn 1 z
n n
< P( z Zn 1 z) = C Tna ; ; 2
75.2. NORMAL IID SAMPLES - UNKNOWN VARIANCE 601
75.2.5 Size
The size of the con…dence interval Tnu is
" rr #!
Sn2 Sn2
(Tnu ) = Xn z; X n + z
n n
r r r
Sn2 Sn2 Sn2
= Xn + z Xn + z=2 z
n n n
Note that the size of the con…dence interval based on the unadjusted sample
variance Sn2 is smaller than the size of the con…dence interval based on the adjusted
sample variance s2n , because
Sn2 < s2n
and, as a consequence
r r
Sn2 s2n
(Tnu ) =2 z<2 z= (Tna )
n n
Thus, the con…dence interval based on the unadjusted sample variance has a
smaller size and a smaller coverage probability. As we have explained in the lecture
entitled Set estimation (p. 591), the choice of set estimators is often inspired by the
principle of achieving the highest possible coverage probability for a given size or
the smallest possible size for a given coverage probability. Following this principle,
there is no clear ranking between the estimator based on the unadjusted sample
variance and the estimator based on the adjusted sample variance, because the
former has smaller size, but the latter has higher coverage probability.
602 CHAPTER 75. SET ESTIMATION OF THE MEAN
X = Sn2
where c is a constant:
2 (n 1)=2
n=
c=
2(n 1)=2 ((n 1) =2)
and () is the Gamma function. Therefore:
" r # r
u Sn2 1 hp 2 i
E [ (Tn )] = E 2 z = 2z E Sn
n n
r
1 h 1=2 i
= 2z E X
n
Z 1
n 1
= 2zn 1=2 x1=2 cx(n 1)=2 1 exp 2 2
x dx
0
Z 1
n 1
= 2zn 1=2 c xn=2 1 exp 2 2
x dx
0
Z
1 1 n 1
= 2zn 1=2 c c1 xn=2 1 exp 2 2
x dx
c1 0
1
= 2zn 1=2 c
c1
where we have de…ned
2 n=2
n=
c1 =
2n=2 (n=2)
and we have used the fact that
Z 1
n 1
c1 xn=2 1
exp 2 2
x dx = 1
0
because it is the integral of the density of a Gamma random variable with para-
meters n and 2 over its support and probability densities integrate to 1. Thus:
1
E [ (Tnu )] = 2zn 1=2
c
c1
1 2 See p. 55.
75.3. SOLVED EXERCISES 603
2 (n 1)=2
1=2 n= 2n=2 (n=2)
= 2zn
2(n 1)=2 ((n 1) =2) (n= 2 )n=2
2 (n 1)=2
1=2 2n=2 n= (n=2)
= 2zn
2(n 1)=2
(n= 2 )n=2 ((n 1) =2)
1=2 (n=2)
= 2zn 1=2 21=2 n= 2
((n 1) =2)
r r
2 2 (n=2)
= 2z
n n ((n 1) =2)
"r # r
2 (n=2) 2
= 2 z
n ((n 1) =2) n
Exercise 1
Suppose you observe a sample of 100 independent draws from a normal distribution
having unknown mean and known variance 2 = 1. Denote the 100 draws by
X1 , . . . , X100 . Suppose their sample mean X 100 is equal to 1, i.e.:
100
1 X
X 100 = Xi = 1
100 i=1
604 CHAPTER 75. SET ESTIMATION OF THE MEAN
Find a con…dence interval for , using a set estimator of having 90% coverage
probability.
Solution
For a given sample size n, the interval estimator
" r r #
2 2
Tn = X n z; X n + z
n n
Exercise 2
Suppose you observe a sample of 100 independent draws from a normal distribution
having unknown mean and unknown variance 2 . Denote the 100 draws by X1 ,
. . . , X100 . Suppose their sample mean X 100 is equal to 1, i.e.:
100
1 X
X 100 = Xi = 1
100 i=1
75.3. SOLVED EXERCISES 605
Find a con…dence interval for , using a set estimator of having 99% coverage
probability.
Solution
For a given sample size n, the interval estimator
" r r #
a s2n s2n
Tn = X n z; X n + z
n n
C Tna ; ; 2
= P ( 2 Tna ) = P ( z Zn 1 z)
P( z Zn 1 z) = 99%
But
P( z Z z) = P (Z z) P (Z < z)
= 1 P (Z > z) P (Z < z)
= 1 2P (Z > z)
where the last equality stems from the fact that the standard Student’s t distrib-
ution is symmetric around zero. Therefore z must be such that
1 2P (Z > z) = 0:99
or
P (Z > z) = 0:005
Using a computer program to …nd the value of z (for example, with the MATLAB
command tinv(0.995,99)), we obtain
z = 2:6264
This lecture presents some examples of set estimation1 problems, focusing on set
estimation of the variance, i.e. on using a sample to produce a set estimate of
the variance of an unknown distribution.
n = [X1 : : : Xn ]
1 See p. 591.
2 See p. 564.
3 See p. 591.
607
608 CHAPTER 76. SET ESTIMATION OF THE VARIANCE
n c2 n c2
Tn = ;
z2 n z1 n
2 n c2 2 n c2
P 2 Tn = P
z2 n z1 n
n c2 2 2 n c2
= P \
z2 n z1 n
nn o n n c2 o
= P c2 z2 \ z 1
2 n 2 n
n c2
= P z1 2 n z2
= P (z1 Z z2 )
In the lecture entitled Point estimation of the variance (p. 579), we have demon-
strated that, given the assumptions on the sample n made above, the estimator
c2 has a Gamma distribution6 with parameters n and 2 . Multiplying a Gamma
n
random variable with parameters n and 2 by n2 one obtains a Chi-square ran-
dom variable with n degrees of freedom. Therefore, the variable Z has a Chi-square
distribution with n degrees of freedom.
76.1.5 Size
The size8 of the interval estimator Tn is
n c2 n c2
(Tn ) = ;
z 2 n z1 n
n c2 n c2
=
z1 n z2 n
1 1 c2
= n n
z1 z2
1 1 c2
E [ (Tn )] = E n n
z1 z2
1 1 h i
= n E c2n
z1 z2
1 1 2
= n
z1 z2
h i
where we have used the fact that c2n is an unbiased estimator of 2
(i.e. E c2n =
2
, see p. 580).
n = [x1 : : : xn ]
n = [X1 : : : Xn ]
8 See p. 592.
610 CHAPTER 76. SET ESTIMATION OF THE VARIANCE
n 2 n 2 n 1 2 n 1 2
Tn = Sn ; Sn = s ; s
z2 z1 z2 n z 1 n
2 n 2 2 n 2
P 2 Tn = P S S
z2 n z1 n
n 2 2 2 n 2
= P S \ S
z2 n z1 n
nn o n n 2o
= P S2
2 n
z2 \ z 1 S
2 n
n
= P z1 2
Sn2 z2
= P (z1 Zn 1 z2 )
In the lecture entitled Point estimation of the variance (p. 579), we have demon-
strated that, given the assumptions on the sample n made above, the unadjusted
sample variance Sn2 has a Gamma distribution with parameters n 1 and nn 1 2 .
9 See p. 583 for a de…nition and a discussion of adjusted and unadjusted sample variance.
76.3. SOLVED EXERCISES 611
76.2.5 Size
The size of the con…dence interval Tn is
n 2 n 2
(Tn ) = S ; S
z2 n z1 n
1 1
= n Sn2
z1 z2
1 1
E [ (Tn )] = E n Sn2
z1 z2
1 1
= n E Sn2
z1 z2
1 1 n 1 2
= n
z1 z2 n
1 1 2
= (n 1)
z1 z2
where in the penultimate step we have used the fact (proved in the lecture entitled
Point estimation of the variance - p. 579) that
n 1
E Sn2 = 2
n
Exercise 1
Suppose you observe a sample of 100 independent draws from a normal distribution
having known mean = 0 and unknown variance 2 . Denote the 100 draws by
X1 , . . . , X100 . Suppose that
n
d
2 1 X 2
100 = X =1
100 i=1 i
Find a con…dence interval for 2 , using a set estimator of 2 having 90% cov-
erage probability.
Hint: a Chi-square random variable Z with 100 degrees of freedom has a dis-
tribution function FZ (z) such that
FZ (77:9295) = 0:05
FZ (124:3421) = 0:95
Solution
For a given sample size n, the interval estimator
n c2 n c2
Tn = ;
z2 n z1 n
has coverage probability
2 2
C Tn ; =P 2 Tn = P (z1 Z z2 )
where Z is a Chi-square random variable with n degrees of freedom and z1 ; z2 2
R++ are strictly positive constants. Thus, if we set
z1 = 77:9295
z2 = 124:3421
then
P (z1 Z z2 ) = P (Z z2 ) P (Z < z1 )
A = P (Z z2 ) P (Z z1 )
B = FZ (z2 ) FZ (z1 )
= FZ (124:3421) FZ (77:9295)
= 0:95 0:05 = 0:9
which is equal to our desired coverage probability (in step A we have used the
fact that any speci…c realization of an absolutely continuous random variable has
zero probability; in step B we have used the de…nition of distribution function).
Thus, the con…dence interval for 2 is
100 d2 ; 100 d 2
T100 =
z2 100 z1 100
100 100
= ;
124:3421 77:9295
= [0:8042; 1:2832]
76.3. SOLVED EXERCISES 613
Exercise 2
Suppose you observe a sample of 100 independent draws from a normal distribution
having unknown mean and unknown variance 2 . Denote the 100 draws by X1 ,
. . . , X100 . Suppose that their adjusted sample variance s2100 is equal to 5, i.e.:
100
1 X 2
s2100 = Xi X 100 =5
99 i=1
Find a con…dence interval for 2 , using a set estimator of 2 having 99% cov-
erage probability.
Hint: a Chi-square random variable Z with 99 degrees of freedom has a distri-
bution function FZ (z) such that
FZ (66:5101) = 0:005
FZ (138:9868) = 0:995
Solution
For a given sample size n, the interval estimator
n 1 n 1
Tn = s2n ; s2n
z2 z1
has coverage probability
2 2
C Tn ; ; =P 2 Tn = P (z1 Z z2 )
where Z is a Chi-square random variable with n 1 degrees of freedom and z1 ; z2 2
R++ are strictly positive constants. Thus, if we set
z1 = 66:5101
z2 = 138:9868
then
P (z1 Z z2 ) = P (Z z2 ) P (Z < z1 )
A = P (Z z2 ) P (Z z1 )
B = FZ (z2 ) FZ (z1 )
= FZ (138:9868) FZ (66:5101)
= 0:995 0:005 = 0:99
which is equal to our desired coverage probability (in step A we have used the
fact that any speci…c realization of an absolutely continuous random variable has
zero probability; in step B we have used the de…nition of distribution function).
Thus, the con…dence interval for 2 is
99 2 99 2
T100 = s ; s
z2 100 z1 100
99 99
= 5; 5
138:9868 66:5101
= [3:5615; 7:4425]
614 CHAPTER 76. SET ESTIMATION OF THE VARIANCE
Chapter 77
Hypothesis testing
Roughly speaking, we start from a large set of distributions that might pos-
sibly have generated the sample and we would like to restrict our attention to a
smaller set R . In a test of hypothesis, we use the sample to decide whether or
not to restrict our attention to the smaller set R .
If we have a parametric model, we can also carry out parametric tests of hy-
pothesis.
Remember that in a parametric model the set of distribution functions is
put into correspondence with a set Rp of p-dimensional real vectors called
the parameter space. The elements of are called parameters. Denote by 0 the
parameter that is associated with the unknown distribution function F ( ) and
assume that 0 is unique. 0 is called the true parameter, because it is associated
to the distribution that actually generated the sample.
In parametric hypothesis testing we have a restriction R on the parameter
space and we choose one of the following two statements about the restriction:
615
616 CHAPTER 77. HYPOTHESIS TESTING
H0 : 0 2 R
For some authors, "rejecting the null hypothesis H0 " and "accepting the alter-
native hypothesis H1 " are synonyms. For other authors, however, "rejecting the
null hypothesis H0 " does not necessarily imply "accepting the alternative hypoth-
esis H1 ". Although this is mostly a matter of language, it is possible to envision
situations in which, after rejecting H0 , a second test of hypothesis is performed
whereby H1 becomes the new null hypothesis and it is rejected (this may happen
for example if the model is mis-speci…ed2 ). In these situations, if "rejecting the
null hypothesis H0 " and "accepting the alternative hypothesis H1 " are treated as
synonyms, then some confusion arises, because the …rst test leads to "accept H1 "
and the second test leads to "reject H1 ".
Also note that some statisticians sometimes take into consideration as an alter-
native hypothesis a set smaller than cR . In these cases, the null hypothesis and
the alternative hypothesis do not cover all the possibilities contemplated by the
parameter space .
is called the critical region (or rejection region) and it is the set of all values
of for which the null hypothesis is rejected:
C [ Cc = R
S = s( )
A critical region for S is a subset CS R of the set of real numbers and the
test is performed based on the test statistic, as follows:
s ( ) 2 CS =) 2 C =) H0 is rejected
s( ) 2= CS =) 2
= C =) H0 is not rejected
CSc = [sl ; su ]
the lower bound of the interval sl and the upper bound su are called critical
values of the test.
( )=P ( 2C )
where the notation P is used to indicate the fact that the probability is calculated
using the distribution function F ( ; ) associated to the parameter .
= sup ( )
2 R
and it is called the size of the test. The size of the test is also called by some
authors the level of signi…cance of the test. However, according to other
authors, who assign a slightly di¤erent meaning to the term, the level of signi…cance
of a test is an upper bound on the size of the test, i.e. a constant that, to the
statistician’s knowledge, satis…es:
sup ( )
2 R
77.9 Examples
You can …nd examples of hypothesis testing in the lectures entitled Hypothesis tests
about the mean (p. 619) and Hypothesis tests about the variance (p. 619).
n = [X1 : : : Xn ]
H0 : = 0
1 See p. 615.
2 See p. 616.
619
620 CHAPTER 78. HYPOTHESIS TESTS ABOUT THE MEAN
This test statistic is often called z-statistic or normal z-statistic, and a test of
hypothesis based on this statistic is called z-test or normal z-test.
CZn = ( 1; z) [ (z; 1)
( ) = P (Zn 2
= [ z; z])
= 1 P (Zn 2 [ z; z])
!
Xn 0
= 1 P z p z
2 =n
!
X
= 1 P z+p 0
p n z+p 0
2 =n 2 =n 2 =n
3 See p. 616.
4 See p. 617.
5 See p. 616.
6 See p. 617.
78.2. NORMAL IID SAMPLES - UNKNOWN VARIANCE 621
!
Xn
= 1 P z + p0 p z + p0
2 =n 2 =n 2 =n
!
= 1 P z + p0 Z z + p0
2 =n 2 =n
As demonstrated in the lecture entitled Point estimation of the mean (p. 573),
the sample mean X n has a normal distribution with mean and variance 2 =n,
given the assumptions on the sample n we made above. By subtracting the mean
of a normal random variable from the random variable itself, and dividing it by
the square root of its variance, one obtains a standard normal random variable.
Therefore, the variable Z has a standard normal distribution.
( 0) =P 0
(Zn 2
= [ z; z]) = 1 P( z Z z)
n = [x1 : : : xn ]
n = [X1 : : : Xn ]
7 See p. 617.
622 CHAPTER 78. HYPOTHESIS TESTS ABOUT THE MEAN
H0 : = 0
Xn 0
Znu = p
Sn2 =n
Xn 0
Zna = p
s2n =n
where the superscripts u and a indicate whether the test statistic is based on the
unadjusted or the adjusted sample variance. These two test statistics are often
called t-statistics or Student’s t-statistics, and tests of hypothesis based on
these statistics are called t-tests or Student’s t-tests.
CZni = ( 1; z) [ (z; 1)
where the notation P is used to indicate the fact that the probability of rejecting
the null hypothesis is computed under the hypothesis that the true mean is equal
to , and Wn 1 is a non-central standard Student’s t distribution9 with n 1
degrees of freedom and non-centrality parameter equal to
0
p
2 =n
Given the assumptions on the sample n we made above, the sample mean X n
has a normal distribution with10 mean and variance 2 =n, so that the random
variable .p
Xn 2 =n
The power function of the test based on the adjusted sample variance is
a
( ) = P (Zna 2
= [ z; z]) = 1 P( z Wn 1 z)
where the notation P is used to indicate the fact that the probability of rejecting
the null hypothesis is computed under the hypothesis that the true mean is equal to
, and Wn 1 is a non-central standard Student’s t distribution with n 1 degrees
of freedom and non-centrality parameter equal to
0
p
2 =n
0 .p .p 1
Xn 2 =n + ( ) 2 =n
0
= 1 P @ z p zA
2
sn = 2
= 1 P( z Wn 1 z)
Given the assumptions on the sample n we made above, the sample mean X n has
a normal distribution with mean and variance 2 =n, so that the random variable
.p
Xn 2 =n
Note that, for a …xed z, the test based on the unadjusted sample variance is
more powerful than the test based on the adjusted sample variance:
r r !
u n 1 n 1
( ) = 1 P z Wn 1 z
n n
a
> 1 P( z Wn 1 z) = ( )
because r
n 1
<1
n
and, as a consequence
r r !
n 1 n 1
P z Wn 1 z < P( z Wn 1 z)
n n
p0 0
2 =n
The size of the test based on the adjusted sample variance is equal to
a
( 0) =1 P( z Wn 1 z)
where Wn 1 is a standard Student’s t distribution with n 1 degrees of freedom.
Proof. When evaluated at the point = 0 , the power function is equal to the
size of the test, that is, the probability of committing a Type I error. The power
function evaluated at 0 is
u
( 0) =1 P( z Wn 1 z)
where Wn 1 is a non-central standard Student’s t distribution with n 1 degrees
of freedom and non-centrality parameter equal to
p0 0
2 =n
Exercise 1
Denote by Fn (x; k) the distribution function of a non-central standard Student’s
t distribution with n degrees of freedom and non-centrality parameter equal to k.
Suppose a statistician observes 100 independent realizations of a normal random
variable. The mean and the variance of the random variable, which the statistician
does not know, are equal to 1 and 4 respectively. What is the probability, expressed
in terms of Fn (x; k), that the statistician will reject the null hypothesis that the
mean is equal to zero if she runs a t-test based on the 100 observed realizations,
setting z = 2 as the critical value, and using the adjusted sample variance to
compute the t-statistic?
Solution
The probability of rejecting the null hypothesis 0 = 0 is obtained by evaluating
the power function of the test at = 1:
a a
( )= (1) = P (Zna 2
= [ z; z]) = 1 P( 2 W99 2)
where the notation P is used to indicate the fact that the probability of rejecting
the null hypothesis is computed under the hypothesis that the true mean is equal to
= 1, and W99 is a non-central standard Student’s t distribution with 99 degrees
of freedom and non-centrality parameter
0 1 0 10
k=p =p = =5
2 =n 4=100 2
78.3. SOLVED EXERCISES 627
Exercise 2
Denote by Fn (x) the distribution function of a standard Student’s t distribution
with n degrees of freedom, and by Fn 1 (p) its inverse. Suppose that a statistician
observes 100 independent realizations of a normal random variable, and she per-
forms a t-test of the null hypothesis that the mean of the variable is equal to zero,
based on the 100 observed realizations, and using the unadjusted sample variance
to compute the t-statistic. What critical value should she use in order to incur into
a Type I error with 10% probability? Express it in terms of Fn (x).
Solution
A Type I error is committed when the null hypothesis is true, but it is rejected.
The probability of rejecting the null hypothesis 0 = 0 is
r r !
u u n 1 n 1
( 0) = (0) = 1 P z Wn 1 z
n n
r r !
99 99
= 1 P z W99 z
100 100
where z is the critical value, and W99 is a standard Student’s t distribution with
99 degrees of freedom. This probability can be expressed as
r r !
99 99
1 P z W99 z
100 100
" r ! r !#
99 99
= 1 F99 z F99 z
100 100
r ! r !
99 99
= 1 F99 z + F99 z
100 100
" r !# r !
A = 1 99 99
1 F99 z + F99 z
100 100
r !
99
= 2 F99 z
100
where: in step A we have used the fact that the density of a standard Student’s
t distribution is symmetric around zero. Thus, we need to set z in such a way that
r !
99 1
2 F99 z =
100 10
628 CHAPTER 78. HYPOTHESIS TESTS ABOUT THE MEAN
This is accomplished by r
100 1
z= F 1
99 99 20
Chapter 79
n = [X1 : : : Xn ]
1 See p. 615.
2 See p. 616.
629
630 CHAPTER 79. HYPOTHESIS TESTS ABOUT THE VARIANCE
X n
c2 = 1 (Xi
2
)
n
n i=1
This test statistic is often called Chi-square statistic (also written as 2 -statistic)
and a test of hypothesis based on this statistic is called Chi-square test (also
written as 2 -test).
C 2
n
= [0; z1 ) [ (z2 ; 1)
2 2
0 n c2 0
= 1 P 2 z
2 1 2 n z
2 2
2 2
0 0
= 1 P z
2 1 n z
2 2
As demonstrated in the lecture entitled Point estimation of the variance (p. 579),
the estimator c2n has a Gamma distribution8 with parameters n and 2 , given the
assumptions on the sample n we made above. Multiplying a Gamma random
variable with parameters n and 2 by n= 2 one obtains a Chi-square random
variable with n degrees of freedom. Therefore, the variable n has a Chi-square
distribution with n degrees of freedom.
n = [x1 : : : xn ]
which is a realization of the random vector
n = [X1 : : : Xn ]
This test statistic is often called Chi-square statistic (also written as 2 -statistic)
and a test of hypothesis based on this statistic is called Chi-square test (also
written as 2 -test).
C 2
n
= [0; z1 ) [ (z2 ; 1)
where the notation P 2 is used to indicate the fact that the probability of rejecting
the null hypothesis is computed under the hypothesis that the true variance is
equal to 2 and n has a Chi-square distribution with n 1 degrees of freedom.
1 0 See p. 583.
79.3. SOLVED EXERCISES 633
2 2
0 0
= 1 P z
2 1 n z
2 2
Given the assumptions on the sample n we made above, the unadjusted sample
variance Sn2 has a Gamma distribution with parameters11 n 1 and nn 1 2 , so that
the random variable
n 1 2 n 2
n 1 2 Sn = 2 Sn
n
Exercise 1
Denote by Fn (x) the distribution function of a Chi-square random variable with n
degrees of freedom. Suppose you observe 40 independent realizations of a normal
random variable. What is the probability, expressed in terms of Fn (x), that you
will commit a Type I error if you run a Chi-square test of the null hypothesis that
the variance is equal to 1, based on the 40 observed realizations, and choosing
z1 = 0:8 and z2 = 1:2 as the critical values?
Solution
The probability of committing a Type I error is equal to the size of the test:
2
0 = (1) = 1 P (z1 40 z2 )
1 1 See p. 579.
634 CHAPTER 79. HYPOTHESIS TESTS ABOUT THE VARIANCE
Thus
2
0 = (1) = 1 P (z1 40 z2 ) = 1 F39 (1:2) + F39 (0:8)
Exercise 2
Make the same assumptions of the previous exercise and denote by Fn 1 (p) the
inverse of Fn (x). Change the critical value z1 in such a way that the size of the
test becomes exactly equal to 5%.
Solution
Replace 0:8 with z1 in the formula for the size of the test:
2
0 =1 F39 (1:2) + F39 (z1 )
2
You need to set z1 in such a way that 0 = 0:05. In other words, you need to
solve
0:05 = 1 F39 (1:2) + F39 (z1 )
which is equivalent to