1 ProbTools
1 ProbTools
measure-theoretic probability
Bent Nielsen
University of Oxford
Michaelmas 2024
Part 1
1/89
Introduction
2/89
Some textbooks
Some examples of textbooks on measure-theoretic probability:
Resnick (2019) and Shiryaev (1996): Many examples, focus
on probability, good for self-study.
Kolmogorov and Fomin (1970) and Dudley (2018):
Comprehensive book on real analysis and probability.
The books Lehmann and Casella (1998) and Lehmann and
Romano (2005) also contain crash-course style material like
the one in these slides with particular emphasis on
applications in statistics and econometrics.
Liese and Miescke (2008): The appendix has a collection of
results. We shall frequently use the notation and presentation
style from this book.
The recent book Axler (2021) has a lot of intuition and many
examples of why the theory is the way it is [why σ-algebras,
why Lebesgue integration and not Riemann?]. It may be
interesting for self-study. Available for free on Axler’s website.
3/89
Why the measure-theoretic approach?
Introduce
1 σ-algebras.
2 Measures and their properties.
3 Measurable functions.
4 The Lebesgue integral.
Linearity, monotonicity, dealing with null sets.
The Monotone and Dominated Convergence Theorems.
Markov, Hölder, Jensen, Minkowski inequalities.
Induced measures and the substitution rule.
Tonelli’s Theorem.
5 Densities and conditioning.
The primary examples will be probability spaces and random
variables.
5/89
σ-algebras
Sets.
Definition of σ-algebras.
Generated σ-algebras.
Borel sets on R.
6/89
Sets1
Basic definitions.
A = {a, b, c, . . . } is a set with elements a, b, c, . . . .
A ⊂ B, that is A is a subset of B, if every element in A
belongs to B.
A = B ⇔ A ⊂ B and B ⊂ A.
∅ is the empty set, which contains no elements.
Operations on sets.
A ∪ B and A ∩ B are the union / intersection of A and B,
consisting of all elements that are in A or/and B.
∪α Aα and ∩α Aα are unions and intersections. The index sets
can be finite or infinite in the countable or uncountable sense.
A, B are disjoint if A ∩ B = ∅.
Consider a subset A of a basic set R. Then, the complement
Ac is the set of elements of R that are not in A.
1
Kolmogorov and Fomin (1970, p. 1–4) 7/89
σ-algebra
8/89
Examples of σ-algebras
The systems
9/89
Some further stability properties of σ-algebras
is a σ-algebra in X .
11/89
Generated σ-algebras
12/89
Generated σ-algebra and Borel σ-algebra
Let a ≤ b.
Examples of Borel sets:
Half-open intervals: (a, b] = (−∞, b]\(−∞, a].
Open half-lines: (−∞, b) = ∪n∈N (−∞, b − 1/n].
Closed intervals: [a, b] = (−∞, b]\(−∞, a).
Open intervals: (a, b) = (−∞, b)\(−∞, a].
The above sets could also be used as based for the Borel σ-algebra.
Further examples of Borel sets:
Points: {a} = [a, a].
14/89
Measures and their properties
Measures.
Examples.
Properties.
15/89
Measures
16/89
Definition 6
Let (X , A) be a measurable space. A measure µ on (X , A) is a
mapping µ : A → [0, ∞] satisfying the following two conditions.
1 µ(∅) = 0.
2 µ is countably additive, that is for every sequence (An )n∈N of
disjoint subsets in A it holds that
[ X
µ An = µ(An ). (1)
n∈N n∈N
Note that
S the left-hand side in (1) is well-defined
since n∈N An ∈ A (by σ3) and the right-hand side is
well-defined as it is a sum of non-negative terms.
17/89
Some terminology
18/89
Examples of measures
19/89
Counting measure
That is, τ (A) < ∞ if and only if A has finitely many elements.
20/89
Fundamental properties of measures
Theorem 7
Let (X , A, µ) be a measure space.
1 µ is finitely additive, that is if A1 , .. . , A
PNNis a finite collection of
disjoint sets in A, then µ ∪N n=1 A n = n=1 µ(An ).
2 If A, B ∈ A and A ⊆ B, then µ(A) ≤ µ(B).
3 If A, B ∈ A, A ⊆ B and µ(A) < ∞, then µ(B \ A) = µ(B) − µ(A).
4 For any sequence of sets (An )n∈N in A it holds that
[ X
µ An ≤ µ(An ).
n∈N n∈N
22/89
Measurable functions
23/89
Functions.2
Functions.
Let X and Y be arbitrary sets. A rule associating a unique
element y = f (x) ∈ Y with each element x ∈ X is said to be
a function f on X .
X is the domain of f .
Y = {f (x) for x ∈ X } is the range of f .
If X ̸⊂ R then f is sometimes said to be a mapping. Then f
maps X into Y .
Images and preimages.
If x ∈ X then y = f (x) is the image of x.
Every x ∈ X with y ∈ Y as its image is called a preimage of y .
The set of x ∈ X whose images belong to a set B ⊂ Y is the
preimage of B, denoted f −1 (B).
If no y ∈ B has a preimage, then f −1 (B) = ∅.
Example: Define the function f : R → R via f (x) = x 2 . Then
f −1 ([1, 4]) = x ∈ R : x 2 ∈ [1, 4] = [−2, −1] ∪ [1, 2].
2
Kolmogorov and Fomin (1970, p. 4-6, 44, 87) 24/89
Some results for images and preimages.
f −1 (A ∩ B) = f −1 (A) ∩ f −1 (B).
f −1 (A ∪ B) = f −1 (A) ∪ f −1 (B).
f (A ∪ B) = f (A) ∪ f (B).
In general, f (A ∩ B) ̸= f (A) ∩ f (B).
Continuity.
Let the real function f map X ⊂ R into Y ⊂ R.
f is continuous at the point x0 if, ∀ϵ > 0, ∃δ > 0 such that
|f (x) − f (x0 )| < ϵ whenever |x − x0 | < δ.
f is continuous on X if it is continuous at all x0 ∈ X .
Theorem: f is continuous on X if and only if the preimage
f −1 (B) of any open set B ⊂ Y is open (in X ).
These considerations generalize to metric spaces and to
topological spaces.
25/89
Measurable functions
26/89
Probability spaces and random variables
Let us put what we have introduced so far into a probabilistic
context.
Definition 9 (Probability space)
A probability space is a triple (Ω, F, P), where Ω is a non-empty
set, F is a σ-algebra in Ω and P is a probability measure on F.
The sets in F are called events and for A ∈ F we
interpret P(A) ∈ [0, 1] as the probability of the event A occurring.
28/89
Continuity implies measurability
29/89
Non-continuous measurable functions
30/89
Let A ∈ A. Clearly, for any B ⊆ R
X if 0, 1 ∈ B,
if 1 ∈ B and 0 ∈ B c ,
A
1−1
A (B) =
Ac
if 0 ∈ B and 1 ∈ B c ,
if 0, 1 ∈ B c .
∅
31/89
Stability properties of measurable functions
Let M(A) = {f : X → R : f is A-B(R)-measurable}.
In the exercises you will show:
Theorem 12 (Stability properties of measurable functions)
1 If f1 , . . . , fd : X → R are elements of M(A) and φ : Rd → R
is B(Rd )-B(R)-measurable, then
φ(f1 , . . . , fd ) : X → R
cf , f + g, f · g, f ∧ g, f ∨g
33/89
Let R = {−∞} ∪ R ∪ {∞} be the extended real line. Define
M(A) = {f : X → R : f is A-B(R)-measurable}
M(A)+ = {f ∈ M(A) : f (x) ≥ 0 for all x ∈ X }
35/89
Lebesgue integrals
Definitions.
Linearity.
Monotone Convergence Theorem.
Dominated Convergence Theorem.
Inequalities.
Induced measure and substitution.
Product measures and double integrals.
36/89
The Lebesgue integral
We shall now introduce the Lebesgue integral and study its main
properties. It has several advantages over the Riemann integral.
1 It is defined for a broader class of functions.
2 It is much more stable under pointwise limits of sequences of
functions. That is, pointwise limits and integration can often
be interchanged. For the Riemann integral we typically need
uniform convergence.
3 The Lebesgue integral is easily defined for functions on an
arbitrary measure space (X , A, µ). The Riemann integral is
defined for functions on R. This is important in probability
theory where the random variables are defined on a probability
space (Ω, F, P) and expected values are Lebesgue integrals
Z
E X = X (ω)P(dω)
37/89
The Lebesgue integral
Given a measure space (X , A, µ), ai ∈ R
and Ai ∈ A, i = 1, . . . , n we call s : X → R defined via
n
X
s(x) = ai 1Ai (x)
i=1
a simple function.
Denote by SM(A) and SM(A)+ , respectively, the set of
simple and non-negative simple functions.
For s ∈ SM(A)+ one defines
Z Xn
sdµ = ai µ(Ai ) ∈ [0, ∞],
i=1
f =f+−f− and |f | = f + + f − .
3
Theorem 12 is for elements of M(A) rather than M(A) 40/89
Definition 15 (L(µ) and L1 (µ))
For a measure space (X , A, µ) we define
n Z Z o
L(µ) := f ∈ M(A) : f + dµ ∧ f − dµ < ∞
n Z Z o
L1 (µ) := f ∈ M(A) : f + dµ ∨ f − dµ < ∞
41/89
µ-a.e. and “almost surely”
It turns that the µ-integral “does not care about null sets”.
If f , g ∈ L(µ) and µ(f =
̸ g ) = 0, then
Z Z
fdµ = gdµ.
Definition 17
A subset N of X is called a µ-null set if there exists an A ∈ A such
that
N ⊆ A and µ(A) = 0.
43/89
Consider (X , A, µ). We say that a property holds for µ-almost
all x ∈ X if the property holds for all x ∈ X \N where
N is a µ-null set. [common in mathematics]
N ∈ A and µ(N) = 0. [common in probability]
We also say the property holds µ-almost everywhere (a.e.).
In probability, we say almost surely (a.s.) or with probability
one.
Examples:
If µ(x ∈ X : f (x) ̸= g (x)) = 0 we say that f = g µ-almost
everywhere, or for µ-almost every x or µ-a.e.
If µ(x ∈ X : limn→∞ fn (x) does not exist) = 0, we say that fn
converges for µ-almost every x.
44/89
Linearity over L1 (µ)
45/89
Theorem 18 (Linearity and other properties of the integral)
4 If f ≥ 0 µ-a.e. then
Z
fdµ = 0 ⇐⇒ f = 0 µ-a.e.
and
Z
fdµ < ∞ =⇒ f < ∞ µ-a.e.
R R
5 | fdµ| ≤ |f |dµ.
47/89
Interchanging limits and integration
48/89
Monotone Convergence Theorem
49/89
Prior to illustrating the Monotone Convergence Theorem, let
us note that in case (Ω, F, P) is a probability space upon
which a random variable X with values in (R, B(R)) is
defined, then we write
Z Z
E X := X (ω)P(dω) = XdP, X ∈ L(P).
Ω Ω
50/89
Example: AR(1)
51/89
P∞ i
Consider first i=0 |α| |εt−i |.
PN
Clearly, limN→∞ i=0 |α|i |εt−i | exists in [0, ∞].
Since N i
P
i=0 |α| |εt−i | is F-B(R)-measurable (by Theorem 12)
and using that limits preserve
P measurability (by Theorem 13),
we conclude that limN→∞ N i=0 |α| i |ε
t−i |
is F-B(R)-measurable.
P∞
Since N i i
P
i=0 |α| |εt−i | ↑ i=0 |α| |εt−i |, the Monotone
Convergence Theorem (and linearity of the integral) yields
that
∞
X N
X
i
E |α| |εt−i | = lim |α|i E |εt−i | ≤ C /(1 − |α|) < ∞.
N→∞
i=0 i=0
P∞ i
Thus, i=0 |α| |εt−i | < ∞ P-a.s. [cf. Theorem 18, part 4]
P∞ i
Since i=0 α εt−i converges absolutely P-a.s, it also
converges P-a.s.
52/89
Lebesgue’s Dominated Convergence Theorem
53/89
Example
Consider the measure space (R, B(R), λ1 ) and let f ∈ L1 (λ1 ). We
show that
Z Z
1 1
lim f (x)λ1 (dx) = lim f (x)1[−n,n] (x)λ1 (dx) = 0,
n→∞ 2n [−n,n] n→∞ 2n
(2)
55/89
By the mean value theorem we get
56/89
Relationship to Riemann integral
57/89
A useful consequence
and so
Z ∞ Z n
−x
xe −x dx = lim 1 − (n + 1)e −n = 1.
xe λ1 (dx) = lim
0 n→∞ 0 n→∞
59/89
Improper integrals
60/89
Regularity conditions: Example
62/89
Jensen’s inequality
−g (E X ) ≤ E[−g (X )] ⇐⇒ E g (X ) ≤ g (E X ).
4
pPm
Here, for any x ∈ Rm , ||x|| = i=1 xi2 denotes the Euclidean norm. 63/89
Induced measure and substitution rule
Let (X , A) and (Y, B) be measurable spaces and T : X → Y
be A-B-measurable.
If µ is a measure on A, then µ ◦ T −1 , defined by
66/89
Product measures and Tonelli’s Theorem
PX ,Y (C ) = P ◦ (X , Y )−1 (C ), C ∈ B(R2 )
PX ,Y = PX ⊗ PY ,
where PX = P ◦ X −1 and PY = P ◦ Y −1 .
68/89
Integration with respect to product measures
69/89
Absolute continuity and domination
70/89
Densities and conditioning
71/89
Radon-Nikodym
Theorem 28 (Radon-Nikodym)
72/89
Example: Normal distribution
1 2 /2σ 2
fη,σ2 (x) = √ e −(x−η) , x ∈R
2πσ 2
with respect to the Lebesgue measure λ1 .
That is,
Z
2
N(η, σ )(A) = fη,σ2 (x)λ1 (dx), A ∈ B(R).
A
e −λ λx
fλ (x) = , x ∈ N0
x!
with respect to the counting measure τ .
That is,
Z X e −λ λx
Poi(λ)(A) = fλ (x)τ (dx) = , A ∈ P(N0 ),
A x!
x∈A
74/89
Conditional expectations
σ(T ) = T −1 (B) : B ∈ T .
76/89
Let us characterize the function φ. By the definition of a
conditional expectation it satisfies5
Z Z
φ(T (ω))P(dω) = X (ω)P(dω) for all B ∈ T ,
T −1 (B) T −1 (B)
where PT = P ◦ T −1 .
A T -B(R)-measurable function φ satisfying the above two
displays is also called a conditional expectation of X given T = t.
One often uses the notation E(X |T = t) := φ(t) for any such
function φ.
5
Observe that a typical element of σ(T ) is of the form T −1 (B) for B ∈ T . 77/89
Stochastic kernels and conditional distributions
78/89
Stochastic kernels
79/89
Conditional distribution
Definition 31 (Conditional distribution)
Let X and Y be random variables on the probability space
(Ω, F, P) with values in the measurable spaces (X , A) and (Y, B),
respectively6 . The kernel K : B × X → [0, 1] is called a regular
conditional distribution of Y given X if
Z
P(X ∈ A, Y ∈ B) = K(B, x)PX (dx) for all A ∈ A, B ∈ B,
A
Observe that
Z
P(Y ∈ B) = P(X ∈ X , Y ∈ B) = K(B, x)PX (dx).
X
82/89
Finding a conditional distribution via densities
PX (A) = PX ,Y (A × Y) = 0,
83/89
Define Z
dPX
fX (x) := (x) = fX ,Y (x, y )ν(dy )
dµ
and Z
dPY
fY (y ) := (y ) = fX ,Y (x, y )µ(dx)
dν
which are called the marginal densities.
You will show the two equalities (that are not definitions) in
the exercises.
84/89
Definition 32 (Conditional distribution via densities)
The function
fX ,Y (x,y ) if fX (x) > 0
fY |X (y |x) = fX (x)
fY (y ) if fX (x) = 0
85/89
Summarizing advantages of the measure theoretic approach
86/89
Dirichlet’s function
7
Assume without loss of generality that x1 < . . . < xn such that there is a
strictly positive distance r between the xi . Hence, in any partition of [0, 1]
consisting of intervals of length at most r /2 at most n elements contain an xi .
Taking the infimum over such partitions one sees that the upper Riemann
integral is 0 [just like the lower Riemann integral clearly is] 87/89
But since D is not Riemann-integrable we see that pointwise
limits do not preserve Riemann-integrability and it does not
make sense to write
Z 1 Z 1
lim fn (x)dx = D(x)dx.
n→∞ 0 0
8
Alternatively, we can use the Dominated Convergence Theorem as stated
in Theorem 20. 88/89
References
Axler, S. (2021): Measure, integration & real analysis, Springer.
Bartle, R. G. and D. R. Sherbert (2011): Introduction to
Real Analysis, Wiley, 4th ed.
Dudley, R. M. (2018): Real analysis and probability, CRC Press.
Hoffmann-Jørgensen, J. (1994): Probability with a view
toward Statistics, vol. 1, Chapman & Hall.
Kolmogorov, A. N. and S. V. Fomin (1970): Introductory
Real Analysis, Dover.
Lehmann, E. and G. Casella (1998): Theory of Point
Estimation, Springer.
Lehmann, E. and J. Romano (2005): Testing Statistical
Hypotheses, Springer.
Liese, F. and K.-J. Miescke (2008): Statistical Decision
Theory, Springer.
Resnick, S. (2019): A probability path, Springer.
Shiryaev, A. N. (1996): Probability, Springer, 2nd ed. 89/89