Casarotto

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

MARKOV CHAINS AND THE ERGODIC THEOREM

CHAD CASAROTTO

Abstract. This paper will explore the basics of discrete-time Markov chains
used to prove the Ergodic Theorem. Definitions and basic theorems will allow
us to prove the Ergodic Theorem without any prior knowledge of Markov
chains, although some knowledge about Markov chains will allow the reader
better insight about the intuitions behind the provided theorems. Even for
those familiar with Markov chains, the provided definitions will be important
in providing the uses for the various notations used in this paper.

Contents
1. Basic Definitions and Properties of Markov Chains 1
2. Stopping Times and the Strong Markov Property 3
3. Recurrence and Transience 4
4. Communication Classes and Recurrence 5
5. The Strong Law of Large Numbers and the Ergodic Theorem 6
References 7

1. Basic Definitions and Properties of Markov Chains


Markov chains often describe the movements of a system between various states.
In this paper, we will discuss discrete-time Markov chains, meaning that at each
step our system can either stay in the state it is in or change to another state. We
denote the random variable Xn as a sort of marker of what state our system is
in at step n. Xn can take the value of any i I, where each i is a state in the
state-space, I. States are usually just denoted as numbers and our state-space as a
countable set.
We will call = (Pi1 , i2 , . . .) = (i | i I) the probability distribution on Xn if:
i = P (Xn =P i) and iI i = 1. Also, a matrix P = {pij }, where i, j I, is called
stochastic if jI ij = 1, i I, i.e. every row of the matrix is a distribution.
Now we can define a Markov chain explicitly.
Definition 1.1. (X0 , X1 , . . .) = (Xn )n0 is a Markov chain with initial distribution
and transition matrix P , shortened to Markov(, P ), if
is the probability distribution on X0 ;
given that Xn = i, (pij | i, j I) is the probability distribution on Xn+1
and is independent of Xk , 0 k < n, i.e. P (XN +1 = j | Xn = i) = pij .

Date: AUGUST 17, 2007.


1
2 CHAD CASAROTTO

Theorem 1.2. (Xn )0nN is Markov(, P ) if and only if


(1.3) P (X0 = i0 , X1 = i1 . . . , XN = iN ) = i0 pi0 i1 pi1 i2 piN 1 iN .
Proof. First, suppose (Xn )0nN is Markov(, P ), thus
P (X0 = i0 , X1 = i1 , . . . , XN = iN )
= P (X0 = i0 )P (X1 = i1 | X0 = i0 ) P (XN = iN | X0 = i0 , . . . , XN 1 = iN 1 )
= P (X0 = i0 )P (X1 = i1 | X0 = i0 ) P (XN = iN | XN 1 = iN 1 )
= i0 pi0 i1 pi1 i2 piN 1 iN
Now assume that (1.3) holds for N, thus
P (X0 = i0 , X1 = i1 . . . , XN = iN ) = i0 pi0 i1 pi1 i2 piN 1 iN
X X
P (X0 = i0 , X1 = i1 . . . , XN = iN ) = i0 pi0 i1 pi1 i2 piN 1 iN
iN I iN I

P (X0 = i0 , X1 = i1 . . . , XN 1 = iN 1 ) = i0 pi0 i1 pi1 i2 piN 2 iN 1


And now by induction, (1.3) holds for all 0 n N . From the formula for
conditional probability, namely that P (A | B) = P (A B)/P (B), we can show that
P (X0 = i0 , . . . , XN = iN , XN +1 = iN +1 )
P (XN +1 = iN +1 | X0 = i0 , . . . , XN = iN ) =
P (X0 = i0 , . . . , XN = iN )
i0 pi0 i1 piN 1 iN piN iN +1
=
i0 pi0 i1 piN 1 iN
= piN iN +1
Thus, by definition, (Xn )0nN is Markov(, P ). 
The next theorem emphasizes the memorylessness of Markov chains. In the
formulation of this theorem, we use the idea of the unit mass at i. It is denoted as
i = (ij ) where 
1 if i = j
ij =
0 otherwise
Theorem 1.4. Let (Xn )n0 be Markov(, P ). Then, given that Xm = i, (Xl )lm
is Markov(i , P ) and is independent of Xk , 0 k < m.
Proof. Let the event A = {Xm = im , . . . , Xn = in } and the event B be any event
determined by X0 , . . . , Xm . To prove the theorem, we must show that
P (A B | Xm = i) = iim pim im+1 pin1 in P (B | Xm = i)
thus the result follows from Theorem 1.2. First, let us consider any elementary
event
B = Bk = {X0 = i0 , . . . , Xm = im }
Here we show that
ii pi i pin1 in P (Bk )
P (A Bk and i = im | Xm = i) = m m m+1
P (Xm = i)
which follows from Theorem 1.2 and the definition of conditional probability. Any
event, B, determined
S by X0 , . . . , Xm can be written as a disjoint union of elementary
events, B = k=1 Bk . Thus, we can prove our above identity by summing up all of
the different Bk for any given event. 
MARKOV CHAINS AND THE ERGODIC THEOREM 3

An additional idea that is going to be important later is the idea of conditioning


on the initial state, X0 . We will let P (A | X0 = i) = Pi (A). Similarly, we will let
E(A | X0 = i) = Ei (A).

2. Stopping Times and the Strong Markov Property


We start this section with the definition of a stopping time.

Definition 2.1. A random variable T is called a stopping time if the event {T = n}


depends only on X0 , . . . , Xn for n = 0, 1, 2, . . ..

An example of a stopping time would be the first passage time

Ti = inf{n 1 | Xn = i}.

where we define inf = . This is a stopping time since {Ti = n} = {Xk 6= i,


Xn = i | 0 < k < n}. Now we will define an expansion of this idea that we will use
later.
(r)
Definition 2.2. The rth passage time Ti to state i is defined recursively using
the first passage time.
(0) (1)
Ti = 0, Ti = Ti
and, for r = 1, 2, . . .,
(r+1) (r)
Ti = inf{n Ti + 1 | Xn = i}.

This leads to the natural definition of the length of the rth excursion to i as
 (r) (r1) (r1)
(r) Ti Ti if Ti <
Si =
0 otherwise.

The following theorem shows how the Markov property holds at stopping times.

Theorem 2.3. Let T be a stopping time of (Xn )n0 which is Markov(, P ). Then
given T < and XT = i, (Xl )lT is Markov(i , P ) and independent of Xk ,
0 k < T.

Proof. First, we already have that (Xl )lT is Markov(i , P ) by Theorem 1.4, so
we just need to show the independence condition. Let the event A = {XT =
i0 , . . . , XT +n = in } and the event B be any event determined by X0 , . . . , XT . It is
important to notice that the event B {T = m} is determined by X0 , . . . , Xm . We
get that

P (AB{T = m}{XT = i}) = Pi (X0 = i0 , . . . , Xn = in )P (B{T = m}{XT = i})

If we now sum over m = 0, 1, 2, . . . and divide each side by P (T < , XT = i)


using the definition of conditional probability, we obtain

P (A B | T < , XT = i) = Pi (X0 = i0 , . . . , Xn = in )P (B | T < , XT = i)

which gives us the independence we desired. 


4 CHAD CASAROTTO

3. Recurrence and Transience


Definition 3.1. Let (Xn )n0 be Markov with transition matrix P . We say that a
state i is recurrent if
Pi (Xn = i for infinitely many n) = 1,
and we say that a state i is transient if
Pi (Xn = i for infinitely many n) = 0.
The following results allow us to show that any state is necessarily either recur-
rent or transient.
(r1) (r)
Lemma 3.2. For r = 2, 3, . . ., given that Ti < , Si is independent of
(r1)
Xk , 0 k Ti and
(r) (r1)
P (Si = n | Ti < ) = Pi (Ti = n).
(r1)
Proof. We can directly apply Theorem 2.3 where Ti is the stopping time T ,
(r1)
since it is assured that XT = i when T < . So, given that Ti < , (Xl )lT
is Markov(i , P ) and independent of Xk , 0 k < T , the independence wanted.
Yet, we know
(r)
Si = inf{l T 1 | Xl = i}
(r)
so Si is the first passage time of (Xl )lT to state i, giving us our desired equality.

Definition 3.3. The idea of the number of visits to i, Vi , is intuitive and can be
easily defined using the indicator function

X
Vi = 1{Xn =i}
n=0

A nice property of Vi is that



X  X
X
Ei (Vi ) = Ei 1{Xn =i} = Ei (1{Xn =i} ) = Pi (Xn = i).
n=0 n=0 n=0

Definition 3.4. Another intuitive and useful term is the return probability to i,
defined as
fi = Pi (Ti < ).
Lemma 3.5. Pi (Vi > r) = (fi )r for r = 0, 1, 2, . . . .
Proof. First, we know that our claim is necessarily true when r = 0. Thus, we can
(r)
use induction and the fact that if X0 = i then {Vi > r} = {Ti < } to conclude
that
(r+1)
Pi (Vi > r + 1) = Pi (Ti < )
(r) (r+1)
= Pi (Ti < and Si < )
(r+1) (r) (r)
= Pi (Si < | Ti < )Pi (Ti < )
= fi (fi )r = (fi )r+1
using Lemma 3.2, so our claim is true for all r. 
MARKOV CHAINS AND THE ERGODIC THEOREM 5

Theorem 3.6. The following two cases hold and show that any state is either
recurrent or transient:
P
(1) if Pi (Ti < ) = 1, then i is recurrent and P n=0 Pi (Xn = i) = ;

(2) if Pi (Ti < ) < 1, then i is transient and n=0 Pi (Xn = i) < .
Proof. If Pi (Ti < ) = fi = 1 by Lemma 3.5, then
Pi (Vi = ) = lim Pi (Vi > r) = lim 1r = 1
r r

so i is recurrent and

X
Pi (Xn = i) = Ei (Vi ) = .
n=0
In the other case, fi = Pi (Ti < ) < 1 then using our fact about Vi

X
X n1
X X
Pi (Xn = i) = Ei (Vi ) = nPi (Vi = n) = Pi (Vi = n)
n=0 n=1 n=1 r=0

X

X X X 1
= Pi (Vi = n) = Pi (Vi > r) = (fi )r = <
r=0 n=r+1 r=0 r=0
1 fi
so Pi (Vi = ) = 0 and i is transient. 

4. Communication Classes and Recurrence


Definition 4.1. State i can send to state j, and we write i j if
Pi (Xn = j for some n 0) > 0.
Also i communicates with j, and we write i j if both i j and j i.
Theorem 4.2. For distinct states i, j I, i j pii1 pi1 i2 pin1 j > 0 for
some states i1 , i2 , . . . , in1 . Also, is an equivalence relation on I.
Proof. ()

X
X X
0 < Pi (Xn = j for some n 0) Pi (Xn = j) = pii1 pi1 i2 pin1 j
n=0 n=0 i1 ,...,in1

Thus, for some pii1 pi1 i2 pin1 j > 0 for some states i1 , i2 , . . . , in1 .
() Take some i1 , i2 , . . . , in1 such that
0 < pii1 pi1 i2 pin1 j Pi (Xn = j) Pi (Xn = j for some n 0).
Now it is clear from the proven inequality that i j, j k i k. Also, it is
true that i i for any state i and that i j j i. Thus, is an equivalence
relation on I. 

Definition 4.3. We say that partitions I into communication classes. Also,


a Markov chain or transition matrix P where I is a single communication class is
called irreducible.
Theorem 4.4. Let C be a communication class. Either all states in C are recurrent
or all are transient.
6 CHAD CASAROTTO

Proof. Take any distinct pair of states i, j C and suppose that i is transient.
Then there exist n, m 0 such that Pi (Xn = j) > 0 and Pj (Xm = i) > 0, and for
all r 0
Pi (Xn+r+m = i) Pi (Xn = j)Pj (Xr = j)Pj (Xm = i).
This implies that

X 1 X
Pj (Xr = j) Pi (Xn+r+m = i) <
r=0
Pi (Xn = j)Pj (Xm = i) r=0
by Theorem 3.6. So any arbitrary j is transient, again by Theorem 3.6, so the
whole of C is transient. The only way for this not to be true is if all states in C
are recurrent. 
This theorem shows us that recurrence and transience is a class property, and
we will refer to it in the future as such.
Theorem 4.5. Suppose P is irreducible and recurrent. Then for all i I we have
P (Ti < ) = 1.
Proof. By Theorem 2.3 we have
X
P (Ti < ) = Pj (Ti < )P (X0 = j)
jI

so we only need to show Pj (Ti < ) = 1 for all j I. By the irreducibility of P ,


we can pick an m such that Pi (Xm = j) > 0. From Theorem 3.6, we have
1 = Pi (Xn = i for infinitely many n)
= Pi (Xn = i for some n m + 1)
X
= Pi (Xn = i for some n m + 1 | Xm = k)Pi (Xm = k)
kI
X
= Pk (Ti < )Pi (Xm = k)
kI
P
using Theorem 2.3 again. Since kI Pi (Xm = k) = 1 so we have that
Pj (Ti < ) = 1. 

5. The Strong Law of Large Numbers and the Ergodic Theorem


The Strong Law of Large Numbers will be presented here without proof although
an elementary proof needing much more background can be found in the journal
Wahrscheinlichketstheorie by N. Etemadi (1981).
Theorem 5.1 (Strong Law of Large Numbers). Let Y1 , Y2 , . . . be a sequence of
independent, identically distributed, non-negative, random variables where
E(Yk ) = . Then
 
Y1 + . . . + Yn
P as n = 1
n
Definition 5.2. We will expand on the idea of Vi by defining
n1
X
Vi (n) = 1{Xk =i} ,
k=0
representing the number of visits to i before step n.
MARKOV CHAINS AND THE ERGODIC THEOREM 7

The Ergodic theorem examines the long-run value of the ratio Vi (n)/n, which is
the proportion of time spent in state i before step n.
Theorem 5.3 (Ergodic Theorem). Let P be irreducible and let be any distribu-
tion. If (Xn )0nN is Markov(, P ) then
 
Vi (n) 1
P as n = 1
n mi
where mi = Ei (Ti ), i.e. the expected return time to state i.
Proof. First, we will consider the case where P is transient. We then know that
with probability 1, Vi < , so
Vi (n) Vi 1
0= .
n n mi
Now, we will consider the case that P is recurrent, and we will fix a state i. Let
T = Ti , and we have Pi (T < ) = 1 by Theorem 3.6 and (Xl )lT is Markov(i , P )
and independent of Xk , 0 k < T by Theorem 2.3. Since the long-run proportion
of time spent in state i is the same for (Xl )lT as for (Xn )n0 , we need to only
consider the case where = i .
(1) (2)
By Lemma 3.2, we know that the non-negative, random variables Si , Si , . . .
(r)
are independent and identically distributed with Ei (Si ) = Ei (Ti ) = mi . We know
(Vi (n)1) (1) (Vi (n)1)
Ti = Si + . . . + Si n 1,
where the left-hand side of the inequality is the step of the last visit to i before step
n. In addition,
(V (n)) (1) (V (n))
Ti i = Si + . . . + Si i n,
where the left-hand side of the inequality is the step of the first visit to i after step
n 1. So we can squeeze the value of n/Vi (n) using the following inequality:
(1) (V (n)1) (1) (Vi (n))
Si+ . . . + Si i n S + . . . + Si
(5.4) i
(Vi (n)) Vi (n) (Vi (n))
Now we can use the Strong Law of Large Numbers to get
!
(1) (n)
Si + . . . + Si
P mi as n = 1,
n
and since P is recurrent,
P (Vi (n) as n ) = 1.
Thus, if we let n in (5.4), we get
 
n
P mi as n = 1,
Vi (n)
which implies that  
Vi (n) 1
P as n = 1,
n mi


References
[1] Norris, James R. Markov Chains. University of Cambridge Press. 1998.

You might also like