Stochastic Processes SM
Stochastic Processes SM
Stochastic Processes SM
1.1 Introduction
1. 0 ≤ P (E) ≤ 1
2. P (S) = 1
1 = P (S) = P (E ∪ E c ) = P (E) + P (E c )
P (E ∪ F ) = P (E) + P (F ) − P (EF )
1
1.5 Independent Events
where Ei , i ≥ 1 denotes the event that the ith experiment was a success.
2 Random Variables
(
1, if the lifetime of the battery is 2 or more years
I=
0, otherwise
F (b) = P {X ≤ b}
2
1. F (b) is a nondecreasing function of b
p(a) = P{X = a}
p(a) is positive for at most a countable number of values of a; if X must assume one of the values x1 , x2 , . . ., then
p(xi ) > 0, i = 1, 2, . . .
p(x) = 0, all other values of x
∞
X
p(xi ) = 1
i=1
X
F (a) = p(xi )
all xi ≤a
p(0) = P{X = 0} = 1 − p
p(1) = P{X = 1} = p
3
2.2.3 The Geometric Random Variable
for n = 1, 2, . . .
∞
X ∞
X
p(n) = p (1 − p)n−1 = 1
n=1 n=1
λi
p(i) = P{X = i} = e−λ
i!
for i = 0, 1, . . .
∞ ∞
X X λi
p(i) = e−λ = e−λ eλ = 1
i=0 i=0
i!
(power series)
Poisson random variable may approximate a binomial random variable when n is large and p is small. Suppose X is a
binomial random variable with parameters (n, p) and let λ = np. Then
i n−i
n! i n−i n! λ λ n(n − 1) · · · (n − i + 1) λi (1 − λ/n)n
P{X = i} = p (1 − p) = 1− =
(n − i)!i! (n − i)!i! n n ni i! (1 − λ/n)i
• n (−n/λ)(−λ)
λ λ
1− = 1+ − ≈ e−λ
n n
•
n(n − 1) · · · (n − i + 1)
≈1
ni
•
λ i
(1 − ) ≈1
n
i
So, P{X = i} ≈ e−λ λi!
an experiment has r possible outcomes, the ith outcome has probability pi . If n of these experiments are performed and the
outcome of each experiment does not affect any of the other experiments, the probability that the ith outcome appears xi
times for i = 1, . . . , r is
n!
px1 px2 · · · pxr r
x1 !x2 ! · · · xr ! 1 2
Pr
where i=1 xi = n. The multinomial distribution is a generalization of the binomial distribution.
4
2.3 Continuous Random Variables
X is a continuous random variable if there exists a nonnegative function f (x) defined for all x ∈ R, such that for any set
B of real numbers, Z
P{X ∈ B} = f (x)dx
B
d
F (a) = f (a)
da
Z a+ε/2
P{a − ε/2 ≤ X ≤ a + ε/2} = f (x)dx ≈ εf (a)
a−ε/2
(f (a) is a measure of how likely it is that the random variable will be near a)
(
1, 0 < x < 1
f (x) =
0, otherwise
R∞ R1
density function because f (x) ≥ 0 and −∞
f (x)dx = 0
dx = 1
X = x ∈ (0, 1)
Z b
P{a ≤ X ≤ b} = f (x)dx = b − a
a
cdf:
Z a 0,
a≤α
a−α
F (a) = f (x)dx = β−α , α<a<β
−∞
1, a≥β
5
2.3.2 Exponential Random Variables
cdf: Z a
F (a) = λe−λx dx = 1 − e−λa , a ≥ 0
0
a gamma random variable with shape parameter α and rate parameter λ has pdf:
( −λx
λe (λx)α−1
Γ(α) , x≥0
f (x) =
0, x<0
where the gamma function is defined by Z ∞
Γ(α) e−x xα−1 dx
0
for n ∈ Z,
Γ(n) = (n − 1)!
Suppose α > 0
FY (a) = P{Y ≤ a}
= P{αX + β ≤ a}
a−β
= P{X ≤ }
α
a−β
= FX ( )
α
Z (a−β)/α
1 2 2
√ e−(x−µ) /2σ dx
−∞ σ 2π
Z a
−(v − (αµ + β))2
1
= √ exp dv
−∞ ασ 2π 2α2 σ 2
Z a
= fY (v)dv
−∞
(change of variables v = αx + β)
Similar if α < 0
6
standard normal distribution: N (0, 1)
standardizing: if X ∼ N (µ, σ 2 ), then Y = (X − µ)/σ ∼ N (0, 1)
expected value of X:
X
E[X] = xp(x)
x:p(x)>0
n
X n i
= i p (1 − p)n−i
i=0
i
n
X n!
= pi (1 − p)n−i (note: when i = 0, the whole addend is 0)
i=1
(n − i)!(i − 1)!
n
X (n − 1)!
= np pi−1 (1 − p)n−i
i=1
(n − 1)!(i − 1)!
n−1
X
n−1 k
= np p (1 − p)n−1−k
k
k=0
= np(p + (1 − p))n−1
= np
where k = i − 1.
Expectation of a geometric random variable with parameter p:
∞
X ∞
X
n−1
E[X] = np(1 − p) =p nq n−1
n=1 n=1
where q = 1 − p,
∞ ∞
!
X d n d X
n d q p 1
=p (q ) = p q =p = =
n=1
dq dq n=1
dq 1−q (1 − q)2 p
7
2.4.2 The Continuous Case
Z ∞
E[X] = xf (x)dx
−∞
−λx
integration by parts with dv = λe , u = x,
Z ∞
∞
−xe−λx 0 e−λx dx
= +
0
e−λx ∞
=0−
λ 0
1
=
λ
Expectation of a normal random variable X ∼ N (µ, σ 2 ):
Z ∞
1 2
/2σ 2
E[X] = √ xe−(x−µ) dx
σ 2π −∞
writing x as (x − µ) + µ,
Z ∞ Z ∞
1 −(x−µ)2 /2σ µ 2
/2σ 2
E[X] = √ (x − µ)e dx + √ e−(x−µ) dx
σ 2π −∞ σ 2π −∞
let y = x − µ Z ∞ Z ∞
1 −y 2 /2σ
E[X] = √ ye dy + µ f (x)dx
σ 2π −∞ −∞
by symmetry, the first integral is 0, Z ∞
E[X] = µ f (x)dx = µ
−∞
E[X] = α/λ
If X is a discrete random variable with pmf p(x), then for any real-valued function g,
X
E[g(X)] = g(x)p(x)
x:p(x)>0
8
Let Y = g(X) and let g be invertible. Then P{Y = y} = P{X = g −1 (y)} = p(g −1 (y)). Letting x = g −1 (y),
X X
E[Y ] = yP{Y = y} = g(x)p(x)
y:P{Y =y}>0 x:p(x)>0
If X is a continuous random variable with pdf f (x), then for any real-valued function g,
Z ∞
E[g(X)] = g(x)f (x)dx
−∞
Assume g is an invertible, monotonically increasing function, and let Y = g(X). Then the cdf of g(X) is
= aE[X] + b
= aE[X] + b
(see §2.6)
variance of random variable X:
Var(X) = E[(X − E[X])2 ]
9
Variance of the normal random variable X ∼ N (µ, σ 2 )
∞
σ2
Z
2
√ e−y /2
dy
2π −∞
= σ2
= E[X 2 − 2µX + µ2 ]
Z ∞
= (x2 − 2µx + µ2 )f (x)dx
−∞
Z ∞ Z ∞ Z ∞
= x2 f (x)dx − 2µ xf (x)dx + µ2 f (x)dx
−∞ −∞ −∞
= E[X 2 ] − 2µ2 + µ2
= E[X 2 ] − µ2
= E[X 2 ] − (E[X])2
F (a, b) = P{X ≤ a, Y ≤ b}
p(x, y) = P{X = x, Y = y}
10
pmf of X:
X
pX (x) = P{X = x} = p(x, y)
y:p(x,y)>0
pmf of Y :
X
pY (y) = p(x, y)
x:p(x,y)>0
X and Y are jointly continuous if ∃ a nonnegative function f (x, y) defined for all x, y ∈ R s.t. for all sets A and B of real
numbers, Z Z
P{X ∈ A, Y ∈ B} = f (x, y)dx dy
B A
which is the joint probability density function of X and Y .
finding the pdf of X and Y :
Z ∞ Z Z
P{X ∈ A} = P{X ∈ A, Y ∈ (−∞, ∞)} = f (x, y)dx dy = fX (x)dx
−∞ A A
where Z ∞
fX (x) = f (x, y)dy
−∞
similarly, Z ∞
fY (y) = f (x, y)dx
∞
d2
F (a, b) = f (a, b)
da db
If X and Y are random variables, and g is a function of 2 variables, then
XX
E[g(X, Y )] = g(x, y)p(x, y)
y x
or Z ∞ Z ∞
E[g(X, Y )] = g(x, y)f (x, y)dx dy
−∞ −∞
Example: Z ∞ Z ∞
E[X + Y ] = (x + y)f (x, y)dx dy
−∞ −∞
Z ∞ Z ∞ Z ∞ Z ∞
= xf (x, y)dx dy + yf (x, y)dx dy
−∞ −∞ −∞ −∞
= E[X] + E[Y ]
Generalized: " #
X X
E ai Xi = ai E[Xi ]
i i
11
Verifying the expectation of a binomial random variable X with parameters n and p:
n
X
X= Xi
i=1
where (
1, ith trial is a success
Xi =
0, ith trial is a failure
n
X n
X
E[X] = E[Xi ] = p = np
i=1 i=1
= P{Y ≤ b}P{X ≤ a}
E[g(X)h(Y )] = E[g(X)]E[h(Y )]
12
Z ∞ Z ∞
= g(x)h(y)fX (x)fY (y)dx dy
−∞ −∞
Z ∞ Z ∞
= h(y)fY (y)dy g(x)fX (x)dx
−∞ −∞
= E[h(Y )]E[g(X)]
= E[XY ] − E[X]E[Y ]
Then
Cov(X, Y ) = E[XY ] − E[X]E[Y ] = P{X = 1, Y = 1} − P{X = 1}P{Y = 1}
which shows that Cov(X, Y ) is positive if the outcome X = 1 makes it more likely that Y = 1 (as well as the reverse).
Positive covariance indicates the Y increases with X while negative covariance indicates Y decreases as X increases.
See excellent example on page 51
Properties of Covariance
1. Cov(X, X) = Var(X)
2. Cov(X, Y ) = Cov(Y, X)
3. Cov(cX, Y ) = c Cov(X, Y )
13
Proof of (4):
Cov(X, Y + Z) = E[X(Y + Z)] − E[X]E[Y + Z]
= Cov(X, Y ) + Cov(X, Z)
(4) generalizes to
Xn m
X n X
X m
Cov Xi , Yj = Cov(Xi , Yj )
i=1 j=1 i=1 j=1
n X
X n
= Cov(Xi , Xj )
i=1 j=1
n
X n X
X
= Cov(Xi , Xi ) + Cov(Xi , Xj )
i=1 i=1 j6=i
n
X n X
X
= Var(Xi ) + 2 Cov(Xi , Xj )
i=1 i=1 j<i
If X1 , . . . , Xn are independent and identically distributed (i.i.d.), then the sample mean is
n
X
X= Xi /n
i=1
• E[X] = µ because
m
1X
E[X] = E[Xi ] = µ
n i=1
• Var(X) = σ 2 /n because ! n
2 n 2X
1 X 1 σ2
Var(X) = = Var Xi = Var(Xi ) =
n i=1
n i=1
n
• Cov(X, Xi − X) = 0, i = 1, . . . , n because
14
1 1 X σ2
= Cov(Xi , Xi ) + Cov Xj , Xi −
n n n
j6=i
2 2
σ σ
+0− =0
n n
P
due to the fact that Xi and j6=i Xj are independent and have covariance 0.
where (
1, ith trial is a success
Xi =
0, ith trial is a failure
n
X
Var(X) = Var(Xi )
i=1
Var(X) = np(1 − p)
" n
# " n # n
X Xi 1 X X
E = E Xi = E[Xi ] = np
i=1
n n i=1 i=1
n
! n
!
X Xi 1 X
Var = 2 Var Xi
i=1
n n i=1
n
1 X XX
= 2 Var(Xi ) + 2 Cov(Xi , Xj )
n i=1 i<j
15
= P{Xi = 1}P{Xj = 1|Xi = 1} − p2
Np Np − 1
= − p2
N N −1
So,
n
! n
!
X Xi 1 X
Var = 2 Var Xi
i=1
n n i=1
n
1 X XX
= 2 Var(Xi ) + 2 Cov(Xi , Xj )
n i=1 i<j
1 n Np Np − 1
= np(1 − p) + 2 − p2
n2 2 N N −1
1 p(p − 1)
= 2 np(1 − p) + n(n − 1)
n N −1
p(1 − p) (n − 1)p(1 − p)
= −
n n(N − 1)
Variance of the estimator increases as N increases; as N → ∞, variance approaches p(1 − p)/n. Makes sense
Pn
because for N large, each Xi will be approx. independent, so i=1 Xi will be approx. binomial distribution with
parameters n and p.
Pn
Think of i=1 Xi as the number of white balls obtained when n balls are randomly selected from a population
consisting of N p white and N − N p black balls; this random variable is hypergeometric and has pmf
N p N −N p
( n )
X k n−k
P Xi = k = N
i=1 n
Let X and Y be continuous and independent, and let pdf of X and Y be f and g respectively; let FX+Y (a) be the cdf of
X + Y . Then
16
Z ∞
= f (a − y)g(y)dy
−∞
(
1, 0 < a < 1
f (a) = g(a) =
0, otherwise
Ra Ra
since P{X ≤ a} = −∞
f (x)dx = 0
dx = a
Then Z 1
fX+Y (a) = f (a − y)dy
0
a,
0≤a≤1
fX+Y (a) = 2 − a, 1 < a < 2
0, otherwise
n
X
P{X + Y = n} = P{X = k, Y = n − k}
k=0
n
X
= P{X = k}P{Y = n − k}
k=0
n
X λk1 −λ2 λn−k
= e−λ1 e 2
k! (n − k)!
k=0
n
X λk1 λ2n−k
= e−(λ1 +λ2 )
k!(n − k)!
k=0
n
−(λ1 +λ2 ) X
e n!
= λk λn−k
n! k!(n − k)! 1 2
k=0
e−(λ1 +λ2 )
= (λ1 + λ2 )n
n!
X1 + X2 follows a Poisson distribution with mean λ1 + λ2
17
n random variables X1 , X2 , . . . , Xn are independent if, for all values a1 , a2 , . . . , an ,
n
Y
P{X1 ≤ a1 , . . . , Xn ≤ an } = P{Xi ≤ ai }
i=1
X1 and X2 are jointly continuous random variables with joint pdf f (x1 , x2 )
suppose Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ) for some functions g1 and g2 satisfying the following conditions:
1. y1 = g1 (x1 , x2 ) and y2 = g2 (x1 , x2 ) can be uniquely solved for x1 and x2 in terms of y1 and y2 with solutions given by,
say, x1 = h1 (y1 , y2 ) and x2 = h2 (y1 , y2 )
2. g1 and g2 have continuous partial derivatives at all points (x1 , x2 ) and are such that
∂g1 ∂g1 ∂g1 ∂g2 ∂g1 ∂g2
∂x1 ∂x2
J(x1 , x2 ) = ∂g2 ∂g2 ≡ − 6= 0
∂x ∂x
∂x 1 ∂x2 ∂x2 ∂x1
1 2
where x1 = h1 (y1 , y2 ) and x2 = h2 (y1 , y2 ). This comes from differentiating both sides of the following equation w.r.t. y1 and
y2 : ZZ
P{Y1 ≤ y1 , Y2 ≤ y2 } = fX1 ,X2 (x1 , x2 )dx1 dx2
(x1 ,x2 ):g1 (x1 ,x2 )≤y1 ;g2 (x1 ,x2 )≤y2
(see §2.4.3) moment generating function φ(t) of random variable X is defined for all values t by
( P
tX etx p(x), X is discrete
φ(t) = E[e ] = R ∞x tx
−∞
e f (x)dx, X is continuous
so φ0 (0) = E[X]
in general, the nth derivative of φ(t) evaluated at t = 0 equals E[X n ] for n ≥ 1
Example: Binomial Distribution with n and p
18
n n
X n k X n
φ(t) = E[etX ] = etk p (1 − p)n−k = (pet )k (1 − p)n−k = (pet + 1 − p)n
k k
k=0 k=0
Hence,
E[X] = φ0 (0) = n(pet + 1 − p)n−1 pet t=0 = np
∞ ∞
X etn e−λ λn X (λet )n t t
φ(t) = E[etX ] = = e−λ = e−λ eλe = eλ(e −1)
n=0
n! n=0
n!
E[X] = φ0 (0) = λ
E[X 2 ] = φ00 (0) = λ2 + λ
Var(X) = E[X 2 ] − (E[X])2 = λ
See book for more examples as well as a table of moment generating functions.
If X and Y are independent,
There is a one-to-one correspondence between the moment generating function and the distribution function of a random
variable.
See book for Poisson paradigm, Laplace transform, multivariate normal distribution
2.6.1 Joint Distribution of Sample Mean and Sample Variance from a Normal Population
1
Pn
where X = n i=1 Xi
Note that
n
X n
X
(Xi − X)2 = (Xi − µ + µ − X)2
i=1 i=1
n
! n
X X
= (Xi − µ)2 + n(µ − X)2 + 2(µ − X) (Xi − µ)
i=1 i=1
n
!
X
= (Xi − µ)2 + n(µ − X)2 + 2(µ − X)(nX − nµ)
i=1
19
n
!
X
= (Xi − µ)2 + n(µ − X)2 − 2n(µ − X)2
i=1
n
!
X
2
= (Xi − µ) − n(µ − X)2
i=1
So,
n
X
2
E[(n − 1)S ] = E[(Xi − µ)2 ] − nE[(X − µ)2 ]
i=1
= nσ 2 − nVar(X)
= nσ 2 − n(σ 2 /n)
= (n − 1)σ 2
So E[S 2 ] = σ 2
Pn
If Z1 , . . . , Zn are independent standard normal random variables, then i=1 Zi2 is a chi-squared random variable with n
degrees of freedom
see book for more details
Markov’s Inequality: If X is a random variable that takes only nonnegative values, then for any a > 0,
E[X]
P{X ≥ a} ≤
a
Z ∞ Z ∞ Z ∞
E[X] = xf (x)dx ≥ xf (x)dx ≥ af (x)dx = aP{X ≥ a}
0 a a
Chebyshev’s Inequality: If X is a random variable with mean µ and variance σ 2 , then for any k > 0,
σ2
P{|X − µ| ≥ k} ≤
k2
Proof:
E[(X − µ)2 ]
P{(X − µ)2 ≥ k 2 } ≤
k2
Because (X − µ)2 ≥ k 2 ⇔ |X − µ| ≥ k, then
E[(X − µ)2 ] σ2
P{|X − µ| ≥ k} ≤ =
k2 k2
20
These inequalities allow us to derive bounds on probabilities when only the mean, or both the mean and the variance, of the
probability distribution are known.
Strong Law of Large Numbers
Let X1 , X2 , . . . be a sequence of independent random variables having a common distribution, let E[Xi ] = µ.
Then, with probability 1,
X1 + X2 + · · · + Xn
→ µ as n → ∞
n
Central Limit Theorem
Let X1 , X2 , . . . be a sequence of i.i.d. random variables, each with mean µ and variance σ 2 . Then the distribution
X1 +X2 +···+X n −nµ
of √
σ n
tends to the standard normal as n → ∞:
Z a
X1 + X2 + · · · + Xn − nµ 1 2
P √ ≤a → √ e−x /2 dx
σ n 2π −∞
This holds for all any distribution of the Xi s.
see text for proof of CLT
A stochastic process {X(t), t ∈ T } is a collection of random variables X(t) for each t ∈ T . Refer to X(t) as the state of the
process at time t, and T as the index set of the process. The process is discrete-time if T is countable, and continuous-time if
T is an interval of the real line. The state space of a stochastic process is the set of all possible values that X(t) can assume.
In a sequence of independent success-failure trials (success with probability p), the number of successes that appear before
the rth failure follows the negative binomial distribution.
X ∼ N B(r, p)
k+r−1
P{X = k} = (1 − p)r pk
k
pr
E[X] =
1−p
pr
Var(X) =
(1 − p)2
λi
P{N = i} ≈ e−λ
i!
with E[N ] = np = λ.
From §2.4.1,
21
∞ ∞ ∞
X X λi X λj
E[N ] = iP{N = i} = ie−λ =λ e−λ = λ = np
i=0 i=1
i! j=0
j!
(λt)k
P{Nt = k} = e−λt
k!
for k ≥ 0, so that E[Nt ] = λt is directly related with time.
Assume the arrivals are evenly spaced in the time interval.
Let Tk be the time of the kth arrival.
For some fixed time t, the probability that the first arrival comes after t is equal to the probability that there
are no arrivals in the time interval [0, t]:
(because k = 0).
FT1 (t) = P{T1 ≤ t} = 1 − P{T1 > t} = 1 − e−λt
for t ≥ 0
d d
fT1 (t) = FT (t) = (1 − e−λt ) = λe−λt
dt 1 dt
T1 is an exponential random variable.
For some fixed k (let t be variable), examine the time of the kth arrival, Tk :
k−1 k−1
X X (λt)j
FTk = P{Tk ≤ t} = P{Nt ≥ k} = 1 − P{Nt ≤ (k − 1)} = 1 − P{Nt = j} = 1 − e−λt
j=0 j=0
j!
k−1 k−1
X j
λj tj−1 λj+1 tj λk tk−1 λe−λt (λt)k−1
X
d −λt (λt) −λt −λt
fTk = − e =− e − = −e 0− =
j=0
dt j! j=0
(j − 1)! j! (k − 1)! (k − 1)!
because of telescoping. (Be careful with the case j = 0; it is not notated well above.)
Z ∞ Z ∞
E[Tk ] = P{Tk > t}dt = P{Nt ≤ k}dt
0 0
∞ k k Z ∞ k k
j j
1 ∞ λe−λt (λt)j
Z Z
−λt (λt) −λt (λt) 1 k
X X X X
= e dt = e dt = dt = =
0 j=0
j! j=0 0
j! j=0
λ 0 j! j=0
λ λ
(k + 1)k k 2 k
Var[Tk ] = 2
− = 2
λ λ λ
Theorem: The interval lengths T1 , T2 − T1 , T3 − T2 , . . . are i.i.d exponential variables with parameter λ.
22
k
E[Tk ] = E[T1 + (T2 − T1 ) + (T3 − T2 ) + . . . (Tk − Tk−1 )] =
λ
k
Var[Tk ] = Var[T1 + (T2 − T1 ) + (T3 − T2 ) + . . . (Tk − Tk−1 )] =
λ2
Gamma density function with shape index k and scale parameter λ:
λe−λt (λt)k−1
(k − 1)!
Gamma density function with shape index α and scale parameter λ:
λe−λt (λt)α−1
Γ(α)
where Γ(α) is what makes it a pdf, and
Z ∞ Z ∞
Γ(α) ≡ λe−λt (λt)α−1 dt = e−x xα−1 dx
0 0
• Γ(1) = 1
• Γ(α) = (α − 1)Γ(α − 1)
• Γ(k) = (k − 1)! if k is an integer
√
• Γ 21 = π
R∞ e−x xα−1
Consider X ∼ gamma(α, 1). Since 0 Γ(α) dx = 1, then
Z ∞
Γ(α) = e−x xα−1 dx
0
Z ∞
= [−e−x xα−1 ]∞
0 − −e−x (α − 1)xα−2 dx
0
= 0 + (α − 1)Γ(α − 1)
Γ(α) = (α − 1)Γ(α − 1)
√ 2
√ 2
2
√ √ Z t
e−x /2
Z t
e−x /2
P{Y ≤ t} = P{X ≤ t} = P{− t ≤ X ≤ t} = √
√ dx = 2 √ dx
− t 2π 0 2π
find pdf:
√ 2
! 2
!
t 1 −t/2 1 −1/2
e−x /2 e−t /2 e−t/2 t−1/2
2e ( 2 t)
Z
d d 1 −1/2
P{Y ≤ t} = 2 √ dx = 2 √ t = √ = √
dt dt 0 2π 2π 2 2π π
Y ∼ gamma 12 , 12
√
Γ 12 = π
Pn n 1
If X1 , . . . , Xn each ∼ N (0, 1), then i=1 Xi2 ∼ gamma 2, 2 , which is the chi-square distribution.
23
Alternative way to compute expected value of a nonnegative random variable, X ≥ 0:
Z ∞ Z ∞ Z x Z ∞Z x Z ∞Z ∞
E[X] = (f (x)x)dx = f (x) dy dx = f (x) dy dx = f (x) dx dy
0 0 0 0 0 0 y
Z ∞ Z ∞
= P{X > y}dy = (1 − F (y))dy
0 0
Beta distribution
Γ(α + β) α−1
P{U ∈ du} = u (1 − u)β−1
Γ(α)Γ(β)
3.1 Introduction
if P(F ) > 0,
P(E ∩ F )
P(E|F ) ≡
P(F )
P{X = x, Y = y} p(x, y)
pX|Y (x|y) = P{X = x|Y = y} = =
P{Y = y} pY (y)
24
n1 n2 n1 n2
pk q n1 −k m−k n2 −m+k
P{X1 = k, X2 = m − k} k m−k p q k m−k
P{X1 = k|X1 + X2 = m} = = n1 +n2 m n1 +n2 −m
= n1 +n2
P{X1 + X2 = m} m p q m
where we use the fact that X1 + X2 ∼ Binomial(n1 + n2 , p)
This is the hypergeometric distribution (see §2.5.3), the distribution of the number of blue balls that are
chosen when a sample of m balls is randomly chosen from n1 blue and n2 red balls.
If X and Y have joint pdf f (x, y), then cdf of X given that Y = y is defined for all y s.t. fY (y) > 0 by
f (x, y)
fX|Y (x|y) =
fY (y)
conditional expectation of X given that Y = y, is defined for all y s.t. fY (y) > 0, by
Z ∞
E[X|Y = y] = xfX|Y (x|y)dx
−∞
Let E[X|Y ] be a function of the random variable Y : when Y = y, then E[X|Y ] = E[X|Y = y]. E[X|Y ] itself is random
variable.
For all random variables X and Y ,
E[X] = E[E[X|Y ]]
If Y is discrete,
X
E[X] = E[X|Y = y]P{Y = y}
y
X XX
E[X|Y = y]P{Y = y} = xP{X = x|Y = y}P{Y = y}
y y x
XX X X X
= xP{X = x, Y = y} = x P{X = x, Y = y} = xP{X = x} = E[X]
y x x y x
25
Example Expectation of the sum of a random number of random variables
N denotes number of accidents per week, Xi denotes the number of injuries in the ith accident. Let E[N ] = 4
and E[Xi ] = 2 for all i ∈ {1, . . . , N }. What is the expected number of injuries?
"N # " "N ##
X X
E Xi = E E Xi |N
i=1 i=1
which uses the independence of Xi and N . Note that above, E[X] = E[Xi ] for all i ∈ {1, . . . , N } Thus,
"N #
X
E Xi = E[N E[X]] = E[N ]E[X]
i=1
PN
compound random variable i=1 Xi , the sum of a random number N of i.i.d. random variables that are also
independent of N .
Example Mean of a geometric distribution Trial has probability p of success. N represents number of trials before first
success. Let Y be an indicator random variable that equals 1 if the first trial is a success, and 0 otherwise.
Let Mn denote the expected number of comparisons needed by quick sort to sort a set of n distinct values.
Condition on the rank of the initial value selected:
n
X 1
Mn = E[number of comparisons|value selected is the jth smallest]
j=1
n
If the initial value selected is the jth smallest, two sets of size j − 1 and n − j, and you need n − 1 comparisons
with the initial value.
n n−1
X 1 2X
Mn = (n − 1 + Mj−1 + Mn−j ) =n−1+ Mk
j=1
n n
k=1
Replace n by n + 1
n
X
(n + 1)Mn+1 = (n + 1)n + 2 Mk
k=1
26
(n + 1)Mn+1 = (n + 2)Mn + 2n
Iterate:
Mn+1 2n Mn
= +
n+2 (n + 1)(n + 2) n + 1
Mn+1 2n 2(n − 1) Mn−1
= + +
n+2 (n + 1)(n + 2) n(n + 1) n
= ···
n−1
X n−k
=2
(n + 1 − k)(n + 2 − k)
k=0
since M1 = 0
n−1 n
X n−k X i
Mn+1 = 2(n + 2) = 2(n + 2)
(n + 1 − k)(n + 2 − k) i=1
(i + 1)(i + 2)
k=0
for n ≥ 1, and we let i = n − k.
" n n
#
X 2 X 1
Mn+1 = 2(n + 2) −
i=1
i + 2 i=1 i + 1
Independent trails with success p. Let N be the trial number of the first success. Let Y = 1 if first trial is success,
Y = 0 otherwise.
Find E[N 2 ]:
E[N 2 ] = E[E[N 2 |Y ]]
E[N 2 |Y = 1] = 1
E[N 2 |Y = 0] = E[(1 + N )2 ]
E[N 2 ] = p + E[(1 + N )2 ](1 − p) = 1 + (1 − p)E[2N + N 2 ]
Since we showed that E[N ] = 1/p earlier, then
2(1 − p)
E[N 2 ] = 1 + + (1 − p)E[N 2 ]
p
2−p
E[N 2 ] =
p2
Thus,
2
2−p 1 1−p
Var(N ) = E[N 2 ] − (E[N ])2 = − =
p2 p p2
27
By the definition of variance,
Var(X|Y = y) = E (X − E[X|Y = y])2 |Y = y
If we let Var(X|Y ) be a function of Y that takes the value Var(X|Y = y) when Y = y, then we have this:
Proposition The Conditional Variance Formula
Proof
E[Var(X|Y )] = E[E[X 2 |Y ] − (E[X|Y ])2 ] = E[E[X 2 |Y ]] − E[(E[X|Y ])2 ] = E[X 2 ] − E[(E[X|Y ])2 ]
and
Var(E[X|Y ]) = E[(E[X|Y )2 ] − (E[E[X|Y ]])2 = E[(E[X|Y ])2 ] − (E[X])2
so,
E[Var(X|Y )] + Var(E[X|Y ]) = E[X 2 ] − (E[X])2
Let X1 , . . . be i.i.d. random variables with distribution F having mean µ and variance σ 2 , and let them be
PN
independent of the nonnegative integer valued random variable N . S = i=1 Xi is a compound random variable.
N
! n
! n
!
X X X
Var(S|N = n) = Var Xi |N = n = Var Xi |N = n = Var Xi = nσ 2
i=1 i=1 i=1
Var(S|N ) = N σ 2
and
E[S|N ] = N µ
In the case that N is Poisson, then S is a compound Poisson random variable. Since the variance of a Poisson
is equal to its mean, if N is Poisson with E[N ] = λ,
28
3.5 Computing Probabilities by Conditioning
If we have X be an indicator random variable that takes value 1 if E occurs and 0 otherwise, then
E[X] = P(E)
E[X|Y = y] = P(E|Y = y)
So if Y is discrete,
X X
P(E) = E[X] = E[X|Y = y]P (Y = y) = P(E|Y = y)P(Y = y)
y y
and if Y is continuous,
Z ∞ Z ∞
P(E) = E[X] = E[X|Y = y]P(Y = y)dy = P(E|Y = y)fY (y)dy
−∞ −∞
from §3.3
Example Suppose X and Y are independent with pdfs fX and fY . Compute P{X < Y }.
Z ∞
P{X < Y } = P{X < Y |Y = y}fY (y)dy
−∞
Z ∞
= P{X > y|Y = y}fY (y)dy
−∞
Z ∞
= P{X < y}fY (y)dy
−∞
Z ∞
= FX (y)fY (y)dy
−∞
Z ∞ Z y
= fX (x)dx fY (y)dy
−∞ −∞
Let X1 , . . . Xn be independent Bernoulli random variables with Xi having parameter pi . Specifically, P{Xi =
1} = pi and P{Xi = 0} = qi = 1 − pi .
Let
Pk (j) = P{X1 + · · · + Xk = j}
and
k
Y
Pk (0) = qi
i=1
29
= P{X1 + · · · + Xk−1 = j − 1}pk + P{X1 + · · · + Xk−1 = j}qk
Example
Example
∞
X
E[S] = E[S|N = n]P{N = n}
n=0
∞
X
= E[X1 + · · · + Xn ]P{N = n}
n=0
∞
X
= nµP{N = n} = µE[N ] = aµ
n=0
Compute variance:
30
∞
X
E[S 2 ] = E[S 2 |N = n]P{N = n}
n=0
∞
X
= E[(X1 + · · · + Xn )2 ]P{N = n}
n=0
∞
X
V ar(X1 + · · · + Xn ) + (E[X1 + · · · + Xn ])2 P{N = n}
=
n=0
∞
X
= (nσ 2 + n2 µ2 )P{N = n}
n=0
= σ 2 E[N ] + µ2 E[N 2 ]
= aσ 2 + µ2 (b2 + a2 )
P{L = i, M = j} = P{L = i, M = j, N = i + j}
Example
31
If the sun rose for n consecutive days, what is the probability that it will rise tomorrow, if we know that its rising
or not is a Bernoulli random variable?
Sn : number of successes in first n trials
X ∼ U(0, 1): probability of success in one trial
P{X ∈ dp} = dp, p ∈ [0, 1]
P{Sn = k|X = p} = nk pk (1 − p)n−k
R 1 n+1
P{Sn+1 = n + 1} p dp n+1
P{Sn+1 = n + 1|Sn = n} = = 0R 1 = ≈1
P{Sn = n} pn dp n+2
0
for n large.
What is P{X ∈ dp|Sn = k}?
Z 1
uk (1 − u)n−k du
0
1
Γ(k + 1)Γ(n − k + 1)
Z
Γ(n + 2)
= uk (1 − u)n−k du
Γ(n + 2) 0 Γ(k + 1)Γ(n − k + 1)
Γ(k + 1)Γ(n − k + 1)
=
Γ(n + 2)
k!(n − k)!
=
(n + 1)!
Theorem
Let X be a random variable and φ be a positive bounded deterministic function.
R
(X has pdf f (x)) ⇔ (E[φ(X)] = x φ(x)f (x)dx)
General strategy for finding pdfs for functions of random variables
Z b Z d
E[φ(f (X))] = φ(f (x))P{X ∈ dx} = · · · = φ(y)g(y)dy
a c
Suppose X ∼ expon(λ)
What is the pdf for λX?
E[λX] = λE[X] = 1
Var(λX) = λ2 Var(X) = 1
32
Let φ be an arbitrary positive, bounded, deterministic function.
Letting y = λx,
Z ∞
E[φ(λX)] = φ(λX)λe−λx dx
0
Z ∞
= φ(y)e−y dy
0
= E[φ(Y )]
So pdf of Y is e−y .
Conclusion: (X ∼ expon(λ)) ⇔ (λX ∼ expon(1))
Example
Z ∞
E[φ(Y )] = E[φ(λX)] = φ(λX)f (x)dx
0
∞
λe−y y α−1 1
Z
= φ(y) dy
0 Γ(α) λ
So Y = λX ∼ gamma(α, 1)
Conclusion (concept of scale): (X ∼ gamma(α, λ)) ⇔ (λX ∼ gamma(α, 1))
Example: Convolution
33
Z ∞
= φ(z)h(z)dz
−∞
R∞
We call h(z) = −∞
f (z)g(z − x)dz the convolution of f and g, written f ∗ g.
Example
Let X ∼ gamma(α, λ)
Let Y ∼ gamma(β, λ)
Let X and Y be independent
X
Show that Z = X + Y and U = X+Y are independent.
Suffices to show that this is true for λ = 1; we can scale to any other value for λ.
x
Let φ be an arbitrary, positive, bounded function in two variables, and let φ̂(x, y) = φ( x+y , x + y).
RR
(Goal: try to get φ(u, z)h(u, z)dudz)
Letting y = z − x, and x = zu,
Z ∞ Z ∞
E[φ(U, Z)] = E[φ̂(X, Y )] = φ̂(x, y)f (x)g(y)dydx
0 0
∞ ∞
e−x xα−1 e−y y β−1
Z Z
x
= φ
, x + y dydx
0 0 Γ(α) x+yΓ(β)
Z ∞ Z ∞ −x α−1 −(z−x)
e x e (z − x)β−1 x
= φ , z dzdx
0 x Γ(α) Γ(β) z
Z ∞ Z z −x α−1 −(z−x)
e x e (z − x)β−1 x
= φ , z dxdz
0 0 Γ(α) Γ(β) z
Z ∞ Z 1 −z
e (zu)α−1 (z(1 − u))β−1
= φ(u, z)(z du)dz
0 0 Γ(α)Γ(β)
Z ∞ Z 1 −z α+β−1
uα−1 (1 − u)β−1
e z
= Γ(α + β) φ(u, z)dudz
0 0 Γ(α + β) Γ(α)Γ(β)
Method 2:
X
We know that X + Y and X+Y are independent.
X X
E E[X + Y ] = E (X + Y )
X +Y X +Y
X E[X] α/λ α
E = = =
X +Y E[X + Y ] (α + β)/λ α+β
34
1
Special case: α = β = 2
X 1 1
X+Y ∼ Beta 2 , 2
√
Because Γ( 12 ) = π
pdf:
Γ(1) 1 1
u− 2 (1 − u)− 2
Γ( 21 )Γ( 12 )
1
= p
π u(1 − u)
4 Markov Chains
4.1 Introduction
Let {Xn }n∈{0,1,2,...} be a stochastic process with state space D, which can be finite or countable, and transition matrix
for i, j ∈ D.
X is a Markov chain if
P{Xn+1 = j|X0 , . . . , Xn } = P{Xn+1 = j|Xn }
35
4.2 Chapman-Kolmogorov Equations
Let us denote Pi {F } = P{F |X0 = i} and Ei [F ] = E[F |X0 = i] for any event F .
Ex. Pb {X1 = c, X2 = a, X3 = b} = pbc pca pab (step through each Xi ).
(2)
X X
Pi {X2 = k} = Pi {X1 = j, X2 = k} = pij pjk = pik
j∈D j∈D
which is the i, k entry of P 2 . Reasoning: we know X0 = i and X2 = k; we add up the probabilities of all possible values of
X1 , which turns out to be the matrix product of the ith row and kth column of P , which is the i, k entry of P 2 , which equals
(2)
[pij ].
In general,
(m) (n) (m+n)
X X
Pi {Xm+n = k} = Pi {Xn = j, Xm+2n = k} = pij pjk = pik
j∈D j
Define “success at trial m” to mean j is visited at least once between trials m − 1 and m where a trial is the
mth time the particle is in state i. The probability of such a success is strictly positive because i j (Murphy’s
law?). By the Strong Markov Property, test trials are independent Bernoulli trials. There are infinitely many
trials because i is recurrent. Then the total number of successes is also infinite with probability 1. Since the
number of visits to j is at least as large as the number of successes (can have multiple visits to j per success),
then P{N (j) = ∞} = 1. Further, if i j, then j i since i is recurrent (you reach state j from state i, but
must eventually get back to i). Or in other words, ∃n : Pj {Xn = i} > 0.
Corollary If i j and j i (i.e., i and j are in the same class) and i is transient, then j is transient.
Proof: Let i be transient and let i and j be in the same class. For sake of contradiction, let j be recurrent. There
is a path from j to i, so by the theorem, i is recurrent, which is a contradiction.
Note that the condition for the corollary is stronger than for the theorem.
36
Theorem If D is finite and X is irreducible, then by the previous theorem, all states are recurrent. Having all states be
transient is impossible, since each state will only be visited a finite amount of times, which would result in a time in which
the particle is in no state, which is a contradiction.
However, if D is infinite, it is possible that all states are transient. Since there are an infinite number of states, each state
can still be visited a finite number of times (transiency), and time can still continue on infinitely. See example below:
Example
Then
1
• p> 2 ⇒ all states are transient
1
• p< 2 ⇒ all states are recurrent (because state 0 is recurrent)
• p = q ⇒ all states are recurrent (proof omitted)
Theorem More generally, if X is irreducible, either all states are recurrent or all states are transient. Thus, recurrency and
transiency are class properties.
Periodicity is a class property, or more specifically, states in the same class have the same period.
Let π be a row vector whose entries are
πi = P{X0 = i}
Thus π is the distribution of X0 .
Then
(n)
X X
P{Xn = j} = P{X0 = i}P{Xn = j|X0 = i} = πi pij
i∈D i∈D
which is the jth entry in the row vector πP n .
Suppose π = πP . This implies π = πP = πP 2 = · · · so π = πP n ∀n. Then
(n)
X
P{Xn = j} = πi pij = πj
i∈D
Such a π is called a stationary distribution for the process X = {Xn }n∈{0,1,...} . A stationary distribution exists if there
is at least one recurrent state. Consequently, if a state space is finite, a stationary distribution exists because at least one
state must be recurrent.
37
If D is recurrent and irreducible (for finite D, irreducibility suffices), then there is a unique π such that π = πP and
P
j∈D πi = 1. This is because the solution space of π = πP has dimension 1, that is, if π satisfies the equation, then so
does cπ, for any constant c. However, for π to represent a distribution, the sum of its entries must be 1, so normalizing any
solution to π = πP will give you the unique row vector.
Theorem Suppose that D is recurrent and irreducible and that all states are aperiodic (for finite D, irreducibility and
aperiodicity suffice). Then
(n)
πj = lim Pi {Xn = j} = lim pij
n→∞ n→∞
exists for each j, the limit does not depend on the initial state i, and the row vector π of probabilities πj is the unique
stationary distribution. Note that each row of lim P n is π.
n→∞
Note: by definition, the limiting probability Pi {Xn = j} of a transient state j is 0. If a Markov chain has transient states,
remove all rows and columns of the transition matrix that correspond to transient states, and find the stationary distribution
of the smaller matrix that results (that is, just consider the recurrent states).
4.5 Rewards
Given f : D → R, we represent f as a column vector with entries f (i) for each state i. P n f is a column vector, and let
P n f (i) be the ith entry of P n f . We can think of f (j) as the reward for being in state j. Then the reward at time n is f (Xn ).
X
Ei [f (Xn )] = f (j)Pi {Xn = j}
j∈D
(n)
X
= f (j)pij
j∈D
Suppose reward is f (Xn ) at time n, and discount factor is αn for α < 1. This gives smaller rewards for larger n.
P∞
Total discounted reward is n=0 αn f (Xn )
∞ ∞
" #
X X
n
Ei α f (Xn ) = αn Ei [f (Xn )]
n=0 n=0
X∞
= αn P n f (i)
n=0
P∞
Lettting g(i) = Ei [ n=0 αn f (Xn )] and g being a column vector with entries g(i),
g = f + αP f + α2 P 2 f + · · ·
= f + αP g
38
4.6 Time Averages
Fix a state j. Let Nm (j) be the number of visits to state j during [0, m]. This can be interpreted as a reward function
f (x) = 1.
m
(n)
X
Ei [Nm (j)] = pij
n=0
(n)
Theorem If X is irreducible recurrent and aperiodic, then lim pij = πj where πj is part of the stationary distribution.
n→∞
Then from the equation above,
m
1 1 X (n) (n)
lim Ei [Nm (j)] = lim pij = lim pij = πj
m→∞ m m→∞ m m→∞
n=0
(See appendix for proof of the middle equality). In other words, the limiting probability πj is also the long-term average of
the expected number of visits to j per unit time.
Theorem Suppose X is irreducible recurrent (but not necessarily aperiodic). Let π be the stationary distribution. Then
with probability one,
1
lim Nm (j) = πj
m→∞ m
The limiting probability πj is also the long-time average of the random number of visits to j per unit time. This is a basically
a strong law of large numbers.
Proof:
Use strong law of large numbers: if L1 , L2 , . . . are i.i.d. with mean µ, then with probability one,
1
lim (L1 + · · · + Ln ) = µ
n→∞ n
Let Li be the lengths between successive visits to the fixed state j (i.e., Li is the length between the ith and
(i + 1)st visit to j).
Since the past and future become independent at each time of visit to j, the lengths L0 , L1 , · · · are independent,
and L1 , L2 , · · · are i.i.d. and thus have the same mean µ. By the strong law,
1
lim (L0 + L1 + · · · + Ln ) = µ
n→∞ n+1
with probability one. In other words, the visits to j occur once every µ time units on the average in the long run.
1
Thus the number of visits to j per unit time is equal to µ in the long run:
1 1
limNm (j) =
m→∞ m µ
1
with probability one. To show that µ = pij , take the expectations of both sides and use the result from the
previous theorem.
Corollary
m
1 X X
lim f (Xn ) = πj f (j) = πf
m→∞ m
n=0 j∈D
Proof:
39
Pm
n=0 f (Xn ) is the reward accumulated until time m if we receive f (j) dollars each time we are in state j, for all
j. So,
m
X X
f (Xn ) = Nm (j) f (j).
n=0 j∈D
The LHS is an average over time, whereas the RHS is an average of f over the state space D. The equality of these two
averages is called the ergodic principle.
4.7 Transitions
Let Nm (i, j) be the number of transitions from i to j during [0, m]. Every time the chain is in state i, there is the probability
pij that the next jump is to state j. At each visit to i, interpret as a Bernoulli trial where “success” means “jumping to state
j.” Then long-run number of successes per trial is
Nm (i, j)
lim = pij
m→∞ Nm (i)
Theorem With this along with the results from the previous section, we have
1 1
lim Nm (i, j) = lim Nm (i)pij = πi pij
m→∞ m m→∞ m
This is useful if we have a reward f (i, j) that depends on both the present and preceding state (jump from i to j).
Cumulative reward during [0, m]:
m
X XX
f (Xn−1 , Xn ) = Nm (i, j)f (i, j)
n=1 i j
Let a particle move from state to state in D according to a Markov chain. It receives a reward of f (j) each time it is in state
j. Each dollar of time n is worth αn dollars today, at time 0. The particle “dies” (stops receiving rewards) as soon as it
enters A, where A is some fixed subset of D. The expected value of the present worth of all rewards we are to receive, given
the initial state is i, is
T
X −1
g(i) = Ei αn f (Xn )
n=0
where T = TA is the time of the first visit to the set A, i.e.
T = min{n ≥ 0 : Xn ∈ A}
40
Proof:
Suppose i ∈ B. Then T ≥ 1 and we receive a reward of f (i) dollars at time 0 and the discounted expected value
of all rewards to be received from time 1 on is equal to g(X1 ) dollars in time 1 dollars. Since g(j) = 0 for j ∈ A,
P
the expected value of g(X1 ) dollars of time 1 is equal to j∈B αpij g(j) at time 0.
5.1 Introduction
5.2.1 Definition
Example Expected discounted return is equal to reward earned up to an exponentially distributed random time (see book)
Since e−λ(s+t) = e−λs e−λt , exponentially distributed random variables are memoryless.
See books for example problems on memorylessness of exponential
Definition: a+ = a if a > 0, and a+ = 0 if a ≤ 0.
Claim: The only right continuous function g that satisfies g(s + t) = g(s)g(t) is g(x) = e−λx
Proof:
41
g(1/n) = (g(1))1/n
g(m/n) = (g(1/n))m = (g(1))m/n
By the right continuity of g, we then have g(x) = (g(1))x
g(1) = (g(1/n))2 ≥ 0
g(x) = e−(− log(g(1)))x
f (t)
Definition: failure/hazard rate function is r(t) = 1−F (t)
Suppose lifetime X has survived for t; what is the probability it does not survive for additional time dt?
P{X∈(t,t+dt),X>t} P{X∈(t,t+dt)} f (t)dt
P{X ∈ (t, t + dt)|X > t} = P{X>t} = P{X>t} ≈ 1−F (t) = r(t)dt
−λt
λe
If X ∼ expon(λ), then r(t) = e−λt
=λ
r(t) uniquely determines distribution F :
d
dt F (t)
r(t) = 1−F (t)
Claim: exponential random variables are the only ones that are memoryless
Proof: We showed above that memoryless is equivalent to having a constant failure rate function, and that exponential
random variables have a constant failure rate function. If a failure rate function is constant, then by the equation above,
1 − F (t) = e−ct , which shows that it must be exponential.
See book for example on hyperexponential random variable.
Then
Z ∞
fX1 +···+Xn (t) = fXn (t − s)fX1 +···+Xn−1 (s)ds
0
t
(λs)n−2
Z
= λe−λ(t−s) λe−λs ds
0 (n − 2)!
(λt)n−1
= λe−λt
(n − 1)!
42
Let X1 ∼ expon(λ1 ) and X2 ∼ expon(λ2 )
Z ∞
P{X1 < X2 } = P{X1 < X2 |X1 = x}λ1 e−λ1 x dx
0
Z ∞
= P{x < X2 }λ1 e−λ1 x dx
0
Z ∞
= e−λ2 x λ1 e−λ1 x dx
Z0 ∞
= λ1 e−(λ1 +λ2 )x dx
0
λ1
=
λ1 + λ2
Pn
Thus, min(X1 , . . . , Xn ) ∼ expon ( i=1 µi )
Example Greedy Algorithm (see book)
Definition A stochastic process {N (t), t ≥ 0} is a counting process if N (t) represents the total number of events that
occur by time t. Then these must hold:
• N (t) ≥ 0
• N (t) ∈ Z
Definition A counting process has independent increments if the number of events that occur in disjoint time intervals
are independent.
Definition A counting process has stationary increments if the distribution of the number f events that occur in any
interval of time depends only on the length of the interval.
43
5.3.2 Definition of Poisson Process
• N (0) = 0
• independent increments
• N (t + s) − N (s) ∼ Pois(λt) which implies stationary increments and that E[N (t)] = λt
f (h)
Definition The function f is said to be o(h) if limh→0 h =0
Definition Counting process {N (t), t ≥ 0} is a Poisson process with rate λ > 0 if
• N (0) = 0
Let T1 be the time of the first event. Let Tn be the time between the (n − 1)st and the nth event for n > 1. We call {Tn }
the sequence of interarrival times
P{T1 > t} = P{N (t) = 0} = e−λt so T1 ∼ expon(λ)
P{T2 > t} = E[P{T2 > t|T1 }] = E[P{0 events in (s, s + t]|T1 }] = E[P{0 events in (s, s + t]}] = e−λt
T2 ∼ expon(λ), and T2 is indep. of T1
Proposition: Tn ∼ expon(λ), ∀n ∈ N
Pn
We call Sn = i=1 Ti the waiting time until the nth event. By this proposition, and the result from §5.2.3 and §2.2,
Sn ∼ gamma(n, λ), that is
(λt)n−1
fSn (t) = λe−λt
(n − 1)!
Alternate method:
N (t) ≥ n ⇔ Sn ≤ t
∞
X (λt)j
FSn (t) = P{Sn ≤ t} = P{N (t) > n} = e−λt
j=n
j!
44
differentiate:
∞ ∞
X (λt)j X −λt (λt)j−1
fSn (t) = − λe−λt + λe
j=n
j! j=n
(j − 1)!
∞ ∞
(λt)n−1 X (λt)j−1 X −λt (λt)j
= λe−λt + λe−λt − λe
(n − 1)! j=n+1 (j − 1)! j=n j!
(λt)n−1
= λe−λt
(n − 1)!
Suppose each event in Poisson process {N (t), t ≥ 0} with rate λ is classified as either type I (with probability p) or type
II (with probability 1 − p). Let N1 (t) and N2 (t) denote the number of type I and type II events occurring in [0, t]. Then
N (t) = N1 (t) + N2 (t).
Proposition {N1 (t), t ≥ 0} and {N2 (t), t ≥ 0} are independent Poisson processes having respective rates λp and λ(1 − p).
Proof:
Verify that {N1 (t), t ≥ 0} satisfies the [second] definition of Poisson process:
P{N1 (h) = 1} = P{N1 (h) = 1|N (h) = 1}P{N (h) = 1} + P{N1 (h) = 1|N (h) ≥ 2}P{N (h) ≥ 2}
= p(λh + o(h)) + o(h)
= λph + o(h)
So, {N1 (t), t ≥ 0} is a Poisson process with rate λp. Similarly, {N2 (t), t ≥ 0} is a Poisson process with rate
λ(1 − p)
See §3.8 (or example 3.23 in the book) for why they are independent.
45
5.3.5 Conditional Distribution of the Arrival Times
So T1 ∼ U(0, t)
see book for order statistics
Theorem (Dart Theorem) Given N (t) = n, the n arrival times S1 , . . . Sn have the same distribution as the order statistics
corresponding to n independent random variables uniformly distributed on the interval (0, t). Phrased differently, this theorem
states that under the condition that N (t) = n, the times S1 , . . . , Sn at which events occur, considered as unordered random
variables, are distributed independently and uniformly in the interval (0, t).
Proposition If Ni (t), i ∈ {1, . . . , k} represents the number of type i events occurring by time t, then Ni (t), i ∈ {1, . . . , k}
Rt
are independent Poisson random variables having means E[Ni (t)] = λ 0 Pi (s)ds where Pi (t) is hte probability that an event
occurring t time y will be classified as type i, independently of anything that previously occurred.
Each discovered bug is responsible for certain number of errors. Denote by Mj (t) the number of bugs that caused exactly
j errors by time t. (So, M1 (t) is the number of bugs that caused exactly one error, M2 (t) the number of bugs that caused
P
exactly two errors, etc.) Then j jMj (t) is the total number of errors found before time t.
(
1, bug i causes exactly 1 error by time t
Let Ii (t) =
0, otherwise
P
Then M1 (t) = i Ii (t)
46
E[Ii (t)] = i λi te−λi t
P P
E[M1 (t)] = i
h i
So, then we have E Λ(t) − M1t(t) = 0
X X
Var(Λ(t)) = λ2i Var(φi (t)) = λ2i e−λi t (1 − e−λi t )
i i
X X
Var(M1 (t)) = Var(Ii (t)) = λi te−λi t (1 − λi te−λi t )
i i
X X
Cov(Λ(t), M1 (t)) = Cov λi φi (t), Ij (t)
i j
XX
= Cov(λi φi (t), Ij (t))
i j
X
= λi Cov(φi (t), Ii (t))
i
X
=− λi e−λi t λi te−λi t
i
Where the last two equalities follow because (i 6= j) ⇒ (φi (t) and Ij (t) are independent).
Then " 2 #
M1 (t) M1 (t) X 1X E[M1 (t) + 2M2 (t)]
Var Λ(t) − =E Λ(t) − = λ2i e−λi t + λi e−λi t =
t t i
t i t2
1
− it)2 e−λi t
P
where we use E[M2 (t)] = 2 i (λ
Definition A stochastic process {X(t), t ≥ 0} is said to be a compound Poisson process if it can be represented as
N (t)
X
X(t) = Yi , t≥0
i=1
where {N (t)} is a Poisson process, and {Yi , i ≥ 1} is a family of i.i.d. random variables, independent of {N (t)}.
Definition Let {N (t)} be a counting process whose probabilities are defined as follows: there is a poisitive random variable
L such that, conditional on L = λ, the counting prcess is a Poisson process with rate λ. This counting process is called a
conditional or mixed Poisson process. Such a process has staionary increments, but generally does not have independent
increments.
47
5.5 Extra stuff
We want to show that for nonnegative random variables Y and Z, independent of each other and of X,
Z ∞ Z ∞
P{X > Y + Z|X > Y } = P{X > Y + Z|X > Y, Y = y, Z + z}P{Y ∈ dy}P{Z ∈ dz}
Z0 ∞ Z0 ∞
= P{X > y + z|X > y}P{Y ∈ dy}P{Z ∈ dz}
Z0 ∞ Z0 ∞
= P{X > z}P{Y ∈ dy}P{Z ∈ dz}
0 0
= P{X > Z}
Alternate way to view random (geometric) sum of i.i.d. exponential random variables
Think of X as the time of the first arrival in a Poisson process with rate λ.
Think of Y as the time of the first arrival in a Poisson process with rate µ.
λ
P{X < Y } = P{first arrival in superposition process is from the first process} = λ+µ
48
Decomposition of two Poisson processes
Pn
Each arrival in Poisson process N with rate λ has probability pk of being “type” k, with k=1 pk = 1.
(k)
We can decompose into independent Poisson processes of only arrivals of type k, call it N with rate pk λ.
(µ(A))k
P{NA = k} = e−µ(A)
k!
R
where µ(A) = A
λu du
Intuition: plot λt over t, then for an infinitesimally small time interval, the function is approximately constant,
and the mean of the Poisson distribution of arrivals in this interval is the area under the curve.
6.1 Introduction
Poisson process (where N (t) is the state of the process) is a continuous -time Markov chain with states {0, 1, 2, . . .} that
always proceeds from state n to n + 1 where n ≥ 0.
A process is a pure birth process if the state of the system is always increased by one in any transition.
A process is a birth and death model if the state of the system is n + 1 or n − 1 after a transition from state n.
Definition A continuous-time stochastic process {X(t), t ≥ 0} whose state space is {0, 1, 2, . . .} is a continuous-time
Markov Chain if for all s, t ≥ 0¡ and nonnegative integers i, j, x(u), 0 ≤ u < s,
or in other words, it has the Markovian property that the conditional distribution of the future X(t + s) given the present
X(s) and the past X(u), 0 ≤ u < s depends only on the present and is independent of the past.
If in addition,
P{X(t + s) = j|X(s) = i}
is independent of s, then the continuous-time Markov chain is said to have stationary or homogeneous transition probabilities.
All Markov chains considered here will be assumed to have stationary transition probabilities.
If we let Ti denote the amount of time that the process stays in state i before making transition into a different state, then
49
Definition An alternate way to define a continuous-time Markov chain: a stochastic process that has the properties that as
each time it enters state i, the amount of timeit spendsin the state before transitioning into a different state is exponentially
distributed with mean 1/vi and when the process leaves state i, it enters state j with home probability Pij which satisfies,
P
for all i, that Pij = 0 and j Pij = 1.
In other words, a continuous-time Markov chain is a stochastic process that moves from state to state in accordance with a
discrete-time Markov chain, and the amount of time spent in each state before proceeding to the next state is exponentially
distributed. Additionally, the amount of time the process spends in state i and in the next state visited must be independent
(Markov property).
6.7 Uniformization
9 Reliability Theory
9.1 Introduction
system of n components, each component is either functioning or failed, indicator variable xi for the ith component:
(
1, ith component functioning
xi =
0, ith component failed
state vector:
x = (x1 , . . . , xn )
50
series structure functions iff all components function:
n
Y
φ(x) = min(x1 , . . . , xn ) = xi
i=1
Ex: four-component structure; 1 and 2 both function, at least one of 3 and 4 function
φ(x) is an increasing function of x; replacing a failed component by a functioning one will never lead to a deterioration of
the system
i.e., xi ≤ yi , ∀i ∈ {1, . . . , n} ⇒ φ(x) ≤ φ(y)
a system is thus monotone
let A1 , . . . , As denote minimal path sets. Define αj (x) as the indicator function of the jth minimal path set:
(
1, all components of Aj are functioning Y
αj (x) = = xi
0, otherwise i∈Aj
A system functions iff all components of at least one minimal path set are functioning:
(
1, αj (x) = 1 for some j Y
φ(x) = = maxj αj (x) = maxj xi
0, αj (x) = 0 for all j i∈Aj
51
let C1 , . . . , Ck denote minimal cut sets. Define βj (x) as the indicator function of the jth minimal cut set:
(
1, at least one component of Cj is functioning
β(x) = = maxi∈Cj xi
0, all components in Cj are not functioning
A system fails iff all components of at least one minimal cut set are not functioning:
k
Y k
Y
φ(x) = βj (x) = maxi∈Cj xi
j=1 j=1
P{Xi = 1} = pi = 1 − P{Xi = 0}
where X = (X1 , . . . , Xn ).
Define reliability function r(P ) = r where P = (p1 , . . . , pn )
reliability of a series system:
n
Y
r(P ) = P{φ(X) = 1} = P{Xi = 1, ∀i =∈ {1, . . . , n}} = pi
i=1
Theorem If r(P ) is the reliability function of a system of independent components, then r(P ) is an increasing function of
P.
Proof:
52
where (1i , X) = (X1 , . . . , Xi−1 , 1, Xi+1 , . . . , Xn ) and (0i , X) = (X1 , . . . , Xi−1 , 0, Xi+1 , . . . , Xn ) so,
but since φ is an increasing function, E[φ(1i , X) − φ(0i , X)] ≥ 0 so r(P ) increases in pi for all i.
P{at least one system functions} = 1 − P{neither system functions} = 1 − ((1 − r(P ))(1 − r(P 0 )))
since 1 − (1 − pi )(1 − p0i ) is the probability that the ith component functions
or equivalently
E[φ(max(X, X0 ))] ≥ E[max(φ(X), φ(X0 ))]
Proof:
Because φ is monotonically increasing, φ[max(X, X0 )] greater than or equal to both φ(X) and φ(X0 ), so
So,
r[1 − (1 − P )(1 − P 0 )] ≥ E[max[φ(X), φ(X0 )]] = P{max[φ(X), φ(X0 )] = 1} = 1 − [1 − r(P )][1 − r(P 0 )]
53
9.4 Bounds on the Reliability Function
For a distribution function G, define G(a) ≡ 1 − G(a) as the probability that the random variable is greater than a.
The ith component functions for a random length of time (distribution Fi ), then fails.
Let F denote the distribution of system lifetime, then
F (t) = P{system life > t} = P{system is functioning at time t} = r(P1 (t), . . . , Pn (t))
where
Pi (t) = P{component i is functioning at t} = P{lifetime of i > t} = F i (t)
So,
F (t) = r(F 1 (t), . . . , F n (t))
Qn
system life in a series system: r(P ) = i=1 pi , so
n
Y
F (t) = F i (t)
i=1
Qn
system life in a parallel system: r(P ) = 1 − i=1 (1 − pi ), so
n
Y
F (t) = 1 − Fi (t)
i=1
in §5.2.2, if G is the distribution of the lifetime of an item, then λ(t) represents the probability intensity that a t-year-old
item will fail
G is an increasing failure rate (IFR) distribution if λ(t) is an increasing function of t
G is an decreasing failure rate (DFR) distribution if λ(t) is an decreasing function of t
A random variable has Weibull distribution if its distribution is, for some λ > 0, α > 0,
α
G(t) = 1 − e−(λt)
54
If we define r(G(t)) ≡ r(G(t), . . . , G(t)), then
d d
r0 (G(t)) 0 G(t)r0 (G(t)) G0 (t) pr0 (p)
dt F (t) dt (1 − r(G(t)))
λF (t) = = = G (t) = = λG (t)
F (t) r(G(t)) r(G(t)) r(G(t)) G(t) r(p) p=G(t)
Since G(t) is a decreasing function of t, if each component of a coherent system has the same IFR lifetime distribution, then
the distribution of system lifetime will be IFR if pr0 (p)/r(p) is a decreasing function of p.
See book for example on IFR k-out-of-n system and a non-IFR parallel system
See p 606 for discussion on mixtures
If a distribution F (t) has density f (t) = F 0 (t), then
f (t)
λ(t) =
1 − F (t)
Z t Z t
f (s)
λ(s)ds = ds = − log F (t)
0 0 1 − F (s)
So,
F (t) = e−Λ(t)
Rt
where Λ(t) = 0
λ(s)ds the [cumulative] hazard function of distribution F .
A distribution F has an increasing failure rate on the average (IFRA) if
Rt
Λ(t) λ(s)ds
= 0
t t
increases in t for t ≥ 0. The average failure rate up to time t increases as t increases.
F is IFR ⇒ F is IFRA, but not necessarily the converse:
F is IFRA
⇔ Λ(s)/s ≤ Λ(t)/t whenever 0 ≤ s ≤ t
Λ(αt) Λ(t)
⇔ αt ≤ t for 0 ≤ α ≤ 1, ∀t ≥ 0
⇔ − log F (αt) ≤ −α log F (t)
α
⇔ log F (αt) ≥ log F (t)
α
⇔ F (αt) ≥ F (t), for 0 ≤ α ≤ 1, ∀t ≥ 0 because log is a monotone function
For a vector P = (p1 , . . . , pn ), define P α = (pα α
1 , . . . , pn )
for 0 ≤ α ≤ 1.
Proof
If n = 1, then r(p) ≡ 0 or r(p) ≡ 1 or r(p) ≡ p. In all three cases, the inequality is satisfied.
Assume the proposition holds for n − 1 components. Consider a system of n components with structure function
φ. Condition on whether or not the nth component is functioning:
r(P α ) = pα α α α
n r(1n , P ) + (1 − pn )r(0n , P )
55
Consider a system of components 1 through n−1 having a structure function φ1 (x) = φ(1n , x) (watch subscripts),
so the reliability function is r1 (P ) = r(1n , P ), so from the inductive assumption,
Similarly, consider a system of components 1 through n − 1 having a structure function φ0 (x) = φ(0n , x), then
So,
r(P α ) ≥ pα α α α α
n [r(1n , P )] + (1 − pn )[r(1n , P )] ≥ [pn r(1n , P ) + (1 − pn )r(0n , P )] = [r(P )]
α
(see lemma).
Lemma If 0 ≤ α ≤ 1, 0 ≤ λ ≤ 1, then
for 0 ≤ y ≤ x.
Theorem For a monotone system of independent components, if each component has an IFRA lifetime distribution, then
the distribution of system lifetime is itself IFRA.
Proof
Thus, Z ∞
E[system life] = r(f (t))dt
0
Consider a k-out-of-n system of i.i.d. exponential components. If θ is the mean lifetime of each component, then
Z t
1 −x/θ
F i (t) = e dx = e−t/θ
0 θ
56
we have Z n
∞X
n
E[system life] = (e−t/θ )i (1 − e−t/θ )n−i dt
0 i
i=k
n n
X n! (i − 1)!(n − i)! X1
=θ =θ
(n − i)!i! n! i
i=k i=k
Another approach
Lifetime of a k-out-of-n system can be written as T1 + · · · + Tn−k+1 where Ti represents time between the (i − 1)st
and ith failure. T1 + · · · + Tn−k+1 represents the time when the (n − k + 1)st component fails which is the moment
that the number of functioning components is less than k. When all n components are functioning, the rate at
which failures occur is n/θ; i.e., T1 is exponentially distributed with mean θ/n. Therefore, Ti represents the time
until the next failure when there are n − (i − 1) functioning components; Ti is exponentially distributed with mean
θ/(n − i + 1). So,
1 1
E[T1 + · · · + Tn−k+1 ] = θ + ··· +
n k
(Covers §9.5)
57
Let L represent lifetime. P{L > t} is such that P{L > 0} = 1, and limt→∞ P{L > t} = 0.
We can express it as
P{L > t} = e−H(t)
where the cumulative hazard function H(t) = − log P{L > t} is increasing(?) in t, has H(0) = 0, and
limt→∞ H(t) = ∞.
Thus, if X ∼ exponential distribution with parameter 1, P{X > u} = e−u , so
Rt d
Let H(t) = 0
h(s)ds and h(t) = dt H(t).
Exponential distribution:
P{L > t} = e−λt , t ≥ 0
H(t) = λt, t ≥ 0
h(t) = λ
Interpretation: for a lifetime that follows an exponential distribution, since the hazard function is constant, its
probability of dying at any given moment is independent of how long it has lived so far.
Weibull Distribution:
α
P{L > t} = e−(λt) , t ≥ 0
H(t) = (λt)α , t ≥ 0
h(t) = αλ(λt)α−1
d
Let F (t) = P{L ≤ t} and let f (t) = dt F (t) and let F (t) = 1 − F (t)
Then
d − d (P{L > t}) − d (1 − F (t)) f (t)
h(t) = (− log P{L > t}) = dt = dt =
dt P{L > t} F (t) F (t)
≈ 1 − exp(−h(t)∆t)
≈ h(t)∆t
1
{L ≤ t + u|L > t} = h(t)
lim
u→∞ u
58
∂
Note: if something has an increasing failure rate (IFR), the hazard function h(t) must be differentiable and ∂t λ(t) > 0 for
all t.
Example: two ways to compute lifetime of parallel structure
Three components in parallel with lifetimes that follow exponential distributions of parameters λ1 , λ2 , λ3 .
P{L ≤ t} = P{L1 ≤ t}P{L2 ≤ t}P{L3 ≤ t} = (1 − e−λ1 t )(1 − e−λ2 t )(1 − e−λ3 t )
Z ∞
E[L] = P{L > t}dt
0
Z ∞
=1− (1 − e−λ1 t )(1 − e−λ2 t )(1 − e−λ3 t )dt
0
Z ∞
=1− 1 − e−λ1 t − e−λ2 t − e−λ3 t + e−(λ1 +λ2 )t + e−(λ1 +λ3 )t + e−(λ2 +λ3 )t − e−(λ1 +λ2 +λ3 )t dt
0
1 1 1 1 1 1 1
= + + − − − +
λ1 λ2 λ3 λ1 + λ2 λ1 + λ3 λ2 + λ3 λ1 + λ2 + λ3
(Use the same trick for expected value)
Suppose all have the same distribution. Then what?
Method 1:
Use the above formula to get 3/λ − 3/(2λ) + 1/(3λ) = 11/(6λ)
Method 2:
Three time intervals:
L1 : duration for which 3 components are working; ∼ expon(3λ)
L2 : duration for which 2 components are working: ∼ expon(2λ)
L3 : duration for which 1 component is working: ∼ expon(λ)
E[L] = E[L1 + L2 + L3 ] = 1/(3λ) + 1/(2λ) + 1/(λ) = 11/(6λ)
De Moivre showed that the binomial distribution with n → ∞ approximates the Guassian (see de Moivre-Laplace theorem)
2
1
Bachelier: Gsn(0, 2π ): P{X ∈ dx} = e−πx dx
Moment generating function for X ∼ Gsn(0, 1):
59
∞ 2
e−x /2
Z
E[erX ] = erx √ dx
−∞ 2π
Z ∞
1 2 1
= e− 2 (x −2rx) √ dx
−∞ 2π
Z ∞
1 2 2 2 1
= e− 2 (x −2rx+r ) er /2 √ dx
−∞ 2π
Z ∞
2 1 2
= er /2 √ e−(x−r) /2 dx
−∞ 2π
2
= er /2
2
Differentiating the moment generating function φ(r) = er /2 and setting r = 0 helps us find expected values:
n
d r2 /2
e = E[X n ]
drn r=0
From this we have E[Z 2 ] = 1, E[Z 4 ] = 3, [Z 6 ] = 15, as well as E[Z 1 ] = E[Z 3 ] = E[Z 5 ] = · · · = 0 which can be
seen from the symmetry of the pdf of Z.
0 n is odd
E[Z n ] = n!
2n/2 ( n
n is even
2 )!
Let X ∼ Gsn(µ, σ 2 )
X −µ
P ≤y = P{X ≤ µ + σy}
σ
µ+σy
X −µ
Z
1 2
/2σ 2
P ≤y = √ e−(x−µ) dx
σ −∞ 2πσ 2
Z µ+σy
X −µ d 1 2 2
P ∈ dy = √ e−(x−µ) /2σ dx dy
σ dy −∞ 2πσ 2
1 2 2
= √ e((µ+σy)−µ) /2σ σ dy
2πσ 2
−y 2 /2
e
= √ dy
2π
Rz 2 2
Where we use d
dy (g(h(y))) = g 0 (h(y))h0 (y) with g(z) = 0
√ 1
2πσ 2
e−(x−µ) /2σ dx and h(y) = µ + σy.
So
X ∼ Gsn(µ, σ 2 ) ⇔ X = µ + σZ
where Z ∼ Gsn(0, 1)
2 2
σ 2 /2)
The generating function for X is E[erX ] = E[er(µ+σZ) ] = erµ e(rσ) /2
= erµ+(r , and since the moment
generating function is determined uniquely, any distribution with the previous generating function must have
distribution Gsn(µ, σ 2 ):
2
σ 2 /2)
X ∼ Gsn(µ, σ 2 ) ⇔ E[erX ] = erµ+(r
60
Linear combinations:
αX + βY ∼ Gsn(αµ + βν, α2 σ 2 + β 2 τ 2 )
This can be shown by using the moment generating function and the independence of X and Y
2
α2 σ 2 /2) 2
β 2 τ 2 /2)
E[erαX ] = eαµr+(r ; E[erβY ] = eβνr+(r
2
(α2 σ 2 +β 2 τ 2 )/2
E[er(αX+βY ) ] = E[erαX ] · E[erβY ] = e(αµ+βν)+r ⇔ αX + βY ∼ Gsn(αµ + βν, α2 σ 2 + β 2 τ 2 )
Gaussian-gamma connection:
2 1 1
Z ∼ Gsn(0, 1) ⇒ Z ∼ gamma ,
2 2
Proof:
√ Z √t
√ √ Z
1 −x2 /2 t
1 2
2
P{Z ≤ t} = P{− t ≤ Z ≤ t} = √
√ e dx = 2 √ e−x /2 dx
− t 2π 0 2π
−λt
(λt)α−1
d 1 λe
P{Z 2 ≤ t} = √ e−t/2 t−1/2 =
dt 2π Γ(α)
√
where λ = α = 12 , which also shows that Γ 12 = π. Also see §2.10.
In §3.8, we showed that if X ∼ gamma(α, λ) and Y ∼ gamma(β, λ) are independent, then X + Y ∼ gamma(α +
β, λ). Then, if Z1 , . . . , Zn i.i.d. with distribution Gsn(0, 1), then
n 1
Z12 + · · · + Zn2 ∼ gamma ,
2 2
If X and Y are i.i.d. with distribution Gsn(0, 1), then X 2 + Y 2 ∼ gamma 1, 21 ≡ expon 1
2
Let X and Y be x- and y-coordinates of a random point in R2 ; (X, Y ) is the landing point of a dart aimed at the
origin.
√
Let R = X 2 + Y 2 be the distance from the origin
Let A be the angle made by the vector hX, Y i, measured CCW from positive x-axis.
Then
1
• R2 ∼ expon 2
• A ∼ U(0, 2π]
• R and A are independent
61
1
This is used in Monte-Carlo studies for generating Gaussian variables: given R2 ∼ expon 2 and A ∼ U(0, 2π],
then R cos A and R sin A are independent Gsn(0, 1) variables.
We can derive the pdf of the arcsine distribution (beta 21 , 21 ) with this interpretation. Since X 2 and Y 2 have
√ √ √
distribution gamma 12 , 12 , and using the fact that if A ∈ [0, arcsin u] ∪ [π − arcsin u, π] ∪ [π, π + arcsin u] ∪
√
[2π − arcsin u, 2π), then (sin A)2 ≤ u to find:
√
X2 √
2 arcsin u 2
P ≤ u = P{(sin A) ≤ u} = 4 = arcsin u
X2 + Y 2 2π π
X2
1
P ∈ du = p
X2 + Y 2 π u(1 − u)
Cauchy distribution
If X and Y are independent Gsn(0, 1) variables, the distribution of T = X/Y is called the standard Cauchy
distribution, which has pdf
1
p(z) =
(1 + z 2 )π
for z ∈ R.
Note that T and 1/T have the same distribution.
1/T = Y /X = tan A where A ∼ U(0, 2π]
One way to derive the standard Cauchy density using this:
Z 2π
1
E[T ] = E[f (tan A)] = f (tan a) da
0 2π
Z Z 3π
1 2 1
= f (tan a) da + f (tan a) da resolve uniqueness of arctan
π 3π
(0, 2 )∪( 2 2π) 2π π
2
2π
Z ∞ Z ∞
1 1 1 1 1
= f (x) 2
dx + f (x) 2
dx tan a = x; a = arctan x; da = dx
−∞ 2π 1 + x −∞ 2π 1 + x 1 + x2
Z ∞
1
= f (x) dx
−∞ π(1 + x2 )
So we get the pdf of the Cauchy distribution
T has no expected value: Z 0
z p(z) dz = −∞
−∞
Z∞
z p(z) dz = ∞
0
62
10.1.2 Gaussian vectors
α · X = α1 X1 + · · · + αn Xn
∀i ∈ {1, . . . , n} where Z1 , . . . , Zn are independent Gsn(0, 1) variables, µ1 , . . . , µn ∈ R, and aij ∈ R, ∀i, j ∈ {1, . . . , n}.
Example
X1 1.2 0.3 0 0.5 Z1
X2 = −0.8 + 0.8 0.1 2.0 Z2
Let r ∈ Rn .
E[r · X] = E[r1 X1 + · · · + rn Xn ] = r1 µ1 + · · · rn µn = r · µ
~
~ )2 ]
Var(r · X) = E[(r · X − r · µ
= E[(r1 (X1 − µ1 ) + · · · + rn (Xn − µn ))2 ]
n X
X n
= ri rj E[(Xi − µi )(Xj − µj )]
i=1 j=1
Xn X n
= ri σij rj
i=1 j=1
Xn X n
= r · (Σr)
i=1 j=1
Xn X n
= r| Σr
i=1 j=1
63
where σij = Cov(Xi , Xj ), and Σ = [σij ]1≤i,j≤n is the covariance matrix.
1
E[er·X ] = exp r · µ
~ + r| Σr
2
Theorem If X = (X1 , . . . , Xn ) is Gaussian, Xi and Xj are independent if and only if Cov(Xi , Xj ) = 0. Note that X must
be Gaussian; otherwise, it is possible that Xi and Xj are Gaussian with covariance 0, but still be dependent on each other.
More generally, if X is Gaussian, then X1 , . . . Xn are independent if and only if the covariance matrix Σ is diagonal, that is,
σij = 0 for all i 6= j.
Definition Let X = {Xt }t≥0 be a stochastic process with state space R (that is, Xt ∈ R). Then X is a Gaussian process
if X = (Xt1 , . . . , Xtn ) is Gaussian for all choices of n ≥ 2 and t1 , . . . , tn .
If X is a Gaussian process, the distribution of X is specified for all n and t1 , . . . , tn once we specify the mean function
m(t) = E[Xt ] and the covariance function v(s, t) = Cov(Xs , Xt ). m is arbitrary, but v must be symmetric (i.e., v(s, t) =
v(t, s)) and positive-definite (i.e., r| Σr > 0, see explanation of Var(r · X).
Let 0 < r < 1 and c > 0 be fixed. Let Z1 , Z2 , . . . be i.i.d. Gsn(0, 1), let Y0 ∼ Gsn(a0 , b20 ) be independent of the Zn . Define
process Y = {Yn }n∈{0,1,...} recursively by
Yn+1 = rYn + cZn+1
Or in words, “the current value is equal to a proportion of the previous value plus an additive randomness.”
Note that cZn ∼ Gsn(0, c2 ).
The vector (Y0 , Y1 , . . . , Yn ) is Gaussian for any n because each Yi is a linear combination of independent Gaussian random
variables. Therefore, Y = {Yn }n∈{0,1,...} is a Gaussian process.
an = E[Yn ] = rn a0
vn,n = Var(Yn )
= r2n b20 + c2 r2n−2 + · · · + c2
1 − r2n
= r2n b20 + c2
1 − r2
limn→∞ an = 0
c2
limn→∞ vn,n = 1−r 2
64
10.2 Brownian Motion
• independent increments
• stationary increments
By stationarity, Xs+t − Xs has the same distribution as Xt − X0 , and we concentrate on the latter here. Take
t
0 = t0 < t1 < · · · < tn = t such that ti − ti=1 = n. Then
expresses Xt − X0 as the sum of n i.i.d. random variables. Moreover, since X is continuous, the terms on the irght
side are small when n is large. By the central limit theorem, the distribution of the right side is approximately
Gaussian. Since n can be as large as desired, Xt − X0 must have a Gaussian distribution, say, with mean a(t)
and variance b(t). To determine their forms, write
and note that the two terms on the right side are independent with means a(s) and a(t), and variances b(s) and
b(t). Thus,
a(s + t) = a(s) + a(t), b(s + t) = b(s) + b(t).
It follows that a(t) = µt and b(t) = σ 2 t for some constant µ and some positive constant σ 2 .
Definition A process W = {Wt }t≥0 is a Wiener process if it is a Brownian motion with W0 = 0, E[Wt ] = 0, Var(Wt ) = t
Then Wt ∼ Gsn(0, t) and Wt − Ws ∼ Gsn(0, t − s)
Relationship between Brown and Wiener
Let X be a Brownian motion with Xs+t − Xt having mean µt and variance σ 2 t. Then from Xt − X0 ∼ Gsn(µt, σ 2 t) and
Wt ∼ Gsn(0, t)
(Xt − X0 ) − µt
= Wt
σ
Xt = X0 + µt + σWt
65
10.2.2 Poisson approximation
σ2
λ=
2ε2
h ε
i h i
E erXt = E er(εLt −εMt )
= E erεLt E e−rεMt
n
X (λt)k rε
using the fact that E erεLt = erεk e−λt = e−λt eλte = exp{−λt(1 − erε )},
k!
k=0
−rεMt
and similarly, that E e = exp{−λt(1 − er )},
= exp{−λt(1 − e−rε )} exp{−λt(1 − er )}
= exp{λt(erε + e−rε − 2)}
2
σ t rε −rε
= exp (e + e − 2)
2ε2
σ t erε + e−rε − 2
2
= exp ·
2 ε2
using l’Hopital’s rule twice,
h
rXtε
i 1 2 2
lim E e = exp σ tr
ε→0 2
= E erY where Y ∼ Gsn(0, σ 2 t)
In the last step, we recognized that if Y ∼ Gsn(0, σ 2 t), then √Y = Z ∼ Gsn(0, 1) and that , as we showed previously,
h √ 2 i σ2 t
E erZ = er /2 , so we have E erY = E e(r σ t)Z = er σ t/2
2 2 2
So we have x
y2
Z
1
lim P {Xtε ≤ x} = √ exp − 2 dy
ε→0 −∞ 2πσ 2 t 2σ t
If we let Xt = limε→0 Xtε , we have the following properties for the process {Xt }t≥0 :
66
• t 7→ Xt is continuous (not a counting process, which is discrete and strictly nondecreasing)
• independent increments
• stationary increments
Let X = {Xt } be a Brownian motion with X0 = 0. Fix n ≥ 1 and 0 ≤ t1 < t2 < . . . < tn . Then Xt1 , Xt2 − Xt1 ,. . .,
Xtn − Xtn−1 are independent and Gaussian. Since the vector (Xt1 , . . . , Xtn ) is obtained from a linear transformation of those
increments, (Xt1 , . . . , Xtn ) is n-dimensional Gaussian.
Theorem Let X be a Brownian motion with X0 = 0, drift µ, volatility σ. Then X is a Gaussian process with mean function
m(t) = µt and covariance function v(s, t) = σ 2 (s ∧ t). Conversely, if X is a Gaussian process with these mean and covariance
functions, and if X is continuous, then X is a Brownian motion with X0 = 0, drift µ, and volatility σ.
Corollary A process W = {Wt }t≥0 is a Wiener process if and only if it is continuous and Gaussian with E[Wt ] = 0 and
Cov(Ws , Wt ) = s ∧ t.
67
• symmetry: Let W t = −Wt for t ≥ 0, then W is a Wiener process
• time inversion: W
ft = tWs/t for t > 0 (note strict inequality) and W
f0 = 0, then W
f is a Wiener process
Proof
which is true for Wiener processes. However, we need to check the continuity of W
f at t = 0.
1
lim W
ft = lim tW1/t = lim Wu = 0 = W
f0
t→0 t→0 u→∞ u
Where we used u = 1t .
The intuitive idea behind this is the strong law of large numbers. Pick δ > 0, u = nδ.
= P{sW1/s ≤ x|tW1/t = z}
= P{W1/s ≤ x/s|W1/t = z/t}
= P{W1/s − W1/t = x/s − z/t}
Time inversion allowed us to change a condition on the future to a condition on the past.
Brownian bridge
Let Xt = Wt − tW1 for t ∈ [0, 1]. It is called a “bridge” because it is tied down for t = 0 and t = 1, that is,
X0 = X1 = 0. It is used to model things, such as bonds, whose values change over time, but whose values are
known at the end of time (t = 1).
E[Xt ] = 0
1
The variance makes sense: it is 0 at t = 0 and t = 1, and reaches its maximum of 4 at t = 21 .
68
10.2.4 Hitting Times for Wiener Processes
which is the first time the Wiener process has displacement a > 0.
Let us denote Fs = {Wr : r ≤ s} as the “past” from time s.
Markov property of W
for t ≥ 0.
c = {W
W ct } is a wiener process because disjoint increments are independent and because W
ct −W
ct ∼ Gsn(0, t1 −t0 ).
1 0
Conceptually, this is saying that if we have a Wiener process, then freeze at time s and reset the origins of time
and space to your current time/position, then continue the process, it is still a Wiener process with respect to
the new origin.
Distribution of Ta
1
P{Wt > 0} = 2 because Wt ∼ Gsn(0, t).
1
P{W
ct > 0} =
2
1
P{Wt > a|Ta < t} = 2
√
Z ∞
a 1 2
P{Ta < t} = 2P{ tZ > a} = 2P Z > √ =2 √
√ e−z /2 dz = P{Ta ≤ t}
t a/ t 2π
Remarks
Z ∞
1 2 1
P{Ta < ∞} = lim P{Ta < t} = 2 √ e−z /2 dz = 2 · = 1
t→∞ 0 2π 2
This states that for any a < ∞, the particle will hit a with probability 1.
pdf of Ta :
2
d ae−a /2t
P{Ta ≤ t} = √
dt 2πt3
for t ≥ 0
69
expected value:
∞ 2
ae−a /2t
Z
E[Ta ] = t √ dt = ∞
0 2πt3
Although particle will hit every level a with probability 1, the expected time it takes to do so is ∞, no matter
how small a is.
√ √
P{Ta < t} = 2P{Wt > a} = 2P{Z > a/ t} = P{|Z| > a/ t} = P{Z 2 > a2 /t} = P{a2 /Z 2 < t}
Maximum process
Define Mt = maxs≤t Ws for t ≥ 0, the highest level reached during [0, t] by the Wiener particle. M0 = 0, and
t → Mt is continuous and nondecreasing, and limn→∞ Mt = +∞, since the particle will hit any level a. Between
any two points, there are infinitely many “flats.”
Similarly if we define mt = mins≤t Ws , the fact that limn→∞ mt = −∞ shows that the set {t ≥ 0 : Wt = 0} is
infinite, as the particle will cross 0 infinitely many times.
To derive the distribution,
because E[Wt ] = 0.
Z ∞
1 −x2 /2t
E[Mt ] = E[|Wt |] = |x| √
e dx
−∞ 2πt
Z ∞
1 −x2 /2t
=2 x√ e dx
0 2πt
Z ∞
2t x −x2 /2t
=√ e dx
2πt 0 t
r h i∞
2t 2
= −e−x /2t
π 0
r
2t
=
π
70
2t
Var(Mt ) = t −
π
= E[Mt2 ] − E[Mt ]2
2t
= E[|Wt |2 ] −
π
2 2t
= E[Wt ] −
π
2t
=t−
π
Hitting 0 and the Arcsine Law of Brownian Motion
Define Rt = (min{u ≥ t : Wu = 0}) − t as the time from t until the next time W touches 0. By resetting the
origin of time/space at time t,
or in other words, the probability that it takes u time for W to hit 0 given that it has position x at time t is the
same as the probability that another Wiener particle reaches position −x at time u. So, Rt ≈ Wt2 /Z 2 where Z
and Wt are independent.
√
Since Wt ≈ tY where Y ∼ Gsn(0, 1), then
Rt ≈ tY 2 /Z 2
Note that Y /Z ∼ Cauchy.
Define Dt = Rt + t = inf{u > t : Wu = 0} as the first time that W hits 0 after t.
So
Y 2 + Z2
Dt ≈ t + tY 2 /Z 2 = t
Z2
n o
Z2 2
∼ beta 12 , 12 , i.e. P Y 2Z+Z 2 ∈ du = √ 1
Note that Y 2 +Z 2 π u(1−u)
2
Y + Z2
P{Gt < s} = P{Ds > t} = P s > t
Z2
Z2
s
=P <
Y 2 + Z2 t
Z s/t
1
= p du
0 π u(1 − u)
Z arcsin √s/t
2
= dx u = sin2 x; du = 2 sin x cos xdx
0 π
2 p
= arcsin s/t
π
Note that the event {Gt < s} can be interpreted as the event that W does not hit 0 in the time interval (s, t).
Notice also that
Z2
Gt ≈ t
Y 2 + Z2
71
10.2.5 Geometric Brownian Motion
Xt = X0 eµt+σWt
for t ≥ 0. Then X = {Xt }t≥0 is a geometric Brownian motion with drift µ and volatility σ. Letting Y = log X, we have
Yt = Y0 + µt + σWt
Yt − (Y0 + µt)
= Wt ∼ Gsn(0, t)
σ
where Y is a Brownian motion with drift µ and volatility σ.
Treat X0 as fixed at some value x0 . We have that Yt = log Xt ∼ Gsn(log x0 +µt, σ 2 t). We say that Xt follows the log-normal
2
distribution. It follows from E[erWt ] = er t/2
that (for fixed t),
2
E[Xt ] = E[x0 eµt+σWt ] = x0 eµt E[eσWt ] = x0 eµt+σ t/2
.
If X0 = x0 > 0, then Xt > 0 for all t ≥ 0.
Modeling stock prices
Remember:
X ∼ Y ⇔ E[f (X)] = E[f (Y )], ∀f
72
11 Branching Processes
Start with a single progenitor (root node)
Assume the number of children a node has is independent of everything else
Let pk be the probability a man has k children, k ∈ {0, 1, . . .}
Assume p0 > 0 (otherwise tree will never be extinct)
Assume p0 + p1 6= 1 (otherwise, geometric random variable, P{extinct at step n} = pn−1
1 p0 , probability of extinction is 1.)
Let Xn be the number of nodes in the nth generation
P∞
Let the expected number of sons a man has be µ = n=0 n(pn )
E[Xn+1 |Xn = k] = kµ
P∞ P∞
E[Xn+1 ] = k=0 E[Xn+1 |Xn = k]P{Xn = k} = k=0 (kµ)P{Xn = k} = µE[Xn ]
E[X0 ] = 1; E[X1 ] = µ; E[X2 ] = µ2 ; . . . E[Xn ] = µn
(µ < 1) ⇒ (limn→∞ E[Xn ] = 0) ⇒ (limn→∞ Xn = 0)
(µ > 1) ⇒ (limn→∞ E[Xn ] = ∞)
Definition Let η = P{limn→∞ Xn = 0} = P{eventual extinction}
P∞
Theorem If we have generating function g(z) = k=0 pk z k for z ∈ [0, 1], then η is the smallest such solution to g(z) = z.
Theorem: µ ≤ 1 if and only if η = 1; µ > 1 if and only if 0 < η < 1 and η is the unique solution to z = g(z), z ∈ (0, 1).
Proof:
Suppose that the root node has k children. Then the probability of extinction η is the probability that each of
the root’s children’s trees will eventually be extinct. We can view each of these children (1st generation) as root
nodes of their own respective trees, so the probability that each of the children’s trees will eventually be extinct
P∞
is also η. The probability that all k lines are extinct is thus η k . Then we have η = k=0 P{extinction|N =
P∞
k}P{N = k} = k=0 pk η k .
P∞
Note the following properties of g(z) = k=0 pk z k = (p0 + p1 z + p2 z 2 + · · · ), given the assumptions we made
earlier:
73
g(0) = p0 > 0
P∞
g(1) = k=0 pk = 1
g(z) increases in z
P∞
g 0 (z) = k=0 kpk z k−1 = (p1 + 2p2 z + 3p3 z 2 + · · · ) increases in z
P∞
g 0 (1) = k=0 kpk = µ
We are concerned with when z = g(z) (graph both sides, find the intersections). We have two cases, as shown
above.
Consider the graph of z − g(z). In the picture on the right, this difference reaches a maximum (between the
d
two intersections). If we define z0 such that this maximum occurs at z = z0 , then we have dz (z − g(z))z=z = 0
0
d
which is true iff dz (g(z))z=z = 1. Since g 0 (z) is increasing in z, we can conclude that in the second picture,
0
µ = g 0 (1) > g 0 (z0 ) = 1. The first picture is the case that µ ≤ 1.
Let ηn = P{Xn = 0}.
Then, ηn = P{Xn = 0} ≤ P{Xn+1 = 0} = ηn+1 , ∀n, so η0 ≤ η1 ≤ · · ·
η0 = P{X0 = 0} = 0
η1 = P{X1 = 0} = p0
We use a similar argument that we used earlier. If the 1st generation has k children, we can view the n + 1st
generation as the nth generations of each of the k children’s trees.
P∞
Thus, ηn+1 = k=0 pk (ηn )k = g(ηn )
We can visually represent this recursive process, shown below.
Example
74
k
Suppose pk ∼ Pois(µ) so that pk = e−µ µk! .
P∞ k P∞ k
Then g(z) = k=0 e−µ µk! z k = e−µ k=0 (µz) k! = e
−µ µz
e = e−µ(1−z)
Solve z = e−µ(1−z)
Or find limn→∞ ηn :
η1 = p0 = e−µ
ηn+1 = e−µ(1−ηn )
Let there be a series of i.i.d. Bernoulli trials with probability of success p, and let Xn be the indicator variable
Pn
for the nth trial. Let the number of successes in n trials be S = i=1 Xi .
What is the generating function for S?
∞
X
g(z) = z k P{S = k} = E[z S ] = E[z X1 z X2 · · · z Xn ] = E[z X1 ]E[z X2 ] · · · E[z Xn ]
k=0
So, continuing,
g(z) = (pz + q)n
∞ n
X X n
z k P{S = k} = (pz)k q n−k
k
k=0 k=0
n k n−k
P{S = k} = p q
k
∞ ∞
X X µk k
E[z X ] = z k P{X = k} = e−µ z = e−µ eµz = e−µ(1−z)
k!
k=0 k=0
(µ + ν)k
P{X + Y = k} = e−(µ+ν)
k!
X + Y ∼ Pois(µ + ν)
75
12 Gambler’s Ruin
12.1 Introduction
You have $28. At each stage of the game, a coin is flipped; if heads, you get a dollar, if tails, you lose a dollar. The game
ends when you have either $0 or $100.
Let Xn be your capital after the nth step. The state space is D = {0, 1, . . . , 100}. (Note: imagine an opponent that has
$100 − Xn .)
1
P{Xn+1 = i + 1|Xn = i} = P{Xn+1 = i − 1|Xn = i} = , 1 ≤ i ≤ 99
2
P{Xn+1 = i|Xn = i} = 1, i ∈ {0, 100}
Note that this is a Markov chain, and that we will use the technique of conditioning on the first step throughout this section.
Question 1 Will the game ever end?
Let
f (i) := P{1 ≤ Xn ≤ 99, ∀ n ≥ 0|X0 = i}, 0 ≤ i ≤ 100
i.e., the probability that the game continues forever given that X0 = i.
0 i ∈ {0, 100}
f (i) =
1 f (i + 1) + 1 f (i − 1) 1 ≤ i ≤ 99
2 2
Note that f (i) is the average of f (i − 1) an f (i + 1). This means that the graph of these three points is collinear,
and further, that all points {f (i) : i ∈ D} are all collinear. Since f (0) = f (100) = 0, that means
f (i) = 0, 0 ≤ i ≤ 100
Since the game will end, there exists a finite T such that
Then the probability that you will end the game with $100, given that X0 = i is
As before, this equation shows that the graph of r(i) consists of collinear points. Since f (0) = 0 and f (100) = c,
we have
i
r(i) = , 0 ≤ i ≤ 100
100
76
Note that because r(i is increasing, the game favors those who are initially rich, so it is not “socially fair.”
However, it is fair in the sense that the coin flips are fair and that your expected winnings
i
E[XT |X0 = i] = r(i) · · · 100 = · 100 = i
100
which is your initial capital.
For 1 ≤ i ≤ 99, you can write µi as 12 µi + 21 µi on the LHS, and move things around to get
Then we have
µ1 = µ1 − µ0 = a
µ2 − µ1 = a − 2
µ3 − µ2 = a − 4
..
.
µi − µi=1 = a − 2(i − 1)
..
.
µ100 − µ99 = a − 2 · 99
We can show that the game will end in finite time, as before.
77
f (i) := P{1 ≤ Xn ≤ 99, ∀ n ≥ 0|X0 = i}, 0 ≤ i ≤ 100
i.e., the probability that the game continues forever given that X0 = i.
0 i ∈ {0, 100}
f (i) =
p f (i + 1) + q f (i − 1) 1 ≤ i ≤ 99
i i
As before, r(i) is monotonic. Suppose we want to exact justice so that you help the oor, but when the rich become poor,
they are helped the same amount. For example, you want r(72) = 1 − r(28) because when you have $72, your opponent has
$28, and should have the same amount of help as you did when you had $28.
Then the graph of r(i), which still hits (0, 0) and (100, 1) and is still monotone, appaears to flatten out when i is near 50.
From earlier,
Let c be the total capital instead of $100. Let there be one coin for all states, but with probability p and q = 1 − p =6= p for
winning and losing a dollar respectively. Now the duration of the game T is the first time you have $0 or $c, and as before,
is finite. Let
ri := P{XT = c|X0 = i}, 0 ≤ i ≤ c
0 i=0
ri = pri+1 + qri−1 1 ≤ i ≤ c
1 i=c
78
For 1 ≤ i ≤ c, replacing ri with pri + qri and moving things around, we have
p
Letting r = q for convenience, we have
r1 = r1 − r0 = a
r2 − r1 = ar
r3 − r2 = ar2
..
.
ri − ri−1 = ari−1
..
.
rc − rc − 1 = arc−1
we have again
0 i ∈ {0, c}
µi =
1 + pµ
i+1 + qµi−1 1≤i≤c
q
Using the same technique, and letting r = p and s = p1 ,
79
Then
µ1 = µ1 − µ0 = a
µ2 − µ1 = ra − s
µ3 − µ2 = r2 a − rs − s
..
.
µi − µi−1 = ri−1 a − ri−2 s − ri−3 s − · · · − rs − s
..
.
µc − µc−1 = rc−1 a − rc−2 s − rc−3 s − · · · − rs − s
sc s
Then 0 = µc gives us a = − , and thus
1 − rc 1−r
1 − ri
1
µi = c −i , 0≤i≤c
p−q 1 − rc
1 − ri
where we replace s = 1
p but not r = pq . Recall that for all this, p 6= q. Since we found earlier that ri = , we have
1 − rc
1
µi = (cri − i)
p−q
13 Appendix
13.1 dx notation
Z b
P{a ≤ X ≤ b} = f (x)dx
a
On the left side, dx represents an interval, but on the right side, it represents the length of the interval. In higher-level
probability, we use λ(dx) to represent the length of the interval, in order to avoid confusion.
80
Also,
Z b
P{a ≤ X ≤ b} = P{X ∈ dx}
a
P{X ∈ dx, Y ∈ dy} ≡ f (x, y)dxdy
Suppose Y = 2X and we know the pdf of X: P{X ∈ dx} = f (x)dx. What is the pdf of Y ?
Z ∞ Z ∞
dy
E[Y ] = E[2X] = 2xf (x)dx = yf (y/2)
−∞ −∞ 2
So P{Y ∈ dy} = 21 f (y/2)dy Don’t forget to change the limits of integration!
In general,
dx − a x−a dx
P X∈ =f
b b b
Let X ∼ expon(1), let 0 = a0 < a1 < a2 < · · · s.t. ∀x ∈ (0, ∞), ∃i : ai < x < ai+1 .
Let Y = yi if ai < X ≤ ai+1 ; assume ai < yi ≤ ai+1
Choose {yi } to minimize E[|X − Y |].
Let there be indicator functions (
1, ai < X ≤ ai+1
Ii =
0, otherwise
Then
Z ∞
E[|X − Y |] = E[|x − Y |X = x]P{X ∈ dx}
0
Z ∞
∞X
= |x − yi |Ii P{X ∈ dx}
0 i=0
∞ Z ai+1
X
= |x − yi |P{X ∈ dx}
i=0 ai
R ai+1
Minimize E[|X − Y |] by minimizing each ai
|x − yi |P{X ∈ dx}
(Remove the absolute value signs)
Z ai+1 Z yi Z ai+1
∂ ∂ ∂
|x − yi |P{X ∈ dx} = (yi − x)P{X ∈ dx} + (x − yi )P{X ∈ dx}
∂yi ai ∂yi ai ∂yi yi
Z yi
= 1(yi − yi )P{X ∈ dx} − 0(yi − yi )P{X ∈ dx} + 1P{X ∈ dx}
ai
!
Z yi
− (1(yi − yi )P{X ∈ dx} − 0(yi − yi )P{X ∈ dx} + 1P{X ∈ dx}
ai+1
Z ai+1
= P{X ∈ dx}
ai
etc.
81
13.3 Average
If an → L, then
m
1 X
lim an = L
m→∞ m
n=0
Proof
Given ε > 0, ∃N1 > 0 such that n ≥ N1 implies |an − L| < ε/2. Then, this implies that for m ≥ N1 ,
1
((aN1 +1 − L) + (aN1 +2 − L) + · · · + (am − L))
m
1
≤ (|aN1 +1 − L| + · · · + |am − L|)
m
1 ε
< (m − N1 )
m 2
1 ε
≤ m
m 2
ε
=
2
82