Expected Values Expected Values: Alberto Suárez

Expected Values
Problem: Calculate expected values of random quantities

in high dimensions
E[g (X )] = dx pdf (x ) g (x )
Markov Chain Monte Carlo
Monte Carlo Integration:

1 M
E[g (X )]
g (X m ) ; X m ~ iirv' s with distributi on pdf (x)
M m =1
Alberto Surez
Markov Chain Monte Carlo:

Construct Markov chain whose stationary distribution is pdf (x
x)
X 0 , X1 , X 2 , K , X t , K , XT 1 , XT ;
Calculate expected value
P (t ) ( X t | X 0 ) t
pdf ( X t )
as an ergodic average
Escuela Politcnica Superior

Universidad Autnoma de Madrid
E[g (X )]
Discrete Markov Chains
X (t ) = {X i (t )} ;
1
X i (t ) =
0
if system is in state si at time t
Markov Chain can be

Homogeneous: The transition probability matrix is independent
of time: W(t) = W. W is a stochastic matrix: W j i 0; W j i = 1
P X (t ) = s j | X (t 1) = sit 1 , X (t 2) = sit 2 ,K, X (1) = si1 , X (0) = si0 =

| X (t 1) = sit 1
Transition matrix: W(t ) : W ji (t ) P (X (t ) = s j | X (t 1) = si )

Chapman-Kolmogorov Eqn.:
Irreducible: All states are connected in a finite number of steps
si , s j S
otherwise
Markov property
t =Ttransient +1
|S |
i =1
(
P (X (t ) = s
g (X ) ;
Types of Markov Chains
Markov Chain in which the space of states is countable

(i.e., states can be labeled, not necesarily finite)
Space of states: S {si }|iS=|1
Markov Chain:
1
T Ttransient
P(t ) : Pi (t ) = P (X (t ) = si );
P (t ) = 1
i
i
|S |
P(t ) = W(t ) P(t 1) : Pj (t ) = W ji (t ) Pi (t 1);

i =1
n 1/
W n j i > 0
Aperiodic: The system is never forced into a cycle of fixed

length between two states.
si S : Di {n 1 / W n ii > 0} g.c.d .( Di ) = 1
Lemma: An irreducible homogeneous MC is aperiodic if j : W j j > 0;

m
Dem.
m 1 W i j > 0
Irreducible
n
n 1 W ji > 0
W m +n ii = W m ik W n k i W m i j W n ji > 0
k
W m +n +1ii = W m ik Wk l W n l i W m i j W j j W n ji > 0
k ,l
Stationary distribution of a Markov Chain

Stationary distribution for a homogeneous Markov Chain
W =
the stationary distribution is an eigenvector of W, with eigenvalue 1.
An arbitrary initial distribution converges to the stationary distribution
provided that all other eigenvalues of W are smaller than 1 in absolute value.
W = ; W v
(n)
= n v ; n = 2,3,K, | S |; 1 > 2 3 K |S | ;
( n)
P(t ) = W P (0);
t
Detailed balance
Theorem (Feller,1950) Let W be the transition matrix of a finite Marko
Chain, which is irreducible and aperiodic.
The equation W = (Global detailed balance) has a unique
solution, which is the stationary distribution
A sufficient condition for convergence to the stationary distribution
for a homogeneous Markov Chain is that its transition matrix W
satisfies local detailed balance (reversibility)
|S |
W j i i = Wi j j
P(0) = + n (0) v ( n ) ;
n=2
|S |
P(t ) = + n (0)tn v ( n )
n=2
limt P(t ) =
Note that is the stationary distribution of the chain
Conditions for converge to the stationary distribution:
W j i i = Wi j j W j i i = j
The chain is irreducible: The eigenvalue =1 is not a multiple eigenvalue of W
The chain is aperiodic: The only eigenvalue with || = 1 is = 1.
Continuous Markov process
Metropolis--Hastings algorithm
Metropolis
The sequence of random variables

X 0 , X1 , X 2 , K , X t , K
is a Markov process if the Markov property holds
P ( X t | X t 1 ,K, X1 , X 0 ; t ) = P ( X t | X t 1; t )
Chapman-Kolmogorov eqn.
P ( X t +1 ) = dX t P ( X t +1 | X t ; t ) P ( X t )
Problem: How to construct a Markov Chain whose

stationary distribution is (x)
Metropolis et al. (1953) + Hastings (1970)
A Markov process is homogeneous if the transition

kernel does not depend explicitly on time.
P ( X t +1 | X t ; t ) = P ( X t +1 | X t )
Under certain regularity conditions a homogeneous

Markov process converges to a unique stationary
distribution P ( t ) ( X t | X 0 ) t
P ( X t )
Assume the system is in state Xt at time t

Choose a proposal distribution q(y|Xt). This distribution is
arbitrary (e.g. A multivariate normal distribution with mean Xt
and a fixed covariance matrix)
Sample Y from a proposal distribution q(y|Xt)
Accept Y with probability
(Y ) q ( X t | Y )
(X t , Y ) = min 1,
( X t ) q (Y | X t )
If Y is accepted, Xt+1 = Y, otherwise Xt+1 = Xt
Metropolis--Hastings algorithm (pseudocode)

Metropolis
Metropolis
(convergence)
Metropolis-Hastings transition kernel:
Pseudocode:
P(X t +1 | X t ) = q (X t +1 | X t ) (X t , X t +1 ) + ( X t +1 X t ) 1 dy q(y | X t ) (X t , y )
Intialize X0. Set t = 0;

For t = 0:T-1
Chapman-Kolmogorov eqn:
Sample Y from q(y|Xt)

Sample U from a uniform distribution U[0,1]
If (U < (y|Xt)) then

Set Xt+1 = Y
otherwise
Set Xt+1 = Xt
P(X t +1 ) = dX t P(X t +1 | X t ) P(X t )
From the definition of :

( X t ) q ( X t +1 | X t )(X t , X t +1 ) = ( X t +1 ) q ( X t | X t +1 )(X t +1 , X t )
Local detailed balance
( X t ) P ( X t +1 | X t ) = ( X t +1 ) P ( X t | X t +1 )
Integrating the detailed balance equation
dXt (Xt ) P(Xt +1 | Xt ) = (Xt +1 ) (x) is invariant
If Markov Chain converges (which it does, under certain
regularity conditions), (x) is the stationary distribution.
End
Metropolis--Hastings in practice (proposal dists.)

Metropolis
Metropolis algorithm:
(x)
q(y | x ) = q(x | y ) (x, y ) = min 1,
(y )
Random-walk Metropolis q(y | x ) = q(x | y ) = q ( x y )
If steps |Y-Xt| generated by q(|Y-Xt|) are either too small or too
large, the chain may have poor mixing properties.
Independence Sampler:
w(x)
(x)
; w(x) =
q (y | x ) = q(y ) (x, y ) = min 1,
w
q ( x)
(
)
y
Works best if q(x) is a good approximation to (x)

q(x) should be heavier-tailed than (x). Otherwise sampler may
get stuck at the tails of the distribution
If large-sample theory is valid: q(x) a multivariate normal:
mean: mode of (x)
1
2
(x)
Covariance matrix: d log
T
dx dx
x = mode of [ ( x )]
Metropolis
(variants)
Multiple chains: One can use multiple shorter chains,

instead of a single long one.
Single long chain:
Asymptotic convergence can be proved
Multiple shorter chains:

Might reveal important differences if stationarity has not been reached.
Convergence?
Can be run in parallel.
Single component Metropolis- Hastings:

Only one component components of X is updated at a given iteration.
The proposal distribution can be different for different components.
Variations:
Update blocks of components
Random updating order:
If one component is modified, then update with larger probability the
components that are highly correlated with it.
Metropolis
(implementation)
Choice of starting values:

If chain is irreducible, the choice of X0 does not affect the stationary
distribution.
In a rapidly mixing chain, burn-in times are short and the choice of X0
is not very important.
In a slowly mixing chain, it is important to choose a value of X0 that
avoids lengthy burn-in.
Length of burn-in
Depends on rate on convergence of P(t)(Xt |X0 ) to (Xt) and of

the desired accuracy.
Usually established by measures of convergence that monitor
expected values of f(Xt).
Comparison between properties of multiple chains can be useful.
Gibbs Sampling (Geman + Geman 1984)

Assume XT =(XIT XIIT).
qI (y I | x I , x II ) = (y I | x II )
qII (y II | x I , x II ) = (y II | x I )
Algorithm
Set initial value XI(t=0). Set t = 0.
Generate XII(t=0) from (XII | XI(t=0))
For t=0:T-1
Generate XI(t+1) from (XI | XII(t))
Generate XII(t+1) from (XII | XI(t))
End
Stopping time:
Estimate the variance of the expected value that is being calculated
Variance estimates are easiest if multiple chains are run.
Simulated annealing
Global optimization in many dimensions.
Physical annealing: Minimize free energy
Heat up a solid until it melts.
Cool down slowly until crystal is formed.
Simulated annealing for combinatorial optimization:

Find s* such that cost function E(s*) is minimal
Construct an irreducible aperiodic Markov Chain whose stationary
solution is
1 Ei
1
( si ) =
e ; = ; E ( si ) Ei ; Z (T ) = e Ei (partition fn.)
Z (T )
T
i
Establish an annealing schedule for T
T(t); t = 0,1,2,K / limt T (t ) = 0; (sufficiently slowly)

limT 0 ( si ) =
1
( si , sl* );
*
|S | l
Simulated annealing (pseudocode)

Procedure simulatedAnnealing
Initialize(i0, T0, L0);
k:= 0; i:= i0;
Repeat
% epoch at fixed temperature Tk with Lk steps.
For L:=1 To Lk Do {
Generate sj from si ;
Generate u from U[0,1];
Ei E j
Tk
If exp
> u
Then i:= j;
k:= k+1;
calculateLength(Lk); calculateTemperature(Tk)
Until stopCriterion;
End;
Simulated annealing (dynamics)

Transition matrix W(T) = G A(T)
G: Generation probability
A(T): Acceptance probability
ji
G ji Aji (T );
W ji (T ) =
1 G ji Aji (T ) j = i

l i
1 if s j is in the neighorhood of si
1
ji ; i = ji ; ji =
i
j
0 otherwise.
(E j Ei )+
;
Aji (T ) = exp
T
Local detailed balance: Assuming G ji = Gi j

G ji =
E
qi (T ) = exp i ;
T
A j i (T )qi (T ) = Ai j (T )q j (T ) W j i (T )qi (T ) = Wi j (T )q j (T );
Simulated annealing (convergence II)

Annealing dynamics (inhomogeneous Markov process)
Assuming an annealing schedule in with during epochs of length L the
temperature is held constant at temperature Tk ,
Tk
( L + 1)
;
log(k + 2)
[logarithmic cooling schedule (slow!)]
L max( Minimum number of transitions to reach s* from any s j )
max (E j Ei | s j in the vicinity of si )
the piecewise homogeneous Markov process converges in distribution to
( si ) =
1
( si , sl* );
| S* | l
Simulated annealing (convergence I)

Fixed temperature dynamics (homogeneous Markov process)
Theorem: The Markov process with transition matrix W(T) converges to a
stationary distribution of the Boltzmann type if
si , s j S ; p 1 /
sl0 , sl1 , K , sl p S ; l0 = i; l p = j;
Glk +1 lk > 0, k = 0,1, K, ( p 1).

The process is irreducible:
Consider si , s j S and
[W
(T )
j i
p 1
j k p1
k1 ,k 2 ,K,k p 1
/ Glk +1 lk > 0, k = 0,1,K, ( p 1); l0 = i; l p = j
(T ) Wk p 1 k p 2 (T ) K Wk2 k1 (T ) Wk1 i (T )
W j l p1 (T ) Wl p 1 l p 2 (T ) K Wl2 l1 (T ) Wl1 i (T ) > 0

The process is aperiodic
* si , s j S with Gi j > 0 and Ei < E j

Wi i (T ) = 1 Wk i (T ) = 1 W j i (T )
k i
1 G j i
k i , j
k i
Aji (T ) < 1;
k i , j
= 1 Gk i = Gi i > 0
k i
k i
Al i (T ) 1, l j
(T ) = 1 G j i Aj i (T )
k i , j
k i
Ak i (T ) >
Wi i (T ) > 0
Annealing schedules
Kirkpatrick, Gelatt & Vecchi (1982,1983)
Choose T0 large enough so that most transitions are accepted.
Start with a small value of T0
k:=0; Choose >1
Repeat
T0 = T0;
Until accepance ratio is sufficiently close to 1.
Tk+1 = Tk ; [0.8,0.99]
Lk is sufficiently large so that for each value of Tk quasi-equilibrium
obtains ( | Lk | < Lmax so that long chains are avoided at low T)
Stop criterion: Value of cost function does not change in a specified
number of epochs.
Genetic algorithms for optimization

Evolve a population of individuals each of which represents a
candidate solution for the problem
Individuals have a DNA composed of bit-strings.
The fitness of the individual is a function of its genotype or of its
fenotype.
Evolution
Selection: Only the best individuals, according to their fitness, survive.
Crossover: Generate new individuals whose genetic material is some
combination of the genetic material of other individuals.
Mutation: Alter information at random.
Important
For sufficiently large numbers of individuals the algorithm improves the
average fitness of the population.
Convergence is not guaranteed (not even in principle).
The hardest part is the coding scheme.
Bibliography
Markov Chain Monte Carlo in practice
W. R. Gilks, S. Richardson and D. J. Siegelhalter
Chapman & Hall, London 1996.
Simulated Annealing and Boltzmann Machines
E. Aarts and J. Korst
Wiley-Intescience, New York 1990
Genetic algorithms in search, optimization, and
machine learning
David E. Goldberg
Addison-Wesley, Reading,Mass 1989

Expected Values Expected Values: Alberto Suárez

Uploaded by

Copyright:

Available Formats

Expected Values Expected Values: Alberto Suárez

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Expected Values Expected Values: Alberto Suárez

Uploaded by

Copyright:

Available Formats

Expected Values

Problem: Calculate expected values of random quantities

Markov Chain Monte Carlo

Monte Carlo Integration:

Markov Chain Monte Carlo:

Escuela Politcnica Superior

Discrete Markov Chains

if system is in state si at time t

Markov Chain can be

P X (t ) = s j | X (t 1) = sit 1 , X (t 2) = sit 2 ,K, X (1) = si1 , X (0) = si0 =

Transition matrix: W(t ) : W ji (t ) P (X (t ) = s j | X (t 1) = si )

Irreducible: All states are connected in a finite number of steps

Types of Markov Chains

Markov Chain in which the space of states is countable

P(t ) = W(t ) P(t 1) : Pj (t ) = W ji (t ) Pi (t 1);

Aperiodic: The system is never forced into a cycle of fixed

Lemma: An irreducible homogeneous MC is aperiodic if j : W j j > 0;

Stationary distribution of a Markov Chain

Note that is the stationary distribution of the chain

Conditions for converge to the stationary distribution:

The chain is irreducible: The eigenvalue =1 is not a multiple eigenvalue of W

The chain is aperiodic: The only eigenvalue with || = 1 is = 1.

Continuous Markov process

The sequence of random variables

Problem: How to construct a Markov Chain whose

A Markov process is homogeneous if the transition

Under certain regularity conditions a homogeneous

Assume the system is in state Xt at time t

If Y is accepted, Xt+1 = Y, otherwise Xt+1 = Xt

Metropolis--Hastings algorithm (pseudocode)

Metropolis-Hastings transition kernel:

Intialize X0. Set t = 0;

Sample Y from q(y|Xt)

If (U < (y|Xt)) then

P(X t +1 ) = dX t P(X t +1 | X t ) P(X t )

From the definition of :

Metropolis--Hastings in practice (proposal dists.)

Works best if q(x) is a good approximation to (x)

Multiple chains: One can use multiple shorter chains,

Multiple shorter chains:

Single component Metropolis- Hastings:

Choice of starting values:

Depends on rate on convergence of P(t)(Xt |X0 ) to (Xt) and of

Gibbs Sampling (Geman + Geman 1984)

Simulated annealing for combinatorial optimization:

T(t); t = 0,1,2,K / limt T (t ) = 0; (sufficiently slowly)

Simulated annealing (pseudocode)

Simulated annealing (dynamics)

Local detailed balance: Assuming G ji = Gi j

Simulated annealing (convergence II)

[logarithmic cooling schedule (slow!)]

L max( Minimum number of transitions to reach s* from any s j )

max (E j Ei | s j in the vicinity of si )

the piecewise homogeneous Markov process converges in distribution to

Simulated annealing (convergence I)

Glk +1 lk > 0, k = 0,1, K, ( p 1).

/ Glk +1 lk > 0, k = 0,1,K, ( p 1); l0 = i; l p = j

W j l p1 (T ) Wl p 1 l p 2 (T ) K Wl2 l1 (T ) Wl1 i (T ) > 0