PBM Notes
PBM Notes
PBM Notes
1
Chapter 1
Week1
1.1 Motivation
This module is about probabilistic (statistical) models for data and how to fit
theses models to data (solving the inverse problem). Take, e.g. the problem of
curve–fitting using a function f
2
A possible function might look like this
another one, the red curve fits better, but seems to be highly complex.
3
We can introduce a likelihood function of the parameter θ
n
Y
p(Data|θ) = p(yi |θ). (1.2)
i=1
4
1.2.3 1-dimensional Gaussian density
The density of a one dimensional Gaussian random variable X is given by
1 (x−µ)2
p(x|µ, σ 2 ) = √ e− 2σ 2 (1.4)
2πσ 2
We write X ∼ N (µ, σ 2 ). It is easy to see that the parameter µ = E(X) is the
mean. It is a bit harder to show that σ 2 = E(X − µ)2 is the variance.
µ = E[X] is the mean vector (show this !) and Σ is a d × d, the positive definite
covariance matrix. One can show that
5
20
15
10
16.6 6.8 −5
−10 −5 0 5 10 15 20 25
6.8 6.4
E[X] = µ + AE[Z] = µ
E[(X − µ )(X − µ )> ] = E[AZZ> A> ] = AE[ZZ> ]A> = AA> = S
• Central limit P
theorem: For i.i.d. Xi with finite variance, the normalised
n
sum Sn = √1n i=1 (Xi −m) becomes asymptotically Gaussian distributed.
6
1.2.7 Computing conditional expectations for jointly Gaus-
sian variables
This is something we have to do quite often. We may assume that we observe
some variables (data) and want to predict others. This is based on the condi-
tional density. If all variables are jointly Gaussian, the conditional density is
also Gaussian.
Let’s split x = (v, z)> into two groups of variables. To compute p(v|z) we
first write the joint density
1 1 T −1
p(x|µµ, Σ ) = d 1 exp − (x − µ ) Σ (x − µ ) ∝ (1.7)
(2π) 2 |Σ
Σ| 2 2
1 T −1 > −1
exp − x Σ x − x Σ µ ) (1.8)
2
in the following unnormalised form
1
p(v, z) =∝ exp − (v z)> Ω (v z) + (v z)> ξ (1.9)
2
Ωvv Ωvz
with the information matrix Ω = Σ −1 and Ω = and ξ =
Ωzv Ωzz
(ξv ξz )> . We also see that ξ = Ωµ µ. We have ignored all the terms that are
independent of x. To obtain the conditional density we write it in the form
p(v, z) 1
p(v|z) = ∝ exp − v > Ωvv v + v > (ξv − Ωvz z) (1.10)
p(z) 2
where we have collected all the terms that depend on the random variable V .
We know that the conditional density is a Gaussian and we can read off the
mean vector and the covariance from this unnormalised density. I know two
ways for doing this:
1. Completing the square: We look at the quadratic form in the exponent of
(1.10). We then complete the square
1 1 1 1
− v > Ωvv v + v > a = − v > Ωvv v + v > a + a> v =
2 2 2 2
1 1 >
− (v − (Ωvv ) a) Ωvv (v − (Ωvv ) )a) + a (Ωvv )−1 a
−1 > −1
2 2
.
where a = ξv − Ωvz z
2. Finding the maximum of the exponent. For Gaussian densities, the max-
imiser of the probability density equals the mean. Hence, we can take the
gradient
1
∇v − v > Ωvv v + v > a = −Ωvv v + a
2
and set it equal to zero.
7
For both methods, we get for the mean and covariance
Remember that the conditional expectation is the best prediction (in the mean
square sense) of the random vector V given the ’data’ z. It is interesting that for
Gaussian variables, the covariance (the uncertainty of the prediction) is actually
independent of the data !
For two probability mass functions P (x) (discrete random variables) we define
X P (x)
KL(pkq) = P (x) ln (1.14)
x
Q(x)
8
1.3.1 The biased coin (Bernoulli model)
Consider a data sequence D = (x1 , x2 , . . . , xn ) of bits xi ∈ {0, 1} which we
believe are generated independently at random with the same probability. Let
θ be the unknown probabiliy of xi = 1. Hence, we have
The probability of the entire sequence D under this model (we use indepen-
dence) is
Yn
1−x
P (D|θ) = θxi (1 − θ) i
i=1
For this parameter, the observed data have the highest probability under the
model. Equivalent, we maximise the log–likelihood
n
X
ln P (D|θ) = (xi ln θ + (1 − xi ) ln(1 − θ)) = n1 ln θ + (n − n1 ) ln(1 − θ)
i=1
d ln P (D|θ) n1 n − n1 n1
= − =0 −→ θ̂ = . (1.18)
dθ θ 1−θ n
9
be 0). But it is reasonable to use the joint density p(D|µ, σ 2 ) instead ! Using
independence again, a short calculation yields
N
1 X (xi − µ)2
2 2
ln p(D|µ, σ ) = − + ln(2πσ ) (1.19)
2 i=1 σ2
10
1.3.4 Generalised linear models
The linear regression model can be generalised to the fitting with non-linear
functions
yi = fw (xi ) + νi (1.23)
for i = 1, . . . , n, with unknown parameter w and νi i.i.d. ∼ N (0, σ 2 ). The
generalised linear model assumes
K
X
fw (x) = wj φj (x) (1.24)
j=0
where the φj (x) denote a fixed set of functions, e.g. φj (x) = xj (polynomial
regression) which leads typically to a nonlinear function in x, but which is linear
in the parameters wj . The likelihood is
" n #
1 X (yi − fw (xi ))2
p(D|w) = exp − pX (x1 , . . . , xn ) (1.25)
(2πσ 2 )n/2 i=1
2σ 2
11
by the law of large numbers
1 1X
− ln p(D|θ) = − ln p(xi |θ) ' −Ep∗ ln p(X|θ) =
n n i
p(X|θ)
−Ep∗ ln − Ep∗ [ln p∗ (X)] = KL(p∗ kp(·|θ) − Ep∗ [ln p∗ (X)]
p∗ (X)
Hence, for large n one might expect that by minimising the negative log–
likelihood, we find the parameter θ which makes p(x|θ) closest (measure by
KL divergence) to the true density p∗ . And if p∗ (x) = p(x|θ∗ ) we will find the
true parameter.
Illustration: ML estimation of the variance (shown are histograms with
10.000 repetitions of estimation) of a Gaussian for n = 5, 10, 100 with true
parameter σ 2 = 1
We see that the distributions of estimators become more and more concen-
trated around the true value. But we also see that for finite n, the estimator is
biased.
12
1.4 Appendix I: Some probability essentials
Here is a collection of topics from basic probability which we assume to be
known by participants of this module.
for a set S ∈ R2 . 1 .
R∞
• Marginal densities are obtained e.g. as p(x) = −∞
p(x, y)dy
1 Note: When it is clear which random variables are involved, I often write simply p(x)
instead of pX (x).
13
Transformations of random variables and their densities:
• Let Y = T (X) be an invertible transformation and let the density of X
be p(x). We are interested in the density q(y) of the random variable Y .
Using change of variables for integrals, one gets
dx 1
q(y) = p(x(y)) = p(x(y))
dy dy
dx
• Conditional Probabilities
P (A∩B)
P (A|B) = P (B) and similarly for conditional distributions: P (x|y) =
P (x,y) p(x,y)
P (y) and conditional densities p(x|y) = p(y) .
Bayes Rule!!!
P (y|x)P (x) P P (y|x)P0 (x) 0 .
P (x|y) = P (y) =
x0 P (y|x )P (x )
• Expectations
The expectation of X is defined as
P R
E(X) = x P (x) x (discrete case) or E(X) = p(x) x dx (continuous
case). For a function g of the rva X, we can show that
P R
E(g(X)) = x P (x) g(x) (discrete) or E(g(X)) = p(x) g(x) dx (contin-
uous).
Mean: µ = E[X]
Variance: V ar(X) = E((X − µ)2 ) = E(X 2 ) − (E(X))2 .
Linearity
E(aX + bY ) = aE(X) + bE(Y )
• Conditional Expectation
E(Y |X = x) or E(Y |x):
P
E(g(Y )|X = x) = y g(y) P (y|x) (discrete case) and E(g(Y )|X = x) =
R
g(y) p(y|x) dy (continuous case).
14
QN
1. E(X1 · X2 · · · XN ) = i=1 E(Xi ).
P P
N N
2. Var i=1 X i = i=1 Var(Xi ).
3. Law of large numbers
PN
Let X1 , X2 , . . . , XN , i.i.d. with finite variance σ 2 and SN = N1 i=1 Xi ,
then one can show that
limN →∞ P (|SN − E(X)| > ε) = 0.
PN
Hence, when N large, with high probability we have N1 i=1 Xi ≈
E(X).
i=1
λi
15
y2
Qd − i
Thus exp − 12 (x − µ )T Σ −1 (x − µ ) = i=1 e 2λi
Qd
• The determinant equals the product of the eigenavalues |Σ
Σ| = i=1 λi
• Putting things together we find the transform random variables random
variables defined by y coordinates Y = UT (X −µµ) are independent and
have the Gaussian densities
d y2
Y 1 − i
p(y) = √ e 2λi (1.30)
i=1
2πλi
15
10
−5
−10 −5 0 5 10
15 20 25
16.6 6.8
The covariance matrix is Σ = . The eigenvalues are λ1 = 20
6.8 6.4
and λ2 = 3 with eigenvectors u1 = √15 (2, 1)T , and u2 = √15 (1, −2)T .
16
1.6 Appendix: Inequalities
• Cauchy–Schwarz:
2
{E(XY )} ≤ E(X 2 )E(Y 2 ) .
where ξ ∈ [X, y]. Finally take expectations on both sides. Note that an expec-
tation does not change the direction of the inequality (easy to see for discrete
random variables).
17
Chapter 2
Week 2
g(x) = wT x + w0
where we are not interested in the probability of the inputs. The optimal pa-
rameters could be found by maximising the likelihood using a gradient method.
18
2.2 ML for a Markov chain
To show that we can apply ML to dependent data, we look at an example of a
2 state Markov chain xi ∈ {0, 1} with unknown transition matrix
P11 P10
θ=
P01 P00
We assume we observe e.g. the data sequence D = 1101011001011110111010110010100
and try to estimate P11 , P10 , P01 , P00 using the ML method. The likelihood
equals the probability of D for a given parameter (transition matrix)
30
Y
8 10 9 3
P (D|θ) = P (xi+1 |xi , θ) P (x1 ) = P11 P10 P01 P00 P (x1 ). (2.2)
i=1
We have also two contraints on the transition probabilities P11 + P10 = 1 and
P01 + P00 = 1. We can take care of those by introducing Lagrange multipliers
and obtain the Lagrangefunction
L(θ, λ1 , λ0 ) = (2.3)
− ln P (D|θ) + λ1 (P11 + P10 − 1) + λ0 (P01 + P00 − 1) = (2.4)
− ln P (x1 ) − 8 ln P11 − 10 ln P10 − 9 ln P01 − 3 ln P00 + (2.5)
+λ1 (P11 + P10 − 1) + λ0 (P01 + P00 − 1) (2.6)
Differentiating with respect to the Pij yields P11 = λ81 and P10 = λ101 . The
constraint gives 1 = 8+10
λ1 , ie λ1 = 18. Hence P11 = 0.44 and P10 = 0.56.
Similarly, we obtain P01 = 9/12 = 0.75 and P00 = 0.25.
19
2.3.2 Gaussian densities as an exponential family
Also Gaussian densities can be cast into the exponential family framework.
Since we have two parameters, we will obtain 2–dimensional sufficient statistics.
We rewrite the Gaussian density as
2 1 1 2
p(x|µ, σ ) = √ exp − 2 (x − µ) = (2.10)
2πσ 2 2σ
µ2
µ 1 1
exp 2 x − 2 x2 √ exp − 2 (2.11)
σ 2σ 2πσ 2 2σ
≡ f (x) exp[ψψ (θθ ) · φ (x) + g(θθ )] (2.12)
Obviously, this is in the correct form if we define ψ (θθ ) = (µ/σ 2 , 1/2σ 2 ) and
φ (x) = (x, −x2 ). Finally, we have f (x) = 1 and
µ2
1
eg(θθ ) = √ exp − 2 . (2.13)
2πσ 2 2σ
One can also show that the following models belong to the exponential family
class:
20
2.3.5 Mathematical properties of exponential families
We begin with the normalisation. This helps us to express the function g(θθ ) as
an integral.
Z Z Z
1 = p(x|θθ ) dx = eg(θθ ) f (x)eψ ·φφ(x) dx → f (x)eψ ·φφ(x) dx = e−g(θθ ) (2.16)
We can now express the expectation of the sufficient statistics as another inte-
gral.
This shows that maximum likelihood estimation leads to simple moment match-
ing: For the ML parameter, the expected sufficient statistics equals the data
average of the sufficient statistics.
21
Let p(x|θ) be a parametric familiy. A statistics T (D) of the sample D =
{x1 , x2 , . . . , xn } is called sufficient if the conditional probability
22
/
Neuron ID
0.2
pspike
0.1
0 2000
Tim e [ m s]
The figure illustrates (lower panel) the binarisation of continuous time spike
trains. The activity of each neuron is represented by two states (active, inactive)
xi = ±1. The upper panel shows the simultaneously time series of states for a
group of neurons. There will be repeated experiments form which data averages
of xi and of xi xj (pairwise correlations) are estimated.
A generalisation to variables xi with more than 2 states (Potts models) has
been used to predict interactions between amino acids in proteins.
ML estimation of the parameters θij and θi by gradient descent requires the
computation of the model averages of E[xi |θ] = ∂∂θ ln Z
i
and similar for E[xi xj |θ].
By the moment matching result for maximum likelihood this has to be matched
with the corresponding data averages. Unfortunately, the computation of the
model expectations using e.g. the normalising ’partition sum’
X X X
Z(θ) = exp θij xi xj + θ i xi (2.28)
{xi =±1}N
i=1
(i,j) i
23
if θ = θ∗ equals the true parameter. Proof:
Z
∇θ Ep∗ [ln p(X|θ)] = p(x|θ∗ )∇θ [ln p(x|θ)]dx = (2.30)
∇θ p(x|θ)
Z
p(x|θ∗ ) dx (2.31)
p(x|θ)
For θ = θ∗ , we get
Z Z
= ∇θ p(x|θ)θ=θ∗ dx = ∇θ p(x|θ)dx = ∇θ 1 = 0
2.4.3 Pseudo–likelihood:
We will now apply a similar idea to another quantity which resembles the ex-
pected logarithm of the data but which is often simpler to work with. We will
first show that
∇θ Ep∗ [ln P (xi |x−i , θ)] = 0 for θ = θ∗ (2.32)
Proof:
X ∇θ P (xi |x−i , θ)θ=θ∗
Ep∗ [∇θ ln P (xi |x−i , θ)θ=θ∗ ] = P (x|θ∗ ) = (2.33)
x
P (xi |x−i , θ∗ )
X X ∇θ P (xi |x−i , θ)θ=θ∗
P (x−i |θ∗ ) P (xi |x−i , θ∗ ) = (2.34)
x−i xi
P (xi |x−i , θ∗ )
X X
P (x−i |θ∗ ) ∇θ P (xi |x−i , θ)θ=θ∗ = 0 (2.35)
x−i xi
P
In the last step, we have used that xi P (xi |x−i , θ) = 1.
We will then sum over i, to get the the exact equation
N
X
∇θ Ep∗ [ln P (xi |x−i , θ)] = 0 for θ = θ∗ . (2.36)
i=1
24
Note, the different notation used for data samples.
Why is this method simpler compared to ML in case of the Ising model ?
The conditional distribution can be derived from the joint distribution
1 X
P (x|θ) = exp θij xi xj + θi xi (2.38)
Z(θ)
(i,j)
as
P (x|θ) X
P (xi |x−i , θ) = ∝ exp xi θi + θij xj . (2.39)
P (x−i |θ) j
Note, that the intractable Z(θ) will drop out of the result. We can get a properly
normalised result
h P i
exp xi θi + j θij xj
P (xi |x−i , θ) = h P i h P i
exp θi + j θij xj + exp − θi + j θij xj
Hence, the gradient is computable ! The only thing that needs to be done is
a numerical approach for solving the resulting optimisation. But that is much
simpler compared to the intractable summations needed for the ML approach.
25
Chapter 3
Week 3
• Estimator θ̂(D)
• Variance of estimator
2
Var(θ̂) = E θ̂(D) − E[θ̂(D)]
Qn
• The expectation is w.r.t. p(D|θ) = i=1 p(xi |θ)
(∂θ E(θ̂))2
VAR[θ̂] ≥
nJ(θ)
26
• For unbiased estimators ∂θ E(θ̂)) = 1 we get a bound on the mean squared
error
1
M SE(θ̂) ≥
nJ(θ)
• The score for independent data D (use additivity of variances for inde-
pendent rva)
Pn
. d ln p(D|θ) d i=1 ln p(xi |θ)
Vn = =
dθ dθ
V AR[Vn ] = nJ(θ)
= nJ(θ)V AR[θ̂]
27
• The left hand side of the equation is
h i h i
E (Vn − E[Vn ])(θ̂ − E[θ̂] = E Vn θ̂
Z
1 dp(D|θ)
= p(D|θ) θ̂(D) dx1 . . . dxn =
p(D|θ) dθ
Z
dp(D|θ) d
θ̂(D) dx1 . . . dxn = E[θ̂(D)] estimator indep of θ
dθ dθ
p(x|θ) = θx (1 − θ)1−x .
28
3.1.7 Examples for Fisher information
• 1–d Gaussian with parameters θ = (µ, σ 2 ). The Fisher Information is
1
σ2 0
found to be J(θθ ) = .
0 2σ1 2
1
• Cauchy density: p(x|θ) = π(1+(x−θ)2 ) has J(θ) = π/8.
• Fisher for exponential families with natural parameters: We use the rep-
resentation
ψ · φ (x) + g̃(ψ)] .
p(x|ψ) = f (x) exp[ψ
Then the Fisher Info is found to be
ψ ) = −∂∂E[ln p(x|ψ
Jψ (ψ ψ )] = −∂∂g̃(ψ) =
Z
∂∂ ln f (x)eψ ·φφ(x) dx = COV [φ φ(x)].
The last equality follows from the exponential family representation by direct
calculation of the second derivatives.
29
3.1.9 Proof of
R R
− dx p(x|θθ )∂i ∂j ln p(x|θθ ) = p(x|θθ )∂i ln p(x|θθ )∂j dx for scalar case.
(0 ) denotes dθd
0 Z !0
h 00 p (x|θ)
i
E (ln p(x|θ)) = p(x|θ) dx =
p(x|θ)
Z 00 Z 0
!2
p (x|θ) p (x|θ)
p(x|θ) dx − p(x|θ) dx =
p(x|θ) p(x|θ)
Z 0
!2 Z
p (x|θ) 0 2
− p(x|θ) dx = − p(x|θ) (ln p(x|θ)) dx
p(x|θ)
0 = ∂θ ln p(D|θ = θ̂M L ) ≈
∂θ ln p(D|θ) + (θ̂ − θ) ∂θ2 ln p(D|θ)
• Solve for
∂θ ln p(D|θ) Vn
(θ̂ − θ) ≈ − ≈
∂θ2 ln p(D|θ) nJ(θ)
E(Vn2 ) nJ(θ) 1
E(θ̂ − θ)2 ≈ = 2 2 =
n2 J 2 (θ) n J (θ) nJ(θ)
• where
.
kθθ 1 − θ 2 k2Fisher = (θθ 1 − θ 2 )> J(θ)(θθ 1 − θ 2 )
30
• 3.2.1 Information Geometry
S. Amari developed a differential geometric approach to estimation.
• Define non Euclidean metric in parameter space by
31
Chapter 4
Week 3
One can use the Fisher–metric to define efficient online algorithms. Before doing
so, we will give the Fishher
until convergence. η is a learning rate. This requires the storage of the whole
batch of n data. On the other hand, if we want to perform online learning for
the case of streaming data, we base the new estimate on the likelihood for the
new data point ln p(xn+1 |θθ ) and the old estimate θ̂θ (n). A common possibility
is to apply stochastic gradient descent, i.e.
The differential operator J−1 (θθ (n))∇θ is termed natural gradient. For the choice
γn = n1 , one can show that the online algorithm yields asymptotically effi-
cient estimation.
32
One can motivate the update by the following idea: On the one hand, one
would like to make data log–likelihood small. But one should not rely entirely
on the new data, but also take the old estimate θ (n) into account, by not moving
to far away from it. If distances are measured by the Fisher metric, we should
.
minimise (set ∆θθ = θ 0 − θ )
λ
− ln p(x|θθ 0 ) + k∆θθ k2Fisher ≈
2
λ
− ln p(x|θθ ) − ∇ ln p(x|θθ )∆θθ + k∆θθ k2Fisher = (1. order Taylor)
2
λ
− ln p(x|θθ ) − ∇ ln p(x|θθ )∆θθ + ∆θθ > J(θθ )∆θθ
2
with respect to θ 0 . λ is a parameter that controls how strongly the old parameter
.
contributes. We have also defined ∆θθ = θ 0 − θ ). Minimisation w.r.t. ∆θθ yields
the natural gradient of the log–likelihood
∇ ln p(x|θθ ) − λJ(θθ )∆θθ = 0
∆θθ = λJ−1 (θθ )∇ ln p(x|θθ ).
For the Cauchy density, we get
4(xn+1 − θn )
θn+1 = θn +
n(1 + (xn+1 − θn )2 )
The left figure shows the prediction θn for a single run of the algorithm, when
the true parameter θ = 1. The right figure shows the average squared estimation
error (obtained from 10.000 runs) vs 1/n. For large n, we get a 1/n decay.
33
4.2.1 Poisson processes
.
This models a set of discrete events D = (z1 , . . . , zN ) which occur e.g. in a
compact domain S as shown in the figure.
Λg ( )
x
34
for the intensity function Λ(·). To proceed one could choose a paramterisation
for the function Λ(·) and estimate its parameter using ML. But why does the
likelihood mean.
This means that we can express expectations w.r.t. P through expectation w.r.t.
to the reference measure R. p(X) reweighs the different contributions to the
integral. Hence, if the reference measure does not contain parameters that we
wish to estimate, it makes sense to use p(x) as a likelihood for the parameters.
35
4.3 Latent Variable Models
• Exponential families allow for simple analytic parameter estimation by
Maximum Likelihood.
• More complex models explain data by hidden (unobserved) variables, the
so called latent variables.
• However, Maximum Likelihood (ML) estimation for this class usually re-
quires numerical optimisation. We will discuss an iterative (EM) algorithm
which helps to simplify ML.
36
We have one hidden variable ci for each data point, telling us from which
Gaussian component the observed point yi was generated. We also need,
as usual for each component a mean and variance parameter. The additional
parameter vector w(c) gives the probability for a component c. Hence, θ =
{µc , σc , w(c)}Kc=1 .
The likelihood is given by
n n
( K )
(yi − µci )2
Y Y X 1
p(Y|θθ ) = p(yi |θθ ) = w(ci ) p exp − .
i=1 i=1 c =1
2πσc2i 2σc2i
i
3. (M–Step) Maximise
i.e. the likelihood is not decreasing ! We are not guaranteed to find the global
maximum in this way but often may converge to a local one. This can be
improved by starting with different random initialisations.
37
4.3.4 Example: Mixture of Gaussians
For the MoG model we have to perform the following steps
• (E-Step): Compute
( n
)
X Y
L(θθ , θ t ) ≡ p(c|y, θ t ) ln p(yi , ci |θθ )
c i=1
with
n n
Y Y p(yi |ci , θ t )p(ci |θθ t )
p(c|y, θ t ) = p(ci |yi , θ t ) =
i=1 i=1
p(yi |θθ t )
and
4.3.5 Details
• E-Step: Compute
( )
X Y
L(θθ , θ t ) ≡ p(c|y, θ t ) ln p(yi , ci |θθ ) =
c i
n X
X
p(ci |yi , θ t ) ln p(yi , ci |θθ )
i=1 c
Note, the expectations in the sum over p(cj |yj , θ t ) for j 6= i equals 1.
1 (yi − µc )2
ln p(yi , c|θθ ) = − ln(2πσc2 ) − + ln w(c)
2 2σc2
This has a similar form as the ML estimate for a single Gaussian. The only
exception is that each yi has a weight p(c|yi , θ t , the so–called responsibility
of c generating data point yi .
38
• The variation with respect to σc2 yields
which can be differentiated independently with respect to the w(c). This yields
the equations
n n
1 X 1X
p(c|yi , θ t ) − λ = 0 → w(c) = p(c|yi , θ t ).
w(c) i=1 λ i=1
Here we have used the fact that p(c|yi , θ t ) is probability. Thus, we get λ = n.
39
Chapter 5
Week 4
We will next give a proof that with the EM algorithm the likelihood is never
decreased. After this we discuss a continuous mixture model. After this, we we
introduce the Bayesian approach.
5.1 Analysis of EM
We will begin with the KL divergence. For any q(x)
X q(x)p(y|θθ )
D(qkp(·|y, θ )) = q(x) ln ≥0
x
p(y, x|θθ )
Rearranging the inequality, we get
X q(x)
− ln p(y|θθ ) ≤ F (q, θ ) ≡ q(x) ln .
x
p(y, x|θθ )
We only have quality, when q(x) = p(x|y, θ ) ! From this we can show that
− ln p(y|θθ ) ≤ F (qt , θ)
− ln p(y|θθ t ) = F (qt , θt ).
.
In the next step we relate F and L. Let qt (x) = p(x|y, θ t ), then
X
L(θθ , θ t ) ≡ p(x|y, θ t ) ln p(y, x, θ )
x
X
= −F (qt , θ ) + qt (x) ln qt (x)
x
40
5.2 Pólya–Gamma mixture representation of lo-
gistic regression
We begin with the following continuous mixture representation of the Laplace
transform (Polson et al 2013) of 1/ cosh:
Z ∞
1 1 2
= e− 2 ωx pPG (ω)dω.
cosh( x2 ) 0
One can show that pPG (ω) is indeed proper density because of the infinite prod-
uct representation
∞ −1
1 Y t
q = 1+ 2 .
cosh( 2t ) i=1 2π (k − 1/2)2
41
By using the mixture representation, we get the augmented likelihood
n
1 Y yi w> xi (w> xi )2
n − ωi
p(y, {ωi }i=1 |w) = p PG (ωi ) e 2 2 .
2n i=1
In this form, the weights appear simply in quadric form in the exponent ! We
can use this form to solve the ML estimation of w by an EM algorithm.
5.2.3 EM
For the E–Step, we need the conditional density of the auxiliary variables
n
{ωi }i=1
n
Y (w> xi )2
n
p({ωi }i=1 |w) ∝ pPG (ωi ) e− 2 ωi
i=1
This can be easily optimised w.r.t. w. All we need is an explicit result for
R∞ > x )2
(wt i
0
pPG (ω) e− 2 ω ωdω
E[ωi |yi , wt ] = R (wt> x )2
∞ i
0
pPG (ω) e− 2 ω dω
In fact this is can be obtained from the Laplace transform. The integral is of
then type
R∞
pPG (ω) ω e−zω dω
R0 ∞ =
0
pPG (ω) e−zω dω
Z ∞
d
− ln pPG (ω) e−ωz dω =
dz
r 0 r
d t z 1
ln cosh( ) = tanh √
dz 2 2 2 2z
The last line follows from the Laplace transform of pPG .
Z ∞
1 1 2
= e− 2 ωx pPG (ω)dω
cosh( x2 ) 0
This result can be used to solve logistic regression using an EM–algorithm. The
method can also be extended to a Bayesian version of logistic regression.
42
5.3 The Bayesian approach to statistics
In the Bayes approach, all prior knowledge (or lack of) about unknown param-
eters should be described by a probability density p(θ). The information from
the data is described by the likelihood P (D|θ). Using Bayes rule, we com-
pute the posterior distribution which gives our belief about θ after seeing
the data
p(D|θ)p(θ)
p(θ|D) =
p(D)
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Posterior density of θ for a Bernoulli model for different data sets of size
n = 3, 10, 50, 100. The true value under which the data were generated was
θ = 0.7. The prior was flat p(θ) = 1 for 0 ≤ θ ≤ 1.
For a flat posterior p(θ) = const this agrees with the ML estimate.
• Another point estimate is the posterior mean
Z
θ̂m = E[θ|D] = θ p(θ|D) dθ
43
• The posterior mean θ̂m minimises the loss function
Z 2
L2 (θ̂) = θ̂ − θ p(θ|D) dθ
We will see later that for for many parametric models, and large n, the poste-
rior variance → 0 and θ̂m ≈ θ̂M AP ≈ θ̂M L
The Bayes optimal prediction for the unknown distribution is the predictive
distribution
Z Z
p(x|D) = p(x, θ|D)dθ = p(x|θ, D)p(θ|D)dθ =
Z
p(x|θ)p(θ|D)dθ
µ0 and σ02 are hyper parameters, reflecting the prior beliefs about the location
of the unknown µ.
Given data D = (x1 , . . . , xn ), the posterior density for µ is
n
p(D|µ)p(µ) p(µ) Y 1 (xi −µ)2
p(µ|D) = = √ e− 2σ2 ∝
p(D) p(D) i=1 2πσ 2
" ( n
)#
1 2 1 n µ0 1 X
exp − µ + 2 × exp µ + 2 xi
2 σ02 σ σ02 σ i=1
44
This can be rewritten explicitly as a Gaussian density
2
1 (µ−µ )
− 2σ2n
p e n
2πσn2
45
Chapter 6
Week 5
where τ and n0 are hyper parameters. For these priors, the posterior will be of
the same form:
" n
#
X
p(θθ |Dττ , n0 ) ∝ exp ψ (θθ ) · φ (xi ) + ng(θθ ) × exp [ψ
ψ (θθ ) · τ + n0 g(θθ )]
i=1
" n
#
X
= exp ψ (θθ ) · ( φ (xi ) + τ ) + (n + n0 )g(θθ ) .
i=1
Pn
We simply replace n0 → n0 + n and τ → i=1 φ (xi ) + τ to obtain the posterior
from the prior.
Let us look at some examples:
46
This is of the form of a beta–density which is usually denoted as ∝ θα−1 (1 −
θ)β−1 .
If we assume that all models have Rthe same prior probability, we choose the the
model with the largest evidence P (D|θ, M)p(θ|M)dθ. The evidence is also
frequently used to optimise hyper parameters.
47
The likelihood is
" N #
1 X (yi − fw (xi ))2
p(D|w) = exp − .
(2πσ 2 )N/2 i=1
2σ 2
PK 2
1 wj
We also specify the Gaussian prior distribution on weights p(w) = (2πσ02 )(K+1)/2
exp − j=0
2σ 2
0
p(D|w)p(w)
p(w|D) =
p(D)
48
instead. Hence, we need to compute the evidence. One way of doing this would
be to explicitly perform a Gaussian integral. But it is also possibly to think
probabilistic and compute the joint density p(D|K) of observations from the
generative Bayesian model. The data y ≡ D is
X
y = Xw + ξ (yi = wk Xik + ξi )
k
where
ξ ∼ N (0, σ 2 In )
w ∼ N (0, σ02 IK+1 )
are two independent Gaussian random vectors. The linear combination of the
two is also Gaussian. Hence y is Gaussian and p(y|K) a multivariate Gaussian
density. We have to find mean and covariance. Obviously E[y] = 0 and
.
Σ = COV[y] = E[yy> ] =
XE[ww> ]X> + σ 2 In = σ02 XX> + σ 2 In .
Hence
y ∼ N (0, Σ)
Σ = σ02 XX> + σ 2 In
49
The next figure shows the log-evidence as function of K showing that the
correct polynomial order K = 4 gives the most likely model
50
If we repeat the experiment with the ’wrong’ prior σ0 = 2 which assumes
typically bigger coefficients, the plot of the log-evidence gives is the constant
polynomial K = 0 as the most likely function.
51
52
Chapter 7
Week 6
We will next discuss the large n behaviour of the posterior density and derive
approximations for posterior integrals. Finally, we will introduce Monet Carlo
sampling methods for a different type of computations of such integrals.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
We can see that for large n, the posterior has a Gaussian shape and is con-
centrated around the true value. To get more insight, we perform a Taylor ex-
pansion of the log–likelihood around the ML estimator θ̂ (for a one–diemensional
53
problem, for simplicity):
n 2 c 3
X c2 3
ln p(D|θ) = ln p(xi |θ) = C + n θ − θ̂ + n θ − θ̂ + . . .
i=1
2 3!
Pn
with the constant C = i=1 ln p(xi |θ̂) and
n
1X k
ck = ∂ ln p(xi |θ)|θ̂ ≈ Ex [∂θk ln p(x|θ)|θ̂ ] = O(1)
n i=1 θ
Note, that c1 = 0 because at the ML value the first derivative vanishes ! In the
last step, we have approximated empirical averages over independent xi by the
expectation. Assuming concentration around θ ≈ θ̂, we identify the dominating
terms
2
−|c2 | c3 3
p(θ|D) ∝ exp − n θ − θ̂ 1 + n θ − θ̂ + . . .
2 3!
which is a Gaussian with small corrections. The correction was obtained by a
further Taylor expansion of the exponential.
With high probability with respect to the Gaussian, we have |θ − θ̂| ∼ √1n .
3
1
Hence the correction term is typically of order n θ − θ̂ ∼ n1/2 .
54
with A = ∇2 h(x̂). We can now perform the Gaussian integral by first shifting
the integration x − x̂ → x and using
Z
1 1 > −1
e− 2 x Σ x dx = 1.
(2π)d/2 |Σ
Σ|1/2
Hence
(2π)d/2
Z Z
1
e−h(x) dx ≈ e−h(x̂) exp − (x − x̂)T A(x − x̂) dx = e−h(x̂)
2 |A|1/2
Γ(n) = (n − 1)!.
55
We have g(y) = −ey + ty and g 0 (y) = −ey + y Thus the maximiser is ŷ = ln t.
00 00
We need the 2nd derivative at the maximum, g (y) = −ey and g (ŷ) = −t.
The Laplace approximation yields
Z ∞ √
t 2
Γ(t) ≈ e−t+t ln t e− 2 (y−ŷ) dy = 2π tt−1/2 e−t
−∞
where A = −∇2 ln p(θ̂θ |D) and θ̂θ is the MAP estimator. This approximation
only requires the MAP and the local curvature at the MAP value.
One can still approximate this result further to get a fairly crude approxi-
mation for the evidence known as the Bayes Information Criterion ( BIC)
for Bayes model selection. We ignore all the terms that do not scale with n
or d and assume θ̂θ ≈ θ M L . Finally, for large n all matrix elements of A have
an asymptotic scaling ∝ n (they are computed from sums of xi ). Hence, the
determinant scales like |A| = O(nd ). Thus we get
d
− ln p(D) ≈ − ln p(D|θθ M L ) + ln n.
2
56
The first term is the negative likelihood at the optimum, which typically de-
creases with increasing model complexity. The second term increases with the
complexity (dimensionality d) of the model. For given data D, there should be
an optimal model d which minimises the right hand side.
with
57
which is OK for regression. For classification, one would add a further sig-
moid giving the probability for output y = 1.
We can view the neural network as a probabilistic model for outputs y
β 2
p(y|x, w) ∝ exp − (y − fw (x)) Regression
2
y 1−y
1 1
p(y|x, w) = Classification
1 + e−fw (x) 1 + efw (x)
where the weights are the parameters. This can be made into a Bayesian model
by adding a prior to the weights. We will use so–called ARD (automatic rele-
vance determination) priors for the input to hidden weights which are (factoris-
ing) Gaussian densities. If we define wik to be the weight connecting input k to
hidden unit i we have
1 2
p(wik ) ∝ exp − αk wik
2
The hyper parameter αk (which is shared by all weights connecting input feature
xi ) determines the influence of input xi on the output. For large αk , the weight
prior shuts off the weights wik for all i and there is no relevance of feature
xi . For small αk , the Gaussian is broad, giving strong influence. We can use
the Laplace approximation for the evidence to perform hyperparameter (αk )
optimisation and model selection (number of hidden units).
58
We will study the performance of this method on the artificial the Friedman
data set. This is generated as
10
4 X
y(x) = 0.1e4x1 + −20(x2 − 21 )
+ 3x3 + 2x4 + x5 + 0 · xi + ν
1+e i=6
where ν denotes added noise. The function shows different relevance of the
inputs xi . x1,2 appear inside highly nonlinear functions and strongly influence
the output. x3,4,5 appear linearly and are thus somewhat less relevant. The
remaining outputs don’t have any relevance for the output.
The following plots show the result of the Bayesian neural network learning
for this toy problem. The first figure shows the test error for Bayes learning
(blue) as a function of the number of hidden units using MAP as an output and
compares with a vanilla backprop algorithm (red) which ignores the prior. We
can see that a network of 3 hidden units appears to be the optimal representation
for both methods. But we also see that the Bayes predictions are more robust
when we don’t use the optimal setting.
The second plot shows the optimised hyper parameters αk for the different
inputs xk showing a clear relevance (small αk ) for the first two and less relevance
for the next three inputs. Finally, the remainig inputs have very large α and are
found to be irrelevant. The corresponding weights are essentially set to wik = 0
by the prior.
59
The final plot shows the log evidence as a function of the number of hidden
units. We find the optimum for three hidden units which also coincided with
the best test error.
60
• Advantages: The intractable integrations are replaced by optimisation
(finding the MAP). The Hessian which is required for the approximate
covariance of the posterior can also be helpful for a Newton–Raphson
optimisation algorithm.
• Disadvantages: it is only local approximation, it takes into account the
MAP and the curvature of the posterior at the MAP. It ignores other
posterior modes. It also can’t be used for parameters which are discrete
variables.
We have shown that this changes under transformations τ = f (θ) (in 1 − −d)
˜ ) = (f 0 (τ ))2 J(θ)
as J(τ
Jeffrey’s prior (assume parameter space Θ compact) is defined as
p
pJef f (θ) ∝ J(θ).
Hence, for both parametrisations, the prior has the same form.
For a Bernoulli model, we have
1
pJef f (θ) ∝ p
θ(1 − θ)
61
One can show that Jeffrey’s prior fulfils certain minimax properties asymptot-
ically. This means that the frequentist results of certain risk function of the
Bayes prediction become independent of the true model. This shows that one
gets optimal predictions for the worst true model in the family.
For non compact parameter spaces, Jeffrey’s prior is not normalisable. The
use of such improper priors (even if the posterior is normalisable) could be
dangerous.
The error will be O(N −1/2 ) and decreases to zero when the number of sam-
ples grows large. A nice property of such Monte Carlo methods is the fact
that marginalisation of the components of a parameter vector is trivial: If
θ = (θ(1) , . . . , θ(d) ) and we are interested in θ(k) only, then
N
h
(k)
i 1 X (k)
E g(θ |D = g(θi )
N i=1
where θ i ∼ p(θθ |D). This means we just keep the components in the samples that
we need. There is no need to perform analytical integrals over joint densities.
However, we need methods for sampling from arbitrary probability distribu-
tions. Posteriors might depend on the data in a complicated way (except for ex-
ponential families with conjugate priors, but we can deal with those usually with-
out MC). Another problem comes from the (often) unknown normalisation of the
posterior: Usually we just have the unnormalised version p(θθ |D) ∝ p(D|θθ )p(θθ ).
So a good MC method should not require the normalisation.
62
An application to 1–d cases is straightforward: Let Y ∼ U (0, 1) have uniform
density, i.e. pY (y) = 1. Choose
Z x
dT (x)
T (x) = Pr(X ≤ x) = pX (v)dv → = pX (x)
−∞ dx
Y ∼ U (0, 1)
X = T −1 (Y ) ∼ pX .
Y ∼ U (0, 1)
X = T −1 (Y ) = − ln(1 − Y )
alternative : X = − ln Y
63
Generate (X, Y ) ∼ U (A) (uniform ∈ A)
Accept if (X, Y ) ∈ B, keep X
Else: Reject (X, Y ) start again
X ∼ q(·)
Y |x ∼ U (0, Cq(x))
This means Y = U Cq(x) with U ∼ U (0, 1). Proof: We compute the joint
density from the marginal and the conditional as
1 1
ρ(x, y) = ρ(y|x)ρ(x) = q(x) =
Cq(x) C
Area(B) 1
Pr(accept) = =
Area(A) C
64
7.4.4 Summary: Rejection method
• Problem: We need random samples from target density p(·). We can
draw random variables from density q(·) (proposal density).
p(x)
• Assume q(x) ≤ C.
0.9
0.6
1.6
0.8
1.4 0.5
0.7
1.2
0.6
0.4
1
0.5
0.8 0.3
0.4
c g(x)
0.6 0.3
0.2
0.4 0.2
0.1
0.2 0.1
0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3
This can be turned into a sampler for Gaussian random variables by multi-
plying the positive variables with a random independent sign.
65
Chapter 8
Week 7
We will next discuss Markov chain Monte Carlo methods. We give up on inde-
pendent samples by generating samples from a Markov chain which asymptoti-
cally are distributed with the target distribution.
66
where the transition density T fulfils
Z
Pr(Xt+1 ∈ A|Xt = y) = T (x|y)dx
A
From this we get the marginal distribution by integrating out the other variables:
Z
pt (x) = p(xt , xt−1 , . . . , x0 )dxt−1 . . . dx0 .
This is true, because the integral over y on the right hand side equals 1. We
will construct T (y|x) such that
T (x|y)p(y) = T (y|x)p(x)
detailed balance. If this condition is fulfilled for all x and y, then the integral
over y yields stationarity. The Markov chain is called reversible.
67
8.2 Metropolis - Hastings method
This method defines a large class of MC algorithms which fulfil detailed balance.
The user has to define a proposal distribution q(x0 |x). The Markov chain is
generated as follows:
• Given a state x = xt of the Markov chain, generate a new state x0 with
probability distribution q(x0 |x).
p(x0 )q(x|x0 )
A(x0 ; x) = min 1,
p(x)q(x0 |x)
• Accept new state xt=1 = x0 with probability A(x0 ; x). This is done
by generating a uniformly distributed random variable u ∼ U (0, 1). We
accept x0 if u ≤ A(x0 ; x).
Reject new state, ie keep old state xt+1 = x with probability 1−A(x0 ; x).
where the term with the Dirac distribution takes into account that the chain
stays in its old state when the proposalR is rejected. The term α(x) can be
obtained from the normalisation 1 = T (x0 |x)dx0 . This leads to α(x) =
A(x ; x)q(x0 |x)dx0 .
0
R
We will show detailed balance using its definition. We will fill concentrate
on the first part of the transition distribution and write
p(x0 )q(x|x0 )
0 0 0
A(x ; x)q(x |x)p(x) = q(x |x) min 1, p(x) =
p(x)q(x0 |x)
min (q(x0 |x)p(x), q(x|x0 )p(x0 ))
p(x)q(x0 |x)
= p(x0 )q(x|x0 ) min , 1 = A(x; x0 )q(x|x0 )p(x0 ).
p(x0 )q(x|x0 )
Here we have used the fact that we can multiply both sides of the min operation
with non–negative numbers. Since we also have (1 − α(x))δ(x0 − x) = (1 −
α(x0 ))δ(x − x0 ), detailed balance is proved.
An interesting property of the MH method is that only ratios of probabilities
p(x0 )
p(x) (densities) are required. Hence, we can work with un–normalised probabil-
ities. This is very useful for Bayesian approaches where the normalisation term
is given by the evidence, which is often hard to compute.
68
8.2.2 Random walk sampler
The simples idea is to work with a proposal that completely ignores the target.
For continuos state spaces one may choose a move
√
x0 = x + ρz
where z ∼ N (0, I). This proposal defines a random walk in state space. This
yields a symmetric proposal with q(x0 |x) = q(x|x0 ). The acceptance proba-
bility is then simply
p(x0 )
0
A(x ; x) = min ,1
p(x)
For symmetric proposals, one speaks of a Metropolis sampler.
The choice of ρ is important for the performance of the algorithm. For large
ρ acceptance will be highly unlikely and the sampler is stuck for a long time.
Small ρ will lead to high acceptance rates but to slow diffusion. The relevant
states are visited only slowly. This is illustrated in the two figures which show
the random walk sampler applied to a two dimensional Gaussian density. On
the left we have ρ = 1 and on the right ρ = 0.1 (1000 samples).
This approach may remind us of the rejection method, but now the samples are
dependent. One may argue that the method could be useful if the proposal q is
similar to p. One may then achieve good acceptance rates.
69
However, one should be careful using this method. The problems are illus-
trated in the following simple example.
8.2.4 Example
A class of target densities is defined by exponential densities p(x) = λe−λx ,
x ≥ 0. We will use q(x) = e−x , x ≥ 0 as the proposal.
The density ratio in the acceptance probability equals p(x)
q(x) = λe
−(λ−1)x
.
Obviously for λ < 1, p(x)
q(x) = λe
−(λ−1)x
becomes unbounded !
If we use proposals with λ < 1, the ’tail events’ (x large) become rarely
proposed. But if such samples end up in the tails, the MH sampler stays there
for a long time !
This behaviour is illustrated in the following three figures where histogrammes
of 10000 MCMC steps are shown and compared to the exact density.
The first case is obtained with λ = 2:
70
For λ = 0.1 one can see points in the tail, where the sampler was stuck for
a long time.
71
8.2.6 Gibbs as Metropolis Hastings
We can understand Gibbs sampling as a special case of MH. The Gibbs proposal
at component i is given by
where the Dirac distribution takes care of the fact that the components x−i are
not changed.
The MH acceptance probability equals
p(x0 )q(x|x0 ) p(x0 )p(xi |x0−i )δ(x−i − x0−i )
A(x0 ; x) = = =
p(x)q(x0 |x) p(x)p(x0i |x−i )δ(x0−i − x−i )
p(x0i |x−i )p(x−i )p(xi |x−i )
=1
p(xi |x−i )p(x−i )p(x0i |x−i )
Hence, the proposal is always accepted !
72
– K has a discrete prior distribution P (K).
K n
Y λx1 i Y λx2 i η1a1 a1 −1 −η1 λ1 η2a2 a2 −1 −η2 λ2
e−λ1 e−λ2 λ e λ e
i=1
xi ! xi ! Γ(a1 ) 1 Γ(a2 ) 2
i=K+1
We will show that the conditional distributions for Gibbs sampler are
n
X
λ2 |λ1 , η1,2 , K, D ∼ Gamma(a2 + xi , n − K + η2 )
K+1
η1 |λ1,2 , η2 , K, D ∼ Gamma(a1 + b1 , λ1 + c1 )
PK
K|λ1,2 , η1,2 , D ∼ const × p(K)e−K(λ1 −λ2 ) (λ1 /λ2 ) i=1 xi
Details
The main idea is to collect terms in the joint distribution which depend on the
variable to update, and normalise later: p(A|B) = PP(A,B)
(B) ∝ P (A, B)
73
• Starting from the joint distribution
K n
Y λx1 i Y λx2 i η1a1 a1 −1 −η1 λ1 η2a2 a2 −1 −η2 λ2
e−λ1 e−λ2 λ e λ e
i=1
xi ! xi ! Γ(a1 ) 1 Γ(a2 ) 2
i=K+1
η1a1 −η1 λ1
p(η1 |λ1,2 , η2 , K, D) ∝ e × η1b1 −1 e−c1 η1 ∝
Γ(a1 )
η1b1 +a1 −1 e−η1 (λ1 +c1 )
Simulations
In the following plots we show histograms of the marginal posteriors obtained
by Gibbs sampling. Then first panel shows the observed data. The vertical
black lines in the posterior distributions are the exact values from which the
data were generated.
74
number of disasters
16
14
12
10
0
0 10 20 30 40 50 60 70 80 90 100
year
P(K)
2000
1800
1600
1400
1200
1000
800
600
400
200
0
0 10 20 30 40 50 60 70 80 90 100
P(η1) P(η2)
800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
1 2 3 4 5 6 7 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
P(λ1) P(λ2)
1500 800
700
600
1000
500
400
300
500
200
100
0 0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
The second series of plots are generated for a case where the exact rates λ12
are more similar to each other and inference becomes harder.
75
number of disasters
11
10
0
0 10 20 30 40 50 60 70 80 90 100
year
P(K)
1500
1000
500
0
0 10 20 30 40 50 60 70 80 90 100
P(η1) P(η2)
800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9
P(λ1) P(λ2)
800 3000
700
2500
600
2000
500
400 1500
300
1000
200
500
100
0 0
1.5 2 2.5 3 3.5 4 4.5 5 2 4 6 8 10 12 14
76
than making a parametric ansatz for such functions (with corresponding priors
for its parameters) we consider a non–parametric approach where the prior is
directly defined over a space of functions.
We will starts with well known
yi = fθ (xi ) + νi
or Fourier series
K
X
fθ (x) = {θl sin(2πlx) + θl0 cos(2πlx)}
l=1
Using a Gaussian prior p(θθ ) and Gaussian noise, the posterior p(θθ |Data) is also
Gaussian.
For this class of models we have to specify (or estimate) K, the number
of basis functions. It would be interesting to take the the limit K → ∞ and
allowing a Bayesian model to assume an unbounded complexity for modelling
functions. But in this case, we would have infinitely many parameters θl which
may not be easy to handle.
The Gaussian process approach to such nonparametric models is to assume a
prior over functions which we write as: f (·) ∼ GP(0, K). The figure illustrates
what we are looking for. The left panel shows random functions generated
77
3
2
2
GP Samples
GP Samples
1
1
0
0
−1
−1
−2
−2
0 2 4 6 8 10 0 2 4 6 8 10
x x
from the prior. The blue shading gives the prior uncertainty on the marginal
variance of functions for each input x. The second panel illustrates the posterior
over functions after observing four data points. The random functions generated
from the posterior distribution are close to the data but show large variability for
input points which are further away from observations. The shading measures
the marginal poster variance at each input x.
78
R∞
Proof: We use the Fourier integral K(x) = −∞
eiωx K̂(ω)dω. Thus
X Z X
ak K(xk − xl )al = K̂(ω) ak al eiω(xk −xl ) dω =
ij kl
Z ! ! Z 2
X X X
K̂(ω) ak eiωxk al e−iωxl dω = K̂(ω) ak eiωxk dω
k l k
≥ 0.
• Matérn kernels allow for an interpolation between RBF and OU. Here we
can control the smoothness of the random functions.
• Polynomial kernels: K(x, x0 ) = (1 + x · x0 )k . These have sample paths
which are themselves polynomials in x. In such a way, we can recover a
parametric model.
• To obtain new kernels, we can combine existing kernels: Sums and prod-
ucts of kernels are also kernels.
−1
−2
−3
0 5 10 15 20 25 30 35 40 45 50
79
The next plots are samples from GPs with RBF kernels having two different
0 2 0 2
length–scales: K(x, x0 ) = e−3(x−x ) and K(x, x0 ) = e−10(x−x )
2.5 3
2
2
1.5
1 1
0.5
0
0
!0.5 !1
!1
!2
!1.5
!2 !3
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
80
Chapter 9
Week 8
In this section, we will continue with GPs and see how we can compute posterior
predictions. We will show that a closed form solution is possible for regression
with Gaussian noise. Hyperparameters can be learnt using the evidence.
81
We are interested in the posterior density
Z
p(v|y) = p(v|z)p(z|y)dz.
• This yields
E[v|z] = −(Ωvv )−1 Ωvz z
V AR[v|z] = (Ωvv )−1
Finally, to get an explicit result for Ω, we have to compute the inverse of
a block matrix.
82
• The general rule is
−1
−M BD−1
A B M
=
C D −D−1 CM D + D CM BD−1
−1 −1
with −1
M = A − BD−1 C
• Thus finally
E[v|z] = k>
xK
−1
z
V AR[v|z] = K(x, x) − k>
xK
−1
kx
p(z|y) = N (z|µ
µ, S)
−1
with S = K−1 + σ12 I and µ = 1
σ 2 Sy We can use these results to get explicit
analytical predictions:
• The posterior mean prediction is obtained by using our results for p(v|z):
Z Z
> −1
E[v|y] = E[v|z]p(z|y)dz = kx K zp(z|y)dz =
−1
1 > −1 1 > −1 1
k>
xK
−1
E[z|y] = 2 kx K Sy = 2 kx K −1
K + 2I y
σ σ σ
2 −1
= k>
x K+σ I y
• This prediction is linear in the data y. It can also be written in the form
n
X
fˆ(x) = αi K(x, xi ).
i=1
83
which is similar to predictions with other non–Bayesian kernel machines
(e.g. SVM).
• A further calculation shows that the uncertainty at a test input x is ob-
2 −1
tained as V AR[v|y] = K(x, x) − k>
x K+σ I kx which is independent
of y.
Mean and uncertainty (shaded region) for different numbers of data (blue points)
are shown in the sequence of plots for a toy problem, where the exact function
is shown in blue and the prediction in red.
1 observation
84
2 observations
3 observations
85
10 observations
15 observations
86
30 observations
p(y) =
Z
= dz p(z) p(y|z)
1 1 T 2 −1
= 1 exp − y (K + σ I) y
(2π)n/2 | det(K + σ 2 I)| 2 2
y = z+ξ
ξ ∼ N (0, σ 2 I)
z ∼ N (0, K)
87
are jointly Gaussian and independent. Hence, the density of observations is
Gaussan
y ∼ N (0, Σ ).
Thus E[y] = 0 and by independence we can add the two covariances to get
.
Σ = COV[y] = E[yy> ] =
E[zz> ] + σ 2 I = K + σ 2 I.
The following plots illustrate the maximisation of the evidence to get good
values of hyper parameters. We show true function (red) and prediction (black)
together with the data (blue dots) and the uncertainty (dashed). The first plot
is obtained with non–optimal parameters. Predictions are too wiggly.
The second plot show the log–evidence as a function of the length scale L of
the RBF kernel.
88
The final plot displays predictions with optimised parameters.
89
9.3.2 GP Application: Inference for linear ordinary dif-
ferential equations
We can apply GP inference to problems where data are assumed to be gener-
ated from linear operations on latent functions. This is possible, because linear
operations on GPs leads to GPs !
Take e.g. a dynamical model data model given by the ODE
dx(t)
= −λx(t) + f (t)
dt
yi = x(ti ) + νi , i = 1, . . . , n
90
With 10 observations ...
91
With 50 observations ...
92
generated from the simulator to approximate f (x) with GP regression. This
GP approximation (emulator) to f (x) can be evaluated much faster than by
running the simulator
This could be applied e.g. to:
• Sensitivity analysis: How does the outputs change under small changes
of the input ?
• Uncertainty analysis: What is the uncertainty of outputs based on
uncertainty in the inputs modelled by distribution p(x).
Earlier work can be found in:
http://www.tonyohagan.co.uk/academic/GEM/index.html and the MUCM
(MANAGING UNCERTAINTY IN COMPLEX MODELS) page http://www.mucm.ac.uk
with respect to x. This will take both mean and uncertainty of g(x) into ac-
count. Note, that the minimisation does not need any new evaluation of the
true function f .
93
The observation model
φ(x) ∼ GP(0, K)
yi = −∇φ(xi ) + i .
94
9.5 Inference for Gaussian processes: Non–Gaussian
observation models
The posterior density of the unknown function f (x) at an input x is
Z
p(f (x)|y) = p(f (x)|z1 , . . . , zn )p(z1 , . . . , zn |y)dz1 , . . . dzn
For a Gaussian noise model p(yi |zi ) the integrals can be performed analytically.
But a GP appears as a latent function in more complicated models. Take e.g.
95
• yi = f (x) + non–Gaussian noise
• Binary classification yi ∈ {0, 1} with p(y = 1|f (x)) = sigmoid[f (x)].
• ...
p(data|z1 , . . . , zN )p(z1 , . . . , zN )
p(z1 , . . . , zN |data) =
p(data)
p(zi |data) =
Z
p(data|z1 , . . . , zN )p(z1 , . . . , zN )
dz1 . . . dzi−1 dzi+1 . . . dxN
p(data)
and Z
p(data) = dz1 . . . . . . dzN p(data|z1 , . . . , zN )p(z1 , . . . , zN )
96
• or even simpler
Y X
p(z1 . . . , zN |data) = Ψi (zi ) exp Aij zi zj
i i<j
The minimum is
q(z)
min Eq ln = − ln p(y)
q p(z, y)
Proof:
q(z) q(z)
Eq ln = Eq ln =
p(z, y) p(z|y)p(y)
Z
q(z)
= q(z) ln dz − ln p(y)
p(z|y)
= D(qkp(·|y)) − ln p(y)
97
9.8 The variational approximation
We approximate ’intractable’ posterior p(z|y) by a ’close’ distribution q ∗ (z)
(for fixed y) where q ∗ is from a ’nice’ family F of distributions (which allow us
e.g. to perform efficient marginalisations). We will measure ’closeness’ by the
KL–divergence and solve the variational problem
9.8.1 Lingo
Older papers call the quantity
q(z)
Eq ln
p(z, y)
’Variational Free Energy’. It has to be minimised, and its original definition
goes back to statistical physics. It upper bounds − ln p(y)
More recent papers call the variational objective
p(z, y) q(z)
Eq ln = −Eq ln
q(z) p(z, y)
the ELBO (expected lower bound). It has to be maximised and lower bounds
ln p(y).
98
• The solution is
n o
q1opt (z1 ) ∝ exp Eqopt [ln p(z1 , z2 , y)]
2
n o
opt
q2 (z2 ) ∝ exp Eqopt [ln p(z1 , z2 , y)]
1
This is minimised by
Z
q1 (z1 ) ∝ exp q2 (z2 ) ln p(z1 , z2 |y)dz2
This result shows that one can find a local optimum by performing a ’coordinate–
wise’ optimisation where q2 is fixed and q1 is optimised. Then the same is
repeated with q1 and q2 exchanged.
99
The optimal solution is
X
qi (z) ∝ Ψi (z) exp z Aij mj
j6=i
with R h P i
Ψj (z) exp z k6=j Ajk mk z dz
mj = Eq [Zj ] = R h P i
Ψj (z) exp z k6=j Ajk mk dz
with a fixed set {Φ1 (x), . . . , Φk (x)} of K basis functions. The prior distribution
on the weights is given by
α K/2 K
α X
p(w|α) = exp − w2 .
2π 2 j=1 j
p(y, w, α) = p(y|w)p(w|α)p(α)
100
We aim at a factorising distribution to the posterior
q(w) ∝ exp {Eα [ln p(y, w, α)]} ∝ p(y|w) exp {Eα [ln p(w|α)]} ∝
K
Eα [α] X
p(y|w) exp − w2 .
2 j=1 j
q(α) ∝ exp {Ew [ln p(y, w, α)]} ∝ p(α) exp {Ew [ln p(w|α)]} ∝
K
α X
p(α) exp − Ew [wj2 ] .
2 j=1
101
Chapter 10
Week 9
102
the mode. The Gaussian variational approximation is some kind of Laplace ’on
average’.
The following plot illustrates the difference between the two approximations
for a one–dimensional density. Approximating the green density by the Laplace
method yields the green Gaussian. The variational method gives the blue Gaus-
sian instead. While Laplace is entirely local, the variational Gaussian is able to
incorporate more of the probability mass of the true density.
which play a role as posteriors for GP models. Our result on the equation for
the optimal covariance states that
2
∂ Vn (yn , zn
Σ−1 = K−1 + diag Eq .
∂zn2
103
This means that only the d diagonal elements of the inverse covariance are
unknown quantities.
The following plot shows an application of the Gaussian variational approx-
imation to GP inference of an unknown function using observations which are
corrupted by Cauchy noise. This leads to larger outliers compared to Gaus-
sian noise. The first plot shows GP inference using a Gaussian likelihood, which
ignores the knowledge that the noise is Cauchy. The resulting mean prediction
interprets the strong fluctuations of the observations as resulting from the true
function. The inferred curve is more wiggly compared to the truth.
1.2
0.8
0.6
0.4
0.2
!0.2
!0.4
!10 !8 !6 !4 !2 0 2 4 6 8 10
104
1.2
0.8
0.6
0.4
0.2
!0.2
!0.4
!10 !8 !6 !4 !2 0 2 4 6 8 10
where p(s, B) is the joint prior. We would like to approximate posterior p(s, B|y)
using sparse likelihood by setting
We can find the best likelihood L̂(s) (in the variational sense) by minimising
" #
p(s, B)L̂(s)
Eq ln
p(s, B)p(y|s, B)
105
The optimal likelihood is given by
Z
L̂(s) ∝ exp ln[p(y|B, s)]p(B|s)dB
Proof:
" #
p(s, B)L̂(s)
Eq ln =
p(s, B)p(y|s, B)
" #
L̂(s) h i
Eq ln = Eq ln L̂(s) − Eq [ln p(y|s, B)] .
p(y|s, B)
Hence
q(B|s) ∝ p(s, B)
q(B|s) ∝ p(B|s).
zs = {f (x)}x∈inducing points .
106
This result can be understood from the fact that
where for a Gaussian distribution, the conditional covariance does not depend
zs .
An explicit calculation shows that
By using the ELBO, one can optimise the location of ’inducing points’.
The posterior over all latent variables (assuming GP prior p(z) = GP(0, K) with
kernel K) is
N 2
Y yn zn zn
N − ωn
p(z, {ωn }n=1 |y) ∝ p(z) pPG (ωn ) e 2 2
n=1
107
We can treat this augmented model by variational inference using a structured
mean field approximation
N N . N
p(z, {ωn }n=1 |y) ≈ q(z, {ωn }n=1 ) = q1 ({ωn }n=1 ) q2 (z).
n=1
0
pPG (ω) ω e− 2 ω
dω
E1 [ωn ] = R E [z 2 ]
=
∞ − 22 n ω
0
p PG (ω) e dω
Z ∞
d
− ln pPG (ω) e−ωt dω =
dt t=E2 [zn2 ]/2 0
r
d t
ln cosh( )
dt t=E2 [zn ]/2
2 2
The last line follows from the basic definition (set t = x2 /2)
1
σ(x) = =
1 + e−x
x Z ∞
e2 1 x x2
= e2 e− 2 ω pPG (ω)dω.
2 cosh( x2 ) 2 0
108
109
Chapter 11
Week 10
This is represented as a sample average ! The second method is the well known
110
11.1.1 Reparametrisation trick
We will give the main idea for a simple one dimensional case, where the varia-
tional density is assumed to be Gaussian. We represent the random variables z
as a transformation of a random variable u with a fixed distribution.
Hence
qφ (z) qφ (µ + σu)
Eq ln = Eu ln
p(z, y) p(µ + σu, y)
since xα is concave for 0 < α ≤ 1 and equalty for p ≡ q. We also get a bound
on the evidence
Fα ≥ −(p(y))1−α .
111
We can also recover the KL–divergence, because in the limit α → 1, we get
" 1−α #
. p(z, y) p(z, y)
Fα = −Eq = −Eq exp (1 − α) ln ≈
qφ (z) qφ (z)
p(z, y)
−1 − (1 − α)Eq ln
qφ (z)
which is minimized by the true marginal qi = pi . On the other hand for expo-
nential families
we get
Z
− dz p(z) ln q(z) = const − ψ (θθ ) · Ep [φ
φ(z)] − ln g(θθ )
Since ∇ψ ln g(θθ (ψ
ψ )) = −Eq [φ
φ(z)], taking the gradient wrt to ψ yields moment
matching for the optimal ψ
φ(z)] = Ep [φ
Eq [φ φ(z)].
112
11.3.1 Bayes Online (Assumed Density Filtering)
We can still try this procedure by a further approximation using an on–line
algorithm. Let us consider the exact update of the posterior, when new data
yt+1 arrives
p(yt+1 |z)p(z|Dt )
p(z|Dt+1 ) = R .
dzp(yt+1 |z)p(z|Dt )
We replace the exact p(z|Dt ) by parametric approximation q(z|par(t)) using the
following steps:
1. Update:
p(yt+1 |z)p(z|par(t))
q(z|yt+1 , par(t)) = R .
dzp(yt+1 |z)q(z|par(t))
2. Project: Minimize
D (q(·|yt+1 , par(t))kq(·|par))
113
with
Z
Bij = dy p∗ (y)∂i ln p(y|z∗ )∂j ln p(y|z∗ )
Z
Aij = − dy p∗ (y)∂i ∂j ln p(y|z∗ ).
This is the same error rate as for batch algorithms (Max. Likelihood or Bayes):
Hence, we get asymptotic efficiency! One can also show that the algorithm is
asymptotically equivalent to natural gradient online learning.
The following plot shows test errors (probability of misclassification) for a
toy probit model, with spherical Gaussian inputs (d = 50) and a realizable
.
target. α = #data
d ). The dashed line is an analytical result for the quality of a
Bayes optimal batch algorithm.
dzi (t) .
= φt (zi (t))
dt
We must define a mapping φt (·) such that for t → ∞, the density of particles
zi (t) ∼ q∞ (·) is close to the posterior p(z|y).
114
The basic idea to obtain the mapping is to construct φt (·) in such a way
kp(·|y))
that the change (decrease) of the KL divergence dD(qtdt is large. To work
out the details, we need to know
• How does qt change over time ?
dD(qt kp(·|y))
• What is dt ?
• We need to specify a family G of mapping functions φt (·) and be able to
maximise the decrease of KL !
• We need to express all expectations by sample means.
• Finally, in practice we use discrete time steps
zi (t + 1) = zi (t) + φt (zi (t)).
In the last step, we have performed integration by parts. On the other hand,
we can express the same quantity by the change of the density:
Z Z
d d dqt (z)
E[g(Z(t))] = qt (z)g(z)dz = g(z)dz.
dt dt dt
Since both expressions hold for arbitrary functions g (assuming some smooth-
ness), we conclude that
dqt (z)
= −∇ · (qt (z)φt (z)) .
dt
115
11.4.2 Change of KL
dD(qt kp(·|y))
We will next address dt .
A direct calculation yields
Z Z
d d
qt (z) ln qt (z)dz − qt (z) ln p(z|y)dz =
dt dt
Z Z
− ∇ · (qt (z)φt (z)) ln qt (z)dz + ∇ · (qt (z)φt (z)) ln p(z|y) =
Z Z
+ qt (z)φt (z) · ∇ ln qt (z)dz − qt (z)φt (z) · ∇ ln p(z|y) =
Z Z
+ ∇qt (z)φt (z)dz − qt (z)φt (z) · ∇ ln p(z|y) =
To obtain the third line and the last line we have performed an integration by
parts. The Operator inside the bracket is known as Stein’s operator.
For specific families of functions F this can be solved in closed from: The
Stein variational gradient descent (SVGD) algorithm chooses functions φ in the
Reproducing Kernel Hilbert Space (RKHS) given by some p.d. kernel K(z, z 0 ).
116
where log means binary logarithm. This gives a measure of uncertainty about
X and also a measure of information contained in observing X.
This is illustrated by the two figures. The left is a discrete distribution with
zero probability for all values except for a single value which has P = 1. There is
no surprise in a realisation of X and the entropy is 0. The right shows a uniform
distribution. All values are equally probable and we have maximal surprise and
entropy in observing a realisation of X.
which is NOT the entropy of the conditional distribution (but its expectation).
The relative entropy (KL divergence) is given by
. X p(x)
D(pkq) = p(x) log ≥ 0,
x
q(x)
117
The Mutual information can also be expressed
. X p(x, y)
I(X, Y ) = D(p(x, y)kp(x)p(y)) = p(x, y) log
x,y
p(x)p(y)
X p(x|y) X p(y|x)
= p(x, y) log = p(x, y) log =
x,y
p(x) x,y
p(y)
H(X) − H(X|Y ) = H(Y ) − H(Y |X).
Hence, conditioning reduces entropy. We have
I(X, Y ) = H(X) − H(X|Y ) ≥ 0
Thus
H(X|Y ) ≤ H(X)
with equality if and only if X and Y independent.
118
We have p(xt+1 |xt ) = q(xt+1 |xt ) (both chains evolve under the same transition
probability). Thus
D(p(xt+1 )kq(xt+1 )) +
X X p(xt |xt+1 )
p(xt+1 ) p(xt |xt+1 ) log
xt+1 xt
q(xt |xt+1 )
≥ D(p(xt+1 )kq(xt+1 ))
A → 10
B → 00
C → 11
D → 110
1 → A?, C?, D?
11 → C?, D?
110 → CB?, D?
1100 → CB?, DB?
11000 → DB
we can not decide on the value of the first bit, unless we arrive at the end of
the sequence because C = 11 is a prefix to D = 110.
Hence, we define instantaneous (prefix) codes: No codeword is prefix to
any other codeword. Prefix codes can be represented by a Code tree: No CW
is ancestor of another CW. This can be seen in the figure, where CW appear
only at the leaves.
119
11.6.1 Kraft inequality
One would like to assign code lengths to source symbols which are as smalls
possible. But there are limits: For any prefix code with m codewords, the code
lengths l1 , l2 , . . . , lm satisfy (binary alphabet) Kraft’s inequality
X
2−li ≤ 1.
i
To prove this, we note that in a code tree, each CW eliminates its descendants
as CW. Let lmax be the length of the largest CW. Each node in a tree at level
lmax can be a CW, a descendant of CW or neither. Now consider CW at level
li
1. It has 2lmax −li descendants at level lmax .
120
There is a converse to Kraft’s inequality: For any set of integers l1 , l2 , . . . , lm
fulfilling Kraft’s inequality, one can construct prefix code.
One might assume that one could better than Kraft by relaxing the prefix
assumption. But you can’t beat the system: Any uniquely decodable code
fulfils Kraft ! Proof: Consider code which concatenates n CW. The length
.
for encoding the source symbols x = x1 , . . . , xn is then:
n
X
l(x) = l(xi ).
i=1
Let us look at
!k
X X X
Dl(x) = D−l(x1 ) · · · D−l(xk ) = D−l(x)
x x1 ,...,xk x
klX
max klX
max
−m
= a(m)D ≤ Dm D−m = klmax ,
m=1 m=1
121
probabilities x ∼ p(x). The expected code length is given by
. X
L = E[l(x)] = p(x)l(x)
x
L ≥ HD (X)
. P
where HD (X) = − x p(x) logD p(x). Hence, the entropy is the minimum ex-
pected code–length that can be achieved. Proof:
X
L − HD (X) = p(x) (l(x) + logD p(x)) =
x
X
p(x) − logd D−l(x) + logD p(x) .
x
. −l(x)
PD −l(x) .
We define new probabilities by Let r(x) = Hence
xD
!
X p(x) X
L − HD (X) = p(x) logD − logD D−l(x) ≥0
x
r(x) x
where we have used the fact that the KL divergence is ≥ 0 and Kraft’s inequality
to bound the last term.
Thus, the KL divergence measures the extra expected code length needed
for compression, when the true probabilities of the source are not known.
122
Chapter 12
Week 11
In this chapter, we will briefly discuss information theory and gambling, differ-
ential entropy and MaxEnt estimation and finally, the mimimum description
length method. This is a compression approach to model selection and can
serve as another justification of Bayesian methods.
123
which is maximal, when the negative KL divergence is zero.
and can be negative ! To understand its relation to the discrete entropy H(X),
we consider X ∆ =, a quantised version (using small bins, see the figure) of X
where for small ∆, we have P (X ∆ ) ≈ f (X ∆ )∆. Hence, the discrete entropy
equals
X
H(X ∆ ) = − P (X ∆ ) log P (X ∆ ) ≈ h(X) − log ∆
X∆
We will have similar definitions for relative entropy, mutual information etc,
e.g.
Z
f (x)
D(f kg) = f (x) log dx
g(x)
Of course, usually this is not enough to specify the density f . How should we
model the density f (x) thereby making the least additional assumptions ?
124
The MaxEnt principle (Jaynes) suggests to look for the density with the
largest entropy, i.e. to maximise
Z Z
h(X) = − f (x) ln f (x)dx such that f (x)rj (x)dx = αj
for j = 1, . . . , k.
This can be solved by introducing Lagrange multipliers for the constraints
Z X k Z Z
L(f ) = − f (x) ln f (x)dx + λk f (x)rj (x)dx + λ0 . f (x)dx
j=1
From the second to the third line, we used the fact that f and g have the same
expectation for the functions rj . The result shows that the entropy of g is never
larger than that of f .
12.0.7 Examples:
• MaxEnt applied to E[X] = µ, X ≥ 0 yields the exponential density:
1 −x/µ
f ∗ (x) = e
µ
125
• A Gaussian is obtained for constraints on mean and variance: E[X] = α1
and E[X 2 ] = α2
with C(k) = E[Xi Xi+k ]. Unfortunately, estimation of C(k) from random sam-
ples becomes bad when k large ! The figures (taken from Josef Honerkamp’s
book: Stochastic dynamical systems) show that it does not help to just increase
the length T of the observations. The top figure shows a time series gener-
ated from an AR process. The middle figure gives the estimate of C(ω) using
T = 1024 and the bottom figure yields the spectrum estimate for T = 4096.
the estimated spectrum does not converge. Th reason is, that although we get
more data for estimating correlations with smaller time lags, also the number
of badly estimated correlations increases. A possible solution is to smooth the
estimate over a small window.
126
12.0.9 Spectrum estimation using MaxEnt
MaxEnt proposes a different solution. We use a set of correlations αk =
E[Xi Xi+k ] of a stationary stochastic process for k = 0, . . . , p (where we restrict
ourselves to p small enough so that these are well estimated) as constraints for
a MaxEnt problem. Since we specify second moments, the optimal solution is a
Gaussian (–process). One can show that is can be represented by an AR process
of the form
p
X
Xi = aj Xi−j + Zi
j=1
127
12.1 Information theory and Model selection
This is mainly based on Jorma Rissanen’s work. The idea is that for encoding
data we need a statistical model for the data. Good statistical models allow for
a good compression of data ! The goal could be to find the MDL = mimimum
description length of data for given model as a yardstick for model selection.
Let us use a 2–stage coding for compressing the observed data.
128
12.1.1 Stochastic complexity
Sofar we have not yet obtained the full Bayesian evidence as a result of com-
pression, only an approximation. We can do better. For encoding we just need
a probability distribution over data sets D, not necessarily a two stage code.
We can use the following probability over data sequences D
X
P (D) = P (D|θ∆ )P (θ∆ )
θ∆
P (D) yields a better Code: The new code length for using P (D) is actually
smaller or equal to the old one which is based on the two stage approach:
X ∆ ∆
− log P (D) = − log 2−CL(θ ) ≤ − log maxθ∆ 2−CL(θ ) = CL(θopt∆
).
θ∆
The relation to the Bayesian evidence becomes evident when we take limit
∆ → 0:
X X Z
∆ ∆ ∆ ∆
P (D) = P (D|θ )P (θ ) ≈ P (D|θ )p(θ )∆ ≈ P (D|θ)p(θ)dθ,
θ∆ θ∆
where in the last step we have approximated the sum by an integral This equals
the Bayesian evidence. In the coding context it also known as stochastic com-
plexity.
129