PBM Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 130

Probabilistic and Bayesian Modelling in Machine

Learning and Artificial Intelligence

Manfred Opper & Théo Galy–Fajou

July 20, 2020


0.1 Background reading
• Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer,
2006.
• Information Theory, Inference, and Learning Algorithms, David J C MacKay,
Cambridge University Press, 2003.
• Machine Learning - A probabilistic Perspective, Kevin P. Murphy, The
MIT Press, 2012.
• Computer Age Statistical Inference, Bradley Efron and Trevor Hastie,
Cambridge University Press, 2016.

1
Chapter 1

Week1

1.1 Motivation
This module is about probabilistic (statistical) models for data and how to fit
theses models to data (solving the inverse problem). Take, e.g. the problem of
curve–fitting using a function f

2
A possible function might look like this

another one, the red curve fits better, but seems to be highly complex.

Which one makes more sense ?

1.1.1 A statistical model for curve fitting


A possible solution is the assumption of a generative model:
yi = fθ (xi ) + νi (1.1)
We assume that there is a
• ’true’ function fθ (xi ) with unknown parameters θ
• Observations are not perfect, there is independent Gaussian noise of vari-
ance σ 2 νi added.

3
We can introduce a likelihood function of the parameter θ
n
Y
p(Data|θ) = p(yi |θ). (1.2)
i=1

It tells us how likely (not yet a probability distribution) a parameter is un-


der the model. If we have prior knowledgeabout parameters, we can attempt a
Bayesian solution, assuming a prior distribution p(θ) from which we get poste-
rior probability distributions over parameters including which yields measures
of uncertainties.

1.2 Some reminders on basic probability


1.2.1 A little calculation on expectations
Let’s make sure that we understand how do deal with expectations. As an
example let us prove that E[(X − E(X))2 ] = E[X 2 ] − (E(X))2 . We go through
this step by step

E[(X − E(X))2 ] = E X 2 − 2XE[X] + (E[X])2


 

= E[X 2 ] − E[2XE[X]] + E[(E[X])2 ] linearity


= E[X 2 ] − 2E[X]E[X] + E[(E[X])2 ] take nonrandom term 2E[X] out
= E[X 2 ] − 2E[X]E[X] + (E[X])2 last term is nonrandom
= E[X 2 ] − (E(X))2

1.2.2 Conditional expectation and optimal prediction


If X, Y are dependent with density p(x, y), we can make a prediction about
.
Y , when we observe X = x. What is the best predictor Ŷ = φ(x) of Y ?
We will show that Ŷ = E[Y |X = x] minimises the mean squared error of
predicting Y by a function φ(X)”
E[Y |X = x] = arg min E (Y − φ(X))2 |X = x
 
(1.3)
φ

Proof: Let us minimise the mean squared error by differentiation


Z
d d
E (Y − φ(x))2 |X = x = (y − φ(x))2 p(y|x)dy =
 
dφ(x) dφ(x)
Z Z Z
2 (φ(x) − y)p(y|x)dy = 2φ(x) p(y|x)dy − 2 y p(y|x)dy =

2 {φ(x) − E[Y |X = x]} = 0


R
We have used normalisation p(y|x)dy = 1.
To get a concrete example of probability densities, we will look at univariate
and multivariate Gaussian densities.

4
1.2.3 1-dimensional Gaussian density
The density of a one dimensional Gaussian random variable X is given by
1 (x−µ)2
p(x|µ, σ 2 ) = √ e− 2σ 2 (1.4)
2πσ 2
We write X ∼ N (µ, σ 2 ). It is easy to see that the parameter µ = E(X) is the
mean. It is a bit harder to show that σ 2 = E(X − µ)2 is the variance.

1.2.4 The d-dimensional Gaussian distribution


Let X = (X1 , . . . , Xd )T and µ = (µ1 , . . . , µd )T
The multivariate Gaussian density is the joint density of the components of
the random vector X ∼ N (µ µ, Σ) given by
 
1 1 T −1
µ
p(x|µ , ) =Σ d 1 exp − (x − )
µ Σ (x − )
µ (1.5)
(2π) 2 |Σ
Σ| 2 2

µ = E[X] is the mean vector (show this !) and Σ is a d × d, the positive definite
covariance matrix. One can show that

Σ = E(X − µ )(X − µ )T (1.6)

The components are


Σij = E(Xi − µi )(Xj − µj )
Note: The notation p(x|µ, σ 2 ) suggests that we condition on the parameters
µ, σ 2 . This would mean, we consider µ, σ 2 as random variables. This is indeed
the Bayesian perspective which we will discuss in later chapters. For the moment
think of it as just a notation.
Example: Lines of constant density and random samples for a two di-
mensional Gaussian with mean is µ = (7, 7)T and covariance matrix Σ =

5
20

15

10

 
16.6 6.8 −5
−10 −5 0 5 10 15 20 25
6.8 6.4

1.2.5 Properties of Gaussian random variables


• One can show: Linear combinations of jointly Gaussian random variables
are Gaussian. Marginal & conditional densities of jointly Gaussian random
variables are also Gaussian.
• We can generate Gaussian distributed random vectors X with mean µ and
covariance matrix Σ from vectors Z with indepedent normal components
E(Zi Zj ) = δij by performing the Cholesky decomposition Σ = AA> .
Then set X = µ + AX.
Proof: Linear transformation preserve joint Gaussianity, we just have to
check mean and covariance

E[X] = µ + AE[Z] = µ
E[(X − µ )(X − µ )> ] = E[AZZ> A> ] = AE[ZZ> ]A> = AA> = S

• Central limit P
theorem: For i.i.d. Xi with finite variance, the normalised
n
sum Sn = √1n i=1 (Xi −m) becomes asymptotically Gaussian distributed.

1.2.6 Computing marginals for joint Gaussian densities


Marginals are easy ! Just take the entries of µ and Σ corresponding to the
variables you are interested in.  
16.6 6.8
Example: p(x) = N (µ µ, Σ ) with µ = (7, 7)T and Σ = . We
6.8 6.4
get p(x1 ) = N (7, 16.6) and p(x2 ) = N (7, 6.4).

6
1.2.7 Computing conditional expectations for jointly Gaus-
sian variables
This is something we have to do quite often. We may assume that we observe
some variables (data) and want to predict others. This is based on the condi-
tional density. If all variables are jointly Gaussian, the conditional density is
also Gaussian.
Let’s split x = (v, z)> into two groups of variables. To compute p(v|z) we
first write the joint density
 
1 1 T −1
p(x|µµ, Σ ) = d 1 exp − (x − µ ) Σ (x − µ ) ∝ (1.7)
(2π) 2 |Σ
Σ| 2 2
 
1 T −1 > −1
exp − x Σ x − x Σ µ ) (1.8)
2
in the following unnormalised form
 
1
p(v, z) =∝ exp − (v z)> Ω (v z) + (v z)> ξ (1.9)
2
 
Ωvv Ωvz
with the information matrix Ω = Σ −1 and Ω = and ξ =
Ωzv Ωzz
(ξv ξz )> . We also see that ξ = Ωµ µ. We have ignored all the terms that are
independent of x. To obtain the conditional density we write it in the form
 
p(v, z) 1
p(v|z) = ∝ exp − v > Ωvv v + v > (ξv − Ωvz z) (1.10)
p(z) 2
where we have collected all the terms that depend on the random variable V .
We know that the conditional density is a Gaussian and we can read off the
mean vector and the covariance from this unnormalised density. I know two
ways for doing this:
1. Completing the square: We look at the quadratic form in the exponent of
(1.10). We then complete the square
1 1 1 1
− v > Ωvv v + v > a = − v > Ωvv v + v > a + a> v =
2 2 2 2
1 1 >
− (v − (Ωvv ) a) Ωvv (v − (Ωvv ) )a) + a (Ωvv )−1 a
−1 > −1
2 2
.
where a = ξv − Ωvz z
2. Finding the maximum of the exponent. For Gaussian densities, the max-
imiser of the probability density equals the mean. Hence, we can take the
gradient
 
1
∇v − v > Ωvv v + v > a = −Ωvv v + a
2
and set it equal to zero.

7
For both methods, we get for the mean and covariance

E(V |z) = (Ωvv )−1 (ξv − Ωvz z) (1.11)


Cov(v|z) = (Ωvv )−1 . (1.12)

Remember that the conditional expectation is the best prediction (in the mean
square sense) of the random vector V given the ’data’ z. It is interesting that for
Gaussian variables, the covariance (the uncertainty of the prediction) is actually
independent of the data !

1.2.8 The Kullback–Leibler (KL) divergence


Often we have to compare probability densities p(x) and q(x). The KL–divergence
is a non symmetric dissimilarity measure defined as (for densities):
Z
p(x)
KL(pkq) = p(x) ln dx (1.13)
q(x)

For two probability mass functions P (x) (discrete random variables) we define
X P (x)
KL(pkq) = P (x) ln (1.14)
x
Q(x)

We will next show that KL(pkq) ≥ 0 with equality = 0 if and only if p = q.


Proof: We express the KL as an expectation and apply Jensen’s inequality:
   
p(X) q(X)
KL(pkq) = Ep ln = −Ep ln
q(X) p(X)
 
q(X)
≥ − ln Ep Jensen, − ln(·) is convex
p(X)
Z Z
q(x)
= − ln p(x) dx = − ln q(x)dx = − ln 1 = 0.
p(x)

The penultimate equality follows because q is a normalised density. Since − ln(·)


is strictly convex, equality is only possible if pq = const. This can only happen if
p = q (both densities are normalised). Another property is that KL is invariant
against invertible smooth transformations of the random variables.

1.3 Estimating model parameters by Maximum


Likelihood (ML)
We will now start with an introduction into ML estimation for simple models,
where this can be done analytically. Later, we will see why ML might be a good
method.

8
1.3.1 The biased coin (Bernoulli model)
Consider a data sequence D = (x1 , x2 , . . . , xn ) of bits xi ∈ {0, 1} which we
believe are generated independently at random with the same probability. Let
θ be the unknown probabiliy of xi = 1. Hence, we have

P (xi = 1|θ) = θ P (xi = 0|θ) = 1 − θ (1.15)

This can be compressed into a single equation

P (xi |θ) = θxi (1 − θ)1−xi (1.16)

The probability of the entire sequence D under this model (we use indepen-
dence) is
Yn
1−x
P (D|θ) = θxi (1 − θ) i
i=1

For fixed data sequence, D, we call P (D|θ) as a function of θ the likelihood of


θ.
To estimate the unknown parameter θ of the model (from which the
data was generated) we use the method of Maximum Likelihood, i.e. choosing

θ̂M L = argmax P (D|θ) (1.17)


θ

For this parameter, the observed data have the highest probability under the
model. Equivalent, we maximise the log–likelihood
n
X
ln P (D|θ) = (xi ln θ + (1 − xi ) ln(1 − θ)) = n1 ln θ + (n − n1 ) ln(1 − θ)
i=1

where n1 is the number of i for which xi = 1. Differentiating with respect to θ,


setting the derivative to 0 an solving for θ gives

d ln P (D|θ) n1 n − n1 n1
= − =0 −→ θ̂ = . (1.18)
dθ θ 1−θ n

1.3.2 The Gaussian density


Let us assume we have a sequence of data points D = (x1 , x2 , . . . , xn ), which
are generated from a Gaussian density
1 (x−µ)2
p(x|µ, σ 2 ) = √ e− 2σ 2 .
2πσ 2
We use the maximum likelihood method to estimate the unknown parameters
µ and σ 2 from the data. Since the random variables are continuous, it doesn’t
make sense to calculate the joint probability of the data sequence D (this would

9
be 0). But it is reasonable to use the joint density p(D|µ, σ 2 ) instead ! Using
independence again, a short calculation yields
N 
1 X (xi − µ)2

2 2
ln p(D|µ, σ ) = − + ln(2πσ ) (1.19)
2 i=1 σ2

Maximising ln p(D|µ, σ 2 ) with respect to µ and σ 2 leads to the simultaneous


equations
∂ ln p(D|µ, σ 2 )
=0
∂µ
∂ ln p(D|µ, σ 2 )
=0
∂σ 2
which we have to solve (do this !) for µ and σ 2 . The Maximum Likelihood
Estimates are
N n
1X 1X 2
µ̂ = xi σ̂ 2 = (xi − µ̂)
n i=1 n i=1

1.3.3 Linear Regression


The Gaussian model can be easily generalised to a model for linear regres-
sion. Here is the idea: We observe a set of input–ouput data pairs D =
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} with x = input, y = target values. We try to fit
.
a linear function y = f (x) = w0 + w1 x to the data. We represent this problem
as a probabilistic model and assume that n observations are generated for fixed
xi as
yi = w0 + w1 xi + νi
for i = 1, . . . , n where νi are independent Gaussian noise variables ν ∼ N (0, σ 2 ).
Hence, the parameters of the model are given by the vector θ = (w0 , w1 , σ 2 ).
Finally, we may assumed that the xi are generated from an unknown density
pX . We are not interested in estimating parameters of pX ! Hence we can write
n
Y
p(D|θ) = pX (x1 , . . . , xN ) p(yi |xi , θ) (1.20)
i=1

For the Gaussian noise we can write


1 (y−w0 −w1 x)2
p(y|x, θ) = √ e− 2σ 2 (1.21)
2πσ 2
Hence, the negative log–likelihood is
N
n 1 X 2
− ln p(D|w, σ 2 ) = const + ln σ 2 + 2 (yi − w0 − w1 xi ) (1.22)
2 2σ i=1
and where the constant contains pX . ML estimation of w0 and w1 leads to
minimisation of (1.22). It is easy to see that this is equivalent to fitting a linear
function through the data by the Least Squares method.

10
1.3.4 Generalised linear models
The linear regression model can be generalised to the fitting with non-linear
functions
yi = fw (xi ) + νi (1.23)
for i = 1, . . . , n, with unknown parameter w and νi i.i.d. ∼ N (0, σ 2 ). The
generalised linear model assumes
K
X
fw (x) = wj φj (x) (1.24)
j=0

where the φj (x) denote a fixed set of functions, e.g. φj (x) = xj (polynomial
regression) which leads typically to a nonlinear function in x, but which is linear
in the parameters wj . The likelihood is
" n #
1 X (yi − fw (xi ))2
p(D|w) = exp − pX (x1 , . . . , xn ) (1.25)
(2πσ 2 )n/2 i=1
2σ 2

Linearity in the parameters w allows for an analytical solution to the ML es-


timation in terms of matrix inversion. We will come back to this model, when
we discuss Bayesian inference.

1.3.5 Properties of Estimators


• Estimates of parameters θ̂ = θ̂(D) depend on data sets D and can thus
be considered random variables themselves (with respect to the random
drawing of the data).

• One defines the bias of an estimator as ED (θ̂) − θ and its variance as


 2
ED θ̂ − ED (θ̂) , where the expectations ED are over datasets of fixed
size n which are drawn at random from a distribution with true parameter
θ.
• “Good” estimators should become asymptotically consistent, i.e. the es-
timates should converge to the true parameters as n → ∞. This means
that bias and variance must go to 0 as n → ∞.
• ML estimators are found to be consistent under fairly general conditions.
Check books on mathematical statistics for more. We will give some in-
tuitive explanation.

1.3.6 Intuition about consistency of ML estimators


Suppose that data points D = (x1 , . . . , xn ) are realisations of i.i.d. random
variables from true distribution p∗ (x). With high probability as n → ∞ we get

11
by the law of large numbers
1 1X
− ln p(D|θ) = − ln p(xi |θ) ' −Ep∗ ln p(X|θ) =
n n i
 
p(X|θ)
−Ep∗ ln − Ep∗ [ln p∗ (X)] = KL(p∗ kp(·|θ) − Ep∗ [ln p∗ (X)]
p∗ (X)

Hence, for large n one might expect that by minimising the negative log–
likelihood, we find the parameter θ which makes p(x|θ) closest (measure by
KL divergence) to the true density p∗ . And if p∗ (x) = p(x|θ∗ ) we will find the
true parameter.
Illustration: ML estimation of the variance (shown are histograms with
10.000 repetitions of estimation) of a Gaussian for n = 5, 10, 100 with true
parameter σ 2 = 1

We see that the distributions of estimators become more and more concen-
trated around the true value. But we also see that for finite n, the estimator is
biased.

12
1.4 Appendix I: Some probability essentials
Here is a collection of topics from basic probability which we assume to be
known by participants of this module.

• Sample Space Ω: Space of possible outcomes ω of a random experiment.


• Events: (measurable) subsets of Ω.
• Probabilities: Number P (A) assigned to events A.
We have 0 ≤ P (A) ≤ 1, P (Ø) = 0 and P (Ω) = 1.

• Addition Rule: If A ∩ B = Ø Then P (A ∪ B) = P (A) + P (B) (extends


to countable sequence of disjoint events).
• Random Variables are functions of outcomes X(ω).
• For discrete rvs we define the probability mass function PX (x) = Pr(X =
x). Often we speak (sloppily) about the distribution of X.
• Joint distribution of two random variables:

PX,Y (x, y) = Pr(X = x, Y = y) .


P P
• Marginal distributions: PX (x) = y PX,Y (x, y) and PY (y) = x PX,Y (x, y).
• For continuous random variables we define a probability density pX (x)
Rb
by a pX (x) dx = P (a < X < b).

• A joint density can be defined for two (and more) variables:


Z Z
pX,Y (x, y) dxdy = P ((X, Y ) ∈ S)
S

for a set S ∈ R2 . 1 .
R∞
• Marginal densities are obtained e.g. as p(x) = −∞
p(x, y)dy

1 Note: When it is clear which random variables are involved, I often write simply p(x)

instead of pX (x).

13
Transformations of random variables and their densities:
• Let Y = T (X) be an invertible transformation and let the density of X
be p(x). We are interested in the density q(y) of the random variable Y .
Using change of variables for integrals, one gets

dx 1
q(y) = p(x(y)) = p(x(y))
dy dy
dx

• Conditional Probabilities
P (A∩B)
P (A|B) = P (B) and similarly for conditional distributions: P (x|y) =
P (x,y) p(x,y)
P (y) and conditional densities p(x|y) = p(y) .
Bayes Rule!!!
P (y|x)P (x) P P (y|x)P0 (x) 0 .
P (x|y) = P (y) =
x0 P (y|x )P (x )

• Expectations
The expectation of X is defined as
P R
E(X) = x P (x) x (discrete case) or E(X) = p(x) x dx (continuous
case). For a function g of the rva X, we can show that
P R
E(g(X)) = x P (x) g(x) (discrete) or E(g(X)) = p(x) g(x) dx (contin-
uous).
Mean: µ = E[X]
Variance: V ar(X) = E((X − µ)2 ) = E(X 2 ) − (E(X))2 .
Linearity
E(aX + bY ) = aE(X) + bE(Y )
• Conditional Expectation
E(Y |X = x) or E(Y |x):
P
E(g(Y )|X = x) = y g(y) P (y|x) (discrete case) and E(g(Y )|X = x) =
R
g(y) p(y|x) dy (continuous case).

• Independence: Multiplication rule:


A family of events A1 , A2 , . . . are called independent if for any subset
{Ai1 , Ai2 , . . . , Aik } P (Ai1 ∩ . . . ∩ Aik ) = P (Ai1 )P (Ai2 ) · · · P (Aik ).
A family of random variables X1 , X2 , . . . are called independent if for any
subset {Xi1 , Xi2 , . . . , Xik } P (Xi1 , Xi2 , . . . , Xik ) = P (Xi1 )P (Xi2 ) · · · P (Xik ) =
Qk
j=1 P (xij ) (with an analogous definition for densities). Hence, if X and
P (x,y)
Y independent then P (x|y) = P (y) = P (x).

• Some properties of independent random variables X1 , X2 , . . . , XN :

14
QN
1. E(X1 · X2 · · · XN ) = i=1 E(Xi ).
P  P
N N
2. Var i=1 X i = i=1 Var(Xi ).
3. Law of large numbers
PN
Let X1 , X2 , . . . , XN , i.i.d. with finite variance σ 2 and SN = N1 i=1 Xi ,
then one can show that
limN →∞ P (|SN − E(X)| > ε) = 0.
PN
Hence, when N large, with high probability we have N1 i=1 Xi ≈
E(X).

1.5 Appendix II: Understanding eq. (1.5)


To understand the properties of this density, we will simplify the quadratic form
in the exponent. We consider the eigenvalue problem of the covariance matrix
Σ ui = λi ui for i = 1, . . . , d (1.26)
with eigenvectors ui and eigenvalues λi . Σ is a real symmetric matrix with
orthonormal eigenvectors ui · uj = uTi uj = δij . With the d × d orthogonal
matrix formed by the d column eigenvectors
U = (u1 u2 · · · ud ). (1.27)
we have UT U = I. This means U−1 = UT .  
λ1 0 ··· 0
 0 λ2 · · · 0 
Using (1.27) and the definition of the diagonal matrix Λ = 
 
.. .. .. 
 . . . 
0 · · · 0 λn
we can rewrite the d eigenvalue equations (1.26) as single matrix equation
Σ U = UΛ. Hence multiplying by U−1 from the right, we get
Σ = UΛU−1 = UΛUT (1.28)
Taking the inverse of (1.28) we get
Σ −1 = UΛ−1 U−1 = UΛ−1 UT (1.29)
We use the orthogonal matrix U to define the transformation by y = UT (x−µ µ),
or x = µ + Uy.
This transformation preserves inner products, i.e. we have for two vectors y1
and y2 that y1T y2 = (x1 −µ
µ)T (x2 −µ
µ). It can be understood as a transformation
to a new coordinate system given by a combination of a shift and a rotation.
• In the transformed variables, the quadratic form reads
(x − µ)T Σ−1 (x − µ) = yT U> UΛ−1 U−1 Uy =
d
X y2
yT Λ−1 y = i

i=1
λi

15
y2
Qd − i
Thus exp − 12 (x − µ )T Σ −1 (x − µ ) = i=1 e 2λi


Qd
• The determinant equals the product of the eigenavalues |Σ
Σ| = i=1 λi
• Putting things together we find the transform random variables random
variables defined by y coordinates Y = UT (X −µµ) are independent and
have the Gaussian densities
d y2
Y 1 − i
p(y) = √ e 2λi (1.30)
i=1
2πλi

• In transforming the density, we also needed | ∂y T


∂x | = |U | = 1

We also see that


• The variances λi > 0. Hence, the matrix Σ must be positive definite.
• surfaces of constant probability density for the Gaussian density p(x), eq.
(1.5) are ellipsoids.
• We see next show that Σ = E[(X − µ )(X − µ )T ].

E[(X − µ )(X − µ )T ] = E[UYY> U> ]


> >
= UE[YY ]U take nonrand. matrices out of expec.
= UΛU> = Σ

In the last step, we have used a) E[Yi ] = 0, b) the independence of the


Yi , i.e. E[Yi Yj ] = 0 for i 6= j and c) E[Yi2 ] = λi from (1.30). Thus
E[YY> ] = Λ.
Back to the example:
20

15

10

−5
−10 −5 0 5 10
 15 20 25

16.6 6.8
The covariance matrix is Σ = . The eigenvalues are λ1 = 20
6.8 6.4
and λ2 = 3 with eigenvectors u1 = √15 (2, 1)T , and u2 = √15 (1, −2)T .

16
1.6 Appendix: Inequalities
• Cauchy–Schwarz:
2
{E(XY )} ≤ E(X 2 )E(Y 2 ) .

Equality = if and only if P (sX = tY ) = 1 for some nonrandom s and t.


• Jensen’s inequality: Let f (·) be a convex function, i.e. for all x1,2 and
0≤α≤1

αf (x1 ) + (1 − α)f (x2 ) ≥ f (αx1 + (1 − α)x2 ) (1.31)


00
(if f is differentiable, then f (x) ≥ 0 for all x). Then

E[f (X)] ≥ f (E[X]) (1.32)

If f is strictly convex (equality in the definition of convexity): equality in Jensen


= holds if and only if X = E(X) = const for almost all X.

1.6.1 Proof of Jensen


For simplicity, assume that f is differentiable. Let y be non random. Use the
Taylor expansion (with remainder) of X around y:
1 00
f (X) = f (y) + (X − y)f 0 (y) + (X − y)2 f (ξ)
2
00
≥ f (y) + (X − y)f 0 (y) we use f (ξ) ≥ 0

where ξ ∈ [X, y]. Finally take expectations on both sides. Note that an expec-
tation does not change the direction of the inequality (easy to see for discrete
random variables).

E[f (X)] ≥ f (y) + (E[X] − y)f 0 (y)

The result follows by setting y = E[X].

17
Chapter 2

Week 2

We will discuss parametric families of probability distributions–exponential families–


for which maximum likelihood estimation has analytic solutions. For Markov
random field models such as Boltzmann machines, maximum likelihood estima-
tion becomes intractable when the dimensionality of states is large. We will
look at the pseudo–likelihood method as an alternative approach.
But before this we will look briefly at ML for two other models.

2.1 ML for Logistic Regression


We have discussed linear regression as an application of ML. We can also
do a similar thing for a classification problem. The model is known as Lo-
gistic Regression in statistics. Here we assume data coming in pairs D =
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} where xi are the inputs and yi ∈ {0, 1} binary
class labels. We use the model
1
P (y = 1|x, θ ) =
1 + e−g(x)
with the function g

g(x) = wT x + w0

The parameters are θ = w, w0 . The function g could also be replaced by some-


thing more complex, like a multilayer neural network. The likelihood is
N
Y
P (yi = 1|xi , θ )yi (1 − P (yi = 0|θθ ))1−yi × p(x1 , . . . , xn ) (2.1)

P (D|θθ ) =
i=1

where we are not interested in the probability of the inputs. The optimal pa-
rameters could be found by maximising the likelihood using a gradient method.

18
2.2 ML for a Markov chain
To show that we can apply ML to dependent data, we look at an example of a
2 state Markov chain xi ∈ {0, 1} with unknown transition matrix
 
P11 P10
θ=
P01 P00
We assume we observe e.g. the data sequence D = 1101011001011110111010110010100
and try to estimate P11 , P10 , P01 , P00 using the ML method. The likelihood
equals the probability of D for a given parameter (transition matrix)
30
Y
8 10 9 3
P (D|θ) = P (xi+1 |xi , θ) P (x1 ) = P11 P10 P01 P00 P (x1 ). (2.2)
i=1

We have also two contraints on the transition probabilities P11 + P10 = 1 and
P01 + P00 = 1. We can take care of those by introducing Lagrange multipliers
and obtain the Lagrangefunction
L(θ, λ1 , λ0 ) = (2.3)
− ln P (D|θ) + λ1 (P11 + P10 − 1) + λ0 (P01 + P00 − 1) = (2.4)
− ln P (x1 ) − 8 ln P11 − 10 ln P10 − 9 ln P01 − 3 ln P00 + (2.5)
+λ1 (P11 + P10 − 1) + λ0 (P01 + P00 − 1) (2.6)
Differentiating with respect to the Pij yields P11 = λ81 and P10 = λ101 . The
constraint gives 1 = 8+10
λ1 , ie λ1 = 18. Hence P11 = 0.44 and P10 = 0.56.
Similarly, we obtain P01 = 9/12 = 0.75 and P00 = 0.25.

2.3 Exponential families


Regular 1 exponential families are defined (in their canonical representa-
tion) by
ψ (θθ ) · φ(x) + g(θθ )] .
p(x|θθ ) = f (x) exp[ψ (2.7)
ψ is called the natural parameter and φ (x) the sufficient statistics.

2.3.1 Bernoulli models as exponential family


To show that Bernoulli distributions form an exponential family, we rewrite the
probability as
P (x|θ) = θx (1 − θ)1−x = exp [x ln θ − x ln(1 − θ) + ln(1 − θ)] (2.8)
   
θ
= exp x ln + ln(1 − θ) ≡ f (x) exp[ψ(θ)φ(x) + g(θ)]. (2.9)
1−θ
θ
By comparison, we find that φ(x) = x and ψ(θ) = ln 1−θ and g(θ) = ln(1 − θ).
1 this means that the range of the data x is independent of the parameter θ

19
2.3.2 Gaussian densities as an exponential family
Also Gaussian densities can be cast into the exponential family framework.
Since we have two parameters, we will obtain 2–dimensional sufficient statistics.
We rewrite the Gaussian density as
 
2 1 1 2
p(x|µ, σ ) = √ exp − 2 (x − µ) = (2.10)
2πσ 2 2σ
µ2
   
µ 1 1
exp 2 x − 2 x2 √ exp − 2 (2.11)
σ 2σ 2πσ 2 2σ
≡ f (x) exp[ψψ (θθ ) · φ (x) + g(θθ )] (2.12)

Obviously, this is in the correct form if we define ψ (θθ ) = (µ/σ 2 , 1/2σ 2 ) and
φ (x) = (x, −x2 ). Finally, we have f (x) = 1 and

µ2
 
1
eg(θθ ) = √ exp − 2 . (2.13)
2πσ 2 2σ

One can also show that the following models belong to the exponential family
class:

2.3.3 Poisson distributions


θn
p(n|θ) = e−θ (2.14)
n!
for n = 0, 1, 2, . . .. The plot shows the distribution for θ = 1.

2.3.4 Multinomial family


P
Let n = (n1 , . . . , nK ), with nj ∈ N and j nj = n, we define the Multimomial
family as
K
n! Y n
P (n|θθ ) = QK θj j (2.15)
j=1 nj ! j=1
PK
where j=1 θj = 1. This is useful for histogramme data (counts, e.g. in Bag
of words model).

20
2.3.5 Mathematical properties of exponential families
We begin with the normalisation. This helps us to express the function g(θθ ) as
an integral.
Z Z Z
1 = p(x|θθ ) dx = eg(θθ ) f (x)eψ ·φφ(x) dx → f (x)eψ ·φφ(x) dx = e−g(θθ ) (2.16)

We can now express the expectation of the sufficient statistics as another inte-
gral.

eg(θθ ) φ (x)f (x)eψ ·φφ(x) dx φ (x)f (x)eψ ·φφ(x) dx


R R
φ(X)|θθ ] =
E[φ R = R (2.17)
eg(θθ ) f (x)eψ ·φφ(x) dx f (x)eψ ·φφ(x) dx

If we assume that θ can be expressed as a function of ψ , we can use (2.16) to


get
Z 
−∇ψ g(θθ (ψψ )) = ∇ψ ln f (x)eψ ·φφ(x) dx = (2.18)

∇ψ f (x)eψ ·φφ(x) dx φ (x)f (x)eψ ·φφ(x) dx


R R
R = R φ(X)|θθ ]. (2.19)
= E[φ
f (x)e ·φ (x) dx
ψ φ f (x)eψ ·φφ(x) dx

2.3.6 Maximum likelihood for exponential families


We will see that maximum likelihood estimation has a simple form for exponen-
.
tial families. As usual, we assume that data D = (x1 , . . . , xn ) are i.i.d. from an
exponential family distribution. The likelihood is given by
n
" n
# n
Y X Y
p(D|θθ ) = p(xi |θθ ) = exp ψ · φ (xi ) + ng(θθ ) f (xi ) (2.20)
i=1 i=1 i=1

To maximise the log–likelihood we assume that parameters are expressed by the


natural parameter ψ . Hence, we set the gradient to zero and obtain
n
X
∇ψ ln p(D|θθ (ψ
ψ )) = φ (xi ) + n∇ψ g(θθ (ψ
ψ )) = 0 (2.21)
i=1
n
1X
→ φ (xi ) = −∇ψ g(θθ (ψ
ψ )) = E[φ
φ(X)|θθ ]. (2.22)
n i=1

This shows that maximum likelihood estimation leads to simple moment match-
ing: For the ML parameter, the expected sufficient statistics equals the data
average of the sufficient statistics.

2.3.7 More on sufficient statistics


We will discuss a bit more the meaning of sufficient statistics. We will see that
it contains, in some sense, all the information about the parameters.

21
Let p(x|θ) be a parametric familiy. A statistics T (D) of the sample D =
{x1 , x2 , . . . , xn } is called sufficient if the conditional probability

P (D|T (D) = t, θ) (2.23)

is independent of θ. In this sense T (D) incorporates all relevant information


of
Pnthe parameter θ ! We will show that for exponential families, T(D) =
i=1 φ (xi ) is a sufficient statistics.
Proof: We will Q restrict ourselves to discrete random variables for simplicity.
Setting f (D) = i f (xi ), we get

P (D, T (D) = t|θ)


P (D|T (D) = t, θ) = (2.24)
P (T (D) = t|θ)
f (D) exp [ψ(θ) · t + ng(θ)]
=P 0
(2.25)
D 0 :T (D 0 )=t f (D ) exp [ψ(θ) · t + ng(θ)]
f (D)
=P (2.26)
D 0 :T (D 0 )=t f (D0 )

which is indeed independent of θ.

2.4 Pseudo–likelihood approach


If the dimensionality of the model is too large, ML for exponential families can
become intractable. This happens for e.g. for the

2.4.1 Ising model


This was introduced in physics as a model for magnetic materials but has also
found applications in machine learning. It defines a probability over vectors of
binary data: Let xi = ±1 for i = 1, . . . , N . The joint distribution of variables
x = (x1 , . . . , xN ) is defined as the Markov random field
 
1 X
pIsing (x|θ) = exp  θij xi xj + θi xi  . (2.27)
Z(θ)
(i,j)

This is of the form of an exponential family with sufficient statistics xi and


xi xj . In AI (where one transforms to a representations with variables 0, 1) this
is known as a Boltzmann machine. In recent years, the model was used use
to predict effective couplings between neurons from spike data.

22
/

Neuron ID

0.2
pspike

0.1

0 2000
Tim e [ m s]

The figure illustrates (lower panel) the binarisation of continuous time spike
trains. The activity of each neuron is represented by two states (active, inactive)
xi = ±1. The upper panel shows the simultaneously time series of states for a
group of neurons. There will be repeated experiments form which data averages
of xi and of xi xj (pairwise correlations) are estimated.
A generalisation to variables xi with more than 2 states (Potts models) has
been used to predict interactions between amino acids in proteins.
ML estimation of the parameters θij and θi by gradient descent requires the
computation of the model averages of E[xi |θ] = ∂∂θ ln Z
i
and similar for E[xi xj |θ].
By the moment matching result for maximum likelihood this has to be matched
with the corresponding data averages. Unfortunately, the computation of the
model expectations using e.g. the normalising ’partition sum’
 
X X X
Z(θ) = exp  θij xi xj + θ i xi  (2.28)
{xi =±1}N
i=1
(i,j) i

needs a summations over 2N states, which is impossible if we have e.g. N = 40.

2.4.2 Another look at ML


To motivate the pseudo likelihood approach, let us again understand look at
.
maximum likelihood approach. Let p∗ (x) = p(x|θ∗ ) be the true data generating
distribution We will show that

∇θ Ep∗ [ln p(X|θ)] = 0 (2.29)

23
if θ = θ∗ equals the true parameter. Proof:
Z
∇θ Ep∗ [ln p(X|θ)] = p(x|θ∗ )∇θ [ln p(x|θ)]dx = (2.30)

∇θ p(x|θ)
Z
p(x|θ∗ ) dx (2.31)
p(x|θ)
For θ = θ∗ , we get
Z Z
= ∇θ p(x|θ)θ=θ∗ dx = ∇θ p(x|θ)dx = ∇θ 1 = 0

If we approximate the true expectation by an average over data


n
1X
∇θ Ep∗ [ln p(X|θ)] ≈ ∇θ ln p(xi |θ) = 0
n i=1

we obtain the ML estimator.

2.4.3 Pseudo–likelihood:
We will now apply a similar idea to another quantity which resembles the ex-
pected logarithm of the data but which is often simpler to work with. We will
first show that
∇θ Ep∗ [ln P (xi |x−i , θ)] = 0 for θ = θ∗ (2.32)
Proof:
X ∇θ P (xi |x−i , θ)θ=θ∗
Ep∗ [∇θ ln P (xi |x−i , θ)θ=θ∗ ] = P (x|θ∗ ) = (2.33)
x
P (xi |x−i , θ∗ )
X X ∇θ P (xi |x−i , θ)θ=θ∗
P (x−i |θ∗ ) P (xi |x−i , θ∗ ) = (2.34)
x−i xi
P (xi |x−i , θ∗ )
X X
P (x−i |θ∗ ) ∇θ P (xi |x−i , θ)θ=θ∗ = 0 (2.35)
x−i xi
P
In the last step, we have used that xi P (xi |x−i , θ) = 1.
We will then sum over i, to get the the exact equation
N
X
∇θ Ep∗ [ln P (xi |x−i , θ)] = 0 for θ = θ∗ . (2.36)
i=1

If we approximate the expectation in the previous equation by a data average


we obtain the Pseudo–likelihood: approach. Given data D = (x(1) , . . . , x(M ) ),
we define the pseudo–log–likelihood:
M X
N
(k) (k)
X
ln P (xi |x−i , θ). (2.37)
k=1 i=1

24
Note, the different notation used for data samples.
Why is this method simpler compared to ML in case of the Ising model ?
The conditional distribution can be derived from the joint distribution
 
1 X
P (x|θ) = exp  θij xi xj + θi xi  (2.38)
Z(θ)
(i,j)

as   
P (x|θ) X
P (xi |x−i , θ) = ∝ exp xi θi + θij xj  . (2.39)
P (x−i |θ) j

Note, that the intractable Z(θ) will drop out of the result. We can get a properly
normalised result
h  P i
exp xi θi + j θij xj
P (xi |x−i , θ) = h P i h  P i
exp θi + j θij xj + exp − θi + j θij xj

Hence, the gradient is computable ! The only thing that needs to be done is
a numerical approach for solving the resulting optimisation. But that is much
simpler compared to the intractable summations needed for the ML approach.

25
Chapter 3

Week 3

3.1 Quality of estimators


• Data i.i.d. D = (x1 , . . . , xn ) from p(x|θ).

• Estimator θ̂(D)

• Variance of estimator
 2 
Var(θ̂) = E θ̂(D) − E[θ̂(D)]

Qn
• The expectation is w.r.t. p(D|θ) = i=1 p(xi |θ)

• For unbiased estimators E[θ̂(D)] = θ. Then variance equals mean square


error of the estimator
 2 
MSE = E θ̂(D) − θ (3.1)

3.1.1 Efficiency & Rao–Cramér inequality


• Limits speed at which the estimate θ̂θ approaches true parameter θ on
average. For single (scalar) parameter

(∂θ E(θ̂))2
VAR[θ̂] ≥
nJ(θ)

• with the Fisher–Information


 2
. d ln p(x|θ)
J(θ) = Eθ

26
• For unbiased estimators ∂θ E(θ̂)) = 1 we get a bound on the mean squared
error
1
M SE(θ̂) ≥
nJ(θ)

3.1.2 Useful definitions for the proof


• The score function is defined as
d ln p(x|θ)
V =

• The expectation of the score


  Z
. 1 dp(x|θ) 1 dp(x|θ)
E[V ] = E = p(x|θ) dx
p(x|θ) dθ p(x|θ) dθ
Z Z
dp(x|θ) d d
= dx = p(x|θ)dx = 1=0
dθ dθ dθ

• Variance of the score


 2
d ln p(x|θ)
V AR[V ] = E[V 2 ] = E = J(θ)

• The score for independent data D (use additivity of variances for inde-
pendent rva)
Pn
. d ln p(D|θ) d i=1 ln p(xi |θ)
Vn = =
dθ dθ
V AR[Vn ] = nJ(θ)

3.1.3 Reminder: Cauchy Schwarz inequality


We also need the following bound
2
{E(XY )} ≤ E(X 2 )E(Y 2 ) .

We have equality = if and only if P (sX = tY ) = 1 for some nonrandom s and


t.

3.1.4 Proof of RC inequality


• We apply Cauchy–Schwarz to X ≡ Vn and Y ≡ θ̂ − E[θ̂
n h io2
E Vn (θ̂ − E[θ̂] ≤ E[Vn2 ]E[(θ̂ − E[θ̂)2 ]

= nJ(θ)V AR[θ̂]

27
• The left hand side of the equation is
h i h i
E (Vn − E[Vn ])(θ̂ − E[θ̂] = E Vn θ̂
Z
1 dp(D|θ)
= p(D|θ) θ̂(D) dx1 . . . dxn =
p(D|θ) dθ
Z
dp(D|θ) d
θ̂(D) dx1 . . . dxn = E[θ̂(D)] estimator indep of θ
dθ dθ

Hence, by solving for VAR[θ̂] we obtain the CR inequality.

3.1.5 d–dimensional parameters


• d dimensional vector of parameters: For any real vector (z1 , . . . , zd ) (spe-
cialise to unbiased estimators E(θ̂θ ) = θ for simpicity), one can show the
multivariate CR inequlity
1 −1
COV (θ̂θ )  J (θθ ))
n

• with the Fisher Information matrix


Z
Jij (θ) = p(x|θθ )∂i ln p(x|θθ )∂j dx ln p(x|θθ ) .

The notation A  B means that the matrix A − B is positive semidefinite.


From this one can obtain e.g. inequalities for the variance of weighted
averages of individual components of the estimator.

• Estimators with equality, are called efficient.

3.1.6 Example: ML for Bernoulli model


This model is given bu

p(x|θ) = θx (1 − θ)1−x .

The Fisher Information is


1
J(θ) =
θ(1 − θ)

In the plot we show


1
E(θ̂ − θ)2 versus
J(θ)n

28
3.1.7 Examples for Fisher information
• 1–d Gaussian with parameters θ = (µ, σ 2 ). The Fisher Information is
1
σ2 0
found to be J(θθ ) = .
0 2σ1 2
1
• Cauchy density: p(x|θ) = π(1+(x−θ)2 ) has J(θ) = π/8.
• Fisher for exponential families with natural parameters: We use the rep-
resentation
ψ · φ (x) + g̃(ψ)] .
p(x|ψ) = f (x) exp[ψ
Then the Fisher Info is found to be
ψ ) = −∂∂E[ln p(x|ψ
Jψ (ψ ψ )] = −∂∂g̃(ψ) =
Z 
∂∂ ln f (x)eψ ·φφ(x) dx = COV [φ φ(x)].

The last equality follows from the exponential family representation by direct
calculation of the second derivatives.

3.1.8 ML estimation and efficiency


Under weak assumptions, ML estimators are asymptotically efficient.
• One can show that (under some technical conditions)
 
1 −1
θ̂θ M L ∼ N θ , J (θθ )
n
for n → ∞.
• Result can be used for error bars applying
Z
Jij (θθ ) = − p(x|θθ )∂i ∂j ln p(x|θθ )dx
n
1 X
≈ − ∂i ∂j ln p(xi |θ̂θ M L )
n i=1

Note the new result for Fisher–Info in terms of second derivatives !

29
3.1.9 Proof of
R R
− dx p(x|θθ )∂i ∂j ln p(x|θθ ) = p(x|θθ )∂i ln p(x|θθ )∂j dx for scalar case.
(0 ) denotes dθd

0 Z !0
h 00 p (x|θ)
i
E (ln p(x|θ)) = p(x|θ) dx =
p(x|θ)
Z 00 Z 0
!2
p (x|θ) p (x|θ)
p(x|θ) dx − p(x|θ) dx =
p(x|θ) p(x|θ)
Z 0
!2 Z
p (x|θ) 0 2
 
− p(x|θ) dx = − p(x|θ) (ln p(x|θ)) dx
p(x|θ)

3.1.10 Heuristic derivation for ML Gaussian


• Taylor expansion of log–likelihood at true (scalar) parameter θ

0 = ∂θ ln p(D|θ = θ̂M L ) ≈
∂θ ln p(D|θ) + (θ̂ − θ) ∂θ2 ln p(D|θ)

• Solve for
∂θ ln p(D|θ) Vn
(θ̂ − θ) ≈ − ≈
∂θ2 ln p(D|θ) nJ(θ)

• mean squared error

E(Vn2 ) nJ(θ) 1
E(θ̂ − θ)2 ≈ = 2 2 =
n2 J 2 (θ) n J (θ) nJ(θ)

3.2 Fisher Metric


• For θ̂ close to θ
h n i
p(θ̂θ (D)|θθ ) ∝ exp − kθ̂θ (D) − θ k2Fisher
2

• where
.
kθθ 1 − θ 2 k2Fisher = (θθ 1 − θ 2 )> J(θ)(θθ 1 − θ 2 )

• Can this be viewed as a natural distance between two distributions


expressed by the parameters ?

30
• 3.2.1 Information Geometry
S. Amari developed a differential geometric approach to estimation.
• Define non Euclidean metric in parameter space by

||∆θ||2Fisher = ∆θθ T J(θ)∆θθ

for small ∆θθ = θ 0 − θ


• reflects how well two close distributions can be distinguished by estimation
using random data.

3.2.2 Properties of Fisher–metric


.
• Expand KL divergence in powers of ∆θθ = θ 0 − θ
Z
p(x|θθ ) 1
p(x|θθ ) ln 0
dx = ∆θθ > J(θθ )∆θθ + O((∆θθ )3 )
θ
p(x|θ ) 2
.
• Transformations of parameters: θ = f (τ ). Hence, ∆θ = f 0 (τ )∆τ
˜ )
J(τ = E[(∂τ ln p(x|f (τ )))2 ] = E[(∂θ ln p(x|θ))2 ](f 0 (τ ))2
= J(θ)(f 0 (τ ))2
k∆θk2Fisher = ˜ )
(∆θ)2 J(θ) = (f 0 (τ ))2 J(θ)(∆τ )2 = (∆τ )2 J(τ

Hence, the distance between close probability distributions from a parametric


family is invariant against transformations of parameters.
For a geometric picture in case of 1–d Gaussians, see e.g. Sueli I. R. Costa,
Sandra Augusta Santos, João Eloir Strapasson (2005) DOI: 10.1109/ITW.2005.1531851.

31
Chapter 4

Week 3

One can use the Fisher–metric to define efficient online algorithms. Before doing
so, we will give the Fishher

4.1 Application: Online Learning


We often perform ML estimation by a gradient descent algorithm. We iterate
n
X
θ 0 = θ + η ∇θ ln p(xk |θθ )
k=1

until convergence. η is a learning rate. This requires the storage of the whole
batch of n data. On the other hand, if we want to perform online learning for
the case of streaming data, we base the new estimate on the likelihood for the
new data point ln p(xn+1 |θθ ) and the old estimate θ̂θ (n). A common possibility
is to apply stochastic gradient descent, i.e.

θ (n + 1) = θ (n) + η(n) ∇θ ln p(xn+1 |θθ (n))

where the learning rate η(n) has to be decreased to ensure convergence. A


typical schedule is η(n) ∝ 1/n as n → ∞.

4.1.1 Natural gradient learning


Following S. Amari, the scalar learning rate η(n) is replaced by a matrix which
also depends on η (n). The update is given by

θ (n + 1) = θ (n) + γn J−1 (θθ (n))∇θ ln p(xt+1 |θθ (n)).

The differential operator J−1 (θθ (n))∇θ is termed natural gradient. For the choice
γn = n1 , one can show that the online algorithm yields asymptotically effi-
cient estimation.

32
One can motivate the update by the following idea: On the one hand, one
would like to make data log–likelihood small. But one should not rely entirely
on the new data, but also take the old estimate θ (n) into account, by not moving
to far away from it. If distances are measured by the Fisher metric, we should
.
minimise (set ∆θθ = θ 0 − θ )
λ
− ln p(x|θθ 0 ) + k∆θθ k2Fisher ≈
2
λ
− ln p(x|θθ ) − ∇ ln p(x|θθ )∆θθ + k∆θθ k2Fisher = (1. order Taylor)
2
λ
− ln p(x|θθ ) − ∇ ln p(x|θθ )∆θθ + ∆θθ > J(θθ )∆θθ
2
with respect to θ 0 . λ is a parameter that controls how strongly the old parameter
.
contributes. We have also defined ∆θθ = θ 0 − θ ). Minimisation w.r.t. ∆θθ yields
the natural gradient of the log–likelihood
∇ ln p(x|θθ ) − λJ(θθ )∆θθ = 0
∆θθ = λJ−1 (θθ )∇ ln p(x|θθ ).
For the Cauchy density, we get
4(xn+1 − θn )
θn+1 = θn +
n(1 + (xn+1 − θn )2 )
The left figure shows the prediction θn for a single run of the algorithm, when
the true parameter θ = 1. The right figure shows the average squared estimation
error (obtained from 10.000 runs) vs 1/n. For large n, we get a 1/n decay.

4.2 Detour: ∞− dimensional models


For ML estimation, it might be a bit confusing that one should use probabilities
for likelihoods when we have discrete random variables but densities for con-
tinuous random variables. Why does this make sense ? One might even have
to deal with models for which data are ∞–dimensional random variables. How
should one define likelihoods in such a case ?
Let us take as an example

33
4.2.1 Poisson processes
.
This models a set of discrete events D = (z1 , . . . , zN ) which occur e.g. in a
compact domain S as shown in the figure.

The inhomogeneous Poisson process is determined by a rate (intensity) function


Λ(z) which forms the ’parameter’ of the model

λ

Λg ( )
x

The model is defined by the following properties


R
• A Λ(z)dz = E[N (A)] (expected number of points in a region A).
• Families of random variables N (Ai ) for i = 1, . . . , m are mutually inde-
pendent if the regions Ai are disjoint.
• N (A) is Poisson distributed with mean E[N (A)].

4.2.2 Poisson Likelihood


.
We assume a data set D = (z1 , . . . , zn ) with zi ∈ S ⊂ Rd generated from
a Poisson process with intensity Λ(·). In the literature one finds the Poisson
likelihood
n
Y  Z 
p(D|Λ) = Λ(zi ) × exp − Λ(z)dz (4.1)
i=1 S

34
for the intensity function Λ(·). To proceed one could choose a paramterisation
for the function Λ(·) and estimate its parameter using ML. But why does the
likelihood mean.

4.2.3 Radon–Nykodym derivative


A stochastic process X with probability measure P (X) has often no density
with respect to Lebesgue measure. But it is possible to define densities p(X)
(Radon–Nykodym derivatives) with respect to another reference measure R(X).
One can write
dP
p(X) = (X),
dR
if R absolutely continuous with respect to P (if R(A) = 0 then P (A) = 0).
the meaning of this expression becomes clear, when we compute expectations
of functions with respect to P
Z Z
EP [f (X)] = f (x)dP (x) = f (x)p(x)dR(x) = ER [f (x)p(x)].

This means that we can express expectations w.r.t. P through expectation w.r.t.
to the reference measure R. p(X) reweighs the different contributions to the
integral. Hence, if the reference measure does not contain parameters that we
wish to estimate, it makes sense to use p(x) as a likelihood for the parameters.

4.2.4 Poisson process density


For the Poisson case, the random variables X are sets Π = (z1 , . . . , zn ). These
are infinite dimensional objects, because the number of points is random and
unbounded. One can show that the RN derivative of two Poisson processes is
given by
 Z  Y
dPΛ Λ(zi )
pΛ (Π) = (Π) = exp − (Λ(z) − Λ0 (z))dz
dPΛ0 S Λ0 (zi )
zi ∈Π

where PΛ0 is the reference measure with intensity Λ0 . If we want to estimate


Λ, the parts with Λ0 (which we assume does not contain parameters) can be
factored out of the density and we are essentially left with (4.1).
It is also interesting to express the Kullback-Leibler divergence in terms
of the RN derivative. One can show that the general definition is
  Z
dQ q(x)
DKL (QkP ) = EQ log (X) = log dQ(x)
dP p(x)
and KL–divergence does not depend on the reference measure R(X). Hence,
with this definition, we can repeat the argument that ML asymptotically min-
imises a KL divergence.
We will next discuss an important technique for generating complex models
from simple ones by mixing them together.

35
4.3 Latent Variable Models
• Exponential families allow for simple analytic parameter estimation by
Maximum Likelihood.
• More complex models explain data by hidden (unobserved) variables, the
so called latent variables.
• However, Maximum Likelihood (ML) estimation for this class usually re-
quires numerical optimisation. We will discuss an iterative (EM) algorithm
which helps to simplify ML.

4.3.1 Latent variable Models: Definition


• Y = observed variables.
• The observed data are explained through X = latent, unobserved vari-
ables.
• θ = (θθ y , θ x ) are sets of parameters.
• The likelihood is represented as a mixture
X X
p(Y|θθ ) = p(Y, X|θθ ) = p(Y|X, θ y )p(X|θθ x )
X X

• If the x’s would be known, ML would often be easy!

4.3.2 Mixtures of Gaussians


A typical example is a model for multimodal densities given by the mixture of
K Gaussian densities
(y − µc )2
 
K
X 1
p(y|{µc , σc , p(c)}c=1 ) = w(c) exp −
2σc2
p
c 2πσc2
X
≡ p(c|θθ )p(y|c, θ )
c

36
We have one hidden variable ci for each data point, telling us from which
Gaussian component the observed point yi was generated. We also need,
as usual for each component a mean and variance parameter. The additional
parameter vector w(c) gives the probability for a component c. Hence, θ =
{µc , σc , w(c)}Kc=1 .
The likelihood is given by
n n
( K )
(yi − µci )2

Y Y X 1
p(Y|θθ ) = p(yi |θθ ) = w(ci ) p exp − .
i=1 i=1 c =1
2πσc2i 2σc2i
i

Unfortunately, ML estimation by setting ∇θ ln p(D|θθ ) = 0 does not lead to


simple equations, because the different components are coupled. One may sim-
ply use a numerical optimiser to solve problem. An alternative is the iterative
approach of

4.3.3 The Expectation–Maximisation (EM) Algorithm


which splits ML estimation into a sequence of simpler optimisation problems,
for which different components become uncoupled. This is defined by
1. Start with arbitrary θ 0
Iterate:
2. (E-Step): Compute the expectation
X
L(θθ , θ t ) ≡ p(X|Y, θ t ) ln p(Y, X|θθ )
x

with the posterior probability (given the observations) of the latent


variables
p(Y|X, θ t )p(X|θθ t )
p(X|Y, θ t ) =
p(X|θθ t )

3. (M–Step) Maximise

θ t+1 = argmax L(θθ , θ t )


θ

We will later show that

ln p(Y|θθ t+1 ) ≥ ln p(Y|θθ t )

i.e. the likelihood is not decreasing ! We are not guaranteed to find the global
maximum in this way but often may converge to a local one. This can be
improved by starting with different random initialisations.

37
4.3.4 Example: Mixture of Gaussians
For the MoG model we have to perform the following steps
• (E-Step): Compute
( n
)
X Y
L(θθ , θ t ) ≡ p(c|y, θ t ) ln p(yi , ci |θθ )
c i=1

with
n n
Y Y p(yi |ci , θ t )p(ci |θθ t )
p(c|y, θ t ) = p(ci |yi , θ t ) =
i=1 i=1
p(yi |θθ t )

and

p(yi , ci , θ ) = p(yi |ci , θ )p(ci |θθ )

• (M–Step) Update θ t+1 = argmaxθ L(θθ , θ t )

4.3.5 Details
• E-Step: Compute
( )
X Y
L(θθ , θ t ) ≡ p(c|y, θ t ) ln p(yi , ci |θθ ) =
c i
n X
X
p(ci |yi , θ t ) ln p(yi , ci |θθ )
i=1 c

Note, the expectations in the sum over p(cj |yj , θ t ) for j 6= i equals 1.

• The log–complete likelihood is

1 (yi − µc )2
ln p(yi , c|θθ ) = − ln(2πσc2 ) − + ln w(c)
2 2σc2

4.3.6 Explicit Formulas


• Variation with respect to µc given an explicit result
P
i yi p(c|yi , θ t )
X
(yi − µc )p(c|yi , θ t ) = 0 → µc,t+1 = P
i i p(c|yi , θ t )

This has a similar form as the ML estimate for a single Gaussian. The only
exception is that each yi has a weight p(c|yi , θ t , the so–called responsibility
of c generating data point yi .

38
• The variation with respect to σc2 yields

(yi − µc,t+1 )2 p(c|yi , θ t )


P
2
σc,t+1 = i P
i p(c|yi , θ t )

Again we get a weighted average over data points.

• Variation with respect to w(c)


1X
wt+1 (c) ≡ p(c|θθ t+1 ) = p(c|yi , θ t )
n i

This is also an intuitive result: The weight of a component c equals its


data averaged responsibility.

4.3.7 Details about w(c)


We have to maximise
n X
X
p(c|yi , θ t ) ln w(c)
i=1 c
P
The w(c) fulfil c w(c) = 1. We can deal with this constraint by using a
Lagrange–multiplier λ. We obtain the Lagrange function
n X
!
X X
p(c|yi , θ t ) ln w(c) − λ w(c) − 1
i=1 c c

which can be differentiated independently with respect to the w(c). This yields
the equations
n n
1 X 1X
p(c|yi , θ t ) − λ = 0 → w(c) = p(c|yi , θ t ).
w(c) i=1 λ i=1

Finally, to determine λ, we invoke the normalisation


n
X 1 XX n
1= w(c) = p(c|yi , θ t ) =
c
λ i=1 c λ

Here we have used the fact that p(c|yi , θ t ) is probability. Thus, we get λ = n.

39
Chapter 5

Week 4

We will next give a proof that with the EM algorithm the likelihood is never
decreased. After this we discuss a continuous mixture model. After this, we we
introduce the Bayesian approach.

5.1 Analysis of EM
We will begin with the KL divergence. For any q(x)
X q(x)p(y|θθ )
D(qkp(·|y, θ )) = q(x) ln ≥0
x
p(y, x|θθ )
Rearranging the inequality, we get
X q(x)
− ln p(y|θθ ) ≤ F (q, θ ) ≡ q(x) ln .
x
p(y, x|θθ )
We only have quality, when q(x) = p(x|y, θ ) ! From this we can show that
− ln p(y|θθ ) ≤ F (qt , θ)
− ln p(y|θθ t ) = F (qt , θt ).
.
In the next step we relate F and L. Let qt (x) = p(x|y, θ t ), then
X
L(θθ , θ t ) ≡ p(x|y, θ t ) ln p(y, x, θ )
x
X
= −F (qt , θ ) + qt (x) ln qt (x)
x

Thus maximising L(θθ , θ t ) w.r.t. to θ is equivalent to minimising F (qt , θ ). Hence,


ln p(y|θθ ) − ln p(y|θθ t ) ≥ −F (qt , θ ) − (−F (qt , θt ))
ln p(y|θθ t+1 ) − ln p(y|θθ t ) ≥ 0,
where the last inequality follows because we maximised −F with respect to θ
in the M–Step. This shows that the log–likelihood is not decreasing.

40
5.2 Pólya–Gamma mixture representation of lo-
gistic regression
We begin with the following continuous mixture representation of the Laplace
transform (Polson et al 2013) of 1/ cosh:
Z ∞
1 1 2
= e− 2 ωx pPG (ω)dω.
cosh( x2 ) 0

One can show that pPG (ω) is indeed proper density because of the infinite prod-
uct representation
∞  −1
1 Y t
q = 1+ 2 .
cosh( 2t ) i=1 2π (k − 1/2)2

Each term is the Laplace transform of an exponential random variable. Products


of Laplace transforms are Laplace transforms of sums of independent random
variables. Hence, the random variable ω is an infinite sum of exponentials with
decreasing means. For what follows, we only need the defying Laplace transform.

5.2.1 Pólya–Gamma representation of sigmoid


From the 1/cosh result we find easily a corresponding representation of the
sigmoid function:
1
σ(x) = =
1 + e−x
x Z ∞
e2 1 x x2
= e2 e− 2 ω pPG (ω)dω.
2 cosh( x2 ) 2 0

5.2.2 Logistic regression


From this we can get a mixture form of logistic regression. Our generative
model is data given by y = (y1 , . . . , yn ), yi = ±1 and inputs (x1 , . . . , xn ). The
Likelihood for a classifier with weight vector w
n
Y
p(y|w) = σ(yi w> xi ),
i=1

41
By using the mixture representation, we get the augmented likelihood
n
1 Y yi w> xi (w> xi )2
n − ωi
p(y, {ωi }i=1 |w) = p PG (ωi ) e 2 2 .
2n i=1

In this form, the weights appear simply in quadric form in the exponent ! We
can use this form to solve the ML estimation of w by an EM algorithm.

5.2.3 EM
For the E–Step, we need the conditional density of the auxiliary variables
n
{ωi }i=1
n
Y (w> xi )2
n
p({ωi }i=1 |w) ∝ pPG (ωi ) e− 2 ωi

i=1

The complete log–likelihood is


n n
n
X yi w> xi X (w> xi )2
ln p(y, {ωi }i=1 |w) = const + − ωi .
i=1
2 i=1
2

The expected complete log–likelihood


L(w, wt ) = const +
n > n
X yi w xi X (w> xi )2
− E[ωi |yi , wt ]
i=1
2 i=1
2

This can be easily optimised w.r.t. w. All we need is an explicit result for
R∞ > x )2
(wt i
0
pPG (ω) e− 2 ω ωdω
E[ωi |yi , wt ] = R (wt> x )2
∞ i

0
pPG (ω) e− 2 ω dω
In fact this is can be obtained from the Laplace transform. The integral is of
then type
R∞
pPG (ω) ω e−zω dω
R0 ∞ =
0
pPG (ω) e−zω dω
Z ∞
d
− ln pPG (ω) e−ωz dω =
dz
r 0 r 
d t z 1
ln cosh( ) = tanh √
dz 2 2 2 2z
The last line follows from the Laplace transform of pPG .
Z ∞
1 1 2
= e− 2 ωx pPG (ω)dω
cosh( x2 ) 0

This result can be used to solve logistic regression using an EM–algorithm. The
method can also be extended to a Bayesian version of logistic regression.

42
5.3 The Bayesian approach to statistics
In the Bayes approach, all prior knowledge (or lack of) about unknown param-
eters should be described by a probability density p(θ). The information from
the data is described by the likelihood P (D|θ). Using Bayes rule, we com-
pute the posterior distribution which gives our belief about θ after seeing
the data
p(D|θ)p(θ)
p(θ|D) =
p(D)

with the evidence


Z
p(D) = p(D|θ) p(θ) dθ .

10

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Posterior density of θ for a Bernoulli model for different data sets of size
n = 3, 10, 50, 100. The true value under which the data were generated was
θ = 0.7. The prior was flat p(θ) = 1 for 0 ≤ θ ≤ 1.

5.3.1 Point estimators:


The output of Bayesian inference is a probability distribution over parameters.
There are several ways to make a point prediction:
• The maximum posterior MAP value

θ̂M AP = arg max ln p(θ|D)

For a flat posterior p(θ) = const this agrees with the ML estimate.
• Another point estimate is the posterior mean
Z
θ̂m = E[θ|D] = θ p(θ|D) dθ

43
• The posterior mean θ̂m minimises the loss function
Z  2
L2 (θ̂) = θ̂ − θ p(θ|D) dθ

Other point estimators are obtained by changing the loss function.

We will see later that for for many parametric models, and large n, the poste-
rior variance → 0 and θ̂m ≈ θ̂M AP ≈ θ̂M L
The Bayes optimal prediction for the unknown distribution is the predictive
distribution
Z Z
p(x|D) = p(x, θ|D)dθ = p(x|θ, D)p(θ|D)dθ =
Z
p(x|θ)p(θ|D)dθ

which typically will not be in the family p(x|θ).

5.3.2 Properties of Bayes procedures


• Implementation of prior knowledge.
• Regularisation if only small amount of data is available.
• Bayes yields a simple approach to model selection and error bars.

• Bayes approach is conceptually simple but often computationally hard.


• It could be sensitive to ’wrong’ priors, but we one learn priors too!

5.3.3 Bayes for 1–d Gaussian densities:


We show the posterior for a simple case, where σ 2 is known but µ is unknown.
We use a (conjugate) prior
(µ−µ0 )2
1 − 2
2σ0
p(µ) = p e
2πσ02

µ0 and σ02 are hyper parameters, reflecting the prior beliefs about the location
of the unknown µ.
Given data D = (x1 , . . . , xn ), the posterior density for µ is
n  
p(D|µ)p(µ) p(µ) Y 1 (xi −µ)2
p(µ|D) = = √ e− 2σ2 ∝
p(D) p(D) i=1 2πσ 2
   " ( n
)#
1 2 1 n µ0 1 X
exp − µ + 2 × exp µ + 2 xi
2 σ02 σ σ02 σ i=1

44
This can be rewritten explicitly as a Gaussian density
2
1 (µ−µ )
− 2σ2n
p e n
2πσn2

with posterior mean and variance


nσ02 σ2
µn = nσ02 +σ 2
x + µ ,
nσ02 +σ 2 0
1 n 1
= σ2 + σ02
,
σn2
Pn
where x = i=1 xi /n is the sample mean. We see that for large n, have σn2 ≈
2
σ /n. This means the uncertainty decreases and also µn ≈ x which is the ML
estimator.

45
Chapter 6

Week 5

6.1 Conjugate priors


We will next look at so–called conjugate priors which allow for simple Bayesian
updates in case of exponential family models. These priors are defined in terms
of the standard parametrisations of exponential families as
p(θθ |ττ , n0 ) ∝ exp [ψ
ψ (θθ ) · τ + n0 g(θθ )]

where τ and n0 are hyper parameters. For these priors, the posterior will be of
the same form:
" n
#
X
p(θθ |Dττ , n0 ) ∝ exp ψ (θθ ) · φ (xi ) + ng(θθ ) × exp [ψ
ψ (θθ ) · τ + n0 g(θθ )]
i=1
" n
#
X
= exp ψ (θθ ) · ( φ (xi ) + τ ) + (n + n0 )g(θθ ) .
i=1
Pn
We simply replace n0 → n0 + n and τ → i=1 φ (xi ) + τ to obtain the posterior
from the prior.
Let us look at some examples:

6.1.1 Bernoulli model


For the Bernoulli model, we have
P (x|θ) = θx (1 − θ)1−x ∝ exp[ψ(θ)φ(x) + g(θ)]
θ
so that φ(x) = x and ψ(θ) = ln 1−θ and g(θ) = ln(1 − θ). Hence, the conjugate
prior is
 
θ
p(θ|τ, n0 ) ∝ exp [ψ(θ)τ + n0 g(θ)] = exp τ ln + n0 ln(1 − θ) =
1−θ
θτ (1 − θ)n0 −τ .

46
This is of the form of a beta–density which is usually denoted as ∝ θα−1 (1 −
θ)β−1 .

6.1.2 Gaussian densities


Gaussian densities are expressed as an exponential family
r
λ − λ (x−µ)2
p(x|µ, λ) = e 2 = exp[ψ ψ (θθ ) · φ (x) + g(θθ )]

√ h 2
i
with ψ (θθ ) = (λµ, − λ2 ) and φ (x) = (x, x2 ), and eg(θθ ) ∝ λ exp −λ µ2 . Hence,
the conjugate prior is

p(θθ |ττ , n0 ) ∝ exp [ψ


ψ (θθ ) · τ + n0 g(θθ )] =
√ −λ µ2 n0
  
1
exp − λτ2 + τ1 λµ λe 2 .
2
2
q
λ − (µ−γ) λ(2α−1) α−1 −βλ
This is of the form of Normal–Gamma priors p(µ, λ|α, β, γ) = 2π e
2 λ e
after an appropriate transformation of hyper parameters.

6.2 Bayes Model selection


The Bayes approach offers a conceptually simple way of model selection. Sup-
pose we are given a variety of models M1 , M2 , ... with different priors on
parameters p(θ1 |M1 ), p(θ2 |M2 ) and likelihoods P (D|θ, Mi ).
A prior probability over models P (M) leads to the posterior probability of
a model
R
P (D|M)P (M) P (M) P (D|θ, M)p(θ|M)dθ
P (M|D) = = .
P (D) P (D)

If we assume that all models have Rthe same prior probability, we choose the the
model with the largest evidence P (D|θ, M)p(θ|M)dθ. The evidence is also
frequently used to optimise hyper parameters.

6.2.1 Example: Bayesian polynomial regression


We assume a generalised linear model where data are independently generated
as yi = fw (xi )+ξi for i = 1, . . . , n, with fw (·) unknown, ξi i.i.d. ∼ N (0, σ 2 ). To
model the function fw (·) we choose the following polynomial class of models:
(allowing for different orders K).
K
X
fw (x) = w j xj .
j=0

47
The likelihood is
" N #
1 X (yi − fw (xi ))2
p(D|w) = exp − .
(2πσ 2 )N/2 i=1
2σ 2
 PK 2 
1 wj
We also specify the Gaussian prior distribution on weights p(w) = (2πσ02 )(K+1)/2
exp − j=0
2σ 2
0

The posterior density of the parameters w is given by

p(D|w)p(w)
p(w|D) =
p(D)

which is also a multivariate Gaussian.

6.2.2 Details of calculation

− ln p(D|w) − ln p(w) = const +


PK PK
n
X (yi − j=0 wj xji )2 j=0 wj
2
+ =
i=1
2σ 2 2σ02
n PK 2 n
1 X j=0 wj 1 X X
X j k
wj wk x i x i + 2 − w j yi xji =
2σ 2 i=1
2σ 0 σ 2
j i=1
jk
 
1 > 1 1 > 1 > >
w I + 2 X X w − 2 w X y + const
2 σ02 σ σ

where we have defined Xik = xki .


From this, we can read off the posterior mean
−1
σ2

E[w|D] = I + XT X XT y
σ02

and the posterior covariance


−1
σ2

COV[w|D] = I + XT X
σ02

6.2.3 Model selection


If we would use simply the likelihood (without prior) to select the model, we
would choose the most complex model. Fitting data through the noise is possible
for large K and leads to the largest likelihood.
Bayesian model selection could use the evidence as a function of the model
Z
p(D|K) = p(D|w, K) p(w|K)dw

48
instead. Hence, we need to compute the evidence. One way of doing this would
be to explicitly perform a Gaussian integral. But it is also possibly to think
probabilistic and compute the joint density p(D|K) of observations from the
generative Bayesian model. The data y ≡ D is
X
y = Xw + ξ (yi = wk Xik + ξi )
k

where

ξ ∼ N (0, σ 2 In )
w ∼ N (0, σ02 IK+1 )

are two independent Gaussian random vectors. The linear combination of the
two is also Gaussian. Hence y is Gaussian and p(y|K) a multivariate Gaussian
density. We have to find mean and covariance. Obviously E[y] = 0 and
.
Σ = COV[y] = E[yy> ] =
XE[ww> ]X> + σ 2 In = σ02 XX> + σ 2 In .

Hence

y ∼ N (0, Σ)

and obtain the explicit formula


N 1 1
ln p(y|K) = ln p(y) = − Σ| − yT Σ −1 y ,
ln 2π − ln |Σ
2 2 2
where

Σ = σ02 XX> + σ 2 In

We will illustrate this result on an experiment where n = 21 data-points yi are


generated from equally spaced inputs xi in the interval [−1, 1] using the true
function f (x) = x4 − x2 with added noise of variance σ 2 = 0.01. We use the
prior distribution with variance σ02 = 1 for inference.
The figure shows the noisy observations where the points are connected by
lines.

49
The next figure shows the log-evidence as function of K showing that the
correct polynomial order K = 4 gives the most likely model

We can also show a reconstruction


R of the function fw (x) (in blue) using the
posterior mean E[w|D] = dw p(w|D) fw (x) and comparison with the exact
function (green) and the data (red stars).

50
If we repeat the experiment with the ’wrong’ prior σ0 = 2 which assumes
typically bigger coefficients, the plot of the log-evidence gives is the constant
polynomial K = 0 as the most likely function.

The reconstruction of the constant using the posterior mean:

51
52
Chapter 7

Week 6

We will next discuss the large n behaviour of the posterior density and derive
approximations for posterior integrals. Finally, we will introduce Monet Carlo
sampling methods for a different type of computations of such integrals.

7.1 Asymptotics of posterior


We will argue, that for large n, the posterior is concentrated around the MAP ∼
ML value θ̂ and (for continuous θ) can be approximated by a Gaussian density.
This results from the behaviour of the likelihood for large n. We illustrate this
for the posterior density of θ for a Bernoulli model for data sets of different sizes
n = 3, 10, 50, 100. The true parameter value was θ = 0.7.
10

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

We can see that for large n, the posterior has a Gaussian shape and is con-
centrated around the true value. To get more insight, we perform a Taylor ex-
pansion of the log–likelihood around the ML estimator θ̂ (for a one–diemensional

53
problem, for simplicity):
n 2 c  3
X c2  3
ln p(D|θ) = ln p(xi |θ) = C + n θ − θ̂ + n θ − θ̂ + . . .
i=1
2 3!
Pn
with the constant C = i=1 ln p(xi |θ̂) and
n
1X k
ck = ∂ ln p(xi |θ)|θ̂ ≈ Ex [∂θk ln p(x|θ)|θ̂ ] = O(1)
n i=1 θ

Note, that c1 = 0 because at the ML value the first derivative vanishes ! In the
last step, we have approximated empirical averages over independent xi by the
expectation. Assuming concentration around θ ≈ θ̂, we identify the dominating
terms
 2   
−|c2 |  c3  3
p(θ|D) ∝ exp − n θ − θ̂ 1 + n θ − θ̂ + . . .
2 3!
which is a Gaussian with small corrections. The correction was obtained by a
further Taylor expansion of the exponential.
With high probability with respect to the Gaussian, we have |θ − θ̂| ∼ √1n .
 3
1
Hence the correction term is typically of order n θ − θ̂ ∼ n1/2 .

7.1.1 Bayes asymptotics


This idea can be extended to the multivariate parameter case and one obtains
a Gaussian
 
p(θ|D) ≈ N θ̂θ , I−1 (θ̂θ )
Pn
for n → ∞, where θ̂θ is the ML estimator and Iij (θθ ) = −∂i ∂j k=1 ln p(xk |θθ ).
This Bayesian uncertainty agrees with the frequentist uncertainty of ML esti-
mation for large n, if we set θ̂θ ≈ θ true and approximate Iij (θ̂θ ) ≈ J(θθ ) (Fisher
information).

7.2 Laplace approximation


Laplace’s method is a technique for approximating integrals of the form
Z
e−h(x) dx

assuming that e−h(x) can be approximated by a Gaussian shaped function.


We perform a 2nd order Taylor expansion of the exponent at its maximum
to obtain such a Gaussian function. Again the 1.order term vanishes at the
maximum.
1
h(x) ≈ h(x̂) + (x − x̂)T A(x − x̂)
2

54
with A = ∇2 h(x̂). We can now perform the Gaussian integral by first shifting
the integration x − x̂ → x and using
Z
1 1 > −1
e− 2 x Σ x dx = 1.
(2π)d/2 |Σ
Σ|1/2

Hence
(2π)d/2
Z Z  
1
e−h(x) dx ≈ e−h(x̂) exp − (x − x̂)T A(x − x̂) dx = e−h(x̂)
2 |A|1/2

This is an illustration of Laplace’s method for computing an approximation


to Euler’s Gamma function
Z ∞
.
Γ(t) = xt−1 e−x dx for t > 0.
0

This is an interesting function, we can show e.g. (using integration by parts)


the recursion
Z ∞
Γ(t + 1) = xt e−x dx = tΓ(t)
0

and that Γ(1) = 1. Hence for integer t = n, we have the factorial

Γ(n) = (n − 1)!.

By performing an approximation to the integral we obtain a simple asymptotic


formula for factorials valid for large n.
To apply Laplace’s method we transform the integration variable x into
y = − ln x so that the region of integration is no longer constrained but is
the entire real axes. If we would work with the original variables we obtain
a slightly worse approximation. This results in x = ey and dx y
dy = e and the
integral bedomes
Z ∞ Z ∞ Z ∞
y
xt−1 e−x dx = e(t−1)y e−e ey dy ≡ eg(y) dy.
0 −∞ −∞

55
We have g(y) = −ey + ty and g 0 (y) = −ey + y Thus the maximiser is ŷ = ln t.
00 00
We need the 2nd derivative at the maximum, g (y) = −ey and g (ŷ) = −t.
The Laplace approximation yields
Z ∞ √
t 2
Γ(t) ≈ e−t+t ln t e− 2 (y−ŷ) dy = 2π tt−1/2 e−t
−∞

Specialising to integer arguments this is known as Stirling’s formula for fac-


torials. The quality of the approximation is shown here (which we took from
https://de.wikipedia.org/wiki/Stirlingformel) which gives the relative error.

7.2.1 Approximating the evidence


We will now use the Laplace method for approximating the evidence integral at
MAP. This motivated by the fact that the posterior is (for large n) Gaussian
shaped. Hence, we expand the un–normalised posterior around its maximum
(at the MAP) and get
Z Z
− ln p(D) = − ln p(D|θθ )p(θθ )dθθ = − ln exp [ln p(D|θθ ) + ln p(θθ )] dθθ
d 1
≈ − ln p(D|θ̂θ ) − ln p(θ̂θ ) − ln(2π) + ln |A|
2 2

where A = −∇2 ln p(θ̂θ |D) and θ̂θ is the MAP estimator. This approximation
only requires the MAP and the local curvature at the MAP value.
One can still approximate this result further to get a fairly crude approxi-
mation for the evidence known as the Bayes Information Criterion ( BIC)
for Bayes model selection. We ignore all the terms that do not scale with n
or d and assume θ̂θ ≈ θ M L . Finally, for large n all matrix elements of A have
an asymptotic scaling ∝ n (they are computed from sums of xi ). Hence, the
determinant scales like |A| = O(nd ). Thus we get

d
− ln p(D) ≈ − ln p(D|θθ M L ) + ln n.
2

56
The first term is the negative likelihood at the optimum, which typically de-
creases with increasing model complexity. The second term increases with the
complexity (dimensionality d) of the model. For given data D, there should be
an optimal model d which minimises the right hand side.

7.2.2 Posterior expectations


We can use Laplace’s method to compute other Bayesian expectations besides
the evidence. We represent the expectation as a ratio of two integrals
R −h∗ (θθ )
e dθθ
E[g(θθ )|D] = R −h(θθ )
e dθθ

with

−h∗ (θθ ) = ln p(θθ ) + ln p(D|θθ ) + ln g(θθ )


−h(θθ ) = ln p(θθ ) + ln p(D|θθ ).

We apply Laplace to both numerator and denominator. Let θ̂θ , θ̂θ the maximisers
of h∗ and h. Then we get
v
u
u ∇2 h(θ̂θ )
∗ ∗
h i
E[g(θθ )|D] ≈ t ∗ exp −h (θ̂
θ ) + h(θ̂θ )
u

∇2 h∗ (θ̂θ )

7.2.3 2 - layer Bayesian neural networks


Laplace’s method can be applied to a two layer Bayesian neural network in
order to optimise hyper parameters and the number of hidden units using the
evidence. The neural network has an input–output relation
X
fw (x) = Wj σ(wjT x)
j

where e.g. σ(z) = tanh(z).

57
which is OK for regression. For classification, one would add a further sig-
moid giving the probability for output y = 1.
We can view the neural network as a probabilistic model for outputs y
 
β 2
p(y|x, w) ∝ exp − (y − fw (x)) Regression
2
 y  1−y
1 1
p(y|x, w) = Classification
1 + e−fw (x) 1 + efw (x)
where the weights are the parameters. This can be made into a Bayesian model
by adding a prior to the weights. We will use so–called ARD (automatic rele-
vance determination) priors for the input to hidden weights which are (factoris-
ing) Gaussian densities. If we define wik to be the weight connecting input k to
hidden unit i we have
 
1 2
p(wik ) ∝ exp − αk wik
2
The hyper parameter αk (which is shared by all weights connecting input feature
xi ) determines the influence of input xi on the output. For large αk , the weight
prior shuts off the weights wik for all i and there is no relevance of feature
xi . For small αk , the Gaussian is broad, giving strong influence. We can use
the Laplace approximation for the evidence to perform hyperparameter (αk )
optimisation and model selection (number of hidden units).

58
We will study the performance of this method on the artificial the Friedman
data set. This is generated as
10
4 X
y(x) = 0.1e4x1 + −20(x2 − 21 )
+ 3x3 + 2x4 + x5 + 0 · xi + ν
1+e i=6

where ν denotes added noise. The function shows different relevance of the
inputs xi . x1,2 appear inside highly nonlinear functions and strongly influence
the output. x3,4,5 appear linearly and are thus somewhat less relevant. The
remaining outputs don’t have any relevance for the output.
The following plots show the result of the Bayesian neural network learning
for this toy problem. The first figure shows the test error for Bayes learning
(blue) as a function of the number of hidden units using MAP as an output and
compares with a vanilla backprop algorithm (red) which ignores the prior. We
can see that a network of 3 hidden units appears to be the optimal representation
for both methods. But we also see that the Bayes predictions are more robust
when we don’t use the optimal setting.

The second plot shows the optimised hyper parameters αk for the different
inputs xk showing a clear relevance (small αk ) for the first two and less relevance
for the next three inputs. Finally, the remainig inputs have very large α and are
found to be irrelevant. The corresponding weights are essentially set to wik = 0
by the prior.

59
The final plot shows the log evidence as a function of the number of hidden
units. We find the optimum for three hidden units which also coincided with
the best test error.

7.2.4 Summary: Laplace approximation


• The method approximates posterior by a Gaussian (2nd order Taylor ex-
pansion of log–posterior around MAP value).

• It becomes asymptotically exact for large number n of data for finite


dimensional models with continuous parameters (under technical condi-
tions).

60
• Advantages: The intractable integrations are replaced by optimisation
(finding the MAP). The Hessian which is required for the approximate
covariance of the posterior can also be helpful for a Newton–Raphson
optimisation algorithm.
• Disadvantages: it is only local approximation, it takes into account the
MAP and the curvature of the posterior at the MAP. It ignores other
posterior modes. It also can’t be used for parameters which are discrete
variables.

7.3 Jeffrey’s prior


We will add a brief introduction into Jeffrey’s prior. This prior is considered to
be uninformative. This should not be confused with a flat prior p(θ) = const.
Because if a prior is constant for one parametrisation of a model, this will
change under transformations of the parameters τ = f (θ) when we consider the
transformations of probability densities. We would rather lime to have priors
which are ’uniform’ in the space of the probability densities p(x|θ)) that we
consider.
To define such a prior, one uses the Fisher Information defined as
Z Z
Jij (θ) = p(x|θθ )∂i ln p(x|θθ )∂j ln p(x|θθ ) dx = − p(x|θθ )∂i ∂j ln p(x|θθ )dx

We have shown that this changes under transformations τ = f (θ) (in 1 − −d)
˜ ) = (f 0 (τ ))2 J(θ)
as J(τ
Jeffrey’s prior (assume parameter space Θ compact) is defined as
p
pJef f (θ) ∝ J(θ).

If we look at prior expectations of a function g, we see that this can be written


using transformations of parameters as
Z p Z p
E[g(θ)] = C g(θ) J(θ)dθ = C g(f (τ )) J(f (τ ))f 0 (τ )dτ =
Θ T
Z q
C g(f (τ )) J(τ˜ ))dτ
T

Hence, for both parametrisations, the prior has the same form.
For a Bernoulli model, we have
1
pJef f (θ) ∝ p
θ(1 − θ)

The multivariate generalisation of Jeffrey’s prior is


p
pJef f (θθ ) ∝ |J(θθ )|

61
One can show that Jeffrey’s prior fulfils certain minimax properties asymptot-
ically. This means that the frequentist results of certain risk function of the
Bayes prediction become independent of the true model. This shows that one
gets optimal predictions for the worst true model in the family.
For non compact parameter spaces, Jeffrey’s prior is not normalisable. The
use of such improper priors (even if the posterior is normalisable) could be
dangerous.

7.4 Monte Carlo methods for Bayesian inference


This is another method for approximating posterior expectations. Rather than
getting a biased approximation (e.g. with Laplace), we use a sampling method
which samples parameters at random from the posterior which yields unbiased
estimates of expectations
Z N
1 X
E [g(θθ )|D] = p(θθ |D)g(θθ )dθθ ≈ g(θθ i ) θ i ∼ p(θθ |D).
N i=1

The error will be O(N −1/2 ) and decreases to zero when the number of sam-
ples grows large. A nice property of such Monte Carlo methods is the fact
that marginalisation of the components of a parameter vector is trivial: If
θ = (θ(1) , . . . , θ(d) ) and we are interested in θ(k) only, then
N
h
(k)
i 1 X (k)
E g(θ |D = g(θi )
N i=1

where θ i ∼ p(θθ |D). This means we just keep the components in the samples that
we need. There is no need to perform analytical integrals over joint densities.
However, we need methods for sampling from arbitrary probability distribu-
tions. Posteriors might depend on the data in a complicated way (except for ex-
ponential families with conjugate priors, but we can deal with those usually with-
out MC). Another problem comes from the (often) unknown normalisation of the
posterior: Usually we just have the unnormalised version p(θθ |D) ∝ p(D|θθ )p(θθ ).
So a good MC method should not require the normalisation.

7.4.1 The transformation method


We start with a basic technique that usually needs some analytical calculations.
Suppose we know how to sample random variables Y from simple distri-
bution pY , e.g. the uniform one Y ∼ U (0, 1). Can we find a (deterministic)
transformation X = T (Y ) such that X has the desired target distribution
pX ?
We will specialise to invertible smooth transformations. Let Y = T (X). We
have to see how the density of x is computed. This is

∂T (x)
pX (x) = pY (T (x))
.
∂x

62
An application to 1–d cases is straightforward: Let Y ∼ U (0, 1) have uniform
density, i.e. pY (y) = 1. Choose
Z x
dT (x)
T (x) = Pr(X ≤ x) = pX (v)dv → = pX (x)
−∞ dx

Hence, to generate samples from density pX we sample

Y ∼ U (0, 1)
X = T −1 (Y ) ∼ pX .

This can be illustrated for the exponential density

p(x) = e−x x≥0

for which we have


Z x
T (x) = e−v dv = 1 − e−x
0

To generate an exponentially distributed random variable X, we use

Y ∼ U (0, 1)
X = T −1 (Y ) = − ln(1 − Y )
alternative : X = − ln Y

Note: Y and 1 − Y have the same uniform density.

7.4.2 Comments on transformation method


We need to compute and invert nonlinear transformations and integrals ! This
can be down for a variety of known univariate densities. But for most mul-
tivariate densities this seems intractable. However, in recent years some new
ideas to do this numerically have come up in machine learning. This is based on
iterative sample based approximated to transformations which are represented
e.g. by neural networks. We will come back to some of these ideas later.

7.4.3 Rejection method: General idea


Another classical technique is the rejection method. Again, the goal is to draw
random samples from the target p(x). Assume we know how to draw samples
from another distribution q(x), where p(x) ≤ Cq(x) and where C ≥ 1 is a
constant.

63
Generate (X, Y ) ∼ U (A) (uniform ∈ A)
Accept if (X, Y ) ∈ B, keep X
Else: Reject (X, Y ) start again

We will now show that this construction returns samples X ∼ p(x):


By construction, accepted vectors (X, Y ) ∼ ρ(x, y) = U (B) are uniform in
B. This means the density is

ρ(x, y) = 1 for (x, y) ∈ B


ρ(x, y) = 0 for (x, y) ∈
/ B.

Our claim: the marginal density of X is p(x). The proof is simple:


Z p(x) Z p(x)
ρX (x) = ρ(x, y) dy = 1 dy = p(x)
0 0

We have reduced the sampling to creating uniform random samples in A. We


claim that the following method will do the job:

X ∼ q(·)
Y |x ∼ U (0, Cq(x))

This means Y = U Cq(x) with U ∼ U (0, 1). Proof: We compute the joint
density from the marginal and the conditional as
1 1
ρ(x, y) = ρ(y|x)ρ(x) = q(x) =
Cq(x) C

which is constant. The efficiency of the method is defined by the probability of


acceptance

Area(B) 1
Pr(accept) = =
Area(A) C

64
7.4.4 Summary: Rejection method
• Problem: We need random samples from target density p(·). We can
draw random variables from density q(·) (proposal density).
p(x)
• Assume q(x) ≤ C.

• Algorithm: Generate two independent random variables X ∼ q(·) and


p(x)
U ∼ U (0, 1). If u ≤ Cq(x) accept x. Otherwise reject (x, u) and start
again.

7.4.5 Example: Positive Gaussian from Exponential


The target has density p(x) = √2
exp(− 21 x2 ) for 0 ≤ x < ∞. As proposal we
p 2π
choose q(x) = e−x (exponential). A good candidate is C = 2e/π for which
p(x) 2
Cq(x) = exp(−(x − 1) /2). The plot shows the densities together with uniform
samples in the regions.
1 0.7

0.9
0.6
1.6
0.8

1.4 0.5
0.7

1.2
0.6
0.4
1
0.5

0.8 0.3
0.4
c g(x)
0.6 0.3
0.2

0.4 0.2

0.1
0.2 0.1

0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3

This can be turned into a sampler for Gaussian random variables by multi-
plying the positive variables with a random independent sign.

65
Chapter 8

Week 7

We will next discuss Markov chain Monte Carlo methods. We give up on inde-
pendent samples by generating samples from a Markov chain which asymptoti-
cally are distributed with the target distribution.

8.1 Markov Chain Monte Carlo (MCMC)


• Goal: Generate (usually dependent!) samples from a given distribution
p(x).
• General Idee: Construct a Markov chain with a transition probability
T (y|x) that has p(x) as its stationary distribution.
• Assume only a single stationary distribution and that any initial distri-
bution converges to it. Asymptotically (that is if we wait long enough),
distribution of samples Xt becomes arbitrarily close to p(x).
PN
• Obtain N1 t=1 h(Xt ) → E[h(X)] (ergodicity), but the result for the vari-
ance is different from the case of independent samples.

8.1.1 Discrete time series


We can always rewrite the joint distribution of Xt , Xt−1 , . . . , X0 in terms of
conditionals
t
Y
p(xt , xt−1 , . . . , x0 ) = p(xi |xi−1 , . . . , x0 ).
i=1

If we make the Markov assumption (assuming that the chain is homogenous)


we get

p(xi+1 |xi , . . . , x0 ) = T (xi+1 |xi )

66
where the transition density T fulfils
Z
Pr(Xt+1 ∈ A|Xt = y) = T (x|y)dx
A

and similar for discrete random variables.

8.1.2 Stationary distributions


We will again rewrite the joint distribution for Markov chain as
t
Y
p(xt , xt−1 , . . . , x0 ) = p(x0 ) T (xi |xi−1 ).
i=1

From this we get the marginal distribution by integrating out the other variables:
Z
pt (x) = p(xt , xt−1 , . . . , x0 )dxt−1 . . . dx0 .

Hence, we get the recursion


Z
pt+1 (x) = T (x|y)pt (y) dy

The stationary distribution fulfils the condition


Z
p(x) = T (x|y)p(y) dy

The idea of MCMC is to construct transition probabilities which leave a given


target distribution invariant.

8.1.3 Detailed balance and reversible Markov chains


Detailed balance is a sufficient condition for obtaining a transition probability
which makes a given distribution stationary. To derive this, we rewrite the
equation for the stationary distribution as
Z Z
0 = T (x|y)p(y) dy − p(x)T (y|x) dy.

This is true, because the integral over y on the right hand side equals 1. We
will construct T (y|x) such that

T (x|y)p(y) = T (y|x)p(x)

detailed balance. If this condition is fulfilled for all x and y, then the integral
over y yields stationarity. The Markov chain is called reversible.

67
8.2 Metropolis - Hastings method
This method defines a large class of MC algorithms which fulfil detailed balance.
The user has to define a proposal distribution q(x0 |x). The Markov chain is
generated as follows:
• Given a state x = xt of the Markov chain, generate a new state x0 with
probability distribution q(x0 |x).

• Define acceptance ratio

p(x0 )q(x|x0 )
 
A(x0 ; x) = min 1,
p(x)q(x0 |x)

• Accept new state xt=1 = x0 with probability A(x0 ; x). This is done
by generating a uniformly distributed random variable u ∼ U (0, 1). We
accept x0 if u ≤ A(x0 ; x).
Reject new state, ie keep old state xt+1 = x with probability 1−A(x0 ; x).

8.2.1 Proof of detailed balance for Metropolis Hastings


The MH method defines a Markov chain with transition probability

T (x0 |x) = A(x0 ; x)q(x0 |x) + (1 − α(x))δ(x0 − x) ,

where the term with the Dirac distribution takes into account that the chain
stays in its old state when the proposalR is rejected. The term α(x) can be
obtained from the normalisation 1 = T (x0 |x)dx0 . This leads to α(x) =
A(x ; x)q(x0 |x)dx0 .
0
R

We will show detailed balance using its definition. We will fill concentrate
on the first part of the transition distribution and write

p(x0 )q(x|x0 )
 
0 0 0
A(x ; x)q(x |x)p(x) = q(x |x) min 1, p(x) =
p(x)q(x0 |x)
min (q(x0 |x)p(x), q(x|x0 )p(x0 ))
p(x)q(x0 |x)
 
= p(x0 )q(x|x0 ) min , 1 = A(x; x0 )q(x|x0 )p(x0 ).
p(x0 )q(x|x0 )

Here we have used the fact that we can multiply both sides of the min operation
with non–negative numbers. Since we also have (1 − α(x))δ(x0 − x) = (1 −
α(x0 ))δ(x − x0 ), detailed balance is proved.
An interesting property of the MH method is that only ratios of probabilities
p(x0 )
p(x) (densities) are required. Hence, we can work with un–normalised probabil-
ities. This is very useful for Bayesian approaches where the normalisation term
is given by the evidence, which is often hard to compute.

68
8.2.2 Random walk sampler
The simples idea is to work with a proposal that completely ignores the target.
For continuos state spaces one may choose a move

x0 = x + ρz
where z ∼ N (0, I). This proposal defines a random walk in state space. This
yields a symmetric proposal with q(x0 |x) = q(x|x0 ). The acceptance proba-
bility is then simply
p(x0 )
 
0
A(x ; x) = min ,1
p(x)
For symmetric proposals, one speaks of a Metropolis sampler.
The choice of ρ is important for the performance of the algorithm. For large
ρ acceptance will be highly unlikely and the sampler is stuck for a long time.
Small ρ will lead to high acceptance rates but to slow diffusion. The relevant
states are visited only slowly. This is illustrated in the two figures which show
the random walk sampler applied to a two dimensional Gaussian density. On
the left we have ρ = 1 and on the right ρ = 0.1 (1000 samples).

8.2.3 Independence sampler


this sampler is on the other side of the spectrum. It completely ignores the
present state and generates proposals q(x0 |x) = q(x0 ) independent of x in the
Metropolis–Hastings method. Thus, the acceptance probability is given by
 0 
p(x )
0
 
p(x )q(x)  0
q(x )

A(x0 ; x) = min , 1 = min , 1
p(x)q(x0 )  p(x) 
q(x)

This approach may remind us of the rejection method, but now the samples are
dependent. One may argue that the method could be useful if the proposal q is
similar to p. One may then achieve good acceptance rates.

69
However, one should be careful using this method. The problems are illus-
trated in the following simple example.

8.2.4 Example
A class of target densities is defined by exponential densities p(x) = λe−λx ,
x ≥ 0. We will use q(x) = e−x , x ≥ 0 as the proposal.
The density ratio in the acceptance probability equals p(x)
q(x) = λe
−(λ−1)x
.
Obviously for λ < 1, p(x)
q(x) = λe
−(λ−1)x
becomes unbounded !
If we use proposals with λ < 1, the ’tail events’ (x large) become rarely
proposed. But if such samples end up in the tails, the MH sampler stays there
for a long time !
This behaviour is illustrated in the following three figures where histogrammes
of 10000 MCMC steps are shown and compared to the exact density.
The first case is obtained with λ = 2:

For λ = 0.5, we see a deviation from the exact density:

70
For λ = 0.1 one can see points in the tail, where the sampler was stuck for
a long time.

Hence, if an independence sampler is used, the proposal density should have


fatter tails than the target !

8.2.5 Gibbs sampling


This is a classical MH technique which has been used in Bayesian modelling
for a long time. It has now been bypassed somewhat by Hamilton (hybrid)
MC. The method reduces sampling from vectors by sequential sampling of their
components.
• Assume that random variables are vectors X = (X1 , . . . , Xd ). We use the
notation p(xi |x−i ) where x−i = (x1 , . . . , xi−1 , xi+1 , . . . , xD ).
• At step τ + 1 (note the different notation for the discrete time) one cy-
cles sequentially through the components of x and samples from the 1–
dimensional conditional distributions

xτ1 +1 ∼ p(x1 |xτ2 , xτ3 , . . . , xτd )


xτ2 +1 ∼ p(x2 |xτ1 +1 , xτ3 , . . . , xτd )
... ... ...
xτj +1 ∼ p(xj |xτ1 +1 , . . . , xτj−1
+1
, xτj+1 , . . . , xτd )
... ... ...
xτd +1 ∼ p(xd |xτ1 +1 , . . . , xτd−1
+1
)

• As an alternative, one can use a random sequential update.

Gibbs sampling is illustrated on a 2–dim Gaussian:

71
8.2.6 Gibbs as Metropolis Hastings
We can understand Gibbs sampling as a special case of MH. The Gibbs proposal
at component i is given by

qi (x0 |x) = p(x0i |x−i )δ(x0−i − x−i )

where the Dirac distribution takes care of the fact that the components x−i are
not changed.
The MH acceptance probability equals
p(x0 )q(x|x0 ) p(x0 )p(xi |x0−i )δ(x−i − x0−i )
A(x0 ; x) = = =
p(x)q(x0 |x) p(x)p(x0i |x−i )δ(x0−i − x−i )
p(x0i |x−i )p(x−i )p(xi |x−i )
=1
p(xi |x−i )p(x−i )p(x0i |x−i )
Hence, the proposal is always accepted !

8.2.7 Example: Hierarchical Bayesian for change points


We discuss a generative model for the number Xi of certain events in years
i ∈ {1, 2, . . . , n}. We want to model the fact that at some unknown year K, the
distribution of Xi suddenly changes. The model is defined as follows:
• Events occur at each year i ∈ {1, 2, . . . , n}. The number of events at year
xi
i is distributed as a Poisson variable, i.e. p(xi |λ) = e−λ λxi ! . The rate of
events changes suddenly from λ1 to λ2 at the unknown change point
year K ∈ {1, 2, . . . , n}.
To estimate K, we define the hierarchical Bayesian model
x
– Given the rates, the data are independent xi ∼ e−λ λx! .
– Given K, the rates λ1,2 are independent with
λ1,2 ∼ Gamma(a1,2 , η1,2 ) density. The a1,2 are known.
– η1,2 are independent hyperparameters η1,2 ∼ Gamma(b1,2 , c1,2 ) with
known b1,2 and c1,2 .

72
– K has a discrete prior distribution P (K).

• Problem: Given a set of observations D = (x1 , . . . , xn ) over n years, draw


samples from the posterior distribution p(K, η1,2 , λ1,2 |D). We will use
a Gibbs sampler.

A comment on the Gamma density


• The Gamma density is given by
β α α−1 −βx
p(x|α, β) = x e
Γ(α)
α α
with E[X] = β and V ar[X] = β2 .
2
• Note that (E[X])
V ar[X] = α. Hence, knowing the parameter α means knowledge
of relative uncertainty about X.

Joint and conditional distributions


To find the conditional distributions for each of the five unobserved variables, we
first write down the joint distribution of all variables (including the xi ). Using
the causal structure of the model, this is given by

p(D, λ1,2 , η1,2 , K) = p(x|λ1,2 , K)p(λ1,2 |η1,2 )p(η1,2 )p(K) =

K n
Y λx1 i Y λx2 i η1a1 a1 −1 −η1 λ1 η2a2 a2 −1 −η2 λ2
e−λ1 e−λ2 λ e λ e
i=1
xi ! xi ! Γ(a1 ) 1 Γ(a2 ) 2
i=K+1

cb11 b1 −1 −c1 η1 cb22 b2 −1 −c2 η2


η e η e P (K)
Γ(b1 ) 1 Γ(b2 ) 2

We will show that the conditional distributions for Gibbs sampler are
n
X
λ2 |λ1 , η1,2 , K, D ∼ Gamma(a2 + xi , n − K + η2 )
K+1
η1 |λ1,2 , η2 , K, D ∼ Gamma(a1 + b1 , λ1 + c1 )
PK
K|λ1,2 , η1,2 , D ∼ const × p(K)e−K(λ1 −λ2 ) (λ1 /λ2 ) i=1 xi

Details
The main idea is to collect terms in the joint distribution which depend on the
variable to update, and normalise later: p(A|B) = PP(A,B)
(B) ∝ P (A, B)

73
• Starting from the joint distribution
K n
Y λx1 i Y λx2 i η1a1 a1 −1 −η1 λ1 η2a2 a2 −1 −η2 λ2
e−λ1 e−λ2 λ e λ e
i=1
xi ! xi ! Γ(a1 ) 1 Γ(a2 ) 2
i=K+1

cb11 b1 −1 −c1 η1 cb22 b2 −1 −c2 η2


η e η e P (K)
Γ(b1 ) 1 Γ(b2 ) 2

• we get the conditional of λ2


n
Y λx2 i a2 −1 −η2 λ2
p(λ2 |λ1 , η1,2 , K, D) ∝ e−λ2 λ e ∝
xi ! 2
i=K+1
a2 −1+ n
P
i=K+1 xi −λ2 (n−K+η2 )
λ2 e

• Similar, the conditional of η1

η1a1 −η1 λ1
p(η1 |λ1,2 , η2 , K, D) ∝ e × η1b1 −1 e−c1 η1 ∝
Γ(a1 )
η1b1 +a1 −1 e−η1 (λ1 +c1 )

• and finally, the conditional of K is


K n
Y λx1 i Y −λ2 λx2 i
P (K|λ1,2 , η1,2 , D) ∝ e−λ1 e P (K) ∝
i=1
xi ! xi !
K+1
PKPn
x xi
e−λ1 K e−λ2 (n−K) λ1 i=1 i λ2 i=K+1 P (K) ∝
PK
e−(λ1 −λ2 )K (λ1 /λ2 ) i=1 xi P (K).

Simulations
In the following plots we show histograms of the marginal posteriors obtained
by Gibbs sampling. Then first panel shows the observed data. The vertical
black lines in the posterior distributions are the exact values from which the
data were generated.

74
number of disasters
16

14

12

10

0
0 10 20 30 40 50 60 70 80 90 100
year

P(K)
2000

1800

1600

1400

1200

1000

800

600

400

200

0
0 10 20 30 40 50 60 70 80 90 100

P(η1) P(η2)
800 800

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
1 2 3 4 5 6 7 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

P(λ1) P(λ2)
1500 800

700

600

1000
500

400

300
500

200

100

0 0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

The second series of plots are generated for a case where the exact rates λ12
are more similar to each other and inference becomes harder.

75
number of disasters
11

10

0
0 10 20 30 40 50 60 70 80 90 100
year

P(K)
1500

1000

500

0
0 10 20 30 40 50 60 70 80 90 100

P(η1) P(η2)
800 800

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9

P(λ1) P(λ2)
800 3000

700
2500

600

2000
500

400 1500

300
1000

200

500
100

0 0
1.5 2 2.5 3 3.5 4 4.5 5 2 4 6 8 10 12 14

8.3 Gaussian process models


We will reconsider Bayesian models for supervised learning (regression, classifi-
cation), where we need to infer a function that links inputs to outputs. Rather

76
than making a parametric ansatz for such functions (with corresponding priors
for its parameters) we consider a non–parametric approach where the prior is
directly defined over a space of functions.
We will starts with well known

8.3.1 Statistical model for curve fitting


This defined by the generative model:

yi = fθ (xi ) + νi

where νi independent Gaussian noise of variance σ 2 and θ is a set of unknown


parameters characterising the function fθ (xi ).
We can introduce a likelihood
n
Y
p(Data|θ) = p(yi |θ)
i=1

and a prior distribution p(θ) to treat this model in a Bayesian way.

8.3.2 Reminder: Generalised linear models


This is a case that we have already considered before:
K
X
fθ (x) = θl φl (x).
l=1

The model is linear in the parameters θl . Examples would be power series


K
X
fθ (x) = θ l xl
l=1

or Fourier series
K
X
fθ (x) = {θl sin(2πlx) + θl0 cos(2πlx)}
l=1

Using a Gaussian prior p(θθ ) and Gaussian noise, the posterior p(θθ |Data) is also
Gaussian.
For this class of models we have to specify (or estimate) K, the number
of basis functions. It would be interesting to take the the limit K → ∞ and
allowing a Bayesian model to assume an unbounded complexity for modelling
functions. But in this case, we would have infinitely many parameters θl which
may not be easy to handle.
The Gaussian process approach to such nonparametric models is to assume a
prior over functions which we write as: f (·) ∼ GP(0, K). The figure illustrates
what we are looking for. The left panel shows random functions generated

77
3
2

2
GP Samples

GP Samples
1

1
0

0
−1

−1
−2

−2
0 2 4 6 8 10 0 2 4 6 8 10
x x

from the prior. The blue shading gives the prior uncertainty on the marginal
variance of functions for each input x. The second panel illustrates the posterior
over functions after observing four data points. The random functions generated
from the posterior distribution are close to the data but show large variability for
input points which are further away from observations. The shading measures
the marginal poster variance at each input x.

8.4 Gaussian Process (GP) priors over functions


A Gaussian (prior) distribution over functions is formalised as follows:

• A Gaussian process is a family of random variables f (x), x ∈ T where each


finite collection {f (x1 ), f (x2 ), . . . , f (xn )} has a joint distribution Gaus-
siandistribution.
• The process is characterised by its mean and covariance. Often the mean
.
function is set to m(x) = E[f (x)] = 0 for all x. Then the process is
characterised by the covariance kernel
.
K(x, x0 ) = E[f (x)f (x0 )]

• Kernel functions encode prior beliefs (or knowledge) about smoothness or


’wiggliness’ of functions f (x).

8.4.1 Stationary kernels


Stationary
R ∞kernels K(x − x0 ) can be constructed from their Fourier transform
iωx
K(x) = −∞ e K̂(ω) dω with non-negative K̂(ω) ≥ 0 (this is part of Bochner’s
theorem).
To see that this gives a valid construction of a positive definite kernel, we
need to show that for all sets of inputs (x1 , . . . , xm ), the resulting kernel matrix
is positive semi–definite (to define a joint Gaussian density). Hence we need to
show that for any (a1 , . . . , aM ) and (x1 , . . . , xm )
X
ak K(xk − xl )al ≥ 0
ij

78
R∞
Proof: We use the Fourier integral K(x) = −∞
eiωx K̂(ω)dω. Thus
X Z X
ak K(xk − xl )al = K̂(ω) ak al eiω(xk −xl ) dω =
ij kl
Z ! ! Z 2
X X X
K̂(ω) ak eiωxk al e−iωxl dω = K̂(ω) ak eiωxk dω


k l k
≥ 0.

8.4.2 More on kernels


• One can derive one of the most popular kernels in machine learning, the
radial basis function (RBF) kernel from this construction. For this, we
ω2
have the Fourier transform K̂(ω) ∝ e− 2λ .
1
• A different case is the Fourier transform K̂(ω) ∝ ω2 +λ 2 (Ornstein Uhlen-

beck) for which we obtain an exponential covariance.

• Matérn kernels allow for an interpolation between RBF and OU. Here we
can control the smoothness of the random functions.
• Polynomial kernels: K(x, x0 ) = (1 + x · x0 )k . These have sample paths
which are themselves polynomials in x. In such a way, we can recover a
parametric model.

• To obtain new kernels, we can combine existing kernels: Sums and prod-
ucts of kernels are also kernels.

8.4.3 Samples from the GP prior


We illustrate the different behaviour of random functions for RBF and OU ker-
0
nels. Samples from a GP with K(x, x0 ) = e−|x−x | (Ornstein–Uhlenbeck process)
are shown first. One finds continuous but nowhere differentiable functions:
4

−1

−2

−3
0 5 10 15 20 25 30 35 40 45 50

79
The next plots are samples from GPs with RBF kernels having two different
0 2 0 2
length–scales: K(x, x0 ) = e−3(x−x ) and K(x, x0 ) = e−10(x−x )
2.5 3

2
2
1.5

1 1

0.5
0
0

!0.5 !1

!1
!2
!1.5

!2 !3
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Functions are infinitely often differentiable. For the smaller length–scale, we


get more wiggles.

8.4.4 GPs in higher dimensions


Kernels can be constructed for d dimensional inputs
x = (x(1), x(2), . . . , x(d)) where x(i) is the i-th coordinate of x. A popular
choice is the radial basis function (RBF) kernel which is obtained by multiplying
one–dimensional RBF kernels
d (x(k)−x0 (k))2
Y −
K(x, x0 ) = e 2l2
k ,
k=1

allowing for different hyperparameters (lengthscales) lk .

80
Chapter 9

Week 8

In this section, we will continue with GPs and see how we can compute posterior
predictions. We will show that a closed form solution is possible for regression
with Gaussian noise. Hyperparameters can be learnt using the evidence.

9.1 How to make predictions with GPs


Let us assume that we have observations y = (y1 , . . . , yn ), and latent function
values z = (f (x1 ), . . . , f (xn )) and v = f (x), where the xi are training inputs
(for which we have noisy observations), and x is a new test input where we want
to make a prediction (see the figure).

81
We are interested in the posterior density
Z
p(v|y) = p(v|z)p(z|y)dz.

In general, the posterior of the unobserved function values is obtained by


multiplying likelihood terms with the prior. The latter is derived from as a joint
Gaussian distribution using the GP prior
n  
Y 1 > −1
p(z|y) ∝ p(yi |zi ) exp − z K z ,
i=1
2
.
with the kernel matrix Kij = K(xi , xj ).

9.1.1 Start with p(v|z)


+
The joint density of z and v is a Gaussian N (0, K+ ), where Kij = K(xi , xj )
for i, j = 0, 1 . . . , n and x0 ≡ x is the test–input. We can write this matrix as a
block matrix
K(x, x) k>
 
K+ = x
kx K
.
where kx = (K(x, x1 ), . . . , K(x, xn ))> . We need the conditional density
(which is 1–dimensional).
p(v|z) = N (v|m, s).

9.1.2 Reminder: Conditioning and inverses of a block ma-


trix
• We have the joint density which is of the form
 
1 >
p(v, z) ∝ exp − (v z) Ω (v z)
2
 
+ −1 Ωvv Ωvz
with the information matrix Ω = (K ) =
Ωzv Ωzz
• The conditional density is
 
1
p(v|z) ∝ exp − Ωvv v 2 − vΩvz z)
2

• This yields
E[v|z] = −(Ωvv )−1 Ωvz z
V AR[v|z] = (Ωvv )−1
Finally, to get an explicit result for Ω, we have to compute the inverse of
a block matrix.

82
• The general rule is
−1 
−M BD−1
 
A B M
=
C D −D−1 CM D + D CM BD−1
−1 −1

with −1
M = A − BD−1 C

• We will apply this result to A = K(x, x) and B = k>


x , C = kx and D = K.

• Hence 1/M = K(x, x) − k> > −1


x Kkx and Ωvz = −M kx K

• Thus finally

E[v|z] = k>
xK
−1
z
V AR[v|z] = K(x, x) − k>
xK
−1
kx

9.2 Regression with Gaussian noise


For the Gaussian noise model, yi = f (xi ) + νi , and νi = N (0, σ 2 ) we get
n  
Y 1 > −1
p(z|y) ∝ p(yi |zi ) exp − z K z ∝
i=1
2
 
1
Pn 2 1
e− 2σ2 i=1 (yi −zi ) exp − z> K−1 z ∝
2
   
1 1 1
exp − z> K−1 + 2 I z + 2 zt y
2 σ σ
Hence, the posterior over the latent function values at observed data is Gaussian

p(z|y) = N (z|µ
µ, S)
−1
with S = K−1 + σ12 I and µ = 1
σ 2 Sy We can use these results to get explicit
analytical predictions:

• The posterior mean prediction is obtained by using our results for p(v|z):
Z Z
> −1
E[v|y] = E[v|z]p(z|y)dz = kx K zp(z|y)dz =
 −1
1 > −1 1 > −1 1
k>
xK
−1
E[z|y] = 2 kx K Sy = 2 kx K −1
K + 2I y
σ σ σ
2 −1
= k>

x K+σ I y

• This prediction is linear in the data y. It can also be written in the form
n
X
fˆ(x) = αi K(x, xi ).
i=1

83
which is similar to predictions with other non–Bayesian kernel machines
(e.g. SVM).
• A further calculation shows that the uncertainty at a test input x is ob-
2 −1
tained as V AR[v|y] = K(x, x) − k>

x K+σ I kx which is independent
of y.

Mean and uncertainty (shaded region) for different numbers of data (blue points)
are shown in the sequence of plots for a toy problem, where the exact function
is shown in blue and the prediction in red.

1 observation

84
2 observations

3 observations

85
10 observations

15 observations

86
30 observations

9.3 Model selection using the evidence


Sensible values for kernel hyperparameters and noise σ 2 can be obtained by
numerically maximising the evidence
(Maximum Likelihood II)
For GP regression we have an explicit analytical expression

p(y) =
Z
= dz p(z) p(y|z)
 
1 1 T 2 −1
= 1 exp − y (K + σ I) y
(2π)n/2 | det(K + σ 2 I)| 2 2

9.3.1 Derivation of the evidence


There are at least two ways of deriving this result for the evidence. Method I:
Brute force calculation of Gaussian integrals. Method II: Think probabilistically
and consider the generation of data y as a two stage process: We generate latent
function values from the GP and add independent Gaussian noise. Thus

y = z+ξ

where the two variable sets

ξ ∼ N (0, σ 2 I)
z ∼ N (0, K)

87
are jointly Gaussian and independent. Hence, the density of observations is
Gaussan

y ∼ N (0, Σ ).

Thus E[y] = 0 and by independence we can add the two covariances to get
.
Σ = COV[y] = E[yy> ] =
E[zz> ] + σ 2 I = K + σ 2 I.

The following plots illustrate the maximisation of the evidence to get good
values of hyper parameters. We show true function (red) and prediction (black)
together with the data (blue dots) and the uncertainty (dashed). The first plot
is obtained with non–optimal parameters. Predictions are too wiggly.

The second plot show the log–evidence as a function of the length scale L of
the RBF kernel.

88
The final plot displays predictions with optimised parameters.

89
9.3.2 GP Application: Inference for linear ordinary dif-
ferential equations
We can apply GP inference to problems where data are assumed to be gener-
ated from linear operations on latent functions. This is possible, because linear
operations on GPs leads to GPs !
Take e.g. a dynamical model data model given by the ODE

dx(t)
= −λx(t) + f (t)
dt
yi = x(ti ) + νi , i = 1, . . . , n

where f (t) is an unknown function to be estimated. We can treat this problem


by using a GP prior for f (·). When f (·) ∼ GP, then both f (·) and the derivative
function f 0 (·) are jointly GPs.
To obtain posterior predictions, we need the kernel of the joint prior dis-
tribution of these functions. This new kernel can be derived from the kernel
K for f . E.g. we need the covariance
1
E[f (x1 )f 0 (x2 )] = lim E [f (x1 ) {f (x2 + ) − f (x2 )}] =
→0 
1
lim {E [f (x1 )f (x2 + )] − E [f (x1 )f (x2 )]}

1
= lim {K(x1 , x2 + ) − K(x1 , x2 )} =


K(x1 , x)x=x2 .
∂x
Similar results are obtained for E[f 0 (x1 )f 0 (x2 )] etc.
The following plots illustrate inference for an ODE using different numbers
of observations. We show inference for x(t) and the unobserved driving force
f (t).

90
With 10 observations ...

With 10 observations ...

91
With 50 observations ...

With 50 observations ...

9.3.3 GP Application: Emulators


This addresses the problem of emulating complex simulation software packages.
Such packages can be viewed as evaluating functions y = f (x) of input variables
using very lengthy computations.
To speed up predictions one uses a Gaussian process approximation f (x) ∼
GP(m(x), K) as prior knowledge (where the mean function m(x) could be a
simpler parametric function). Then one would use a smaller set of data (xi , yi )

92
generated from the simulator to approximate f (x) with GP regression. This
GP approximation (emulator) to f (x) can be evaluated much faster than by
running the simulator
This could be applied e.g. to:
• Sensitivity analysis: How does the outputs change under small changes
of the input ?
• Uncertainty analysis: What is the uncertainty of outputs based on
uncertainty in the inputs modelled by distribution p(x).
Earlier work can be found in:
http://www.tonyohagan.co.uk/academic/GEM/index.html and the MUCM
(MANAGING UNCERTAINTY IN COMPLEX MODELS) page http://www.mucm.ac.uk

9.4 GP application: Gaussian Processes for Global


Optimisation
A similar idea can be applied to global optimisation, when function evaluations
are costly. One can use previous function evaluations to approximate unknown
function f (x) by a GP g(x).
To find new candidate point xn+1 which are (hopefully) closer to the min-
imiser one minimises the posterior expectation of a surrogate loss function, e.g.

loss = min{g(x), f (xn )}

with respect to x. This will take both mean and uncertainty of g(x) into ac-
count. Note, that the minimisation does not need any new evaluation of the
true function f .

9.4.1 GP application: Modeling and interpolation of the


ambient magnetic field
Pictures are taken from A. Solin et al arXiv:1509.04634v1
The main idea behind this project is the observation that the earth’s mag-
netic field is distorted by the buildings. Hence, if we create maps of magnetic
fields in buildings, these can be used (together with the information from other
sensors) for localisation of agents inside the buildings. Using GP priors for the
fields, we can obtain field predictions by regression from sparse and noise field
measurements.
Magnetic fields are vectors in space H(x) which fulfil ∇ × H(x) = 0 outside
of electric currents. Thus, one obtains a simpler and more robust representation
by a scalar field H(x) = −∇φ(x) and introducing a GP prior for φ(x).

93
The observation model

φ(x) ∼ GP(0, K)
yi = −∇φ(xi ) +  i .

94
9.5 Inference for Gaussian processes: Non–Gaussian
observation models
The posterior density of the unknown function f (x) at an input x is
Z
p(f (x)|y) = p(f (x)|z1 , . . . , zn )p(z1 , . . . , zn |y)dz1 , . . . dzn

For a Gaussian noise model p(yi |zi ) the integrals can be performed analytically.
But a GP appears as a latent function in more complicated models. Take e.g.

95
• yi = f (x) + non–Gaussian noise
• Binary classification yi ∈ {0, 1} with p(y = 1|f (x)) = sigmoid[f (x)].
• ...

Hence, approximations are necessary: Laplace, Monte Carlo, variational


...

9.6 Another computational tool: Variational ap-


proximation
Here is s short summary of the general setting of the problem:
• We assume observations y ≡ (y1 , . . . , yK ) (”data”)
• These are explained by latent, unobserved variables z ≡ (z1 , . . . , zN ) (e.g.
parameters in Bayesian model).

• We have a likelihood p(y|z) (forward model) and a


• prior distribution p(z).
Our goal is to solve the inverse problem. Using Bayes rule we can make
predictions on all all hidden variables

p(data|z1 , . . . , zN )p(z1 , . . . , zN )
p(z1 , . . . , zN |data) =
p(data)

But what we often really need are marginal distributions eg.

p(zi |data) =
Z
p(data|z1 , . . . , zN )p(z1 , . . . , zN )
dz1 . . . dzi−1 dzi+1 . . . dxN
p(data)
and Z
p(data) = dz1 . . . . . . dzN p(data|z1 , . . . , zN )p(z1 , . . . , zN )

9.6.1 Simplest type of dependencies


The simplest, nontrivial type of models have
• Pairwise interactions
Y Y
p(z1 . . . , zN |data) = Ψi (zi ) Ψij (zi , zj )
i i<j

96
• or even simpler
 
Y X
p(z1 . . . , zN |data) = Ψi (zi ) exp  Aij zi zj 
i i<j

• We can see that GP models with (conditionally) i.i.d. observations belong


to this class:
 
n   Y n
Y 1 > −1 1 X
p(yi |zi ) exp − z K z = p(yi |zi ) exp − zi (K−1 )ij zj 
i=1
2 i=1
2 i,j

• Our goal is to obtain approximations to these posteriors for which marginal-


isations become efficiently tractable.

9.6.2 Reminder: The KL divergence


For two distributions q(z) and p(z), the Kullback–Leibler divergence is
defined as   Z
q(z) q(z)
D(qkp) = Eq ln = q(z) ln dz ≥ 0.
p(z) p(z)
We have equality = 0 if and only if p = q almost everywhere.

9.7 Inference by optimisation


These properties of the KL divergence allow us to obtain the posterior from the
following variational problem
  
q(z)
p(z|y) = arg min Eq ln
q p(z, y)

The minimum is  
q(z)
min Eq ln = − ln p(y)
q p(z, y)
Proof:

   
q(z) q(z)
Eq ln = Eq ln =
p(z, y) p(z|y)p(y)
Z
q(z)
= q(z) ln dz − ln p(y)
p(z|y)
= D(qkp(·|y)) − ln p(y)

97
9.8 The variational approximation
We approximate ’intractable’ posterior p(z|y) by a ’close’ distribution q ∗ (z)
(for fixed y) where q ∗ is from a ’nice’ family F of distributions (which allow us
e.g. to perform efficient marginalisations). We will measure ’closeness’ by the
KL–divergence and solve the variational problem

q ∗ (z) = arg min D(qkp(·|y)) =


F
  
q(z)
arg min Eq ln .
F p(z, y)
Hence, there is no need to know p(y). But we an approximation to p(y) for free
  
q(z)
− ln p(y) ≤ min Eq ln .
F p(z, y)

9.8.1 Lingo
Older papers call the quantity
 
q(z)
Eq ln
p(z, y)
’Variational Free Energy’. It has to be minimised, and its original definition
goes back to statistical physics. It upper bounds − ln p(y)
More recent papers call the variational objective
   
p(z, y) q(z)
Eq ln = −Eq ln
q(z) p(z, y)
the ELBO (expected lower bound). It has to be maximised and lower bounds
ln p(y).

9.9 The Mean Field Method


We will explain this for 2 variables. Generalisation to more variables is straight-
forward.
• We approximate
p(z1 , z2 |y)
by distributions from simpler family F of factorising distributions
.
q(z1 , z2 ) = q1 (z1 )q2 (z2 )

• We try to find the best q by solving


  
q1 (z1 )q2 (z2 )
q opt (z1 , z2 ) = arg min Eq ln
q∈F p(z1 , z2 , y)

98
• The solution is
n o
q1opt (z1 ) ∝ exp Eqopt [ln p(z1 , z2 , y)]
2
n o
opt
q2 (z2 ) ∝ exp Eqopt [ln p(z1 , z2 , y)]
1

Proof: Assume q2 is given and fixed for the moment.


   
q1 (z1 )q2 (z2 ) q1 (z1 )
Eq ln = Eq ln + Eq2 [ln q2 ] =
p(z1 , z2 , y) p(z1 , z2 , y)
Z  Z 
q1 (z1 ) ln q1 (z1 ) − q2 (z2 ) ln p(z1 , z2 |y)dz2 dz1 + const =
Z  Z 
q1 (z1 ) ln q1 (z1 ) − ln exp q2 (z2 ) ln p(z1 , z2 |y)dz2 dz1 + const =
Z
q1 (z1 )
q1 (z1 ) ln R  + const
exp q2 (z2 ) ln p(z1 , z2 |y)dz2

This is minimised by
Z 
q1 (z1 ) ∝ exp q2 (z2 ) ln p(z1 , z2 |y)dz2

This result shows that one can find a local optimum by performing a ’coordinate–
wise’ optimisation where q2 is fixed and q1 is optimised. Then the same is
repeated with q1 and q2 exchanged.

9.9.1 The Mean Field Method: Many variables


QN
To approximate p(z|y) by the best factorising distribution q(z) = i=1 qi (zi )
we obtain the optimal variational solution as:
1
qiopt (zi ) =

exp E\i [ln p(z, y)]
Zi
with E\i [. . .] = the average over all variables of z except zi . This requires only
one–dimensional integrals. Again, this leads to simple coordinatewise optimisa-
tion.

9.9.2 Work this out for ...


a model with . . . pairwise interactions
 
Y X
p(z1 . . . , zN |y) = Ψi (zi ) exp  Aij zi zj .
i i<j

99
The optimal solution is
 
X
qi (z) ∝ Ψi (z) exp z Aij mj 
j6=i

with R h P i
Ψj (z) exp z k6=j Ajk mk z dz
mj = Eq [Zj ] = R h P i
Ψj (z) exp z k6=j Ajk mk dz

9.9.3 Simple GP classifier


Let us assume noise free, binary class labels yi = ±1 with likelihood

p(yi |zi ) = Iyi zi >0 .

For this, we have


 
1 −1 2
Ψi (z) = p(yi |zi ) exp − (K )ii zi
2

Aij = (K−1 )ij .

9.9.4 Example: Hyperparameter estimation for generalised


linear model
For this model, we have a likelihood
  2 
 N/2 N K
β  Xβ X
p(y|w) = exp − yi − wj Φj (xi ) 

2π i=1
2 j=1

with a fixed set {Φ1 (x), . . . , Φk (x)} of K basis functions. The prior distribution
on the weights is given by
 
 α K/2 K
α X
p(w|α) = exp − w2  .
2π 2 j=1 j

Finally, we assume a hyper–prior (Gamma) for α:

p(α) ∝ αa0 −1 e−b0 α .

The joint distribution of all variables is given by

p(y, w, α) = p(y|w)p(w|α)p(α)

100
We aim at a factorising distribution to the posterior

p(w, α|y) ≈ q(α)q(w).

The optimal approximating density of w using a factorising approximation is

q(w) ∝ exp {Eα [ln p(y, w, α)]} ∝ p(y|w) exp {Eα [ln p(w|α)]} ∝
 
K
Eα [α] X
p(y|w) exp − w2  .
2 j=1 j

The optimal approximating density for α is also a Gamma density:

q(α) ∝ exp {Ew [ln p(y, w, α)]} ∝ p(α) exp {Ew [ln p(w|α)]} ∝
 
K
α X
p(α) exp − Ew [wj2 ] .
2 j=1

101
Chapter 10

Week 9

We will discuss Gaussian approximating densities in the variational approxi-


mation, sparse Gaussian approximations for GPs. Finally we introduce a varia-
tional approach to binary classification with GPs which involves the introduction
of extra latent variables.

10.1 The Gaussian variational approximation


The mean field approach neglects dependencies between random variables. A
possible way to include such dependencies is to use multivariate Gaussian den-
sities as approximations to the posterior. Hence, we define
 
1
q(z) = (2π)−N/2 |Σ
Σ|−1/2 exp − (z − µ )>Σ −1 (z − µ ) .
2
as the variational distribution. The variational free energy to be minimised is
  Z
q(z)
F[q] = Eq ln = q(z) ln q(z)dz − Eq [log p(y, z)]
p(y, z)
N 1 N
− log 2π − log |Σ Σ| − − Eq [log p(y, z)].
2 2 2
Taking derivatives w.r.t. variational parameters µ and Σ yields implicit equa-
tions

0 = Eq [∇z log p(y, z)]


 2 
∂ log p(y, z)
Σ−1 )ij
(Σ = −Eq .
∂zi ∂zj
These equations show some resemblance to the Laplace approximation. In fact,
Laplace is recovered, if we neglect the expectations on the right hand side of
the equations. In this case, the mean of the approximate Gaussian would be
at the mode of log p(y, z) and the inverse covariance would be the curvature at

102
the mode. The Gaussian variational approximation is some kind of Laplace ’on
average’.
The following plot illustrates the difference between the two approximations
for a one–dimensional density. Approximating the green density by the Laplace
method yields the green Gaussian. The variational method gives the blue Gaus-
sian instead. While Laplace is entirely local, the variational Gaussian is able to
incorporate more of the probability mass of the true density.

There are also limitations to this Gaussian approximation. It cannot be


applied to cases where the KL–divergence between the true distribution and a
Gaussian does not exist (or is infinite). This rules out all random variable which
are constrained to subsets of Rd , e.g. positive random variables.

10.1.1 Gaussian Processes with factorising likelihood


A second problem with the Gaussian approach is the fact that the number of
parameters, µ and Σ to be optimised can in general be quite large (O(d2 )) for
a d dimensional problem.
This problem can be are simplified for specific structures of the problem.
Consider e.g. densities of the form
d
!
1 X 1 T −1
p(z, y) = exp − Vn (yn , zn ) − z K z ,
Z0 n=1
2

which play a role as posteriors for GP models. Our result on the equation for
the optimal covariance states that
 2 
∂ Vn (yn , zn
Σ−1 = K−1 + diag Eq .
∂zn2

103
This means that only the d diagonal elements of the inverse covariance are
unknown quantities.
The following plot shows an application of the Gaussian variational approx-
imation to GP inference of an unknown function using observations which are
corrupted by Cauchy noise. This leads to larger outliers compared to Gaus-
sian noise. The first plot shows GP inference using a Gaussian likelihood, which
ignores the knowledge that the noise is Cauchy. The resulting mean prediction
interprets the strong fluctuations of the observations as resulting from the true
function. The inferred curve is more wiggly compared to the truth.

1.2

0.8

0.6

0.4

0.2

!0.2

!0.4

!10 !8 !6 !4 !2 0 2 4 6 8 10

The second plot is obtained by applying the Gaussian variational approxi-


mation to GP inference with the Cauchy likelihood. The inference is obviously
improved !

104
1.2

0.8

0.6

0.4

0.2

!0.2

!0.4

!10 !8 !6 !4 !2 0 2 4 6 8 10

10.1.2 Variational Sparsity for GPs


GP inference requires inversion of large matrices. This is bad for ’big data’
applications ! A trivial ’sparse’ solution would be to throw away data points.
In this case, the ’smaller’ likelihood would lead to small matrices to be inverted.
But one would loose all the information from other data. The following gives a
method to obtain an effective likelihood which depends only on a small set of
latent variables but keeps information from all data points.
We assume split of latent variables z = (s, B) into a small (sparse) set s and
the rest, the BIG set B. The exact joint probability is

p(s, B, y) = p(s, B)p(y|s, B),

where p(s, B) is the joint prior. We would like to approximate posterior p(s, B|y)
using sparse likelihood by setting

q(s, B) = p(s, B)L̂(s)

We can find the best likelihood L̂(s) (in the variational sense) by minimising
" #
p(s, B)L̂(s)
Eq ln
p(s, B)p(y|s, B)

105
The optimal likelihood is given by
Z 
L̂(s) ∝ exp ln[p(y|B, s)]p(B|s)dB

Proof:
" #
p(s, B)L̂(s)
Eq ln =
p(s, B)p(y|s, B)
" #
L̂(s) h i
Eq ln = Eq ln L̂(s) − Eq [ln p(y|s, B)] .
p(y|s, B)

We work on the second term:


Z Z
Eq [ln p(y|s, B)] = q(s) q(B|s) ln[p(y|s, B)]dB ds =
Z Z 
q(s) ln exp q(B|s) ln[p(y|s, B)]dB ds

The total result yields


Z
L̂(s)
q(s) ln R ds.
exp q(B|s) ln[p(y|s, B)]dB

But what is q(B|s) ? Using the definition of q, we get

q(s, B) = p(s, B)L̂(s)

Hence

q(B|s) ∝ p(s, B)

and thus finally

q(B|s) ∝ p(B|s).

Obviously the minimisation yields the desired result.

10.1.3 Sparse approximation for Gaussian likelihoods


Let us consider likelihoods of the form
1
log p(y|z) ∝ aT zT − zT Az
2
which would e.g. correspond to Gaussian noise models. We can show L̂(s) is
obtained from L by simply replacing z → E[z|zs ] where in the GP context,

zs = {f (x)}x∈inducing points .

106
This result can be understood from the fact that

E zzT |zs = E[z|zs ]E[zT |zs ] + COV[z|zs ]


 

where for a Gaussian distribution, the conditional covariance does not depend
zs .
An explicit calculation shows that

E0 [z|zs ] = KBs K−1


ss zs

By using the ELBO, one can optimise the location of ’inducing points’.

10.1.4 A different variational approach to GP classifica-


tion: Pólya–Gamma variables
Let y = (y1 , . . . , yN ), yi = ±1 class labels and zn = f (xn ). The likelihood for a
classifier using sigmoid functions is
N
Y
p(y|z) = σ(yn zn ).
n=1

Our starting point for a new variational approximation is the representation


of the sigmoid function using extra latent variables
1
σ(x) = =
1 + e−x
x Z ∞
e2 1 x x2
= e2 e− 2 ω pPG (ω)dω
2 cosh( x2 ) 2 0

The augmented likelihood is


N
1 Y yn zn z2
N − 2n ωn
p(y, {ωn }n=1 |z) = pPG (ω n ) e 2
2N n=1

The posterior over all latent variables (assuming GP prior p(z) = GP(0, K) with
kernel K) is
N 2
Y yn zn zn
N − ωn
p(z, {ωn }n=1 |y) ∝ p(z) pPG (ωn ) e 2 2

n=1

107
We can treat this augmented model by variational inference using a structured
mean field approximation
N N . N
p(z, {ωn }n=1 |y) ≈ q(z, {ωn }n=1 ) = q1 ({ωn }n=1 ) q2 (z).

Using the joint density


N 2
Y yn zn zn
N − ωn
p(z, {ωn }n=1 |y) = p(z) pPG (ωn ) e 2 2

n=1

we can perform a free form minimisation of the KL divergence to obtain the


Gaussian
N 2
zn
h i Y yn zn
N
q2 (z) ∝ exp Eω ln p(z, {ωn }n=1 ), y) ∝ p(z) e 2 − 2 E1 [ωn ]
n=1

and the factorising shifted Pólya–Gamma density


N 2]
E2 [zn
h i Y
N N
q1 ({ωn }n=1 ) ∝ exp Ez ln p(z, {ωn }n=1 ), y) ∝ pPG (ωn ) e− 2 ωn
n=1

The optimal distributions can be found by iteration between q1 and q2 . The


only unknown quantity is the mean of the shifted Pólya–Gamma density. This
is easily obtained from
R∞ 2]
E2 [zn

0
pPG (ω) ω e− 2 ω

E1 [ωn ] = R E [z 2 ]
=
∞ − 22 n ω
0
p PG (ω) e dω
Z ∞
d
− ln pPG (ω) e−ωt dω =
dt t=E2 [zn2 ]/2 0
r
d t
ln cosh( )
dt t=E2 [zn ]/2
2 2

The last line follows from the basic definition (set t = x2 /2)
1
σ(x) = =
1 + e−x
x Z ∞
e2 1 x x2
= e2 e− 2 ω pPG (ω)dω.
2 cosh( x2 ) 2 0

This approximation can be used (together with mini–batch sampling) and a


further sparse GP approximation to deal with fairly large data sets.
Our method (X–GPC) shows faster convergence in training compared to
another variational Gaussian approximation which is not based on extra latent
variables.

108
109
Chapter 11

Week 10

11.1 Black box approaches


(see e.g. Ranganath et al, 2014). The goal is to enable variational inference
for many models without needing analytical calculations. The main idea is to
combine two things. (a) Perform the expectations in the variational method
by MC sampling from a parametric variational distribution qφ . φ is set of
parameters. (b) Use gradient descent as algorithm

φt+1 = φt − γ∇φ F (φt )

fro minimising the variational objective.


 
qφ (z)
F (φ) = Eq ln
p(z, y)

where the gradient is computed using automatic differentiation. The question


is: How to we the get gradient w.r.t φ from MC samples ? In order apply theory
of stochastic gradient descent, we need unbiased estimator of gradient !
Here is one approach:
  
qφ (z)
∇φ Eq ln =
p(z, y)
Z Z
∇φ qφ (z) ln qφ (z)dz − ∇φ qφ (z) ln p(z, y)dz =
Z Z Z
∇φ qφ (z) ln qφ (z)dz − ∇φ qφ (z) ln p(z, y)dz + ∇φ qφ (z)dz =
Z Z
qφ (z)∇φ ln qφ (z) ln qφ (z)dz − qφ (z)(∇φ ln qφ (z)) ln p(z, y)dz

= Eq [(∇φ ln qφ (z)) ln qφ (z)] − Eq [∇φ ln qφ (z) ln p(z, y)]

This is represented as a sample average ! The second method is the well known

110
11.1.1 Reparametrisation trick
We will give the main idea for a simple one dimensional case, where the varia-
tional density is assumed to be Gaussian. We represent the random variables z
as a transformation of a random variable u with a fixed distribution.

z = µ + σu where u ∼ N (0, 1).

Hence
   
qφ (z) qφ (µ + σu)
Eq ln = Eu ln
p(z, y) p(µ + σu, y)

The gradient is given by


    
qφ (z) qφ (µ + σu)
∇φ Eq ln = ∇φ Eu ln =
p(z, y) p(µ + σu, y)
 
qφ (µ + σu)
Eu ∇φ ln .
p(µ + σu, y)

Again, the gradient is represented as an expectation and the MC estimate is


unbiased.

11.2 Why use Kullback–Leibler ?


If we rely on variational inference with MC samples rather than analytical inte-
grals, we might use other cost functions besides the KL divergence. A possible
alternative divergence measure for probability distributions is the following:
Consider for 0 < α ≤ 1
" 1−α #
. p(z, y)
Fα = −Eq
qφ (z)
 1−α   1−α 
p(z,y) p(z|y)
We note that Eq qφ (z) = Eq qφ (z) (p(y))1−α . For 2 distribu-
tions p and q we can show by an application of Jensen’s inequality that
" 1−α # Z  α
p(z) q(z)
Eq = p(z) dz ≤
q(z) p(z)
Z α
q(z)
p(z) dz = 1α = 1,
p(z)

since xα is concave for 0 < α ≤ 1 and equalty for p ≡ q. We also get a bound
on the evidence

Fα ≥ −(p(y))1−α .

111
We can also recover the KL–divergence, because in the limit α → 1, we get
" 1−α #    
. p(z, y) p(z, y)
Fα = −Eq = −Eq exp (1 − α) ln ≈
qφ (z) qφ (z)
  
p(z, y)
−1 − (1 − α)Eq ln
qφ (z)

using ex ≈ 1 + x for x → 0. Hence variational approximations with the new


family of divergences includes the original variational method. Of special inter-
est is also the limit α → 0. In this case, the bias in estimating p(y) disappears
and we are essentially performing importance sampling ! But there is a generic
problem: For high dimensional problems, the ratio p(z,y) qφ (z) has typically huge
fluctuations for values α < 1!

11.3 Minimising the other KL ?


Another possibility of variational inference seems to be the other KL ?
Z Z
p(z)
D(pkq) = dz p(z) ln = const − dz p(z) ln q(z)
q(z)

where the variantional distribution is on the right rather than


Q on the left. Ex-
ample: If we consider a factorisjng approximation q(z) = i qi (zi ), we have to
minimize
XZ
− dz pi (z) ln qi (z)
i

which is minimized by the true marginal qi = pi . On the other hand for expo-
nential families

ψ (θθ ) · φ (z) + g(θθ )]


q(z|θθ ) = f (z) exp[ψ

we get
Z
− dz p(z) ln q(z) = const − ψ (θθ ) · Ep [φ
φ(z)] − ln g(θθ )

Since ∇ψ ln g(θθ (ψ
ψ )) = −Eq [φ
φ(z)], taking the gradient wrt to ψ yields moment
matching for the optimal ψ

φ(z)] = Ep [φ
Eq [φ φ(z)].

In practice we can’t perform the necessary calculations, because these require


expectations over the intractable distribution p.

112
11.3.1 Bayes Online (Assumed Density Filtering)
We can still try this procedure by a further approximation using an on–line
algorithm. Let us consider the exact update of the posterior, when new data
yt+1 arrives
p(yt+1 |z)p(z|Dt )
p(z|Dt+1 ) = R .
dzp(yt+1 |z)p(z|Dt )
We replace the exact p(z|Dt ) by parametric approximation q(z|par(t)) using the
following steps:
1. Update:
p(yt+1 |z)p(z|par(t))
q(z|yt+1 , par(t)) = R .
dzp(yt+1 |z)q(z|par(t))
2. Project: Minimize
D (q(·|yt+1 , par(t))kq(·|par))

3. For exponential families p(z|par) ∝ exp[ψ


ψ · φ (z)], we have par = ψ . The
projection leads to moment matching E[φφ(z)] for the distributions q(z|par)
and q(z|yt+1 , par(t)).

11.3.2 Example: Gaussian Approximation


For Gaussian approximations, we define the parameters as par = (mean, covariance) =
(ẑ, C). One can show that the matching of moments results in the explicit up-
date:
X
ẑ(t + 1) = ẑ(t) + Cij (t) ×
j
×∂j ln Eu [p(yt+1 |ẑ(t) + u)]
and
X
Cij (t + 1) = Cij (t) + Cik (t)Clj (t) ×
kl
×∂k ∂l ln Eu [p(yt+1 |ẑ(t) + u)].
. ∂
with ∂j = ∂ ẑj .
R
dzp(yt+1 |z)q(z|par(t)) was written as Eu [p(yt+1 |ẑ(t) + u)] where u is a zero
mean Gaussian random vector with covariance C(t).

11.3.3 Asymptotic Error


Assuming convergence, we can get an asymptotic result for the quality of the
method, by assuming that data are generated from true density p∗ (y). One
obtains
1
ED [i (t)j (t)] = (A−1 BA−1 )ij , t → ∞.
t

113
with
Z
Bij = dy p∗ (y)∂i ln p(y|z∗ )∂j ln p(y|z∗ )
Z
Aij = − dy p∗ (y)∂i ∂j ln p(y|z∗ ).

This is the same error rate as for batch algorithms (Max. Likelihood or Bayes):
Hence, we get asymptotic efficiency! One can also show that the algorithm is
asymptotically equivalent to natural gradient online learning.
The following plot shows test errors (probability of misclassification) for a
toy probit model, with spherical Gaussian inputs (d = 50) and a realizable
.
target. α = #data
d ). The dashed line is an analytical result for the quality of a
Bayes optimal batch algorithm.

11.3.4 Further properties:


• The Bayes online method can be applied to GP models (e.g. informative
vector machine)
• It can be turned into batch algorithm by using data many times (Expec-
tation Propagation).

11.4 Approximate inference based on particle


flow
The main idea is to generate samples (particles) zi ∼ p0 (·) i.i.d. from simple
distribution p0 , e.g. prior. One then applies particle flow (continuous time)

dzi (t) .
= φt (zi (t))
dt
We must define a mapping φt (·) such that for t → ∞, the density of particles
zi (t) ∼ q∞ (·) is close to the posterior p(z|y).

114
The basic idea to obtain the mapping is to construct φt (·) in such a way
kp(·|y))
that the change (decrease) of the KL divergence dD(qtdt is large. To work
out the details, we need to know
• How does qt change over time ?
dD(qt kp(·|y))
• What is dt ?
• We need to specify a family G of mapping functions φt (·) and be able to
maximise the decrease of KL !
• We need to express all expectations by sample means.
• Finally, in practice we use discrete time steps
zi (t + 1) = zi (t) + φt (zi (t)).

11.4.1 Change of density under deterministic flow


To answer the first question, we consider a system of ordinary differential equa-
tion dZ(t) d d
dt = φt (Z(t)) for Z(t) ∈ R and φ(z) ∈ R . Given the initial density of
the random vector Z(0) ∼ q0 (z), we try to figure out what is density qt (z) of
Z(t) ?
To solve this problem, we consider the change of expectations for arbitrary
smooth functions g at time t:
 
d dZ(t)
E[g(Z(t))] = E ∇g(Z(t)) · = E [∇g(Z(t)) · φt (Z(t))] =
dt dt
Z Z
qt (z)∇g(z) · φt (z)dz = − g(z)∇ · (qt (z)φt (z)) dz

In the last step, we have performed integration by parts. On the other hand,
we can express the same quantity by the change of the density:
Z Z
d d dqt (z)
E[g(Z(t))] = qt (z)g(z)dz = g(z)dz.
dt dt dt
Since both expressions hold for arbitrary functions g (assuming some smooth-
ness), we conclude that
dqt (z)
= −∇ · (qt (z)φt (z)) .
dt

115
11.4.2 Change of KL
dD(qt kp(·|y))
We will next address dt .
A direct calculation yields
Z Z
d d
qt (z) ln qt (z)dz − qt (z) ln p(z|y)dz =
dt dt
Z Z
− ∇ · (qt (z)φt (z)) ln qt (z)dz + ∇ · (qt (z)φt (z)) ln p(z|y) =
Z Z
+ qt (z)φt (z) · ∇ ln qt (z)dz − qt (z)φt (z) · ∇ ln p(z|y) =
Z Z
+ ∇qt (z)φt (z)dz − qt (z)φt (z) · ∇ ln p(z|y) =

−Eq [∇ · φt (z) + φt (z) · ∇ ln p(z|y)] .

To obtain the third line and the last line we have performed an integration by
parts. The Operator inside the bracket is known as Stein’s operator.

11.4.3 steepest descent direction


To find the steepest descent direction in φ space, we need to solve an optimisa-
tion problem of the form

max {Eq [∇ · φt (z) + φt (z) · ∇ ln p(z|y)] such that kφk ≤ 1} .


φ∈F

For specific families of functions F this can be solved in closed from: The
Stein variational gradient descent (SVGD) algorithm chooses functions φ in the
Reproducing Kernel Hilbert Space (RKHS) given by some p.d. kernel K(z, z 0 ).

11.5 Basics of Information Theory


Literature: Elements of Information Theory by T. M. Cover and J. A. Thomas,
Wiley (1991).
Information Theory addresses the following problems:

• What is the ultimate compressibility of data ?


• What is the ultimate transmission rate of communication ?
• Lots of applications & links to other fields (statistics, game theory etc.)

11.5.1 Basic definitions


The entropy for discrete random variables X is defined as:
. X
H(X) = − p(x) log p(x) = −Ep [log p(x)] ≥ 0,
x

116
where log means binary logarithm. This gives a measure of uncertainty about
X and also a measure of information contained in observing X.
This is illustrated by the two figures. The left is a discrete distribution with
zero probability for all values except for a single value which has P = 1. There is
no surprise in a realisation of X and the entropy is 0. The right shows a uniform
distribution. All values are equally probable and we have maximal surprise and
entropy in observing a realisation of X.

11.5.2 Joint entropy


. P
This is a simple generalisation H(X, Y ) = − x,y p(x, y) log p(x, y). It is addi-
tive H(X, Y ) = H(X) + H(Y ) if X, Y independent. Proof:
X
H(X, Y ) = − p(x)p(y) log(p(x)p(y)) =
x,y
X X
− p(x)p(y) log p(x) − p(x)p(y) log p(y) =
x,y x,y
X X
− p(x) log p(x) − p(y) log p(y)
x x

11.5.3 Conditional entropy and relative entropy


The conditional entropy is
. X
H(Y |X) = − p(x, y) log p(y|x),
x,y

which is NOT the entropy of the conditional distribution (but its expectation).
The relative entropy (KL divergence) is given by
. X p(x)
D(pkq) = p(x) log ≥ 0,
x
q(x)

and the Mutual Information


. X p(x, y)
I(X, Y ) = D(p(x, y)kp(x)p(y)) = p(x, y) log
x,y
p(x)p(y)

117
The Mutual information can also be expressed
. X p(x, y)
I(X, Y ) = D(p(x, y)kp(x)p(y)) = p(x, y) log
x,y
p(x)p(y)
X p(x|y) X p(y|x)
= p(x, y) log = p(x, y) log =
x,y
p(x) x,y
p(y)
H(X) − H(X|Y ) = H(Y ) − H(Y |X).
Hence, conditioning reduces entropy. We have
I(X, Y ) = H(X) − H(X|Y ) ≥ 0
Thus
H(X|Y ) ≤ H(X)
with equality if and only if X and Y independent.

11.5.4 Properties of entropy


Entropy is maximal for uniform distribution u(X) = 1/N Proof:
X p(x) X
D(pku) = p(x) log = p(x) log p(x) + log N =
x
u(x) x
−H(X) + log N ≥ 0
Hence
X
H(X) ≤ log N = − u(x) log u(x)
x

We consider discrete time/state Markov chain with transition probabilities


p(xt+1 |xt ). Assume 2 chains with different marginal probabilities p(xt ) and
q(xt ) for t = 1, 2. . . . caused by starting with different initial conditions p(x0 )
and q(x0 ) but having the same transition probabilities. Then we can show that
D(p(xt+1 )kq(xt+1 )) ≤ D(p(xt )kq(xt ))
Proof: Use
p(xt+1 , xt ) = p(xt+1 |xt )p(xt ) = p(xt |xt+1 )p(xt+1 )
Thus
X p(xt+1 , xt )
D(p(xt+1 , xt )kq(xt+1 , xt )) = p(xt+1 , xt ) log =
xt ,xt+1
q(xt+1 , xt )
X p(xt+1 |xt ) X p(xt )
p(xt+1 , xt ) log + p(xt+1 , xt ) log =
xt ,xt+1
q(xt+1 |xt ) x ,x q(xt )
t t+1

X p(xt |xt+1 ) X p(xt+1 )


p(xt+1 , xt ) log + p(xt+1 , xt ) log
xt ,xt+1
q(xt |xt+1 ) x q(xt+1 )
t ,xt+1

118
We have p(xt+1 |xt ) = q(xt+1 |xt ) (both chains evolve under the same transition
probability). Thus

D(p(xt )kq(xt )) = D(p(xt+1 )kq(xt+1 )) +


X p(xt |xt+1 )
p(xt+1 , xt ) log =
x ,x
q(xt |xt+1 )
t t+1

D(p(xt+1 )kq(xt+1 )) +
X X p(xt |xt+1 )
p(xt+1 ) p(xt |xt+1 ) log
xt+1 xt
q(xt |xt+1 )
≥ D(p(xt+1 )kq(xt+1 ))

We can apply to q = stationary distribution to show convergence of p(xt ) to q.

11.6 Source coding (data compression)


We will next relate information theory and data compression. We consider
source codes: source symbols are mapped → code words (strings from e.g.
binary alphabet).
Example:

A → 10
B → 00
C → 11
D → 110

If we try to decode 11000 → DB sequentially:

1 → A?, C?, D?
11 → C?, D?
110 → CB?, D?
1100 → CB?, DB?
11000 → DB

we can not decide on the value of the first bit, unless we arrive at the end of
the sequence because C = 11 is a prefix to D = 110.
Hence, we define instantaneous (prefix) codes: No codeword is prefix to
any other codeword. Prefix codes can be represented by a Code tree: No CW
is ancestor of another CW. This can be seen in the figure, where CW appear
only at the leaves.

119
11.6.1 Kraft inequality
One would like to assign code lengths to source symbols which are as smalls
possible. But there are limits: For any prefix code with m codewords, the code
lengths l1 , l2 , . . . , lm satisfy (binary alphabet) Kraft’s inequality
X
2−li ≤ 1.
i

This can be extend to codes with alphabet size D:


X
D−li ≤ 1.
i

To prove this, we note that in a code tree, each CW eliminates its descendants
as CW. Let lmax be the length of the largest CW. Each node in a tree at level
lmax can be a CW, a descendant of CW or neither. Now consider CW at level
li
1. It has 2lmax −li descendants at level lmax .

2. The sets of descendants of different CW are disjoint !


3. Total number of nodes in all descendant sets ≤ 2lmax .
Thus
X
2lmax −li ≤ 2lmax .
i

Dividing by 2lmax yields Kraft’s inequality. In the figure, we show an example


of CWs and their descendants for lmax = 3.

120
There is a converse to Kraft’s inequality: For any set of integers l1 , l2 , . . . , lm
fulfilling Kraft’s inequality, one can construct prefix code.
One might assume that one could better than Kraft by relaxing the prefix
assumption. But you can’t beat the system: Any uniquely decodable code
fulfils Kraft ! Proof: Consider code which concatenates n CW. The length
.
for encoding the source symbols x = x1 , . . . , xn is then:
n
X
l(x) = l(xi ).
i=1

Let us look at
!k
X X X
Dl(x) = D−l(x1 ) · · · D−l(xk ) = D−l(x)
x x1 ,...,xk x
klX
max klX
max
−m
= a(m)D ≤ Dm D−m = klmax ,
m=1 m=1

where a(m) = number of source sequences x which are mapped to CW of length


m. Since a(m) ≤ Dm (there cannot be more CWs than sequences, otherwise
we could not uniquely decode them). Hence
X 1/k
Dl(x) ≤ (klmax ) .
x

If we take limit k → ∞, we obtain Kraft’s inequality.

11.6.2 Expected code length


We will next combine compression with statistical assumptions. We assume
that source symbols x are generated at random (i.i.d.) from the source with

121
probabilities x ∼ p(x). The expected code length is given by
. X
L = E[l(x)] = p(x)l(x)
x

We will show that

L ≥ HD (X)
. P
where HD (X) = − x p(x) logD p(x). Hence, the entropy is the minimum ex-
pected code–length that can be achieved. Proof:
X
L − HD (X) = p(x) (l(x) + logD p(x)) =
x
X  
p(x) − logd D−l(x) + logD p(x) .
x

. −l(x)
PD −l(x) .
We define new probabilities by Let r(x) = Hence
xD

!
X p(x) X
L − HD (X) = p(x) logD − logD D−l(x) ≥0
x
r(x) x

where we have used the fact that the KL divergence is ≥ 0 and Kraft’s inequality
to bound the last term.

11.6.3 Further properties


• One can construct a code for which H(X) ≤ L ≤ H(X) + 1.
• By using block codes for x = x1 , . . . , xn one can achieve
1
E[l(x)] → H(X)
n

• We will view l(x) = − logD p(x) as the ideal codelength


• Compression to the entropy (–rate) can be extended to non i.i.d. sources.
• If p(x) is not known, but one use another distribution q(x) for encoding:
l(x) = − logD p(x)
X X p(x)
L= p(x)l(x) = Lideal + p(x) logD
x x
q(x)

Thus, the KL divergence measures the extra expected code length needed
for compression, when the true probabilities of the source are not known.

122
Chapter 12

Week 11

In this chapter, we will briefly discuss information theory and gambling, differ-
ential entropy and MaxEnt estimation and finally, the mimimum description
length method. This is a compression approach to model selection and can
serve as another justification of Bayesian methods.

12.0.4 Information Theory and Gambling


We assume horse races which occur at times t = 1, 2, 3 . . . , n. Let us make the
following definitions:
1. There m horses in a race
2. Xt denotes the index of the winning horse in race t (assumed to be ran-
dom).
3. p(x) = Pr(horse x wins) independent of the race.
4. b(x) = fraction of money invested in horse x.
5. o(x) payoff ’o(x) for 1’ if x wins (this fixed).
We are interested in the wealth of the gambler after n (n is large) races:
n
Y Pn
log2 b(Xt )
Wn = b(Xt )o(Xt ) ∝ 2 t=1 ≈
t=1

2nE[log2 b(X)] for n large


where in the first line, we have kept only the terms that the gambler can control.
In the last line, we approximated the sum over independent random terms in
the exponent by its expectation, invoking the law of large numbers.
The gambler’s goals is to maximise the (log–) wealth. This is achieved by
proportional betting, i.e. by setting b(x) = p(x). This is optimal because
X X b(X)
E [log2 b(X)] = p(x) log2 b(X) = p(x) log2 − H(X)
x x
p(x)

123
which is maximal, when the negative KL divergence is zero.

12.0.5 Differential entropy


One can define entropy for probability densities f (x). This is called differential
entropy
Z
h(X) = − f (x) log f (x)dx

and can be negative ! To understand its relation to the discrete entropy H(X),
we consider X ∆ =, a quantised version (using small bins, see the figure) of X
where for small ∆, we have P (X ∆ ) ≈ f (X ∆ )∆. Hence, the discrete entropy
equals
X
H(X ∆ ) = − P (X ∆ ) log P (X ∆ ) ≈ h(X) − log ∆
X∆

as ∆ → 0, where we have approximated the sum by an integral. The discrete


entropy becomes infinite for ∆ → 0.

We will have similar definitions for relative entropy, mutual information etc,
e.g.
Z
f (x)
D(f kg) = f (x) log dx
g(x)

In this case, the corresponding quantised version converges to continuous result


(no divergence) for ∆ → 0, because ∆ drops out in the ratio of probabilities.

12.0.6 Principle of maximum entropy


Suppose we know several expectations of a random variable X, e.g.
Z
.
αj = E[rj (X)] = f (x)rj (x)dx, j = 1, . . . , k.

Of course, usually this is not enough to specify the density f . How should we
model the density f (x) thereby making the least additional assumptions ?

124
The MaxEnt principle (Jaynes) suggests to look for the density with the
largest entropy, i.e. to maximise
Z Z
h(X) = − f (x) ln f (x)dx such that f (x)rj (x)dx = αj
for j = 1, . . . , k.
This can be solved by introducing Lagrange multipliers for the constraints
Z X k Z Z
L(f ) = − f (x) ln f (x)dx + λk f (x)rj (x)dx + λ0 . f (x)dx
j=1

Then by setting the functional derivative


n
δL(f ) X
= − ln f (x) − 1 + λj rj (x) + λ0 = 0,
δf (x) j=1

we obtain the optimal solution


Pn
λj rj (x)

Pn
λj rj (x)+λ0 −1 e j=1
f (x) = e j=1 =R Pn
λj rj (x)
,
e j=1 dx
where the λj need to be adjusted such that the constraints are fulfilled. We
can show independently, that the solution is indeed optimal: Let g(x) be any
density which fulfils the constraints.
Z Z Z
g(x)
− g(x) ln g(x)dx = − g(x) ln ∗ dx − g(x) ln f ∗ (x)dx =
f (x)
 
Z n
X
−D(gkf ∗ ) − g(x) λ0 + λj rj (x) dx =
j=1
 
Z n
X
−D(gkf ∗ ) − f ∗ (x) λ0 + λj rj (x) dx
j=1
Z
= −D(gkf ∗ ) − f ∗ (x) ln f ∗ (x)dx
Z
≤− f ∗ (x) ln f ∗ (x)dx

From the second to the third line, we used the fact that f and g have the same
expectation for the functions rj . The result shows that the entropy of g is never
larger than that of f .

12.0.7 Examples:
• MaxEnt applied to E[X] = µ, X ≥ 0 yields the exponential density:
1 −x/µ
f ∗ (x) = e
µ

125
• A Gaussian is obtained for constraints on mean and variance: E[X] = α1
and E[X 2 ] = α2

f ∗ (x) = N (α1 , α2 − α12 )

• Note that we obtain exponential families if we generalise the maximisation


of H[X] to that of −D(f kf0 ) where f0 is a reference density.
• One can show that MaxEnt shows maximal robustness to uncertainty in
αj . In practice, this uncertainty is a result of usually having only estimates
of the expectations based on random samples.

12.0.8 Problems of spectrum estimation


In many problems of time series estimation, one assumes a sequence of random
observations . . . , X0 , X1 , X2 , . . . XT generated from a stationary time series. A
goal might be to estimate the frequency spectrum

X
C(ω) = eiωk C(k)
k=−∞

with C(k) = E[Xi Xi+k ]. Unfortunately, estimation of C(k) from random sam-
ples becomes bad when k large ! The figures (taken from Josef Honerkamp’s
book: Stochastic dynamical systems) show that it does not help to just increase
the length T of the observations. The top figure shows a time series gener-
ated from an AR process. The middle figure gives the estimate of C(ω) using
T = 1024 and the bottom figure yields the spectrum estimate for T = 4096.
the estimated spectrum does not converge. Th reason is, that although we get
more data for estimating correlations with smaller time lags, also the number
of badly estimated correlations increases. A possible solution is to smooth the
estimate over a small window.

126
12.0.9 Spectrum estimation using MaxEnt
MaxEnt proposes a different solution. We use a set of correlations αk =
E[Xi Xi+k ] of a stationary stochastic process for k = 0, . . . , p (where we restrict
ourselves to p small enough so that these are well estimated) as constraints for
a MaxEnt problem. Since we specify second moments, the optimal solution is a
Gaussian (–process). One can show that is can be represented by an AR process
of the form
p
X
Xi = aj Xi−j + Zi
j=1

where Zi ∼ N (0, σ 2 ). We have to adjust the aj and σ 2 so that the correlations


agree with the measured one. Using the process, we can then compute the
corresponding spectrum.

127
12.1 Information theory and Model selection
This is mainly based on Jorma Rissanen’s work. The idea is that for encoding
data we need a statistical model for the data. Good statistical models allow for
a good compression of data ! The goal could be to find the MDL = mimimum
description length of data for given model as a yardstick for model selection.
Let us use a 2–stage coding for compressing the observed data.

1. Encode data D = (x1 , . . . , xn ) given a parametric model f (x|θ).


2. Encode the parameter value θ∆ (assume d = 1) used for the model with
the help of a discretised prior distribution P (θ∆ ) ≈ p(θ∆ )∆.
Note, that the use of a prior is necessary to define the encoding of the parameter.
The total codelength of the two stage code is

CL = CL(data|θ∆ ) + CL(θ∆ ) '


n
X
log f (xi |θ∆ ) − log p(θ∆ )∆


i=1

θ∆ is a quantised parameter. This code length can be minimised with respect


to the quantised parameters θ∆ and the quantisation ∆ and we get
( n )
X
∆ ∆

M DL = min − log f (xi |θ ) − log p(θ )∆ .
θ ∆ ,∆
i=1

For small enough ∆, the optimal θ∆ ≈ θM L


n
X 1
− log f (xi |θ∆ ) ≈ − log f (D|θM L ) − ∂θ2 log f (D|θM L )(θ∆ − θM L )2 ≤
i=1
2
1
− log f (D|θM L ) − ∂θ2 log f (D|θM L )∆2
2
We can then optimise the quantisation to get
1
−∂θ2 log f (D|θM L )∆ =

1 1
∆opt ≈ p ≈p .
−∂θ2 log f (D|θM L ) nJ(θM L )

The total result equals


1 1
M DL ≈ − log f (D|θM L ) − − log p(θM L ) + log(nJ(θM L ))
2 2
which is similar to the Laplace approximation for the evidence and can be used
to justify the BIC criterion.

128
12.1.1 Stochastic complexity
Sofar we have not yet obtained the full Bayesian evidence as a result of com-
pression, only an approximation. We can do better. For encoding we just need
a probability distribution over data sets D, not necessarily a two stage code.
We can use the following probability over data sequences D
X
P (D) = P (D|θ∆ )P (θ∆ )
θ∆

which is properly normalised (Note, there might be an extra quantisation of


x necessary, when x is continuous, but this discretisation is the same for all
models. We can write
X ∆ ∆ X ∆
P (D) = 2log P (D|θ )+log P (θ ) = 2−CL(θ ) .
θ∆ θ∆

P (D) yields a better Code: The new code length for using P (D) is actually
smaller or equal to the old one which is based on the two stage approach:
X ∆ ∆
− log P (D) = − log 2−CL(θ ) ≤ − log maxθ∆ 2−CL(θ ) = CL(θopt∆
).
θ∆

The relation to the Bayesian evidence becomes evident when we take limit
∆ → 0:
X X Z
∆ ∆ ∆ ∆
P (D) = P (D|θ )P (θ ) ≈ P (D|θ )p(θ )∆ ≈ P (D|θ)p(θ)dθ,
θ∆ θ∆

where in the last step we have approximated the sum by an integral This equals
the Bayesian evidence. In the coding context it also known as stochastic com-
plexity.

129

You might also like