0% found this document useful (0 votes)
31 views4 pages

The EM Algorithm: Ajit Singh November 20, 2005

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 4

The EM Algorithm

Ajit Singh

November 20, 2005

1 Introduction
Expectation-Maximization (EM) is a technique used in point estimation. Given a set of observable
variables X and unknown (latent) variables Z we want to estimate parameters θ in a model.

Example 1.1 (Binomial Mixture Model). You have two coins with unknown probabilities of
heads, denoted p and q respectively. The first coin is chosen with probability π and the second
coin is chosen with probability 1 − π. The chosen coin is flipped once and the result is recorded.
x = {1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1} (Heads = 1, Tails = 0). Let Zi ∈ {0, 1} denote which coin was
used on each toss.

In example 1.1 we added latent variables Zi for reasons that will become apparent. The parameters
we want to estimate are θ = (p, q, π). Two criteria for point estimation are maximum likelihood
and maximum a posteriori:

θ̂M L = arg max log p(x|θ)


θ
θ̂M AP = arg max log p(x, θ)
θ
= arg max [log p(x|θ) + log p(θ)]
θ

Our presentation will focus on the maximum likelihood case (ML-EM); the maximum a posteriori
case (MAP-EM) is very similar1 .

2 Notation
X Observed variables
Z Latent (unobserved) variables
θ(t) The estimate of the parameters at iteration t.
`(θ) The marginal log-likelihood log p(x|θ)
log p(x, z|θ) The complete log-likelihood, i.e., when we know the value of Z.
q(z|x, θ) Averaging distribution, a free distribution that EM gets to vary.
Q(θ|θ(t) )
P
The expected complete log-likelihood z q(z|x, θ) log p(x, z|θ)
H(q) Entropy of the distribution q(z|x, θ).
1
In MAP-EM the M-step is a MAP estimate, instead of an ML estimate.

1
3 Derivation
P
We could directly maximize `(θ) = z log p(x, z|θ) using a gradient method (e.g., gradient as-
cent, conjugate gradient, quasi-Newton) but sometimes the gradient is hard to compute, hard to
implement, or we do not want to bother adding in a black-box optimization routine.

Consider the following inequality


X
`(θ) = log p(x|θ) = log p(x, z|θ) (1)
z
X p(x, z|θ)
= log q(z|x, θ) (2)
z
q(z|x.θ)
X p(x, z|θ)
≥ q(z|x, θ) log ≡ F (q, θ) (3)
z
q(z|x, θ)

where q(z|x, θ) is an arbitrary density over Z. This inequality is foundational to what are called
“variational methods” in the machine learning literature2 . Instead of maximizing `(θ) directly, EM
maximizes the lower-bound F (q, θ) via coordinate ascent:

E-step : q (t+1) = arg max F (q, θ(t) ) (4)


q

M-step : θ(t+1) = arg max F (q (t+1) , θ) (5)


θ

Starting with some initial value of the parameters θ(0) , one cycles between the E and M-steps
until θ(t) converges to a local maxima. Computing equation 4 directly involves fixing θ = θ(t) and
optimizating over the space of distributions, which looks painful. However, it is possible to show
that q (t+1) = p(z|x, θ(t) ). We can stop worrying about q as a variable over the space of distributions,
since we know the optimal q is a distribution that depends on θ(t) . To compute equation 5 we fix
q and note that

`(θ) ≥ F (q, θ) (6)


X p(x, z|θ)
= q(z|x, θ) log (7)
z
q(z|x, θ)
X X
= q(z|x, θ) log p(x, z|θ) − q(z|x, θ) log q(z|x, θ) (8)
z z
= Q(θ|θ(t) ) + H(q) (9)

so maximizing F (q, θ) is equivalent to maximizing the expected complete log-likelihood. Obscuring


these details, which explain what EM is doing, we can re-express equations 4 and 5 as

E-step : Compute Q(θ|θ(t) ) = Ep(z|x,θ(t) ) [log p(x, z|θ)] (10)


M-step : θ(t+1) = arg max Ep(z|x,θ(t) ) [log p(x, z|θ)] (11)
θ
2
If you feel compelled to tart it up, you can call equation 3 Gibbs inequality and F (q, θ) the negative variational
free energy.

2
3.1 Limitations of EM
EM is useful for several reasons: conceptual simplicity, ease of implementation, and the fact that
each iteration improves `(θ). The rate of convergence on the first few steps is typically quite good,
but can become excruciatingly slow as you approach a local optima. Generally, EM works best
when the fraction of missing information is small3 and the dimensionality of the data is not too
large. EM can require many iterations, and higher dimensionality can dramatically slow down the
E-step.

4 Using the EM algorithm


Applying EM to example 1.1 we start by writing down the expected complete log-likelihood
n
" #
Y
(t) xi 1−xi zi xi 1−xi 1−zi
Q(θ|θ ) = E log [πp (1 − p) ] [(1 − π)q (1 − q) ]
i=1
n
X
= E[zi |xi , θ(t) ][log π + xi log p + (1 − xi ) log (1 − p)]
i=1
+ (1 − E[zi |xi , θ(t) ])[log (1 − π) + xi log q + (1 − xi ) log (1 − q)]

Next we compute E[zi |xi , θ(t) ]


(t)
µi = E[zi |xi , θ(t) ] = p(zi = 1|xi , θ(t) )
p(xi |zi , θ(t) )p(zi = 1|θ(t) )
=
p(xi |θ(t) )
π[p(t) ]xi [(1 − p(t) )]1−xi
=
π (t) [p(t) ]xi [(1 − p(t) ]1−xi + (1 − π (t) )[q (t) ]xi [(1 − q (t) )]1−xi

Maximizing Q(θ|θ(t) ) w.r.t. θ yields the update equations

∂Q(θ|θ(t) ) 1 X (t)
= 0 =⇒ π (t+1) = µi
∂π n
i
P (t)
∂Q(θ|θ(t) ) µ xi
= 0 =⇒ p(t+1) = Pi i (t)
∂p
i µi
(t)
∂Q(θ|θ(t) )
P
(t+1) i (1 − µi )xi
= 0 =⇒ q = P (t)
∂q
i (1 − µi )

4.1 Constrained Optimization


Sometimes the M-step is a constrained maximization, which means that there are constraints on
valid solutions not encoded in the function itself. An example of a constrained optimization is to
3
The statement “fraction of missing information is small” can be quantified using Fisher information.

3
maximize
n
X
H(p1 , p2 , . . . , pn ) = − pi log2 pi (12)
i=1
n
X
such that pi = 1 (13)
i=1

Such problems can be solved using the method of Lagrange multipliers. To maximize a function
f (p1 , . . . , pn ) on the open set p = (p1 , . . . , pn ) ⊂ Rn subject to the constraint g(p) = 0 it suffices
to maximize the unconstrained function

Λ(p, λ) = f (p) − λg(p)


P
To solve equation 12 we encode the constraint as g(p) = i pi − 1 and maximize
n n
!
X X
Λ(p, λ) = − pi log2 pi − λ pi − 1
i=1 i=1

in the unusual unconstrained manner, by solving the system of equations

∂Λ(p, λ) ∂Λ(p, λ)
= 0, =0
∂pi ∂λ

which leads to the solution pi = n1 .

Acknowledgements: The idea of EM as coordinate ascent was first presented in ”A View of


the EM Algorithm that Justifies Incremental, Sparse, and other Variants”, by R.M. Neal and
G.E. Hinton. This presentation is also indebited to an unpublished manuscript by M.I. Jordan and
C. Bishop.

You might also like