The EM Algorithm: Ajit Singh November 20, 2005
The EM Algorithm: Ajit Singh November 20, 2005
The EM Algorithm: Ajit Singh November 20, 2005
Ajit Singh
1 Introduction
Expectation-Maximization (EM) is a technique used in point estimation. Given a set of observable
variables X and unknown (latent) variables Z we want to estimate parameters θ in a model.
Example 1.1 (Binomial Mixture Model). You have two coins with unknown probabilities of
heads, denoted p and q respectively. The first coin is chosen with probability π and the second
coin is chosen with probability 1 − π. The chosen coin is flipped once and the result is recorded.
x = {1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1} (Heads = 1, Tails = 0). Let Zi ∈ {0, 1} denote which coin was
used on each toss.
In example 1.1 we added latent variables Zi for reasons that will become apparent. The parameters
we want to estimate are θ = (p, q, π). Two criteria for point estimation are maximum likelihood
and maximum a posteriori:
Our presentation will focus on the maximum likelihood case (ML-EM); the maximum a posteriori
case (MAP-EM) is very similar1 .
2 Notation
X Observed variables
Z Latent (unobserved) variables
θ(t) The estimate of the parameters at iteration t.
`(θ) The marginal log-likelihood log p(x|θ)
log p(x, z|θ) The complete log-likelihood, i.e., when we know the value of Z.
q(z|x, θ) Averaging distribution, a free distribution that EM gets to vary.
Q(θ|θ(t) )
P
The expected complete log-likelihood z q(z|x, θ) log p(x, z|θ)
H(q) Entropy of the distribution q(z|x, θ).
1
In MAP-EM the M-step is a MAP estimate, instead of an ML estimate.
1
3 Derivation
P
We could directly maximize `(θ) = z log p(x, z|θ) using a gradient method (e.g., gradient as-
cent, conjugate gradient, quasi-Newton) but sometimes the gradient is hard to compute, hard to
implement, or we do not want to bother adding in a black-box optimization routine.
where q(z|x, θ) is an arbitrary density over Z. This inequality is foundational to what are called
“variational methods” in the machine learning literature2 . Instead of maximizing `(θ) directly, EM
maximizes the lower-bound F (q, θ) via coordinate ascent:
Starting with some initial value of the parameters θ(0) , one cycles between the E and M-steps
until θ(t) converges to a local maxima. Computing equation 4 directly involves fixing θ = θ(t) and
optimizating over the space of distributions, which looks painful. However, it is possible to show
that q (t+1) = p(z|x, θ(t) ). We can stop worrying about q as a variable over the space of distributions,
since we know the optimal q is a distribution that depends on θ(t) . To compute equation 5 we fix
q and note that
2
3.1 Limitations of EM
EM is useful for several reasons: conceptual simplicity, ease of implementation, and the fact that
each iteration improves `(θ). The rate of convergence on the first few steps is typically quite good,
but can become excruciatingly slow as you approach a local optima. Generally, EM works best
when the fraction of missing information is small3 and the dimensionality of the data is not too
large. EM can require many iterations, and higher dimensionality can dramatically slow down the
E-step.
∂Q(θ|θ(t) ) 1 X (t)
= 0 =⇒ π (t+1) = µi
∂π n
i
P (t)
∂Q(θ|θ(t) ) µ xi
= 0 =⇒ p(t+1) = Pi i (t)
∂p
i µi
(t)
∂Q(θ|θ(t) )
P
(t+1) i (1 − µi )xi
= 0 =⇒ q = P (t)
∂q
i (1 − µi )
3
maximize
n
X
H(p1 , p2 , . . . , pn ) = − pi log2 pi (12)
i=1
n
X
such that pi = 1 (13)
i=1
Such problems can be solved using the method of Lagrange multipliers. To maximize a function
f (p1 , . . . , pn ) on the open set p = (p1 , . . . , pn ) ⊂ Rn subject to the constraint g(p) = 0 it suffices
to maximize the unconstrained function
∂Λ(p, λ) ∂Λ(p, λ)
= 0, =0
∂pi ∂λ