0% found this document useful (0 votes)
126 views

EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007

The document discusses applying the Expectation-Maximization (EM) algorithm to estimate parameters in problems with missing or incomplete data. Specifically: 1) It introduces the EM algorithm and explains that it estimates parameters by iterating between an Expectation (E) step, where the expected value of the complete-data log-likelihood is computed, and a Maximization (M) step, where this expectation is maximized. 2) It then applies the EM algorithm to estimate parameters for a mixture of Gaussian distributions using waiting time data from the Old Faithful geyser, where the component indicators are treated as missing data. 3) The EM algorithm estimates the proportion of observations from each component and other parameters
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views

EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007

The document discusses applying the Expectation-Maximization (EM) algorithm to estimate parameters in problems with missing or incomplete data. Specifically: 1) It introduces the EM algorithm and explains that it estimates parameters by iterating between an Expectation (E) step, where the expected value of the complete-data log-likelihood is computed, and a Maximization (M) step, where this expectation is maximized. 2) It then applies the EM algorithm to estimate parameters for a mixture of Gaussian distributions using waiting time data from the Old Faithful geyser, where the component indicators are treated as missing data. 3) The EM algorithm estimates the proportion of observations from each component and other parameters
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

EM Algorithm

Shu-Ching Chang Hyung Jin Kim


December 9, 2007

1 Introduction
It’s very important for us to understand the data structure before doing the
data analysis. However, most of the time, there may exist of a lot of missing
values or incomplete information in the data subject to the analysis. For ex-
ample, survival time data always have some missing values because of death or
job transfer. These kinds of data are called censored data. Since these data
might obtain some incomplete but useful information, if we ignore them in the
analysis, it’s risky for us to get some biased results. The EM algorithm has the
ability to deal with missing data and unidentified variables, so it is becoming
useful in a variety of incomplete-data problem.

In this project, we investigate the EM algorithm for estimating parameters


in application to missing data and mixture density problems. [4] shows the
combination of EM algorithm and Bootstrap improves Satellite Image Fusion.
Since Bootstrap approach uses the resampling concepts that can be applied to
check the adequacy of standard measures of uncertainty and give quick approx-
imate solutions, we are also interested in comparing the parameter estimations
obtained form the EM algorithm to the combination of EM and Bootstrap.
Therefore, after using the EM algorithm to find the unknown parameters from
the data, we use the parameters estimated from EM to do the parametric Boot-
strap to find the same unknown parameters from the data and compare them
to results we obtain from the EM algorithm. Then, we can observe whether or
not these two methods can get compatible results.

The paper is organized as follows. In Section 2, we introduce the EM algo-


rithm including Expectation (E) Step and Maximization (M) Step. In Section
3, we apply the EM method to real data analysis and then we use the estimates
obtained form the EM algorithm to the parametric Bootstrap. Finally, section
4 contains the summary and discussion of the results we obtained.

1
2 EM Algorithm
The EM algorithm has two main applications. The first case occurs when the
data has missing values due to limitations or problems with the observation
process. The second case occurs when the likelihood function can be obtained
and simplified by assuming that there is an additional but missing parameters.

With missing values or parameters in the data which is generated by some


distribution under assumption, we call the data, X, the incomplete data. And,
we assume that the complete data, Z = (X, Y ) exists with Y being missing data
and that a joint density function also exists as follows:

p(z|θ) = p(x, y|θ) = p(y|x, θ) ∗ p(x|θ)


where θ is a set of unknown parameters from a distribution including a missing
parameter.

With the density function, we now define the complete-data likelihood as:
L(θ|Z) = L(θ|X, Y ) = p(X, Y |θ)
And, the original likelihood L(θ|X) is called the incomplete-data likelihood
function. Since the missing data Y is unknown under a certain distribution
by assumption, we can think of L(θ|X, Y ) as a function of a random variable,
Y , with constant values, X and θ.
L(θ|X, Y ) = f(X,θ) (Y )
Using the complete-data log-likelihood function with respect to the missing data
Y given the observed data X, the EM algorithm finds its expected value as well
as the current parameter estimates at the E Step and maximizes the expectation
at the M step. By repeating the E and M step, the algorithm is guaranteed to
converge to a local maximum of the likelihood function with each iteration
increasing the log-likelihood.

2.1 Expectation (E) Step


First, we define the expectation of the complete-data log-likelihood function as:

Q(θ, θ(i−1) ) = E[log p(X, Y |θ)|X, θ(i−1) ] (1)


(i−1)
where θ is a set of current parameters estimates that we use to evaluate
the expectation and to increase Q with the new θ for optimization. Here, X
and θ(i−1) are known constants and θ is a variable to be adjusted. Since Y is a
missing random variable under an assumed distribution, f (y|X, θ(i−1) ). Then,
the expectation in the equation (1) can be written as:

Z
E[log p(X, Y |θ)|X, θ(i−1) ] = log p(X, y|θ) ∗ f (y|X, θ(i−1) )dy (2)
y∈Ω

2
where Ω is the space of values where y can take values on and f (y|X, θ(i−1) )
is the marginal distribution of the missing data Y depending on observed data
and current parameters.

2.2 Maximization (M) Step


At the M step, we maximize the expectation we obtain in the E step. That is
to find:
θ(i) = argmax Q(θ, θ(i−1) )
θ

Maximizing the equation (1) becomes either easy or hard depending on the
form of p(X, y|θ). For instance, if p(X, y|θ) is a simple normal distribution
where θ = (µ, σ 2 ), we set the derivative of log(L(θ|X, Y )) equal to zero and
solve directly for µ and σ 2 . This is an easy example, but we need to resort to
more elaborate techniques for more complicated ones.

3 Example
The data used for the example is called faithful implemented in R. It contains
waiting time between eruptions and the duration of the eruption for the Old
Faithful geyser in Yellowstone National Park, Wyoming, USA. From this data,
we use the waiting time part. The histogram of the waiting time resembles the
mixture Gaussian distribution; short and long waiting times. We set indicators
which mixture component each observation belongs to as a missing data and the
EM algorithm will find the proportion of observations belonging to each nor-
mal distribution along with other unknown parameters for means and variances.

The density for the mixture of two Gaussian populations is


µ ¶ µ ¶
1 w − µ1 1 w − µ2
fW (w|θ) = p ∗ ∗ϕ + (1 − p) ∗ ∗ϕ
σ1 σ1 σ2 σ2
And, our unknown parameters are p, µ1 , µ2 , σ12 , σ22 where µ1 and σ12 indicate
the mean and the variance from the normal distribution with the shorter wait-
ing time and µ2 and σ22 represent the mean and the variance from the longer
waiting time. Our p represents the proportion an observation comes from the
normal distribution with the shorter waiting time. We let θ = (p, µ1 , µ2 , σ12 , σ22 ).

The indicator variable as a missing data is


(
1 Wi belongs to distribution of shorter waiting times
Yi =
0 Wi belongs to distribution of longer waiting times

where Yi is Bernoulli distribued with parameter p.

Therefore, the likelihood expression for the complete-data is given by:

3
n
Y µ ¶Yi µ ¶1−Yi
1 Wi − µ1 1 Wi − µ2
Ln (θ|W, Y ) = pYi ∗(1−p)1−Yi ∗ ϕ ∗ ϕ
i=1
σ1Yi σ1 1−Yi
σ2 σ2

And, the corresponding log-likelihood function for the density from the data
faithful becomes:

n
X n
X
ln (θ|W, Y ) = Yi ∗ log(p) + (1 − Yi ) ∗ log(1 − p)
i=n i=n
n n
1X 1 X
− Yi ∗ log(2πσ12 ) − 2 Yi ∗ (Wi − µ1 )2
2 i=1 2σ1 i=1
n n
1X 1 X
− (1 − Yi ) ∗ log(2πσ22 ) − 2 (1 − Yi ) ∗ (Wi − µ2 )2
2 i=1 2σ2 i=1

From now, we apply the EM algorithm and find the expectation of Yi . Since
the conditional distribution of Yi given W is
(k)
Yi |Wi , θ(k) ∼ Bin(1, pi )

with µ ¶
(k)
(k) 1 w−µ1
p (k) ϕ (k)
(k) σ1 σ1
pi = µ ¶ µ ¶.
(k) (k)
1 Wi −µ1 1 Wi −µ2
p(k) (k) ϕ (k) + (1 − p(k) ) (k) ϕ (k)
σ1 σ1 σ2 σ2

where p(k) is a set of known or estimated parameters at kth step. p(0) is an initial
values. Thus, by the property of the Binomial distribution, the conditional mean
is
(k)
E(Yi |Wi , θ(k) ) = pi .
By substituting p(k) for Yi , we obtain the expectation function as
n
X n
X
(k) (k)
Q(θ|θ(k) ) = pi ∗ log(p) + (1 − pi ) ∗ log(1 − p)
i=1 i=n
n n
1 X (k) 1 X (k)
− pi ∗ log(2πσ12 ) − 2 p ∗ (Wi − µ1 )2
2 i=1 2σ1 i=1 i
n n
1X (k) 1 X (k)
− (1 − pi ) ∗ log(2πσ22 ) − 2 (1 − pi ) ∗ (Wi − µ2 )2
2 i=1 2σ2 i=1

4
In the maximization step, setting the first derivatives of Q(θ|θ(k) ) with respect to
each parameter equal to zero results in following equations for each parameter.
n
(k+1) 1 X (k)
p = p
n i=1 i
Pn (k)
(k+1) i=1 pi Wi
µ1 = Pn (k)
i=1 pi
Pn (k)
(k+1) i=1 (1 − pi )Wi
µ2 = Pn (k)
i=1 (1 − pi )
Pn (k) (k+1) 2
(k+1) 2 i=1 pi (Wi − µ1 )
(σ1 ) = Pn (k)
i=1 pi
Pn (k) (k+1) 2
(k+1) 2 i=1 (1 − pi )(Wi − µ2 )
(σ2 ) = Pn (k)
i=1 (1 − pi )

For the initial value of θ, we decide to have the following values from the Figure
1.
(0) (0) (0) (0)
p(0) = 0.4, µ1 = 40, µ2 = 90, σ1 = 4, σ2 = 4

5
50
40
Frequency

30
20
10
0

40 50 60 70 80 90 100

Waiting

Figure 1: Histogram of Waiting

After the EM algorithm, our estimates for unknown parameters are described
in the Table 1. In addition, Table2 shows the results form the combination of
EM algorithm and Bootstrap.

p µ1 µ2 σ12 σ22
0.35 54.22 79.91 29.86 35.98

Table 1: Estimates Using EM Algorithm

6
µ1 µ2 σ12 σ22
54.26 79.88 29.74 36.03

Table 2: Estimates Using the Combination of EM Algorithm and Bootstrap

4 Discussion
In our study, we review some literature about EM algorithm, which has shown
great performance in practice to deal with missing data and mixture density
problems. In section 3, we utilize EM algorithm to experiment on one data called
“faithful” which has the characteristics of missing values and mixture densities
. Since Bootstrap is a re-sampling method that improves estimator properties
notably in small sample, we also want to know if the bootstrap method com-
bined with EM algorithm can improve the accuracy of the EM algorithm alone.
Then we use the estimates form the EM algorithm to do parametric Bootstrap
and compare the results from these two approaches.

From Figure 1, we can clearly find that there exists of two mixture nor-
mal distributions for the waiting time of faithful dataset. We apply the EM
algorithm to this dataset in order to find unknown parameters for means and
variances. In Table 1, we find about 35% an observation comes from the normal
distribution with the shorter waiting time. The mean of normal distribution
with the shorter and longer waiting time are separately about 54.22 and 79.91
that are much closed to the result of the histogram plot of waiting time we
obtain in Figure 1. In addition, the variances we get form the EM algorithm
seem appropriate. We may reasonably conclude that the parameter estimations
of mixture density dataset form the EM algorithm almost approximate to the
parameters of real dataset. Therefore, EM algorithm seems a good procedure
to help us find the characteristics of one data, especially for mixture density
dataset. Table 2 shows the results form the combination of EM algorithm with
Bootstrap that also obtains the parameter estimations approximated to the ones
of original dataset.

Many interesting problems may be worth future study, such as creating a cri-
terion to decide which procedure could get more precise parameter estimators
to the original dataset. In addition, since it’s important to choose appropriate
initial values in the EM algorithm, finding a procedure of choosing initial values
is also needed. Finally, we might try to find a more suitable formula of the
combination of EM algorithm and bootstrap.

5 Appendix
> data(faithful)

7
> attach(faithful)
>
> ## EM Algorithm
>
> W = waiting
>
> s = c(0.5, 40, 90, 16, 16)
>
> em = function(W,s) {
+ Ep = s[1]*dnorm(W, s[2], sqrt(s[4]))/(s[1]*dnorm(W, s[2], sqrt(s[4])) +
+ (1-s[1])*dnorm(W, s[3], sqrt(s[5])))
+ s[1] = mean(Ep)
+ s[2] = sum(Ep*W) / sum(Ep)
+ s[3] = sum((1-Ep)*W) / sum(1-Ep)
+ s[4] = sum(Ep*(W-s[2])^2) / sum(Ep)
+ s[5] = sum((1-Ep)*(W-s[3])^2) / sum(1-Ep)
+ s
+ }
>
> iter = function(W, s) {
+ s1 = em(W,s)
+ for (i in 1:5) {
+ if (abs(s[i]-s1[i]) > 0.0001) {
+ s=s1
+ iter(W,s)
+ }
+ else s1
+ }
+ s1
+ }
>
> iter(W,s)
[1] 0.3507784 54.2179838 79.9088649 29.8611799 35.9824271
>
> p = iter(W, s)
>
> p1<-p[1]
> p2<-p[2]
> p3<-p[3]
> p4<-p[4]
> p5<-p[5]
>
> Boot<-function(B){
+ r<-0
+ k<-0
+ bootmean1 <-rep(0, B)

8
+ bootvar1<-rep(0, B)
+ bootmean2<-rep(0, B)
+ bootvar2<-rep(0, B)
+ for(i in 1:B){
+ p<-runif(1, 0, 1)
+ if(p<p1){
+ boot1<-rnorm(p1*272, p2, sqrt(p4))
+ bootmean1[i]<-mean(boot1)
+ bootvar1[i]<-var(boot1)
+ r<-r+1
+ }
+ else{
+ boot2<-rnorm((1-p1)*272, p3, sqrt(p5))
+ bootmean2[i]<-mean(boot2)
+ bootvar2[i]<-var(boot2)
+ k<-k+1
+ }
+ }
+ meanbootm1<-sum(bootmean1)/r
+ meanbootvar1<-sum(bootvar1)/r
+ meanbootm2<-sum(bootmean2)/k
+ meanbootvar2<-sum(bootvar2)/k
+ list(meanbootm1= meanbootm1, meanbootvar1= meanbootvar1,
+ meanbootm2= meanbootm2, meanbootvar2= meanbootvar2 )
+ }
>
> Boot(1000)
$meanbootm1
[1] 54.26174

$meanbootvar1
[1] 29.74411

$meanbootm2
[1] 79.88433

$meanbootvar2
[1] 36.02818

References
[1] Jeff A. Blimes, International Computer Science Institute.
Computer Science Division, Department of Electrical Engineering and Com-
puter Science. (April 1998).

9
[2] A.P. Dempster, N.M. Laird, and D.B. Rubin (1977).
Maximum likelihood from incomplete data via the EM algorithm. Journal
of the Royal Statistical Society B, 39, 1-38.
[3] Richard A. Redner and Homer F. Walker (April 1984)
Mixture Densities, Maximum Likelihood and the Em Algorithm. SIAM Re-
view, 26(2), 195-239.
[4] Tijani Delleji, Mourad Zribi, and Ahmed Ben Hamida (2007)
On the EM Algorithm and Bootstrap Approach Combination for Improving
Satellite Image Fusion. International Journal of Signal Processing, volume
4, No.1, ISSN 1304-4478.
[5] Geoffrey J. McLachlan and Thriyambakam Krishnan (1997)
The EM Algorithm and Extensions.John Wiley & Sons, Inc.
[6] Michiko Watanabe and Kazunori Yamaguchi (1991)
The EM Algorithm and Related Statistical Models. STATISTICS: A
DEKKER series of TEXTBOOKS and MONOGRAPHS
[7] Kate Cowles, The University of Iowa.
Lecture Note 11. (September 24, 2006).

10

You might also like