ML | Expectation-Maximization Algorithm

Last Updated : 04 Feb, 2025

The Expectation-Maximization (EM) algorithm is an iterative method used in unsupervised machine learning to estimate unknown parameters in statistical models. It helps find the best values for unknown parameters, especially when some data is missing or hidden.

It works in two steps:

E-step (Expectation Step): Estimates missing or hidden values using current parameter estimates.
M-step (Maximization Step): Updates model parameters to maximize the likelihood based on the estimated values from the E-step.

This process repeats until the model reaches a stable solution, improving accuracy with each iteration. EM is widely used in clustering (e.g., Gaussian Mixture Models) and handling missing data

Expectation-Maximization in EM Algorithm

By iteratively repeating these steps, the EM algorithm seeks to maximize the likelihood of the observed data. It is commonly used for clustering, where latent variables are inferred and has applications in various fields, including machine learning, computer vision, and natural language processing.

Key Terms in Expectation-Maximization (EM) Algorithm

Lets understand about some of the most commonly used key terms in the Expectation-Maximization (EM) Algorithm below:

Latent Variables: These are hidden or unmeasured variables that affect what we can observe in the data. We can’t directly see them, but we can make educated guesses about them based on the data we can see.
Likelihood: This refers to the probability of seeing the data we have, based on certain assumptions or parameters. The EM algorithm tries to find the best parameters that make the data most likely.
Log-Likelihood: This is just the natural log of the likelihood function. It’s used to make calculations easier and measure how well the model fits the data. The EM algorithm tries to maximize the log-likelihood to improve the model fit.
Maximum Likelihood Estimation (MLE): This is a technique for estimating the parameters of a model. It does this by finding the parameter values that make the observed data most likely (maximizing the likelihood).
Posterior Probability: In Bayesian methods, this is the probability of the parameters, given both prior knowledge and the observed data. In EM, it helps estimate the “best” parameters when there’s uncertainty about the data.
Expectation (E) Step: In this step, the algorithm estimates the missing or hidden information (latent variables) based on the observed data and current parameters. It calculates probabilities for the hidden values given what we can see.
Maximization (M) Step: This step updates the parameters by finding the values that maximize the likelihood, based on the estimates from the E-step. It often involves running optimization methods to get the best parameters.
Convergence: Convergence happens when the algorithm has reached a stable point. This is checked by seeing if the changes in the model’s parameters or the log-likelihood are small enough to stop the process.

How Expectation-Maximization (EM) Algorithm Works:

So far, we’ve discussed the key terms in the EM algorithm. Now, let’s dive into how the EM algorithm works. Here’s a step-by-step breakdown of the process:

EM Algorithm Flowchart

Initialization:
The algorithm starts with initial parameter values and assumes the observed data comes from a specific model.

E-Step (Expectation Step):
- Estimate the missing or hidden data based on the current parameters.
- Calculate the posterior probability (responsibility) of each latent variable given the observed data.
- Compute the log-likelihood of the observed data using the current parameter estimates.
M-Step (Maximization Step):
- Update the model parameters by maximizing the log-likelihood computed in the E-step.
- This involves solving an optimization problem to find parameter values that improve the model fit.
Convergence:
- Check if the model parameters are stable (converging).
- If the changes in log-likelihood or parameters are below a set threshold, stop. If not, repeat the E-step and M-step until convergence is reached

Expectation-Maximization Algorithm Step by Step Implementation

Step 01 : Import the necessary libraries

import numpy as np
import seaborn as sns
from scipy.stats import norm
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt

Step 02 : Generate a dataset with two Gaussian components

mu1, sigma1 = 2, 1
mu2, sigma2 = -1, 0.8
X1 = np.random.normal(mu1, sigma1, size=200)
X2 = np.random.normal(mu2, sigma2, size=600)
X = np.concatenate([X1, X2])

sns.kdeplot(X)
plt.xlabel('X')
plt.ylabel('Density')
plt.title('Density Estimation of X')
plt.show()

Output:

Density Plot

Step 03: Initialize parameters

mu1_hat, sigma1_hat = np.mean(X1), np.std(X1)
mu2_hat, sigma2_hat = np.mean(X2), np.std(X2)
pi1_hat, pi2_hat = len(X1) / len(X), len(X2) / len(X)

Step 04: Perform EM algorithm

Iterates for the specified number of epochs (20 in this case).
In each epoch, the E-step calculates the responsibilities (gamma values) by evaluating the Gaussian probability densities for each component and weighting them by the corresponding proportions.
The M-step updates the parameters by computing the weighted mean and standard deviation for each component

num_epochs = 20
log_likelihoods = []

for epoch in range(num_epochs):
    # E-step: Compute responsibilities
    gamma1 = pi1_hat * norm.pdf(X, mu1_hat, sigma1_hat)
    gamma2 = pi2_hat * norm.pdf(X, mu2_hat, sigma2_hat)
    total = gamma1 + gamma2
    gamma1 /= total
    gamma2 /= total
    
    # M-step: Update parameters
    mu1_hat = np.sum(gamma1 * X) / np.sum(gamma1)
    mu2_hat = np.sum(gamma2 * X) / np.sum(gamma2)
    sigma1_hat = np.sqrt(np.sum(gamma1 * (X - mu1_hat)**2) / np.sum(gamma1))
    sigma2_hat = np.sqrt(np.sum(gamma2 * (X - mu2_hat)**2) / np.sum(gamma2))
    pi1_hat = np.mean(gamma1)
    pi2_hat = np.mean(gamma2)
    
    # Compute log-likelihood
    log_likelihood = np.sum(np.log(pi1_hat * norm.pdf(X, mu1_hat, sigma1_hat)
                                   + pi2_hat * norm.pdf(X, mu2_hat, sigma2_hat)))
    log_likelihoods.append(log_likelihood)


plt.plot(range(1, num_epochs+1), log_likelihoods)
plt.xlabel('Epoch')
plt.ylabel('Log-Likelihood')
plt.title('Log-Likelihood vs. Epoch')
plt.show()

Output:

Epoch vs Log-likelihood

Step 05: Plot the final estimated density

X_sorted = np.sort(X)
density_estimation = pi1_hat*norm.pdf(X_sorted,
                                        mu1_hat, 
                                        sigma1_hat) + pi2_hat * norm.pdf(X_sorted,
                                                                         mu2_hat, 
                                                                         sigma2_hat)


plt.plot(X_sorted, gaussian_kde(X_sorted)(X_sorted), color='green', linewidth=2)
plt.plot(X_sorted, density_estimation, color='red', linewidth=2)
plt.xlabel('X')
plt.ylabel('Density')
plt.title('Density Estimation of X')
plt.legend(['Kernel Density Estimation','Mixture Density'])
plt.show()

Output:

Estimated density

Advantages of EM algorithm

Always improves results – With each step, the algorithm improves the likelihood (chances) of finding a good solution.
Simple to implement – The two steps (E-step and M-step) are often easy to code for many problems.
Quick math solutions – In many cases, the M-step has a direct mathematical solution (closed-form), making it efficient

Disadvantages of EM algorithm

Takes time to finish – It converges slowly, meaning it may take many iterations to reach the best solution.
Gets stuck in local best – Instead of finding the absolute best solution, it might settle for a “good enough” one.
Needs extra probabilities – Unlike some optimization methods that only need forward probability, EM requires both forward and backward probabilities, making it slightly more complex.

Conclusion

The EM algorithm iteratively estimates missing data and updates model parameters to improve accuracy. By alternating between the E-step and M-step, it refines the model until it converges, making it a powerful tool for handling hidden or incomplete data in machine learning.