Expectation Maximization (EM)

Next: EM Method for Parameter Up: MCMC and EM Algorithms Previous: Metropolis-Hastings algorithm

Expectation Maximization (EM)

Assume the a priori distribution of the parameters is $P(\theta)$ and the distribution of the data is , then the joint probability of both the data and parameters is

$\begin{displaymath}P(X,\theta)=P(X\vert\theta) P(\theta) \end{displaymath}$

The posterior distribution of the model parameters can be obtained according to Bayesian theorem:

$\begin{displaymath}P(\theta\vert X)=\frac{P(\theta)P(X\vert\theta)}{P(X)} =\fra... ...theta)P(X\vert\theta)}{\int P(\theta) P(X\vert\theta) d\theta} \end{displaymath}$

where $P(X\vert\theta)=L(\theta\vert X)$ is the likelihood function of the parameters $\theta$ , given the observed data

. The goal of maximum-likelihood estimation is to find $\theta$ that maximizes the likelihood

$\begin{displaymath}\theta^*=argmax_{\theta}\;L(\theta\vert X) \end{displaymath}$

The expectation-maximization (EM) algorithm is a method for finding maximum likelihood estimates of the parameters $\theta$ of a probabilistic model, based on unobserved or hidden variables , as well as the observed variables .

Let be the complete data containing both the observed (but incomplete) data and the missing or hidden data , the the complete-data likelihood of the parameters $\theta$ is

$\begin{displaymath}L(\theta\vert X,Y)=p(X,Y\vert\theta)=p(Y\vert X,\theta) p(X\vert\theta) \end{displaymath}$

which is a random variable, since the missing data

is unknown, assumed to be random with some distribution, and according to Bayes rule, the incomplete-data likelihood is

$\begin{displaymath}L(\theta\vert X)=p(X\vert\theta)=\frac{p(X,Y\vert\theta)}{p(Y\vert X,\theta)} \end{displaymath}$

The EM algorithm attempts to find the value of $\theta$ which maximizes $p(X\vert\theta)$ given the observed , by making use of the associated family $p(X, Y\vert\theta)$ [Dempster et al., 1977]. EM alternates between the following two steps:

E Step: compute the expectation (with respect to the hidden data ) of the likelihood function of the model parameters by including the missing variables as if they were observed,

$\displaystyle Q(\theta,\theta_t)$ $\textstyle \stackrel{\triangle}{=}$ $\displaystyle E_Y\;[ log\;p(X,Y\vert\theta)\vert X,\theta_t] =\int log\;p(X,Y\vert\theta) \;p(Y\vert X,\theta_t) dY$

$\textstyle =$ $\displaystyle \int log\;p(X,Y\vert\theta) \;\frac{p(Y,X\vert\theta_t)}{p(X\vert\theta_t)} dY$

Here the hidden data is a random variable governed by the distribution $p(Y\vert X,\theta)$ , given observed data and previous parameter estimation $\theta_t$ .
M step: computes the maximum likelihood estimates of the parameters $\theta$ by maximizing the expected likelihood found in E step.

$\displaystyle \theta_{t+1}$ $\textstyle =$ $\displaystyle argmax_{\theta}\; Q(\theta,\theta_t) =argmax_{\theta}\; E_Y\;[ log\;p(X,Y\vert\theta)\vert X,\theta_t]$

$\textstyle =$ $\displaystyle argmax_{\theta}\;\int log\;p(X,Y\vert\theta) \;p(Y\vert X,\theta_t) dY$

$\textstyle =$ $\displaystyle argmax_{\theta}\;\int log\;p(X,Y\vert\theta) \;\frac{p(Y,X\vert\theta_t)}{p(X\vert\theta_t)}dY$

$\textstyle =$ $\displaystyle argmax_{\theta}\;\int log\;p(X,Y\vert\theta) \;p(Y,X\vert\theta_t) dY$

In the E step, the first argument $\theta$ of $Q(\theta,\theta_t)$ represents the parameter to be optimized to maximize the likelihood, while the second argument $\theta_t$ represents the parameters used to evaluate the expectation. In the M step, $\theta_{t+1}$ is the value that maximizes (M) the conditional expectation (E) of the complete data log-likelihood given the observed variables

under the previous parameter value $\theta_t$ . The parameters $\theta_{t+1}$ found on the M step are then used to begin another E step, and the process is repeated.

Also note that the last equal sign is due to the fact that the denominator $p(X\vert\theta_t)$ is not a function and is therefore independent of the parameters $\theta$ . In other words, the function in the E step can also be written as

$\begin{displaymath}Q(\theta,\theta_t)=\int log\;p(X,Y\vert\theta) \;p(Y,X\vert\theta_t) dY \end{displaymath}$

Theorem: The procedure in M step above quarantees that

$\begin{displaymath}log[L(\theta_{t+1}\vert X)] \ge log[L(\theta_t\vert X)] \end{displaymath}$

with equality iff $Q(\theta_{t+1}\vert X,\theta_t)=Q(\theta_t\vert X,\theta_t)$ $n=1,2,\cdots$ .

Proof: As $p(X\vert\theta)=p(X,Y\vert\theta)/p(Y\vert X,\theta)$ , we have

		$\displaystyle log\;L(\theta\vert X)=log\; p(X\vert\theta)=log\;p(X,Y\vert\theta)-log\;p(Y\vert X,\theta)$
	$\textstyle \approx$	$\displaystyle E_Y [log\;p(X,Y\vert\theta)]-E_Y[log\;p(Y\vert X,\theta)] =Q(\theta,\theta_t)-H(\theta,\theta_t)$

where $H(\theta,\theta_t)$ is defined as

$\begin{displaymath}H(\theta,\theta_t)\stackrel{\triangle}{=}E_Y\;[ log\;p(Y\vert... ...a_t] =\int log\;p(Y\vert X,\theta) \;p(Y\vert X,\theta_t) dY \end{displaymath}$

Now we get:

		$\displaystyle log[L(\theta_{t+1}\vert X)]-log[L(\theta_t\vert X)] =log[p(X\vert\theta_{t+1})]-log[p(X\vert\theta_t)]$
	$\textstyle \approx$	$\displaystyle [Q(\theta_{t+1},\theta_t)-Q(\theta_t,\theta_t)]+[H(\theta_t,\theta_t)-H(\theta_{t+1},\theta_t)] \ge 0$

as the first term is greater than or equal to 0 due to the M step of the algorithm, and we can show the second term is also greater than or equal to 0:

$\displaystyle H(\theta_t,\theta_t)-H(\theta_{t+1},\theta_t)$	$\textstyle =$	$\displaystyle \int log\;p(Y\vert X,\theta_t) \;p(Y\vert X,\theta_t) dY -\int log\;p(Y\vert X,\theta_{t+1}) \;p(Y\vert X,\theta_t) dY$
	$\textstyle =$	$\displaystyle \int log \frac{p(Y\vert X,\theta_t)}{p(Y\vert X,\theta_{t+1})} p(Y\vert X,\theta_t) dY \ge 0$

The last expression is Kullback-Leibler information divergence (KL divergence), which is always non-negative, or zero if

$\begin{displaymath}p(Y\vert X,\theta_t)=p(Y\vert X,\theta_{t+1}) \end{displaymath}$

Example: Assume observed variable and hidden variablle are independently and identically generated by an exponential distribution $p(x\vert\theta)=\theta\; exp(-\theta x)$ ( $x\ge 0$ ):

$\begin{displaymath}\int_0^\infty p(x\vert\theta) dx=\int_0^\infty \theta e^{-\theta x} dx =1\end{displaymath}$

$\begin{displaymath}E[ x]=\int_0^\infty x p(x\vert\theta) dx =\int_0^\infty x \theta e^{-\theta x} dx=\frac{1}{\theta} \end{displaymath}$

Then the joint distribution of

and

$\begin{displaymath}p(x,y\vert\theta)=p(x\vert\theta) p(y\vert\theta)=\theta e^{-\theta x}\;\theta e^{-\theta y} \end{displaymath}$

and the log likelihood is

$\begin{displaymath}log\;L(\theta\vert x,y)=log\;p(x,y\vert\theta)=log [\theta e^... ...2 log\;\theta-\theta x-\theta y=2 log\;\theta-5\theta-\theta y \end{displaymath}$

and

are independent, the conditional probability of

$\begin{displaymath}p(y\vert x,\theta)=p(y\vert\theta)=\theta\; e^{-\theta y} \end{displaymath}$

The E step: compute $Q(\theta,\theta_t)$ :

$\begin{displaymath}Q(\theta,\theta_t)=\int_0^\infty log \;p(x,y\vert\theta) p(y\... ...{-\theta y} dy =2 log\;\theta-5\theta-\frac{\theta}{\theta_t} \end{displaymath}$

The M step: maximize $Q(\theta,\theta_t)$ :

$\begin{displaymath}\frac{d}{d\theta}[2 log\;\theta-5\theta-\frac{\theta}{\theta_t}]=0 \end{displaymath}$

which can be solved to yield the iteration formula:

$\begin{displaymath}\theta_{t+1}=\frac{2\theta_t}{5\theta_t +1} \end{displaymath}$

This iteration will always converge to $\theta_{t+1}=0.2$ , independent of the initial value $\theta_1$ . In fact, for any value of the observed sample

, the iteration will converge to $\theta=1/x$ , agreeing with the observed value $E(x)=1/\theta$ .

Next: EM Method for Parameter Up: MCMC and EM Algorithms Previous: Metropolis-Hastings algorithm

Ruye Wang 2006-10-11

$\displaystyle Q(\theta,\theta_t)$	$\textstyle \stackrel{\triangle}{=}$	$\displaystyle E_Y\;[ log\;p(X,Y\vert\theta)\vert X,\theta_t] =\int log\;p(X,Y\vert\theta) \;p(Y\vert X,\theta_t) dY$
	$\textstyle =$	$\displaystyle \int log\;p(X,Y\vert\theta) \;\frac{p(Y,X\vert\theta_t)}{p(X\vert\theta_t)} dY$

$\displaystyle \theta_{t+1}$	$\textstyle =$	$\displaystyle argmax_{\theta}\; Q(\theta,\theta_t) =argmax_{\theta}\; E_Y\;[ log\;p(X,Y\vert\theta)\vert X,\theta_t]$
	$\textstyle =$	$\displaystyle argmax_{\theta}\;\int log\;p(X,Y\vert\theta) \;p(Y\vert X,\theta_t) dY$
	$\textstyle =$	$\displaystyle argmax_{\theta}\;\int log\;p(X,Y\vert\theta) \;\frac{p(Y,X\vert\theta_t)}{p(X\vert\theta_t)}dY$
	$\textstyle =$	$\displaystyle argmax_{\theta}\;\int log\;p(X,Y\vert\theta) \;p(Y,X\vert\theta_t) dY$