Factor Analysis and Expectation Maximization

The method of factor analysis (FA) models a set of $d$ observed manifest variables in ${\bf x}=[x_1,\cdots,x_d]^T$ as a linear combination of a set of $d'<d$ unobserved hiden latent variables or common factors in ${\bf z}=[z_1,\cdots,z_{d'}]^T$ , to explain and reveal the variability and dependency among the $d$ observed variables, typically correlated, in terms of the latent variables, assumed to be independent and therefore uncorrelated. The method of FA is therefore considered as a method for dimensionality reduction.

Specifically, we assume each of the observed variables in ${\bf x}$ is a linear combination of the $d'$ factors in ${\bf z}$

$\displaystyle x_i=\sum_{j=1}^{d'} w_{ij} z_j+e_i =[w_{i1},\cdots,w_{id'}]\left[... ...}z_1\\ \vdots\\ z_{d'} \end{array}\right]+e_i, \;\;\;\;\;\;\;\;\;(i=1,\cdots,d)$

(121)

or in matrix form

$\displaystyle {\bf x} =\left[\begin{array}{c}x_1\\ \vdots\\ \vdots\\ x_d\end{ar... ...\begin{array}{c}e_1\\ \vdots\\ \vdots\\ e_d\end{array}\right] ={\bf Wz}+{\bf e}$

(122)

where ${\bf W}$ is a $d\times d'$ factor loading matrix, and $e_i$

is the noise associated with $x_i$

. Also, for simplicity and without loss of generality, we assume the dataset has a zero mean. If ${\bf W}$ were available, the $d'$

factors in ${\bf z}$ can be found by solving this over-determined linear equation system of $d$

equations but $d'<d$

unknowns by the least-squares method (with minimum squared error $\vert\vert{\bf e}\vert\vert^2$ ):

$\displaystyle \hat{\bf z}={\bf W}^-{\bf x} =({\bf W}^T{\bf W})^{-1}{\bf W}^T{\bf x}$

(123)

wheer ${\bf W}^-=({\bf W}^T{\bf W})^{-1}{\bf W}^T$ is the left pseudo-inverse of ${\bf W}$ .

However, as ${\bf W}$ is unavailable, it needs to be estimated together with ${\bf z}$ at the same time, based on the given dataset ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ , typically $N\gg d$ . This can be done by the general method of expectation-maximization (EM), an iterative algorithm that maximizes the expectation of the likelihood (ML) or maximum a posteriori (MAP) estimate of some model parameters, widely used in machine learning.

Specifically, we treat both ${\bf z}$ and ${\bf e}$ as random vectors, and make the following assumptions:

The latent variables in ${\bf z}$ are of zero mean, independent of each other, and with unity vairance:

$\displaystyle {\bf m}_z=E[ {\bf z} ]={\bf0},\;\;\;\;\;\;\;\;\;\;\;\; {\bf\Sigma}_z=Cov[ {\bf z} ]=E[ {\bf zz}^T ]={\bf I}$ (124)

and they are normally distributed:

$\displaystyle p({\bf z})={\cal N}({\bf0},\;{\bf I})$ (125)
The noise components in ${\bf e}$ have zero mean, and they are independent of each other with a diagonal covariance matrix:

$\displaystyle {\bf m}_e=E[ {\bf e} ]={\bf0},\;\;\;\;\;\;\; {\bf\Sigma}_e=Cov[ {... ...cdots & 0\\ \vdots & \ddots & \vdots \\ 0 & \cdots & \psi_d \end{array}\right]$ (126)

and they are also normally distributed:

$\displaystyle p({\bf e})={\cal N}({\bf0},{\bf\Psi})$ (127)
The latent variables and the noise are independent of each other:

$\displaystyle {\bf\Sigma}_{ze}=Cov[{\bf z},{\bf e}] =E[({\bf z}-{\bf m}_z)({\bf e}-{\bf m}_e^T)] =E[{\bf ze}^T] ={\bf0}$ (128)

The two matrices ${\bf W}$ and ${\bf\Psi}$ defined above as the parameters of the FA model are denoted by $\theta=\{{\bf W},{\bf\Psi}\}$ .

Based on the assumptions above, we desire to find the conditional pdf $p({\bf z}\vert{\bf x})$ of the latent variable ${\bf z}$ given the observed variable ${\bf x}$ . To do so, we first find the pdf of ${\bf x}={\bf Wz}+{\bf e}$ , which, as a linear combination of the two normally distributed random vectors ${\bf z}$ and ${\bf e}$ , is also normally distributied with $p({\bf x})={\cal N}({\bf m}_x,{\bf\Sigma}_x)$ , where

$\displaystyle {\bf m}_x=E[ {\bf x}]$	$\displaystyle =$	$\displaystyle E[{\bf Wz}+{\bf e}] ={\bf W} E[{\bf z)}+E({\bf e}]={\bf0}$	(129)
$\displaystyle {\bf\Sigma}_x=Cov[{\bf x}]$	$\displaystyle =$	$\displaystyle E[({\bf x}-{\bf m}_x)({\bf x}-{\bf m}_x^T)] =E[{\bf xx}^T]=E\left[ ({\bf Wz}+{\bf e}) ({\bf Wz}+{\bf e})^T \right]$
	$\displaystyle =$	$\displaystyle {\bf W} E[{\bf zz}^T] {\bf W}^T-E[{\bf ez}^T]{\bf W}^T -{\bf W} E[ {\bf ze}^T ]+E[{\bf ee}^T]$
	$\displaystyle =$	$\displaystyle {\bf WW}^T +{\bf\Psi}$	(130)

As both ${\bf m}_x$ and ${\bf\Sigma}_x$ are based on the model parameter ${\bf\theta}=\{{\bf W},{\bf\Psi}\}$ , the pdf of ${\bf x}$ is conditional on ${\bf\theta}$ :

$\displaystyle p({\bf x}\vert{\bf\theta})={\cal N}({\bf m}_x,{\bf\Sigma}_x) ={\cal N}({\bf0},{\bf WW}^T+{\bf\Psi})$

(131)

The joint distribution $p({\bf z},{\bf x})$ is also Gaussian with a zero mean

$\displaystyle {\bf m} =E\left(\left[\begin{array}{c}{\bf z}\\ {\bf x}\end{array... ...m}_x\end{array}\right] =\left[\begin{array}{c}{\bf0}\\ {\bf0}\end{array}\right]$

(132)

and a covariance ${\bf\Sigma}$ composed of four submatrices:

$\displaystyle {\bf\Sigma}=\left[\begin{array}{cc}{\bf\Sigma}_{zz}&{\bf\Sigma}_{zx}\\ {\bf\Sigma}_{xz}&{\bf\Sigma}_{xx}\end{array}\right]$

(133)

where

$\displaystyle {\bf\Sigma}_{zz}$	$\displaystyle =$	$\displaystyle {\bf I}$
$\displaystyle {\bf\Sigma}_{xx}$	$\displaystyle =$	$\displaystyle {\bf WW}^T +{\bf\Psi}$
$\displaystyle {\bf\Sigma}_{zx}$	$\displaystyle =$	$\displaystyle {\bf\Sigma}_{zx}^T =E[({\bf z}-{\bf m}_z)({\bf x}-{\bf m}_x)^T]=E[{\bf zx}^T] =E[{\bf z}({\bf Wz}+{\bf e})^T]$
	$\displaystyle =$	$\displaystyle E[{\bf zz}^T]{\bf W}^T+E[{\bf ze}^T] ={\bf IW}^T+{\bf0} ={\bf W}^T$	(134)

The normal distribution $p({\bf z},{\bf x}\vert{\bf\theta})$ can now be expressed as:

$\displaystyle p({\bf x},{\bf z}\vert{\bf\theta}) =p\left( \left[\begin{array}{c}{\bf z}\\ {\bf x}\end{array}\right] \right)$	$\displaystyle =$	$\displaystyle {\cal N}({\bf m},{\bf\Sigma}) ={\cal N}\left(\left[\begin{array}{... ...&{\bf\Sigma}_{zx}\\ {\bf\Sigma}_{xz}&{\bf\Sigma}_{xx}\end{array}\right]\right)$
	$\displaystyle =$	$\displaystyle {\cal N} \left( \left[\begin{array}{c}{\bf0}\\ {\bf0}\end{array}\... ...}{\bf I} & {\bf W}^T\\ {\bf W} & {\bf WW}^T+{\bf\Psi}\end{array}\right] \right)$	(135)

Based on this joint pdf of both ${\bf x}$ and ${\bf z}$ , we can further find the desired conditional pdfs of both $p({\bf x}\vert{\bf z})$ and $p({\bf z}\vert{\bf x})$ (see properties of Normal distributions):

$p({\bf x}\vert{\bf z},\theta)={\cal N}({\bf m}_{x\vert z},{\bf\Sigma}_{x\vert z})$ , with

$\displaystyle \left\{\begin{array}{lcl} {\bf m}_{x\vert z}&=&{\bf m}_x+{\bf\Sig... ...}{\bf\Sigma}_{zx} ={\bf WW}^T+{\bf\Psi}-{\bf WW}^T={\bf\Psi} \end{array}\right.$ (136)
$p({\bf z}\vert{\bf x},\theta)={\cal N}({\bf m}_{z\vert x},{\bf\Sigma}_{z\vert x})$ , with

$\displaystyle \left\{\begin{array}{lcl} {\bf m}_{z\vert x}&=&{\bf m}_z+{\bf\Sig... ...\bf W}^T({\bf WW}^T+{\bf\Psi})^{-1}{\bf W} ={\bf I}-{\bf BW} \end{array}\right.$ (137)

where we have defined

$\displaystyle {\bf B}={\bf W}^T({\bf WW}^T+{\bf\Psi})^{-1}$

(138)

Note that while $p({\bf z})={\cal N}({\bf0},{\bf I})$ has zero mean and diagonal covariance, the conditional distribution $p({\bf z}\vert{\bf x},{\bf\theta})={\cal N}({\bf m}_{z\vert x},{\bf\Sigma}_{z\vert x})$ has non-zero mean ${\bf m}_{z\vert x}$ and non-diagonal covariance ${\bf\Sigma}_{z\vert x}$ .

The computational complexity for the inversion of the $N\times N$ matrix ${\bf WW}^T+{\bf\Psi}$ is $O(N^3)$ . However, by applying the Woodbury matrix identity:

$\displaystyle ({\bf\Psi}+{\bf WW}^T)^{-1} ={\bf\Psi}^{-1}-{\bf\Psi}^{-1} {\bf W}({\bf I}+{\bf W}^T{\bf\Psi}^{-1}{\bf W})^{-1}{\bf W}^T{\bf\Psi}^{-1}$

(139)

where ${\bf\Psi}$ as a diagonal matrix can be easily inverted, and ${\bf I}+{\bf W}^T{\bf\Psi}^{-1}{\bf W}$ is an $d'\times d'$ matrix that can be inverted with complexity $O(d'^3) \ll O(N^3)$ .

The model parameter ${\bf\theta}=\{{\bf W},{\bf\Psi}\}$ can now be estimated based on the given dataset ${\bf X}$ by the EM algorithm, an iterative process of the following two steps:

The E-step:
Find the expectation of the log likelihood function of the model parameter ${\bf\theta}$ , to be maximized in the following M-step.
Find the likelihood function of ${\bf\theta}$ based on the observed dataset ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ (all samples assumed to be i.i.d.):

$\displaystyle L(\theta\vert{\bf X},{\bf z})= p({\bf X},{\bf z}\vert\theta) =\pr... ...rt\theta) =\prod_{n=1}^N p({\bf x}_n\vert{\bf z},\theta)\;p({\bf z}\vert\theta)$ (140)

and the log likelihood:

$\displaystyle \log L(\theta\vert{\bf X},{\bf z}) =\sum_{n=1}^N \left( \log p({\bf x}_n\vert{\bf z},\theta) +\log p({\bf z}) \right)$ (141)

The second term can be dropped as $p({\bf z})={\cal N}({\bf0},{\bf I})$ is independent of the model parameter ${\bf\theta}$ and therefore irrelevant to the maximization of the log likelihood with respect to ${\bf\theta}$ .
We then find the expectation of the log likelihood function, denoted by , with respect to the latent variable ${\bf z}$ based on Eq. (136):

$\displaystyle Q$ $\displaystyle =$ $\displaystyle E_{z\vert x} \left[\log L(\theta\vert{\bf X},{\bf z})\right] =\sum_{n=1}^N E_{z\vert x_n}\left[\log p({\bf x}_n\vert{\bf z},\theta) \right]$

$\displaystyle =$ $\displaystyle \sum_{n=1}^N E_{z\vert x_n}\;\log \left[ \frac{1}{(2\pi)^{d/2}\ve... ...t z})^T{\bf\Sigma}_{x\vert z}^{-1} ({\bf x}_n-{\bf m}_{x\vert z})\right)\right]$

$\displaystyle =$ $\displaystyle \sum_{n=1}^N E_{z\vert x_n} \left[ -\frac{d}{2}\log(2\pi)-\frac{1... ...rt -\frac{1}{2}({\bf x}_n-{\bf Wz})^T{\bf\Psi}^{-1} ({\bf x}_n-{\bf Wz})\right]$

$\displaystyle =$ $\displaystyle -\frac{dN}{2}\log(2\pi)-\frac{N}{2}\log\vert{\bf\Psi}\vert -\frac... ...ert x_n}\left[ ({\bf x}_n-{\bf Wz})^T{\bf\Psi}^{-1} ({\bf x}_n-{\bf Wz})\right]$

$\displaystyle =$ $\displaystyle C-\frac{N}{2}\log\vert{\bf\Psi}\vert -\frac{1}{2} \sum_{n=1}^N E_... ...x}_n^T{\bf\Psi}^{-1}{\bf Wz} + {\bf z}^T{\bf W}^T{\bf\Psi}^{-1}{\bf Wz} \right]$ (142)

where we have used the fact that

$\displaystyle {\bf x}_n^T{\bf\Psi}^{-1}({\bf Wz})=({\bf Wz})^T{\bf\Psi}^{-1}{\bf x}_n$ (143)

as both ${\bf\Psi}^{-1}$ and ${\bf\Psi}$ are diagonal and therefore also symmetric. Here the constant $C=-dN\log(2\pi)/2$ can be dropped.
The M-step:
Find the optimal model parameter $\theta=\{{\bf W},{\bf\Psi}\}$ that maximizes the expectation of the log likelihood obtained in the E-step.
We set to zero the derivative of with respective each of the two parameters in $\theta=\{{\bf W},{\bf\Psi}\}$ (see derivative with respect to matrix) and solve the resulting equations.
- Find ${\bf W}$ :
  
  $\displaystyle \frac{\partial Q}{\partial{\bf W}}$ $\displaystyle =$ $\displaystyle \frac{\partial}{\partial{\bf W}} \sum_{n=1}^N E_{z\vert x_n} \lef... ... x}_n^T{\bf\Psi}^{-1}{\bf Wz} -{\bf z}^T{\bf W}^T{\bf\Psi}^{-1}{\bf Wz} \right]$
  
  $\displaystyle =$ $\displaystyle \sum_{n=1}^N E_{z\vert x_n} \left[\frac{\partial}{\partial{\bf W}... ...{\bf\Psi}^{-1}{\bf Wz} -{\bf z}^T{\bf W}^T{\bf\Psi}^{-1}{\bf Wz} \right)\right]$
  
  $\displaystyle =$ $\displaystyle \sum_{n=1}^N E_{z\vert x_n} \left[ 2{\bf\Psi}^{-1}{\bf x}_n{\bf z}^T -2{\bf\Psi}^{-1}{\bf Wz}{\bf z}^T \right]$
  
  $\displaystyle =$ $\displaystyle 2{\bf\Psi}^{-1}\sum_{n=1}^N \left( {\bf x}_nE_{z\vert x_n}[{\bf z}^T] -{\bf W}E_{z\vert x_n}[{\bf zz}^T] \right)={\bf0}$ (144)
  
  Solving for ${\bf W}$ we get
  
  $\displaystyle {\bf W}=\left(\sum_{n=1}^N {\bf x}_n E_{z\vert x_n} [{\bf z}^T]\right)\; \left(\sum_{n=1}^N E_{z\vert x_n} [{\bf zz}^T] \right)^{-1}$ (145)
  
  where $E_{z\vert x_n}[{\bf z}]$ and $E_{z\vert x_n}[{\bf zz}^T]$ can be found in Eq. (137):
  
  $\displaystyle \left\{ \begin{array}{lll} E_{z\vert x_n}[{\bf z}]&=&{\bf m}_{z\v... ...vert x_n}^T ={\bf I}-{\bf BW}+{\bf Bx}_n{\bf x}_n^T{\bf B}^T \end{array}\right.$ (146)
  
  The second equation is due to the fact that ${\bf\Sigma}_z=E[{\bf zz}^T]-{\bf m}_z{\bf m}_z^T$ . Here $E_{z\vert x_n}[{\bf z}]$ can be considered as the estimation of the ${\bf z}$ , while $E_{z\vert x_n}[{\bf zz}^T]$ the uncertainty of the estimation.
- Find ${\bf\Psi}$ :
  
  $\displaystyle \frac{\partial Q}{\partial{\bf\Psi}^{-1}}$ $\displaystyle =$ $\displaystyle \frac{\partial}{\partial{\bf\Psi}^{-1}} \left[ N\log\vert{\bf\Psi... ...bf\Psi}^{-1}{\bf Wz} + {\bf z}^T{\bf W}^T{\bf\Psi}^{-1}{\bf Wz} \right] \right]$
  
  $\displaystyle =$ $\displaystyle -N{\bf\Psi}+\sum_{n=1}^N E_{z\vert x_n}\left[{\bf x}_n{\bf x}_n^T -2{\bf x}_n {\bf z}^T{\bf W}^T+{\bf W}({\bf zz}^T){\bf W}^T\right]$
  
  $\displaystyle =$ $\displaystyle -N{\bf\Psi}+\sum_{n=1}^N {\bf x}_n{\bf x}_n^T -2 \sum_{n=1}^N {\b... ...f z}]^T {\bf W}^T +{\bf W}\sum_{n=1}^N E_{z\vert x}[{\bf zz}^T]{\bf W}^T={\bf0}$ (147)
  
  Solving for ${\bf\Psi}$ we get
  
  $\displaystyle {\bf\Psi}=\frac{1}{N} diag \left( \sum_{n=1}^N {\bf x}_n{\bf x}_n... ...T +{\bf W}\left(\sum_{n=1}^N E_{z\vert x_n}[{\bf zz}^T]\right){\bf W}^T \right)$ (148)
  
  where $diag({\bf A})$ is the operation that sets all off-diagonal elements of matrix ${\bf A}$ to zero, so that the resulting matrix is indeed diagonal as what ${\bf\Psi}$ should be. We further replace ${\bf W}$ in front of the last term above by that in Eq. (145) and get
  
  $\displaystyle {\bf\Psi}=\frac{1}{N} diag \left( \sum_{n=1}^N {\bf x}_n{\bf x}_n^T -\sum_{n=1}^N {\bf x}_n E_{z\vert x_n}[{\bf z}]^T {\bf W}^T\right)$ (149)

In summary, here are the steps of the EM method:

Initialize parameters $\theta_{old}=\{{\bf W}_{old},{\bf\Psi}_{old}\}$ ;
E-step:
Find $E_{z\vert x}({\bf z})$ and $E_{z\vert x}({\bf zz}^T)$ in Eq. (146) based on ${\bf m}_{z\vert x_n}$ and ${\bf\Sigma}_{z\vert x_n}$ in Eq. (137), which in turn is based on ${\bf\theta}_{old}=\{{\bf W}_{old},{\bf\Psi}_{old}\}$ ;
M-step:
Find ${\bf\theta}_{new}=\{{\bf W}_{new},{\bf\Psi}_{new}\}$ in Eqs. (145) and (149), based on $E_{z\vert x}({\bf z})$ and $E_{z\vert x}({\bf zz}^T)$ ;
Terminate if convergence cretierion is satisfied, otherwise replace $\theta_{old}$ by $\theta_{new}$ and return to the E-step.

As the E-step and M-step of the EM algorithm for FA are interdependent on each other, they need to be carried out iteratively based on cetain inital guess of the parameters in ${\bf\theta}$ , as show in the steps below:

$\displaystyle Q$	$\displaystyle =$	$\displaystyle E_{z\vert x} \left[\log L(\theta\vert{\bf X},{\bf z})\right] =\sum_{n=1}^N E_{z\vert x_n}\left[\log p({\bf x}_n\vert{\bf z},\theta) \right]$
	$\displaystyle =$	$\displaystyle \sum_{n=1}^N E_{z\vert x_n}\;\log \left[ \frac{1}{(2\pi)^{d/2}\ve... ...t z})^T{\bf\Sigma}_{x\vert z}^{-1} ({\bf x}_n-{\bf m}_{x\vert z})\right)\right]$
	$\displaystyle =$	$\displaystyle \sum_{n=1}^N E_{z\vert x_n} \left[ -\frac{d}{2}\log(2\pi)-\frac{1... ...rt -\frac{1}{2}({\bf x}_n-{\bf Wz})^T{\bf\Psi}^{-1} ({\bf x}_n-{\bf Wz})\right]$
	$\displaystyle =$	$\displaystyle -\frac{dN}{2}\log(2\pi)-\frac{N}{2}\log\vert{\bf\Psi}\vert -\frac... ...ert x_n}\left[ ({\bf x}_n-{\bf Wz})^T{\bf\Psi}^{-1} ({\bf x}_n-{\bf Wz})\right]$
	$\displaystyle =$	$\displaystyle C-\frac{N}{2}\log\vert{\bf\Psi}\vert -\frac{1}{2} \sum_{n=1}^N E_... ...x}_n^T{\bf\Psi}^{-1}{\bf Wz} + {\bf z}^T{\bf W}^T{\bf\Psi}^{-1}{\bf Wz} \right]$	(142)

$\displaystyle \frac{\partial Q}{\partial{\bf W}}$	$\displaystyle =$	$\displaystyle \frac{\partial}{\partial{\bf W}} \sum_{n=1}^N E_{z\vert x_n} \lef... ... x}_n^T{\bf\Psi}^{-1}{\bf Wz} -{\bf z}^T{\bf W}^T{\bf\Psi}^{-1}{\bf Wz} \right]$
	$\displaystyle =$	$\displaystyle \sum_{n=1}^N E_{z\vert x_n} \left[\frac{\partial}{\partial{\bf W}... ...{\bf\Psi}^{-1}{\bf Wz} -{\bf z}^T{\bf W}^T{\bf\Psi}^{-1}{\bf Wz} \right)\right]$
	$\displaystyle =$	$\displaystyle \sum_{n=1}^N E_{z\vert x_n} \left[ 2{\bf\Psi}^{-1}{\bf x}_n{\bf z}^T -2{\bf\Psi}^{-1}{\bf Wz}{\bf z}^T \right]$
	$\displaystyle =$	$\displaystyle 2{\bf\Psi}^{-1}\sum_{n=1}^N \left( {\bf x}_nE_{z\vert x_n}[{\bf z}^T] -{\bf W}E_{z\vert x_n}[{\bf zz}^T] \right)={\bf0}$	(144)