Factor Analysis and Expectation Maximization

The method of factor analysis (FA) models a set of $d$ observed manifest variables in ${\bf x}=[x_1,\cdots,x_d]^T$ as a linear combination of a set of $d'<d$ unobserved hiden latent variables or common factors in ${\bf z}=[z_1,\cdots,z_{d'}]^T$, to explain and reveal the variability and dependency among the $d$ observed variables, typically correlated, in terms of the latent variables, assumed to be independent and therefore uncorrelated. The method of FA is therefore considered as a method for dimensionality reduction.

Specifically, we assume each of the observed variables in ${\bf x}$ is a linear combination of the $d'$ factors in ${\bf z}$

$\displaystyle x_i=\sum_{j=1}^{d'} w_{ij} z_j+e_i
=[w_{i1},\cdots,w_{id'}]\left[...
...}z_1\\ \vdots\\ z_{d'}
\end{array}\right]+e_i,
\;\;\;\;\;\;\;\;\;(i=1,\cdots,d)$ (121)

or in matrix form

$\displaystyle {\bf x}
=\left[\begin{array}{c}x_1\\ \vdots\\ \vdots\\ x_d\end{ar...
...\begin{array}{c}e_1\\ \vdots\\ \vdots\\ e_d\end{array}\right]
={\bf Wz}+{\bf e}$ (122)

where ${\bf W}$ is a $d\times d'$ factor loading matrix, and $e_i$ is the noise associated with $x_i$. Also, for simplicity and without loss of generality, we assume the dataset has a zero mean. If ${\bf W}$ were available, the $d'$ factors in ${\bf z}$ can be found by solving this over-determined linear equation system of $d$ equations but $d'<d$ unknowns by the least-squares method (with minimum squared error $\vert\vert{\bf e}\vert\vert^2$):

$\displaystyle \hat{\bf z}={\bf W}^-{\bf x}
=({\bf W}^T{\bf W})^{-1}{\bf W}^T{\bf x}$ (123)

wheer ${\bf W}^-=({\bf W}^T{\bf W})^{-1}{\bf W}^T$ is the left pseudo-inverse of ${\bf W}$.

However, as ${\bf W}$ is unavailable, it needs to be estimated together with ${\bf z}$ at the same time, based on the given dataset ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$, typically $N\gg d$. This can be done by the general method of expectation-maximization (EM), an iterative algorithm that maximizes the expectation of the likelihood (ML) or maximum a posteriori (MAP) estimate of some model parameters, widely used in machine learning.

Specifically, we treat both ${\bf z}$ and ${\bf e}$ as random vectors, and make the following assumptions:

The two matrices ${\bf W}$ and ${\bf\Psi}$ defined above as the parameters of the FA model are denoted by $\theta=\{{\bf W},{\bf\Psi}\}$.

Based on the assumptions above, we desire to find the conditional pdf $p({\bf z}\vert{\bf x})$ of the latent variable ${\bf z}$ given the observed variable ${\bf x}$. To do so, we first find the pdf of ${\bf x}={\bf Wz}+{\bf e}$, which, as a linear combination of the two normally distributed random vectors ${\bf z}$ and ${\bf e}$, is also normally distributied with $p({\bf x})={\cal N}({\bf m}_x,{\bf\Sigma}_x)$, where

$\displaystyle {\bf m}_x=E[ {\bf x}]$ $\displaystyle =$ $\displaystyle E[{\bf Wz}+{\bf e}]
={\bf W} E[{\bf z)}+E({\bf e}]={\bf0}$ (129)
$\displaystyle {\bf\Sigma}_x=Cov[{\bf x}]$ $\displaystyle =$ $\displaystyle E[({\bf x}-{\bf m}_x)({\bf x}-{\bf m}_x^T)]
=E[{\bf xx}^T]=E\left[ ({\bf Wz}+{\bf e}) ({\bf Wz}+{\bf e})^T \right]$  
  $\displaystyle =$ $\displaystyle {\bf W} E[{\bf zz}^T] {\bf W}^T-E[{\bf ez}^T]{\bf W}^T
-{\bf W} E[ {\bf ze}^T ]+E[{\bf ee}^T]$  
  $\displaystyle =$ $\displaystyle {\bf WW}^T +{\bf\Psi}$ (130)

As both ${\bf m}_x$ and ${\bf\Sigma}_x$ are based on the model parameter ${\bf\theta}=\{{\bf W},{\bf\Psi}\}$, the pdf of ${\bf x}$ is conditional on ${\bf\theta}$:

$\displaystyle p({\bf x}\vert{\bf\theta})={\cal N}({\bf m}_x,{\bf\Sigma}_x)
={\cal N}({\bf0},{\bf WW}^T+{\bf\Psi})$ (131)

The joint distribution $p({\bf z},{\bf x})$ is also Gaussian with a zero mean

$\displaystyle {\bf m}
=E\left(\left[\begin{array}{c}{\bf z}\\ {\bf x}\end{array...
...m}_x\end{array}\right]
=\left[\begin{array}{c}{\bf0}\\ {\bf0}\end{array}\right]$ (132)

and a covariance ${\bf\Sigma}$ composed of four submatrices:

$\displaystyle {\bf\Sigma}=\left[\begin{array}{cc}{\bf\Sigma}_{zz}&{\bf\Sigma}_{zx}\\
{\bf\Sigma}_{xz}&{\bf\Sigma}_{xx}\end{array}\right]$ (133)

where
$\displaystyle {\bf\Sigma}_{zz}$ $\displaystyle =$ $\displaystyle {\bf I}$  
$\displaystyle {\bf\Sigma}_{xx}$ $\displaystyle =$ $\displaystyle {\bf WW}^T +{\bf\Psi}$  
$\displaystyle {\bf\Sigma}_{zx}$ $\displaystyle =$ $\displaystyle {\bf\Sigma}_{zx}^T
=E[({\bf z}-{\bf m}_z)({\bf x}-{\bf m}_x)^T]=E[{\bf zx}^T]
=E[{\bf z}({\bf Wz}+{\bf e})^T]$  
  $\displaystyle =$ $\displaystyle E[{\bf zz}^T]{\bf W}^T+E[{\bf ze}^T]
={\bf IW}^T+{\bf0} ={\bf W}^T$ (134)

The normal distribution $p({\bf z},{\bf x}\vert{\bf\theta})$ can now be expressed as:
$\displaystyle p({\bf x},{\bf z}\vert{\bf\theta})
=p\left( \left[\begin{array}{c}{\bf z}\\ {\bf x}\end{array}\right] \right)$ $\displaystyle =$ $\displaystyle {\cal N}({\bf m},{\bf\Sigma})
={\cal N}\left(\left[\begin{array}{...
...&{\bf\Sigma}_{zx}\\
{\bf\Sigma}_{xz}&{\bf\Sigma}_{xx}\end{array}\right]\right)$  
  $\displaystyle =$ $\displaystyle {\cal N}
\left( \left[\begin{array}{c}{\bf0}\\ {\bf0}\end{array}\...
...}{\bf I} & {\bf W}^T\\ {\bf W} & {\bf WW}^T+{\bf\Psi}\end{array}\right]
\right)$ (135)

Based on this joint pdf of both ${\bf x}$ and ${\bf z}$, we can further find the desired conditional pdfs of both $p({\bf x}\vert{\bf z})$ and $p({\bf z}\vert{\bf x})$ (see properties of Normal distributions): where we have defined

$\displaystyle {\bf B}={\bf W}^T({\bf WW}^T+{\bf\Psi})^{-1}$ (138)

Note that while $p({\bf z})={\cal N}({\bf0},{\bf I})$ has zero mean and diagonal covariance, the conditional distribution $p({\bf z}\vert{\bf x},{\bf\theta})={\cal N}({\bf m}_{z\vert x},{\bf\Sigma}_{z\vert x})$ has non-zero mean ${\bf m}_{z\vert x}$ and non-diagonal covariance ${\bf\Sigma}_{z\vert x}$.

The computational complexity for the inversion of the $N\times N$ matrix ${\bf WW}^T+{\bf\Psi}$ is $O(N^3)$. However, by applying the Woodbury matrix identity:

$\displaystyle ({\bf\Psi}+{\bf WW}^T)^{-1} ={\bf\Psi}^{-1}-{\bf\Psi}^{-1}
{\bf W}({\bf I}+{\bf W}^T{\bf\Psi}^{-1}{\bf W})^{-1}{\bf W}^T{\bf\Psi}^{-1}$ (139)

where ${\bf\Psi}$ as a diagonal matrix can be easily inverted, and ${\bf I}+{\bf W}^T{\bf\Psi}^{-1}{\bf W}$ is an $d'\times d'$ matrix that can be inverted with complexity $O(d'^3) \ll O(N^3)$.

The model parameter ${\bf\theta}=\{{\bf W},{\bf\Psi}\}$ can now be estimated based on the given dataset ${\bf X}$ by the EM algorithm, an iterative process of the following two steps:

In summary, here are the steps of the EM method:

  1. Initialize parameters $\theta_{old}=\{{\bf W}_{old},{\bf\Psi}_{old}\}$;

  2. E-step:

    Find $E_{z\vert x}({\bf z})$ and $E_{z\vert x}({\bf zz}^T)$ in Eq. (146) based on ${\bf m}_{z\vert x_n}$ and ${\bf\Sigma}_{z\vert x_n}$ in Eq. (137), which in turn is based on ${\bf\theta}_{old}=\{{\bf W}_{old},{\bf\Psi}_{old}\}$;

  3. M-step:

    Find ${\bf\theta}_{new}=\{{\bf W}_{new},{\bf\Psi}_{new}\}$ in Eqs. (145) and (149), based on $E_{z\vert x}({\bf z})$ and $E_{z\vert x}({\bf zz}^T)$;

  4. Terminate if convergence cretierion is satisfied, otherwise replace $\theta_{old}$ by $\theta_{new}$ and return to the E-step.
As the E-step and M-step of the EM algorithm for FA are interdependent on each other, they need to be carried out iteratively based on cetain inital guess of the parameters in ${\bf\theta}$, as show in the steps below: