Non-Gaussianity is Independence

The theoretical foundation of ICA is the central limit theorem, which states that the distribution of the sum (average or linear combination) of $N$ independent random variables approaches Gaussian as $N\rightarrow \infty$ . For example, the face value of a dice has a uniform distribution from 1 to 6. But the distribution of the sum of a pair of dice is no longer uniform. It has a maximum probability at the mean of 7. As the number of dice increases, the distribution of the sum of the face values will be better approximated by a Gaussian.

Let $x_1,\cdots,x_N$ be random variables independently drawn from an arbitrary distribution with mean $\mu$ and variance $\sigma^2$ . Then the distribution of the mean $x=\sum_{i=1}^N x_i/N$ approaches Gaussian with mean $\mu$ and variance $\sigma^2/N$ .

To solve the BSS problem, we want to find a matrix ${\mathbf W}$ so that ${\bf y}={\bf Wx}={\bf WAs}$ is as close to the independent sources ${\bf s}$ as possible. This can be seen as the reverse process of the central limit theorem above. Consider the jth component $y_j={\bf w_j^TAs}$ of ${\bf y}$ , where ${\bf w_j^T}$ is the jth row of ${\bf W}$ . As a linear combination of all components of ${\bf s}$ , $y_i$ is necessarily more Gaussian than any of the $m$ source components $\{s_1,\cdots,s_n\}$ , unless $y_i$ is equal to one of them (i.e., ${\bf w_i^TA}$ has only one non-zero component). Therefore for ${\bf y}$ to be an estimate of ${\bf s}$ , we desire to find ${\bf W}$ that maximizes the non-Gaussianity of ${\bf y}={\bf WAs}$ so that ${\bf y}$ is least Gaussian. This is the essence of all ICA methods. Obviously if all source variables are Gaussian, the ICA method will not work.

Based on the above discussion, we get requirements and constrains for the ICA methods:

The number of observed variables must be no fewer than the number of independent sources (assume in the following).
The source components are stochastically independent, and have to be non-Gaussian (with possible exception of no more than one Gaussian).
The estimation of ${\bf A}$ and ${\bf s}$ is up to a scaling factor. Let

$\displaystyle {\bf C}=diag(c_1,\cdots,c_n), \;\;\;\;\;\; {\bf C}^{-1}=diag(1/c_1,\cdots,1/c_n)$ (206)

and ${\bf A}'={\bf AC}^{-1}$ and ${\bf s'=Cs}$ , we have

$\displaystyle {\bf x}={\bf As}=[{\bf AC}^{-1}][{\bf Cs}]={\bf A}'{\bf s}'$ (207)

Also the scaling factor could be either positive or negative. For this reason, we will always assume the independent components have unit variance $E\{s_i^2\}=1$ . As they are also uncorrelated (all independent variables are uncorreclated), we have $E\{s_is_j\}=\delta_{ij}$ , i.e.,

$\displaystyle E( {\bf ss}^T )={\bf I}$ (208)
The estimated independent components are not in any particular order. When the order of the corresponding elements in both ${\bf s}$ and ${\bf A}$ is rearranged, ${\bf x=As}$ still holds.

Based on the same fundamental approach discussed above, all ICA algorithms can be considered as an optimization process that finds the matrix ${\bf W}$ that maximizes certain objective function that measures the non-Gaussianity of ${\bf y}={\bf Wx}$ , and thereby the independence of its components. In the following, we will discuss some common objective functions.