Minimization of Mutual Information

Next: Preprocessing for ICA Up: Methods of ICA Estimations Previous: Measures of Non-Gaussianity

Minimization of Mutual Information

The mutual information of two random variables and is defined as

$\begin{displaymath}I(x,y)=H(x)+H(y)-H(x,y)=H(x)-H(x\vert y)=H(y)-H(y\vert x) \end{displaymath}$

Obviously when

and

are independnent, i.e., $H(y\vert x)=H(y)$ and $H(x\vert y)=H(x)$ , their mutual information

is zero.

Similarly the mutual information $I(y_1,\cdots,y_n)$ of a set of variables ( $i=1,\cdots,n$ ) is defined as

$\begin{displaymath}I(y_1,\cdots,y_n)=\sum_{i=1}^n H(y_i)-H(y_1,\cdots,y_n) \end{displaymath}$

If random vector ${\mathbf y}=[y_1,\cdots,y_n]^T$ is a linear transform of another random vector ${\mathbf x}=[x_1,\cdots,x_n]^T$ :

$\begin{displaymath}y_i=\sum_{j=1}^n w_{ij} x_j,\;\;\;\;\;\mbox{or}\;\;\;\;{\mathbf y=Wx} \end{displaymath}$

then the entropy of ${\mathbf y}$ is related to that of ${\mathbf x}$ by:

$\displaystyle H(y_1,\cdots,y_n)$	$\textstyle =$	$\displaystyle H(x_1,\cdots,x_n)+E\;\{ log\;J(x_1,\cdots,x_n)\}$
	$\textstyle =$	$\displaystyle H(x_1,\cdots,x_n)+ log\;det\; {\mathbf W}$

where $J(x_1,\cdots,x_n)$ is the Jacobian of the above transformation:

$\begin{displaymath}J(x_1,\cdots,x_n)=\left\vert \begin{array}{ccc} \frac{\parti... ... y_n}{\partial x_n} \end{array} \right\vert =det\;{\mathbf W} \end{displaymath}$

The mutual information above can be written as

$\displaystyle I(y_1,\cdots,y_n)$	$\textstyle =$	$\displaystyle \sum_{i=1}^n H(y_i)-H(y_1,\cdots,y_n)$
	$\textstyle =$	$\displaystyle \sum_{i=1}^n H(y_i)-H(x_1,\cdots,x_n)-log\;det\; {\mathbf W}$

We further assume

to be uncorrelated and of unit variance, i.e., the covariance matrix of ${\mathbf y}$ is

$\begin{displaymath} E\{{\mathbf yy^T}\}={\mathbf W}E\{{\mathbf xx^T}\}{\mathbf W^T}={\mathbf I} \end{displaymath}$

and its determinant is

$\begin{displaymath}det\;{\mathbf I}=1=(det\;{\mathbf W})\;(det\;E\{{\mathbf xx}^T\}) \;(det\;{\mathbf W}^T) \end{displaymath}$

This means $det\;{\mathbf W}$ is a constant (same for any ${\mathbf W}$ ). Also, as the second term in the mutual information expression $H(x_1,\cdots,x_n)$ is also a constant (invariant with respect to ${\mathbf W}$ ), we have

$\begin{displaymath}I(y_1,\cdots,y_n)=\sum_{i=1}^n H(y_i)+\mbox{Constant} \end{displaymath}$

i.e., minimization of mutual information $I(y_1,\cdots,y_n)$ is achieved by minimizing the entropies

$\begin{displaymath}H(y_i)=-\int p_i(y_i) log\;p_i(y_i) dy_i=-E\{ log\;p_i(y_i) \} \end{displaymath}$

As Gaussian density has maximal entropy, minimizing entropy is equivalent to minimizing Gaussianity.

Moreover, since all have the same unit variance, their negentropy becomes

$\begin{displaymath}J(y_i)=H(y_G)-H(y_i)=C-H(y_i) \end{displaymath}$

where

is the entropy of a Gaussian with unit variance, same for all

. Substituting

into the expression of mutual information, and realizing the other two terms $H({\mathbf x})$ and $log\;det\;{\mathbf W}$ are both constant (same for any ${\mathbf W}$ ), we get

$\begin{displaymath}I(y_1,\cdots,y_n)=Const-\sum_{i=1}^n J(y_i) \end{displaymath}$

where

is a constant (including all terms

, $H({\mathbf x})$ and $log\;det\;{\mathbf W}$ ) which is the same for any linear transform matrix

. This is the fundamental relation between mutual information and negentropy of the variables

. If the mutual information of a set of variables is decreased (indicating the variables are less dependent) then the negentropy will be increased, and

are less Gaussian. We want to find a linear transform matrix

to minimize mutual information $I(y_1,\cdots,y_n)$ , or, equivalently, to maximize negentropy (under the assumption that

are uncorrelated).

Next: Preprocessing for ICA Up: Methods of ICA Estimations Previous: Measures of Non-Gaussianity

Ruye Wang 2018-03-26