Minimization of Mutual Information

The mutual information $I(x,y)$ of two random variables $x$ and $y$ is defined as

$\displaystyle I(x,y)=H(x)+H(y)-H(x,y)=H(x)-H(x\vert y)=H(y)-H(y\vert x)$

(216)

Obviously when $x$

and

are independnent, i.e., $H(y\vert x)=H(y)$ and $H(x\vert y)=H(x)$ , their mutual information $I(x,y)$

is zero.

Similarly the mutual information $I(y_1,\cdots,y_n)$ of a set of $n$ variables $y_i$ ( $i=1,\cdots,n$ ) is defined as

$\displaystyle I(y_1,\cdots,y_n)=\sum_{i=1}^n H(y_i)-H(y_1,\cdots,y_n)$

(217)

If random vector ${\bf y}=[y_1,\cdots,y_n]^T$ is a linear transform of another random vector ${\bf x}=[x_1,\cdots,x_n]^T$ :

$\displaystyle y_i=\sum_{j=1}^n w_{ij} x_j,\;\;\;\;\;$ or $\displaystyle \;\;\;\;{\bf y=Wx}$

(218)

then the entropy of ${\bf y}$ is related to that of ${\bf x}$ by:

$\displaystyle H(y_1,\cdots,y_n)$	$\displaystyle =$	$\displaystyle H(x_1,\cdots,x_n)+E\;\{ log\;J(x_1,\cdots,x_n)\}$
	$\displaystyle =$	$\displaystyle H(x_1,\cdots,x_n)+ log\;det\; {\bf W}$	(219)

where $J(x_1,\cdots,x_n)$ is the Jacobian of the above transformation:

$\displaystyle J(x_1,\cdots,x_n)=\left\vert \begin{array}{ccc} \frac{\partial y_... ...\cdots &\frac{\partial y_n}{\partial x_n} \end{array} \right\vert =det\;{\bf W}$

(220)

The mutual information above can be written as

$\displaystyle I(y_1,\cdots,y_n)$	$\displaystyle =$	$\displaystyle \sum_{i=1}^n H(y_i)-H(y_1,\cdots,y_n)$
	$\displaystyle =$	$\displaystyle \sum_{i=1}^n H(y_i)-H(x_1,\cdots,x_n)-log\;det\; {\bf W}$	(221)

We further assume $y_i$

to be uncorrelated and of unit variance, i.e., the covariance matrix of ${\bf y}$ is

$\displaystyle E\{{\bf yy^T}\}={\bf W}E\{{\bf xx^T}\}{\bf W^T}={\bf I}$

(222)

and its determinant is

$\displaystyle det\;{\bf I}=1=(det\;{\bf W})\;(det\;E\{{\bf xx}^T\}) \;(det\;{\bf W}^T)$

(223)

This means $det\;{\bf W}$ is a constant (same for any ${\bf W}$ ). Also, as the second term in the mutual information expression $H(x_1,\cdots,x_n)$ is also a constant (invariant with respect to ${\bf W}$ ), we have

$\displaystyle I(y_1,\cdots,y_n)=\sum_{i=1}^n H(y_i)+$ Constant

(224)

i.e., minimization of mutual information $I(y_1,\cdots,y_n)$ is achieved by minimizing the entropies

$\displaystyle H(y_i)=-\int p_i(y_i) log\;p_i(y_i) dy_i=-E\{ log\;p_i(y_i) \}$

(225)

As Gaussian density has maximal entropy, minimizing entropy is equivalent to minimizing Gaussianity.

Moreover, since all $y_i$ have the same unit variance, their negentropy becomes

$\displaystyle J(y_i)=H(y_G)-H(y_i)=C-H(y_i)$

(226)

where

is the entropy of a Gaussian with unit variance, same for all $y_i$

. Substituting $H(y_i)=C-J(y_i)$

into the expression of mutual information, and realizing the other two terms $H({\bf x})$ and $log\;det\;{\bf W}$ are both constant (same for any ${\bf W}$ ), we get

$\displaystyle I(y_1,\cdots,y_n)=Const-\sum_{i=1}^n J(y_i)$

(227)

where

is a constant (including all terms $C$

, $H({\bf x})$ and $log\;det\;{\bf W}$ ) which is the same for any linear transform matrix $W$

. This is the fundamental relation between mutual information and negentropy of the variables $y_1$

. If the mutual information of a set of variables is decreased (indicating the variables are less dependent) then the negentropy will be increased, and $y_i$

are less Gaussian. We want to find a linear transform matrix $W$

to minimize mutual information $I(y_1,\cdots,y_n)$ , or, equivalently, to maximize negentropy (under the assumption that $y_i$

are uncorrelated).