Gaussian process

Next: Appendix A: Conditional and Up: Gaussiaon Process Previous: Nonlinar Regression

Gaussian process

The function $f({\bf x},{\bf w})$ can be written as

$\begin{displaymath}f_n=\sum_{h=1}^H \phi_h({\bf x}^{(n)})\;(n=1,\cdots,N)\;\;\; \mbox{or in matrix form} \;\;\;{\bf f}={\bf\Phi} {\bf w} \end{displaymath}$

If the prior distribution of the weights ${\bf w}$ is zero-mean Gaussian:

$\begin{displaymath}p({\bf w})=N(0,\Sigma_p) \end{displaymath}$

then the function ${\bf f}({\bf w})$ , as a linear function of ${\bf w}$ , is also a zero-mean Gaussian:

$\begin{displaymath}p({\bf f})=N(0,{\bf Q}) \end{displaymath}$

where ${\bf Q}_{N\times N}$ is the covariancee matrix of ${\bf f}$ :

$\begin{displaymath}{\bf Q}=Cov({\bf f})=E({\bf ff}^T)=E({\bf\Phi w}({\bf\Phi w})... ...{\bf\Phi}E({\bf ww}^T){\bf\Phi}^T={\bf\Phi}\Sigma_p{\bf\Phi}^T \end{displaymath}$

with the mn-th component being:

$\begin{displaymath}Q_{mn}=\phi^T({\bf x}^{(m)}) \Sigma_p \phi({\bf x}^{(n)}) =k({\bf x}^{(m)},{\bf x}^{(n)}) \end{displaymath}$

In particular, when $\Sigma_p=\sigma^2_p{\bf I}$ , we get

$\begin{displaymath}Q_{mn}=\sigma^2_p \phi^T({\bf x}^{(m)}) \phi({\bf x}^{(n)}) ... ...ma^2_p \sum_{h=1}^H \phi_h({\bf x}^{(m)})\phi_h({\bf x}^{(n)}) \end{displaymath}$

The function is a Gaussian process as the distribution $p(f({\bf x}^{(1)}),\cdots,f({\bf x}^{(N)}))$ of any of its values correponding to input points is a Gaussian. As the noise $\epsilon \sim N(0,\sigma_n)$ is also a Gaussian distribution, the distribution of the values of the output $y=f({\bf x})+\epsilon$ are also Gaussian:

$\begin{displaymath}{\bf y} \sim N(0,{\bf C})=N(0, {\bf Q}+\sigma_n^2{\bf I}) \end{displaymath}$

where ${\bf C}_{N\times N}$ is the covariance matrix of ${\bf y}$ :

$\begin{displaymath}{\bf C}={\bf Q}+\sigma_n^2{\bf I}={\bf\Phi}\Sigma_p{\bf\Phi}^T +\sigma_n^2{\bf I} \end{displaymath}$

with the mn-th component being:

$\begin{displaymath}C_{mn}=k({\bf x}^{(m)},{\bf x}^{(n)})+\delta_{mn}\sigma_n^2 \end{displaymath}$

The result above can be generalized when $H \rightarrow \infty$ , i.e., the function is expressed as the linear combination (integration rather than summation) of basis functions. For example, assume and the h-th basis is a radial function centered at :

$\begin{displaymath}\phi_h(x^{(n)})=exp(-\frac{(x^{(n)}-h)^2}{2r^2}) \end{displaymath}$

and $\Sigma_p=\sigma_p{\bf I}$ , the covariance $Q_{mn}$ becomes

$\displaystyle Q_{mn}$	$\textstyle =$	$\displaystyle Cov(x^{(m)}, x^{(n)})=\phi S\;\int_{-\infty}^{\infty} \phi_h(x^{(m)})\phi_h(x^{(n)})\; dh$
	$\textstyle =$	$\displaystyle S\;\int_{-\infty}^{\infty} exp(-\frac{(x^{(m)}-h)^2}{2r^2}) exp(-\frac{(x^{(n)}-h)^2}{2r^2})\; dh$
	$\textstyle =$	$\displaystyle S'\; exp(-\frac{(x^{(m)}-x^{(n)})^2}{4r^2})$

where

and

are some scaling factors (including $\sigma_p$ ). More generally, when

, the covariance matrix of the

function values ${\bf f}$ at the

input points ${\bf x}^n,\;(n=1,\cdots,N)$ can be defined as

$\begin{displaymath}Q_{mn}=k({\bf x}^{(m)},{\bf x}^{(n)}),\;\;\;\;(m,n=1,\cdots,N) \end{displaymath}$

Now the regression problem can be approached based on a totally different point of view. Instead of specifying the basis functions and some model parameters (e.g. the weights ${\bf w}$ ), we can assume the function values ${\bf f}$ to be a Gaussian process and construct its covariance matrix (while always assume zeromean vector):

$\begin{displaymath}\Sigma_f={\bf Q}=Cov({\bf f})=\left[ \begin{array}{ccc}...&..... ......&...&...\end{array} \right]_{N\times N} =K({\bf X},{\bf X}) \end{displaymath}$

The mn-th component of the covariance matrix is

$\begin{displaymath}Q_{mn}=k({\bf x}^{(m)},{\bf x}^{(n)}) \end{displaymath}$

which be constructed according to the specific problem. For example, the following covariances can be used:

$\begin{displaymath}Q_{mn}=c\;exp(-\frac{({\bf x}^{(m)}-{\bf x}^{(n)})^T({\bf x}^{(m)}-{\bf x}^{(n)})}{2r^2}) \end{displaymath}$

$\begin{displaymath}Q_{mn}=c\;exp(-\frac{sin^2(\pi ({\bf x}^{(m)}-{\bf x}^{(n)})^T({\bf x}^{(m)}-{\bf x}^{(n)}))}{2r^2}) \end{displaymath}$

$\begin{displaymath}Q_{mn}=c\;exp(-\frac{({\bf x}^{(m)}-{\bf x}^{(n)})^T({\bf x}^{(m)}-{\bf x}^{(n)})}{2r^2})+x^{(m)}x^{(n)} \end{displaymath}$

Comments:

The motivation behind the general form of these covarances is the intuition that similar inputs (small $(x^{(m)}-x^{(n)}$ ) should give rise to similar predictions (large covarance or correlation $c_{mn}$ ).
Here the covarance $Q_{mn}=k({\bf x}^{(m)},{\bf x}^{(n)})$ can be considered as some kernel function of the two vector arguments as used in various kernel-based algorithms such as support vector machine and kernel PCA. In either case, these kernels need to be constructed based on some prior knowledge of the problem.

For the output $y=f({\bf x})+\epsilon$ , the covariance matrix becomes:

$\begin{displaymath}C_{mn}=Q_{mn}+\delta_n^2\delta_{mn},\;\;\;\;(m,n=1,\cdots,N) \;\;\;\;\mbox{or}\;\;\;\;{\bf C}={\bf Q}+\sigma_n {\bf I} \end{displaymath}$

and its prior distribution is:

$\begin{displaymath}p({\bf y})=N(0,{\bf C})=\frac{1}{Z}e^{-\frac{1}{2}{\bf y}{\bf C}^{-1}{\bf y}} \end{displaymath}$

Samples from this distribution $N(0,{\bf C})$ can be drawn to get the

values of the function

, i.e., various curves in 1-D space.

Now the regression problem of finding the underlying function $f({\bf x})$ to fit the observed data

$\begin{displaymath}{\cal D}=({\bf X}, {\bf y})=\{ ({\bf x}^{(n)}, y^{(n)}), (n=1,\cdots,N)\}\end{displaymath}$

is turned into a problem of finding the posterior distribution of the output:

$\begin{displaymath}p({\bf y}^*\vert{\bf X}^*,{\cal D})=p({\bf y}^*\vert{\bf X}^*... ... y}\vert{\bf X}^*,{\bf X})}{p({\bf y}\vert{\bf X}^*,{\bf X})} \end{displaymath}$

As the function $f({\bf x})$ as well as output $y=f({\bf x})+\epsilon$ are Gaussian process, the conditional distribution of ${\bf y}^*$ given ${\bf y}$ can be found as below (proof given in Appendix A).

We first consider finding the conditional distribution of function values ${\bf f}^*$ . Set up the function vector containing both the known values ${\bf f}$ and the prediction ${\bf f}^*$ :

$\begin{displaymath}\left[ \begin{array}{c} {\bf f} {\bf f}^* \end{array} \right] \end{displaymath}$

The normal distribution of this vector is determined by the mean vector (assumed to be zero) and the constructed covariance matrix, which is expressed in four blocks according to the dimensions (number of points) of ${\bf f}$ and ${\bf f}^*$ :

$\begin{displaymath}\Sigma_f=Cov\left[ \begin{array}{c} {\bf f} {\bf f}^* \end... ...\bf X}^*,{\bf X}) & K({\bf X}^*,{\bf X}^*) \end{array} \right] \end{displaymath}$

Now the conditional distribution of ${\bf f}^*$ given ${\bf f}$ can be found:

$\begin{displaymath}p({\bf y}^*\vert{\bf X}^*,{\bf X},{\bf y})=N(\mu_{(f^*\vert f)},\Sigma_{(f^*\vert f)}) \end{displaymath}$

where

$\begin{displaymath}\left\{ \begin{array}{l} \mu_{(f^*\vert f)}=K({\bf X},{\bf X}... ...({\bf X},{\bf X})^{-1}K({\bf X}^*,{\bf X}) \end{array} \right. \end{displaymath}$

As the covariance matrix $K({\bf X},{\bf X})$ is symmetric, $K({\bf X}^*,{\bf X})^T=K({\bf X},{\bf X}^*)$ .

If ${\bf Q}$ above is replaced by ${\bf C}={\bf Q}+\sigma_n{\bf I}$ , the discussion is also valid for output ${\bf y}$ .

The samples drawn from this posterior distribution $p({\bf y}^*\vert{\bf X}^*,{\cal D})=p({\bf y}^*\vert{\bf X}^*,{\bf X},{\bf y})$ are different curves that interpolate (fit) the observed data, the data points ${\cal D}=\{({\bf x}^{(n)},y^{(n)}), (n=1,\cdots,N) \}$ , and they can also predict the outputs at any input points ${\bf x}^*$ .

In summary, the regression problem can be approached in two different ways:

The weight space view: the function is modeled as a linear combination of a set of discrete basis functions $\phi_h({\bf x})$ (could be just the input points ${\bf x}^{(n)}$ in the linear model), and a set of model parameters, the weights, is assumed. The covariance matrix of the function is

$\begin{displaymath}Q_{mn}=k({\bf x}^{(m)},{\bf x}^{(n)}) =\sigma^2_p \sum_{h=1}^H \phi_h({\bf x}^{(m)})\phi_h({\bf x}^{(n)}) \end{displaymath}$
The function space view: an infinite number of basis functions are taken from a function space, and no weights are specified. Instead, the convariance $Q_{mn}=k({\bf x}^{(m)},{\bf x}^{(n)})$ of the function (assumed a Gaussian process) can be constructed, so long as the resulting convariance matrix is positive definite.

These two views can be unified by the Mercer's theorem: a given positive definite covariance function can be decomposed as

$\begin{displaymath}Q_{mn}=k({\bf x}^{(m)},{\bf x}^{(n)})=\sum_{h=1}^\infty \lambda_n \phi_i({\bf x}^{(m)})\phi_i({\bf x}^{(n)}) \end{displaymath}$

where $\lambda_i$ and $\phi_i$ are the eigenvalues and eigenfunctions respectively:

$\begin{displaymath}\int k({\bf x}^{(m)},{\bf x}^{(n)})\phi_i({\bf x}^{(m)})d{\bf x}^{(m)} =\lambda_i \phi_i({\bf x}^{(n)}) \end{displaymath}$

In the weight space view, the covariance matrix is a result of the basis functions and their weights, while in the function space view, the covariance matrix is constructed first without explicitely specifying the basis functions and their weight.

Next: Appendix A: Conditional and Up: Gaussiaon Process Previous: Nonlinar Regression

Ruye Wang 2006-11-14