next up previous
Next: Nonlinar Regression Up: Gaussiaon Process Previous: Gaussiaon Process

Linar Regression

In a general regression problem, the observed data (training data) are

\begin{displaymath}{\cal D}=\{ ({\bf x}^{(n)},y^{(n)}) (n=1,\cdots, N) \} \end{displaymath}

where ${\bf x}^{(n)}=[x_1^{(n)},\cdots,x_d^{(n)}]^T$ is a set of $N$ input vectors of dimensionality of $d$, and $y^{(n)}$ is the corresponding output scalar assumed to be generated by some underlying processing described by a function $f({\bf x})$ with addititive noise, i.e.,

\begin{displaymath}y=f( {\bf x} )+\epsilon \end{displaymath}

The goal of the regression is to infer the function $f$ based on ${\cal D}$, and make prediction of the output $y^*$ when the input is ${\bf x}^*$.

The simplest form of regression is linear regression based on the assumption that the underlying function $f({\bf x})$ is a linear combination of all components of the input vector with weights ${\bf w}=[w_1,\cdots,w_d]^T$:

\begin{displaymath}y=f({\bf x})={\bf x}^T{\bf w}=\sum_{i=1}^d w_i x_i \end{displaymath}

This can be expressed in matrix form for all $N$ data points:

\begin{displaymath}{\bf X}^T {\bf w}={\bf y} \end{displaymath}

where ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ is an $d\times N$ matrix whose nth column is for the input vector ${\bf x}^{(n)}$ and ${\bf y}$ is an N-dimensional vector for the $N$ output values. In general, $N>d$ and the linear regression can be solved by least-squares method to get

\begin{displaymath}\hat{{\bf w}}={\bf X}^- {\bf y}=({\bf X}{\bf X}^T)^{-1}{\bf X}\;{\bf y} \end{displaymath}

where ${\bf X}^-=({\bf X}{\bf X}^T)^{-1}{\bf X}$ is the psudo inverse of matrix ${\bf X}$.

Alternatively, the regression problem can be viewed as a Bayesian inference process. We can assume both the model parameters and the noise are normally distributed:

\begin{displaymath}{\bf w} \sim N(0,\Sigma_p),\;\;\;\;\;\;\epsilon \sim N(0,\sigma_n^2) \end{displaymath}

i.e., the noise $\epsilon$ in the $N$ different data points is independent. The likelihood of the model parameters ${\bf w}$ given the data ${\cal D}$ is
$\displaystyle {\cal L}({\bf w}\vert{\cal D})$ $\textstyle =$ $\displaystyle p({\cal D}\vert{\bf w})=p({\bf y}\vert{\bf X},{\bf w})$  
  $\textstyle =$ $\displaystyle \prod_{n=1}^N \frac{1}{\sqrt{2\pi\sigma_n^2}}exp[-\frac{(y^{(n)}-{\bf w}^T{\bf x}^{(n)})^2}{2\sigma_n^2}]=N({\bf X}^T{\bf w},\sigma_n^2 I)$  

According to Bayesian theorem, the posterior of the parameters is proportional to the product of the likelihood and the prior:
$\displaystyle p({\bf w}\vert{\bf y},{\bf X})$ $\textstyle =$ $\displaystyle \frac{p({\bf y}\vert{\bf X},{\bf w}) p({\bf w})}{p({\bf y}\vert{\bf X})} \propto p({\bf y}\vert{\bf X},{\bf w}) p({\bf w})$  
  $\textstyle =$ $\displaystyle c_1\;exp[-\frac{1}{2}(\frac{1}{\sigma_n^2}({\bf y}-{\bf X}^T{\bf w})^T({\bf y}-{\bf X}^T{\bf w})+{\bf w}^T\Sigma_p^{-1} {\bf w})]$  
  $\textstyle =$ $\displaystyle c_1\;exp[-\frac{1}{2}(\frac{1}{\sigma_n^2}{\bf y}^T{\bf y}+\frac{...
...}
-\frac{2}{\sigma_n^2}{\bf w}^T{\bf X}{\bf y}+{\bf w}^T\Sigma_p^{-1} {\bf w})]$  
  $\textstyle =$ $\displaystyle c_2 \;exp[-\frac{1}{2}({\bf w}^T (\frac{1}{\sigma_n^2}{\bf X}{\bf X}^T+\Sigma_p^{-1}){\bf w}) -\frac{1}{\sigma_n^2}{\bf w}^T{\bf X} {\bf y} ]$  
  $\textstyle =$ $\displaystyle c_3 \;exp[-\frac{1}{2}({\bf w}-\mu_w)^T\Sigma_w^{-1}({\bf w}-\mu_w)]=N({\bf w}, \mu_w,\Sigma_w)$  

where

\begin{displaymath}\Sigma_w=(\frac{1}{\sigma_n^2}{\bf X}{\bf X}^T+\Sigma_p^{-1})...
...;\;\;\;
\mu_w=\frac{1}{\sigma_n^2}\Sigma_w^{-1} {\bf X}{\bf y} \end{displaymath}

The predictive distribution of $y^*$ given ${\bf x}^*$ is the average over all possible parameter values weighted by their posterior probability:
$\displaystyle p(y^*\vert{\bf x}^*,{\bf X},{\bf y})$ $\textstyle =$ $\displaystyle \int p(y,{\bf w}\vert {\bf x}^*,{\bf X},{\bf y}) d{\bf w}
=\int p(y\vert {\bf x}^*,{\bf w})p({\bf w}\vert{\bf X},{\bf y}) d{\bf w}$  
  $\textstyle =$ $\displaystyle \int {\bf w}^T{\bf x}^*p({\bf w}\vert{\bf X},{\bf y}) d{\bf w}
=N(\mu_w^T {\bf x}^*, {\bf x}^{*T}\Sigma_w {\bf x}^*)$  


next up previous
Next: Nonlinar Regression Up: Gaussiaon Process Previous: Gaussiaon Process
Ruye Wang 2006-11-14