Linar Regression

Next: Nonlinar Regression Up: Gaussiaon Process Previous: Gaussiaon Process

Linar Regression

In a general regression problem, the observed data (training data) are

$\begin{displaymath}{\cal D}=\{ ({\bf x}^{(n)},y^{(n)}) (n=1,\cdots, N) \} \end{displaymath}$

where ${\bf x}^{(n)}=[x_1^{(n)},\cdots,x_d^{(n)}]^T$ is a set of

input vectors of dimensionality of

, and $y^{(n)}$ is the corresponding output scalar assumed to be generated by some underlying processing described by a function $f({\bf x})$ with addititive noise, i.e.,

$\begin{displaymath}y=f( {\bf x} )+\epsilon \end{displaymath}$

The goal of the regression is to infer the function

based on ${\cal D}$ , and make prediction of the output

when the input is ${\bf x}^*$ .

The simplest form of regression is linear regression based on the assumption that the underlying function $f({\bf x})$ is a linear combination of all components of the input vector with weights ${\bf w}=[w_1,\cdots,w_d]^T$ :

$\begin{displaymath}y=f({\bf x})={\bf x}^T{\bf w}=\sum_{i=1}^d w_i x_i \end{displaymath}$

This can be expressed in matrix form for all

data points:

$\begin{displaymath}{\bf X}^T {\bf w}={\bf y} \end{displaymath}$

where ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ is an $d\times N$ matrix whose nth column is for the input vector ${\bf x}^{(n)}$ and ${\bf y}$ is an N-dimensional vector for the

output values. In general,

and the linear regression can be solved by least-squares method to get

$\begin{displaymath}\hat{{\bf w}}={\bf X}^- {\bf y}=({\bf X}{\bf X}^T)^{-1}{\bf X}\;{\bf y} \end{displaymath}$

where ${\bf X}^-=({\bf X}{\bf X}^T)^{-1}{\bf X}$ is the psudo inverse of matrix ${\bf X}$ .

Alternatively, the regression problem can be viewed as a Bayesian inference process. We can assume both the model parameters and the noise are normally distributed:

$\begin{displaymath}{\bf w} \sim N(0,\Sigma_p),\;\;\;\;\;\;\epsilon \sim N(0,\sigma_n^2) \end{displaymath}$

i.e., the noise $\epsilon$ in the

different data points is independent. The likelihood of the model parameters ${\bf w}$ given the data ${\cal D}$ is

$\displaystyle {\cal L}({\bf w}\vert{\cal D})$	$\textstyle =$	$\displaystyle p({\cal D}\vert{\bf w})=p({\bf y}\vert{\bf X},{\bf w})$
	$\textstyle =$	$\displaystyle \prod_{n=1}^N \frac{1}{\sqrt{2\pi\sigma_n^2}}exp[-\frac{(y^{(n)}-{\bf w}^T{\bf x}^{(n)})^2}{2\sigma_n^2}]=N({\bf X}^T{\bf w},\sigma_n^2 I)$

According to Bayesian theorem, the posterior of the parameters is proportional to the product of the likelihood and the prior:

$\displaystyle p({\bf w}\vert{\bf y},{\bf X})$	$\textstyle =$	$\displaystyle \frac{p({\bf y}\vert{\bf X},{\bf w}) p({\bf w})}{p({\bf y}\vert{\bf X})} \propto p({\bf y}\vert{\bf X},{\bf w}) p({\bf w})$
	$\textstyle =$	$\displaystyle c_1\;exp[-\frac{1}{2}(\frac{1}{\sigma_n^2}({\bf y}-{\bf X}^T{\bf w})^T({\bf y}-{\bf X}^T{\bf w})+{\bf w}^T\Sigma_p^{-1} {\bf w})]$
	$\textstyle =$	$\displaystyle c_1\;exp[-\frac{1}{2}(\frac{1}{\sigma_n^2}{\bf y}^T{\bf y}+\frac{... ...} -\frac{2}{\sigma_n^2}{\bf w}^T{\bf X}{\bf y}+{\bf w}^T\Sigma_p^{-1} {\bf w})]$
	$\textstyle =$	$\displaystyle c_2 \;exp[-\frac{1}{2}({\bf w}^T (\frac{1}{\sigma_n^2}{\bf X}{\bf X}^T+\Sigma_p^{-1}){\bf w}) -\frac{1}{\sigma_n^2}{\bf w}^T{\bf X} {\bf y} ]$
	$\textstyle =$	$\displaystyle c_3 \;exp[-\frac{1}{2}({\bf w}-\mu_w)^T\Sigma_w^{-1}({\bf w}-\mu_w)]=N({\bf w}, \mu_w,\Sigma_w)$

where

$\begin{displaymath}\Sigma_w=(\frac{1}{\sigma_n^2}{\bf X}{\bf X}^T+\Sigma_p^{-1})... ...;\;\;\; \mu_w=\frac{1}{\sigma_n^2}\Sigma_w^{-1} {\bf X}{\bf y} \end{displaymath}$

The predictive distribution of

given ${\bf x}^*$ is the average over all possible parameter values weighted by their posterior probability:

$\displaystyle p(y^\vert{\bf x}^,{\bf X},{\bf y})$	$\textstyle =$	$\displaystyle \int p(y,{\bf w}\vert {\bf x}^,{\bf X},{\bf y}) d{\bf w} =\int p(y\vert {\bf x}^,{\bf w})p({\bf w}\vert{\bf X},{\bf y}) d{\bf w}$
	$\textstyle =$	$\displaystyle \int {\bf w}^T{\bf x}^p({\bf w}\vert{\bf X},{\bf y}) d{\bf w} =N(\mu_w^T {\bf x}^, {\bf x}^{T}\Sigma_w {\bf x}^)$

Next: Nonlinar Regression Up: Gaussiaon Process Previous: Gaussiaon Process

Ruye Wang 2006-11-14

$\displaystyle p(y^\vert{\bf x}^,{\bf X},{\bf y})$	$\textstyle =$	$\displaystyle \int p(y,{\bf w}\vert {\bf x}^,{\bf X},{\bf y}) d{\bf w} =\int p(y\vert {\bf x}^,{\bf w})p({\bf w}\vert{\bf X},{\bf y}) d{\bf w}$
	$\textstyle =$	$\displaystyle \int {\bf w}^T{\bf x}^p({\bf w}\vert{\bf X},{\bf y}) d{\bf w} =N(\mu_w^T {\bf x}^, {\bf x}^{T}\Sigma_w {\bf x}^)$