Least Squares Linear Regression

In general, the goal of regression analysis is to estimate the relationship between a dependent variable $y$ as a function $y=f(x_1,\cdots,x_d)=f({\bf x})$ of d independdent variables, typically represented as a d-dimensional vector ${\bf x}=[x_1,\cdots,x_d]^T$ . The estimation is based on a given dataset of $N$ observed points ${\bf x}_1,\cdots,{\bf x}_N$ and the corresponding values of the dependent variables $y_1,\cdots,y_N$ , which can be more concisely represented as

$\displaystyle {\cal D}=\{ ({\bf x}_n,y_n),\; (n=1,\cdots, N) \} =({\bf X},\;{\bf y})$

(1)

where ${\bf y}=[y_1,\cdots,y_N]^T$ is an N-dimensional vector containing the corresponding values of the dependent variable $y$

, and ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ is a $d\times N$ matrix containing the $N$

independent vectors. Regression analysis can be considered as a learning process, in which caase the observed dataset is the training set.

As the simpliest form of regression, linear regression tries to model the given dataset by a linear relationship between $y$ and ${\bf x}$ :

$\displaystyle \hat{y}=f_{\bf w}({\bf x})=b+\sum_{i=1}^d w_ix_i=\sum_{i=0}^d w_ix_i ={\bf x}^T{\bf w}$

(2)

where vector ${\bf w}=[w_0=b,w_1,\cdots,w_d]^T$ contains $d+1$

model parameters to be determined based on the observed data, and ${\bf x}=[x_0=1,x_1,\cdots,x_d]^T$ is redefined as an augmented $d+1$

dimensional vector containing $x_0=1$

as well as the $D$

independent variables. Geometrically the equation $y={\bf x}^T{\bf w}=0$ represents a straight line if $D=1$

, or a plane (hyperplane) defined over a $D\ge 2$ dimensional space, with intercept $f_{\bf w}(0,\cdots,0)=w_0=b$ and normal direction $[-1,w_1,\cdots,w_d]^T$ .

There are in general two different methods for findig the model paramters ${\bf w}$ that fit the given data optimally. First, the least squares (LS) method is based on the residual, the difference between the predicted value by the model $f_{\bf w}({\bf x}_n)$ and the actual value $y_n$ given in the dataset:

$\displaystyle {\bf r}=\left[\begin{array}{c}r_1\\ \vdots\\ r_N\end{array}\right... ...}_N^T \end{array}\right]{\bf w} ={\bf y}-{\bf X}^T {\bf w} ={\bf y}-\hat{\bf y}$

(3)

where ${\bf x}_i$ is the augmented vector corresponding to the ith data point, and ${\bf X}$ is a $(d+1)\times N$ matrix containing all $N$

data points. We assume there are more training data samples than unknown parameters, i.e., $N\ge d+1$ . To find the optimal function parameters in ${\bf w}$ that minimize the squared error

$\displaystyle \varepsilon({\bf w})=\frac{1}{2}\vert\vert{\bf r}\vert\vert^2 =\f... ...ft({\bf y}^T{\bf y}-2{\bf w}^T{\bf Xy} +{\bf w}^T{\bf X}{\bf X}^T{\bf w}\right)$

(4)

we set its gradient with respect to ${\bf w}$ to zero:

$\displaystyle {\bf g}_\varepsilon({\bf w})=\frac{d\varepsilon({\bf w})}{d{\bf w}} ={\bf XX}^T{\bf w}-{\bf Xy}={\bf0}$

(5)

and solve the ressulting over-determined system containing $d+1$

unknowns but $N\ge d+1$ linear equations to get

$\displaystyle {\bf w}=({\bf X}{\bf X}^T)^{-1}{\bf X}{\bf y}={\bf X}^-{\bf y}$

(6)

where ${\bf X}^-=({\bf X}{\bf X}^T)^{-1}{\bf X}{\bf y}$ is the $(d+1)\times N$ pseudo-inverse of ${\bf X}$ . This is actually the least squares solution of an over-determined equation system ${\bf X}^T{\bf w}={\bf y}$ .

Substituting ${\bf w}={\bf X}^-{\bf y}$ back into the expression of $\varepsilon({\bf w})$ we get

$\displaystyle \varepsilon({\bf w})$	$\displaystyle =$	$\displaystyle {\bf r}^T{\bf r} ={\bf y}^T{\bf y}-2{\bf w}^T{\bf Xy}+{\bf w}^T{\... ...}{\bf X}^T{\bf w} ={\bf y}^T{\bf y}-{\bf y}^T{\bf X}^T({\bf XX}^T)^{-1}{\bf Xy}$
	$\displaystyle =$	$\displaystyle {\bf y}^T{\bf y}-{\bf y}^T{\bf X}^T{\bf X}^-{\bf y} ={\bf y}^T{\b... ...^T{\bf y}-{\bf y}^T\hat{\bf y} ={\bf y}^T({\bf y}-\hat{\bf y})={\bf y}^T{\bf r}$	(7)

i.e.,

$\displaystyle {\bf r}^T{\bf r}=({\bf y}-\hat{\bf y})^T({\bf y}-\hat{\bf y}) ={\... ...T{\bf r}={\bf y}^T({\bf y}-\hat{\bf y}) ={\bf y}^T{\bf y}-{\bf y}^T\hat{\bf y}$

We therefore get

$\displaystyle \hat{\bf y}^T\hat{\bf y}={\bf y}^T\hat{\bf y}$

///

we try to fit a hyperplane in a $n+1$ dimentional space to a set of $N$ given data points:

containing $N$ vectors each for one of the $N$ data points in the d-dimensional space, and $y_n$ is the corresponding values $y_i$ to an n-D point and ${\bf x}_n=[x_1^{(n)},\cdots,x_d^{(n)}]^T$ .

data points in the d-dimensional space: where we try to fit a

The relationship between the inputs and outputs can be described by a function $f({\bf x})$ with additive noise, i.e.,

$\displaystyle y=f( {\bf x} )+e$

(8)

and the goal of the regression problem is to infer the function $f({\bf x})$ based on the training set ${\cal D}$ , and make prediction of the output $y_*$

when the input is any ${\bf x}_*$ , called the test set.

The simplest form of this regression problem is the linear regression based on the assumption that the function $f({\bf x})={\bf x}^T{\bf w}$ is a linear combination of all components of the input vector ${\bf x}$ with weights ${\bf w}=[w_1,\cdots,w_d]^T$ :

$\displaystyle y=f({\bf x})+e={\bf x}^T{\bf w}+e=\sum_{i=1}^d w_i x_i+e$

(9)

Applying this linear model to the $N$

points in the training set, we get

$\displaystyle y_n=f({\bf x}_n)+e_n={\bf x}_n^T{\bf w}+e_n,\;\;\;\;\;\;(n=1,\cdots,N)$

(10)

or in matrix form:

$\displaystyle {\bf y}=\left[\begin{array}{c}y_1\\ \vdots\\ y_N\end{array}\right... ...ots\\ e_N\end{array}\right] ={\bf X}^T {\bf w}+{\bf e}={\bf f}({\bf X})+{\bf e}$

(11)

where ${\bf f}({\bf X})={\bf X}^T{\bf w}$ , and ${\bf e}=[e_1,\cdots,e_N]^T$ . Assuming $N>d$

, we can use the least-squares method to solve the linear regression problem and find the parameters ${\bf w}$ that minimizes the squared error $\vert\vert{\bf e}\vert\vert^2={\bf e}^T{\bf e}$ :

$\displaystyle \hat{{\bf w}}={\bf X}^- {\bf y}=({\bf X}{\bf X}^T)^{-1}{\bf X}\;{\bf y}$

(12)

where ${\bf X}^-=({\bf X}{\bf X}^T)^{-1}{\bf X}$ is the $d\times N$ pseudo inverse of matrix ${\bf X}$ .

Example

$\displaystyle y=w_1+w_2x$

(13)

$\displaystyle {\cal D}=\{ (0,\,1),\;(2,\,1.9),\;(4,\,3.2) \}$

(14)

$\displaystyle {\bf X}=\left[\begin{array}{ccc}1&1&1\\ 0&2&4\end{array}\right], \;\;\;\;\;{\bf y}=[1,\,1.9,\,3.2]^T$

(15)

$\displaystyle \hat{{\bf w}}={\bf X}^- {\bf y}=({\bf X}{\bf X}^T)^{-1}{\bf X}\;{\bf y} =[0.93,\,0.55]^T$

(16)