Least Squares Linear Regression

In general, the goal of regression analysis is to estimate the relationship between a dependent variable $y$ as a function $y=f(x_1,\cdots,x_d)=f({\bf x})$ of d independdent variables, typically represented as a d-dimensional vector ${\bf x}=[x_1,\cdots,x_d]^T$. The estimation is based on a given dataset of $N$ observed points ${\bf x}_1,\cdots,{\bf x}_N$ and the corresponding values of the dependent variables $y_1,\cdots,y_N$, which can be more concisely represented as

$\displaystyle {\cal D}=\{ ({\bf x}_n,y_n),\; (n=1,\cdots, N) \}
=({\bf X},\;{\bf y})$ (1)

where ${\bf y}=[y_1,\cdots,y_N]^T$ is an N-dimensional vector containing the corresponding values of the dependent variable $y$, and ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ is a $d\times N$ matrix containing the $N$ independent vectors. Regression analysis can be considered as a learning process, in which caase the observed dataset is the training set.

As the simpliest form of regression, linear regression tries to model the given dataset by a linear relationship between $y$ and ${\bf x}$:

$\displaystyle \hat{y}=f_{\bf w}({\bf x})=b+\sum_{i=1}^d w_ix_i=\sum_{i=0}^d w_ix_i
={\bf x}^T{\bf w}$ (2)

where vector ${\bf w}=[w_0=b,w_1,\cdots,w_d]^T$ contains $d+1$ model parameters to be determined based on the observed data, and ${\bf x}=[x_0=1,x_1,\cdots,x_d]^T$ is redefined as an augmented $d+1$ dimensional vector containing $x_0=1$ as well as the $D$ independent variables. Geometrically the equation $y={\bf x}^T{\bf w}=0$ represents a straight line if $D=1$, or a plane (hyperplane) defined over a $D\ge 2$ dimensional space, with intercept $f_{\bf w}(0,\cdots,0)=w_0=b$ and normal direction $[-1,w_1,\cdots,w_d]^T$.

There are in general two different methods for findig the model paramters ${\bf w}$ that fit the given data optimally. First, the least squares (LS) method is based on the residual, the difference between the predicted value by the model $f_{\bf w}({\bf x}_n)$ and the actual value $y_n$ given in the dataset:

$\displaystyle {\bf r}=\left[\begin{array}{c}r_1\\ \vdots\\ r_N\end{array}\right...
...}_N^T
\end{array}\right]{\bf w}
={\bf y}-{\bf X}^T {\bf w}
={\bf y}-\hat{\bf y}$ (3)

where ${\bf x}_i$ is the augmented vector corresponding to the ith data point, and ${\bf X}$ is a $(d+1)\times N$ matrix containing all $N$ data points. We assume there are more training data samples than unknown parameters, i.e., $N\ge d+1$. To find the optimal function parameters in ${\bf w}$ that minimize the squared error

$\displaystyle \varepsilon({\bf w})=\frac{1}{2}\vert\vert{\bf r}\vert\vert^2
=\f...
...ft({\bf y}^T{\bf y}-2{\bf w}^T{\bf Xy}
+{\bf w}^T{\bf X}{\bf X}^T{\bf w}\right)$ (4)

we set its gradient with respect to ${\bf w}$ to zero:

$\displaystyle {\bf g}_\varepsilon({\bf w})=\frac{d\varepsilon({\bf w})}{d{\bf w}}
={\bf XX}^T{\bf w}-{\bf Xy}={\bf0}$ (5)

and solve the ressulting over-determined system containing $d+1$ unknowns but $N\ge d+1$ linear equations to get

$\displaystyle {\bf w}=({\bf X}{\bf X}^T)^{-1}{\bf X}{\bf y}={\bf X}^-{\bf y}$ (6)

where ${\bf X}^-=({\bf X}{\bf X}^T)^{-1}{\bf X}{\bf y}$ is the $(d+1)\times N$ pseudo-inverse of ${\bf X}$. This is actually the least squares solution of an over-determined equation system ${\bf X}^T{\bf w}={\bf y}$.

Substituting ${\bf w}={\bf X}^-{\bf y}$ back into the expression of $\varepsilon({\bf w})$ we get

$\displaystyle \varepsilon({\bf w})$ $\displaystyle =$ $\displaystyle {\bf r}^T{\bf r}
={\bf y}^T{\bf y}-2{\bf w}^T{\bf Xy}+{\bf w}^T{\...
...}{\bf X}^T{\bf w}
={\bf y}^T{\bf y}-{\bf y}^T{\bf X}^T({\bf XX}^T)^{-1}{\bf Xy}$  
  $\displaystyle =$ $\displaystyle {\bf y}^T{\bf y}-{\bf y}^T{\bf X}^T{\bf X}^-{\bf y}
={\bf y}^T{\b...
...^T{\bf y}-{\bf y}^T\hat{\bf y}
={\bf y}^T({\bf y}-\hat{\bf y})={\bf y}^T{\bf r}$ (7)

i.e.,

$\displaystyle {\bf r}^T{\bf r}=({\bf y}-\hat{\bf y})^T({\bf y}-\hat{\bf y})
={\...
...T{\bf r}={\bf y}^T({\bf y}-\hat{\bf y})
={\bf y}^T{\bf y}-{\bf y}^T\hat{\bf y}
$

We therefore get

$\displaystyle \hat{\bf y}^T\hat{\bf y}={\bf y}^T\hat{\bf y}
$

///

we try to fit a hyperplane in a $n+1$ dimentional space to a set of $N$ given data points:

containing $N$ vectors each for one of the $N$ data points in the d-dimensional space, and $y_n$ is the corresponding values $y_i$ to an n-D point and ${\bf x}_n=[x_1^{(n)},\cdots,x_d^{(n)}]^T$.

data points in the d-dimensional space: where we try to fit a

The relationship between the inputs and outputs can be described by a function $f({\bf x})$ with additive noise, i.e.,

$\displaystyle y=f( {\bf x} )+e$ (8)

and the goal of the regression problem is to infer the function $f({\bf x})$ based on the training set ${\cal D}$, and make prediction of the output $y_*$ when the input is any ${\bf x}_*$, called the test set.

The simplest form of this regression problem is the linear regression based on the assumption that the function $f({\bf x})={\bf x}^T{\bf w}$ is a linear combination of all components of the input vector ${\bf x}$ with weights ${\bf w}=[w_1,\cdots,w_d]^T$:

$\displaystyle y=f({\bf x})+e={\bf x}^T{\bf w}+e=\sum_{i=1}^d w_i x_i+e$ (9)

Applying this linear model to the $N$ points in the training set, we get

$\displaystyle y_n=f({\bf x}_n)+e_n={\bf x}_n^T{\bf w}+e_n,\;\;\;\;\;\;(n=1,\cdots,N)$ (10)

or in matrix form:

$\displaystyle {\bf y}=\left[\begin{array}{c}y_1\\ \vdots\\ y_N\end{array}\right...
...ots\\ e_N\end{array}\right]
={\bf X}^T {\bf w}+{\bf e}={\bf f}({\bf X})+{\bf e}$ (11)

where ${\bf f}({\bf X})={\bf X}^T{\bf w}$, and ${\bf e}=[e_1,\cdots,e_N]^T$. Assuming $N>d$, we can use the least-squares method to solve the linear regression problem and find the parameters ${\bf w}$ that minimizes the squared error $\vert\vert{\bf e}\vert\vert^2={\bf e}^T{\bf e}$:

$\displaystyle \hat{{\bf w}}={\bf X}^- {\bf y}=({\bf X}{\bf X}^T)^{-1}{\bf X}\;{\bf y}$ (12)

where ${\bf X}^-=({\bf X}{\bf X}^T)^{-1}{\bf X}$ is the $d\times N$ pseudo inverse of matrix ${\bf X}$.

Example

$\displaystyle y=w_1+w_2x$ (13)

$\displaystyle {\cal D}=\{ (0,\,1),\;(2,\,1.9),\;(4,\,3.2) \}$ (14)

$\displaystyle {\bf X}=\left[\begin{array}{ccc}1&1&1\\ 0&2&4\end{array}\right],
\;\;\;\;\;{\bf y}=[1,\,1.9,\,3.2]^T$ (15)

$\displaystyle \hat{{\bf w}}={\bf X}^- {\bf y}=({\bf X}{\bf X}^T)^{-1}{\bf X}\;{\bf y}
=[0.93,\,0.55]^T$ (16)