The method of regression can be used to model the relationship
between an independent variable
and a dependent variable
by a function
, based on a set of observed data pairs
. If there is the reason to
believe that this is a linear relationship, then we can assume
, where the two model parameters
(intercept) and
(slope) are to be found for the model to
fit the given data optimally, in the sense that the total
squared error below is minimized:
![$\displaystyle \varepsilon=\frac{1}{2}\sum_{i=1}^N r_i^2
=\frac{1}{2}\sum_{i=1}^N (y_i-\hat{y}_i)^2
=\frac{1}{2}\sum_{i=1}^N[y_i-(w_0+w_1\,x_i)]^2
$](img252.svg)
(
65)
where
is the residual
of the ith data pair, assumed to be an i.i.d. sample of a normal
distribution
.
To find the optimal coefficients
and
that minimize the
squared error
, we set its derivatives with respect to
and
to zero:
and solve the equations to get:
where
The regression function becomes:
We see that the slope of the linear regression function can also
be written as
The univariate linear regression model considered above
can be generalized to multivariate linear regression, by
which the relationship between a dependent variable
and
independent variables
is
modeled by a linear function

(
66)
where
. We desire to find the
model parameters
so that the model fits optimally a
set of observed sample points
. Substituting the observed data into the
model we get
equations

(
67)
which can be written in matrix form:
where we have defined
![$\displaystyle {\bf w}=\left[\begin{array}{c}w_0\\ w_1\\ \vdots\\ w_D\end{array}...
...\;\;\;\;
{\bf X}=[{\bf x}_0={\bf 1},{\bf x}_1,\cdots,{\bf x}_D]_{N\times(D+1)}
$](img284.svg)
(
68)
Ideally, we want to find
so that
, i.e.,
the observed
can be expressed as a linear combination
of
vectors
.
However, as this is an over determined equation system with
equations but only
unknowns
,
there does not exist an exact solution. We therefore can only
find the least square (LS) approximation that minimizes
the residual
, the squared
error:
To do so, we find its gradient (derivatives) with respect to
to zero:

(
70)
and solve the resulting equation to get:

(
71)
where
is the
pseudo-inverse
of
.
We further consider some properties of the solution. First, we
note that
![$\displaystyle {\bf X}^T{\bf r}=\left[\begin{array}{c}{\bf x}_0^T\\
{\bf x}_1^T\\ \vdots\\ {\bf x}_D^T\end{array}\right]{\bf r}={\bf0}
$](img301.svg)
(
72)
of which the first component is
![$\displaystyle {\bf x}_0^T{\bf r}=[1,\cdots,1]{\bf r}=\sum_{n=1}^N r_n=0
$](img302.svg)
(
73)
i.e., the sum of all
residuals is zero. Now the regression
model can be written as
,
and we have

(
74)
and
i.e.,

(
75)
we therefore get

(
76)
We can also show that all three vectors
,
and
are
perpendicular to vector
:
The fact that the residual
is
perpendicular to
indicates that among all linear
combinations of the
vectors
(all points on the hypersurface spanned by
these vectors),
is indeed the optimal
one with the minimum residual
.
How well the regression model fits the observed data can be
quantitatively measured based on the following sums of
squares:
- The total sum of squares (SST): The total
variation in the data

(
78)
- The explained sum of squares (SSE): variation of data
explained by the regression model:

(
79)
- Residual sum of squares (SSR): variation of data not
explained by the data, due to the noise, or discrepancy between
the data and the model:

(
80)
We can show that the total sum of squares is the sum of the
explained sum of squares and the residual sum of squares:
The last equality is due to the fact that the two middle terms are
both zeros:

(
81)
We now define the coefficient of determination denoted by
(R-squared) as a measure for the goodness of the model
as the percentage of variance explained by the model:

(
82)
If SSR is small, i.e., the model residual is small, then SSE
is large, i.e., most of the variation in the data can be
explained by the model, then
is large, indicating a good
fit of the model to the data.
In particular, when
, we find the parameters of the linear
regression model
, the slope
and the intercept
. We may ask
two different questions regarding the data and the model:
- How closely are the two variables
and
related
to each other? This question can be addressed by the
correlation coefficient defined as:

(
83)
A few simple examples of different correlation coefficients
can be found here.
- How well does the regression model fits the observed
data? This question can be addressed by the goodness measure
discussed above.
In fact,
for correlation and
for regression are
closely related. Specifically, consider
We see that if
is large, indicating the two variables
and
are highly correlated, then SSR is small, i.e., the error
of the model is small, therefore
is large, indicating the
model is a good fit of the data.
Although correlation and regression are closely related to
each other, they are different in several aspects:
- Only if two variables
and
are correlated will
regression analysis be meaningful.
- In regression
is a dependent variable, possibly
random, a function of
, a deterministic independent
variable. But they are treated equally (both possibly random)
in correlation.
- Regression can provide a model, a specific mathematical
function
, by which the given samples can be interpolated
and extrapolated; but correlation cannot.
Examples:

(
85)

(
86)