The goal of regression analysis is to model the relatiionship
between a dependent variable , typically a scalor, and a set
of independent variables or predictors
,
represented as a column vector
in a
d-dimensional space. Here both
and the components in
take numerical values.
Regression can be considered as a supervised learning method that
learns the essential relationship between the dependent and independent
variables, based on the training dataset containing observed data
samples
(99) |
More specifically, a regression algorithm is to model the relationship
between independent variable and the dependent variable
by a hypothesized regression function
, containing a set of parameters
symbolically denoted by
.
Geometrically this regression function
represents curve
if
, a surface if
or a hypersurface if
.
Typically, the form of the function
(e.g.,
linear, polynomial, or exponential) is determined is assumed to be
known based on prior knowledge, while the parameter
is
to be estimated by the regression algorithm, so that the predicted
value
is to match the ground truth
optimally is some
sense, but not affected by the inevitable observation noise in the
data. In other words, a regression algorithm should neither
overfit nor underfit the data.
Regression analysis can also be interpreted as
system modeling/identification, when the independent variable
and the dependent variable
are treated respectively as
the input (stimuli) and output (responses) of a system, the behavior
of which is described by the relationship between such input and
output modeled by the regression function
.
Regression analysis is also closely related to
pattern recognition/classification, when the independent
vector variable in the data is treated as a set of
features that characterize a pattern or opject of
interest, and the corresponding dependent variable
is treated
as a categorical label indicating to which of a set of
classes
a pattern
belongs. In this case,
the modeling process of the relationship between
and
becomes supervised pattern classification or recognition, to be
discussed in a later chapter.
In general the regression problem can be addressed based on
different philosophical viewpoints. In the frequentist
point of view, the unknown model parameters in
are fixed deterministic variables that can be estimated based
on the observed data, and a typical method based on this
viewpoint is the least squares method.
Alternatively, in the Bayesian inferenceoint of view, the model parameters in
are random
variables. Their prior probability distribution
before any data are observed can be estimated based on some prior
knowledge. If no such prior knowledge is available,
can be simply a uniform distribution, i.e., all possible values
of
are equally likely. Once the training set
becomes available, we can further get the posterior probability
based on Bayes' theorem:
Based on either of these two viewpoints, different regression
algorithms can be used to find the parameters
for the model function
to
fit the observed dataset
in
some optimal way based on different criteria, as shown below.
This frequentist method measures how well the regression
function models the training data by the residual
defined as
, defined
as the difference between the model prediction
and the ground
truth labeling value
corresponding to
, for each
of the
data points in the observed data:
(101) |
This Bayesian inference method measures how well the
regression function models the training data
in terms of the
likelihood
of the model parameter
based on the observed dataset
,
which is proportional to the conditional probability of
given
:
(103) |
(104) |
(105) |
The zero-mean pdf of
can also be
considered as the pdf of
with mean
,
the conditional pdf
of
given
as well as
, based on the assumption
that
and
are indeed related by
. Then we can find the likelihood
of the model parameter
given the
samples
in in the training set, all assumed to be
independent and identically distributed (i.i.d.):
This is also a Bayesian inference method that measures how
well the regression function models the training data by the
posterior probability of the parameter
given
in Eq. (100), proportional to the product
of the likelihood
and the prior
. If no prior knowledge about
is available, then
is a uniform
distribution, then MAP is equivalent to MLE. However, if certain
prior knowledge regarding
does exist, and the
prior is not uniform, then the posterior better characterizes
than the likelihood and MAP may produce better
result than MLE.
We see that regression analysis can be treated as an optimization problem, in which either the sum of squares error is minimized or the likelihood function is maximized. Also, the optimization problem is in general over-determined, as there are typically many more observed data points in the training data than the number of unknown parameters. Algorithms based on these different methods are to be considered in detail in later sections in this chapter.
We further note that regression analysis can be considered as a
binary classification, when the regression function
,
as a hypersurface in a
dimensional space spanned by
, is thresholded by a constant
(e.g.,
). The resulting equation
defines a hypersurface
in the
dimensional space spanned by
, which
partitions the space into two parts in such a way that all points on
one side of the hypersurface satisfy
, while all points
on the other side satisfy
. In other words, the regression
function is a binary classifier that separates every point
in the
dimensional space into two classes
and
, depending
on whether
is greater or smaller than C. Now each
corresponding to
in the given dataset
can be treated as a label indicating
belong to class
if
, or
if
, and
the regression problem becomes a binary classification problem.