The goal of regression analysis is to model the relatiionship between a dependent variable , typically a scalor, and a set of independent variables or predictors , represented as a column vector in a d-dimensional space. Here both and the components in take numerical values.
Regression can be considered as a supervised learning method that learns the essential relationship between the dependent and independent variables, based on the training dataset containing observed data samples
(99) |
More specifically, a regression algorithm is to model the relationship between independent variable and the dependent variable by a hypothesized regression function , containing a set of parameters symbolically denoted by . Geometrically this regression function represents curve if , a surface if or a hypersurface if .
Typically, the form of the function (e.g., linear, polynomial, or exponential) is determined is assumed to be known based on prior knowledge, while the parameter is to be estimated by the regression algorithm, so that the predicted value is to match the ground truth optimally is some sense, but not affected by the inevitable observation noise in the data. In other words, a regression algorithm should neither overfit nor underfit the data.
Regression analysis can also be interpreted as system modeling/identification, when the independent variable and the dependent variable are treated respectively as the input (stimuli) and output (responses) of a system, the behavior of which is described by the relationship between such input and output modeled by the regression function .
Regression analysis is also closely related to pattern recognition/classification, when the independent vector variable in the data is treated as a set of features that characterize a pattern or opject of interest, and the corresponding dependent variable is treated as a categorical label indicating to which of a set of classes a pattern belongs. In this case, the modeling process of the relationship between and becomes supervised pattern classification or recognition, to be discussed in a later chapter.
In general the regression problem can be addressed based on different philosophical viewpoints. In the frequentist point of view, the unknown model parameters in are fixed deterministic variables that can be estimated based on the observed data, and a typical method based on this viewpoint is the least squares method.
Alternatively, in the Bayesian inferenceoint of view, the model parameters in are random variables. Their prior probability distribution before any data are observed can be estimated based on some prior knowledge. If no such prior knowledge is available, can be simply a uniform distribution, i.e., all possible values of are equally likely. Once the training set becomes available, we can further get the posterior probability based on Bayes' theorem:
In general, the posterior is a narrower distribution than the prior, and therefore a more accurate description of the model parameters. The output of such a regression model based Bayesian inference is no longer a deterministic value, but a random variable described by its mean and variance.Based on either of these two viewpoints, different regression algorithms can be used to find the parameters for the model function to fit the observed dataset in some optimal way based on different criteria, as shown below.
This frequentist method measures how well the regression function models the training data by the residual defined as , defined as the difference between the model prediction and the ground truth labeling value corresponding to , for each of the data points in the observed data:
(101) |
This Bayesian inference method measures how well the regression function models the training data in terms of the likelihood
of the model parameter based on the observed dataset , which is proportional to the conditional probability of given :
(103) |
(104) |
(105) |
The zero-mean pdf of
can also be
considered as the pdf of with mean
,
the conditional pdf
of given
as well as , based on the assumption
that and are indeed related by
. Then we can find the likelihood
of the model parameter
given the samples
in in the training set, all assumed to be
independent and identically distributed (i.i.d.):
This is also a Bayesian inference method that measures how well the regression function models the training data by the posterior probability of the parameter given in Eq. (100), proportional to the product of the likelihood and the prior . If no prior knowledge about is available, then is a uniform distribution, then MAP is equivalent to MLE. However, if certain prior knowledge regarding does exist, and the prior is not uniform, then the posterior better characterizes than the likelihood and MAP may produce better result than MLE.
We see that regression analysis can be treated as an optimization problem, in which either the sum of squares error is minimized or the likelihood function is maximized. Also, the optimization problem is in general over-determined, as there are typically many more observed data points in the training data than the number of unknown parameters. Algorithms based on these different methods are to be considered in detail in later sections in this chapter.
We further note that regression analysis can be considered as a binary classification, when the regression function , as a hypersurface in a dimensional space spanned by , is thresholded by a constant (e.g., ). The resulting equation defines a hypersurface in the dimensional space spanned by , which partitions the space into two parts in such a way that all points on one side of the hypersurface satisfy , while all points on the other side satisfy . In other words, the regression function is a binary classifier that separates every point in the dimensional space into two classes and , depending on whether is greater or smaller than C. Now each corresponding to in the given dataset can be treated as a label indicating belong to class if , or if , and the regression problem becomes a binary classification problem.