In a generic regression problem, the goal is to estimate the
unknown relationship
between the dependent variable
and the
independent variables
,
by a model function
, based on a set of given
training data
,
contaminated by some rero-mean random noise
with variance
:
(136) |
How well the model
fits the training
data can be measured by the mean of its squared error
(SE) defined as
, called the
mean squared error (MSE):
(137) |
(138) |
We therefore get the following general relationship in terms of the three types of error:
Mean Squared Error |
(139) |
Mean Squared Error |
(140) |
Given the total MSE and the irreducible error , the
bias error and the variance error are complementary, i.e., if
one error is higher, the other is lower, depending on the model
function
produced by the specific algorithm. This is
called the
bias-variance tradoff, which is related to the major issur of
overfitting vs. underfitting in supervised learning:
(141) |
As both overfitting and underfitting result in poor predicttion
capability, they need to be avoided by making a proper tradeoff
between bias-error/underfitting and variance-error/overfitting.
This is an important issue not only in the context of regression,
but also in general in all supervised learning algorithms, for
the purpose of predicting either a value as in regression, or a
categorical class in classification, based on the output
as a function of the given data point
. A general framework for all such algorithms is shown
below, where the algorithm in the box is to come up with a model
function
of which the output
need to match the labeling
of the data point
in
some optimal way:
Due to the inevitable noise in the training data, all such algorithms face the same issue of how to make a proper tradeoff between overfitting and underfitting, as illustrated below.
We see that in regression on the left, a set of data points is modeled by two different regression functions, while in classification on the right, the 2-D space is partitioned into two regions corresponding to two classes by two different decision boundaries. The issue is, which of the two models, in either regression or classification, fits the data better, in terms of underfitting versus overfitting.
The simple linear regression function or decision boundary (red) may underft the data as it may miss some legitimate variations in the data.
The complex regression model, e.g., a high order polynomial, or decision boundary (blue) may overfit the data as it may be overly affected by the variation due to the observation noise.
In general, based on a single set of training data points, it is impossible to distinguish between signal and random noise, and to judge whether a learning algorithm over or underfits the data. However, based on multiple datasets contaminated by different random noise, overfitting problem can be detected if the algorithm produces significantly different results when trained by different datasets. Based on this idea, the method of cross-validation can be used to train a learning algorithm by a subset of the training dataset and then test the algorithm by a different subset of the data. if the residual error during testing is large, then the algorithm may suffer from overfitting.
Many regression and classification algorithms address the issue of overfitting versus underfitting by a process called regularization to prevent overfitting, espeicially when the problem is ill-conditioned, so that the model is less sensitive to noise in the training data and therefore more generalizable, while still able to capture the essential nature and behavior of the signal. Typically this is done by adjusting some hyperparameter of the model to make a proper tradeoff between the two extremes of over and underfitting.
As an exammple, consider the method of ridge regression
for LS linear regression. If the data points in the training
set are barely independent, then matrix
has full
rank but is very close to singularity, and its smallest
eigenvalue
, the smallest singular value
of
, is close
to zero, and the normal equation
(Eq. (115)) is ill-conditioned. Now the solution
based on the pseudo-inverse is
unstable and prone to noise. Any minor change due to noise in
the dataset (in either
or
) may cause a
huge change in
, i.e., the resulting
has
a large variance error and the regression model overfits the
data and is therefore poor generalizable.
In this case, the ridge regression method can be used
to regularize the ill-conditioned problem by constructing an
objective function
that contains a regularization
or penalty term
to discourage large
weights while minimizing the error
:
(143) |