In a generic regression problem, the goal is to estimate the unknown relationship between the dependent variable and the independent variables , by a model function , based on a set of given training data , contaminated by some rero-mean random noise with variance :
(136) |
How well the model
fits the training
data can be measured by the mean of its squared error
(SE) defined as
, called the
mean squared error (MSE):
(137) |
(138) |
We therefore get the following general relationship in terms of the three types of error:
Mean Squared Error Variance Bias Error Irreducible Error | (139) |
Mean Squared Error Variance Bias Error | (140) |
Given the total MSE and the irreducible error , the bias error and the variance error are complementary, i.e., if one error is higher, the other is lower, depending on the model function produced by the specific algorithm. This is called the bias-variance tradoff, which is related to the major issur of overfitting vs. underfitting in supervised learning:
(141) |
As both overfitting and underfitting result in poor predicttion capability, they need to be avoided by making a proper tradeoff between bias-error/underfitting and variance-error/overfitting. This is an important issue not only in the context of regression, but also in general in all supervised learning algorithms, for the purpose of predicting either a value as in regression, or a categorical class in classification, based on the output as a function of the given data point . A general framework for all such algorithms is shown below, where the algorithm in the box is to come up with a model function of which the output need to match the labeling of the data point in some optimal way:
Due to the inevitable noise in the training data, all such algorithms face the same issue of how to make a proper tradeoff between overfitting and underfitting, as illustrated below.
We see that in regression on the left, a set of data points is modeled by two different regression functions, while in classification on the right, the 2-D space is partitioned into two regions corresponding to two classes by two different decision boundaries. The issue is, which of the two models, in either regression or classification, fits the data better, in terms of underfitting versus overfitting.
The simple linear regression function or decision boundary (red) may underft the data as it may miss some legitimate variations in the data.
The complex regression model, e.g., a high order polynomial, or decision boundary (blue) may overfit the data as it may be overly affected by the variation due to the observation noise.
In general, based on a single set of training data points, it is impossible to distinguish between signal and random noise, and to judge whether a learning algorithm over or underfits the data. However, based on multiple datasets contaminated by different random noise, overfitting problem can be detected if the algorithm produces significantly different results when trained by different datasets. Based on this idea, the method of cross-validation can be used to train a learning algorithm by a subset of the training dataset and then test the algorithm by a different subset of the data. if the residual error during testing is large, then the algorithm may suffer from overfitting.
Many regression and classification algorithms address the issue of overfitting versus underfitting by a process called regularization to prevent overfitting, espeicially when the problem is ill-conditioned, so that the model is less sensitive to noise in the training data and therefore more generalizable, while still able to capture the essential nature and behavior of the signal. Typically this is done by adjusting some hyperparameter of the model to make a proper tradeoff between the two extremes of over and underfitting.
As an exammple, consider the method of ridge regression for LS linear regression. If the data points in the training set are barely independent, then matrix has full rank but is very close to singularity, and its smallest eigenvalue , the smallest singular value of , is close to zero, and the normal equation (Eq. (115)) is ill-conditioned. Now the solution based on the pseudo-inverse is unstable and prone to noise. Any minor change due to noise in the dataset (in either or ) may cause a huge change in , i.e., the resulting has a large variance error and the regression model overfits the data and is therefore poor generalizable.
In this case, the ridge regression method can be used to regularize the ill-conditioned problem by constructing an objective function that contains a regularization or penalty term to discourage large weights while minimizing the error :
This method is in general call weight decay and the hyperparameter is called the weight decay parameter. Solving the equation(143) |