Bias-Variance Tradeoff and Ridge Regression

In a generic regression problem, the goal is to estimate the unknown relationship $y=f({\bf x})$ between the dependent variable $y$ and the $d$ independent variables ${\bf x}=[x_1,\cdots,x_d]^T$, by a model function $\hat{f}({\bf x})$, based on a set of given training data ${\cal D}=\{({\bf x}_n,\,y_n)\;n=1,\cdots,N\}$, contaminated by some rero-mean random noise $e$ with variance $\sigma^2$:

$\displaystyle y=f({\bf x})+e,\;\;\;\;\;\;\;
E[e]=0,\;\;\;\; \Var[e]=E[(e-E[e]^2)]=E[e^2]=\sigma^2$ (136)

As a function of the random training samples ${\bf x}$, the estimated function $\hat{f}({\bf x})$ is also random, and, by assumption, it is independent of $e$, i.e., $p(e\,\hat{f})=p(e)\;p(\hat{f})$, therefore $E[e\,\hat{f}]=E[e]\,E[\hat{f}]$.

How well the model $\hat{f}({\bf x})$ fits the training data can be measured by the mean of its squared error (SE) defined as $[y-\hat{f}({\bf x})]^2$, called the mean squared error (MSE):

$\displaystyle \E [ (y-\hat{f})^2 ]$ $\displaystyle =$ $\displaystyle \E[(f+e-\hat{f})^2 ]
=\E\left[ [ (f-E\hat{f})+(E\hat{f}-\hat{f})+e]^2 \right]$  
  $\displaystyle =$ $\displaystyle E\left[ (f-E\hat{f})^2+(E\hat{f}-\hat{f})^2+e^2
+2 [e(f-E\hat{f})+e(E\hat{f}-\hat{f})
+(f-E\hat{f})(E\hat{f}-\hat{f}) ]\right]$  
  $\displaystyle =$ $\displaystyle \E[(f-E\hat{f})^2]+\E[(E\hat{f}-\hat{f})^2]+E (e^2)$  
  $\displaystyle +$ $\displaystyle 2\E[ e(f-E\hat{f}) ]
+2\E[e(E\hat{f}-\hat{f})]
+2\E[(E\hat{f}-\hat{f})(f-E\hat{f})]$  
  $\displaystyle =$ $\displaystyle \E[(f-E\hat{f})^2]+\E[(E\hat{f}-\hat{f})^2]+\E[e^2]$ (137)

The last equality is due to the fact that all three cross terms are zero:

$\displaystyle \left\{ \begin{array}{l}
\E[e(f-E\hat{f})]=\E[e]\;\E[f-E\hat{f}]=...
...t{f}-\hat{f})(f-E\hat{f})]=(E\hat{f}-E\hat{f})(f-E\hat{f})=0
\end{array}\right.$ (138)

The three remaining terms represent three different types of error:

We therefore get the following general relationship in terms of the three types of error:

Mean Squared Error$\displaystyle =$   Variance$\displaystyle +$   Bias Error$\displaystyle +$   Irreducible Error (139)

In the noise-free case with $E(e^2]=0$, we have:

Mean Squared Error$\displaystyle =$   Variance$\displaystyle +$   Bias Error (140)

The variance error and bias error are illustrated below:

BiasVariance.png

Given the total MSE and the irreducible error $\sigma^2$, the bias error and the variance error are complementary, i.e., if one error is higher, the other is lower, depending on the model function $\hat{f}$ produced by the specific algorithm. This is called the bias-variance tradoff, which is related to the major issur of overfitting vs. underfitting in supervised learning:

$\displaystyle \begin{tabular}{l\vert\vert c\vert c}\hline
& Overfit & Underfit ...
...ce error & High & Low \\ \hline
Bias error & Low & High \\ \hline
\end{tabular}$ (141)

UnderOverFit0.png

As both overfitting and underfitting result in poor predicttion capability, they need to be avoided by making a proper tradeoff between bias-error/underfitting and variance-error/overfitting. This is an important issue not only in the context of regression, but also in general in all supervised learning algorithms, for the purpose of predicting either a value as in regression, or a categorical class in classification, based on the output $\hat{y}=f({\bf x})$ as a function of the given data point ${\bf x}$. A general framework for all such algorithms is shown below, where the algorithm in the box is to come up with a model function $f({\bf x},{\bf\theta})$ of which the output $\hat{y}$ need to match the labeling $y$ of the data point ${\bf x}$ in some optimal way:

MLmodel.png

Due to the inevitable noise in the training data, all such algorithms face the same issue of how to make a proper tradeoff between overfitting and underfitting, as illustrated below.

UnderOverFitting.png

We see that in regression on the left, a set of data points is modeled by two different regression functions, while in classification on the right, the 2-D space is partitioned into two regions corresponding to two classes by two different decision boundaries. The issue is, which of the two models, in either regression or classification, fits the data better, in terms of underfitting versus overfitting.

In general, based on a single set of training data points, it is impossible to distinguish between signal and random noise, and to judge whether a learning algorithm over or underfits the data. However, based on multiple datasets contaminated by different random noise, overfitting problem can be detected if the algorithm produces significantly different results when trained by different datasets. Based on this idea, the method of cross-validation can be used to train a learning algorithm by a subset of the training dataset and then test the algorithm by a different subset of the data. if the residual error during testing is large, then the algorithm may suffer from overfitting.

Many regression and classification algorithms address the issue of overfitting versus underfitting by a process called regularization to prevent overfitting, espeicially when the problem is ill-conditioned, so that the model is less sensitive to noise in the training data and therefore more generalizable, while still able to capture the essential nature and behavior of the signal. Typically this is done by adjusting some hyperparameter of the model to make a proper tradeoff between the two extremes of over and underfitting.

As an exammple, consider the method of ridge regression for LS linear regression. If the $N$ data points in the training set are barely independent, then matrix ${\bf XX}^T$ has full rank but is very close to singularity, and its smallest eigenvalue $\lambda_{min}$, the smallest singular value $\sigma_{min}=\sqrt{\lambda_{min}}$ of ${\bf A}$, is close to zero, and the normal equation ${\bf XX}^T{\bf w}={\bf Xy}$ (Eq. (115)) is ill-conditioned. Now the solution ${\bf w}=({\bf X}^T)^-{\bf y}$ based on the pseudo-inverse is unstable and prone to noise. Any minor change due to noise in the dataset (in either ${\bf X}$ or ${\bf y}$) may cause a huge change in ${\bf w}$, i.e., the resulting ${\bf w}$ has a large variance error and the regression model overfits the data and is therefore poor generalizable.

In this case, the ridge regression method can be used to regularize the ill-conditioned problem by constructing an objective function $J({\bf w})$ that contains a regularization or penalty term $\lambda\vert\vert{\bf w}\vert\vert^2/2$ to discourage large weights while minimizing the error $\varepsilon=\vert\vert{\bf r}\vert\vert^2/2$:

$\displaystyle J({\bf w})=\frac{1}{2}\vert\vert{\bf y}-{\bf X}^T{\bf w}\vert\vert^2
+\frac{\lambda}{2}\vert\vert{\bf w}\vert\vert^2$ (142)

This method is in general call weight decay and the hyperparameter $\lambda$ is called the weight decay parameter. Solving the equation

$\displaystyle \frac{d}{d{\bf w}} J({\bf w})
={\bf XX}^T{\bf w}-{\bf Xy}+\lambda{\bf w}
=({\bf XX}^T+\lambda{\bf I}){\bf w}-{\bf Xy}
={\bf0}$ (143)

we get

$\displaystyle {\bf w}^*=({\bf XX}^T+\lambda{\bf I})^{-1}{\bf Xy}$ (144)

Now a proper tradeoff between accuracy and stability can be made by adjusting the hyperparameter $\lambda$: