Regression as Optimization Problems

The goal of regression analysis is to model the relatiionship between a dependent variable $y$, typically a scalor, and a set of independent variables or predictors $\{x_1,\cdots,x_d\}$, represented as a column vector ${\bf x}=[x_1,\cdots,x_d]^T$ in a d-dimensional space. Here both $y$ and the components in ${\bf x}$ take numerical values.

Regression can be considered as a supervised learning method that learns the essential relationship between the dependent and independent variables, based on the training dataset containing $N$ observed data samples

$\displaystyle {\cal D}=\{ ({\bf x}_n,\,y_n),\; (n=1,\cdots, N) \}$ (99)

where each ${\bf x}_n$ is labeld by $y_n$, the ground truth value corresponding to ${\bf x}$. We also represent the training set as ${\cal D}=\{{\bf X},\;{\bf y}\}$ in terms of a $d\times N$ matrix ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ containing the $N$ sample points and an N-dimensional vector ${\bf y}=[y_1,\cdots,y_N]^T$ for their labelings.

More specifically, a regression algorithm is to model the relationship between independent variable ${\bf x}$ and the dependent variable $y$ by a hypothesized regression function $\hat{y}=f({\bf x},{\bf\theta})$, containing a set of parameters symbolically denoted by ${\bf\theta}=[\theta_1,\cdots,\theta_M]^T$. Geometrically this regression function $f({\bf x})$ represents curve if $d=1$, a surface if $d=2$ or a hypersurface if $d>2$.

Typically, the form of the function $f({\bf x},{\bf\theta})$ (e.g., linear, polynomial, or exponential) is determined is assumed to be known based on prior knowledge, while the parameter ${\bf\theta}$ is to be estimated by the regression algorithm, so that the predicted value $\hat{y}$ is to match the ground truth $y$ optimally is some sense, but not affected by the inevitable observation noise in the data. In other words, a regression algorithm should neither overfit nor underfit the data.

Regression analysis can also be interpreted as system modeling/identification, when the independent variable ${\bf x}$ and the dependent variable $y$ are treated respectively as the input (stimuli) and output (responses) of a system, the behavior of which is described by the relationship between such input and output modeled by the regression function $y=f({\bf x})$.

Regression analysis is also closely related to pattern recognition/classification, when the independent vector variable ${\bf x}$ in the data is treated as a set of features that characterize a pattern or opject of interest, and the corresponding dependent variable $y$ is treated as a categorical label indicating to which of a set of $K$ classes $\{C_1,\cdots,C_K\}$ a pattern ${\bf x}_n$ belongs. In this case, the modeling process of the relationship between ${\bf x}$ and $y$ becomes supervised pattern classification or recognition, to be discussed in a later chapter.

In general the regression problem can be addressed based on different philosophical viewpoints. In the frequentist point of view, the unknown model parameters in ${\bf\theta}$ are fixed deterministic variables that can be estimated based on the observed data, and a typical method based on this viewpoint is the least squares method.

Alternatively, in the Bayesian inferenceoint of view, the model parameters in ${\bf\theta}$ are random variables. Their prior probability distribution $p({\bf\theta})$ before any data are observed can be estimated based on some prior knowledge. If no such prior knowledge is available, $p({\bf w})$ can be simply a uniform distribution, i.e., all possible values of ${\bf w}$ are equally likely. Once the training set ${\bf X}$ becomes available, we can further get the posterior probability based on Bayes' theorem:

$\displaystyle p({\bf\theta}\vert{\bf X})
=\frac{ p({\bf X}\vert{\bf\theta}) p({\bf\theta})}
{p({\bf X})}\propto p({\bf X}\vert{\bf w}) p({\bf\theta})$ (100)

In general, the posterior is a narrower distribution than the prior, and therefore a more accurate description of the model parameters. The output $\hat{y}=f({\bf x},{\bf\theta})$ of such a regression model based Bayesian inference is no longer a deterministic value, but a random variable described by its mean and variance.

Based on either of these two viewpoints, different regression algorithms can be used to find the parameters ${\bf\theta}$ for the model function $\hat{y}=f({\bf x},{\bf\theta})$ to fit the observed dataset ${\cal D}=\{ {\bf X},{\bf y}\}$ in some optimal way based on different criteria, as shown below.

We see that regression analysis can be treated as an optimization problem, in which either the sum of squares error is minimized or the likelihood function is maximized. Also, the optimization problem is in general over-determined, as there are typically many more observed data points in the training data than the number of unknown parameters. Algorithms based on these different methods are to be considered in detail in later sections in this chapter.

We further note that regression analysis can be considered as a binary classification, when the regression function $y=f({\bf x})$, as a hypersurface in a $d+1$ dimensional space spanned by $\{x_1,\cdots,x_d,\,y\}$, is thresholded by a constant $C$ (e.g., $C=0$). The resulting equation $f({\bf x})=C$ defines a hypersurface in the $d$ dimensional space spanned by $\{x_1,\cdots,x_d\}$, which partitions the space into two parts in such a way that all points on one side of the hypersurface satisfy $f({\bf x})>C$, while all points on the other side satisfy $f({\bf x})<C$. In other words, the regression function is a binary classifier that separates every point ${\bf x}$ in the $d$ dimensional space into two classes $C_+$ and $C_-$, depending on whether $f({\bf x})$ is greater or smaller than C. Now each $y_n$ corresponding to ${\bf x}_n$ in the given dataset ${\cal D}=\{ {\bf X},{\bf y}\}$ can be treated as a label indicating ${\bf x}_n$ belong to class $C_+$ if $y_n>C$, or $C_-$ if $y_n<C$, and the regression problem becomes a binary classification problem.

LinearRegressionClassifier.png