We revisit the linear regression problem but from the viewpoint
of Bayesian inference. Now the parameter in the linear
regression model
is
assumed to be a random vector, and, as a function of ,
the regression function
is also random, and
so is the residual based on the training set
:
i.e. |
(157) |
We further assume has a zero-mean normal prior pdf:
|
(158) |
based on the desire that the weights in should take
values around zero, instead of large values on either positive
or negative, to avoid potiential overfitting. We also assume
the residual has a zero-mean normal pdf with a
diagonal covariance matrix, i.e., all components of
are indepenent of each other:
|
(159) |
As a linear transformation of , the regression function
also has a zero-mean Gaussian
distribution, and so is
as
the sum of two Gaussians.
To obtain the model parameter , we consider its
likelihood
based on the
observed data
:
This is the same Gaussian as
,
but shifted by
.
Now we can find the optimal model parameter
by maximizing the likelihood
. This
method is called maximum likelihood estimation (MLE).
But as maximizing
is equivalent to
minimizing the SSE
, the MLE method is
equivalent to the least squares method considered before.
Maximizing the likelihood above is equivalent to minimizing
the negative log likelihood:
|
(161) |
which is the same as the squared error
defined in Eq. (114) to be minimized by
the LS method previously considered. We therefore see that the
least squares method and the maximum likelihood method are
equivalent to each other in this case.
We can further find the posterior
of the
model parameter given the observed data ,
as the product of prior
and likelihood
:
where all constant scaling factors (irrelavant to ) are
dropped for simplicity. As both the prior and the likelihood are
Gaussian, this this posterior, when normalized, is also Gaussian
in terms of the mean
and covariance
:
with the following covariance and mean:
Now we can find the optimal model parameter
equal to the mean of the Gaussian
posterior where it is maximized. This solution is called the
maximum a posteriori (MAP) estimate of the model parameter
.
We can now predict the function value of any unlabeled test
data point based on the conditional probability
distribution
of the regression
model
, which,
as a linear transformation of the random parameters in ,
is also Gaussian
(see properties of Gaussians):
|
(165) |
with the following mean and variance:
Furthermore, given a set of unlabeled test data points in
, we can find
|
(168) |
As a linear combination of Guassian ,
is also Gaussian:
|
(169) |
with the following mean and covariance:
The mean
above
can be considered as the estimated function value
at the test points in
, while the variance
as the confidence or certainty of the estimation.
Such a result produced by the Bayesian method in terms of the mean
and covariance of
as a random vector is
quite different from the result produced by a frequentist method such
as the linear least square regression in terms of an estimated value
of
as a deterministic variable. However, in the special
case of
, i.e., the residual is no longer treated
as a random variable, then Eq. (170) above becomes the same
as Eq. (117) produced by linear least square regression, in
other words, the linear method method can be considered as a special
case of the Bayesian regression when the residual is deterministic.
The method considered above is based on the linear regression model
, but it can also be generalized to
nonlinear functions, same as in the case of
generalized linear regression
discussed previously. To do so, the regression function is assumed
to be a linear combination of a set of nonlinear mapping functions
:
|
(172) |
where
|
(173) |
For example, some commenly used mapping functions are listed in Eq.
(155).
Applying this regression model to each of the observed data
points in
, we
get
with residual
, which can be written in matrix form:
where, as previously defined in Eq. (152),
|
(175) |
As this nonlinear model
is in the
same form as the linear model
, all
previous discussion for the linear Bayesian regression is still
valid for this nonlinear model, if
is replaced
by
.
We can get the covariance matrix and mean vector of the posterior
in the K-dimensional space based on the
observed data
:
The function
of the multiple
test points in
is
a linear transformation of :
|
(177) |
and it also has a Gaussian joint distribution:
|
(178) |
with the mean and covariance:
They can be reformulated to become:
and
where we have used Theorem 1 in the Appendices.
We note that in these expressions the mapping function
only shows up in the quadratic form of
,
,
or
. This suggests that we
may consider using
kernel method, by
which some significant advantage may be gained (such as in the
support-vector machines).
Specifically, if we replace the mn-th component
of
by a kernel function
only in terms of and :
|
(182) |
then the mapping function
no longer needs to
be explicitly specified. This idea is called the kernel trick,
Now we have
|
(183) |
and similarly we also have
and
. These kernel
functions are symmetric, same as all covariance matrices. Now the mean
and covariance of given above can be expressed as:
As a general summary of the method discussed above, we note that it is
essentially different from the frequentist approach, in the sense that
the estimated result produced is no longer a set specific values of some
unknown variables in question (function values, parameters of regression
models, etc.), but the probability distribution of these variables in terms
of the expectation and covariance. Consequently, we can not only treat the
expectation as the estimated values, but also use the variance as the
confidence or certainty of such an estimation.
Below is a segment of Matlab code for estimating the mean and covariance
of weight vector by the Bayes linear regression. Here X
and y
are for and of the training data ,
and Xstar
, Phi
and Phistar
are for ,
and
, respectively.
SigmaP=eye(d); % prior variance of r in original space
SigmaQ=eye(D); % prior variance of r in feature space
Sigmaw1=inv(X*X'/sigmaN^2+inv(SigmaP)); % covariance of w by Bayes regression
Sigmaw2=inv(Phi0*Phi0'/sigmaN^2+inv(SigmaQ)); % covariance of w with basis function
w0=inv(X*X')*X*y % weight vector w by LS method
w1=Sigmaw1*X*y/sigmaN^2 % mean of w by Bayes regression
w2=Sigmaw2*Phi0*y/sigmaN^2; % mean of w vector by Bayes regression with basis function
y0=w0'*Xstar; % y by least squares method
y1=w1'*Xstar; % y by Bayes regression
y2=w2'*Phistar0; % y by Bayes regression with basis function
for i=1:M
sgm1(i)=sqrt(Xstar(:,i)'*Sigmaw1*Xstar(:,i)); % variance of estimated y
sgm2(i)=sqrt(Phistar(:,i)'*Sigmaw2*Phistar(:,i));
end
Example
Consider a 2-D linear function
|
(185) |
where we have assumed , and , i.e., the
function is actually a straight line
with intercept
and slope . The distribution of the additive noise
(residual) is
. The prior distributions
of the weights
is assumed to be
, and posterior
can be obtained based on the observed data
,
both shown in the figure below. We see that the posterior is concentrated
around
and
, the estimates of the true
weights and , based on the observed noisy data.
Various regression methods including both the least-squared method and the
Bayesian linear regression methods, and in feature space spanned by a set
of basis functions
based on each of
the three functions given above are carried out with their results shown
in the figure below:
The LS errors for these methods are listed below:
|
(186) |
We see that the linear methods (the pseudo inverse and the Bayesian
regression) produce similar estimated weights error, but smaller error
can be achieved by the same method of Bayesian linear regression if
proper nonlinear mapping is used (e.g., the last two mapping functions).