The Gaussian process regression (GPR) is yet another
regression method that fits a regression function
to the data samples in the given training set. Different from
all previously considered algorithms that treat
as a deterministic function with an explicitly specified form,
the GPR treats the regression function
as a
stochastic process called Gaussian process (GP), i.e.,
the function value
at any point
is
assumed to be a random variable with a Gaussian distribution.
The joint probability distribution of the vector function
for
all
i.i.d. samples in the training set
is also
Gaussian:
(251) |
We also assume
as the
labeling of
in the training set is contaminatd by
noise
, with a zero-mean normal pdf
(252) |
The unknown covariance matrix
can be
constructed based on the desired smoothness of the regression
function
, in the sense that any two function
values
and
are likely to be
more correlated if the distance
between the two corresponding points is small, but less so
if the distance is large. Such a covariance can be modeled
by a squared exponential (SE) kernel function:
The smoothness of the function
plays a very
important role as it determines whether the function
overfits or underfits the sample points in the training
data. If the function is not smooth enough, it may overfit
as it may overly affected by some noisy samples; on the
other hand, if the function is too smooth, it may underfit
as it may not follow the training samples closely enough.
The smoothness of the regression function can be controlled
by the hyperparameter in Eq. (253).
Consider two extreme cases. If
, i.e.,
the value of the SE approaches 1, then
and
are highly correlated and
is
smooth; on the other hand, if
, i.e., SE
approaches 0, then
and
are
not correlated and
is not smooth. Therefore
by adjusting the value of
, a proper tradeoff between
overfitting and underfitting can be made.
Given the covariance
of
function values at two different sample points
and
, we can represent the covariance matrix of the prior
as
Once the conditional probability
is estimated
based on the training set
, we
can get the mean and covariance of
of a test dataset
containing
unlabeled test samples. As both
and
are the same zero-mean Gaussian process, their joint
distribution (given
as well as
) is also a
zero-mean Gaussian:
(256) |
Having found the joint pdf
of both
and
in Eq. (255), we
can further find the conditional distribution
of
given
(as well as
and
), based on the
properties of the Gaussian distributions discussed in the
Appendices:
Similarly, we can also find the joint distribution of
and
, both of which are zero-mean Gaussian:
(259) |
(260) |
(261) |
In the special noise-free case of
(with zero mean and zero covariance
), we have
, and the mean and covariance
in Eq. (263) become the same as Eq. (258),
and Eqs. (262) and (257) become
the same:
(264) |
The Matlab code below for the Gaussian process regression
algorithm, takes the training set
,
the test samples in
, and the variance
of the noise as the input, and generates as the output the
posterior
, in terms of its
mean as the GPR regression function, and its covariance matrix,
of which the variances on the diagonal represent the confidence
or certainty of the regression.
function [fmean fcov]=GPR(X,y,Xstar,sigmaN) % given training set D={X,y}, test sample Xstar % find regression function fmean and covariance Sy=Kernel(X,X)+sigmaN^2*eye(length(X)); % Covariance of training samples Syf=Kernel(X,Xstar); % Covariance of training and test samples Sf=Kernel(Xstar,Xstar); % Covariance of test samples fmean=Syf'*inv(Sy)*y; % mean of posterior as the regression function fcov=Sf-Syf'*inv(Sy)*Syf; % covariance of posterior end
The follwoing code is for a function that generates the squared
exponential kernel matrix based on two sets of data samples
in X1
and X2
:
function K=Kernel(X1,X2) a=1; % parameter that controls smoothness of f(x) m=length(X1); n=length(X2); K=zeros(m,n); for i=1:m for j=1:n d=norm(X1(:,i)-X2(:,j)); K(i,j)=exp(-(d/a)^2); end end end
Example:
A sinusoidal function
contaminated by
Gaussian noise
with
is sampled at
random positions over the duration of the
function
. These samples form the training dataset
.
Then the method of GPR is used to estimate the mean and variance of
at 91 test points
, evenly distributed over the
duration of the function between the limits
and
,
The 8 panels of the figure below show the estimated mean , and
the variance
, both above and below the mean, to form a
region of uncertainty of the estimates. The wider the region, the more
uncertain the results are (less confidence).
The top left panel represents the prior probability
of the estimated model output
at point
with
no knowledge based on the training set
(
training
samples), which is assumed to have zero-mean
and unity
variance
. The subsequent seven panels
show the mean and variance of the posterior
based on
training
samples in the training set
. Note that wherever a training
sample
becomes available, the estimated mean
is
modified to be close to the true function value
, and the
variance of the posterior around
is reduced, indicating the
uncertainty of the estimate is reduced, or the confidence is much
increased. When all
training samples are used, the general
shape of the original function is reasonably reconstructed.
The 8 panels in the figure below show 12 sample curves drawn from
each posterior
. They can be treated as
different interpolations for the
observed samples in training
set
, and any of them can be used to predict the outputs
at any input points
. The mean of these samples shown
as the red curve is simply the same as
shown in the
previous figure.
We see that wherever a training sample is available at
,
the variance and thereby the width of the region of uncertainty is
reduced, and correspondingly the sample curves in the neighborhood
of
are more tightly distributed. Specifically, we see that the
sample curves drawn from the prior in the top left panel are very
loosely distributed, while those samples drawn from the posterior
obtained based on all
training points in the bottom right panel
are relatively tightly distributed. In particular, note that due
to lack of training data in the interval
, the region of
uncertainty is significantly wider than else where, and the sample
curves are much more loosely located.
The parameter of the kernel function plays an important
role in the GPR results, as shown in the three panels in the figure
below, corresponding respectively to
values of 0.5, 1.5, and 3.0.
When
is too small (top panel), the model may overfit the data,
as the regression function fits the training samples well but it not
smooth and it does not interpolate regions between the training
samples well (with large variance); one the other hand, when
is too large (bottom panel), the model may underfit the data, as
the regression curve is smooth, but it does not fit the training
samples well. We therefore see that we need to choose the value
of
carefully for a proper tradeoff between overfitting and
underfitting (middle panel).