The goal of nonlinear least squares (LS) regresion is to
model the nonlinear relationship between the dependent variable
and independent variables in
,
based on the given training dataset
,
by a nonlinear regression function parameterized
by a set of
parameters represented by
:
(187) |
(192) |
The optimal parameters in
that minimizes
can now be found iteratively by the
gradient descent method:
(193) |
This gradient method is also called
batch gradient descent (BGD), as the gradient in
Eq. (190) is calculated based on the entire batch
of all data points in the dataset
in each
iteration. When the size of the dataset is large, BGD may
be slow due to the high computational complexity. In this
case the method of stochastic gradient descent (SGD)
can be used, which approximates the gradient based on only
one of the
data points uniformly randomly chosen from
the dataset in each iteration, i.e., the summation in
Eq. (190) contains only one term corresponding to
the chosen sample
:
(194) |
(195) |
As the data are typically noisy, the SGD gradient based on
only one data point may be in a direction very different
from the true gradient based on the mean of all data
points, and the resulting search path in the M-D space may
be in a very jagged zig-zag shape, and the iteration may
converge slowly. However, as the computational complexity
in each iteration is much reduced, we can afford to carry
out a large number of iterations with an overall search
path generally coinciding with that of the BGD method.
Some trade-off between BGD and SGD can be made so that the
gradient is estimated based on more than one but less than
all data points in the dataset, the resulting method,
called mini-batch gradient descent, can take advantage
of both methods. The methods of stochastic and mini-batch
gradient descent are widely used in machine learning, such
as in the method of artificial neural networks, which can
be treated as a nonlinear regression problem, considered
in a later chapter.
Alternatively, the optimal parameter
of
the nonlinear regression function
can also be found by the Gauss-Newton method, which
directly solves the nonlinear over-determined equation system
in Eq. (188), based on the iterative
Newton-Raphson method in Eq. (
) in
Chapter 2:
(197) |
In the special case of linear regression function
, we have:
(198) |
(199) |
(200) |
(201) |
The Matlab code for the essential parts of the algorithm is
listed below, where data(:,1)
and data(:,2)
are the
observed data containing samples
in the first
column and their corresponding function values
in the second column. In the code segment below, the regression
function (e.g., an exponential model) is symbolically defined, and
then a function
NLregression
is called to estimate the
function parameters:
sym x; % symbolic variables a=sym('a',[3 1]); % symbolic parameters f=@(x,a)a(1)*exp(a(2)*x)+a(3)| % symbolic function [a er]=NLregression(data(:,1),data(:,2),f,3); % call regression
The code below estimates the parameters of the regression
function based on Newton's method.
function [a er]=NLregression(x,y,f,M) % nonlinear regression N=length(x); % number of data samples a=sym('a',[M 1]); % M symbolic parameters F=[]; for i=1:N F=[F; f(x(i),a)]; % evaluate function at N given data points end J=jacobian(F,a); % symbolic Jacobian J=matlabFunction(J); % convert to a true function F=matlabFunction(F); a1=ones(M,1); % initial guesses for M parameters n=0; % iteration index da=1; while da>tol % terminate if no progress made n=n+1; a0=a1; % update parameters J0=J(a0(1),a0(2)); % evaluate Jacobian based on a0 Jinv=pinv(J0); % pseudo inverse of Jacobian r=y-F(a0(1),a0(2)); % residual a1=a0+Jinv*r; % Newton iteration er=norm(r); % residual error da=norm(a0-a1); % progress end endwhere
tol
is a small value for tollerance, such as Example 1
Consider three sets of data (shown as the solid dots in the three columns in the plots below) are fitted by three different models (shown as the open circles in the plots):
These are the parameters for the three models to fit each of the three datasets:
We see that the first dataset can be fitted by all three models equally
well with similar error (when
in model 2, it is approximately
a linear functions), the second dataset can be well fitted by either the
second or the third model, while the third dataset can only be fitted
by the third exponential model.
Example 2
Based on the following three functions with four parameters
, three sets of 2-D data points are
generated (with some random noise added), shown in the figure
below as the dots in the three rows.
(202) |
Then each of three datasets are fitted by the three functions as models while the optimal parameters are found by the Gauss-Newton method.
The Levenberg-Marquardt algorithm (LMA) or
damped least-squares method is a method widely
used to minimize a sum-of-squares error
as
in Eq. (189). The LMA is an interpolation or trade-off
between the the gradient descent method based on the first
order derivative of the error function to be minimized, and
the Gauss-Newton method based on both the first and second
order derivatives of the error function, both discussed above.
As the LMA can take advantage of both methods and is therefore
more robust. Even if the initial guess is far from the solution
corresponding to the minimum of the objective function, the
iteration can still converge toward the solution.
Specifically, the LMA modifies Eq. (196)
(203) |
In each iterative step, if is small and the error
function
decreases quickly, indicating
the Gauss-Newton is effective in the sense that the error fucntion
is well modeled as a quadratic function by the first and second
order terms of the Taylor series, we will keep
small.
On the other hand, if
decreases slowly, indicating
the error function may not be well modeled as a quadratic function,
due possibly to the fact that the current
is far
away from the minimum, then we let
take a large value,
so that the term
becomes insignificant, and
Eq. (204) approaches
(205) |
(206) |
By adjusting the parameter , the LMA can switch between
either of the two methods for the same purpose of minimizing
. When the value of
is small,
the iteration is dominated by the Gauss-Newton method that
approximate the error function by a quadratic function based
on the second order derivatives. If the error reduces slowly,
indicating the error function cannot be approximated by a
quadratic function, a greater value of
is used to
switch to the gradient descent method.