The goal of nonlinear least squares (LS) regresion is to model the nonlinear relationship between the dependent variable and independent variables in , based on the given training dataset , by a nonlinear regression function parameterized by a set of parameters represented by :
(187) |
(192) |
The optimal parameters in that minimizes can now be found iteratively by the gradient descent method:
(193) |
This gradient method is also called batch gradient descent (BGD), as the gradient in Eq. (190) is calculated based on the entire batch of all data points in the dataset in each iteration. When the size of the dataset is large, BGD may be slow due to the high computational complexity. In this case the method of stochastic gradient descent (SGD) can be used, which approximates the gradient based on only one of the data points uniformly randomly chosen from the dataset in each iteration, i.e., the summation in Eq. (190) contains only one term corresponding to the chosen sample :
(194) |
(195) |
As the data are typically noisy, the SGD gradient based on only one data point may be in a direction very different from the true gradient based on the mean of all data points, and the resulting search path in the M-D space may be in a very jagged zig-zag shape, and the iteration may converge slowly. However, as the computational complexity in each iteration is much reduced, we can afford to carry out a large number of iterations with an overall search path generally coinciding with that of the BGD method.
Some trade-off between BGD and SGD can be made so that the gradient is estimated based on more than one but less than all data points in the dataset, the resulting method, called mini-batch gradient descent, can take advantage of both methods. The methods of stochastic and mini-batch gradient descent are widely used in machine learning, such as in the method of artificial neural networks, which can be treated as a nonlinear regression problem, considered in a later chapter.
Alternatively, the optimal parameter of the nonlinear regression function can also be found by the Gauss-Newton method, which directly solves the nonlinear over-determined equation system in Eq. (188), based on the iterative Newton-Raphson method in Eq. () in Chapter 2:
The last equality is due to the fact that . Here in the pseudo-inverse is actually an approximation of the Hessian of the error function , as shown in the example in a previous chapter, and is the gradient of the error function given in Eq. (190). So the iteration above is essentially:(197) |
In the special case of linear regression function , we have:
(198) |
(199) |
(200) |
(201) |
The Matlab code for the essential parts of the algorithm is
listed below, where data(:,1)
and data(:,2)
are the
observed data containing samples
in the first
column and their corresponding function values
in the second column. In the code segment below, the regression
function (e.g., an exponential model) is symbolically defined, and
then a function NLregression
is called to estimate the
function parameters:
sym x; % symbolic variables a=sym('a',[3 1]); % symbolic parameters f=@(x,a)a(1)*exp(a(2)*x)+a(3)| % symbolic function [a er]=NLregression(data(:,1),data(:,2),f,3); % call regression
The code below estimates the parameters of the regression function based on Newton's method.
function [a er]=NLregression(x,y,f,M) % nonlinear regression N=length(x); % number of data samples a=sym('a',[M 1]); % M symbolic parameters F=[]; for i=1:N F=[F; f(x(i),a)]; % evaluate function at N given data points end J=jacobian(F,a); % symbolic Jacobian J=matlabFunction(J); % convert to a true function F=matlabFunction(F); a1=ones(M,1); % initial guesses for M parameters n=0; % iteration index da=1; while da>tol % terminate if no progress made n=n+1; a0=a1; % update parameters J0=J(a0(1),a0(2)); % evaluate Jacobian based on a0 Jinv=pinv(J0); % pseudo inverse of Jacobian r=y-F(a0(1),a0(2)); % residual a1=a0+Jinv*r; % Newton iteration er=norm(r); % residual error da=norm(a0-a1); % progress end endwhere
tol
is a small value for tollerance, such as .
Example 1
Consider three sets of data (shown as the solid dots in the three columns in the plots below) are fitted by three different models (shown as the open circles in the plots):
These are the parameters for the three models to fit each of the three datasets:
We see that the first dataset can be fitted by all three models equally well with similar error (when in model 2, it is approximately a linear functions), the second dataset can be well fitted by either the second or the third model, while the third dataset can only be fitted by the third exponential model.
Example 2
Based on the following three functions with four parameters
, three sets of 2-D data points are
generated (with some random noise added), shown in the figure
below as the dots in the three rows.
(202) |
Then each of three datasets are fitted by the three functions as models while the optimal parameters are found by the Gauss-Newton method.
The Levenberg-Marquardt algorithm (LMA) or damped least-squares method is a method widely used to minimize a sum-of-squares error as in Eq. (189). The LMA is an interpolation or trade-off between the the gradient descent method based on the first order derivative of the error function to be minimized, and the Gauss-Newton method based on both the first and second order derivatives of the error function, both discussed above. As the LMA can take advantage of both methods and is therefore more robust. Even if the initial guess is far from the solution corresponding to the minimum of the objective function, the iteration can still converge toward the solution.
Specifically, the LMA modifies Eq. (196)
(203) |
In each iterative step, if is small and the error function decreases quickly, indicating the Gauss-Newton is effective in the sense that the error fucntion is well modeled as a quadratic function by the first and second order terms of the Taylor series, we will keep small. On the other hand, if decreases slowly, indicating the error function may not be well modeled as a quadratic function, due possibly to the fact that the current is far away from the minimum, then we let take a large value, so that the term becomes insignificant, and Eq. (204) approaches
(205) |
(206) |
By adjusting the parameter , the LMA can switch between either of the two methods for the same purpose of minimizing . When the value of is small, the iteration is dominated by the Gauss-Newton method that approximate the error function by a quadratic function based on the second order derivatives. If the error reduces slowly, indicating the error function cannot be approximated by a quadratic function, a greater value of is used to switch to the gradient descent method.