Levenberg-Marquardt algorithm

The Levenberg-Marquardt algorithm (aka damped least-squares method) can be considered as an interpolation or trade-off between the Gauss-Newton method and the gradient descent method. While all such methods can be used to minimize an objective function, such as the LS error $\varepsilon({\bf a})$ of a data fitting (modeling) problem, the Levenberg-Marquardt method is more robust. Even if the initial guess is far from the solution corresponding to the minimum of the objective function, the iteration can still converge toward the solution.

Specifically, the normal equation obtained in the Gauss-Newton method

$\displaystyle \left({\bf J}^T{\bf J}\right)\,\triangle{\bf a}={\bf J}^T({\bf y}-{\bf f}({\bf a}))$ (46)

is modified by adding an extra term proportional to the identity matrix ${\bf I}$:

$\displaystyle \left({\bf J}^T{\bf J}+\lambda {\bf I}\right)\,\triangle{\bf a}
={\bf J}^T ({\bf y}-{\bf f}({\bf a}))$ (47)

Here $\lambda$ is the non-negative damping factor, which is to be adjusted at each iteration. Solving for $\triangle{\bf a}$, we get

$\displaystyle \triangle{\bf a}=({\bf J}^T{\bf J}+\lambda{\bf I})^{-1}
\left[{\bf J}^T ({\bf y}-{\bf f}({\bf a}))\right]$ (48)

When $\lambda$ is close to zero, the above is essentially the Gauss-Newton method discussed previously. But when $\lambda$ is large, ${\bf J}^T{\bf J}$ becomes insignificant, $\triangle{\bf a}$ is approximately proportional to ${\bf J}^T({\bf y}-{\bf f}({\bf a}))$, which is the netive gradient of $\varepsilon({\bf a})$ (Eq. (22)):

$\displaystyle {\bf g}_\varepsilon({\bf a}) =-2\,{\bf J}^T({\bf y}-{\bf f}({\bf a}))$ (49)

We realize that the iteration is actually a combination of Gauss-Newton method and gradient descent method. The parameter $\lambda$ is used to emphasize either of the two methods for the same purpose of minimizing $\varepsilon({\bf a})$. Initially $\lambda$ can be set to a small value so that the iteration is dominated by the Gauss-Newton method. But if the objective function is reduced too slowly, the value of $\lambda$ is increased, the iteration is then switched to the gradient descent method.