Quasi-Newton Methods

As we have seen above, Newton's methods can be used to solve both nonlinear systems to find roots of a set of simultaneous equations, and optimization problems to minimize a scalar-valued objective function based on the iterations of the same form. Specifically,

If the Jacobian ${\bf J}({\bf x})$ or the Hessian matrix ${\bf H}({\bf x})$ is not available, or if it is too computationally costly to calculate its inverse (with complexity $O(N^3)$), the quasi-Newton methods can be used to approximate the Hessian matrix or its inverse based only on the first order derivative, the gradient ${\bf g}$ of $f({\bf x})$ (with complexity $O(N^2)$), similar to Broyden's method previously considered.

In the following, we consider the minimization of a function $f({\bf x})$. Its Taylor expansion around point ${\bf x}_{n+1}$ is

$\displaystyle f({\bf x})=f({\bf x}_{n+1})+({\bf x}-{\bf x}_{n+1})^T {\bf g}_{n+...
...H}_{n+1}
({\bf x}-{\bf x}_{n+1})+O(\vert\vert{\bf x}-{\bf x}_{n+1}\vert\vert^3)$ (80)

Taking derivetive with respect to ${\bf x}$, we get

$\displaystyle \frac{d}{d{\bf x}}f({\bf x})={\bf g}({\bf x})
={\bf g}_{n+1}+{\bf H}_{n+1} ({\bf x}-{\bf x}_{n+1})
+O(\vert\vert{\bf x}-{\bf x}_{n+1}\vert\vert^2)$ (81)

Evaluating at ${\bf x}={\bf x}_n$, we have ${\bf g}({\bf x}_n)={\bf g}_n$, and the above can be written as
$\displaystyle {\bf g}_{n+1}-{\bf g}_n$ $\displaystyle =$ $\displaystyle {\bf H}_{n+1}({\bf x}_{n+1}-{\bf x}_n)
+O(\vert\vert{\bf x}_{n+1}-{\bf x}_n\vert\vert^2)$  
  $\displaystyle =$ $\displaystyle {\bf B}_{n+1}({\bf x}_{n+1}-{\bf x}_n)$ (82)

where matrix ${\bf B}_n$ is the secant approximation of the Hessian matrix ${\bf H}_n$, and the last eqality is called the secant equation. For convenience, we further define:

$\displaystyle {\bf s}_n={\bf x}_{n+1}-{\bf x}_n,\;\;\;\;\;\;
{\bf y}_n={\bf g}_{n+1}-{\bf g}_n$ (83)

so that the equation above can be written as

$\displaystyle {\bf B}_{n+1}{\bf s}_n={\bf y}_n,\;\;\;\;\;$or$\displaystyle \;\;\;\;\;
{\bf B}_{n+1}^{-1}{\bf y}_n={\bf s}_n$ (84)

This is the quasi-Newton equation, which is the secant condition that must be satisfied by matrix ${\bf B}_n$, or its inverse ${\bf B}_n^{-1}$, in any of the quasi-Newton algorithms, all taking the follow general steps:
  1. Initialize ${\bf x}_0$ and ${\bf B}_0$, set $n=0$;
  2. Compute gradient ${\bf g}_n$ and the search direction ${\bf d}_n=-{\bf B}_n^{-1}{\bf g}_n$;
  3. Get ${\bf x}_{n+1}={\bf x}_n+\delta{\bf d}_n$ with step size $\delta$ satisfying the Wolfe conditions.
  4. update ${\bf B}_{n+1}={\bf B}_n+\Delta{\bf B}_n$ or ${\bf B}_{n+1}^{-1}={\bf B}_n^{-1}+\Delta{\bf B}_n^{-1}$, that satisfies the quasi-Newton equation;
  5. If termination condition is not satisfied, $n=n+1$, go back to second step.
For this iteration to converge to a local minimum of $f({\bf x})$, ${\bf B}_n$ must be positive definite matrix, same as the Hessian matrix ${\bf H}$ it approximates. Specially, if ${\bf B}_n={\bf I}$, then ${\bf d}_n=-{\bf g}_n$, the algorithm becomes the gradient descent method; also, if ${\bf B}_n={\bf H}_n$ is the Hessian matrix, then the algorithm becomes the Newton's method.

In a quasi-Newton method, we can choose to update either ${\bf B}_n$ or its inverse ${\bf B}_n^{-1}$ based on one of the two forms of the quasi-Newton equation. We note that there is a dual relationship between the two ways to update, i.e., by swapping ${\bf y}_n$ and ${\bf s}_m$, an update formula for ${\bf B}_n$ can be directly applied to its inverse and vice versa.

In both the DFP and BFGS methods, matrix ${\bf B}_n^{-1}$, as well as ${\bf B}_n$, must be positive definite, i.e., ${\bf z}^T{\bf B}_n{\bf z} > 0$ must hold for any ${\bf z}\ne{\bf0}$. We now prove that this requirement is satisfied if the curvature condition of the Wolfe conditions is satisfied:

$\displaystyle {\bf g}_{n+1}^T {\bf d}_n \ge c_2\;{\bf g}_n^T {\bf d}_n
\;\;\;\;\;\;\;$i.e.$\displaystyle \;\;\;\;\;
\left( {\bf g}_{n+1}-c_2\;{\bf g}_n\right)^T {\bf d}_n \ge 0$ (119)

Replacing the search direction ${\bf d}_n$ by $\delta{\bf d}_n={\bf x}_{n+1}-{\bf x}_n={\bf s}_n$, we get:

$\displaystyle ({\bf g}_{n+1}-c_2{\bf g}_n)^T {\bf s}_n
={\bf g}^T_{n+1}{\bf s}_n-c_2{\bf g}_n^T {\bf s}_n \ge 0$ (120)

We also have

$\displaystyle {\bf y}_n^T{\bf s}_n=({\bf g}_{n+1}-{\bf g}_n)^T {\bf s}_n
={\bf g}_{n+1}^T{\bf s}_n-{\bf g}_n^T {\bf s}_n$ (121)

Substracting the first equation from the second, we get

$\displaystyle {\bf y}_n^T{\bf s}_n-
({\bf g}^T_{n+1}{\bf s}_n-c_2{\bf g}_n^T {\bf s}_n)
=(c_2-1){\bf g}_n^T{\bf s}_n > 0$ (122)

The last inequality is due to the fact that ${\bf g}_n^T{\bf s}_n<0$ and $c_2<1$. We therefore have

$\displaystyle {\bf y}_n^T{\bf s}_n\ge {\bf g}^T_{n+1}{\bf s}_n-c_2{\bf g}_n^T {\bf s}_n \ge 0$ (123)

Given ${\bf y}_n^T{\bf s}_n>0$, we can further prove by induction that both ${\bf B}_{n+1}$ and ${\bf B}_{n+1}^{-1}$ based repectively on the update formulae in Eqs. (98) and (117) are positive definite. We first assume ${\bf B}_0$ and ${\bf B}_n$ are both positive definite, and express ${\bf B}_n$ in terms of its Cholesky decomposition, ${\bf B}_n={\bf LL}^T$. We further define ${\bf a}={\bf L}^T{\bf z}$, ${\bf b}={\bf L}^T{\bf s}$, and get

$\displaystyle {\bf a}^T{\bf a}={\bf z}^T{\bf B}_n{\bf z},\;\;\;\;
{\bf b}^T{\bf b}={\bf s}^T{\bf B}_n{\bf s},\;\;\;\;
{\bf a}^T{\bf b}={\bf z}^T{\bf B}_n{\bf s}$ (124)

Now we can show that the sum of the first and third terms of Eq. (112) is positive definite:

$\displaystyle {\bf z}^T\left[ {\bf B}_n-\frac{{\bf B}_n{\bf s}_n{\bf s}_n^T{\bf...
...{\bf s}_n}
={\bf a}^T{\bf a}-\frac{({\bf a}^T{\bf b})^2}{{\bf b}^T{\bf b}}\ge 0$ (125)

The last step is due to the Cauchy-Schwarz inequality. Also, as ${\bf s}_n^T{\bf y}_n\ge 0$, the second term of Eq. (98) is positive definite

$\displaystyle {\bf z}^T\left(\frac{{\bf s}_n{\bf s}_n^T}{{\bf s}_n^T{\bf y}_n}\right){\bf z}\ge 0$ (126)

Combining these two results, we get

$\displaystyle {\bf z}^T{\bf B}_{n+1}{\bf z}
= {\bf z}^T\left( {\bf B}_n-\frac{{...
... z}^T\left(\frac{{\bf y}_n{\bf y}_n^T}{{\bf s}_n^T{\bf y}_n}\right){\bf z}\ge 0$ (127)

i.e., ${\bf B}_{n+1}$ based on the update formular Eq. (98) is positive definite. Following the same steps, we can also show that ${\bf B}_{n+1}^{-1}$ based on Eq. (117) is positive definite.

Examples:

The figures below shows the search path of the DFP and BPGS methods when applied to find the minimum of the Rosenbrock function.

RosenbrockSR1.png RosenbrockDFP.png RosenbrockBFGS.png