As we have seen above, Newton's methods can be used to solve both
nonlinear systems to find roots of a set of simultaneous equations,
and optimization problems to minimize a scalar-valued objective function
based on the iterations of the same form. Specifically,
If the Jacobian
or the Hessian matrix
is not available, or if it is too computationally costly to calculate
its inverse (with complexity ), the quasi-Newton methods can be
used to approximate the Hessian matrix or its inverse based only on
the first order derivative, the gradient of
(with complexity ), similar to
Broyden's method
previously
considered.
In the following, we consider the minimization of a function
.
Its Taylor expansion around point
is
|
(80) |
Taking derivetive with respect to , we get
|
(81) |
Evaluating at
, we have
,
and the above can be written as
where matrix is the secant approximation of the Hessian
matrix , and the last eqality is called the
secant equation. For convenience, we further define:
|
(83) |
so that the equation above can be written as
or |
(84) |
This is the quasi-Newton equation, which is the
secant condition that must be satisfied by matrix
, or its inverse
, in any of the
quasi-Newton algorithms, all taking the follow general steps:
- Initialize and , set ;
- Compute gradient and the search direction
;
- Get
with step
size satisfying the
Wolfe conditions.
- update
or
,
that satisfies the quasi-Newton equation;
- If termination condition is not satisfied, , go back
to second step.
For this iteration to converge to a local minimum of
,
must be positive definite matrix, same as the Hessian
matrix it approximates. Specially, if
,
then
, the algorithm becomes the gradient
descent method; also, if
is the Hessian matrix,
then the algorithm becomes the Newton's method.
In a quasi-Newton method, we can choose to update either
or its inverse
based on one of the two forms of
the quasi-Newton equation. We note that there is a dual relationship
between the two ways to update, i.e., by swapping and
, an update formula for can be directly
applied to its inverse and vice versa.
- The Symmetric Rank 1 (SR1) Algorithm
Here matrix is updated by an additional term:
|
(85) |
where a vector and
is a matrix of rank 1.
Imposing the secant condition, we get
i.e., |
(86) |
This equation indicats is in the same direction as
, and can therefore
be written as
. Substituting this back we get
|
(87) |
Solving this we get
,
|
(88) |
and the iteration of becomes
|
(89) |
Applying the
Sherman-Morrison formula,
we can further get the inverse
:
We note that this equation is dual to the previous one, and it
can also be obtained based on the duality between the secant
conditions for and
. Specifically,
same as above, we impose the secant condition for the inverse
of the following form
|
(91) |
and get
|
(92) |
Then by repeating the same process above, we get the same update
formula for
.
This is the formula for directly updating
.
For the iteration to converge to a local minimum, matrix
needs to be positive definite as well as
, we therefore require
|
(93) |
It may be difficult to maintain this requirement through out the
iteration. This problem can be avoided in the following DFP and
BFGS methods.
- The BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm
This is a rank 2 algorithm in which matrix is updated
by two rank-1 terms:
|
(94) |
Imposing the secant condition for , we get
|
(95) |
or
|
(96) |
which can be satisfied if we let
|
(97) |
Substituting these into
we get
|
(98) |
Given this update formula for , we can further find its
inverse
and then the search direction
.
Alternatively, given , we can further derive an update
formula directly for the inverse matrix
, so that
the search direction can be obtained without carrying
out the inversion computation. Specifically, we first define
and
,
where
|
(99) |
so that the expression above can be written as:
|
(100) |
and then apply the
Sherman-Morrison formula,
to get:
where we have defined
|
(102) |
with
|
(103) |
|
(104) |
|
(105) |
|
(106) |
|
(107) |
Substituting this
into the expression for
, we get:
where
|
(109) |
|
(110) |
Substituting these three terms back into Eq. (108),
we get the update formula for
:
|
(112) |
In summary,
can be found by either Eq.
(98) for
followed by an additional
inversion operation, or Eq. (112) directly for
. The results obtained by these methods are
equivalent.
- The DFP (Davidon-Fletcher-Powell) Algorithm
Same as the BFGS algorithm, the DFP algorithm is also a rank 2
algorithm, where instead of , the inverse matrix
is updated by two rank-1 terms:
|
(113) |
where and are real scalars, and
are vectors. Imposing the secant condition for the inverse of
, we get
|
(114) |
or
|
(115) |
which can be satisfied if we let
|
(116) |
Substituting these into
we get
|
(117) |
Following the same procedure used for the BFGS method, we can also
obtain the update formular of matrix for the DFP method:
|
(118) |
We note that Eqs. (117) and (98) form
a duality pair, and Eqs. (118) and (112)
form another duality pair. In other words, the BFGS and FDP methods
are dual of each other.
In both the DFP and BFGS methods, matrix
, as
well as , must be positive definite, i.e.,
must hold for any
.
We now prove that this requirement is satisfied if the curvature
condition of the Wolfe conditions
is satisfied:
i.e. |
(119) |
Replacing the search direction by
, we get:
|
(120) |
We also have
|
(121) |
Substracting the first equation from the second, we get
|
(122) |
The last inequality is due to the fact that
and . We therefore have
|
(123) |
Given
, we can further prove by
induction that both
and
based repectively on the update formulae in Eqs.
(98) and (117) are positive
definite. We first assume and are both
positive definite, and express in terms of its
Cholesky decomposition,
. We further
define
,
,
and get
|
(124) |
Now we can show that the sum of the first and third terms of
Eq. (112) is positive definite:
|
(125) |
The last step is due to the Cauchy-Schwarz inequality. Also, as
, the second term of Eq. (98)
is positive definite
|
(126) |
Combining these two results, we get
|
(127) |
i.e.,
based on the update formular Eq. (98)
is positive definite. Following the same steps, we can also show that
based on Eq. (117) is positive definite.
Examples:
The figures below shows the search path of the DFP and BPGS methods
when applied to find the minimum of the Rosenbrock function.