Quasi-Newton Methods
As we have seen above, Newton's methods can be used to solve both
nonlinear systems to find roots of a set of simultaneous equations,
and optimization problems to minimize a scalar-valued objective function
based on the iterations of the same form. Specifically,
If the Jacobian #math527##tex2html_wrap_inline7706# or the Hessian matrix #math528##tex2html_wrap_inline7708#
is not available, or if it is too computationally costly to calculate
its inverse (with complexity #tex2html_wrap_inline7710#), the quasi-Newton methods can be
used to approximate the Hessian matrix or its inverse based only on
the first order derivative, the gradient #tex2html_wrap_inline7712# of #tex2html_wrap_inline7714#
(with complexity #tex2html_wrap_inline7716#), similar to Broyden's method previously
considered (Section #NewtonRaphsonMV#1424>).
In the following, we consider the minimization of a function #tex2html_wrap_inline7718#.
Its Taylor expansion around point #tex2html_wrap_inline7720# is
#math529# #tex2html_wrap_indisplay19186#
(59)
Taking derivative with respect to #tex2html_wrap_inline7722#, we get
#math530# #tex2html_wrap_indisplay19189#
(60)
Evaluating at #math531##tex2html_wrap_inline7724#, we have #math532##tex2html_wrap_inline7726#,
and the above can be written as
#math533#
#tex2html_wrap_indisplay19193# |
#tex2html_wrap_indisplay19194# |
#tex2html_wrap_indisplay19195# |
|
|
#tex2html_wrap_indisplay19196# |
#tex2html_wrap_indisplay19197# |
(61) |
where matrix #tex2html_wrap_inline7728# is the secant approximation of the Hessian
matrix #tex2html_wrap_inline7730#, and the last equality is called the
<#1494#>secant equation<#1494#>. For convenience, we further define:
#math534# #tex2html_wrap_indisplay19201#
(62)
so that the equation above can be written as
#math535# #tex2html_wrap_indisplay19203#
(63)
This is the <#1517#>quasi-Newton equation<#1517#>, which is the
<#1518#>secant condition<#1518#> that must be satisfied by matrix
#tex2html_wrap_inline7732#, or its inverse #math536##tex2html_wrap_inline7734#, in any of the
quasi-Newton algorithms, all taking the follow general steps:
- Initialize #tex2html_wrap_inline7736# and #tex2html_wrap_inline7738#, set #tex2html_wrap_inline7740#;
- Compute gradient #tex2html_wrap_inline7742# and the search direction
#math537##tex2html_wrap_inline7744#;
- Get #math538##tex2html_wrap_inline7746# with step
size #tex2html_wrap_inline7748# satisfying the Wolfe conditions
- update #math539##tex2html_wrap_inline7750# or
#math540##tex2html_wrap_inline7752#,
that satisfies the quasi-Newton equation;
- If termination condition is not satisfied, #tex2html_wrap_inline7754#, go back
to second step.
For this iteration to converge to a local minimum of #tex2html_wrap_inline7756#,
#tex2html_wrap_inline7758# must be positive definite matrix, same as the Hessian
matrix #tex2html_wrap_inline7760# it approximates. Specially, if #math541##tex2html_wrap_inline7762#,
then #math542##tex2html_wrap_inline7764#, the algorithm becomes the gradient
descent method; also, if #math543##tex2html_wrap_inline7766# is the Hessian matrix,
then the algorithm becomes the Newton's method.
In a quasi-Newton method, we can choose to update either #tex2html_wrap_inline7768#
or its inverse #math544##tex2html_wrap_inline7770# based on one of the two forms of
the quasi-Newton equation. We note that there is a dual relationship
between the two ways to update, i.e., by swapping #tex2html_wrap_inline7772# and
#tex2html_wrap_inline7774#, an update formula for #tex2html_wrap_inline7776# can be directly
applied to its inverse and vice versa.
- The <#1562#>Symmetric Rank 1 (SR1)<#1562#> Algorithm
Here matrix #tex2html_wrap_inline7778# is updated by an additional term:
#math545# #tex2html_wrap_indisplay19229#
(64)
where #tex2html_wrap_inline7780# a vector and #tex2html_wrap_inline7782# is a matrix of rank 1.
Imposing the secant condition, we get
#math546# #tex2html_wrap_indisplay19233#
(65)
This equation indicates #tex2html_wrap_inline7784# is in the same direction as
#math547##tex2html_wrap_inline7786#, and can therefore
be written as #math548##tex2html_wrap_inline7788#. Substituting this back we get
#math549# #tex2html_wrap_indisplay19238#
(66)
Solving this we get #math550##tex2html_wrap_inline7790#,
#math551# #tex2html_wrap_indisplay19241#
(67)
and the iteration of #tex2html_wrap_inline7792# becomes
#math552# #tex2html_wrap_indisplay19244#
(68)
Applying the Sherman-Morrison formula (Section #Woodbury#1644>) we
further get the inverse #tex2html_wrap_inline7794#:
#math553#
#tex2html_wrap_indisplay19247# |
#tex2html_wrap_indisplay19248# |
#tex2html_wrap_indisplay19249# |
|
|
#tex2html_wrap_indisplay19250# |
#tex2html_wrap_indisplay19251# |
(69) |
We note that this equation is dual to the previous one, and it
can also be obtained based on the duality between the secant
conditions for #tex2html_wrap_inline7796# and #math554##tex2html_wrap_inline7798#. Specifically,
same as above, we impose the secant condition for the inverse
#math555##tex2html_wrap_inline7800# of the following form
#math556# #tex2html_wrap_indisplay19256#
(70)
and get
#math557# #tex2html_wrap_indisplay19258#
(71)
Then by repeating the same process above, we get the same update
formula for #math558##tex2html_wrap_inline7802#.
This is the formula for directly updating #math559##tex2html_wrap_inline7804#.
For the iteration to converge to a local minimum, matrix
#math560##tex2html_wrap_inline7806# needs to be positive definite as well as
#math561##tex2html_wrap_inline7808#, we therefore require
#math562# #tex2html_wrap_indisplay19264#
(72)
It may be difficult to maintain this requirement through out the
iteration. This problem can be avoided in the following DFP and
BFGS methods.
- The <#1731#>BFGS (Broyden-Fletcher-Goldfarb-Shanno)<#1731#> algorithm
This is a rank 2 algorithm in which matrix #tex2html_wrap_inline7810# is updated
by two rank-1 terms:
#math563# #tex2html_wrap_indisplay19267#
(73)
Imposing the secant condition for #tex2html_wrap_inline7812#, we get
#math564# #tex2html_wrap_indisplay19270#
(74)
or
#math565# #tex2html_wrap_indisplay19272#
(75)
which can be satisfied if we let
#math566# #tex2html_wrap_indisplay19274#
(76)
Substituting these into
#math567##tex2html_wrap_inline7814#
we get
#math568# #tex2html_wrap_indisplay19277#
(77)
Given this update formula for #tex2html_wrap_inline7816#, we can further find its
inverse #math569##tex2html_wrap_inline7818# and then the search direction
#math570##tex2html_wrap_inline7820#.
Alternatively, given #tex2html_wrap_inline7822#, we can further derive an update
formula directly for the inverse matrix #math571##tex2html_wrap_inline7824#, so that
the search direction #tex2html_wrap_inline7826# can be obtained without carrying
out the inversion computation. Specifically, we first define
#math572##tex2html_wrap_inline7828# and #math573##tex2html_wrap_inline7830#,
where
#math574# #tex2html_wrap_indisplay19287#
(78)
so that the expression above can be written as:
#math575# #tex2html_wrap_indisplay19289#
(79)
and then apply the Sherman-Morrison formula to get:
#math576#
#tex2html_wrap_indisplay19291# |
#tex2html_wrap_indisplay19292# |
#tex2html_wrap_indisplay19293# |
|
|
#tex2html_wrap_indisplay19294# |
#tex2html_wrap_indisplay19295# |
|
|
#tex2html_wrap_indisplay19296# |
#tex2html_wrap_indisplay19297# |
(80) |
where we have defined
#math577# #tex2html_wrap_indisplay19299#
(81)
with
#math578# #tex2html_wrap_indisplay19301#
(82)
#math579# #tex2html_wrap_indisplay19303#
(83)
#math580# #tex2html_wrap_indisplay19305#
(84)
#math581# #tex2html_wrap_indisplay19307#
(85)
#math582# #tex2html_wrap_indisplay19309#
(86)
Substituting this #tex2html_wrap_inline7832# into the expression for
#math583##tex2html_wrap_inline7834#, we get:
#math584#
#tex2html_wrap_indisplay19313# |
#tex2html_wrap_indisplay19314# |
#tex2html_wrap_indisplay19315# |
|
|
#tex2html_wrap_indisplay19316# |
#tex2html_wrap_indisplay19317# |
(87) |
where
#math585# #tex2html_wrap_indisplay19319#
(88)
#math586# #tex2html_wrap_indisplay19321#
(89)
#math587#
#tex2html_wrap_indisplay19323# |
#tex2html_wrap_indisplay19324# |
#tex2html_wrap_indisplay19325# |
|
|
#tex2html_wrap_indisplay19326# |
#tex2html_wrap_indisplay19327# |
(90) |
Substituting these three terms back into Eq. (#BFGSinverseBexpression#2181>),
we get the update formula for #math588##tex2html_wrap_inline7836#:
#math589# #tex2html_wrap_indisplay19330#
(91)
In summary, #math590##tex2html_wrap_inline7838# can be found by either Eq.
(#BFGSupdateB#2215>) for #tex2html_wrap_inline7840# followed by an additional
inversion operation, or Eq. (#BFGSupdateInvB#2218>) directly for
#math591##tex2html_wrap_inline7842#. The results obtained by these methods are
equivalent.
- The <#2222#>DFP (Davidon-Fletcher-Powell)<#2222#> Algorithm
Same as the BFGS algorithm, the DFP algorithm is also a rank 2
algorithm, where instead of #tex2html_wrap_inline7844#, the inverse matrix
#math592##tex2html_wrap_inline7846# is updated by two rank-1 terms:
#math593# #tex2html_wrap_indisplay19337#
(92)
where #tex2html_wrap_inline7848# and #tex2html_wrap_inline7850# are real scalars, #tex2html_wrap_inline7852# and #tex2html_wrap_inline7854#
are vectors. Imposing the secant condition for the inverse of
#tex2html_wrap_inline7856#, we get
#math594# #tex2html_wrap_indisplay19344#
(93)
or
#math595# #tex2html_wrap_indisplay19346#
(94)
which can be satisfied if we let
#math596# #tex2html_wrap_indisplay19348#
(95)
Substituting these into
#math597##tex2html_wrap_inline7858#
we get
#math598# #tex2html_wrap_indisplay19351#
(96)
Following the same procedure used for the BFGS method, we can also
obtain the update formula of matrix #tex2html_wrap_inline7860# for the DFP method:
#math599# #tex2html_wrap_indisplay19354#
(97)
We note that Eqs. (#DFPupdateInvB#2347>) and (#BFGSupdateB#2348>) form
a duality pair, and Eqs. (#DFPupdateB#2349>) and (#BFGSupdateInvB#2350>)
form another duality pair. In other words, the BFGS and FDP methods
are dual of each other.
In both the DFP and BFGS methods, matrix #math600##tex2html_wrap_inline7862#, as
well as #tex2html_wrap_inline7864#, must be positive definite, i.e.,
#math601##tex2html_wrap_inline7866# must hold for any #math602##tex2html_wrap_inline7868#.
We now prove that this requirement is satisfied if the curvature
condition of the Wolfe conditions is satisfied:
#math603# #tex2html_wrap_indisplay19360#
(98)
Replacing the search direction #tex2html_wrap_inline7870# by
#math604##tex2html_wrap_inline7872#, we get:
#math605# #tex2html_wrap_indisplay19364#
(99)
We also have
#math606# #tex2html_wrap_indisplay19366#
(100)
Subtracting the first equation from the second, we get
#math607# #tex2html_wrap_indisplay19368#
(101)
The last inequality is due to the fact that#math608##tex2html_wrap_inline7874#
and #tex2html_wrap_inline7876#. We therefore have
#math609# #tex2html_wrap_indisplay19372#
(102)
Given #math610##tex2html_wrap_inline7878#, we can further prove by
induction that both #tex2html_wrap_inline7880# and #math611##tex2html_wrap_inline7882#
based respectively on the update formulae in Eqs.
(#BFGSupdateB#2431>) and (#DFPupdateInvB#2432>) are positive
definite. We first assume #tex2html_wrap_inline7884# and #tex2html_wrap_inline7886# are both
positive definite, and express #tex2html_wrap_inline7888# in terms of its
Cholesky decomposition, #math612##tex2html_wrap_inline7890#. We further
define #math613##tex2html_wrap_inline7892#, #math614##tex2html_wrap_inline7894#,
and get
#math615# #tex2html_wrap_indisplay19383#
(103)
Now we can show that the sum of the first and third terms of
Eq. (#BFGSupdateInvB#2461>) is positive definite:
#math616# #tex2html_wrap_indisplay19385#
(104)
The last step is due to the Cauchy-Schwarz inequality. Also, as
#math617##tex2html_wrap_inline7896#, the second term of Eq. (#BFGSupdateB#2491>)
is positive definite
#math618# #tex2html_wrap_indisplay19388#
(105)
Combining these two results, we get
#math619# #tex2html_wrap_indisplay19390#
(106)
i.e., #tex2html_wrap_inline7898# based on the update formula Eq. (#BFGSupdateB#2524>)
is positive definite. Following the same steps, we can also show that
#math620##tex2html_wrap_inline7900# based on Eq. (#DFPupdateInvB#2528>) is positive definite.
<#2529#>Examples:<#2529#>
Fig. #RosenbrockKSR1#2530> shows the search path of the DFP and
BPGS methods when applied to find the minimum of the Rosenbrock
function.
<#19393#>Figure<#19393#> 2.11:
<#19394#>SR1 based on #tex2html_wrap_inline7902#<#19394#>
[width=70mm]<#2532#>figures/RosenbrockSR1.eps<#2532#> |
<#19398#>Figure<#19398#> 2.12:
<#19399#>DFP based on #tex2html_wrap_inline7904#<#19399#>
[width=70mm]<#2538#>figures/RosenbrockDFP.eps<#2538#> |
<#19403#>Figure<#19403#> 2.13:
<#19404#>BFGS based on #tex2html_wrap_inline7906#<#19404#>
[width=70mm]<#2544#>figures/RosenbrockBFGS.eps<#2544#> |