In general, in any iterative algorithm for minimizing a
function #tex2html_wrap_inline7464#, including Newton's method
(Eq. (#NewtonIteration#777>)) and the gradient descent method
(Eq. (#GradientIteration#778>)), the variable #tex2html_wrap_inline7466# is
updated iteratively #math455##tex2html_wrap_inline7468#
based on both the search direction #tex2html_wrap_inline7470# and the step
size #tex2html_wrap_inline7472#, which need to be determined in such a way that
the function value #tex2html_wrap_inline7474# is maximally reduced at each
iteration step.
First, the search direction #tex2html_wrap_inline7476# needs to point away
from the gradient #tex2html_wrap_inline7478# along which #tex2html_wrap_inline7480#
increases most rapidly. In other words, the angle between
#tex2html_wrap_inline7482# and #tex2html_wrap_inline7484# should be greater than #tex2html_wrap_inline7486#:
#math456# #tex2html_wrap_indisplay18969#
(30)
This condition is indeed satisfied in both the gradient descent
method with #math457##tex2html_wrap_inline7488# and Newton's method with
#math458##tex2html_wrap_inline7490#:
#math459# #tex2html_wrap_indisplay18973#
(31)
Here #tex2html_wrap_inline7492# is assumed to be positive definite so that a
minimum exists.
Next, the optimal step size #tex2html_wrap_inline7494# needs to be such that the
function value #math460##tex2html_wrap_inline7496#
is minimized along the search direction #tex2html_wrap_inline7498#. To do so, we
set to zero the derivative of the function value with respect to
#tex2html_wrap_inline7500#, and get the <#828#>directional derivative<#828#> (Section
#vectorMatrixDifferentiation#829>) along the direction of
#tex2html_wrap_inline7502# based on the chain rule:
#math461# #tex2html_wrap_indisplay18981#
(32)
This result indicates that the gradient
#math462##tex2html_wrap_inline7504# at the next point #tex2html_wrap_inline7506#
should be perpendicular to the search direction #tex2html_wrap_inline7508#. In
other words, when traversing along #tex2html_wrap_inline7510#, we should stop at
the point #math463##tex2html_wrap_inline7512# at which the
gradient #tex2html_wrap_inline7514# is perpendicular to #tex2html_wrap_inline7516#, i.e.,
it has zero component along #tex2html_wrap_inline7518#, and the corresponding
#tex2html_wrap_inline7520# is the optimal step size.
To find the actual optimal step size #tex2html_wrap_inline7522#, we need
to solve a 1-D optimization problem while treating
#math464##tex2html_wrap_inline7524# as a function
of the step size #tex2html_wrap_inline7526#, and approximate it by the first
three terms of its Taylor series at #tex2html_wrap_inline7528# (Maclaurin
series of function #tex2html_wrap_inline7530#):
#math465#
|
|
#tex2html_wrap_indisplay18997# |
|
|
#tex2html_wrap_indisplay18998# |
#tex2html_wrap_indisplay18999# |
(33) |
where
#math466#
#tex2html_wrap_indisplay19001# |
#tex2html_wrap_indisplay19002# |
#tex2html_wrap_indisplay19003# |
(34) |
#tex2html_wrap_indisplay19004# |
#tex2html_wrap_indisplay19005# |
#tex2html_wrap_indisplay19006# |
(35) |
#tex2html_wrap_indisplay19007# |
#tex2html_wrap_indisplay19008# |
#tex2html_wrap_indisplay19009# |
|
|
#tex2html_wrap_indisplay19010# |
#tex2html_wrap_indisplay19011# |
|
|
#tex2html_wrap_indisplay19012# |
#tex2html_wrap_indisplay19013# |
(36) |
where #math467##tex2html_wrap_inline7532# is the Hessian matrix of #tex2html_wrap_inline7534#
at #tex2html_wrap_inline7536#. Substituting these back into the Taylor series above
we get:
#math468# #tex2html_wrap_indisplay19018#
(37)
To find #tex2html_wrap_inline7538# that minimizes #math469##tex2html_wrap_inline7540#,
we set to zero its derivative with respect to #tex2html_wrap_inline7542#:
#math470#
#tex2html_wrap_indisplay19023# |
#tex2html_wrap_indisplay19024# |
#tex2html_wrap_indisplay19025# |
|
|
#tex2html_wrap_indisplay19026# |
#tex2html_wrap_indisplay19027# |
(38) |
and solve the resulting equation for #tex2html_wrap_inline7544# to get the optimal
step size based on both #tex2html_wrap_inline7546# and #tex2html_wrap_inline7548#:
#math471# #tex2html_wrap_indisplay19032#
(39)
Given this result, we can find the optimal step side for the
Newton's method and gradient descent method considered previously:
- <#993#>Newton's method:<#993#>
The search direction is #math472##tex2html_wrap_inline7550#, and the
optimal step size is
#math473# #tex2html_wrap_indisplay19035#
(40)
This confirms that the iteration in Eq. (#NewtonIteration#1024>)
is indeed optimal:
#math474# #tex2html_wrap_indisplay19037#
(41)
- <#1038#>Gradient descent method:<#1038#>
The search direction is #math475##tex2html_wrap_inline7552#, and
Eq. (#searchDirectionPerpendicular#1041>) becomes:
#math476# #tex2html_wrap_indisplay19040#
(42)
i.e., the search direction #math477##tex2html_wrap_inline7554# is always
perpendicular to the previous one #math478##tex2html_wrap_inline7556#, i.e., the
iteration follows a zigzag path composed of a sequence of segments
from the initial guess to the final solution. The optimal step size
is
#math479# #tex2html_wrap_indisplay19044#
(43)
Given #tex2html_wrap_inline7558# as well as #tex2html_wrap_inline7560#, we can find the optimal
step size #tex2html_wrap_inline7562# with complexity #tex2html_wrap_inline7564# for each iteration.
However, as we assume the Hessian #tex2html_wrap_inline7566# is not available in
the gradient descent method, the optimal step size above cannot be
computed. We can instead approximate #math480##tex2html_wrap_inline7568# in
the third term of Eq. (#OptimalStepTaylor#1082>) at two nearby points
at #tex2html_wrap_inline7570# and #tex2html_wrap_inline7572#, where #tex2html_wrap_inline7574# is a small value:
#math481#
#tex2html_wrap_indisplay19055# |
#tex2html_wrap_indisplay19056# |
#tex2html_wrap_indisplay19057# |
|
|
#tex2html_wrap_indisplay19058# |
#tex2html_wrap_indisplay19059# |
(44) |
where #math482##tex2html_wrap_inline7576# and
#math483##tex2html_wrap_inline7578#.
This approximation can be used to replace #math484##tex2html_wrap_inline7580#
in Eq. (#OptimalDelta1#1123>) above:
#math485# #tex2html_wrap_indisplay19064#
(45)
Solving for #tex2html_wrap_inline7582# we get the estimated optimal step size:
#math486# #tex2html_wrap_indisplay19067#
(46)
Specifically in the gradient descent method with #math487##tex2html_wrap_inline7584#,
this optimal step size becomes:
#math488# #tex2html_wrap_indisplay19070#
(47)
and the iteration becomes
#math489# #tex2html_wrap_indisplay19072#
(48)
<#1189#>Example 2: <#1189#>
The gradient descent method applied to solve the same
three-variable equation system previously solved by Newton's
method:
#math490# #tex2html_wrap_indisplay19074#
The step size #tex2html_wrap_inline7586# is determined by the secant method with
#math491##tex2html_wrap_inline7588#. The iteration from an initial guess #math492##tex2html_wrap_inline7590#
is shown below:
#math493# #tex2html_wrap_indisplay19079#
We see that after the first 100 iterations the error is reduced
to about #math494##tex2html_wrap_inline7592#, and
With 500 additional iterations the algorithm converges to the
following approximated solution with accuracy of
#math495##tex2html_wrap_inline7594#:
#math496# #tex2html_wrap_indisplay19083#
(49)
Although the gradient descent method requires many more iterations
than Newton's method to converge, the computational cost in each
iteration is much reduced, as no more matrix inversion is needed.
When it is difficult or too computationally costly to find the
optimal step size along the search direction, some suboptimal
step size may be acceptable, such as in the quasi-Newton methods
(Section #quasiNewton#1218>)
for minimization problems. In this case, although the step size
#tex2html_wrap_inline7596# is no longer required to be such that the function value
at the new position #math497##tex2html_wrap_inline7598#
is minimized along the search direction #tex2html_wrap_inline7600#, the step size
#tex2html_wrap_inline7602# still has to satisfy the following <#1224#>Wolfe conditions<#1224#>:
Here the two constants #tex2html_wrap_inline7606# and #tex2html_wrap_inline7608# above satisfy #math502##tex2html_wrap_inline7610#.
In general, these conditions are motivated by the desired effect
that after each iterative step, the function should have a shallower
slope along #tex2html_wrap_inline7612#, as well as a lower value, so that eventually
the solution can be approached where #tex2html_wrap_inline7614# is minimum and
the gradient is zero.
Specifically, to understand the first condition above, we represent
the function to be minimized as a single-variable function of the
step size #math503##tex2html_wrap_inline7616#, and its
tangent line at the point #tex2html_wrap_inline7618# as a linear function
#math504##tex2html_wrap_inline7620#, where the intercept #tex2html_wrap_inline7622# can be found
as #math505##tex2html_wrap_inline7624#, and the
slope #tex2html_wrap_inline7626# can be found as the derivative of
#math506##tex2html_wrap_inline7628# at #tex2html_wrap_inline7630#:
#math507# #tex2html_wrap_indisplay19109#
(53)
which is required to be negative for the function value to be
reduced, #math508##tex2html_wrap_inline7632#. Now the
function of the tangent line can be written as
#math509# #tex2html_wrap_indisplay19112#
(54)
Comparing this with a constant line #math510##tex2html_wrap_inline7634# of
slope zero, we see that any straight line between #tex2html_wrap_inline7636#
and #tex2html_wrap_inline7638# can be described by
#math511##tex2html_wrap_inline7640# with
#tex2html_wrap_inline7642#, with a slope #math512##tex2html_wrap_inline7644#.
The Armijo rule is to find any #tex2html_wrap_inline7646# that satisfies
#math513# #tex2html_wrap_indisplay19121#
(55)
We see that the value of #tex2html_wrap_inline7648# is guaranteed to be reduced.
The second condition requires that at the new position
#tex2html_wrap_inline7650# the slope of the gradient #tex2html_wrap_inline7652# along
the search direction #tex2html_wrap_inline7654# be sufficiently reduced to be
less than a specified value (determined by #tex2html_wrap_inline7656#), in comparison
to the slope at the old position #tex2html_wrap_inline7658#.
<#19128#>Figure<#19128#> 2.5:
<#19129#>Iterations of the Secant Method<#19129#>
[width=90mm]<#1305#>figures/GradientDescent.eps<#1305#> |
The reason why #math514##tex2html_wrap_inline7660# can also be explained
geometrically. As shown in Fig. #GradientDescent#1312>, the gradient
vectors at various points along the direction of #tex2html_wrap_inline7662# are
shown by the blue arrows, and their projections onto the direction
represent the slopes of the
function #math515##tex2html_wrap_inline7664#. Obviously at
#tex2html_wrap_inline7666# where #tex2html_wrap_inline7668# reaches its minimum, its slope is
zero. In other words, the projection of the gradient #tex2html_wrap_inline7670#
onto the direction of #tex2html_wrap_inline7672# is zero, i.e.,
#math516##tex2html_wrap_inline7674# or #math517##tex2html_wrap_inline7676#.
The gradient descent method gradually approaches a solution #tex2html_wrap_inline7678#
of an N-D minimization problem by moving from the initial guess #tex2html_wrap_inline7680#
along a zigzag path composed of a set of segments with any two consecutive
segments perpendicular to each other. The number of steps depends greatly
on the initial guess. As illustrated in the example below in an N=2
dimensional case, the best possible case is that the solution happens to
be on the gradient direction of the initial guess, which could be reached
in a single step, while the worst possible case is that the gradient
direction of the initial guess happens to be 45 degrees off from the
gradient direction of the optimal case, and it takes many zigzag steps
to go around the optimal path to reach the solution. Many of the steps
are in the same direction as some of the previous steps.
To improve the performance of the gradient descent method we can include in
the iteration a momentum term representing the search direction previously
traversed:
#math518# #tex2html_wrap_indisplay19143#
(56)
Now two consecutive search directions are no longer perpendicular to
each other and the resulting search path is smoother than the zigzag
path without the momentum term. The parameter #tex2html_wrap_inline7682# controls how
much momentum is to be added.
Obviously it is most desirable not to repeat any of the previous
directions traveled so that the solution can be reached in N
steps, each in a unique direction in the N-D space. In other words,
the subsequent steps are independent of each other, never interfering
with the results achieved in the previous steps. Such a method will
be discussed in the next section.
<#19145#>Figure<#19145#> 2.6:
<#19146#>Best and Worst Initial Guess<#19146#>
[width=90mm]<#1340#>figures/GradientDescent1.eps<#1340#> |
<#1344#>Example 3:<#1344#>
The Rosenbrock function
#math519# #tex2html_wrap_indisplay19149#
is a two-variable non-convex function with a global minimum
#tex2html_wrap_inline7686# at the point #tex2html_wrap_inline7688#, which is inside a long
parabolic shaped valley as shown in the figure below. As the
slope along the valley is very shallow, it is difficult for an
algorithm to converge quickly to the minimum. For this reason,
the Rosenbrock function is often used to test various minimization
algorithms.
<#19152#>Figure<#19152#> 2.7:
<#19153#>Rosenbrock Surface<#19153#>
[width=70mm]<#1349#>figures/Rosenbrock.eps<#1349#> |
The figure below shows the search path of the gradient
descent method based on the optimal step size given in Eq.
(#GradientStepSizeOptimal#1353>). We see that the search path
is composed a long sequence of 90 degree turns between consecutive
segments.
<#19155#>Figure<#19155#> 2.8:
<#19156#>Gradeitn Descent for Rosenbrock Surface<#19156#>
[width=70mm]<#1355#>figures/RosenbrockGradient0.eps<#1355#> |
<#19158#>Figure<#19158#> 2.9:
<#19159#>Search Path of the Gradient Descent<#19159#>
[width=60mm]<#1360#>figures/RosenbrockGradient01.eps<#1360#> |
When the Newton's method is applied to this minimization problem,
it takes only four iterations for the algorithm to converge to the
minimum, as shown in the figure below:
<#19161#>Figure<#19161#> 2.10:
<#19162#>Search Path of Newton's Methpd<#19162#>
[width=70mm]<#1365#>figures/RosenbrockNewton.eps<#1365#> |