Line minimization

In general, in any iterative algorithm for minimizing a function #tex2html_wrap_inline7464#, including Newton's method (Eq. (#NewtonIteration#777>)) and the gradient descent method (Eq. (#GradientIteration#778>)), the variable #tex2html_wrap_inline7466# is updated iteratively #math455##tex2html_wrap_inline7468# based on both the search direction #tex2html_wrap_inline7470# and the step size #tex2html_wrap_inline7472#, which need to be determined in such a way that the function value #tex2html_wrap_inline7474# is maximally reduced at each iteration step.

First, the search direction #tex2html_wrap_inline7476# needs to point away from the gradient #tex2html_wrap_inline7478# along which #tex2html_wrap_inline7480# increases most rapidly. In other words, the angle between #tex2html_wrap_inline7482# and #tex2html_wrap_inline7484# should be greater than #tex2html_wrap_inline7486#:

#math456#   #tex2html_wrap_indisplay18969# (30)
This condition is indeed satisfied in both the gradient descent method with #math457##tex2html_wrap_inline7488# and Newton's method with #math458##tex2html_wrap_inline7490#:
#math459#   #tex2html_wrap_indisplay18973# (31)
Here #tex2html_wrap_inline7492# is assumed to be positive definite so that a minimum exists.

Next, the optimal step size #tex2html_wrap_inline7494# needs to be such that the function value #math460##tex2html_wrap_inline7496# is minimized along the search direction #tex2html_wrap_inline7498#. To do so, we set to zero the derivative of the function value with respect to #tex2html_wrap_inline7500#, and get the <#828#>directional derivative<#828#> (Section #vectorMatrixDifferentiation#829>) along the direction of #tex2html_wrap_inline7502# based on the chain rule:

#math461#   #tex2html_wrap_indisplay18981# (32)
This result indicates that the gradient #math462##tex2html_wrap_inline7504# at the next point #tex2html_wrap_inline7506# should be perpendicular to the search direction #tex2html_wrap_inline7508#. In other words, when traversing along #tex2html_wrap_inline7510#, we should stop at the point #math463##tex2html_wrap_inline7512# at which the gradient #tex2html_wrap_inline7514# is perpendicular to #tex2html_wrap_inline7516#, i.e., it has zero component along #tex2html_wrap_inline7518#, and the corresponding #tex2html_wrap_inline7520# is the optimal step size.

To find the actual optimal step size #tex2html_wrap_inline7522#, we need to solve a 1-D optimization problem while treating #math464##tex2html_wrap_inline7524# as a function of the step size #tex2html_wrap_inline7526#, and approximate it by the first three terms of its Taylor series at #tex2html_wrap_inline7528# (Maclaurin series of function #tex2html_wrap_inline7530#):

#math465#
    #tex2html_wrap_indisplay18997#  
  #tex2html_wrap_indisplay18998# #tex2html_wrap_indisplay18999# (33)
where
#math466#
#tex2html_wrap_indisplay19001# #tex2html_wrap_indisplay19002# #tex2html_wrap_indisplay19003# (34)
#tex2html_wrap_indisplay19004# #tex2html_wrap_indisplay19005# #tex2html_wrap_indisplay19006# (35)
#tex2html_wrap_indisplay19007# #tex2html_wrap_indisplay19008# #tex2html_wrap_indisplay19009#  
  #tex2html_wrap_indisplay19010# #tex2html_wrap_indisplay19011#  
  #tex2html_wrap_indisplay19012# #tex2html_wrap_indisplay19013# (36)
where #math467##tex2html_wrap_inline7532# is the Hessian matrix of #tex2html_wrap_inline7534# at #tex2html_wrap_inline7536#. Substituting these back into the Taylor series above we get:
#math468#   #tex2html_wrap_indisplay19018# (37)
To find #tex2html_wrap_inline7538# that minimizes #math469##tex2html_wrap_inline7540#, we set to zero its derivative with respect to #tex2html_wrap_inline7542#:
#math470#
#tex2html_wrap_indisplay19023# #tex2html_wrap_indisplay19024# #tex2html_wrap_indisplay19025#  
  #tex2html_wrap_indisplay19026# #tex2html_wrap_indisplay19027# (38)
and solve the resulting equation for #tex2html_wrap_inline7544# to get the optimal step size based on both #tex2html_wrap_inline7546# and #tex2html_wrap_inline7548#:
#math471#   #tex2html_wrap_indisplay19032# (39)
Given this result, we can find the optimal step side for the Newton's method and gradient descent method considered previously:

  • <#993#>Newton's method:<#993#>

    The search direction is #math472##tex2html_wrap_inline7550#, and the optimal step size is

    #math473#   #tex2html_wrap_indisplay19035# (40)
    This confirms that the iteration in Eq. (#NewtonIteration#1024>) is indeed optimal:
    #math474#   #tex2html_wrap_indisplay19037# (41)

  • <#1038#>Gradient descent method:<#1038#>

    The search direction is #math475##tex2html_wrap_inline7552#, and Eq. (#searchDirectionPerpendicular#1041>) becomes:

    #math476#   #tex2html_wrap_indisplay19040# (42)
    i.e., the search direction #math477##tex2html_wrap_inline7554# is always perpendicular to the previous one #math478##tex2html_wrap_inline7556#, i.e., the iteration follows a zigzag path composed of a sequence of segments from the initial guess to the final solution. The optimal step size is
    #math479#   #tex2html_wrap_indisplay19044# (43)

Given #tex2html_wrap_inline7558# as well as #tex2html_wrap_inline7560#, we can find the optimal step size #tex2html_wrap_inline7562# with complexity #tex2html_wrap_inline7564# for each iteration. However, as we assume the Hessian #tex2html_wrap_inline7566# is not available in the gradient descent method, the optimal step size above cannot be computed. We can instead approximate #math480##tex2html_wrap_inline7568# in the third term of Eq. (#OptimalStepTaylor#1082>) at two nearby points at #tex2html_wrap_inline7570# and #tex2html_wrap_inline7572#, where #tex2html_wrap_inline7574# is a small value:

#math481#
#tex2html_wrap_indisplay19055# #tex2html_wrap_indisplay19056# #tex2html_wrap_indisplay19057#  
  #tex2html_wrap_indisplay19058# #tex2html_wrap_indisplay19059# (44)
where #math482##tex2html_wrap_inline7576# and #math483##tex2html_wrap_inline7578#. This approximation can be used to replace #math484##tex2html_wrap_inline7580# in Eq. (#OptimalDelta1#1123>) above:
#math485#   #tex2html_wrap_indisplay19064# (45)
Solving for #tex2html_wrap_inline7582# we get the estimated optimal step size:
#math486#   #tex2html_wrap_indisplay19067# (46)
Specifically in the gradient descent method with #math487##tex2html_wrap_inline7584#, this optimal step size becomes:
#math488#   #tex2html_wrap_indisplay19070# (47)
and the iteration becomes
#math489#   #tex2html_wrap_indisplay19072# (48)

<#1189#>Example 2: <#1189#>

The gradient descent method applied to solve the same three-variable equation system previously solved by Newton's method:

#math490#   #tex2html_wrap_indisplay19074#  

The step size #tex2html_wrap_inline7586# is determined by the secant method with #math491##tex2html_wrap_inline7588#. The iteration from an initial guess #math492##tex2html_wrap_inline7590# is shown below:

#math493#   #tex2html_wrap_indisplay19079#  
We see that after the first 100 iterations the error is reduced to about #math494##tex2html_wrap_inline7592#, and With 500 additional iterations the algorithm converges to the following approximated solution with accuracy of #math495##tex2html_wrap_inline7594#:
#math496#   #tex2html_wrap_indisplay19083# (49)
Although the gradient descent method requires many more iterations than Newton's method to converge, the computational cost in each iteration is much reduced, as no more matrix inversion is needed.

When it is difficult or too computationally costly to find the optimal step size along the search direction, some suboptimal step size may be acceptable, such as in the quasi-Newton methods (Section #quasiNewton#1218>) for minimization problems. In this case, although the step size #tex2html_wrap_inline7596# is no longer required to be such that the function value at the new position #math497##tex2html_wrap_inline7598# is minimized along the search direction #tex2html_wrap_inline7600#, the step size #tex2html_wrap_inline7602# still has to satisfy the following <#1224#>Wolfe conditions<#1224#>:

  • <#1226#>Sufficient decrease (Armijo rule):<#1226#>
    #math498#   #tex2html_wrap_indisplay19089# (50)
  • <#1236#>Curvature condition:<#1236#>
    #math499#   #tex2html_wrap_indisplay19091# (51)
    As #math500##tex2html_wrap_inline7604#, this condition can also be written in the following alternative form:
    #math501#   #tex2html_wrap_indisplay19094# (52)
Here the two constants #tex2html_wrap_inline7606# and #tex2html_wrap_inline7608# above satisfy #math502##tex2html_wrap_inline7610#.

In general, these conditions are motivated by the desired effect that after each iterative step, the function should have a shallower slope along #tex2html_wrap_inline7612#, as well as a lower value, so that eventually the solution can be approached where #tex2html_wrap_inline7614# is minimum and the gradient is zero.

Specifically, to understand the first condition above, we represent the function to be minimized as a single-variable function of the step size #math503##tex2html_wrap_inline7616#, and its tangent line at the point #tex2html_wrap_inline7618# as a linear function #math504##tex2html_wrap_inline7620#, where the intercept #tex2html_wrap_inline7622# can be found as #math505##tex2html_wrap_inline7624#, and the slope #tex2html_wrap_inline7626# can be found as the derivative of #math506##tex2html_wrap_inline7628# at #tex2html_wrap_inline7630#:

#math507#   #tex2html_wrap_indisplay19109# (53)
which is required to be negative for the function value to be reduced, #math508##tex2html_wrap_inline7632#. Now the function of the tangent line can be written as
#math509#   #tex2html_wrap_indisplay19112# (54)
Comparing this with a constant line #math510##tex2html_wrap_inline7634# of slope zero, we see that any straight line between #tex2html_wrap_inline7636# and #tex2html_wrap_inline7638# can be described by #math511##tex2html_wrap_inline7640# with #tex2html_wrap_inline7642#, with a slope #math512##tex2html_wrap_inline7644#. The Armijo rule is to find any #tex2html_wrap_inline7646# that satisfies
#math513#   #tex2html_wrap_indisplay19121# (55)
We see that the value of #tex2html_wrap_inline7648# is guaranteed to be reduced.

The second condition requires that at the new position #tex2html_wrap_inline7650# the slope of the gradient #tex2html_wrap_inline7652# along the search direction #tex2html_wrap_inline7654# be sufficiently reduced to be less than a specified value (determined by #tex2html_wrap_inline7656#), in comparison to the slope at the old position #tex2html_wrap_inline7658#.

<#19128#>Figure<#19128#> 2.5: <#19129#>Iterations of the Secant Method<#19129#>
[width=90mm]<#1305#>figures/GradientDescent.eps<#1305#>

The reason why #math514##tex2html_wrap_inline7660# can also be explained geometrically. As shown in Fig. #GradientDescent#1312>, the gradient vectors at various points along the direction of #tex2html_wrap_inline7662# are shown by the blue arrows, and their projections onto the direction represent the slopes of the function #math515##tex2html_wrap_inline7664#. Obviously at #tex2html_wrap_inline7666# where #tex2html_wrap_inline7668# reaches its minimum, its slope is zero. In other words, the projection of the gradient #tex2html_wrap_inline7670# onto the direction of #tex2html_wrap_inline7672# is zero, i.e., #math516##tex2html_wrap_inline7674# or #math517##tex2html_wrap_inline7676#.

The gradient descent method gradually approaches a solution #tex2html_wrap_inline7678# of an N-D minimization problem by moving from the initial guess #tex2html_wrap_inline7680# along a zigzag path composed of a set of segments with any two consecutive segments perpendicular to each other. The number of steps depends greatly on the initial guess. As illustrated in the example below in an N=2 dimensional case, the best possible case is that the solution happens to be on the gradient direction of the initial guess, which could be reached in a single step, while the worst possible case is that the gradient direction of the initial guess happens to be 45 degrees off from the gradient direction of the optimal case, and it takes many zigzag steps to go around the optimal path to reach the solution. Many of the steps are in the same direction as some of the previous steps.

To improve the performance of the gradient descent method we can include in the iteration a momentum term representing the search direction previously traversed:

#math518#   #tex2html_wrap_indisplay19143# (56)
Now two consecutive search directions are no longer perpendicular to each other and the resulting search path is smoother than the zigzag path without the momentum term. The parameter #tex2html_wrap_inline7682# controls how much momentum is to be added.

Obviously it is most desirable not to repeat any of the previous directions traveled so that the solution can be reached in N steps, each in a unique direction in the N-D space. In other words, the subsequent steps are independent of each other, never interfering with the results achieved in the previous steps. Such a method will be discussed in the next section.

<#19145#>Figure<#19145#> 2.6: <#19146#>Best and Worst Initial Guess<#19146#>
[width=90mm]<#1340#>figures/GradientDescent1.eps<#1340#>

<#1344#>Example 3:<#1344#>

The Rosenbrock function

#math519#   #tex2html_wrap_indisplay19149#  
is a two-variable non-convex function with a global minimum #tex2html_wrap_inline7686# at the point #tex2html_wrap_inline7688#, which is inside a long parabolic shaped valley as shown in the figure below. As the slope along the valley is very shallow, it is difficult for an algorithm to converge quickly to the minimum. For this reason, the Rosenbrock function is often used to test various minimization algorithms.

<#19152#>Figure<#19152#> 2.7: <#19153#>Rosenbrock Surface<#19153#>
[width=70mm]<#1349#>figures/Rosenbrock.eps<#1349#>

The figure below shows the search path of the gradient descent method based on the optimal step size given in Eq. (#GradientStepSizeOptimal#1353>). We see that the search path is composed a long sequence of 90 degree turns between consecutive segments.

<#19155#>Figure<#19155#> 2.8: <#19156#>Gradeitn Descent for Rosenbrock Surface<#19156#>
[width=70mm]<#1355#>figures/RosenbrockGradient0.eps<#1355#>
<#19158#>Figure<#19158#> 2.9: <#19159#>Search Path of the Gradient Descent<#19159#>
[width=60mm]<#1360#>figures/RosenbrockGradient01.eps<#1360#>

When the Newton's method is applied to this minimization problem, it takes only four iterations for the algorithm to converge to the minimum, as shown in the figure below:

<#19161#>Figure<#19161#> 2.10: <#19162#>Search Path of Newton's Methpd<#19162#>
[width=70mm]<#1365#>figures/RosenbrockNewton.eps<#1365#>