Gradient Descent Method

Newton's method discussed above is based on the gradient #math429##tex2html_wrap_inline7360# and Hessian #math430##tex2html_wrap_inline7362#, representing the local variation and more global curvature of the function #tex2html_wrap_inline7364# to be minimized. The method is not applicable if the Hessian #tex2html_wrap_inline7366# is not available, because either the analytical form of the function is not known or the cost for computing the inverse #math431##tex2html_wrap_inline7368# is too high. In such cases, we can use the gradient descent method based only on the gradient but no longer the Hessian.

We first consider the minimization of a single-variable function #tex2html_wrap_inline7370#. From any initial point #tex2html_wrap_inline7372#, we can move to a nearby point #tex2html_wrap_inline7374# along the opposite direction of the gradient, so that the function value will be smaller than that at the current position #tex2html_wrap_inline7376#:

#math432#   #tex2html_wrap_indisplay18885# (25)
We see that no matter whether #tex2html_wrap_inline7378# is positive or negative, the function value #tex2html_wrap_inline7380# approximated by the first two terms of its Taylor series is always reduced if the positive step size #tex2html_wrap_inline7382# is small enough:
#math433#   #tex2html_wrap_indisplay18890# (26)
This process can be carried out iteratively
#math434#   #tex2html_wrap_indisplay18892# (27)
until eventually reaching a point #tex2html_wrap_inline7384# at which #tex2html_wrap_inline7386# and no further progress can be made, i.e. a local minimum of the function is reached.

This simple method can be generalized to minimize a multi-variable objective function #math435##tex2html_wrap_inline7388# in N-D space. The derivative of the 1-D case is generalized to the gradient vector #math436##tex2html_wrap_inline7390# of function #tex2html_wrap_inline7392#, which is in the direction along which the function increases most rapidly with the steepest slope, perpendicular to the contour or iso-lines of the function #tex2html_wrap_inline7394#. The fastest way to reduce #tex2html_wrap_inline7396# is to go down hill along the opposite direction of the gradient vector.

Specifically the gradient descent method (also called steepest descent or down hill method) carries out the iteration #math437##tex2html_wrap_inline7398#, which is can be considered as the first two terms of the Taylor series of #tex2html_wrap_inline7400# with #math438##tex2html_wrap_inline7402#:

#math439#   #tex2html_wrap_indisplay18904# (28)
As the function value is always reduced by this iteration, it can be carried out repeatedly
#math440#   #tex2html_wrap_indisplay18906# (29)
until eventually reaching a point #tex2html_wrap_inline7404# at which #math441##tex2html_wrap_inline7406# and #tex2html_wrap_inline7408# is minimized.

Comparing Eq. (#GradientIteration#649>) with Newton's method where the search direction is #math442##tex2html_wrap_inline7410# in Eq. (#NewtonIteration#654>) based on both #tex2html_wrap_inline7412# and #tex2html_wrap_inline7414#, we see that here in gradient descent method the search direction #math443##tex2html_wrap_inline7416# relies only on the local information contained in the gradient #tex2html_wrap_inline7418# without more global information contained in the second order Hessian #tex2html_wrap_inline7420#, i.e., the gradient method does not have as much information as Newton's method and therefore may not be as efficient. For example, when the function is quadratic, as discussed before, Newton's method can find the solution in a single step from any initial guess, but it takes the gradient descent method multiple steps to reach the solution, as it always follows the negative direction of the local gradient, which does not lead toward the solution directly in general. However, the gradient descent method is computationally less expensive as the Hessian matrix is not needed.

<#661#>Example 1:<#661#>

Consider a two-variable quadratic function in the following general form:

#math444#   #tex2html_wrap_indisplay18917#  
If
#math445#   #tex2html_wrap_indisplay18919#  
is a symmetric positive semidefinite matrix, i.e., #tex2html_wrap_inline7422# and #tex2html_wrap_inline7424#, then the function becomes
#math446#   #tex2html_wrap_indisplay18923#  
which has a minimum #math447##tex2html_wrap_inline7426# at #tex2html_wrap_inline7428#.

<#18926#>Figure<#18926#> 2.3: <#18927#>Gradient of a Quadratic Surface<#18927#>
[width=100mm]<#699#>figures/gradient1.eps<#699#>

We let the initial guess be #math448##tex2html_wrap_inline7430#, at which the gradient is #math449##tex2html_wrap_inline7432#. Now we compare the gradient method with Newton's method, in terms of the search direction #tex2html_wrap_inline7434# and progress of the iterations:

Fig. #QuadraticContourEx2#764> shows the search direction of the Newton's method (red), which reaches the solution in a single step, based on not only the gradient #tex2html_wrap_inline7444#, the local information at the point #tex2html_wrap_inline7446#, but also the Hessian matrix #tex2html_wrap_inline7448#, the more global information of the quadratic function representing its curvature, in comparison to the gradient descent method (blue) based only on the local gradient #tex2html_wrap_inline7450#. Without global information #tex2html_wrap_inline7452# representing the elliptical shape of the contour line of the quadratic function, the gradient descent method requires an iteration, and is therefore less effective.

<#18947#>Figure<#18947#> 2.4: <#18948#>Comparison of Gradient Descent and Newton's Method<#18948#>
[width=40mm]<#771#>figures/QuadraticContourEx2.eps<#771#>
96

In the gradient descent method, we also need to determine a proper step size #tex2html_wrap_inline7454#. If #tex2html_wrap_inline7456# is too small, the iteration may converge too slowly, especially when #tex2html_wrap_inline7458# reduces slowly toward its minimum. On the other hand, if #tex2html_wrap_inline7460# is too large but #tex2html_wrap_inline7462# has some rapid variations in the local region, the minimum of the function may be skipped and the iteration may not converge. We will consider how to find the optimal step size in the next section.