Gradient Descent Method
Newton's method discussed above is based on the gradient
#math429##tex2html_wrap_inline7360# and Hessian #math430##tex2html_wrap_inline7362#,
representing the local variation and more global curvature of
the function #tex2html_wrap_inline7364# to be minimized. The method is not
applicable if the Hessian #tex2html_wrap_inline7366# is not available, because
either the analytical form of the function is not known or the
cost for computing the inverse #math431##tex2html_wrap_inline7368# is too high. In
such cases, we can use the gradient descent method based only on
the gradient but no longer the Hessian.
We first consider the minimization of a single-variable function
#tex2html_wrap_inline7370#. From any initial point #tex2html_wrap_inline7372#, we can move to a nearby point
#tex2html_wrap_inline7374# along the opposite direction of the gradient, so that the
function value will be smaller than that at the current position
#tex2html_wrap_inline7376#:
#math432# #tex2html_wrap_indisplay18885#
(25)
We see that no matter whether #tex2html_wrap_inline7378# is positive or negative,
the function value #tex2html_wrap_inline7380# approximated by the first two terms
of its Taylor series is always reduced if the positive step
size #tex2html_wrap_inline7382# is small enough:
#math433# #tex2html_wrap_indisplay18890#
(26)
This process can be carried out iteratively
#math434# #tex2html_wrap_indisplay18892#
(27)
until eventually reaching a point #tex2html_wrap_inline7384# at which #tex2html_wrap_inline7386# and no
further progress can be made, i.e. a local minimum of the function
is reached.
This simple method can be generalized to minimize a multi-variable
objective function #math435##tex2html_wrap_inline7388# in N-D space. The
derivative of the 1-D case is generalized to the gradient vector
#math436##tex2html_wrap_inline7390# of function #tex2html_wrap_inline7392#,
which is in the direction along which the function increases most
rapidly with the steepest slope, perpendicular to the contour or
iso-lines of the function #tex2html_wrap_inline7394#. The fastest way to reduce
#tex2html_wrap_inline7396# is to go down hill along the opposite direction of the
gradient vector.
Specifically the gradient descent method (also called steepest
descent or down hill method) carries out the iteration
#math437##tex2html_wrap_inline7398#, which is can be considered
as the first two terms of the Taylor series of #tex2html_wrap_inline7400#
with #math438##tex2html_wrap_inline7402#:
#math439# #tex2html_wrap_indisplay18904#
(28)
As the function value is always reduced by this iteration,
it can be carried out repeatedly
#math440# #tex2html_wrap_indisplay18906#
(29)
until eventually reaching a point #tex2html_wrap_inline7404# at which
#math441##tex2html_wrap_inline7406# and #tex2html_wrap_inline7408# is minimized.
Comparing Eq. (#GradientIteration#649>) with Newton's method
where the search direction is #math442##tex2html_wrap_inline7410#
in Eq. (#NewtonIteration#654>) based on both #tex2html_wrap_inline7412# and
#tex2html_wrap_inline7414#, we see that here in gradient descent method the
search direction #math443##tex2html_wrap_inline7416# relies only on the local
information contained in the gradient #tex2html_wrap_inline7418# without more
global information contained in the second order Hessian
#tex2html_wrap_inline7420#, i.e., the gradient method does not have as much
information as Newton's method and therefore may not be as
efficient. For example, when the function is quadratic, as
discussed before, Newton's method can find the solution in a
single step from any initial guess, but it takes the gradient
descent method multiple steps to reach the solution, as it
always follows the negative direction of the local gradient,
which does not lead toward the solution directly in general.
However, the gradient descent method is computationally less
expensive as the Hessian matrix is not needed.
<#661#>Example 1:<#661#>
Consider a two-variable quadratic function in the following
general form:
#math444# #tex2html_wrap_indisplay18917#
If
#math445# #tex2html_wrap_indisplay18919#
is a symmetric positive semidefinite matrix, i.e., #tex2html_wrap_inline7422#
and #tex2html_wrap_inline7424#, then the function becomes
#math446# #tex2html_wrap_indisplay18923#
which has a minimum #math447##tex2html_wrap_inline7426# at #tex2html_wrap_inline7428#.
<#18926#>Figure<#18926#> 2.3:
<#18927#>Gradient of a Quadratic Surface<#18927#>
[width=100mm]<#699#>figures/gradient1.eps<#699#> |
We let the initial guess be #math448##tex2html_wrap_inline7430#, at which the
gradient is #math449##tex2html_wrap_inline7432#. Now we compare the gradient
method with Newton's method, in terms of the search direction
#tex2html_wrap_inline7434# and progress of the iterations:
Fig. #QuadraticContourEx2#764>
shows the search direction of the Newton's method
(red), which reaches the solution in a single step, based on not only
the gradient #tex2html_wrap_inline7444#, the local information at the point #tex2html_wrap_inline7446#,
but also the Hessian matrix #tex2html_wrap_inline7448#, the more global information of
the quadratic function representing its curvature, in comparison to
the gradient descent method (blue) based only on the local gradient
#tex2html_wrap_inline7450#. Without global information #tex2html_wrap_inline7452# representing the
elliptical shape of the contour line of the quadratic function, the
gradient descent method requires an iteration, and is therefore less
effective.
<#18947#>Figure<#18947#> 2.4:
<#18948#>Comparison of Gradient Descent and Newton's Method<#18948#>
[width=40mm]<#771#>figures/QuadraticContourEx2.eps<#771#> |
96
In the gradient descent method, we also need to determine a proper step
size #tex2html_wrap_inline7454#. If #tex2html_wrap_inline7456# is too small, the iteration may converge too
slowly, especially when #tex2html_wrap_inline7458# reduces slowly toward its minimum. On the
other hand, if #tex2html_wrap_inline7460# is too large but #tex2html_wrap_inline7462# has some rapid variations
in the local region, the minimum of the function may be skipped and the
iteration may not converge. We will consider how to find the optimal step
size in the next section.