Newton's method discussed above is based on the Hessian
and gradient
of the function
to be minimized. However, the method is not applicable if
the Hessian
is not available, or the cost of computing the
inverse
is too high. In such a case, the gradient
descent method can be used without using the Hessian matrix.
We first consider the minimization of a single-variable function
. From any inital point
, we can move to a nearby point
along the opposite direction of the gradient, so that the
function value will be smaller than that at the current position
:
(46) |
(47) |
(48) |
This simple method can be generalized to minimize a multi-variable
objective function
in N-D space. The
derivative of the 1-D case is generalized to the gradient vector
of function
,
which is in the direction along which the function increases most rapidly
with the steepest slope, perpendicular to the contour or iso-lines of the
function
. The fastest way to reduce
is to go
down hill along the opposite direction of the gradient vector.
Specifically the gradient descent method (also called steepest
descent or down hill method) carries out the following approximation
(the first two terms of the
Taylor series) with
:
(49) |
Comparing the gradient descent method with Newton's method we see that
here the Hessian matrix is no longer needed. The iteration simply follows
a search direction
, which is different from the
search direction of Newton's method,
,
based on
as well as
. Specially, when
,
the two methods become the same.
As the gradient descent method relies only on the gradient vector of the objective function without any information contained in the second order derivatives in the Hessian matrix, it does not have as much information as Newton's method and therefore may not be as efficient. For example, when the function is quadratic, as discussed before, Newton's method can find the solution in a single step from any initial guess, but it may take the gradient descent method many steps to reach the solution, because it always follows the negative direction of the local gradient, which typically does not lead to the solution directly. However, for the same reason, the gradient descent method is computationally less expensive as the Hessian matrix is not used.
Example:
Consider a two-variable quadratic function in the following general form:
We assume the initial guess is
, at which the
gradient is
. Now we compare the gradient method
with Newton's method, in terms of the search direction
and
progress of the iterations:
The figure below shows the search direction of the Newton's method
(red), which reaches the solution in a single step, based on not only
the gradient , the local information at the point
,
but also the Hessian matrix
, the global information of the
quadratic function, in comparison to the gradient descent method (blue)
based only on the local gradient
. Without global information
representing the elliptical shape of the contour line of the
quadratic function, the gradient descent method requires an iteration,
and is therefore less effective.
In the gradient descent method, we also need to determine a proper step
size . If
is too small, the iteration may converge too
slowly, especially when
reduces slowly toward its minimum. On the
other hand, if
is too large but
has some rapid variations
in the local region, the minimum of the function may be skipped and the
iteration may not converge. We will consider how to find the optimal step
size in the next section.