In general, in an algorithm for minimizing an objective function
, such as Newton's method (Eq. (36))
and the gradient descent method (Eq. (50)),
the variable
is iteratively updated to gradualy reduce
the value of
. Specifically, both the search direction
and the step size
in the iteration
need to be determined
so that
is maximally reduced.
First we realize that the search direction needs to
point away from the gradient
along which
increases most rapidly. In other words, the angle
between
and
should be greater than
:
(51) |
(52) |
Next, we need to find the optimal step size so that the
function value
is
minimized along the search direction
. To do so, we set
to zero the derivative of the function with respect to
,
the
directional derivative
along the direction of
, and get the following by chain
rule:
To find the actual optimal step size , we need
to solve a 1-D optimization problem while treating
as a function
of the step size
as a variable, and approximate it
by the first three terms of its Taylor series at
(Maclaurin series of function
):
(55) | |||
(56) | |||
(57) |
(58) |
Based on this result, we can find the optimal step side for the Newton's method and gradient descent method considered previously:
The search direction is
, and the
optimal step size is
(61) |
(62) |
The search direction is
, and
Eq. (53) becomes:
(63) |
Given as well as
, we can find the optimal
step size
with complexity
for each iteration.
However, if the Hessian
is not available, such as in
the gradient descent method, the optimal step size above cannot be
computed. We can instead approximate
in
the third term of Eq. (54) at two nearby points
at
and
, where
is a small value:
(65) |
(66) |
(67) |
(69) |
Example:
The gradient descent method applied to solve the same three-variable equation system previously solved by Newton's method:
The step size is determined by the secant method with
. The iteration from an initial guess
is shown below:
(70) |
When it is difficult or too computationally costly to find the
optimal step size along the search direction, some suboptimal
step size may be acceptable, such as in the
quasi-Newton methods
for minimization problems. In this case, although the step size
is no longer required to be such that the function value
at the new position
is minimized along the search direction
, the step size
still has to satisfy the following
Wolfe conditions:
(71) |
(72) |
(73) |
In general, these conditions are motivated by the desired effect
that after each iterative step, the function should have a shallower
slope along , as well as a lower value, so that eventually
the solution can be approached where
is minimum and
the gradient is zero.
Specifically, to understand the first condition above, we represent
the function to be minimized as a single-variable function of the
step size
, and its
tangent line at the point
as a linear function
, where the intercept
can be found
as
, and the
slope
can be found as the derivative of
at
:
(74) |
(75) |
(76) |
The second condition requires that at the new position
the slope of the gradient
along
the search direction
be sufficiently reduced to be
less than a specified value (determined by
), in comparison
to the slope at the old position
.
The reason why
can also be explained geometrically.
As shown in the figure the gradient vectors at various points along the direction
of
are shown by the blue arrows, and their projections onto the direction
represent the slopes of the function
.
Obviously at
where
reaches its minimum, its slope
is zero. In other words, the projection of the gradient
onto the
direction of
is zero, i.e.,
or
.
The gradient descent method gradually approaches a solution of an N-D
minimization problem by moving from the initial guess
along a zigzag
path composed of a set of segments with any two consecutive segments
perpendicular to each other. The number of steps depends greatly on the initial
guess. As illustrated in the example below in an N=2 dimensional case, the best
possible case is that the solution happens to be on the gradient direction of the
initial guess, which could be reached in a single step, while the worst possible
case is that the gradient direction of the initial guess happens to be 45 degrees
off from the gradient direction of the optimal case, and it takes many zigzag
steps to go around the optimal path to reach the solution. Many of the steps
are in the same direction as some of the previous steps.
To improve the performance of the gradient descent method we can include in the iteration a momentum term representing the search direction previously traversed:
(77) |
Obviously it is most desirable not to repeat any of the previous directions traveled so that the solution can be reached in N steps, each in a unique direction in the N-D space. In other words, the subsequent steps are independent of each other, never interfering with the results achieved in the previous steps. Such a method will be discussed in the next section.
Example:
The Rosenbrock function
The figure below shows the search path of the gradient descent method based on the optimal step size given in Eq. (64). We see that the search path is composed a long sequence of 90 degree turns between consecutive segments.
When the Newton's method is applied to this minimization problem, it takes only four iterations for the algorithm to converge to the minimum, as shown in the figure below: