In general, in an algorithm for minimizing an objective function , such as Newton's method (Eq. (36)) and the gradient descent method (Eq. (50)), the variable is iteratively updated to gradualy reduce the value of . Specifically, both the search direction and the step size in the iteration need to be determined so that is maximally reduced.
First we realize that the search direction needs to point away from the gradient along which increases most rapidly. In other words, the angle between and should be greater than :
(51) |
(52) |
Next, we need to find the optimal step size so that the function value is minimized along the search direction . To do so, we set to zero the derivative of the function with respect to , the directional derivative along the direction of , and get the following by chain rule:
This result indicates that the gradient at the next point should be perpendicular to the search direction . In other words, when traversing along , we should stop at the point at which the gradient is perpendicular to , i.e., it has zero component along , and the corresponding is the optimal step size.To find the actual optimal step size , we need to solve a 1-D optimization problem while treating as a function of the step size as a variable, and approximate it by the first three terms of its Taylor series at (Maclaurin series of function ):
where(55) | |||
(56) | |||
(57) |
(58) |
Based on this result, we can find the optimal step side for the Newton's method and gradient descent method considered previously:
The search direction is , and the optimal step size is
(61) |
(62) |
The search direction is , and Eq. (53) becomes:
i.e., | (63) |
Given as well as , we can find the optimal
step size with complexity for each iteration.
However, if the Hessian is not available, such as in
the gradient descent method, the optimal step size above cannot be
computed. We can instead approximate
in
the third term of Eq. (54) at two nearby points
at and
, where is a small value:
(65) |
(66) |
(67) |
(69) |
Example:
The gradient descent method applied to solve the same three-variable equation system previously solved by Newton's method:
The step size is determined by the secant method with . The iteration from an initial guess is shown below:
(70) |
When it is difficult or too computationally costly to find the optimal step size along the search direction, some suboptimal step size may be acceptable, such as in the quasi-Newton methods for minimization problems. In this case, although the step size is no longer required to be such that the function value at the new position is minimized along the search direction , the step size still has to satisfy the following Wolfe conditions:
(71) |
(72) |
(73) |
In general, these conditions are motivated by the desired effect that after each iterative step, the function should have a shallower slope along , as well as a lower value, so that eventually the solution can be approached where is minimum and the gradient is zero.
Specifically, to understand the first condition above, we represent the function to be minimized as a single-variable function of the step size , and its tangent line at the point as a linear function , where the intercept can be found as , and the slope can be found as the derivative of at :
(74) |
(75) |
(76) |
The second condition requires that at the new position the slope of the gradient along the search direction be sufficiently reduced to be less than a specified value (determined by ), in comparison to the slope at the old position .
The reason why can also be explained geometrically. As shown in the figure the gradient vectors at various points along the direction of are shown by the blue arrows, and their projections onto the direction represent the slopes of the function . Obviously at where reaches its minimum, its slope is zero. In other words, the projection of the gradient onto the direction of is zero, i.e., or .
The gradient descent method gradually approaches a solution of an N-D minimization problem by moving from the initial guess along a zigzag path composed of a set of segments with any two consecutive segments perpendicular to each other. The number of steps depends greatly on the initial guess. As illustrated in the example below in an N=2 dimensional case, the best possible case is that the solution happens to be on the gradient direction of the initial guess, which could be reached in a single step, while the worst possible case is that the gradient direction of the initial guess happens to be 45 degrees off from the gradient direction of the optimal case, and it takes many zigzag steps to go around the optimal path to reach the solution. Many of the steps are in the same direction as some of the previous steps.
To improve the performance of the gradient descent method we can include in the iteration a momentum term representing the search direction previously traversed:
(77) |
Obviously it is most desirable not to repeat any of the previous directions traveled so that the solution can be reached in N steps, each in a unique direction in the N-D space. In other words, the subsequent steps are independent of each other, never interfering with the results achieved in the previous steps. Such a method will be discussed in the next section.
Example:
The Rosenbrock function
The figure below shows the search path of the gradient descent method based on the optimal step size given in Eq. (64). We see that the search path is composed a long sequence of 90 degree turns between consecutive segments.
When the Newton's method is applied to this minimization problem, it takes only four iterations for the algorithm to converge to the minimum, as shown in the figure below: