The gradient descent method can be used to solve the minimization problem when the Hessian matrix of the objective function is not available. However, this method may be inefficient if it gets into a zigzag search pattern and repeat the same search directions many times. This problem can be avoided in the conjugate gradient (CG) method. If the objective function is quadratic, the CG method converges to the solution in iterations without repeating any of the directions previously traversed. If the objective function is not quadratic, the CG method can still significantly improve the performance in comparison to the gradient descent method.
Again consider the approximation of the function to be minimized by the first three terms of its Taylor series:
(128) |
(129) |
(130) |
(131) |
(132) |
We also see that the minimization of the quadratic function is equivalent to solving a linear equation with a symmetric positive definite coefficient matrix . The CG method considered here can therefore be used for solving both problems.
Conjugate basis vectors
We first review the concept of conjugate vectors, which is of essential importance in the CG method. Two vectors and are mutually conjugate (or A-orthogonal or A-conjugate) to each other with respect to a symmetric matrix , if they satisfy:
(134) |
Similar to a set of orthogonal vectors that can be used as the basis spanning an N-D space, a set of mutually conjugate vectors satisfying () can also be used as a basis to span the N-D space. Any vector in the space can be expressed as a linear combination of these basis vectors.
Also we note that any set of independent vectors can be converted by the Gram-Schmidt process to a set of basis vectors that are either orthogonal or A-orthogonal.
Example:
Given two independent basis vectors of the 2-D space, and a positive-definite matrix:
Search along a conjugate basis
Similar to the gradient descent method, which iteratively improves the estimated solution by following a sequence of orthogonal search directions with satisfying , the CG method also follows a sequence of search directions A-orthogonal to each other, i.e., :
We define the error at the nth step as . Then subtracting from both sides, we get the iteration in terms of the errors: Due to in Eq. (133), we can find the gradient at the nth step as As the gradient at the solution is zero , we can consider the gradient at as the residual of the nth iteration, and an error measurement representing how close is to the true solution .
The optimal step size given in Eq. (60) can now be
written as
(139) |
(141) |
(142) |
(143) |
Pre-multiplying on both sides of the equation above, we get
(144) |
Due to Eq. (137), the equation above can also be written as
i.e., the gradient is orthogonal to all previous search directions.In the figure below, the conjugate gradient method is compared with the gradient descent method for the case of . We see that the first search direction is the same for both methods. However, the next search direction is A-orthogonal to , same as the next error , different from the search direction in gradient descent method. The conjugate gradient method finds the solution in steps, while the gradient descent method has to go through many more steps all orthogonal to each other before it finds the solution.
.
Find the A-orthogonal basis
The A-orthogonal search directions can be constructed based on any set of independent vectors by the Gram-Schmidt process:
where and is the A-projection of onto each of the previous direction .We will gain some significant computational advantage if we choose to use . Now the Gram-Schmidt process above becomes:
where The equation above indicates that can be written as a linear combination of all previous search directions :(149) |
(150) |
Pre-multiplying on both sides of Eq. (147), we get
(151) |
(153) |
(154) |
(155) |
(156) |
which is non-zero only when , i.e., there is only one non-zero term in the summation of the Gram-Schmidt formula for . This is the reason why we choose . We can now drop the second subscript in , and Eq. (146) becomes:
Substituting the step size (Eq. (152)) into the above expression for , we get We note that matrix no longer appears in the expression.The CG algorithm
Summarizing the above, we finally get the conjugate gradient algorithm in the following steps:
(159) |
(160) |
(161) |
(162) |
(163) |
The algorithm above assumes the objective function to be quadratic with known . But when is not quadratic, is no longer available, we can still approximate as a quadratic function by the first three terms in its Taylor series in the neighborhood of its minimum. Also, the we need to modify the algorithm so that it does not depend on . Specifically, the optimal step size calculated in step 3 above based on can also be alternatively found by line minimization based on any suitable algorithms for 1-D optimization.
The Matlab code for the conjugate gradient algorithm is listed below:
function xn=myCG(o,tol) % o is the objective function to minimize syms d; % variable for 1-D symbolic function f(d) x=symvar(o).'; % symbolic variables in objective function O=matlabFunction(o); % the objective function G=jacobian(o).'; % symbolic gradient of o(x) G=matlabFunction(G); % the gradient function xn=zeros(length(x),1); % initial guess of x xc=num2cell(xn); e=O(xc{:}); % initial error gn=G(xc{:}); % gradient at xn dn=-gn; % use negative gradient as search direction n=0; while e>tol n=n+1; f=subs(o,x,xn+d*dn); % convert n-D f(x) to 1-D f(d) delta=Opt1d(f); % find delta that minimizes 1D function f(d) xn=xn+delta*dn; % updata variable x xc=num2cell(xn); e=O(xc{:}); % new error gn1=G(xc{:}); % new gradient bt=-(gn1.'*gn1)/(gn.'*gn); % find beta dn=-gn1-bt*dn; % new search direction gn=gn1; % update gradient fprintf('%d\t(%.4f, %.4f, %.4f)\t%e\n',n,xn(1),xn(2),xn(3),e) end endHere is the function that uses Newton's method to find the optimal step size that minimizes the objective function as a 1-D function of :
function x=Opt1d(f) % f is 1-D symbolic function to minimize syms x; tol=10^(-3); d1=diff(f); % 1st order derivative d2=diff(d1); % 2nd order derivative f=matlabFunction(f); d1=matlabFunction(d1); d2=matlabFunction(d2); x=0.0; % initial guess of delta if d2(x)<=0 % second order derivative needs to be greater than 0 x=rand-0.5; end y=x+1; while abs(x-y) > tol % minimization y=x; x=y-d1(y)/d2(y); % Newton iteration end end
Example:
To compare the conjugate method and the gradient descent method, consider a very simple 2-D quadratic function
For an example of with
For an example, it takes over 4000 iterations for the gradient descent method to converge with , but exactly 9 iterations for the CG method to converge with .
Example:
The figure below shows the search path of the conjugate gradient method applied to the minimization of the Rosenbrock function:
Example:
Solve the following non-linear equation system by the CG method:
This equation system can be represented in vector form as and the objective function is . The iteration of the CG method with an initial guess is shown below:
In comparison, the gradient descent method would need to take over 200 iterations (with much reduced complexity though) to reach this level of error.
Conjugate gradient method used for solving linear equation systems:
As discussed before, if is the solution that minimizes the quadratic function , with being symmetric and positive definite, it also satisfies . In other words, the optimization problem is equivalent to the problem of solving the linear system , both can be solved by the conjugate gradient method.
Now consider solving the linear system with . Let be a set of A-orthogonal vectors satisfying , which can be generated based on any independent vectors, such as the standard basis vectors, by the Gram-Smidth method. The solution of the equation can be represented by these vectors as
(164) |
(165) |
(166) |
(167) |
(168) |
(169) |
One application of the conjugate gradient method is to solve the normal equation to find the least-square solution of an over-constrained equation system , where the coefficient matrix is by with rank . As discussed previously, the normal equation of this system is
(170) |