The gradient descent method can be used to solve the minimization problem
when the Hessian matrix of the objective function is not available. However,
this method may be inefficient if it gets into a zigzag search pattern and
repeat the same search directions many times. This problem can be avoided
in the conjugate gradient (CG) method.
If the objective function is quadratic, the CG method converges to the
solution in iterations without repeating any of the directions previously
traversed. If the objective function is not quadratic, the CG method can
still significantly improve the performance in comparison to the gradient
descent method.
Again consider the approximation of the function
to be
minimized by the first three terms of its Taylor series:
(128) |
(129) |
(130) |
(131) |
(132) |
We also see that the minimization of the quadratic function
is equivalent to solving a linear equation
with
a symmetric positive definite coefficient matrix
. The CG method
considered here can therefore be used for solving both problems.
Conjugate basis vectors
We first review the concept of
conjugate vectors, which is
of essential importance in the CG method. Two vectors and
are mutually conjugate (or A-orthogonal or A-conjugate)
to each other with respect to a symmetric matrix
, if
they satisfy:
(134) |
Similar to a set of orthogonal vectors that can be used as the basis
spanning an N-D space, a set of
mutually conjugate vectors
satisfying
(
) can also be used as a basis to span the N-D space. Any vector in
the space can be expressed as a linear combination of these basis vectors.
Also we note that any set of independent vectors can be converted by the
Gram-Schmidt process
to a set of
basis vectors that are either orthogonal or A-orthogonal.
Example:
Given two independent basis vectors of the 2-D space, and a positive-definite matrix:
Search along a conjugate basis
Similar to the gradient descent method, which iteratively improves
the estimated solution by following a sequence of orthogonal
search directions
with
satisfying
,
the CG method also follows a sequence of search directions
A-orthogonal to each other, i.e.,
:
The optimal step size given in Eq. (60) can now be
written as
(139) |
(141) |
(142) |
(143) |
Pre-multiplying
on both sides of the
equation above, we get
(144) |
Due to Eq. (137), the equation above can also be written as
i.e., the gradient
In the figure below, the conjugate gradient method is compared with
the gradient descent method for the case of . We see that the
first search direction is the same
for both methods.
However, the next search direction
is A-orthogonal to
, same as the next error
, different from
the search direction
in gradient descent method. The
conjugate gradient method finds the solution
in
steps, while the gradient descent method has to go through many
more steps all orthogonal to each other before it finds the solution.
.
Find the A-orthogonal basis
The A-orthogonal search directions
can be constructed based on any set of
independent vectors
by the
Gram-Schmidt process:
We will gain some significant computational advantage if we choose
to use
. Now the Gram-Schmidt process above
becomes:
(149) |
(150) |
Pre-multiplying
on both sides of Eq. (147),
we get
(151) |
(153) |
(154) |
(155) |
(156) |
which is non-zero only when , i.e., there is only one non-zero term
in the summation of the Gram-Schmidt formula for
. This is the
reason why we choose
. We can now drop the second
subscript
in
, and Eq. (146) becomes:
The CG algorithm
Summarizing the above, we finally get the conjugate gradient algorithm in the following steps:
(159) |
(160) |
(161) |
(162) |
(163) |
The algorithm above assumes the objective function
to be quadratic
with known
. But when
is not quadratic,
is no
longer available, we can still approximate
as a quadratic function
by the first three terms in its Taylor series in the neighborhood of its minimum.
Also, the we need to modify the algorithm so that it does not depend on
.
Specifically, the optimal step size
calculated in step 3 above based
on
can also be alternatively found by line minimization based on any
suitable algorithms for 1-D optimization.
The Matlab code for the conjugate gradient algorithm is listed below:
function xn=myCG(o,tol) % o is the objective function to minimize syms d; % variable for 1-D symbolic function f(d) x=symvar(o).'; % symbolic variables in objective function O=matlabFunction(o); % the objective function G=jacobian(o).'; % symbolic gradient of o(x) G=matlabFunction(G); % the gradient function xn=zeros(length(x),1); % initial guess of x xc=num2cell(xn); e=O(xc{:}); % initial error gn=G(xc{:}); % gradient at xn dn=-gn; % use negative gradient as search direction n=0; while e>tol n=n+1; f=subs(o,x,xn+d*dn); % convert n-D f(x) to 1-D f(d) delta=Opt1d(f); % find delta that minimizes 1D function f(d) xn=xn+delta*dn; % updata variable x xc=num2cell(xn); e=O(xc{:}); % new error gn1=G(xc{:}); % new gradient bt=-(gn1.'*gn1)/(gn.'*gn); % find beta dn=-gn1-bt*dn; % new search direction gn=gn1; % update gradient fprintf('%d\t(%.4f, %.4f, %.4f)\t%e\n',n,xn(1),xn(2),xn(3),e) end endHere is the function that uses Newton's method to find the optimal step size
function x=Opt1d(f) % f is 1-D symbolic function to minimize syms x; tol=10^(-3); d1=diff(f); % 1st order derivative d2=diff(d1); % 2nd order derivative f=matlabFunction(f); d1=matlabFunction(d1); d2=matlabFunction(d2); x=0.0; % initial guess of delta if d2(x)<=0 % second order derivative needs to be greater than 0 x=rand-0.5; end y=x+1; while abs(x-y) > tol % minimization y=x; x=y-d1(y)/d2(y); % Newton iteration end end
Example:
To compare the conjugate method and the gradient descent method, consider a very simple 2-D quadratic function
For an example of
with
For an example, it takes over 4000 iterations for the gradient
descent method to converge with
, but exactly 9
iterations for the CG method to converge with
.
Example:
The figure below shows the search path of the conjugate gradient method applied to the minimization of the Rosenbrock function:
Example:
Solve the following non-linear equation system by the CG method:
This equation system can be represented in vector form as
and the objective function is
.
The iteration of the CG method with an initial guess
is
shown below:
In comparison, the gradient descent method would need to take over 200 iterations (with much reduced complexity though) to reach this level of error.
Conjugate gradient method used for solving linear equation systems:
As discussed before, if is the solution that minimizes the
quadratic function
,
with
being symmetric and positive definite, it also satisfies
. In other words, the
optimization problem is equivalent to the problem of solving the linear system
, both can be solved by the conjugate gradient
method.
Now consider solving the linear system
with
. Let
be a set of
A-orthogonal vectors satisfying
,
which can be generated based on any
independent vectors, such as the
standard basis vectors, by the Gram-Smidth method. The solution
of the equation
can be represented by these
vectors as
(164) |
(165) |
(166) |
(167) |
(168) |
(169) |
One application of the conjugate gradient method is to solve the normal
equation to find the least-square solution of an over-constrained equation
system
, where the coefficient matrix
is
by
with rank
. As discussed previously, the normal equation of this
system is
(170) |