#math627##tex2html_wrap_inline7920#, at which the function is minimized
to
#math628# #tex2html_wrap_indisplay19425#
(111)
At the solution #tex2html_wrap_inline7922# that minimizes #tex2html_wrap_inline7924#, its gradient is
#math629# #tex2html_wrap_indisplay19429#
(112)
We also see that the minimization of the quadratic function #tex2html_wrap_inline7926#
is equivalent to solving a linear equation #math630##tex2html_wrap_inline7928# with a
symmetric positive definite coefficient matrix #tex2html_wrap_inline7930#. The CG method
considered here can therefore be used for solving both problems.
<#2660#>Conjugate basis vectors<#2660#>
We first review the concept of <#2661#>conjugate vectors<#2661#>
(Section #GramSchmidt#2662>) which is of essential importance
in the CG method. Two vectors #tex2html_wrap_inline7932# and #tex2html_wrap_inline7934# are
<#2665#>mutually conjugate<#2665#> or <#2666#>A-conjugate<#2666#> to each other with
respect to a symmetric matrix #math631##tex2html_wrap_inline7936#, if they satisfy:
#math632# #tex2html_wrap_indisplay19437#
(113)
Specially, if #math633##tex2html_wrap_inline7938#, the two conjugate vectors become
orthogonal to each other, i.e., #math634##tex2html_wrap_inline7940#.
Similar to a set of #tex2html_wrap_inline7942# orthogonal vectors that can be used as the
basis spanning an N-D space, a set of #tex2html_wrap_inline7944# mutually conjugate vectors
#math635##tex2html_wrap_inline7946# satisfying #math636##tex2html_wrap_inline7948#
(#tex2html_wrap_inline7950#) can also be used as a basis to span the N-D space. Any vector in
the space can be expressed as a linear combination of these basis vectors.
Also we note that any set of #tex2html_wrap_inline7952# independent vectors can be converted
by the Gram-Schmidt process (Section #GramSchmidt#2690>) to a set of #tex2html_wrap_inline7954#
basis vectors that are either orthogonal or A-conjugate to each other.
<#2691#>Example:<#2691#>
Given two independent basis vectors of the 2-D space, and
a positive-definite matrix:
#math637# #tex2html_wrap_indisplay19448#
we can construct two A-orthogonal basis vectors by the Gram-Schmidt method:
#math638# #tex2html_wrap_indisplay19450#
The projections of a vector #math639##tex2html_wrap_inline7956# onto #tex2html_wrap_inline7958# and #tex2html_wrap_inline7960#
are:
#math640# #tex2html_wrap_indisplay19455#
The A-projections of the same vector #tex2html_wrap_inline7962# onto #tex2html_wrap_inline7964# and #tex2html_wrap_inline7966#
are:
#math641# #tex2html_wrap_indisplay19460#
The original vector #tex2html_wrap_inline7968# can be represented in either of the two bases:
#math642#
#tex2html_wrap_indisplay19463# |
#tex2html_wrap_indisplay19464# |
#tex2html_wrap_indisplay19465# |
|
|
#tex2html_wrap_indisplay19466# |
#tex2html_wrap_indisplay19467# |
|
<#19468#>Figure<#19468#> 2.14:
<#19469#>A-Orthogonality based on Matrix #tex2html_wrap_inline7970#<#19469#>
[width=60mm]<#2831#>figures/Aorthogonal.eps<#2831#> |
<#2835#>Search along a conjugate basis<#2835#>
Similar to the gradient descent method, which iteratively improves
the estimated solution by following a sequence of orthogonal
search directions #math643##tex2html_wrap_inline7972# with
#math644##tex2html_wrap_inline7974# satisfying #math645##tex2html_wrap_inline7976#,
the CG method also follows a sequence of search directions
#math646##tex2html_wrap_inline7978# A-orthogonal to each other, i.e.,
#math647##tex2html_wrap_inline7980#:
#math648# #tex2html_wrap_indisplay19479#
(114)
We define the error at the nth step as #math649##tex2html_wrap_inline7982#.
Then subtracting #tex2html_wrap_inline7984# from both sides, we get the iteration
in terms of the errors:
#math650# #tex2html_wrap_indisplay19483#
(115)
Due to #math651##tex2html_wrap_inline7986# in Eq. (#gAx0#2875>), we can find the
gradient at the nth step #tex2html_wrap_inline7988# as
#math652# #tex2html_wrap_indisplay19487#
(116)
As the gradient at the solution is zero #math653##tex2html_wrap_inline7990#,
we can consider the gradient #tex2html_wrap_inline7992# at #tex2html_wrap_inline7994# as the
residual of the nth iteration, and #math654##tex2html_wrap_inline7996#
an error measurement representing how close #tex2html_wrap_inline7998# is to the
true solution #tex2html_wrap_inline8000#.
The optimal step size given in Eq. (#OptimalDelta#2903>) can now be
written as
#math655#
#tex2html_wrap_indisplay19495# |
#tex2html_wrap_indisplay19496# |
#tex2html_wrap_indisplay19497# |
|
|
#tex2html_wrap_indisplay19498# |
#tex2html_wrap_indisplay19499# |
(117) |
The last equality is due to the fact that #tex2html_wrap_inline8002# and #tex2html_wrap_inline8004#
are A-orthogonal, #math656##tex2html_wrap_inline8006#.
Substituting this into Eq. (#CGerror#2949>) we get
#math657# #tex2html_wrap_indisplay19504#
(118)
On the other hand, we can represent the error #math658##tex2html_wrap_inline8008#
associated with the initial guess #tex2html_wrap_inline8010# as a linear combination
of the A-orthogonal search vectors #math659##tex2html_wrap_inline8012#
as #tex2html_wrap_inline8014# basis vectors that span the N-D vector space:
#math660# #tex2html_wrap_indisplay19510#
(119)
where #math661##tex2html_wrap_inline8016# is the A-projection of #tex2html_wrap_inline8018#
onto the ith basis vector #tex2html_wrap_inline8020#:
#math662# #tex2html_wrap_indisplay19515#
(120)
Note that the coefficient #tex2html_wrap_inline8022# happens to be the negative optimal step
size in Eq. (#delta_gd#3010>):
#math663# #tex2html_wrap_indisplay19518#
(121)
Now the expression of #tex2html_wrap_inline8024# in Eq. (#CGerror#3021>) can be written as
#math664# #tex2html_wrap_indisplay19521#
(122)
We see that in each iteration, the number of terms in the summation
for #tex2html_wrap_inline8026# is reduced by one, i.e., the nth component
#math665##tex2html_wrap_inline8028# of #tex2html_wrap_inline8030# along the direction
of #tex2html_wrap_inline8032# is completely eliminated. After #tex2html_wrap_inline8034# such iterations,
the error is reduced from #tex2html_wrap_inline8036# to #math666##tex2html_wrap_inline8038#, and the
true solution is obtained #math667##tex2html_wrap_inline8040#.
Pre-multiplying #math668##tex2html_wrap_inline8042# on both sides of the
equation above, we get
#math669# #tex2html_wrap_indisplay19532#
(123)
We see that after #tex2html_wrap_inline8044# iterations the remaining error #tex2html_wrap_inline8046#
is A-orthogonal to all previous directions #math670##tex2html_wrap_inline8048#.
Due to Eq. (#geqAe#3072>), the equation above can also be written as
#math671# #tex2html_wrap_indisplay19537#
(124)
i.e., the gradient #tex2html_wrap_inline8050# is orthogonal to all previous
search directions.
In the figure below, the conjugate gradient method is compared with
the gradient descent method for the case of #tex2html_wrap_inline8052#. We see that the
first search direction is the same #tex2html_wrap_inline8054# for both methods.
However, the next search direction #tex2html_wrap_inline8056# is A-orthogonal to
#tex2html_wrap_inline8058#, same as the next error #tex2html_wrap_inline8060#, different from
the search direction #tex2html_wrap_inline8062# in gradient descent method. The
conjugate gradient method finds the solution #tex2html_wrap_inline8064# in #tex2html_wrap_inline8066#
steps, while the gradient descent method has to go through many
more steps all orthogonal to each other before it finds the solution.
<#19547#>Figure<#19547#> 2.15:
<#19548#>Conjugate Gradient Descent<#19548#>
[width=90mm]<#3092#>figures/ConjugateGradient.eps<#3092#> |
<#3096#>Find the A-orthogonal basis<#3096#>
The #tex2html_wrap_inline8068# A-orthogonal search directions #math672##tex2html_wrap_inline8070#
can be constructed based on any set of #tex2html_wrap_inline8072# independent vectors
#math673##tex2html_wrap_inline8074# by the Gram-Schmidt process:
#math674# #tex2html_wrap_indisplay19555#
(125)
where #math675##tex2html_wrap_inline8076#
and #math676##tex2html_wrap_inline8078# is the A-projection of #tex2html_wrap_inline8080# onto each
of the previous direction #tex2html_wrap_inline8082#.
We will gain some significant computational advantage if we choose
to use #math677##tex2html_wrap_inline8084#. Now the Gram-Schmidt process above
becomes:
#math678# #tex2html_wrap_indisplay19562#
(126)
where
#math679# #tex2html_wrap_indisplay19564#
(127)
The equation above indicates that #tex2html_wrap_inline8086# can be written
as a linear combination of all previous search directions
#math680##tex2html_wrap_inline8088#:
#math681# #tex2html_wrap_indisplay19568#
(128)
Pre-multiplying #math682##tex2html_wrap_inline8090# on both sides, we get
#math683# #tex2html_wrap_indisplay19571#
(129)
The last equality is due to #math684##tex2html_wrap_inline8092# in Eq. (#gd0#3192>).
We see that #tex2html_wrap_inline8094# is also orthogonal to all previous gradients
#math685##tex2html_wrap_inline8096#.
Pre-multiplying #math686##tex2html_wrap_inline8098# on both sides of Eq. (#GSCG#3198>),
we get
#math687# #tex2html_wrap_indisplay19577#
(130)
Note that all terms in the summation are zero as #math688##tex2html_wrap_inline8100#
for all #tex2html_wrap_inline8102# (Eq. (#gd0#3218>)). Substituting
#math689##tex2html_wrap_inline8104# into Eq. (#delta_gd#3222>), we get
#math690# #tex2html_wrap_indisplay19582#
(131)
Next we consider
#math691# #tex2html_wrap_indisplay19584#
(132)
Pre-multiplying #tex2html_wrap_inline8106# with #tex2html_wrap_inline8108# on both sides we get
#math692# #tex2html_wrap_indisplay19588#
(133)
where #math693##tex2html_wrap_inline8110# (#tex2html_wrap_inline8112#). Solving for #math694##tex2html_wrap_inline8114#
we get
#math695# #tex2html_wrap_indisplay19593#
(134)
Substituting this into Eq. (#GSbeta#3289>) we get
#math696# #tex2html_wrap_indisplay19595#
(135)
which is non-zero only when #tex2html_wrap_inline8116#, i.e., there is only one non-zero term
in the summation of the Gram-Schmidt formula for #tex2html_wrap_inline8118#. This is the
reason why we choose #math697##tex2html_wrap_inline8120#. We can now drop the second
subscript #tex2html_wrap_inline8122# in #tex2html_wrap_inline8124#, and Eq. (#CGdn#3313>) becomes:
#math698# #tex2html_wrap_indisplay19602#
(136)
Substituting the step size
#math699##tex2html_wrap_inline8126#
(Eq. (#CGstepsize#3334>)) into the above expression for #math700##tex2html_wrap_inline8128#,
we get
#math701# #tex2html_wrap_indisplay19606#
(137)
We note that matrix #tex2html_wrap_inline8130# no longer appears in the expression.
<#3343#>The CG algorithm<#3343#>
Summarizing the above, we finally get the conjugate gradient algorithm
in the following steps:
- Set #tex2html_wrap_inline8132# and initialize the search direction (same as gradient
descent):
#math702# #tex2html_wrap_indisplay19610#
(138)
- Terminate if the error #math703##tex2html_wrap_inline8134# is smaller
than a preset threshold. Otherwise, continue with the following:
- Find optimal step size (Eq. (#CGstepsize#3350>)) and step
forward
#math704# #tex2html_wrap_indisplay19613#
(139)
- Update gradient:
#math705# #tex2html_wrap_indisplay19615#
(140)
- Find coefficient for the Gram-Schmidt process
(Eq. (#getBeta#3369>):
#math706# #tex2html_wrap_indisplay19617#
(141)
- Update search direction (Eq. (#CDdn1#3376>)):
#math707# #tex2html_wrap_indisplay19619#
(142)
Set #tex2html_wrap_inline8136# and go back to step 2.
The algorithm above assumes the objective function #tex2html_wrap_inline8138# to be quadratic
with known #tex2html_wrap_inline8140#. But when #tex2html_wrap_inline8142# is not quadratic, #tex2html_wrap_inline8144# is no
longer available, we can still approximate #tex2html_wrap_inline8146# as a quadratic function
by the first three terms in its Taylor series in the neighborhood of its minimum.
Also, the we need to modify the algorithm so that it does not depend on #tex2html_wrap_inline8148#.
Specifically, the optimal step size #tex2html_wrap_inline8150# calculated in step 3 above based
on #tex2html_wrap_inline8152# can also be alternatively found by line minimization based on any
suitable algorithms for 1-D optimization.
The Matlab code for the conjugate gradient algorithm is listed below:
verbatim337#
Here is the function that uses Newton's method to find the optimal
step size #tex2html_wrap_inline8154# that minimizes the objective function as a 1-D
function of #tex2html_wrap_inline8156#:
verbatim338#
<#3397#>Example:<#3397#>
To compare the conjugate method and the gradient descent method, consider a very
simple 2-D quadratic function
#math708# #tex2html_wrap_indisplay19632#
The performance of the gradient descent method depends significantly on
the initial guess. For the specific initial guess of #math709##tex2html_wrap_inline8158#,
the iteration gets into a zigzag pattern and the convergence is very slow,
as shown below:
#math710# #tex2html_wrap_indisplay19635#
However, as expected, the conjugate gradient method takes exactly
#tex2html_wrap_inline8160# steps from any initial guess to reach at the solution:
#math711# #tex2html_wrap_indisplay19638#
<#19639#>Figure<#19639#> 2.16:
<#19640#>Gradient vs. Conjugate Gradient<#19640#>
[width=90mm]<#3425#>figures/GDvsCG2.eps<#3425#> |
247
For an #tex2html_wrap_inline8162# example of #math712##tex2html_wrap_inline8164# with
#math713# #tex2html_wrap_indisplay19645#
from an initial guess #math714##tex2html_wrap_inline8166#, it takes the gradient
descent method 41 iterations to reach
#math715##tex2html_wrap_inline8168# corresponding
to #math716##tex2html_wrap_inline8170#. From the same initial guess, it takes the
conjugate gradient method only #tex2html_wrap_inline8172# iterations to converge to the
solution:
#math717# #tex2html_wrap_indisplay19651#
For an #tex2html_wrap_inline8174# example, it takes over 4000 iterations for the gradient
descent method to converge with #math718##tex2html_wrap_inline8176#, but exactly 9
iterations for the CG method to converge with #math719##tex2html_wrap_inline8178#.
<#3454#>Example:<#3454#>
The figure below shows the search path of the conjugate gradient method
applied to the minimization of the Rosenbrock function:
<#19655#>Figure<#19655#> 2.17:
<#19656#>Conjugate Gradient Applied to Rosenbrock Surface<#19656#>
[width=90mm]<#3456#>figures/RosenbrockCG.eps<#3456#> |
<#3460#>Example:<#3460#>
Solve the following #tex2html_wrap_inline8180# non-linear equation system by the CG method:
#math720# #tex2html_wrap_indisplay19660#
The solution is known to be #math721##tex2html_wrap_inline8182#.
This equation system can be represented in vector form as
#math722##tex2html_wrap_inline8184# and the objective function is
#math723##tex2html_wrap_inline8186#. The iteration
of the CG method with an initial guess #math724##tex2html_wrap_inline8188# is
shown below:
#math725# #tex2html_wrap_indisplay19666#
In comparison, the gradient descent method would need to take over 200
iterations (with much reduced complexity though) to reach this level of
error.
<#3483#>Conjugate gradient method used for solving linear equation systems:<#3483#>
As discussed before, if #tex2html_wrap_inline8190# is the solution that minimizes the
quadratic function #math726##tex2html_wrap_inline8192#,
with #math727##tex2html_wrap_inline8194# being symmetric and positive definite, it also satisfies
#math728##tex2html_wrap_inline8196#. In other words, the
optimization problem is equivalent to the problem of solving the linear system
#math729##tex2html_wrap_inline8198#, both can be solved by the conjugate gradient
method.
Now consider solving the linear system #math730##tex2html_wrap_inline8200# with
#math731##tex2html_wrap_inline8202#. Let #math732##tex2html_wrap_inline8204# be a set of #tex2html_wrap_inline8206#
A-orthogonal vectors satisfying #math733##tex2html_wrap_inline8208#,
which can be generated based on any #tex2html_wrap_inline8210# independent vectors, such as the
standard basis vectors, by the Gram-Schmidt method. The solution #tex2html_wrap_inline8212#
of the equation #math734##tex2html_wrap_inline8214# can be represented by these #tex2html_wrap_inline8216#
vectors as
#math735# #tex2html_wrap_indisplay19682#
(143)
Now we have
#math736# #tex2html_wrap_indisplay19684#
(144)
Pre-multiplying #tex2html_wrap_inline8218# on both sides we get
#math737# #tex2html_wrap_indisplay19687#
(145)
Solving for #tex2html_wrap_inline8220# we get
#math738# #tex2html_wrap_indisplay19690#
(146)
Substituting this back into the expression for #tex2html_wrap_inline8222# we get the solution
of the equation:
#math739# #tex2html_wrap_indisplay19693#
(147)
Also note that as #math740##tex2html_wrap_inline8224#, the ith term of the summation above
is simply the A-projection of #tex2html_wrap_inline8226# onto the ith direction #tex2html_wrap_inline8228#:
#math741# #tex2html_wrap_indisplay19698#
(148)
One application of the conjugate gradient method is to solve the normal
equation to find the least-square solution of an over-constrained equation
system #math742##tex2html_wrap_inline8230#, where the coefficient matrix #tex2html_wrap_inline8232# is #tex2html_wrap_inline8234#
by #tex2html_wrap_inline8236# with rank #tex2html_wrap_inline8238#. As discussed previously, the normal equation of this
system is
#math743# #tex2html_wrap_indisplay19705#
(149)
Here #math744##tex2html_wrap_inline8240# is an #tex2html_wrap_inline8242# by #tex2html_wrap_inline8244# symmetric, positive definite matrix.
This normal equation can be solved by the conjugate gradient method.