Newton's method

Newton's method is based on the Taylor series expansion of the function $f(x)$ to be minimized near some point $x_0$:

$\displaystyle f(x)$ $\displaystyle =$ $\displaystyle f(x_0)+f'(x_0)(x-x_0)+\frac{1}{2}f''(x_0)(x-x_0)^2 + \cdots
+\frac{1}{n!}f^{(n)}(x_0)(x-x_0)^n + \cdots$  
  $\displaystyle \approx$ $\displaystyle f(x_0)+f'(x_0)(x-x_0)+\frac{1}{2}f''(x_0)(x-x_0)^2 =q(x)$ (28)

NewtonEx1.png

If $f(x)=q(t)$ is a quadratic function, then the Taylor series contains only the first three terms (constant, linear, and quadratic terms). In this case, we can find the vertex point at which $f(x)=q(x)$ reaches its extremum, by first setting its derivative to zero:

$\displaystyle q'(x)$ $\displaystyle =$ $\displaystyle \frac{d}{dx}q(x)=\frac{d}{dx}\left[ f(x_0)+f'(x_0)(x-x_0)
+\frac{1}{2}f''(x_0)(x-x_0)^2 \right]$  
  $\displaystyle =$ $\displaystyle f'(x_0)+f''(x_0)(x-x_0)=0$ (29)

and then solving the resulting equation for $x$ to get:

$\displaystyle x^*=x=x_0-\frac{f'(x_0)}{f''(x_0)}=x_0+\Delta x_0$ (30)

where $\Delta x_0=-f'(x_0)/f''(x_0)$ is the step we need to take to go from any initial point $x_0$ to the solution $x^*$ in a single step. Note that $f(x^*)$ is a minimum if $f''(x^*)>0$ but a maximum if $f''(x^*)<0$.

If $f(x)\ne q(x)$ is not quadratic, the result above can still be considered as an approximation of the solution, which can be improved iteratively from the initial guess $x_0$ to eventually approach the solution:

$\displaystyle x_{n+1}=x_n+\Delta x_n=x_n-\frac{f'(x_n)}{f''(x_n)},\;\;\;\;\;\;
n=0,\,1,\,2,\cdots$ (31)

where $\Delta x_n=-f'(x_n)/f''(x_n)$. We see that in each step $x_n$ of the iteration, the function $f(x)$ is fitted by a quadratic functions $q(x_n)$ and its vertex at $x_{n+1}$ is used as the updated approximated solution, at which the function is again fitted by another quadratic function $q(x_{n+1})$ for the next iteration. Through this process the solution can be approached as $n\rightarrow\infty$.

We note that the iteration $x_{n+1}=x_n-f'(x_n)/f''(x_n)$ is just the Newton-Raphson method for equation $f'(x)=0$, as a necessary condition for an extremum of $f(x)$.

Newton's method for the minimization of a single-variable function $f(x)$ can be generalized for the minimization of a mult-variable function $f({\bf x})=f(x_1,\cdots,x_N)$. Again, we approximate $f({\bf x})$ by a quadratic function $q({\bf x})$ containing only the first three terms of its Taylor series at some initial point ${\bf x}_0$:

$\displaystyle f({\bf x})\approx f({\bf x}_0)+{\bf g}_0^T({\bf x}-{\bf x}_0)
+\frac{1}{2}({\bf x}-{\bf x}_0)^T{\bf H}_0\,({\bf x}-{\bf x}_0)
=q({\bf x})$ (32)

where ${\bf g}_0$ and ${\bf H}_0$ are respectively the gradient vector and Hessian matrix of the function $f({\bf x})$ at ${\bf x}_0$:
$\displaystyle {\bf g}_0$ $\displaystyle =$ $\displaystyle {\bf g}_f({\bf x}_0)=\frac{d}{d{\bf x}} f({\bf x}_0)
=\left[\begi...
...x_1}\\
\vdots \\ \frac{\partial f({\bf x}_0)}{\partial x_N}\end{array}\right],$  
$\displaystyle {\bf H}_0$ $\displaystyle =$ $\displaystyle {\bf H}_f({\bf x}_0)=\frac{d}{d{\bf x}} {\bf g}({\bf x}_0)
=\frac...
...1} &
\cdots & \frac{\partial^2 f({\bf x}_0)}{\partial x_N^2}
\end{array}\right]$ (33)

The stationary point of $f({\bf x})=q({\bf x})$ can be found from any point ${\bf x}_0$ by setting its derivative to zero
$\displaystyle \frac{d}{d{\bf x}}q({\bf x})$ $\displaystyle =$ $\displaystyle \frac{d}{d{\bf x}}\left[f({\bf x}_0)+{\bf g}_0^T({\bf x}-{\bf x}_0)
+\frac{1}{2}({\bf x}-{\bf x}_0)^T{\bf H}_0\,({\bf x}-{\bf x}_0)\right]$  
  $\displaystyle =$ $\displaystyle {\bf g}_0+{\bf H}_0\,({\bf x}-{\bf x}_0)={\bf0}$ (34)

and solving the resulting equation to get

$\displaystyle {\bf x}^*={\bf x}={\bf x}_0-{\bf H}({\bf x}_0)^{-1}{\bf g}({\bf x}_0)$ (35)

Similar to the 1-D case where whether $f({\bf x}^*)$ is either a minimum or maximum depends on whether the second order derivative $f''({\bf x}^*)$ is greater or smaller than zero, here whether $f({\bf x}^*)$ is a minimum, maximum, or neither, depends on the second order derivatives, the Hessiam matrix ${\bf H}^*={\bf H}({\bf x}^*)$:

If $f({\bf x})\ne q({\bf x})$ is not quadratic, the result above is not a minimum or maximum, but it can still be used as an approximation of the solution, which can be improved to approached the true solution iteratively:

$\displaystyle {\bf x}_{n+1}={\bf x}_n+\Delta{\bf x}_n
={\bf x}_n-{\bf H}_n^{-1}{\bf g}_n ={\bf x}_n+{\bf d}_n,
\;\;\;\;\;\;n=0,\,1,\,2,\cdots$ (36)

where ${\bf g}_n={\bf g}({\bf x}_n)$, ${\bf H}_n={\bf H}({\bf x}_n)$, and the increment ${\bf d}_n=\Delta{\bf x}_n=-{\bf H}_n^{-1}{\bf g}_n$ is called Newton search direction. We note that the computational complexity for each iteration is $O(N^3)$ due to the inverse operation ${\bf H}_n^{-1}$.

We note that the iteration above is just a generalization of $x_{n+1}=x_n-f'(x_n)/f''(x_n)$ in 1-D case where ${\bf g}_n=f'(x_n)$ and ${\bf H}_n=f''(x_n)$. Also, the iteration ${\bf x}_{n+1}={\bf x}_n-{\bf H}_n^{-1}\,{\bf g}_n$ above is just the Newton-Raphson method for solving equation ${\bf g}_f({\bf x})={\bf f}'({\bf x})={\bf0}$ as the necessary condition for an extremum of ${\bf f}({\bf x})$.

The speed of convergence of the iteration can be controlled by a parameter $\delta>0$ that controls the step size:

$\displaystyle {\bf x}_{n+1}={\bf x}_n+\delta_n \; \Delta{\bf x}_n
={\bf x}_n-\delta_n \; {\bf H}_n^{-1}{\bf g}_n$ (37)

Example:

An over-constrained nonlinear equation system ${\bf f}({\bf x})={\bf0}$ of $M$ equations and $N<M$ unknowns can be solved by minimizing the following sum-of-squares error:

$\displaystyle \varepsilon({\bf x})=\frac{1}{2}\vert\vert{\bf f}({\bf x})\vert\vert^2
=\frac{1}{2}\sum_{m=1}^M f_m^2({\bf x})$ (38)

This minimization problem can be solved by the Newton-Raphson method.

The gradient of the error function is:

$\displaystyle {\bf g}_\varepsilon({\bf x})=\frac{d}{d{\bf x}}\varepsilon({\bf x...
...}({\bf x})\vert\vert^2
=\frac{d {\bf f}}{d{\bf x}}\;{\bf f} ={\bf J}_f^T{\bf f}$ (39)

of which the ith component is:

$\displaystyle g_i=\frac{\partial}{\partial x_i}\frac{1}{2}\vert\vert{\bf f}\ver...
...
=\sum_{m=1}^M \frac{\partial f_m}{\partial x_i}\;f_m
=\sum_{m=1}^M J_{mi}\;f_m$ (40)

where $J_{mi}=\partial f_m/\partial x_i$ is the component of the function's Jacobian ${\bf J}_f({\bf x})$ in the mth row and ith column.

We further find the component of the Hessian ${\bf H}_\varepsilon$ of the error function in the ith row and jth column:

$\displaystyle H_{ij}$ $\displaystyle =$ $\displaystyle \frac{\partial^2}{\partial x_i\partial x_j}\frac{1}{2}\vert\vert{...
...frac{\partial}{\partial x_j}
\left[ f_m\frac{\partial f_m}{\partial x_i}\right]$  
  $\displaystyle =$ $\displaystyle \sum_{m=1}^M \left[
\frac{\partial f_m}{\partial x_i}\;\frac{\par...
...f_m}{\partial x_j}
+f_m \frac{\partial^2 f_m}{\partial x_i\partial x_j} \right]$  
  $\displaystyle \approx$ $\displaystyle \sum_{m=1}^M
\frac{\partial f_m}{\partial x_i}\;\frac{\partial f_m}{\partial x_j}
=\sum_{m=1}^M J_{mi}J_{mj}$ (41)

where we have dropped the second order term $f_m \;(\partial^2 f_m/\partial x_i\partial x_j)$. Now the Hessian can be written as

$\displaystyle {\bf H}_\varepsilon({\bf x})
=\frac{d^2}{d{\bf x}^2} \varepsilon(...
...})
=\frac{d}{d{\bf x}}{\bf g}_\varepsilon({\bf x})
\approx {\bf J}_f^T{\bf J}_f$ (42)

Based on Newton's method, the optimal solution ${\bf x}^*$ that minimizes $\varepsilon({\bf x})$ can be found iteratively:

$\displaystyle {\bf x}_{n+1}={\bf x}_n-{\bf H}^{-1}_n{\bf g}_n
\approx {\bf x}_n-({\bf J}_n^T{\bf J}_n)^{-1} {\bf J}_n^T{\bf f}_n$ (43)

This is the same as pseudo-inverse method in Eq. ([*]) in the previous chapter.

Example:

Consider the following quadratic function:

$\displaystyle q(x,y)=\frac{1}{2} [x_1,\;x_2]
\left[\begin{array}{cc}a&b/2\\ b/2...
...\begin{array}{c}x_1\\ x_2\end{array}\right]
=\frac{1}{2}(ax_1^2+bx_1x_2+cx_2^2)$    

We have

$\displaystyle {\bf g}=\left[\begin{array}{c}ax_1+bx_2/2\\ bx_1/2+cx_2\end{array...
...gin{array}{cc}a&b/2\\ b/2&c\end{array}\right],
\;\;\;\;\;\;\det{\bf H}=ac-b^2/4$    

We let $a=1$, $c=2$, and consider the following values of $b$:

QuadraticSurf.png

We can speed up the convergence by a bigger step size $\delta>1$. However, if $\delta$ is too big, the solution may be skipped and the iteration may not converge if it gets into an oscillation around the solution. Even worse, the iteration may become divergent. For such reasons, a smaller step size $\delta<1$ may be preferred sometimes.

In summary, Newton's method approximates the function $f({\bf x})$ at an estimated solution ${\bf x}_n$ by a quadratic equation (the first three terms of the Taylor's series) based on the gradient ${\bf g}_n$ and Hessian ${\bf H}$ of the function at ${\bf x}_n$, and treat the vertex of the quadratic equation as the updated estimate ${\bf x}_{n+1}$.

Newton's method requires the Hessian matrix as well as the gradient to be available. Moreover, it is necessary calculate the inverse of the Hessian matrix in each iteration, which may be computationally expensive.

Example:

The Newton's method is applied to solving the following non-linear equation system of $N=3$ variables:

$\displaystyle \left\{\begin{array}{l}
f_1(x_1,\,x_2,\,x_3)=3x_1-(x_2x_3)^2-3/2\...
...25\,x_2^2+2x_2-1\\
f_3(x_1,\,x_2,\,x_3)=exp(-x_1x_2)+20x_3+9\end{array}\right.$    

with the exact solution $(x_1=0.5,\;x_2=0,\;x_3=-0.5)$. These equations can be expressed in vector form as ${\bf f}({\bf x})={\bf0}$ and solved as an optimization problem with the objective function $o({\bf x})={\bf f}^T({\bf x}){\bf f}({\bf x})$. The iteration from an initial guess ${\bf x}_0={\bf0}$ is shown below.

\begin{displaymath}\begin{array}{c\vert\vert c\vert c}\hline
n & {\bf x}=[x_1,\,...
...\; 0.003202,\;\; -0.499920 & 2.371437e-14 \\ \hline
\end{array}\end{displaymath} (44)

We see that after 13 iterations the algorithm converges to the following approximated solution with accuracy of $o({\bf x})=\vert\vert{\bf f}({\bf x}^*)\vert\vert^2\approx 10^{-28}$:

$\displaystyle {\bf x}^*=\left[\begin{array}{r}0.5000008539707297\\ 0.0032017070323056\\
-0.4999200212218281\end{array}\right]$ (45)