Optimization

The goal of mathematical optimization (or mathematical programming) is to solve an optimization problem of a given real multivariate function $y=f({\bf x})=f(x_1,\cdots,x_N)$ by searching its domain, an N-dimensional space, to find the optimal point ${\bf x}^*=[x_1^2,\cdots,x_N^*]^T\in\mathbb{R}^N$ at which the corresponding function value $f({\bf x}^*)$ is either maximized or minimized. As maximizing $f({\bf x})$ is equivalent to minimizing $-f({\bf x})$ , we can only consider minimization problems. This function $f({\bf x})$ is called an objective or cost function.

$\displaystyle {\bf x}^*=\argmin_{\bf x} f({\bf x}),\;\;\;\;\;\;$ i.e., $\displaystyle \;\;\;\;\; f({\bf x}^*)=\min_{\bf x} f({\bf x})$

(1)

If the independent variables in ${\bf x}$ are allowed to take any values inside the function domain $\mathbb{R}^N$ , then the optimization problem is unconstrained, otherwise if ${\bf x}$ has to be inside a subset of $\mathbb{R}^N$ , the problem is constrained.

For example, in machine learning, we need to carry out regression analysis and classification, and we may need to develop a model $y=f({\bf x},{\bf a})$ to fit a set of observed data points ${\cal D}=\{ ({\bf x}_n,\,y_n),\;\;n=1,\cdots,N \}$ , by estimating the parameters in ${\bf a}=[a_1,\cdots,a_M]^T$ of the function. Typically the function is assumed to be known, such as the linear model $y=f(x)=ax+b$ with parameters $a$ for the slope and $b$ for the intercept, to be estimated based on the observed data.

Different methods can be used to solve this parameter estimation problem. If the least square method is used, the objective function can be the squared error, the difference between the model and the given data, which is to be minimized:

$\displaystyle o({\bf a})=\sum_{n=1}^N [y_n-f(({\bf x}_n,{\bf a}))]^2$

(2)

Alternatively, if the Bayesian method is used, the objective functioin can be the likelihood function of the parameters in ${\bf a}$ , which is to be maximized:

$\displaystyle o({\bf a})=L({\bf a}\vert{\cal D}) =p({\cal D}\vert{\bf a}) =\prod_{n=1}^N p(y_n\vert{\bf x}_n,{\bf a})$

(3)

Solving suth optimization problems we get the optimal model parameters in ${\bf a}^*$ .

If an optimization problem is unconstrained, and the analytical expression of the objective function is explicitly given, then the solution ${\bf x}^*$ at which the objective function $f({\bf x})$ is minimized may be found by solving the following equation system obtained by setting the gradient of the $f({\bf x})$ to be zero:

$\displaystyle {\bf g}_f=\bigtriangledown_{\bf x} f({\bf x}) =\frac{d f({\bf x})... ...\partial x_1},\cdots, \frac{\partial f({\bf x})}{\partial x_N}\right]^T ={\bf0}$

(4)

where ${\bf g}_f=\bigtriangledown_{\bf x} f({\bf x})=d\,f({\bf x})/d{\bf x}$ denotes the gradient vector of a scalar function $f({\bf x})$ with respect to its vector argument ${\bf x}=[{\bf x}_1,\cdots,{\bf x}_N]^T$ . The root ${\bf x}^*$ of the equation ${\bf g}_f={\bf0}$ is called a critical, stationary, or stable point of $f({\bf x})$ .

Whether the function value $f({\bf x}^*)$ at this stationary point is a maximum, minimum, or saddle point, depending on its Hessian matrix for its second order derivatives:

$\displaystyle {\bf H}_f=\left[ \begin{array}{ccc} \frac{\partial^2 f({\bf x})}{... ..._1} & \cdots & \frac{\partial^2 f({\bf x})}{\partial x_N^2} \end{array} \right]$

Let ${\bf H}^*={\bf H}^*={\bf H}_f({\bf x})\vert _{{\bf x}={\bf x}^*}$ , then

If ${\bf H}^*>0$ is positive definite (all eigenvalues are positive), $f({\bf x}^*)$ is a minimum;
If ${\bf H}^*<0$ is negative definite (all eigenvalues are negative), $f({\bf x}^*)$ is a maximum;
If ${\bf H}^*$ is indefinite (with both positive and negative eigenvalues), $f({\bf x}^*)$ is a saddle point (maximum in some directions, but minimum in others).

For example, the gradient vectors and Hessian matrices of three functions $f_1({\bf x})=x_1^2+x_2^2$ , $f_2({\bf x})=-x_1^2-x_2^2$ , and $f_3({\bf x})=x_1^2-x_2^2$ :

$\displaystyle {\bf g}_{f_1}=\left[\begin{array}{c}2x_1\\ 2x_2\end{array}\right]... ...t], \;\;\;\; {\bf g}_{f_3}=\left[\begin{array}{r}2x_1\\ -2x_2\end{array}\right]$

$\displaystyle {\bf H}_{f_1}=\left[\begin{array}{rr}2 & 0\\ 0 & 2\end{array}\rig... ... \;\;\;\; {\bf H}_{f_3}=\left[\begin{array}{rr}2 & 0\\ 0 & -2\end{array}\right]$

At the common stationary point ${\bf x}^*=[0,\,0]^T$ for all three functions at which $f_1({\bf x}^*)=f_2({\bf x}^*)=f_3({\bf x}^*)=0$ , $f_1({\bf x}^*)$ is a minimum, $f_2({\bf x}^*)$ is a maximum, and $f_3({\bf x}^*)$ is a saddle point, i.e., $f_3({\bf x}^*)$ is a minimum in $x_1$

direction but a maximum in $x_2$

direction.

The problems of minimization and equation solving are essentially the same as they can be converted to each other.

First, the problem of minimizing an objective function $f({\bf x})$ is equivalent to solving the following equation system:

$\displaystyle {\bf g}_f({\bf x})=\bigtriangledown_{\bf x} f({\bf x}) =\frac{d f... ...al f/\partial x_1\\ \vdots\\ \partial f/\partial x_N\end{array}\right] ={\bf0}$ (5)

If its root ${\bf x}^*$ is not a saddle point, it is the solution of the optimization problem that either maximizes or minimizes $f({\bf x})$ .
This equation system can be solved by any of the methods discussed previously, such as the Newton-Raphson method, which finds the root of a general equation ${\bf f}({\bf x})={\bf0}$ by the iteration ${\bf x}_{n+1}={\bf x}_n-{\bf J}_f^{-1}({\bf x}_n)\,{\bf f}({\bf x}_n)$ . Here, specifically, to solve the equation ${\bf g}_f({\bf x})={\bf0}$ , we first get the Jacobian ${\bf J}_g({\bf x})={\bf H}_f({\bf x})$ of the gradient ${\bf g}_f({\bf x})$ of $f({\bf x})$ , which is the Hessian of $f({\bf x})$ , and then carry out the iteration below to eventually find ${\bf x}^*$ that minimizes $f({\bf x})$ :

$\displaystyle {\bf x}_{n+1}={\bf x}_n-{\bf J}_g^{-1}({\bf x}_n)\,{\bf g}_f({\bf x}_n) ={\bf x}_n-{\bf H}_f^{-1}({\bf x}_n)\,{\bf g}_f({\bf x}_n)$ (6)

The second equality is due to the fact that the Jacobian of the gradient ${\bf g}_f$ of function ${\bf f}({\bf x})$ is the Hessian of the function, i.e., ${\bf J}_g={\bf H}_f$ .
On the other hand, to solve the equation system ${\bf g}({\bf x})=[g_1({\bf x}),\cdots,g_N({\bf x})]^T={\bf0}$ , we can minimize the objective function defined as

$\displaystyle f({\bf x})=\frac{1}{2}{\bf g}^T({\bf x}){\bf g}({\bf x}) =\frac{1... ...{\bf g}({\bf x})\vert\vert^2 =\frac{1}{2}\sum_{i=1}^N \vert g_i({\bf x})\vert^2$ (7)

To solve this minimization problem, we solve the equations:

$\displaystyle \frac{\partial{f({\bf x})}}{\partial x_j} =\frac{\partial}{\parti... ...\bf x})\frac{\partial g_i({\bf x})}{\partial x_j}=0,\;\;\;\;\;\; (j=1,\cdots,N)$ (8)

Combining all of such equations we get the gradient vector of function $f({\bf x})={\bf g}^T{\bf g}$ :

$\displaystyle \bigtriangledown_{\bf x} f({\bf x}) =\frac{d}{d{\bf x}}\left[\fra... ...\bf x})\;\;{\bf g}({\bf x}) ={\bf J}_{\bf g} ({\bf x})\,{\bf g}({\bf x})={\bf0}$ (9)

where ${\bf J}_{\bf g} ({\bf x})$ is the Jicobian of ${\bf g}({\bf x})$ . As in general ${\bf J}_{\bf g} ({\bf x})$ has full rank, the homogeneous equation above only has a zero solution ${\bf g}({\bf x})={\bf0}$ .