Optimization

The goal of mathematical optimization (or mathematical programming) is to solve an optimization problem of a given real multivariate function $y=f({\bf x})=f(x_1,\cdots,x_N)$ by searching its domain, an N-dimensional space, to find the optimal point ${\bf x}^*=[x_1^2,\cdots,x_N^*]^T\in\mathbb{R}^N$ at which the corresponding function value $f({\bf x}^*)$ is either maximized or minimized. As maximizing $f({\bf x})$ is equivalent to minimizing $-f({\bf x})$, we can only consider minimization problems. This function $f({\bf x})$ is called an objective or cost function.

$\displaystyle {\bf x}^*=\argmin_{\bf x} f({\bf x}),\;\;\;\;\;\;$i.e.,$\displaystyle \;\;\;\;\;
f({\bf x}^*)=\min_{\bf x} f({\bf x})$ (1)

If the independent variables in ${\bf x}$ are allowed to take any values inside the function domain $\mathbb{R}^N$, then the optimization problem is unconstrained, otherwise if ${\bf x}$ has to be inside a subset of $\mathbb{R}^N$, the problem is constrained.

For example, in machine learning, we need to carry out regression analysis and classification, and we may need to develop a model $y=f({\bf x},{\bf a})$ to fit a set of observed data points ${\cal D}=\{ ({\bf x}_n,\,y_n),\;\;n=1,\cdots,N \}$, by estimating the parameters in ${\bf a}=[a_1,\cdots,a_M]^T$ of the function. Typically the function is assumed to be known, such as the linear model $y=f(x)=ax+b$ with parameters $a$ for the slope and $b$ for the intercept, to be estimated based on the observed data.

Different methods can be used to solve this parameter estimation problem. If the least square method is used, the objective function can be the squared error, the difference between the model and the given data, which is to be minimized:

$\displaystyle o({\bf a})=\sum_{n=1}^N [y_n-f(({\bf x}_n,{\bf a}))]^2$ (2)

Alternatively, if the Bayesian method is used, the objective functioin can be the likelihood function of the parameters in ${\bf a}$, which is to be maximized:

$\displaystyle o({\bf a})=L({\bf a}\vert{\cal D}) =p({\cal D}\vert{\bf a})
=\prod_{n=1}^N p(y_n\vert{\bf x}_n,{\bf a})$ (3)

Solving suth optimization problems we get the optimal model parameters in ${\bf a}^*$.

If an optimization problem is unconstrained, and the analytical expression of the objective function is explicitly given, then the solution ${\bf x}^*$ at which the objective function $f({\bf x})$ is minimized may be found by solving the following equation system obtained by setting the gradient of the $f({\bf x})$ to be zero:

$\displaystyle {\bf g}_f=\bigtriangledown_{\bf x} f({\bf x})
=\frac{d f({\bf x})...
...\partial x_1},\cdots,
\frac{\partial f({\bf x})}{\partial x_N}\right]^T ={\bf0}$ (4)

where ${\bf g}_f=\bigtriangledown_{\bf x} f({\bf x})=d\,f({\bf x})/d{\bf x}$ denotes the gradient vector of a scalar function $f({\bf x})$ with respect to its vector argument ${\bf x}=[{\bf x}_1,\cdots,{\bf x}_N]^T$. The root ${\bf x}^*$ of the equation ${\bf g}_f={\bf0}$ is called a critical, stationary, or stable point of $f({\bf x})$.

Whether the function value $f({\bf x}^*)$ at this stationary point is a maximum, minimum, or saddle point, depending on its Hessian matrix for its second order derivatives:

$\displaystyle {\bf H}_f=\left[ \begin{array}{ccc}
\frac{\partial^2 f({\bf x})}{...
..._1} & \cdots &
\frac{\partial^2 f({\bf x})}{\partial x_N^2}
\end{array} \right]$    

Let ${\bf H}^*={\bf H}^*={\bf H}_f({\bf x})\vert _{{\bf x}={\bf x}^*}$, then

For example, the gradient vectors and Hessian matrices of three functions $f_1({\bf x})=x_1^2+x_2^2$, $f_2({\bf x})=-x_1^2-x_2^2$, and $f_3({\bf x})=x_1^2-x_2^2$:

$\displaystyle {\bf g}_{f_1}=\left[\begin{array}{c}2x_1\\ 2x_2\end{array}\right]...
...t],
\;\;\;\;
{\bf g}_{f_3}=\left[\begin{array}{r}2x_1\\ -2x_2\end{array}\right]$    

$\displaystyle {\bf H}_{f_1}=\left[\begin{array}{rr}2 & 0\\ 0 & 2\end{array}\rig...
...
\;\;\;\;
{\bf H}_{f_3}=\left[\begin{array}{rr}2 & 0\\ 0 & -2\end{array}\right]$    

At the common stationary point ${\bf x}^*=[0,\,0]^T$ for all three functions at which $f_1({\bf x}^*)=f_2({\bf x}^*)=f_3({\bf x}^*)=0$, $f_1({\bf x}^*)$ is a minimum, $f_2({\bf x}^*)$ is a maximum, and $f_3({\bf x}^*)$ is a saddle point, i.e., $f_3({\bf x}^*)$ is a minimum in $x_1$ direction but a maximum in $x_2$ direction.

The problems of minimization and equation solving are essentially the same as they can be converted to each other.