7.3: Multi-Dimensional Continuous Optimization

Last updated
Save as PDF

Page ID: 47260

Franz S. Hover & Michael S. Triantafyllou
Massachusetts Institute of Technology via MIT OpenCourseWare

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Now we consider that the cost function \(f\) is a function of more than one variable. \(X\) is a multi-dimensional space, and \(x\) is a vector. We consider again continuous functions. At the minima, as for the single-dimension case, we have certain conditions for the first and second derivatives:

\begin{align} \nabla f(x*) \, &= \, [f_1, \, f_2, \, \cdots]_{x=x*} = [0, \, 0, \, \cdots] \\[4pt] \nabla^2 f(x) \, &= \, \begin{bmatrix} f_{x_1 x_1} \quad f_{x_1 x_2} &\, \\[4pt] f_{x_2 x_1} \quad f_{x_2 x_2} &\, \\[4pt] &\cdots \end{bmatrix}_{x=x*} > \, 0, \end{align}where the notation that a matrix is greater than zero denotes that it is positive definite. We will present three practical methods for finding the minimum. First is the method of steepest descent. The idea here is to find the downhill direction and take a step \(\delta\) that way:

\begin{align} e \, &= \, -\nabla f(x) / ||\nabla f(x)|| \\[4pt] \delta \, &= \, \gamma e. \end{align}

Note that, as written, this is a different algorithm than the first method given in one dimension, because here the direction vector \(e\) is normalized, and hence the magnitude of the step in \(x\) is the same no matter what the steepness of the function. We note also that there exists a value \(\alpha\) such as \(x + \alpha e\) is the minimizing argument of \(f\), along the e direction and passing through the point \(x\). This is the result of the so-called line search, and a reasonable steepest descent procedure is to iteratively perform a two-step procedure of a gradient calculation followed by a line search in the downhill direction.

The performance of successive line searches can be quite good, or very poor. Poor performance results in some cases because the successive downhill directions are constrained to be orthogonal to each other. This has to be the case because at the minimum on the line, the gradient in the direction of the line is zero by definition. In fact none of these downhill directions may actually point to the minimum, and so many pointless steps might be taken. A solution to this is the conjugate gradient method, wherein we make a useful modification to the downhill directions used for each of the line searches.

We will call the downhill direction vector corresponding with the \(k\)'th guess \(d_k\). Letting \(g(x) = \nabla f(x)\) and \(d_0 = -g(x_0)\), we will let \(d_{k+1} = -g(x_{k+1}) + \beta d_k\). This says that we will deflect the next search direction from the downhill direction by the term \(\beta d_k\); the scalar factor \(\beta\) is given by

\[ \beta \, = \, \dfrac{g(x_{k+1})^T g(x_{k+1})} {g(x_k)^T g(x_k)}. \]

Note here that \(d\) is not normalized. The algebra needed to derive this rule is not difficult, and a simple example will illustrate that it is a very powerful one.

Finally, we mention Newton’s second-order method in multiple dimensions, using the second derivative. It comes from the multivariable version of the Taylor series expansion:

\[ f(x + \delta) \, \approx \, f(x) + \nabla f(x) \delta + \dfrac{1}{2} \delta^T \nabla^2 f(x) \delta. \]

Following from the one-dimensional case, we try to select \(\delta\) so as to cause \(\partial f(x + \delta) / \partial \delta\) = 0. This gives

\begin{align} - \nabla f(x) \, &= \, \delta^T \nabla^2 f(x) \longrightarrow \\[4pt] \delta \, &= \, -[\nabla^2 f(x)]^{-1} \nabla f(x). \end{align}

In words, Newton’s method takes the first and second-derivative information and tries to come up with a specific solution in one step. The algorithm has extremely good convergence properties and is recommended when second derivatives are available.

It is important to understand that both the conjugate gradient method and Newton’s second order method get the exact answer in two (conjugate gradient) or one (Newton) tries, when in fact the function is quadratic. Thinking of computational cost, the conjugate gradient algorithm has to have the derivative vector at two points, whereas Newton has to have the gradient plus the Hessian matrix, at the starting point. If these derivatives have to be created numerically and with accuracy, it is clear that the Hessian could be quite expensive. For this reason, the conjugate gradient method may be preferred.

Graphical visualization of the difference between the approach of the methods of steepest descent and of conjugate gradient towards the target point. — Figure \(\PageIndex{1}\): graph illustrates the difference in the behavior of the steepest-descent vs conjugate gradient methods of determining the optimal point. The method of steepest descent descends by straight lines towards the target point, which is less efficient than the conjugate gradient method that descends as a smooth curve.