Newton-Raphson Optimization: Steve Kroon

Newton-Raphson optimization
Steve Kroon
Disclaimer: these notes do not explicitly indicate whether values are vectors or scalars, but expects the reader to discern this from the context.
Sketching the problem

Unconstrained optimization Correspondence between optimization and root-nding of derivative/gradient One dimension: distinguishing maxima vs minima; higher dimensions also have saddle points Problems with local optima (and which root you nd with root-nding techniques) Second derivative equivalent is Hessian matrix of second partial derivatives check if it is positive semi-denite Most approaches are local nd a direction, and pick a distance Optimization theory is a wide eld we restrict ourselves to a simple local technique To keep things simple, we assume all derivatives exist and are continuous
Iterative (line search) methods

The basic idea: identify a downhill direction, and walk down Add some teleporting: determining a step size is important Steepest way down could work, but it is sensitive to choice of step size (more generally, shows slower convergence) Newton-Raphson approach eectively nds the root of the gradient of the function direction and step size come out automatically
Newtons method for root-nding

Newtons method is an iterative technique for nding zeroes of a function: given an initial guess (we like to call it an approximation) x0 , we repeatedly rene the estimate in a sequence x1 , x2 , . . . which converges to a zero of the function.
The method makes use of the derivative of the function, so we need to be able to obtain that to use this approach: the update rule used is xk+1 = xk f (xk ) f (xk )
or for the multivariate case where f : Rn Rn , xk+1 = xk [Jf (xk )]1 f (xk ) where Jf (x) is the Jacobian matrix for f at x. Let us try to develop some intuition for this: Suppose we want to nd the root of a linear function of one variable, f (x) = ax + b. We know that the root is b/a. Let us see how Newtons method yields this answer, using a = 3 and b = 2, starting from an initial guess of x0 = 4. Now f (4) = 14, and f (4) = 3. Thus, for each unit we increase x by, we increase f (x) by 3. Since to get to zero from x0 , we need a reduction by 14, our next guess should be 14/3 less than the current guess. Thus we use x1 = x0 f (x0 )/f (x0 ) = 4 14/3 = 2/3 as our next guess, and this is the zero of the function. For linear functions, it should be obvious that this approach will always work in one step. What does this approach do when trying to nd the root of a quadratic function? It approximates the function at xk using the tangent line to the parabola, identies the zero of that line, and uses that as the next estimate xk+1 . This intuition carries over to any univariate function. Note that the stepsize is automatically determined. (For one-dimension, the direction is either increasing or decreasing xk . The key insight into using Newtons root-nding method for optimization is to realize that optimizing a scalar function of n variables is the same as nding the zeroes of an n-dimensional vector-valued function of n variables: f (xk ) = 0. Thus given an estimate of the optimum xk , we would like to use the update rule xk+1 = xk [Jf (xk )]1 f (xk ). Let us consider optimizing a quadratic function of a single variable, f (x) = ax2 + bx + c, where we know the optimum is at b/2a. In this case, f = 2ax + b. Thus, starting at a point x0 , our update rule species x1 = x0 (2a)1 (2ax0 + b) = b/2a so that the optimum is found in a single step.
Jacobians and Hessians

When we move from a scalar function of a scalar to a scalar function of a vector, the derivative is replaced by the gradient, a vector of partial derivatives. Similarly, when moving from a scalar function of a vector to a vector-valued function of a vector, the gradient is replaced by the Jacobian matrix, a matrix of partial derivatives formed by stacking the gradients of each component of the vector valued function beside each other.1
1 Note that some sources, including Wikipedia, dene the Jacobian as the transpose of this. For our purposes, it will not make a numerical dierence which denition we use, but the denition given here makes the interpretation of the Hessian as the Jacobian of the gradient function consistent.
When the vector-valued function of a vector is already a vector of partial derivatives of a real-valued function f , as in our case, it follows that the contents 2 f , at of the Jacobian matrix are second-order derivatives, of the form x( j ) x(i) some point x. (Here, the bracketed subscripts refer to components of the vector x.) Such a matrix is called the Hessian matrix of the function f at x, Hf (x), and is also denoted by 2 f (x).2 In this notation, Newtons rule becomes xk+1 = xk [Hf (x)]1 f (xk ) or xk+1 xk = [2 f (xk )]1 f (xk ) The Hessian matrix is relevant because it emerges as the natural highdimensional analog of the second derivative of a function when approximating the function using Taylor expansions: the equivalent of f (x + ) f (x) + f (x) + for vector-valued inputs turns out to be f (x + ) f (x) + f (x) + noting that f = Jf for scalar functions. 1 2
T
1 2 (2) f (x) 2
Hf (x)
Newton-Raphson optimization as quadratic approximation

Suppose for the moment that the high-dimensional quadratic Taylor approximation to f is exact. Then the directional derivative of f in the direction is a linear function f (x) = f (x) + Hf (x) For x to be an optimum, this must be 0, so that = [Hf (x)]1 f (x). This leads us to an optimum in one step, by updating a candidate x to x [Hf (x)]1 f (x). Note that solving for zero to update x here is simply a high-dimensional analog of the linear root-nding done earlier, but this time being performed on the function f , which has derivative Hf . Since the rootnding is done on the gradient, and the root can be found exactly in one step for linear functions, it turns out that using Newtons method on a quadratic function yields the exact optimum in one step from any initial guess. For non-quadratic functions, we can repeat the process iteratively. We know that Newtons method nds a root of f by successively approximating f by a linear function. On the other hand, if we approximate the gradient of a function f by a linear function, we end up approximating f itself by a quadratic function. Thus, our intuition for Newton-Raphson optimization of f is that it nds a local quadratic approximation to f around xk , and calculates the update xk+1 as the optimum of that quadratic function (which it can obtain in a single step).
2 This notation comes from viewing as a vector of dierentiation operators (functions) to be applied to f . In that case the composition (product) of these operators yield the second order derivatives required in the Hessian if we use an outer product .
Implementation notes
Avoid inverting the Hessian When implementing Newton-Raphson optimization in practice, one would like to avoid the computationally expensive process of inverting the Hessian. Noting that we are only actually interested in obtaining xk+1 , and since we already have xk and f (x), we can bypass the inversion step by left-multiplying the update rule by the Hessian to obtain the system of equations Hf (xk )[xk+1 xk ] = f (xk ) which can then be solved directly to yield xk+1 xk . Solving this system of equations is benecial since it is less computationally demanding than inverting the Hessian. When to stop If the function being optimized is linear, Newton-Raphson optimization is unlikely to ever nd the exact optimum, but instead approaches it very closely. As such, one should terminate the algorithm when the current estimate is satisfactory. To ensure this, it is useful to evaluate the function at each estimate, and terminate when the relative improvement in successive steps becomes too slow a rule of thumb is to stop after a relative improvement of less than 1%. In some problematic cases, Newton-Raphson optimization can behave badly: for examples, estimate updates may overshoot the minimum, leading to oscillation or slow convergence.3 To address cases like this, one approach is to limit the number of iterations of Newton-Raphson updates. If the limit is reached, the best estimate so far can be output, with a warning about the non-convergence.
3 In
the case of overshooting, it can help to reduce the step size by half.

Newton-Raphson Optimization: Steve Kroon

Uploaded by

Copyright:

Available Formats

Newton-Raphson Optimization: Steve Kroon

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Newton-Raphson Optimization: Steve Kroon

Uploaded by

Copyright:

Available Formats

Newton-Raphson optimization

Sketching the problem

Iterative (line search) methods

Newtons method for root-nding

Jacobians and Hessians

Newton-Raphson optimization as quadratic approximation

You might also like