Newton-Raphson Optimization: Steve Kroon
Newton-Raphson Optimization: Steve Kroon
Newton-Raphson Optimization: Steve Kroon
Steve Kroon
Disclaimer: these notes do not explicitly indicate whether values are vectors or scalars, but expects the reader to discern this from the context.
The method makes use of the derivative of the function, so we need to be able to obtain that to use this approach: the update rule used is xk+1 = xk f (xk ) f (xk )
or for the multivariate case where f : Rn Rn , xk+1 = xk [Jf (xk )]1 f (xk ) where Jf (x) is the Jacobian matrix for f at x. Let us try to develop some intuition for this: Suppose we want to nd the root of a linear function of one variable, f (x) = ax + b. We know that the root is b/a. Let us see how Newtons method yields this answer, using a = 3 and b = 2, starting from an initial guess of x0 = 4. Now f (4) = 14, and f (4) = 3. Thus, for each unit we increase x by, we increase f (x) by 3. Since to get to zero from x0 , we need a reduction by 14, our next guess should be 14/3 less than the current guess. Thus we use x1 = x0 f (x0 )/f (x0 ) = 4 14/3 = 2/3 as our next guess, and this is the zero of the function. For linear functions, it should be obvious that this approach will always work in one step. What does this approach do when trying to nd the root of a quadratic function? It approximates the function at xk using the tangent line to the parabola, identies the zero of that line, and uses that as the next estimate xk+1 . This intuition carries over to any univariate function. Note that the stepsize is automatically determined. (For one-dimension, the direction is either increasing or decreasing xk . The key insight into using Newtons root-nding method for optimization is to realize that optimizing a scalar function of n variables is the same as nding the zeroes of an n-dimensional vector-valued function of n variables: f (xk ) = 0. Thus given an estimate of the optimum xk , we would like to use the update rule xk+1 = xk [Jf (xk )]1 f (xk ). Let us consider optimizing a quadratic function of a single variable, f (x) = ax2 + bx + c, where we know the optimum is at b/2a. In this case, f = 2ax + b. Thus, starting at a point x0 , our update rule species x1 = x0 (2a)1 (2ax0 + b) = b/2a so that the optimum is found in a single step.
When the vector-valued function of a vector is already a vector of partial derivatives of a real-valued function f , as in our case, it follows that the contents 2 f , at of the Jacobian matrix are second-order derivatives, of the form x( j ) x(i) some point x. (Here, the bracketed subscripts refer to components of the vector x.) Such a matrix is called the Hessian matrix of the function f at x, Hf (x), and is also denoted by 2 f (x).2 In this notation, Newtons rule becomes xk+1 = xk [Hf (x)]1 f (xk ) or xk+1 xk = [2 f (xk )]1 f (xk ) The Hessian matrix is relevant because it emerges as the natural highdimensional analog of the second derivative of a function when approximating the function using Taylor expansions: the equivalent of f (x + ) f (x) + f (x) + for vector-valued inputs turns out to be f (x + ) f (x) + f (x) + noting that f = Jf for scalar functions. 1 2
T
1 2 (2) f (x) 2
Hf (x)
Implementation notes
Avoid inverting the Hessian When implementing Newton-Raphson optimization in practice, one would like to avoid the computationally expensive process of inverting the Hessian. Noting that we are only actually interested in obtaining xk+1 , and since we already have xk and f (x), we can bypass the inversion step by left-multiplying the update rule by the Hessian to obtain the system of equations Hf (xk )[xk+1 xk ] = f (xk ) which can then be solved directly to yield xk+1 xk . Solving this system of equations is benecial since it is less computationally demanding than inverting the Hessian. When to stop If the function being optimized is linear, Newton-Raphson optimization is unlikely to ever nd the exact optimum, but instead approaches it very closely. As such, one should terminate the algorithm when the current estimate is satisfactory. To ensure this, it is useful to evaluate the function at each estimate, and terminate when the relative improvement in successive steps becomes too slow a rule of thumb is to stop after a relative improvement of less than 1%. In some problematic cases, Newton-Raphson optimization can behave badly: for examples, estimate updates may overshoot the minimum, leading to oscillation or slow convergence.3 To address cases like this, one approach is to limit the number of iterations of Newton-Raphson updates. If the limit is reached, the best estimate so far can be output, with a warning about the non-convergence.
3 In
the case of overshooting, it can help to reduce the step size by half.