Solution of NonLinear Inverse Problems and The Levenberg-Marquardt Method
Solution of NonLinear Inverse Problems and The Levenberg-Marquardt Method
Solution of NonLinear Inverse Problems and The Levenberg-Marquardt Method
2732552
Jose Pujol1
ABSTRACT
Although the Levenberg-Marquardt damped least-squares method is an extremely powerful tool for the iterative solution of nonlinear problems, its theoretical basis has not been described adequately in the literature. This is unfortunate, because Levenberg and Marquardt approached the solution of nonlinear problems in different ways and presented results that go far beyond the simple equation that characterizes the method. The idea of damping the solution was introduced by Levenberg, who also showed that it is possible to do that while at the same time reducing the value of a function that must be minimized iteratively. This result is not obvious, although it is taken for granted. Moreover, Levenberg derived a solution more general than the one currently used. Marquardt started with the current equation and showed that it interpolates between the ordinary least-squares-
method and the steepest-descent method. In this tutorial, the two papers are combined into a unied presentation, which will help the reader gain a better understanding of what happens when solving nonlinear problems. Because the damped least-squares and steepest-descent methods are intimately related, the latter is also discussed, in particular in its relation to the gradient. When the inversion parameters have the same dimensions and units , the direction of steepest descent is equal to the direction of minus the gradient. In other cases, it is necessary to introduce a metric i.e., a denition of distance in the parameter space to establish a relation between the two directions. Although neither Levenberg nor Marquardt discussed these matters, their results imply the introduction of a metric. Some of the concepts presented here are illustrated with the inversion of synthetic gravity data corresponding to a buried sphere of unknown radius and depth. Finally, the work done by early researchers that rediscovered the damped least-squares method is put into a historical context.
INTRODUCTION
Anyone working on inverse problems will immediately recognize the equation
A TA + I
= ATc,
where 0 and I is the identity matrix. The symbols used may be different, but the meaning of equation 1 will be clear, namely it represents the Levenberg-Marquardt damped least-squares solution of the equation
A = c.
Equation 1 arises in a number of situations. The one to be discussed here is the solution of linearized nonlinear inverse problems with a vector of adjustments or corrections to the values of the pa-
rameters about which the problem is linearized. Once equation 1 is solved, is added to the initial vector of parameters and the resulting vector is used as a new initial vector. In this way an iterative process is established, but as early workers noted, convergence is not assured when is computed using ordinary least squares i.e., when equation 1 with = 0 is used . A typical reason for the lack of convergence is that the initial values of the parameters are far from the values that actually solve the nonlinear problem, which means that the assumptions behind the linearization of the problem are not valid. As a consequence, the absolute values of the components of may become larger or oscillate as the iterations proceed, instead of going through the steady decrease that characterizes a convergent process. This problem is solved by Levenberg 1944 and independently by Marquardt 1963 , who uses different approaches that led to similar solutions. The purpose of this tutorial is to go through their arguments in detail, which will help to gain a better understanding of what really
Manuscript received by the Editor June 27, 2006; revised manuscript received January 10, 2007; published online May 30, 2007. 1 University of Memphis, Department of Earth Sciences, Memphis, Tennessee. E-mail: [email protected]. 2007 Society of Exploration Geophysicists. All rights reserved.
W1
W2 happens when solving a linearized nonlinear problem. To give the readers a avor of the matters to be discussed, the basic features of the two approaches are summarized below. Levenberg 1944 solves the problem of the lack of convergence by introducing and naming the damped least-squares method. The basic idea was to damp i.e., limit the values of the parameters at each iteration. Specically, instead of using the function S see below whose minimization leads to the ordinary least-squares solution, Levenberg minimizes the function = wS + Q, where w 0 S and Q is a linear combination of the components of squared. The result of this minimization is a generalization of equation 1, with I replaced by a diagonal matrix D with nonnegative elements. Levenbergs contribution to the solution of the problem did not stop here, however. Equally important are his proofs that the minimization of S leads to a decrease in the values of S and the function whose linearization leads to S i.e., the function s below . The reduction in the value of s does not occur for all values of w which is equal to 1/ when D = I , and Levenberg suggests a way to nd the value of w leading to a reduction in the value of s. Another important result is that the Q corresponding to the damped least-squares solution is always i.e., for all w smaller than when damping is not applied. These results are not obvious, but are rarely considered when the damped least-squares method is introduced in the literature. Marquardt 1963 , on the other hand, starts with equation 1 and investigates the angle between the computed and the direction of steepest descent of s equal to s, where stands for gradient . When goes to innity, the contribution of ATA in equation 1 becomes negligible and the result is the equation used in the method of steepest descent. This method generally produces a signicant reduction in the value of s in the early iterations, but becomes extremely slow after that, to the point that convergence to the solution may not be achieved even after a large number hundreds or more of iterations e.g., Gill et al., 1981, and the example below . On the other hand, the ordinary least-squares method known as the Gauss-Newton method has the ability to converge to the solution quickly when the starting point is close to the solution and even when far from it, as the example below shows . Marquardt proves that equation 1 interpolates between the two methods, and shows that the angle between and s is a monotonically decreasing function of , with the angle going to zero as goes to innity. Based on this fact, Marquardt proposes a simple algorithm in which at each iteration the value of is modied to assure that the corresponding value of s becomes smaller than in the previous iteration. Marquardt also recognizes that the term I in equation 1 assures that the matrix in parentheses is better conditioned than ATA and that the angle between and s is always less than 90. If this condition is not met, the iterative process may not be convergent. Although neither Levenberg nor Marquardt discusses the steepest-descent method itself, this tutorial would be incomplete without consideration of the relation between the direction of steepest descent and the gradient, which is not unique when inversion parameters have different dimensions or units. In such cases, it is not obvious how to measure the distance between two points in parameter space, and as a result, equating the direction of steepest descent to the direction of minus the gradient becomes meaningless Feder, 1963 . It is only when a denition of distance i.e., a metric is introduced that the two directions become uniquely related. These questions will be discussed rst to put Levenbergs and Marquardts approaches into a broader perspective, as they involve, either directly or indirectly, the introduction of a metric.
Pujol The concepts introduced here are illustrated with a simple example involving the inversion of gravity data corresponding to a buried sphere. In this case, the unknown parameters are the radius of the sphere and the depth to its center. By limiting the number of inversion parameters to two, it is easy to visualize both the function s and the path followed by the parameters as a function of the iterations for different initial values of the parameters and , and for solutions obtained using the damped and ordinary least-squares methods and the steepest-descent method. This tutorial concludes with a historical note. Although Levenbergs paper was published almost twenty years before Marquardts, it went almost unnoticed in spite of its practical importance. Interestingly, an internet search uncovered a paper by Feder 1963 on computerized lens design that shows that ideas similar to that of Levenberg had been rediscovered more than once. Feders paper, in turn, led to a paper by Wynne 1959 , which anticipates some of the ideas in Marquardts approach. Yet the fact remains that it was Marquardts paper that popularized the damped least-squares method, a fact he attributed to his having distributed hundreds of copies of his FORTRAN code!
oi
f vi,x
i
fi x ;
x as
i = 1, . . .,m.
x = oi f i x ;
i = 1, . . .,m.
We are interested in nding the set of parameters x j that minimize the sum of residuals squared, namely
m 2 i i=1
sx =
x .
A function that measures the mist between observations and model values, such as s x , is known as a merit function. Other terms found in the parameter estimation and optimization literature are objective, loss, and risk function. If f i x is a nonlinear function of the parameters, the minimization of equation 5 generally requires the use of numerical methods. A typical approach is to express f i x in terms of its linearized Taylor expansion about an initial solution xo j = 1, . . . ,n j at which s does not have a stationary point. This gives
n
fi x
f i xo +
j=1
fi xj
x j xo ; j
x = xo
i = 1, . . .,m, 6
where xo has components xo. Using this expression with equation 3, j we can introduce a new set of residuals
W3
ri = oi f i xo
j=1
aij j ,
where
j
= x j xo , j
and
A condition for a minimum of S to exist is that HS be positive denite e.g., Apostol, 1969 , which is the case when ATA 1 exists see Appendix A . These results allow us to establish, in principle, the following iterative process. Solve equation 15 for , use equation 13 to compute x, which becomes a new xo, and use it to compute new values of A and c using equation 9 and 11 . Then solve for again. To make the process clear, we will describe the two steps that lead to the estimate x p+1 for the p + 1 th iteration. First, solve
aij =
fi xj
.
x = xo
A TA
= A Tc
17
Note that f i xo and the derivatives have specic numerical values, while the j are unknown. Equation 7 will be written in matrix form as
where the superscript p indicates iteration number and A and c have components
aijp =
r=cA ,
10
fi xj
;
x=x p
ci p = oi f i x p .
18
where r and have components ri and i, A denotes the matrix of derivatives, and c is the vector with components
x p+1 = x p +
19
ci = oi f i xo .
Now we will look for the vector that minimizes
m
11
We will let x o = xo and will apply the superscript p to any function of x computed at x p . To stop the iterative process, one can introduce conditions such as
p j min j ;
Sx =
i=1
r 2 = r Tr = c T i
T
AT c A 12
20 21 22
= cTc 2cTA +
A TA .
s p+1 p
The superscript T indicates transposition. Before proceeding, however, it is necessary to make a comment on notation. Strictly speaking, the right-hand side of equation 12 is a function of , but because the x that minimizes equation 5 will be determined iteratively, we will be interested in deriving results involving x, which from equation 8 is equal to
x = xo + .
13
Therefore, S can be considered a function of x. The minimization of S requires computing its derivative with respect to and setting it equal to zero:
where the values on the right sides of these equations are preestablished values. The iterative minimization method with the in equation 19 computed using equation 17 is known as the Gauss-Newton method, and can be derived using a different approach e.g., Gill et al., 1981; see also the note at the end of next section . As noted in the Introduction, however, a problem with this method is that o may be too large if xo is too far from its optimal value, which in turn may lead to a nonconvergent iterative process. The following example will illustrate some of the features of the method.
S
1
S
2
...
= 2AcT + 2ATA = 0
14
Example 1a
Consider a buried homogeneous sphere of radius a with center at y o,z , where y o is measured along the y-axis horizontal and z is depth. The vertical component of the gravitational attraction caused by the sphere at a point y i at zero depth is given by:
ATA = ATc.
15
In this section, we will assume that ATA 1 exists, which means that equation 15 can be solved for . When this assumption is not valid, a different method such as damped least-squares should be used. Now it remains to show that the obtained from equation 15 minimizes S. To see that, we must examine the Hessian of S, HS, which is the matrix of second derivatives of S with respect to the components of . Because the quadratic terms in equation 12 are of the form ATA mn m n and ATA is symmetric,
2
g y i,z,a =
4 3
Da3z y i y o 2 + z2
3/2
23
HS
kl
=
k
S
l
= 2 ATA kl .
16
e.g., Dobrin, 1976 , where is the gravitational constant and D is the density contrast equal to the difference of the densities of the sphere and the surrounding medium, assumed homogeneous . For distance and density in km and g/cm3 and gravity in mGal used here , the numerical value of is 6.672. The inverse problem that we will solve is the following. Given m gravity values Gi corresponding to points y i along the y axis, nd out the values of a and z of the sphere whose gravity best ts the Gi. It will be assumed that D and y o are known. Clearly, this problem is
W4 nonlinear in both a and z, which play the role of the parameters x1 and x2. In practice, the Gi should be observed values, but for the purposes of this tutorial they will be synthetic data generated using equation 23 with the following values: z = 7, a = 5, y o = 0, all in km; D = 0.25 g/cm3, m = 20, y 1 = 10 km, and y i+1 y i = 1 km. To stop the iterative process, the condition that the adjustments 1 and 2 become smaller than or equal to 1 105 km was assumed. In this example, the estimated variance of the residuals, given by
2
Pujol the z,a plane that connects an initial point zo,ao and the point and thus, 2 . In our example, zM ,aM zM ,aM that minimizes = 7,5 , and at this point = 0. It may happen, however, that because of the complexity of , no path can be found, in which case the inverse problem has not been solved. To investigate this question the initial points labeled A, B, C, and D in Figure 1 were used. Some of these initial values are too far from the true values see Figure 2 , but they were chosen for demonstration purposes, not as reasonable initial estimates for this problem. In addition, the corresponding results can be useful for cases where the function to be minimized is not equal to zero at its minimum, and there is no easy way to assess whether the initial estimates are reasonably close to the optimal values. The results of the inversion are summarized in Table 1 and the paths followed by the intermediate pairs z p,a p p = iteration number are shown in Figure 1. For the initial point D there was no convergence, but for the other three points, the minimum was reached in ve iterations points B and C or 10 iterations point A . These results are interesting for several reasons. First, convergence can be achieved even when the assumptions behind the linearization of the problem are completely violated. Second, convergence is not always achieved. Third, whether an initial point leads to convergence or not is not directly related to its distance to the point that minimizes . Finally, inspection of the inversion paths does not give any clue as to the path corresponding to any other initial point within the range of Figure 1. These facts are typical of nonlinear problems, and the other methods discussed below have been designed to address some of them.
2 2
z a
12 10
24
plays the role of the merit function s to be minimized. The 2 in the denominator is introduced to make 2 an unbiased estimate Jenkins and Watts, 1968 . Clearly, a cannot be larger than z when the y i are assumed to be at the same elevation, but for the analysis that follows we will be concerned with the mathematical, not the physical, aspects of the problem. There are two reasons for the use 2. One is its statistical signicance and the other is that 2 is a normalized form of s, which allows a comparison of results obtained for different numbers of observations or for different models. The following results, however, are shown in terms of the standard deviation , which has the same units as g i.e., mGal . Representative contour lines of z,a are shown in Figure 1. Note that the shapes of the contours are highly variable. For values of less than about eight they are close to highly elongated ellipses closed or open , although the other contours are mostly straight or slightly curved with changing slopes. This fact must be kept in mind because solving the inverse problem is equivalent to nding a path in
10 9 8 Radius (km) 7 6
A 500
100
50
10
T: A: B: C: D:
10 5 3
10
g(x,z,a) (mGal)
3 1 5 4 3 5
2
B A 10
1
10
10
2 2 C 3 4 5 6 7 8 Depth (km) 9 D 10 11 12
Figure 1. Contour lines cyan curves of the function see equation 24 and paths followed by the points zi,ai indicated by circles , where i is iteration number for the Gauss-Newton inversion method equation 17 . The numbers next to the contours indicate the value of in mGal . The contours between the numbered ones are equispaced. The points labeled A, B, C, and D are initial points xo,y o for the inversion. Figure 2 shows the corresponding gravity values. For D the method did not converge. The value of for this point is 11.5. See Table 1 for additional inversion results. The large + is centered at 7,5 , which is at the minimum of =0 .
10
D C 10 5 0 x (km) 5 10
Figure 2. Gravity values computed using equation 23 for several z,a pairs, listed on the upper-right corner of the gure. Each pair is identied by a different symbol and by a letter. The gravity values identied by a T are the true values, while the others correspond to the initial values used for the inversion of the true values. The gravity scale is logarithmic.
W5
s is the basis of the steepest-descent method of minimization, which is one of the oldest methods used. A heuristic introduction of the steepest-descent method is as follows. Using the notation introduced above, we are interested in an iterative approach such that
s s j= = 2 oi f i x xj i=1
s p+1 fi x , xj 25
sp.
29
s j = 2
i=1
ci
fi = 2 A Tc j , xj
To achieve this goal, we will use the fact that s points in the direction of steepest ascent. This is a well-known result from calculus e.g., Apostol, 1969 and will be proved below in a more general context. Therefore, the initial estimate for the p + 1 th iteration will be computed using
26 x p+1 = x p 27
1
p
s xp ,
30
s = 2ATc.
Now consider equation 1 with and c the vectors that appear in equation 15 and let go to innity. In this case, the rst term on the left side of the equation becomes negligible and the solution becomes
1
g
A Tc =
1 2
s 0;
28
Therefore, in this limiting case the damped least-squares solution g is in the direction of minus the gradient. This fact is emphasized by use of the subscript g. In addition, g goes to zero. The direction
where p is a scalar that assures that equation 29 is satised. The question is how to choose the value of p . A general discussion of this question is presented below, but for the time being, we note that the gradient of a function is a local property, which means that, in general, the direction of the steepest descent will change as a function of position. Therefore, if p is not selected carefully, it may happen that the value s is not reduced, as desired. For this reason, a number of strategies for the selection of p have been designed i.e., Beveridge and Schechter, 1970; Dorny, 1975 , but for the example considered next we will use a very simple approach, based on the use of equation 30 with a large constant value of p , say , which will assure a small step in the steepest direction. In this way, we will be able to see the steepest-descent path clearly, which will be used for
Table 1. Gravity inversion results obtained using different methods and values of the initial parameter. Point zo ao
M go
z1
a1
M g1
zN
aN
or
A B C A A B C D A A B B C C D D
2 10 2 2 2 10 2 10 2 2 10 10 2 2 10 10
10 10 2 10 10 10 2 2 10 10 10 10 2 2 2 2
Gauss-Newton method equation 17, Figure 1 2.0 6.8 531 10 7.00 9.1 6.9 28 5 7.00 5.9 4.6 19 5 7.00 p Steepest-descent method equation 30, = , Figure 3 1747 2.3 9.9 1317 28904 6.52 1747 4.7 8.9 229 7141 7.06 70 10.3 9.4 54 13326 7.02 14 2.0 2.0 15 8481 6.98 1 10.0 2.0 1 10265 7.02 Levenberg-Marquardt method equation 108, p+1 = p /2, Figure 5 1747 2.8 9.5 762 23 7.00 1747 2.2 7.1 550 17 7.00 70 10.0 10.0 70 24 7.00 70 10.3 7.6 29 11 7.00 14 2.0 2.0 14 24 7.00 14 2.7 2.9 22 11 7.00 1 10.0 2.0 1 24 7.00 1 9.9 3.2 2 11 7.00 1747 70 14
5.00 5.00 5.00 5.00 5.04 5.01 4.99 5.01 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 2 2 2 2 2 1 1 1 1 1 1 1 1 107 106 104 104 104 106 104 106 102 106 102 106 102 0.24 0.15 0.12 0.10 0.12 0.10 0.12 0.10
M M Point: see Figure 1 ; zo,ao: initial values of z,a for the inversion; go ,g1 : maximum values of g mGal, equation 23 for zo, ao and z1, a1, respectively; zi, ai: values of z,a after the ith iteration; N: total number of iterations; o, N: initial and nal values of . The gravity values to be inverted were generated using equation 23 with z = zM = 7 and a = aM = 5. The corresponding maximum value is 18.
W6 comparison with paths corresponding to the Gauss-Newton method and the Levenberg-Marquardt solutions obtained for different choices of . This example will also show the problems that affect the method of steepest descent, which are removed when the Levenberg-Marquardt method is used.
Pujol of the gradient and the steepest-descent method is very fruitful because it sheds light on certain questions that arise when solving inverse problems that involve parameters with different dimensions. This type of problem is not uncommon. In seismology, for example, the parameters may be time, position, velocity, and density, among others. If the dimensions of two or more of the parameters are different, a question that arises is how to dene distance in the parameter space. When all the parameters have the same dimensions and are measured in the same units, the gradient of a function s x gives the direction along which s has the largest rate of change. In other words, for a given x, s x + x s x / x is largest when x is in the direction of s computed at x. In this case, it is meaningful to speak of the direction of steepest ascent and to equate it to the gradient direction. For any other case, however, a distance in parameter space must be dened. Once this is done, the direction of steepest ascent is well dened, as we now show. The following results originate with Crockett and Chernoff 1955 . Although in this paper we are interested in the steepest-descent direction, here we consider its opposite direction corresponding to the steepest ascent to avoid introducing an inconvenient minus sign. Ageneral denition of distance d between two points represented by vectors and is
Example 1b
The merit function is the 2 introduced in equation 24. For the initial points B, C, and D the same value of was used, while for point A two other values were used see Figure 3 . Let us consider the most salient aspects of this example. For three of the initial points B, C, D the corresponding endpoints are very close to, although not exactly at, zM ,aM see Table 1 , but the number of iterations is extremely large 7000 . Recall that with the Gauss-Newton method, convergence for points B and C is achieved in ve iterations. Note that for the three points the paths have sharp bends, after which the paths follow the major axes of the roughly elliptical contours. These bends occur when the paths become approximately tangent to the contours, which is a general property of the method see the discussion following equation 48 and Figure 4 . For point A, the results are different. First, the value of used for the other three points did not lead to convergence. Second, when using the larger value of in Table 1, the path reaches a point close to where it should bend, but this bending does not occur even after a very large number of iterations about 29,000 . When a somewhat smaller value of is used, the path reaches a point close to the minimum with a much smaller number of iterations about 7000 , but the path between xo,y o and x1,y 1 is clearly different from the steepest-descent path. The previous example illustrates the well-known slow convergence of the steepest-descent method, which makes it computationally inefcient, particularly when compared to the Levenberg-Marquardt method. On the other hand, a study of some of the properties
: = 2 10 4 10 9 8
Radius (km)
d=
i,j
bij
1/2 j 1/2
31
where B is a positive denite symmetric matrix see Appendix A . With this condition on B, d is always a nonnegative real number, with d = 0 only if = . The denition of distance is known as the metric of the space of points under consideration. If B = I, d is the
: = 2 10 6 100
: = 2 10 7 50 B
A 500
10 5 3 3
B C A
7 6 5 4 3 2 2 C 3 4 5 6 7 8 Depth (km) 9 D 10
10
11
12
Figure 3. Similar to Figure 1 showing the paths corresponding to the steepest-descent method equation 30 with p = = constant for the values of given at the top of the gure. For point A the dot-dash path was far from the minimum value see Table 1 . The large on each path corresponds to z1,a1 . The two contours corresponding to 3 are different from those in Figure 1, and were drawn to show their relations to the bend in the paths for points B and D see Figure 4 and corresponding text for further details .
Figure 4. Elliptical contour lines corresponding to a 2D quadratic merit function given by equation 55 with x1 a point of minimum . The contour spacing is not uniform. The points corresponding to the centers of the small circles are identied by the letters next to them. The solid and dashed lines are tangent to the contours at points B and C. The segments AB and BC are in the directions of the gradient at A and B. The positions of points B and C were determined using equations 47, 54, and 56. The two segments are perpendicular to each other. The two pairs of closely spaced contours were drawn to show that if the segments AB and BC extend beyond points B and C, the value of the quadratic function becomes larger.
Levenberg-Marquardt nonlinear inversion usual Euclidean distance. Given a point with coordinates xo, the points x at a distance d from it are on the ellipsoid
W7
s xo +
s xo + H , d0
43
x xo T B x xo = d2 .
32 s xo +
and, from equation 41,
Given a function s, the direction of steepest ascent in the d neighborhood of xo is dened as the direction from xo to the point on the ellipsoid of equation 32 for which the value of s is greatest. Let represent that direction. To nd it, we will maximize s xo + under the constraint
T
s xo ;
44
dB1 s xo s xo TB1 s xo
1/2 ;
d 0.
45
B = d2 ,
33
where
= x xo .
34
Here, we are assuming that s xo 0. If s xo = 0, xo corresponds to a critical point i.e., a point where s has an extremal value or a saddle point . In conclusion, the direction of steepest ascent at any point x is given by the vector
The notation used here has been chosen to emphasize the connection of this section to the preceding one. In particular, s and could be those introduced in equations 5 and 13. To solve this problem, we will use the method of Lagrange multipliers, which requires introducing the function
x = B1 s x .
46
u = s xo +
d2
B ,
35
where is the Lagrange multiplier, computing its derivatives with respect to xi and , and setting them equal to zero. This gives
xi
which is equivalent to
s xo +
=2
i
bij j ,
36
When B = I, is in the direction of the gradient, as expected. Finally, we will address the choice of B. In principle, the choice is somewhat arbitrary, because different metrics should lead to the same minimum. In fact, equation 46 shows that x = 0 implies s = 0 and vice versa Feder, 1963 . However, Crockett and Chernoff 1955 showed that the most computationally efcient steepestascent method requires that B = H. The following proof has two parts, originating from Davies and Whitting 1972 and Greenstadt 1967 , respectively. For the rst part, consider the iterative minimization or maximization of a function s x . At a given iteration, we have a point xo and move to a new point x1 in a direction dened by a vector u. Let
s xo +
and
=2 B ,
37
and
x1 = xo + u
47
d2 =
B .
38
s x1 = s xo + u
48
1 1 B s, 2
39
introduce this result in equation 38, use the symmetry of B, and apply the square root to both sides of the resulting equation. Thus we get
d=
1 2
s T B1 s
1/2
40
Next, solve this equation for 1/2 and introduce the result in equation 39. This gives
dB1 s s TB1 s
1/2 ,
41
where s is computed at xo + . Now we will let d go to zero, which means that goes to the zero vector, and linearize s xo + in the vicinity of xo. Writing in component form we have
From a computational point of view, we are interested in an iterative process with the least number of steps. This requires nding the that reduces or increases in the case of maximization the value of F as much as possible in the direction u at every step, which in turn means that x1 becomes a point of tangency to one of the contours of s. This situation is illustrated in Figure 4, which shows the contours corresponding to a 2D quadratic merit function with a point of minimum. The points indicated by A and B correspond to xo and x1, respectively, for a given iteration. Going from A to B, the value of s keeps decreasing, while moving past B leads to an increase in the value of s. For the next iteration B and C become the new xo and x1. The points B and C correspond to tangency points and were determined using equations 54 and 56 below. The directions at A and B are given by s computed at those points. Because the gradient is perpendicular to the contours, the segment BC is perpendicular to AB. If the contours were circular, the minimum would be reached in one step. To nd the value of that will lead to a point of tangency, we will expand F to second order about xo
Fo +
s xo + xi
s xo + xi
s xo xi x j
dF d
+
o
d 2F 2 d 2
2
,
o
49
j.
42
where the subscript o indicates evaluation at xo. Then, expanding dF/d to rst order about xo and setting it equal to zero at the point of tangency we obtain
W8
Pujol
dF d
dF d
+
o
d 2F d 2
= 0.
o
50
FB = FN
60
If F were a quadratic function, these relations would be exact. Now, differentiating equation 48 with respect to gives
61
dF = d
and
s ui = uT s xi
51
so that
s = B1/2p.
62
d 2F = d 2
i,j
s uiu j = uTHu, xi x j
52
Because B is positive denite and symmetric, so is B1/2 see Appendix A . Using these two equations, becomes
=
where
where H is the Hessian matrix. Introducing these two expressions in equation 50 and solving for gives
pT p 2 , p Mp pT M1p
T
63
dF d
d 2F d 2
uT s = T u Hu
.
x = xo
53
M = B1/2 HB1/2 .
64
sT s s TH s
Matrix M is positive denite see Appendix A . An upper bound to can be established by using the following generalization of Schwarzs inequality
.
x = xo
54
aT b
aT Ca bT C1b
65
This expression is exact when s is a quadratic function. For example, s may be of the form
see Appendix A , where C is a positive denite matrix. Application of this expression to gives
s = x x1 T P x x1 ,
1. 55
66
Now we will apply the Kantorovich inequality Luenberger, 1973 to the right-side of equation 63, which immediately gives
4
n
n 1
s = 2P x x1 ;
H = 2P.
56
where
1
4 1/ n 1 + 1/ n
4 1+
2,
67
If x1 minimizes s, s x1 = 0 and P is positive denite e.g., Apostol, 1969 . For the second part of the proof we will consider the difference F between F and Fo, which is determined from equations 49 and 5153:
and
=
In summary,
1/ n.
68
F = F Fo =
1 uT s 2 2 uT Hu
.
x = xo
57
4 1+
1.
69
This result applies to both the minimization and maximization of s, with the sign of F depending on whether H is positive or negative denite see Appendix A , corresponding to whether s has a minimum or a maximum e.g., Apostol, 1969 . Now we will set u = see equation 46 for two cases: 1 B is any positive denite symmetric matrix, and 2 B = H. The latter case corresponds to the socalled Newton or Newton-Raphson method, and to distinguish between the two possibilities we will use subscripts B and N. Thus,
FB =
and
1 sTB1 s 2 2 sTB1HB1 s
58
FN =
1 T 1 s H s. 2
59
To investigate the relative efciency of the methods represented by the two choices we will consider the ratio
This result shows that the efciency of the method depends on , which is the condition number of M e.g., Gill et al., 1981 . The better conditioned this matrix is, the higher the efciency. In particular, = 1 when = 1, which in turn requires M = cI. Without losing generality, we can take c = 1, in which case B = H see equation 64 . Therefore, the Newton step is the most efcient assuming that s is arbitrary . Crockett and Chernoffs 1955 proof of this result is based on a different approach and resulting expressions . This choice of B, however, is not always the most advisable for two reasons. First, when minimizing s, H may not always be positive denite; in fact, some of its eigenvalues may be negative and FN may become positive. Second, even though H may be positive denite through the iterative process, the required computation of second derivatives increases the computational costs. On the other hand, this choice has special relevance in statistics because, as Crockett and Chernoff 1955 note, when solving maximum likelihood estimation problems the function s is the logarithm of the likelihood function, in which case H1 represents an estimate of the covariance matrix of the maximum likelihood estimate Seber and Wild, 1989 .
Levenberg-Marquardt nonlinear inversion The covariance matrix is also related to our discussion of the metric via the Mahalanobis distance, named after the Indian statistician that introduced the concept in 1936 . If x is a random vector from a population with mean and covariance matrix V, then the distance between x and is given by
W9
x S w
x ; S
xw .
75
Q xo = 0
see equation 8 , we can write
76
dM =
V1 x
1/2
70
A good qualitative justication for this denition can be found in Krzanowski 1988 , who also notes the relation between this distance and the maximum likelihood function. Finally, it is worth noting that the choice B = HS see equation 16 leads to the Gauss-Newton method. In fact, using equations 46, 16, 27, and 15 gives
wS xw
wS xw + Q xw = xw S = wS xo + Q xo = wS xo ,
x S o 77
so that
S xw
S xo ,
78
= A TA
A Tc =
71
provided that the inverse exists . Now using equation 34 with instead of the sign being used to specialize to the steepest-descent case and then using equation 71 we have
which means that the minimization of will lead to a decrease in S. S Now, letting x denote the ordinary least-squares solution the reason for this notation is explained below , we have
wS xw + Q xw = xw S
x S
= wS x
+Qx 79
x = xo = xo + .
72
so that
wS xw + Q x ,
Comparison of this expression with equation 13 shows that we have recovered the Gauss-Newton method. This result is consistent with the fact that one way to derive this method is to assume that H HS in the Newton method e.g., Gill et al., 1981
Q xw
Qx .
80
The second inequality in equation 79 arises because x minimizes S. Inequality 80 shows that the minimization of also leads to a deS crease in the weighted sum of adjustments squared. Next, we will derive the equation for the solution that minimizes , but before proceeding we note that Levenberg derived his results S using scalar notation, not the more familiar matrix notation used here. The starting point is equation 73, which will be rewritten using equation 12
A TA
D . 81
A TA +
1 D w
Aside from a factor of w, equation 81 is formally similar to equation 12, with ATA in the latter replaced by the symmetric matrix in parentheses in the former. Therefore, by analogy with equation 15, the minimization of leads to the damped least-squares solution S
A TA +
1 D w
= ATc,
82
x = wS x + Q x , S
where
73
Q x = d1
2 1
+ + dn
2 n
D ,
74
w and the di are positive weighting factors independent of x, and D is a diagonal matrix with elements D ii = di. A comparison of equations 74 and 31 shows that Levenbergs method implicitly introduces a non-Euclidean norm in the parameter space. Moreover, the results of the analysis below are valid when D is a symmetric positive denite matrix. Let us establish two important results concerning , S, and Q. Let S S xw be the value of x that minimizes for a given value of w, i.e.,
so that the only difference from the ordinary least-squares solution is the addition of a diagonal matrix to ATA. Because the inverse of the matrix in parentheses always exists for w see Appendix A , equation 82 has a solution even when ATA 1 does not exist and the Gauss-Newton method is not applicable. Also note that for w = the second term on the left side of equation 82 vanishes and we get the ordinary Gauss-Newton solution provided it exists . This is why we introduced the x used in equations 79 and 80. On the other hand, if w goes to zero, 1/w goes to innity and the rst term on the left becomes negligible, which means that we can write
1 D w
ATc;
w 0.
83
W10 In addition, because the diagonal elements of D are nonzero, its inverse always exists and we can write
g
Pujol dxw /dw is a vector tangent to xw e.g., Spiegel, 1959 . Furthermore, because the product on the right side is the matrix form of the scalar product, we can write
1 wD1ATc = wD1 s 0; 2
w 0.
84
see equation 27 . This result is also valid when D is symmetric and positive denite so that its inverse exists, see Appendix A , in which case it agrees with equation 46. The difference in the signs of and is due to the fact that they are in the directions of steepest descent and ascent, respectively. So far, we have concentrated on S and , but as we will see next, S we can derive several important results concerning s, which is the quantity that is of most interest to us. In the following we will focus on the case of w going to zero, which means that we can use equation 84. Then, letting
g
ds xw = dw
dxw cos , dw
91
where is the angle between the two vectors. The minimum value of the derivative is attained for = . Introducing this value of as well as equations 27, 86, and 88 into equation 91 we obtain
92
This equation is satised when D = dI, with d equal to a constant, which results in a factor of d1 on both sides of the equation. Taking d = 1 and letting = 1/w, we nd that equation 82 becomes the well-known equation
= xw xo
85 A TA + I = ATc; =
we nd that
1 . w
93
dxw d g = = D1ATc; dw dw
Furthermore,
w 0.
86
di = ATA ii , dxw . dw
94
ds xw = dw
n j=1
s dx j x j dw
=
x = xw
87
in which case the matrix in parentheses in equation 82 becomes the matrix ATA with its diagonal elements multiplied by 1 + . Levenberg did not give a motivation for this choice, but it is directly related to the scaling introduced by Marquardt.
Because of equations 84 and 85, xw xo. Then, introducing equations 86 and 27 in equation 87 and operating gives
ds xw dw
= 2 ATc T D1ATc
w=0
= 2 D1/2ATc
D1/2ATc 0 88
= 2 D1/2ATc 2
see also equation 44 . The inequality arises because of the assumption that xo is not a stationary point of s, which means that the partial derivatives cannot all be equal to zero. Therefore, because s xw is decreasing at w = 0, there are values of w positive that will reduce the value of s. In principle, the value of w that minimizes s could be determined by setting ds/dw equal to zero, but because of the complexity of this equation in practical cases, Levenberg proposed to write s xw in terms of its linearized Taylor expansion
A TA + I
Then
o
= ATc.
satises
95
96
This result was proved using the method of Lagrange multipliers, which requires minimizing the function
s xw
s xo
ds +w dw
u , 89
w=0
=S+
97
s xo ds/dw
=
w=0
2D
s xo 1/2 T
A c2
90
with respect to and , where is a Lagrange multiplier. This requires nding the derivatives of u with respect to and and setting them equal to zero. Because o does not depend on , the derivative with respect to can be determined as done in connection with the minimization of in equation 81. In fact, setting w = 1 and D = I S in equation 82 immediately gives
where equation 88 was used. According to Levenberg, this type of approximation was published by Cauchy in 1847. The results derived above do not depend on the values of the weights d j. To determine them, Levenberg proposed two approaches. One was to choose the di such that the directional derivative of s along the curve dened by x = xw taken at w = 0 has a minimum value. The directional derivative is given by equation 87 because
A TA + I
= ATc.
98
This proof is more general than that provided by Marquardt, which assumed the existence of ATA 1. Next, setting the derivative of u with respect to equal to zero gives
2
99
Levenberg-Marquardt nonlinear inversion which proves the result. For the sake of simplicity, the subscript in o will be dropped. The second result requires writing ATA in terms of its eigenvalue decomposition e.g., Noble and Daniel, 1977 , namely
W11
applies to a scaled version of the problem. However, because this scaling is not essential and is not always used; e.g., Gill et al., 1981 , the basis of the algorithm is described rst. At the pth iteration the following equation is solved for p
A TA = U U T ,
100
A TA
= A Tc
108
where U is a matrix whose columns are the eigenvectors of ATA and is the diagonal matrix of its eigenvalues i all real numbers . This decomposition applies to any symmetric matrix. The elements of the diagonal matrix will be indicated with i. There should be no confusion between the damping parameter and the i because the former is never subscripted. Using equation 100 and the property
x p+1 = x p +
109
UUT = UT U = I
we obtain
101
s p+1
sp,
110
A A+ I=U
+ IU .
102
The matrix in parentheses is diagonal, and, as shown in Appendix A, all of its diagonal elements are always positive when 0. Therefore its inverse always exists, which allows writing
= U
+ I UT
A Tc = U
+ I
u,
103
the value of p is reduced. Otherwise, its value is increased. After this step a new iteration is started. Marquardt introduced three tests that determined the value of p+1 , but a simpler approach, described below, works well. In any case, the important point to note here is that the Marquardt algorithm is based on a trial-and-error approach for the selection of the appropriate value of at each iteration, which is simpler than Levenbergs approach equation 90 . Marquardt applies his algorithm after introducing the scaled matrix ATA * and vector ATc *, with components given by
u = UTATc.
Then
2
104
and
A TA
ij
= siis jj ATA
ij
111
= uT
+ I
2 1
u=
i i
u2 i +
2.
105
where
A Tc
= sii ATc i ,
112
From this equation we see that is a decreasing function of . This result and the previous one are from Morrison 1960, unpublished, quoted by Marquardt . Marquardts nal result concerns the angle between and ATc, which is proportional to s see equation 27 . Using equations 103, 104, and 101, we can write
T
sii =
1 A TA
ii
113
cos
A Tc A c
T
uT
+ I
1
1 T
U TA Tc
Note that the diagonal elements of ATA * are all equal to one. In terms of this scaling, equation 15 becomes
A c + I
2 1
A TA . 106
* *
= A Tc * .
114
uT u
T
u
1/2
+ I
A c
The angle is a function of . When goes to innity, we already saw that goes to s / see equation 28 , so that goes to zero. When = 0, two cases must be considered. If the inverse of ATA exists, all the i are positive, cos 0, and 90. If the inverse does not exist, then equation 15 cannot be solved. Now we will address the question of what happens to for other values of . To answer it, Marquardt investigates the sign of the derivative of cos with respect to and nds that
= sii * . i
115
To verify that equation 115 is correct, we will proceed as follows. First, introduce a diagonal matrix S with elements
ii
= sii .
116
d cos d
A TA A Tc
and
= SATAS, = SATc,
117 118
0.
107
*
Appendix B . The main consequence of this result is that is a monotonic decreasing function of , which assures that it is always possible to nd a value of that will reduce the value of s. This observation leads to the algorithm introduced by Marquardt, which he
SATAS
= SATc.
119
Pujol solved and on the initial values given to the parameters to be determined. For points A, B, and C, equation110 was always satised and the choice of o was not critical recall that the Gauss-Newton method converged for these points . For point D, the situation is different see below . For other inverse problems, the best approach to the selection of o is to invert synthetic data that resemble the actual data as much as possible and to experiment with different values of o and even the constant c. Two values of o for each initial point were used. The rst value equal to 1 106 for all the points was chosen very large to see the relation between the convergence paths and the corresponding steepest-descent paths. Interestingly, the two paths are extremely close to each other for the four initial values, but the number of iterations is several orders of magnitude smaller just 23 or 24 and the endpoints coincide with zM ,aM Table 1 . This similarity of paths was not expected, and it is not clear how the changes in , ATA, and ATc at each iteration combine to produce the observed paths. For a comparison, the largest initial values of ATA for the points A, B, C, and D are close to 5 106, 5 103, 1 103, and 8, respectively, and the corresponding value at the minimum point is 920. The second value of o is equal to 1 104 for point A and 1 102 for the other ones. In all cases, the point zM ,aM is reached exactly, and the number of iterations becomes smaller 17 for point A and 11 for the other points than for the previous value of o. The values of o used here were chosen so that the convergence paths are intermediate between the previous ones and those obtained using the GaussNewton method. For smaller values of o, convergence is even faster for all points except D. For this point, values of o less than about 10 lead to a larger number of iterations. Recall that this was the only point for which the Gauss-Newton method did not converge. Although it is not possible to give a conclusive explanation for these differences in convergence speeds, they may be related to the fact that point D is in a region of the z,a plane with a very slow rate of change in the value of , so that to assure a decrease in its value, the adjustments z and a must be smaller than for some of the other points, thus requiring larger values of o. The application of the scaled version of the method is based on equation 123. Using o = 1, convergence to the true values of z and a was achieved in 12 iterations for points A, B, and C Figure 6 and in 27 iterations for point D. The problem for point D is that initially increases for this value of o, which requires an increase in the values of used in subsequent iterations. Therefore, to reduce the total number of iterations a larger o is needed. For this particular point, the smallest number of iterations is 17 for = 20 Figure 6 . Let us examine the convergence paths. For points A and C there are not signicant qualitative differences with the corresponding paths seen in Figure 5 for the smaller values of o, but for the other two points the differences are signicant. For point B, the rst three points of the path do not interpolate between the Gauss-Newton and steepest-descent paths. Recall that the interpolation property discussed here applies to the unscaled version of the method, so that it cannot be expected that it will always apply when scaling is introduced. For point D, the path is completely unexpected, with z1,a1 equal to 2.79,2.79 , much further away from xo,y o than for any of the other three initial points. In addition, after this rst point has been reached, the path is similar to that for point C. The inversion was repeated using equation 82 with the diagonal elements of D given by equation 94, 1/w = , and the same values of o. The results obtained in this way agree with those shown in Figure
A TA S
so that
= ATc,
120
=S
121
provided that ATA 1 exists . Marquardts algorithm is based on the solution of equation 108 after introducing the scaling described above. Solving the scaled equation gives * p , which is converted to p using equation 115, which in turn is used in equation 109. The reasons given by Marquardt to scale the problem are twofold. First, it was known Curry, 1944 that the properties of the steepestdescent method are not scale invariant. Second, the particular scaling he chose was widely used in linear least-squares problems to improve their numerical aspects. These questions are discussed in, e.g., Draper and Smith 1981 and Gill et al. 1981 . What is not obvious, however, is that equation 115 is applicable after the scaled version of equation 108 is solved, but this fact can be proved using the ideas developed by Levenberg. To see that, let us multiply both sides of equation 82 by D1/2 on the left, rewrite it slightly, and operate. Using = 1/w, this gives
D1/2 ATA + D1/2 D1/2 D1/2 D1/2 = D1/2ATAD1/2 + I D1/2 = D1/2ATc. 122
If D is the diagonal matrix with elements given by equation 94, then D1/2 = S and equation 122 becomes
SATAS + I
= SATc,
123
where * = S1 see equation 121 . This result has two consequences. First, its comparison with equations 117121 justies Marquardts procedure. Second, it shows that this procedure is equivalent to Levenbergs second choice of matrix D, namely D = S2. This equivalence, noted without proof by Davies and Whitting 1972 , shows that Marquardts method implicitly introduces a non-Euclidean norm that changes at each iteration because it is based on the value of x for that iteration.
Example 1c
Here, we will apply the Levenberg-Marquardt method with and without scaling to the gravity data introduced before. There are two reasons for using the two options. One is that because both of them are used in practice, a comparison of their performances will be useful. The second reason is that scaling is equivalent to changing the shape of s at each iteration, so that a direct comparison with the Gauss-Newton and steepest-descent methods is possible only for the unscaled version. First, we will consider the results obtained using the unscaled version corresponding to equation 108 . The initial values of indicated by o are given in Table 1. The following procedure to handle at each iteration is simpler than that proposed by Marquardt but was found to be effective when applied to a variety of inverse problems. As before, 2 plays the role of s. Then, if equation 110 is satised, the value of p +1 is set equal to p /c, where c is a constant here, c = 2 . If not, the values of and the parameters are set equal to those they had in the iteration p 1. Then a new iteration is started. The selection of o depends on the type of inverse problem being
Levenberg-Marquardt nonlinear inversion 6, thus providing a numerical conrmation of the equivalence of this choice of D and the Marquardt scaling proved analytically. In summary, for this particular example, the unscaled and scaled versions of the Levenberg-Marquardt method perform similarly. It may be argued that the scaled version makes it easier to choose the value of o, which can be taken close to one, but as point D showed, this value may not lead to a smaller number of iterations. Obviously, this is not a major concern in our case, but it may be so when solving inverse problems involving large numbers of parameters. Also note that the results for points C and D in Figure 6 show that it is difcult to make general statements regarding the convergence paths for linearized nonlinear problems, even for a relatively simple 2D case. Again, convergence to a solution may become more of an issue as the number of inversion parameters increases. In particular, the function s may have local minima in addition to an absolute minimum, in which case the inversion results may depend on the initial solution and on the selection of o. These facts must be borne in mind by those beginning their work in nonlinear inverse problems. Each problem will have features that make it different from other problems and, as noted above, the best way to investigate it is through the inversion of realistic synthetic data i.e., the model is realistic . In addition, because actual data are always affected by measurement or observational errors, representative errors should be added to the synthetic data.
W13
solution, which can be derived as follows. Equations 124 and 125 will be written as a single equation involving partitioned matrices, namely
B = u,
where
127
B=
A ; pI
u=
c . 0
128
o: o = 1 10 2 50 B
10 5 3 3 5
HISTORICAL NOTE
In spite of its importance, Levenbergs 1944 paper went largely unnoticed until it was referred to in Marquardts 1963 paper. When Levenberg published his paper he was working at the Engineering Research Section of the U. S. Army Frankford Arsenal Philadelphia , and according to Curry 1944 the engineers there preferred Levenbergs method over the steepest-descent method. Interestingly, the Frankford Arsenal supported the work of Rosen and Eldert 1954 on lens design using least squares, but they did not use Levenbergs method. The computerized design of lenses was an area of research with military and civilian applications, with early results summarized by Feder 1963 . Regarding Levenbergs paper, Feder notes that it had come to his attention in 1956 and that other people had rediscovered the damped least-squares method, although some of the work was supported by the military and could not be made public until several years later because of its classied nature. One of the rediscoverers of the damped least-squares method was Wynne 1959 , who notes that the problems affecting the ordinary least-squares method when the initial solutions did not lead to an approximate linearization could be addressed by limiting the size of the adjustment vector. Using the notation introduced here, Wynne added the constraint
10
7 8 Depth (km)
D 10
11
12
Figure 5. Similar to Figure 1 showing the paths corresponding to the unscaled Levenberg-Marquardt method equation 108 using the initial values of i.e., o given at the top of the gure circles . For a comparison, some of the steepest-descent paths in Figure 3 are also shown here black lines .
o: o = 20 100 50 B
10 5 3 3 5
p = 0,
where p is a weighting factor, to an equation similar to
124
10
A = c.
125
3 2 2 C 3 4 5 6 7 8 Depth (km) 9 D 10
Wynne noted that the least-squares solution of the combined equations 124 and 125 minimizes a function similar to
11
12
=r r+p S
T
2 T
126
Figure 6. Similar to Figure 5 showing the paths corresponding to the scaled Levenberg-Marquardt method based on equation 123 using the initial values of o given at the top of the gure circles .
Note that this is a special case of equation 73 with w = 1 and D = p2I. Wynne, however, did not give an explicit expression for the
Pujol
APPENDIX A 129 SOME RESULTS CONCERNING POSITIVE DEFINITE AND SEMIDEFINITE MATRICES
1 A square symmetric matrix C is said to be positive semidenite if
BT B = BTu,
which after performing the multiplications indicated becomes equation 93 with replaced by p2. This result is quoted without proof in Wynne and Wormell 1963 . Feder 1963 , however, provides a proof that started with equation 126. Wynnes 1959 paper is also interesting because it notes that for p going to innity the solution approaches that which is obtained using the method of steepest descent, thus anticipating Marquardts results. In Wynnes method, the selection of p was empirical, and was based on the condition that the computed value of was small enough to assure that the linearization of the nonlinear problem was approximately satised. A simpler approach, suggested by Feder 1963 , is to start with a large value of p and to reduce it gradually so as to assure convergence to a solution. An application of Wynnes method was provided in Nunn and Wynne 1959 . Another rediscoverer was Girard 1958 , although his work was not as extensive as that of Wynne. Girards approach was to add the constraint
yTCy
0;
0.
A-1
The matrix C is positive denite if the sign above is replaced by . If vi and i are an eigenvector of C and its corresponding eigenvalue, then
vTCvi = i
T iv i v i
vi 2 .
A-2
Because vi 0, if C is positive semidenite, i 0. If C is positive denite, i 0 and its inverse exists because C1 = U 1UT see equations 100 and 101 . If the sign in equation A-1 is replaced by the matrix C is said to be negative denite and its eigenvalues are negative. 1a Given any matrix A, ATA is either positive denite or semidenite, as can be seen from
K = 0,
130
yT ATA y = Ay TAy = Ay 2
0;
0.
A-3
where K is a diagonal matrix, to an equation similar to equation 125. This led to a merit function similar to and to a matrix equation simS ilar to equation 82 with 1/w D replaced by a diagonal matrix. Neither the derivation of the equation, nor an expression for the elements of the matrix were given, although the latter can be derived by replacing pI with K in the matrix B in equation 128 and proceeding as before. This brief summary shows that Levenbergs method was known among people working on optical design, but this knowledge did not spread further. The widespread lack of recognition of Levenbergs work may have to do with the unavailability of adequate computational capabilities at the time his paper appeared, and the possibility that his way of nding the optimal value of w at each iteration was deemed too complicated for its computer implementation. For example, Hartley 1961 notes the need to know higher derivatives of the merit function and concludes that the method was not well suited for computer programming. Interestingly, it was Hartley who brought Levenbergs paper to the attention of Marquardt as a reviewer of the latters paper. Marquardts work, on the other hand, became popular rather quickly, which brings the question of why this happened. According to Davis 1993 , Marquardt explained the success of his method by the fact that he implemented it in a FORTRAN code and that he gave away hundreds of copies of it. Other interesting comments by Marquardt on his paper can be found at http://gareld. library.upenn.edu/classics1979/A1979HZ24400001.pdf .
If the inverse of ATA exists, all of its eigenvalues will be larger than zero and ATA will be positive denite. If the inverse does not exist, some of the eigenvalues will be equal to zero and the matrix will be positive semidenite. 1b A diagonal matrix D with elements D ii = di 0 is positive denite because
yTDy =
i
diy 2 i
0;
0, di
0.
A-4
1c If matrices C and P are positive semidenite and denite, respectively, and 0, then C + P is positive denite because
yT C + P y = yTCy + yTPy
0;
0,
0. A-5
These three results are important in the context of the damped least-squares method because if C = ATA, then the matrices in parentheses in equations 82 and 93 will be positive denite as long as w and 0, and their inverses will exist. For the particular case of P = I the eigenvalues of C + I are i + , which are always positive as long as 0 Feder, 1963 . 2 If B is a symmetric positive semi denite matrix, there exists a unique matrix C such that
B = C2 ,
A-6
ACKNOWLEDGMENT
I gratefully acknowledge the constructive comments of one of the reviewers, Bill Rodi, which led to an improved presentation of the paper, his careful checking of the equations, and the comments that motivated the note on the relation between the Newton and GaussNewton methods.
where C is symmetric positive semi denite. To see that, start with the eigenvalue decomposition of B, given by
B = U UT
see equation 100 and introduce the matrix
A-7
C=U
1/2
UT ,
A-8
W15
1/2 i
.
i
u2 i
1i i
u2 i
i i
3i
i 2 2
u2 i
2 2i
B-4
C2 = U
UT U
1/2
UT = U UT = B,
A-9
where
ni
where equation 101 has been used. The matrix C is known as the square root of B, and is indicated by B1/2. This matrix is unique for a proof see, e.g., Harville, 1997; Zhang, 1999 . If B is positive denite,
=
i
n = 1,2,3;
i.
B-5
B1/2 = C1 = U
1/2
UT ,
A-10
where equation 101 has been used. This matrix is also symmetric and positive denite. 3 In its standard form, Schwarz inequality states that
To show how this result is derived it will be assumed that the number of terms in the sums is three; the extension to any other number is straightforward. Let
ai =
+ .
B-6
a Tb
a Ta b Tb
A-11
Then, the fractions within braces in equation B-2 can be rewritten as follows
for any vectors a and b e.g., Arfken, 1985 . Let C be a symmetric positive denite matrix. In equation A-11, replace a and b by C1/2 a and C1/2 b, respectively. Because C1/2 is symmetric, this immediately gives equation 65 e.g., Zhang, 1999 . 4 The matrix M in equation 64 is symmetric because so are the matrices on the right and positive denite. To show that, let y be an arbitrary nonzero vector. Then
u2 1 an 1
u2 2 an 2
u2 3 an 3
u 2a na n + u 2a na n + u 2a na n 1 2 3 2 1 3 3 1 2 a na na n 1 2 3 u2 1
n1
u2 n2 2 n n n a 1a 2a 3
u2 3
u2 i
n3
ni
an i
i
0,
A-12
where
n1
B-7
because H is assumed to be positive denite. The vector in parentheses on the right-hand side of this equation is arbitrary.
= a na n ; 2 3
n2
= a na 3; n 1
n3
= a na n ; 1 2
n = 1,2,3. B-8
cos
=
i
u2 i + i
1/2 2
. A Tc
B-1
Note that the rst subindex in ni refers to the power to which the factors in the product are raised and the second one indicates which factor is excluded from the product. The denominator in equation B-4 is common to the two terms within the braces in equation B-2. The denominator in equation B-4 is also positive. To nd out the sign of the numerator, a new transformation is needed, based on the fact that
1i 3i
u2 i + i
2i
B-9
ui
i 3
1/2 2 1i i
ui
1/2 2 3i
ui
1/2 1i
ui
1/2 3i
d cos d
1 C
i
u2 i
i i
u2 i
i 2 2 i
B-10
which is always positive because of the Schwarz inequality see equation A-11 . To apply it to equation B-10 let a and b be vectors 1/2 with components ui 1/2 and ui 3i , respectively. This result shows 1i that
u2 i + i
B-2
where
C=
i i
u2 i +
2
3/2
A c.
d cos d B-3
0.
B-11
The factor 1/C is positive. To nd the sign of the factor in braces in equation B-2 we must perform all the operations indicated. The resulting expression is
REFERENCES
Arfken, G., 1985, Mathematical methods for physicists: Academic Press Inc. Apostol, T., 1969, Calculus, vol. II: Blaisdell Publishing Company.
W16
Beveridge, G., and R. Schechter, 1970, Optimization: Theory and practice: McGraw-Hill Book Company. Crockett, J., and H. Chernoff, 1955, Gradient methods of maximization: Pacic Journal of Mathematics, 5, 3350. Curry, H., 1944, The method of steepest descent for non-linear minimization problems: Quarterly of Applied Mathematics, 2, 258261. Davies, M., and I. Whitting, 1972, A modied form of Levenbergs correction, in F. Lootsma, ed., Numerical methods for non-linear optimization: Academic Press Inc., 191201. Davis, P., 1993, Levenberg-Marquart sic methods and nonlinear estimation: SIAM News, 26, no. 2. Dobrin, M., 1976, Introduction to geophysical prospecting: McGraw-Hill Book Co. Dorny, C., 1975, A vector space approach to models and optimization: John Wiley & Sons. Draper, N., and H. Smith, 1981, Applied regression analysis: John Wiley & Sons. Feder, D., 1963, Automatic optical design: Applied Optics, 2, 12091226. Gill, P., W. Murray, and M. Wright, 1981, Practical optimization: Academic Press Inc. Girard, A., 1958, Calcul automatique en optique gomtrique: Revue DOptique, 37, 225241. Greenstadt, J., 1967, On the relative efciencies of gradient methods: Mathematics of Computation, 21, 360367. Hartley, H., 1961, The modied Gauss-Newton method for the tting of non-
Pujol
linear regression functions by least squares: Technometrics, 3, 269280. Harville, D., 1997, Matrix algebra from a statisticians perspective: Springer, Pub. Co., Inc. Jenkins, G., and D. Watts, 1968, Spectral analysis: Holden-Day. Krzanowski, W., 1988, Principles of multivariate analysis: Oxford University Press. Levenberg, K., 1944, A method for the solution of certain non-linear problems in least squares: Quarterly of Applied Mathematics, 2, 164168. Luenberger, D., 1973, Introduction to linear and nonlinear programming: Addison-Wesley Publishing Company. Marquardt, D., 1963, An algorithm for least-squares estimation of nonlinear parameters: SIAM Journal, 11, 431441. Noble, B., and J. Daniel, 1977, Applied linear algebra: Prentice-Hall. Nunn, M., and C. Wynne, 1959, Lens designing by electronic digital computer: II: Proceedings of the Physical Society, 74, 316329. Rosen, S., and C. Eldert, 1954, Least-squares method for optical correction: Journal of the Optical Society of America, 44, 250252. Seber, G., 1977, Linear regression analysis: John Wiley & Sons, Inc. Seber, G., and C. Wild, 1989, Nonlinear regression: John Wiley & Sons, Inc. Spiegel, M., 1959, Vector analysis: McGraw-Hill Book Co. Wynne, C., 1959, Lens designing by electronic digital computer: I: Proceedings of the Physical Society London , 73, 777787. Wynne, C., and P. Wormell, 1963, Lens design by computer: Applied Optics, 2, 12331238. Zhang, F., 1999, Matrix theory: Springer, Pub. Co., Inc.