Levenberg-Marquardt Optimization: 0.1 Improvements To Simple Gradient Descent
Levenberg-Marquardt Optimization: 0.1 Improvements To Simple Gradient Descent
Sam Roweis
Abstract
Levenberg-Marquardt Optimization is a virtual standard in nonlinear optimization which signi cantly outperforms gradient descent and conjugate gradient methods for medium sized problems. It is a pseudo-second order method which means that it works with only function evaluations and gradient information but it estimates the Hessian matrix using the sum of outer products of the gradients. This note reviews the mathematical motivations for Levenberg-Marquardt and also details the algorithm.
weight changes alleviates many of the above problems. Momentum is an example of an improvement on our simple rst order method that keeps it rst order but tries to get some curvature information by averaging the gradients locally. However in order to cut down on cost it only averages the gradients that have already been computed. Another technique which improves basic gradient descent by estimating curvature information and which works extremely well for medium sized models is discussed below.
where the angle brackets donate the mean over input output pairs. For linear functions, this had a quadratic form:
E w = a + 2bT w + wT Cw
where a, b, and C depend on averages over the input output pairs. We can nd the minimum in closed form by solving for when the gradient goes to zero. Don't be concerned by all the matrix notation, this is exactly the same thing that you know well from the scalar case: if the function is linear then squared error is quadratic and everybody knows how to solve for the minimum of a parabola.
wopt = ,C,1b
Here we don't have to do any gradient descent, we can just jump directly to the minimum. For nonlinear choices of f of course, our function E will be more complex than the above quadratic form and we won't just be able to hop to a minimum. However, close to a minimum, we can linearize the function which will approximate E by a quadratic equation and use the above method to guess where the minimum is. Even if we don't jump right to the true minimum the rst time, if the approximation is good we will converge much more quickly than with steepest descent. In essence what we are doing is looking at the curvature of the error surface where we are and assuming that the curvature we see is due to a parabolic bowl. Then we jump to the bottom of this ctitious bowl and re-evaluate things wherever we end up.
We will derive this quadratic approximation to E in a series of steps. First, consider the general deterministic model f x; w. It is a function of both the data x and its parameters w. When we use our model to predict the behavior of the unknown target function, we have xed w and are more interested in f as a function of x. However, when we are training our model by optimizing the weights to reduce E , we are interested in f as a function of w. For the discussion below, we will concentrate on this view of f , and all derivatives and gradients will be with respect to w. One way to approximate E as locally quadratic in w near a minimum is to approximate f x; w as a linear function of w, which we will now derive. Remember that all gradients are with respect to w and all averages are over input output pairs. ^ First we write a new function f x; w which is a linear approximation of f x; w in the neighborhood of a speci c weight value w0 : ^ f x; w = f x; w0 + w , w0 T rf x; w0 Assuming the model is f^, we write expressions for E w and rE w in terms of y, f^x; w and rf^x; w:
D E ^ ^ E w = f x; w , y2
^ Substituting for f^, we solve for rE w in terms of y, w, w0 , f x; w0 , and rf x; w0. First ^ ^ note the simple form of rf^x; w: rf x; w = rf x; w0 because f is a linear function of w. ^, Now we can solve for rE ^ rE w = 2f x; w0 + w , w0T rf x; w0 , yrf x; w0 Let
D E
where the letter d stands for derivative and the letter H stands for Hessian. Notice that while d is exactly the average error gradient, H is not the true Hessian matrix of mixed partials of the function. In other words Hij 6= @f=@xi @xj . Instead, H is an approximation to the Hessian which is obtained by averaging outer products of the rst order derivative gradient. This approximation is exact if f is linear, but in general may be quite poor. However, the trick that we will soon see is that we rely on this approximation only in regions where a linear approximation to f is reasonable. One nal point: it may look as though H is of rank one since it is made from outer products of
vectors, however remember that it is the average of many such outer products each of rank one and so in general it is full rank. Back to our derivation to nish up the quadratic approximation: ^ We write the previous equation in terms of d and H, then solve for the w where rE goes to zero. ^ rE w = 2Hw , w0 + 2d ^ rE w = 0 2Hwopt , w0 + 2d = 0
wopt = ,H,1d + w0
wi+1 = wi , d
Compare this to the update rule based on our quadratic approximation
wi+1 = wi , H,1d
Our quadratic rule is not universally better since it assumes a linear approximation of f 's dependence on w, which is only valid near a minimum. The technique invented by Levenberg involves blending" between these two extremes. We can use a steepest descent type method until we approach a minimum, then gradually switch to the quadratic rule. We can try to guess how close we are to a minimum by how our error is changing. In particular, Levenberg's algorithm is formalized as follows: let be a blending factor which will determine the mix between steepest descent and the quadratic approximation. The update rule is
wi+1 = wi , H + I,1d
where I is the identity matrix. As gets small, the rule approaches the quadratic approximation update rule above. If is large, the rule approaches 1 wi+1 = wi , d which is steepest descent. The algorithm adjusts according to whether E is increasing or decreasing as follows:
1. Do an update as directed by the rule above. 2. Evaluate the error at the new weight vector. 3. If the error has increased as a result the update, then retract the step i.e. reset the weights to their previous values and increase by a factor of 10 or some such signi cant factor. Then go to 1 and try an update again. 4. If the error has decreased as a result of the update, then accept the step i.e. keep the weights at their new values and decrease by a factor of 10 or so. The intuition is that if error is increasing, our quadratic approximation is not working well and we are likely not near a minimum, so we should increase in order to blend more towards simple gradient descent. Conversely, if error is decreasing, our approximation is working well, and we expect that we are getting closer to a minimum so is decreased to bank more on the Hessian. Marquardt improved this method with a clever incorporation of estimated local curvature information, resulting in the Levenberg-Marquardt method. The insight of Marquardt was that when is high and we are doing essentially gradient descent, we can still get some bene t from the Hessian matrix that we estimated. In essence, he suggested that we should move further in the directions in which the gradient is smaller in order to get around the classic error valley" problem. So he replaced the identity matrix in Levenberg's original equations with the diagonal of the Hessian: