Advanced Gradient Descent
Advanced Gradient Descent
Advanced Gradient Descent
Irène Waldspurger∗
December 5, 2022
• heavy ball, which is the simplest form of gradient descent with momen-
tum, and already provides significant speed-ups,
1
We assume that a minimizer exists, and denote it x∗ 1 .
Gradient descent is an iterative algorithm, which moves from one iterate
to the next following the direction given by the (opposite of the) gradient.
Its definition is recalled in Algorithm 1.
Consequently, −∇f (x) is the direction along which f decays the fastest
around x. It is therefore a reasonable direction to follow if we want to mini-
mize f .
f (xt ) − f (x∗ )
goes to zero. This rate depends on the properties of the objective function f ,
so we must specify what we assume about f . In this lecture, we will assume
either
2
2. f is strongly convex and smooth.
Remark
A strongly convex function is necessarily convex.
2L||x0 − x∗ ||2
f (xt ) − f (x∗ ) ≤ .
t+4
3
If, in addition to being smooth, f is assumed to be strongly convex, then
the convergence rate can be shown to be geometric. The geometric decay
rate is 1 − κ, where κ is the conditioning of the problem:
µ
κ= ,
L
where µ and L are, respectively, the strong convexity and smoothness con-
stants.
Theorem 2: gradient descent - smooth strongly convex case
Let 0 < µ < L be fixed. Let f be L-smooth and µ-strongly convex.
We consider gradient descent with constant stepsize: αt = L1 for all t.
Then, for any t ∈ N,
L µ t
f (xt ) − f (x∗ ) ≤ 1− ||x0 − x∗ ||2 .
2 L
Theorems 1 and 2 are optimal, in the sense that there exist functions
f for which the inequality is an equality (up to minor modifications in the
constants).
2 Motivation of momentum
In this section, we motivate the introduction of momentum: we consider a
simple function f for which gradient descent converges slowly, explain why
convergence is slow, and why momentum can speed it up.
Let f be a simple quadratic function over R2 :
1
∀(x1 , x2 ) ∈ R2 , λ1 x21 + λ2 x22 ,
f (x1 , x2 ) =
2
for parameters 0 < λ1 < λ2 . The unique minimizer of f is
x∗ = (0, 0).
The gradient of f is
∀(x1 , x2 ) ∈ R2 , ∇f (x1 , x2 ) = (λ1 x1 , λ2 x2 ).
If we run gradient descent with a constant stepsize α > 0, the relation be-
tween iterates xt = (xt,1 , xt,2 ) and xt+1 = (xt+1,1 , xt+1,2 ) is
4
(xt+1,1 , xt+1,2 ) = xt − α∇f (xt )
= (xt,1 , xt,2 ) − α(λ1 xt,1 , λ2 xt,2 )
= ((1 − αλ1 )xt,1 , (1 − αλ2 )xt,2 ) .
If λ1 and λ2 are of the same order, this is fine: it suffices to pick α of the
order of λ11 ∼ λ12 .
But if λ1 is much smaller than λ2 (that is, the problem is ill-conditioned ),
there is no good choice of α. If we set α ≈ λ11 , then
λ2
1 − αλ2 = 1 − < −1
λ1
and the second coordinate of the iterates, xt,2 , diverges when t → ∞. If, on
the other hand, we set α ≈ λ12 , then the second coordinate goes to 0, and
fast, but the first one converges very slowly:
λ1
1 − αλ1 = 1 − ≈ 1.
λ2
In this situation, gradient descent is slow. Figure 1a displays the first fifteen
iterates in the case where λ1 = 0.1 and λ2 = 1, for α = 4/3 (that is, of the
order of λ12 ). As expected, the second coordinate goes fast to zero, but the
first one decays only slowly.
A possible remedy to this slow convergence is to use the information
given by the past gradients when we define xt+1 from xt : instead of moving
in the direction given by −∇f (xt ), we move in a direction mt+1 which is a
(weighted) average between −∇f (xt ) and the previous gradients −∇f (x0 ),
..., −∇f (xt−1 ). Concretely, this yields the following iteration formula:
5
2 2
1 1
0 0
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
−1 −1
−2 −2
Proof of the remark. From the second equation in the iteration formula:
xt − xt+1
∀t ∈ N, mt+1 = ,
αt
xt−1 − xt
⇒ ∀t ∈ N − {0}, mt = .
αt−1
We plug these equalities into the first iteration formula:
xt − xt+1 xt−1 − xt
∀t ∈ N − {0}, = γt + (1 − γt )∇f (xt ),
αt αt−1
αt γt
⇒ ∀t ∈ N − {0}, xt+1 = xt − αt (1 − γt )∇f (xt ) + (xt − xt−1 ) .
αt−1
6
Using momentum instead of plain gradient in the iteration formula al-
lows to use a larger stepsize. Indeed, for large stepsizes, αt ∇f (xt ) diverges
when t grows, which causes the divergence of plain gradient descent. But
it is possible that αt mt stays bounded, in which case gradient descent with
momentum does not diverge: αt mt is an average of potentially large gradi-
ents pointing to different directions, which may therefore compensate each
other. This can be seen in Figure 1b: compared to Figure 1a, the stepsize is
larger; consequently, the first coordinate converges faster towards zero, but
the second coordinate does not diverge.
3 Heavy ball
The simplest version of gradient descent with momentum is when the mo-
mentum and stepsize parameters are constant. It is due to Polyak, and often
called heavy ball 2 .
end
return xT
Algorithm 2: Heavy ball
7
Theorem 3: heavy ball - quadratic case
Let 0 < µ < L be fixed. Let f be a quadratic function, which is
L-smooth and µ-strongly convex. We set
√ √ !2
1 L− µ
α= √ , γ= √ √ .
µL L+ µ
Before proving the theorem, let us compare the convergence rate with
gradient descent. From Theorem 2, gradient descent converges geometrically,
with decay rate
µ
1− .
L
Theorem 3, on the other hand, guarantees for heavy ball a convergence with
decay rate
√ √ !2 r
L− µ µ
√ √ ≈1−4 when µ L.
L+ µ L
ln(10)
√ √ ≈ 6
2
L− µ
− ln √ √
L+ µ
8
Proof of Theorem 3. Up to a change of coordinates, we can assume that f is
of the form
1
λ1 x21 + · · · + λn x2n ,
f (x1 , . . . , xn ) =
2
where
L ≥ λ1 ≥ λ2 ≥ · · · ≥ λn ≥ µ > 0
are the eigenvalues of the matrix representing f .
Denoting xt = (xt,1 , xt,2 , . . . , xt,n ), we have, for each t,
For all t ∈ N,
!
(1)
mt,k (σk )t gt,k −1 m0,k
= Gk (2) Gk ,
xt,k 0 (σk )t x0,k
(1) (1) (2) (2)
with gt,k = ((σk )t−1 + (σk )t−2 σk + · · · + (σk )t−1 )gk .
9
(The triple bar denotes the spectral norm.)
For some constants C, C 0 > 0, the spectral norm can be upper bounded
by
!
(1)
(σk )t gk,t
(1) t (2) t
(2) ≤ C max |σ k | , |σk | , |gk,t |
0 (σk )t
t
0 (1) (2)
≤ C t max |σk |, |σk | .
(1) (2) (1) (2)
We must compute max |σk |, |σk | , where we recall that σk , σk are
the eigenvalues of Mk . These eigenvalues are the roots of the characteristic
polynomial of Mk . A (slightly tedious) computation shows that the polyno-
mial has a negative discriminant. The eigenvalues are therefore complex and
conjugate one from each other:
(1) (2) (1) (2)
|σk |2 = |σk |2 = σk σk = det(Mk ) = γ.
√
(1) (2)
In particular, max |σk |, |σk | = γ, and we get
mt,k 0 t/2 m0,k
∀k, ≤ C tγ
xt,k x0,k
q √
⇒ |xt,k | ≤ C 0 tγ t/2 x20,k + m20,k ≤ C 0 tγ t/2 1 + L2 |x0,k |
n
X
⇒ f (xt ) − f (x∗ ) = λk x2t,k ≤ L(1 + L2 )C 02 t2 γ t ||x0 ||2 .
k=1
10
The theorem we just proved does not extend from strongly convex quadratic
functions to general strongly convex functions. Indeed, there are unfavorable
strongly convex functions, on which gradient descent with momentum is not
faster than its standard version (or even where it diverges whereas plain
gradient descent converges). Fortunately, many “interesting” functions are
either quadratic or, more frequently, approximately quadratic in the neigh-
borhood of a minimizer. For these functions, heavy ball is usually better
than plain gradient descent.
4 Nesterov’s method
In the previous section, we have said that heavy ball has a faster convergence
rate than gradient descent for quadratic problems, but not for all strongly
convex problems. In addition, it does not apply when the objective function
is not strongly convex. In this final section, we present an algorithm which
solves both these issues. As it has been found by Yurii Nesterov, it is often
called “Nesterov’s method”.
The iteration formula for this algorithm is
11
Input: Starting point x0 , number of iterations T , smoothness
parameter L, strong√
convexity
√
parameter µ.
1 L− µ
Set x−1 = x0 , α = L , β = √L+√µ ;
for t = 0, . . . , T − 1 do
define
end
return xT
Algorithm 3: Nesterov’s algorithm with constant parameters
12
Input: Starting point x0 , number of iterations T , smoothness
parameter L.
Set x−1 = x0 , α = L1 , λ−1 = 0;
for t = 0, . . . , T − 1 do
define
p
1 + 1 + 4λ2t−1
λt = ;
2
λt−1 − 1
βt = ;
λt
xt+1 = xt − α∇f (xt + βt (xt − xt−1 )) + βt (xt − xt−1 ).
end
return xT
Algorithm 4: Nesterov’s algorithm with changing parameters
2L
f (xt ) − f (x∗ ) ≤ 2
||x0 − x∗ ||2 .
(t + 1)
13
which is L-smooth and convex, such that, after t steps,
3L
f (xt ) − f (x∗ ) ≥ ||x0 − x∗ ||2 .
32(t + 1)2
This means that, up to the constant, no first-order algorithm can achieve a
better convergence rate than the one in Theorem 5.
Nesterov’s method is also optimal for smooth strongly convex functions
among all first-order algorithms: no first-order algorithm can achieve a better
convergence rate, for L-smooth and µ-strongly convex functions, than the one
guaranteed by Theorem 4.
5 References
The main references used to prepare these notes are the original article where
Polyak introduced the heavy ball algorithm,
• Some methods of speeding up the convergence of iteration methods,
by B. T. Polyak, Ussr computational mathematics and mathematical
physics, volume 4(5), pages 1-17 (1964),
two classical (classical to be, for the second one) books on optimization,
• Introductory lectures on convex optimization: a basic course, by Y.
Nesterov, Springer Science & Business Media, volume 87 (2003),
• Optimization for data analysis, by S. J. Wright and B. Recht, Cam-
bridge University Press (2022),
and two blog posts by S. Bubek on Nesterov’s method for smooth convex
functions,
• http://blogs.princeton.edu/imabandit/2013/04/01/accelerat
edgradientdescent/,
• http://blogs.princeton.edu/imabandit/2018/11/21/a-short-p
roof-for-nesterovs-momentum/.
For another presentation of the advanced aspects of gradient descent, the
reader can also refer to
• Lecture notes on advanced gradient descent, by C. Royer, https://ww
w.lamsade.dauphine.fr/%7Ecroyer/ensdocs/GD/LectureNotesOML
-GD.pdf (2021).
14