Advanced Gradient Descent

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Gradient descent with momentum

Irène Waldspurger∗
December 5, 2022

Gradient descent is by far the most well-known optimization algorithm.


Because of its simplicity and flexibility, it is a method of choice for many
problems. However, it is oftentimes unconveniently slow. In this lecture, we
will see that it is possible to speed up gradient descent by incorporating in it
a term called momentum. We will present two forms of momentum, leading
to the following two algorithms:

• heavy ball, which is the simplest form of gradient descent with momen-
tum, and already provides significant speed-ups,

• Nesterov’s method, which is slightly more complex, but performs much


better than gradient descent on a larger range of problems than heavy
ball.

1 Reminder on gradient descent


1.1 Definition
Here, we go back to the most basic form of optimization problems, discussed
in the first lectures: unconstrained minimization of a convex and differen-
tiable function over Rn . In all the lecture, f : Rn → R is a convex and
differentiable function, and we want to

find x∗ such that f (x∗ ) = minx∈Rn f (x). (1)


[email protected]

1
We assume that a minimizer exists, and denote it x∗ 1 .
Gradient descent is an iterative algorithm, which moves from one iterate
to the next following the direction given by the (opposite of the) gradient.
Its definition is recalled in Algorithm 1.

Input: Starting point x0 , number of iterations T , sequence of


stepsizes (αt )0≤t≤T −1 .
for t = 0, . . . , T − 1 do
Define xt+1 = xt − αt ∇f (xt ).
end
return xT
Algorithm 1: Gradient descent

The rationale behind this definition is that the gradient of f at a point x ∈


n
R provides a linear approximation of f in a neighborhood of x: informally,

∀y close to x, f (y) ≈ f (x) + h∇f (x), y − xi .

Consequently, −∇f (x) is the direction along which f decays the fastest
around x. It is therefore a reasonable direction to follow if we want to mini-
mize f .

1.2 Convergence rate


The objective of this lecture is to propose ameliorations of gradient descent
which are faster than the “basic” version. To make our discussion rigorous,
we need to formally define the speed of an optimization algorithm. Several
definitions are possible. Here, assuming the algorithm returns a sequence of
iterates (xt )t∈N , we will focus on the rate at which the sequence

f (xt ) − f (x∗ )

goes to zero. This rate depends on the properties of the objective function f ,
so we must specify what we assume about f . In this lecture, we will assume
either

1. f is convex and smooth;


1
At least, we denote one of them x∗ : the minimizer may not be unique.

2
2. f is strongly convex and smooth.

(See below for the definitions of strongly convex and smooth.)


These two possible sets of assumptions do not encompass all functions
which appear in practical optimization problems, but are nevertheless sat-
isfied in a number of settings. Therefore, they offer a good compromise
between practical relevance and simplicity of theoretical analysis.
Definition 1: smoothness
For any L > 0, we say that a differentiable function f is L-smooth if
∇f is L-Lipschitz, that is

∀x, y ∈ Rn , ||∇f (x) − ∇f (y)|| ≤ L||x − y||.

Definition 2: strong convexity


For any µ > 0, we say that a differentiable function f is µ-strongly
convex if
µ
∀x, y ∈ Rn , f (y) ≥ f (x) + h∇f (x), y − xi + ||y − x||2 .
2

Remark
A strongly convex function is necessarily convex.

When f is assumed to be smooth andconvex, the convergence rate of


gradient descent is, in the worst case, O 1t .
Theorem 1: gradient descent - smooth convex case
We assume that f is convex and L-smooth, for some L > 0. We
consider gradient descent with constant stepsize
1
∀t ∈ N, αt = .
L
Then, for any t ∈ N,

2L||x0 − x∗ ||2
f (xt ) − f (x∗ ) ≤ .
t+4

3
If, in addition to being smooth, f is assumed to be strongly convex, then
the convergence rate can be shown to be geometric. The geometric decay
rate is 1 − κ, where κ is the conditioning of the problem:
µ
κ= ,
L
where µ and L are, respectively, the strong convexity and smoothness con-
stants.
Theorem 2: gradient descent - smooth strongly convex case
Let 0 < µ < L be fixed. Let f be L-smooth and µ-strongly convex.
We consider gradient descent with constant stepsize: αt = L1 for all t.
Then, for any t ∈ N,
L µ t
f (xt ) − f (x∗ ) ≤ 1− ||x0 − x∗ ||2 .
2 L

Theorems 1 and 2 are optimal, in the sense that there exist functions
f for which the inequality is an equality (up to minor modifications in the
constants).

2 Motivation of momentum
In this section, we motivate the introduction of momentum: we consider a
simple function f for which gradient descent converges slowly, explain why
convergence is slow, and why momentum can speed it up.
Let f be a simple quadratic function over R2 :
1
∀(x1 , x2 ) ∈ R2 , λ1 x21 + λ2 x22 ,

f (x1 , x2 ) =
2
for parameters 0 < λ1 < λ2 . The unique minimizer of f is
x∗ = (0, 0).
The gradient of f is
∀(x1 , x2 ) ∈ R2 , ∇f (x1 , x2 ) = (λ1 x1 , λ2 x2 ).
If we run gradient descent with a constant stepsize α > 0, the relation be-
tween iterates xt = (xt,1 , xt,2 ) and xt+1 = (xt+1,1 , xt+1,2 ) is

4
(xt+1,1 , xt+1,2 ) = xt − α∇f (xt )
= (xt,1 , xt,2 ) − α(λ1 xt,1 , λ2 xt,2 )
= ((1 − αλ1 )xt,1 , (1 − αλ2 )xt,2 ) .

Since we want the iterates to go as fast as possible to zero, we would like


to choose α such that

|1 − αλ1 |  1 and |1 − αλ2 |  1.

If λ1 and λ2 are of the same order, this is fine: it suffices to pick α of the
order of λ11 ∼ λ12 .
But if λ1 is much smaller than λ2 (that is, the problem is ill-conditioned ),
there is no good choice of α. If we set α ≈ λ11 , then

λ2
1 − αλ2 = 1 − < −1
λ1
and the second coordinate of the iterates, xt,2 , diverges when t → ∞. If, on
the other hand, we set α ≈ λ12 , then the second coordinate goes to 0, and
fast, but the first one converges very slowly:
λ1
1 − αλ1 = 1 − ≈ 1.
λ2
In this situation, gradient descent is slow. Figure 1a displays the first fifteen
iterates in the case where λ1 = 0.1 and λ2 = 1, for α = 4/3 (that is, of the
order of λ12 ). As expected, the second coordinate goes fast to zero, but the
first one decays only slowly.
A possible remedy to this slow convergence is to use the information
given by the past gradients when we define xt+1 from xt : instead of moving
in the direction given by −∇f (xt ), we move in a direction mt+1 which is a
(weighted) average between −∇f (xt ) and the previous gradients −∇f (x0 ),
..., −∇f (xt−1 ). Concretely, this yields the following iteration formula:

mt+1 = γt mt + (1 − γt )∇f (xt ),


xt+1 = xt − αt mt+1 .

5
2 2

1 1

0 0
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

−1 −1

−2 −2

(a) Standard gradient descent (b) Gradient descent with momentum

Figure 1: First 15 iterates of gradient descent, for λ1 = 0.1, λ2 = 1

Here, γt and αt are respectively the momentum and stepsize parameters.


The quantity mt , which is the average of all gradients until step t, is called
momentum.
Remark
An equivalent iteration formula is

xt+1 = xt − α̃t ∇f (xt ) + β̃t (xt − xt−1 ), (2)


αt γt
with α̃t = αt (1 − γt ) and β̃t = αt−1
.

Proof of the remark. From the second equation in the iteration formula:
xt − xt+1
∀t ∈ N, mt+1 = ,
αt
xt−1 − xt
⇒ ∀t ∈ N − {0}, mt = .
αt−1
We plug these equalities into the first iteration formula:
 
xt − xt+1 xt−1 − xt
∀t ∈ N − {0}, = γt + (1 − γt )∇f (xt ),
αt αt−1
αt γt
⇒ ∀t ∈ N − {0}, xt+1 = xt − αt (1 − γt )∇f (xt ) + (xt − xt−1 ) .
αt−1

6
Using momentum instead of plain gradient in the iteration formula al-
lows to use a larger stepsize. Indeed, for large stepsizes, αt ∇f (xt ) diverges
when t grows, which causes the divergence of plain gradient descent. But
it is possible that αt mt stays bounded, in which case gradient descent with
momentum does not diverge: αt mt is an average of potentially large gradi-
ents pointing to different directions, which may therefore compensate each
other. This can be seen in Figure 1b: compared to Figure 1a, the stepsize is
larger; consequently, the first coordinate converges faster towards zero, but
the second coordinate does not diverge.

3 Heavy ball
The simplest version of gradient descent with momentum is when the mo-
mentum and stepsize parameters are constant. It is due to Polyak, and often
called heavy ball 2 .

Input: Starting point x0 , number of iterations T , stepsize α,


momentum parameter γ.
Set m0 = ∇f (x0 );
for t = 0, . . . , T − 1 do
define

mt+1 = γmt + (1 − γ)∇f (xt );


xt+1 = xt − αmt+1 .

end
return xT
Algorithm 2: Heavy ball

For proper choices of parameters, heavy ball exhibits a faster convergence


rate than plain gradient descent on many natural problems. We will prove
this fact for quadratic strongly convex functions.
2
The name comes from the fact that the momentum term can be seen as an inertia
term, which reminds of the movement of a heavy ball falling down a mountain towards a
valley.

7
Theorem 3: heavy ball - quadratic case
Let 0 < µ < L be fixed. Let f be a quadratic function, which is
L-smooth and µ-strongly convex. We set
√ √ !2
1 L− µ
α= √ , γ= √ √ .
µL L+ µ

There exists a constant Cµ,L > 0 such that, for any t ∈ N,


√ √ !2t
L− µ
f (xt ) − f (x∗ ) ≤ Cµ,L t2 √ √ ||x0 − x∗ ||2 .
L+ µ

Before proving the theorem, let us compare the convergence rate with
gradient descent. From Theorem 2, gradient descent converges geometrically,
with decay rate
µ
1− .
L
Theorem 3, on the other hand, guarantees for heavy ball a convergence with
decay rate
√ √ !2 r
L− µ µ
√ √ ≈1−4 when µ  L.
L+ µ L

For ill-conditioned problems, Lµ is much larger than Lµ , resulting in a sig-


p

nificant speed-up. As an example, if Lµ = 0.01, dividing f (xt ) − f (x∗ ) by a


factor 10 necessitates around
ln(10)
 ≈ 230
− ln 1 − Lµ

iterations with gradient descent, and only

ln(10)
 √ √   ≈ 6
2
L− µ
− ln √ √
L+ µ

with heavy ball.

8
Proof of Theorem 3. Up to a change of coordinates, we can assume that f is
of the form
1
λ1 x21 + · · · + λn x2n ,

f (x1 , . . . , xn ) =
2
where
L ≥ λ1 ≥ λ2 ≥ · · · ≥ λn ≥ µ > 0
are the eigenvalues of the matrix representing f .
Denoting xt = (xt,1 , xt,2 , . . . , xt,n ), we have, for each t,

∇f (xt ) = (λ1 xt,1 , . . . , λn xt,n ),

hence the evolution equation of heavy ball is, for each t ∈ N,

∀k ≤ n, mt+1,k = γmt,k + (1 − γ)λk xt,k ;


xt+1,k = xt,k − αmt+1,k = (1 − α(1 − γ)λk )xt,k − αγmt,k .

This can be written in matricial form: for each t ∈ N, k ∈ {1, . . . , n},


     
mt+1,k mt,k γ (1 − γ)λk
= Mk , with Mk =
xt+1,k xt,k −αγ 1 − α(1 − γ)λk
   
mt,k t m0,k
⇒ = Mk .
xt,k x0,k

For any k, the matrix Mk can be triangularized in a (complex) orthonormal


basis: for some unitary matrix Gk , we can write it under the form
!
(1)
σk gk −1
Mk = Gk (2) Gk .
0 σk

For all t ∈ N,
!
  (1)  
mt,k (σk )t gt,k −1 m0,k
= Gk (2) Gk ,
xt,k 0 (σk )t x0,k
(1) (1) (2) (2)
with gt,k = ((σk )t−1 + (σk )t−2 σk + · · · + (σk )t−1 )gk .

As Gk is unitary, it does not change the norm:


!
  (1)  
mt,k (σk )t gk,t m0,k
≤ (2) .
xt,k 0 (σk )t x0,k

9
(The triple bar denotes the spectral norm.)
For some constants C, C 0 > 0, the spectral norm can be upper bounded
by
!
(1)
(σk )t gk,t 
(1) t (2) t

(2) ≤ C max |σ k | , |σk | , |gk,t |
0 (σk )t
 t
0 (1) (2)
≤ C t max |σk |, |σk | .
 
(1) (2) (1) (2)
We must compute max |σk |, |σk | , where we recall that σk , σk are
the eigenvalues of Mk . These eigenvalues are the roots of the characteristic
polynomial of Mk . A (slightly tedious) computation shows that the polyno-
mial has a negative discriminant. The eigenvalues are therefore complex and
conjugate one from each other:
(1) (2) (1) (2)
|σk |2 = |σk |2 = σk σk = det(Mk ) = γ.
  √
(1) (2)
In particular, max |σk |, |σk | = γ, and we get
   
mt,k 0 t/2 m0,k
∀k, ≤ C tγ
xt,k x0,k
q √
⇒ |xt,k | ≤ C 0 tγ t/2 x20,k + m20,k ≤ C 0 tγ t/2 1 + L2 |x0,k |
n
X
⇒ f (xt ) − f (x∗ ) = λk x2t,k ≤ L(1 + L2 )C 02 t2 γ t ||x0 ||2 .
k=1

If we set Cµ,L = L(1 + L2 )C 02 and recall that


√ √ !2
L− µ
γ= √ √ ,
L+ µ

we get the announced result:


√ √ !2t
2 L− µ
f (xt ) − f (x∗ ) ≤ Cµ,L t √ √ ||x0 − x∗ ||2 .
L+ µ

10
The theorem we just proved does not extend from strongly convex quadratic
functions to general strongly convex functions. Indeed, there are unfavorable
strongly convex functions, on which gradient descent with momentum is not
faster than its standard version (or even where it diverges whereas plain
gradient descent converges). Fortunately, many “interesting” functions are
either quadratic or, more frequently, approximately quadratic in the neigh-
borhood of a minimizer. For these functions, heavy ball is usually better
than plain gradient descent.

4 Nesterov’s method
In the previous section, we have said that heavy ball has a faster convergence
rate than gradient descent for quadratic problems, but not for all strongly
convex problems. In addition, it does not apply when the objective function
is not strongly convex. In this final section, we present an algorithm which
solves both these issues. As it has been found by Yurii Nesterov, it is often
called “Nesterov’s method”.
The iteration formula for this algorithm is

xt+1 = xt − αt ∇f (xt + βt (xt − xt−1 )) + βt (xt − xt−1 ), (3)

for a proper choice of parameters αt , βt . We see that it is very similar to the


general form of gradient descent with momentum, as described in Equation
(2), with the (important) difference that the gradient is not evaluated at
point xt , but at xt + βt (xt − xt−1 ).
If f is assumed to be L-smooth and µ-strongly convex, a simple choice is
possible for coefficients αt , βt :
√ √
1 L− µ
∀t, αt = and βt = √ √ .
L L+ µ

This yields the following algorithm.

11
Input: Starting point x0 , number of iterations T , smoothness
parameter L, strong√
convexity

parameter µ.
1 L− µ
Set x−1 = x0 , α = L , β = √L+√µ ;
for t = 0, . . . , T − 1 do
define

xt+1 = xt − α∇f (xt + β(xt − xt−1 )) + β(xt − xt−1 ).

end
return xT
Algorithm 3: Nesterov’s algorithm with constant parameters

With this choice, Nesterov’s method converges to the minimizer linearly,


with decay rate r
µ
1− ,
L
which is similar to the convergence rate of heavy ball, but true for all strongly
convex functions, not only quadratic ones!
Theorem 4: Nesterov’s method: smooth strongly convex case
Let 0 < µ < L be fixed. Let f be an L-smooth and µ-strongly convex
function.
Let (xt )t∈N be the sequence computed by Algorithm 3. For all t ∈ N,
 r t
µ
f (xt ) − f (x∗ ) ≤ 2 1 − (f (x0 ) − f (x∗ )) .
L

When f is not strongly convex, it is not possible to set parameters αt and


βt to constant values. A more complicated (and admittedly mysterious, at
first sight) definition must be used, described in the following algorithm.

12
Input: Starting point x0 , number of iterations T , smoothness
parameter L.
Set x−1 = x0 , α = L1 , λ−1 = 0;
for t = 0, . . . , T − 1 do
define
p
1 + 1 + 4λ2t−1
λt = ;
2
λt−1 − 1
βt = ;
λt
xt+1 = xt − α∇f (xt + βt (xt − xt−1 )) + βt (xt − xt−1 ).

end
return xT
Algorithm 4: Nesterov’s algorithm with changing parameters

The convergence rate of this algorithm is given in the following theorem.


Theorem 5: Nesterov’s method: smooth convex case
Let L > 0 be fixed. Let f be an L-smooth convex function.
Let (xt )t∈N be the sequence computed by Algorithm 4. For all t ∈ N,

2L
f (xt ) − f (x∗ ) ≤ 2
||x0 − x∗ ||2 .
(t + 1)

Comparing the rates in Theorems 1 and 5 shows the superiority of Nes-


terov’s method over gradient descent for smooth convex functions f :
 
1
gradient descent rate: O ;
t
 
1
Nesterov’s method rate: O 2 .
t

Actually, it is possible to show that Nesterov’s method is optimal for


smooth convex functions among all first-order algorithms. In other words, for
any first order algorithm (that is, an algorithm which only exploits gradient
information about f ), there exists an “adversarial” objective function f ,

13
which is L-smooth and convex, such that, after t steps,
3L
f (xt ) − f (x∗ ) ≥ ||x0 − x∗ ||2 .
32(t + 1)2
This means that, up to the constant, no first-order algorithm can achieve a
better convergence rate than the one in Theorem 5.
Nesterov’s method is also optimal for smooth strongly convex functions
among all first-order algorithms: no first-order algorithm can achieve a better
convergence rate, for L-smooth and µ-strongly convex functions, than the one
guaranteed by Theorem 4.

5 References
The main references used to prepare these notes are the original article where
Polyak introduced the heavy ball algorithm,
• Some methods of speeding up the convergence of iteration methods,
by B. T. Polyak, Ussr computational mathematics and mathematical
physics, volume 4(5), pages 1-17 (1964),
two classical (classical to be, for the second one) books on optimization,
• Introductory lectures on convex optimization: a basic course, by Y.
Nesterov, Springer Science & Business Media, volume 87 (2003),
• Optimization for data analysis, by S. J. Wright and B. Recht, Cam-
bridge University Press (2022),
and two blog posts by S. Bubek on Nesterov’s method for smooth convex
functions,
• http://blogs.princeton.edu/imabandit/2013/04/01/accelerat
edgradientdescent/,
• http://blogs.princeton.edu/imabandit/2018/11/21/a-short-p
roof-for-nesterovs-momentum/.
For another presentation of the advanced aspects of gradient descent, the
reader can also refer to
• Lecture notes on advanced gradient descent, by C. Royer, https://ww
w.lamsade.dauphine.fr/%7Ecroyer/ensdocs/GD/LectureNotesOML
-GD.pdf (2021).

14

You might also like