Unconstrained

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

L.

Vandenberghe ECE236B (Winter 2022)

10. Unconstrained minimization

• terminology and assumptions


• gradient descent method
• steepest descent method
• Newton’s method
• self-concordant functions
• implementation

10.1
Unconstrained minimization

minimize 𝑓 (𝑥)

• 𝑓 convex, twice continuously differentiable (hence dom 𝑓 open)


• we assume optimal value 𝑝★ = inf 𝑥 𝑓 (𝑥) is attained (and finite)

Unconstrained minimization methods

• produce sequence of points 𝑥 (𝑘) ∈ dom 𝑓 , 𝑘 = 0, 1, . . . , with

𝑓 (𝑥 (𝑘) ) → 𝑝★

• can be interpreted as iterative methods for solving optimality condition

∇ 𝑓 (𝑥★) = 0

Unconstrained minimization 10.2


Initial point and sublevel set

algorithms in this chapter require a starting point 𝑥 (0) such that

• 𝑥 (0) ∈ dom 𝑓
• sublevel set 𝑆 = {𝑥 | 𝑓 (𝑥) ≤ 𝑓 (𝑥 (0) )} is closed

2nd condition is hard to verify, except when all sublevel sets are closed:

• equivalent to condition that epi 𝑓 is closed


• true if dom 𝑓 = R𝑛
• true if 𝑓 (𝑥) → ∞ as 𝑥 → bd dom 𝑓

examples of differentiable functions with closed sublevel sets:

X
𝑚 X
𝑚
𝑓 (𝑥) = log( exp(𝑎𝑇𝑖 𝑥 + 𝑏𝑖 )), 𝑓 (𝑥) = − log(𝑏𝑖 − 𝑎𝑇𝑖 𝑥)
𝑖=1 𝑖=1

Unconstrained minimization 10.3


Strong convexity and implications

𝑓 is strongly convex on 𝑆 if there exists an 𝑚 > 0 such that

∇2 𝑓 (𝑥)  𝑚𝐼 for all 𝑥 ∈ 𝑆

Implications

• for 𝑥, 𝑦 ∈ 𝑆 ,
𝑇 𝑚
𝑓 (𝑦) ≥ 𝑓 (𝑥) + ∇ 𝑓 (𝑥) (𝑦 − 𝑥) + k𝑥 − 𝑦k22
2
• 𝑆 is bounded
• 𝑝★ > −∞ and for 𝑥 ∈ 𝑆 ,

★ 1
𝑓 (𝑥) − 𝑝 ≤ k∇ 𝑓 (𝑥) k22
2𝑚

useful as stopping criterion (if you know 𝑚 )

Unconstrained minimization 10.4


Descent methods

𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝑡 (𝑘) Δ𝑥 (𝑘) with 𝑓 (𝑥 (𝑘+1) ) < 𝑓 (𝑥 (𝑘) )

• other notations:
𝑥 + = 𝑥 + 𝑡Δ𝑥, 𝑥 := 𝑥 + 𝑡Δ𝑥

• Δ𝑥 is the step, or search direction; 𝑡 is the step size, or step length


• for convex 𝑓 : if 𝑓 (𝑥 +) < 𝑓 (𝑥) then Δ𝑥 must be a descent direction:

∇ 𝑓 (𝑥)𝑇 Δ𝑥 < 0

General descent method


given: a starting point 𝑥 ∈ dom 𝑓
repeat
1. determine a descent direction Δ𝑥
2. line search: choose a step size 𝑡 > 0
3. update: 𝑥 := 𝑥 + 𝑡Δ𝑥
until stopping criterion is satisfied

Unconstrained minimization 10.5


Line search types

Exact line search: 𝑡 = argmin𝑡>0 𝑓 (𝑥 + 𝑡Δ𝑥)

Backtracking line search (with parameters 𝛼 ∈ (0, 1/2) , 𝛽 ∈ (0, 1) )


• starting at 𝑡 = 1, repeat 𝑡 := 𝛽𝑡 until

𝑓 (𝑥 + 𝑡Δ𝑥) < 𝑓 (𝑥) + 𝛼𝑡∇ 𝑓 (𝑥)𝑇 Δ𝑥

• graphical interpretation: backtrack until 𝑡 ≤ 𝑡0

𝑓 (𝑥 + 𝑡Δ𝑥)

𝑇
𝑓 (𝑥) + 𝑡∇ 𝑓 (𝑥) Δ𝑥 𝑓 (𝑥) + 𝛼𝑡∇ 𝑓 (𝑥)𝑇 Δ𝑥
𝑡
𝑡=0 𝑡0
Unconstrained minimization 10.6
Gradient descent method

Gradient descent: general descent method with Δ𝑥 = −∇ 𝑓 (𝑥)


given: a starting point 𝑥 ∈ dom 𝑓
repeat
1. Δ𝑥 := −∇ 𝑓 (𝑥)
2. line search: choose step size 𝑡 via exact or backtracking line search
3. update: 𝑥 := 𝑥 + 𝑡Δ𝑥
until stopping criterion is satisfied

• stopping criterion usually of the form k∇ 𝑓 (𝑥) k2 ≤ 𝜖


• convergence result: for strongly convex 𝑓 ,

𝑓 (𝑥 (𝑘) ) − 𝑝★ ≤ 𝑐 𝑘 ( 𝑓 (𝑥 (0) ) − 𝑝★)

𝑐 ∈ (0, 1) depends on 𝑚 , 𝑥 (0) , line search type


• very simple, but often very slow

Unconstrained minimization 10.7


Quadratic problem in R2

𝑓 (𝑥) = 21 (𝑥12 + 𝛾𝑥22) (𝛾 > 0)

with exact line search, starting at 𝑥 (0) = (𝛾, 1) :

 𝑘  𝑘
(𝑘) 𝛾−1 (𝑘) 𝛾−1
𝑥1 = 𝛾 , 𝑥2 = −
𝛾+1 𝛾+1

• very slow if 𝛾 ≫ 1 or 𝛾 ≪ 1
• example for 𝛾 = 10:

𝑥 (0)
𝑥2

0
𝑥 (1)

−4
−10 0 10
𝑥1
Unconstrained minimization 10.8
Nonquadratic example

𝑓 (𝑥1, 𝑥2) = 𝑒 𝑥1+3𝑥2−0.1 + 𝑒 𝑥1−3𝑥2−0.1 + 𝑒 −𝑥1−0.1

𝑥 (0) 𝑥 (0)

𝑥 (2)
𝑥 (1)

𝑥 (1)

backtracking line search exact line search

Unconstrained minimization 10.9


Example in R100

𝑇
X
500
𝑓 (𝑥) = 𝑐 𝑥 − log(𝑏𝑖 − 𝑎𝑇𝑖 𝑥)
𝑖=1

104

102
𝑓 (𝑥 (𝑘) ) − 𝑝★

100
exact l.s.

10−2

backtracking l.s.
10−4
0 50 100 150 200
𝑘

‘linear’ convergence, i.e., a straight line on a semilog plot

Unconstrained minimization 10.10


Steepest descent method

Normalized steepest descent direction (at 𝑥 , for norm k · k ):

Δ𝑥nsd = argmin{∇ 𝑓 (𝑥)𝑇 𝑣 | k𝑣k = 1}

interpretation: for small 𝑣,

𝑓 (𝑥 + 𝑣) ≈ 𝑓 (𝑥) + ∇ 𝑓 (𝑥)𝑇 𝑣

direction Δ𝑥 nsd is unit-norm step with most negative directional derivative

(Unnormalized) steepest descent direction

Δ𝑥sd = k∇ 𝑓 (𝑥) k∗Δ𝑥nsd

satisfies ∇ 𝑓 (𝑥)𝑇 Δ𝑥 sd = −k∇ 𝑓 (𝑥) k∗2

Steepest descent method


• general descent method with Δ𝑥 = Δ𝑥sd
• convergence properties similar to gradient descent

Unconstrained minimization 10.11


Examples

• Euclidean norm: Δ𝑥sd = −∇ 𝑓 (𝑥)


• quadratic norm k𝑥k𝑃 = (𝑥𝑇 𝑃𝑥) 1/2 (𝑃 ∈ S++
𝑛 ):

Δ𝑥sd = −𝑃−1 ∇ 𝑓 (𝑥)

• ℓ1-norm: Δ𝑥sd = −(𝜕 𝑓 (𝑥)/𝜕𝑥𝑖 )𝑒𝑖 , where |𝜕 𝑓 (𝑥)/𝜕𝑥𝑖 | = k∇ 𝑓 (𝑥) k∞

unit balls, steepest descent directions for a quadratic norm and ℓ1-norm:

−∇ 𝑓 (𝑥)

−∇ 𝑓 (𝑥)
Δ𝑥nsd
Δ𝑥nsd

Unconstrained minimization 10.12


Choice of norm for steepest descent

𝑥 (0)
𝑥 (0)
𝑥 (2)
𝑥 (1) 𝑥 (2)

𝑥 (1)

• steepest descent with backtracking line search for two quadratic norms
• ellipses show {𝑥 | k𝑥 − 𝑥 (𝑘) k𝑃 = 1}
• equivalent interpretation of steepest descent with quadratic norm k · k𝑃 :
gradient descent after change of variables 𝑥¯ = 𝑃1/2𝑥

shows choice of 𝑃 has strong effect on speed of convergence

Unconstrained minimization 10.13


Newton step

Δ𝑥nt = −∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥)

Interpretations

• 𝑥 + Δ𝑥nt minimizes second order approximation

b 𝑇 1 𝑇 2
𝑓 (𝑥 + 𝑣) = 𝑓 (𝑥) + ∇ 𝑓 (𝑥) 𝑣 + 𝑣 ∇ 𝑓 (𝑥)𝑣
2

• 𝑥 + Δ𝑥nt solves linearized optimality condition

∇ 𝑓 (𝑥 + 𝑣) ≈ ∇ b
𝑓 (𝑥 + 𝑣) = ∇ 𝑓 (𝑥) + ∇2 𝑓 (𝑥)𝑣 = 0

b
𝑓′

b
𝑓 𝑓′
(𝑥 + Δ𝑥nt, 𝑓 ′ (𝑥 + Δ𝑥nt))
(𝑥, 𝑓 (𝑥)) (𝑥, 𝑓 ′ (𝑥))

(𝑥 + Δ𝑥nt, 𝑓 (𝑥 + Δ𝑥nt)) 𝑓

Unconstrained minimization 10.14


• Δ𝑥nt is steepest descent direction at 𝑥 in local Hessian norm

k𝑢k∇2 𝑓 (𝑥) = (𝑢𝑇 ∇2 𝑓 (𝑥)𝑢) 1/2

𝑥 + Δ𝑥nsd
𝑥 + Δ𝑥nt

dashed lines are contour lines of 𝑓 ; ellipse is {𝑥 + 𝑣 | 𝑣𝑇 ∇2 𝑓 (𝑥)𝑣 = 1}

arrow shows −∇ 𝑓 (𝑥)

Unconstrained minimization 10.15


Newton decrement

𝜆(𝑥) = (∇ 𝑓 (𝑥)𝑇 ∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥)) 1/2

a measure of the proximity of 𝑥 to 𝑥★

Properties

• gives an estimate of 𝑓 (𝑥) − 𝑝★, using quadratic approximation b


𝑓:

b 1
𝑓 (𝑥) − inf 𝑓 (𝑦) = 𝜆(𝑥) 2
𝑦 2

• equal to the norm of the Newton step in the quadratic Hessian norm

𝜆(𝑥) = (Δ𝑥𝑇nt ∇2 𝑓 (𝑥)Δ𝑥nt) 1/2

• directional derivative in the Newton direction: ∇ 𝑓 (𝑥)𝑇 Δ𝑥nt = −𝜆(𝑥) 2


• affine invariant (unlike k∇ 𝑓 (𝑥) k2)

Unconstrained minimization 10.16


Newton’s method

given: a starting point 𝑥 ∈ dom 𝑓 , tolerance 𝜖 > 0


repeat
1. compute the Newton step and decrement

Δ𝑥nt := −∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥) ; 𝜆2 := ∇ 𝑓 (𝑥)𝑇 ∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥)


2. stopping criterion: quit if 𝜆2/2 ≤ 𝜖
3. line search: choose step size 𝑡 by backtracking line search
4. update: 𝑥 := 𝑥 + 𝑡Δ𝑥 nt

Affine invariance

• Newton iterates for 𝑓˜(𝑦) = 𝑓 (𝑇 𝑦) with starting point 𝑦 (0) = 𝑇 −1𝑥 (0) are

𝑦 (𝑘) = 𝑇 −1𝑥 (𝑘)

• independent of linear changes of coordinates

Unconstrained minimization 10.17


Classical convergence analysis

Assumptions

• 𝑓 strongly convex on 𝑆 with constant 𝑚


• ∇2 𝑓 is Lipschitz continuous on 𝑆 , with constant 𝐿 > 0:

k∇2 𝑓 (𝑥) − ∇2 𝑓 (𝑦) k2 ≤ 𝐿k𝑥 − 𝑦k2

( 𝐿 measures how well 𝑓 can be approximated by a quadratic function)

Outline: there exist constants 𝜂 ∈ (0, 𝑚 2/𝐿) , 𝛾 > 0 such that

• if k∇ 𝑓 (𝑥) k2 ≥ 𝜂, then 𝑓 (𝑥 (𝑘+1) ) − 𝑓 (𝑥 (𝑘) ) ≤ −𝛾


• if k∇ 𝑓 (𝑥) k2 < 𝜂, then
 2
𝐿 (𝑘+1) 𝐿 (𝑘)
2
k∇ 𝑓 (𝑥 ) k2 ≤ 2
k∇ 𝑓 (𝑥 ) k2
2𝑚 2𝑚

Unconstrained minimization 10.18


Classical convergence analysis

Damped Newton phase ( k∇ 𝑓 (𝑥) k2 ≥ 𝜂 )

• most iterations require backtracking steps


• function value decreases by at least 𝛾
• if 𝑝★ > −∞, this phase ends after at most ( 𝑓 (𝑥 (0) ) − 𝑝★)/𝛾 iterations

Quadratically convergent phase ( k∇ 𝑓 (𝑥) k2 < 𝜂 )

• all iterations use step size 𝑡 = 1


• k∇ 𝑓 (𝑥) k2 converges to zero quadratically: if k∇ 𝑓 (𝑥 (𝑘) ) k2 < 𝜂, then

  2𝑙−𝑘   2𝑙−𝑘
𝐿 𝑙 𝐿 𝑘 1
2
k∇ 𝑓 (𝑥 ) k2 ≤ 2
k∇ 𝑓 (𝑥 ) k2 ≤ , 𝑙≥𝑘
2𝑚 2𝑚 2

Unconstrained minimization 10.19


Classical convergence analysis

Conclusion: number of iterations until 𝑓 (𝑥) − 𝑝★ ≤ 𝜖 is bounded above by

𝑓 (𝑥 (0) ) − 𝑝★
+ log2 log2 (𝜖0/𝜖)
𝛾

• 𝛾 , 𝜖0 are constants that depend on 𝑚 , 𝐿 , 𝑥 (0)


• second term is small (of the order of 6) and almost constant for practical
purposes

• in practice, constants 𝑚 , 𝐿 (hence 𝛾 , 𝜖0) are usually unknown


• provides qualitative insight in convergence properties (i.e., explains two
algorithm phases)

Unconstrained minimization 10.20


Examples

Example in R2 (page 10.9)

105

𝑥 (0) 100

𝑓 (𝑥 (𝑘) ) − 𝑝★
𝑥 (1) 10−5

10−10

10−15
0 1 2 3 4 5
𝑘

• backtracking parameters 𝛼 = 0.1, 𝛽 = 0.7


• converges in only 5 steps
• quadratic local convergence

Unconstrained minimization 10.21


Examples

Example in R100 (page 10.10)

105 2

exact line search


100 1.5
𝑓 (𝑥 (𝑘) ) − 𝑝★

step size 𝑡 (𝑘)


backtracking
10−5 1

exact line search


10−10 0.5 backtracking

10−15 0
0 2 4 6 8 10 0 2 4 6 8
𝑘 𝑘

• backtracking parameters 𝛼 = 0.01, 𝛽 = 0.5


• backtracking line search almost as fast as exact l.s. (and much simpler)
• clearly shows two phases in algorithm

Unconstrained minimization 10.22


Examples

Example in R10000 (with sparse 𝑎𝑖 )

X
10000 X
100000
𝑓 (𝑥) = − log(1 − 𝑥𝑖2) − log(𝑏𝑖 − 𝑎𝑇𝑖 𝑥)
𝑖=1 𝑖=1

105
𝑓 (𝑥 (𝑘) ) − 𝑝★

100

10−5

0 5 10 15 20
𝑘

• backtracking parameters 𝛼 = 0.01, 𝛽 = 0.5


• performance similar as for small examples

Unconstrained minimization 10.23


Self-concordance

Shortcomings of classical convergence analysis

• depends on unknown constants (𝑚 , 𝐿 , . . . )


• bound is not affinely invariant, although Newton’s method is

Convergence analysis via self-concordance (Nesterov and Nemirovski)

• does not depend on any unknown constants


• gives affine-invariant bound
• applies to special class of convex functions (‘self-concordant’ functions)
• developed to analyze polynomial-time interior-point methods for convex
optimization

Unconstrained minimization 10.24


Self-concordant functions

Definition

• convex 𝑓 : R → R is self-concordant if

| 𝑓 ′′′ (𝑥)| ≤ 2 𝑓 ′′ (𝑥) 3/2 for all 𝑥 ∈ dom 𝑓

• 𝑓 : R𝑛 → R is self-concordant if 𝑔(𝑡) = 𝑓 (𝑥 + 𝑡𝑣) is s.c. for all 𝑥 ∈ dom 𝑓 and 𝑣

Examples on R

• linear and quadratic functions


• negative logarithm 𝑓 (𝑥) = − log 𝑥
• negative entropy plus negative logarithm: 𝑓 (𝑥) = 𝑥 log 𝑥 − log 𝑥

Affine invariance: if 𝑓 : R → R is s.c., then 𝑓˜(𝑦) = 𝑓 (𝑎𝑦 + 𝑏) is s.c.:

𝑓˜′′′ (𝑦) = 𝑎 3 𝑓 ′′′ (𝑎𝑦 + 𝑏), 𝑓˜′′ (𝑦) = 𝑎 2 𝑓 ′′ (𝑎𝑦 + 𝑏)

Unconstrained minimization 10.25


Self-concordant calculus

Properties

• preserved under sums and positive scaling by factor ≥ 1


• preserved under composition with affine function
• if 𝑔 is convex with dom 𝑔 = R++ and |𝑔′′′ (𝑥)| ≤ 3𝑔′′ (𝑥)/𝑥 then

𝑓 (𝑥) = log(−𝑔(𝑥)) − log 𝑥

is self-concordant

Examples: properties can be used to show that the following are s.c.
P𝑚 𝑇 𝑥) on {𝑥 | 𝑎𝑇 𝑥 < 𝑏 , 𝑖 = 1, . . . , 𝑚}
• 𝑓 (𝑥) = − 𝑖=1
log(𝑏 𝑖 − 𝑎 𝑖 𝑖 𝑖
𝑛
• 𝑓 (𝑋) = − log det 𝑋 on S++
• 𝑓 (𝑥) = − log(𝑦 2 − 𝑥𝑇 𝑥) on {(𝑥, 𝑦) | k𝑥k2 < 𝑦}

Unconstrained minimization 10.26


Convergence analysis for self-concordant functions

Summary: there exist constants 𝜂 ∈ (0, 1/4] , 𝛾 > 0 such that

• if 𝜆(𝑥) > 𝜂, then


𝑓 (𝑥 (𝑘+1) ) − 𝑓 (𝑥 (𝑘) ) ≤ −𝛾

• if 𝜆(𝑥) ≤ 𝜂, then
 2
(𝑘+1) (𝑘)
2𝜆(𝑥 ) ≤ 2𝜆(𝑥 )

(𝜂 and 𝛾 only depend on backtracking parameters 𝛼, 𝛽)

Complexity bound: number of Newton iterations bounded by

𝑓 (𝑥 (0) ) − 𝑝★
+ log2 log2 (1/𝜖)
𝛾

for 𝛼 = 0.1, 𝛽 = 0.8, 𝜖 = 10−10, bound evaluates to 375( 𝑓 (𝑥 (0) ) − 𝑝★) + 6

Unconstrained minimization 10.27


Numerical example

150 randomly generated instances of

P
𝑚
minimize 𝑓 (𝑥) = − log(𝑏𝑖 − 𝑎𝑇𝑖 𝑥)
𝑖=1

25

20

◦: 𝑚 = 100, 𝑛 = 50
iterations
15
: 𝑚 = 1000, 𝑛 = 500
^: 𝑚 = 1000, 𝑛 = 50 10

0
0 5 10 15 20 25 30 35
(0) ★
𝑓 (𝑥 )−𝑝
• number of iterations much smaller than 375( 𝑓 (𝑥 (0) ) − 𝑝★) + 6
• bound of the form 𝑐( 𝑓 (𝑥 (0) ) − 𝑝★) + 6 with smaller 𝑐 (empirically) valid
Unconstrained minimization 10.28
Implementation

main effort in each iteration: evaluate derivatives and solve Newton system

𝐻Δ𝑥 = −𝑔

where 𝐻 = ∇2 𝑓 (𝑥) , 𝑔 = ∇ 𝑓 (𝑥)

Via Cholesky factorization

𝐻 = 𝐿𝐿𝑇 , Δ𝑥nt = −𝐿 −𝑇 𝐿 −1 𝑔, 𝜆(𝑥) = k𝐿 −1 𝑔k2

• cost (1/3)𝑛3 flops for unstructured system


• cost ≪ (1/3)𝑛3 if 𝐻 sparse, banded

Unconstrained minimization 10.29


Example of dense Newton system with structure

X
𝑛
𝑓 (𝑥) = 𝜓𝑖 (𝑥𝑖 ) + 𝜓0 ( 𝐴𝑥 + 𝑏), 𝐻 = 𝐷 + 𝐴𝑇 𝐻0 𝐴
𝑖=1

• assume 𝐴 ∈ R 𝑝×𝑛 , dense, with 𝑝 ≪ 𝑛


• 𝐷 diagonal with diagonal elements 𝜓𝑖′′ (𝑥𝑖 ) ; 𝐻0 = ∇2𝜓0 ( 𝐴𝑥 + 𝑏)

Method 1: form 𝐻 , solve via dense Cholesky factorization (cost (1/3)𝑛3)

Method 2 (page 9.15): factor 𝐻0 = 𝐿 0 𝐿𝑇0 ; write Newton system as

𝐷Δ𝑥 + 𝐴𝑇 𝐿 0 𝑤 = −𝑔, 𝐿𝑇0 𝐴Δ𝑥 − 𝑤 = 0

eliminate Δ𝑥 from first equation; compute 𝑤 and Δ𝑥 from

(𝐼 + 𝐿𝑇0 𝐴𝐷 −1 𝐴𝑇 𝐿 0)𝑤 = −𝐿𝑇0 𝐴𝐷 −1 𝑔, 𝐷Δ𝑥 = −𝑔 − 𝐴𝑇 𝐿 0 𝑤

cost: 2𝑝 2 𝑛 (dominated by computation of 𝐿𝑇0 𝐴𝐷 −1 𝐴𝑇 𝐿 0)


Unconstrained minimization 10.30

You might also like