Unconstrained

L.
Vandenberghe ECE236B (Winter 2022)
10. Unconstrained minimization
• terminology and assumptions

• gradient descent method
• steepest descent method
• Newton’s method
• self-concordant functions
• implementation
10.1
Unconstrained minimization
minimize 𝑓 (𝑥)
• 𝑓 convex, twice continuously differentiable (hence dom 𝑓 open)

• we assume optimal value 𝑝★ = inf 𝑥 𝑓 (𝑥) is attained (and finite)
Unconstrained minimization methods
• produce sequence of points 𝑥 (𝑘) ∈ dom 𝑓 , 𝑘 = 0, 1, . . . , with
𝑓 (𝑥 (𝑘) ) → 𝑝★
• can be interpreted as iterative methods for solving optimality condition
∇ 𝑓 (𝑥★) = 0
Unconstrained minimization 10.2

Initial point and sublevel set
algorithms in this chapter require a starting point 𝑥 (0) such that
• 𝑥 (0) ∈ dom 𝑓
• sublevel set 𝑆 = {𝑥 | 𝑓 (𝑥) ≤ 𝑓 (𝑥 (0) )} is closed
2nd condition is hard to verify, except when all sublevel sets are closed:
• equivalent to condition that epi 𝑓 is closed

• true if dom 𝑓 = R𝑛
• true if 𝑓 (𝑥) → ∞ as 𝑥 → bd dom 𝑓
examples of differentiable functions with closed sublevel sets:
X
𝑚 X
𝑚
𝑓 (𝑥) = log( exp(𝑎𝑇𝑖 𝑥 + 𝑏𝑖 )), 𝑓 (𝑥) = − log(𝑏𝑖 − 𝑎𝑇𝑖 𝑥)
𝑖=1 𝑖=1

Strong convexity and implications
𝑓 is strongly convex on 𝑆 if there exists an 𝑚 > 0 such that
∇2 𝑓 (𝑥) 𝑚𝐼 for all 𝑥 ∈ 𝑆
Implications
• for 𝑥, 𝑦 ∈ 𝑆 ,
𝑇 𝑚
𝑓 (𝑦) ≥ 𝑓 (𝑥) + ∇ 𝑓 (𝑥) (𝑦 − 𝑥) + k𝑥 − 𝑦k22
2
• 𝑆 is bounded
• 𝑝★ > −∞ and for 𝑥 ∈ 𝑆 ,
★ 1
𝑓 (𝑥) − 𝑝 ≤ k∇ 𝑓 (𝑥) k22
2𝑚
useful as stopping criterion (if you know 𝑚 )

Descent methods
𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝑡 (𝑘) Δ𝑥 (𝑘) with 𝑓 (𝑥 (𝑘+1) ) < 𝑓 (𝑥 (𝑘) )
• other notations:
𝑥 + = 𝑥 + 𝑡Δ𝑥, 𝑥 := 𝑥 + 𝑡Δ𝑥
• Δ𝑥 is the step, or search direction; 𝑡 is the step size, or step length

• for convex 𝑓 : if 𝑓 (𝑥 +) < 𝑓 (𝑥) then Δ𝑥 must be a descent direction:
∇ 𝑓 (𝑥)𝑇 Δ𝑥 < 0
General descent method

given: a starting point 𝑥 ∈ dom 𝑓
repeat
1. determine a descent direction Δ𝑥
2. line search: choose a step size 𝑡 > 0
3. update: 𝑥 := 𝑥 + 𝑡Δ𝑥
until stopping criterion is satisfied

Line search types
Exact line search: 𝑡 = argmin𝑡>0 𝑓 (𝑥 + 𝑡Δ𝑥)
Backtracking line search (with parameters 𝛼 ∈ (0, 1/2) , 𝛽 ∈ (0, 1) )

• starting at 𝑡 = 1, repeat 𝑡 := 𝛽𝑡 until
𝑓 (𝑥 + 𝑡Δ𝑥) < 𝑓 (𝑥) + 𝛼𝑡∇ 𝑓 (𝑥)𝑇 Δ𝑥
• graphical interpretation: backtrack until 𝑡 ≤ 𝑡0
𝑓 (𝑥 + 𝑡Δ𝑥)
𝑇
𝑓 (𝑥) + 𝑡∇ 𝑓 (𝑥) Δ𝑥 𝑓 (𝑥) + 𝛼𝑡∇ 𝑓 (𝑥)𝑇 Δ𝑥
𝑡
𝑡=0 𝑡0
Gradient descent method
Gradient descent: general descent method with Δ𝑥 = −∇ 𝑓 (𝑥)

given: a starting point 𝑥 ∈ dom 𝑓
repeat
1. Δ𝑥 := −∇ 𝑓 (𝑥)
2. line search: choose step size 𝑡 via exact or backtracking line search
3. update: 𝑥 := 𝑥 + 𝑡Δ𝑥
until stopping criterion is satisfied
• stopping criterion usually of the form k∇ 𝑓 (𝑥) k2 ≤ 𝜖

• convergence result: for strongly convex 𝑓 ,
𝑓 (𝑥 (𝑘) ) − 𝑝★ ≤ 𝑐 𝑘 ( 𝑓 (𝑥 (0) ) − 𝑝★)
𝑐 ∈ (0, 1) depends on 𝑚 , 𝑥 (0) , line search type

• very simple, but often very slow

Quadratic problem in R2
𝑓 (𝑥) = 21 (𝑥12 + 𝛾𝑥22) (𝛾 > 0)
with exact line search, starting at 𝑥 (0) = (𝛾, 1) :
𝑘 𝑘
(𝑘) 𝛾−1 (𝑘) 𝛾−1
𝑥1 = 𝛾 , 𝑥2 = −
𝛾+1 𝛾+1
• very slow if 𝛾 ≫ 1 or 𝛾 ≪ 1
• example for 𝛾 = 10:
𝑥 (0)
𝑥2
0
𝑥 (1)
−4
−10 0 10
𝑥1
Nonquadratic example
𝑓 (𝑥1, 𝑥2) = 𝑒 𝑥1+3𝑥2−0.1 + 𝑒 𝑥1−3𝑥2−0.1 + 𝑒 −𝑥1−0.1
𝑥 (0) 𝑥 (0)
𝑥 (2)
𝑥 (1)
𝑥 (1)
backtracking line search exact line search

Example in R100
𝑇
X
500
𝑓 (𝑥) = 𝑐 𝑥 − log(𝑏𝑖 − 𝑎𝑇𝑖 𝑥)
𝑖=1
104
102
𝑓 (𝑥 (𝑘) ) − 𝑝★
100
exact l.s.
10−2
backtracking l.s.
10−4
0 50 100 150 200
𝑘
‘linear’ convergence, i.e., a straight line on a semilog plot

Steepest descent method
Normalized steepest descent direction (at 𝑥 , for norm k · k ):
Δ𝑥nsd = argmin{∇ 𝑓 (𝑥)𝑇 𝑣 | k𝑣k = 1}
interpretation: for small 𝑣,
𝑓 (𝑥 + 𝑣) ≈ 𝑓 (𝑥) + ∇ 𝑓 (𝑥)𝑇 𝑣
direction Δ𝑥 nsd is unit-norm step with most negative directional derivative
(Unnormalized) steepest descent direction
Δ𝑥sd = k∇ 𝑓 (𝑥) k∗Δ𝑥nsd
satisfies ∇ 𝑓 (𝑥)𝑇 Δ𝑥 sd = −k∇ 𝑓 (𝑥) k∗2
Steepest descent method

• general descent method with Δ𝑥 = Δ𝑥sd
• convergence properties similar to gradient descent

Examples
• Euclidean norm: Δ𝑥sd = −∇ 𝑓 (𝑥)

• quadratic norm k𝑥k𝑃 = (𝑥𝑇 𝑃𝑥) 1/2 (𝑃 ∈ S++
𝑛 ):
Δ𝑥sd = −𝑃−1 ∇ 𝑓 (𝑥)
• ℓ1-norm: Δ𝑥sd = −(𝜕 𝑓 (𝑥)/𝜕𝑥𝑖 )𝑒𝑖 , where |𝜕 𝑓 (𝑥)/𝜕𝑥𝑖 | = k∇ 𝑓 (𝑥) k∞
unit balls, steepest descent directions for a quadratic norm and ℓ1-norm:
−∇ 𝑓 (𝑥)
−∇ 𝑓 (𝑥)
Δ𝑥nsd
Δ𝑥nsd

Choice of norm for steepest descent
𝑥 (0)
𝑥 (0)
𝑥 (2)
𝑥 (1) 𝑥 (2)
𝑥 (1)
• steepest descent with backtracking line search for two quadratic norms
• ellipses show {𝑥 | k𝑥 − 𝑥 (𝑘) k𝑃 = 1}
• equivalent interpretation of steepest descent with quadratic norm k · k𝑃 :
gradient descent after change of variables 𝑥¯ = 𝑃1/2𝑥
shows choice of 𝑃 has strong effect on speed of convergence

Newton step
Δ𝑥nt = −∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥)
Interpretations
• 𝑥 + Δ𝑥nt minimizes second order approximation
b 𝑇 1 𝑇 2
𝑓 (𝑥 + 𝑣) = 𝑓 (𝑥) + ∇ 𝑓 (𝑥) 𝑣 + 𝑣 ∇ 𝑓 (𝑥)𝑣
2
• 𝑥 + Δ𝑥nt solves linearized optimality condition
∇ 𝑓 (𝑥 + 𝑣) ≈ ∇ b
𝑓 (𝑥 + 𝑣) = ∇ 𝑓 (𝑥) + ∇2 𝑓 (𝑥)𝑣 = 0
b
𝑓′
b
𝑓 𝑓′
(𝑥 + Δ𝑥nt, 𝑓 ′ (𝑥 + Δ𝑥nt))
(𝑥, 𝑓 (𝑥)) (𝑥, 𝑓 ′ (𝑥))
(𝑥 + Δ𝑥nt, 𝑓 (𝑥 + Δ𝑥nt)) 𝑓

• Δ𝑥nt is steepest descent direction at 𝑥 in local Hessian norm
k𝑢k∇2 𝑓 (𝑥) = (𝑢𝑇 ∇2 𝑓 (𝑥)𝑢) 1/2
𝑥 + Δ𝑥nsd
𝑥 + Δ𝑥nt
dashed lines are contour lines of 𝑓 ; ellipse is {𝑥 + 𝑣 | 𝑣𝑇 ∇2 𝑓 (𝑥)𝑣 = 1}
arrow shows −∇ 𝑓 (𝑥)

Newton decrement
𝜆(𝑥) = (∇ 𝑓 (𝑥)𝑇 ∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥)) 1/2
a measure of the proximity of 𝑥 to 𝑥★
Properties
• gives an estimate of 𝑓 (𝑥) − 𝑝★, using quadratic approximation b

𝑓:
b 1
𝑓 (𝑥) − inf 𝑓 (𝑦) = 𝜆(𝑥) 2
𝑦 2
• equal to the norm of the Newton step in the quadratic Hessian norm
𝜆(𝑥) = (Δ𝑥𝑇nt ∇2 𝑓 (𝑥)Δ𝑥nt) 1/2
• directional derivative in the Newton direction: ∇ 𝑓 (𝑥)𝑇 Δ𝑥nt = −𝜆(𝑥) 2

• affine invariant (unlike k∇ 𝑓 (𝑥) k2)

Newton’s method
given: a starting point 𝑥 ∈ dom 𝑓 , tolerance 𝜖 > 0

repeat
1. compute the Newton step and decrement
Δ𝑥nt := −∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥) ; 𝜆2 := ∇ 𝑓 (𝑥)𝑇 ∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥)

2. stopping criterion: quit if 𝜆2/2 ≤ 𝜖
3. line search: choose step size 𝑡 by backtracking line search
4. update: 𝑥 := 𝑥 + 𝑡Δ𝑥 nt
Affine invariance
• Newton iterates for 𝑓˜(𝑦) = 𝑓 (𝑇 𝑦) with starting point 𝑦 (0) = 𝑇 −1𝑥 (0) are
𝑦 (𝑘) = 𝑇 −1𝑥 (𝑘)
• independent of linear changes of coordinates

Classical convergence analysis
Assumptions
• 𝑓 strongly convex on 𝑆 with constant 𝑚

• ∇2 𝑓 is Lipschitz continuous on 𝑆 , with constant 𝐿 > 0:
k∇2 𝑓 (𝑥) − ∇2 𝑓 (𝑦) k2 ≤ 𝐿k𝑥 − 𝑦k2
( 𝐿 measures how well 𝑓 can be approximated by a quadratic function)
Outline: there exist constants 𝜂 ∈ (0, 𝑚 2/𝐿) , 𝛾 > 0 such that
• if k∇ 𝑓 (𝑥) k2 ≥ 𝜂, then 𝑓 (𝑥 (𝑘+1) ) − 𝑓 (𝑥 (𝑘) ) ≤ −𝛾

• if k∇ 𝑓 (𝑥) k2 < 𝜂, then
2
𝐿 (𝑘+1) 𝐿 (𝑘)
2
k∇ 𝑓 (𝑥 ) k2 ≤ 2
k∇ 𝑓 (𝑥 ) k2
2𝑚 2𝑚

Damped Newton phase ( k∇ 𝑓 (𝑥) k2 ≥ 𝜂 )
• most iterations require backtracking steps

• function value decreases by at least 𝛾
• if 𝑝★ > −∞, this phase ends after at most ( 𝑓 (𝑥 (0) ) − 𝑝★)/𝛾 iterations
Quadratically convergent phase ( k∇ 𝑓 (𝑥) k2 < 𝜂 )
• all iterations use step size 𝑡 = 1

• k∇ 𝑓 (𝑥) k2 converges to zero quadratically: if k∇ 𝑓 (𝑥 (𝑘) ) k2 < 𝜂, then
2𝑙−𝑘 2𝑙−𝑘
𝐿 𝑙 𝐿 𝑘 1
2
k∇ 𝑓 (𝑥 ) k2 ≤ 2
k∇ 𝑓 (𝑥 ) k2 ≤ , 𝑙≥𝑘
2𝑚 2𝑚 2

Conclusion: number of iterations until 𝑓 (𝑥) − 𝑝★ ≤ 𝜖 is bounded above by
𝑓 (𝑥 (0) ) − 𝑝★
+ log2 log2 (𝜖0/𝜖)
𝛾
• 𝛾 , 𝜖0 are constants that depend on 𝑚 , 𝐿 , 𝑥 (0)

• second term is small (of the order of 6) and almost constant for practical
purposes
• in practice, constants 𝑚 , 𝐿 (hence 𝛾 , 𝜖0) are usually unknown

• provides qualitative insight in convergence properties (i.e., explains two
algorithm phases)

Examples
Example in R2 (page 10.9)
105
𝑥 (0) 100
𝑓 (𝑥 (𝑘) ) − 𝑝★
𝑥 (1) 10−5
10−10
10−15
0 1 2 3 4 5
𝑘
• backtracking parameters 𝛼 = 0.1, 𝛽 = 0.7

• converges in only 5 steps
• quadratic local convergence

Examples
Example in R100 (page 10.10)
105 2
exact line search

100 1.5
𝑓 (𝑥 (𝑘) ) − 𝑝★
step size 𝑡 (𝑘)

backtracking
10−5 1
exact line search

10−10 0.5 backtracking
10−15 0
0 2 4 6 8 10 0 2 4 6 8
𝑘 𝑘

• backtracking line search almost as fast as exact l.s. (and much simpler)
• clearly shows two phases in algorithm

Examples
Example in R10000 (with sparse 𝑎𝑖 )
X
10000 X
100000
𝑓 (𝑥) = − log(1 − 𝑥𝑖2) − log(𝑏𝑖 − 𝑎𝑇𝑖 𝑥)
𝑖=1 𝑖=1
105
𝑓 (𝑥 (𝑘) ) − 𝑝★
100
10−5
0 5 10 15 20
𝑘

• performance similar as for small examples

Self-concordance
Shortcomings of classical convergence analysis
• depends on unknown constants (𝑚 , 𝐿 , . . . )

• bound is not affinely invariant, although Newton’s method is
Convergence analysis via self-concordance (Nesterov and Nemirovski)
• does not depend on any unknown constants

• gives affine-invariant bound
• applies to special class of convex functions (‘self-concordant’ functions)
• developed to analyze polynomial-time interior-point methods for convex
optimization

Self-concordant functions
Definition
• convex 𝑓 : R → R is self-concordant if
| 𝑓 ′′′ (𝑥)| ≤ 2 𝑓 ′′ (𝑥) 3/2 for all 𝑥 ∈ dom 𝑓
• 𝑓 : R𝑛 → R is self-concordant if 𝑔(𝑡) = 𝑓 (𝑥 + 𝑡𝑣) is s.c. for all 𝑥 ∈ dom 𝑓 and 𝑣
Examples on R
• linear and quadratic functions

• negative logarithm 𝑓 (𝑥) = − log 𝑥
• negative entropy plus negative logarithm: 𝑓 (𝑥) = 𝑥 log 𝑥 − log 𝑥
Affine invariance: if 𝑓 : R → R is s.c., then 𝑓˜(𝑦) = 𝑓 (𝑎𝑦 + 𝑏) is s.c.:
𝑓˜′′′ (𝑦) = 𝑎 3 𝑓 ′′′ (𝑎𝑦 + 𝑏), 𝑓˜′′ (𝑦) = 𝑎 2 𝑓 ′′ (𝑎𝑦 + 𝑏)

Self-concordant calculus
Properties
• preserved under sums and positive scaling by factor ≥ 1

• preserved under composition with affine function
• if 𝑔 is convex with dom 𝑔 = R++ and |𝑔′′′ (𝑥)| ≤ 3𝑔′′ (𝑥)/𝑥 then
𝑓 (𝑥) = log(−𝑔(𝑥)) − log 𝑥
is self-concordant
Examples: properties can be used to show that the following are s.c.
P𝑚 𝑇 𝑥) on {𝑥 | 𝑎𝑇 𝑥 < 𝑏 , 𝑖 = 1, . . . , 𝑚}
• 𝑓 (𝑥) = − 𝑖=1
log(𝑏 𝑖 − 𝑎 𝑖 𝑖 𝑖
𝑛
• 𝑓 (𝑋) = − log det 𝑋 on S++
• 𝑓 (𝑥) = − log(𝑦 2 − 𝑥𝑇 𝑥) on {(𝑥, 𝑦) | k𝑥k2 < 𝑦}

Convergence analysis for self-concordant functions
Summary: there exist constants 𝜂 ∈ (0, 1/4] , 𝛾 > 0 such that
• if 𝜆(𝑥) > 𝜂, then

𝑓 (𝑥 (𝑘+1) ) − 𝑓 (𝑥 (𝑘) ) ≤ −𝛾
• if 𝜆(𝑥) ≤ 𝜂, then
2
(𝑘+1) (𝑘)
2𝜆(𝑥 ) ≤ 2𝜆(𝑥 )
(𝜂 and 𝛾 only depend on backtracking parameters 𝛼, 𝛽)
Complexity bound: number of Newton iterations bounded by
𝑓 (𝑥 (0) ) − 𝑝★
+ log2 log2 (1/𝜖)
𝛾
for 𝛼 = 0.1, 𝛽 = 0.8, 𝜖 = 10−10, bound evaluates to 375( 𝑓 (𝑥 (0) ) − 𝑝★) + 6

Numerical example
150 randomly generated instances of
P
𝑚
minimize 𝑓 (𝑥) = − log(𝑏𝑖 − 𝑎𝑇𝑖 𝑥)
𝑖=1
25
20
◦: 𝑚 = 100, 𝑛 = 50
iterations
15
: 𝑚 = 1000, 𝑛 = 500
^: 𝑚 = 1000, 𝑛 = 50 10
0
0 5 10 15 20 25 30 35
(0) ★
𝑓 (𝑥 )−𝑝
• number of iterations much smaller than 375( 𝑓 (𝑥 (0) ) − 𝑝★) + 6
• bound of the form 𝑐( 𝑓 (𝑥 (0) ) − 𝑝★) + 6 with smaller 𝑐 (empirically) valid
Implementation
main effort in each iteration: evaluate derivatives and solve Newton system
𝐻Δ𝑥 = −𝑔
where 𝐻 = ∇2 𝑓 (𝑥) , 𝑔 = ∇ 𝑓 (𝑥)
Via Cholesky factorization
𝐻 = 𝐿𝐿𝑇 , Δ𝑥nt = −𝐿 −𝑇 𝐿 −1 𝑔, 𝜆(𝑥) = k𝐿 −1 𝑔k2
• cost (1/3)𝑛3 flops for unstructured system

• cost ≪ (1/3)𝑛3 if 𝐻 sparse, banded

Example of dense Newton system with structure
X
𝑛
𝑓 (𝑥) = 𝜓𝑖 (𝑥𝑖 ) + 𝜓0 ( 𝐴𝑥 + 𝑏), 𝐻 = 𝐷 + 𝐴𝑇 𝐻0 𝐴
𝑖=1
• assume 𝐴 ∈ R 𝑝×𝑛 , dense, with 𝑝 ≪ 𝑛

• 𝐷 diagonal with diagonal elements 𝜓𝑖′′ (𝑥𝑖 ) ; 𝐻0 = ∇2𝜓0 ( 𝐴𝑥 + 𝑏)
Method 1: form 𝐻 , solve via dense Cholesky factorization (cost (1/3)𝑛3)
Method 2 (page 9.15): factor 𝐻0 = 𝐿 0 𝐿𝑇0 ; write Newton system as
𝐷Δ𝑥 + 𝐴𝑇 𝐿 0 𝑤 = −𝑔, 𝐿𝑇0 𝐴Δ𝑥 − 𝑤 = 0
eliminate Δ𝑥 from first equation; compute 𝑤 and Δ𝑥 from
(𝐼 + 𝐿𝑇0 𝐴𝐷 −1 𝐴𝑇 𝐿 0)𝑤 = −𝐿𝑇0 𝐴𝐷 −1 𝑔, 𝐷Δ𝑥 = −𝑔 − 𝐴𝑇 𝐿 0 𝑤
cost: 2𝑝 2 𝑛 (dominated by computation of 𝐿𝑇0 𝐴𝐷 −1 𝐴𝑇 𝐿 0)


Unconstrained

Uploaded by

Copyright:

Available Formats

Unconstrained

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unconstrained

Uploaded by

Copyright:

Available Formats

L.

Vandenberghe ECE236B (Winter 2022)

10. Unconstrained minimization

• terminology and assumptions

• 𝑓 convex, twice continuously differentiable (hence dom 𝑓 open)

Unconstrained minimization methods

• produce sequence of points 𝑥 (𝑘) ∈ dom 𝑓 , 𝑘 = 0, 1, . . . , with

• can be interpreted as iterative methods for solving optimality condition

Unconstrained minimization 10.2

algorithms in this chapter require a starting point 𝑥 (0) such that

• equivalent to condition that epi 𝑓 is closed

examples of differentiable functions with closed sublevel sets:

Unconstrained minimization 10.3

𝑓 is strongly convex on 𝑆 if there exists an 𝑚 > 0 such that

∇2 𝑓 (𝑥)  𝑚𝐼 for all 𝑥 ∈ 𝑆

useful as stopping criterion (if you know 𝑚 )

Unconstrained minimization 10.4

𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝑡 (𝑘) Δ𝑥 (𝑘) with 𝑓 (𝑥 (𝑘+1) ) < 𝑓 (𝑥 (𝑘) )

• Δ𝑥 is the step, or search direction; 𝑡 is the step size, or step length

General descent method

Unconstrained minimization 10.5

Exact line search: 𝑡 = argmin𝑡>0 𝑓 (𝑥 + 𝑡Δ𝑥)

Backtracking line search (with parameters 𝛼 ∈ (0, 1/2) , 𝛽 ∈ (0, 1) )

𝑓 (𝑥 + 𝑡Δ𝑥) < 𝑓 (𝑥) + 𝛼𝑡∇ 𝑓 (𝑥)𝑇 Δ𝑥

• graphical interpretation: backtrack until 𝑡 ≤ 𝑡0

Gradient descent: general descent method with Δ𝑥 = −∇ 𝑓 (𝑥)

• stopping criterion usually of the form k∇ 𝑓 (𝑥) k2 ≤ 𝜖

𝑓 (𝑥 (𝑘) ) − 𝑝★ ≤ 𝑐 𝑘 ( 𝑓 (𝑥 (0) ) − 𝑝★)

𝑐 ∈ (0, 1) depends on 𝑚 , 𝑥 (0) , line search type

Unconstrained minimization 10.7

𝑓 (𝑥) = 21 (𝑥12 + 𝛾𝑥22) (𝛾 > 0)

with exact line search, starting at 𝑥 (0) = (𝛾, 1) :

𝑓 (𝑥1, 𝑥2) = 𝑒 𝑥1+3𝑥2−0.1 + 𝑒 𝑥1−3𝑥2−0.1 + 𝑒 −𝑥1−0.1

backtracking line search exact line search

Unconstrained minimization 10.9

‘linear’ convergence, i.e., a straight line on a semilog plot

Unconstrained minimization 10.10

Normalized steepest descent direction (at 𝑥 , for norm k · k ):

Δ𝑥nsd = argmin{∇ 𝑓 (𝑥)𝑇 𝑣 | k𝑣k = 1}

interpretation: for small 𝑣,

direction Δ𝑥 nsd is unit-norm step with most negative directional derivative

(Unnormalized) steepest descent direction

Δ𝑥sd = k∇ 𝑓 (𝑥) k∗Δ𝑥nsd

satisfies ∇ 𝑓 (𝑥)𝑇 Δ𝑥 sd = −k∇ 𝑓 (𝑥) k∗2

Steepest descent method

Unconstrained minimization 10.11

• Euclidean norm: Δ𝑥sd = −∇ 𝑓 (𝑥)

Δ𝑥sd = −𝑃−1 ∇ 𝑓 (𝑥)

• ℓ1-norm: Δ𝑥sd = −(𝜕 𝑓 (𝑥)/𝜕𝑥𝑖 )𝑒𝑖 , where |𝜕 𝑓 (𝑥)/𝜕𝑥𝑖 | = k∇ 𝑓 (𝑥) k∞

Unconstrained minimization 10.12

shows choice of 𝑃 has strong effect on speed of convergence

Unconstrained minimization 10.13

Δ𝑥nt = −∇2 𝑓 (𝑥) −1 ∇ 𝑓 (𝑥)

• 𝑥 + Δ𝑥nt minimizes second order approximation

• 𝑥 + Δ𝑥nt solves linearized optimality condition

Unconstrained minimization 10.14

k𝑢k∇2 𝑓 (𝑥) = (𝑢𝑇 ∇2 𝑓 (𝑥)𝑢) 1/2

dashed lines are contour lines of 𝑓 ; ellipse is {𝑥 + 𝑣 | 𝑣𝑇 ∇2 𝑓 (𝑥)𝑣 = 1}

arrow shows −∇ 𝑓 (𝑥)

Unconstrained minimization 10.15

∇2 𝑓 (𝑥) 𝑚𝐼 for all 𝑥 ∈ 𝑆