Co 463

CO463: Convex Optimization and Analysis
1
Felix Zhou
April 13, 2021
1 From Professor Walaa Moursi’s Lectures at the University of Waterloo in Winter 2021
2
Contents
1 Convex Sets 7
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Affine Sets & Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Convex Combinations of Vectors . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 The Projection Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Convex Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Topological Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Separation Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.9 More Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Convex Functions 29
2.1 Definitions & Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Lower Semicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 The Support Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Further Notions of Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Operations Preserving Convexity . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3
2.7 Conjugates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 The Subdifferential Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.9 Calculus of Subdifferentials . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.11 Conjugacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.12 Coercive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.13 Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.14 The Proximal Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.15 Nonexpansive & Averaged Operators . . . . . . . . . . . . . . . . . . . . . . 68
2.16 Féjer Monotonocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.17 Composition of Averaged Operators . . . . . . . . . . . . . . . . . . . . . . . 76
3 Constrained Convex Optimization 79
3.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.1 The Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . 81
3.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3 Projected Subgradient Method . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.1 The Convex Feasibility Problem . . . . . . . . . . . . . . . . . . . . . 92
3.4 Proximal Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.4.1 Proximal-Gradient Inequality . . . . . . . . . . . . . . . . . . . . . . 96
3.4.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5 Fast Iterative Shrinkage Thresholding . . . . . . . . . . . . . . . . . . . . . . 103
3.5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.5.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4
3.6 Iterative Shrinkage Thresholding Algorithm . . . . . . . . . . . . . . . . . . 106
3.6.1 Norm Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.7 Douglas-Rachford Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.8 Stochastic Projected Subgradient Method . . . . . . . . . . . . . . . . . . . 109
3.8.1 Minimizing a Sum of Functions . . . . . . . . . . . . . . . . . . . . . 111
3.9 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.9.1 Fenchel Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.9.2 Fenchel-Rockafellar Duality . . . . . . . . . . . . . . . . . . . . . . . 113
3.9.3 Self-Duality of Douglas-Rachford . . . . . . . . . . . . . . . . . . . . 114
5
6
Chapter 1
Convex Sets
1.1 Introduction
Let f : Rn → R be differentiable. Consider the problem

min f (x) (P )
x ∈ C ⊆ Rn
In the case when C = Rn , the minimizers of f will occur at the critical points of f . Namely,
at x ∈ Rn when ∇f (x) = 0. This is known as “Fermat’s Rule”.
In this course, we seek to approach (P ) when f is not differentiable but f is convex and
when ∅ 6= C ( Rn is a convex set.
1.2 Affine Sets & Subspaces
Definition 1.2.1 (Affine Set)

S ⊆ Rn is affine if for all x, y ∈ S and λ ∈ R,
λx + (1 − λ)y ∈ S.
Definition 1.2.2 (Affine Subspace)

An affine set ∅ 6= S ⊆ Rn .
7
Definition 1.2.3 (Affine Hull)
Let S ⊆ Rn . The affine hull of S
\
aff(S) := T
S⊆T ⊆Rn :T is affine
is the smallest affine set containing S.
Example 1.2.1
Let L be a linear subspace of Rn and a ∈ Rn .
Then L, a + L, ∅, Rn are all examples of affine sets.
1.3 Convex Sets
Definition 1.3.1
C ⊆ Rn is convex if for all x, y ∈ C and λ ∈ (0, 1),
λx = (1 − λ)y ∈ C.
Example 1.3.1
∅, Rn , balls, affine, and half-sets are all examples of convex sets.
Theorem 1.3.2
The intersection of an arbitrary collection of convex sets is convex.
Proof
Let I be an index set. Let (Ci )i∈I be a collection of convex subsets of Rn .
Put \
C := Ci .
i∈I
Pick x, y ∈ C. By the definition of set intersection, x, y ∈ Ci for all i ∈ I. Since each Ci

is convex, for all λ ∈ (0, 1),
λx + (1 − λ)y ∈ Ci .
It follows that C is convex by the arbitrary choice of i.
8
Corollary 1.3.2.1
Let bi ∈ Rn and βi ∈ R for i ∈ I for some arbitrary index set I.
The set
C := {x ∈ Rn : hx, bi i ≤ βi , ∀i ∈ I}
is convex.
1.4 Convex Combinations of Vectors
Definition 1.4.1 (Convex Combinations)

A vector sum
Xm
λ i xi
i=1
is a convex conbination if λ ≥ 0 and 1T λ = 1.
Theorem 1.4.1
C ⊆ Rn is convex if and only if it contains all convex combinations of its elements.
Proof
( ⇐= ) Apply the definition of convex combination with m = 2.
( =⇒ ) We argue by induction on m. Observe that by deleting xi ’s if necessary, we may

assume without loss of generality that λ > 0.
When m = 2, this is simply the definition of convexity.
For m > 2, we can write

m
X m−1
X
λi xi = λ i xi + λ m xm
i=1 i=1
m−1
X λi
= (1 − λm ) xi + λ m xm
i=1
1 − λm
= (1 − λm )x0 + λm xm . x0 ∈ C by induction
Hence C indeed contains all convex combinations of its elements.
9
Definition 1.4.2 (Convex Hull)
The convex hull of S ⊆ Rn
\
conv S := T
S⊆T ⊆Rn :T is convex
is the smallest convex set containing S.
Theorem 1.4.2
Let ⊆ Rn . conv S consists of all convex conbinations of elements of S.
Proof
Let D be the set of convex combinations of elements of S.
(conv S ⊆ D) D is convex since convex combinations of convex combinations again yields

convex combinations. Moreover, S ⊆ D by considering the trivial convex combination. It
follows that conv S ⊆ D by definition.
(D ⊆ conv S) By the previous theorem, the convexity of conv S means that if contains
all convex combinations of elements. In particular, it contains all convex conbinations of
S ⊆ conv S.
1.5 The Projection Theorem
Definition 1.5.1 (Distance Function)

Fix S ⊆ Rn . The distance to S is the function dS : Rn → [0, ∞] given by
x 7→ inf kx − sk.
s∈S
Definition 1.5.2 (Projection onto a Set)

Let ∅ 6= C ⊆ Rn , x ∈ Rn and p ∈ C. p is a projection of x onto C, if
dC (x) = kx − pk.
If a projection p of x onto C is unique, we denote it by PC (x) := p.
10
Recall that a cauchy sequence (xn )n∈N in Rn is a sequence such that
kxm − xn k → 0
as min(m, n) → ∞.
Since Rn is a complete metric space under the Euclidean metric, every cauchy sequence
converges in Rn .
Moreover, recall that a function f : Rn → R is continuous at x̄ ∈ Rn if and only if for every

sequence xn → x̄, we have
f (xn ) → f (x̄).
Fix y ∈ Rn . The function f : Rn → R given by

x 7→ kx − yk
is continuous.
Lemma 1.5.1
Let x, y, z ∈ Rn . Then
2
2 2
2x + y
kx − yk = 2kz − xk + 2kz − yk − 4 z −
.
2
Proof
This is by computation.
2kx − zk2 = 2hz − x, z − xi

= 2kzk2 − 4hz, xi + 2kxk2
2kz − yk2 = 2kzk2 − 4hz, yi + 2kyk2
2
x + y 2 1 2
4 z −
= 4 kzk + kx + yk − hz, x + yi
2 4
= 4kz 2 k + kx + yk2 − 4hz, xi − 4hz, yi.
Putting everything together yields

2
2 2
x + y
= 2kxk2 + 2kyk2 − kx + yk2
2kz − xk + 2kz − yk − 4 z −

2
= kxk2 + kyk2 − 2hx, yi
= kx − yk2 .
11
Lemma 1.5.2
Let x, y ∈ Rn . Then
hx, yi ≤ 0 ⇐⇒ ∀λ ∈ [0, 1], kxk ≤ kx − λyk.
Proof
( =⇒ ) Suppose hx, yi ≤ 0. Then
kx − λyk2 − kxk2 = λ λkyk2 − 2hx, yi

≥ 0.
( ⇐= ) Conversely, we have λkyk2 − 2hx, yi ≥ 0. This implies
λ
hx, yi ≤ kyk2
2
→ 0. λ→0
Theorem 1.5.3 (Projection)

Let ∅ 6= C ⊆ Rn be closed and convex. Then the following hold:
i) For all x ∈ Rn , PC (x) exists and is unique.
ii) For every x ∈ Rn and p ∈ Rn , p = PC (x) ⇐⇒ p ∈ C ∧∀y ∈ C, hy−p, x−pi ≤ 0.
Proof (i)
Recall that
dC (x) := inf kx − ck.
c∈C
Hence there is a sequence (cn )n∈N in C such that
dC (x) = lim kcn − xk.
n→∞
Let m, n ∈ N. By the convexity of C, 12 cm + 12 cn ∈ C. But then

1
dC (x) = inf kx − ck ≤ x − (c m + c n ) .
c∈C 2
Apply our first lemma with cm , cn , x to see that

2
2 2
2 c n + c m
kcn − cm k = 2kcn − xk + 2kcm − xk − 4 x −
2
≤ 2kcn − xk2 + 2kcm − xk2 − 4dC (x)2 .
12
As m, n → ∞,
0 ≤ kcn − cm k2 → 4dC (x)2 − 4dC (x)2 = 0
and (cn ) is a Cauchy sequence. But then there is some c ∈ p such that cn → p by the
closedness (completeness) of C.
By the continuity of kx − ·k, cn → p implies
kx − cn k → dC (x) = kx − pk.
This demonstrates the existence of p.
Suppose there is some q ∈ C such that dC (x) = kq − xk. By convexity, 12 (p + q) ∈ C.

Using the first lemma again, we have
0 ≤ kp − qk2
2
2 2
p + q
= 2kp − xk + 2kq − xk − 4 x −
2
≤ 2dC (x)2 + 2dC (x)2 − 4dC (x)2
≤ 0.
So kp − qk = 0 =⇒ p = q.
This shows uniqueness.
Proof (ii)
Observe that p = PC (x) if and only if p ∈ C and
kx − pk2 = dC (x)2 .
Since C is convex,
∀α ∈ [0, 1], yα := αy + (a − α)p ∈ C.
Thus
kx − pk2 = dC (x)2
⇐⇒ ∀y ∈ C, α ∈ [0, 1], kx − pk2 ≤ kx − yα k2
⇐⇒ ∀y ∈ C, α ∈ [0, 1], kx − pk2 ≤ kx − p − α(y − p)k2
⇐⇒ ∀y ∈ C, hx − p, y − pi ≤ 0 auxiliary lemma 2.
In the absence of closedness, PC (x) does not in general exist unless x ∈ C. In the absence
of convexity, uniqueness does not in general hold.
13
Example 1.5.4
Fix > 0 and C = B(0; ) be the closed ball around 0 or radius .
For all x ∈ Rn , either PC (x) = x when x ∈ C or PC (x) is

kxk
x, the vector obtained from
x by scaling its norm to .
In other words,

PC (x) = x.
max(kxk, )
1.6 Convex Set Operations
Definition 1.6.1 (Minkowski Sum)

Let C, D ⊆ Rn . The Minkowski Sum of C, D is
C + D := {c + d : c ∈ C, d ∈ D}.
Theorem 1.6.1 (Minkowski)

Let C1 , C2 ⊆ Rn be convex. Then C1 + C2 is convex.
Proof
If either C1 , C2 is empty, then C1 + C2 = ∅ by definition.
Otherwise, C1 + C2 6= ∅. Fix x1 + x2 , y1 + y2 ∈ C1 + C2 and λ ∈ (0, 1). By the convexity

of C1 , C2 ,
λ(x1 + x2 ) + (1 − λ)(y1 + y2 ) = λx1 + (1 − λ)y1 + λx2 + (1 − λ)y2

∈ C1 + C2
as required.
Proposition 1.6.2
Let ∅ 6= C, D ⊆ Rn be closed and convex. Moreover, suppose that D is bounded.
Then C + D 6= ∅ is closed and convex.
Proof
We have already shown non-emptiness and convexity in the previous theorem.
14
Let (xn + yn )n∈N be a convergent sequence in C + D. Say that xn + yn → z.
Since D is bounded, there is a subsequence (ykn )n∈N such that ykn → y ∈ D. It follows
that
xkn = z − ykn → z − y ∈ C
by the closedness of C.
It follows that z ∈ C + y ⊆ C + D as desired.

If we drop the assumption that D is bounded, the result no longer holds in general. Indeed,
consider C = {2, 3, 4, . . .} and D := {−n + n : n = 2, 3, 4, . . .}. n n≥2 is the sum but 0 is
1 1

not!
Theorem 1.6.3
Let C ⊆ Rn be convex and λ1 , λ2 ≥ 0. Then
(λ1 + λ2 )C = λ1 C + λ2 C.
Proof
(⊆) This is always true, even if C is not convex.
(⊇) Without loss of generality, we may assume that λ1 + λ2 > 0. By convexity, we have
λ1 λ2
C+ C ⊆ C.
λ1 + λ2 λ1 + λ2
In other words, λ1 C + λ2 C ⊆ (λ1 + λ2 )C.
1.7 Topological Properties
We will write
B(x; ) := {y ∈ Rn : ky − xk ≤ }
to denote the closed ball of radius about x. In particular, we write
B := B(0; 1)
to denote the closed unit ball.
15
Definition 1.7.1 (Interior)
The interior of C ⊆ Rn is
int C := {x : ∃ > 0, x + B ⊆ C}.
Definition 1.7.2 (Closure)

The closure of C ⊆ Rn is
\
C̄ := C + B.
>0
Definition 1.7.3 (Relative Interior)

The relative interior of a convex C ⊆ Rn is
ri C := {x ∈ aff C : ∃ > 0, (x + B) ∩ aff C ⊆ C}.
Proposition 1.7.1
Let C ⊆ Rn . Suppose that int C 6= ∅. Then int C = ri C.
Proof
Let x ∈ int C. There is some > 0 such that B(x; ) ⊆ C. Hence
Rn = aff(B(x; ))
⊆ aff C
⊆ Rn .
But then aff C = Rn and the result follows from definition.

Let A ⊆ Rn be affine. Every affine set has a corresponding linear subspace
L := A − A.
This is a linear subspace as it is affine and contains 0.
Definition 1.7.4 (Dimension)

Let ∅ 6= A ⊆ Rn be affine. The dimension of A is the dimension of the corresponding
linear subspace
dim A := dim(A − A).
16
It may be useful to consider [
A−A= (A − a)
a∈A
as the union of translations.
Definition 1.7.5 (Dimension)

Let ∅ 6= C ⊆ Rn be convex. The dimension of C, denoted dim C, is the dimension of
aff C.
Proposition 1.7.2
Let C ⊆ Rn be convex. For all x ∈ int C and y ∈ C̄,
[x, y) ⊆ int C.
Proof
Let λ ∈ [0, 1). We argue that (1 − λ)x + λy ∈ int C. It suffices to show that
(1 − λ)x + λy + B ⊆ C
for some > 0.
As y ∈ C̄, we have that ∀ > 0, y ∈ C + B. Thus for all > 0,
(1 − λ)x + λy + B ⊆ (1 − λ)x + λ(C + B) + B

= (1 − λ)x + (1 + λ)B + λC previous theorem

1+λ
= (1 − λ) x + B + λC
1−λ
⊆ (1 − λ)C + λC sufficiently small , x ∈ int C
= C. previous theorem again
Theorem 1.7.3
Let C ⊆ Rn be convex. Then for all x ∈ ri C and y ∈ C̄,
[x, y) ⊆ ri C.
Proof
Case I: int C 6= ∅ This follows by the observation that ri C = int C.
Case II: int C = ∅ We must have dim C = m < n. Let L := aff C − aff C be the corre-
sponding linear subspace of dimension m.
17
Through translation by −c ∈ C if necessary, we may assume without loss of generality
that C ⊆ L ∼
= Rm .
But then the interior of C with respect to Rm is ri C in Rn . An application of Case I with

C ⊆ Rm yields the result.
Theorem 1.7.4
Let C ⊆ Rn be convex. The following hold:
(i) C̄ is convex.
(ii) int C is convex.
(iii) If int C 6= ∅, then int C = int C̄ and C̄ = int C.
Proof (i)
Let x, y ∈ C̄ and λ ∈ (0, 1). There are sequences xn , yn ∈ C such that
xn → x, yn → y.
It follows by convexity that
C 3 λxn + (1 − λ)y → λx + (1 − λy) ∈ C̄.
By definition, C̄ is convex.
Proof (ii)
If int C = ∅, the conclusion is clear.
Otherwise, use the previous proposition with y ∈ C ⊆ C̄ to see that
[x, y] = [x, y) ∪ {y}

⊆ int C ∪ int C
= int C.
Proof (iii)
Since C ⊆ C̄, it must hold that int C ⊆ int C̄.
Conversely, let y ∈ int C̄. If y ∈ int C, then we are done. Thus suppose otherwise.
There is some > 0 such that B(y; ) ⊆ C̄. We may thus choose some int C 63 y 6= x ∈
int C 6= ∅ and λ > 0 sufficiently small such that
y + λ(y − x) ∈ B(y; ) ⊆ C̄.
18
By a previous proposition applied with y + λ(y − x), we have that
[x, y + λ(y − x)) ⊆ int C.
We now claim that y ∈ [x, y + λ(y − x)). Indeed, set α := 1

1+λ
∈ (0, 1). We have
(1 − α)x + α(y + λ(y − x)) = (1 − α(1 + λ))x + α(1 + λ)y

= y.
It follows by the arbitrary choice of y that int C̄ ⊆ int C. We now turn to the second
identity.
Since int C ⊆ C, we must have int C ⊆ C̄. Conversely, let y ∈ C̄ and x ∈ int C. For
λ ∈ [0, 1), define
yλ = (1 − λ)x + λy.
The previous proposition agains tells us that
yλ ∈ [x, y) ⊆ int C.
But then y = limλ→0 yλ ∈ int C and C̄ ⊆ int C.
This concludes the argument.
Theorem 1.7.5
Let C ⊆ Rn be convex. Then ri C, C̄ are convex.
Moreover,
C 6= ∅ ⇐⇒ ri C 6= ∅.
1.8 Separation Theorems
Definition 1.8.1 (Separated)

Let C1 , C2 ⊆ Rn . We say C1 , C2 are separated if there is some b ∈ Rn \ {0} such that
sup hc1 , bi ≤ inf hc2 , bi.

c1 ∈C1 c2 ∈C2
19
If
sup hc1 , bi < inf hc2 , bi,
c1 ∈C1 c2 ∈C2
then we say C1 , C2 are strongly separated.
Theorem 1.8.1
Let ∅ 6= C ⊆ Rn be closed and convex and suppose x ∈
/ C. Then x is strongly
separated from C.
Proof
The goal is to find some b 6= 0 such that
suphc, bi < hx, bi

suphc − x, bi < 0.
Set p := PC (X) and b := x − p 6= ∅. Let y ∈ C. By the projection theorem,
hy − p, x − pi ≤ 0 ∀y ∈ C
hy − (x − b), x − (x − b)i ≤ 0 p=x−b
hy − x, bi ≤ −hb, bi
= −kbk2
suphy, bi − hx, bi ≤ −kbk2
y∈C
<0
as desired.
Corollary 1.8.1.1
Let C1 ∩ C2 = ∅ be nonempty subsets of Rn such that C1 − C2 is closed and convex.
Then C1 , C2 are strongly separated.
Proof
By definition, C1 , C2 are strongly separated if and only if there is b 6= 0 such that
sup hc1 , bi < inf hc2 , bi
c1 ∈C1 c2 ∈C2
sup hc1 , bi < − sup hc2 , bi

c1 ∈C1 c2 ∈C2
sup hc1 , bi + sup hc2 , bi < 0

c1 ∈C1 c2 ∈C2
sup hc1 − c2 , bi < 0.

c1 ∈C1 ,c2 ∈C2
20
Since C1 ∩ C2 = ∅, we know that 0 ∈
/ C1 − C2 . Hence C1 − C2 is strongly separated from
0 and the conclusion follows.
Corollary 1.8.1.2
Let ∅ 6= C1 , C2 ⊆ Rn be closed and convex such that C1 ∩ C2 = ∅ and C2 is bounded.
Then C1 , C2 are strongly separted.
Proof
C1 ∩ C2 = ∅ =⇒ 0 ∈ / C1 − C2 . In addition, −C2 is also closed and convex. It follows by
a previous theorem that C1 + (−C2 ) is nonempty, closed, and convex.
Theorem 1.8.2
Let ∅ 6= C1 , C2 ⊆ Rn be closed and convex such that C1 ∩ C2 = ∅. Then C1 , C2 are
separated.
Proof
For each n ∈ N, set
Dn := C2 ∩ B(0; n).
Observe that C1 ∩ Dn = ∅ for all n. Moreover, Dn is bounded by construction.
It follows that there is a hyperplane un that separates C1 , Dn for all n. Specifically,

kun k = 1 and
suphC1 , un i < infhDn , un i.
But the sequence un is bounded, hence there is a convergent subsequence ukn . where
ukn → u with kuk = 1.
Let x ∈ C1 , y ∈ C2 . For sufficiently large n, y ∈ B(0; kn ) and
hx, ukn i < hy, ukn i.
Taking the limit as k → ∞ yields
hx, ui ≤ hy, ui.
This completes the proof.
21
1.9 More Convex Sets
Definition 1.9.1 (Cone)

C ⊆ Rn is a cone if
C = R++ C.
Definition 1.9.2 (Conical Hull)

cone C is the intersection of all cones containing C.
Definition 1.9.3 (Closed Conical Hull)

cone(C) is the smallest closed cone containing C.
Proposition 1.9.1
Let C ⊆ Rn . The following hold:
(i) cone C = R++ C
(ii) cone C = cone(C)
(iii) cone(conv C) = conv(cone C)
(iv) cone(conv C) = conv(cone C)
The proofs of all these are trivial if C = ∅. Thus in our proofs, we assume that C is
nonempty.
Proof (i)
Set D := R++ C. It is clear that C ⊆ D with D being a cone. Hence cone C ⊆ D.
Conversely, for y ∈ D, there is some λ > 0, c ∈ C for which y = λc. Then y ∈ cone C and
D ⊆ cone C.
Proof (ii)
cone(C) is a closed cone with C ⊆ cone(C). Hence
cone C ⊆ cone(C) = cone(C).
Conversely, since cone C is a cone,
cone(C) ⊆ cone C.
22
Proof (iii)
(⊆) Let x ∈ cone(conv C). By i, there is λ > 0, y ∈ conv C such that x = λy. Since
y ∈ conv C, we can express is as a convex combination
x = λy
X m
=λ λi xi
i=1
m
X
= λi λxi
i=1
∈ conv(cone C).
(⊇) Let x ∈ conv(cone C). We can write x as convex combinations of scalar multiples of
C.
m
X
x= µi λi xi
i=1
m
! m
!
X X λµ
= λi µi P i i xi
i=1 i=1
λi µi
Xm
=α βi xi .
i=1
This is a scalar multiple of a convex combination of C and thus x ∈ cone(conv C) as

desired.
Proof (iv)
This is a direct consequence of iii.
Lemma 1.9.2
Let 0 ∈ C ⊆ Rn be convex with int C 6= ∅. The following are equivalent:
(i) 0 ∈ int C
(ii) cone C = Rn
(iii) coneC = Rn
It is a fact that for 0 ∈ C ⊆ Rn convex with int C 6= ∅,
int(cone C) = cone(int C).
23
Proof
(i) =⇒ (ii) Suppose 0 ∈ int C. Then B(0; ) ⊆ C for some > 0. But then
Rn = cone(B(0; ))
⊆ cone C
⊆ Rn
and we have equality.
(ii) =⇒ (iii) Recall that cone C = coneC. But then
Rn = cone C ⊆ coneC.
(iii) =⇒ (i) Recall that cone(conv C) = conv(cone C). Thus
conv(cone C) = cone C
and cone C is convex.
By assumption,
∅ 6= int C ⊆ int(cone C)
and cone C has nonempty interior.
Recall that
int(cone C) = int(coneC)
as cone C is convex.
Hence
Rn = int Rn
= int(coneC)
= int(cone C)
= cone(int C).
Thus 0 ∈ λ int C for some λ > 0. It must be then that 0 ∈ C as desired.
24
Definition 1.9.4 (Tangent Cone)
Let ∅ 6= C ⊆ Rn with x ∈ Rn . The tangent cone to C at x is
( S
cone(C − x) = λ∈R++ λ(C − x), x ∈ C
TC (x) =
∅, x∈
/C
Definition 1.9.5 (Normal Cone)

Let ∅ 6= C ⊆ Rn with x ∈ Rn . The normal cone to C at x is
(
{u ∈ Rn : supc∈C hc − x, ui ≤ 0}, x ∈ C
NC (x) =
∅, x∈/C
Theorem 1.9.3
Let ∅ 6= C ⊆ Rn be closed and convex. Let X ∈ Rn .
Both NC (x), TC (x) are closed convex cones.
Lemma 1.9.4
Let ∅ 6= C ⊆ Rn be closed and convex with x ∈ C.
n ∈ NC (x) ⇐⇒ ∀t ∈ TC (x), hn, ti ≤ 0.
Proof
( =⇒ ) Let n ∈ NC (x) and t ∈ TC (x). Recall that TC (x) = cone(C − x). Thus there is
some λk > 0 and tk ∈ Rn such that
x + λk tk ∈ C
and tk → t.
Since n ∈ NC (x) and x + λk tk ∈ C, it follows that for all k, hn, λk tk i ≤ 0. But then as
k → ∞ we see that
hn, ti ≤ 0.
25
( ⇐= ) Suppose that ∀t ∈ TC (x), we have hn, ti ≤ 0. Pick y ∈ C and observe that
y−x∈C −x
⊆ cone(C − x)
⊆ cone(C − x)
=: TC (x).
It follows that hn, y − xi ≤ 0 and n ∈ NC (x).
Theorem 1.9.5
Let C ⊆ Rn be convex such that int C 6= ∅. Let x ∈ C. The following are equivalent.
(1) x ∈ int C
(2) TC (x) = Rn
(3) NC (x) = {0}
Proof
(1) ⇐⇒ (2) Observe that x ∈ int C if and only if 0 ∈ int(C − x) if and only if there is
some > 0 with
B(0; ) ⊆ C − x.
Now,
Rn = cone(B(0; ))
⊆ cone(C − x)
⊆ cone(C − x)
= cone(C − x)
= TC (x)
⊆ Rn .
(2) ⇐⇒ (3) Our previous lemma combined with (1) yields
n ∈ NC (x) ⇐⇒ ∀t ∈ TC (x) = Rn , hn, ti ≤ 0

⇐⇒ n = 0.
Hence NC (x) = {0}.
Conversely, suppose NC (x) = {0}. It is clear that 0 ∈ TC (x). Pick y ∈ Rn . We claim

that y ∈ TC (x). To see this recall that TC (x) is a closed convex cone, hence p = PTC (x) (y)
exists and is unique. Moreover, it suffices to show that y = p ∈ TC (x).
26
Indeed, by the projection theorem
hy − p, t − pi ≤ 0
for all t ∈ TC (x). In particular, it holds for t = p, 2p ∈ TC (x) (TC (x) is a cone). So
hy − p, ±pi ≤ 0 =⇒ hy − p, pi = 0.
But then hy − p, ti ≤ 0 for all t ∈ TC (x), which implies that y − p ∈ NC (x) = {0} and
y = p ∈ TC (x)
as desired.
27
28
Chapter 2
Convex Functions
2.1 Definitions & Basic Results
Definition 2.1.1 (Epigraph)

Let f : Rn → [−∞, ∞]. The epigraph of f is
epi f := {(x, α) : f (x) ≤ α} ⊆ Rn × R.
Definition 2.1.2 (Domain)

For f : Rn → [−∞, ∞],
dom f := {x ∈ Rn : f (x) < ∞}.
Definition 2.1.3 (Proper Function)

We say that f is proper if dom f 6= ∅ and f (Rn ) > −∞.
Definition 2.1.4 (Indicator Function)

Let C ⊆ Rn . The indicator function of C is given by
(
0, x∈C
δC (x) :=
∞, x ∈ /C
29
Definition 2.1.5 (Lower Semicontinuous)
f is lower semicontinuous (l.s.c.) if epi(f ) is closed.
Definition 2.1.6 (Convex Function)

f is convex if epi f is convex.
Proposition 2.1.1
Let f : Rn → [−∞, ∞] be convex. Then dom f is convex.
Recall that linear transformations A : Rn → Rm preserve set convexity (C ⊆ Rn convex

implies that A(C) is convex).
Proof
Consider the linear transformation L : Rn+1 → Rn given by
(x, α) 7→ x.
Then dom f = L(epi f ) is convex.
Theorem 2.1.2
Let f : Rm → [−∞, ∞]. Then f is convex if and only if for all x, y ∈ dom f and
λ ∈ (0, 1),
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
Proof
If f = ∞ ⇐⇒ epi f = ∅ ⇐⇒ dom f = ∅, then result is trivial. Hence let us suppose
that f 6= ∞ ⇐⇒ dom f 6= ∅.
( =⇒ ) Pick x, y ∈ dom f and λ ∈ (0, 1). Observe that (x, f (x)), (y, f (y)) ∈ epi f . By
convexity,
λ(x, f (x)) + (1 − λ)(y, f (y)) = (λx + (1 − λ)y, λf (x) − (1 − λ)f (y)) ∈ epi(f )
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
( ⇐= ) Conversely, suppose the function inequality holds. Pick (x, α), (y, β) ∈ epi f as
well as λ ∈ (0, 1). Now,
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

≤ λα + (1 − λ)β
30
and
(λx + (1 − λ)y, λα, (1 − λ)β) ∈ epi f
as desired.
It follows that epi f is convex and so is f .
2.2 Lower Semicontinuity
Definition 2.2.1 (Lower Semicontinuity; Alternative)

Let f : Rn → [−∞, ∞] and x ∈ Rn . f is lower semicontinuous (l.s.c) at x if for every
sequence (xn )n≥1 ∈ Rn such that xn → x,
f (x) ≤ lim inf f (xn ).
We say f is l.s.c. if f is l.s.c. at every point in Rn .
Remark that continuity implies lower semicontinuity. One can show that the two definitions
of l.s.c. are equivalent, but we omit the proof.
Theorem 2.2.1
Let C ⊆ Rm . Then the following hold:
(i) C 6= ∅ if and only if δC is proper
(ii) C is convex if and only if δC is convex
(iii) C is closed if and only if δC is l.s.c.
We prove (i) and (ii) in A2.
Proof ((iii))
Observe that C = ∅ ⇐⇒ epi δC = ∅, which is certainly closed. Thus we proceed
assuming C 6= ∅.
( =⇒ ) Suppose C is closed. We want to show that epi δC is closed.
Pick a converging sequence sequence (xn , αn ) → (x, α) with every element in epi δC .
Observe that xn is a sequence in C, hence x ∈ C. Moreover, αn ∈ [0, ∞) and α ≥ 0.
It follows that (x, α) ∈ epi δC as required.
( ⇐= ) Conversely, suppose that δC is l.s.c. Let (xn )n≥1 be a sequence in C with xn → x.
31
By the definition of δC , it suffices to show that δC (x) = 0.
By lower semicontinuity,
0 ≤ δC (x)
≤ lim inf δC (xn )
=0
and we have equality throughout.
Proposition 2.2.2
Let I be an indexing set and let (fi )i∈I be a family of l.s.c. convex functions on Rn .
Then
F := sup fi
i∈I
is convex and l.s.c.
Proof
We claim that epi F = i∈I epi f . Indeed,
T
(x, α) ∈ epi F ⇐⇒ sup fi (x) ≤ α

i∈I
⇐⇒ ∀i ∈ I, fi (x) ≤ α
⇐⇒ ∀i ∈ I, (x, α) ∈ epi fi
⇐⇒ ∀i ∈ I(x, α) ∈ epi fi .
The result follows by the definition of convex functions and lower semicontinuity as inter-
sections preserve both set convexity and closedness.
2.3 The Support Function
Definition 2.3.1 (Support Function)

Let C ⊆ Rm . The support function σC : Rm → [−∞, ∞] of C is
u 7→ suphc, ui.
c∈C
Proposition 2.3.1
Let ∅ 6= C ⊆ Rn . Then σC is convex, l.s.c., and proper.
32
Proof
For each c ∈ C, define
fC (x) := hx, ci.
Then fc is linear and hence proper, l.s.c., and convex. Moreover,
σC = sup fc .
c∈C
Combined with our previous proposition, we learn that σC is convex and l.s.c.
Observe that since C 6= ∅,
σC (0) = suph0, ci = 0 < ∞.

c∈C
Hence dom σC 6= ∅. In addition, fix c̄ ∈ C. Then for all u ∈ Rm ,
σC (u) = suphu, ci
c∈C
≥ hu, c̄i
> −∞.
Hence σC is proper as well.
2.4 Further Notions of Convexity
Let f : Rm → [−∞, ∞] be proper. Then f is strictly convex if for every x 6= y ∈ dom f and
λ ∈ (0, 1),
f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y).
Moreover, f is strongly convex with constant β > 0 if for every x, y ∈ dom f, λ ∈ (0, 1),
β
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) − λ(1 − λ)kx − yk2 .
2
Clearly, strong convexity implies strict convexity, which in turn implies convexity.
33
2.5 Operations Preserving Convexity
Proposition 2.5.1
Let I be a finite indexing set and (fi )i∈I a family of convex functions Rm → [−∞, ∞].
Then X
fi
i∈I
is convex.
Proposition 2.5.2
Let f be convex and l.s.c. and pick λ > 0. Then
λf
is convex and l.s.c.
2.6 Minimizers
Definition 2.6.1 (Global Minimizer)

Let f : Rm → (−∞, ∞] be proper and x ∈ Rm . Then x is a (global) minimizer of f if
f (x) = min f (Rm ).
We will use argmin f to denote the set of minimizers of f .
Definition 2.6.2 (Local Minimum)

Let f : Rm →] − ∞, ∞] be be proper and x̄ ∈ Rm . Then x̄ is a local minimum of f if
there is δ > 0 such that
kx − x̄k < δ =⇒ f (x̄) ≤ f (x).
We way that x̄ is a global minimum of f if for all x ∈ dom f ,
f (x̄) ≤ f (x).
Analogously, we define the local maximum and global maximum.
Why are convex functions so special?
34
Proposition 2.6.1
Let f : Rm → (−∞, ∞] be proper and convex. Then every local minimizer of f is a
global minimizer.
Proof
Let x be a local minimizer of f . There is some ρ > 0 such that
f (x) = min f (B(x; ρ)).
Pick some y ∈ dom f \ B(x; ρ). Notice that

ρ
λ := 1 − ∈ (0, 1).
kx − yk
Set
z := λx + (1 − λ)y ∈ dom f.
We know this is in the domain as dom f is convex by our prior work.
We have
z − x = (1 − λ)y − (1 − λ)x
= (1 − λ)(y − x)
kz − xk = k(1 − λ)(y − x)k
ρ
= ky − xk
ky − xk
= ρ.
This shows that z ∈ B(x; ρ).
By the convexity of f ,
f (x) ≤ f (z)
≤ λf (x) + (1 − λ)f (y)
(1 − λ)f (x) ≤ (1 − λ)f (y)
f (x) ≤ f (y).
Proposition 2.6.2
Let f : Rm → (−∞, ∞] be proper and convex. Let C ⊆ Rm . Suppose that x is a
minimizer of f over C such that x ∈ int C. Then x is a minimizer of f .
35
Proof
There is some > 0 such that x minimizes f over B(x; ) ⊆ int C. Since x is a local
minimizer, it is a global minimizer as well.
2.7 Conjugates
Definition 2.7.1 (Fenchel-Legendre/Convex Conjugate)

Let f : Rm → [−∞, ∞]. Then Fenchel-Legendre/Convex Conjugate of f , denoted
f ∗ : Rm → [−∞, ∞] is given by
u 7→ sup hx, ui − f (x).

x∈Rm
Recall that a closed convex set is the intersection of all supporting hyperplanes. The idea is
that the epigraph of a convex, l.s.c. function f can be recovered by the supremum of affine
functions majorized by f .
Given a slope x ∈ Rm , we want the best translation α which supports f .
f (x) ≥ hu, xi − α ∀x ∈ Rn
α ≥ hu, xi − f (x) ∀x ∈ Rn .
Thus f ∗ (u) := supx∈Rn hu, xi−f (x) is the best translation such that hu, xi−f ∗ (u) is majorized
by f .
Proposition 2.7.1
Let f : Rm → [−∞, ∞]. Then f ∗ is convex and l.s.c.
Proof
Observe that f ≡ ∞ ⇐⇒ dom f = ∅. Hence if f ≡ ∞, for all u ∈ Rm
f ∗ (u) = sup hx, ui − f (x)

x∈Rm
= sup hx, ui − f (x)
x∈dom f
= −∞.
This is trivially convex and l.s.c.
Now suppose that f 6≡ ∞. We claim that f ∗ (u) = sup(x,α)∈epi f hx, ui − α. Observe that
36
f(x,α) := hx, ·i − α is an affine function. By definition,
sup hx, ui − f (x) ≥ sup hx, ui − α

x∈dom f (x,α)∈epi f
as f (x) ≤ α by the definition of the epigraph. On the other hand,
sup hx, ui − f (x) ≤ sup hx, ui − α

(x,f (x)):x∈dom f (x,α)∈epi f
as each (x, f (x)) ∈ epi f .
But then
f ∗ (u) = sup f(x,α) (u)
(x,α)∈epi f
is a supremum of convex and l.s.c. (affine) functions which is convex and l.s.c. by our
earlier work.
Example 2.7.2
Let 1 < p, q such that
1 1
+ = 1.
p q
Then for f (x) := |x|p
p
,
|u|q
f ∗ (x) = .
q
This can be shown by differentiating to find maximums.
Example 2.7.3
Let f (x) := ex . Then

u ln u − u, u > 0

∗
f (u) = 0, u=0

∞, u<0

Example 2.7.4
Let C ⊆ Rm , then
δC∗ = σC .
37
By definition,
δC∗ (y) := sup hx, yi − δC (y)

y∈dom δC
= suphx, yi.
y∈C
2.8 The Subdifferential Operator
Definition 2.8.1 (Subdifferential)

Let f : Rm → (−∞, ∞] be proper. The subdifferential of f is the set-valued operator
∂f : Rm ⇒ Rm given by
x 7→ {u ∈ Rm : ∀y ∈ Rm , f (y) ≥ f (x) + hu, y − xi}.
We say f is subdifferentiable at x if ∂f (x) 6= ∅.
The elements of ∂f (x) are called the subgradient of f at x.
The idea is that for a differentiable convex function, the derivative at x ∈ Rn is the slope
for a line tangent to x which lies strictly below f . If f is not differentiable at x, we can still
ask for slopes of line segments tangent to x which lie below x.
Theorem 2.8.1 (Fermat)

Let f : Rm → (−∞, ∞] be proper. Then
argmin f = {x ∈ Rm : 0 ∈ ∂f (x)} =: zer ∂f.
Proof
Let x ∈ Rm .
x ∈ argmin f ⇐⇒ ∀y ∈ Rm , f (x) ≤ f (y)

⇐⇒ ∀y ∈ Rm , h0, y − xi + f (x) ≤ f (y)
⇐⇒ 0 ∈ ∂f (x).
38
Example 2.8.2
Consider f (x) = |x|. Then

{−1}, x < 0

∂f (x) = [−1, 1], x = 0

{1}, x>0

Lemma 2.8.3
Let f : Rm → (−∞, ∞] be proper. Then
dom ∂f ⊆ dom f.
Proof
We argue by the contrapositive, suppose x ∈
/ dom f . Then f (x) = ∞ and ∂f (x) = ∅.
Proposition 2.8.4
Let ∅ 6= C ⊆ Rm be closed and convex. Then
∂δC (x) = NC (x).
Proof
Let u ∈ Rm and x ∈ C = dom δC . Then
u ∈ ∂δC (x) ⇐⇒ ∀y ∈ Rm , δC (y) ≥ δC (x) + hu, y − xi

⇐⇒ ∀y ∈ C, δC (y) ≥ δC (x) + hu, y − xi
⇐⇒ ∀y ∈ C, 0 ≥ hu, y − xi
⇐⇒ u ∈ NC (x).
Consider the constrained optimization problem min f (x), x ∈ C, where f is proper, convex,
l.s.c. and C 6= ∅ is closed and convex. We can rephrase this as min f (x) + δC (x).
In some cases, ∂(f + δC ) = ∂f + ∂δC = ∂f + NC (x). Thus by Fermat’s theorem, we look for
some x where
0 ∈ ∂f (x) + NC (x).
2.9 Calculus of Subdifferentials
The main question we are concerned with is whether the subdifferential operator is additive.
39
Proposition 2.9.1
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Then
∅ 6= ri dom f ⊆ dom ∂f.
In particular,
ri dom f = ri dom ∂f
dom f = dom ∂f .
Definition 2.9.1 (Properly Separated)

Let ∅ 6= C1 , C2 ⊆ Rm . Then C1 , C2 are properly separated if there is some b 6= 0 such
that
sup hb, c1 i ≤ inf hb, c2 i
c1 ∈C c2 ∈C
(separated) AND such that
inf hb, c1 i < sup hb, c2 i.

c1 ∈C2 c2 ∈C2
A problem with the definition of separated is that a set can be separated from itself. Indeed,
the x-axis is separated from itself with itself as a separating hyperplane. To be properly
separated, there must be some c1 ∈ C1 , c2 ∈ C2 such that
hb, c1 i < hb, c2 i.
In otherwords, C1 ∪ C2 is not fully contained in the hyperplane.
Proposition 2.9.2
Let ∅ 6= C1 , C2 ⊆ Rm be convex. Then C1 , C2 are properly separated if and only if
ri C1 ∩ ri C2 = ∅.
Proposition 2.9.3
Let C1 , C2 ⊆ Rm be convex. Then
ri(C1 + C2 ) = ri C1 + ri C2 .
Moreover,
ri(λC) = λ(ri C)
for all λ ∈ R.
40
Proposition 2.9.4
Let C1 ⊆ Rm and C2 ⊆ Rp be convex. Then
ri(C1 ⊕ C2 ) = ri C1 ⊕ ri C2 .
Theorem 2.9.5
Let C1 , C2 ⊆ Rm be convex such that ri C1 ∩ ri C2 6= ∅. For each x ∈ C1 ∩ C2 ,
NC1 ∩C2 (x) = NC1 (x) + NC2 (x).
Proof
The reverse inclusion is not hard. Hence we check the inclusion only.
Let x ∈ C1 ∩ C2 and n ∈ NC1 ∩C2 (x). Then for each u ∈ C1 ∩ C2 ,
hn, y − xi ≤ 0.
Set E1 := epi δC1 = C1 × [0, ∞) ⊆ Rm × R. Moreover, put
E2 := {(y, α) : y ∈ C2 , α ≤ hn, y − xi} ⊆ Rm × R.
By a previous fact,
ri E1 = ri C1 × (0, ∞).
Similarly,
ri E2 = {(y, α), α < hn, y − xi}.
We claim that ri E1 ∩ ri E2 = ∅. Indeed, suppose towards a contradiction that there is

some (z, α) ∈ ri E1 ∩ ri E2 . Then
0 < α < hn, z − xi ≤ 0
which is impossible.
It follows by a previous fact that E1 , E2 are properly separated. Namely, there is (b, γ) ∈
Rm × R \ {0} such that
hx, bi + αγ ≤ hy, bi + βγ ∀(x, α) ∈ E1 , (y, β) ∈ E2

hx̄, bi + ᾱγ < hȳ, bi + β̄γ ∃(x̄, ᾱ) ∈ E1 , (ȳ, β̄) ∈ E2
We claim that γ < 0. Indeed, (x, 1) ∈ E and (x, 0) ∈ E2 . So
hx, bi + γ ≤ hx, bi =⇒ γ ≤ 0.
41
Next we claim that γ 6= 0. Suppose to the contrary that γ = 0. But then
hx, bi ≤ hy, bi ∀(x, α) ∈ E1 , (y, β) ∈ E2

hx̄, bi < hȳ, bi ∃(x̄, ᾱ) ∈ E1 , (ȳ, β̄) ∈ E2
and C1 , C2 are properly separated.
From our earlier fact, this contradicts the assumption that ri C1 ∩ ri C2 6= ∅. Altogether,
γ < 0.
Our goal is to show that

b b
n= − + n+ .
γ γ
|{z} | {z }
∈NC1 (x) ∈NC2 (x)
First, we claim that b ∈ NC1 (x). This happens if and only if for all y ∈ C1 ,
hy − x, bi ≤ 0 ⇐⇒ hb, yi ≤ hb, xi.
Indeed, we know that (y, 0) ∈ E1 . Moreover, (x, 0) ∈ E2 by construction. Hence
hy, bi + 0 · γ ≤ hx, bi + 0 · γ.
Thus b ∈ NC1 (x) =⇒ − γ1 b ∈ NC1 (x).
Now, for all y ∈ C2 , (y, hn, y − xi) ∈ E2 by construction, Hence for all y ∈ C2 ,
hb, xi + 0 · γ ≤ hb, yi + γhn, y − xi.
Equivalently,
b
+ n, y − x ≤ 0.
γ
This shows that
b
+ n ∈ NC2 (x).
γ
Thus n ∈ NC1 (x) + NC2 (x) and we are done.
Proposition 2.9.6
Let f : Rm → (−∞, ∞) be convex, l.s.c. and proper. Let x, u ∈ Rm . Then
u ∈ ∂f (x) ⇐⇒ (u, −1) ∈ Nepi f (x, f (x)).
42
Proof
Observe that epi f 6= ∅ and is convex since f is proper and convex. Now let u ∈ Rm .
Then
(u, −1) ∈ Nepi f (x, f (x))

⇐⇒ x ∈ dom f ∧ ∀(y, β) ∈ epi f, h(y, β) − (x, f (x)), (u, −1)i ≤ 0
⇐⇒ x ∈ dom f ∧ ∀(y, β) ∈ epi f, h(y − x), β − f (x), (u, −1)i ≤ 0
⇐⇒ ∀(y, β) ∈ epi f, hy − x, ui + f (x) ≤ β
⇐⇒ ∀y ∈ dom f, hy − x, ui + f (x) ≤ f (y)
⇐⇒ u ∈ ∂f (x).
Theorem 2.9.7
Let f, g : Rm → (−∞, ∞] be convex, l.s.c., and proper. Suppose that ri dom f ∩
ri dom g 6= ∅. Then for all x ∈ Rm ,
∂f (x) + ∂g(x) = ∂(f + g)(x).
Proof
Let x ∈ Rm . If x ∈
/ dom(f + g) = dom f ∩ dom g, then ∂f (x) + ∂g(x) = ∅. Also,
∂(f + g)(x) = ∅.
Suppose now that x ∈ dom f ∩ dom g = dom(f + g). It is easy to check that
∂f (x) + ∂g(x) ⊆ ∂(f + g)(x).
We verify the reverse inclusion.
Pick any u ∈ ∂(f + g)(x). By definition, for all y ∈ Rm ,
(f + g)(y) ≥ (f + g)(x) + hu, y − xi.
Consider the closed convex sets
E1 = {(x, α, β) ∈ Rm × R × R : f (x) ≤ α} = epi f × R

E2 = {(x, α, β) ∈ Rm × R × R : g(x) ≤ β} ∼= epi g × R.
We claim that
(u, −1, −1) ∈ NE1 ∩E2 (x, f (x), g(x)).
Indeed, let (y, α, β) ∈ E1 , E2 . We have by construction f (y) − α, g(y) − β ≤ 0.
43
Now,
h(u, −1, −1), (y, α, β) − (x, f (x), g(x))i

= hu, y − xi − (α − f (x)) − (β − g(x))
= hu, y − xi + (f + g)(x) − (α + β)
≤ (f + g)(y) − α − β u ∈ ∂(f + g)(x)
≤ 0.
Next, we claim that ri Ei ∩ ri E2 6= ∅. Indeed, by a previous fact,
ri E1 = ri(epi f × R)
= ri epi f × R.
Similarly,
ri E2 = {(x, α, β) ∈ Rm × R × R : g(x) < β}.
Pick z ∈ ri dom f ∩ ri dom g. Then (z, f (z) + 1, g(z) + 1) ∈ ri E1 , ri E2 . Hence, (z, f (z) +
1, g(z) + 1) ∈ ri E1 ∩ ri E2 6= ∅.
All in all, E1 , E2 6= ∅ are closed, convex, with ri E1 ∩ ri E2 6= ∅. Hence by the previous

theorem,
NE1 ∩E2 (x, f (x), g(x)) = NE1 (x, f (x), g(x)) + NE2 (x, f (x), g(x)).
Now, it can be shown that Nepi f ×R = Nepi f × NR and similarly for E2 . Therefore, there
is some u1 , u2 ∈ Rm , α, β ∈ R for which
(u, −1, −1) = (u1 , −α, 0) + (u2 , 0, −β).
Thus u = u1 + u2 and α = β = 1. It follows that
(u1 , −1) ∈ Nepi f (x, f (x))

(u2 , −1) ∈ Nepi g (x, g(x)).
From a previous proposition, we conclude that u1 ∈ ∂f (x) and u2 ∈ ∂g(x). Hence
u = u1 + u2 ∈ ∂f (x) + ∂g(x),
completing the proof.

Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Suppose φ 6= C ⊆ Rm is closed and
44
convex. Furthermore, suppose ri C ∩ ri dom f 6= ∅. Consider the problem
min f (x) (P )
x∈C
Then x̄ ∈ Rm solves (P) if and only if
(∂f (x̄)) ∩ (−NC (x̄)) 6= ∅.
Indeed, we convert this to the unconstrained minimization problem min f +δC . This function
is convex, l.s.c., and proper. By Fermat’s theorem, x̄ solves P if and only if
0 ∈ ∂(f + δC )(x̄).
Now, ri dom f ∩ ri dom δC 6= ∅. Hence by the previous theorem, x̄ solves (P) if and only if
0 ∈ ∂(f + δC )(x̄) = ∂f (x̄) + NC (x̄) ⇐⇒ ∃u ∈ ∂f (x̄), −u ∈ NC (x̄)

⇐⇒ ∂f (x̄) ∩ (−NC (x̄)) 6= ∅.
Example 2.9.8
Let d ∈ Rm and ∅ 6= C ⊆ Rm be convex and closed. Consider
minhd, xi (P )
x∈C
Let x̄ ∈ Rm . Then x̄ solves (P) if and only if
−d ∈ NC (x̄).
2.10 Differentiability
Definition 2.10.1 (Directional Derivative)

Let f : Rm → (−∞, ∞] be proper and x ∈ dom f . The directional derivative of f at
x in the direction of d is
f (x + td) − f (x)
f 0 (x; d) := lim .
t↓0 t
45
Definition 2.10.2 (Differentiable)
Let f : Rm → (−∞, ∞] be proper and x ∈ dom f . f is differentiable at x if there is
a linear operator ∇f (x) : Rm → Rm , called the derivative (gradient) of f at x, that
satisfies
kf (x + y) − f (x) − ∇f (x) · yk
lim = 0.
06=kyk→0 kyk
If f is differentiable at x, then the directional derivative of f at x in the direction of d is
f 0 (x; d) = h∇f (x), di.
Theorem 2.10.1
Let f : Rm → (−∞, ∞] be convex. Suppose f (x) < ∞. For each y, the quotient in
the definition of f 0 (x; y) is a non-decreasing function of λ > 0. So f 0 (x; y) exists and
f (x + λy) − f (x)
f 0 (x; y) = inf .
λ>0 λ
Theorem 2.10.2
Let f : Rm → (−∞, ∞] be convex and proper. Let x ∈ dom f and u ∈ Rm . Then u
is a subgradient of f at x if and only if
∀y ∈ Rm , f 0 (x; y) ≥ hu, yi.
Proof
By definition,
u ∈ ∂f (x) ⇐⇒ ∀y ∈ Rm , λ > 0, f (x + λy) ≥ f (x) + hu, λyi

f (x + λy) − f (x)
⇐⇒ ∀y ∈ Rm , λ > 0, ≥ hu, yi
λ
f (x + λy) − f (x)
⇐⇒ ∀y ∈ Rm , inf ≥ hu, yi
λ>0 λ
⇐⇒ ∀y ∈ Rm , f 0 (x; y) ≥ hu, yi.
Theorem 2.10.3
Let f : Rm → (−∞, ∞] be convex and proper. Suppose x ∈ dom f . If f is differen-
tiable at x, then ∇f (x) is the unique subgradient of f at x.
46
Proof
Recall that for each y ∈ Rm ,
f 0 (x; y) = h∇f (x), yi.
Let u ∈ Rm . By the previous theorem,
u ∈ ∂f (x) ⇐⇒ ∀y ∈ Rm , f 0 (x; y) ≥ hu, yi

⇐⇒ ∀y ∈ Rm , h∇f (x), yi ≥ hu, yi.
It is clear that ∇f (x) ∈ ∂f (x). Conversely, by setting y := u − ∇f (x). We see that
h∇f (x), u − ∇f (x)i ≥ hu, u − ∇f (x)i ⇐⇒ hu − ∇f (x), u − ∇f (x)i ≤ 0

⇐⇒ u = ∇f (x).
Lemma 2.10.4
Let ϕ : R → (−∞, ∞] be a proper function that is differentiable on an interval
∅ 6= I ⊆ dom ϕ. If ϕ0 is increasing on I, then ϕ is convex on I.
Proof
Fix x, y ∈ I and λ ∈ (0, 1). Let ψ : R → (−∞, ∞] be given by
z 7→ λϕ(x) + (1 − λ)ϕ(z) − ϕ(λx + (1 − λ)z).
Then
ψ 0 (z) = (1 − λ)φ0 (z) − (1 − λ)φ0 (λx + (1 − λ)z)
and ψ 0 (x) = 0 = ψ(x).
Since φ0 is increasing, ψ 0 (z) ≤ 0 when z < x and ψ 0 (z) > 0 whenever z > x. It follows
that ψ achieves its infimum on I at x.
That is, for all y ∈ I, ψ(y) ≥ ψ(x) = 0. But then
λφ(x) + (1 − λ)φ(y) ≥ φ(λx + (1 − λ)y)
as desired.
47
Proposition 2.10.5
Let f : Rm → (−∞, ∞] be proper. Suppose that dom f is open and convex, and that f
is differentiable on dom f . The following are equivalent.
(i) f is convex
(ii) ∀x, y ∈ dom f, hx − y, ∇f (y)i + f (y) ≤ f (x)
(iii) ∀x, y ∈ dom f, hx − y, ∇f (x) − ∇f (y)i ≥ 0
Proof
(i) =⇒ (ii) ∇f (y) is the unique subgradient of f at y. Hence for all x ∈ Rm and y ∈ dom f ,
f (x) ≥ hx − y, ∇f (y)i + f (y).
(ii) =⇒ (iii) We prove this in assignment 2.
(iii) =⇒ (i) Fix x, y ∈ dom f and z ∈ Rm . By assumption, dom f is open. Thus there is
some > 0 such that
y + (1 + )(x − y) = x + (x − y) ∈ dom f
y − (x − y) = y + (y − x) ∈ dom f.
By the convexity of dom f , for every α ∈ (−, 1 + ), y + α(x − y) ∈ dom f .
Set C = (−, 1 + ) ⊆ R and φ : R → (−∞, ∞] be given by

φ(α) := f (y + α(x − y)) + δC (α).
By construction, φ is differentiable on C and for each α ∈ C,
φ0 (α) = h∇f (y + α(x − y)), x − yi.
Now, take α < β ∈ C. Set

yα := y + α(x − y)
yβ := y + β(x − y)
yβ − yα = (β − α)(x − y).
Then by assumption,
ϕ0 (β) − ϕ0 (α) = h∇f (y + β(x − y)), x − yi − h∇f (y + α(x − y)), x − yi
= h∇f (yβ ) − ∇f (yα ), x − yi
1
= h∇f (yβ ) − ∇f (yα ), yβ − yα i
β−α
≥ 0.
48
That is, ϕ0 is increasing on C and ϕ is convex on C. But then
f (αx + (1 − α)y) = ϕ(α)

≤ αϕ(1) + (1 − α)ϕ(0)
= αf (x) + (1 − α)f (y).
Example 2.10.6
Let A be a m × m matrix, and set f : Rm → R be given by
f (x) = hx, Axi.
Then ∇f (x) = A + AT and f is convex if and only if A + AT is posiitve semidefinite.
2.11 Conjugacy
Proposition 2.11.1
Let f, g be functions from Rm → [−∞, ∞]. Then
(1) f ∗∗ := (f ∗ )∗ ≤ f
(2) f ≤ g =⇒ f ∗ ≥ g ∗ , f ∗∗ ≤ g ∗∗
Proposition 2.11.2 (Fenchel-Young Inequality)

Let f : Rm → (−∞, ∞] be proper. Then for all x, u ∈ Rm ,
f (x) + f ∗ (u) ≥ hx, ui.
Proof
By definition, f ∗ (x) = −∞ ⇐⇒ f ≡ ∞. Hence by assumption f ∗ (Rm ) > 0.
Now, let x, u ∈ Rm . If f (x) = ∞, the inequality trivially holds. Otherwise,
f ∗ (u) := sup hy, ui − f (u) ≥ hy, xi − f (x)

y∈Rm
as desired.
Proposition 2.11.3
Let f : Rm → (−∞, ∞] be convex and proper. For x, u ∈ Rm ,
u ∈ ∂f (x) ⇐⇒ f (x) + f ∗ (x) = hx, ui.
49
Proof
We have
u ∈ ∂f (x)
⇐⇒ ∀y ∈ dom f, hy − x, ui + f (x) ≤ f (y)
⇐⇒ ∀y ∈ dom f, hy, ui − f (y) ≤ hx, ui − f (x)
⇐⇒ f ∗ (u) = sup hy, ui − f (y) ≤ hx, ui − f (x)
y∈Rm
⇐⇒ f ∗ (u) = hx, ui − f (x). hx, ui − f (x) ≤ f ∗ (u)
Proposition 2.11.4
Let f : Rm → (−∞, ∞] be convex and proper. Pick x ∈ Rn such that ∂f (x) 6= ∅. Then
f ∗∗ (x) = f (x).
Proof
Let u ∈ ∂f (x). By the previous proposition,
hu, xi = f (x) + f ∗ (u).
Consequently,
f ∗∗ (x) := sup hx, yi − f ∗ (y)

y∈Rm
≥ hx, ui − f ∗ (u)
= f (x).
Conversely,
f ∗∗ (x) = sup hy, xi − f ∗ (y)

y∈Rm
= sup hy, xi − sup (hz, yi − f (z))

y∈Rm z∈Rm
= sup hy, xi + infm (f (z) − hy, zi)

y∈Rm z∈R
= sup infm (f (z) + hy, x − zi)

y∈Rm z∈R
≤ sup f (x) + hy, x − xi

y∈Rm
= sup f (x)
y∈Rm
= f (x).
50
Proposition 2.11.5
Let f : Rm → (−∞, ∞] be proper. Then f is convex and l.s.c. if and only if
f = f ∗∗ .
In this case, f ∗ is also proper.
Corollary 2.11.5.1
Let f : Rm → (−∞, ∞] be convex, l.s.c. and proper. Then
(i) f ∗ is convex, l.s.c., and proper
(ii) f ∗∗ = f
Proof
To see (i), combine the previous proposition and the fact that f ∗ is always convex and
l.s.c.
(ii) follows from the previous proposition.
Proposition 2.11.6
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Then
u ∈ ∂f (x) ⇐⇒ x ∈ ∂f ∗ (u).
Proof
Recall that
u ∈ ∂f (x) ⇐⇒ f (x) + f ∗ (u) = hx, ui.
By a previous proposition, g := f ∗ satifies g ∗ = f . Moreover, g is convex, l.s.c., and
proper.
Hence,
u ∈ ∂f (x) ⇐⇒ f (x) + f ∗ (u) = hx, ui

⇐⇒ g ∗ (x) + g(u) = hx, ui
⇐⇒ x ∈ ∂g(u) = ∂f ∗ (u)
as desired.
51
2.12 Coercive Functions
Theorem 2.12.1
Let f : Rm → R be proper, l.s.c. and compact C ⊆ Rm such that
C ∩ dom f 6= ∅.
Then the following hold:

(i) f is bounded below over C
(ii) f attains its minimal value over C
Proof
(i): Suppose towards a contradiction that f is not bounded below over C. There is a
sequence xn in C such that
lim f (xn ) = −∞.
n
Since C is (sequentially) compact, there there is a convergent subsequence xkn → x̄ ∈ C.

But f is l.s.c., hence
f (x̄) ≤ lim inf f (xkn ) = −∞
n
which contradicts the properness of f .
(ii): Since f is bounded below,

fmin := inf f (x)
x∈C
exists. There is a sequence xn in C such that f (xn ) → fmin .
Again, there is a convergent subsequence xkn → x̄ ∈ C. Then
f (x̄) ≤ lim inf f (xkn ) = fmin .

n
Thus x̄ is a minimizer of f over C.
Definition 2.12.1 (Coercive Function)

Let f : Rm → (−∞, ∞]. Then f is coercive if
lim f (x) = ∞.
kxk→∞
52
Definition 2.12.2 (Super Coercive)
Let f : Rm → (−∞, ∞]. Then f is super coercive if
f (x)
lim = ∞.
kxk→∞ kxk
Theorem 2.12.2
Let f : Rm → (−∞, ∞] be proper, l.s.c., and coercive. Let C ⊆ Rm be a closed subset
of Rm satisfying
C ∩ dom f 6= ∅.
Then f attains its minimal value over C.
Proof
Let x ∈ C ∩ dom f . Since f is coercive, there is some M such that
∀y, kyk > M =⇒ f (y) > f (x).
But then the set of minimizers of f over C is the same as the set of minimizers of f over
C ∩ B(0; M ). This set is compact. Hence by the previous theorem, f attains its minimal
value over C.
2.13 Strong Convexity
Definition 2.13.1 (Lipschitz Function)

Let T : Rm → Rm and L ≥ 0. Then T is L-Lipschitz if for all x, y ∈ Rm ,
kT x − T yk ≤ Lkx − yk.
Example 2.13.1
Let f : Rm → R be given by
1
x 7→ hx, Axi + hb, xi + x
2
where A 0 is positive semi-definite, b ∈ Rn and c ∈ R.
Then
53
(i) ∇f (x) = Ax for all x ∈ Rm
(ii) ∇f is Lipschitz with constant kAk, the operator norm of A
Example 2.13.2
Let ∅ 6= C ⊆ Rm be closed and convex. Then PC is Lipschitz continuous with constant
1.
Lemma 2.13.3 (Descent)

Let f : Rm → (−∞, ∞] be differentiable on ∅ 6= D ⊆ int dom f such that ∇f is
L-Lipschitz. Moreover, suppose that D is convex.
Then for all x, y ∈ D,
L
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2 .
2
Proof
Recall that the fundamental theorem of calculus implies that
Z 1
f (y) − f (x) = h∇f (x + t(y − x)), y − xidt
0
Z 1
= h∇f (x), y − xi + h∇f (x + t(y − x)) − ∇f (x), y − xidt.
0
Hence
|f (y) − f (x) − h∇f (x), y − xi|

Z 1

= h∇f (x + t(y − x)) − ∇f (x), y − xidt

0
Z 1
≤ |h∇f (x + t(y − x)) − ∇f (x), y − xi|dt
0
Z 1
≤ k∇f (x + t(y − x)) − ∇f (x)k · ky − xkdt
0
Z 1
≤ Lkx + t(y − x) − xk · ky − xkdt f is L-Lipschitz
0
Z 1
= tLkx − yk2 dt
0
L
= kx − yk2 .
2
54
It follows that
L
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2 .
2
Theorem 2.13.4
Let f : Rm → R be convex and differentiable and L > 0. The following are equivalent:
(i) ∇f is L-Lipschitz
(ii) for all x, y ∈ Rm , f (y) ≤ f (x) + h∇f (x), y − xi + L2 kx − yk2
(iii) for all x, y ∈ Rm , f (y) ≥ f (x) + h∇f (x), y − xi + 1
2L
k∇f (x) − ∇f (y)k2
(iv) for all x, y ∈ Rm , h∇f (x) − ∇f (y), x − yi ≥ L1 k∇f (x) − ∇f (y)k2
Proof
(i) =⇒ (ii): This is the descent lemma.
(ii) =⇒ (iii): If ∇f (x) = ∇f (y), the this follows immediately from the subgradient in-
equality and the fact that ∂f (x) = {∇f (x)}.
Fix x ∈ Rm and define
hx (y) := f (y) − f (x) − h∇f (x), y − xi.
Observe that hx is convex, differentiable, with
∇hx (y) = ∇f (y) − ∇f (x).
We claim that for all y, z ∈ Rm ,

L
hx (z) ≤ hx (y) + h∇hx (y), z − yi + kz − yk2 .
2
Indeed,
hx (z) = f (z) − f (x) − h∇f (x), z − xi

L
≤ f (y) + h∇f (y), z − yi + kz − yk2 − f (x) − h∇f (x), z − xi
2
L
= f (y) − f (x) − h∇f (x), y − xi − h∇f (x), z − yi + h∇f (y), z − yi + kz − yk2
2
L
= f (y) − f (x) − h∇f (x), y − xi + h∇f (y) − ∇f (x), z − yi + kz − yk2
2
L
= hx (y) + h∇hx (y), z − yi + kz − yk2 .
2
55
By construction, ∇hx (x) = 0. But the convexity of hx then asserts that x is a global
minimizer of hx . That is, for all z ∈ Rn ,
hx (x) ≤ hx (z).
Pick y, v ∈ Rm be such that kvk = 1 and h∇hx (y), vi = k∇hx (y)k. Set
k∇hx (y)k
z=y− v.
L
From the fact that x is a global minimizer, we have
0 = hx (x)

k∇hx (y)k
≤ hx y − v .
L
On the other hand, the earlier inequality yields
0 = hx (x)
k∇hx (y)k 1
≤ hx (y) − h∇hx (y), vi + k∇hx (y)k2 kvk2
L 2L
k∇hx (y)k2 1
= hx (y) − + k∇hx (y)k2
L 2L
1
= hx (y) − k∇hx (y)k2
2L
1
= f (y) − f (x) − h∇f (x), y − xi − k∇f (x) − ∇g(y)k2 .
2L
(iii) =⇒ (iv): Using (iii),
1
f (y) ≥ f (x) + h∇f (x), y − xi + k∇f (x) − ∇f (y)k2
2L
1
f (x) ≥ f (y) + h∇f (y), x − yi + k∇f (y) − ∇f (x)k2 .
2L
(iv) =⇒ (i): If ∇f (x) = ∇f (y), the implication is trivial. We proceed assuming otherwise.
We have
k∇f (x) − ∇f (y)k2 ≤ Lh∇f (x) − ∇f (y), x − yi

≤ Lk∇f (x) − ∇f (y)k · kx − yk
k∇f (x) − ∇f (y)k ≤ Lkx − yk.
56
Example 2.13.5 (Firm Nonexpansiveness)
Let ∅ 6= C ⊆ Rm be closed and convex. Then for each x, y ∈ Rm ,
kPC (x) − Pc (y)k2 ≤ hPC (x) − PC (y), x − yi.
Example 2.13.6
Let ∅ 6= C ⊆ Rm be closed and convex. Let f : Rm → R be given by
1
f (x) = d2C (x).
2
Then the following holds
(i) f is differentiable over Rm with ∇f (x) = x − PC (x) for all x ∈ Rm

(ii) ∇f is 1-Lipschitz
Indeed, for x ∈ Rm , define
hx (y) := f (x + y) − f (x) − hy, x − PC (x)i.
It can be shown that

|hx (y)|
→0
kyk
as y → 0 by bounding |hx (y)| ≤ 12 kyk2 .
To see the 1-Lipschitz continuity of ∇f , we would apply the non-expansiveness of projec-

tions onto closed convex sets.
Theorem 2.13.7 (Second Order Characterization)

Let f : Rm → R be twice continuously differentiable over Rm and let L ≥ 0. The
following are equivalent.
(i) ∇f is L-Lipschitz
(ii) for all x ∈ Rm , k∇2 f (x)k ≤ L (operator norm)
Proof
(i) =⇒ (ii) Suppose that ∇f is L-Lipschitz continuous. For any y ∈ Rm and α > 0,
k∇f (x + αy) − ∇f (x)k ≤ Lkx + αy − xk = αLkyk.
57
That is,
k∇f (x + αy) − ∇f (x)k

k∇2 f (x)(y)k = lim
α↓0 α
Lkx + αy − xk
≤ lim
α↓0 α
= lim Lkyk
α↓0
= Lkyk.
Equivalently,
k∇2 f (x)k ≤ L
as desired. Note that we used the fact that ∇2 f (x)(y) = (∇f )0 (x; y).
(ii) =⇒ (i) Suppose that k∇2 f (x)k ≤ L and fix x, y ∈ Rm . By the fundamental theorem
of calculus,
Z 1
∇f (x) = ∇f (y) + ∇2 f (y + α(x − y))(x − y)dα
0
Z 1
2
= ∇f (y) + ∇ f (y + α(x − y))dα (x − y)
0
Hence
Z 1
2

k∇f (x) − ∇f (y)k ≤
∇ f (x + α(x − y))dα · kx − yk
0
Z 1
≤ k∇2 f (x + α(x − y))kdαkx − yk
0
≤ Lkx − yk.
Proposition 2.13.8
For a symmetric A ∈ Rm×m ,
sup kAxk = max |λi |

kxk=1 1≤i≤m
where λi are the eigenvalues of A.
Proof
Write x as a linear combination of some orthonormal eigenvector basis of A.
58
Proposition 2.13.9
A twice continuously differentiable function f : Rm → R is convex if and only if ∇2 f (x)
is positive semi-definite.
Proof
See A3.
Corollary 2.13.9.1
Let f : Rm → R be convex and twice continuously differentiable. Suppose L ≥ 0. Then
∇f is L-Lipschitz if and only if for all x ∈ Rm ,
λmax (∇2 f (x)) ≤ L.
Proof
Since f is convex and twice continuously differentiable, ∇2 f (x) is positive semidefinite
everwhere. Combined with the earlier result,
L ≥ k∇2 f (x)k
= |λmax (∇2 f (x))|
= λmax (∇2 f (x)).
Example 2.13.10
p
x 7→ 1 + kxk2 .
Then
(i) f is convex
(ii) ∇f is 1-Lipschitz
Proposition 2.13.11
Let β > 0. f : Rm → (−∞, ∞] is β-strongly convex if and only if
β 2
f− k·k
2
is convex.
Proof
See A3.
59
Proposition 2.13.12
Let f, g : Rm → (−∞, ∞] and β > 0. Suppose that f is β-strongly convex and that g is
convex. Then f + g is β-strongly convex.
Proof
Define
β 2
h := f − k·k + g.
2
Then h is convex as it is the sum of two convex functions. Thus applying the previous
proposition yields the result.
Proposition 2.13.13
Let f : Rm → (−∞, ∞] be strongly convex, l.s.c., and proper. Then f has a unique
minimizer.
2.14 The Proximal Operator
Definition 2.14.1 (Proximal Point Mapping)

Let f : Rm → (−∞, ∞]. The proximal point mapping of f is the operator Proxf :
Rm ⇒ Rm given by
1
Proxf (x) := argminu∈Rm {f (u) + ku − xk2 }.
2
Theorem 2.14.1
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Then for every x ∈ Rm ,
Proxf (x) is a singleton.
Proof
For a fixed x ∈ Rm ,
1
hx := k· − xk2
2
is β-strongly convex for all β < 1. Therefore,
gx := f + hx
is strongly convex for every x ∈ Rm .
60
We know that gx is l.s.c. as f, hx are l.s.c. Moreover, gx is proper as f, g is proper with
dom f ∩ dom gx = dom f . Thus from the previous proposition,
argminu∈Rm gx =: Proxf (x)
exists and is unique.
Example 2.14.2
For ∅ 6= C ⊆ Rm closed and convex,
ProxδC = PC .
Proposition 2.14.3
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Let x, p ∈ Rm . Then p = Proxf (x)
if and only if for all y ∈ Rm ,
hy − p, x − pi + f (p) ≤ f (y).
Proof
( =⇒ ) Suppose that p = Proxf (x). For each λ ∈ (0, 1), set
pλ := λy + (1 − λ)p.
Thus
1 1
f (p) ≤ f (pλ ) + kx − pλ k2 − kx − pk2
2 2
1 1
≤ f (pλ ) + kx − λy − (1 − λ)pk2 − kx − pk2
2 2
1
= f (pλ ) + hx − p − λ(y − p) − (x − p), x − p − λ(y − p) + (x − p)i
2
1
= f (pλ ) + h−λ(y − p), 2(x − p) − λ(y − p)i
2
λ
= f (pλ ) + ky − pk2 − λhx − p, y − pi
2
λ2
= f (λy + (1 − λ)p) + ky − pk2 − λhx − p, y − pi
2
λ2
f (p) ≤ λf (y) + (1 − λ)f (p) + ky − pk2 − λhx − p, y − pi
2
λ2
λhx − p, y − pi + λf (p) ≤ λf (y) + ky − pk2 .
2
Division by λ and taking the limit as λ → 0 yields the result.
61
( ⇐= ) Suppose that
hy − p, x − pi + f (p) ≤ f (y).
Then
f (p) ≤ f (y) − hy − p, x − pi = f (y) + hx − p, p − yi.
It follows that
1 1
f (p) + kx − pk2 ≤ f (y) + hx − p, p − yi + kx − pk2
2 2
1 1
≤ f (y) + hx − p, p − yi + kx − pk2 + kp − yk2
2 2
≤ f (y) + kx − p + p − yk2
= f (y) + kx − yk2 .
Example 2.14.4
x 7→ |x|.
Then 
x − 1, x > 1

Proxf (x) := 0, x ∈ [−1, 1]

x + 1, x < −1

We need only apply the previous proposition and consider 3 cases.
Proposition 2.14.5
Let f : Rm → R be convex, l.s.c., and proper. Then x minimizes f over Rm if and only
if
x = Proxf (x).
Proof
By the previous proposition,
x = Proxf (x) ⇐⇒ ∀y ∈ Rm , hy − x, x − xi + f (x) ≤ f (y)

⇐⇒ ∀y ∈ Rm , f (x) ≤ f (y).
Convexity is crucial for the proximal operator to be well-defined.
62
Example 2.14.6
Let g, h : R → R be given by
(
0, x 6= 0
g(x) :=
λ, x = 0
(
0, x 6= 0
h(x) :=
−λ, x = 0
for some λ > 0.
Then
 √
{x},
 |x| > 2λ
√
Proxh (x) = {0, x}, |x| = 2λ
 √
{0}, |x| < 2λ

(
{x}, x 6= 0
Proxh (x) =
∅, x=0
Example 2.14.7 (Soft Threshold)

Let f : R → R be given by
x 7→ λ|x|
for some λ ≥ 0.
For all x ∈ R, 
x − λ, x > λ

Proxf (x) = 0, x ∈ [−λ, λ]

x + λ, x < −λ

Note that the above formula can be written as
Proxf (x) = sign(x)(|x| − λ)+
where sign(y) is 1, −1 depending on the sign of y and [−1, 1] if y = 0. Moreover, (y)+ = y

if y ≥ 0 and is 0 otherwise.
63
Theorem 2.14.8
Suppose f : Rm → (−∞, ∞] is given by
m
X
f (x) := fi (xi )
i=1
for fi R → (−∞, ∞] convex, l.s.c„ and proper.

Then for all x ∈ Rm ,
Proxf (x) = (Proxfi (xi ))m
i=1 .
Proof
From A2, f is convex, l.s.c., and proper. We know that
p = Proxf (x) ⇐⇒ ∀y ∈ Rm , f (y) ≥ f (p) + hy − p, x − pi

Xm m
X m
X
m
⇐⇒ ∀y ∈ R , fi (yi ) ≥ fi (pi ) + (yi − pi )(xi − pi ).
i=1 i=1 i=1
In particular, for some j ∈ [m], let yj ∈ R and yi = 0 for all i 6= j. Then
fi (yi ) ≥ fi (pi ) + (yi − pi )(xi − pi )
which happens if and only if pi = Proxfi (xi ).
Conversely, if fi (yi ) ≥ fi (pi ) + (yi − pi )(xi − pi ) for each i ∈ [m], then clearly p = Proxf (x).
Example 2.14.9
Let g : Rm → (−∞, ∞] be given by
(
−α m
P
i=1 log xi , x>0
x 7→
∞, else
where α > 1.
Then p !m
xi + x2i + 4α
Proxg (x) =
2
i=1
since p
x2i + 4α
xi +
Proxgi (xi ) = .
2
This can be proven by differentiating to find the minimizer of hi (yi ) := gi (yi ) + 12 (yi − xi )2 .
64
Theorem 2.14.10
Let g : Rm → (−∞, ∞] be proper and c > 0. Let a ∈ Rm , γ ∈ R. For each x ∈ Rm ,
define
c
f (x) = g(x) + kxk2 + ha, xi + γ.
2
Then for all x ∈ R ,
m

x−a
Proxf (x) = Prox 1 g .
c+1 c+1
Proof
Indeed, recall that
1
Proxf (x) := argminu∈Rm f (u) + ku − xk2
2
c 1
= argminu∈Rm g(u) + kuk2 + ha, ui + γ + ku − xk2 .
2 2
Now,
c 1 c 1 1
kuk2 + ha, ui + ku − xk2 = kuk2 + ha, ui + kuk2 − hu, xi + kxk2
2 2 2 2 2
c+1 1
= kuk2 − hu, x − ai + kxk2
2 2
c+1 2 x − a 1 2
= kuk − 2 u, + kxk
2 c+1 c+1
" 2 #
2
c+1 x − a kx − ak 1
= u − − + kxk2
2 c+1 c+1 c+1
2 2
c + 1
u − x − a − kx − ak + 1 kxk2 .

=
2 c+1 2 2
Finally, since minimizers are preserved under positive scalar multiplication and transla-
tion,
2 2
c + 1 x + a + γ − kx − ak + 1 kxk2

Proxf (x) = argminu∈Rm g(u) + u −
2 c+1 2 2
2
c + 1
u − x + a

= argminu∈Rm g(u) +
2 c+1
2
1 1 x − a
= argminu∈Rm g(u) + u −

c+1 2 c+1

x+a
=: Prox 1 g .
c+1 c+1
65
Example 2.14.11
Let µ ∈ R and α ≥ 0. Consider f : R → (−∞, ∞] given by
(
µx, x ∈ [0, α]
f (x) :=
∞, else
For each x ∈ R,
f (x) = µx + δ[0,α] (x).
Moreover,
Proxf (x) = min(max(x − µ, 0), α).
Indeed, apply the previous theorem with g = δ[0,α] and c = γ = 0. Then
Proxf (x) = Proxg (x − µ) = PC (x − µ).
Theorem 2.14.12
Let g : R → (−∞, ∞] be convex, l.s.c. and proper such that dom g ⊆ [0, ∞) and let
f : Rm → R be given by
f (x) = g(kxk).
Then (
x
Proxg (kxk) kxk , x 6= 0
Proxf (x) =
{u ∈ Rm : kuk = Proxg (x)}, x=0
Proof
Case I: x = 0 By definition,
1
Proxf (x) = argminu∈Rm f (u) + kuk2 .
2
By the change of variable w = kuk, then above set of minimizers is the same as
1
argminw∈Rm g(w) + w2 =: Proxg (0).
2
66
Case II: x 6= 0 By definition, Proxf (x) is the set of solutions to the minimization problem
1
minm g(kuk) + ku − xk2
u∈R 2
1 1
= minm g(kuk) + kuk2 − hu, xi + kxk2
u∈R 2 2
1 1
= min min g(α) + α2 − hu, xi + kxk2
α≥0 u∈Rm :kuk=α 2 2
Now, hu, xi ≤ kuk · kxk by the Cauchy-Schwartz inequality with equality when u = λx
for some λ ≥ 0. Thus

x 1 2 1
α = min g(α) + α − hu, xi + kxk2 .
kxk u∈Rm :kuk=α 2 2
The values of α which minimize α kxk

x
are then given by
1 1
min g(α) + α2 − αkxk + kxk2
α≥0 2 2
1
= min g(α) + (α − kxk)2 .
α≥0 2
This is precisely Proxg (kxk).
Hence
x
Proxf (x) = Proxg (kxk)
kxk
as desired.
Example 2.14.13
Let α > 0, λ ≥ 0, and f : R→ (−∞, ∞] be given by
(
λ|x|, |x| ≤ α
f (x) =
∞, |x| > α
Then f is convex, l.s.c. and proper (see A3).
Define (
λx, x ∈ [0, α]
g(x) =
∞, x ∈/ [0, α]
67
so that f (x) = g(|x|). By the previous theorem,
(
Proxg (|x|) sgn(x), x 6= 0
Proxf (x) =
0, x=0
= min(max(|x| − λ, 0), α) sgn(x).
Example 2.14.14
Let w, α ∈ Rm
+ and f : R → (−∞, ∞] given by
m
(P
m
i=1 wi |xi |, −α ≤ x ≤ α
f (x) =
∞, else
Then Proxf (x) = (min(max(|xi | − wi , 0), αi ) sgn(xi ))m

i=1 (see A3).
Moreover, consider the problem

m
X
min wi |xi | (P )
i=1
|xi | ≤ αi , ∀i ∈ [m]
Let the sequence (xn )n≥0 be recursively defined by x0 ∈ Rm and xn+1 = Proxf (xn ). Then
xn → x̄ where x̄ is a minimizer of (P).
2.15 Nonexpansive & Averaged Operators
We use Id : Rm → Rm to denote the m × m identity matrix.
Definition 2.15.1 (Nonexpansive)

Let T : Rm → Rm . Then T is nonexpansive if for all x, y ∈ Rm ,
kT x − T yk ≤ kx − yk
Definition 2.15.2 (Firmly Nonexpansive)

Let T : Rm → Rm . Then T is firmly nonexpansive (f.n.e.) if for all x, y ∈ Rm ,
kT x − T yk2 + k(Id −T )x − (Id −T )yk2 ≤ kx − yk2
68
Definition 2.15.3 (Averaged)
Let T : Rm → Rm and α ∈ (0, 1). Then T is α-averaged if there is some N : Rm → Rm
such that N is nonexpansive and
T = (1 − α) Id +αN.
Proposition 2.15.1
T : Rm → Rm . The following are equivalent.
(i) T is f.n.e.
(ii) Id −T is f.n.e.
(iii) 2T − Id is nonexpansive
(iv) for all x, y ∈ Rm , kT x − T yk2 ≤ hx − y, T x − T yi.
(v) for all x, y ∈ Rm , hT x − T y, (Id −T )x − (Id −T )yi ≥ 0
Proof
(i) ⇐⇒ (ii): This is clear from the definition.
(i) ⇐⇒ (iii) ⇐⇒ (iv) ⇐⇒ (v): See A3.
We can refine the previous result when T is linear.
Proposition 2.15.2
Let T : Rm → Rm be linear. Then the following are equivalent.
(i) T is f.n.e.
(ii) k2T − Idk ≤ 1
(iii) for all x ∈ Rm , kT xk2 ≤ hx, T xi
(iv) for all x ∈ Rm , hT x, x − T xi ≥ 0
Proof
(i) ⇐⇒ (ii) We know that T is f.n.e. if and only if 2T − Id is nonexpansive. This happens
if and only if for all x 6= y,
k(2T − Id)(x − y)k = k(2T − Id)x − (2T − Id)yk
≤ kx − yk
⇐⇒
k2T − Idk ≤ 1.
(i) ⇐⇒ (iii) This is easily seen by the previous proposition and the fact that T x − T y =
69
T (x − y).
(i) ⇐⇒ (iv) This is seen by applying the previous proposition and observing that T x −
T y = T (x − y) as well as
(Id −T )x − (Id −T )y = x − y − T (x − y).

Observe that T is f.n.e. if and only if N := 2T −Id is nonexpansive if and only if 2T = Id +N
for N nonexpansive if and only if T = 12 Id + 12 N for N nonexpansive if and only if T is 12 -
averaged.
Example 2.15.3
Let ∅ 6= C ⊆ Rm be convex and closed. Then PC (x) is f.n.e. Indeed, for all x, y ∈ Rm ,
kPC (x) − PC (y)k ≤ hPC (x) − PC (y), x − yi.
Example 2.15.4
Suppose that T = − 12 Id. Then T is averaged but NOT f.n.e.
We have
1 3
T = Id + (− Id)
4 4
and so T is 34 -averaged.
But T is not f.n.e. as for all 0 6= x ∈ Rm ,

1 9
kT xk2 + kx − T xk2 = kxk2 + kxk2
4 4
5
= kxk2
2
> kxk2 .
Example 2.15.5
T := − Id is nonexpansive but NOT averaged. Indeed suppose there is some nonexpansive
N : Rm → Rm and α ∈ (0, 1) such that
T = (1 − α) Id +αN ⇐⇒ − Id = (1 − α) Id +αN
⇐⇒ (−1 + α) Id = αN
α−2
⇐⇒ N = Id .
α
70
But then

α − 2
kN k =
≤1
α
2−α
⇐⇒ ≤1
α
⇐⇒ 2 − α ≤ α
⇐⇒ α ≥ 1
which is impossible by the definition of averaged.
Proposition 2.15.6
Let T : Rm → Rm be nonexpansive. Then T is continuous.
Proof
Suppose xn → x̄. Then
kT xn − T x̄k ≤ kxn − x̄k → 0.
Definition 2.15.4 (Fixed Point)

Let T : Rm → Rm then
Fix T := {x ∈ Rm : x = T x}.
2.16 Féjer Monotonocity
Definition 2.16.1 (Féjer Monotone)

Let ∅ 6= C ⊆ Rm and (xn )n∈N a sequence in Rm . Then (xn ) is a Féjer monotone with
respect to C if for all c ∈ C, n ∈ N,
kxn+1 − ck ≤ kxn − ck.
Example 2.16.1
Suppose Fix T 6= ∅ for some nonexpansive T : Rm → Rm . For any x0 ∈ Rn , the sequence
defined recursively by
xn := T (xn−1 )
is Féjer monotone with respect to Fix T .
71
Proposition 2.16.2
Let ∅ 6= C ⊆ Rm and (xn )n≥0 a Féjer monotone sequence in Rm with respect to C. The
following hold:
(i) (xn ) is bounded
(ii) for every c ∈ C, (kxn − ck)n≥0 converges
(iii) (dC (xn ))n≥0 is decreasing and converges
Proof
Fix c ∈ C. We have
kxn k ≤ kck + kxn − ck

≤ kck + kx0 − ck.
Hence (xn ) is a bounded sequence.
Now, kxn − ck is bounded below by 0 and monotonic, hence necessarily converges to the
infimum.
Observe that for each n ∈ N, c ∈ C,
kxn+1 − ck ≤ kxn − ck.
Taking infimums on both sides preserve this inequality.

Recall the following analysis fact.
Proposition 2.16.3
A bounded sequence (xn )n∈N in Rm converges if and only if it has a unique cluster point.
Proof
The forward direction is clear. Suppose now that (xn )n∈N has a unique cluster point x̄.
Suppose that xn 6→ x̄. Then there is some 0 > 0 and subsequence xkn such that for all n,
kxkn − x̄k ≥ 0 .
But then (xkn )n∈N is bounded and hence contains a convergent subsequence. This is still
a subsequence of (xn )n∈N but cannot converge to x̄.
It follows that (xn )n∈N has more than one cluster point. By contradiction, xn → x̄.
72
Lemma 2.16.4
Let (xn )n∈N be a sequence in Rm and ∅ 6= C ⊆ Rm be such that for all c ∈ C,
(kxn − ck)n∈N converges and every cluster point of (xn )n∈N lies in C.
Then (xn )n∈N converges to a point in C.
Proof
(xn ) is necessarily bounded since kxn k ≤ kck + kxn − ck is bounded. It suffices by the
previous proposition to show that (xn )n∈N has a unique cluster point.
Let x, y be two cluster points of (xn )n∈N . That is, there are subsequences
xkn → x, x`n → y.
By assumption, x, y ∈ C. Hence kxn − xk, kxn − yk converges.
Observe that
2hxn , x − yi
= kxn k2 + kyk2 − 2hxn , yi − kxn k2 − kxk2 + 2hxn , xi + kxk2 − kyk2
= kxn − yk − kxn − xk2 + kxk2 − kyk2
→ L ∈ Rm .
But then taking the limit along kn , `n ,
hx, x − yi = hy, x − yi
kx − yk2 = 0
x = y.
Theorem 2.16.5
Let ∅ 6= C ⊆ Rm and (xn )n∈N a sequence in Rm . Suppose that (xn )n∈N is Féjer
monotone with respect to C, and that every cluster point of (xn )n∈N lies in C. Then
(xn )n∈N converges to a point in C.
Proof
We know that for all c ∈ C,
kxn − ck
converges. Hence the result follows from the previous lemma.
73
Let x, y ∈ Rm and α ∈ R. By computation,
kαx + (1 − α)yk2 + α(1 − α)kx − yk2 = αkxk2 + (1 − α)kyk2 .
Theorem 2.16.6
Let α ∈ (0, 1] and T : Rm → Rm be α-averaged such that Fix T 6= ∅. Let x0 ∈ Rm .
Define
xn+1 := T xn .
The following hold:
(i) (xn )n∈N is Fejér monotone with respect to Fix T .
(ii) T xn − xn → 0.
(iii) (xn )n∈N converges to a point in Fix T .
Proof
Now, T being averaged implies that it is nonexpansive. The example earlier shows that
(xn )n∈N is Féjer monotone.
By the definition of averaged, there is some nonexpansive N : Rm → Rm such that
T = (1 − α) Id +αN.
Hence for each n ∈ N,

xn+1 = (1 − α)xn + αN (xn ).
Let f ∈ Fix T .
kxn+1 − f k2 = k(1 − α)(xn − f ) + α(N (xn ) − f )k2

= (1 − α)kxn − f k2 + αkN (xn ) − N (f )k2 − α(1 − α)kN (xn ) − xn k2
≤ (1 − α)kxn − f k2 + αkxn − f k2 − α(1 − α)kN (xn ) − xn k2
= kxn − f k2 − α(1 − α)kN (xn ) − xn k2
α(1 − α)kN (xn ) − xn k2 ≤ kxn − f k2 − kxn+1 − f k2 .
By a telescoping sum argument,

k
X
α(1 − α)kN (x0 ) − xn k2 = kx0 − f k2 − kxk+1 − f k2
n=0
≤ kx0 − f k2 .
By our work with non-negative monotone series, it must be that kN (xn ) − xn k → 0.
74
In particular,
kT xn − xn k = k(1 − α)xn + αN (xn ) − xn k = αkN (xn ) − xn k

→ 0.
Now, (xn )n∈N is Féjer monotone with respect to Fix T = Fix N . Let x̄ be a cluster point of
(xn )n∈N , say xkn → x̄. Observe that N being nonexpansive implies that N is continuous.
Since N xn − xn → 0, we must also have N xkn − xkn → 0. Thus
N xkn = (N xkn − xkn ) + xkn → 0 + x̄.
By continuity,
N x̄ = lim N xkn = x̄.
n
That is, every cluster point of (xn )n∈N lies in Fix N = Fix T . Combined with a previous
theorem, this yield the proof.
Corollary 2.16.6.1
Let T : Rm → Rm be f.n.e. and suppose that Fix T 6= ∅. Put x0 ∈ Rm . Recursively
define
xn+1 := T xn .
There is some x̄ ∈ Fix T such that
xn → x̄.
Proof
Since T is f.n.e., T is also averaged. The result follows then by the previous theorem.
Proposition 2.16.7
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Then Proxf is f.n.e.
Proof
Let x, y ∈ Rm . Set p := Proxf (x) and q := Proxf (y).
By our work with the proximal operator, p, q are characterized as ∀z ∈ Rm ,
hz − p, x − pi + f (p) ≤ f (z)
hz − q, y − qi + f (q) ≤ f (z).
75
By choosing z = p, q,
hq − p, x − pi + f (p) ≤ f (q)
hp − q, y − qi + f (q) ≤ f (p)
hq − p, (x − p) − (y − q)i ≤ 0
hp − q, (x − p) − (y − q)i ≥ 0.
But then by our characterization of f.n.e. operators, Proxf is f.n.e.
Corollary 2.16.7.1
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper such that argmin f 6= ∅. Let
x0 ∈ Rm and updated via
xn+1 = Proxf (xn ).
There is some x̄ ∈ argmin f such that xn → x̄.
Proof
Recall that
x ∈ argmin f ⇐⇒ x = Proxf (x) ⇐⇒ x ∈ Fix Proxf .
Thus argmin f = Fix Proxf 6= ∅.
By the previous proposition, Proxf is f.n.e. Thus the result follows from a previous
theorem.
2.17 Composition of Averaged Operators
Consider the following identity for all x, y ∈ Rm , α ∈ R \ {0}:

2 !
1 1 1−α
α2 2
kxk − 1 −
2
x + y = α kxk −
2
kx − yk − kyk 2
α α α
Proposition 2.17.1
Let T : Rm → Rm be nonexpansive and α ∈ (0, 1). The following are equivalent:
1. T is α-averaged
2. 1 − α1 Id + α1 T is nonexpansive

3. For each x, y ∈ Rm , kT x − T yk2 ≤ kx − yk2 − 1−α

α
k(Id −T )x − (Id −T )yk2
76
Proof
(i) ⇐⇒ (ii): We have T is α-averaged if and only if there is some N : Rm → Rm nonex-
pansive such that
T = (1 − α) Id +αN
1
⇐⇒ N = (T − (1 − α) Id)
α

1 1
⇐⇒ N = 1 − Id + T
α α
if and only if 1 − 1
Id + α1 T is nonexpansive.

α
(ii) ⇐⇒ (iii) By definition 1 − α1 Id + α1 T is nonexpansive if and only if for all x, y ∈ Rm ,

kx − yk2
2
1 1 1 1
≥ 1−
x + Tx − 1 − y − T y
α α α α
2

1 1
= 1−
(x − y) + (T x − T y)
α α

1 1−α
2
= kx − yk − 2
kx − yk − 2
k(x − T x) − (y − T y)k − kT x − T yk2
identity
α α

1 2 1−α 2 2
0≥− kx − yk − k(x − T x) − (y − T y)k − kT x − T yk
α α
1−α
0 ≤ kx − yk2 + k(x − T x) − (y − T y)k2 − kT x − T yk2 α > 0.
α
Theorem 2.17.2
Let α1 , α2 ∈ (0, 1) and Ti : Rm → Rm be αi -averaged. Define
T := T1 T2
α1 + α2 − 2α1 α2
α := .
1 − α1 α2
Then T is α-averaged.
Proof
First observe that by computation,
α ∈ (0, 1) ⇐⇒ α1 (1 − α2 ) < 1 − α2
which is a tautology.
77
By the previous proposition, for each x, y ∈ Rm ,
kT x − T yk2
= kT1 T2 x − T1 T2 yk2
1 − α1
≤ kT2 x − T2 yk2 − k(Id −T1 )T2 x − (Id −T1 )T2 yk2
α1
1 − α 2 1 − α1
≤ kx − yk2 − k(Id −T2 )x − (Id −T2 )yk2 − k(Id −T1 )T2 x − (Id −T1 )T2 yk2
α2 α1
= kx − yk2 − V1 − V2 .
Set
1 − α1 1 − α2
β := + > 0.
α1 α2
By computation,
(1 − α1 )(1 − α2 )
V1 + V2 ≥ k(Id −T )x − (Id −T )yk2 .
βα1 α2
Consequently,
(1 − α1 )(1 − α2 )
kT x − T yk2 ≤ kx − yk2 − k(Id −T )x − (Id −T )yk2
βα1 α2
1 − α
= kx − yk2 − k(Id −T )x − (Id −T )yk2 .
α
By the previous proposition, we are done.
78
Chapter 3
Constrained Convex Optimization
We now consider the problem
min f (x) (P )
x∈C
where f : Rm → (−∞, ∞] is convex, l.s.c., and proper with C 6= ∅ being convex and closed.
3.1 Optimality Conditions
Recall that if ri C ∩ ri dom f 6= ∅, then x̄ ∈ Rm solves (P) if and only if
(∂f (x̄)) ∩ (−NC (x̄)) 6= ∅.
We now explore weaker results in the absence of convexity.
79
Theorem 3.1.1
Let f : Rm → (−∞, ∞] be proper and g : Rm → (−∞, ∞] convex, l.s.c., proper with
dom g ⊆ int(dom f ). Consider the problem
min f (x) + g(x). (P )

x ∈ Rm
(i) If f is differentiable at x∗ ∈ dom g and x∗ is a local minima of (P), then

−∇f (x∗ ) ∈ ∂g(x∗ )
(ii) If f is convex and differentiable at x∗ ∈ dom g then x∗ is a global minimizer of
(P) if and only if −∇f (x∗ ) ∈ ∂g(x∗ )
Proof (i)
Let y ∈ dom g. Since g is convex, we know that dom g is convex. Hence for any λ ∈ (0, 1),
x∗ + λ(y − x∗ ) = (1 − λ)x∗ + λy
=: xλ
∈ dom g.
Hence for sufficiently small λ,
f (xλ ) + g(xλ ) ≥ f (x∗ ) + g(x∗ )

f (xλ ) + (1 − λ)g(x∗ ) + λg(y) ≥ f (x∗ ) + g(x∗ )
λg(x∗ ) − λg(y) ≤ f (xλ ) − f (x∗ )
f (xλ ) − f (x∗ )
g(x∗ ) − g(y) ≤
λ
→ f 0 (x∗ ; y − x∗ ) λ → 0+
= h∇f (x∗ ), y − x∗ i.
In other words, for all y ∈ dom g,
g(y) ≥ g(x∗ ) + h∇f (x∗ ), y − x∗ i

=⇒
−∇f (x ) ∈ ∂g(x∗ )
∗
Proof (ii)
Suppose that f is convex and observe that (i) proves the forward direction.
80
Now suppose −∇f (x∗ ) ∈ ∂g(x∗ ). By definition, for each y ∈ dom g,
g(y) ≥ g(x∗ ) + h−∇f (x∗ ), y − xi.
Moreover, since f is differentiable at x∗ one of our characterizations of the convexity of f

is that for any y ∈ dom g ⊆ int dom f ,
f (y) ≥ f (x∗ ) + h∇f (x∗ ), y − x∗ i.
Adding the inequalities yield that for all y ∈ dom g,
f (y) + g(y) ≥ f (x∗ ) + g(x∗ )
and x∗ solves (P).
3.1.1 The Karush-Kuhn-Tucker Conditions
In the following, we assume that

f, g1 , . . . , gn : Rm → R
are of full domain.
Consider the problem

min f (x) (P )
gi (x) ≤ ∀i ∈ [n]
We assume that (P) has at least one solution and that

µ := min{f (x) : ∀i ∈ I, f (x) ≤ 0} ∈ R
is the optimal value. Put
F (x) := max{f (x) − µ, g1 (x), . . . , gn (x)}.
| {z }
=:g0 (x)
Lemma 3.1.2
For all x ∈ Rm , F (x) ≥ 0. Moreover, the solution of (P) are precisely the minimizers
of
F := {x : F (x) = 0}.
81
Proof
Let x ∈ Rn .
Case Ia: x is infeasible Then there is some j ∈ [n] such that gj (x) > 0. Hence F (x) ≥
gi (x) > 0.
Case Ib: x is not optimal Then gi (x) ≤ 0 but f (x) > µ. Thus F (x) ≥ g0 (x) > 0.
Case II: x solves (P) Then x is feasible and f (x) = µ. Hence F (x) = 0.
Proposition 3.1.3 (Max Rule for Subdifferential Calculus)

Let g1 , . . . , gn : Rm → (−∞, ∞] be convex, l.s.c., and proper. Define
g(x) = max{gi (x), . . . , gn (x)}

A(x) = {i ∈ [n] : gi (x) = g(x)}.
Now, let
n
\
x∈ int dom gi .
n=1
We have  
[
∂g(x) = conv  ∂gi (x) .
i∈A(x)
Theorem 3.1.4 (Fritz-John Optimality Conditions)

Suppose that f, g1 , . . . , gn are convex and x∗ solves (P). There exists
α0 , . . . , α ≥ 0
not all 0 for which

n
X
∗
0 ∈ α0 ∂f (x ) + αi ∂gi (x∗ )
i=1
∗
αi gi (x ) = 0 ∀i ∈ [n]
(complementary slackness)
Proof
Recall that F (x) := max{f (x) − µ, gi (x), . . . , gn (x)}. By the previous lemma,
F (x∗ ) = 0 = min F (Rn ).
82
Hence
0 ∈ ∂F (x∗ ) = convi∈A(x∗ ) ∂gi (x∗ ).
where A(x∗ ) := {0 ≤ i ≤ n : gi (x∗ ) = 0}.
Note that 0 ∈ ∂f (x∗ ) since f0 (x∗ ) = f (x∗ ) − µ = 0. So
0 ∈ ∂g0 = ∂f.
By our work with convex hulls, there is some α0 , . . . , αn such that αi = 1 (so
P
i∈A(x∗ )
αj = 0 if j ∈
/ A(x∗ )) and that
X
0∈ αi ∂gi (x∗ )
i∈A(x∗ )
X
= α0 ∂g0 (x∗ ) + αi ∂gi (x∗ )
i∈A(x∗ )\{0}
X n
= α0 ∂g0 (x∗ ) + αi ∂gi (x∗ ).
i=1
Now to see complementary slackness: If i ∈ A(x∗ ) ∩ [n], then gi (x∗ ) = 0. Else if i ∈

[n] \ A∗ (x), then αi = 0. In all cases,
αi gi (x∗ ) = 0
for all i ∈ [n].
Theorem 3.1.5 (Karush-Kuhn-Tucker; Necessary Conditions)

Suppose f, g1 , . . . , gn are convex, and x∗ solves (P). Suppose that Slater’s condition
holds, ie there is some s ∈ Rm such that for all i ∈ [n],
gi (s) < 0.
Then there exists λ1 , . . . , λm ≥ 0 such that the KKT conditions hold: (stationarity
condition) X
0 ∈ ∂f (x∗ ) + λi ∂gi (x∗ )
i∈I
and (complementary slackness condition) for each i ∈ [n],
λi gi (x∗ ) = 0.
83
Proof
By the Fritz-John necessary conditions, there are α0 , α1 , . . . , αn ≥ 0 not all 0 such that
n
X
∗
0 ∈ α0 ∂f (x ) + αi ∂gi (x∗ ).
i=1
and for all i ∈ [n],

αi gi (x∗ ) = 0.
We claim that α0 6= 0. Then it is necessary that

n
X αi
0 ∈ ∂f (x∗ ) + ∂gi (x∗ ).
i=1
α0
Suppose towards a contradiction that α0 = 0. There exist yi ∈ ∂gi (x∗ ) such that
n
X
αi yi = 0.
i=1
By the definition of the subgradient, for all y ∈ Rm ,
gi (x∗ ) + hyi , y − x∗ i ≤ gi (y).
Thus for each i ∈ [n],

gi (x∗ ) + hyi , s − x∗ i ≤ gi (s).
Multiplying each inequality by αi and adding them yields

n
* n +
X X
0= αi gi (x∗ ) + α i y i , s − x∗
i=1 i=1
Xn
≤ αi gi (s)
i=1
<0
which is absurd.
By contradiction, α0 > 0 and we are done.
84
Theorem 3.1.6 (Karush-Kuhn-Tucker; Sufficient Conditions)
Suppose f, g1 , . . . , gn are convex and x∗ ∈ Rm satisfies
∀i ∈ [n], gi (x∗ ) ≤ 0 primal feasibility

∀i ∈ [n], λi ≥ 0 dual feasibility
n
stationarity
X
∂f (x∗ ) + λi ∂gi (x∗ ) 3 0
i=1
∀i ∈ [n], λi gi (x∗ ) = 0 complementary slackness
Then x∗ solves (P).
Proof
Define
n
X
h(x) := f (x) + λi gi (x).
i=1
Then h is convex since non-negative multiplication preserves convexity.
Apply the sum rule to obtain that

n
X
∂g(x) = ∂f (x) + λi ∂gi (x).
i=1
By assumption,
n
X
∗ ∗
0 ∈ ∂h(x ) = ∂f (x ) + λi ∂gi (x∗ ).
i=1
Thus by Fermat’s theorem, x is a global minimizer of H.

∗
Let x be feasible for (P). Then

n
X
∗ ∗
f (x ) = f (x ) + λi gi (x∗ )
i=1
= h(x∗ )
≤ h(x)
n
X
= f (x) + λi gi (x)
i=1
≤ f (x).
85
3.2 Gradient Descent
min f (x) (P )
x ∈ Rm
Definition 3.2.1 (Descent Direction)

Let f : Rm → (−∞, ∞] be proper and let x ∈ int dom f . d ∈ Rm \ {0} is a descent
direction of f at x if the directional derivative satisfies
f 0 (x; d) < 0.
Remark that if 0 6= ∇f (x) exists, then ∇f (x) is a descent direction. Indeed,
f 0 (x; −∇f (x)) = −k∇f (x)k2 < 0.
Also remark that for convex f and x ∈ dom f ,
f (x + λd) − f (x)
f 0 (x, d) = lim+ .
λ→0 λ
Thus f (x, d) < 0 implies that there is some such that λ ∈ (0, ) implies that
f (x + λd) − f (x)
< 0 ⇐⇒ f (x + λd) < f (x).
λ
The gradient/steepest descent method consists of the following:
1. Initialize x0 ∈ Rm .
2. For each n ∈ N:
(a) Pick tn ∈ argmint≥0 f (xn − t∇f (xn )).
(b) Update xn+1 := xn − tn ∇f (xn )
Theorem 3.2.1 (Peressini, Sullivan, Uhl)

If f is strictly convex and coercive, then xn converges to the unique minimizer of f .
In the lack of smoothness, a lot of pathologies happen.
86
Example 3.2.2 (L. Vandenberghe)
Negative subgradients are NOT necessarily descent directions. Consider f : R2 → R+
given by
(x1 , x2 ) 7→ |x1 | + 2|x2 |.
Then f is convex as it is a direct sum of convex functions.
Since f has full domain and is continuous,
∂f (1, 0) = {1} × [−2, 2].
Take d := (−1, −2) ∈ −∂f (1, 0).
d is NOT a descent direction. Moreover,
f (1, 0) = 1 < f [(1, 0) + t(−1, −2)]
for all t > 0.
Example 3.2.3 (Wolfe)

Let γ > 1. Consider the function f : R2 → R given by
(p
x21 + γx22 , |x2 | ≤ x1
(x1 , x2 ) 7→ x1 +γ|x
√
1+γ
2|
, else
Observe that argminx∈Rm f = ∅. One can show that f = σC where
x22

2 2 1
C = x ∈ R : x2 + ≤ 1, x2 ≥ √ .
γ 1+γ
Thus f is convex. Moreover, f is differentiable on
D := R2 \ ((−∞, 0] × {0})).
Let x0 := (γ, 1) ∈ D.
The steepest descent method will generate a equence

n n
γ−1 γ−1
xn := γ , − → (0, 0)
γ+1 γ+1
which is not a minimizer of f !
87
3.3 Projected Subgradient Method
Consider
min f (x) (P )
x∈C
where f : Rm → (−∞, ∞] is convex, l.s.c., and proper, ∅ 6= C ⊆ int dom f is convex and
closed.
Suppose
S := argminx∈C f (x) 6= ∅
µ := min f (x).
x∈C
Moreover, there is some L > 0 such that
supk∂f (C)k ≤ L < ∞.
In other words, for all c ∈ C and u ∈ ∂f (c), kuk ≤ L.
1) Get x0 ∈ C.
2) Given xn , pick a stepsize tn > 0 and f 0 (xn ) ∈ ∂f (xn )
3) Update xn+1 := PC (xn − tn f 0 (xn )).
Recall that C ⊆ int dom f , hence each xn ∈ int dom f and ∂f (xn ) 6= ∅. Thus the algorithm
is well-defined.
Lemma 3.3.1
Let s ∈ S := argminx∈C f (x). Then
kxn+1 − sk2 ≤ kxn − sk2 − 2tn (f (xn ) − µ) + t2n kf 0 (xn )k2 .
Observe that S ⊆ C.
Proof
We have
kxn+1 − sk2 = kPC (xn − tn f 0 (xn )) − PC (s)k2

≤ kxn − tn f 0 (xn ) − sk2
= kxn − sk2 + t2n kf 0 (xn )k2 − 2tn hxn − s, f 0 (xn )i.
88
It suffices to show that
2tn hxn − s, f 0 (xn )i ≤ −2tn (f (xn ) − µ)

hxn − s, f 0 (xn )i ≥ f (xn ) − µ
hxn − s, f 0 (xn )i ≥ f (xn ) − f (x)
which holds by the subgradient inequality.

What is a good step size? We wish to minimize the upper bound derived in the previous
lemma.
d
0= (−2tn (f (xn ) − µ) + t2n kf 0 (xn )k2 )
dtn
= −2(f (xn ) − µ) + 2tn kf 0 (xn )k2 .
If xn is not a global minimizer, then 0 ∈

/ ∂f (xn ) and f 0 (xn ) 6= 0. Hence we can take
f (xn ) − µ
tn := .
kf 0 (xn )k2
Definition 3.3.1 (Polyak’s Rule)

The projected subgradient method with step size
f (xn ) − µ
tn := .
kf 0 (xn )k2
Theorem 3.3.2
We have
(i) For all s ∈ S, n ∈ N, kxn+1 − sk ≤ kxn − sk, ie (xn )n∈N is Fejér monotone with
respect to S
(ii) f (xn ) → µ

(iii) µn − µ ≤ L·dS (x0 )
√
n+1
∈O √1
n
, where µn := min0≤k≤n f (xk )
(iv) For each > 0, if n ≥ − 1, then µn ≤ µ +

L2 d2S (x0 )
2
89
Proof (i)
Let s ∈ S, n ∈ N By computation�
kxn+1 − sk2 ≤ kxn − sk2 − 2tn (f (xn ) − µ) + t2n kf 0 (xn )k2

2
f (xn ) − µ f (xn ) − µ
2
= kxn − sk − 2 0 (f (xn ) − µ) + kf 0 (xn )k2
kf (xn )k2 kf 0 (xn )k2
(f (xn ) − µ)2
= kxn − sk2 −
kf 0 (xn )k2
(f (xn ) − µ)2
≤ kxn − sk2 −
L2
≤ kxn − sk2 .
Proof (ii)
From our work in (i): for all k ∈ N,
(f (xk ) − µ)2
2
≤ kxk − sk2 − kxk+1 − sk.
L
Summing the above inequalities over k = 0, . . . , n yields

n
1 X
2
(f (xk ) − µ2 ) ≤ kx0 − sk2 − kxn+1 − sk2
L k=0
≤ kx0 − sk2 .
Letting n → ∞,
∞
X
0≤ (f (xk ) − µ)2 ≤ L2 kx0 − sk2 < ∞
k=0
and it must be that f (xk ) → µ.
Proof (iii)
Recall that
µn := min f (xk ).
0≤k≤n
Let n ≥ 0. For each 0 ≤ k ≤ n,

(µn − µ)2 ≤ (f (xk ) − µ)2
n
(µn − µ)2 1 X
(n + 1) 2
≤ 2 (f (xk ) − µ)2
L L k=0
≤ kx0 − sk2 .
90
Minimizing over s ∈ S, we get that
(µn − µ)2
(n + 1) ≤ d2S (x0 ).
L2
Proof (iv)
Suppose that
L2 d2S (x0 )
n≥ −1
2
⇐⇒
d2S (x0 )L2
≤ 2 .
n+1
Apply (iii) yields
d2S (x0 )L2

(µn − µ)2 ≤
n+1
≤ 2
µn − µ ≤ .
Recall that if (xn )n∈N is Fejér monotone with respect to some ∅ 6= C ⊆ Rm , and every
cluster point lies in C, then xn → c ∈ C.
Theorem 3.3.3 (Convergence of Projected Subgradient)

Suppose xn is generated as in the projected subgradient method with Polyak’s rule.
Then xn → s ∈ S.
Proof
We have already shown that (xn ) is Fejér monotone with respect to S. Thus the sequence
is also bounded. Also, by the previous theorem,
f (xn ) → µ = min f (x).
x∈C
By Bolzano-Weirestrass, there is some subsequence xkn → x̄ ∈ C. Now,

µ = min f (x)
x∈C
≤ f (x̄)
≤ lim inf f (xkn )
n
=µ f (xn ) → µ.
91
Hence x̄ ∈ S. That is, all cluster points of (xn )n∈N lie in S.
It follows that xn → x̄ ∈ S by the Fejér monotonicity theorem.
Example 3.3.4
Let C ⊆ Rm be convex, closed, and non-empty. Fix x ∈ Rm .
(
x−PC (x)
dC (x)
, x∈/C
∂dC (x) =
NC (x) ∩ B(0; 1), x ∈ C
Moreover, supk∂dC (x)k ≤ 1.
Lemma 3.3.5
Let f be convex, l.s.c., and proper. Fix λ > 0. Then
∂(λf ) = λ∂f.
3.3.1 The Convex Feasibility Problem
Problem 1
Given k closed convex subsets Si ⊆ Rm such that
k
\
S := Si 6= ∅,
i=1
find x ∈ S.
We take
f (x) := max{dSi (x) : i ∈ [k]}.
The domain is C := Rm . Observe that f ≥ 0 with
f (x) = 0 ⇐⇒ ∀i, dSi (x) = 0

⇐⇒ ∀i, x ∈ Si
⇐⇒ x ∈ S.
Recall that the max rule for subdifferentials implies that for all x ∈
/ S,
∂f (x) = conv{∂dSi (x) : dSi (x) = f (x) > 0}
92
Thus k∂f (x)k ≤ 1 as a convex combination preserves the norm bound.
Given xn , pick an index ī such that dSī (xn ) = f (xn ) > 0. Set
xn − PSī (xn )
f 0 (xn ) := .
dSī (xn )
Since this is a unit vector, Polyak’s step size simplifies to
tn = dSī (xn ).
The sequence converging to a member of S is thus
xn+1 := PC (xn − tn f 0 (xn ))

= xn − tn f 0 (xn )
xn − PSī (xn )
= xn − dSī (xn )
dSī (xn )
= xn − (xn − PSī (xn ))
= PSī (xn ).
By the convergence of the projected subgradient method, xn → S.
Note that in practice, it is possible that µ := minx∈C f (x) is NOT known to us. In this case,
replace Polyak’s stepsize by a sequence (tn )n∈N such that
Pn 2
tk
Pk=0
n → 0, n → ∞.
k=0 tk
For example, tk := 1
k+1
. One can show that
n
µn := min f (xk ) → µ
k=0
as n → ∞.
3.4 Proximal Gradient Method
min F (x) := f (x) + g(x) (P )

x ∈ Rm
93
We shall assume that S := argminx∈Rm F (x) 6= ∅ and define
µ := minm F (x).
x∈R
f is “nice” in that it is convex, l.s.c., proper, and differentiable on int dom f 6= ∅. Moreover,
∇f is L-Lipschitz on int dom f .
g is convex, l.s.c., and proper with dom g ⊆ int dom f . In particular,
∅ 6= ri dom g
⊆ dom g
⊆ ri dom f
= int dom f
=⇒
ri dom g ∩ ri dom f = ri dom g
6= ∅.
Example 3.4.1
We can model contrained optimization functions as
min f (x) + δC (x)

x ∈ Rm
where ∅ 6= C ⊆ Rm is convex and closed.

Let x ∈ int dom f ⊇ dom g. Update via
1
x+ := Prox 1 g (x − ∇f (x))
L L 2
1 1 1
= argminy∈Rm g(y) + y −
∇f (x)
L 2 L
∈ dom g
⊆ int dom f
= dom ∇f.
Let the update operator be denoted
1
T := Prox 1 g (Id − ∇f ).
L L
94
Theorem 3.4.2
Let x ∈ Rm . Then
x∈S
= argminx∈Rm F
= argminx∈Rm (f + g)
⇐⇒
x = Tx
⇐⇒
x ∈ Fix T.
Proof
By Fermat’s theorem,
x ∈ S ⇐⇒ 0 ∈ ∂(f + g)(x) = ∇f (x) + ∂g(x)

⇐⇒ −∇f (x) ∈ ∂g(x)

1 1 1
⇐⇒ x − ∇f (x) ∈ x + ∂g(x) = Id +∂ g (x)
L L L
−1
1 1
⇐⇒ x ∈ Id +∂ g x − ∇f (x)
L L

1
⇐⇒ x = Prox 1 g Id − ∇f (x) = T x.
L L
Proposition 3.4.3
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Fix β > 0. Then f is β-strongly
convex if and only if for all x ∈ dom ∂f, u ∈ ∂f (x),
β
f (y) ≥ f (x) + hu, y − xi + ky − xk2 .
2
95
3.4.1 Proximal-Gradient Inequality
Proposition 3.4.4
Let x ∈ Rm , y+ ∈ int dom f , and
y+ := Prox 1 g (y − ∇f (y)) = T y.
L
Then
L L
F (x) − F (y+ ) ≥ kx − y+ k2 − kx − yk2 + Df (x, y).
2 2
where
Df (x, y) := f (x) − f (y) − h∇f (y), x − yi.
Df is known as the Bregman distance.
Proof
Define
L
h(z) := f (y) + h∇f (y), z − yi + g(z) + kz − yk2 .
2
Then h is L-strongly convex.
We claim that y+ is the unique minimizer of h. Indeed, for z ∈ Rm ,

L 2
z ∈ argmin h ⇐⇒ 0 ∈ ∂ f (y) + h∇f (y), z − yi + g(z) + kz − yk
2

L 2
⇐⇒ 0 ∈ ∂ h∇f (y), z − yi + g(z) + kz − yk
2
⇐⇒ 0 ∈ ∇f (y) + ∂g(z) + L(z − y)

1 1
⇐⇒ 0 ∈ ∇f (y) + ∂ g (z) + (z − y)
L L

1 1
⇐⇒ y − ∇f (y) ∈ z + ∂ g (z)
L L

1 1
⇐⇒ y − ∇f (y) ∈ Id +∂ g (z)
L L
−1
1 1
⇐⇒ z ∈ Id +∂ g y − ∇f (y)
L L

1
⇐⇒ z = Prox 1 g y − ∇f (y)
L L
⇐⇒ z = T y = y+ .
96
Applying the previous proposition yields that
L
h(x) ≥ h(y+ ) + h0, x − y+ i + kx − y+ k2
2
L
= h(y+ ) + kx − y+ k2
2
L
h(x) − h(y+ ) ≥ kx − y+ k2 .
2
Moreover, by the descent lemma,

L
f (y+ ) ≤ f (y) + h∇f (y), y+ − yi + ky+ − yk2 .
2
Hence
L
h(y+ ) := f (y) + h∇f (y), y+ − yi + g(y+ ) + ky+ − yk2
2
≥ f (y+ ) + g(y+ )
= F (y+ ).
Combining with our work above,
h(x) − F (y+ ) ≥ h(x) − h(y+ )

L
≥ kx − y+ k2
2
L L
f (y) + h∇f (y), x − yi + g(x) + kx − yk − F (y+ ) ≥ kx − y+ k2
2
2 2
L L
f (x) + g(x) − F (y+ ) ≥ kx − y+ k2 − kx − yk2 + Df (x, y).
2 2
Lemma 3.4.5 (Sufficient Decrease)

We have
L
F (y+ ) ≤ F (y) − ky − y+ k2 .
2
97
Proof
Recall that
L L
F (y) − F (y+ ) ≥ ky − y+ k2 − ky − yk2 + Df (y, y)
2 2
L
F (y) − F (y+ ) ≥ ky − y+ k 2
f is convex
2
L
F (y+ ) ≤ F (y) − ky − y+ k2 .
2
3.4.2 The Algorithm
Given x0 ∈ int dom f , update via

1
xn+1 := T xn = Prox 1 g xn − ∇f (xn ) .
L L
Theorem 3.4.6 (Rate of Convergence)

The following hold:
(i) For all s ∈ S, n ∈ N, kxn+1 − sk ≤ kxn − sk (ie xn is Fejér monotone with
respect to S).
(ii) (F (xn ))n∈N satisfies 0 ≤ F (xn ) − µ ≤ S2n 0 ∈ O n1 . Hence F (xn ) → µ.
Ld2 (x )
Proof
(i): Recall the previous proposition that
0 ≥ F (s) − F (xk+1 ) F (x) = µ

L L
≥ ks − xk+1 k2 − ks − xk k2 .
2 2
Thus (xn ) is Fejér monotone with respect to S.
(ii): Multiplying this inequality by 2

L
and adding the resulting inequalities from k =
0, . . . , n − 1 and telescoping yields
n−1
!
2 X
(µ − F (xk+1 )) ≥ ks − xk k2 − ks − x0 k2
L k=0
≥ −ks − x0 k2 .
98
In particular, by setting s := PS (x0 ) ∈ S, we obtain
d2S (x0 ) = kPS (x0 ) − x0 k2

n−1
2X
≥ (F (xk+1 ) − µ)
L k=0
n−1
2X
≥ (F (xn ) − µ)
L k=0
2
= n(F (xn ) − µ).
L
Equivalently,
0 ≤ F (xn ) − µ
Ld2S (x0 )
≤
2n
and F (xn ) → µ.
Theorem 3.4.7 (Convergence of Proximal Gradient Method)

xn converges to some solution in
S := argminx∈Rm F (x).
Proof
By the previous theorem we know that (xn ) is Fejér monotone with respect to S. Thus it
suffices to show that every cluster point of (xn ) lies in S.
Suppose x̄ is a cluster point of (xn ), say xkn → x̄. We argue that F (x̄) = µ. Indeed,
µ ≤ F (x̄)
≤ lim inf F (xkn )
n
=µ
Hence F (x̄) = µ and x̄ ∈ S.
99
Proposition 3.4.8
The following hold:
(i) L1 ∇f is f.n.e.
(ii) Id − L1 ∇f is f.n.e.
(iii) T = Prox 1 g (Id −∇f ) is 23 -averaged.
L
Proof
(i), (ii): Recall for real-valued, convex, differentiable functions with L-Lipschitz gradient,
1
h∇f (x) − ∇f (y), x − yi ≥ k∇f (x) − ∇f (y)k2
L 2

1 1 1 1
∇f (x) − ∇f (y), x − y ≥ ∇f (x) − ∇f (y) .
L L L L
The result follows then from the two equivalent characterizations of f.n.e.: Id −T is non-
expansive and
hT x − T y, T x − T yi ≥ kT x − T yk2 .
(iii): Recall that Prox 1 g is f.n.e. Hence, Prox 1 g and Id − L1 ∇f are both 1
2
-averaged.
L L
Consequently, the composition

1
Prox 1 g Id − ∇f
L L
is averaged with constant 23 .
Theorem 3.4.9
The PGM iteration satisifes
√
2dS (x0 ) 1
kxn+1 − xn k ≤ √ ∈O √ .
n n
Proof
Using the previous remark, we have that for all x, y,
1
k(Id −T )x − (Id −T )yk2 < kx − yk2 − kT x − T yk2 .
2
Let x ∈ S and observe that s ∈ Fix s by a previous theorem. Applying the above
100
inequality with x = xk , y = s yields
1
k(Id −T )xk − (Id −T )sk ≤ kxk − sk2 − kT xk − T sk2
2
1
kxk − xk+1 k2 ≤ kxk − sk2 − kxk+1 − sk2 .
2
Now, T is 2
3
averaged and thus nonexpansive. Therefore,
kxk − xk+1 k = kT xk−1 − T xk k ≤ kxk−1 − xk k

≤ ...
≤ kx0 − x1 k.
Summing over k = 0 . . . , n − 1 yields

n−1
2 1X2
kx0 − sk − kxn − sk ≥ kxk − xk+1 k2
2 k=0
1
≥ nkxn−1 − xn k2 .
2
Specifically, for x := PS (x0 ),
1
nkxn−1 − xn k2 ≤ d2S (x0 )
2 √
2
kxn−1 − xn k ≤ √ dS (x0 ).
n
101
Corollary 3.4.9.1 (Classical Proximal Point Algorithm)
Let g : Rm → (−∞, ∞] be convex, l.s.c., and proper. Fix c > 0. Consider the problem
min g(x) (P )
x ∈ Rm
Assume that S := argminx∈Rm g(x) ≤ ∅.

Let x0 ∈ Rm and update via
xn+1 := Proxcg xn .
Then
(i) g(xn ) ↓ µ := min g(Rm )
(ii) 0 ≤ g(xn ) − µ ≤
d2S (x0 )
2cn
(iii) xn converges to a point within S
√
(iv) kxn−1 − xn k ≤ 2dS (x0 )
√
n
Proof
Set f ≡ 0 and observe that ∇f ≡ 0 and ∇f is L-Lipchitz for any L > 0. Specifically, for
L := 1c > 0.
We can thus write (P) as
min f (x) + g(x) (P )

x ∈ Rm
Now, S = argmin f + g = argmin g. Moreover, ∇f ≡ 0 =⇒ Id − L1 ∇f = Id.
Hence
1
T := Prox 1 g (Id − ∇f )
L L
= Proxcg
and we are done by the previous results.
102
3.5 Fast Iterative Shrinkage Thresholding
Consider the following problem
min F (x) := f (x) + g(x) (P )

x ∈ Rm
We assume (P) has solutions so that
S := argminx∈Rm F (x) 6= ∅
and write µ := minx∈Rm F (x).
We assume f is convex, l.s.c., and proper, as well as being differnentiable on Rm . Moreover,

∇f is L-Lipschitz on Rm .
We also assume that g is convex, l.s.c., and proper.
3.5.1 The Algorithm
Initially, set x0 ∈ Rm , t0 = 1, y0 = x0 . We update via
p
1+ 1 + 4t2n
tn+1 =
2
1
xn+1 = Prox 1 g Id − ∇f (yn ) = T yn
L L
tn − 1
yn+1 = xn+1 + (xn+1 − xn )
tn+1

1 − tn 1 − tn
= 1− xn+1 + xn
tn+1 tn+1
∈ aff{xn , xn+1 }
Observe that
t2n+1 − tn+1 = t2n .
103
3.5.2 Correctness
Proposition 3.5.1
The sequence (tn )n∈N satisfies
n+2
tn ≥ ≥ 1.
2
Proof
Induction.
Theorem 3.5.2 (Quadratic Converge for FISTA)

The sequence (xn ) satisfies
0 ≤ F (xn ) − µ
2Ld2S (x0 )
≤
(n + 1)2

1
∈O .
n2
Notice that this converges significantly faster than O 1

for PGM.

n
Proof
Set s := PS (x0 ). By the convexity of F ,

1 1 1 1
F s+ 1− xn ≤ F (s) + 1 − F (xn )
tn tn tn tn
For each n ∈ N, set
sn := F (xn ) − µ ≥ 0.
By computation,

1 1 1
1− sn − sn+1 ≥ F s+ 1− xn − F (xn+1 ).
tn tn tn
Now, applying the proximal gradient inequality with

1 1
x= s+ 1− xn
xn tn
y = yn
y+ = T yn = xn+1
104
yields

1 1
F s + 1 − xn − F (xn+1 )
tn tn
L L
≥ 2 ktn xn+1 − (s + (tn − 1)xn )k2 − 2 ktn yn − (s + (tn − 1)xn )k2
2tn 2tn
Simplying yields that

ktn yn − (s + (tn − 1)xn )k2 = ktn−1 xn − (s + (tn−1 − 1))xn−1 k2 .
Combined with the fact that t2n+1 − tn+1 = t2n , we get that

2 2 2 1 1
tn−1 sn − tn sn+1 ≥ tn F s= 1− xn − F (xn+1 )
tn tn
L L
≥ ktn xn+1 − (s + (tn − 1))xn k2 − ktn yn − (s + (tn − 1))xn k2
2 2
L L
= ktn xn+1 − (s + (tn − 1))xn k2 − ktn−1 xn − (s + (tn−1 − 1))xn−1 k2
2 2
Set un := tn−1 xn − (s + (tn−1 − 1)xn−1 ). Multiplying the inequality above by 2

L
and
rearranging yields
2 2
kun+1 k2 + t2n sn+1 ≤ kun k2 + t2n−1 sn .
L L
It follows that
2 2 2
tn−1 sn ≤ kun k2 + t2n sn+1
L L
2
≤ ku1 k2 + t20 s1
L
2
= kx1 − sk2 + (F (x1 ) − µ)
L
≤ kx0 − sk2
where the last inequality follows from the proximal gradient inequality.
In other words,
F (xn ) − µ = sn
L 1
≤ kx0 − sk2 2
2 tn−1
L 4 n+2
≤ kx0 − sk2 tn ≥
2 (n + 1)2 2
2Ld2S (x0 )
= .
(n + 1)2
105
3.6 Iterative Shrinkage Thresholding Algorithm
This is a special case of PGM with g(x) = λkxk, λ > 0. Hence

1 λ
g(x) = kxk1
L L
and
n
Prox 1 g (x) = Prox λ k·k1 (x)
L L i=1
n
λ
= sign(xi ) max{0, |xi | − }
L i=1
FISTA is the accelerated version of ISTA.
3.6.1 Norm Comparison
Consider the problems
minkxk2 (P1 )
Ax = b
minkxk1 (P2 )
Ax = b
Example 3.6.1
1
min kAx − bk22 + λkxk1 (P )
2
m
x∈R
where λ > 0 and A ∈ Rn×m .
g is convex, l.s.c., and proper, with f being smooth and

∇f (x) = AT (Ax − b).
Recall that ∇f is L-Lipschitz if and only if the spectral norm of the Hessian is bounded
by L. Thus ∇f is L-Lipschitz for
L := λmax (AT A).
106
To see the necessarily assumption that S := argminx∈Rm F (x) holds, observe that f (x) is
continuous, convex, and coercive, with dom F = Rm .
Using the fact that if F is convex, l.s.c., proper, and coercive and ∅ 6= C is closed and
convex with dom F ∩ C 6= ∅, then F has a minimizer over C.
Now, m can be very large and λmax (AT A) may be difficult to compute. It suffices to use
some upper bound on eigenvalues such as the Frobenius norm
m X
X n
kAk2F = a2ij
j=1 i=1
= tr AT A

Xm
= λi (AT A)
i=1
3.7 Douglas-Rachford Algorithm

min F (x) := f (x) + g(x) (P )
x ∈ Rm
where f, g are convex, l.s.c., and proper with
S := argminx∈Rm F (x) 6= ∅.
Suppose there exists some s ∈ S such that

0 ∈ ∂f (s) + ∂g(s) ⊆ ∂(f + g)(s).
This happens for example when ri dom f ∩ ri dom g 6= ∅.
Define
Rf := 2 Proxf − Id
Rg := 2 Proxg − Id .
Definition 3.7.1 (Douglas-Rachford Operator)

Define
T := Id − Proxf + Proxg Rf .
107
Lemma 3.7.1
The following hold:
(i) Rf , Rg are nonexpansive
(ii) T = 12 (Id +Rg Rf )
(iii) T is firmly nonexpansive
Proof
Since Proxf , Proxg are f.n.e., 2 Proxf − Id, 2 Proxg − Id are nonexpansive as shown in the
assignments.
Expanding the definitions of Rg , Rf shows the equivalent expression
1
T = (Id +Rg Rg ).
2
The above shows that T is 12 -averaged, which is equivalent to firm nonexpansiveness.
Proposition 3.7.2
Fix T = Fix Rg Rf .
Proof
Let x ∈ Rm . Then
1
x ∈ Fix T ⇐⇒ x = (x + Rg Rf x)
2
⇐⇒ x = Rg Rf x
⇐⇒ x ∈ Fix Rg Rf .
Proposition 3.7.3
Proxf (Fix T ) ⊆ S.
Proof
Let x ∈ Rm and set s = Proxf (x) = (Id +∂f )−1 (x). Then
x ∈ (Id +∂f )(s) = s + ∂f (s) ⇐⇒ 2s − (2s − x) ∈ s + ∂f (s)

⇐⇒ 2s − Rf (x) − s ∈ ∂f (s)
⇐⇒ s − Rf (x) ∈ ∂f (s).
108
Moreover,
x ∈ Fix T ⇐⇒ x = x − Proxf (x) + Proxg Rf (x)

⇐⇒ s = Proxg Rf (x) = (Id +∂g)−1 (Rf (x))
⇐⇒ Rf (x) ∈ s + ∂g(s)
⇐⇒ Rf (x) − s ∈ ∂g(s)
It follows that
0 = s − Rf (x) + Rf (x) − s
∈ ∂f (s) + ∂g(s)
⊆ ∂(f + g)(s)
and s ∈ S as required for all x ∈ Fix T .

Recall that (firmly) nonexpansive operators are continuous and iterating a f.n.e. operator
tends to a fixed point.
Theorem 3.7.4
Let x0 ∈ Rm . Update via
xn+1 := xn − Proxg xn + Proxg (2 Proxf xn − xn ).
Then Proxf (xn ) tends to a minimizer of f + g.
Proof
Remark that xn+1 = T xn = T n+1 x0 . Since T is f.n.e., we know that xn → x̄ ∈ Fix T .
But since Proxf is continuous,
Proxf xn → Proxf x̄ ∈ Proxf (Fix T ) ⊆ S.
3.8 Stochastic Projected Subgradient Method

min f (x) (P )
x∈C
f is convex, l.s.c., and proper, ∅ 6= C ⊆ int dom f is closed and convex, and S :=
109
argminx∈C f (x) 6= ∅.
Set
µ := min f (C).
Given x0 ∈ C, update via

xn+1 := PC (xn − tn gn ).
We assume that tn > 0 and
∞
X
tn → ∞
Pn 2n=0
tk
Pk=0
n →0 k→∞
k=0 tk
for example tn = α
n+1
for some α > 0.
We choose gn to be a random vector satisfying the following assumptions
(A1) For each n ∈ N, E[gn | xn ] ∈ ∂f (xn ) (unbiased subgradient)

(A2) For each n ∈ N, there is some L > 0, E[kgn k2 | xn ] ≤ L2
Let us write
µk := min{f (xi ) : 0 ≤ i ≤ k}.
Theorem 3.8.1
Assuming the previous assumptions hold, then E[µk ] → µ as k → ∞.
Proof
Pick s ∈ S and let n ∈ N. Then
0 ≤ kxn+1 − sk2
= kPC (xn − tn gn ) − PC sk2
≤ k(xn − tn gn ) − sk2
= k(xn − s) − tn gn k2
= kxn − sk2 − 2tn hgn , xn − si + t2n kgn k2
110
Taking the conditional expectation given xn yields
E[kxn+1 − sk2 | xn ] ≤ kxn − sk2 + 2tn hE[gn | xn ], s − xn i + t2n E[kgn k2 | xn ]

≤ kxn − sk2 + 2tn (f (x) − f (xn )) + t2n L2 (A1), (A2)
2 2 2
= kxn − sk + 2tn (µ − f (xn )) + tn L .
Now, taking the expection with respect to xn yields
E[kxn+1 − sk2 ] ≤ E[kxn − sk2 ] + 2tn (µ − E[f (xn )]) + t2n L2 .
Let k ∈ N. Summing the inequality from n = 0 to k yields
0 ≤ E[kxn+1 − sk2 ]
k
X k
X
2 2
≤ kx0 − sk − 2 tn (E[f (xn )] − µ) + L t2n .
n=0 n=0
Rearranging yields
0 ≤ E[µk ] − µ
kx0 − sk2 + L2 kn=0 t2n
P
≤
2 kn=0 tn
P
→0 k→∞
3.8.1 Minimizing a Sum of Functions

X
min f (x) := fi (x) (P )
i∈[r]
x∈C
Suppose f1 , . . . , fr : Rm → (−∞, ∞] are convex, l.s.c., and proper.
Set I := [r] and assume that for each i ∈ I,
∅ 6= C ⊆ int dom fi .
for some closed convex C.
111
We also assuem that for each i ∈ I, there is some Li ≥ 0 for which
k∂fi (C)k ≤ Li .
Proposition 3.8.2
supk∂fi (C)k ≤ Li if and only if fi C is Li -Lipchistz.

For example, this holds if C is bounded.
Let us assume that (P) has a solution. We verify (A1), (A2) to justify solving the problem
with SPSM.
By the triangle inequality, X

supk∂f (C)k ≤ Li .
i∈I
Let x0 ∈ C. Given xn ∈ C, we pick an index in ∈ I uniformly randomly and set

gn := rfi0n (xn ) ∈ ∂fin (xn ).
Observe that
r
X 1
E[gn | xn ] = rfi0 (xn )
i=1
r
r
X
= fi0 (xn )
i=1
∈ ∂f1 (xn ) + · · · + ∂fr (xn )
= ∂(f1 + · · · + fr )(xn ) Sum Rule
= ∂f (xn )
hence (A1) holds.
Next,
r
X 1
2
E[kgn k | xn ] = krfi0 (xn )k2
i=1
r
Xr
= rkfi0 (xn )k2
i=1
X r
≤r L2i .
i=1
Thus (A2) holds with L := r ri=1 L2i .

p P
Having verified the assumptions, we may apply SPSM.
112
3.9 Duality
3.9.1 Fenchel Duality

min f (x) + g(x) (P )
x ∈ Rm
f, g : Rm → (−∞, ∞] are convex, l.s.c., and proper.
We can rewrite the problem as

min {f (x) + g(z) : x = z}.
x,z∈Rm
Construct the Lagrangian

L(x, z; y) := f (x) + g(z) + hy, z − xi.
The dual objective function is obtained by minimizing the Lagrangian with respect to x, z.
d(u) := inf L(x, z; u)
x,z
= inf {f (x) − hu, xi + g(z) + hu, zi}

x,z
= − sup(hu, xi − f (x)) − sup(h−u, zi − g(z))

x z
= −f ∗ (u) − g ∗ (−u).
Let
p := infm f (x) + g(x)
x∈R
d := infm f ∗ (u) + g ∗ (−u)
u∈R
and recall that p ≥ −d from assignments.
3.9.2 Fenchel-Rockafellar Duality

min f (x) + g(Ax) (P )
x ∈ Rm
113
where f : Rm , → (−∞, ∞] is convex, l.s.c., and proper, g : Rn , → (−∞, ∞] is convex, l.s.c.,
and proper, and A ∈ Rn×m .
The Fenchel-Rockafellar dual is given by

min f ∗ (−AT y) + g ∗ (y) (D)
y ∈ Rn
As before, let
p := infm f (x) + g(Ax)
x∈R
d := infn f ∗ (−AT y) + g ∗ (y)

y∈R
and recall that p ≥ −d from assignments.
Lemma 3.9.1
Let h : Rm → (−∞, ∞] be convex, l.s.c., and proper. For each x ∈ Rm ,
hv (x) := h(−x).
The following hold:

(i) hv is convex, l.s.c., and proper
(ii) ∂hv = −∂h ◦ (− Id)
Proof
The convexity, l.s.c., and properness is verified by definition.
Let u ∈ Rm and x ∈ dom ∂h ◦ (− Id). Then
u ∈ −∂h ◦ (− Id)(x) = −∂f (−x) ⇐⇒ −u ∈ ∂h(−x)

⇐⇒ h(y) ≥ h(−x) + h−u, y − (−x)i ∀y ∈ Rm
⇐⇒ h(−y) ≥ h(−x) + h−u, −y + xi ∀y ∈ Rm
⇐⇒ hv (y) ≥ hv (x) + hu, y − xi ∀y ∈ Rm
⇐⇒ u ∈ ∂hv (x).
3.9.3 Self-Duality of Douglas-Rachford
Recal that the DR operator to solve (P) is given by

1
Tp := (Id +Rg Rf )
2
114
where Rf := 2 Proxf − Id and similarly for Rg .
Similarly, the DR operator to solve (D) is defined as
1
Td := (Id +R(g∗ )v Rf ∗ ).
2
Lemma 3.9.2
Let h : Rm → (−∞, ∞] be convex, l.s.c., and proper. The following hold:
(i) Proxhv = − Proxh ◦(Id)
(ii) Rh∗ = −Rh
(iii) R(h∗ )v = Rh ◦ (− Id)
Proof
(i): This is shown using the relation Proxf = (Id +∂f )−1 as well as the lemma ∂hv =
−∂h ◦ (− Id).
(ii): This can be proven by expanding the definition of Rh∗ as well as the relation Proxh∗ =
(Id − Proxh ) proven in A4.
(iii): First, we can shown by definition that
Prox(h∗ )v = − Proxh∗ ◦(− Id).
The proof is completed using this fact as well as the relation Proxh∗ = (Id − Proxh )
Theorem 3.9.3
Tp = Td .
Proof
From our previous lemma,
1
Td := (Id +R(g∗ )v Rf ∗ )
2
1
= (Id +[Rg ◦ (− Id)] ◦ (−Rf ))
2
1
= (Id +Rg Rf )
2
= Tp .
115
Theorem 3.9.4
Let x0 ∈ Rm . Update via
xn+1 := xn − Proxf (xn ) + Proxg (2 Proxf xn − xn ) = Tp xn .
Then Proxf (xn ) converges to a minimizer of f + g and xn − Proxf (xn ) converges to

a minimizer of f ∗ + (g ∗ )v .
Proof
We already know that Proxf (xn ) converges to a minimizer of f + g. Since Tp = Td ,
Proxf ∗ (xn ) converges to a minimizer of f ∗ +(g ∗ )v . Using the fact that Proxf ∗ = Id − Proxf ,
we conclude the proof.
116

Co 463

Uploaded by

Copyright:

Available Formats

Co 463

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Co 463

Uploaded by

Copyright:

Available Formats

CO463: Convex Optimization and Analysis

April 13, 2021

1.2 Affine Sets & Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Convex Combinations of Vectors . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 The Projection Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6 Convex Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.7 Topological Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.8 Separation Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.9 More Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1 Definitions & Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Lower Semicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 The Support Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4 Further Notions of Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.5 Operations Preserving Convexity . . . . . . . . . . . . . . . . . . . . . . . . 34

2.8 The Subdifferential Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.9 Calculus of Subdifferentials . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.12 Coercive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.13 Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.14 The Proximal Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.15 Nonexpansive & Averaged Operators . . . . . . . . . . . . . . . . . . . . . . 68

2.16 Féjer Monotonocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.17 Composition of Averaged Operators . . . . . . . . . . . . . . . . . . . . . . . 76

3 Constrained Convex Optimization 79

3.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.1.1 The Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . 81

3.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.3 Projected Subgradient Method . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.3.1 The Convex Feasibility Problem . . . . . . . . . . . . . . . . . . . . . 92

3.4 Proximal Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.4.1 Proximal-Gradient Inequality . . . . . . . . . . . . . . . . . . . . . . 96

3.4.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.5 Fast Iterative Shrinkage Thresholding . . . . . . . . . . . . . . . . . . . . . . 103

3.5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.5.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3.6.1 Norm Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

3.7 Douglas-Rachford Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.8 Stochastic Projected Subgradient Method . . . . . . . . . . . . . . . . . . . 109

3.8.1 Minimizing a Sum of Functions . . . . . . . . . . . . . . . . . . . . . 111

3.9 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

3.9.1 Fenchel Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

3.9.2 Fenchel-Rockafellar Duality . . . . . . . . . . . . . . . . . . . . . . . 113

3.9.3 Self-Duality of Douglas-Rachford . . . . . . . . . . . . . . . . . . . . 114

Let f : Rn → R be differentiable. Consider the problem

1.2 Affine Sets & Subspaces

Definition 1.2.1 (Affine Set)

Definition 1.2.2 (Affine Subspace)

is the smallest affine set containing S.

Then L, a + L, ∅, Rn are all examples of affine sets.

1.3 Convex Sets

Pick x, y ∈ C. By the definition of set intersection, x, y ∈ Ci for all i ∈ I. Since each Ci

It follows that C is convex by the arbitrary choice of i.

1.4 Convex Combinations of Vectors

Definition 1.4.1 (Convex Combinations)

is a convex conbination if λ ≥ 0 and 1T λ = 1.

( =⇒ ) We argue by induction on m. Observe that by deleting xi ’s if necessary, we may

When m = 2, this is simply the definition of convexity.

For m > 2, we can write

For all x ∈ Rn , either PC (x) = x when x ∈ C or PC (x) is

to denote the closed ball of radius about x. In particular, we write

int C := {x : ∃ > 0, x + B ⊆ C}.

ri C := {x ∈ aff C : ∃ > 0, (x + B) ∩ aff C ⊆ C}.

for some > 0.

As y ∈ C̄, we have that ∀ > 0, y ∈ C + B. Thus for all > 0,

(1 − λ)x + λy + B ⊆ (1 − λ)x + λ(C + B) + B

y + λ(y − x) ∈ B(y; ) ⊆ C̄.