Co 463
Co 463
Co 463
1
Felix Zhou
1 From Professor Walaa Moursi’s Lectures at the University of Waterloo in Winter 2021
2
Contents
1 Convex Sets 7
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Convex Functions 29
2.6 Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3
2.7 Conjugates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.11 Conjugacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4
3.6 Iterative Shrinkage Thresholding Algorithm . . . . . . . . . . . . . . . . . . 106
5
6
Chapter 1
Convex Sets
1.1 Introduction
In the case when C = Rn , the minimizers of f will occur at the critical points of f . Namely,
at x ∈ Rn when ∇f (x) = 0. This is known as “Fermat’s Rule”.
In this course, we seek to approach (P ) when f is not differentiable but f is convex and
when ∅ 6= C ( Rn is a convex set.
λx + (1 − λ)y ∈ S.
7
Definition 1.2.3 (Affine Hull)
Let S ⊆ Rn . The affine hull of S
\
aff(S) := T
S⊆T ⊆Rn :T is affine
Example 1.2.1
Let L be a linear subspace of Rn and a ∈ Rn .
Definition 1.3.1
C ⊆ Rn is convex if for all x, y ∈ C and λ ∈ (0, 1),
λx = (1 − λ)y ∈ C.
Example 1.3.1
∅, Rn , balls, affine, and half-sets are all examples of convex sets.
Theorem 1.3.2
The intersection of an arbitrary collection of convex sets is convex.
Proof
Let I be an index set. Let (Ci )i∈I be a collection of convex subsets of Rn .
Put \
C := Ci .
i∈I
8
Corollary 1.3.2.1
Let bi ∈ Rn and βi ∈ R for i ∈ I for some arbitrary index set I.
The set
C := {x ∈ Rn : hx, bi i ≤ βi , ∀i ∈ I}
is convex.
Theorem 1.4.1
C ⊆ Rn is convex if and only if it contains all convex combinations of its elements.
Proof
( ⇐= ) Apply the definition of convex combination with m = 2.
9
Definition 1.4.2 (Convex Hull)
The convex hull of S ⊆ Rn
\
conv S := T
S⊆T ⊆Rn :T is convex
Theorem 1.4.2
Let ⊆ Rn . conv S consists of all convex conbinations of elements of S.
Proof
Let D be the set of convex combinations of elements of S.
(D ⊆ conv S) By the previous theorem, the convexity of conv S means that if contains
all convex combinations of elements. In particular, it contains all convex conbinations of
S ⊆ conv S.
x 7→ inf kx − sk.
s∈S
dC (x) = kx − pk.
10
Recall that a cauchy sequence (xn )n∈N in Rn is a sequence such that
kxm − xn k → 0
as min(m, n) → ∞.
Since Rn is a complete metric space under the Euclidean metric, every cauchy sequence
converges in Rn .
Lemma 1.5.1
Let x, y, z ∈ Rn . Then
2
2 2
2x + y
kx − yk = 2kz − xk + 2kz − yk − 4
z −
.
2
Proof
This is by computation.
11
Lemma 1.5.2
Let x, y ∈ Rn . Then
Proof
( =⇒ ) Suppose hx, yi ≤ 0. Then
≥ 0.
λ
hx, yi ≤ kyk2
2
→ 0. λ→0
Proof (i)
Recall that
dC (x) := inf kx − ck.
c∈C
Hence there is a sequence (cn )n∈N in C such that
dC (x) = lim kcn − xk.
n→∞
12
As m, n → ∞,
0 ≤ kcn − cm k2 → 4dC (x)2 − 4dC (x)2 = 0
and (cn ) is a Cauchy sequence. But then there is some c ∈ p such that cn → p by the
closedness (completeness) of C.
kx − cn k → dC (x) = kx − pk.
0 ≤ kp − qk2
2
2 2
p + q
= 2kp − xk + 2kq − xk − 4
x −
2
≤ 2dC (x)2 + 2dC (x)2 − 4dC (x)2
≤ 0.
So kp − qk = 0 =⇒ p = q.
Proof (ii)
Observe that p = PC (x) if and only if p ∈ C and
kx − pk2 = dC (x)2 .
Since C is convex,
∀α ∈ [0, 1], yα := αy + (a − α)p ∈ C.
Thus
kx − pk2 = dC (x)2
⇐⇒ ∀y ∈ C, α ∈ [0, 1], kx − pk2 ≤ kx − yα k2
⇐⇒ ∀y ∈ C, α ∈ [0, 1], kx − pk2 ≤ kx − p − α(y − p)k2
⇐⇒ ∀y ∈ C, hx − p, y − pi ≤ 0 auxiliary lemma 2.
In the absence of closedness, PC (x) does not in general exist unless x ∈ C. In the absence
of convexity, uniqueness does not in general hold.
13
Example 1.5.4
Fix > 0 and C = B(0; ) be the closed ball around 0 or radius .
In other words,
PC (x) = x.
max(kxk, )
C + D := {c + d : c ∈ C, d ∈ D}.
Proof
If either C1 , C2 is empty, then C1 + C2 = ∅ by definition.
as required.
Proposition 1.6.2
Let ∅ 6= C, D ⊆ Rn be closed and convex. Moreover, suppose that D is bounded.
Then C + D 6= ∅ is closed and convex.
Proof
We have already shown non-emptiness and convexity in the previous theorem.
14
Let (xn + yn )n∈N be a convergent sequence in C + D. Say that xn + yn → z.
Since D is bounded, there is a subsequence (ykn )n∈N such that ykn → y ∈ D. It follows
that
xkn = z − ykn → z − y ∈ C
by the closedness of C.
not!
Theorem 1.6.3
Let C ⊆ Rn be convex and λ1 , λ2 ≥ 0. Then
(λ1 + λ2 )C = λ1 C + λ2 C.
Proof
(⊆) This is always true, even if C is not convex.
(⊇) Without loss of generality, we may assume that λ1 + λ2 > 0. By convexity, we have
λ1 λ2
C+ C ⊆ C.
λ1 + λ2 λ1 + λ2
In other words, λ1 C + λ2 C ⊆ (λ1 + λ2 )C.
We will write
B(x; ) := {y ∈ Rn : ky − xk ≤ }
B := B(0; 1)
15
Definition 1.7.1 (Interior)
The interior of C ⊆ Rn is
Proposition 1.7.1
Let C ⊆ Rn . Suppose that int C 6= ∅. Then int C = ri C.
Proof
Let x ∈ int C. There is some > 0 such that B(x; ) ⊆ C. Hence
Rn = aff(B(x; ))
⊆ aff C
⊆ Rn .
L := A − A.
16
It may be useful to consider [
A−A= (A − a)
a∈A
Proposition 1.7.2
Let C ⊆ Rn be convex. For all x ∈ int C and y ∈ C̄,
[x, y) ⊆ int C.
Proof
Let λ ∈ [0, 1). We argue that (1 − λ)x + λy ∈ int C. It suffices to show that
(1 − λ)x + λy + B ⊆ C
Theorem 1.7.3
Let C ⊆ Rn be convex. Then for all x ∈ ri C and y ∈ C̄,
[x, y) ⊆ ri C.
Proof
Case I: int C 6= ∅ This follows by the observation that ri C = int C.
Case II: int C = ∅ We must have dim C = m < n. Let L := aff C − aff C be the corre-
sponding linear subspace of dimension m.
17
Through translation by −c ∈ C if necessary, we may assume without loss of generality
that C ⊆ L ∼
= Rm .
Theorem 1.7.4
Let C ⊆ Rn be convex. The following hold:
(i) C̄ is convex.
(ii) int C is convex.
(iii) If int C 6= ∅, then int C = int C̄ and C̄ = int C.
Proof (i)
Let x, y ∈ C̄ and λ ∈ (0, 1). There are sequences xn , yn ∈ C such that
xn → x, yn → y.
By definition, C̄ is convex.
Proof (ii)
If int C = ∅, the conclusion is clear.
Proof (iii)
Since C ⊆ C̄, it must hold that int C ⊆ int C̄.
Conversely, let y ∈ int C̄. If y ∈ int C, then we are done. Thus suppose otherwise.
There is some > 0 such that B(y; ) ⊆ C̄. We may thus choose some int C 63 y 6= x ∈
int C 6= ∅ and λ > 0 sufficiently small such that
18
By a previous proposition applied with y + λ(y − x), we have that
It follows by the arbitrary choice of y that int C̄ ⊆ int C. We now turn to the second
identity.
Since int C ⊆ C, we must have int C ⊆ C̄. Conversely, let y ∈ C̄ and x ∈ int C. For
λ ∈ [0, 1), define
yλ = (1 − λ)x + λy.
yλ ∈ [x, y) ⊆ int C.
Theorem 1.7.5
Let C ⊆ Rn be convex. Then ri C, C̄ are convex.
Moreover,
C 6= ∅ ⇐⇒ ri C 6= ∅.
19
If
sup hc1 , bi < inf hc2 , bi,
c1 ∈C1 c2 ∈C2
Theorem 1.8.1
Let ∅ 6= C ⊆ Rn be closed and convex and suppose x ∈
/ C. Then x is strongly
separated from C.
Proof
The goal is to find some b 6= 0 such that
hy − p, x − pi ≤ 0 ∀y ∈ C
hy − (x − b), x − (x − b)i ≤ 0 p=x−b
hy − x, bi ≤ −hb, bi
= −kbk2
suphy, bi − hx, bi ≤ −kbk2
y∈C
<0
as desired.
Corollary 1.8.1.1
Let C1 ∩ C2 = ∅ be nonempty subsets of Rn such that C1 − C2 is closed and convex.
Then C1 , C2 are strongly separated.
Proof
By definition, C1 , C2 are strongly separated if and only if there is b 6= 0 such that
sup hc1 , bi < inf hc2 , bi
c1 ∈C1 c2 ∈C2
20
Since C1 ∩ C2 = ∅, we know that 0 ∈
/ C1 − C2 . Hence C1 − C2 is strongly separated from
0 and the conclusion follows.
Corollary 1.8.1.2
Let ∅ 6= C1 , C2 ⊆ Rn be closed and convex such that C1 ∩ C2 = ∅ and C2 is bounded.
Then C1 , C2 are strongly separted.
Proof
C1 ∩ C2 = ∅ =⇒ 0 ∈ / C1 − C2 . In addition, −C2 is also closed and convex. It follows by
a previous theorem that C1 + (−C2 ) is nonempty, closed, and convex.
Theorem 1.8.2
Let ∅ 6= C1 , C2 ⊆ Rn be closed and convex such that C1 ∩ C2 = ∅. Then C1 , C2 are
separated.
Proof
For each n ∈ N, set
Dn := C2 ∩ B(0; n).
Observe that C1 ∩ Dn = ∅ for all n. Moreover, Dn is bounded by construction.
But the sequence un is bounded, hence there is a convergent subsequence ukn . where
ukn → u with kuk = 1.
21
1.9 More Convex Sets
Proposition 1.9.1
Let C ⊆ Rn . The following hold:
(i) cone C = R++ C
(ii) cone C = cone(C)
(iii) cone(conv C) = conv(cone C)
(iv) cone(conv C) = conv(cone C)
The proofs of all these are trivial if C = ∅. Thus in our proofs, we assume that C is
nonempty.
Proof (i)
Set D := R++ C. It is clear that C ⊆ D with D being a cone. Hence cone C ⊆ D.
Conversely, for y ∈ D, there is some λ > 0, c ∈ C for which y = λc. Then y ∈ cone C and
D ⊆ cone C.
Proof (ii)
cone(C) is a closed cone with C ⊆ cone(C). Hence
cone(C) ⊆ cone C.
22
Proof (iii)
(⊆) Let x ∈ cone(conv C). By i, there is λ > 0, y ∈ conv C such that x = λy. Since
y ∈ conv C, we can express is as a convex combination
x = λy
X m
=λ λi xi
i=1
m
X
= λi λxi
i=1
∈ conv(cone C).
(⊇) Let x ∈ conv(cone C). We can write x as convex combinations of scalar multiples of
C.
m
X
x= µi λi xi
i=1
m
! m
!
X X λµ
= λi µi P i i xi
i=1 i=1
λi µi
Xm
=α βi xi .
i=1
Proof (iv)
This is a direct consequence of iii.
Lemma 1.9.2
Let 0 ∈ C ⊆ Rn be convex with int C 6= ∅. The following are equivalent:
(i) 0 ∈ int C
(ii) cone C = Rn
(iii) coneC = Rn
23
Proof
(i) =⇒ (ii) Suppose 0 ∈ int C. Then B(0; ) ⊆ C for some > 0. But then
Rn = cone(B(0; ))
⊆ cone C
⊆ Rn
Rn = cone C ⊆ coneC.
conv(cone C) = cone C
By assumption,
∅ 6= int C ⊆ int(cone C)
and cone C has nonempty interior.
Recall that
int(cone C) = int(coneC)
as cone C is convex.
Hence
Rn = int Rn
= int(coneC)
= int(cone C)
= cone(int C).
24
Definition 1.9.4 (Tangent Cone)
Let ∅ 6= C ⊆ Rn with x ∈ Rn . The tangent cone to C at x is
( S
cone(C − x) = λ∈R++ λ(C − x), x ∈ C
TC (x) =
∅, x∈
/C
Theorem 1.9.3
Let ∅ 6= C ⊆ Rn be closed and convex. Let X ∈ Rn .
Both NC (x), TC (x) are closed convex cones.
Lemma 1.9.4
Let ∅ 6= C ⊆ Rn be closed and convex with x ∈ C.
Proof
( =⇒ ) Let n ∈ NC (x) and t ∈ TC (x). Recall that TC (x) = cone(C − x). Thus there is
some λk > 0 and tk ∈ Rn such that
x + λk tk ∈ C
and tk → t.
Since n ∈ NC (x) and x + λk tk ∈ C, it follows that for all k, hn, λk tk i ≤ 0. But then as
k → ∞ we see that
hn, ti ≤ 0.
25
( ⇐= ) Suppose that ∀t ∈ TC (x), we have hn, ti ≤ 0. Pick y ∈ C and observe that
y−x∈C −x
⊆ cone(C − x)
⊆ cone(C − x)
=: TC (x).
Theorem 1.9.5
Let C ⊆ Rn be convex such that int C 6= ∅. Let x ∈ C. The following are equivalent.
(1) x ∈ int C
(2) TC (x) = Rn
(3) NC (x) = {0}
Proof
(1) ⇐⇒ (2) Observe that x ∈ int C if and only if 0 ∈ int(C − x) if and only if there is
some > 0 with
B(0; ) ⊆ C − x.
Now,
Rn = cone(B(0; ))
⊆ cone(C − x)
⊆ cone(C − x)
= cone(C − x)
= TC (x)
⊆ Rn .
26
Indeed, by the projection theorem
hy − p, t − pi ≤ 0
for all t ∈ TC (x). In particular, it holds for t = p, 2p ∈ TC (x) (TC (x) is a cone). So
hy − p, ±pi ≤ 0 =⇒ hy − p, pi = 0.
But then hy − p, ti ≤ 0 for all t ∈ TC (x), which implies that y − p ∈ NC (x) = {0} and
y = p ∈ TC (x)
as desired.
27
28
Chapter 2
Convex Functions
29
Definition 2.1.5 (Lower Semicontinuous)
f is lower semicontinuous (l.s.c.) if epi(f ) is closed.
Proposition 2.1.1
Let f : Rn → [−∞, ∞] be convex. Then dom f is convex.
Proof
Consider the linear transformation L : Rn+1 → Rn given by
(x, α) 7→ x.
Theorem 2.1.2
Let f : Rm → [−∞, ∞]. Then f is convex if and only if for all x, y ∈ dom f and
λ ∈ (0, 1),
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
Proof
If f = ∞ ⇐⇒ epi f = ∅ ⇐⇒ dom f = ∅, then result is trivial. Hence let us suppose
that f 6= ∞ ⇐⇒ dom f 6= ∅.
( =⇒ ) Pick x, y ∈ dom f and λ ∈ (0, 1). Observe that (x, f (x)), (y, f (y)) ∈ epi f . By
convexity,
λ(x, f (x)) + (1 − λ)(y, f (y)) = (λx + (1 − λ)y, λf (x) − (1 − λ)f (y)) ∈ epi(f )
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
( ⇐= ) Conversely, suppose the function inequality holds. Pick (x, α), (y, β) ∈ epi f as
well as λ ∈ (0, 1). Now,
30
and
(λx + (1 − λ)y, λα, (1 − λ)β) ∈ epi f
as desired.
Remark that continuity implies lower semicontinuity. One can show that the two definitions
of l.s.c. are equivalent, but we omit the proof.
Theorem 2.2.1
Let C ⊆ Rm . Then the following hold:
(i) C 6= ∅ if and only if δC is proper
(ii) C is convex if and only if δC is convex
(iii) C is closed if and only if δC is l.s.c.
Proof ((iii))
Observe that C = ∅ ⇐⇒ epi δC = ∅, which is certainly closed. Thus we proceed
assuming C 6= ∅.
Pick a converging sequence sequence (xn , αn ) → (x, α) with every element in epi δC .
Observe that xn is a sequence in C, hence x ∈ C. Moreover, αn ∈ [0, ∞) and α ≥ 0.
31
By the definition of δC , it suffices to show that δC (x) = 0.
By lower semicontinuity,
0 ≤ δC (x)
≤ lim inf δC (xn )
=0
Proposition 2.2.2
Let I be an indexing set and let (fi )i∈I be a family of l.s.c. convex functions on Rn .
Then
F := sup fi
i∈I
Proof
We claim that epi F = i∈I epi f . Indeed,
T
The result follows by the definition of convex functions and lower semicontinuity as inter-
sections preserve both set convexity and closedness.
u 7→ suphc, ui.
c∈C
Proposition 2.3.1
Let ∅ 6= C ⊆ Rn . Then σC is convex, l.s.c., and proper.
32
Proof
For each c ∈ C, define
fC (x) := hx, ci.
Then fc is linear and hence proper, l.s.c., and convex. Moreover,
σC = sup fc .
c∈C
Combined with our previous proposition, we learn that σC is convex and l.s.c.
σC (u) = suphu, ci
c∈C
≥ hu, c̄i
> −∞.
Let f : Rm → [−∞, ∞] be proper. Then f is strictly convex if for every x 6= y ∈ dom f and
λ ∈ (0, 1),
Moreover, f is strongly convex with constant β > 0 if for every x, y ∈ dom f, λ ∈ (0, 1),
β
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) − λ(1 − λ)kx − yk2 .
2
Clearly, strong convexity implies strict convexity, which in turn implies convexity.
33
2.5 Operations Preserving Convexity
Proposition 2.5.1
Let I be a finite indexing set and (fi )i∈I a family of convex functions Rm → [−∞, ∞].
Then X
fi
i∈I
is convex.
Proposition 2.5.2
Let f be convex and l.s.c. and pick λ > 0. Then
λf
2.6 Minimizers
f (x̄) ≤ f (x).
34
Proposition 2.6.1
Let f : Rm → (−∞, ∞] be proper and convex. Then every local minimizer of f is a
global minimizer.
Proof
Let x be a local minimizer of f . There is some ρ > 0 such that
Set
z := λx + (1 − λ)y ∈ dom f.
We know this is in the domain as dom f is convex by our prior work.
We have
z − x = (1 − λ)y − (1 − λ)x
= (1 − λ)(y − x)
kz − xk = k(1 − λ)(y − x)k
ρ
= ky − xk
ky − xk
= ρ.
By the convexity of f ,
f (x) ≤ f (z)
≤ λf (x) + (1 − λ)f (y)
(1 − λ)f (x) ≤ (1 − λ)f (y)
f (x) ≤ f (y).
Proposition 2.6.2
Let f : Rm → (−∞, ∞] be proper and convex. Let C ⊆ Rm . Suppose that x is a
minimizer of f over C such that x ∈ int C. Then x is a minimizer of f .
35
Proof
There is some > 0 such that x minimizes f over B(x; ) ⊆ int C. Since x is a local
minimizer, it is a global minimizer as well.
2.7 Conjugates
Recall that a closed convex set is the intersection of all supporting hyperplanes. The idea is
that the epigraph of a convex, l.s.c. function f can be recovered by the supremum of affine
functions majorized by f .
f (x) ≥ hu, xi − α ∀x ∈ Rn
α ≥ hu, xi − f (x) ∀x ∈ Rn .
Thus f ∗ (u) := supx∈Rn hu, xi−f (x) is the best translation such that hu, xi−f ∗ (u) is majorized
by f .
Proposition 2.7.1
Let f : Rm → [−∞, ∞]. Then f ∗ is convex and l.s.c.
Proof
Observe that f ≡ ∞ ⇐⇒ dom f = ∅. Hence if f ≡ ∞, for all u ∈ Rm
= −∞.
Now suppose that f 6≡ ∞. We claim that f ∗ (u) = sup(x,α)∈epi f hx, ui − α. Observe that
36
f(x,α) := hx, ·i − α is an affine function. By definition,
But then
f ∗ (u) = sup f(x,α) (u)
(x,α)∈epi f
is a supremum of convex and l.s.c. (affine) functions which is convex and l.s.c. by our
earlier work.
Example 2.7.2
Let 1 < p, q such that
1 1
+ = 1.
p q
Then for f (x) := |x|p
p
,
|u|q
f ∗ (x) = .
q
Example 2.7.3
Let f (x) := ex . Then
u ln u − u, u > 0
∗
f (u) = 0, u=0
∞, u<0
Example 2.7.4
Let C ⊆ Rm , then
δC∗ = σC .
37
By definition,
= suphx, yi.
y∈C
The idea is that for a differentiable convex function, the derivative at x ∈ Rn is the slope
for a line tangent to x which lies strictly below f . If f is not differentiable at x, we can still
ask for slopes of line segments tangent to x which lie below x.
Proof
Let x ∈ Rm .
38
Example 2.8.2
Consider f (x) = |x|. Then
{−1}, x < 0
∂f (x) = [−1, 1], x = 0
{1}, x>0
Lemma 2.8.3
Let f : Rm → (−∞, ∞] be proper. Then
dom ∂f ⊆ dom f.
Proof
We argue by the contrapositive, suppose x ∈
/ dom f . Then f (x) = ∞ and ∂f (x) = ∅.
Proposition 2.8.4
Let ∅ 6= C ⊆ Rm be closed and convex. Then
Proof
Let u ∈ Rm and x ∈ C = dom δC . Then
Consider the constrained optimization problem min f (x), x ∈ C, where f is proper, convex,
l.s.c. and C 6= ∅ is closed and convex. We can rephrase this as min f (x) + δC (x).
In some cases, ∂(f + δC ) = ∂f + ∂δC = ∂f + NC (x). Thus by Fermat’s theorem, we look for
some x where
0 ∈ ∂f (x) + NC (x).
The main question we are concerned with is whether the subdifferential operator is additive.
39
Proposition 2.9.1
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Then
In particular,
ri dom f = ri dom ∂f
dom f = dom ∂f .
A problem with the definition of separated is that a set can be separated from itself. Indeed,
the x-axis is separated from itself with itself as a separating hyperplane. To be properly
separated, there must be some c1 ∈ C1 , c2 ∈ C2 such that
Proposition 2.9.2
Let ∅ 6= C1 , C2 ⊆ Rm be convex. Then C1 , C2 are properly separated if and only if
ri C1 ∩ ri C2 = ∅.
Proposition 2.9.3
Let C1 , C2 ⊆ Rm be convex. Then
ri(C1 + C2 ) = ri C1 + ri C2 .
Moreover,
ri(λC) = λ(ri C)
for all λ ∈ R.
40
Proposition 2.9.4
Let C1 ⊆ Rm and C2 ⊆ Rp be convex. Then
ri(C1 ⊕ C2 ) = ri C1 ⊕ ri C2 .
Theorem 2.9.5
Let C1 , C2 ⊆ Rm be convex such that ri C1 ∩ ri C2 6= ∅. For each x ∈ C1 ∩ C2 ,
Proof
The reverse inclusion is not hard. Hence we check the inclusion only.
hn, y − xi ≤ 0.
By a previous fact,
ri E1 = ri C1 × (0, ∞).
Similarly,
ri E2 = {(y, α), α < hn, y − xi}.
which is impossible.
It follows by a previous fact that E1 , E2 are properly separated. Namely, there is (b, γ) ∈
Rm × R \ {0} such that
hx, bi + γ ≤ hx, bi =⇒ γ ≤ 0.
41
Next we claim that γ 6= 0. Suppose to the contrary that γ = 0. But then
From our earlier fact, this contradicts the assumption that ri C1 ∩ ri C2 6= ∅. Altogether,
γ < 0.
First, we claim that b ∈ NC1 (x). This happens if and only if for all y ∈ C1 ,
hy, bi + 0 · γ ≤ hx, bi + 0 · γ.
Now, for all y ∈ C2 , (y, hn, y − xi) ∈ E2 by construction, Hence for all y ∈ C2 ,
Equivalently,
b
+ n, y − x ≤ 0.
γ
This shows that
b
+ n ∈ NC2 (x).
γ
Proposition 2.9.6
Let f : Rm → (−∞, ∞) be convex, l.s.c. and proper. Let x, u ∈ Rm . Then
42
Proof
Observe that epi f 6= ∅ and is convex since f is proper and convex. Now let u ∈ Rm .
Then
Theorem 2.9.7
Let f, g : Rm → (−∞, ∞] be convex, l.s.c., and proper. Suppose that ri dom f ∩
ri dom g 6= ∅. Then for all x ∈ Rm ,
Proof
Let x ∈ Rm . If x ∈
/ dom(f + g) = dom f ∩ dom g, then ∂f (x) + ∂g(x) = ∅. Also,
∂(f + g)(x) = ∅.
Suppose now that x ∈ dom f ∩ dom g = dom(f + g). It is easy to check that
We claim that
(u, −1, −1) ∈ NE1 ∩E2 (x, f (x), g(x)).
Indeed, let (y, α, β) ∈ E1 , E2 . We have by construction f (y) − α, g(y) − β ≤ 0.
43
Now,
ri E1 = ri(epi f × R)
= ri epi f × R.
Similarly,
ri E2 = {(x, α, β) ∈ Rm × R × R : g(x) < β}.
Pick z ∈ ri dom f ∩ ri dom g. Then (z, f (z) + 1, g(z) + 1) ∈ ri E1 , ri E2 . Hence, (z, f (z) +
1, g(z) + 1) ∈ ri E1 ∩ ri E2 6= ∅.
NE1 ∩E2 (x, f (x), g(x)) = NE1 (x, f (x), g(x)) + NE2 (x, f (x), g(x)).
Now, it can be shown that Nepi f ×R = Nepi f × NR and similarly for E2 . Therefore, there
is some u1 , u2 ∈ Rm , α, β ∈ R for which
u = u1 + u2 ∈ ∂f (x) + ∂g(x),
44
convex. Furthermore, suppose ri C ∩ ri dom f 6= ∅. Consider the problem
min f (x) (P )
x∈C
Indeed, we convert this to the unconstrained minimization problem min f +δC . This function
is convex, l.s.c., and proper. By Fermat’s theorem, x̄ solves P if and only if
0 ∈ ∂(f + δC )(x̄).
Now, ri dom f ∩ ri dom δC 6= ∅. Hence by the previous theorem, x̄ solves (P) if and only if
Example 2.9.8
Let d ∈ Rm and ∅ 6= C ⊆ Rm be convex and closed. Consider
minhd, xi (P )
x∈C
−d ∈ NC (x̄).
2.10 Differentiability
45
Definition 2.10.2 (Differentiable)
Let f : Rm → (−∞, ∞] be proper and x ∈ dom f . f is differentiable at x if there is
a linear operator ∇f (x) : Rm → Rm , called the derivative (gradient) of f at x, that
satisfies
kf (x + y) − f (x) − ∇f (x) · yk
lim = 0.
06=kyk→0 kyk
Theorem 2.10.1
Let f : Rm → (−∞, ∞] be convex. Suppose f (x) < ∞. For each y, the quotient in
the definition of f 0 (x; y) is a non-decreasing function of λ > 0. So f 0 (x; y) exists and
f (x + λy) − f (x)
f 0 (x; y) = inf .
λ>0 λ
Theorem 2.10.2
Let f : Rm → (−∞, ∞] be convex and proper. Let x ∈ dom f and u ∈ Rm . Then u
is a subgradient of f at x if and only if
Proof
By definition,
Theorem 2.10.3
Let f : Rm → (−∞, ∞] be convex and proper. Suppose x ∈ dom f . If f is differen-
tiable at x, then ∇f (x) is the unique subgradient of f at x.
46
Proof
Recall that for each y ∈ Rm ,
f 0 (x; y) = h∇f (x), yi.
Lemma 2.10.4
Let ϕ : R → (−∞, ∞] be a proper function that is differentiable on an interval
∅ 6= I ⊆ dom ϕ. If ϕ0 is increasing on I, then ϕ is convex on I.
Proof
Fix x, y ∈ I and λ ∈ (0, 1). Let ψ : R → (−∞, ∞] be given by
Then
ψ 0 (z) = (1 − λ)φ0 (z) − (1 − λ)φ0 (λx + (1 − λ)z)
and ψ 0 (x) = 0 = ψ(x).
Since φ0 is increasing, ψ 0 (z) ≤ 0 when z < x and ψ 0 (z) > 0 whenever z > x. It follows
that ψ achieves its infimum on I at x.
as desired.
47
Proposition 2.10.5
Let f : Rm → (−∞, ∞] be proper. Suppose that dom f is open and convex, and that f
is differentiable on dom f . The following are equivalent.
(i) f is convex
(ii) ∀x, y ∈ dom f, hx − y, ∇f (y)i + f (y) ≤ f (x)
(iii) ∀x, y ∈ dom f, hx − y, ∇f (x) − ∇f (y)i ≥ 0
Proof
(i) =⇒ (ii) ∇f (y) is the unique subgradient of f at y. Hence for all x ∈ Rm and y ∈ dom f ,
f (x) ≥ hx − y, ∇f (y)i + f (y).
(iii) =⇒ (i) Fix x, y ∈ dom f and z ∈ Rm . By assumption, dom f is open. Thus there is
some > 0 such that
y + (1 + )(x − y) = x + (x − y) ∈ dom f
y − (x − y) = y + (y − x) ∈ dom f.
By the convexity of dom f , for every α ∈ (−, 1 + ), y + α(x − y) ∈ dom f .
48
That is, ϕ0 is increasing on C and ϕ is convex on C. But then
Example 2.10.6
Let A be a m × m matrix, and set f : Rm → R be given by
2.11 Conjugacy
Proposition 2.11.1
Let f, g be functions from Rm → [−∞, ∞]. Then
(1) f ∗∗ := (f ∗ )∗ ≤ f
(2) f ≤ g =⇒ f ∗ ≥ g ∗ , f ∗∗ ≤ g ∗∗
Proof
By definition, f ∗ (x) = −∞ ⇐⇒ f ≡ ∞. Hence by assumption f ∗ (Rm ) > 0.
as desired.
Proposition 2.11.3
Let f : Rm → (−∞, ∞] be convex and proper. For x, u ∈ Rm ,
49
Proof
We have
u ∈ ∂f (x)
⇐⇒ ∀y ∈ dom f, hy − x, ui + f (x) ≤ f (y)
⇐⇒ ∀y ∈ dom f, hy, ui − f (y) ≤ hx, ui − f (x)
⇐⇒ f ∗ (u) = sup hy, ui − f (y) ≤ hx, ui − f (x)
y∈Rm
Proposition 2.11.4
Let f : Rm → (−∞, ∞] be convex and proper. Pick x ∈ Rn such that ∂f (x) 6= ∅. Then
f ∗∗ (x) = f (x).
Proof
Let u ∈ ∂f (x). By the previous proposition,
Consequently,
≥ hx, ui − f ∗ (u)
= f (x).
Conversely,
= sup f (x)
y∈Rm
= f (x).
50
Proposition 2.11.5
Let f : Rm → (−∞, ∞] be proper. Then f is convex and l.s.c. if and only if
f = f ∗∗ .
Corollary 2.11.5.1
Let f : Rm → (−∞, ∞] be convex, l.s.c. and proper. Then
(i) f ∗ is convex, l.s.c., and proper
(ii) f ∗∗ = f
Proof
To see (i), combine the previous proposition and the fact that f ∗ is always convex and
l.s.c.
Proposition 2.11.6
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Then
u ∈ ∂f (x) ⇐⇒ x ∈ ∂f ∗ (u).
Proof
Recall that
u ∈ ∂f (x) ⇐⇒ f (x) + f ∗ (u) = hx, ui.
By a previous proposition, g := f ∗ satifies g ∗ = f . Moreover, g is convex, l.s.c., and
proper.
Hence,
as desired.
51
2.12 Coercive Functions
Theorem 2.12.1
Let f : Rm → R be proper, l.s.c. and compact C ⊆ Rm such that
C ∩ dom f 6= ∅.
Proof
(i): Suppose towards a contradiction that f is not bounded below over C. There is a
sequence xn in C such that
lim f (xn ) = −∞.
n
lim f (x) = ∞.
kxk→∞
52
Definition 2.12.2 (Super Coercive)
Let f : Rm → (−∞, ∞]. Then f is super coercive if
f (x)
lim = ∞.
kxk→∞ kxk
Theorem 2.12.2
Let f : Rm → (−∞, ∞] be proper, l.s.c., and coercive. Let C ⊆ Rm be a closed subset
of Rm satisfying
C ∩ dom f 6= ∅.
Then f attains its minimal value over C.
Proof
Let x ∈ C ∩ dom f . Since f is coercive, there is some M such that
But then the set of minimizers of f over C is the same as the set of minimizers of f over
C ∩ B(0; M ). This set is compact. Hence by the previous theorem, f attains its minimal
value over C.
kT x − T yk ≤ Lkx − yk.
Example 2.13.1
Let f : Rm → R be given by
1
x 7→ hx, Axi + hb, xi + x
2
where A 0 is positive semi-definite, b ∈ Rn and c ∈ R.
Then
53
(i) ∇f (x) = Ax for all x ∈ Rm
(ii) ∇f is Lipschitz with constant kAk, the operator norm of A
Example 2.13.2
Let ∅ 6= C ⊆ Rm be closed and convex. Then PC is Lipschitz continuous with constant
1.
Proof
Recall that the fundamental theorem of calculus implies that
Z 1
f (y) − f (x) = h∇f (x + t(y − x)), y − xidt
0
Z 1
= h∇f (x), y − xi + h∇f (x + t(y − x)) − ∇f (x), y − xidt.
0
Hence
54
It follows that
L
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2 .
2
Theorem 2.13.4
Let f : Rm → R be convex and differentiable and L > 0. The following are equivalent:
(i) ∇f is L-Lipschitz
(ii) for all x, y ∈ Rm , f (y) ≤ f (x) + h∇f (x), y − xi + L2 kx − yk2
(iii) for all x, y ∈ Rm , f (y) ≥ f (x) + h∇f (x), y − xi + 1
2L
k∇f (x) − ∇f (y)k2
(iv) for all x, y ∈ Rm , h∇f (x) − ∇f (y), x − yi ≥ L1 k∇f (x) − ∇f (y)k2
Proof
(i) =⇒ (ii): This is the descent lemma.
(ii) =⇒ (iii): If ∇f (x) = ∇f (y), the this follows immediately from the subgradient in-
equality and the fact that ∂f (x) = {∇f (x)}.
Indeed,
55
By construction, ∇hx (x) = 0. But the convexity of hx then asserts that x is a global
minimizer of hx . That is, for all z ∈ Rn ,
hx (x) ≤ hx (z).
Pick y, v ∈ Rm be such that kvk = 1 and h∇hx (y), vi = k∇hx (y)k. Set
k∇hx (y)k
z=y− v.
L
0 = hx (x)
k∇hx (y)k
≤ hx y − v .
L
0 = hx (x)
k∇hx (y)k 1
≤ hx (y) − h∇hx (y), vi + k∇hx (y)k2 kvk2
L 2L
k∇hx (y)k2 1
= hx (y) − + k∇hx (y)k2
L 2L
1
= hx (y) − k∇hx (y)k2
2L
1
= f (y) − f (x) − h∇f (x), y − xi − k∇f (x) − ∇g(y)k2 .
2L
1
f (y) ≥ f (x) + h∇f (x), y − xi + k∇f (x) − ∇f (y)k2
2L
1
f (x) ≥ f (y) + h∇f (y), x − yi + k∇f (y) − ∇f (x)k2 .
2L
(iv) =⇒ (i): If ∇f (x) = ∇f (y), the implication is trivial. We proceed assuming otherwise.
We have
56
Example 2.13.5 (Firm Nonexpansiveness)
Let ∅ 6= C ⊆ Rm be closed and convex. Then for each x, y ∈ Rm ,
Example 2.13.6
Let ∅ 6= C ⊆ Rm be closed and convex. Let f : Rm → R be given by
1
f (x) = d2C (x).
2
Then the following holds
Proof
(i) =⇒ (ii) Suppose that ∇f is L-Lipschitz continuous. For any y ∈ Rm and α > 0,
57
That is,
= Lkyk.
Equivalently,
k∇2 f (x)k ≤ L
as desired. Note that we used the fact that ∇2 f (x)(y) = (∇f )0 (x; y).
(ii) =⇒ (i) Suppose that k∇2 f (x)k ≤ L and fix x, y ∈ Rm . By the fundamental theorem
of calculus,
Z 1
∇f (x) = ∇f (y) + ∇2 f (y + α(x − y))(x − y)dα
0
Z 1
2
= ∇f (y) + ∇ f (y + α(x − y))dα (x − y)
0
Hence
Z 1
2
k∇f (x) − ∇f (y)k ≤
∇ f (x + α(x − y))dα
· kx − yk
0
Z 1
≤ k∇2 f (x + α(x − y))kdαkx − yk
0
≤ Lkx − yk.
Proposition 2.13.8
For a symmetric A ∈ Rm×m ,
Proof
Write x as a linear combination of some orthonormal eigenvector basis of A.
58
Proposition 2.13.9
A twice continuously differentiable function f : Rm → R is convex if and only if ∇2 f (x)
is positive semi-definite.
Proof
See A3.
Corollary 2.13.9.1
Let f : Rm → R be convex and twice continuously differentiable. Suppose L ≥ 0. Then
∇f is L-Lipschitz if and only if for all x ∈ Rm ,
Proof
Since f is convex and twice continuously differentiable, ∇2 f (x) is positive semidefinite
everwhere. Combined with the earlier result,
L ≥ k∇2 f (x)k
= |λmax (∇2 f (x))|
= λmax (∇2 f (x)).
Example 2.13.10
Let f : Rm → R be given by
p
x 7→ 1 + kxk2 .
Then
(i) f is convex
(ii) ∇f is 1-Lipschitz
Proposition 2.13.11
Let β > 0. f : Rm → (−∞, ∞] is β-strongly convex if and only if
β 2
f− k·k
2
is convex.
Proof
See A3.
59
Proposition 2.13.12
Let f, g : Rm → (−∞, ∞] and β > 0. Suppose that f is β-strongly convex and that g is
convex. Then f + g is β-strongly convex.
Proof
Define
β 2
h := f − k·k + g.
2
Then h is convex as it is the sum of two convex functions. Thus applying the previous
proposition yields the result.
Proposition 2.13.13
Let f : Rm → (−∞, ∞] be strongly convex, l.s.c., and proper. Then f has a unique
minimizer.
Theorem 2.14.1
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Then for every x ∈ Rm ,
Proxf (x) is a singleton.
Proof
For a fixed x ∈ Rm ,
1
hx := k· − xk2
2
is β-strongly convex for all β < 1. Therefore,
gx := f + hx
60
We know that gx is l.s.c. as f, hx are l.s.c. Moreover, gx is proper as f, g is proper with
dom f ∩ dom gx = dom f . Thus from the previous proposition,
Example 2.14.2
For ∅ 6= C ⊆ Rm closed and convex,
ProxδC = PC .
Proposition 2.14.3
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Let x, p ∈ Rm . Then p = Proxf (x)
if and only if for all y ∈ Rm ,
hy − p, x − pi + f (p) ≤ f (y).
Proof
( =⇒ ) Suppose that p = Proxf (x). For each λ ∈ (0, 1), set
pλ := λy + (1 − λ)p.
Thus
1 1
f (p) ≤ f (pλ ) + kx − pλ k2 − kx − pk2
2 2
1 1
≤ f (pλ ) + kx − λy − (1 − λ)pk2 − kx − pk2
2 2
1
= f (pλ ) + hx − p − λ(y − p) − (x − p), x − p − λ(y − p) + (x − p)i
2
1
= f (pλ ) + h−λ(y − p), 2(x − p) − λ(y − p)i
2
λ
= f (pλ ) + ky − pk2 − λhx − p, y − pi
2
λ2
= f (λy + (1 − λ)p) + ky − pk2 − λhx − p, y − pi
2
λ2
f (p) ≤ λf (y) + (1 − λ)f (p) + ky − pk2 − λhx − p, y − pi
2
λ2
λhx − p, y − pi + λf (p) ≤ λf (y) + ky − pk2 .
2
Division by λ and taking the limit as λ → 0 yields the result.
61
( ⇐= ) Suppose that
hy − p, x − pi + f (p) ≤ f (y).
Then
f (p) ≤ f (y) − hy − p, x − pi = f (y) + hx − p, p − yi.
It follows that
1 1
f (p) + kx − pk2 ≤ f (y) + hx − p, p − yi + kx − pk2
2 2
1 1
≤ f (y) + hx − p, p − yi + kx − pk2 + kp − yk2
2 2
≤ f (y) + kx − p + p − yk2
= f (y) + kx − yk2 .
Example 2.14.4
Let f : Rm → R be given by
x 7→ |x|.
Then
x − 1, x > 1
Proxf (x) := 0, x ∈ [−1, 1]
x + 1, x < −1
Proposition 2.14.5
Let f : Rm → R be convex, l.s.c., and proper. Then x minimizes f over Rm if and only
if
x = Proxf (x).
Proof
By the previous proposition,
62
Example 2.14.6
Let g, h : R → R be given by
(
0, x 6= 0
g(x) :=
λ, x = 0
(
0, x 6= 0
h(x) :=
−λ, x = 0
Then
√
{x},
|x| > 2λ
√
Proxh (x) = {0, x}, |x| = 2λ
√
{0}, |x| < 2λ
(
{x}, x 6= 0
Proxh (x) =
∅, x=0
For all x ∈ R,
x − λ, x > λ
Proxf (x) = 0, x ∈ [−λ, λ]
x + λ, x < −λ
63
Theorem 2.14.8
Suppose f : Rm → (−∞, ∞] is given by
m
X
f (x) := fi (xi )
i=1
Proof
From A2, f is convex, l.s.c., and proper. We know that
Conversely, if fi (yi ) ≥ fi (pi ) + (yi − pi )(xi − pi ) for each i ∈ [m], then clearly p = Proxf (x).
Example 2.14.9
Let g : Rm → (−∞, ∞] be given by
(
−α m
P
i=1 log xi , x>0
x 7→
∞, else
where α > 1.
Then p !m
xi + x2i + 4α
Proxg (x) =
2
i=1
since p
x2i + 4α
xi +
Proxgi (xi ) = .
2
This can be proven by differentiating to find the minimizer of hi (yi ) := gi (yi ) + 12 (yi − xi )2 .
64
Theorem 2.14.10
Let g : Rm → (−∞, ∞] be proper and c > 0. Let a ∈ Rm , γ ∈ R. For each x ∈ Rm ,
define
c
f (x) = g(x) + kxk2 + ha, xi + γ.
2
Then for all x ∈ R ,
m
x−a
Proxf (x) = Prox 1 g .
c+1 c+1
Proof
Indeed, recall that
1
Proxf (x) := argminu∈Rm f (u) + ku − xk2
2
c 1
= argminu∈Rm g(u) + kuk2 + ha, ui + γ + ku − xk2 .
2 2
Now,
c 1 c 1 1
kuk2 + ha, ui + ku − xk2 = kuk2 + ha, ui + kuk2 − hu, xi + kxk2
2 2 2 2 2
c+1 1
= kuk2 − hu, x − ai + kxk2
2 2
c+1 2 x − a 1 2
= kuk − 2 u, + kxk
2 c+1 c+1
"
2 #
2
c+1
x − a
kx − ak 1
=
u −
− + kxk2
2
c+1
c+1 c+1
2 2
c + 1
u − x − a
− kx − ak + 1 kxk2 .
=
2
c+1
2 2
Finally, since minimizers are preserved under positive scalar multiplication and transla-
tion,
2 2
c + 1
x + a
+ γ − kx − ak + 1 kxk2
Proxf (x) = argminu∈Rm g(u) +
u −
2
c+1
2 2
2
c + 1
u − x + a
= argminu∈Rm g(u) +
2
c+1
2
1 1
x − a
= argminu∈Rm g(u) +
u −
c+1 2 c+1
x+a
=: Prox 1 g .
c+1 c+1
65
Example 2.14.11
Let µ ∈ R and α ≥ 0. Consider f : R → (−∞, ∞] given by
(
µx, x ∈ [0, α]
f (x) :=
∞, else
For each x ∈ R,
f (x) = µx + δ[0,α] (x).
Moreover,
Proxf (x) = min(max(x − µ, 0), α).
Theorem 2.14.12
Let g : R → (−∞, ∞] be convex, l.s.c. and proper such that dom g ⊆ [0, ∞) and let
f : Rm → R be given by
f (x) = g(kxk).
Then (
x
Proxg (kxk) kxk , x 6= 0
Proxf (x) =
{u ∈ Rm : kuk = Proxg (x)}, x=0
Proof
Case I: x = 0 By definition,
1
Proxf (x) = argminu∈Rm f (u) + kuk2 .
2
By the change of variable w = kuk, then above set of minimizers is the same as
1
argminw∈Rm g(w) + w2 =: Proxg (0).
2
66
Case II: x 6= 0 By definition, Proxf (x) is the set of solutions to the minimization problem
1
minm g(kuk) + ku − xk2
u∈R 2
1 1
= minm g(kuk) + kuk2 − hu, xi + kxk2
u∈R 2 2
1 1
= min min g(α) + α2 − hu, xi + kxk2
α≥0 u∈Rm :kuk=α 2 2
Now, hu, xi ≤ kuk · kxk by the Cauchy-Schwartz inequality with equality when u = λx
for some λ ≥ 0. Thus
x 1 2 1
α = min g(α) + α − hu, xi + kxk2 .
kxk u∈Rm :kuk=α 2 2
1 1
min g(α) + α2 − αkxk + kxk2
α≥0 2 2
1
= min g(α) + (α − kxk)2 .
α≥0 2
This is precisely Proxg (kxk).
Hence
x
Proxf (x) = Proxg (kxk)
kxk
as desired.
Example 2.14.13
Let α > 0, λ ≥ 0, and f : R→ (−∞, ∞] be given by
(
λ|x|, |x| ≤ α
f (x) =
∞, |x| > α
Define (
λx, x ∈ [0, α]
g(x) =
∞, x ∈/ [0, α]
67
so that f (x) = g(|x|). By the previous theorem,
(
Proxg (|x|) sgn(x), x 6= 0
Proxf (x) =
0, x=0
= min(max(|x| − λ, 0), α) sgn(x).
Example 2.14.14
Let w, α ∈ Rm
+ and f : R → (−∞, ∞] given by
m
(P
m
i=1 wi |xi |, −α ≤ x ≤ α
f (x) =
∞, else
Let the sequence (xn )n≥0 be recursively defined by x0 ∈ Rm and xn+1 = Proxf (xn ). Then
xn → x̄ where x̄ is a minimizer of (P).
kT x − T yk ≤ kx − yk
68
Definition 2.15.3 (Averaged)
Let T : Rm → Rm and α ∈ (0, 1). Then T is α-averaged if there is some N : Rm → Rm
such that N is nonexpansive and
T = (1 − α) Id +αN.
Proposition 2.15.1
T : Rm → Rm . The following are equivalent.
(i) T is f.n.e.
(ii) Id −T is f.n.e.
(iii) 2T − Id is nonexpansive
(iv) for all x, y ∈ Rm , kT x − T yk2 ≤ hx − y, T x − T yi.
(v) for all x, y ∈ Rm , hT x − T y, (Id −T )x − (Id −T )yi ≥ 0
Proof
(i) ⇐⇒ (ii): This is clear from the definition.
Proposition 2.15.2
Let T : Rm → Rm be linear. Then the following are equivalent.
(i) T is f.n.e.
(ii) k2T − Idk ≤ 1
(iii) for all x ∈ Rm , kT xk2 ≤ hx, T xi
(iv) for all x ∈ Rm , hT x, x − T xi ≥ 0
Proof
(i) ⇐⇒ (ii) We know that T is f.n.e. if and only if 2T − Id is nonexpansive. This happens
if and only if for all x 6= y,
k(2T − Id)(x − y)k = k(2T − Id)x − (2T − Id)yk
≤ kx − yk
⇐⇒
k2T − Idk ≤ 1.
(i) ⇐⇒ (iii) This is easily seen by the previous proposition and the fact that T x − T y =
69
T (x − y).
(i) ⇐⇒ (iv) This is seen by applying the previous proposition and observing that T x −
T y = T (x − y) as well as
Example 2.15.3
Let ∅ 6= C ⊆ Rm be convex and closed. Then PC (x) is f.n.e. Indeed, for all x, y ∈ Rm ,
Example 2.15.4
Suppose that T = − 12 Id. Then T is averaged but NOT f.n.e.
We have
1 3
T = Id + (− Id)
4 4
and so T is 34 -averaged.
Example 2.15.5
T := − Id is nonexpansive but NOT averaged. Indeed suppose there is some nonexpansive
N : Rm → Rm and α ∈ (0, 1) such that
T = (1 − α) Id +αN ⇐⇒ − Id = (1 − α) Id +αN
⇐⇒ (−1 + α) Id = αN
α−2
⇐⇒ N = Id .
α
70
But then
α − 2
kN k =
≤1
α
2−α
⇐⇒ ≤1
α
⇐⇒ 2 − α ≤ α
⇐⇒ α ≥ 1
Proposition 2.15.6
Let T : Rm → Rm be nonexpansive. Then T is continuous.
Proof
Suppose xn → x̄. Then
kT xn − T x̄k ≤ kxn − x̄k → 0.
Example 2.16.1
Suppose Fix T 6= ∅ for some nonexpansive T : Rm → Rm . For any x0 ∈ Rn , the sequence
defined recursively by
xn := T (xn−1 )
is Féjer monotone with respect to Fix T .
71
Proposition 2.16.2
Let ∅ 6= C ⊆ Rm and (xn )n≥0 a Féjer monotone sequence in Rm with respect to C. The
following hold:
(i) (xn ) is bounded
(ii) for every c ∈ C, (kxn − ck)n≥0 converges
(iii) (dC (xn ))n≥0 is decreasing and converges
Proof
Fix c ∈ C. We have
Now, kxn − ck is bounded below by 0 and monotonic, hence necessarily converges to the
infimum.
Proposition 2.16.3
A bounded sequence (xn )n∈N in Rm converges if and only if it has a unique cluster point.
Proof
The forward direction is clear. Suppose now that (xn )n∈N has a unique cluster point x̄.
Suppose that xn 6→ x̄. Then there is some 0 > 0 and subsequence xkn such that for all n,
kxkn − x̄k ≥ 0 .
But then (xkn )n∈N is bounded and hence contains a convergent subsequence. This is still
a subsequence of (xn )n∈N but cannot converge to x̄.
It follows that (xn )n∈N has more than one cluster point. By contradiction, xn → x̄.
72
Lemma 2.16.4
Let (xn )n∈N be a sequence in Rm and ∅ 6= C ⊆ Rm be such that for all c ∈ C,
(kxn − ck)n∈N converges and every cluster point of (xn )n∈N lies in C.
Then (xn )n∈N converges to a point in C.
Proof
(xn ) is necessarily bounded since kxn k ≤ kck + kxn − ck is bounded. It suffices by the
previous proposition to show that (xn )n∈N has a unique cluster point.
Let x, y be two cluster points of (xn )n∈N . That is, there are subsequences
xkn → x, x`n → y.
Observe that
2hxn , x − yi
= kxn k2 + kyk2 − 2hxn , yi − kxn k2 − kxk2 + 2hxn , xi + kxk2 − kyk2
= kxn − yk − kxn − xk2 + kxk2 − kyk2
→ L ∈ Rm .
hx, x − yi = hy, x − yi
kx − yk2 = 0
x = y.
Theorem 2.16.5
Let ∅ 6= C ⊆ Rm and (xn )n∈N a sequence in Rm . Suppose that (xn )n∈N is Féjer
monotone with respect to C, and that every cluster point of (xn )n∈N lies in C. Then
(xn )n∈N converges to a point in C.
Proof
We know that for all c ∈ C,
kxn − ck
converges. Hence the result follows from the previous lemma.
73
Let x, y ∈ Rm and α ∈ R. By computation,
kαx + (1 − α)yk2 + α(1 − α)kx − yk2 = αkxk2 + (1 − α)kyk2 .
Theorem 2.16.6
Let α ∈ (0, 1] and T : Rm → Rm be α-averaged such that Fix T 6= ∅. Let x0 ∈ Rm .
Define
xn+1 := T xn .
The following hold:
(i) (xn )n∈N is Fejér monotone with respect to Fix T .
(ii) T xn − xn → 0.
(iii) (xn )n∈N converges to a point in Fix T .
Proof
Now, T being averaged implies that it is nonexpansive. The example earlier shows that
(xn )n∈N is Féjer monotone.
T = (1 − α) Id +αN.
Let f ∈ Fix T .
74
In particular,
Now, (xn )n∈N is Féjer monotone with respect to Fix T = Fix N . Let x̄ be a cluster point of
(xn )n∈N , say xkn → x̄. Observe that N being nonexpansive implies that N is continuous.
By continuity,
N x̄ = lim N xkn = x̄.
n
That is, every cluster point of (xn )n∈N lies in Fix N = Fix T . Combined with a previous
theorem, this yield the proof.
Corollary 2.16.6.1
Let T : Rm → Rm be f.n.e. and suppose that Fix T 6= ∅. Put x0 ∈ Rm . Recursively
define
xn+1 := T xn .
There is some x̄ ∈ Fix T such that
xn → x̄.
Proof
Since T is f.n.e., T is also averaged. The result follows then by the previous theorem.
Proposition 2.16.7
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Then Proxf is f.n.e.
Proof
Let x, y ∈ Rm . Set p := Proxf (x) and q := Proxf (y).
hz − p, x − pi + f (p) ≤ f (z)
hz − q, y − qi + f (q) ≤ f (z).
75
By choosing z = p, q,
hq − p, x − pi + f (p) ≤ f (q)
hp − q, y − qi + f (q) ≤ f (p)
hq − p, (x − p) − (y − q)i ≤ 0
hp − q, (x − p) − (y − q)i ≥ 0.
Corollary 2.16.7.1
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper such that argmin f 6= ∅. Let
x0 ∈ Rm and updated via
xn+1 = Proxf (xn ).
There is some x̄ ∈ argmin f such that xn → x̄.
Proof
Recall that
x ∈ argmin f ⇐⇒ x = Proxf (x) ⇐⇒ x ∈ Fix Proxf .
Thus argmin f = Fix Proxf 6= ∅.
By the previous proposition, Proxf is f.n.e. Thus the result follows from a previous
theorem.
Proposition 2.17.1
Let T : Rm → Rm be nonexpansive and α ∈ (0, 1). The following are equivalent:
1. T is α-averaged
2. 1 − α1 Id + α1 T is nonexpansive
76
Proof
(i) ⇐⇒ (ii): We have T is α-averaged if and only if there is some N : Rm → Rm nonex-
pansive such that
T = (1 − α) Id +αN
1
⇐⇒ N = (T − (1 − α) Id)
α
1 1
⇐⇒ N = 1 − Id + T
α α
if and only if 1 − 1
Id + α1 T is nonexpansive.
α
kx − yk2
2
1 1 1 1
≥
1−
x + Tx − 1 − y − T y
α α α α
2
1 1
=
1−
(x − y) + (T x − T y)
α α
1 1−α
2
= kx − yk − 2
kx − yk − 2
k(x − T x) − (y − T y)k − kT x − T yk2
identity
α α
1 2 1−α 2 2
0≥− kx − yk − k(x − T x) − (y − T y)k − kT x − T yk
α α
1−α
0 ≤ kx − yk2 + k(x − T x) − (y − T y)k2 − kT x − T yk2 α > 0.
α
Theorem 2.17.2
Let α1 , α2 ∈ (0, 1) and Ti : Rm → Rm be αi -averaged. Define
T := T1 T2
α1 + α2 − 2α1 α2
α := .
1 − α1 α2
Then T is α-averaged.
Proof
First observe that by computation,
α ∈ (0, 1) ⇐⇒ α1 (1 − α2 ) < 1 − α2
which is a tautology.
77
By the previous proposition, for each x, y ∈ Rm ,
kT x − T yk2
= kT1 T2 x − T1 T2 yk2
1 − α1
≤ kT2 x − T2 yk2 − k(Id −T1 )T2 x − (Id −T1 )T2 yk2
α1
1 − α 2 1 − α1
≤ kx − yk2 − k(Id −T2 )x − (Id −T2 )yk2 − k(Id −T1 )T2 x − (Id −T1 )T2 yk2
α2 α1
= kx − yk2 − V1 − V2 .
Set
1 − α1 1 − α2
β := + > 0.
α1 α2
By computation,
(1 − α1 )(1 − α2 )
V1 + V2 ≥ k(Id −T )x − (Id −T )yk2 .
βα1 α2
Consequently,
(1 − α1 )(1 − α2 )
kT x − T yk2 ≤ kx − yk2 − k(Id −T )x − (Id −T )yk2
βα1 α2
1 − α
= kx − yk2 − k(Id −T )x − (Id −T )yk2 .
α
78
Chapter 3
min f (x) (P )
x∈C
where f : Rm → (−∞, ∞] is convex, l.s.c., and proper with C 6= ∅ being convex and closed.
79
Theorem 3.1.1
Let f : Rm → (−∞, ∞] be proper and g : Rm → (−∞, ∞] convex, l.s.c., proper with
dom g ⊆ int(dom f ). Consider the problem
Proof (i)
Let y ∈ dom g. Since g is convex, we know that dom g is convex. Hence for any λ ∈ (0, 1),
x∗ + λ(y − x∗ ) = (1 − λ)x∗ + λy
=: xλ
∈ dom g.
Proof (ii)
Suppose that f is convex and observe that (i) proves the forward direction.
80
Now suppose −∇f (x∗ ) ∈ ∂g(x∗ ). By definition, for each y ∈ dom g,
Lemma 3.1.2
For all x ∈ Rm , F (x) ≥ 0. Moreover, the solution of (P) are precisely the minimizers
of
F := {x : F (x) = 0}.
81
Proof
Let x ∈ Rn .
Case Ia: x is infeasible Then there is some j ∈ [n] such that gj (x) > 0. Hence F (x) ≥
gi (x) > 0.
Case Ib: x is not optimal Then gi (x) ≤ 0 but f (x) > µ. Thus F (x) ≥ g0 (x) > 0.
Case II: x solves (P) Then x is feasible and f (x) = µ. Hence F (x) = 0.
Now, let
n
\
x∈ int dom gi .
n=1
We have
[
∂g(x) = conv ∂gi (x) .
i∈A(x)
α0 , . . . , α ≥ 0
Proof
Recall that F (x) := max{f (x) − µ, gi (x), . . . , gn (x)}. By the previous lemma,
82
Hence
0 ∈ ∂F (x∗ ) = convi∈A(x∗ ) ∂gi (x∗ ).
where A(x∗ ) := {0 ≤ i ≤ n : gi (x∗ ) = 0}.
0 ∈ ∂g0 = ∂f.
By our work with convex hulls, there is some α0 , . . . , αn such that αi = 1 (so
P
i∈A(x∗ )
αj = 0 if j ∈
/ A(x∗ )) and that
X
0∈ αi ∂gi (x∗ )
i∈A(x∗ )
X
= α0 ∂g0 (x∗ ) + αi ∂gi (x∗ )
i∈A(x∗ )\{0}
X n
= α0 ∂g0 (x∗ ) + αi ∂gi (x∗ ).
i=1
αi gi (x∗ ) = 0
gi (s) < 0.
Then there exists λ1 , . . . , λm ≥ 0 such that the KKT conditions hold: (stationarity
condition) X
0 ∈ ∂f (x∗ ) + λi ∂gi (x∗ )
i∈I
λi gi (x∗ ) = 0.
83
Proof
By the Fritz-John necessary conditions, there are α0 , α1 , . . . , αn ≥ 0 not all 0 such that
n
X
∗
0 ∈ α0 ∂f (x ) + αi ∂gi (x∗ ).
i=1
Suppose towards a contradiction that α0 = 0. There exist yi ∈ ∂gi (x∗ ) such that
n
X
αi yi = 0.
i=1
which is absurd.
84
Theorem 3.1.6 (Karush-Kuhn-Tucker; Sufficient Conditions)
Suppose f, g1 , . . . , gn are convex and x∗ ∈ Rm satisfies
Proof
Define
n
X
h(x) := f (x) + λi gi (x).
i=1
By assumption,
n
X
∗ ∗
0 ∈ ∂h(x ) = ∂f (x ) + λi ∂gi (x∗ ).
i=1
85
3.2 Gradient Descent
min f (x) (P )
x ∈ Rm
f 0 (x; d) < 0.
f (x + λd) − f (x)
f 0 (x, d) = lim+ .
λ→0 λ
Thus f (x, d) < 0 implies that there is some such that λ ∈ (0, ) implies that
f (x + λd) − f (x)
< 0 ⇐⇒ f (x + λd) < f (x).
λ
1. Initialize x0 ∈ Rm .
2. For each n ∈ N:
(a) Pick tn ∈ argmint≥0 f (xn − t∇f (xn )).
(b) Update xn+1 := xn − tn ∇f (xn )
86
Example 3.2.2 (L. Vandenberghe)
Negative subgradients are NOT necessarily descent directions. Consider f : R2 → R+
given by
(x1 , x2 ) 7→ |x1 | + 2|x2 |.
Then f is convex as it is a direct sum of convex functions.
x22
2 2 1
C = x ∈ R : x2 + ≤ 1, x2 ≥ √ .
γ 1+γ
D := R2 \ ((−∞, 0] × {0})).
Let x0 := (γ, 1) ∈ D.
87
3.3 Projected Subgradient Method
Consider
min f (x) (P )
x∈C
where f : Rm → (−∞, ∞] is convex, l.s.c., and proper, ∅ 6= C ⊆ int dom f is convex and
closed.
Suppose
S := argminx∈C f (x) 6= ∅
µ := min f (x).
x∈C
1) Get x0 ∈ C.
2) Given xn , pick a stepsize tn > 0 and f 0 (xn ) ∈ ∂f (xn )
3) Update xn+1 := PC (xn − tn f 0 (xn )).
Recall that C ⊆ int dom f , hence each xn ∈ int dom f and ∂f (xn ) 6= ∅. Thus the algorithm
is well-defined.
Lemma 3.3.1
Let s ∈ S := argminx∈C f (x). Then
Observe that S ⊆ C.
Proof
We have
88
It suffices to show that
d
0= (−2tn (f (xn ) − µ) + t2n kf 0 (xn )k2 )
dtn
= −2(f (xn ) − µ) + 2tn kf 0 (xn )k2 .
f (xn ) − µ
tn := .
kf 0 (xn )k2
f (xn ) − µ
tn := .
kf 0 (xn )k2
Theorem 3.3.2
We have
(i) For all s ∈ S, n ∈ N, kxn+1 − sk ≤ kxn − sk, ie (xn )n∈N is Fejér monotone with
respect to S
(ii) f (xn ) → µ
(iii) µn − µ ≤ L·dS (x0 )
√
n+1
∈O √1
n
, where µn := min0≤k≤n f (xk )
89
Proof (i)
Let s ∈ S, n ∈ N By computation�
Proof (ii)
From our work in (i): for all k ∈ N,
(f (xk ) − µ)2
2
≤ kxk − sk2 − kxk+1 − sk.
L
Letting n → ∞,
∞
X
0≤ (f (xk ) − µ)2 ≤ L2 kx0 − sk2 < ∞
k=0
Proof (iii)
Recall that
µn := min f (xk ).
0≤k≤n
90
Minimizing over s ∈ S, we get that
(µn − µ)2
(n + 1) ≤ d2S (x0 ).
L2
Proof (iv)
Suppose that
L2 d2S (x0 )
n≥ −1
2
⇐⇒
d2S (x0 )L2
≤ 2 .
n+1
Recall that if (xn )n∈N is Fejér monotone with respect to some ∅ 6= C ⊆ Rm , and every
cluster point lies in C, then xn → c ∈ C.
Proof
We have already shown that (xn ) is Fejér monotone with respect to S. Thus the sequence
is also bounded. Also, by the previous theorem,
f (xn ) → µ = min f (x).
x∈C
91
Hence x̄ ∈ S. That is, all cluster points of (xn )n∈N lie in S.
Example 3.3.4
Let C ⊆ Rm be convex, closed, and non-empty. Fix x ∈ Rm .
(
x−PC (x)
dC (x)
, x∈/C
∂dC (x) =
NC (x) ∩ B(0; 1), x ∈ C
Lemma 3.3.5
Let f be convex, l.s.c., and proper. Fix λ > 0. Then
∂(λf ) = λ∂f.
Problem 1
Given k closed convex subsets Si ⊆ Rm such that
k
\
S := Si 6= ∅,
i=1
find x ∈ S.
We take
f (x) := max{dSi (x) : i ∈ [k]}.
The domain is C := Rm . Observe that f ≥ 0 with
Recall that the max rule for subdifferentials implies that for all x ∈
/ S,
92
Thus k∂f (x)k ≤ 1 as a convex combination preserves the norm bound.
Given xn , pick an index ī such that dSī (xn ) = f (xn ) > 0. Set
xn − PSī (xn )
f 0 (xn ) := .
dSī (xn )
tn = dSī (xn ).
Note that in practice, it is possible that µ := minx∈C f (x) is NOT known to us. In this case,
replace Polyak’s stepsize by a sequence (tn )n∈N such that
Pn 2
tk
Pk=0
n → 0, n → ∞.
k=0 tk
For example, tk := 1
k+1
. One can show that
n
µn := min f (xk ) → µ
k=0
as n → ∞.
93
We shall assume that S := argminx∈Rm F (x) 6= ∅ and define
µ := minm F (x).
x∈R
f is “nice” in that it is convex, l.s.c., proper, and differentiable on int dom f 6= ∅. Moreover,
∇f is L-Lipschitz on int dom f .
∅ 6= ri dom g
⊆ dom g
⊆ ri dom f
= int dom f
=⇒
ri dom g ∩ ri dom f = ri dom g
6= ∅.
Example 3.4.1
We can model contrained optimization functions as
1
x+ := Prox 1 g (x − ∇f (x))
L L
2
1 1
1
= argminy∈Rm g(y) +
y −
∇f (x)
L 2 L
∈ dom g
⊆ int dom f
= dom ∇f.
1
T := Prox 1 g (Id − ∇f ).
L L
94
Theorem 3.4.2
Let x ∈ Rm . Then
x∈S
= argminx∈Rm F
= argminx∈Rm (f + g)
⇐⇒
x = Tx
⇐⇒
x ∈ Fix T.
Proof
By Fermat’s theorem,
Proposition 3.4.3
Let f : Rm → (−∞, ∞] be convex, l.s.c., and proper. Fix β > 0. Then f is β-strongly
convex if and only if for all x ∈ dom ∂f, u ∈ ∂f (x),
β
f (y) ≥ f (x) + hu, y − xi + ky − xk2 .
2
95
3.4.1 Proximal-Gradient Inequality
Proposition 3.4.4
Let x ∈ Rm , y+ ∈ int dom f , and
y+ := Prox 1 g (y − ∇f (y)) = T y.
L
Then
L L
F (x) − F (y+ ) ≥ kx − y+ k2 − kx − yk2 + Df (x, y).
2 2
where
Df (x, y) := f (x) − f (y) − h∇f (y), x − yi.
Proof
Define
L
h(z) := f (y) + h∇f (y), z − yi + g(z) + kz − yk2 .
2
Then h is L-strongly convex.
96
Applying the previous proposition yields that
L
h(x) ≥ h(y+ ) + h0, x − y+ i + kx − y+ k2
2
L
= h(y+ ) + kx − y+ k2
2
L
h(x) − h(y+ ) ≥ kx − y+ k2 .
2
Hence
L
h(y+ ) := f (y) + h∇f (y), y+ − yi + g(y+ ) + ky+ − yk2
2
≥ f (y+ ) + g(y+ )
= F (y+ ).
97
Proof
Recall that
L L
F (y) − F (y+ ) ≥ ky − y+ k2 − ky − yk2 + Df (y, y)
2 2
L
F (y) − F (y+ ) ≥ ky − y+ k 2
f is convex
2
L
F (y+ ) ≤ F (y) − ky − y+ k2 .
2
Proof
(i): Recall the previous proposition that
98
In particular, by setting s := PS (x0 ) ∈ S, we obtain
Equivalently,
0 ≤ F (xn ) − µ
Ld2S (x0 )
≤
2n
and F (xn ) → µ.
S := argminx∈Rm F (x).
Proof
By the previous theorem we know that (xn ) is Fejér monotone with respect to S. Thus it
suffices to show that every cluster point of (xn ) lies in S.
Suppose x̄ is a cluster point of (xn ), say xkn → x̄. We argue that F (x̄) = µ. Indeed,
µ ≤ F (x̄)
≤ lim inf F (xkn )
n
=µ
99
Proposition 3.4.8
The following hold:
(i) L1 ∇f is f.n.e.
(ii) Id − L1 ∇f is f.n.e.
(iii) T = Prox 1 g (Id −∇f ) is 23 -averaged.
L
Proof
(i), (ii): Recall for real-valued, convex, differentiable functions with L-Lipschitz gradient,
1
h∇f (x) − ∇f (y), x − yi ≥ k∇f (x) − ∇f (y)k2
L
2
1 1
1 1
∇f (x) − ∇f (y), x − y ≥
∇f (x) − ∇f (y)
.
L L
L L
The result follows then from the two equivalent characterizations of f.n.e.: Id −T is non-
expansive and
hT x − T y, T x − T yi ≥ kT x − T yk2 .
(iii): Recall that Prox 1 g is f.n.e. Hence, Prox 1 g and Id − L1 ∇f are both 1
2
-averaged.
L L
Consequently, the composition
1
Prox 1 g Id − ∇f
L L
Theorem 3.4.9
The PGM iteration satisifes
√
2dS (x0 ) 1
kxn+1 − xn k ≤ √ ∈O √ .
n n
Proof
Using the previous remark, we have that for all x, y,
1
k(Id −T )x − (Id −T )yk2 < kx − yk2 − kT x − T yk2 .
2
Let x ∈ S and observe that s ∈ Fix s by a previous theorem. Applying the above
100
inequality with x = xk , y = s yields
1
k(Id −T )xk − (Id −T )sk ≤ kxk − sk2 − kT xk − T sk2
2
1
kxk − xk+1 k2 ≤ kxk − sk2 − kxk+1 − sk2 .
2
Now, T is 2
3
averaged and thus nonexpansive. Therefore,
101
Corollary 3.4.9.1 (Classical Proximal Point Algorithm)
Let g : Rm → (−∞, ∞] be convex, l.s.c., and proper. Fix c > 0. Consider the problem
min g(x) (P )
x ∈ Rm
Proof
Set f ≡ 0 and observe that ∇f ≡ 0 and ∇f is L-Lipchitz for any L > 0. Specifically, for
L := 1c > 0.
Hence
1
T := Prox 1 g (Id − ∇f )
L L
= Proxcg
102
3.5 Fast Iterative Shrinkage Thresholding
S := argminx∈Rm F (x) 6= ∅
p
1+ 1 + 4t2n
tn+1 =
2
1
xn+1 = Prox 1 g Id − ∇f (yn ) = T yn
L L
tn − 1
yn+1 = xn+1 + (xn+1 − xn )
tn+1
1 − tn 1 − tn
= 1− xn+1 + xn
tn+1 tn+1
∈ aff{xn , xn+1 }
Observe that
t2n+1 − tn+1 = t2n .
103
3.5.2 Correctness
Proposition 3.5.1
The sequence (tn )n∈N satisfies
n+2
tn ≥ ≥ 1.
2
Proof
Induction.
0 ≤ F (xn ) − µ
2Ld2S (x0 )
≤
(n + 1)2
1
∈O .
n2
Proof
Set s := PS (x0 ). By the convexity of F ,
1 1 1 1
F s+ 1− xn ≤ F (s) + 1 − F (xn )
tn tn tn tn
For each n ∈ N, set
sn := F (xn ) − µ ≥ 0.
By computation,
1 1 1
1− sn − sn+1 ≥ F s+ 1− xn − F (xn+1 ).
tn tn tn
104
yields
1 1
F s + 1 − xn − F (xn+1 )
tn tn
L L
≥ 2 ktn xn+1 − (s + (tn − 1)xn )k2 − 2 ktn yn − (s + (tn − 1)xn )k2
2tn 2tn
It follows that
2 2 2
tn−1 sn ≤ kun k2 + t2n sn+1
L L
2
≤ ku1 k2 + t20 s1
L
2
= kx1 − sk2 + (F (x1 ) − µ)
L
≤ kx0 − sk2
where the last inequality follows from the proximal gradient inequality.
In other words,
F (xn ) − µ = sn
L 1
≤ kx0 − sk2 2
2 tn−1
L 4 n+2
≤ kx0 − sk2 tn ≥
2 (n + 1)2 2
2Ld2S (x0 )
= .
(n + 1)2
105
3.6 Iterative Shrinkage Thresholding Algorithm
minkxk2 (P1 )
Ax = b
minkxk1 (P2 )
Ax = b
Example 3.6.1
Consider the problem
1
min kAx − bk22 + λkxk1 (P )
2
m
x∈R
where λ > 0 and A ∈ Rn×m .
Recall that ∇f is L-Lipschitz if and only if the spectral norm of the Hessian is bounded
by L. Thus ∇f is L-Lipschitz for
L := λmax (AT A).
106
To see the necessarily assumption that S := argminx∈Rm F (x) holds, observe that f (x) is
continuous, convex, and coercive, with dom F = Rm .
Using the fact that if F is convex, l.s.c., proper, and coercive and ∅ 6= C is closed and
convex with dom F ∩ C 6= ∅, then F has a minimizer over C.
Now, m can be very large and λmax (AT A) may be difficult to compute. It suffices to use
some upper bound on eigenvalues such as the Frobenius norm
m X
X n
kAk2F = a2ij
j=1 i=1
= tr AT A
Xm
= λi (AT A)
i=1
Define
Rf := 2 Proxf − Id
Rg := 2 Proxg − Id .
107
Lemma 3.7.1
The following hold:
(i) Rf , Rg are nonexpansive
(ii) T = 12 (Id +Rg Rf )
(iii) T is firmly nonexpansive
Proof
Since Proxf , Proxg are f.n.e., 2 Proxf − Id, 2 Proxg − Id are nonexpansive as shown in the
assignments.
1
T = (Id +Rg Rg ).
2
Proposition 3.7.2
Fix T = Fix Rg Rf .
Proof
Let x ∈ Rm . Then
1
x ∈ Fix T ⇐⇒ x = (x + Rg Rf x)
2
⇐⇒ x = Rg Rf x
⇐⇒ x ∈ Fix Rg Rf .
Proposition 3.7.3
Proxf (Fix T ) ⊆ S.
Proof
Let x ∈ Rm and set s = Proxf (x) = (Id +∂f )−1 (x). Then
108
Moreover,
It follows that
0 = s − Rf (x) + Rf (x) − s
∈ ∂f (s) + ∂g(s)
⊆ ∂(f + g)(s)
Theorem 3.7.4
Let x0 ∈ Rm . Update via
Proof
Remark that xn+1 = T xn = T n+1 x0 . Since T is f.n.e., we know that xn → x̄ ∈ Fix T .
f is convex, l.s.c., and proper, ∅ 6= C ⊆ int dom f is closed and convex, and S :=
109
argminx∈C f (x) 6= ∅.
Set
µ := min f (C).
∞
X
tn → ∞
Pn 2n=0
tk
Pk=0
n →0 k→∞
k=0 tk
for example tn = α
n+1
for some α > 0.
Let us write
µk := min{f (xi ) : 0 ≤ i ≤ k}.
Theorem 3.8.1
Assuming the previous assumptions hold, then E[µk ] → µ as k → ∞.
Proof
Pick s ∈ S and let n ∈ N. Then
0 ≤ kxn+1 − sk2
= kPC (xn − tn gn ) − PC sk2
≤ k(xn − tn gn ) − sk2
= k(xn − s) − tn gn k2
= kxn − sk2 − 2tn hgn , xn − si + t2n kgn k2
110
Taking the conditional expectation given xn yields
0 ≤ E[kxn+1 − sk2 ]
k
X k
X
2 2
≤ kx0 − sk − 2 tn (E[f (xn )] − µ) + L t2n .
n=0 n=0
Rearranging yields
0 ≤ E[µk ] − µ
kx0 − sk2 + L2 kn=0 t2n
P
≤
2 kn=0 tn
P
→0 k→∞
x∈C
∅ 6= C ⊆ int dom fi .
111
We also assuem that for each i ∈ I, there is some Li ≥ 0 for which
k∂fi (C)k ≤ Li .
Proposition 3.8.2
supk∂fi (C)k ≤ Li if and only if fi C is Li -Lipchistz.
Let us assume that (P) has a solution. We verify (A1), (A2) to justify solving the problem
with SPSM.
Next,
r
X 1
2
E[kgn k | xn ] = krfi0 (xn )k2
i=1
r
Xr
= rkfi0 (xn )k2
i=1
X r
≤r L2i .
i=1
112
3.9 Duality
Let
p := infm f (x) + g(x)
x∈R
d := infm f ∗ (u) + g ∗ (−u)
u∈R
113
where f : Rm , → (−∞, ∞] is convex, l.s.c., and proper, g : Rn , → (−∞, ∞] is convex, l.s.c.,
and proper, and A ∈ Rn×m .
As before, let
p := infm f (x) + g(Ax)
x∈R
Lemma 3.9.1
Let h : Rm → (−∞, ∞] be convex, l.s.c., and proper. For each x ∈ Rm ,
hv (x) := h(−x).
Proof
The convexity, l.s.c., and properness is verified by definition.
1
Td := (Id +R(g∗ )v Rf ∗ ).
2
Lemma 3.9.2
Let h : Rm → (−∞, ∞] be convex, l.s.c., and proper. The following hold:
(i) Proxhv = − Proxh ◦(Id)
(ii) Rh∗ = −Rh
(iii) R(h∗ )v = Rh ◦ (− Id)
Proof
(i): This is shown using the relation Proxf = (Id +∂f )−1 as well as the lemma ∂hv =
−∂h ◦ (− Id).
(ii): This can be proven by expanding the definition of Rh∗ as well as the relation Proxh∗ =
(Id − Proxh ) proven in A4.
The proof is completed using this fact as well as the relation Proxh∗ = (Id − Proxh )
Theorem 3.9.3
Tp = Td .
Proof
From our previous lemma,
1
Td := (Id +R(g∗ )v Rf ∗ )
2
1
= (Id +[Rg ◦ (− Id)] ◦ (−Rf ))
2
1
= (Id +Rg Rf )
2
= Tp .
115
Theorem 3.9.4
Let x0 ∈ Rm . Update via
Proof
We already know that Proxf (xn ) converges to a minimizer of f + g. Since Tp = Td ,
Proxf ∗ (xn ) converges to a minimizer of f ∗ +(g ∗ )v . Using the fact that Proxf ∗ = Id − Proxf ,
we conclude the proof.
116