Hw2sol PDF
Hw2sol PDF
Boyd
EE364b Homework 2
1. Subgradient optimality conditions for nondifferentiable inequality constrained optimiza-
tion. Consider the problem
minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m,
with variable x ∈ Rn . We do not assume that f0 , . . . , fm are convex. Suppose that x̃
and λ̃ 0 satisfy primal feasibility,
fi (x̃) ≤ 0, i = 1, . . . , m,
dual feasibility,
m
X
0 ∈ ∂f0 (x̃) + λ̃i ∂fi (x̃),
i=1
and the complementarity condition
λ̃i fi (x̃) = 0, i = 1, . . . , m.
Show that x̃ is optimal, using only a simple argument, and definition of subgradient.
Recall that we do not assume the functions f0 , . . . , fm are convex.
Solution. Let g be defined by g(x) = f0 (x) + m i=1 λ̃i fi (x). Then, 0 ∈ ∂g(x̃). By
P
For each i, complementarity implies that either λi = 0 or fi (x̃) = 0. Hence, for any
feasible y (for which fi (y) ≤ 0), each λ̃i (fi (y) − fi (x̃)) term is either zero or negative.
Therefore, any feasible y also satisfies f0 (y) ≥ f0 (x̃), and x̃ is optimal.
2. Optimality conditions and coordinate-wise descent for ℓ1 -regularized minimization. We
consider the problem of minimizing
φ(x) = f (x) + λkxk1 ,
where f : Rn → R is convex and differentiable, and λ ≥ 0. The number λ is the
regularization parameter, and is used to control the trade-off between small f and
small kxk1 . When ℓ1 -regularization is used as a heuristic for finding a sparse x for
which f (x) is small, λ controls (roughly) the trade-off between f (x) and the cardinality
(number of nonzero elements) of x.
1
(a) Show that x = 0 is optimal for this problem (i.e., minimizes φ) if and only if
k∇f (0)k∞ ≤ λ. In particular, for λ ≥ λmax = k∇f (0)k∞ , ℓ1 regularization yields
the sparsest possible x, the zero vector.
Remark. The value λmax gives a good reference point for choosing a value of the
penalty parameter λ in ℓ1 -regularized minimization. A common choice is to start
with λ = λmax /2, and then adjust λ to achieve the desired sparsity/fit trade-off.
Solution. A necessary and sufficient condition for optimality of x = 0 is that
0 ∈ ∂φ(0). Now ∂φ(0) = ∇f (0) + λ∂k0k1 = ∇f (0) + λ[−1, 1]n . In other words,
x = 0 is optimal if −∇f (x) ∈ [−λ, λ]n . This is equivalent to k∇f (0)k∞ ≤ λ.
(b) Coordinate-wise descent. In the coordinate-wise descent method for minimizing
a convex function g, we first minimize over x1 , keeping all other variables fixed;
then we minimize over x2 , keeping all other variables fixed, and so on. After
minimizing over xn , we go back to x1 and repeat the whole process, repeatedly
cycling over all n variables.
Show that coordinate-wise descent fails for the function
(In particular, verify that the algorithm terminates after one step at the point
(0) (0)
(x2 , x2 ), while inf x g(x) = −∞.) Thus, coordinate-wise descent need not work,
for general convex functions.
(0)
Solution. We first minimize over x1 , with x2 fixed as x2 . The optimal choice is
(0)
x1 = x2 , since the derivative on the left is −0.9, and on the right, it is 1.1. We
(0) (0)
then arrive at the point (x2 , x2 ). We now optimize over x2 . But it is optimal,
with the same left and right derivatives, so x is unchanged. We’re now at a fixed
point of the coordinate-descent algorithm.
On the other hand, taking x = (−t, t) and letting t → ∞, we see that g(x) =
−0.1t → −∞.
It’s good to visualize coordinate-wise descent for this function, to see why x gets
stuck at the crease along x1 = x2 . The graph looks like a folded piece of paper,
with the crease along the line x1 = x2 . The bottom of the crease has a small
tilt in the direction (−1, −1), so the function is unbounded below. Moving along
either axis increases g, so coordinate-wise descent is stuck. But moving in the
direction (−1, −1), for example, decreases the function.
(c) Now consider coordinate-wise descent for minimizing the specific function φ de-
fined above. Assuming f is strongly convex (say) it can be shown that the iterates
converge to a fixed point x̃. Show that x̃ is optimal, i.e., minimizes φ.
Thus, coordinate-wise descent works for ℓ1 -regularized minimization of a differ-
entiable function.
Solution. For each i, x̃i minimizes the function ψ, with all other variables kept
2
fixed. It follows that
∂f
0 ∈ ∂xi ψ(x̃) = (x̃) + λIi , i = 1, . . . , n,
∂xi
where Ii is the subdifferential of | · | at x̃i : Ii = {−1} if x̃i < 0, Ii = {+1} if
x̃i > 0, and Ii = [−1, 1] if x̃i = 0.
But this is the same as saying 0 ∈ ∇f (x̃) + ∂kx̃k1 , which means that x̃ minimizes
ψ.
The subtlety here lies in the general formula that relates the subdifferential of
a function to its partial subdifferentials with respect to its components. For a
separable function h : R2 → R, we have
∂h(x) = ∂x1 h(x) × ∂x2 h(x),
but this is false in general.
(d) Work out an explicit form for coordinate-wise descent for ℓ1 -regularized least-
squares, i.e., for minimizing the function
kAx − bk22 + λkxk1 .
You might find the deadzone function
u−1 u>1
ψ(u) = 0 |u| ≤ 1
u + 1 u < −1
useful. Generate some data and try out the coordinate-wise descent method.
Check the result against the solution found using CVX, and produce a graph show-
ing convergence of your coordinate-wise method.
Solution. At each step we choose an index i, and minimize kAx − bk22 + λkxk1
over xi , while holding all other xj , with j 6= i, constant.
Selecting the optimal xi for this problem is equivalent to selecting the optimal xi
in the problem
minimize ax2i + cxi + |xi |,
where a = (AT A)ii /λ and c = (2/λ)( j6=i (AT A)ij xj + (bT A)i ). Using the theory
P
discussed above, any minimizer xi will satisfy 0 ∈ 2axi + c + ∂|xi |. Now we note
that a is positive, so the minimizer of the above problem will have opposite sign
to c. From there we deduce that the (unique) minimizer x⋆i will be
(
0 c ∈ [−1, 1]
x⋆i =
(1/2a)(−c + sign(c)) otherwise,
where
−1 u < 0
sign(u) = 0 u=0
1 u > 0.
3
Finally, we make use of the deadzone function ψ defined above and write
T
−ψ((2/λ) j6=i (A A)ij xj + (bT A)i )
P
x⋆i = .
(2/λ)(AT A)ii
4
1
10
0
10
−1
10
−2
10
−3
10
−4
10
−5
10
−6
10
−7
10
0 5 10 15 20 25 30