Hw2sol PDF

EE364b Prof. S.
Boyd
EE364b Homework 2
1. Subgradient optimality conditions for nondifferentiable inequality constrained optimiza-
tion. Consider the problem
minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m,
with variable x ∈ Rn . We do not assume that f0 , . . . , fm are convex. Suppose that x̃
and λ̃ 0 satisfy primal feasibility,
fi (x̃) ≤ 0, i = 1, . . . , m,
dual feasibility,
m
X
0 ∈ ∂f0 (x̃) + λ̃i ∂fi (x̃),
i=1
and the complementarity condition
λ̃i fi (x̃) = 0, i = 1, . . . , m.
Show that x̃ is optimal, using only a simple argument, and definition of subgradient.
Recall that we do not assume the functions f0 , . . . , fm are convex.
Solution. Let g be defined by g(x) = f0 (x) + m i=1 λ̃i fi (x). Then, 0 ∈ ∂g(x̃). By
P
definition of subgradient, this means that for any y,

g(y) ≥ g(x̃) + 0T (y − x̃).
Thus, for any y,

m
X
f0 (y) ≥ f0 (x̃) − λ̃i (fi (y) − fi (x̃)).
i=1
For each i, complementarity implies that either λi = 0 or fi (x̃) = 0. Hence, for any
feasible y (for which fi (y) ≤ 0), each λ̃i (fi (y) − fi (x̃)) term is either zero or negative.
Therefore, any feasible y also satisfies f0 (y) ≥ f0 (x̃), and x̃ is optimal.
2. Optimality conditions and coordinate-wise descent for ℓ1 -regularized minimization. We
consider the problem of minimizing
φ(x) = f (x) + λkxk1 ,
where f : Rn → R is convex and differentiable, and λ ≥ 0. The number λ is the
regularization parameter, and is used to control the trade-off between small f and
small kxk1 . When ℓ1 -regularization is used as a heuristic for finding a sparse x for
which f (x) is small, λ controls (roughly) the trade-off between f (x) and the cardinality
(number of nonzero elements) of x.
1
(a) Show that x = 0 is optimal for this problem (i.e., minimizes φ) if and only if
k∇f (0)k∞ ≤ λ. In particular, for λ ≥ λmax = k∇f (0)k∞ , ℓ1 regularization yields
the sparsest possible x, the zero vector.
Remark. The value λmax gives a good reference point for choosing a value of the
penalty parameter λ in ℓ1 -regularized minimization. A common choice is to start
with λ = λmax /2, and then adjust λ to achieve the desired sparsity/fit trade-off.
Solution. A necessary and sufficient condition for optimality of x = 0 is that
0 ∈ ∂φ(0). Now ∂φ(0) = ∇f (0) + λ∂k0k1 = ∇f (0) + λ[−1, 1]n . In other words,
x = 0 is optimal if −∇f (x) ∈ [−λ, λ]n . This is equivalent to k∇f (0)k∞ ≤ λ.
(b) Coordinate-wise descent. In the coordinate-wise descent method for minimizing
a convex function g, we first minimize over x1 , keeping all other variables fixed;
then we minimize over x2 , keeping all other variables fixed, and so on. After
minimizing over xn , we go back to x1 and repeat the whole process, repeatedly
cycling over all n variables.
Show that coordinate-wise descent fails for the function
g(x) = |x1 − x2 | + 0.1(x1 + x2 ).
(In particular, verify that the algorithm terminates after one step at the point
(0) (0)
(x2 , x2 ), while inf x g(x) = −∞.) Thus, coordinate-wise descent need not work,
for general convex functions.
(0)
Solution. We first minimize over x1 , with x2 fixed as x2 . The optimal choice is
(0)
x1 = x2 , since the derivative on the left is −0.9, and on the right, it is 1.1. We
(0) (0)
then arrive at the point (x2 , x2 ). We now optimize over x2 . But it is optimal,
with the same left and right derivatives, so x is unchanged. We’re now at a fixed
point of the coordinate-descent algorithm.
On the other hand, taking x = (−t, t) and letting t → ∞, we see that g(x) =
−0.1t → −∞.
It’s good to visualize coordinate-wise descent for this function, to see why x gets
stuck at the crease along x1 = x2 . The graph looks like a folded piece of paper,
with the crease along the line x1 = x2 . The bottom of the crease has a small
tilt in the direction (−1, −1), so the function is unbounded below. Moving along
either axis increases g, so coordinate-wise descent is stuck. But moving in the
direction (−1, −1), for example, decreases the function.
(c) Now consider coordinate-wise descent for minimizing the specific function φ de-
fined above. Assuming f is strongly convex (say) it can be shown that the iterates
converge to a fixed point x̃. Show that x̃ is optimal, i.e., minimizes φ.
Thus, coordinate-wise descent works for ℓ1 -regularized minimization of a differ-
entiable function.
Solution. For each i, x̃i minimizes the function ψ, with all other variables kept
2
fixed. It follows that
∂f
0 ∈ ∂xi ψ(x̃) = (x̃) + λIi , i = 1, . . . , n,
∂xi
where Ii is the subdifferential of | · | at x̃i : Ii = {−1} if x̃i < 0, Ii = {+1} if
x̃i > 0, and Ii = [−1, 1] if x̃i = 0.
But this is the same as saying 0 ∈ ∇f (x̃) + ∂kx̃k1 , which means that x̃ minimizes
ψ.
The subtlety here lies in the general formula that relates the subdifferential of
a function to its partial subdifferentials with respect to its components. For a
separable function h : R2 → R, we have
∂h(x) = ∂x1 h(x) × ∂x2 h(x),
but this is false in general.
(d) Work out an explicit form for coordinate-wise descent for ℓ1 -regularized least-
squares, i.e., for minimizing the function
kAx − bk22 + λkxk1 .
You might find the deadzone function

 u−1 u>1

ψ(u) =  0 |u| ≤ 1

u + 1 u < −1
useful. Generate some data and try out the coordinate-wise descent method.
Check the result against the solution found using CVX, and produce a graph show-
ing convergence of your coordinate-wise method.
Solution. At each step we choose an index i, and minimize kAx − bk22 + λkxk1
over xi , while holding all other xj , with j 6= i, constant.
Selecting the optimal xi for this problem is equivalent to selecting the optimal xi
in the problem
minimize ax2i + cxi + |xi |,
where a = (AT A)ii /λ and c = (2/λ)( j6=i (AT A)ij xj + (bT A)i ). Using the theory
P
discussed above, any minimizer xi will satisfy 0 ∈ 2axi + c + ∂|xi |. Now we note
that a is positive, so the minimizer of the above problem will have opposite sign
to c. From there we deduce that the (unique) minimizer x⋆i will be
(
0 c ∈ [−1, 1]
x⋆i =
(1/2a)(−c + sign(c)) otherwise,
where 
 −1 u < 0

sign(u) =  0 u=0

1 u > 0.
3
Finally, we make use of the deadzone function ψ defined above and write
T
−ψ((2/λ) j6=i (A A)ij xj + (bT A)i )
P
x⋆i = .
(2/λ)(AT A)ii
Coordinate descent was implemented in Matlab for a random problem instance

with A ∈ R400×200 . When solving to within 0.1% accuracy, the iterative method
required only a third the time of cvx. Sample code appears below, followed by
a graph showing the coordinate-wise descent method’s function value converging
to the CVX function value.
% Generate a random problem instance.

randn(’state’, 10239); m = 400; n = 200;
A = randn(m, n); ATA = A’*A;
b = randn(m, 1);
l = 0.1;
TOL = 0.001;
xcoord = zeros(n, 1);
% Solve in cvx as a benchmark.

cvx_begin
variable xcvx(n);
minimize(sum_square(A*xcvx + b) + l*norm(xcvx, 1));
cvx_end
% Solve using coordinate-wise descent.

while abs(cvx_optval - (sum_square(A*xcoord + b) + ...
l*norm(xcoord, 1)))/cvx_optval > TOL
for i = 1:n
xcoord(i) = 0; ei = zeros(n,1); ei(i) = 1;
c = 2/l*ei’*(ATA*xcoord + A’*b);
xcoord(i) = -sign(c)*pos(abs(c) - 1)/(2*ATA(i,i)/l);
end
end
4
1
10
0
10
−1
10
−2
10
−3
10
−4
10
−5
10
−6
10
−7
10
0 5 10 15 20 25 30

Hw2sol PDF

Uploaded by

Hw2sol PDF

Uploaded by

EE364b Prof. S.

definition of subgradient, this means that for any y,

Thus, for any y,

g(x) = |x1 − x2 | + 0.1(x1 + x2 ).

Coordinate descent was implemented in Matlab for a random problem instance

% Generate a random problem instance.

% Solve in cvx as a benchmark.

% Solve using coordinate-wise descent.

You might also like