斯坦福大学机器学习数学基础 41-48

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

3.

4 Sublevel Sets
Convex functions give rise to a particularly important type of convex set called an α-sublevel
set. Given a convex function f : Rn → R and a real number α ∈ R, the α-sublevel set is
defined as
{x ∈ D(f ) : f (x) ≤ α}.
In other words, the α-sublevel set is the set of all points x such that f (x) ≤ α.
To show that this is a convex set, consider any x, y ∈ D(f ) such that f (x) ≤ α and
f (y) ≤ α. Then

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) ≤ θα + (1 − θ)α = α.

3.5 Examples
We begin with a few simple examples of convex functions of one variable, then move on to
multivariate functions.

• Exponential. Let f : R → R, f (x) = eax for any a ∈ R. To show f is convex, we can


simply take the second derivative f ′′ (x) = a2 eax , which is positive for all x.

• Negative logarithm. Let f : R → R, f (x) = − log x with domain D(f ) = R++


(here, R++ denotes the set of strictly positive real numbers, {x : x > 0}). Then
f ′′ (x) = 1/x2 > 0 for all x.

• Affine functions. Let f : Rn → R, f (x) = bT x + c for some b ∈ Rn , c ∈ R. In


this case the Hessian, ∇2x f (x) = 0 for all x. Because the zero matrix is both positive
semidefinite and negative semidefinite, f is both convex and concave. In fact, affine
functions of this form are the only functions that are both convex and concave.

• Quadratic functions. Let f : Rn → R, f (x) = 21 xT Ax + bT x + c for a symmetric


matrix A ∈ Sn , b ∈ Rn and c ∈ R. In our previous section notes on linear algebra, we
showed the Hessian for this function is given by

∇2x f (x) = A.

Therefore, the convexity or non-convexity of f is determined entirely by whether or


not A is positive semidefinite: if A is positive semidefinite then the function is convex
(and analogously for strictly convex, concave, strictly concave). If A is indefinite then
f is neither convex nor concave.
Note that the squared Euclidean norm f (x) = kxk22 = xT x is a special case of quadratic
functions where A = I, b = 0, c = 0, so it is therefore a strictly convex function.

39
• Norms. Let f : Rn → R be some norm on Rn . Then by the triangle inequality and
positive homogeneity of norms, for x, y ∈ Rn , 0 ≤ θ ≤ 1,
f (θx + (1 − θ)y) ≤ f (θx) + f ((1 − θ)y) = θf (x) + (1 − θ)f (y).
This is an example of a convex function where it is not possible to prove convexity based
on the second or first order conditions,
Pbecause norms are not generally differentiable
everywhere (e.g., the 1-norm, ||x||1 = ni=1 |xi |, is non-differentiable at all points where
any xi is equal to zero).
• Nonnegative weighted sums of convex functions. Let f1 , f2 , . . . , fk be convex
functions and w1 , w2 , . . . , wk be nonnegative real numbers. Then
k
X
f (x) = wi fi (x)
i=1

is a convex function, since


k
X
f (θx + (1 − θ)y) = wi fi (θx + (1 − θ)y)
i=1
Xk
≤ wi (θfi (x) + (1 − θ)fi (y))
i=1
X k k
X
= θ wi fi (x) + (1 − θ) wi fi (y)
i=1 i=1
= θf (x) + (1 − θ)f (x).

4 Convex Optimization Problems


Armed with the definitions of convex functions and sets, we are now equipped to consider
convex optimization problems. Formally, a convex optimization problem in an opti-
mization problem of the form
minimize f (x)
subject to x ∈ C
where f is a convex function, C is a convex set, and x is the optimization variable. However,
since this can be a little bit vague, we often write it often written as
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , p
where f is a convex function, gi are convex functions, and hi are affine functions, and x is
the optimization variable.

40
Is it imporant to note the direction of these inequalities: a convex function gi must be
less than zero. This is because the 0-sublevel set of gi is a convex set, so the feasible region,
which is the intersection of many convex sets, is also convex (recall that affine subspaces are
convex sets as well). If we were to require that gi ≥ 0 for some convex gi , the feasible region
would no longer be a convex set, and the algorithms we apply for solving these problems
would not longer be guaranteed to find the global optimum. Also notice that only affine
functions are allowed to be equality constraints. Intuitively, you can think of this as being
due to the fact that an equality constraint is equivalent to the two inequalities hi ≤ 0 and
hi ≥ 0. However, these will both be valid constraints if and only if hi is both convex and
concave, i.e., hi must be affine.
The optimal value of an optimization problem is denoted p⋆ (or sometimes f ⋆ ) and is
equal to the minimum possible value of the objective function in the feasible region7

p⋆ = min{f (x) : gi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , p}.

We allow p⋆ to take on the values +∞ and −∞ when the problem is either infeasible (the
feasible region is empty) or unbounded below (there exists feasible points such that f (x) →
−∞), respectively. We say that x⋆ is an optimal point if f (x⋆ ) = p⋆ . Note that there can
be more than one optimal point, even when the optimal value is finite.

4.1 Global Optimality in Convex Problems


Before stating the result of global optimality in convex problems, let us formally define
the concepts of local optima and global optima. Intuitively, a feasible point is called locally
optimal if there are no “nearby” feasible points that have a lower objective value. Similarly,
a feasible point is called globally optimal if there are no feasible points at all that have a
lower objective value. To formalize this a little bit more, we give the following two definitions.

Definition 4.1 A point x is locally optimal if it is feasible (i.e., it satisfies the constraints
of the optimization problem) and if there exists some R > 0 such that all feasible points z
with kx − zk2 ≤ R, satisfy f (x) ≤ f (z).

Definition 4.2 A point x is globally optimal if it is feasible and for all feasible points z,
f (x) ≤ f (z).

We now come to the crucial element of convex optimization problems, from which they
derive most of their utility. The key idea is that for a convex optimization problem
all locally optimal points are globally optimal .
Let’s give a quick proof of this property by contradiction. Suppose that x is a locally
optimal point which is not globally optimal, i.e., there exists a feasible point y such that
7
Math majors might note that the min appearing below should more correctly be an inf. We won’t worry
about such technicalities here, and use min for simplicity.

41
f (x) > f (y). By the definition of local optimality, there exist no feasible points z such that
kx − zk2 ≤ R and f (z) < f (x). But now suppose we choose the point
R
z = θy + (1 − θ)x with θ = .
2kx − yk2
Then
   
R R
kx − zk2 = x− y+ 1− x
2kx − yk2 2kx − yk2 2
R
= (x − y)
2kx − yk2 2
= R/2 ≤ R.
In addition, by the convexity of f we have
f (z) = f (θy + (1 − θ)x) ≤ θf (y) + (1 − θ)f (x) < f (x).
Furthermore, since the feasible set is a convex set, and since x and y are both feasible
z = θy + (1 − θ) will be feasible as well. Therefore, z is a feasible point, with kx − zk2 < R
and f (z) < f (x). This contradicts our assumption, showing that x cannot be locally optimal.

4.2 Special Cases of Convex Problems


For a variety of reasons, it is often times convenient to consider special cases of the general
convex programming formulation. For these special cases we can often devise extremely
efficient algorithms that can solve very large problems, and because of this you will probably
see these special cases referred to any time people use convex optimization techniques.
• Linear Programming. We say that a convex optimization problem is a linear
program (LP) if both the objective function f and inequality constraints gi are affine
functions. In other words, these problems have the form
minimize cT x + d
subject to Gx  h
Ax = b
where x ∈ Rn is the optimization variable, c ∈ Rn , d ∈ R, G ∈ Rm×n , h ∈ Rm ,
A ∈ Rp×n , b ∈ Rp are defined by the problem, and ‘’ denotes elementwise inequality.
• Quadratic Programming. We say that a convex optimization problem is a quadratic
program (QP) if the inequality constraints gi are still all affine, but if the objective
function f is a convex quadratic function. In other words, these problems have the
form,
minimize 21 xT P x + cT x + d
subject to Gx  h
Ax = b

42
where again x ∈ Rn is the optimization variable, c ∈ Rn , d ∈ R, G ∈ Rm×n , h ∈ Rm ,
A ∈ Rp×n , b ∈ Rp are defined by the problem, but we also have P ∈ Sn+ , a symmetric
positive semidefinite matrix.

• Quadratically Constrained Quadratic Programming. We say that a convex


optimization problem is a quadratically constrained quadratic program (QCQP)
if both the objective f and the inequality constraints gi are convex quadratic functions,
1 T
minimize 2
x P x + cT x + d
1 T
subject to 2
x Qi x + riT x + si ≤ 0, i = 1, . . . , m
Ax = b

where, as before, x ∈ Rn is the optimization variable, c ∈ Rn , d ∈ R, A ∈ Rp×n , b ∈ Rp ,


P ∈ Sn+ , but we also have Qi ∈ Sn+ , ri ∈ Rn , si ∈ R, for i = 1, . . . , m.

• Semidefinite Programming. This last example is a bit more complex than the pre-
vious ones, so don’t worry if it doesn’t make much sense at first. However, semidefinite
programming is become more and more prevalent in many different areas of machine
learning research, so you might encounter these at some point, and it is good to have an
idea of what they are. We say that a convex optimization problem is a semidefinite
program (SDP) if it is of the form

minimize tr(CX)
subject to tr(Ai X) = bi , i = 1, . . . , p
X0

where the symmetric matrix X ∈ Sn is the optimization variable, the symmetric ma-
trices C, A1 , . . . , Ap ∈ Sn are defined by the problem, and the constraint X  0 means
that we are constraining X to be positive semidefinite. This looks a bit different than
the problems we have seen previously, since the optimization variable is now a matrix
instead of a vector. If you are curious as to why such a formulation might be useful,
you should look into a more advanced course or book on convex optimization.

It should be fairly obvious from the definitions that quadratic programs are more general
than linear programs (since a linear program is just a special case of a quadratic program
where P = 0), and likewise that quadratically constrained quadratic programs are more
general than quadratic programs. However, what is not obvious at all is that semidefinite
programs are in fact more general than all the previous types. That is, any quadratically
constrained quadratic program (and hence any quadratic program or linear program) can
be expressed as a semidefinte program. We won’t discuss this relationship further in this
document, but this might give you just a small idea as to why semidefinite programming
could be useful.

10

43
4.3 Examples
Now that we’ve covered plenty of the boring math and formalisms behind convex optimiza-
tion, we can finally get to the fun part: using these techniques to solve actual problems.
We’ve already encountered a few such optimization problems in class, and in nearly every
field, there is a good chance that someone has tried to apply convex optimization to solve
some problem.
• Support Vector Machines. One of the most prevalent applications of convex op-
timization methods in machine learning is the support vector machine classifier. As
discussed in class, finding the support vector classifier (in the case with slack variables)
can be formulated as the optimization problem
minimize 21 kwk22 + C m
P
i=1 ξi
subject to y (i) (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m
ξi ≥ 0, i = 1, . . . , m
with optimization variables w ∈ Rn , ξ ∈ Rm , b ∈ R, and where C ∈ R and x(i) , y (i) , i =
1, . . . m are defined by the problem. This is an example of a quadratic program, which
we try to put the problem into the form described in the previous section. In particular,
if define k = m + n + 1, let the optimization variable be
 
w
x ∈ Rk ≡  ξ 
b
and define the matrices
   
I 0 0 0
P ∈ Rk×k =  0 0 0  , c ∈ Rk =  C · 1  ,
0 0 0 0
   
2m×k −diag(y)X −I −y 2m −1
G∈R = , h∈R =
0 −I 0 0
where I is the identity, 1 is the vector of all ones, and X and y are defined as in class,
 T

x(1)
 
y (1)
 (2) T   y (2) 
m×n
 x  m
X∈R = , y ∈ R =  ..  .
 
 .. 

 .   . 
T (m)
x(m) y

You should try to convince yourself that the quadratic program described in the pre-
vious section, when using these matrices defined above, is equivalent to the SVM
optimization problem. In reality, it is fairly easy to see that there the SVM optimiza-
tion problem has a quadratic objective and linear constraints, so we typically don’t
need to put it into standard form to “prove” that it is a QP, and would only do so if
we are using an off-the-shelf solver that requires the input to be in standard form.

11

44
• Constrained least squares. In class we have also considered the least squares prob-
lem, where we want to minimize kAx − bk22 for some matrix A ∈ Rm×n and b ∈ Rm .
As we saw, this particular problem can actually be solved analytically via the normal
equations. However, suppose that we also want to constrain the entries in the solution
x to lie within some predefined ranges. In other words, suppose we weanted to solve
the optimization problem,
minimize 21 kAx − bk22
subject to l  x  u
with optimization variable x and problem data A ∈ Rm×n , b ∈ Rm , l ∈ Rn , and u ∈ Rn .
This might seem like a fairly simple additional constraint, but it turns out that there
will no longer be an analytical solution. However, you should be able to convince
yourself that this optimization problem is a quadratic program, with matrices defined
by
1 1
P ∈ Rn×n = AT A, c ∈ Rn = −bT A, d ∈ R = bT b,
2 2
   
−I 0 −l
G ∈ R2n×2n = , h ∈ R2n = .
0 I u
• Maximum Likelihood for Logistic Regression. For homework one, you were
required to show that the log-likelihood of the data in a logistic model was concave.
This log likehood under such a model is
n
X
y (i) ln g(θT x(i) ) + (1 − y (i) ) ln(1 − g(θT x(i) ))

ℓ(θ) =
i=1

where g(z) denotes the logistic function g(z) = 1/(1 + e−z ). Finding the maximum
likelihood estimate is then a task of maximizing the log-likelihood (or equivalently,
minimizing the negative log-likelihood, a convex function), i.e.,
minimize −ℓ(θ)
with optimization variable θ ∈ Rn and no constraints.
Unlike the previous two examples, it turns out that it is not so easy to put this prob-
lem into a “standard” form optimization problem. Nevertheless, you’ve seen on the
homework that the fact that ℓ is a concave function means that you can very efficiently
find the global solution using an algorithm such as Newton’s method.

References
[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge UP, 2004.
Online: http://www.stanford.edu/∼boyd/cvxbook/

12

45
Convex Optimization Overview (cnt’d)
Chuong B. Do
October 26, 2007

1 Recap
During last week’s section, we began our study of convex optimization, the study of
mathematical optimization problems of the form,
minimize
n
f (x)
x∈R
subject to gi (x) ≤ 0, i = 1, . . . , m, (1)
hi (x) = 0, i = 1, . . . , p,
where x ∈ Rn is the optimization variable, f : Rn → R and gi : Rn → R are convex functions,
and hi : Rn → R are affine functions. In a convex optimization problem, the convexity of both
the objective function f and the feasible region (i.e., the set of x’s satisfying all constraints)
allows us to conclude that any feasible locally optimal point must also be globally optimal.
This fact provides the key intuition for why convex optimization problems can in general be
solved efficiently.
In these lecture notes, we continue our foray into the field of convex optimization. In
particular, we will introduce the theory of Lagrange duality for convex optimization problems
with inequality and equality constraints. We will also discuss generic yet efficient algorithms
for solving convex optimization problems, and then briefly mention directions for further
exploration.

2 Duality
To explain the fundamental ideas behind duality theory, we start with a motivating example
based on CS 229 homework grading. We prove a simple weak duality result in this setting,
and then relate it to duality in optimization. We then discuss strong duality and the KKT
optimality conditions.

2.1 A motivating example: CS 229 homework grading


In CS 229, students must complete four homeworks throughout the quarter, each consisting
of five questions apiece. Suppose that during one year that the course is offered, the TAs

46

You might also like