Admm Distr Stats
Admm Distr Stats
Admm Distr Stats
Contents
1 Introduction
2 Precursors
2.1
2.2
2.3
Dual Ascent
Dual Decomposition
Augmented Lagrangians and the Method of Multipliers
7
9
10
13
3.1
3.2
3.3
3.4
3.5
13
15
18
20
23
Algorithm
Convergence
Optimality Conditions and Stopping Criterion
Extensions and Variations
Notes and References
4 General Patterns
25
4.1
4.2
4.3
4.4
25
26
30
31
Proximity Operator
Quadratic Objective Terms
Smooth Objective Terms
Decomposition
33
5.1
5.2
34
36
Convex Feasibility
Linear and Quadratic Programming
6 1 -Norm Problems
38
6.1
6.2
6.3
6.4
6.5
39
41
42
43
45
48
7.1
7.2
7.3
48
53
56
61
8.1
8.2
8.3
62
64
66
Examples
Splitting across Examples
Splitting across Features
9 Nonconvex Problems
73
9.1
9.2
73
76
Nonconvex Constraints
Bi-convex Problems
10 Implementation
78
10.1
10.2
10.3
10.4
78
80
81
82
Abstract Implementation
MPI
Graph Computing Frameworks
MapReduce
11 Numerical Examples
11.1
11.2
11.3
11.4
11.5
87
88
92
95
97
100
12 Conclusions
103
Acknowledgments
105
A Convergence Proof
106
References
111
Abstract
Many problems of recent interest in statistics and machine learning
can be posed in the framework of convex optimization. Due to the
explosion in size and complexity of modern datasets, it is increasingly
important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection
or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this
review, we argue that the alternating direction method of multipliers
is well suited to distributed convex optimization, and in particular to
large-scale problems arising in statistics, machine learning, and related
areas. The method was developed in the 1970s, with roots in the 1950s,
and is equivalent or closely related to many other algorithms, such
as dual decomposition, the method of multipliers, DouglasRachford
splitting, Spingarns method of partial inverses, Dykstras alternating
projections, Bregman iterative algorithms for 1 problems, proximal
methods, and others. After briey surveying the theory and history of
the algorithm, we discuss applications to a wide variety of statistical
and machine learning problems of recent interest, including the lasso,
sparse logistic regression, basis pursuit, covariance selection, support
vector machines, and many others. We also discuss general distributed
optimization, extensions to the nonconvex setting, and ecient implementation, including some details on distributed MPI and Hadoop
MapReduce implementations.
1
Introduction
Introduction
Many such problems can be posed in the framework of convex optimization. Given the signicant work on decomposition methods and
decentralized algorithms in the optimization community, it is natural
to look to parallel optimization algorithms as a mechanism for solving
large-scale statistical tasks. This approach also has the benet that one
algorithm could be exible enough to solve many problems.
This review discusses the alternating direction method of multipliers (ADMM), a simple but powerful algorithm that is well suited to
distributed convex optimization, and in particular to problems arising in applied statistics and machine learning. It takes the form of a
decomposition-coordination procedure, in which the solutions to small
local subproblems are coordinated to nd a solution to a large global
problem. ADMM can be viewed as an attempt to blend the benets
of dual decomposition and augmented Lagrangian methods for constrained optimization, two earlier approaches that we review in 2. It
turns out to be equivalent or closely related to many other algorithms
as well, such as Douglas-Rachford splitting from numerical analysis,
Spingarns method of partial inverses, Dykstras alternating projections method, Bregman iterative algorithms for 1 problems in signal
processing, proximal methods, and many others. The fact that it has
been re-invented in dierent elds over the decades underscores the
intuitive appeal of the approach.
It is worth emphasizing that the algorithm itself is not new, and that
we do not present any new theoretical results. It was rst introduced
in the mid-1970s by Gabay, Mercier, Glowinski, and Marrocco, though
similar ideas emerged as early as the mid-1950s. The algorithm was
studied throughout the 1980s, and by the mid-1990s, almost all of the
theoretical results mentioned here had been established. The fact that
ADMM was developed so far in advance of the ready availability of
large-scale distributed computing systems and massive optimization
problems may account for why it is not as widely known today as we
believe it should be.
The main contributions of this review can be summarized as follows:
(1) We provide a simple, cohesive discussion of the extensive
literature in a way that emphasizes and unies the aspects
of primary importance in applications.
5
(2) We show, through a number of examples, that the algorithm
is well suited for a wide variety of large-scale distributed modern problems. We derive methods for decomposing a wide
class of statistical problems by training examples and by features, which is not easily accomplished in general.
(3) We place a greater emphasis on practical large-scale implementation than most previous references. In particular, we
discuss the implementation of the algorithm in cloud computing environments using standard frameworks and provide
easily readable implementations of many of our examples.
Throughout, the focus is on applications rather than theory, and a main
goal is to provide the reader with a kind of toolbox that can be applied
in many situations to derive and implement a distributed algorithm of
practical use. Though the focus here is on parallelism, the algorithm
can also be used serially, and it is interesting to note that with no
tuning, ADMM can be competitive with the best known methods for
some problems.
While we have emphasized applications that can be concisely
explained, the algorithm would also be a natural t for more complicated problems in areas like graphical models. In addition, though our
focus is on statistical learning problems, the algorithm is readily applicable in many other cases, such as in engineering design, multi-period
portfolio optimization, time series analysis, network ow, or scheduling.
Outline
We begin in 2 with a brief review of dual decomposition and the
method of multipliers, two important precursors to ADMM. This section is intended mainly for background and can be skimmed. In 3,
we present ADMM, including a basic convergence theorem, some variations on the basic version that are useful in practice, and a survey of
some of the key literature. A complete convergence proof is given in
appendix A.
In 4, we describe some general patterns that arise in applications
of the algorithm, such as cases when one of the steps in ADMM can
Introduction
2
Precursors
2.1
Dual Ascent
(2.1)
Precursors
problem is
maximize g(y),
with variable y Rm . Assuming that strong duality holds, the optimal
values of the primal and dual problems are the same. We can recover
a primal optimal point x from a dual optimal point y as
x = argmin L(x, y ),
x
(2.2)
(2.3)
where k > 0 is a step size, and the superscript is the iteration counter.
The rst step (2.2) is an x-minimization step, and the second step (2.3)
is a dual variable update. The dual variable y can be interpreted as
a vector of prices, and the y-update is then called a price update or
price adjustment step. This algorithm is called dual ascent since, with
appropriate choice of k , the dual function increases in each step, i.e.,
g(y k+1 ) > g(y k ).
The dual ascent method can be used even in some cases when g is
not dierentiable. In this case, the residual Axk+1 b is not the gradient of g, but the negative of a subgradient of g. This case requires a
dierent choice of the k than when g is dierentiable, and convergence
is not monotone; it is often the case that g(y k+1 ) > g(y k ). In this case,
the algorithm is usually called the dual subgradient method [152].
If k is chosen appropriately and several other assumptions hold,
then xk converges to an optimal point and y k converges to an optimal
dual point. However, these assumptions do not hold in many applications, so dual ascent often cannot be used. As an example, if f is a
nonzero ane function of any component of x, then the x-update (2.2)
fails, since L is unbounded below in x for most y.
2.2
Dual Decomposition
The major benet of the dual ascent method is that it can lead to a
decentralized algorithm in some cases. Suppose, for example, that the
objective f is separable (with respect to a partition or splitting of the
variable into subvectors), meaning that
f (x) =
N
fi (xi ),
i=1
N
i=1 Ai xi ,
L(x, y) =
N
i=1
Li (xi , y) =
N
fi (xi ) + y T Ai xi (1/N )y T b ,
i=1
(2.4)
(2.5)
xi
10
Precursors
2.3
(2.6)
11
(2.7)
(2.8)
f (x ) + AT y = 0,
= x f (xk+1 ) + AT y k + (Axk+1 b)
= x f (xk+1 ) + AT y k+1 .
12
Precursors
We see that by using as the step size in the dual update, the iterate
(xk+1 , y k+1 ) is dual feasible. As the method of multipliers proceeds, the
primal residual Axk+1 b converges to zero, yielding optimality.
The greatly improved convergence properties of the method of multipliers over dual ascent comes at a cost. When f is separable, the augmented Lagrangian L is not separable, so the x-minimization step (2.7)
cannot be carried out separately in parallel for each xi . This means that
the basic method of multipliers cannot be used for decomposition. We
will see how to address this issue next.
Augmented Lagrangians and the method of multipliers for constrained optimization were rst proposed in the late 1960s by Hestenes
[97, 98] and Powell [138]. Many of the early numerical experiments on
the method of multipliers are due to Miele et al. [124, 125, 126]. Much
of the early work is consolidated in a monograph by Bertsekas [15],
who also discusses similarities to older approaches using Lagrangians
and penalty functions [6, 5, 71], as well as a number of generalizations.
3
Alternating Direction Method of Multipliers
3.1
Algorithm
(3.1)
14
(3.2)
(3.3)
(3.4)
x
z
where > 0. The algorithm is very similar to dual ascent and the
method of multipliers: it consists of an x-minimization step (3.2), a
z-minimization step (3.3), and a dual variable update (3.4). As in the
method of multipliers, the dual variable update uses a step size equal
to the augmented Lagrangian parameter .
The method of multipliers for (3.1) has the form
(xk+1 , z k+1 ) := argmin L (x, z, y k )
x,z
k+1
3.2 Convergence
3.1.1
15
Scaled Form
(3.7)
k
rj ,
j=1
3.2
Convergence
There are many convergence results for ADMM discussed in the literature; here, we limit ourselves to a basic but still very general result
that applies to all of the examples we will consider. We will make one
16
3.2 Convergence
3.2.1
17
Convergence
Convergence in Practice
18
3.3
The necessary and sucient optimality conditions for the ADMM problem (3.1) are primal feasibility,
Ax + Bz c = 0,
(3.8)
0 f (x ) + AT y
(3.9)
0 g(z ) + B y .
(3.10)
T
Here, denotes the subdierential operator; see, e.g., [140, 19, 99].
(When f and g are dierentiable, the subdierentials f and g can
be replaced by the gradients f and g, and can be replaced by =.)
Since z k+1 minimizes L (xk+1 , z, y k ) by denition, we have that
0 g(z k+1 ) + B T y k + B T (Axk+1 + Bz k+1 c)
= g(z k+1 ) + B T y k + B T rk+1
= g(z k+1 ) + B T y k+1 .
This means that z k+1 and y k+1 always satisfy (3.10), so attaining optimality comes down to satisfying (3.8) and (3.9). This phenomenon is
analogous to the iterates of the method of multipliers always being dual
feasible; see page 11.
Since xk+1 minimizes L (x, z k , y k ) by denition, we have that
0 f (xk+1 ) + AT y k + AT (Axk+1 + Bz k c)
= f (xk+1 ) + AT (y k + rk+1 + B(z k z k+1 ))
= f (xk+1 ) + AT y k+1 + AT B(z k z k+1 ),
or equivalently,
AT B(z k+1 z k ) f (xk+1 ) + AT y k+1 .
This means that the quantity
sk+1 = AT B(z k+1 z k )
can be viewed as a residual for the dual feasibility condition (3.9).
We will refer to sk+1 as the dual residual at iteration k + 1, and to
rk+1 = Axk+1 + Bz k+1 c as the primal residual at iteration k + 1.
19
In summary, the optimality conditions for the ADMM problem consist of three conditions, (3.83.10). The last condition (3.10) always
holds for (xk+1 , z k+1 , y k+1 ); the residuals for the other two, (3.8) and
(3.9), are the primal and dual residuals rk+1 and sk+1 , respectively.
These two residuals converge to zero as ADMM proceeds. (In fact, the
convergence proof in appendix A shows B(z k+1 z k ) converges to zero,
which implies sk converges to zero.)
3.3.1
Stopping Criteria
(3.11)
This shows that when the residuals rk and sk are small, the objective
suboptimality also must be small. We cannot use this inequality directly
in a stopping criterion, however, since we do not know x . But if we
guess or estimate that xk x 2 d, we have that
f (xk ) + g(z k ) p (y k )T rk + dsk 2 y k 2 rk 2 + dsk 2 .
The middle or righthand terms can be used as an approximate bound
on the objective suboptimality (which depends on our guess of d).
This suggests that a reasonable termination criterion is that the
primal and dual residuals must be small, i.e.,
rk 2 pri
(3.12)
where pri > 0 and dual > 0 are feasibility tolerances for the primal and
dual feasibility conditions (3.8) and (3.9), respectively. These tolerances
can be chosen using an absolute and relative criterion, such as
ance. (The factors p and n account for the fact that the 2 norms are
in Rp and Rn , respectively.) A reasonable value for the relative stopping
20
3.4
incr k
(3.13)
k+1 := k / decr if sk 2 > rk 2
k
otherwise,
where > 1, incr > 1, and decr > 1 are parameters. Typical choices
might be = 10 and incr = decr = 2. The idea behind this penalty
parameter update is to try to keep the primal and dual residual norms
within a factor of of one another as they both converge to zero.
The ADMM update equations suggest that large values of place a
large penalty on violations of primal feasibility and so tend to produce
21
Over-relaxation
Inexact Minimization
22
3.4.5
Update Ordering
Several variations on ADMM involve performing the x-, z-, and yupdates in varying orders or multiple times. For example, we can divide
the variables into k blocks, and update each of them in turn, possibly
multiple times, before performing each dual variable update; see, e.g.,
[146]. Carrying out multiple x- and z-updates before the y-update can
be interpreted as executing multiple Gauss-Seidel passes instead of just
one; if many sweeps are carried out before each dual update, the resulting algorithm is very close to the standard method of multipliers [17,
3.4.4]. Another variation is to perform an additional y-update between
the x- and z-update, with half the step length [17].
3.4.6
Related Algorithms
There are also a number of other algorithms distinct from but inspired
by ADMM. For instance, Fukushima [80] applies ADMM to a dual
problem formulation, yielding a dual ADMM algorithm, which is
shown in [65] to be equivalent to the primal Douglas-Rachford method
discussed in [57, 3.5.6]. As another example, Zhu et al. [183] discuss
variations of distributed ADMM (discussed in 7, 8, and 10) that
can cope with various complicating factors, such as noise in the messages exchanged for the updates, or asynchronous updates, which can
be useful in a regime when some processors or subsystems randomly
fail. There are also algorithms resembling a combination of ADMM
and the proximal method of multipliers [141], rather than the standard
method of multipliers; see, e.g., [33, 60]. Other representative publications include [62, 143, 59, 147, 158, 159, 42].
3.5
23
24
4
General Patterns
4.1
Proximity Operator
26
General Patterns
analysis,
f(v) = inf f (x) + (/2)x v22
x
where C denotes projection (in the Euclidean norm) onto C. This holds
independently of the choice of . As an example, if f is the indicator
function of the nonnegative orthant Rn+ , we have x+ = (v)+ , the vector
obtained by taking the nonnegative part of each component of v.
4.2
(4.1)
4.2.1
27
Direct Methods
We assume here that a direct method is used to carry out the x-update;
the case when an iterative method is used is discussed in 4.3. Direct
methods for solving a linear system F x = g are based on rst factoring
F = F1 F2 Fk into a product of simpler matrices, and then computing
x = F 1 b by solving a sequence of problems of the form Fi zi = zi1 ,
where z1 = F11 g and x = zk . The solve step is sometimes also called
a back-solve. The computational cost of factorization and back-solve
operations depends on the sparsity pattern and other properties of F .
The cost of solving F x = g is the sum of the cost of factoring F and
the cost of the back-solve.
In our case, the coecient matrix is F = P + AT A and the righthand side is g = AT v q, where P Sn+ and A Rpn . Suppose we
exploit no structure in A or P , i.e., we use generic methods that work
for any matrix. We can form F = P + AT A at a cost of O(pn2 ) ops
(oating point operations). We then carry out a Cholesky factorization
of F at a cost of O(n3 ) ops; the back-solve cost is O(n2 ). (The cost of
forming g is negligible compared to the costs listed above.) When p is
on the order of, or more than n, the overall cost is O(pn2 ). (When p is
less than n in order, the matrix inversion lemma described below can
be used to carry out the update in O(p2 n) ops.)
4.2.2
Exploiting Sparsity
When A and P are such that F is sparse (i.e., has enough zero entries
to be worth exploiting), much more ecient factorization and backsolve routines can be employed. As an extreme case, if P and A are
diagonal n n matrices, then both the factor and solve costs are
O(n). If P and A are banded, then so is F . If F is banded with
bandwidth k, the factorization cost is O(nk 2 ) and the back-solve cost
is O(nk). In this case, the x-update can be carried out at a cost
O(nk 2 ), plus the cost of forming F . The same approach works when
P + AT A has a more general sparsity pattern; in this case, a permuted
Cholesky factorization can be used, with permutations chosen to reduce
ll-in.
28
General Patterns
4.2.3
Caching Factorizations
Now suppose we need to solve multiple linear systems, say, F x(i) = g (i) ,
i = 1, . . . , N , with the same coecient matrix but dierent righthand
sides. This occurs in ADMM when the parameter is not changed. In
this case, the factorization of the coecient matrix F can be computed
once and then back-solves can be carried out for each righthand side.
If t is the factorization cost and s is the back-solve cost, then the total
cost becomes t + N s instead of N (t + s), which would be the cost if
we were to factor F each iteration. As long as does not change, we
can factor P + AT A once, and then use this cached factorization in
subsequent solve steps. For example, if we do not exploit any structure
and use the standard Cholesky factorization, the x-update steps can
be carried out a factor n more eciently than a naive implementation,
in which we solve the equations from scratch in each iteration.
When structure is exploited, the ratio between t and s is typically
less than n but often still signicant, so here too there are performance
gains. However, in this case, there is less benet to not changing, so
we can weigh the benet of varying against the benet of not recomputing the factorization of P + AT A. In general, an implementation
should cache the factorization of P + AT A and then only recompute
it if and when changes.
4.2.4
We can also exploit structure using the matrix inversion lemma, which
states that the identity
(P + AT A)1 = P 1 P 1 AT (I + AP 1 AT )1 AP 1
holds when all the inverses exist. This means that if linear systems
with coecient matrix P can be solved eciently, and p is small, or
at least no larger than n in order, then the x-update can be computed
eciently as well. The same trick as above can also be used to obtain
an ecient method for computing multiple updates: The factorization
of I + AP 1 AT can be cached and cheaper back-solves can be used
in computing the updates.
29
4.2.5
The same comments hold for the slightly more complex case of a convex
quadratic function restricted to an ane set:
dom f = {x | F x = g}.
f (x) = (1/2)xT P x + q T x + r,
P + I F T
F
0
xk+1
q (z k uk )
g
= 0.
30
General Patterns
4.3
4.3.1
Early Termination
Warm Start
4.4 Decomposition
31
iterative method used to compute the update xk+1 ) than if the iterative
method were started at zero or some other default initialization. This
is especially the case when ADMM has almost converged, in which case
the updates will not change signicantly from their previous values.
4.3.4
4.4
4.4.1
Decomposition
Block Separability
4.4.2
Component Separability
32
General Patterns
4.4.3
Soft Thresholding
For an example that will come up often in the sequel, consider f (x) =
x1 (with > 0) and A = I. In this case the (scalar) xi -update is
2
x+
i := argmin |xi | + (/2)(xi vi ) .
xi
Even though the rst term is not dierentiable, we can easily compute
a simple closed-form solution to this problem by using subdierential
calculus; see [140, 23] for background. Explicitly, the solution is
x+
i := S/ (vi ),
where the soft thresholding operator S
a
S (a) = 0
a +
is dened as
a>
|a|
a < ,
or equivalently,
S (a) = (a )+ (a )+ .
Yet another formula, which shows that the soft thresholding operator
is a shrinkage operator (i.e., moves a point toward zero), is
S (a) = (1 /|a|)+ a
(4.2)
(for a = 0). We refer to updates that reduce to this form as elementwise soft thresholding. In the language of 4.1, soft thresholding is the
proximity operator of the 1 norm.
5
Constrained Convex Optimization
(5.1)
k+1
:= C (xk+1 + uk )
34
rk = xk z k ,
5.1
5.1.1
Convex Feasibility
Alternating Projections
k+1
k+1
:= u + x
k+1
(5.2)
,
35
method [56, 9], which is far more ecient than the classical method
that does not use the dual variable u.
Here, the norm of the primal residual xk z k 2 has a nice interpretation. Since xk C and z k D, xk z k 2 is an upper bound on
dist(C, D), the Euclidean distance between C and D. If we terminate
with rk 2 pri , then we have found a pair of points, one in C and
one in D, that are no more than pri far apart. Alternatively, the point
(1/2)(xk + z k ) is no more than a distance pri /2 from both C and D.
5.1.2
Parallel Projections
D = {(x1 , . . . , xN ) RnN | x1 = x2 = = xN }.
N
1 k+1
(xi + uki )
N
uk+1
i
uki
i=1
:=
+ xk+1
z k+1 .
i
Here the rst and third steps are carried out in parallel, for i = 1, . . . , N .
(The description above involves a small abuse of notation in dropping
the index i from zi , since they are all the same.) This can be viewed as a
special case of constrained optimization, as described in 4.4, where the
indicator function of A1 AN splits into the sum of the indicator
functions of each Ai .
36
5.2
(5.3)
dom f = {x | Ax = b}
37
k+1
x
k+1
:= (x
+ uk )+
b
All of the comments on ecient computation from 4.2, such as storing
factorizations so that subsequent iterations can be carried out cheaply,
also apply here. For example, if P is diagonal, possibly zero, the rst
x-update can be carried out at a cost of O(np2 ) ops, where p is the
number of equality constraints in the original quadratic program. Subsequent updates only cost O(np) ops.
5.2.1
6
1 -Norm Problems
The problems addressed in this section will help illustrate why ADMM
is a natural t for machine learning and statistical problems in particular. The reason is that, unlike dual ascent or the method of multipliers,
ADMM explicitly targets problems that split into two distinct parts, f
and g, that can then be handled separately. Problems of this form are
pervasive in machine learning, because a signicant number of learning
problems involve minimizing a loss function together with a regularization term or side constraints. In other cases, these side constraints are
introduced through problem transformations like putting the problem
in consensus form, as will be discussed in 7.1.
This section contains a variety of simple but important problems
involving 1 norms. There is widespread current interest in many of these
problems across statistics, machine learning, and signal processing, and
applying ADMM yields interesting algorithms that are state-of-the-art,
or closely related to state-of-the-art methods. We will see that ADMM
naturally lets us decouple the nonsmooth 1 term from the smooth loss
term, which is computationally advantageous. In this section, we focus on
the non-distributed versions of these problems for simplicity; the problem
of distributed model tting will be treated in the sequel.
38
39
6.1
40
1 -Norm Problems
Huber Fitting
A problem that lies in between least squares and least absolute deviations is Huber function tting,
minimize g hub (Ax b),
where the Huber penalty function g hub is quadratic for small arguments
and transitions to an absolute value for larger values. For scalar a, it
is given by
g hub (a) =
a2 /2
|a| 1
k+1
1
Ax
b + uk +
(Axk+1 b + uk ).
S
1+
1 + 1+1/
6.2
41
Basis Pursuit
42
6.3
1 -Norm Problems
(6.1)
k+1
:= S/ (xk+1 + uk )
6.4 Lasso
6.4
43
Lasso
(6.2)
where > 0 is a scalar regularization parameter that is usually chosen by cross-validation. In typical applications, there are many more
features than training examples, and the goal is to nd a parsimonious model for the data. For general background on the lasso, see [95,
3.4.2]. The lasso has been widely applied, particularly in the analysis of biological data, where only a small fraction of a huge number of
possible factors are actually predictive of some outcome of interest; see
[95, 18.4] for a representative case study.
In ADMM form, the lasso problem can be written as
minimize f (x) + g(z)
subject to x z = 0,
where f (x) = (1/2)Ax b22 and g(z) = z1 . By 4.2 and 4.4,
ADMM becomes
xk+1 := (AT A + I)1 (AT b + (z k uk ))
z k+1 := S/ (xk+1 + uk )
uk+1 := uk + xk+1 z k+1 .
Note that AT A + I is always invertible, since > 0. The x-update
is essentially a ridge regression (i.e., quadratically regularized least
squares) computation, so ADMM can be interpreted as a method for
solving the lasso problem by iteratively carrying out ridge regression.
When using a direct method, we can cache an initial factorization to
make subsequent iterations much cheaper. See [1] for an example of an
application in image processing.
6.4.1
Generalized Lasso
(6.3)
44
1 -Norm Problems
1 j=i+1
Fij =
1 j = i
0 otherwise,
and A = I, in which case the generalization reduces to
minimize (1/2)x b22 + n1
i=1 |xi+1 xi |.
(6.4)
The second term is the total variation of x. This problem is often called
total variation denoising [145], and has applications in signal processing. When A = I and F is a second dierence matrix, the problem (6.3)
is called 1 trend ltering [101].
In ADMM form, the problem (6.3) can be written as
minimize (1/2)Ax b22 + z1
subject to F x z = 0,
which yields the ADMM algorithm
xk+1 := (AT A + F T F )1 (AT b + F T (z k uk ))
z k+1 := S/ (F xk+1 + uk )
uk+1 := uk + F xk+1 z k+1 .
For the special case of total variation denoising (6.4), AT A + F T F
is tridiagonal, so the x-update can be carried out in O(n) ops [90, 4.3].
For 1 trend ltering, the matrix is pentadiagonal, so the x-update is
still O(n) ops.
6.4.2
Group Lasso
45
ADMM for this problem is the same as above with the z-update
replaced with block soft thresholding
+ uk ),
zik+1 = S/ (xk+1
i
i = 1, . . . , N,
N
xGi 2 ,
i=1
6.5
i = 1, . . . , N,
46
1 -Norm Problems
consider the task of estimating the covariance matrix under the prior
assumption that 1 is sparse. Since (1 )ij is zero if and only if
the ith and jth components of the random variable are conditionally
independent, given the other variables, this problem is equivalent to the
structure learning problem of estimating the topology of the undirected
graphical model representation of the Gaussian [104]. Determining the
sparsity pattern of the inverse covariance matrix 1 is also called the
covariance selection problem.
For n very small, it is feasible to search over all sparsity patterns
in 1 since for a xed sparsity pattern, determining the maximum
likelihood estimate of is a tractable (convex optimization) problem.
A good heuristic that scales to much larger values of n is to minimize
the negative log-likelihood (with respect to the parameter X = 1 )
with an 1 regularization term to promote sparsity of the estimated
inverse covariance matrix [7]. If S is the empirical covariance matrix
T
(1/N ) N
i=1 ai ai , then the estimation problem can be written as
minimize Tr(SX) log det X + X1 ,
with variable X Sn+ , where 1 is dened elementwise, i.e., as the
sum of the absolute values of the entries, and the domain of log det is
Sn++ , the set of symmetric positive denite n n matrices. This is a
special case of the general 1 regularized problem (6.1) with (convex)
loss function l(X) = Tr(SX) log det X.
The idea of covariance selection is originally due to Dempster [48]
and was rst studied in the sparse, high-dimensional regime by Meinshausen and B
uhlmann [123]. The form of the problem above is due to
Banerjee et al. [7]. Some other recent papers on this problem include
Friedman et al.s graphical lasso [79], Duchi et al. [55], Lu [115], Yuan
[178], and Scheinberg et al. [148], the last of which shows that ADMM
outperforms state-of-the-art methods for this problem.
The ADMM algorithm for sparse inverse covariance selection is
X k+1 := argmin Tr(SX) log det X + (/2)X Z k + U k 2F
X
k+1
:= argmin Z1 + (/2)X k+1 Z + U k 2F
Z
Z
47
where F is the Frobenius norm, i.e., the square root of the sum of
the squares of the entries.
This algorithm can be simplied much further. The Z-minimization
step is elementwise soft thresholding,
k+1
k+1
:= S/ (Xij
+ Uijk ),
Zij
(6.5)
We will construct a matrix X that satises this condition and thus minimizes the X-minimization objective. First, take the orthogonal eigenvalue decomposition of the righthand side,
(Z k U k ) S = QQT ,
where = diag(1 , . . . , n ), and QT Q = QQT = I. Multiplying (6.5)
by QT on the left and by Q on the right gives
X
1 = ,
X
= QT XQ. We can now construct a diagonal solution of this
where X
ii 1/X
ii = i .
ii that satisfy X
equation, i.e., nd positive numbers X
By the quadratic formula,
i + 2i + 4
ii =
X
,
2
T satwhich are always positive since > 0. It follows that X = QXQ
ises the optimality condition (6.5), so this is the solution to the Xminimization. The computational eort of the X-update is that of an
eigenvalue decomposition of a symmetric matrix.
7
Consensus and Sharing
7.1
We rst consider the case with a single global variable, with the objective and constraint terms split into N parts:
minimize f (x) =
N
i=1 fi (x),
49
solve the problem above in such a way that each term can be handled
by its own processing element, such as a thread or processor.
This problem arises in many contexts. In model tting, for example, x represents the parameters in a model and fi represents the loss
function associated with the ith block of data or measurements. In this
case, we would say that x is found by collaborative ltering, since the
data sources are collaborating to develop a global model.
This problem can be rewritten with local variables xi Rn and a
common global variable z:
N
minimize
i=1 fi (xi )
(7.1)
subject to xi z = 0, i = 1, . . . , N.
This is called the global consensus problem, since the constraint is that
all the local variables should agree, i.e., be equal. Consensus can be
viewed as a simple technique for turning additive objectives N
i=1 fi (x),
which show up frequently but do not split due to the variable being
shared across terms, into separable objectives N
i=1 fi (xi ), which split
easily. For a useful recent discussion of consensus algorithms, see [131]
and the references therein.
ADMM for the problem (7.1) can be derived either directly from
the augmented Lagrangian
L (x1 , . . . , xN , z, y) =
N
fi (xi ) + yiT (xi z) + (/2)xi z22 ,
i=1
z k+1 :=
N
1 k+1
xi + (1/)yik
N
i=1
z k+1 ).
yik+1 := yik + (xk+1
i
50
x
)
+
(/2)x
x
i i
i
i
i
2
i
xi
yik+1
:=
yik
xk+1 ).
+ (xk+1
i
51
N
xki xk 22 ,
i=1
N
i=1 fi (xi )
subject to xi z = 0,
+ g(z)
i = 1, . . . , N.
(7.2)
52
z
)
+
(/2)x
z
f
:=
argmin
(x
i
i
i
i
i
2
i
xi
z
yik+1 := yik + (xk+1
i
N
i=1
k+1
(yikT z + (/2)xk+1
z22 )
i
(7.4)
).
(7.5)
By collecting the linear and quadratic terms, we can express the zupdate as an averaging step, as in consensus ADMM, followed by a
proximal step involving g:
z k+1 := argmin g(z) + (N /2)z xk+1 (1/)y k 22 .
z
z
+
u
f
(7.6)
:=
argmin
i
i
i
i 2
i
xi
z k+1 := argmin g(z) + (N /2)z xk+1 uk 22
(7.7)
z
uk+1
i
:=
uki
+ xk+1
z k+1 .
i
(7.8)
In many cases, this version is simpler and easier to work with than the
unscaled form.
7.2
53
i = 1, . . . , N,
j = 1, . . . , ni .
54
Fig. 7.1. General form consensus optimization. Local objective terms are on the left; global
variable components are on the right. Each edge in the bipartite graph is a consistency
constraint, linking a local variable and a global variable component.
i = 1, . . . , N,
(7.9)
L (x, z, y) =
N
i=1
fi (xi ) + yiT (xi zi ) + (/2)xi zi 22 ,
55
f
:=
argmin
xk+1
i
i
i i
i
i 2
i
xi
m
yikT zi + (/2)xk+1
zi 22
z k+1 := argmin
i
z
yik+1
:=
yik
i=1
k+1
(xi
zik+1 ),
where the xi - and yi -updates can be carried out independently in parallel for each i.
The z-update step decouples across the components of z, since L
is fully separable in its components:
k+1
k)
(x
)
+
(1/)(y
j
j
i
G(i,j)=g
i
,
zgk+1 :=
G(i,j)=g 1
so zg is found by averaging all entries of xk+1
+ (1/)yik that correspond
i
to the global index g. Applying the same type of argument as in the
global variable consensus case, we can show that after the rst iteration,
(yik )j = 0,
G(i,j)=g
i.e., the sum of the dual variable entries that correspond to any given
global index g is zero. The z-update step can thus be written in the
simpler form
(xk+1
)j ,
zgk+1 := (1/kg )
i
G(i,j)=g
56
(7.10)
7.3
Sharing
Another canonical problem that will prove useful in the sequel is the
sharing problem
N
N
(7.11)
minimize
i=1 fi (xi ) + g( i=1 xi )
with variables xi Rn , i = 1, . . . , N , where fi is a local cost function for
subsystem i, and g is the shared objective, which takes as argument the
sum of the variables. We can think of the variable xi as being the choice
of agent i; the sharing problem involves each agent adjusting its variable
to minimize its individual cost fi (xi ), as well as the shared objective
term g( N
i=1 xi ). The sharing problem is important both because many
useful problems can be put into this form and because it enjoys a dual
relationship with the consensus problem, as discussed below.
Sharing can be written in ADMM form by copying all the variables:
N
N
minimize
i=1 fi (xi ) + g( i=1 zi )
(7.12)
subject to xi zi = 0, i = 1, . . . , N,
with variables xi , zi Rn , i = 1, . . . , N . The scaled form of ADMM is
k
k 2
xk+1
f
:=
argmin
(x
)
+
(/2)x
z
+
u
i i
i
i
i 2
i
xi
N
k xk+1 2
z k+1 := argmin g( N
z
)
+
(/2)
z
u
i
i
2
i
i=1
i=1
i
z
uk+1
i
:=
uki
+ xk+1
zik+1 .
i
The rst and last steps can be carried out independently in parallel for
each i = 1, . . . , N . As written, the z-update requires solving a problem
7.3 Sharing
57
(7.13)
(7.14)
which shows that the dual variables uki are all equal (i.e., in consensus)
and can be replaced with a single dual variable u Rm . Substituting
in the expression for zik in the x-update, the nal algorithm becomes
k
k
k
k 2
f
:=
argmin
(x
)
+
(/2)x
x
+
x
z
+
u
xk+1
i
i
i
i
2
i
xi
z k+1 := argmin g(N z) + (N /2)z uk xk+1 22
z
k+1
:= u + xk+1 z k+1 .
k
The x-update can be carried out in parallel, for i = 1, . . . , N . The zupdate step requires gathering xk+1
to form the averages, and then
i
solving a problem with n variables. After the u-update, the new value
of xk+1 z k+1 + uk+1 is scattered to the subsystems.
7.3.1
Duality
otherwise.
58
Optimal Exchange
7.3 Sharing
59
x
+
x
+
u
xk+1
i i
i
i
2
i
xi
k+1
:= u + xk+1 .
(x
x
)
xk+1
i i
i
i
i
2
i
xi
k+1
:= y + xk+1 .
60
acts via price adjustment, i.e., increasing or decreasing the price of each
good depending on whether there is an excess demand or excess supply
of the good, respectively.
Dual decomposition is the simplest algorithmic expression of
t
atonnement. In this setting, each agent adjusts his consumption xi
to minimize his individual cost fi (xi ) adjusted by the cost y T xi , where
y is the price vector. The central collector (called the secretary of market in [163]) works toward equilibrium by adjusting the prices y up or
down depending on whether each commodity or good is overproduced
or underproduced. ADMM diers only in the inclusion of the proximal
regularization term in the updates for each agent. As y k converges to
an optimal price vector y , the eect of the proximal regularization
term vanishes. The proximal regularization term can be interpreted as
each agents commitment to help clear the market.
8
Distributed Model Fitting
(8.1)
m
li (aTi x bi ),
i=1
62
parameters are not regularized, such as the oset parameter in a classication model. This corresponds to, for example, r(x) = x1:n1 1 ,
where x1:n1 is the subvector of x consisting of all but the last component of x; with this choice of r, the last component of x is not regularized.
The next section discusses some examples that have the general
form above. We then consider two ways to solve (8.1) in a distributed
manner, namely, by splitting across training examples and by splitting
across features. While we work with the assumption that l and r are
separable at the component level, we will see that the methods we
describe work with appropriate block separability as well.
8.1
8.1.1
Examples
Regression
Classication
Many classication problems can also be put in the form of the general
model tting problem (8.1), with A, b, l, and r appropriately chosen. We
follow the standard setup from statistical learning theory, as described
in, e.g., [8]. Let pi Rn1 denote the feature vector of the ith example
8.1 Examples
63
and let qi {1, 1} denote the binary outcome or class label, for i =
1, . . . , m. The goal is to nd a weight vector w Rn1 and oset v R
such that
sign(pTi w + v) = qi
holds for many examples. Viewed as a function of pi , the expression
pTi w + v is called a discriminant function. The condition that the sign
of the discriminant function and the response should agree can also be
written as i > 0, where i = qi (pTi w + v) is called the margin of the
ith training example.
In the context of classication, loss functions are generally written
as a function of the margin, so the loss for the ith example is
li (i ) = li (qi (pTi w + v)).
A classication error is made if and only if the margin is negative, so
li should be positive and decreasing for negative arguments and zero
or small for positive arguments. To nd the parameters w and v, we
minimize the average loss plus a regularization term on the weights:
1
li (qi (pTi w + v)) + rwt (w).
m
m
(8.2)
i=1
This has the generic model tting form (8.1), with x = (w, v), ai =
(qi pi , qi ), bi = 0, and regularizer r(x) = rwt (w). (We also need to scale
li by 1/m.) In the sequel, we will address such problems using the form
(8.1) without comment, assuming that this transformation has been
carried out.
In statistical learning theory, the problem (8.2) is referred to as
penalized empirical risk minimization or structural risk minimization.
When the loss function is convex, this is sometimes termed convex
risk minimization. In general, tting a classier by minimizing a surrogate loss function, i.e., a convex upper bound to 0-1 loss, is a
well studied and widely used approach in machine learning; see, e.g.,
[165, 180, 8].
Many classication models in machine learning correspond to different choices of loss function li and regularization or penalty rwt .
64
8.2
Here we discuss how to solve the model tting problem (8.1) with
a modest number of features but a very large number of training
examples. Most classical statistical estimation problems belong to this
regime, with large volumes of relatively low-dimensional data. The goal
is to solve the problem in a distributed way, with each processor handling a subset of the training data. This is useful either when there are
so many training examples that it is inconvenient or impossible to process them on a single machine or when the data is naturally collected
or stored in a distributed fashion. This includes, for example, online
social network data, webserver access logs, wireless sensor networks,
and many cloud computing applications more generally.
We partition A and b by rows,
A1
A = ... ,
AN
b1
b = ... ,
bN
with Ai Rmi n and bi Rmi , where N
i=1 mi = m. Thus, Ai and bi
represent the ith block of data and will be handled by the ith processor.
We rst put the model tting problem in the consensus form
minimize
N
i=1 li (Ai xi
subject to xi z = 0,
bi ) + r(z)
i = 1, . . . , N,
(8.3)
65
b
)
+
(/2)x
z
+
u
xk+1
i
i
i
i
i
i 2
i
xi
z k+1 := argmin r(z) + (N /2)z xk+1 uk 22
z
uk+1
i
:=
uki
+ xk+1
z k+1 .
i
The rst step, which consists of an 2 -regularized model tting problem, can be carried out in parallel for each data block. The second
step requires gathering variables to form the average. The minimization in the second step can be carried out componentwise (and usually
analytically) when r is assumed to be fully separable.
The algorithm described above only requires that the loss function
l be separable across the blocks of data; the regularizer r does not need
to be separable at all. (However, when r is not separable, the z-update
may require the solution of a nontrivial optimization problem.)
8.2.1
Lasso
b
+
(/2)x
z
+
u
(1/2)A
xk+1
i i
i 2
i
i 2
i
xi
k+1
:= S/N (xk+1 + uk )
uk+1
:= uki + xk+1
z k+1 .
i
i
Each xi -update takes the form of a Tikhonov-regularized least squares
(i.e., ridge regression) problem, with analytical solution
:= (ATi Ai + I)1 (ATi bi + (z k uki )).
xk+1
i
The techniques from 4.2 apply: If a direct method is used, then the factorization of ATi Ai + I can be cached to speed up subsequent updates,
and if mi < n, then the matrix inversion lemma can be applied to let
us factor the smaller matrix Ai ATi + I instead.
Comparing this distributed-data lasso algorithm with the serial version in 6.4, we see that the only dierence is the collection and averaging steps, which couple the computations for the data blocks.
An ADMM-based distributed lasso algorithm is described in [121],
with applications in signal processing and wireless communications.
66
8.2.2
Consider solving (8.1) with logistic loss functions li and 1 regularization. We ignore the intercept term for notational simplicity; the algorithm can be easily modied to incorporate an intercept. The ADMM
algorithm is
k
k 2
:=
argmin
(A
x
)
+
(/2)x
z
+
u
l
xk+1
i
i i
i
i 2
i
xi
k+1
:= S/N (xk+1 + uk )
:= uki + xk+1
z k+1 .
uk+1
i
i
This is identical to the distributed lasso algorithm, except for the xi
update, which here involves an 2 regularized logistic regression problem that can be eciently solved by algorithms like L-BFGS.
8.2.3
z
+
u
xk+1
i i
+
i
i 2
i
xi
z k+1 :=
(xk+1 + uk )
(1/) + N
:= uki + xk+1
z k+1 .
uk+1
i
i
Each xi -update essentially involves tting a support vector machine to
the local data Ai (with an oset in the quadratic regularization term),
so this can be carried out eciently using an existing SVM solver for
serial problems.
The use of ADMM to train support vector machines in a distributed
fashion was described in [74].
8.3
Now we consider the model tting problem (8.1) with a modest number of examples and a large number of features. Statistical problems
of this kind frequently arise in areas like natural language processing
and bioinformatics, where there are often a large number of potential
67
explanatory variables for any given outcome. For example, the observations may be a corpus of documents, and the features could include
all words and pairs of adjacent words (bigrams) that appear in each
document. In bioinformatics, there are usually relatively few people
in a given association study, but there can be a very large number of
potential features relating to factors like observed DNA mutations in
each individual. There are many examples in other areas as well, and
the goal is to solve such problems in a distributed fashion with each
processor handling a subset of the features. In this section, we show
how this can be done by formulating it as a sharing problem from 7.3.
We partition the parameter vector x as x = (x1 , . . . , xN ), with xi
Rni , where N
i=1 ni = n. Conformably partition the data matrix A as
A = [A1 AN ], with Ai Rmni , and the regularization function as
N
r(x) = N
i=1 ri (xi ). This implies that Ax =
i=1 Ai xi , i.e., Ai xi can
be thought of as a partial prediction of b using only the features referenced in xi . The model tting problem (8.1) becomes
minimize l
N
i=1 Ai xi
b + N
i=1 ri (xi ).
Following the approach used for the sharing problem (7.12), we express
the problem as
minimize
N
i=1 zi
b + N
i=1 ri (xi )
subject to Ai xi zi = 0,
i = 1, . . . , N,
k+1
N
N
:= argmin l(
zi b) +
(/2)Ai xk+1
zik + uki 22
i
z
uk+1
i
:=
uki
i=1
+ Ai xk+1
zik+1 .
i
i=1
68
As in the discussion for the sharing problem, we carry out the z-update
by rst solving for the average z k+1 :
k+1
uk 22
z k+1 := argmin l(N z b) + (N /2)z Ax
z
k+1
+ uki Ax
zik+1 := z k+1 + Ai xk+1
i
uk ,
k+1
k+1
= (1/N ) N
. Substituting the last expression
where Ax
i=1 Ai xi
into the update for ui , we nd that
k+1
= Ax
uk+1
i
+ uk z k+1 ,
which shows that, as in the sharing problem, all the dual variables are
equal. Using a single dual variable uk Rm , and eliminating zi , we
arrive at the algorithm
k
k
k
k 2
r
:=
argmin
(x
)
+
(/2)A
x
A
x
z
+
Ax
+
u
xk+1
i
i
i
i
i
i
2
i
xi
k+1
z k+1 := argmin l(N z b) + (N /2)z Ax
uk 22
z
k+1
k+1
:= u + Ax
z k+1 .
8.3.1
69
Lasso
A
x
z
+
Ax
+
u
+
x
xk+1
i i
i i
i 1
2
i
xi
z k+1 :=
1
k+1
b + Ax
+ uk
N +
k+1
uk+1 := uk + Ax
z k+1 .
When this occurs, the xi -update is fast (compared to the case when
= 0). In a parallel implementation, there is no benet to speeding
xk+1
i
up only some of the tasks being executed in parallel, but in a serial
setting we do benet.
8.3.2
Group Lasso
Consider the group lasso problem with the feature groups coinciding
with the blocks of features, and 2 norm (not squared) regularization:
minimize (1/2)Ax b22 + N
i=1 xi 2 .
The z-update and u-update are the same as for the lasso, but the xi
update becomes
k
k
k
k 2
(/2)A
.
xk+1
:=
argmin
x
A
x
z
+
Ax
+
u
+
x
i
i
i
i
2
i
2
i
xi
(Only the subscript on the last norm diers from the lasso case.) This
involves minimizing a function of the form
(/2)Ai xi v22 + xi 2 ,
which can be carried out as follows. The solution is xi = 0 if and only
if ATi v2 /. Otherwise, the solution has the form
xi = (ATi Ai + I)1 ATi v,
70
for the value of > 0 that gives xi 2 = /. This value can be found
using a one-parameter search (e.g., via bisection) over . We can speed
up the computation of xi for several values of (as needed for the
parameter search) by computing and caching an eigendecomposition of
ATi Ai . Assuming Ai is tall, i.e., m ni (a similar method works when
m < ni ), we compute an orthogonal Q for which ATi Ai = Q diag()QT ,
where is the vector of eigenvalues of ATi Ai (i.e., the squares of the
singular values of Ai ). The cost is O(mn2i ) ops, dominated (in order)
by forming ATi Ai . We subsequently compute xi 2 using
xi 2 = diag( + 1)1 QT ATi v 2 .
This can be computed in O(ni ) ops, once QT ATi v is computed, so
the search over is costless (in order). The cost per iteration is thus
O(mni ) (to compute QT ATi v), a factor of ni better than carrying out
the xi -update without caching.
8.3.3
The algorithm is identical to the lasso problem above, except that the
z-update becomes
k+1
uk 22 ,
z k+1 := argmin l(N z) + (/2)z Ax
z
where l is the logistic loss function. This splits to the component level,
and involves the proximity operator for l. This can be very eciently
computed by a lookup table that gives the approximate value, followed
by one or two Newton steps (for a scalar problem).
It is interesting to see that in distributed sparse logistic regression,
the dominant computation is the solution of N parallel lasso problems.
8.3.4
The algorithm is
k
k
k
k 2
2
(/2)A
:=
argmin
x
A
x
z
+
Ax
+
u
+
x
xk+1
i
i
i
i
i
2
2
i
xi
k+1
z k+1 := argmin 1T (N z + 1)+ + (/2)z Ax
uk 22
z
k+1
k+1
:= u + Ax
z k+1 .
71
vi N/ vi > 1/N + N/
k+1
zi
:=
1/N
vi [1/N, 1/N + N/]
vi < 1/N,
vi
k+1
n
fj (aj ),
j=1
72
m
k
2
k
k
k
argmin rj (fj ) + (/2) i=1 (fj (aij ) fj (aij ) z i + f i + ui )
fj Fj
z k+1 := argmin
z
k+1
m
i=1 li (N z i
bi ) + (/2)
N
i=1 z f
k+1
uk 22
z k+1 ,
k
where f i = (1/n) nj=1 fjk (aij ), the average value of the predicted
response nj=1 fjk (aij ) for the ith feature.
The fj -update is an 2 (squared) regularized function t. The zupdate can be carried out componentwise.
u
k+1
:= u + f
9
Nonconvex Problems
9.1
Nonconvex Constraints
f (x)
subject to x S,
73
74
Nonconvex Problems
k+1
:= S (xk+1 + uk )
Regressor Selection
75
replaced with hard thresholding. This close connection is hardly surprising, since lasso can be thought of as a heuristic for solving the
regressor selection problem. From this viewpoint, the lasso controls the
trade-o between least squares error and sparsity through the parameter , whereas in ADMM for regressor selection, the same trade-o is
controlled by the parameter c, the exact cardinality desired.
9.1.2
The goal is to approximate a symmetric matrix (say, an empirical covariance matrix) as a sum of a rank-k and a diagonal positive
semidenite matrix. Using the Frobenius norm to measure approximation error, we have the problem
minimize (1/2)X + diag(d) 2F
subject to X 0, Rank(X) = k, d 0,
with variables X Sn , d Rn . (Any convex loss function could be used
in lieu of the Frobenius norm.)
We take
f (X) = inf (1/2)X + diag(d) 2F
d0
= (1/2)
(Xij ij ) + (1/2)
2
i=j
n
(Xii ii )2+ ,
i=1
Z k+1 := S (X k+1 + U k )
U k+1 := U k + X k+1 Z k+1 ,
where Z, U Sn . The X-update is separable to the component level,
and can be expressed as
k Uk )
i = j
(X k+1 )ij := (1/(1 + )) ij + (Zij
ij
(X k+1 )ii := (1/(1 + )) ii + (Ziik Uiik )
ii Zii Uii
(X k+1 )ii := Ziik Uiik
76
Nonconvex Problems
9.2
Bi-convex Problems
Another problem that admits exact ADMM updates is the general biconvex problem,
minimize F (x, z)
subject to G(x, z) = 0,
where F : Rn Rm R is bi-convex, i.e., convex in x for each z and
convex in z for each x, and G : Rn Rm Rp is bi-ane, i.e., ane
in x for each xed z, and ane in z for each xed x. When F is
separable in x and z, and G is jointly ane in x and z, this reduces
to the standard ADMM problem form (3.1). For this problem ADMM
has the form
xk+1 := argmin F (x, z k ) + (/2)G(x, z k ) + uk 22
x
k+1
:= argmin F (xk+1 , z) + (/2)G(xk+1 , z) + uk 22
z
z
77
k+1
k+1
The rst step splits across the rows of X and V , so can be performed
by solving a set of quadratic programs, in parallel, to nd each row of
X and V separately; the second splits in the columns of W , so can be
performed by solving parallel quadratic programs to nd each column.
10
Implementation
10.1
Abstract Implementation
79
Here, the iteration indices are omitted because in an actual implementation, we can simply overwrite previous values of these variables. Note
that the u-update must be done before the x-update in order to match
(7.6-7.8). If g = 0, then the z-update simply involves computing x, and
the ui are not part of the aggregation, as discussed in 7.1.
This suggests that the main features required to implement ADMM
are the following:
Mutable state. Each subsystem i must store the current values
of xi and ui .
Local computation. Each subsystem must be able to solve a
small convex problem, where small means that the problem
is solvable using a serial algorithm. In addition, each local
process must have local access to whatever data are required
to specify fi .
Global aggregation. There must be a mechanism for averaging local variables and broadcasting the result back to each
subsystem, either by explicitly using a central collector or via
some other approach like distributed averaging [160, 172]. If
computing z involves a proximal step (i.e., if g is nonzero),
this can either be performed centrally or at each local node;
the latter is easier to implement in some frameworks.
Synchronization. All the local variables must be updated
before performing global aggregation, and the local updates
must all use the latest global variable. One way to implement
this synchronization is via a barrier, a system checkpoint at
which all subsystems must stop and wait until all other subsystems reach it.
When actually implementing ADMM, it helps to consider whether to
take the local perspective of a subsystem performing local processing
and communicating with a central collector, or the global perspective
of a central collector coordinating the work of a set of subsystems.
Which is more natural depends on the software framework used.
From the local perspective, each node i receives z, updates ui and
then xi , sends them to the central collector, waits, and then receives the
80
Implementation
10.2
MPI
81
10.3
82
Implementation
10.4
MapReduce
10.4 MapReduce
83
z = N
because summation is associai=1 (xi + ui ) rather than z or z
tive while averaging is not. We assume N is known (or, alternatively,
the Reducer can compute the sum N
i=1 1). We have N Mappers, one
for each subsystem, and each Mapper updates ui and xi using the
z from the previous iteration. Each Mapper independently executes
the proximal step to compute z, but this is usually a cheap operation like soft thresholding. It emits an intermediate key-value pair that
essentially serves as a message to the central collector. There is a single Reducer, playing the role of a central collector, and its incoming
values are the messages from the Mappers. The updated records are
84
Implementation
10.4 MapReduce
85
Since each Mapper only handles a single block of data, there will
usually be a number of Mappers running on the same machine. To
reduce the amount of data transferred over the network, Hadoop supports the use of combiners, which essentially Reduce the results of all
the Map tasks on a given node so only one set of intermediate keyvalue pairs need to be transferred across machines for the nal Reduce
task. In other words, the Reduce step should be viewed as a two-step
process: First, the results of all the Mappers on each individual node
are reduced with Combiners, and then the records across each machine
are Reduced. This is a major reason why the Reduce function must be
commutative and associative.
Since the input value to a Mapper is a block of data, we also need a
mechanism for a Mapper to read in local variables, and for the Reducer
to store the updated variables for the next iteration. Here, we use
HBase, a distributed database built on top of HDFS that provides
fast random read-write access. HBase, like BigTable, provides a distributed multi-dimensional sorted map. The map is indexed by a row
key, a column key, and a timestamp. Each cell in an HBase table can
contain multiple versions of the same data indexed by timestamp; in
our case, we can use the iteration counts as the timestamps to store
and access data from previous iterations; this is useful for checking termination criteria, for example. The row keys in a table are strings, and
HBase maintains data in lexicographic order by row key. This means
that rows with lexicographically adjacent keys will be stored on the
same machine or nearby. In our case, variables should be stored with
the subsystem identier at the beginning of row key, so information for
the same subsystem is stored together and is ecient to access. For
more details, see [32, 170].
The discussion and pseudocode above omits and glosses over many
details for simplicity of exposition. MapReduce frameworks like Hadoop
also support much more sophisticated implementations, which may
be necessary for very large scale problems. For example, if there
are too many values for a single Reducer to handle, we can use an
approach analogous to the one suggested for MPI: Mappers emit pairs
to regional reduce jobs, and then an additional MapReduce step is carried out that uses an identity mapper and aggregates regional results
86
Implementation
into a global result. In this section, our goal is merely to give a general avor of some of the issues involved in implementing ADMM in
a MapReduce framework, and we refer to [46, 170, 111] for further
details. There has also been some recent work on alternative MapReduce systems that are specically designed for iterative computation,
which are likely better suited for ADMM [25, 179], though the implementations are less mature and less widely available. See [37, 93] for
examples of recent papers discussing machine learning and optimization in MapReduce frameworks.
11
Numerical Examples
87
88
Numerical Examples
11.1
89
Fig. 11.1. Norms of primal residual (top) and dual residual (bottom) versus iteration, for
a lasso problem. The dashed lines show pri (top) and dual (bottom).
Single Problem
90
Numerical Examples
Fig. 11.2. Objective suboptimality versus iteration for a lasso problem. The stopping criterion is satised at iteration 15, indicated by the vertical dashed line.
Since A is fat (i.e., m < n), we apply the matrix inversion lemma
to (AT A + I)1 and instead compute the factorization of the smaller
matrix I + (1/)AAT , which is then cached for subsequent x-updates.
The factor step itself takes about nm2 + (1/3)m3 ops, which is the
cost of forming AAT and computing the Cholesky factorization. Subsequent updates require two matrix-vector multiplications and forwardbackward solves, which require approximately 4mn + 2m2 ops. (The
cost of the soft thresholding step in the z-update is negligible.) For these
problem dimensions, the op count analysis suggests a factor/solve
ratio of around 350, which means that 350 subsequent ADMM iterations can be carried out for the cost of the initial factorization.
In our basic implementation, the factorization step takes about 1
second, and subsequent x-updates take around 30 ms. (This gives a factor/solve ratio of only 33, less than predicted, due to a particularly ecient matrix-matrix multiplication routine used in Matlab.) Thus the
total cost of solving an entire lasso problem is around 1.5 secondsonly
50% more than the initial factorization. In terms of parameter estimation, we can say that computing the lasso estimate requires only 50%
91
Fig. 11.3. Iterations needed versus for warm start (solid line) and cold start (dashed line).
more time than a ridge regression estimate. (Moreover, in an implementation with a higher factor/solve ratio, the additional eort for the
lasso would have been even smaller.)
Finally, we report the eect of varying the parameter on convergence time. Varying over the 100:1 range from 0.1 to 10 yields a solve
time ranging between 1.45 seconds to around 4 seconds. (In an implementation with a larger factor/solve ratio, the eect of varying would
have been even smaller.) Over-relaxation with = 1.5 does not significantly change the convergence time with = 1, but it does reduce the
worst convergence time over the range [0.1, 10] to only 2.8 seconds.
11.1.2
Regularization Path
To illustrate computing the regularization path, we solve the lasso problem for 100 values of , spaced logarithmically from 0.01max (where
x has around 800 nonzeros) to 0.95max (where x has two nonzero
entries). We rst solve the lasso problem as above for = 0.01max ,
and for each subsequent value of , we then initialize (warm start) z
and u at their optimal values for the previous . This requires only one
factorization for all the computations; warm starting ADMM at the
92
Numerical Examples
Task
Time (s)
Factorization
x-Update
1.1
0.03
1.5
160
13
11.2
In this example, we use consensus ADMM to t an 1 regularized logistic regression model. Following 8, the problem is
minimize
m
log 1 + exp(bi (aTi w + v)) + w1 ,
(11.1)
i=1
with optimization variables w Rn and v R. The training set consists of m pairs (ai , bi ), where ai Rn is a feature vector and bi
{1, 1} is the corresponding label.
We generated a problem instance with m = 106 training examples
and n = 104 features. The m examples are distributed among N = 100
subsystems, so each subsystem has 104 training examples. Each feature
vector ai was generated to have approximately 10 nonzero features,
each sampled independently from a standard normal distribution. We
chose a true weight vector wtrue Rn to have 100 nonzero values,
and these entries, along with the true intercept v true , were sampled
93
94
Numerical Examples
Fig. 11.4. Progress of primal and dual residual norm for distributed 1 regularized logistic
regression problem. The dashed lines show pri (top) and dual (bottom).
Figure 11.5 shows the suboptimality pk p for the consensus variable, where
k
p =
m
i=1
log 1 + exp(bi (aTi wk + v k )) + wk 1 .
95
11.3
96
Numerical Examples
Fig. 11.6. Norms of primal residual (top) and dual residual (bottom) versus iteration, for
the distributed group lasso problem. The dashed lines show pri (top) and dual (bottom).
b22
K
i=1
xki 2 .
97
Fig. 11.7. Suboptimality of distributed group lasso versus iteration. The stopping criterion
is satised at iteration 47, indicated by the vertical dashed line.
11.4
98
Numerical Examples
that the overall problem has a skinny coecient matrix but each of
the subproblems has a fat coecient matrix. We emphasize that the
coecient matrix is dense, so the full dataset requires over 30 GB to
store and has 3.2 billion nonzero entries in the total coecient matrix
A. This is far too large to be solved eciently, or at all, using standard
serial methods on commonly available hardware.
We solved the problem using a cluster of 10 machines. We used
Cluster Compute instances, which have 23 GB of RAM, two quad-core
Intel Xeon X5570 Nehalem chips, and are connected to each other
with 10 Gigabit Ethernet. We used hardware virtual machine images
running CentOS 5.4. Since each node had 8 cores, we ran the code
with 80 processes, so each subsystem ran on its own core. In MPI,
communication between processes on the same machine is performed
locally via the shared-memory Byte Transfer Layer (BTL), which provides low latency and high bandwidth communication, while communication across machines goes over the network. The data was sized
so all the processes on a single machine could work entirely in RAM.
Each node had its own attached Elastic Block Storage (EBS) volume
that contained only the local data relevant to that machine, so disk
throughput was shared among processes on the same machine but not
across machines. This is to emulate a scenario where each machine is
only processing the data on its local disk, and none of the dataset is
transferred over the network. We emphasize that usage of a cluster set
up in this fashion costs under $20 per hour.
We solved the problem with a deliberately naive implementation of
the algorithm, based directly on the discussion of 6.4, 8.2, and 10.2.
The implementation consists of a single le of C code, under 400 lines
despite extensive comments. The linear algebra (BLAS operations and
the Cholesky factorization) were performed using a stock installation
of the GNU Scientic Library.
We now report the breakdown of the wall-clock runtime. It took
roughly 30 seconds to load all the data into memory. It then took
4-5 minutes to form and then compute the Cholesky factorizations of
I + (1/)Ai ATi . After caching these factorizations, it then took 0.5-2
seconds for each subsequent ADMM iteration. This includes the backsolves in the xi -updates and all the message passing. For this problem,
99
30 GB
80
400000 8000
5000 8000
30 seconds
5 minutes
1 second
6 minutes
100
Numerical Examples
11.5
Regressor Selection
101
Fig. 11.8. Fit versus cardinality for the lasso (dotted line), lasso with posterior least squares
t (dashed line), and regressor selection (solid line).
The x-update step has exactly the same expression as in the lasso
example, so we use the same method, based on the matrix inversion
lemma and caching, described in that example. The z-update step consists of keeping the c largest magnitude components of x + u and zeroing the rest. For the sake of clarity, we performed an intermediate
sorting of the components, but more ecient schemes are possible. In
any case, the cost of the z-update is negligible compared with that of
the x-update.
Convergence of ADMM for a nonconvex problem such as this one
is not guaranteed; and even when it does converge, the nal result can
depend on the choice of and the initial values for z and u. To explore
this, we ran 100 ADMM simulations with randomly chosen initial values and ranging between 0.1 and 100. Indeed, some of them did not
converge, or at least, were converging slowly. But most of them converged, though not to exactly the same points. However, the objective
values obtained by those that converged were reasonably close to each
other, typically within 5%. The dierent values of x found had small
102
Numerical Examples
12
Conclusions
We have discussed ADMM and illustrated its applicability to distributed convex optimization in general and many problems in statistical machine learning in particular. We argue that ADMM can serve
as a good general-purpose tool for optimization problems arising in the
analysis and processing of modern massive datasets. Much like gradient
descent and the conjugate gradient method are standard tools of great
use when optimizing smooth functions on a single machine, ADMM
should be viewed as an analogous tool in the distributed regime.
ADMM sits at a higher level of abstraction than classical optimization algorithms like Newtons method. In such algorithms, the base
operations are low-level, consisting of linear algebra operations and the
computation of gradients and Hessians. In the case of ADMM, the base
operations include solving small convex optimization problems (which
in some cases can be done via a simple analytical formula). For example, when applying ADMM to a very large model tting problem, each
update reduces to a (regularized) model tting problem on a smaller
dataset. These subproblems can be solved using any standard serial
algorithm suitable for small to medium sized problems. In this sense,
ADMM builds on existing algorithms for single machines, and so can be
103
104
Conclusions
Acknowledgments
We are very grateful to Rob Tibshirani and Trevor Hastie for encouraging us to write this review. Thanks also to Alexis Battle, Dimitri
Bertsekas, Danny Bickson, Tom Goldstein, Dimitri Gorinevsky, Daphne
Koller, Vicente Malave, Stephen Oakley, and Alex Teichman for helpful comments and discussions. Yang Wang and Matt Kraning helped in
developing ADMM for the sharing and exchange problems, and Arezou
Keshavarz helped work out ADMM for generalized additive models. We
thank Georgios Giannakis and Alejandro Ribeiro for pointing out some
very relevant references that we had missed in an earlier version. We
thank John Duchi for a very careful reading of the manuscript and for
suggestions that greatly improved it.
Support for this work was provided in part by AFOSR grant
FA9550-09-0130 and NASA grant NNX07AEIIA. Neal Parikh was supported by the Cortlandt and Jean E. Van Rensselaer Engineering
Fellowship from Stanford University and by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-0645962.
Eric Chu was supported by the Pan Wen-Yuan Foundation Scholarship.
105
A
Convergence Proof
The basic convergence result given in 3.2 can be found in several references, such as [81, 63]. Many of these give more sophisticated results,
with more general penalties or inexact minimization. For completeness,
we give a proof here.
We will show that if f and g are closed, proper, and convex, and
the Lagrangian L0 has a saddle point, then we have primal residual
convergence, meaning that rk 0, and objective convergence, meaning
that pk p , where pk = f (xk ) + g(z k ). We will also see that the dual
residual sk = AT B(z k z k1 ) converges to zero.
Let (x , z , y ) be a saddle point for L0 , and dene
V k = (1/)y k y 22 + B(z k z )22 ,
We will see that V k is a Lyapunov function for the algorithm, i.e., a
nonnegative quantity that decreases in each iteration. (Note that V k is
unknown while the algorithm runs, since it depends on the unknown
values z and y .)
We rst outline the main idea. The proof relies on three key inequalities, which we will prove below using basic results from convex analysis
106
107
along with simple algebra. The rst inequality is
V k+1 V k rk+1 22 B(z k+1 z k )22 .
(A.1)
k=0
which implies that rk 0 and B(z k+1 z k ) 0 as k . Multiplying the second expression by AT shows that the dual residual
sk = AT B(z k+1 z k ) converges to zero. (This shows that the stopping criterion (3.12), which requires the primal and dual residuals to
be small, will eventually hold.)
The second key inequality is
pk+1 p
(y k+1 )T rk+1 (B(z k+1 z k ))T (rk+1 + B(z k+1 z )),
(A.2)
and the third inequality is
p pk+1 y T rk+1 .
(A.3)
108
Convergence Proof
109
(A.4)
The result (A.1) will follow from this inequality after some manipulation and rewriting.
We begin by rewriting the rst term. Substituting y k+1 = y k +
k+1
gives
r
2(y k y )T rk+1 + rk+1 22 + rk+1 22 ,
and substituting rk+1 = (1/)(y k+1 y k ) in the rst two terms gives
(2/)(y k y )T (y k+1 y k ) + (1/)y k+1 y k 22 + rk+1 22 .
Since y k+1 y k = (y k+1 y ) (y k y ), this can be written as
(A.5)
(1/) y k+1 y 22 y k y 22 + rk+1 22 .
We now rewrite the remaining terms, i.e.,
rk+1 22 2(B(z k+1 z k ))T rk+1 + 2(B(z k+1 z k ))T (B(z k+1 z )),
where rk+1 22 is taken from (A.5). Substituting
z k+1 z = (z k+1 z k ) + (z k z )
in the last term gives
rk+1 B(z k+1 z k )22 + B(z k+1 z k )22
+2(B(z k+1 z k ))T (B(z k z )),
and substituting
z k+1 z k = (z k+1 z ) (z k z )
in the last two terms, we get
rk+1 B(z k+1 z k )22 + B(z k+1 z )22 B(z k z )22 .
110
Convergence Proof
With the previous step, this implies that (A.4) can be written as
V k V k+1 rk+1 B(z k+1 z k )22 .
(A.6)
References
[1] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, Fast image recovery using variable splitting and constrained optimization, IEEE Transactions
on Image Processing, vol. 19, no. 9, pp. 23452356, 2010.
[2] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, An Augmented Lagrangian Approach to the Constrained Optimization Formulation of
Imaging Inverse Problems, IEEE Transactions on Image Processing, vol. 20,
pp. 681695, 2011.
[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorenson, LAPACK: A portable
linear algebra library for high-performance computers. IEEE Computing Society Press, 1990.
[4] K. J. Arrow and G. Debreu, Existence of an equilibrium for a competitive
economy, Econometrica: Journal of the Econometric Society, vol. 22, no. 3,
pp. 265290, 1954.
[5] K. J. Arrow, L. Hurwicz, and H. Uzawa, Studies in Linear and Nonlinear
Programming. Stanford University Press: Stanford, 1958.
[6] K. J. Arrow and R. M. Solow, Gradient methods for constrained maxima,
with weakened assumptions, in Studies in Linear and Nonlinear Programming, (K. J. Arrow, L. Hurwicz, and H. Uzawa, eds.), Stanford University
Press: Stanford, 1958.
[7] O. Banerjee, L. E. Ghaoui, and A. dAspremont, Model selection through
sparse maximum likelihood estimation for multivariate Gaussian or binary
data, Journal of Machine Learning Research, vol. 9, pp. 485516, 2008.
111
112
References
References
113
[26] R. H. Byrd, P. Lu, and J. Nocedal, A Limited Memory Algorithm for Bound
Constrained Optimization, SIAM Journal on Scientic and Statistical Computing, vol. 16, no. 5, pp. 11901208, 1995.
[27] E. J. Cand`es and Y. Plan, Near-ideal model selection by 1 minimization,
Annals of Statistics, vol. 37, no. 5A, pp. 21452177, 2009.
[28] E. J. Cand`es, J. Romberg, and T. Tao, Robust uncertainty principles: Exact
signal reconstruction from highly incomplete frequency information, IEEE
Transactions on Information Theory, vol. 52, no. 2, p. 489, 2006.
[29] E. J. Cand`es and T. Tao, Near-optimal signal recovery from random projections: Universal encoding strategies?, IEEE Transactions on Information
Theory, vol. 52, no. 12, pp. 54065425, 2006.
[30] Y. Censor and S. A. Zenios, Proximal minimization algorithm with Dfunctions, Journal of Optimization Theory and Applications, vol. 73, no. 3,
pp. 451464, 1992.
[31] Y. Censor and S. A. Zenios, Parallel Optimization: Theory, Algorithms, and
Applications. Oxford University Press, 1997.
[32] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. E. Gruber, BigTable: A distributed storage
system for structured data, ACM Transactions on Computer Systems, vol. 26,
no. 2, pp. 126, 2008.
[33] G. Chen and M. Teboulle, A proximal-based decomposition method for convex minimization problems, Mathematical Programming, vol. 64, pp. 81101,
1994.
[34] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by
basis pursuit, SIAM Review, vol. 43, pp. 129159, 2001.
[35] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and
update/downdate, ACM Transactions on Mathematical Software, vol. 35,
no. 3, p. 22, 2008.
[36] W. Cheney and A. A. Goldstein, Proximity maps for convex sets, Proceedings of the American Mathematical Society, vol. 10, no. 3, pp. 448450, 1959.
[37] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Y. Ng, and
K. Olukotun, MapReduce for machine learning on multicore, in Advances
in Neural Information Processing Systems, 2007.
[38] J. F. Claerbout and F. Muir, Robust modeling with erratic data, Geophysics,
vol. 38, p. 826, 1973.
[39] P. L. Combettes, The convex feasibility problem in image recovery, Advances
in Imaging and Electron Physics, vol. 95, pp. 155270, 1996.
[40] P. L. Combettes and J. C. Pesquet, A Douglas-Rachford splitting approach
to nonsmooth convex variational signal recovery, IEEE Journal on Selected
Topics in Signal Processing, vol. 1, no. 4, pp. 564574, 2007.
[41] P. L. Combettes and J. C. Pesquet, Proximal Splitting Methods in Signal
Processing, arXiv:0912.3522, 2009.
[42] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forwardbackward splitting, Multiscale Modeling and Simulation, vol. 4, no. 4,
pp. 11681200, 2006.
114
References
References
115
116
References
[79] J. Friedman, T. Hastie, and R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso, Biostatistics, vol. 9, no. 3, p. 432, 2008.
[80] M. Fukushima, Application of the alternating direction method of multipliers to separable convex programming problems, Computational Optimization
and Applications, vol. 1, pp. 93111, 1992.
[81] D. Gabay, Applications of the method of multipliers to variational inequalities, in Augmented Lagrangian Methods: Applications to the Solution of
Boundary-Value Problems, (M. Fortin and R. Glowinski, eds.), North-Holland:
Amsterdam, 1983.
[82] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational problems via nite element approximations, Computers and Mathematics with Applications, vol. 2, pp. 1740, 1976.
[83] M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, M. Booth, and
F. Rossi, GNU Scientic Library Reference Manual. Network Theory Ltd.,
third ed., 2002.
[84] A. M. Georion, Generalized Benders decomposition, Journal of Optimization Theory and Applications, vol. 10, no. 4, pp. 237260, 1972.
[85] S. Ghemawat, H. Gobio, and S. T. Leung, The Google le system, ACM
SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 2943, 2003.
[86] R. Glowinski and A. Marrocco, Sur lapproximation, par elements nis
dordre un, et la resolution, par penalisation-dualite, dune classe de problems
de Dirichlet non lineares, Revue Francaise dAutomatique, Informatique, et
Recherche Operationelle, vol. 9, pp. 4176, 1975.
[87] R. Glowinski and P. L. Tallec, Augmented Lagrangian methods for the
solution of variational problems, Tech. Rep. 2965, University of WisconsinMadison, 1987.
[88] T. Goldstein and S. Osher, The split Bregman method for 1 regularized
problems, SIAM Journal on Imaging Sciences, vol. 2, no. 2, pp. 323343,
2009.
[89] E. G. Golshtein and N. V. Tretyakov, Modied Lagrangians in convex programming and their generalizations, Point-to-Set Maps and Mathematical
Programming, pp. 8697, 1979.
[90] G. H. Golub and C. F. van Loan, Matrix Computations. Johns Hopkins University Press, third ed., 1996.
[91] D. Gregor and A. Lumsdaine, The Parallel BGL: A generic library for distributed graph computations, Parallel Object-Oriented Scientic Computing,
2005.
[92] A. Halevy, P. Norvig, and F. Pereira, The Unreasonable Eectiveness of
Data, IEEE Intelligent Systems, vol. 24, no. 2, 2009.
[93] K. B. Hall, S. Gilpin, and G. Mann, MapReduce/BigTable for distributed
optimization, in Neural Information Processing Systems: Workshop on
Learning on Cores, Clusters, and Clouds, 2010.
[94] T. Hastie and R. Tibshirani, Generalized Additive Models. Chapman & Hall,
1990.
[95] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, second ed., 2009.
References
117
[96] B. S. He, H. Yang, and S. L. Wang, Alternating direction method with selfadaptive penalty parameters for monotone variational inequalities, Journal
of Optimization Theory and Applications, vol. 106, no. 2, pp. 337356, 2000.
[97] M. R. Hestenes, Multiplier and gradient methods, Journal of Optimization
Theory and Applications, vol. 4, pp. 302320, 1969.
[98] M. R. Hestenes, Multiplier and gradient methods, in Computing Methods
in Optimization Problems, (L. A. Zadeh, L. W. Neustadt, and A. V. Balakrishnan, eds.), Academic Press, 1969.
[99] J.-B. Hiriart-Urruty and C. Lemarechal, Fundamentals of Convex Analysis.
Springer, 2001.
[100] P. J. Huber, Robust estimation of a location parameter, Annals of Mathematical Statistics, vol. 35, pp. 73101, 1964.
[101] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky, 1 Trend ltering, SIAM
Review, vol. 51, no. 2, pp. 339360, 2009.
[102] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, An interior-point
method for large-scale 1 -regularized least squares, IEEE Journal of Selected
Topics in Signal Processing, vol. 1, no. 4, pp. 606617, 2007.
[103] K. Koh, S.-J. Kim, and S. Boyd, An interior-point method for large-scale 1 regularized logistic regression, Journal of Machine Learning Research, vol. 1,
no. 8, pp. 15191555, 2007.
[104] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and
Techniques. MIT Press, 2009.
[105] S. A. Kontogiorgis, Alternating directions methods for the parallel solution of
large-scale block-structured optimization problems. PhD thesis, University of
Wisconsin-Madison, 1994.
[106] S. A. Kontogiorgis and R. R. Meyer, A variable-penalty alternating directions method for convex optimization, Mathematical Programming, vol. 83,
pp. 2953, 1998.
[107] L. S. Lasdon, Optimization Theory for Large Systems. MacMillan, 1970.
[108] J. Lawrence and J. E. Spingarn, On xed points of non-expansive piecewise
isometric mappings, Proceedings of the London Mathematical Society, vol. 3,
no. 3, p. 605, 1987.
[109] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic linear
algebra subprograms for Fortran usage, ACM Transactions on Mathematical
Software, vol. 5, no. 3, pp. 308323, 1979.
[110] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems, vol. 13, 2001.
[111] J. Lin and M. Schatz, Design Patterns for Ecient Graph Algorithms in
MapReduce, in Proceedings of the Eighth Workshop on Mining and Learning
with Graphs, pp. 7885, 2010.
[112] P. L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear
operators, SIAM Journal on Numerical Analysis, vol. 16, pp. 964979, 1979.
[113] D. C. Liu and J. Nocedal, On the Limited Memory Method for Large Scale
Optimization, Mathematical Programming B, vol. 45, no. 3, pp. 503528,
1989.
118
References
References
119
[130] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multiagent optimization, IEEE Transactions on Automatic Control, vol. 54, no. 1,
pp. 4861, 2009.
[131] A. Nedic and A. Ozdaglar, Cooperative distributed multi-agent optimization, in Convex Optimization in Signal Processing and Communications,
(D. P. Palomar and Y. C. Eldar, eds.), Cambridge University Press, 2010.
[132] Y. Nesterov, A method of solving a convex programming problem with
convergence rate O(1/k2 ), Soviet Mathematics Doklady, vol. 27, no. 2,
pp. 372376, 1983.
[133] Y. Nesterov, Gradient methods for minimizing composite objective function,
CORE Discussion Paper, Catholic University of Louvain, vol. 76, p. 2007,
2007.
[134] M. Ng, P. Weiss, and X. Yuang, Solving Constrained Total-Variation Image
Restoration and Reconstruction Problems via Alternating Direction Methods, ICM Research Report, Available at http://www.optimization-online.
org/DB_FILE/2009/10/2434.pdf, 2009.
[135] J. Nocedal and S. J. Wright, Numerical Optimization. Springer-Verlag, 1999.
[136] H. Ohlsson, L. Ljung, and S. Boyd, Segmentation of ARX-models using sumof-norms regularization, Automatica, vol. 46, pp. 11071111, 2010.
[137] D. W. Peaceman and H. H. Rachford, The numerical solution of parabolic
and elliptic dierential equations, Journal of the Society for Industrial and
Applied Mathematics, vol. 3, pp. 2841, 1955.
[138] M. J. D. Powell, A method for nonlinear constraints in minimization problems, in Optimization, (R. Fletcher, ed.), Academic Press, 1969.
[139] A. Ribeiro, I. Schizas, S. Roumeliotis, and G. Giannakis, Kalman ltering in
wireless sensor networks Incorporating communication cost in state estimation problems, IEEE Control Systems Magazine, vol. 30, pp. 6686, Apr.
2010.
[140] R. T. Rockafellar, Convex Analysis. Princeton University Press, 1970.
[141] R. T. Rockafellar, Augmented Lagrangians and applications of the proximal point algorithm in convex programming, Mathematics of Operations
Research, vol. 1, pp. 97116, 1976.
[142] R. T. Rockafellar, Monotone operators and the proximal point algorithm,
SIAM Journal on Control and Optimization, vol. 14, p. 877, 1976.
[143] R. T. Rockafellar and R. J.-B. Wets, Scenarios and policy aggregation in
optimization under uncertainty, Mathematics of Operations Research, vol. 16,
no. 1, pp. 119147, 1991.
[144] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis. Springer-Verlag,
1998.
[145] L. Rudin, S. J. Osher, and E. Fatemi, Nonlinear total variation based noise
removal algorithms, Physica D, vol. 60, pp. 259268, 1992.
[146] A. Ruszczy
nski, An augmented Lagrangian decomposition method for block
diagonal linear programming problems, Operations Research Letters, vol. 8,
no. 5, pp. 287294, 1989.
[147] A. Ruszczy
nski, On convergence of an augmented Lagrangian decomposition
method for sparse convex optimization, Mathematics of Operations Research,
vol. 20, no. 3, pp. 634656, 1995.
120
References
References
121
122
References
[182] H. Zhu, A. Cano, and G. B. Giannakis, Distributed consensus-based demodulation: algorithms and error analysis, IEEE Transactions on Wireless Communications, vol. 9, no. 6, pp. 20442054, 2010.
[183] H. Zhu, G. B. Giannakis, and A. Cano, Distributed in-network channel decoding, IEEE Transactions on Signal Processing, vol. 57, no. 10, pp. 39703983,
2009.