papr11

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

784 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO.

2, JUNE 2021

Asynchronous Delayed Optimization With


Time-Varying Minibatches
Haider Al-Lawati , Graduate Student Member, IEEE, Tharindu B. Adikari , Graduate Student Member, IEEE,
and Stark C. Draper , Senior Member, IEEE

Abstract—Large-scale learning and optimization problems are analyzed and implemented. The aim of these methods is to
often solved in parallel. In a master-worker distributed setup, parallelize gradient-based algorithms, which are traditionally
worker nodes are most often assigned fixed-sized minibatches serial in nature, to achieve faster convergence. Despite the
of data points to process. However, workers may take different
amounts of time to complete their per-batch calculations. To deal additional computing and storage resources distributed systems
with such variability in processing times, an alternative approach provide, there are certain factors that can limit the performance
has recently been proposed wherein each worker is assigned a of such systems. For instance, computation speed can unpre-
fixed duration to complete the calculations associated with each dictably vary across workers causing some workers to take
batch. This fixed-time approach results in time-varying minibatch much longer to complete their tasks. These slow nodes, known
sizes and has been shown to outperform the fixed minibatch
approach in synchronous optimization. In this paper we make a as stragglers [1], [2], can significantly slow down the over-
number of contributions in the analysis and experimental verifi- all computation. In real distributed systems, the performance
cation of such systems. First, we formally present a system model of synchronous methods can deteriorate due to the presence
of an asynchronous optimization scheme with variable-sized mini- of stragglers. On the other hand, asynchronous methods are
batches and derive the expected minibatch size. Second, we show inherently robust to the straggler effects. However, they suf-
that for our fixed-time asynchronous approach, the expected
gradient staleness does not depend on the number of workers con- fer from stale gradients as the optimization parameters are
trary to existing schemes. Third, we prove that for convex smooth updated using gradients calculated with respect to out-of-date
objective functions the asynchronous variable minibatch method parameters. Stale gradients can slow down convergence.
achieves the optimal regret and optimality gap bounds. Finally,
we run experiments comparing the performances of the asyn- A. Related Work
chronous fixed-time and fixed-minibatch methods. We present
results for CIFAR-10 and ImageNet. Distributed optimization has been studied for more
than three decades at least since the seminal work of
Index Terms—Asynchronous optimization, distributed algo- Tsitsiklis et al. [3] in which they studied distributed first-
rithms, delayed gradients methods, convergence.
order methods. Asynchronous distributed optimization has
received growing attention in recent years especially in solv-
I. I NTRODUCTION ing machine learning problems as well as online learning
ISTRIBUTED optimization has been extensively studied and prediction problems [4]–[15]. For instance, various tech-
D and implemented in solving various engineering prob-
lems. In distributed optimization, multiple compute nodes
niques to implement asynchronous parallel stochastic gradient
descent (SGD) have been developed and analyzed for dif-
known as workers, often connected to a central node known ferent distributed settings such as multicore [7], [8], [12]
as master, collaborate to solve the problem of interest. The and hub-and spoke [5], [6], [13], [16], [17]. Besides SGD,
master-worker setup is also known as hub-and-spoke. Both parallelizing other gradient-based algorithms has been stud-
synchronous and asynchronous methods have been proposed, ied and analyzed such as asynchronous dual averaging [18],
asynchronous coordinate descent [19], and dual stochastic
Manuscript received October 14, 2020; revised March 19, 2021; accepted
May 8, 2021. Date of publication May 12, 2021; date of current version coordinate ascent [20], to name a few. To mitigate stale gradi-
June 21, 2021. This work was supported in part by Huawei Technologies ent effects and improve convergence, various techniques have
Canada; in part by the Natural Science and Engineering Research Council been developed such as using approximate second-order [21]
(NSERC) of Canada through a Discovery Research Grant; and in part by
the Omani Government Postgraduate Scholarship. (Corresponding author: and using adaptive step size based on the degree of stale-
Haider Al-Lawati.) ness [17]. For more comprehensive survey on asynchronous
The authors are with the Electrical and Computer Engineering methods, we refer the reader to [22].
Department, University of Toronto, Toronto, ON M5S 2E4, Canada
(e-mail: [email protected]; [email protected]; A typical approach in distributed optimization under the
[email protected]). hub-and-spoke configuration is to allocate a fixed amount
This article has supplementary downloadable material available at of work (i.e., a minibatch) to each worker to process in
https://doi.org/10.1109/JSAIT.2021.3079856, provided by the authors. The
material includes the remaining proofs of theorems and lemmas that are not each iteration [23]–[25]. This is known as the fixed-minibatch
found in the main paper and some additional experiments for implementing (FMB) approach. In synchronous optimization, using FMB
the algorithm on a real distributed system along with engineering lessons. can result in a slow convergence with respect to wall-clock
The codes used in this paper’s experiments are available at https://github.
com/StarkDraperLaboratory/asyncTimed time due to stragglers. To alleviate this, fixed-time approaches
Digital Object Identifier 10.1109/JSAIT.2021.3079856 such as Anytime Minibatch (AMB) have been proposed and
2641-8770 
c 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 785

analyzed in [26]–[29]. In this approach, each worker is given problems of the form
a fixed amount of time to compute gradients instead of a
fixed number of gradients to compute. This approach has been minimize F(w) where F(w) := EQ [f (w, x)], (1)
w∈W
shown to be robust against computational stragglers and to
where the expectation is taken with respect to some (unknown)
outperform FMB. In [30], the authors argue that under long
distribution Q over the set X ⊆ Rd . Async-Timed scheme
communication delay, the fixed-time approach can also suffer
uses gradient-based algorithms to solve (1) iteratively. Before
from slow convergence, hence they propose to combine the
delving into the details of the system model and the algorithm,
fixed-time approach with stale gradients methods. The authors
we start in Section II-A by detailing some standard assump-
assume a simple communication model in which communi-
tions made in optimization problems [31]. These assumptions
cation time is deterministic and demonstrate empirically that
will be useful later when analyzing the convergence of the
the stale gradient method can converge faster than the syn-
proposed scheme in Section III. Then, in Section II-B through
chronous method but do not provide a convergence analysis
Section II-E, we detail the assumptions we make about the
for their approach. Furthermore, the fixed-time method under
operation of the optimization system, in Sections II-F and II-G
random communication delay has never been studied which is
describe the optimization method and formally define impor-
the motivation of this paper.
tant system parameters, and in Sections II-H and II-I conclude
with an illustrative example and describe further possible
B. Contribution
points of improvement.
This paper contributes to both the theory and practice of
the fixed-time approach. To the best of our knowledge this is A. Preliminaries and Assumptions
the first work that combines the fixed-time approach with stale
gradients under random communication delay. We summarize We assume that W is a closed convex set and f (., .) is
the contributions as follows. a convex loss function. We assume that the solution of (1),
• First, we formally present a system model of an asyn-
w∗ = arg minw∈W F(w), exists. In addition, we assume that
chronous optimization scheme with variable-sized mini- f (w, x) is differentiable and convex in w for every x ∈ X
batches. To build a framework for our analysis, we which implies that F(w) is also convex and differentiable.
consider a hub-and-spoke distributed system and model Furthermore ∇F(w) = EQ [∇f (w, x)] and ∇f (w, x) is assumed
the compute times that each worker takes to process a to have a bounded variance. That is, there exists σ ≥ 0 such
fixed minibatch as independent and identically distributed that for all w ∈ W, EQ [ ∇f (w, x) − ∇F(w) 2 ] ≤ σ 2 .
(i.i.d.) random variables. We analyze such a system when We also assume that F(w) is J-Lipschitz continuous. That
a fixed compute time is imposed and derive a closed form is, |F(w1 ) − F(w2 )| ≤ J w1 − w2 for all w1 , w2 ∈ W,
expression for the expected minibatch per worker. where · denotes the Euclidean norm. Furthermore, f (w, x)
• We propose a new asynchronous scheme that we term
is L-Lipschitz smooth. In other words, for all w1 , w2 ∈ W and
Async-Timed that is based on the fixed-time approach. for all x ∈ X ,
We assume that workers never idle and that communi- ∇f (w1 , x) − ∇f (w2 , x) ≤ L w1 − w2 , (2)
cation times from workers to the master and vice versa
f (w2 , x) ≤ f (w1 , x) + ∇f (w1 , x), w2 − w1
are i.i.d. random variables. As a result, the master must
L
use stale gradients to update the optimization parame- + w1 − w2 2 . (3)
ters. Gradient staleness of existing asynchronous schemes 2
can grow with the number of workers. In our analysis, A nonnegative function ψ(w) ≥ 0 is said to be 1-strongly
we show that the expected staleness in our scheme is convex if
independent of the number of workers. 1
• We present a convergence analysis for our scheme for
ψ(w2 ) ≥ ψ(w1 ) + g, w2 − w1 + w2 − w1 2 (4)
2
smooth convex objective function. We show that the
for all w1 , w2 ∈ W and for all g ∈ ∂ψ(w) where ∂ψ(w)
regret
√ and the optimality
√ gap are respectively bounded by
is the set of sub-gradients of ψ defined as ∂ψ(w) := {g ∈
O( m) and O(1/ m) asymptotically in the number of
Rd |ψ(w2 ) ≥ ψ(w1 ) + g, w2 − w1 , ∀w1 , w2 ∈ dom(ψ)}.
observed data points, m, as long as the second moment of
Finally, we have the following compactness assumptions.
the staleness parameter is bounded. These are the optimal
There exists a C ∈ R such that ψ(w∗ ) ≤ C2 /2 where w∗ is
bounds [23].
the optimal solution defined in (1) and for all w ∈ W, the
• Finally, we conduct various experiments using CIFAR-10
Bregman divergence between w∗ and w Dψ (w∗ , w) ≤ C2 . For
and ImageNet datasets to demonstrate the performance
any x, y ∈ X , Dψ (x, y) is defined as
of Async-Timed. Our numerical results show that Async-
Timed is almost twice as fast as the asynchronous fixed- Dψ (x, y) = ψ(x) − ψ(y) − ∇ψ(y), x − y) . (5)
minibatch approach. We detail the software engineering
lessons learned in transitioning our theory to practice. B. Distributed System
To solve (1) in a distributed manner, we consider a dis-
II. S YSTEM M ODEL tributed system arranged according to the hub-and-spoke
In this section, we present a distributed system and configuration. The system consists of a master and n work-
the Async-Timed scheme to solve stochastic optimization ers. The primary job of the workers is to calculate gradients

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
786 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO. 2, JUNE 2021

and send them to the master. Workers never idle and con- Then bi (r) = Ni (rTp )−Ni ((r−1)Tp ). Since Ni (t) is a stochastic
tinuously process gradients. Worker i takes Xi (j) seconds to process, bi (r) is a random variable and Async-Timed must
calculate the j-th gradient. Thus, worker  i finishes calculat- work with randomly-distributed variable-sized minibatches.
j
ing its first j gradients at time Si (j) = l=1 Xi (l). Define
jb
Yi (j) := l=(j−1)b+1 i X (l) for some nonnegative integer b. D. Parameter Update
Note that Yi (j) is the time that worker i takes to compute The master updates the optimization parameters every Tu
the j-th minibatch of size b. We assume that Yi (j) are i.i.d. seconds, where Tu is a fixed inter-update time. In the k-th
and that E[Yi (j)] = bμ for all i ∈ [n], where the index set iteration, the master waits for Tu seconds during which it
[n] denotes the set of positive integers from 1 to n, and for all receives gradients from the workers. At the end of the iteration,
j ∈ Z+ where Z+ denotes the set of positive integers. One way the master updates the parameters and broadcasts the update
to think about this model is that while the load on the physical (i)
to all workers. The updated parameter takes Tm (k) seconds to
machine may vary across time, it holds relatively constant over (i)
reach the i-th worker. We assume that the Tm (k) are i.i.d. ran-
the duration of a single batch. This is analogous to the block- dom variables for all i ∈ [n] and for all k ∈ Z+ with expected
fading model in wireless. To formally satisfy this assumption (i)
value E[Tm (r)] = μm , and bounded variance. The subscript
mathematically, one could assume that the Xi (l) are i.i.d. for w in Tw and μw stands for worker while the subscript u in Tu
all i ∈ [n] and for l ∈ Z+ . Another possible scenario is to stands for update.
assume that given a minibatch of size b, the compute times Since all workers start their first transmission at time Tp , we
of each individual gradient are all equal. That is, if Yi (j) = y, assume that the master waits initially for Tp seconds before
then Xi (l) = y/b for (j − 1)b + 1 ≤ l ≤ jb. Let Ni (t) denote the starting to listen to the workers. Recall that Tp and Tu are
number of gradients worker i has processed by time t where deterministic algorithmic parameters that are under our con-
t ≥ 0. Finally, define Ñi (t) to be the total number of mini- (i) (i)
trol while Tm (k) and Tw (r) are random variables that are
batches of size b worker i has computed by time t. Ñi (t) is a outside of our control. The k-th update of the parameters is
renewal process since it is a counting process with i.i.d. inter- broadcasted at time Tp + kTu and reaches worker i at time
arrival times {Yi (j), j = 1, 2, . . . , }. If b = 1, then Yi (j) = Xi (j) (i)
Tp + kTu + Tm (k). The updated parameters are used by work-
for all j ∈ Z+ and Ñi (t) = Ni (t) for all t ≥ 0.
ers at the beginning of the next compute round. We note that
The master receives gradients, updates parameters and
indices r and k are related to Tu and Tp as follows. At any
broadcasts the updated parameters to all workers. Hence, we
given time t ≥ 0, the compute iteration is r = t/Tp  and
can break down the entire optimization process into two sub-
the update iteration is k = max((t − Tp )/Tu , 1), where ·
processes: gradient computation by the workers and parameter
denotes the ceiling function.
updates at the master. Since these are iterative procedures,
we index the gradient computation iterations at the workers
E. Asynchronous Optimization
with r and the parameter updates at the master with k where
both r, k ∈ Z+ . The purpose of having different indices is We solve (1), by finding w∗ in an iterative manner. The
that the rates of gradient computation and parameter updates process starts with a common initial parameter wi (1) =
may differ. For instance, worker i may be calculating the 5- w(1) for all i ∈ [n]. In each compute iteration r, worker
th minibatch (r = 5) while the master may have updated the i calculates bi (r) gradients with respect to wi (r) and data
parameters only once (k = 2). points {xi (r, s)}, s ∈ [bi (r)]; the subscript s stands for “sam-
ple”. At the end of the r-th iteration at time rTp , worker i

C. Gradient Computation bi (r) mi (r) = (gi (r), bi (r)) to the master, where
sends message
gi (r) = s=1 ∇w f (wi (r), xi (r, s)) is the sum of bi (r) gradi-
In Async-Timed, gradient calculations proceeds in itera- ents. The xi (r, s) ∈ X are sampled i.i.d. according to Q. The
tions, termed compute iterations, each of fixed duration of Tp worker also checks if it has received any updates from the
seconds. The fixed compute time Tp ensures that compute iter- master. If not then wi (r + 1) = wi (r), otherwise, wi (r + 1) is
ations across workers stay synchronized as the r-th iteration set to the updated parameters and the worker starts the r +1-th
at any worker begins at time (r − 1)Tp . After each iteration, iteration. We note that due to random communication, workers
worker i ∈ [n] transmits message mi (r) consisting of the num- may receive updates out of order. That is, w(k) may arrive at
ber of calculated gradients and the sum of the gradients to the worker i after w(k+a) has arrived where a is an integer a ≥ 1.
master and starts the next compute iteration. Communication If this happens then workers discard out-of-date updates. We
completes in the background. Message mi (r) takes Tw(i) (r) sec- refer the reader to Section II-G where we formally establish
(i)
onds to reach the master where the Tw (r) are i.i.d. random the relationship between wi (r) at worker i and w(k) at the
(i)
variables for all i ∈ [N] and all r ∈ Z+ with E[Tw (r)] = μw master.
and have bounded variance. The subscripts m in Tm and μm The initial optimization parameters at the master is w(1),
and p in Tp stand for master and processing, respectively. which is the same as the initial parameters at all workers. Let
One way to conceive of the gradient computation process in b(k) denote the total number of gradients the master receives
Async-Timed is by assuming that the system is continuously from all workers in the k-th update iteration. We remind the
processing gradients and then we collect these gradients every reader that k indexes the update iterations at the master while r
Tp seconds and send them to the master. Let bi (r) denote the indexes the iterations at the workers and it may be that k = r.
number of gradients worker i calculates in the r-th iteration. Recall from the discussion in Section II-C that message mi (r)

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 787

TABLE I
L IST OF S YSTEM PARAMETERS

(i)
is received at time rTp + Tw (r) at the master. Recalling that according to the following rules.
(i)
Tw (r) is random and that the master waits initially for Tp
seconds, gi (r) is used in the k-th update if (k − 1)Tu < (r − z(k + 1) = z(k) + g(k),
 
(i)
1)Tp + Tw (r) ≤ kTu . Define the indicator function Ii (k, r) to   1
w(k + 1) = arg min z(k + 1), w + ψ(w) , (7)
nu <
equal 1 if (k−1)T (r−1)Tp +Tw(i) (r) ≤ kTu and 0 otherwise. w∈W α(k + 1)
Then, b(k) = i=1 ∞ r=1 Ii (k, r)bi (r) and the corresponding where α(k) is a sequence of nonincreasing step sizes and ψ(w)
average of the gradients received at update iteration k, denoted is a proximal function that is used to regularize the parameters
by g(k) is w(k). In dual-averaging, the proximal function is nonnegative

1  and 1-strongly convex. When operating in Euclidean space,
n
g(k) = Ii (k, r)gi (r) a natural choice of such regularizer is the l2 -norm, ψ(w) =
b(k)
r=1 i=1 w 2 . The system initializes z(1) = w(1) = 0. Note that we
∞  i (r)
n b used index k as these operations take place at the master.
1
= Ii (k, r)∇w f (wi (r), xi (r, s)).
b(k)
r=1 i=1 s=1
G. Formal Definitions of System Parameters
The optimization parameters wi (r) in gradient To complete the introduction of our system model, we
∇w f (wi (r), xi (r, s)) with Ii (k, r) = 1 is the same as the herein present some formal definitions. We show the rela-
optimization parameters that has been already sent by the tionship between the local optimization parameters, wi (r) at
master some time in the past. Thus, wi (r) = w(ki (r)) worker i with the optimization parameters at the master w(k).
for some ki (r) ≤ k. Define τi (k, r) = k − ki (r), then For ease of reference, we have collected the system parameters
wi (r) = w(k − τi (k, r)). τi (k, r) ∈ [k − 1] is the staleness in Table I.
suffered by the bi (r) gradients computed by worker i in 1) Relating wi (r) to w(k): In our scheme, workers calculate
iteration r and received by the master in the k-th update gradients in the r-th iteration with respect to wi (r). However,
round. For the relationship between wi (r) and w(k) and the as workers receive updates from the master, these wi (r) are
formal definition of staleness, please see Section II-G. We somehow related to w(k) produced by the master. The master
can re-write g(k) as sends w(k) at time Tp +kTu and w(k) is received at worker i at
(i)
1 
i ∞ n b (r) time Tp +kTu +Tm (k). In the first few iterations, wi (r) = w(1)
g(k) = Ii (k, r) until worker i receives its first update. The first update arrives
b(k)
r=1 i=1 s=1 at worker i at time Tp + Tu + Tm(i) (1) while the r-th compute
× ∇w f (w(k − τi (k, r)), xi (k, s)). (6) iteration starts at time (r − 1)Tp . Thus, wi (r) = w(1) for all
r ∈ Z+ such that Tp + Tu + Tm(i) (1) ≥ (r − 1)Tp ; equivalently
(i)
F. Dual Averaging for all r such that r ≤ T1p [Tu + Tm (1)] + 2.
The above system model description can be adopted with For the iterations after receiving the first update, we derive
any gradient-based algorithm. In this paper, we study the con- wi (r) as follows. Before the beginning of each compute round,
vergence of Async-Timed under dual averaging [32], [33] each worker checks if it has received any update and it uses
to solve the stochastic optimization problem in (1). In dual the optimization parameters with the largest index k that it has
averaging, in each update iteration k, the master updates two received before the beginning of the next compute round. That
variables: the primal variable w(k) and the dual variable z(k) is, for all i ∈ [n] and all r ∈ Z+ , r > T1p [Tu + Tm(i) (1)] + 2,

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
788 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO. 2, JUNE 2021

wi (r) = w(k∗ (i, r)) where k∗ (i, r) is waits for Tp seconds. It waits for Tu seconds thereafter to
receive the messages in the first update iteration. During this
arg max k iteration, it receives g2 (1), g1 (1), g3 (1), g1 (2), and g2 (2) and
k∈Z+
their respective sizes. The master computes b(1) = b2 (1) +
subject to Tp + kTu + Tm(i) (k) ≤ (r − 1)Tp . (8)
b1 (1) + b3 (1) + b1 (2) + b2 (2) and the average gradient g(1) =
b(1) (g2 (1) + g1 (1) + g3 (1) + g1 (2) + g2 (2)).
1
2) Staleness Parameter: In the k-th update iteration, the
master receives gi (r) from worker i for some r ∈ Z+ , where Figure 1(b) depicts the remainder of the update phase and
gi (r) has been calculated with respect to wi (r) which in turn the master-to-worker communication. The initial optimization
is equal to w(ki (r)) for some ki (r) ≤ k as explained in the parameter at the master is w(1) which is the same as the ini-
previous section. In particular, for all r ∈ Z+ such that r ≤ tial parameter at each worker i denoted by wi (1). The master
(i)
Tp [Tu + Tm (1)] + 2, wi (r) = w(1), thus all the bi (r) gradients
1 generates and sends the first update (i.e., w(2)) at the end
of the first update iteration at time Tp + Tu . Due to random
in gi (r) are delayed by k − 1 steps. Whereas for r > 1
Tp [Tu +
(i)
communication times, each worker receives w(2) at differ-
Tm (1)]+2, wi (r) = w(ki∗ (r))where ki∗ (r)
is the solution of the (3)
ent times. For instance, it takes Tm (1) seconds for w(2) to
optimization problem in (8). Therefore, the these gradients are reach worker 3 as is illustrated by the solid blue horizon-
delayed by k−ki∗ (r) steps. Putting these together, the staleness tal arrow above the variable-sized boxes of worker 1, while
parameter τi (k, r) for the bi (r) gradients in gi (r) received by (2)
it take the same optimization parameters Tm (1) seconds to
the master from the i-th worker in the k-th update iteration (i.e., (3) (2)
reach worker 2, where Tm (1) ≤ Tm (1). Workers use w(2)
in the time interval [Tp + (k − 1)Tu , Tp + kTu ]) is defined as to compute gradients at the beginning of the next compute
 iteration. For example, workers 1 and 3 receive w(2) dur-
(i)
k−1 if r ≤ T1p Tu + Tm (1) + 2 ing the fourth compute iteration (i.e., r = 4) while worker 2
τi (k, r) = (9)
τi∗ (k, r) otherwise, receives it during the fifth compute iteration. Therefore, both
where τi∗ (k, r) is the solution of the following optimization workers 1 and 3 will update their local optimization parame-
problem ters at the beginning of the fifth round computation. That is,
w1 (5) = w3 (5) = w(2). However, worker 2 will continue using
min k − k (i, r) the same optimization parameters as the previous iterations
k (i,r)≤k
w2 (5) = w2 (4) = w(1). Worker 2 will eventually use w(2) in
1 the sixth iteration, i.e., w2 (6) = w(2). This example illustrates
s.t r > Tu + Tm(i) (1) + 2
Tp why the optimization process is asynchronous as workers may
Tp + kTu + Tm(i) (k) ≤ (r − 1)Tp be calculating gradients with respect to different parameters.
Ii (k, r) = 1. (10) Furthermore, Fig. 1(a) shows that g3 (2) is received at the mas-
ter at the second update iteration (i.e., k = 2). Per Fig. 1(b),
g3 (2) is calculated with respect to w(1), while the most up-to-
H. Example of Async-Timed Scheme
date parameters at the master when g3 (2) is received is w(2).
In this part, we provide an example to illustrate how Async- Hence, all the b3 (2) gradients in g3 (2) are stale gradients with
Timed works to help the reader understand the notations and staleness parameter τ3 (2, 2) = 1.
the descriptions introduced in the previous subsection. In our
example, we consider a system consisting of 3 workers and a I. Improving Async-Timed
master. Fig. 1(a) depicts 6 compute iterations labeled r = 1 to
r = 6, each of duration Tp seconds. The horizontal axis rep- The system presented in this section and the previous exam-
resents wall-clock time. In the r-th compute iteration, worker ple assume that each worker i uses wi (r) throughout the r-th
i computes the sum of bi (r) gradients denoted by gi (r). For iteration. The parameters are updated at the end of the iteration
instance, the second iteration starts at time 2Tp and finishes even if an update is received mid-iteration while gradients
at time 3Tp , during which time worker 1 computes b1 (2) gra- are being computed. This approach is consistent with the
dients whose sum is denoted by g1 (2). The boxes denote the state-of-the-art methods using fixed minibatch. However, if
gradients with their heights indicating the size of the mini- we allow workers to update their parameters even while com-
batch. Worker 1 processes smaller local minibatch in the first puting gradients, the resulting gradients of that iteration will
iteration than it does in the second iteration since the height be partially calculated with respect to an updated version of
of the leftmost top box is smaller than the one adjacent to it. the optimization parameters. Hence, when these gradients are
The solid blue horizontal arrows represent the communi- received by the master, they will have a lower average stal-
cation latency incurred by each message each worker sends eness and hence will have a better impact on convergence.
to the master. For example, worker 2 sends the first message Algorithms 1 and 2 depict the pseudocode of the Async-Timed
(g2 (1), b2 (1)) at time Tp at the end of the first compute round. with the said improvement.
The message takes Tw(2) (1) seconds to reach the master at time
(2) III. A NALYSIS
Tp + Tw (1). This is indicated on the horizontal axis at the
bottom. In this section, we analyze Async-Timed. We first derive
The events at the master node are depicted at the bottom some system properties and then present the convergence anal-
of the figure just above the time axis. The master initially ysis. In our analysis, we assume Tu ≥ Tp . This is because if

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 789

Fig. 1. Illustration of Async-Timed.

Algorithm 1 Async-Timed Algorithm (Worker i) Algorithm 2 Async-Timed Algorithm (Master Node)


1: for all r = 1, 2, · · · do 1: for all k = 1, 2, · · · do
2: initialize gi (r) = 0, bi (r) = 0, s = 0 2: initialize b(k) = 0, g(k) = 0.
3: T0 = current_time 3: T0 = current_time
4: while current_time − T0 ≤ Tp do 4: while current_time − T0 ≤ Tu do
5: bi (r)++, s++. 5: if receive mi (r) = (gi (r), bi (r)) from worker i then
6: sample i.i.d input data xi (k, s) from P 6: b(k) += bi (r)
7: calculate gi (r) = gi (r) + ∇f (wi (r), xi (r, s)) 7: g(k) += gi (r)
8: if received updated parameter w̃ from master then 8: end if
9: set wi (r) = w̃ 9: end while
10: end if 10: g(k) = g(k)/b(k)
11: end while 11: z(k + 1) = z(k) + g(k)
12: send mi (r) = (gi (r), bi (r)) to the master 12: w(k + 1) = arg min w, z(k + 1) + α(k+1) 1 ψ(w)
13: if received updated parameter w̃ from master then 13: send w(k + 1) to all workers
14: set wi (r) = w̃ 14: end for
15: end if
16: end for

AMB when μm = μw = 0, and Tp = Tu while our deriva-


Tu < Tp , then the rate of parameter update 1/Tu is higher than
tions above depends on neither μm nor μw . According to [28],

the rate at which workers produce local minibatches 1/Tp . This
may result in having update iterations that receive no gradients. i E[bi (r)] ≥ nTp /μ − n. Thus, in the many-iteration limit,
i E[bi (r)] is also bounded
 below by nTp /μ − n. However,
Hence, it makes sense to set Tu ≥ Tp . Additionally, to simplify
Corollary 1 shows that i E[bi (r)] approaches nTp /μ which
our analysis, we assume throughout that Tu = mTp , m ∈ Z+ .
is at least as great as nTp /μ − n asymptotically.
According to Lemma 1, we can obtain the desired expected
A. System Properties local (i.e., per-worker) minibatch per iteration by choosing the
We analyze the system proposed in Section II and derive appropriate Tp value. For example, if the desired E[bi (r)] = b,
the expected minibatch each worker produces per iteration, then we should set Tp = bμ.
the expected minibatch the master uses per update, and the Lemma 2: The expected total gradients the master
expected staleness of each received gradients. receives per update iteration in the many-iteration limit is
Lemma 1: The expected number of gradients processed by lim E[b(k)] = nTu /μ.
k→∞
worker i per compute iteration in the many-iteration limit is The proof is provided in Appendix C, but we herein pro-
limr→∞ E[bi (r)] = Tp /μ. vide a sketch. From Corollary 1, the system produces nTp /μ
Corollary 1: The expected total number of gradients pro- gradients every Tp seconds on average in the long term. This
duced by the system per compute iteration in the many- means that the system produces gradients at a rate of n/μ.
iteration limit is nTp /μ. Since each gradient sent by any worker is eventually received
The proofs of the lemma and corollary are found in by the master, then the master also receives gradients at a
Appendices A and B. The result above is also applicable rate n/μ. Hence, over an interval of duration Tu , the master
to the AMB scheme of [28] since Async-Timed reduces to receives on average nTu /μ gradients.

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
790 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO. 2, JUNE 2021

This lemma provides a means to set the Tu parameter at k-th iteration is delayed by τ (k, s) steps. Note that τ (k, s) is
the master in order to get the desired expected per-iteration related to τi (k, r) as explained in Appendix E. Let m be the
minibatch. For instance, if we would like to have E[b(k)] = b̃, total number
K of samples observed over the K iterations; i.e.,
then we can achieve this asymptotically by setting Tu = b̃μ/n, m = k=1 b(k). Define the sample path Btot = {b(k), k ∈
where b̃ ≥ n. [K]}. First, we analyze the regret conditioned on the sam-
Lemma 3: Let τi (k, r) be the staleness parameter of the gra- ple path Btot , then we use the result to derive the expected
dients in gi (r) received at the master from the i-th worker in regret bound in general. The expected regret conditioned on
the k-th update iteration. Assume that the optimization param- the sample path is defined as
eters sent by the master are not received out of order; i.e.,  
workers always receive w(k) before receiving w(k + 1) for E R(T)Btot
 K 
all k. Then,   ∗ 
:= E f (w(k + 1), x(k + 1)) − F w Btot . (13)
μm + μw + Tp
lim E[τi (k, r)] = . (11) k=1
k→∞ Tu
Note that the expectation above is over a realization of data
The proof is found in Appendix D. Lemma 3 provides an points sampled in an i.i.d. manner according to Q and over
insight about how to control gradient staleness. Since large the staleness distribution, which is induced by the distribution
staleness is not desirable and may significantly slow down (i) (i)
of Tm and Tw . The number of data points per iteration is
convergence, we can tune the system parameters Tp and Tu to given by the sample path Btot . Let bmax , bmin and bavg be
obtain the desired long-term expected staleness. The expres- the maximum, minimum and the average empirical minibatch
sion in (11) shows that the expected staleness can be reduced sizes observed across all iterations, respectively. Then bmax =
by either reducing Tp or increasing Tu . Decreasing Tp results in max b(k), bmin = min b(k) and bavg = m/K.
a smaller average local minibatches processed by workers and k∈[K] k∈[K]
shorter compute iterations which in turn results in more fre- Theorem 1: Assume that the master receives gradients over
quent transmissions from workers. However, if Tp  μm +μw , K iterations per Btot during which workers sample m i.i.d. data
then lim E[τ ] ≈ (μm + μw )/Tu . Then to reduce the staleness, points from Q in. Let bmax , bmin and bavg be as defined above.
we should increase Tu . However, larger Tu means larger mini- Conditioned on the sample
 path Btot and choosing the step
batches as well as slower update rate with respect to wall-clock size α(k) = 1/(L + 12 η (K + k − 1)/bavg ) for all k ∈ [K] and
time. The choice of Tp and Tu depends on the application for some fixed η > 0, the expected regret for Async-Timed is
and the objective function and how sensitive it is to gradient   LC2 bmax 2LJ 2 bavg
staleness. E R(K)Btot ≤ + 6JC2 + ( + 1)2
2 bmin η2
One of the advantages of Async-Timed proved by Lemma 3  2 
σ C2 η √
is that the expected gradient staleness does not depend on + + m, (14)
the size of the network unlike other schemes (e.g., [7], [12]) ηbmin 2bavg
whose average staleness scales linearly with the number of where the Lipschitz constants J and L and the upper bounds
workers. In Appendix B, we discuss how we can further σ 2 and C are as defined earlier in Section II-A and  is a
reduce the average staleness in Async-Timed and thus improve constant that satisfies E[τ (k, s)2 |Btot ] ≤ 2 .
convergence. The proof of the theorem is provided in Appendix E.
The proof technique follows a similar approach to the one
B. Convergence Analysis developed in [4] for the FMB approach. The proof uses the
Next, we present the convergence analysis of Async-Timed convexity and smoothness properties of F(w) and f (w, x) to
when using dual averaging. We measure the performance of establish some bounds on F(w(k + 1)) − F(w∗ ). We define the
Async-Timed in terms of the expected regret and the expected gradient error in the k-th iteration as e(k) := ∇F(w(k)) − g(k)
optimality gap. Recalling that w∗ = arg minw∈W F(w) the and incorporate dual averaging rules to further bound F(w(k +
regret after K update iterations at the master, denoted by R(K) 1)) − F(w∗ ). Our proofs differ from that of the FMB approach
where in that we have to account for an additional random variable
which is b(t). Furthermore, some of the intermediate steps and

K
  bounds found in [4] are not applicable to our analysis. Thus,
R(K) = f (w(k + 1), x(k + 1)) − F w∗ .
we develop different bounds for the variable minibatch case.
k=1
 See Appendix H in the supplementary materials for details.
Let ŵ(K) := K1 K t=1 w(t + 1) denote the time-average of Theorem 1 provides a necessary (but not √ sufficient) con-
optimization parameters over K iterations. The optimality gap, dition on Tp and Tu to get order-optimal O( m) scaling of
denoted by G(K), is defined as the conditional
√ regret. According to theorem, the conditional
    regret is O( m) as long as  ≤ O(m1/4 ). Since 2 is an
G(K) = F ŵ(k + 1) − F w∗ . (12)
upper bound on the second moment of the staleness parameter,
τi (k, r), which in turn is an upper bound to E[τi (k, r)|B√ 2
tot ] ,
C. Regret and Optimality Gap Analysis a necessary condition to achieve a regret bound of O( m)
Assume that for update iterations k = 1, 2, . . . , K, the mas- is that E[τ (k, s)|Btot ] ≤ O(m1/4 ). Combining this with the
ter receives b(k) gradients and that the s-th gradient in the result from Lemma 3, a necessary condition to achieve an

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 791


optimal regret bound of O( m) is to pick Tp and Tu such asymptotically bounded above by the same value. In both the
that√(μm + μw + Tp )/Tu ≤ O(m1/4 ). To ensure the optimal regret bound and the convergence rate, the deterministic com-
O( m) bound, the terms with bmin , bmax , and bavg must not ponent is of order O(1/m̄) since we used dual averaging. This
dominate. In practice, these sample-path dependent parame- can be further improved by using an accelerated version of
ters would be determined by the characteristics of the workers SGD [31]. This accelerated implementation can help reduce
and their computing speeds, the choices of Tp and Tu , and the the order of the deterministic component of the convergence
communication delay. In Section IV-C, we show how bmin , rate to O(1/m̄2 ). Such analysis can be a possible extension to
bmax , and bavg scale with Tp , Tu , and the communication delay this work.
profile in our experiments.
Theorem 2: Let m̄ be the expected value of the total num- IV. N UMERICAL R ESULTS
ber of data points observed over K iterations. Assume that
In this section, we present the performance of the asyn-
E[bmax ] ≤ β1 and E[1/bmin ] ≤ 1/β2 where β1 and β2
chronous fixed-time approach and compare it with that of
 Let b̄ = E[b(k)]. Then by choosing
are positive constants.
other asynchronous and synchronous schemes. In particular,
α(k) = 1/(L + 12 η (K − k − 1)/b̄) for all k ∈ [K] and for
we compare the performance of our proposed method with that
some fixed η > 0, and letting C, L, J, σ, η and  be as defined
of asynchronous fixed-minibatch (Async-FMB), a.k.a. Asy-
in Theorem 1, the expected regret is bounded by
SG [10]. We also compare the performance of asynchronous
LC2 β1 2LJ 2 b̄ fixed-time approach with its synchronous counterpart.
E[R(K)] ≤ + 6JC2 + ( + 1)2
2 β2 η2
 2 
σ C2 η √ A. Async-Timed vs. Async-FMB
+ + m̄. (15)
ηβ2 2b̄ First we present numerical results comparing the
performance of the asynchronous fixed-time approach
The proof of Theorem 2 is provided in Appendix F of the with that of Async-FMB. In Async-FMB, each worker
supplementary materials. calculates local minibatches of b gradients each and the
Corollary 2: Let b̄, m̄, β1 , and β2 be as defined in master updates the parameters when it receives K batches
Theorem 2. The expected optimality gap for the Async-Timed (Kb gradients). The purpose of the experiments is to compare
scheme is the performance of the fixed-time method with the state-of-

LC2 6JC2 β1 2LJ 2 b̄ the-art fixed minibatch approach in an asynchronous settings.
E[G(K)] ≤ b̄ + + 2 ( + 1)2 Techniques that speed up the convergence of asynchronous
2m̄ m̄ β2 η m̄
 2   optimization by tackling the effect of stale gradients such
σ C η 1
2
as those found in [17] and [21] can also be applied to
+ + √ . (16)
ηβ2 2b̄ m̄ our scheme. We train neural network classifiers using two
datasets: CIFAR-10 [34] and ImageNet [35]. We use the
The proof of Corollary 2 is provided in Appendix G in the TensorFlow package for computations and Open MPI for
supplementary materials. inter-node communications.
The regret bound √ achieved by Async-Timed according to We train a classifier for the CIFAR-10 dataset on the SciNet
Theorem 2 is O( m̄) asymptotically in the average number high-performance compute cluster. We use a convolutional
of data points sampled m̄. This is the same asymptotic bound neural network with 4 convolutional layers followed by 4 feed
achieved by existing FMB-based methods. Furthermore, this fully connected layers. All layers have batch normalization.
is the optimal bound for convex smooth objective functions Each convolutional layer is followed by a max pooling layer.
as mentioned earlier. The main difference between our results The total number of trainable parameters in the network is
and the previous results such as that of [4] is that the second 4.6 million. Our distributed setting consists of n = 4 work-
term in our expression contains an extra constant β1 /β2 and ers and a master. Each node is equipped with a total of 40
the last term has a factor of 1/β2 instead of 1/b̄. Both are due Intel Skylake CPU cores and 202 GB of memory. On SciNet,
to variable minibatches and both terms are minimized when inter-node communication time is very short relative to the
there are no stragglers or when using FMB methods. One may per-iteration compute time needed to train the classifier. We
therefore infer that FMB outperforms the fixed-time approach. therefore induce artificial communication delays to mimic an
However, we remind the reader that this measure of optimality operating environment subject to longer communication laten-
is with respect to the number of steps and not wall-clock time. cies. The transmission of each local minibatch is independently
Indeed the fixed-time approach outperforms FMB with respect delayed with a delay chosen i.i.d. from the uniform distri-
to wall-clock time. The terms with β1 and β2 suggest that the bution over [2, 8] seconds. The parameter updates from the
fixed-time approach trades off the per-iteration performance master to each worker are similarly delayed, according to the
for faster wall-clock time convergence. Furthermore, asymp- same distribution so μm = μw = 5 sec. For Async-Timed we
totically
√ the dominant term is the last term (i.e., the one with set Tp = Tu = 10 sec. This choice results in an average per-
1/ m̄). Hence, the ratio between the regret bound of Async- iteration minibatch size of roughly 240. In Async-FMB, the
Timed to that of FMB asymptotically is bounded above by minibatch size is b = 60 and we set K = 4, meaning the mas-
2σ 2 b̄/β2 +C2 η2
2σ 2 +C2 η2
asymptotically. Similarly, the ratio between the ter receives gradients calculated with respect to a total of 240
convergence rate bound of Async-Time to that of FMB is samples across workers. Fig. 2 illustrates that Async-Timed

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
792 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO. 2, JUNE 2021

Fig. 2. CIFAR-10 experiment results.

is almost twice faster than Async-FMB. For instance, it takes Async-Timed takes gradient steps at a higher rate. This enables
Async-FMB almost 5 hours to achieve the same training error Async-Timed to converge faster with respect to time. However,
rate achieve by Async-Timed in about 2.5 hours. Furthermore, note that in Fig. 3 when we plot loss versus the number of
after 7 hours Async-Timed achieves almost 10% higher testing steps, Async-FMB performs slightly better. The reason can be
accuracy. understood by examining the plot of the cumulative number
1) ImageNet Classification - Without Induced Delays: In of examples that the master has seen versus the number of
our ImageNet experiments we use a master and n = 3 workers. steps. This metric is higher for Async-FMB. The histogram
The processes were run on four separate Tesla P100 16GB in second column-third row shows the number of messages
GPUs hosted in a single machine. Due to hardware limitations the master has received before taking a step. The mode of
we use the down-sampled ImageNet dataset [36] in which the the Async-FMB plot is 3 since Async-FMB master waits to
images are of size 32 × 32 pixels. All other properties such as receive exactly 3 messages. Async-Timed on the other hand
the number of images and the number of classes are the same consumes a smaller amount of messages per step, but takes
as the original ImageNet dataset. We train a wide residual steps more frequently. The bottom-right histogram shows the
network classifier [37] of depth 28 and width 2 without any time taken for worker computations. Computation time for
weight regularization. The master employs gradient descent Async-Timed is slightly higher than Async-FMB due to the
with a constant learning rate 0.1. For Async-FMB we set b = partitioning overhead.
1024 and K = 3. Since there are 3 workers in the system, In our experiments we did not observe any significant
gradients arrive at the master 3 times the rate a single worker computation stragglers, although, there is some non-trivial
produces them. Therefore, in Async-Timed it is sufficient to variation in communication time as observed in the master
set Tu ≥ Tp /3 for master to receive at least one gradient by wait time histogram. This causes the Async-FMB master to
the time limit Tu . For our simulation we set Tp = 290 ms wait for a variable amount of time before taking gradient
and Tu = 200 ms. Unlike CIFAR-10, for ImageNet we do not steps. However, the Async-Timed master takes a step every
induce any type of stragglers. 200 ms, causing the training loss to drop faster than Async-
The results are presented in Fig. 3. The master wait time FMB. Async-Timed also attains higher accuracy versus time.
histogram (bottom-left) shows that almost always the Async- However, when plotting loss versus the number of steps,
Timed master only waits for 200 ms, meaning the master has Async-FMB performs slightly better. If computation strag-
received at least one gradient by that time. Recall that the glers were present we would have seen additional variability
Async-Timed workers process gradients sequentially until the in the master wait time histogram of Async-FMB. In such
processing time limit Tp occurs. In our implementation we a scenario the difference between the convergence rates of
enable workers to receive the master’s updates in between the the two algorithms would be even more apparent. We provide
sequential computations. The pseudo code for Async-Timed additional details of the experiments, results, and implemen-
including this improvement is presented in Section II-I. In tational lessons in Section IV-D and in Appendix J, in the
Fig. 3, the training loss versus step plots are very similar. Since supplementary materials. In Appendix J-C, in the supplemen-
we do not employ weight regularization in our experiments we tary materials we repeat the experiment in Fig. 3 with K = 15
see that the test accuracy plots in Fig. 3 start to decrease after workers. Next, we make note of the following regarding the
reaching their maximums. Note that Async-FMB experiences ImageNet experiments.
the decrease earlier than Async-Timed because the effective • In our simulations we did not observe significant com-
batch size for gradients applied master is higher for Async- putation stragglers. This means there is little or no
FMB. In fairness to Async-FMB, early over fitting can be benefit of employing the time-limited gradient compu-
avoided using techniques such as weight regularization and tation approach. If one compares the performance of
image augmentation. As is suggested by the upper-right sub- the time-limited versus fixed minibatch approaches in a
figure which plots the number of steps versus wall clock time, straggler-free setting, the former will certainly be at a

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 793

Fig. 3. Results with ImageNet dataset. Both algorithms run for 6 hours. The top-5 test accuracy is measured and Step indicates the master iteration k. Line
plots have been smoothed using a Gaussian filter with standard deviation 20. The dashed lines in histograms mark the average in each case.

disadvantage due to the overhead in Async-Timed we In the next section we conduct additional experiments
describe in Section IV-D. To keep the overheads at a with the ImageNet dataset by inducing communication and
minimum we set the number of partitions to 2. This computation delays, and present results.
means the size of a partition described in Section IV-D 2) ImageNet Classification - With Induced Delays: The
2 = 512.
is 1024 experimental setup is same as that for the results presented
• Although the four GPUs reside in the same machine, in Fig. 3, except for the following changes.
communications between master and workers occur 1) We induce additional communication delays for both
through network sockets accessed via Open MPI. This Async-FMB and Async-Timed. This is true for both
mimics a multi-node environment in which inter-node worker to master and master to worker messages. The
communications is conducted via the network. delay is obtained by taking the absolute value of a ran-
• In both Async-FMB and Async-Timed, workers make dom sample from a zero-mean Gaussian distribution
gradient send requests using the non-blocking method with a standard deviation of 50 ms.
Isend in Open MPI. In our implementation, if the send 2) The Async-FMB master only waits for two messages
request for the previous iteration is not complete work- from the three workers. Recall that for the results in
ers wait for it to complete before making a new request. Fig. 3 the Async-FMB master waited for three messages.
This prevents workers from overwhelming the network. 3) We induce additional computation delays. This is done
The average waiting time is around 10 ms which is only by making the workers sleep a small amount of time
around 3.4% of the computation time. before starting computations. In each iteration, work-
• The Async-Timed algorithm requires synchronization of ers independently determine a sleep time duration as
clocks across different nodes. Clocks of most modern follows. With probability 0.6, workers sample from the
clusters are synchronized to within a few milliseconds distribution shown in bottom-right sub-figure in Fig. 4
accuracy. This degree of synchronization is more than and sleep for the sampled duration. This also means that
sufficient for Async-Timed since the time limits Tp and the workers do not induce a sleep time with probability
Tu in applications of interest are on the order of hundreds 0.4. In the same sub-figure, the three modes simulate
of milliseconds. light, average and heavy computation stragglers.

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
794 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO. 2, JUNE 2021

Fig. 4. Results from the ImageNet experiments with induced delays. Line plots have been smoothed using a Gaussian filter with standard deviation 20.
The sub-figure in the bottom-right corner plots the distribution used to sample sleep time when inducing computation stragglers. The vertical arrow at zero
represents that workers do not induce any delay with probability 0.4.

4) Since there is enough variability in computation times already been shown to outperform the synchronous fixed-
due to induced delays, we increase the number of minibatch (FMB) approach. Therefore, we choose to compare
microbatches in time-limited computation to P = 8. Async-Timed with AMB instead of the synchronous FMB. We
Recall that in the ImageNet experiments presented in run our experiments on the SciNet computing cluster using a
Section IV we set P = 2 due to absence of computa- master and 4 workers. As mentioned before, each node in
tion stragglers. With P = 8 the Async-Timed workers our cluster is equipped with a total of 40 Intel Skylake CPU
have enough granularity to check on the time limit while cores and 202 GB of memory. The choices of Tp , Tu , and
keepting the overheads at a minimum. the induced communication delays are the same as those we
The results are presented in Fig. 4. We make following defined in Section IV. In AMB, we also set Tp = 10; thus
observations about the results. The training loss versus step workers in both schemes have the same fixed compute time.
plots for both algorithms are very similar. However, training Since AMB is synchronous, the master waits to receive from
loss versus wall clock time decreases faster for Async-Timed. all workers before it updates the parameters. Therefore, long
The gradient staleness histogram indicates that Async-Timed communication delays slow down the convergence of AMB
experiences a lower staleness. In this case, the difference with respect to wall-clock time. We observe in Fig. 5 that
between the average staleness in each algorithm is even more AMB converges faster than Async-Time with respect to the
apparent than the results presented in Fig. 3. number of steps. This is expected since AMB uses fresh gra-
dients whereas Async-Timed uses stale gradients which slow
convergence. However, AMB, which is designed to be robust
B. Async-Timed vs. AMB against compute stragglers, suffers from long communication
Figure 5 compares the performance of Async-Timed with delays. On the other hand, Async-Timed is robust against such
its synchronous counterpart known as AMB [28] on the delays. Hence, Async-Timed converges faster than AMB with
CIFAR-10 dataset. We remind the reader that AMB has respect to wall-clock time.

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 795

Fig. 5. Async-Timed vs. AMB performances on CIFAR-10.

Fig. 6. bmax , bmin , bavg , and bmax /bmin for the CIFAR-10 with induced uniform communication delay (left) and induced exponentially distributed
communication delay (right).

C. Effects of Tp , Tu , and Communication Delay on bmax , computation. The gradient computation phase at each worker
bmin , and bavg in the Async-Timed algorithm is inherently sequential. It
In our analysis in Section III, we observe that the expected requires the samples to be processed sequentially, so that a
regret depends on several parameters that are system specific. processing time limit Tp can be imposed. However, this is
These are bmin , bmax , and bavg . We conduct experiments on somewhat in conflict with how TensorFlow-like packages work
SciNet computing cluster using n = 4 workers and record our on top of modern hardware accelerators. The low-level oper-
observations for these quantities. As mentioned before, each ations in TensorFlow are implemented to run in parallel. This
node is equipped with a total of 40 Intel Skylake CPU cores is commonly referred to as batch processing. For example
and 202 GB of memory. We run our algorithm for several val- assume a minibatch of size 32. A common requirement in
ues of the processing time, Tp ∈ {5, 10, 15, 20} sec. In all of machine learning applications is to apply the same function
these experiments we set Tu = Tp . For each choice of Tp , we to all 32 samples. The function may be composed of basic
run the algorithm for 100 update iterations and recorded the operations such as vector-matrix multiplication or convolution.
maximum (bmax ), the minimum (bmin ), and the average (bavg ) Batch processing is similar to the single instruction, multiple
minibatches across these 100 iterations. We run our experi- data (SIMD) concept in computer design. In other words, we
ments for two types of communication delay profiles: uniform need to perform the same set of operations for all available
distribution over [3, 7] and the exponential distribution with data. Hardware accelerators such as GPUs provide an efficient
mean 5. Fig. 6 depicts our observed values for these param- way of doing this. Since GPUs consist of multiple cores, each
eters for both communication delay profiles. In both cases, sample can be independently processed in a separate core.
all parameters scale almost linearly with Tp . Hence, for a Although the exact implementation depends on the software
given Tp , these parameters are bounded and thus the regret package, conceptually, the 32 input elements are allocated 32
bound in (14)√and in (15) cores and are processed in parallel. This means that if a GPU
√ are dominated by the term with,
respectively, m and m̄ asymptotically. Furthermore, the is used for computations, the time required to process 1 sample
ratio bmax /bmin is observed to be bounded above as can be is same as that for 32 samples. Since hardware accelerators are
read from the right y-axis. specifically designed to exploit parallelizable computations,
their operation is fundamentally different from the assumption
D. Implementing Time-Limited Gradient Computation underlying our Async-Timed algorithm.
In this section we describe our implementation of In our implementation we find middle ground between
the Async-Timed algorithm that relies on time limited sequential and batch processing by implementing a time limit

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
796 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO. 2, JUNE 2021

The claim is proved by setting t = (r − 1)Tp to get E[Ni (t +


Tp ) − Ni (t)] = E[Ni (rTp ) − Ni ((r − 1)Tp )] = E[bi (r)] and
observing that r → ∞ implies t → ∞.

Fig. 7. The data minibatch is partitioned into P microbatches, each of A PPENDIX B


size N/P. P ROOF OF C OROLLARY 1
The claim follows since the compute iterations are synchro-
in the data input layer. Figure 7 helps explain the method. nized across the workers
 and the total number of gradients in
Consider an input minibatch of size N. We partition the input the r-th iteration is ni=1 bi (r). Hence,
into P “micro”-batches, each containing N/P data points.
Here, P, the number of microbatches, is a hyper-parameter. 
n 
n
nTp
lim E bi (r) = lim E[bi (r)] = .
The sequential behaviour of time-limited computation can be r→∞ r→∞ μ
i=1 i=1
emulated by instructing TensorFlow to sequentially process
the microbatches starting from the first.
Each microbatch can be processed batch-wise on the GPUs. A PPENDIX C
Out of the total number of P microbatches, we expect slow P ROOF OF L EMMA 2
workers to complete only a small number by the time limit Tp . Let Ni (t) be the total number of gradients worker i has
On the other hand, fast workers will finish a larger number of calculated by time t. Let Mi (t) be the number of gradients the
microbatches. In practice we tune N and P so that even the fast master has received from worker i by time t. Note that the k-th
workers cannot process all P microbatches by the time limit. iteration at the master starts at t0 (k)
 = Tp +(k −1)Tu and ends
This type of sequential behaviour we need can be achieved at t1 (k) = Tp +kTu . Thus, b(k) = ni=1 Mi (t0 (k))−Mi (t1 (k)) .
through the tf.while_loop function in TensorFlow. j
Let Si (j) = l=1 Xi (l), then we can express Ni (t) and Mi (t)
Our approach of partitioning each minibatch into P micro- as
batches introduced additional overhead which we need to


quantify. While larger P provides increased time granular-
Ni (t) = I Si (j) ≤ t , (17)
ity, it also means a smaller microbatch size which may not
j=1
make the most efficient use of computing hardware. We
numerically measure how much overhead is induced by the and
partitioning. The results are presented in Appendix J-A, in the ∞

supplementary materials. Mi (t) = I Si (j) + Tw(i) (j) ≤ t (18)
j=1
V. C ONCLUSION
In conclusion, we study a system deploying the fixed-time where I(·) is the indicator function.
approach for distributed optimization with random inter-node For compactness we will drop the index i from the rest of
communication delays. The resulting scheme is an asynchronous derivations unless necessary. Let δ ≥ 0. Then, in the limit the
optimization method with time-varying minibatch sizes. We expected number of gradients the master receives from worker
analyze the system and derive the expected minibatch size i in the interval [t, t + δ] is
per worker and the expected gradient staleness. Moreover, we 
present a convergence analysis of our scheme for smooth convex lim E M(t + δ) − M(t)
t→∞ ∞
objective functions.
√ We show that we achieve the optimal regret
√ = lim E I S(j) + Tw (j) ≤ t + δ
bound of O( m) and the optimal optimality gap of O(1/ m) t→∞
j=1
asymptotically in the number of data sampled, m, assuming the ∞
 
staleness parameter has bounded variance.We run experiments − I S(j) + Tw (j) ≤ t (19)
using real datasets and observe that Async-Timed outperforms j=1
the asynchronous fixed minibatch approach. ∞  
 
= lim E I S(j)+ ≤ t + δ − I S(j) + Tw (j) ≤ t
t→∞
j=1
A PPENDIX A
(20)
P ROOF OF L EMMA 1 ∞ 
 
The expected number of gradients processed by worker i in = lim E I S(j) + Tw (j) ≤ t + δ
t→∞
an interval of duration Tp is E[Ni (t + Tp ) − Ni (t)]. Recall that j=1
  
Ñi (t) is the total number of minibatches worker i has computed 
− I S(j)+Tw (j) ≤ t Tw (j) = z fTw (j) (z)dz (21)
by time t and that Ñi (t) is a renewal process. Also, observe that
  
∞ 
(Ñi (t) − 1) ≤ Ni (t)/b ≤ Ñi (t). Since limt→∞ Ñi (t)/t = 1/(bμ) = lim E I S(j) + Tw (j) ≤ t + δ
by the strong law of renewal theorem, then limt→∞ Ni (t)/t = t→∞
1/μ. Now we can apply Blackwell’s theorem [38] to get
j=1
  

− I S(j) + Tw (j) ≤ t Tw (j) = z fTw (z)dz
   Tp
lim E Ni t + Tp − Ni (t) = .
t→∞ μ (22)

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 797

 
∞ ∞
 
Therefore, taking the expectation in (27) and solving for E[τ ]
= lim E I S(j) ≤ t + δ − z − I S(j) ≤ t − z
t→∞ we get
j=1 j=1
× fTw (z)dz μm + μw + Tp
   E[τ ] = . (28)
Tu
= lim E N(t + δ − z) − N(t − z) fTw (z)dz (23) Now, we consider the case Tu = mTp where m is a positive
t→∞
 integer. In this case, worker i calculates li local minibatches
δ δ
= fTw (z)dz = (24) using w(k) before receiving w(k + 1). The master receives the
μ μ
j-th local minibatch from the i-th worker where 1 ≤ j ≤ li at
where (21) follows by conditioning on Tw (j) using fTw (j) (z), time t1,j = t0 + Tm(i) (k) + δi + jTp + Tw(i) (r + j).
the probability density function of Tw (j), (22) follows since Suppose that the last local minibatch of gradients with
the Tw (j) are i.i.d. and hence fTw (j) (z) = fTw (z) for all j, (23) respect to w(k) is received by the master from worker i before
follows from (17), and (24) follows from Blackwell’s theorem. the master updates the optimization parameter is the j -th local
Now we let t = Tp + (k − 1)Tu and δ = Tu . minibatch. Then, the master uses all the local minibatches up

lim E M(t + δ) − M(t) to j -th one to calculate the next update. The last local mini-
t→∞ batch waits for δj seconds. Note that E[δj ] = 12 Tp . To see
    Tu
= lim E M Tp + kTu − M Tp + (k − 1)Tu = . (25) this, imagine slicing the update iteration of duration Tu into m
k→∞ μ sub-iterations each of length Tp . Then suppose that the master
 
Finally, since b(k) = ni=1 Mi (t0 (k)) − Mi (t1 (k)) at the end of each sub-iteration sends the last update w(k). In
 n   other words, the master broadcasts w(k) m times. This setup
         still results in the same staleness as no parameter updates hap-
lim E b k = lim E Mi t0 k − Mi t1 k
k→∞ k→∞ pen during these sub-intervals. Now, this setup resembles the
i=1
case Tu = Tp and the expected residual life of the last local

n
     
= lim E Mi t0 k − Mi t1 k minibatch is 12 Tp . Therefore
k→∞
i=1
Tm(i) (k) + δi + j Tp + Tw(i) (r + j) + δj
n
Tu nTu τ +1 =
= = . (26) Tu
μ μ μm + 12 Tp + E[j ]Tp + μw + 12 Tp − Tu
i=1
E[τ ] = . (29)
Tu
A PPENDIX D
P ROOF OF L EMMA 3 Now observe that E[j ] is the expected number of local mini-
batches worker i calculates over an interval of length Tu . Thus,
In this part, we prove Lemma 3. Assume that the master E[j ] = TTup = m. By substituting this in (29) and noticing that
broadcasts w(k) at time t0 = Tp + (k − 1)Tu for k ≥ 2. Worker Tu = mTp , we get
i receives w(k) at time ti (k) = t0 +Tm(i) (k). Furthermore, assume
μm + μw + Tp
that the worker receives w(k) while it is calculating gradients E[τ ] = . (30)
in the r-th iteration and the time remaining to finish the com- Tu
pute iteration is δi . The worker starts the (r + 1)-st iteration Note that equation (30) is the expected number of updates
(i)
at time t0 + Tm (k) + δi . to the optimization parameters that takes place at the master
First, we consider the case Tu = Tp . The worker finishes from the time parameters w(k) for some k ≥ 2 is dispatched
calculating gradients with respect to w(k) and transmits them to all workers till the time gradients with respect to this w(k)
(i)
to the master at time ti (r + 1) = t0 + Tm (k) + δi + Tp . The are received from worker i and used by the master. Therefore,
master receives these gradients from worker i at time t1 = E[τ ] is the expected staleness suffered by gradients received
t0 + Tm(i) (k) + δi + Tp + Tw(i) (r + 1). from worker i provided that these gradients are not calculated
Assume that at time t1 , the optimization parameter at the with respect to w(1), the initial value. This is because workers
master is w(k + τ ). The master waits for δm before using the may receive the first update (i.e., w(2)) from the master when
gradients received at time t1 from worker i. That is the (k + r > 2, and therefore, the first few local minibatches sent by
τ + 1)-st update takes place at time t3 = t1 + δm . In this case, worker i are all with respect to w(1) while the calculations
gradients with respect to w(k) are used to calculate w(k+τ +1), above assumes that worker i has received w(2) or later updates.
and hence the gradient staleness parameter in this case is τ . Therefore, E[τ ] = E[τi (k, r)] after some initial transient time.
On the other hand, τ + 1 are the total number of updates
the master makes in the interval t(t0 , t3 ]. A PPENDIX E
P ROOF OF T HEOREM 1
t3 − t 0 t1 + δm − t0
τ +1 = = In this section, we prove the convergence of our asyn-
Tu Tu
(i) (i) chronous scheme. The proof follows similar steps to the one
Tm (k) + δi + Tp + Tw (r + 1) + δm
= . (27) used in [4] except that we make the necessary adjustments and
Tu establish some intermediate results to suit our scheme that uses
Note that δi is the residual life of a deterministic interval variable minibatch per iteration. To begin, we simplify equa-
of duration Tp . Hence, E[δi ] = 12 Tp . Similarly, E[δm ] = 12 Tp . tion (6) which gives the average of the gradients the master

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
798 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO. 2, JUNE 2021

receives in iteration k. Since the master receives b(k) gradients Next, we substitute (37) in (36) to get
in each iteration, (6) can be written as a sum of b(k) terms as
 F(w(k + 1)) − F(w∗ )
   
b k
         ≤ z(k + 1), w(k + 1) − w∗ − z(k), w(k) − w∗
g k = ∇f w k − τ k, s , x k, s (31) 1 
+ ψ(w(k + 1)) − ψ(w(k))
s=1 α(k)
where each pair (τ (k, s), x(k, s)) in (31) is mapped to a unique   1  
+ e(k), w(k + 1) − w∗ − Dψ w(k + 1), w(k)
(τi (k, r), xi (k, s)) with Ii (k, r) = 1 in (6). α(k)
In each iteration k, the master uses g(k) to evaluate w(k +1) L
+ w(k) − w(k + 1) .
2
(38)
according to (7). We observe that the received gradients suffer 2
from two sources of errors. First, the gradients are calculated Let the sequence of step sizes be defined as α(k) =
with respect to f (w, x) instead of F(w). Furthermore, the gra- (L+η(k))−1 for some nondecreasing sequence η(t) ≥ 0. Since
dients may be stale as they may have been calculated with ψ(w) is 1-strongly convex, then from (4) and (5), we have
respect to w(l) for l < k. Hence, we define e(k) as the k-th
2 w(k) − w(k + 1)
1 2 ≤ D w(k + 1), w(k) . Substituting this
ψ
iteration gradient error. In other words, upper bound along with α(k)−1 = L+η(k) in (38) and observ-
e(k) := ∇F(w(k)) − g(k). (32) ing that the the term LDψ (w(k + 1), w(k)) cancel, the upper
bound in (38) becomes
Now we use the convexity and Lipschitz smoothness of
F(w) to upper bound the term F(w(k)) − F(w∗ ) F(w(k + 1)) − F(w∗ )
   
  ≤ z(k + 1), w(k + 1) − w∗ − z(k), w(k) − w∗
F(w(k)) − F(w∗ ) ≤ ∇F(w(k)), w(k) − w∗    
  + e(k), w(k + 1) − w∗ − η(k)Dψ w(k + 1), w(k)
= ∇F(w(k)), w(k + 1) − w∗ 
+ (L + η(k)) ψ(w(k + 1)) − ψ(w(k)) . (39)
+ ∇F(w(k)), w(k) − w(k + 1) (33)
  By summing both sides of (39) from k = 1 to k = K
≤ ∇F(w(k)), w(k + 1) − w∗
and observing that certain terms telescope and that
+ F(w(k)) − F(w(k + 1)) z(1), w(1) − w∗ = 0 since z(1) = 0, we get
L 2
+ w(k) − w(k + 1) , (34) 
K
2  
F w(k + 1) − F(w∗ )
where (33) follows from the convexity of F while (34) follows k=1
from L-Lipschitz continuity of ∇F as detailed in Section II-A.  
≤ z(K + 1), w(K + 1) − w∗ + (L + η(K))ψ(w(K + 1))
According to the dual averaging update rule in (7), g(k) =

K
z(k + 1) − z(k). Substituting this in (32), we can express the − (L + η(1))ψ(w(1)) + ψ(w(t)) η(k − 1) − η(k)
actual gradient in the k-th iteration as ∇F(w(k)) = e(k)+z(k+ k=2
1) − z(k). Substituting this in (34) to get the following after 
K
  K
 
some rearrangements − η(k)Dψ w(k + 1), w(t) + e(k), w(k + 1) − w∗ .
    t=1 k=1
F w(k + 1) − F w∗ (40)
 
≤ e(k) + z(k + 1) − z(k), w(k + 1) − w∗ 
L 2 Note that since w(K + 1) minimizes z(K + 1), w +
+ w(k) − w(k + 1)
α(K+1) ψ(w) over all possible values of w ∈ W according
1
 2   
= z(k + 1) − z(k), w(k + 1) − w∗ + e(k), w(k + 1) − w∗ to (7), we can upper bound the first two terms on the right
L 2 hand side of (40) by substituting w∗ for w(K + 1). This elim-
+ w(k) − w(k + 1) (35) inates the first inner product while the second term becomes
 2 ∗
  ∗
 (L + η(K))ψ(w∗ ). Furthermore, the third and fourth terms are
= z(k + 1), w(k + 1) − w − z(k), w(k + 1) − w
  L 2 both negative so we can remove them to make the right hand
+ e(k), w(k + 1) − w∗ + w(k) − w(k + 1) . (36) side larger to get
2
According to the primal variable update rule in (7), w(k) = 
K
 
F w(k + 1) − F(w∗ )
arg minw∈W { z(k), w + α(k)−1 ψ(w)}. Using Lemma 1 in
k=1
Appendix I, in the supplementary material, we get the follow-
ing inequality after replacing the terms x+ , x, z and, A from 
K
 
≤ (L + η(K + 1))ψ(w∗ ) − η(k)Dψ w(k + 1), w(k)
Lemma 1 with w(k), w(k+1), z(k), and [α(k)]−1 , respectively, k=1
 
− z(k), w(k + 1) − w∗ 
K
 
+ e(k), w(k + 1) − w∗ . (41)
  1 
≤ − z(k), w(k) − w∗ + ψ(w(k + 1)) − ψ(w(k)) k=1
α(k)
1   Define g̃(k, s) := ∇F(w(k − τ (k, s)). The sum of the inner
− Dψ w(k + 1), w(k) . (37)
α(k) products term in (41) can be expressed as

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 799


K 
  E g̃(k) − g(k)|b(k)
e(k), w(k + 1) − w∗
1 
b(k) 
k=1 
  =E g̃(k, s) − g(k)b(k)
K 
b(k)
1 b(k)
s=1

= ∇F(w(k)) − g̃(k, s), w(k + 1) − w
1 
b(k)
b(k) 
k=1
 b(k)
s=1
 = E ∇F(w(k−τ (k, s)))−f (w(k−τ (k, s)), x(t, s)) = 0
 K  b(k)
1 s=1

+ g̃(k, s) − g(k), w(k + 1) − w (46)
b(k)
t=1 s=1 

K b(k) since ∇F(w) = E ∇f (w, x) by definition. Conditioned on
1  
= ∇F(w(k)) − g̃(k, s), w(k + 1) − w∗ b(k), the expectation of the second term on the right hand side
b(k) of (45) can be bounded using the bounded variance assumption
k=1 s=1
K  b(k) to get
1
+ g̃(k, s) − g(k), w(k + 1) − w∗ . (42)  2  
 1 
b(k)
b(k)  
k=1 s=1 E  g̃(k, s) − g(k) b(k)
b(k)
Since α(k) is nonincreasing,
√ then there exists an η > s=1
 
0 such that α(k) ≤
2 bavg
√ for all k ∈ [K]. Assume  1 
b(k)
 
2 
η K =E  g̃(k, s)−f (w(k−τ (k, s)), x(k, s))  b(k)
that the E[τ (k, s)2 |Btot ] ≤ 2 . Let Y be the first term b(k)
s=1
on
K theright hand side of (42). In other words, Y := σ2
b(k) 1 ∗ ≤ .
s=1 b(k) ∇F(w(k)) − g̃(k, s), w(k + 1) − w .
k=1
(47)
b(k)
Then, using Lemma H.1 from Appendix H in the supple-
mentary material, E Y Btot is bounded as Note that the bound above follows since x(k, s) are sampled
i.i.d and hence the cross terms are all zero. Combining (46)
  bmax 2LJ 2  2 and (47) the expectation of (45) conditioned on b(k) is
E Y Btot ≤ 6JC2 + 2 bavg  + 1 . (43)
bmin η  
b(k)  
1 
∗ 
Next, we look at the summand of the second term on the E g̃(k, s) − g(k), w(k + 1) − w b(k)
b(k)
right hand side of (42). s=1
 
1 σ2 η(k)  
2 
∗ 

b(k) ≤ + E w(k) − w  b(k) . (48)
1 2η(k) b(k) 2
g̃(k, s) − g(k), w(t + 1) − w∗
b(k) Taking the expectation on both sides in (42) conditioned
s=1
! " on the sample path Btot := b(k), k ∈ [K] , substituting (43)
1 
b(k)
= g̃(k, s) − g(k), w(k) − w∗ and (48), and summing over k
b(k)
s=1
! " 
K

1 
b(k)
E e(k), w(k + 1) − w∗ Btot
+ g̃(k, s) − g(k), w(k + 1) − w(k) . (44) k=1
b(k)
s=1
bmax LJ 2   1 σ2 K
b(k) ≤ 6JC2 + 2 bavg  + 1)2 +
Let u = s=1 g̃(k) − g(k), v = w(k + 1) − w(k), and bmin 2η 2η(k) b(k)
k=1
h(u) = η(k) 2
2 u . Let the the convex conjugate of h be
∗ 
K
η(k)  2 
h (v) = 2η(k) v 2 . Then according to Fenchel-Young inequal- E w(k) − w∗  Btot .
1
+ (49)
ity, we have u, v ≤ h(u) + h∗ (v). So, we can upper bound k=1
2
the second term on the right hand side of (44) and get
Now we can calculate E[R(K)|Btot ] from (41) and substitute

b(k)
1
in the bounds from (49) to get
g̃(k, s) − g(k), w(k + 1) − w∗   
s=1
b(k) E R K |Btot
 
!
1 
b(k) "
1
K
    ∗ 
≤ ∗
g̃(k, s) − g(k), w(k) − w + =E F w t + 1 − F w Btot
b(k) 2η(t) k=1
s=1
 2      bmax
 1  η(k)  2 ≤ L + η K + 1 ψ w∗ + 6JC2
b(k)
  ∗
×  g̃(k, s) − g(k) +  w(k) − w  . (45) bmin
b(k)  2
s=1 2LJ 2  2   
K
   
+ 2 bavg  + 1 − η k E Dψ w k + 1 |Btot
The expected value of the inner product on the right hand η
k=1
side of (45) is zero since if we condition on g(1), . . . , g(k −  
K
1 σ 2 
K
η k    2
1), then w(k) is not random. Hence it does not depend on +    + E w k − w∗  |Btot . (50)
g̃(k, s) − g(k). Moreover, 2η k b k 2
k=1 k=1

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
800 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 2, NO. 2, JUNE 2021

2
Finally, we use the facts that ψ(w∗ ) ≤ C2 , m = Kbavg , [12] M. Zinkevich, J. Langford, and A. J. Smola, “Slow learners are fast,”
and 12 w(k +√1) − w(k) 2 ≤ Dψ (w(k + 1), w(k)), and substi- in Advances in Neural Information Processing Systems. Red Hook, NY,
USA: Curran Assoc., 2009, pp. 2331–2339.
η (K+k−1)/bavg
tute η(k) = 2 to derive the regret bound as stated [13] M. A. Zinkevich, M. Weimer, A. Smola, and L. Li, “Parallelized stochas-
tic gradient descent,” in Advances in Neural Information Processing
in Theorem 1 Systems. Red Hook, NY, USA: Curran Assoc., 2010, pp. 2595–2603.
   [14] M. Nokleby and W. U. Bajwa, “Stochastic optimization from dis-
   η 2K/bavg C2 bmax tributed streaming data in rate-limited networks,” IEEE Trans. Signal
E R K |Btot ≤ L + + 6JC2 Inf. Process. Netw., vol. 5, no. 1, pp. 152–167, Mar. 2019.
2 2 bmin
[15] M. Nokleby, H. Raja, and W. U. Bajwa, “Scaling-up distributed pro-
2LJ bavg 
2 2 cessing of data streams for machine learning,” Proc. IEEE, vol. 108,
+ +1 no. 11, pp. 1984–2012, Nov. 2020.
η2
 [16] Q. Ho et al., “More effective distributed ML via a stale synchronous
K
bavg σ2 parallel parameter server,” in Advances in Neural Information Processing
+ #  bk Systems. Red Hook, NY, USA: Curran Assoc., 2013, pp. 1223–1231.
k=1 η K+k−1 [17] W. Zhang, S. Gupta, X. Lian, and J. Liu, “Staleness-aware async-SGD
for distributed deep learning,” in Proc. Int. Joint Conf. Artif. Intell.,
C2 $ η# % bmax 2016, pp. 2350–2356.
≤ L+ 2m/b2avg + 6JC2 [18] J. Duchi, M. I. Jordan, and B. McMahan, “Estimation, optimization, and
2 2 b parallelism when data is sparse,” in Advances in Neural Information
√ min
2LJ 2 bavg  2 σ2 m Processing Systems. Red Hook, NY, USA: Curran Assoc., 2013,
+ +1 + pp. 2832–2840.
η2 bmin η [19] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallel coordi-
avg  2
2b nate descent for L1-regularized loss minimization,” Jun. 2011. [Online].
1 2 b max 2LJ
≤ C L + 6JC2 + +1 Available: arXiv:1105.5379.
2 bmin η 2 [20] K. Tran, S. Hosseini, L. Xiao, T. Finley, and M. Bilenko, “Scaling up
 2  stochastic dual coordinate ascent,” in Proc. Int. Conf. Knowl. Disc. Data
σ C2 η √
+ + m. (51) Min., 2015, pp. 1185–1194.
ηbmin 2bavg [21] S. Zheng et al., “Asynchronous stochastic gradient descent with delay
compensation,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 4120–4129.
[22] B. M. Assran, A. Aytekin, H. R. Feyzmahdavian, M. Johansson, and
ACKNOWLEDGMENT M. G. Rabbat, “Advances in asynchronous parallel and distributed
optimization,” Proc. IEEE, vol. 108, no. 11, pp. 2013–2031, Nov. 2020.
The authors thank Jason Lam and Zhenhua Hu of Huawei [23] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal dis-
Technologies Canada for technical discussions. They thank tributed online prediction using mini-batches,” J. Mach. Learn. Res.,
vol. 13, no. 1, pp. 165–202, Jan. 2012.
Compute Canada for providing their high performance com- [24] K. I. Tsianos and M. G. Rabbat, “Efficient distributed online prediction
puting platform to run experiments throughout this project. and stochastic optimization with approximate distributed averaging,”
IEEE Trans. Signal Inf. Process. Netw., vol. 2, no. 4, pp. 489–506,
Dec. 2016.
R EFERENCES [25] M. Nokleby and W. U. Bajwa, “Distributed mirror descent for stochas-
tic learning over rate-limited networks,” in Proc. IEEE Int. Workshop
[1] P. Garraghan, X. Ouyang, R. Yang, D. McKee, and J. Xu, “Straggler Comput. Adv. Multi-Sensor Adapt. Process., Dec. 2017, pp. 1–5.
root-cause and impact analysis for massive-scale virtualized cloud dat- [26] N. Ferdinand, B. Gharachorloo, and S. C. Draper, “Anytime exploitation
acenters,” IEEE Trans. Services Comput., vol. 12, no. 1, pp. 91–104, of stragglers in synchronous stochastic gradient descent,” in Proc. IEEE
Jan./Feb. 2019. Int. Conf. Mach. Learn. Appl., Dec. 2017, pp. 141–146.
[2] D. Wang, G. Joshi, and G. Wornell, “Using straggler replication [27] N. Ferdinand and S. C. Draper, “Anytime stochastic gradient descent:
to reduce latency in large-scale parallel computing,” SIGMETRICS A time to hear from all the workers,” in Proc. Annu. Allerton Conf.
Perform. Eval. Rev., vol. 43, no. 3, pp. 7–11, Dec. 2015. Commun. Control Comput., 2018, pp. 552–559.
[3] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous [28] N. Ferdinand, H. Al-Lawati, S. C. Draper, and M. S. Nokleby, “Anytime
deterministic and stochastic gradient optimization algorithms,” IEEE MiniBatch: Exploiting stragglers in online distributed optimization,” in
Trans. Autom. Control, vol. AC-31, no. 9, pp. 803–812, Sep. 1986. Proc. Int. Conf. Learn. Represent., 2019.
[4] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic [29] A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani,
optimization,” in Advances in Neural Information Processing Systems. “Robust and communication-efficient collaborative learning,” in
Red Hook, NY, USA: Curran Assoc., 2011, pp. 873–881. Advances in Neural Information Processing Systems. Red Hook, NY,
[5] A. Agarwal and J. C. Duchi, “Large scale distributed deep networks,” USA: Curran Assoc., 2019, pp. 8386–8397.
in Advances in Neural Information Processing Systems. Red Hook, NY, [30] H. Al-Lawati, N. Ferdinand, and S. C. Draper, “Anytime minibatch with
USA: Curran Assoc., 2012, pp. 1223–1231. stale gradients,” in Proc. Conf. Inf. Sci. Syst., Mar. 2019, pp. 1–5.
[6] S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and [31] G. Lan, “An optimal method for stochastic composite optimization,”
stale gradients can win the race: Error-runtime trade-offs in distributed Math. Program., vol. 133, pp. 365–397, Jun. 2012.
SGD,” in Proc. Int. Conf. Artif. Intell. Stat., 2018, pp. 803–812. [32] Y. Nesterov, “Primal-dual subgradient methods for convex problems,”
[7] N. Feng, B. Recht, C. Re, and S. J. Wright, “HOGWILD!: A lock-free Math. Program., vol. 120, pp. 221–259, Apr. 2009.
approach to parallelizing stochastic gradient descent,” in Advances in [33] L. Xiao, “Dual averaging method for regularized stochastic learning
Neural Information Processing Systems. Red Hook, NY, USA: Curran and online optimization,” in Advances in Neural Information Processing
Assoc., 2011, pp. 693–701. Systems. Red Hook, NY, USA: Curran Assoc., 2009, pp. 2116–2124.
[8] H. R. Feyzmahdavian, A. Aytekin, and M. Johansson, “An asynchronous [34] A. Krizhevsky, V. Nair, and G. Hinton. (2014). The CIFAR-10 Dataset.
mini-batch algorithm for regularized stochastic optimization,” IEEE [Online]. Available: http://www.cs.toronto.edu/kriz/cifar.html
Trans. Autom. Control, vol. 61, no. 12, pp. 3740–3754, Dec. 2016. [35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
[9] G. Lan and Y. Zhou, “Asynchronous decentralized accelerated stochastic A large-scale hierarchical image database,” in Proc. Conf. Comput. Vis.
gradient descent,” 2018. [Online]. Available: arXiv:1809.09258. Pattern Recognit., 2009, pp. 248–255.
[10] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochas- [36] P. Chrabaszcz, I. Loshchilov, and F. Hutter, “A downsampled variant
tic gradient for nonconvex optimization,” in Advances in Neural of ImageNet as an alternative to the CIFAR datasets,” 2017. [Online].
Information Processing Systems. Red Hook, NY, USA: Curran Assoc., Available: arXiv:1707.08819.
2015, pp. 2737–2745. [37] S. Zagoruyko and N. Komodakis, “Wide residual networks,” 2016.
[11] A. Nedić, D. P. Bertsekas, and V. S. Borkar, “Distributed asyn- [Online]. Available: arXiv:1605.07146.
chronous incremental subgradient methods,” Stud. Comput. Math., [38] R. G. Gallager, Discrete Stochastic Processes. Boston, MA, USA:
vol. 8, pp. 381–407, Dec. 2001. Springer, 2012.

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.
AL-LAWATI et al.: ASYNCHRONOUS DELAYED OPTIMIZATION WITH TIME-VARYING MINIBATCHES 801

[39] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica- Stark C. Draper (Senior Member, IEEE) received
tion with deep convolutional neural networks,” in Proc. NIPS, 2012, the B.S. degree in electrical engineering and the
pp. 1097–1105. B.A. degree in history from Stanford University
and the M.S. and Ph.D. degrees from the
Massachusetts Institute of Technology (MIT). He is
a Professor of Electrical and Computer Engineering
with the University of Toronto (UofT) and was
Haider Al-Lawati (Graduate Student Member, an Associate Professor with the University of
IEEE) received the B.Sc. and M.Sc. degrees in engi- Wisconsin, Madison, WI, USA. As a Research
neering and mathematics from Queen’s University, Scientist, he has worked with the Mitsubishi Electric
Kingston, ON, Canada, in 2005 and 2007, respec- Research Labs, Disney’s Boston Research Lab,
tively. He is currently pursuing the Ph.D. degree Arraycomm Inc., the C. S. Draper Laboratory, and Ktaadn Inc. He com-
with the Electrical and Computer Engnieering pleted postdoctoral research with UofT and with the University of California,
Department, University of Toronoto. From 2008 to Berkeley, CA, USA. He spent the 2019–2020 academic year on sabbati-
2015, he worked with the Telecom Industry as an cal at the Chinese University of Hong Kong, Shenzhen, China, and visiting
RF Planning Engineer, a Project Manager, and the the Canada-France-Hawaii Telescope in Hawaii, USA. His research interests
Head of a Department working on various mobile include information theory, optimization, error-correction coding, security, and
technologies, such as GSM, WCDMA, LTE, and the application of tools and perspectives from these fields in communica-
WiMAX. Throughout his academic and industrial careers, he received a tions, computing, learning, and astronomy. He is a recipient of the NSERC
number of scholarships and awards. His current research interests include Discovery Award, the NSF CAREER Award, the 2010 MERL President’s
distributed computing, distributed optimization, and machine learning. Award, and teaching awards from the UofT, the University of Wisconsin,
and MIT. He received an Intel Graduate Fellowship, Stanford’s Frederick E.
Terman Engineering Scholastic Award, and a U.S. State Department Fulbright
Fellowship. He Chairs the Machine Intelligence major at UofT, is a member
Tharindu B. Adikari (Graduate Student Member, of the IEEE Information Theory Society Board of Governors, and serves as
IEEE) received the B.Sc. degree from the University the Faculty of Applied Science and Engineering representative on the UofT
of Moratuwa, Sri Lanka, in 2014, and the M.A.Sc. Governing Council.
degree from the University of Toronto, Canada, in
2018, where he is currently pursuing the Ph.D.
degree in electrical and computer engineering. From
2014 to 2016, he worked with LSEG Technology
(formerly, MillenniumIT), Sri Lanka, primarily on
stock market surveillance systems. His current
research interests include communication efficient
distributed optimization, source coding, and machine
learning.

Authorized licensed use limited to: East Carolina University. Downloaded on June 24,2021 at 02:02:13 UTC from IEEE Xplore. Restrictions apply.

You might also like