This is the html version of the file https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=f57c3f12f88e070a030b7ea506041acc9d2dc484.
Google automatically generates html versions of documents as we crawl the web.
These search terms have been highlighted: scalable work stealing
Hierarchical Work Stealing on Manycore Clusters
Page 1
Hierarchical Work Stealing on Manycore Clusters
Seung-Jai Min1
Costin Iancu1
Katherine Yelick1,2
1Lawrence Berkeley National Laboratory and 2University of California at Berkeley
Abstract
Partitioned Global Address Space languages like UPC offer a convenient
way of expressing large shared data structures, especially for irregular
structures that require asynchronous random access. But the static SPMD
parallelism model of UPC does not support divide and conquer parallelism
or other forms of dynamic parallelism. We introduce a dynamic tasking
library for UPC that provides a simple and effective way of adding task
parallelism to SPMD programs. The task library, called HotSLAW, pro-
vides a high-level API that abstracts concurrent task management details
and performs dynamic load balancing. To achieve scalability, we propose
a topology-aware hierarchical work stealing strategy that exploits locality
in distributed-memory clusters. Our approach, named HotSLAW, extends
state of the art techniques in shared- and distributed-memory implementa-
tions with two mechanisms: Hierarchical Victim Selection (HVS) finds the
nearest victim thread to preserve locality and Hierarchical Chunk Selec-
tion (HCS) dynamically determines the amount of work to steal based on
the locality of the victim thread. We evaluate the performance of our run-
time on shared- and distributed-memory systems using irregular applica-
tions. On shared memory, HotSLAW provides performance comparable or
better than hand tuned OpenMP implementations. On distributed memory
systems, the combination of Hierarchical Victim Selection and Hierarchi-
cal Chunk Selection provides better performance than state of the art ap-
proaches using a random victim selection with a StealHalf strategy.
1. Introduction
The Unified Parallel C (UPC) [22] language provides the conve-
nience of a global address space with the locality control needed
for high performance and scalability. It has been shown to perform
well on both shared and distributed memory machines, and the one-
sided communication model that results from reading and writing
to the global address space is both lightweight and takes advantage
of modern interconnect features. However, UPC uses a Single Pro-
gram Multiple Data (SPMD) model of parallelism in which a fixed
set of threads run throughout the program execution. While that
SPMD model matches the underlying hardware, it does not directly
support applications that involve dynamic tasking. Dynamic task-
ing is especially important for problems in which the work and par-
allelism unfold dynamically throughout the program execution. In
this case, it is impossible to determine a priori how to divide work
among processors so as to evenly balance the load. With dynamic
tasking, computations are explicitly and dynamically created and a
runtime scheduler provides load balancing services:work-stealing
is the most popular runtime scheduling approach.
[Copyright notice will appear here once ’preprint’ option is removed.]
In this paper we present a dynamic tasking library called Hot-
SLAW for UPC. HotSLAW provides a high-level API that abstracts
concurrent task management details and performs dynamic load
balancing. Besides API expressiveness and conciseness, a major
goal of our library is to achieve scalable and robust performance on
distributed memory multicore clusters. The implementation takes
advantage of the one-sided data access mechanism to implement
work-stealing efficiently on large scale systems, and also builds
on prior art in dynamic load balancing for both shared and dis-
tributed memory machines. This includes a large body of research
on dynamic tasking and work-stealing on shared-memory architec-
tures, such as Cilk [8], Intel Threading Building Blocks, Microsoft
Task Parallel Library, and OpenMP 3.0. Distributed memory task-
ing at scale has been demonstrated using one-sided communica-
tion paradigms such as the Aggregate Remote Memory Copy Inter-
face [14] (ARMCI) [3, 4] or Unified Parallel C [17, 21].
Our work builds in particular on that of Guo et al [10] who
show that scalable work stealing on shared memory systems is
achieved using a dynamic task scheduling policy with locality
awareness. Their Scalable Locality-aware Adaptive Work-stealing
scheduler (SLAW) combines work-first and help-first scheduling
policies while ensuring bounded memory usage. HotSLAW ex-
tends the principles behind these state-of-the art approaches to ad-
dress the inherently hierarchical nature of memory locality in mod-
ern systems, without assuming application or language level local-
ity hints. SLAW allows stealing to occur only within a place (a
programmer defined locality domain), while HotSLAW provides
a hardware topology aware hierarchical victim selection strategy
where the algorithm tries to steal work from the lowest hierarchy
level (e.g. socket) before moving up the hierarchy (e.g. inter-socket
or inter-node). For distributed memory, we extend the implementa-
tion proposed by Dinan et al [4]. SLAW steals a fixed amount of
work and Dinan’s implementation steals a fixed ratio of work per
event. HotSLAW uses a hierarchical chunk selection approach to
dynamically select the amount of work stolen based on the “local-
ity” hierarchy used.
We have evaluated our approach using programs with irregular
parallelism on shared- and distributed memory systems up to 256
cores. The benchmarks and the experimental setup are described
in Section 4. On shared memory systems, HotSLAW is able to
match or exceed by as much as 109% the performance of manually
optimized OpenMP implementations. On distributed memory sys-
tems, hierarchical victim selection is able to improve performance
by up to 52% when compared to the default random victim selec-
tion policy. Enabling hierarchical chunk selection provides perfor-
mance improvements up to 122% when compared to the StealHalf
method for distributed memory introduced by Dinan et al [4]. Our
experimental results indicate that using a three level (intra-, inter-
socket and inter-node) hierarchy, HotSLAW is able to match hand
tuned performance obtained using an exhaustive search of the space
of the parameters that determine the performance of work-stealing:
1
2011/8/15

Page 2
scheduling policy, space bounds, victim selection and stealing gran-
ularity.
The rest of the paper is organized as follows. We describe the
task parallel programming API for UPC in Section 2. Section 3
explains the design and implementation of HotSLAW. We present
the experimental setup in Section 4 and a thorough performance
evaluation of HotSLAW in Section 5. We discuss related work in
Section 6, followed by a conclusion in Section 7.
2. Unified Parallel C
Unified Parallel C (UPC) is an extension to ISO C 99 that provides
a Partitioned Global Address Space (PGAS) abstraction using Sin-
gle Program Multiple Data (SPMD) parallelism. The memory is
partitioned in a task local heap and a global heap. All tasks can ac-
cess memory residing in the global heap, while access to the local
heap is allowed only for the owner. The global heap is logically
partitioned between tasks and each task is said to have local affin-
ity with its sub-partition. Global memory can be accessed either
using pointer dereferences (load and store) or using bulk communi-
cation primitives (memget(), memput()). The language provides
synchronization primitives, namely locks, barriers and split phase
barriers. Most of the existing UPC implementations also provide
non-blocking communication primitives, e.g. upc memget nb().
The language also provides a memory consistency model which
imposes constraints on message ordering.
2.1 UPC Task Library API
taskq_t *taskq_all_alloc(int nFunc, void *func1,
int input_size1, int output_size1, ...);
int taskq_put(taskq_t *taskq, void *func,
void *in, void *out);
int taskq_execute(taskq_t *taskq);
int taskq_steal(taskq_t *taskq);
void taskq_wait(taskq_t *taskq);
void taskq_fence(taskq_t *taskq);
int taskq_all_isEmpty(taskq_t *taskq);
Figure 1. Task Library API
We provide a library API for instantiating and controlling dy-
namic task parallelism from UPC applications. Figure 1 shows a list
of the most commonly invoked functions. The taskq all alloc
function allocates a distributed data structure to represent a global
task queue; this is a collective function call where arguments have
to match across all threads. The first input argument, nFunc is the
number of different task functions that can be put into this task
queue. Each function is represented by a triplet which consists of a
pointer to the proper function, an input data size, and an output data
size. Logically, the task queue is split into a thread private portion
(or local) and a public portion.
The taskq put function creates a task and puts it into the
queue when space is available. As described later, we implement
a combination of the work-first and help-first scheduling policies
with bounded queues and tasks can be executed inline taskq put
(serialized) when space is unavailable.
The taskq execute function removes a task from the head
of the private task queue and executes it, while the taskq steal
function attempts to steal tasks from the public queue. We further
discuss the stealing strategy in Section 3.2.
The library also provides primitives for inter-task synchroniza-
tion and ordering. The taskq fence is a non-blocking operation
used to enforce ordering between task sub-graphs: it ensures that
any task spawned after calling fence will not be scheduled for exe-
cution until all the tasks spawned before the fence have completed.
The taskq wait is a blocking call that terminates when all the
spawned tasks are completed. The taskq wait function internally
calls the execute and steal functions to ensure progress.
The taskq all isEmpty function is a collective function for
termination detection, it returns 1 if the global task queue is empty,
otherwise it returns 0. In addition, we provide configuration func-
tions taskq set * to specify user defined stealing hierarchies and
configurable behavior.
The complete description of the task library API can be found
at http://upc.lbl.gov/task.shtml.
2.2 Programming Example
01: void FIB( int *n, int *out ) {
02:
int n1 = *n-1;
03:
int n2 = *n-2;
04:
int x, y;
05:
if (*n < CUTOFF) {
06:
FIB_serial(n, out);
07:
return;
08:
}
09:
taskq_put(taskq, FIB, &n1, &x);
10:
taskq_put(taskq, FIB, &n2, &y);
11:
taskq_wait(taskq);
12:
*out = x + y;
13: }
Figure 2. UPC task example for Fibonacci
Recursive divide-and-conquer functions and parallel-for loops
(a.k.a. parallel do-all loops) with potential load imbalance are good
candidates for dynamic tasking. Figure 2 shows a Fibonacci num-
bers generator implemented using our API. A task function FIB
spawns two children tasks at lines 9 and 10 and waits at line 11 un-
til these children complete. While waiting inside the taskq wait
function, the runtime library will consume tasks or try stealing from
other threads if the local task queue is empty.
Tasks are specified at function granularities, with a signature
containing pointers to input and output data buffers, e.g.:
void my func(void *input, void *output);
In contrast with the API used by Dinan [4] which uses global ad-
dresses to specify input/output data, we use proper C void* point-
ers in order to improve interoperability with other programming
models. Input and output data is stored in contiguous memory
space, it is copied into the library space when a task is created and
travels with it whenever migration occurs.
To exploit the PGAS support of UPC, data fields in the input
and output regions can be either a value or a reference. However,
if such a data is a reference type, it should be a pointer to shared
data which remains valid after steals. A task function can read its
local variables, its input parameter, shared variables, and file scope
constant variables and can write to its local variables, its output
parameter, and shared variables accessed within its task function.
Our library maintains and at termination propagates the content
of the task output buffer. The content of this region is undefined
until task termination.
3. HotSLAW Implementation
Work-stealing has been widely popularized by Cilk [8] and ex-
tensively studied. Implementations use two scheduling policies:
work-first and help-first. Under work-first, the scheduler executes
the spawned task eagerly and leaves the parent task to be stolen.
Since we provide a library based approach, in our context a work-
first policy amounts to invoking the task function directly from the
2
2011/8/15

Page 3
taskq put library call. With the help-first policy new tasks are
made available for stealing while the execution continues with the
parent task.
It is reported [9, 10] that the work-first policy is good for
scenarios when stealing is rare, but its implementation can cause
stack overflow for applications with very deep task execution trees,
while the help-first policy is favorable when stealing is frequent
and can be implemented with low stack usage but is not space-
efficient in general. As any work-first implementation, ours too has
the potential for stack overflow.
SLAW [10] presents an implementation using an adaptive pol-
icy: tasks are generated and initially executed using a help-first
policy until several constraints are met, at which point the sched-
uler switches to a work-first policy. In order to reduce overhead,
SLAW re-evaluates the spawning policy after a “certain” number
of spawn operations. In our implementation, we use the SLAW
approach with simpler heuristics to implement queue bounds and
policy switches.
SLAW use three parameters to guide policy selection: S repre-
sents a threshold for the number of active stack frames, after which
the implementation switches from work-first to help-first; F is the
threshold on number of tasks in the queue after which the imple-
mentation switches from help-first to work-first; and INT is the
periodical number of events after which the stealing policy is re-
evaluated. A typical execution starts with help-first to build paral-
lelism and then switches to work-first in order to reduce the over-
head of task creation and execution. If the stack bound S is reached,
the execution reverts to help-first and increases heap memory pres-
sure.
In our implementation we use bounded queues, which are the
equivalent of the parameter F in SLAW. The execution starts with
help-first, and when queue space is exhausted reverts to work-first.
Help-first is resumed whenever queue space becomes available and
the scheduling policy is re-evaluated at each task spawning point.
3.1 Task Queue Implementation
Achieving scalability of work-stealing implementations in dis-
tributed memory environments requires a careful data structure
design to minimize locking on the critical path and fast task cre-
ation and synchronization primitives. Dinan et al [4] discuss the
implementation details of work-stealing in large scale systems and
our implementation builds directly on their work.
Figure 3 shows the block diagram of a global task queue dis-
tributed on the participating UPC threads. In our implementation, a
“global” task queue is stitched from per-thread task queues, man-
aged locally to reduce contention. In the global view, contention
overhead is further reduced by splitting the per-thread queue into
a private region and a public region [3, 17]; tasks are stolen from
the public region and accesses to it are serialized through a lock.
Accesses to the private portion of the task queue do not require
locking. Moving tasks between the public and the private region is
accomplished by updating the split pointer that marks the boundary
between these two regions. If space is available, tasks are inserted
into the private queue region. When this region is full, tasks are
made available for stealing by moving them into the public queue
region.
The private portion of the task queue is manipulated like a stack
in a LIFO fashion, as executing the most recently created task has a
higher chance of exploiting cache locality. The public portion of the
queue is manipulated using a FIFO policy, as the oldest task in the
queue has the potential to contain the largest amount of work in the
task graph. Shared memory runtimes such as Cilk or SLAW steal
one task a time, while distributed memory implementations [4]
advocate stealing half the number of available tasks. As discussed
Thread 0
Thread 1
Thread 2
Thread 3
Global Task Queue
Tasks in the shared region
Tasks in the private region
head
split
tail
Figure 3. A Global Task Queue Structure in the UPC task library
later, in our implementation we use hardware topology information
to provide an adaptive strategy to control the chunk size.
Besides using an adaptive stealing granularity, we provide ex-
tensions to support dependent task execution. The wait task queue
data structure stores the tasks that are spawned, but cannot be
scheduled because they are dependent on tasks not yet completed.
The sync node map is a map whose input key is a task and the
value is a sync node entry that establishes a relationship between
the input key task and the list of tasks that are dependent on the in-
put key task. The sync node entry has a reference counter and a list
of dependent tasks stored in the wait task queue. When a task
with an output dependency completes, the runtime library accesses
the sync node map to find the corresponding sync node entry and
decrements its reference counter by one. If the reference counter
reaches zero, then the dependent tasks of that sync node are ready
to be executed. Hence, those tasks in the wait task queue are
moved to the ready task queue so that they can be scheduled for
execution.
3.2 Hierarchical Work Stealing
Most of the existing research on work-stealing for shared-memory
systems ignores locality concerns or the non-uniform access, hier-
archical nature of current architectures. SLAW extends work steal-
ing implementations for shared memory with a notion of locality.
In their approach, locality is under programmer control and it is
expressed at the application level using the place construct. Work-
stealing occurs only within a place.
Places provide for a two-level abstract view of the system, local
or non-local, while architecturally locality has a more hierarchi-
cal nature: shared cache, shared NUMA domain, shared node and
inter-node. Yan et al [23] introduce hierarchical place trees abstrac-
tions but do not implement the runtime mechanisms required for
proper support of work stealing. Our work provides these mecha-
nisms together with a thorough performance evaluation of hierar-
chical work stealing.
Allowing programmers to specify locality domains provides
useful optimization information but can lead to non-portable pro-
grams where logical locality domains do not naturally map to hard-
ware. HotSLAW aims to provide support for specifying and com-
posing hierarchical work-stealing schedulers. As the first prototype,
we provide a hardware topology aware hierarchical scheduler. This
hardware aware scheduling is orthogonal and easily composable
with application specific notions of locality.
We propose and evaluate mechanisms for victim selection (task
queue) and heuristics to guide the granularity of steal events: the
Hierarchical Victim Selection (HVS) policy determines from which
thread a thief thread steals work and the Hierarchical Chunk Se-
lection (HCS) policy dictates how much work a thief thread steals
from the victim thread.
3
2011/8/15

Page 4
Benchmark
Tasks Created
Avg. Task Time
Input / Output Size
Task Creation Ovhd. Steals
Tasks Serialized (%)
Fibonacci
2,692,537
1.163 µs
4 / 8 bytes
0.172 µs
95
258,928 ( 8.7%)
NQueens
306,719
23.270 µs
80 / 4 bytes
0.174 µs
47
129,012 (29.6%)
UTS (T1L)
102,181,082
0.089 µs
32 / 0 bytes
0.162 µs
485
93,553,030 (47.8%)
UTS (T2L)
96,793,510
0.114 µs
32 / 0 bytes
0.161 µs
378
82,249,556 (45.9%)
UTS (T3L)
111,345,631
0.075 µs
32 / 0 bytes
0.159 µs
46703
108,983,482 (49.4%)
SparseLU
1,430,912
6.281 µs
16,16,24 / 0 bytes
0.166 µs
2320
1,344,733 (48.4%)
Table 1. Execution statistics on 8 core Nehalem. “Tasks Created”, “Steals”, and “Tasks Serialized” are the total numbers contributed by all
8 UPC threads. “Input / Output Size” indicates the size of the input and output of the task function in each benchmark (SparseLU has three
different task functions). “Tasks Serialized (%)” shows the number of cutoff serialized tasks due to the bounded task queue mechanism and
its percentage relative to the total number of tasks created.
3.2.1 Hierarchical Victim Selection
Random selection has been the state-of-the-art method in selecting
victims for work-stealing in shared-memory architectures. SLAW
exploits locality by stealing only from threads within a place, de-
fined as sharing an L2 cache in their study. Of course the size of a
place can be arbitrarily extended, but it has the inherent limitation
of allowing only descriptions of two level hierarchies.
In HotSLAW we allow the specification of an arbitrary local-
ity hierarchy directly at the scheduler level. The implementation
provides a default hierarchy that mimics hardware locality: cache,
socket, node. In the Hierarchical Victim-Selection (HVS) policy, a
thread first attempts to steal work from the nearest neighbors, and
gradually moves up the locality hierarchy. Depending on the num-
ber of trial steals at each level of the hierarchy, the policy can be
tuned to favor different levels of locality. The default HotSLAW
policy for any shared memory (inner) domains is to perform a num-
ber of random stealing attempts equal to the number of cores within
the domain. When reaching the topmost domain, the implementa-
tion performs a tunable number of attempts at that level, before re-
verting to stealing from the lowest domain. In our implementation,
the number of attempts at the topmost level is chosen as 4∗log(N),
where N is the number of cores in the system.. For all benchmarks
presented we have obtained good performance results with these
default settings. All these settings are also user configurable.
3.2.2 Hierarchical Chunk Selection
Previous work has shown that the performance of work stealing
algorithms is sensitive to the number of tasks stolen at each step:
this amount is referred to as chunk size.
Work stealing algorithms for recursive parallelism on shared-
memory architectures, such as Cilk’s runtime system, steal one task
from the tail of the victim’s queue, hoping to maximize the proba-
bility of stealing the task with the maximum amount of work [2].
Particularly, for structured task execution trees where all the leaf
nodes have similar depth or all the parent nodes have similar num-
ber of child nodes, stealing this single task at the tail of the task
queue often makes both the thief and the victim’s task queue have
similar amount of work.
In order to provide scalability on distributed memory, work-
stealing schedulers [4, 17] use a StealHalf policy, where thieves
steal one half of the victim’s queue. The StealHalf strategy has been
shown to be effective when inter-node work stealing is frequent or
when the task execution tree is unstructured and exhibits bursty
behavior.
Our approach, the Hierarchical Chunk Selection (HCS), allows
for tuning the chunk size for the various levels of the locality hi-
erarchy. Based on the distance between the thief and the victim
thread, HCS steals a fixed-sized chunk for inner hierarchy levels
and uses StealHalf at the topmost level, e.g. inter-node. For shared-
memory domains, small chunk sizes (1,2) have been shown to pro-
vide good performance for both structured and bursty/unstructured
task graphs. Inter-node work stealing is expensive in a distributed
memory environment and StealHalf reduces the number of steals.
4. Evaluation Setup
We evaluated the performance of HotSLAW on both shared- and
distributed-memory systems. The shared-memory machine is a
two-socket quad-core Intel Xeon 5530 (Nehalem) 2.4 GHz sys-
tem. The distributed-memory machine is a IBM iDataPlex system,
each having two quad-core Intel Xeon 5500 (Nehalem) 2.67 GHz
processors, for a total of 8 cores per node, connected by 4X QDR
InfiniBand technology.
We use four benchmarks: Fibonacci, NQueens, Unbalanced
Tree Search (UTS) and SparseLU. Fibonacci recursively creates a
number sequence where each subsequent number is the sum of the
previous two. NQueens calculates the number of possible solutions
to place N queens on a N×N chess board. UTS [16] is a synthetic
benchmark specifically designed to evaluate scheduling, load bal-
ancing, termination detection and task coarsening strategies. It per-
forms a parallel traversal of an irregular and unpredictable search
space. UTS counts the number of nodes in an implicitly defined
tree; any sub-tree in the tree can be generated completely from
the description of its parent. The number of children of a node is
a function of the node’s description, which is obtained from its
parent node. Load balancing of UTS is particularly challenging
since the distribution of sub-tree size exhibits extreme variation;
frequent small sub-trees and occasionally enormous sub-trees. The
SparseLU (Sparse LU Decomposition) kernel computes an LU
matrix factorization. The matrix is organized in blocks and not all
blocks are allocated due to the sparsity of the matrix. This sparsity
causes load imbalance during computation.
We have developed UPC implementations for all these bench-
marks using our dynamic tasking library. On shared memory, we
compare the performance of HotSLAW with the performance of
OpenMP tasking features: we use implementations from the BOTS
(Barcelona OpenMP Task Suite) suite [6]. The OpenMP version
of UTS is from the UTS-1.1 distribution [16], which uses its own
work-stealing library written in OpenMP. For the evaluation we use
the Berkeley UPC compiler and use the gcc version 4.4.3 compiler
to build the UPC runtime. OpenMP task applications are compiled
with gcc version 4.4.3 and icc version 11.1.
5. Overhead of Work Stealing
Work-stealing runtimes introduce overhead associated with gener-
ating and manipulating tasks, as well as overhead due to degrada-
tion of task performance when locality assumptions are breached
after a steal. Loss of task performance needs to be carefully consid-
ered in the UPC environment, where tasks are allowed to carry and
perform remote data accesses.
4
2011/8/15

Page 5
10 
100 
16 
32 
64 
128 
256 
512 
1K 
2K 
4K 
8K 
16K 
32K 
64K 
128K 
256K 
512K 
1M 
A
vg. Steal Overhead in usec 
Task Data Size in Bytes 
 NUMA Effect on the Work Stealing Overhead 
Inter‐Node 
Inter‐Socket 
Intra‐Socket 
Figure 4. Average time to steal an empty task with varying input argument
sizes on the IBM iDataPlex cluster.
10 
100 
1000 
10000 
100000 
16 
32 
64 
128 
256 
512 
1K 
2K 
4K 
8K 
16K 
32K 
64K 
128K 
256K 
512K 
1M 
2M 
4M 
8M 
16M 
32M 
64M 
128M 
Memory Bandwidth in MB/s 
UPC_MEMGET Performance on Carver 
Intra‐node 
Inter‐node 
Figure 5. Memory bandwidth on the IBM iDataPlex cluster. Intra-node
measures the inter-socket bandwidth while inter-node measures the Infini-
Band bandwidth.
In Table 1 we present statistics obtained from executing appli-
cations on a eight core Nehalem system. As illustrated, creating
a task with an eight byte argument payload takes around 0.2µs.
To capture the overhead of queue manipulation, we measure the
overhead of stealing an empty task on the iDataPlex system. In the
benchmark, all tasks but one thief generate work. We vary the size
of the input data buffer and present performance numbers for three
levels of hierarchy in Figure 4. The overhead of stealing an empty
task within the socket is around 2µs, the overhead of stealing inter-
socket is around 3.8µs, while the inter-node overhead is 36µs. For
empty tasks, lock acquisition dominates the overhead of work steal-
ing. Increasing the task payload to more than 32K bytes shows the
effects of bandwidth on the performance of our work stealing im-
plementation.
The inter-node work-stealing is 10 to 15 times slower than the
intra-node work-stealing until the data size reaches 4Kbytes. This
gap reduces as the input data size increases such that the inter-node
is asymptotically just two times slower than the intra-node.
Overall, Figure 4 indicates that even when differences in steal-
ing latency within a shared memory hierarchy are observable, their
effect is unlikely to cause large end-to-end application performance
degradation due to their short duration when compared with the
tasks proper.
UPC applications are often optimized to exploit data locality
such that shared data accesses tend to access memory with local
affinity. Therefore, stealing a task from a remote node will increase
the chance to access the victim’s memory. In Figure 5, we show the
10 
15 
20 
25 
30 
101 
201 
301 
401 
501 
601 
701 
801 
901 
Number of Tasks 
Samples 
Task Queue Behavior of FIBONACCI 
1000 
2000 
3000 
4000 
5000 
6000 
7000 
8000 
9000 
1  101  201  301  401  501  601  701  801  901 1001 1101 1201 1301 
Number of Tasks 
Samples 
Task Queue Behavior of UTS (T3L) 
Figure 6. The number of tasks in the task queue are sampled at every
1000 taskq put operations.
bandwidth of a upc memget operation obtained within a node and
across nodes. We see two orders of magnitude difference between
intra-node and inter-node bandwidth when transferring data less
than 1KB. This bandwidth gap reduces as the data size increases
and, after 1M bytes, intra-node bandwidth is consistently 50 %
higher than inter-node.
Figure 5 suggests that tasks containing fine grained accesses
are likely to observe a higher relative impact than tasks performing
large transfers.
5.1 Implementing Space Bounds
Work-first and help-first scheduling policies have different per-
formance characteristics and impose different stack and memory
bounds. Work-first is optimal for scenarios where stealing is rare
and its implementation might overflow the program stack. Help-
first is good for scenarios where stealing is frequent.
Figure 6 shows the number of tasks queued in the system during
runs of the Fibonacci and the UTS benchmarks. For this experiment
we provide unbounded task queues, use help-first and sample a
random task queue every 1000 taskq put operations.
These two benchmarks illustrate the two ends of the task queue
behavior spectrum. Fibonacci produces and consumes tasks at sim-
ilar rates, such that less than 10 tasks are maintained in the queue
most of the time. On the other hand, UTS with the T3L input shows
a bursty task generation pattern with states containing few tasks
while others contain thousands of tasks.
Most if not all work stealing implementations use a static mem-
ory allocation for task queue management. Besides simplicity of
implementation and guaranteeing memory bounds, this approach
fits well with practical optimization goals: generating enough work
and parallelism at application startup using help-first, then switch-
ing to work-first and executing tasks inline to avoid creation and
manipulation overhead.
Tree-depth cutoff techniques [12] use this approach to reduce
queuing overhead. Programmers often bound manually the depth
of the recursion tree, use help-first until reaching a threshold depth
5
2011/8/15

Page 6
12  15  18  21  24  27  30  33  36  39 
10 
12 
14 
16 
10 
15 
20 
25 
30 
35 
40 
45 
50 
Depth of the Task Execution Tree 
Depth of the Task Queue 
Execution Time in Seconds 
CUTOFF Performance of FIBONACCI 
BoundedQ 
TreeDepth 
10K  11K  12K  13K  14K  15K  16K  17K  18K  19K  20K 
10  20  30  40  50  60  70  80  160 320 640  1K  2K  5K 
10 
12 
14 
16 
Depth of the Task Execution Tree 
Depth of the Task Queue 
Execution Time in Seconds 
Cutoff Performance of UTS T3L 
BoundedQ 
TreeDepth 
Figure 7. Performance comparison between the runtime bounded-queue
vs. tree-depth cutoff optimization in Fibonacci(47) and UTS (T3L) bench-
mark
then switch to code that amounts to work-first. Lines five through
eight in Figure 2 illustrate this tree-depth optimization approach.
Tree-depth cutoff alleviates task queue memory pressure, but it is
oblivious to the internal task queue memory consumption and man-
agement. Thus, work stealing environments provide an additional
runtime cutoff by imposing task queue bounds and transparently
switch scheduling policies whenever queue space is exhausted.
This is the approach taken in SLAW and implicitly in HotSLAW.
Figure 7 illustrates the differences between these two ap-
proaches. For all benchmarks we implement application level cut-
off based on the depth of the recursion tree and run experiments
with unbounded task queues. The series labeled “TreeDepth” plots
the execution time corresponding to the cutoff setting “Depth of
Task Tree”. The series labeled “BoundedQ” plots execution time
when queues are bound to contain “Depth of Task Queue” entries .
For this setting the tree-depth cutoff is disabled.
For Fibonacci, we observe a 15X performance difference be-
tween the best and worst tree-depth cutoff (TreeDepth) setting.
For runtime bounded-queue (BoundedQ) technique, we observe
at most 2X performance difference between the best and the
worst setting. For UTS, we observe 2X performance differences
for TreeDepth and at most 20% for BoundedQ. For Fibonacci,
TreeDepth provides best performance, while for UTS best perfor-
mance is provided by BoundedQ.
In Fibonacci, the best tree-depth cutoff performance at the tree
depth of 21 is three times faster than the best runtime bounded-
queue cutoff performance at the task queue depth of 12. Fibonacci
has a structured task execution graph whose parent nodes have
the same number of child nodes. With tree-depth cutoff, tasks
are executed inline directly in the application, while with run-
time bounded-queue cutoff, they are executed inline inside the
taskq put library call. The overhead of the additional function
calls explains this difference.
Contrary to Fibonacci, the UTS benchmark shows that the best
bounded-queue cutoff at the task queue depth of 20 is 24.9% faster
than the best tree-depth cutoff at the threshold depth of 22,000. This
is because the UTS benchmark with the T3L input has a very deep
10 
FIB (47) 
NQueens(14) 
UTS(T1L) 
UTS(T2L) 
UTS(T3L) 
SpLU(256,16) 
Speedup Normalized to Serial Exec. Time 
Performance of Bounded Queue Cutoff 
Optimization on 8 Core Nehalem SMP 
UPC Unbounded Queue 
UPC Bounded Queue 
Figure 8. Performance comparison between HotSLAW with unbounded
(UPC Unbounded Queue) and bounded queues (UPC Bounded Queue).
Both UPC versions are optimized with tree-depth cutoff optimization. Ex-
periments are run on a 8 core Nehalem SMP.
and unstructured task graph and the tree-depth cutoff prematurely
serializes a large sub-tree and hence causes a load imbalance. On
the other hand, the bounded-queue cutoff method does not serialize
a whole sub-tree and instead it serializes one task at a time.
We have repeated these experiments for the NQueens and Su-
perLU benchmarks. Overall, tree-depth cutoff produces large per-
formance variations and the per application optimal threshold (usu-
ally recursion depth) has a wide range. On the other hand, bounding
the task queue size provides good performance and little variation
within a small range of values for optimal queue sizes. Overall,
our results indicate that setting a queue threshold of 20 active tasks
provides good results in practice for the workload considered. The
types of parallelism present in the benchmarks has been discussed
in this section. For reference, the average task length on a eight core
Nehalem is shown in Table 1.
Figure 8 shows the performance impact of bounding runtime
queues to 20 slots per thread. For each benchmark implementation,
we perform a search to detect the best tree-depth cutoff threshold
and use this setting for comparisons. This process is commonly
used in practice by application developers since the tree-depth
cutoff threshold can greatly affect performance of recursive parallel
applications.
The series labeled “UPC Bounded Queue” shows the effects of
enforcing queue bounds on the performance of the UPC implemen-
tation. Note that the SparseLU benchmark uses parallel-for instead
of recursive fork-join parallelism and it is not optimized with tree-
depth cutoff.
Imposing queue bounds in HotSLAW does not affect the per-
formance of the FIB, NQueens and UTS(T1L) benchmarks. These
have well “balanced” task graphs and the tree-depth cutoff is able
to throttle task generation at the right moment. For UTS(T2L, T3L)
which have deep unstructured task graphs, bounding the queues
provides additional performance improvements of up to 18%. For
SparseLU, the benchmark generates a small number of coarse-
grained tasks and the bounded-queue cutoff improves performance
only slightly.
5.2 Impact of Hierarchical Victim Selection
Figure 9 shows the impact of hierarchical victim selection (HVS)
on the performance of HotSLAW on shared memory architectures.
As a frame of reference, we present the performance of hand tuned
OpenMP implementations from the BOTS [6] suite or developed by
independently for UTS [16]. These implementations use tree-depth
based cutoff and we select for comparison the best performance ob-
6
2011/8/15

Page 7
0.5 
1.5 
2.5 
FIB (47) 
NQueens(14) 
UTS(T1L) 
UTS(T2L) 
UTS(T3L)  SpLU(256,16) 
Exec. Time Normalized to gcc‐OpenMP 
(Lo
w
er the Better) 
Performance of Victim Selection 
Policies on 8 Core Nehalem SMP 
gcc‐OpenMP 
icc‐OpenMP 
UPC (Intra‐Socket) 
UPC (HVS) 
UPC (RAND) 
UPC (RAND+BestChunk) 
Figure 9. Performance of OpenMP and victim selection strategies on a 8
core Nehalem SMP. Execution times are normalized to the execution time
of the gcc-OpenMP version (lower the better). All UPC versions use a
fixed chunk size of 1, except the UPC (RANDOM+BestChunk) uses the best
chunk size searched. For reference, the raw speedup numbers of UPC (HVS)
scheme for (FIB, NQueens, UTS(T1L), UTS(T2L), UTS(T3L), SparseLU)
are (7.7, 7.9, 7.5, 7.2, 7.5, 5.9), respectively.
tained after searching the cutoff parameter space. For lack of space,
we do not include details about the OpenMP implementations, note
that it has been shown that performance varies manifold [5, 18] de-
pending on the cut-off parameter selection. For all UPC implemen-
tations we present performance when stealing one task per event
(chunk size of 1), which is the default value in shared-memory
work-stealing runtimes.
The series labeled “Intra-Socket” presents the performance
when HotSLAW is configured with a single level hierarchy, con-
tained with a NUMA domain. The series labeled “HVS” presents
the performance when using a two level hierarchy spanning the
whole node. For reference, we show the performance of HotSLAW
when using a randomized stealing policy (used as default in many
systems) with different chunk-size granularities: we show data for
the default value of 1, as well as for the best value determined
through searching.
As our results indicate, HotSLAW provides comparable or
better performance than both the gcc and icc implementations
of OpenMP. HotSLAW outperforms both gcc and icc OpenMP
versions in FIB, NQueens, and SparseLU. In UTS benchmarks,
icc OpenMP shows the best performance and HotSLAW and gcc
OpenMP have similar performance. For the benchmarks studied,
confining stealing only within one socket reduces performance
when compared to whole node stealing. When allowing node wide
stealing, refining the hierarchy levels does not change performance
and HotSLAW with HVS achieves the same performance as Hot-
SLAW with random stealing. This behavior is caused by the fact
that all benchmarks use irregular data structures with no data lo-
cality. Similar results were obtained by Guo et al [10] for SLAW
where they report improvements only when using places for dense
algebra codes that fit in cache.
Figure 10 shows the performance of HotSLAW on the IBM
iDataPlex cluster, using 256 cores. Again, we perform a search for
selecting the setting for chunk size that provides the best perfor-
mance and use this value for comparisons. The series “Intra-Node”
uses a setting where work is hierarchically stolen only within a
shared memory node, the series “HVS” uses hierarchical victim
selection, and the series “Random” uses the default random steal-
0.2 
0.4 
0.6 
0.8 
1.2 
1.4 
1.6 
FIB(56) 
Nqueens(16) 
UTS (T1L) 
UTS (T2L) 
UTS (T3L) 
SpLU(200,100) 
Speedup Normalized to the Random P
olicy 
Performance of Victim Selection Policies on 
256 cores on Carver Cluster 
INTRA‐NODE 
HVS 
RANDOM 
Figure 10. Performance comparison of stealing strategies on distributed
memory: 256 cores on the Carver IBM iDataPlex cluster. The speedup
numbers are normalized to the speedup of the random policy.
ing policy. For each setting, we present speedup normalized to the
speedup of the random victim selection policy.
The first two benchmarks, FIB and NQueens, which have infre-
quent steals with small input arguments and no shared data accesses
within a task function, show that both HVS and Random perform
equally, while Intra-Node works poorly. The UTS benchmark has
frequent steals, but no shared data accesses, which makes the pro-
posed HVS significantly outperform both Intra-Node and Random
policies. In particular, HVS is able to provide a maximum 52%
speedup when compared to Random. The SparseLU benchmark
exhibits a relatively large number of steals and performs multiple
shared data accesses within its task functions. In this case, con-
taining the locality hierarchy within a node provides 20% better
performance than the other settings. HotSLAW with HVS is able
to match the performance of Intra-Node for SparseLU when tuning
the number of steal trials within a shared memory level.
For reference, the raw speedup numbers of HVS schemes on
256 cores for (FIB, NQueens, UTS(T1L), UTS(T2L), UTS(T3L),
SparseLU) are (179.5, 247.8, 200.1, 222.5, 98.8, 32.1), respec-
tively.
These results indicate that in HotSLAW we can handle well both
recursive fork-join parallelism and parallel-for, using only small
queue bounds and independent of the inherent load balance present
in the application. Furthermore, imposing queue bounds eliminates
the need for manual application tuning: HotSLAW is always able
to improve performance independent of the tree-depth based cutoff.
In particular, due to the tight space bounds, HotSLAW provides
similar performance independent of the application level tuning.
5.3 Impact of Hierarchical Chunk Selection
Figure 11 shows the impact on application performance of stealing
with different granularities (ChunkSize). In this experiment we use
HVS with a single fixed chunk size for all levels of the hierarchy.
We report performance on 256 cores of the IBM iDataPlex cluster.
FIB shows the best performance at “ChunkSize = 1” and its per-
formance drastically drops for larger values. FIB has a structured
task graph with a well balanced computation where stealing more
than one task causes actual load imbalance. The performance of
NQueens, UTS(T1L), UTS(T2L) and SuperLU is relatively insen-
sitive to the variation of ChunkSize. Note that performance degra-
dation sometime occurs for ChunkSize=16: considering the default
queue bounds of 20, this setting implements a policy even more
aggressive than StealHalf [4]. UTS(T3L) exhibits different trends
than all other benchmarks: in this case “ChunkSize=16” provides
best performance. For this particular T3L input, UTS generates an
7
2011/8/15

Page 8
50 
100 
150 
200 
250 
300 
FIB 
Nqueens 
UTS(T1L) 
UTS(T2L) 
UTS(T3L) 
SparseLU 
Speedup normalized to Serial Execution Time 
Benchmarks 
Performance of Fixed‐Chunk Selection 
ChunkSize=1 
ChunkSize=2 
ChunkSize=4 
ChunkSize=8 
ChunkSize=16 
Figure 11. Performance of fixed chunk selection (Fixed-Chunk) method
with varying chunk sizes on 256-core Carver IBM iDataPlex cluster
50 
100 
150 
200 
250 
300 
FIB 
Nqueens 
UTS(T1L) 
UTS(T2L) 
UTS(T3L) 
SparseLU 
Speedup Normalized to Serial Execution Time 
Benchmarks 
Performance of Hierarchical Chunk Selection 
ChunkSize=1 
ChunkSize=2 
ChunkSize=4 
ChunkSize=8 
ChunkSize=16 
Figure 12. Performance of hierarchical chunk selection (HCS) method
with varying chunk sizes on 256-core Carver IBM iDataPlex cluster
unstructured task graph with bursty sub-tree patterns, which makes
larger chunk sizes more efficient for consuming the excess number
of tasks in the queues.
In Figure 12, we present the performance of the hierarchical
chunk selection (HCS) method. In this case, we present results for
a setting using a single granularity (fixed chunk) for all intra-node
steals and StealHalf for any inter-node steals. As shown, the overall
performance trends are similar to the pure fixed chunk method and
HCS is able to lower the performance impact when poorly selecting
the steal granularity. For example, in the Fixed-Chunk mode, the
performance of FIB, NQueens, UTS(T1L), and UTS(T2L) drops
drastically for ChunkSize ≥ 8, while HCS maintains a reasonable
level across all chunk sizes. The HCS method outperforms the
StealHalf method for structured task graph cases, such as FIB,
NQueens, and UTS(T1L).
Figure 13 summarizes the performance trends observed for all
possible chunk selection methods. “Fixed-Chunk (Best)” denotes
the best performance obtained when using fixed chunk and search-
ing through all the possible settings. StealHalf denotes the perfor-
mance when using HVS with a steal-half policy at each hierarchy
level. For applications with structured task graphs, such as FIB,
NQueens, and UTS(T1L), both Fixed-Chunk and HCS strategies
outperform StealHalf. StealHalf is better at frequent stealing opera-
tions and unstructured and bursty task patterns, such as UTS(T3L).
HCS shows the performance when using a fixed chunk size of 1
for its intra-node stealing and the steal-half policy for its inter-node
stealing for all the benchmarks. For reference, we compare perfor-
mance with the random victim selection with StealHalf chunk pol-
icy setting used by Dinan et al, denoted as “Random+StealHalf”.
50 
100 
150 
200 
250 
300 
FIB(56)  Nqueens(16)  UTS(T1L) 
UTS(T2L) 
UTS(T3L)  SpLU(256,64) 
Speedup Normalized to Serial Ex
ecution Time 
Performance of Chunk Selection Policies 
Fixed‐Chunk (Best) 
StealHalf 
HCS 
Random+StealHalf 
Figure 13. Performance of different chunk size selection policies on 256
cores Carver. Fixed-Chunk (Best) uses the best chunk size for each appli-
cation and HCS uses a chunk of size one for its intra-node stealing. Fixed-
Chunk (Best), StealHalf, and HCS use HVS for its victim selection, while
Random+StealHalf uses random victim selection.
The geometric mean speedups of Fixed-Chunk (Best), Steal-
Half, HCS and Random+StealHalf is 116.4, 94.9, 106.1, and 83.3,
respectively. Fixed-Chunk shows the best performance attainable
when manually searching the space of possible setting of the
ChunkSize parameter. For reference, the raw speedup of HCS and
Random+StealHalf on 256 cores for (FIB, NQueens, UTS(T1L),
UTS(T2L), UTS(T3L), SparseLU) are (172.2, 240.1, 194.2, 202.9,
68.3, 12.8) and (82.7, 222.5, 121.0, 159.2, 64.9, 14.5), respectively.
On average, HCS performs 27% better than Random+StealHalf
method.
While the Fixed-Chunk method can attain the highest peak per-
formance, it is also sensitive to the value chosen for ChunkSize
and it can perform very poorly for the wrong choice. The StealHalf
method does not require any tuning as it always steals half of the
victim’s tasks. However, its peak performance is lower than Fixed-
Chunk and HCS for applications with structured task trees be-
cause the hierarchical victim selection (HVS) optimization, which
we described in Section 3.2.1, forces StealHalf method to steal
from the intra-node most of the time. The peak performance of
the HCS method is close to the peak performance of the Fixed-
Chunk method because of the HVS optimization. On the other hand,
among these three methods, the HCS method shows the highest av-
erage performance for all chunk sizes, which shows the robustness
of the proposed technique.
6. Related Work
Dynamic tasking and work stealing runtimes have been exten-
sively studied. Recent work presents scalable and locality aware
implementations for both shared- and distributed memory sys-
tems. SLAW (Scalable Locality-Aware Work-stealing) [10] is de-
signed to support the Habanero-Java [1] task-parallel language.
Habanero-Java supports provides a flat hierarchy of locality con-
structs (places) and SLAW performs work stealing only within a
place. Guo et al. show performance improvements when enabling
cache locality for dense linear algebra kernels. Similar to SLAW,
our approach uses an adaptive work stealing policy. In addition,
HotSLAW provides a locality hierarchy with an adaptive stealing
granularity policy.
The Scioto framework [3] provides locality-aware dynamic
load balancing and inter-operates with MPI [7], ARMCI [14], and
Global Arrays [15]. Scioto implements locality-aware load balanc-
ing by prioritizing the task queue; tasks with high local affinity
are placed toward the head of the task queue and tasks with low
8
2011/8/15

Page 9
local affinity are placed toward the tail of the task queue. High
local affinity tasks are executed by the local thread while the low
local affinity tasks are stolen by other threads. Affinity in Scioto
is declared with respect to a particular process and the mapping
of processes to cores is left unspecified. In contrast, our default
locality notion takes into account the hardware memory hierarchy.
Dinan et al. [4] present a scalable implementation of work stealing
for ARMCI. Our implementation is directly based on theirs and it
extends it with hierarchical victim and chunk selection.
Very recently, Saraswat et al. [20] introduce lifeline based load
balancing that proposes an alternate approach to random stealing on
large scale systems. Their main goal is to minimize the impact of
missed steals on performance and their work introduces the notion
of lifeline graphs. A lifeline graph provides a meta-topology used
for work stealing: when a node fails to steal, it quiesces and informs
the outgoing edges in the lifeline graph. Work arrives from a lifeline
and it is pushed by nodes onto all their active outgoing lifelines. To
implement lifeline graphs the authors advocate cyclic hypercubes:
it is not clear how well these graphs map onto hardware locality
domains.
The PGAS programming model presents the programmer with
a single global address space, with efficient one-sided access to
global data. The one-sided data access model provides a natu-
ral way to implement [17, 21] work stealing operations. Shet et.
al. [21] proposed the Asynchronous Partitioned Global Address
Space model (APGAS) and Asynchronous Remote Methods as po-
tential extensions to the UPC language standard. They implement
a fixed-size global task queue with an estimated upper bound for
APGAS support.
Olivier and Prins [17] developed an optimized application level
runtime library for the UTS benchmark. The library is written in
UPC and they discuss multiple implementation aspects. In particu-
lar, their approach provides the first discussion of the importance of
quiescing threads that fail to steal and provides the seed for lifeline
based load balancing. Olivier and Prins describe an implementa-
tion where tasks mark their intention to steal and work is pushed
from producers. Our implementation uses a pull based approach
with locking. We plan to extend our implementation with a similar
approach and provide a performance comparison. Their distributed-
memory implementation uses the half-chunk policy for work steal-
ing.
Scheduling for shared memory work addressed the problem
of dynamically scheduling for-all loops with unpredictable itera-
tion execution times; Guided self-scheduling [19] and Taper [13]
choose large chunk sizes at first to reduce the overhead of schedul-
ing, small chunk sizes later for load balance. These loop scheduling
approaches assume that the number of tasks is a priori known and
the task pool only shrinks after initialization.
A recursion tree-depth based cut-off mechanism is an important
overhead reduction technique, especially for fine-grained tasks.
Besides the techniques described for SLAW, there exists a large
body of work applied to OpenMP implementations. Duran et. al [5]
proposed an adaptive cut-off technique for OpenMP implemented
in the Nanos OpenMP environment. Their cut-off mechanism uses
runtime profiling to decide whether to create or serialize a new task.
Their profiling technique assumes that all tasks at a given recursion
level will have a similar behavior, a condition that will not hold for
unstructured and bursty task graphs.
Zheng et al. [24] developed a hierarchical load balancing for
Charm++ [11] applications using an object based data model.
They divide the processors into hierarchical autonomous groups,
thereby decentralizing the load balancing task. Charm++ uses run-
time monitoring to periodically assess the computational load for
each object and builds a load database. The runtime consults this
load database to determine load imbalance and migrates computa-
tion objects to low-loaded processors.
7. Conclusion
While PGAS languages provide support for irregular data struc-
tures, the SPMD instances of these languages do not support irregu-
lar computational structures. This is increasingly important as peo-
ple parallelize a wider class of applications, and there are growing
concerns about performance heterogeneity of large-scale systems.
In this paper, we presented HotSLAW, a dynamic tasking library for
the Unified Parallel C programming language. HotSLAW provides
a simple and effective way of adding task parallelism to SPMD
programs that takes advantage of the global address space both
in the implementation and expression of tasks, which may contain
global pointers that are location independent. To exploit locality in
distributed-memory many-core clusters, we presented two hierar-
chical work-stealing optimization techniques: hierarchical victim
selection (HVS) steals work from the nearest available victims to
preserve locality and hierarchical chunk selection (HCS) dynami-
cally determines the amount of work to steal based on the locality
of the victim thread. HotSLAW allows UPC programmers to selec-
tively load balance all or parts of their computation, and it is de-
signed to work with multiple UPC implementations. We evaluated
HotSLAW performance on both shared and distributed memory ar-
chitectures. On shared memory systems, HotSLAW provides per-
formance comparable or better than manually optimized OpenMP
implementations, improving by as much as 109%. On distributed-
memory systems, HVS is able to improve performance by up to
52% when compared to the default random victim selection and
HCS provides performance improvements up to 122% compared to
the StealHalf method. The combination of HVS and HCS enables
HotSLAW to achieve 27% better performance than the state of the
art approach using a random victim selection and StealHalf strat-
egy.
References
[1] ”The Habanero Java (HJ) Programming Language.” [online] Avail-
able:. http://habanero.rice.edu/hj.
[2] R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded com-
putations by work stealing. J. ACM, 46:720–748, September 1999.
[3] J. Dinan, S. Krishnamoorthy, D. B. Larkins, J. Nieplocha, and P. Sa-
dayappan. Scioto: A frameworkfor global-view task parallelism.
In Proceedings of the 37th Intl. Conference on Parallel Processing
(ICPP), 2008.
[4] J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and
J. Nieplocha. Scalable Work Stealing. In Proceedings of the Confer-
ence on High Performance Computing Networking, Storage and Anal-
ysis, SC’09, pages 1–11, New York, NY, USA, 2009. ACM.
[5] A. Duran, J. Corbalán, and E. Ayguadé. An adaptive cut-off for
task parallelism. In Proceedings of the 2008 ACM/IEEE conference
on Supercomputing, SC’08, pages 36:1–36:11, Piscataway, NJ, USA,
2008. IEEE Press.
[6] A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguadé.
Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the
Exploitation of Task Parallelism in OpenMP. In 38th International
Conference on Parallel Processing (ICPP ’09), page 124–131, Vienna,
Austria, September 2009. IEEE Computer Society, IEEE Computer
Society.
[7] T. M. Forum. MPI: A Message Passing Interface standard, 1993.
[8] M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation
of the Cilk-5 multithreaded language. In Proceedings of the ACM
SIGPLAN 1998 conference on Programming language design and
implementation, PLDI ’98, pages 212–223, New York, NY, USA,
1998. ACM.
9
2011/8/15

Page 10
[9] Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first
scheduling policies for async-finish task parallelism. In IPDPS’09,
pages 1–12, 2009.
[10] Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: A scalable locality-
aware adaptive work-stealing scheduler. In IPDPS’10, pages 1–12,
2010.
[11] L. V. Kale and S. Krishnan. CHARM++: a portable concurrent object
oriented system based on c++. In Proceedings of the eighth annual
conference on Object-oriented programming systems, languages, and
applications, OOPSLA ’93, pages 91–108, New York, NY, USA,
1993. ACM.
[12] H.-W. Loidl and K. Hammond. On the granularity of divide-and-
conquer parallelism. In Proceedings of the Glasgow Workshop on
Functional Programming, pages 8–10, Ullapool, Scotland, July 1995.
[13] S. Lucco. A dynamic scheduling method for irregular parallel pro-
grams. In Proceedings of the ACM SIGPLAN 1992 conference on
Programming language design and implementation, PLDI ’92, pages
200–211, New York, NY, USA, 1992. ACM.
[14] J. Nieplocha and B. Carpenter. Armci: A portable remote memory
copy libray for ditributed array libraries and compiler run-time sys-
tems. In IPDPS’99, pages 533–546, 1999.
[15] J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global Arrays: a
portable ”shared-memory” programming model for distributed mem-
ory computers. In Proceedings of the 1994 conference on Supercom-
puting, Supercomputing ’94, pages 340–349, Los Alamitos, CA, USA,
1994. IEEE Computer Society Press.
[16] S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W.
Tseng. UTS: An unbalanced tree search benchmark. In Proceedings
of the 19th international conference on Languages and compilers for
parallel computing, LCPC’06, pages 235–250, Berlin, Heidelberg,
2007. Springer-Verlag.
[17] S. Olivier and J. Prins. Scalable dynamic load balancing using UPC.
In Proceedings of the 2008 37th International Conference on Parallel
Processing, ICPP ’08, pages 123–131, Washington, DC, USA, 2008.
IEEE Computer Society.
[18] S. Olivier and J. Prins. Evaluating openmp 3.0 run time systems on
unbalanced task graphs. In IWOMP’09, pages 63–78, 2009.
[19] C. D. Polychronopoulos and D. J. Kuck. Guided self-scheduling: A
practical scheduling scheme for parallel supercomputers. IEEE Trans.
Comput., 36:1425–1439, December 1987.
[20] V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krish-
namoorthy. Lifeline-based global load balancing. In Proceedings of
the 16th ACM symposium on Principles and practice of parallel pro-
gramming, PPoPP ’11, pages 201–212, New York, NY, USA, 2011.
ACM.
[21] A. G. Shet and R. J. Harrison. Asynchronous programming in UPC:
A case study and potential for improvement. In The 1st Workshop on
Asynchrony in the PGAS Programming Model (APGAS), 2009.
[22] UPC Language Specification, Version 1.0.
Available at
http://upc.gwu.edu.
[23] Y. Yan, J. Zhao, Y. Guo, and V. Sarkar. Hierarchical Place Trees:
A portable abstraction for task parallelism and data movement. In
Languages and Compilers for Parallel Computing, pages 172–187,
2009.
[24] G. Zheng, E. Meneses, A. Bhatele, and L. V. Kale. Hierarchical
load balancing for Charm++ applications on large supercomputers. In
Proceedings of the 2010 39th International Conference on Parallel
Processing Workshops, ICPPW ’10, pages 436–444, Washington, DC,
USA, 2010. IEEE Computer Society.
10
2011/8/15