Unified Parallel C (UPC) is an extension to ISO C 99 that provides
a Partitioned Global Address Space (PGAS) abstraction using Sin-
gle Program Multiple Data (SPMD) parallelism. The memory is
partitioned in a task local heap and a global heap. All tasks can ac-
cess memory residing in the global heap, while access to the local
heap is allowed only for the owner. The global heap is logically
partitioned between tasks and each task is said to have local affin-
ity with its sub-partition. Global memory can be accessed either
using pointer dereferences (load and store) or using bulk communi-
cation primitives (memget(), memput()). The language provides
synchronization primitives, namely locks, barriers and split phase
barriers. Most of the existing UPC implementations also provide
non-blocking communication primitives, e.g. upc memget nb().
The language also provides a memory consistency model which
imposes constraints on message ordering.
2.1 UPC Task Library API
taskq_t *taskq_all_alloc(int nFunc, void *func1,
int input_size1, int output_size1, ...);
int taskq_put(taskq_t *taskq, void *func,
void *in, void *out);
int taskq_execute(taskq_t *taskq);
int taskq_steal(taskq_t *taskq);
void taskq_wait(taskq_t *taskq);
void taskq_fence(taskq_t *taskq);
int taskq_all_isEmpty(taskq_t *taskq);
Figure 1. Task Library API
We provide a library API for instantiating and controlling dy-
namic task parallelism from UPC applications. Figure 1 shows a list
of the most commonly invoked functions. The taskq all alloc
function allocates a distributed data structure to represent a global
task queue; this is a collective function call where arguments have
to match across all threads. The first input argument, nFunc is the
number of different task functions that can be put into this task
queue. Each function is represented by a triplet which consists of a
pointer to the proper function, an input data size, and an output data
size. Logically, the task queue is split into a thread private portion
(or local) and a public portion.
The taskq put function creates a task and puts it into the
queue when space is available. As described later, we implement
a combination of the work-first and help-first scheduling policies
with bounded queues and tasks can be executed inline taskq put
(serialized) when space is unavailable.
The taskq execute function removes a task from the head
of the private task queue and executes it, while the taskq steal
function attempts to steal tasks from the public queue. We further
discuss the stealing strategy in Section 3.2.
The library also provides primitives for inter-task synchroniza-
tion and ordering. The taskq fence is a non-blocking operation
used to enforce ordering between task sub-graphs: it ensures that
any task spawned after calling fence will not be scheduled for exe-
cution until all the tasks spawned before the fence have completed.
The taskq wait is a blocking call that terminates when all the
spawned tasks are completed. The taskq wait function internally
calls the execute and steal functions to ensure progress.
The taskq all isEmpty function is a collective function for
termination detection, it returns 1 if the global task queue is empty,
otherwise it returns 0. In addition, we provide configuration func-
tions taskq set * to specify user defined stealing hierarchies and
configurable behavior.
The complete description of the task library API can be found
at http://upc.lbl.gov/task.shtml.
2.2 Programming Example
01: void FIB( int *n, int *out ) {
02:
int n1 = *n-1;
03:
int n2 = *n-2;
04:
int x, y;
05:
if (*n < CUTOFF) {
06:
FIB_serial(n, out);
07:
return;
08:
}
09:
taskq_put(taskq, FIB, &n1, &x);
10:
taskq_put(taskq, FIB, &n2, &y);
11:
taskq_wait(taskq);
12:
*out = x + y;
13: }
Figure 2. UPC task example for Fibonacci
Recursive divide-and-conquer functions and parallel-for loops
(a.k.a. parallel do-all loops) with potential load imbalance are good
candidates for dynamic tasking. Figure 2 shows a Fibonacci num-
bers generator implemented using our API. A task function FIB
spawns two children tasks at lines 9 and 10 and waits at line 11 un-
til these children complete. While waiting inside the taskq wait
function, the runtime library will consume tasks or try stealing from
other threads if the local task queue is empty.
Tasks are specified at function granularities, with a signature
containing pointers to input and output data buffers, e.g.:
void my func(void *input, void *output);
In contrast with the API used by Dinan [4] which uses global ad-
dresses to specify input/output data, we use proper C void* point-
ers in order to improve interoperability with other programming
models. Input and output data is stored in contiguous memory
space, it is copied into the library space when a task is created and
travels with it whenever migration occurs.
To exploit the PGAS support of UPC, data fields in the input
and output regions can be either a value or a reference. However,
if such a data is a reference type, it should be a pointer to shared
data which remains valid after steals. A task function can read its
local variables, its input parameter, shared variables, and file scope
constant variables and can write to its local variables, its output
parameter, and shared variables accessed within its task function.
Our library maintains and at termination propagates the content
of the task output buffer. The content of this region is undefined
until task termination.