Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The problem of priority scheduling is ubiquitous in computer systems, and it can be formulated abstractly as follows. There is a work-set W of tasks that must be executed by some number of processors. The time to execute a task may be unpredictable and may vary by task. When a task is executed, it may add new tasks to W. Tasks in W can be processed in any order; however, some orders may be more efficient than others—for example, the order may affect the time taken to process a given task, and it may even affect the total number of tasks created during the execution of the program. Therefore, each task has an associated integer called its priority that is an application-specific, heuristic measure of its relative importance for early scheduling. The problem of priority scheduling is to assign tasks to processors according to the specified order (priority) with the goal of minimizing the total execution time of the program.

In this paper, we focus on a particular instance of this problem that arises when implementing irregular graph algorithms such as single-source, shortest-path (sssp), preflow-push maxflow computation (pfp), Delaunay mesh generation and refinement, and betweenness-centrality (bc). Each task in such an algorithm is associated with a node called its active node [14] and it makes an update to a small region of the graph containing its active node, such as modifying node and edge data or adding and removing nodes and edges. Tasks that update disjoint regions of the graph can be executed in parallel.

An important feature of many such algorithms is that although the semantics of the algorithm permit tasks to be performed in arbitrary order, some orders may be far more efficient than others. There are several reasons for this.

  • The work-efficiency and even the asymptotic complexity of the program may depend on the schedule; sssp and preflow-push are well-known examples.

  • Some schedules may exploit locality better than others. For example, in Delaunay mesh refinement, working on recently generated triangles has significant locality benefits. It may also be desirable to schedule tasks with overlapping working sets on the same core (affinity scheduling).

  • In some algorithms such as the Metis graph partitioner [8], the quality of the result may depend on the schedule even if the asymptotic complexity does not.

Priority scheduling can be used to achieve the desired task execution order. For sssp, the priority of an active node is the length of the shortest known path from the source to that node; processing active nodes in increasing distance order, as is done by Dijkstra’s algorithm, is good for work-efficiency. For pfp, each active node is associated with an integer called its height, which is a heuristic estimate of its distance from the sink in the residual graph; processing nodes in decreasing height order improves work-efficiency [4].

Priority scheduling for sequential programs is straightforward: use a priority queue. The priority of items is defined by a user-supplied priority function that encodes the less-than relation between items. There are many implementations of priority queues; one of the most commonly used representations is a heap.

For parallel programs, it is possible in principle to use a concurrent priority queue that uses either locks or lock-free approaches to synchronize insertions and removals from the priority queue. In this paper, we argue that concurrent priority queues are not good priority schedulers for parallel programs. Tasks in the parallel programming context may execute only a few hundred or thousand instructions; for example, sssp tasks take roughly 1,500 cycles (about 300 instructions) on the machines described in Sect. 4. Therefore, it is imperative that scheduling be a lightweight operation. In Sect. 2, we survey prior work that uses concurrent priority queues for priority scheduling. Using sssp, we show experimentally that parallel scaling is severely limited with these approaches.

To address these problems, we introduce a novel priority scheduler in Sect. 3. This ordered-by-integer-metric (obim) scheduler does not use priority queues and has much lower overhead than concurrent priority queues. Its efficiency comes from exploitation of two insights.

  • Exploiting priority inversion. Algorithms that use priorities are often robust to some priority inversion. Although a substantial number of priority inversions can hurt work efficiency, we show that allowing a small number can dramatically reduce communication, synchronization, and coordination between threads.

  • Architecture-aware design. The memory systems of multicores are hierarchical and communication between remote cores is expensive. The design of obim exploits the memory hierarchy to minimize and control coherence traffic.

In Sect. 4, we evaluate the end-to-end performance of seven irregular benchmarks that benefit from priority scheduling, using obim and concurrent priority queues on four multicore machines. For almost all machine/benchmark/input combinations, obim provides far superior performance; for some of them, the obim-based implementation is 50 times faster than a concurrent priority-queue-based implementation.

2 Prior Work on Concurrent Priority Scheduling

In this section, we evaluate the pros and cons of three different ways in which concurrent priority queues have been used in the literature to implement high-performance parallel sssp. Our conclusions apply to other irregular programs as well, but sssp is a good model problem because scheduling strategies for this problem have been studied extensively.

2.1 Schedulers Based on Priority Queues

We study three ways to use concurrent priority queues for parallel scheduling.

Heap: a central concurrent priority queue. There are many choices of concurrent priority queues, which we discuss below.

Sheap: a concurrent priority queue for each thread with work-stealing. New work created by a thread is always pushed to its own local priority queue, although it may get stolen later. Bertsekas et al. implemented one of the first parallel sssp programs using this approach [1].

Pheap: a concurrent priority queue for each thread, with logically partitioned data structures and owner-computes rule for task assignment. When a new task B is created, the owner-computes rule determines which priority queue to push the task on. This policy has been used by Tang et al. [17]; it was also mentioned in [1]. Work-stealing is usually not performed.

Table 1. Number of iterations by type for sssp on machine m1 (Table 3) at 8 threads.

All of these require a concurrent priority queue. We used the concurrent priority queue from the Intel TBB library. We also evaluated a centralized priority scheduler based on a concurrent skip-list [15], but we found that the absolute performance of the TBB priority queue was substantially better, and although the concurrent skip-list scaled better than the TBB priority queue, it never caught up in absolute performance. Besides concurrent skip-lists, many other concurrent priority queues have been proposed [3, 7, 16]. These have various limitations such as being blocking, invalidation heavy, or supporting only bounded ranges which make them unsuitable for scheduling very small tasks on multi-processors with high remote-cache access latency.

2.2 Priority Scheduling for Work Efficiency

For sssp, updates to the graph are called relaxations. Each node A has a label d(A) that contains the length of the shortest known path to that node from the source. For edge \(A \rightarrow B\) with weight w(AB), the relaxation operator updates d(B) to \(d(A)+w(A,B)\) if this value is less than the current value of d(B). Initially, only the source is active. If the distance of a node is lowered by a relaxation, it becomes active in turn. We classify the relaxations into three categories called good work, empty work, and bad work.

  • Good work: relaxation that lowers the distance value of a node to its final value.

  • Empty work: attempted relaxation to a value higher than the current value.

  • Bad work: relaxation of a label to a value greater than its final value.

Relaxations can be applied in any order but ordering them by the distance labels of the active nodes minimizes work. Dijkstra’s algorithm [5] performs only good and empty work. It uses the priority queue to store pending updates to nodes and updates the node label in the graph only when the first (smallest) update to that node reaches the head of the priority queue. In contrast, asynchronous label-correcting algorithms perform relaxations in a random order and may perform a lot of bad work [13]. Table 1 shows the breakdown of the different types of work performed by different implementations on a machine with 8 cores. The input is described in Sect. 4. The amount of good work is the same for all implementations, but the amount of bad work and empty work differ. In particular, sheap performs a lot of bad work.

It is useful to characterize the instantaneous behaviors by plotting the priorities of the work processed by thread over time. Figure 1 shows this data using the total iterations executed as a proxy for time. In each graph, there is a line for each of the 8 threads; in addition, the priority of the work processed by a sequential implementation using a heap is superimposed in black.

Fig. 1.
figure 1

Priorities processed over time by different implementations. Each line corresponds to priority values processed be one thread. For reference, sequential heap is shown in black.

Figure 1a shows that with sheap, threads quickly diverge from processing the globally earliest priority work. Threads eventually converge to processing the earlier priority work through work-stealing. Each of the drops in priorities processed corresponds to a thread stealing earlier priority work from another thread. Figure 1b shows that pheap is much better at keeping threads working on early priority work. This is because the graph is a random graph and thus fairly uniform, so the average earliest priority among t partitions (where t is the number of threads) is close to the earliest priority globally. This may not be true for non-random graphs, and the performance of pheap implementations will be poor for such graphs. Figure 1c shows that obim is successful in keeping all the threads working on the globally earliest priority work.

Although sheap performs poorly, priority scheduling using sheap may be a significant improvement over not using priorities at all. Using a random scheduler on the same input produces an runtime greater than several hours, while using sheap finishes in about 2 min (obim completes in 11 s).

Parallel Overheads: If work efficiency were the only concern, choosing a parallel scheduler would be easy: always pick the one that sticks closely to the ideal priority order. However, the end-to-end performance of a program depends also on the parallel overheads of the scheduler. The overhead costs of a parallel scheduler come from two sources: the sequential cost of performing a scheduling operation and the synchronization and communication cost from making the scheduler concurrent. We find the sequential performance of the heap-based variants are approximately 2x that of obim. Using a sampling profile, we find that at 8 threads, the costs are quite different. Obim scales essentially perfectly, the overhead per task is the same as the sequential result. The concurrent heap being a centralized data structure, however, scales extremely poorly, taking 14.5x more time for scheduling than it did serially. pheap takes 2x and sheap takes 5x more time than each did serially. As we saw in Table 1, sheap performs significantly more iterations also.

End-to-end performance of sssp: Figure 2 shows the end-to-end performance of the four implementations of sssp on a 24 core Intel Xeon. The baseline for speedup is a sequential implementation of sssp using the Intel TBB priority heap (which performed substantially better than the serial priority queue in libstdc++). The two factors discussed above—work efficiency of the algorithm and parallel overheads of the priority scheduler—limit the speed-up of the concurrent-priority-queue-based implementations to roughly 3 on 24 cores. In contrast, it can be seen the obim scheduler gives almost perfect speed-up.

2.3 Priority Scheduling for Output Quality

Priority scheduling is also useful for improving output quality in algorithms such as Metis, a multi-level graph partitioner, which uses a lowest-degree first heuristic for graph coarsening. Figure 3 shows the effect on the edge-cut, a measure of partition quality, from varying the scheduling policy in the coarsening phase. For comparison, random chooses nodes at random to match next, and simple implements a simple work-stealing scheduler. We see that obim provides consistent quality across thread counts, producing better results than random scheduling. Simple scheduling produces widely varying quality. In these tests, both simple and obim had similar runtimes.

Fig. 2.
figure 2

Speed-up of sssp on 24 cores relative to best single threaded version.

Fig. 3.
figure 3

Metis-style quality results by scheduling heuristic and threads.

Fig. 4.
figure 4

Priority map in obim.

3 A Scalable Priority Scheduler

The design of obim exploits the observation that algorithms that use priorities are robust to small amounts of priority inversion. This observation is used to (i) enhance parallelism by allowing each thread to schedule work asynchronously, and (ii) minimize communication by using an approximate consensus protocol with communication matched to the memory system topology. A full discussion with pseudocode can be found in [10]. A simplified, high-level picture of obim is shown in Fig. 4. The obim scheduler is built out of bags, which are used to hold tasks at the same priority level, and priority maps, which are used to hold a collection of bags at different priority levels.

3.1 Implementation of Bags

There is one bag per priority level in the entire system but it is implemented in a distributed, machine-topology-aware way as follows. For a given bag, each core has a data structure called a chunk, which is a ring-buffer that can contain 8–64 tasks (size chosen at compile time). In addition, each package has a list of chunks. When the chunk associated with a core becomes full, it is moved to the package-level list. When the chunk associated with a core becomes empty, the core probes its package-level list to obtain a chunk. If the package-level list is also empty, the core probes the lists of other packages to find work. To reduce traffic on the inter-package connection network, only one hungry core hunts for work in other packages on behalf of all hungry cores in a package.

3.2 Implementation of Priority Map

The priority map is also implemented in a distributed way by (i) a global map of priorities to bags, and (ii) an approximate copy of the global map within each thread. Each thread operates on its thread local map, synchronizing with the global map only when necessary, as explained next.

The thread-local map is implemented by a non-concurrent sorted vector of pairs. The implementation of the map is straight forward and not presented. Threads also maintain a version number representing the last version of the global map they synchronized with. Each thread also tracks the current priority it is working on and bag for that priority. This priority and bag are used by the thread for pop operations.

To minimize synchronization overhead, the global map uses a log-based structure which stores bag-priority pairs created by insert operations on the global map. Each insertion operation also updates a global version number, which corresponds to the length of the log. When a thread cannot find a bag for a particular priority using only its local map, it must synchronize with the global map and possibly create a new mapping there. A thread atomically appends a record to the log and increments the version number. The implementation ensures that the log can be appended in the presence of concurrent readers without requiring locks.

Push: A thread pushing a task uses its local map to find the bag to insert to. If its local map does not know if such a bag exists, the global map is consulted and if the bag is found, the local map is updated appropriately. If the priority of the pushed item is earlier than the current priority, the thread immediately updates its current working priority to operate on the earlier priority work.

Pop: To keep close to the ideal schedule, all threads must be working on early priority work. We adopt the heuristic that threads scan for earlier priority work only when they find that the bag they are working on is empty. Thus, if the bag for the current priority is not empty, a task from that bag is retrieved. Otherwise, when a bag is empty the thread scans the priority space looking for early priority work. We call this procedure back-scan.

Because a scan over the entire global map can be expensive, especially if there are many bags (which often happens with algorithms on high-diameter graphs), an approximate consensus heuristic is used to locally estimate the earliest priority work available and to prune the length of the back-scans, which we call back-scan prevention. Each thread makes available its estimate of the earliest priority work. When a thread needs to scan for work, it looks at this value for all threads that share the same package and uses the earliest priority it finds to start the scan for work. To propagate information between packages, in addition to scanning all the threads in its package, one leader thread per package will scan the other package leaders. This restriction allows most threads to incur only a small amount of local communication.

Once a thread has a starting point for a scan, it simply tries to pop work from each bag from the scan point onwards. The implementation ensures that attempting to pop from empty bags does not perform any writes to shared-memory, so popping from an empty bag, while not free, does not incur poor locality or communication. This back-scan prevention method is especially effective in many algorithms because it exploits the common structure of priority spaces. In most algorithms such as BFS, the priority space is populated monotonically: processing work at one priority will usually generate work at the same or later priority. Thus back-scan prevention can easily limit the scan to just a few bags.

Table 2. Obim variants used in evaluation.
Table 3. Properties of machines used in the evaluation.

3.3 Evaluation of OBIM Design Choices

To evaluate the obim design decisions, we implemented several de-optimized variants of the obim scheduler. Table 2 lists these variants, which focus on two main optimizations. These are (i) the use of distributed bags and (ii) back-scan prevention. Table 3 shows the four machines we used for the evaluation. The numa8x4 is an SGI Ultraviolet (strong NUMA). The other 3 machines are standard Intel Xeons with multiple packages connected by QPI.

We use three inputs which stress the priority scheduler in different ways. The first input is a large random graph, which has many work items and stresses the bag implementation. The second one is the USA road network, which is a smaller graph with a large diameter. It stresses the efficiency of the priority map implementation and the ability of the scheduler to find highest priority work efficiently. The third input is a scale-free rmat graph of \(2^{27}\) nodes.

Fig. 5.
figure 5

Scaling of obim variants for sssp.

The bottom row of Fig. 5 shows the speedup of sssp for the small input for the four obim variants on the four machines. Speedup is relative to the best overall single-threaded execution time. The first conclusion is that the back-scan optimization is critical for performance: peak speedup goes from 2.5 (cmn and dmn) to 5 (cmb and dmb). Given the back-scan optimization (cmb and dmb), the second conclusion is that using distributed bags is also important for performance: without this optimization, speedup is never more than 5 on any machine. Without back-scan prevention, a distributed bag is less efficient than a centralized one on this high-diameter input because it is more efficient to check that a centralized bag is still empty than it is to perform this check on a distributed bag.

The top row of Fig. 5 shows the speedup of sssp for the large input. We see that for this input, back-scan prevention is almost irrelevant. However, distributed bags are even more important on this input than for the small input, pushing scaling from 8 to 25. The power-law graph (rmat) behaves similarly to the large graph. Machine m2x4 did not have enough RAM to load the rmat graph.

We investigated how the differences between variants manifest themselves at the architectural level through sample-based profiling using hardware counters. Briefly, back-scan prevention significantly reduces CPU cycles chiefly by reducing total instructions. The communication profiles with back-scan prevention (dmb) and without it (dmn) are similar at the L3 level. This shows (a) the amount of communication added to perform priority consensus is small, and (b) making sure that probing bags is write-free significantly reduces communication compared to trying to avoid checks using back-scan prevention. The former optimization is shared by both bag variants, and addition back-scan prevention does not change the L3 profile.

The second dimension we investigate is centralized versus distributed bags. There is little difference in total number of instructions executed between these two classes of implementations: we find no more than 6 % difference at 24 threads. However, we see that the centralized queues have more than twice the communication costs of the per-package queues.

Table 4. Test programs and inputs

4 Experimental Evaluation

We implemented seven applications on the four machines in Table 3, using obim and the three priority-queue-based schedulers. All the machines run Linux 2.6.32 with gcc 4.6. Processor affinity was used (and is necessary for the topology-aware code). Each application was run on a large input graph and a small input graph; for lack of space, we only show the results for the large input graph.

We used the following seven applications from the Lonestar benchmark suite [9] in our study. The Lonestar suite publishes comparisons of these benchmarks to third party serial and parallel implementations, so we do not repeat these results here. Descriptions of the benchmarks can likewise be found published with Lonestar. Brief descriptions of the benchmark programs and inputs are given in Table 4.

Fig. 6.
figure 6

Speedup. “#” indicate runs where runs timed out after ten minutes.

4.1 End-to-End Performance

Figure 6 shows speed-up for large inputs, relative to the best serial times for these algorithms. First we see, for most applications, the obim scheduler gives the best performance even on one thread. This is due to the lower overheads of pushing and popping tasks with obim’s bucketing scheme compared to a heap. The minor improvements in scheduling order of the heap do not make up for this overhead.

Second, at full scale, the obim scheduler is almost always substantially faster than all the priority-queue-based implementations for most applications and machines. For instance, at 24 threads on machine m4x6, on sssp with the large input, obim (4.4 s) is about 7 times faster than the partitioned heap schedulers (about 32 s) and 50 times faster than a concurrent heap (about 227 s). There are a few application/machine combinations such as avi for which obim is slower than the heap-based schedulers on one thread, but as the number of threads increase, the performance of obim surpasses that of other priority schedulers for almost all application/machine combinations.

The results for numa8x4 show that on a machine with high NUMA penalty, obim doesn’t scale much beyond one NUMA-node, although it does performs better than the other schedulers. Although all the machines are NUMA, the latency penalty on numa8x4 is significantly higher. We do not optimize for memory-bank locality in scheduling work, we only distribute the graph evenly between nodes. Graph partitioning and partitioning-aware scheduling must be applied in this case. We leave this for future work.

Fig. 7.
figure 7

Total iterations relative to best sequential scheduler for each combination. Outliers exist for pheap (28) and sheap (16).

4.2 Differences in Application-Level Work

Figure 7 show how many iterations were executed with each scheduler for each application, input, and machine, relative to the best single-threaded scheduler. An important caveat is that the number of iterations is only a rough, though easily understood, proxy for the total amount of useful work as we discussed in Sect. 2.2.

First, we see sheap and pheap can perform many more iterations than obim. Extra iterations come from priority inversion. Second, the heap scheduler can sometimes generate more work than the serial heap scheduler. This is because in the parallel implementation, pushes and pops from different threads can get interleaved in a different order than in the sequential implementation. For a few benchmark/input/machine combinations, obim performs more iterations than the best single-threaded scheduler for that combination.

Fig. 8.
figure 8

Scaling of Metis for coarsening, initial partitioning, and refinement phases as well as total scaling.

4.3 A Full Application: Metis

We also evaluated obim for parallelizing a complete application, the Metis graph partitioner [8]. Figure 8 shows the scaling of Metis as well as the scaling of the coarsening, initial partitioning, and refinement phases on m4x10 (Table 3). Creating 4 partitions of the USA road map takes roughly 35s with sequential Metis, 4s with our parallel Metis, and 2s with Mt-Metis, a hand-parallelized version of Metis from the University of Minnesota, while creating 1000 partitions takes roughly 38s, 5s and 7s respectively. The Mt-Metis program uses data structures optimized for graph partitioning while we use generic Galois data structures.

5 Conclusion

We presented a concurrent priority scheduler called the ordered-by-integer-metric (obim) scheduler, which (i) exploits the seemingly innocuous fact that algorithms amenable to priority scheduling are usually robust to small deviations from a strict priority schedule, and (ii) is optimized for the cache hierarchy of current multicore processors. Across a suite of seven complex, irregular benchmarks and four machines, we showed that implementations that use obim almost always outperformed implementations that used concurrent priority queues; for some benchmarks, end-to-end performance improved by a factor of 50. We also showed that obim could be used to successfully parallelize Metis, a complete and complex application, improving running time by roughly a factor of 10 compared to sequential Metis.