1. Introduction
Many important graph problems can be implemented using either ordered or unordered parallel algorithms. Ordered algorithms process active vertices following a dynamic prioritybased ordering, potentially reducing redundant work. By contrast, unordered algorithms process active vertices in an arbitrary order, improving parallelism while potentially performing a significant amount of redundant work. In practice, optimized ordered graph algorithms are up to two orders of magnitude faster than the unordered versions (Dhulipala:2017; Hassaan:2011:OVU:1941553.1941557; DBLP:journals/corr/BeamerAP15; Hassaan:2015:KDG:2694344.2694363), as shown in Figure 1. For example, computing singlesource shortest paths (SSSP) on graphs with nonnegative edge weights can be implemented either using the BellmanFord algorithm (BellmanFord) (an unordered algorithm) or the stepping algorithm (MEYER2003114) (an ordered algorithm).^{1}^{1}1In this paper, we define stepping as an ordered algorithm, in contrast to previous work (Hassaan:2011:OVU:1941553.1941557) which defines stepping as an unordered algorithm. BellmanFord updates the shortest path distances to all active vertices on every iteration. On the other hand, stepping reduces the number of vertices that need to be processed every iteration by updating path distances to vertices that are closer to the source vertex first, before processing vertices farther away (detailed descriptions of the algorithms are provided in the supplemental materials).
Writing highperformance ordered graph algorithms is challenging for users who are not experts in performance optimization. Existing frameworks that support ordered graph algorithms (Dhulipala:2017; DBLP:journals/corr/BeamerAP15; nguyen13sospgalois) require users to be familiar with C/C++ data structures, atomic synchronizations, bitvector manipulations, and other performance optimizations. For example, Figure 2 shows a snippet of a userdefined function for stepping in Julienne (Dhulipala:2017), a stateoftheart framework for ordered graph algorithms. The code involves atomics and lowlevel C/C++ operations.
We propose a new programming framework, PriorityGraph, that simplifies writing parallel ordered graph algorithms. PriorityGraph separates algorithm specifications from performance optimization strategies. The user specifies the highlevel algorithm with the algorithm language and uses a separate scheduling language to configure different performance optimizations. The algorithm language introduces a set of prioritybased data structures and operators to maintain execution ordering, while hiding lowlevel details such as synchronization, deduplication, and data structures to maintain ordering of execution. Figure 3 shows the implementation of stepping using PriorityGraph which dequeues vertices with the lowest priority and updates their neighbors’ distances in each round of the while loop. The while loop terminates when all the vertices’ distances are finalized. The algorithm uses an abstract priority queue data structure, pq (Line LABEL:line:pq_decl), and the operators updatePriorityMin (Line LABEL:line:updatePriorityMin) and dequeueReadySet (Line LABEL:line:dequeueReadySet) to maintain priorities.
PriorityGraph uses a bucketing data structure (Dhulipala:2017; DBLP:journals/corr/BeamerAP15) to maintain the execution ordering. Each bucket stores active vertices of the same priority, and the buckets are sorted in priority order. The program processes one bucket at a time in priority order and dynamically moves active vertices to new buckets when their priorities change. Updates to the bucket structure can be implemented using either an eager bucket update (DBLP:journals/corr/BeamerAP15) approach or a lazy bucket update (Dhulipala:2017) approach. With eager bucket updates, buckets are immediately updated when the priorities of active vertices change. Lazy bucketing buffers the updates and later performs a single bucket update per vertex. Existing frameworks supporting ordered parallel graph algorithms only support one of the two bucketing strategies described above. However, using a suboptimal bucketing strategy can result in more than 10 slowdown, as we show later. Eager and lazy bucketing implementations use different data structures and parallelization schemes, making it difficult to combine both approaches within a single framework.
In PriorityGraph, programmers can switch between lazy and eager bucket update strategies, and combine bucketing optimizations with other edge traversal optimizations using the scheduling language. The compiler leverages program analyses and transformations to generate efficient code for different optimizations. The separation of algorithm and schedule also enables us to build an autotuner for PriorityGraph that can automatically find highperformance combinations of optimizations for a given algorithm and graph.
Bucketing incurs high synchronization overheads, slowing down algorithms that spend most of their time on bucket operations. We introduce a new performance optimization, bucket fusion, which drastically reduces synchronization overheads. In an ordered algorithm, a bucket can be processed in multiple rounds under a bulk synchronous processing execution model. In every round, the current bucket is emptied and vertices whose priority are updated to the current bucket’s priority are added to the bucket. The algorithm moves on to the next bucket when no more vertices are added to the current bucket. The key idea of bucket fusion is to fuse consecutive rounds that process the same bucket. Using bucket fusion in PriorityGraph results in –3 speedup on road networks with large diameters over existing work.
We implement PriorityGraph as a language and compiler extension to GraphIt (graphit:2018), a domainspecific language for writing highperformance graph algorithms. PriorityGraph achieves up to 3 speedup on six ordered graph algorithms (stepping based singlesource shortest paths (SSSP), stepping based pointtopoint shortest path (PPSP), weighted BFS (wBFS), A search, core decomposition, and approximate set cover (SetCover)) over the fastest stateoftheart frameworks that support ordered algorithms (Julienne (Dhulipala:2017) and Galois (nguyen13sospgalois)) and handoptimized implementations (GAPBS (DBLP:journals/corr/BeamerAP15)). Figure 4 shows that PriorityGraph is up to 16.9 and 1.94 faster than Julienne and Galois on the four selected algorithms and supports more algorithms than Galois. Using PriorityGraph also reduces the lines of code compared to existing frameworks and libraries by up to 4.
The contributions of this paper are as follows.

[topsep=2pt, partopsep=0pt, leftmargin=1.5em]

PriorityGraph, a new prioritybased programming model that simplifies the programming of ordered graph algorithms and makes it easy to switch between and combine different optimizations (Section 4).

Compiler extensions that leverage program analyses, program transformations, and code generation to produce efficient implementations with different combinations of optimization strategies (Section 5).

A comprehensive evaluation of PriorityGraph that shows that it is up to 3 faster than stateoftheart graph frameworks on six ordered graph algorithms (Section 6). PriorityGraph also significantly reduces the lines of code compared to existing frameworks and libraries.
2. Preliminaries
We first define ordered graph processing used throughout this paper. Each vertex has a priority . Initially, the users can explicitly initialize priorities of vertices, with the default priority being a null value, . These priorities change dynamically throughout the execution. However, the priorities can only change monotonically, that is they can only be increased, or only be decreased. We say that a vertex is finalized if its priority can no longer be updated. The vertices are processed and finalized based on the sorted priority ordering. By default, the ordered execution will stop when all vertices with nonnull priority values are finalized. Alternatively, the user can define a customized stop condition, for example to halt once a certain vertex has been finalized.
We define priority coarsening as an optimization to coarsen the priority value of the vertex to by dividing the original priority by a coarsening factor such that . The optimization is inspired by stepping for SSSP, and enables greater parallelism at the cost of losing some algorithmic workefficiency. Priority coarsening is used in algorithms that tolerate some priority inversions, such as A search, SSSP, and PPSP, but not in core and SetCover.
3. Performance Optimizations for Ordered Graph Algorithms
We use stepping for singlesource shortest paths (SSSP) as a running example to illustrate the performance tradeoffs between the lazy and eager bucket update approaches, and to introduce our new bucket fusion optimization.
3.1. Lazy Bucket Update
We first consider using the lazy bucket update approach for the stepping algorithm, with pseudocode for the approach shown in Figure 5. The algorithm first constructs a bucketing data structure in Line 3, which groups the vertices into buckets according to their priority. It then repeatedly extracts the bucket with the minimum priority (Line 6), and finishes the computation once all of the buckets have been processed (Lines 5–13). To process a bucket, the algorithm iterates over each vertex in the bucket, and updates the priority of its outgoing neighbor destination vertices by updating the neighbor’s distance (Line 10). With priority coarsening, the algorithm computes the new priority by dividing the distance by coarsening factor . The corresponding bucket update (the vertex and its updated priority) is added to a buffer with a synchronized append (Line 11). The syncAppend can be implemented using atomic operations, or with a prefix sum to avoid atomics. The buffer is later reduced so that each vertex will only have one final bucket update (Line 12). Finally, the buckets are updated in bulk with bulkUpdateBuckets (Line 13).
The lazy bucket update approach can be very efficient when a vertex changes buckets multiple times within a round. The lazy approach buffers the bucket updates, and makes a single insertion to the final bucket. Furthermore, the lazy approach can be combined with other optimizations such as histogrambased reduction on priority updates to further reduce runtime overheads. However, the lazy approach adds additional runtime overhead from maintaining a buffer (Line 7), and performing reductions on the buffer (Line 12) at the end of each round. These overheads can incur a significant cost in cases where there are only a few updates per round (e.g., in SSSP on large diameter road networks).
3.2. Eager Bucket Update
Another approach for implementing stepping is to use an eager bucket update approach (shown in Figure 6) that directly updates the bucket of a vertex when its priority changes. The algorithm is naturally implemented using threadlocal buckets, which are updated in parallel across different threads (Line 9). Each thread works on a disjoint subset of vertices in the current bucket (Line 10). Using threadlocal buckets avoids atomic synchronization overheads on bucket updates (Lines 3 and 12–13). To extract the next bucket, the algorithm first identifies the smallest priority across all threads and then have each thread copies over its local bucket of that priority to a global minBucket (Line 8). If a thread does not have a local bucket of the next smallest priority, then it will skip the copying process. Copying local buckets into a global bucket helps redistribute the work among threads for better load balancing.
3.3. Eager Bucket Fusion Optimization
A major challenge in bucketing is that a large number of buckets need to be processed, resulting in thousands or even tens of thousands of processing rounds. Since each round requires at least one global synchronization, reducing the number of rounds while maintaining priority ordering can significantly reduce synchronization overhead.
Often in practice, many consecutive rounds process a bucket of the same priority. For example, in stepping, the priorities of vertices that are higher than the current priority can be lowered by edge relaxations to the current priority in a single round. As a result, the same priority bucket may be processed again in the next round. The process repeats until no new vertices are added to the current bucket. This pattern is common in ordered graph algorithms that use priority coarsening. We observe that rounds processing the same bucket can be fused without violating priority ordering.
Based on this observation, we propose a novel bucket fusion optimization for the eager bucket update approach that allows a thread to execute the next round processing the current bucket without synchronizing with other threads. We illustrate bucket fusion using the stepping algorithm in Figure 7. The same optimization can be applied in other applications, such as wBFS, A search and pointtopoint shortest path. This algorithm extends the eager bucket update algorithm (Figure 6) by adding a while loop inside each local thread (Figure 7, Line 14). The while loop executes if the current local bucket is nonempty. If the current local bucket’s size is below a certain threshold, the algorithm immediately processes the current bucket without synchronizing with other threads (Figure 7, Line 16). If the current local bucket is large, it will be copied over to the global bucket and distributed across other threads. The threshold is important to avoid creating straggler threads that process too many vertices, leading to load imbalance. The bucket processing logic in the while loop (Figure 7, Lines 17–20) is the same as the original processing logic (Figure 7, Lines 10–13). This optimization is hard to apply for the lazy approach since a global synchronization is needed before bucket updates.
Bucket fusion is particularly useful for road networks as there are many rounds processing the same bucket. For example, bucket fusion reduces the number of rounds by more than 30 for SSSP on the RoadUSA graph, leading to more than 3 speedup by significantly reducing the amount of global synchronization (details in Section 6).
4. Programming Model
Edgeset Apply Operator  Return Type  Description 

applyUpdatePriority(func f)  none  Applies f(src, dst) to every edge. The f function updates priorities of vertices. 
Priority Queue Operators  
new priority_queue(
bool allow_priority coarsening, string priority_direction, vector priority_vector, Vertex optional_start_vertex) 
priority_queue  The constructor for the priority queue. It specifies whether priority coarsening is allowed or not, higher or lower priority gets executed first, the vector that is used to compute the priority values, and an optional start vertex. A lower_first priority direction specifies that lower priority values are executed first, whereas a higher_first indicates higher priority values are executed first. 
pq.dequeueReadySet()  vertexset  Returns a bucket with all the vertices that are currently ready to be processed. 
pq.finished()  bool  Checks if there is any bucket left to process. 
pq.finishedVertex(Vertex v)  bool  Checks if a vertex’s priority is finalized (finished processing). 
pq.getCurrentPriority()  priority_type  Returns the priority of the current bucket. 
pq.updatePriorityMin(Vertex v, ValT new_val)  void  Decreases the value of the priority of the specified vertex v to the new_val. 
pq.updatePriorityMax(Vertex v, ValT new_val)  void  Increases the value of the priority of the specified vertex v to the new_val. 
pq.updatePrioritySum(Vertex v, ValT sum_diff, ValType min_threshold)  void  Adds sum_diff to the priority of the Vertex v. The user can specify an optional minimum threshold so that the priority will not go below the threshold. 
PriorityGraph separates the algorithm specification from the performance optimizations like Halide (RaganKelley:2013:HLC:2499370.2462176) and GraphIt (graphit:2018). The user writes highlevel algorithm specifications using the algorithm language and specifies optimization strategies using the scheduling language. PriorityGraph introduces a set of prioritybased data structures and operators to maintain execution ordering in the algorithm language and supports bucketing optimizations in the scheduling language. PriorityGraph’s programming model is built on top of GraphIt.
4.1. Algorithm Language
The algorithm language exposes opportunities for eager bucket update, eager update with bucket fusion, lazy bucket update, and other optimizations. The highlevel operators hide lowlevel implementation details such as atomic synchronization, deduplication, bucket updates, and priority coarsening. The algorithm language shares the vertex and edge sets, and operators that apply userdefined function on the sets with the GraphIt algorithm language.
PriorityGraph proposes highlevel priority queuebased abstractions to switch between threadlocal and global buckets. PriorityGraph also introduces priority update operators to hide the bucket update mechanisms, and provides a new edgeset apply operator, applyUpdatePriority. The prioritybased data structures and operators are shown in Table 1.
Figure 3 shows an example of stepping for SSSP. PriorityGraph works on vertex and edge sets. The algorithm specification first sets up the edgeset data structures (Lines LABEL:line:vertexset_edgeset_setup_begin–LABEL:line:vertexset_edgeset_setup_end), and set the distances to all the vertices in dist to INT_MAX to represent (Line LABEL:line:dist_init). We declare the global priority queue, pq, on Line LABEL:line:pq_decl. This priority queue can be referenced in userdefined functions and the main function. The user then defines a function, updateEdge, that processes each edge (Lines LABEL:line:udf_start–LABEL:line:udf_end). In updateEdge, the user computes a new distance value, and then updates the priority of the destination vertex using the updatePriorityMin operator defined in Table 1. In other algorithms, such as core, the user can use updatePrioritySum when the priority is decremented or incremented by a sum difference. The updatePrioritySum can detect if the difference to the sum is a constant, and use that to do more optimizations. The priority update operators, updatePriorityMin and updatePrioritySum, hide bucket update operations, allowing the compiler to generate different code for lazy and eager bucket update strategies.
Programmers use the constructor of the priority queue (Lines LABEL:line:pq_constructor_start–LABEL:line:pq_constructor_end) to specify algorithmic information, such as priority ordering, support for priority coarsening, and the direction priorities change (documented in Table 1). The abstract priority queue also hides lowlevel bucket implementation details and provides a mapping between vertex data and their priorities. The user specifies a priority_vector that stores the vertex data values used for computing priorities. In SSSP, we use the dist vector and the coarsening parameter ( specified using the scheduling language) to perform priority coarsening. The while loop (Line LABEL:line:ordered_processing_operator_start) processes vertices from a bucket until all buckets are finished processing. In each iteration of the while loop, a new bucket is extracted with dequeueReadySet (Line LABEL:line:dequeueReadySet). The edgeset operator on Line LABEL:line:delta_stepping_apply_update_priority uses the from operator to keep only the edges that come out of the vertices in the bucket. Then it uses applyUpdatePriority to apply the updateEdge function to outgoing edges of the bucket. Label (#s1#) is later used by the scheduling language to configure optimization strategies.
4.2. Scheduling Language
Apply Scheduling Functions  Descriptions 

configApplyPriorityUpdate(label,config);  Config options: eager_with_fusion, eager_no_fusion, lazy_constant_sum, and lazy. 
configApplyPriorityUpdateDelta(label,config);  Configures the parameter for coarsening the priority range. 
configBucketFusionThreshold(label, config);  Configures the threshold for the bucket fusion optimization. 
configNumBuckets(label,config);  Configures the number of buckets that are materialized for the lazy bucket update approach. 
The scheduling language allows users to specify different optimization strategies without changing the algorithm. We extend the scheduling language of GraphIt with new commands to enable switching between eager and lazy bucket update strategies. Users can also tune other parameters, such as the coarsening factor for priority coarsening. The scheduling API extensions are shown in Table 2.
Figure 8 shows a set of schedules for stepping. PriorityGraph uses labels (#label#) to identify the algorithm language statements on which the scheduling language commands are applied. The programmer adds the label s1 to the edgeset applyUpdatePriority statement. After the schedule keyword, the programmer calls the scheduling functions. The configApplyPriorityUpdate function allows the programmer to use the lazy bucket update optimization. We use optimizations from the GraphIt scheduling language to configure the direction of edge traversal (configApplyDirection) and load balance schemes (configApplyParallelization). Optimizations from GraphIt can be combined with lazy priority update schedules. We use configApplyUpdateDelta to set the delta for priority coarsening.
Users can change the schedules to generate code with different combinations of optimizations as shown in Figure 9. Figure 9(a) shows code generated by combining the lazy bucket update strategy and other edge traversal optimizations from the GraphIt scheduling language. Scheduling function configApplyDirection configures the data layout of the frontier and direction of the edge traversal (SparsePush means sparse frontier and push direction). Figure 9(b) shows the code generated when we combine a different traversal direction (DensePull) with the lazy bucketing strategy. Figure 9(c) shows code generated with the eager bucket update strategy. Code generation is explained in Section 5.
5. Compiler Implementation
We demonstrate how the compiler generates code for different bucketing optimizations. The key challenges are in how to insert lowlevel synchronization and deduplication instructions, and how to combine bucket optimizations with direction optimization and other optimizations in the original GraphIt scheduling language. Furthermore, the compiler has to perform global program transformations and code generation to switch between lazy and eager approaches.
5.1. Lazy Bucket Update Schedules
To support the lazy bucket update approach, the compiler applies program analyses and transformations on the userdefined functions (UDFs). The compiler uses dependence analysis on updatePriorityMin and updatePrioritySum to determine if there are writewrite conflicts and insert atomics instructions as necessary (Figure 9(a) Line 20). Additionally, the compiler needs to insert variables to track whether a vertex’s priority has been updated or not (tracking_var in Figure 9(a), Line 18). This variable is used in the generated code to determine which vertices should be added to the buffer outEdges (Figure 9(a), Line 21). Deduplication is enabled with a compareandswap (CAS) on deduplication flags (Line 21) to ensure that each vertex is inserted into the outEdges only once. Deduplication is required for correctness for applications such as core. Changing the schedules with different traversal directions or frontier layout affects the code generation for edge traversal and userdefined functions (Figure 9(b)). In the DensePull traversal direction, no atomics are needed for the destination nodes.
We built runtime libraries to manage the buffer and update buckets. The compiler generates appropriate calls to the library (getNextBucket, setupFrontier, and updateBuckets). The setupFrontier API (Figure 9(a), Line 24) uses prefix sum on the outEdges buffer to compute the next frontier. We use a lazy priority queue (declared in Figure 9(a), Line 2) for storing active vertices based on their priorities. The lazy bucket is based on Julienne’s bucket data structures that only materialize a few buckets and keep vertices outside of current range in an overflow bucket (Dhulipala:2017). We improved its performance by redesigning the lazy priority queue interface. Julienne’s original interface will invoke a customized lambda function call to compute the priority. PriorityGraph computed the priority using a priority vector and value for priority coarsening, eliminating the extra function call.
Lazy with constant sum reduction. We also incorporated a specialized histogrambased reduction optimization (first proposed in Julienne (Dhulipala:2017)) to reduce priority updates with a constant value each time. This optimization can be combined with the lazy bucket update strategy to improve performance. For core, since the priorities for each vertex always reduce by one at each update, we can optimize it further by keeping track of only the number of updates with a histogram. This way, we avoid contention on vertices that have a large number of neighbors on the frontier.
To generate code for the histogram optimization, the compiler first analyzes the userdefined function to determine whether the change to the priority of the vertex is a fixed value and if it is a sum reduction (Figure 10 (top), Line LABEL:line:constant_update). The compiler ensures that there is only one priority update operator in the userdefined function. It then extracts the fixed value (1), the minimum priority (k), and vertex identifier (dst). In the transformed function (Figure 10 (bottom)), an if statement and max operator are generated to check and maintain the priority of the vertex. The applyUpdatePriority operator gets the counts of updates to each vertex using a histogram approach and supplies the vertex and count as arguments to the transformed function (Figure 10 (bottom), Line LABEL:line:count_arg). The compiler copies all of the expressions used in the priority update operator and the the expressions that they depend on in the transformed function.
5.2. Eager Bucket Update Schedules
The compiler uses program analysis to determine feasibility of the transformation, transforms userdefined functions and edge traversal code, and uses optimized runtime libraries to generate efficient code for the eager bucket update approach.
The compiler first analyzes the while loop (Figure 3, Lines LABEL:line:ordered_processing_operator_start–LABEL:line:ordered_processing_operator_end) to look for the pattern of an iterative priority update with a termination criterion. The analysis checks that there is no other use of the generated vertexset (bucket) except for the applyUpdatePriority operator, ensuring correctness.
Once the analysis identifies the while loop and edge traversal operator, the compiler replaces the while loop with an ordered processing operator. The ordered processing operator uses an OpenMP parallel region (Figure 9(c), Lines 12–32) to set up threadlocal data structures, such as local_bins. We built an optimized runtime library for the ordered processing operator based on the stepping implementation in GAPBS (DBLP:journals/corr/BeamerAP15). A global vertex frontier (Figure 9(c), Line 11) keeps track of vertices of the next priority (the next bucket). In each iteration of the while loop, the #pragma omp for (Figure 9(c), Lines 15–16) distributes work among the threads. After priorities and buckets are updated, each local thread proposes its next bucket priority, and the smallest priority across threads will be selected (omitted code on Figure 9(c), Line 28). Once the next bucket priority is decided, each thread will copy vertices in its next local bucket to the global frontier (Figure 9(c), Line 30)
Finally, the compiler transforms the userdefined functions by appending the local buckets to the argument list and inserting appropriate synchronization instructions. These transformations allow priority update operators to directly update threadlocal buckets (Figure 9(c), Lines 23–26).
Bucket Fusion. The bucket fusion optimization adds another while loop after end of the forloop on Line 27 of Figure 9(c), and before finding the next bucket across threads on Line 28. This inner while loop processes the current bucket in the local priority queue (local_bins) if it is not empty and its size is less than a threshold. In the inner while loop, vertices are processed using the same transformed userdefined functions as before. The size threshold improves load balancing as large buckets are distributed across different threads.
5.3. Autotuning
We built an autotuner on top of PriorityGraph to automatically find highperformance schedules for a given algorithm and graph. The autotuner is built using OpenTuner (ansel:pact:2014) and stochastically searches through a large number of optimization strategies generated with the scheduling language. It uses an ensembles of search methods, such as the area under curve bandit meta technique, to find good combinations of optimizations within a reasonable amount of time.
6. Evaluation
We compare the performance of PriorityGraph to stateoftheart frameworks and libraries. We also analyze performance tradeoffs among different PriorityGraph schedules. We use a dualsocket system with Intel Xeon E52695 v3 CPUs with 12 cores each for a total of 24 cores and 48 hyperthreads. The system has 128 GB of DDR31600 memory and 30 MB last level cache on each socket and runs with Transparent Huge Pages (THP) enabled and Ubuntu 18.04.
Type  Dataset  Num. Vertices  Num. Edges 

Social Networks  Orkut (OK) (friendster)  3 M  234 M 

LiveJournal (LJ) (davis11acmfloridasparse)  5 M  69 M 
Twitter (TW) (kwak10wwwtwitter)  41 M  1469 M  
Friendster (FT) (friendster)  125 M  3612 M  
Web Graph  WebGraph (WB) (sdgraph)  101 M  2043 M 
Road Networks  Massachusetts (MA) (openstreetmap)  0.45 M  1.2 M 
Germany (GE) (openstreetmap)  12 M  32 M  
RoadUSA (RD) (roadgraph)  24 M  58 M 
Algorithm  SSSP  PPSP  wBFS  

Graph  LJ  OK  TW  FT  WB  GE  RD  LJ  OK  TW  FT  WB  GE  RD  LJ  OK  TW  FT  WB 
PriorityGraph  0.093  0.106  3.094  5.637  2.902  0.207  0.224  0.043  0.061  2.597  4.063  2.473  0.049  0.045  0.072  0.104  1.822  7.563  2.129 
GAPBS  0.1  0.107  3.547  6.094  3.304  0.59  0.765  0.042  0.063  2.707  4.312  2.628  0.116  0.112  0.072  0.107  1.903  7.879  2.228 
Galois  0.123  0.234  2.93  7.996  3.005  0.244  0.276  0.084  0.165  2.625  7.092  2.606  0.059  0.051  –  –  –  –  – 
Julienne  0.169  0.334  4.522  x  4.11  3.104  3.685  0.104  0.16  4.904  x  4.107  1.836  0.687  0.148  0.145  2.32  x  2.813 
GraphIt (unordered)  0.221  0.479  6.376  38.458  8.521  90.524  122.374  0.221  0.479  6.376  38.458  8.521  90.524  122.374  0.12  0.198  2.519  21.77  3.659 
Ligra (unordered)  0.301  0.604  7.778  x  x  94.162  129.2  0.301  0.604  7.778  x  x  94.162  129.2  0.164  0.257  3.054  x  x 
Algorithm  core  Approximate Set Cover  A search  
Graph  LJ  OK  TW  FT  WB  GE  RD  LJ  OK  TW  FT  WB  GE  RD  MA  GE  RD  
PriorityGraph  0.745  1.634  10.294  14.423  12.876  0.173  0.305  0.494  0.564  5.299  11.499  7.57  0.545  0.859  0.010  0.060  0.075  
GAPBS  –  –  –  –  –  –  –  –  –  –  –  –  –  –  0.03  0.142  0.221  
Galois  –  –  –  –  –  –  –  –  –  –  –  –  –  –  0.078  0.066  0.083  
Julienne  0.752  1.62  10.5  14.6  13.1  0.184  0.327  0.703  0.868  6.89  13.2  10.7  0.66  1.03  0.181  1.551  4.876  
GraphIt (unordered)  6.131  8.152  x  325.286  x  0.421  1.757  –  –  –  –  –  –  –  0.456  90.524  122.374  
Ligra (unordered)  5.99  8.09  225.102  324  x  0.708  1.76  –  –  –  –  –  –  –  0.832  94.162  129.2 
Applications. We evaluate PriorityGraph on SSSP with stepping, weighted breadthfirst search (wBFS), pointtopoint shortest path (PPSP), A search, core decomposition (core), and approximate set cover (SetCover). We benchmarked wBFS on only the social networks with weights in the range following the convention in previous work (Dhulipala:2017) because wBFS is designed to run only on graphs with this weight and degree distribution. The core algorithm finds all the cores in a graph, not just the ’th core. Detailed descriptions of the algorithms are provided in the supplementary materials and other papers (Dhulipala:2017)
Data Sets. Table 3 shows our input graphs and their sizes. For core and SetCover, we symmetrize the input graphs. For stepping based SSSP, wBFS, PPSP using stepping, and A search, we use the original directed versions of graphs with integer edge weights. The RoadUSA (RD), Germany(GE) and Massachusetts (MA) road graphs are used for the A search algorithm, as they have the longitude and latitude data for each vertex. GE and MA are constructed from data downloaded from OpenStreetMap (openstreetmap). Weight distributions used for experiments are described in the caption of Table 4.
Existing Frameworks. Galois v4 (nguyen13sospgalois) uses approximate priority ordering with an ordered list abstraction for SSSP. We implemented PPSP and A search using the ordered list. To the best of our knowledge and from communications with the developers, strict prioritybased ordering is currently not supported for Galois. Galois does not provide implementations of wBFS, core and SetCover, which need strict priority ordering. GAPBS (DBLP:journals/corr/BeamerAP15) is a suite of C++ implementations of graph algorithms and uses eager bucket update for SSSP. GAPBS does not provide implementations of core and SetCover. We used Julienne (Dhulipala:2017) from early 2019. The developers of Julienne have since tried to incorporate the optimized bucketing interface proposed in this paper in the latest version. GraphIt (graphit:2018) and Ligra (shun13ppoppligra) are two of the fastest unordered graph frameworks. We used the best configurations (e.g., priority coarsening factor and number of cores) for the comparison frameworks. Schedules and parameters used are in the supplementary materials.
6.1. Comparisons with other Frameworks
Table 4 shows the execution times of PriorityGraph and other frameworks. PriorityGraph outperforms the next fastest of Julienne, Galois, GAPBS, GraphIt, and Ligra by up to 3 and is never more than 6 slower than the fastest framework. PriorityGraph is up to 16.8 faster than Julienne, 7.8 faster than Galois, and 3.5 faster than handoptimized implementations in GAPBS. Compared to unordered frameworks, GraphIt and Ligra, PriorityGraph achieves speedups between 1.67 to more than 600 due to improved algorithm efficiency. The times for SSSP and wBFS are averaged over 10 starting vertices. The times for PPSP and A search are averaged over 10 sourcedestination pairs. We chose start and end points to have a balanced selection of different distances.
PriorityGraph has the fastest SSSP performance on six out of the seven input graphs. Julienne uses significantly more instruction than PriorityGraph (up to 16.4 instructions than PriorityGraph). On every iteration, Julienne computes an outdegree sum for the vertices on the frontier to use the direction optimization, which adds significant runtime overhead. PriorityGraph avoids this overhead by disabling the direction optimization with the scheduling language. Julienne also uses lazy bucket update that generates extra instructions to buffer the bucket updates whereas PriorityGraph saves instructions by using eager bucket update. PriorityGraph is faster than GAPBS because of the bucket fusion optimization that allows PriorityGraph to process more vertices in each round and use fewer rounds (details shown in Table 6). The optimization is especially effective for road networks, where synchronization overhead is a significant performance bottleneck. Galois achieves good performances on SSSP because it does not have as much overhead from global synchronization needed to enforce strict priority. However, it is slower than PriorityGraph on most graphs because approximate priority ordering sacrifices some workefficiency.
PriorityGraph is the fastest on most of the graphs for PPSP, wBFS, and A search, which use a variant of the stepping algorithm with priority coarsening. Both PriorityGraph and GAPBS use eager bucket update for these algorithms. PriorityGraph outperforms GAPBS because of bucket fusion. Galois is often slower than PriorityGraph due to lower workefficiency with the approximate priority ordering. Julienne uses lazy bucket update and is slower than PriorityGraph due to the runtime overheads of the lazy approach.
PPSP and A search are faster than SSSP as they only run until the distance to the destination vertex is finalized. A
search is sometimes slower than PPSP because of additional random memory accesses and computation needed for estimating distances to the destination.
For core and SetCover, PriorityGraph runs faster than Julienne because the optimized lazy bucketing interface use the priority vector to compute the priorities of each vertex. Julienne uses a userdefined function to compute the priority every time, resulting in function call overheads and redundant computations. Galois does not provide ordered algorithms for core and SetCover, which require strict priority and synchronizations after processing each priority.
Delta Selection for Priority Coarsening. The best value for each algorithm depends on the size and the structure of the graph. The best values for social networks (ranging from 1 to 100) are much smaller than deltas for road networks with large diameters (ranging from to ). Social networks need only a small value because they have ample parallelism with large frontiers and workefficiency is more important. Road networks need larger values for more parallelism. We also tuned the values for the comparison frameworks to provide the best performance.
Autotuning. The autotuner for PriorityGraph is able to automatically find schedules that performed within 5 of the handtuned schedules used for Table 4. The autotuning process finished within 5000 seconds for even the largest graphs.
PriorityGraph  GAPBS  Galois  Julienne  

SSSP  28  77  90  65 
PPSP  24  80  99  103 
A*  74  105  139  84 
KCore  24  –  –  35 
SetCover  70  –  –  72 
Line Count Comparisons. Table 5 shows the line counts of the five graph algorithms implemented in four frameworks. GAPBS, Galois, and Julienne all require the programmer to take care of implementation details such as atomic synchronization and deduplication. PriorityGraph uses the compiler to automatically generate these instructions. For A search and SetCover, PriorityGraph needs to use long extern functions that significantly increases the line counts.
6.2. Scalability Analysis
We analyze the scalability of different frameworks on social networks and the road networks in Figure 11. The social networks (TW and FT) have very small diameters and large numbers of vertices. As a result, they have a lot of parallelism in each bucket, and all three frameworks scale reasonably well (Figure 11(a) and (b)). Compared to GAPBS, PriorityGraph uses bucket fusion to significantly reduce synchronization overheads and improves parallelism on the RoadUSA network (Figure 11(c)). GAPBS suffers from NUMA accesses when going beyond a single socket (12 cores). Julienne’s overheads from lazy bucket updates make it hard to scale on the RoadUSA graph.
6.3. Performance of Different Schedules
Datasets  with Fusion  without Fusion 

TW  3.09s [1025 rounds]  3.55s [1489 rounds] 
FT  5.64s [5604 rounds]  6.09s [7281 rounds] 
WB  2.90s [772 rounds]  3.30s [2248 rounds] 
RD  0.22s [1069 rounds]  0.77s [48407 rounds] 
core  SSSP with stepping  

Datasets  Eager Update  Lazy Update  Eager Update  Lazy Update 
LJ  0.84  0.75  0.093  0.24 
TW  44.43  10.29  3.09  6.66 
FT  46.59  14.42  5.64  10.34 
WB  35.58  12.88  2.90  7.82 
RD  0.55  0.31  0.22  9.48 
Table 6 shows that SSSP with bucket fusion achieves up to speedup over the version without bucket fusion on road networks, where there are a large number of rounds processing each bucket. Table 6 shows that the optimization improves running time by significantly reducing the number of rounds needed to complete the algorithm.
Table 7 shows the performance impact of eager versus lazy bucket updates on core and SSSP. core does a large number of redundant updates on the priority of each vertex. Every vertex’s priority will be updated the same number of times as its outdegree. In this case, using the lazy bucket update approach drastically reduces the number of bucket insertions. Additionally, with a lazy approach we can also buffer the priority updates and later reduce them with a histogram approach (lazy with constant sum reduction optimization). This histogrambased reduction avoids overhead from atomic operations. For SSSP there are not many redundant updates and the lazy approach introduces significant runtime overhead over the eager approach.
7. Related Work
Parallel Graph Processing Frameworks. There has been a significant amount of work on unordered graph processing frameworks (e.g., (shun13ppoppligra; gluon2018; Zhu16gemni; Grossman2018; Yunming2017; kyrola12osdigraphchi; Ham:2016:GHE:3195638.3195707; prabhakaran12atcgrace; Sakr2017; Wang:2018:LLD:3178487.3178508; Gonzalez2012; sundaram15vldbgraphmat; Wang:2017:GGG:3131890.3108140; GSwitch2019; Xu:PnP; KickStarter2017; Pai2016; graphit:2018; DBLP:journals/pvldb/SongLWGLJ18), among many others). These frameworks do not have data structures and operators to support efficient implementations of ordered algorithms, and cannot support a wide selection of ordered graph algorithms. A few unordered frameworks (Wang16gunrock; GSwitch2019; sundaram15vldbgraphmat) have the users define functions that filter out vertices to support stepping for SSSP. This approach is not very efficient and does not generalize to other ordered algorithms.
GraphIt (graphit:2018) decouples the algorithm specification from optimizations for unordered graph algorithms to achieve high performance. PriorityGraph introduces new prioritybased operators to the algorithm language, proposes new optimizations for the ordered algorithms in the scheduling language, and extends the compiler to generate efficient code.
Wonderland (Zhang:2018:WNA:3173162.3173208) uses an abstractionguided prioritybased scheduling to reduce the total number of iterations for some graph algorithms. However, it requires preprocessing and does not implement a strict ordering of the ordered graph algorithms. PnP (Xu:PnP) proposes directionbased optimizations for pointtopoint queries, which is orthogonal to the optimizations in this paper, and can be combined together to potentially achieve even better performance.
Bucketing. Bucketing is a common way to exploit parallelism and maintain ordering in ordered graph algorithms. It is expressive enough to implement many parallel ordered graph algorithms (Dhulipala:2017; DBLP:journals/corr/BeamerAP15). Existing frameworks support either lazy bucket update or eager bucket update approach. GAPBS (DBLP:journals/corr/BeamerAP15) is a suite of handoptimized C++ programs that includes an implementation of SSSP using the eager bucket update approach. Julienne (Dhulipala:2017) is a highlevel programming framework that uses the lazy bucket update approach, which is efficient for applications that have a lot of redundant updates, such as Core and SetCover. However, it is not as efficient for applications that have fewer redundant updates and less work per bucket, such as SSSP and A search. PriorityGraph unifies both the eager and lazy bucket update approaches with a new programming model and compiler extensions to achieve consistent high performance.
Speculative Execution. Speculative execution is another approach to exploit parallelism in ordered graph algorithms (Hassaan:2011:OVU:1941553.1941557; Hassaan:2015:KDG:2694344.2694363). This approach can potentially generate more parallelism as vertices with different priorities are executed in parallel as long as the dependencies are preserved. This is particularly important for many discrete simulation applications that lack parallelism. However, speculative execution in software incurs significant performance overheads as a commit queue has to be maintained, conflicts need to be detected, and values are buffered for potential rollback on conflicts. Hardware solutions have been proposed to reduce the overheads of speculative execution (Jeffrey:Swarm; Suvinay:Fractal:2017; Jeffrey:DataCentricExectuion; Jeffrey:2018), but it is costly to build customized hardware. Furthermore, some ordered graph algorithms, such as approximate set cover and core, cannot be easily expressed with speculative execution.
Approximate Priority Ordering. Some work disregard a strict priority ordering and use an approximate priority ordering (nguyen13sospgalois; gluon2018; alistarh2015spraylist; DBLP:conf/spaa/Alistarh0KLN18). This approach uses several “relaxed” priority queues in parallel to maintain local priority ordering. However, it does not synchronize globally among the different priority queues. To the best of our knowledge and from communications with the developers, Galois (nguyen13sospgalois; gluon2018) does not currently support strict priority ordering and only supports an approximate ordering. This approach cannot implement certain ordered algorithms that require strict priority ordering, such as workefficient core decomposition and SetCover. Dgalois (Dathathri:2019:PSR:3297858.3304056) implements core for only a specific , whereas PriorityGraph’s core finds the max core. Furthermore, the approximate ordering can be inefficient for applications such as pointtopoint shortest path or A search because approximate ordering cannot easily guarantee that all vertices with priorities less than the current priority of the destination vertex have finished execution.
8. Conclusion
We introduce PriorityGraph, a new prioritybased programming framework that simplifies the programming of parallel ordered graph algorithms and generates different combinations of performance optimizations. We also propose a novel bucket fusion optimization that significantly improves the performance of many ordered graph algorithms on road networks. PriorityGraph achieves up to 3 speedup on six ordered algorithms over stateoftheart frameworks (Julienne, Galois, and GAPBS) while significantly reducing the number of lines of code.
9. Acknowledgments
This research was supported by DOE Early Career Award #DESC0018947, NSF CAREER Award #CCF1845763, DARPA SDH Award #HR00111830007, Applications Driving Architectures (ADA) Research Center, a JUMP Center cosponsored by SRC and DARPA, Toyota Research Institute, DoE Exascale award #DESC0008923, DARPA D3M Award #FA87501720126.
References
10. Appendix
10.1. BellmanFord vs stepping
We use Figure 12 to demonstrate the difference between BellmanFord (unordered algorithm) and stepping (ordered algorithm) for SSSP. BellmanFord updates the distance to vertex in round 1, and then immediately propagates updates to vertices , , and in both rounds 2 and 3. However, these updates (highlighted in red) constitute wasted work. , since vertex ’s correct shortest path distance is set through a 3hop path through vertices and . stepping only propagate updates to vertices , , and after the distance to vertex is finalized in round 3. stepping sacrifices some parallelism, but avoids redundant updates, leading to significant speedup over BellmanFord on many graphs. stepping can often achieve significant speedup over BellmanFord on many graphs.
10.2. Applications
Our evaluation is done on six applications: singlesource shortest paths (SSSP), weighted breadthfirst search (wBFS), pointtopoint shortest path (PPSP), A search, core decomposition (core), and approximate set cover (SetCover).
SSSP and Weighted BreadthFirst Search (wBFS). SSSP with stepping solves the singlesource shortest path problem. In stepping, vertices are partitioned into buckets with interval according to their current shortest distance. The stepping algorithm can be broken down into a number of iterations. In each iteration, the smallest nonempty bucket which contains all vertices with distance in is chosen. All outgoing edges of vertices in bucket are relaxed using BellmanFord iterations until bucket becomes empty. Then the algorithm continues to the next bucket and repeats the same process until all buckets are empty. wBFS is a special case of stepping for graphs with positive integer edge weights, with delta fixed to 1 (Dhulipala:2017). It is especially useful for social and web graphs with power law degree distribution.
Pointtopoint Shortest Path (PPSP). Pointtopoint shortest path (PPSP) takes a graph , a source vertex , and a destination vertex as inputs and computes the shortest path between and . In our PPSP application, we used the stepping algorithm with priority coarsening. It terminates the program early when it enters iteration where is greater than or equal to the shortest distance between and it has already found.
A Search. The A search algorithm also tries to find the shortest path between two points. The difference between A search and stepping is that, instead of using the current shortest distance to a vertex as priority, A search uses the estimated distance of the shortest path that goes from the source to the target vertex that passes through the current vertex as the priority. Our A search implementation, in addition to edge information, also needs the longitude and latitude of the vertices. For a given vertex , the estimated distance is calculated by summing up the current distance to and the linear distance between and the target vertex. The linear distance is calculated using the coordinates of the vertices (Russell:2003:AIM:773294).
core. A core of an undirected graph refers to a maximal connected subgraph of G where all vertices in the subgraph have induceddegree at least . The core problem takes an undirected graph as input and for each computes the maximum core that is contained in (this value is referred to as the coreness of the vertex, and the problem is sometimes called the coreness problem). The core problem is solved using a peeling procedure that is sometimes referred to as the MatulaBeck algorithm (Matula:1983:SOC:2402.322385). In each iteration of the peeling procedure, vertices with degrees under a certain threshold are removed and their degrees become their core numbers. Then, as their incident edges are also removed from the graph, the degrees of the remaining vertices are updated.
Approximate Set Cover. The set cover problem takes as input a universe containing a set of ground elements, a collection of sets s.t. , and a cost function . The problem is to find the cheapest collection of sets that covers , i.e. . In this paper we implement the unweighted version of the problem, where , but the algorithm used easily generalizes to the weighted case (Dhulipala:2017). The algorithm at a highlevel works by bucketing the sets based on their cost per element, i.e., the ratio of the number of uncovered elements they cover to their cost. At each step, a nearlyindependent subset of sets from the highest bucket (sets with the best cost per element) are chosen, removed, and the remaining sets are reinserted into a bucket corresponding to their new cost per element. We refer to the following papers by Blelloch et al. (blelloch11manis; blelloch12setcover) for algorithmic details and a survey of related work.
10.3. Schedules and Parameters used in Evaluation
SSSP, PPSP, wBFS, A search. For SSSP, PPSP, A search in Julienne and GAPBS, we used the single socket performance on the road networks (MA, GE, RD), and the twosocket performance on the social networks and web graphs. Using both sockets results in more than 4 slowdown for these two frameworks on the road networks. We used the no_dense flag for Julienne on wBFS and SSSP for the best performance. PriorityGraph and Galois’s performances scale across two sockets for all graphs. We used eager bucketing with bucket fusion for SSSP, PPSP, wBFS, and A search for PriorityGraph on all graphs except for the smaller social networks (Live Journal and Orkut), and we used a merge threshold of 1000.
coreand SetCover. Both PriorityGraph and Julienne adopt the lazy bucketing strategy for coreand SetCover. For core, we used 16 open buckets for both PriorityGraph and Julienne for all graphs. We found that using 16 buckets outperform using 128 of buckets by 1020. For SetCover, we used 128 buckets, which outperform 16 buckets by up to 2.
10.4. Additional Algorithms in PriorityGraph
Figure 13 and Figure 14 show the PriorityGraph implementation of ordered core and A search algorithms. core computes all the cores in the graph. A search computes the shortest distance between a start and end point. A search uses an extern function defined in C++, calc_dist, to compute the estimated distance from the current point to the destination point.