Many important graph problems can be implemented using either ordered or unordered parallel algorithms. Ordered algorithms process active vertices following a dynamic priority-based ordering, potentially reducing redundant work. By contrast, unordered algorithms process active vertices in an arbitrary order, improving parallelism while potentially performing a significant amount of redundant work. In practice, optimized ordered graph algorithms are up to two orders of magnitude faster than the unordered versions (Dhulipala:2017; Hassaan:2011:OVU:1941553.1941557; DBLP:journals/corr/BeamerAP15; Hassaan:2015:KDG:2694344.2694363), as shown in Figure 1. For example, computing single-source shortest paths (SSSP) on graphs with non-negative edge weights can be implemented either using the Bellman-Ford algorithm (BellmanFord) (an unordered algorithm) or the -stepping algorithm (MEYER2003114) (an ordered algorithm).111In this paper, we define -stepping as an ordered algorithm, in contrast to previous work (Hassaan:2011:OVU:1941553.1941557) which defines -stepping as an unordered algorithm. Bellman-Ford updates the shortest path distances to all active vertices on every iteration. On the other hand, -stepping reduces the number of vertices that need to be processed every iteration by updating path distances to vertices that are closer to the source vertex first, before processing vertices farther away.
Writing high-performance ordered graph algorithms is challenging for users who are not experts in performance optimization. Existing frameworks that support ordered graph algorithms (Dhulipala:2017; DBLP:journals/corr/BeamerAP15; nguyen13sosp-galois) require users to be familiar with C/C++ data structures, atomic synchronizations, bitvector manipulations, and other performance optimizations. For example, Figure 2 shows a snippet of a user-defined function for -stepping in Julienne (Dhulipala:2017), a state-of-the-art framework for ordered graph algorithms. The code involves atomics and low-level C/C++ operations.
We propose a new priority-based extension to GraphIt that simplifies writing parallel ordered graph algorithms. GraphIt separates algorithm specifications from performance optimization strategies. The user specifies the high-level algorithm with the algorithm language and uses a separate scheduling language to configure different performance optimizations. The algorithm language extension introduces a set of priority-based data structures and operators to maintain execution ordering while hiding low-level details such as synchronization, deduplication, and data structures to maintain ordering of execution. Figure 3 shows the implementation of -stepping using the priority-based extension which dequeues vertices with the lowest priority and updates their neighbors’ distances in each round of the while loop. The while loop terminates when all the vertices’ distances are finalized. The algorithm uses an abstract priority queue data structure, pq (Line LABEL:line:pq_decl), and the operators updatePriorityMin (Line LABEL:line:updatePriorityMin) and dequeueReadySet (Line LABEL:line:dequeueReadySet) to maintain priorities.
The priority-based extension uses a bucketing data structure (Dhulipala:2017; DBLP:journals/corr/BeamerAP15) to maintain the execution ordering. Each bucket stores active vertices of the same priority, and the buckets are sorted in priority order. The program processes one bucket at a time in priority order and dynamically moves active vertices to new buckets when their priorities change. Updates to the bucket structure can be implemented using either an eager bucket update (DBLP:journals/corr/BeamerAP15) approach or a lazy bucket update (Dhulipala:2017) approach. With eager bucket updates, buckets are immediately updated when the priorities of active vertices change. Lazy bucketing buffers the updates and later performs a single bucket update per vertex. Existing frameworks supporting ordered parallel graph algorithms only support one of the two bucketing strategies described above. However, using a suboptimal bucketing strategy can result in more than 10 slowdown, as we show later. Eager and lazy bucketing implementations use different data structures and parallelization schemes, making it difficult to combine both approaches within a single framework.
With the priority-based extension, programmers can switch between lazy and eager bucket update strategies and combine bucketing optimizations with other optimizations using the scheduling language. The compiler leverages program analyses and transformations to generate efficient code for different optimizations. The separation of algorithm and schedule also enables us to build an autotuner for GraphIt that can automatically find high-performance combinations of optimizations for a given ordered algorithm and graph.
Bucketing incurs high synchronization overheads, slowing down algorithms that spend most of their time on bucket operations. We introduce a new performance optimization, bucket fusion, which drastically reduces synchronization overheads. In an ordered algorithm, a bucket can be processed in multiple rounds under a bulk synchronous processing execution model. In every round, the current bucket is emptied and vertices whose priority are updated to the current bucket’s priority are added to the bucket. The algorithm moves on to the next bucket when no more vertices are added to the current bucket. The key idea of bucket fusion is to fuse consecutive rounds that process the same bucket. Using bucket fusion in GraphIt results in –3 speedup on road networks with large diameters over existing work.
We implement the priority-based model as a language and compiler extension to GraphIt (graphit:2018)222https://github.com/GraphIt-DSL/graphit, a domain-specific language for writing high-performance graph algorithms. With the extension, GraphIt achieves up to 3 speedup on six ordered graph algorithms (-stepping based single-source shortest paths (SSSP), -stepping based point-to-point shortest path (PPSP), weighted BFS (wBFS), A search, -core decomposition, and approximate set cover (SetCover)) over the fastest state-of-the-art frameworks that support ordered algorithms (Julienne (Dhulipala:2017) and Galois (nguyen13sosp-galois)) and hand-optimized implementations (GAPBS (DBLP:journals/corr/BeamerAP15)). Figure 4 shows that GraphIt is up to 16.9 and 1.94 faster than Julienne and Galois on the four selected algorithms and supports more algorithms than Galois. Using GraphIt also reduces the lines of code compared to existing frameworks and libraries by up to 4.
The contributions of this paper are as follows.
[topsep=2pt, partopsep=0pt, leftmargin=1.5em]
A new priority-based programming model in GraphIt that simplifies the programming of ordered graph algorithms and makes it easy to switch between and combine different optimizations (Section 4).
Compiler extensions that leverage program analyses, program transformations, and code generation to produce efficient implementations with different combinations of optimization strategies (Section 5).
A comprehensive evaluation of GraphIt that shows that it is up to 3 faster than state-of-the-art graph frameworks on six ordered graph algorithms (Section 6). GraphIt also significantly reduces the lines of code compared to existing frameworks and libraries.
We first define ordered graph processing used throughout this paper. Each vertex has a priority . Initially, the users can explicitly initialize priorities of vertices, with the default priority being a null value, . These priorities change dynamically throughout the execution. However, the priorities can only change monotonically, that is they can only be increased, or only be decreased. We say that a vertex is finalized if its priority can no longer be updated. The vertices are processed and finalized based on the sorted priority ordering. By default, the ordered execution will stop when all vertices with non-null priority values are finalized. Alternatively, the user can define a customized stop condition, for example to halt once a certain vertex has been finalized.
We define priority coarsening as an optimization to coarsen the priority value of the vertex to by dividing the original priority by a coarsening factor such that . The optimization is inspired by -stepping for SSSP, and enables greater parallelism at the cost of losing some algorithmic work-efficiency. Priority coarsening is used in algorithms that tolerate some priority inversions, such as A search, SSSP, and PPSP, but not in -core and SetCover.
3. Performance Optimizations for Ordered Graph Algorithms
We use -stepping for single-source shortest paths (SSSP) as a running example to illustrate the performance tradeoffs between the lazy and eager bucket update approaches, and to introduce our new bucket fusion optimization.
3.1. Lazy Bucket Update
We first consider using the lazy bucket update approach for the -stepping algorithm, with pseudocode shown in Figure 5. The algorithm constructs a bucketing data structure in Line 3, which groups the vertices into buckets according to their priority. It then repeatedly extracts the bucket with the minimum priority (Line 6), and finishes the computation once all of the buckets have been processed (Lines 5–13). To process a bucket, the algorithm iterates over each vertex in the bucket, and updates the priority of its outgoing neighbor destination vertices by updating the neighbor’s distance (Line 10). With priority coarsening, the algorithm computes the new priority by dividing the distance by the coarsening factor, . The corresponding bucket update (the vertex and its updated priority) is added to a buffer with a synchronized append (Line 11). The syncAppend can be implemented using atomic operations, or with a prefix sum to avoid atomics. The buffer is later reduced so that each vertex will only have one final bucket update (Line 12). Finally, the buckets are updated in bulk with bulkUpdateBuckets (Line 13).
The lazy bucket update approach can be very efficient when a vertex changes buckets multiple times within a round. The lazy approach buffers the bucket updates, and makes a single insertion to the final bucket. Furthermore, the lazy approach can be combined with other optimizations such as histogram-based reduction on priority updates to further reduce runtime overheads. However, the lazy approach adds additional runtime overhead from maintaining a buffer (Line 7), and performing reductions on the buffer (Line 12) at the end of each round. These overheads can incur a significant cost in cases where there are only a few updates per round (e.g., in SSSP on large diameter road networks).
3.2. Eager Bucket Update
Another approach for implementing -stepping is to use an eager bucket update approach (shown in Figure 6) that directly updates the bucket of a vertex when its priority changes. The algorithm is naturally implemented using thread-local buckets, which are updated in parallel across different threads (Line 9). Each thread works on a disjoint subset of vertices in the current bucket (Line 10). Using thread-local buckets avoids atomic synchronization overheads on bucket updates (Lines 3 and 12–13). To extract the next bucket, the algorithm first identifies the smallest priority across all threads and then has each thread copy over its local bucket of that priority to a global minBucket (Line 8). If a thread does not have a local bucket of the next smallest priority, then it will skip the copying process. Copying local buckets into a global bucket helps redistribute the work among threads for better load balancing.
3.3. Eager Bucket Fusion Optimization
A major challenge in bucketing is that a large number of buckets need to be processed, resulting in thousands or even tens of thousands of processing rounds. Since each round requires at least one global synchronization, reducing the number of rounds while maintaining priority ordering can significantly reduce synchronization overhead.
Often in practice, many consecutive rounds process a bucket of the same priority. For example, in -stepping, the priorities of vertices that are higher than the current priority can be lowered by edge relaxations to the current priority in a single round. As a result, the same priority bucket may be processed again in the next round. The process repeats until no new vertices are added to the current bucket. This pattern is common in ordered graph algorithms that use priority coarsening. We observe that rounds processing the same bucket can be fused without violating priority ordering.
Based on this observation, we propose a novel bucket fusion optimization for the eager bucket update approach that allows a thread to execute the next round processing the current bucket without synchronizing with other threads. We illustrate bucket fusion using the -stepping algorithm in Figure 7. The same optimization can be applied in other applications, such as wBFS, A search and point-to-point shortest path. This algorithm extends the eager bucket update algorithm (Figure 6) by adding a while loop inside each local thread (Figure 7, Line 14). The while loop executes if the current local bucket is non-empty. If the current local bucket’s size is below a certain threshold, the algorithm immediately processes the current bucket without synchronizing with other threads (Figure 7, Line 16). If the current local bucket is large, it will be copied over to the global bucket and distributed across other threads. The threshold is important to avoid creating straggler threads that process too many vertices, leading to load imbalance. The bucket processing logic in the while loop (Figure 7, Lines 17–20) is the same as the original processing logic (Figure 7, Lines 10–13). This optimization is hard to apply for the lazy approach since a global synchronization is needed before bucket updates.
Bucket fusion is particularly useful for road networks where multiple rounds frequently process the same bucket. For example, bucket fusion reduces the number of rounds by more than 30 for SSSP on the RoadUSA graph, leading to more than 3 speedup by significantly reducing the amount of global synchronization (details in Section 6).
4. Programming Model
|Edgeset Apply Operator||Return Type||Description|
|applyUpdatePriority(func f)||none||Applies f(src, dst) to every edge. The f function updates priorities of vertices.|
|Priority Queue Operators|
bool allow_priority coarsening,
|priority_queue||The constructor for the priority queue. It specifies whether priority coarsening is allowed or not, higher or lower priority gets executed first, the vector that is used to compute the priority values, and an optional start vertex. A lower_first priority direction specifies that lower priority values are executed first, whereas a higher_first indicates higher priority values are executed first.|
|pq.dequeueReadySet()||vertexset||Returns a bucket with all the vertices that are currently ready to be processed.|
|pq.finished()||bool||Checks if there is any bucket left to process.|
|pq.finishedVertex(Vertex v)||bool||Checks if a vertex’s priority is finalized (finished processing).|
|pq.getCurrentPriority()||priority_type||Returns the priority of the current bucket.|
|pq.updatePriorityMin(Vertex v, ValT new_val)||void||Decreases the value of the priority of the specified vertex v to the new_val.|
|pq.updatePriorityMax(Vertex v, ValT new_val)||void||Increases the value of the priority of the specified vertex v to the new_val.|
|pq.updatePrioritySum(Vertex v, ValT sum_diff, ValType min_threshold)||void||Adds sum_diff to the priority of the Vertex v. The user can specify an optional minimum threshold so that the priority will not go below the threshold.|
The new priority-based extension follows the design of GraphIt and separates the algorithm specification from the performance optimizations, similar to Halide (Ragan-Kelley:2013:HLC:2499370.2462176) and Tiramisu (Baghdadi:2019:TPC). The user writes a high-level algorithm using the algorithm language and specifies optimization strategies using the scheduling language. The extension introduces a set of priority-based data structures and operators to GraphIt to maintain execution order in the algorithm language and adds support for bucketing optimizations in the scheduling language.
4.1. Algorithm Language
The algorithm language exposes opportunities for eager bucket update, eager update with bucket fusion, lazy bucket update, and other optimizations. The high-level operators hide low-level implementation details such as atomic synchronization, deduplication, bucket updates, and priority coarsening. The algorithm language shares the vertex and edge sets, and operators that apply user-defined functions on the sets with the GraphIt algorithm language.
The new priority-based extension proposes high-level priority queue-based abstractions to switch between thread-local and global buckets. The extension to GraphIt also introduces priority update operators to hide the bucket update mechanisms, and provides a new edgeset apply operator, applyUpdatePriority. The priority-based data structures and operators are shown in Table 1.
Figure 3 shows an example of -stepping for SSSP. GraphIt works on vertex and edge sets. The algorithm specification first sets up the edgeset data structures (Lines LABEL:line:vertexset_edgeset_setup_begin–LABEL:line:vertexset_edgeset_setup_end), and sets the distances to all the vertices in dist to INT_MAX to represent (Line LABEL:line:dist_init). It declares the global priority queue, pq, on Line LABEL:line:pq_decl. This priority queue can be referenced in user-defined functions and the main function. The user then defines a function, updateEdge, that processes each edge (Lines LABEL:line:udf_start–LABEL:line:udf_end). In updateEdge, the user computes a new distance value, and then updates the priority of the destination vertex using the updatePriorityMin operator defined in Table 1. In other algorithms, such as -core, the user can use updatePrioritySum when the priority is decremented or incremented by a given value. The updatePrioritySum can detect if the change to the priority is a constant, and use this fact to perform more optimizations. The priority update operators, updatePriorityMin and updatePrioritySum, hide bucket update operations, allowing the compiler to generate different code for lazy and eager bucket update strategies.
Programmers use the constructor of the priority queue (Lines LABEL:line:pq_constructor_start–LABEL:line:pq_constructor_end) to specify algorithmic information, such as the priority ordering, support for priority coarsening, and the direction that priorities change (documented in Table 1). The abstract priority queue hides low-level bucket implementation details and provides a mapping between vertex data and their priorities. The user specifies a priority_vector that stores the vertex data values used for computing priorities. In SSSP, we use the dist vector and the coarsening parameter ( specified using the scheduling language) to perform priority coarsening. The while loop (Line LABEL:line:ordered_processing_operator_start) processes vertices from a bucket until all buckets are finished processing. In each iteration of the while loop, a new bucket is extracted with dequeueReadySet (Line LABEL:line:dequeueReadySet). The edgeset operator on Line LABEL:line:delta_stepping_apply_update_priority uses the from operator to keep only the edges that come out of the vertices in the bucket. Then it uses applyUpdatePriority to apply the updateEdge function to outgoing edges of the bucket. Label (#s1#) is later used by the scheduling language to configure optimization strategies.
4.2. Scheduling Language
|Apply Scheduling Functions||Descriptions|
|configApplyPriorityUpdate(label,config);||Config options: eager_with_fusion, eager_no_fusion, lazy_constant_sum, and lazy.|
|configApplyPriorityUpdateDelta(label,config);||Configures the parameter for coarsening the priority range.|
|configBucketFusionThreshold(label, config);||Configures the threshold for the bucket fusion optimization.|
|configNumBuckets(label,config);||Configures the number of buckets that are materialized for the lazy bucket update approach.|
The scheduling language allows users to specify different optimization strategies without changing the algorithm. We extend the scheduling language of GraphIt with new commands to enable switching between eager and lazy bucket update strategies. Users can also tune other parameters, such as the coarsening factor for priority coarsening. The scheduling API extensions are shown in Table 2.
Figure 8 shows a set of schedules for -stepping. GraphIt uses labels (#label#) to identify the algorithm language statements for which the scheduling language commands are applied. The programmer adds the label s1 to the edgeset applyUpdatePriority statement. After the schedule keyword, the programmer calls the scheduling functions. The configApplyPriorityUpdate function allows the programmer to use the lazy bucket update optimization. The programmer can use the original GraphIt scheduling language to configure the direction of edge traversal (configApplyDirection) and the load balance strategies (configApplyParallelization). Direction optimizations can be combined with lazy priority update schedules. configApplyUpdateDelta is used to set the delta for priority coarsening.
Users can change the schedules to generate code with different combinations of optimizations as shown in Figure 9. Figure 9(a) shows code generated by combining the lazy bucket update strategy and other edge traversal optimizations from the GraphIt scheduling language. The scheduling function configApplyDirection configures the data layout of the frontier and direction of the edge traversal (SparsePush means sparse frontier and push direction). Figure 9(b) shows the code generated when we combine a different traversal direction (DensePull) with the lazy bucketing strategy. Figure 9(c) shows code generated with the eager bucket update strategy. Code generation is explained in Section 5.
5. Compiler Implementation
We demonstrate how the compiler generates code for different bucketing optimizations. The key challenges are in how to insert low-level synchronization and deduplication instructions, and how to combine bucket optimizations with direction optimization and other optimizations in the original GraphIt scheduling language. Furthermore, the compiler has to perform global program transformations and code generation to switch between lazy and eager approaches.
5.1. Lazy Bucket Update Schedules
To support the lazy bucket update approach, the compiler applies program analyses and transformations on the user-defined functions (UDFs). The compiler uses dependence analysis on updatePriorityMin and updatePrioritySum to determine if there are write-write conflicts and insert atomics instructions as necessary (Figure 9(a) Line 20). Additionally, the compiler needs to insert variables to track whether a vertex’s priority has been updated or not (tracking_var in Figure 9(a), Line 18). This variable is used in the generated code to determine which vertices should be added to the buffer outEdges (Figure 9(a), Line 21). Deduplication is enabled with a compare-and-swap (CAS) on deduplication flags (Line 21) to ensure that each vertex is inserted into the outEdges only once. Deduplication is required for correctness for applications such as -core. Changing the schedules with different traversal directions or frontier layout affects the code generation for edge traversal and user-defined functions (Figure 9(b)). In the DensePull traversal direction, no atomics are needed for the destination nodes.
We built runtime libraries to manage the buffer and update buckets. The compiler generates appropriate calls to the library (getNextBucket, setupFrontier, and updateBuckets). The setupFrontier API (Figure 9(a), Line 24) performs a prefix sum on the outEdges buffer to compute the next frontier. We use a lazy priority queue (declared in Figure 9(a), Line 2) for storing active vertices based on their priorities. The lazy bucketing is based on Julienne’s bucket data structures that only materialize a few buckets, and keep vertices outside of the current range in an overflow bucket (Dhulipala:2017). We improve its performance by redesigning the lazy priority queue interface. Julienne’s original interface invokes a lambda function call to compute the priority. The new priority-based extension computes the priorities using a priority vector and value for priority coarsening, eliminating extra function calls.
Lazy with constant sum reduction. We also incorporated a specialized histogram-based reduction optimization (first proposed in Julienne (Dhulipala:2017)) to reduce priority updates with a constant value each time. This optimization can be combined with the lazy bucket update strategy to improve performance. For -core, since the priorities for each vertex always reduce by one at each update, we can optimize it further by keeping track of only the number of updates with a histogram. This way, we avoid contention on vertices that have a large number of neighbors on the frontier.
To generate code for the histogram optimization, the compiler first analyzes the user-defined function to determine whether the change to the priority of the vertex is a fixed value and if it is a sum reduction (Figure 10 (top), Line LABEL:line:constant_update). The compiler ensures that there is only one priority update operator in the user-defined function. It then extracts the fixed value (-1), the minimum priority (k), and vertex identifier (dst). In the transformed function (Figure 10 (bottom)), an if statement and max operator are generated to check and maintain the priority of the vertex. The applyUpdatePriority operator gets the counts of updates to each vertex using a histogram approach and supplies the vertex and count as arguments to the transformed function (Figure 10 (bottom), Line LABEL:line:count_arg). The compiler copies all of the expressions used in the priority update operator and the expressions that they depend on in the transformed function.
5.2. Eager Bucket Update Schedules
The compiler uses program analysis to determine feasibility of the transformation, transforms user-defined functions and edge traversal code, and uses optimized runtime libraries to generate efficient code for the eager bucket update approach.
The compiler analyzes the while loop (Figure 3, Lines LABEL:line:ordered_processing_operator_start–LABEL:line:ordered_processing_operator_end) to look for the pattern of an iterative priority update with a termination criterion. The analysis checks that there is no other use of the generated vertexset (bucket) except for the applyUpdatePriority operator, ensuring correctness.
Once the analysis identifies the while loop and edge traversal operator, the compiler replaces the while loop with an ordered processing operator. The ordered processing operator uses an OpenMP parallel region (Figure 9(c), Lines 12–32) to set up thread-local data structures, such as local_bins. We built an optimized runtime library for the ordered processing operator based on the -stepping implementation in GAPBS (DBLP:journals/corr/BeamerAP15). A global vertex frontier (Figure 9(c), Line 11) keeps track of vertices of the next priority (the next bucket). In each iteration of the while loop, the #pragma omp for (Figure 9(c), Lines 15–16) distributes work among the threads. After priorities and buckets are updated, each local thread proposes its next bucket priority, and the smallest priority across threads will be selected (omitted code on Figure 9(c), Line 28). Once the next bucket priority is decided, each thread will copy vertices in its next local bucket to the global frontier (Figure 9(c), Line 30)
Finally, the compiler transforms the user-defined functions by appending the local buckets to the argument list and inserting appropriate synchronization instructions. These transformations allow priority update operators to directly update thread-local buckets (Figure 9(c), Lines 23–26).
Bucket Fusion. The bucket fusion optimization adds another while loop after end of the for-loop on Line 27 of Figure 9(c), and before finding the next bucket across threads on Line 28. This inner while loop processes the current bucket in the local priority queue (local_bins) if it is not empty and its size is less than a threshold. In the inner while loop, vertices are processed using the same transformed user-defined functions as before. The size threshold improves load balancing, as large buckets are distributed across different threads.
We built an autotuner on top of the extension to automatically find high-performance schedules for a given algorithm and graph. The autotuner is built using OpenTuner (ansel:pact:2014) and stochastically searches through a large number of optimization strategies generated with the scheduling language. It uses an ensemble of search methods, such as the area under curve bandit meta technique, to find good combinations of optimizations within a reasonable amount of time.
We compare the performance of the new priority-based extension in GraphIt to state-of-the-art frameworks and analyze performance tradeoffs among different GraphIt schedules. We use a dual-socket system with Intel Xeon E5-2695 v3 CPUs with 12 cores each for a total of 24 cores and 48 hyper-threads. The system has 128 GB of DDR3-1600 memory and 30 MB last level cache on each socket and runs with Transparent Huge Pages (THP) enabled and Ubuntu 18.04.
|Type||Dataset||Num. Verts||Num. Edges||Symmetric Num.Edges|
|Social||Orkut (OK) (friendster)||3 M||234 M||234 M|
|Graphs||LiveJournal (LJ) (davis11acm-florida-sparse)||5 M||69 M||85M|
|Twitter (TW) (kwak10www-twitter)||41 M||1469 M||2405 M|
|Friendster (FT) (friendster)||125 M||3612 M||3612 M|
|WebGraph (WB) (sd-graph)||101 M||2043 M||3880 M|
|Road||Massachusetts (MA) (openstreetmap)||0.45 M||1.2 M||1.2 M|
|Graphs||Germany (GE) (openstreetmap)||12 M||32 M||32 M|
|RoadUSA (RD) (road-graph)||24 M||58 M||58 M|
|GraphIt with extension (ordered)||0.093||0.106||3.094||5.637||2.902||0.207||0.224||0.043||0.061||2.597||4.063||2.473||0.049||0.045||0.072||0.104||1.822||7.563||2.129|
|Algorithm||-core||Approximate Set Cover||A search|
|GraphIt with extension (ordered)||0.745||1.634||10.294||14.423||12.876||0.173||0.305||0.494||0.564||5.299||11.499||7.57||0.545||0.859||0.010||0.060||0.075|
Data Sets. Table 3 shows our input graphs and their sizes. For -core and SetCover, we symmetrize the input graphs. For -stepping based SSSP, wBFS, PPSP using -stepping, and A search, we use the original directed versions of graphs with integer edge weights. The RoadUSA (RD), Germany(GE) and Massachusetts (MA) road graphs are used for the A search algorithm, as they have the longitude and latitude data for each vertex. GE and MA are constructed from data downloaded from OpenStreetMap (openstreetmap). Weight distributions used for experiments are described in the caption of Table 4.
Existing Frameworks. Galois v4 (nguyen13sosp-galois) uses approximate priority ordering with an ordered list abstraction for SSSP. We implemented PPSP and A search using the ordered list. To the best of our knowledge and from communications with the developers, strict priority-based ordering is currently not supported for Galois. Galois does not provide implementations of wBFS, -core and SetCover, which require strict priority ordering. GAPBS (DBLP:journals/corr/BeamerAP15) is a suite of C++ implementations of graph algorithms and uses eager bucket update for SSSP. GAPBS does not provide implementations of -core and SetCover. We used Julienne (Dhulipala:2017) from early 2019. The developers of Julienne have since incorporated the optimized bucketing interface proposed in this paper in the latest version. GraphIt (graphit:2018) and Ligra (shun13ppopp-ligra) are two of the fastest unordered graph frameworks. We used the best configurations (e.g., priority coarsening factor and the number of cores) for the comparison frameworks. Schedules and parameters used are in the artifact.
We evaluate the extension to GraphIt on SSSP with -stepping, weighted breadth-first search (wBFS), point-to-point shortest path (PPSP), A search, -core decomposition (-core), and approximate set cover (SetCover).
SSSP and Weighted Breadth-First Search (wBFS). SSSP with -stepping solves the single-source shortest path problem as shown in Figure 5. In -stepping, vertices are partitioned into buckets with interval according to their current shortest distance. In each iteration, the smallest non-empty bucket which contains all vertices with distance in is processed. wBFS is a special case of -stepping for graphs with positive integer edge weights, with delta fixed to 1 (Dhulipala:2017). We benchmarked wBFS on only the social networks and web graphs with weights in the range , following the convention in previous work (Dhulipala:2017).
Point-to-point Shortest Path (PPSP). Point-to-point shortest path (PPSP) takes a graph , a source vertex , and a destination vertex as inputs and computes the shortest path between and . In our PPSP application, we used the -stepping algorithm with priority coarsening. It terminates the program early when it enters iteration where is greater than or equal to the shortest distance between and it has already found.
A Search. The A search algorithm finds the shortest path between two points. The difference between A search and -stepping is that, instead of using the current shortest distance to a vertex as priority, A search uses the estimated distance of the shortest path that goes from the source to the target vertex that passes through the current vertex as the priority. Our A search implementation is based on a related work (chronos) and needs the longitude and latitude of the vertices.
-core. A -core of an undirected graph refers to a maximal connected sub-graph of G where all vertices in the sub-graph have induced-degree at least . The -core problem takes an undirected graph as input and for each computes the maximum -core that is contained in (this value is referred to as the coreness of the vertex) using a peeling procedure (Matula:1983:SOC:2402.322385).
Approximate Set Cover. The set cover problem takes as input a universe containing a set of ground elements, a collection of sets s.t. , and a cost function . The problem is to find the cheapest collection of sets that covers , i.e. . In this paper, we implement the unweighted version of the problem, where , but the algorithm used easily generalizes to the weighted case (Dhulipala:2017). The algorithm at a high-level works by bucketing the sets based on their cost per element, i.e., the ratio of the number of uncovered elements they cover to their cost. At each step, a nearly-independent subset of sets from the highest bucket (sets with the best cost per element) are chosen, removed, and the remaining sets are reinserted into a bucket corresponding to their new cost per element. We refer to the following papers by Blelloch et al. (blelloch11manis; blelloch12setcover) for algorithmic details and a survey of related work.
6.2. Comparisons with other Frameworks
Table 4 shows the execution times of GraphIt with the new priority-based extension and other frameworks. GraphIt outperforms the next fastest of Julienne, Galois, GAPBS, GraphIt, and Ligra by up to 3 and is no more than 6 slower than the fastest. GraphIt is up to 16.8 faster than Julienne, 7.8 faster than Galois, and 3.5 faster than hand-optimized GAPBS. Compared to unordered frameworks, GraphIt without the priority-based extension (unordered) and Ligra, GraphIt with the extension achieves speedups between 1.67 to more than 600 due to improved algorithm efficiency. The times for SSSP and wBFS are averaged over 10 starting vertices. The times for PPSP and A search are averaged over 10 source-destination pairs. We chose start and end points to have a balanced selection of different distances.
GraphIt with the priority extension has the fastest SSSP performance on six out of the seven input graphs. Julienne uses significantly more instruction than GraphIt (up to 16.4 instructions than GraphIt). On every iteration, Julienne computes an out-degree sum for the vertices on the frontier to use the direction optimization, which adds significant runtime overhead. GraphIt avoids this overhead by disabling the direction optimization with the scheduling language. Julienne also uses lazy bucket update that generates extra instructions to buffer the bucket updates whereas GraphIt saves instructions by using eager bucket update. GraphIt is faster than GAPBS because of the bucket fusion optimization that allows GraphIt to process more vertices in each round and use fewer rounds (details are shown in Table 6). The optimization is especially effective for road networks, where synchronization overhead is a significant performance bottleneck. Galois achieves good performances on SSSP because it does not have as much overhead from global synchronization needed to enforce strict priority. However, it is slower than GraphIt on most graphs because approximate priority ordering sacrifices some work-efficiency.
GraphIt with the priority extension is the fastest on most of the graphs for PPSP, wBFS, and A search, which use a variant of the -stepping algorithm with priority coarsening. Both GraphIt and GAPBS use eager bucket update for these algorithms. GraphIt outperforms GAPBS because of bucket fusion. Galois is often slower than GraphIt due to lower work-efficiency with the approximate priority ordering. Julienne uses lazy bucket update and is slower than GraphIt due to the runtime overheads of the lazy approach.
PPSP and A search are faster than SSSP as they only run until the distance to the destination vertex is finalized. A search is sometimes slower than PPSP because of additional random memory accesses and computation needed for estimating distances to the destination.
For -core and SetCover, the extended GraphIt runs faster than Julienne because the optimized lazy bucketing interface uses the priority vector to compute the priorities of each vertex. Julienne uses a user-defined function to compute the priority every time, resulting in function call overheads and redundant computations. Galois does not provide ordered algorithms for -core and SetCover, which require strict priority and synchronizations after processing each priority.
Delta Selection for Priority Coarsening. The best value for each algorithm depends on the size and the structure of the graph. The best values for social networks (ranging from 1 to 100) are much smaller than deltas for road networks with large diameters (ranging from to ). Social networks need only a small value because they have ample parallelism with large frontiers and work-efficiency is more important. Road networks need larger values for more parallelism. We also tuned the values for the comparison frameworks to provide the best performance.
Autotuning. The autotuner for GraphIt is able to automatically find schedules that performed within 5 of the hand-tuned schedules used for Table 4. For most graphs, the autotuner can find a high-performance schedule within 300s after trying 30-40 schedules (including tuning integer parameters) in a large space of about schedules. The autotuning process finished within 5000 seconds for the largest graphs. Users can specify a time limit to reduce autotuning time.
|GraphIt with extension||GAPBS||Galois||Julienne|
Line Count Comparisons. Table 5 shows the line counts of the five graph algorithms implemented in four frameworks. GAPBS, Galois, and Julienne all require the programmer to take care of implementation details such as atomic synchronization and deduplication. GraphIt uses the compiler to automatically generate these instructions. For A search and SetCover, GraphIt needs to use long extern functions that significantly increases the line counts.
6.3. Scalability Analysis
We analyze the scalability of different frameworks in Figure 11 for SSSP on social and road networks. The social networks (TW and FT) have very small diameters and large numbers of vertices. As a result, they have a lot of parallelism in each bucket, and all three frameworks scale reasonably well (Figure 11(a) and (b)). Compared to GAPBS, GraphIt uses bucket fusion to significantly reduce synchronization overheads and improves parallelism on the RoadUSA network (Figure 11(c)). GAPBS suffers from NUMA accesses when going beyond a single socket (12 cores). Julienne’s overheads from lazy bucket updates make it hard to scale on the RoadUSA graph.
6.4. Performance of Different Schedules
|Datasets||with Fusion||without Fusion|
|TW||3.09s [1025 rounds]||3.55s [1489 rounds]|
|FT||5.64s [5604 rounds]||6.09s [7281 rounds]|
|WB||2.90s [772 rounds]||3.30s [2248 rounds]|
|RD||0.22s [1069 rounds]||0.77s [48407 rounds]|
|-core||SSSP with -stepping|
|Datasets||Eager Update||Lazy Update||Eager Update||Lazy Update|
Table 6 shows that SSSP with bucket fusion achieves up to speedup over the version without bucket fusion on road networks, where there are a large number of rounds processing each bucket. Table 6 shows that the optimization improves running time by significantly reducing the number of rounds needed to complete the algorithm.
Table 7 shows the performance impact of eager versus lazy bucket updates on -core and SSSP. -core does a large number of redundant updates on the priority of each vertex. Every vertex’s priority will be updated the same number of times as its out-degree. In this case, using the lazy bucket update approach drastically reduces the number of bucket insertions. Additionally, with a lazy approach, we can also buffer the priority updates and later reduce them with a histogram approach (lazy with constant sum reduction optimization). This histogram-based reduction avoids overhead from atomic operations. For SSSP there are not many redundant updates and the lazy approach introduces significant runtime overhead over the eager approach.
7. Related Work
Parallel Graph Processing Frameworks. There has been a significant amount of work on unordered graph processing frameworks (e.g., (shun13ppopp-ligra; gluon2018; Zhu16gemni; Grossman2018; Yunming2017; kyrola12osdi-graphchi; Ham:2016:GHE:3195638.3195707; prabhakaran12atc-grace; Sakr2017; Wang:2018:LLD:3178487.3178508; Gonzalez2012; sundaram15vldb-graphmat; Wang:2017:GGG:3131890.3108140; GSwitch2019; Xu:PnP; KickStarter2017; Pai2016; graphit:2018; DBLP:journals/pvldb/SongLWGLJ18; Mukkara:18:TS; Mukkara:2019:PAS; Dhulipala:2019:LGS:3314221.3314598), among many others). These frameworks do not have data structures and operators to support efficient implementations of ordered algorithms, and cannot support a wide selection of ordered graph algorithms. A few unordered frameworks (Wang:2017:GGG:3131890.3108140; GSwitch2019; sundaram15vldb-graphmat) have the users define functions that filter out vertices to support -stepping for SSSP. This approach is not very efficient and does not generalize to other ordered algorithms. Wonderland (Zhang:2018:WNA:3173162.3173208) uses abstraction-guided priority-based scheduling to reduce the total number of iterations for some graph algorithms. However, it requires preprocessing and does not implement a strict ordering of the ordered graph algorithms. PnP (Xu:PnP) proposes direction-based optimizations for point-to-point queries, which is orthogonal to the optimizations in this paper, and can be combined together to potentially achieve even better performance. GraphIt (graphit:2018) decouples the algorithm from optimizations for unordered graph algorithms. This paper introduces new priority-based operators to the algorithm language, proposes new optimizations for the ordered algorithms in the scheduling language, and extends the compiler to generate efficient code.
Bucketing. Bucketing is a common way to exploit parallelism and maintain ordering in ordered graph algorithms. It is expressive enough to implement many parallel ordered graph algorithms (Dhulipala:2017; DBLP:journals/corr/BeamerAP15). Existing frameworks support either lazy bucket update or eager bucket update approach. GAPBS (DBLP:journals/corr/BeamerAP15) is a suite of hand-optimized C++ programs that includes SSSP using the eager bucket update approach. Julienne (Dhulipala:2017) is a high-level programming framework that uses the lazy bucket update approach, which is efficient for applications that have a lot of redundant updates, such as -Core and SetCover. However, it is not as efficient for applications that have fewer redundant updates and less work per bucket, such as SSSP and A search. GraphIt with the priority-based extension unifies both the eager and lazy bucket update approaches with a new programming model and compiler extensions to achieve consistent high performance.
Speculative Execution. Speculative execution can also exploit parallelism in ordered graph algorithms (Hassaan:2011:OVU:1941553.1941557; Hassaan:2015:KDG:2694344.2694363). This approach can potentially generate more parallelism as vertices with different priorities are executed in parallel as long as the dependencies are preserved. This is particularly important for many discrete simulation applications that lack parallelism. However, speculative execution in software incurs significant performance overheads as a commit queue has to be maintained, conflicts need to be detected, and values are buffered for potential rollback on conflicts. Hardware solutions have been proposed to reduce the overheads of speculative execution (Jeffrey:Swarm; Suvinay:Fractal:2017; Jeffrey:DataCentricExectuion; Jeffrey:2018; chronos), but it is costly to build customized hardware. Furthermore, some ordered graph algorithms, such as approximate set cover and -core, cannot be easily expressed with speculative execution.
Approximate Priority Ordering. Some work disregard a strict priority ordering and use an approximate priority ordering (nguyen13sosp-galois; gluon2018; alistarh2015spraylist; DBLP:conf/spaa/Alistarh0KLN18). This approach uses several “relaxed” priority queues in parallel to maintain local priority ordering. However, it does not synchronize globally among the different priority queues. To the best of our knowledge and from communications with the developers, Galois (nguyen13sosp-galois; gluon2018) does not currently support strict priority ordering and only supports an approximate ordering. Galois (nguyen13sosp-galois) provides an ordered list abstraction, which does not explicitly synchronize after each priority. As a result, it is hard to implement algorithms that require explicit synchronization, such as -core. Galois also require users to handle atomic synchronizations for correctness. This approach cannot implement certain ordered algorithms that require strict priority ordering, such as work-efficient -core decomposition and SetCover. D-galois (Dathathri:2019:PSR:3297858.3304056) implements -core for only a specific , whereas GraphIt’s -core finds all cores.
Synchronization Relaxation. There has been a number of frameworks that relax synchronizations in graph algorithms for better performance while preserving correctness (Harshvardhan:2014:KNA; Vora:2014:AEA; Ben-Nun:2017:GAM). Compared to existing synchronization relaxation work, bucket fusion in our new priority-based extension is more restricted on synchronization relaxation. The synchronization between rounds can be removed only when the vertices processed in the next round has the same priority as vertices processed in the current round. This way, we ensure no priority inversion happens.
We introduce a new priority-based extension to GraphIt that simplifies the programming of parallel ordered graph algorithms and generates high-performance implementations. We propose a novel bucket fusion optimization that significantly improves the performance of many ordered graph algorithms on road networks. GraphIt with the extension achieves up to 3 speedup on six ordered algorithms over state-of-the-art frameworks (Julienne, Galois, and GAPBS) while significantly reducing the number of lines of code.
Appendix A Artifact Evaluation Information
Algorithms: SSSP with -stepping, PPSP, wBFS, A search, -core, and Approximate Set Cover
Compilation: C++ compiler with C++14 support, Cilk Plus and OpenMP
Binary: Compiled C++ code
Data set: Social, Web, and Road graphs
Run-time environment: Ubuntu 11.04
Hardware: 2-socket Intel Xeon E5-2695 v3 CPUs with Transparent Huge Pages enabled
Publicly available? Yes
Code licenses (if publicly available)? MIT License
The detailed instructions to evaluate the artifact are available at https://github.com/GraphIt-DSL/graphit/blob/master/graphit_eval/priority_graph_cgo2020_eval/readme.md.
The evaluation in the link first demonstrates how SSSP with -stepping with different schedules are compiled to C++ programs (Figure 9). Then we provide instructions on how to run different algorithms on small graphs serially. Finally, there is an optional part that replicates the parallel performance on a more powerful 2-socket machines for LiveJournal, Twitter, and RoadUSA graphs (Table 4).
We thank Maleen Abeydeera for help with A search and Mark Jeffrey for helpful comments. This research was supported by DOE Early Career Award #DE-SC0018947, NSF CAREER Award #CCF-1845763, MIT Research Support Committee Award, DARPA SDH Award #HR0011-18-3-0007, Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA, Toyota Research Institute, DoE Exascale award #DE-SC0008923, DARPA D3M Award #FA8750-17-2-0126.