Many application areas, including machine learning, social networking, business intelligence, and bioinformatics, place great importance on solving problems modeled as graphs[17, 19, 11]. A graph consists of vertices and edges, the former conceptually modelling objects and the latter the relationship between objects. A graph computation begins by assigning initial values to each vertex. Vertex values are updated iteratively by transmitting information along the edges as messages to neighbors [31, 12], which are then aggregated locally at each vertex to produce a new value. The entire process repeats until some application-specific convergence condition is reached.
It is often the case that within one iteration only a subset of vertices receive an updated value that differs from their previous value. During the following iteration, it is only useful to propagate outbound messages from these active vertices. Exploiting this behavior, known as the frontier optimization, can significantly reduce the work required to complete an iteration [1, 2, 40]. A graph processing framework implements this optimization by tracking which vertices an application updates using a data structure called the frontier [40, 16, 44, 51, 31].
Execution of a graph processing application may follow either a push [35, 49, 40, 51, 44, 18] or a pull processing pattern [40, 51, 47]. Push-based engines iterate over source vertices and propagate outbound messages. When parallelizing this write-heavy workload, we must use synchronization to avoid conflicting updates to destination vertices. Conversely, pull-based engines iterate over destination vertices and aggregate inbound messages. Reads are dominant and, since writes occur once per vertex, no synchronization is required when an iteration is parallelized. Hence, a pull-based engine achieves significantly higher throughput than a push-based engine. However, the push pattern enables an efficient implementation of the frontier optimization [16, 4]. The frontier tracks vertices that are active as sources, which is in alignment with the source orientation of the push pattern. A push-based engine can simply process the out-edges for the vertices included in the frontier. Conversely, the pull pattern’s destination orientation is out of alignment with the frontier. A pull-based engine must check every single edge in order to find out if its source vertex is active .
Figure 0(a) quantifies the performance trade-off. For applications that do not use the frontier optimization, such as PageRank (PR), the higher throughput of a pull engine leads to speedups of over a push engine, a result reflective of the state-of-the-art for both patterns and consistent with Grazelle’s published evaluation . On the other hand, the efficient handling of frontiers by the push engine leads to performance gains of up to for frontier-driven applications such as Breadth-First Search (BFS), Connected Components (CC), and Single-Source Shortest Path (SSSP).
Existing graph processing frameworks manage this tradeoff using hybrid engines that implement both patterns and dynamically select a push or pull iteration based on the frontier density [40, 51, 16]. In Figure 0(a) the hybrid configuration achieves the best of both worlds and outperforms push-only mode by up to 10% for the frontier-driven applications shown. However, there are two key disadvantages of this approach. First, programmers must write and optimize their graph applications twice, once for the pull and once for the push pattern. Second, iterations that execute using the push-based engine incur reduced processing throughput. Figure 0(b) shows the time division between push and pull for the three frontier-based applications we tested when run in hybrid mode. The time spent on sparse iterations that use the lower-throughput push engine ranges from 10% to 100%. Were these iterations able to use the high-throughput pull engine without sacrificing frontier handling efficiency, the overall performance benefits could be significant.
Rather than continuing to juggle push and pull, we propose eliminating the push pattern entirely. Our key contribution is Wedge, a high-throughput pull-only graph processing framework that implements the frontier optimization efficiently, despite conventional wisdom stating that this is fundamentally impossible . Wedge’s pull engine continues to produce the traditional source-oriented frontier, but we add a transformation step that converts it to a more pull-friendly format called the Wedge Frontier. The Wedge Frontier is destination-oriented but, rather than tracking active vertices, it tracks individual active edges; simply flipping the frontier to track destination vertices would introduce a potentially unbounded amount of wasted work.
Wedge’s transformation step is application-independent, but its overhead can be significant even when parallelized. Hence, we propose two key optimizations to make it practical. First, we borrow the key concept of hybrid engines and only transform the traditional frontier into the Wedge Frontier when it is sufficiently empty, thus requiring little work to transform. Otherwise we simply execute the pull engine on the entire graph without a frontier. In both cases, Wedge uses the pull engine for processing, improving processing throughput and eliminating the need to implement the application multiple times. Second, we tweak the granularity of the Wedge Frontier so that one element within it can correspond to multiple edges. This represents a trade-off between pull engine and transformation step performance: a coarser granularity adds potentially unnecessary work to the pull engine but reduces the transformation step overheads since less elements need to be added to the Wedge Frontier. The opposite is true of a finer granularity.
Our implementation of Wedge is built on top of Grazelle, a state-of-the-art open-source hybrid graph processing framework, resulting in a new pull-only version. Wedge’s two key optimization strategies respectively improve its performance by up to and , enabling it to outperform existing graph processing frameworks Grazelle, Ligra, and GraphMat respectively by up to , , and .
Wedge is publicly available on GitHub. It can be accessed at https://github.com/stanford-mast/Wedge.
Our focus is in-memory graph processing on a single server machine, although the techniques we propose with Wedge are not conceptually restricted to this setup. A modern server can house several terabytes of DRAM and many real-world problems can comfortably fit within this capacity [26, 27]. Processing graphs in memory on a single machine leads to significantly higher performance than can be achieved using distributed [31, 39, 23, 22, 13, 9, 52, 49, 29] or out-of-core [56, 37, 25, 55, 36] approaches.
A graph processing application proceeds in two phases. In the first phase, messages are exchanged along edges and aggregated at each destination vertex. In the second phase, local computations are performed on each vertex to produce its updated value based on the aggregation of incoming messages. An application alternates between these two phases iteratively until some convergence condition is reached. If a graph engine follows the Bulk-Synchronous Parallel model  and completes processing of all vertices in one phase before moving onto the next, it is said to be synchronous, otherwise it is asynchronous . Existing work has found that there is no clear winner between these two types of graph engines [12, 50]. We focus on synchronous processing because of its relative simplicity.
2.1 Push vs. Pull
The fundamental unit of work in the first phase of graph processing is a single edge: a message is propagated from the vertex at the source to the vertex at the destination. While edges can be processed in any order, it is common to group them by source or destination vertex. The former grouping produces a push processing pattern (Figure 1(a)), whereby a vertex is read once and its outgoing message is aggregated at its outbound neighbors. If we parallelize the first phase, this write-heavy workload requires synchronization because threads may conflict as they update destination vertices. Conversely, the latter grouping produces a pull processing pattern (Figure 1(b)), in which inbound messages are read, and the result of the aggregation is written a single time to each destination vertex. This workload is dominated by reads and requires no synchronization because each vertex receives exactly one write. In terms of programmability the two patterns are equivalent: any application can be represented using either . However, the read-heavy and unsynchronized pattern of a pull-based engine leads to significantly higher processing throughput than a push-based engine, as reflected in the PageRank results of Figure 0(a).
Despite its higher throughput, a pull engine severely disadvantage compared to a push engine in terms of its ability to implement the frontier optimization. This important optimization exploits the common application behavior that only a subset of the graph may need to be processed during each iteration. Subset size varies per iteration and can be as small as a single-digit number of edges . More specifically, an application will add a vertex to the set of active vertices—called the frontier—during a particular iteration if its value is updated during that same iteration and so should be transmitted outbound to its neighbors during the next iteration. Adding a vertex to the frontier during one iteration means that message propagation must occur along all its out-edges during the next. Out-edges from vertices not in the frontier can be skipped.
A push engine iterates over source vertices and is therefore properly aligned with the frontier. It can implement the frontier optimization by iterating over the vertices present in the frontier and processing the out-edges associated with each. In contrast, a pull engine iterates over destination vertices and, as a result is, out of alignment with the frontier. It must examine each individual incoming edge before it knows the source vertex and is able to check its frontier membership. Whereas a push engine consults the frontier before accessing any edges, the pull engine must access edges before checking the frontier and must therefore unconditionally scan through all the edges in the graph . Despite its lower throughput, the frontier optimization is so effective that, per Figure 0(a), a push engine can achieve a speedup of up to over a pull engine on frontier-driven applications.
Previous work has proposed hybrid graph processing frameworks as a means of overcoming the trade-off between push and pull [40, 51, 16], an approach that, per Figure 0(a), is able to achieve the best of both worlds. A block diagram illustrating the top-level control flow of a hybrid framework is shown in Figure 3. When the frontier is almost full, both push and pull would propagate messages along a very high fraction of the edges in the graph, so the deciding factor is throughput and therefore pull is selected. Conversely, when the frontier is sufficiently empty, enough work would be saved by exploiting the frontier optimization that choosing push is worthwhile despite its lower throughput. In determining which engine to use, a hybrid framework computes the sum of the out-degrees of all of the vertices in the frontier and compares the result to a pre-determined threshold, typically a certain fraction of the total number of edges in the graph. The exact value to use is often determined experimentally [40, 16], and the decision of push or pull is made per iteration. While effective from a performance perspective, the disadvantage of a hybrid engine is that the application writer must implement the application twice (once for push and once for pull).
2.2 Graph Representation
Implementations of both push-based and pull-based engines are highly dependent upon the data structures used to represent vertex values, edge information (topology plus optional weight values), and the frontier. Since every vertex has a value, it is common to represent vertex values with an array indexed by vertex identifier [40, 51, 37, 16, 44, 49].
Edge information is often represented using a two-level data structure known as Compressed-Sparse. Each instance of this data structure can either represent out-edges (Compressed-Sparse-Row, or CSR) or in-edges (Compressed-Sparse-Column, or CSC). A hybrid graph processing framework would create one instance of each type for the push (CSR) and pull (CSC) engines . The top level in Compressed-Sparse, the vertex index, contains one element per vertex and indicates that vertex’s starting position within the bottom level edge array. The latter contains one element per edge, each element identifying the vertex at the other end of the edge. Edge weights can be encoded directly into the edge array or placed into a parallel array.
Vector-Sparse, introduced with Grazelle , is a modified form of Compressed-Sparse with two key functional differences. First, edges in the edge array are packed into vectors of up to four edges per vector, all of which correspond to the same vertex in the vertex index. Second, each such vector encodes the identity of the corresponding vertex in the vertex index. As a result, it is possible to process the entire graph by streaming through the edge vector array without ever consulting the vertex index.
The frontier is a data structure that tracks active vertices. In existing work it is often implemented densely as a bit-mask with one bit per vertex [40, 51, 16, 44] or sparsely as a list of active vertices [40, 51]. Two instances of this data structure typically exist: one is consumed during an iteration of the application while the other is simultaneously being produced for the next iteration. Because a vertex is defined as being “active” if it has an updated value to propagate out-bound to its neighbors, the frontier optimization is fundamentally source-oriented.
Wedge is a pull-only graph processing engine that distinguishes itself from all prior work by including an efficient pull-oriented version of the frontier optimization. This allows Wedge to overcome the trade-off between the push and pull patterns and to eliminate the need for a push engine altogether. Wedge’s frontier optimization design is application- and framework-independent. When integrated into a specific graph processing framework there is no impact on its programming model, and no changes are required to graph applications other than to remove the now-unnecessary push-based version.
We built the new pull-oriented frontier optimization based on the three key design requirements described in §3.1. We summarize the high-level operation of Wedge in §3.2 and present in more detail its two key components, the transformation step and the new frontier representation, in §3.3. Finally, in §3.4 we describe the techniques we use to make Wedge’s frontier optimization design practical.
3.1 Design Requirements
Sections 1 and 2 explained why a pull-based graph engine cannot exploit a traditional source-oriented vertex-based frontier. The key issue is the misalignment between the frontier’s source orientation and the pull engine’s destination orientation, which forces a pull engine to scan unconditionally through the entire graph. Given that the conventionally-built frontier optimization does not work for pull engines, our first goal is to establish the requirements for a version of the frontier optimization that does.
Requirement 1: Insertion into the frontier must be vertex-based and source-oriented. We begin by observing that the difficulty a pull engine faces does not lie in inserting vertices into the frontier. Both push-based and pull-based engines ultimately compute updated values for vertices, and both are equally capable of knowing which vertices they are updating at the time they produce these updates. Since the traditional source-oriented vertex-based method of insertion is not an issue, and since any other method of insertion would incur additional processing overhead in the pull engine, our first requirement is that this style of insertion be preserved.
Requirement 2: Traversal of the frontier must be destination-oriented. Our ultimate goal is to arrive at a frontier that can be traversed using a destination orientation. Figure 4 illustrates precisely what this means on an example graph. Suppose an application adds vertices 2 and 4 to the frontier, marking them active as source vertices. The active subset of the graph consists of the four highlighted edges in Figure 3(a), all of which are out-edges of vertices 2 and 4. The equivalent destination-oriented frontier, shown in Figure 3(b), contains the destination vertices of each edge in the active subset of the graph, namely vertices 4, 6, 8, and 9; the out-edges highlighted in Figure 3(a) are highlighted as in-edges in Figure 3(b). If we construct a destination-oriented frontier containing these four vertices, a pull engine would be properly aligned with it, could iterate efficiently over its member vertices, and in so doing would propagate messages along all of the edges in the active subset of the graph.
Requirement 3: Filtering out inactive edges within each destination vertex must be supported. While switching from a source vertex orientation to a destination vertex orientation solves the problem of misalignment between pull engine and frontier, doing this alone is insufficient. Highlighted in Figure 3(b) are seven edges in total, four of which (blue edges) comprise the active subset, and three of which (red edges) are extra edges that the pull engine would process unnecessarily. These useless edges are present because activating a vertex as a destination means processing all of its in-edges, while only some of them are part of the active subset. In Figure 3(b) only three additional edges are highlighted, representing an overhead of 75%. However, if vertex 6 hypothetically had an in-degree of 1 million, the overhead would quickly dominate.
3.2 Wedge Overview
Figure 5 shows a top-level block diagram of Wedge, our proposed pull-only graph processing framework. Compared to a hybrid framework (Figure 3), the push engine has been replaced with a frontier transformation step that consumes the traditional source-oriented vertex-based frontier and produces the Wedge Frontier (§3.3), which contains the same information as the traditional frontier but is formatted such that a pull-based engine can traverse it efficiently. The pull engine continues to produce the traditional source-oriented vertex-based frontier as output, just as it did in the hybrid framework, meaning that this system design follows Requirement 1. The transformation step itself is application-independent, as it is simply converting from one representation format to another. All of the application-specific logic remains encapsulated within the pull engine, meaning that Wedge introduces no impact on programmability, and using it requires no code changes to applications other than to remove the now-obsolete push-based version.
3.3 Wedge Frontier
The primary difference between the Wedge Frontier and the traditional frontier is that the Wedge Frontier is edge-based rather than vertex-based. The traditional frontier identifies the active subset of the graph by explicitly identifying which vertices are members of the frontier such that the active subset implicitly consists of all of the in-edges or out-edges of those member vertices. Conversely, the Wedge Frontier directly identifies the edges that comprise the active subset. More concretely, whereas values in the traditional frontier are vertex identifiers, values in the Wedge Frontier identify edges by their positions within the edge topology data structure. The exact meaning of each value stored in the Wedge Frontier is therefore dependent on the data structure selected for representing edge topology. As many frameworks use Compressed-Sparse, Figure 6 shows the content of the Wedge Frontier for the example in Figure 4 assuming the use of CSC to represent in-edges. Each element in the Wedge Frontier identifies position in both the vertex index and edge array. A Vector-Sparse version (§2.2) would only need to store the latter.
By switching from a vertex basis to an edge basis and aligning the Wedge Frontier with the layout of the destination-oriented edge data structure we are able to follow both Requirements 2 and 3. Aligning itself with the destination-oriented edge data structure means that the Wedge Frontier is traversable using a destination orientation. Furthermore, the Wedge Frontier operates at the granularity of edges rather than vertices, meaning that it enables filtering out individual edges within each destination vertex. A pull-based graph processing engine is able to scan through the Wedge Frontier and limit message propagation to only those edges in the active subset of the graph, as these would be the only edges present in the Wedge Frontier.
Creation of the Wedge Frontier occurs by means of the transformation step shown in Figure 5, which consumes the traditional vertex-based source-oriented frontier as input. The process is conceptually very simple: for each source vertex present in the traditional frontier produced by the pull engine, insert values into the Wedge Frontier that capture that vertex’s out-edges. In a hybrid framework we know each vertex’s out-edges because they are encoded in an out-edge data structure like CSR. The frontier transformation step consumes a similar kind of data structure, called an edge index, except instead of encoding out-neighbors by vertex identifier it encodes out-edges by position within the in-edge data structure. In other words, it maps each vertex to the Wedge Frontier values that capture its out-edges. Therefore, we simply need to traverse this data structure and insert the values encoded within it into the Wedge Frontier.
Instantiating the edge index does not consume any additional memory over what a hybrid graph processing framework already consumes. Because the source-oriented edge data structure used in a hybrid graph processing framework is not necessary in a pull-only framework, we can simply repurpose that space for the edge index. In fact, doing so may actually decrease memory consumption because the source-oriented edge data structure might need to store edge weights, whereas the edge index does not.
Generating the Wedge Frontier represents a per-edge processing overhead: a sequential read to the edge index followed by a frontier insertion operation. An iteration of an application executed using the pull engine along with the Wedge Frontier will outperform that same iteration executed using a push engine as long as the cost of a sequential access plus a frontier insertion operation is low enough so as not to overcome the throughput difference between push and pull. A push engine performs a random-access atomic update operation to a large data structure (the vertex values) for each edge encountered, whereas the frontier transformation writes only to the frontier, which is much smaller and is therefore much more likely to result in cache hits when updated. For more complicated applications that update multiple vertex values, a push engine would be burdened with heavier synchronization operations, such as per-vertex locks, whereas the frontier transformation overhead is fixed irrespective of application complexity.
3.4 Frontier Transformation
Executing the frontier transformation step can be quite expensive, particularly when the resulting Wedge Frontier is almost full, so blindly executing it every iteration would result in significantly slower performance than can be achieved using a hybrid framework. The transformation executes in time and is multi-threaded. To make Wedge’s frontier optimization design practical, we introduce two key optimizations, which we describe in this subsection. Each optimization exposes a tuning parameter, and their respective impacts on performance are evaluated in §5.2.
Frontier Fullness Threshold: To reduce the time spent in the frontier transformation step, we borrow the key concept of a hybrid framework and execute it selectively. This is reflected in the frontier fullness decision shown in Figure 5. If the pull engine produces as output a frontier that is sufficiently full, the following iteration is executed on the entire graph without a frontier, otherwise the transformation step produces the Wedge Frontier and the pull engine consumes it as input. Whether a frontier is “sufficiently full” is determined by comparing the number of edges it contains to the frontier fullness threshold tuning parameter. Had we integrated the frontier transformation directly into the pull engine, we would have lost the ability to execute it selectively because metrics like frontier fullness cannot be evaluated until a processing iteration is fully completed. For the applications and graphs we tested (§5), the ideal threshold value ranged from 1% and 48%, which resulted in up to 30% of iterations running without the Wedge Frontier.
Frontier Precision: Graphs often contain many more edges than vertices [26, 27], so switching the frontier from a vertex basis to an edge basis can incur two overheads. First, an edge-based frontier will need to hold more values than a vertex-based frontier (one value per edge instead of one value per vertex), so the frontier data structure itself will consume more memory space. Second, inserting vertices into an edge-based frontier takes longer than doing so into a vertex-based frontier because there are more values to insert.
To counteract the effects of these overheads, we introduce the frontier precision tuning parameter, which can be adjusted to allow multiple edges to be represented using a single value in the Wedge Frontier. The highest possible precision, whereby each edge is uniquely represented in the Wedge Frontier per Figure 6, comes with the highest time and space overheads. Lowering the frontier precision means grouping contiguous edges together; doing so can reduce the time and space overheads associated with the Wedge Frontier but at the expense of inserting a small number of extra edges into the Wedge Frontier that themselves are not part of the active subset. For example, if we set the group size to 2 edges per group as depicted in Figure 7, we see that the Wedge Frontier continues to represent the active subset of the graph but also includes some additional edges that the pull engine will process unnecessarily. Edge group numbers identify edge groups by starting position in the edge topology data structure. The maximum number of unnecessary edges is bounded by the group size and cannot grow arbitrarily large, thus avoiding the 1-million in-degree problem we identified in §3.1. We can arbitrarily reduce the frontier precision without introducing correctness issues because existing pull-based application implementations typically already support execution without any frontier.
Wedge is designed to be agnostic to specific implementation characteristics. For the purposes of evaluation, it is prototyped in software and integrated into Grazelle , a state-of-the-art hybrid graph processing framework, resulting in a new pull-only version. The purpose of this section is to describe the details of our software implementation of Wedge and its integration into Grazelle. In particular, we address data structure selection, frontier transformation implementation, parallelization across multiple cores, and scaling to multiple processor sockets. Grazelle uses Vector-Sparse to represent edge topology, so Wedge operates at the granularity of an edge vector such that the size of an edge group is measured in terms of the number of edge vectors, rather than individual edges, it contains. Therefore, even at its highest precision, the Wedge Frontier does not uniquely identify each edge. Wedge’s frontier transformation step is implemented in approximately 140 LOC of C code, and modifications to Grazelle’s pull engine total less than 50 LOC of C code.
Data Structures: The Wedge Frontier is implemented densely as a bit-mask, with one bit per edge group. A sparse implementation is possible, following one of the main contributions in Ligra , but we leave this as an engineering task for future work. Grazelle also uses a dense bit-mask to represent its traditional frontiers, so mimicking this behavior in our implementation of Wedge facilitates a fairer comparison. Furthermore, using a bit-mask comes with benefits such as fast insertion and automatic elimination of duplicates. We experimented with implementing the Wedge Frontier using a hierarchical bit-mask data structure but observed no noticeable performance impact. Using a single-level bit-mask is possible because Vector-Sparse does not depend on the vertex index to identify both ends of each edge (§2.2). A Compressed-Sparse version would require one bit-mask for the vertex index and a second for the edge array.
The edge index follows the Compressed-Sparse-Row format. However, instead of destination vertex identifiers, the values in the second-level edge array identify bit positions in the Wedge Frontier that need to be set for each source vertex identified in the first-level vertex index. For each bit that is set in the traditional frontier, the frontier transformation step simply looks up that vertex in the first-level array of the edge index, iterates over its values in the second-level array of the edge index, and sets the corresponding bits in the Wedge Frontier.
Parallelization: Pull engine parallelization across multiple cores is unchanged from the manner in which Grazelle implements it. We therefore refer interested readers to the original Grazelle publication for the details . Parallelization of the Wedge transformation is implemented by slicing the traditional frontier into equally-sized pieces and dynamically scheduling each piece as threads become available to process them. Pieces are sized statically; we consider neither the number of bits set to 1 within each piece nor the out-degrees of the vertices represented by each bit. This decision has load balance implications, which we evaluate in §5.3. We note, however, that load balance issues can be resolved using any known load balancing technique, such as work-stealing , and leave this engineering task for future work. In principle there is no need for synchronization between threads because all bit-setting operations are idempotent. In practice, however, every such operation needs to be atomic because the addressable data unit is 1 byte, which gives rise to false sharing of bits within the same byte.
Scaling to multiple processor sockets occurs through graph partitioning, which is left to the graph processing framework. Wedge assumes each socket has its own destination-oriented edge list, representing a partition of the graph, and locally generates a corresponding edge index for each one. The traditional source-oriented vertex-based frontier is globally shared across sockets. However, the Wedge Frontier is local, and one such data structure exists on each socket to correspond to the edge list partition for that same socket. The frontier transformation runs locally and in parallel on each socket, consuming the global traditional frontier and producing the local Wedge Frontier. The decision to run or skip frontier transformation is global, although in future work each socket could make this decision independently.
We use all available cores to perform the frontier transformation because, barring load balance issues, we found that increasing the number of cores continually increased performance (§5.3) and that even using all cores does not saturate the memory system. In-memory compression, a topic studied in existing work , can be applied to the edge index to reduce further the amount of memory bandwidth required and increase the potential performance to be gained by adding additional cores. Furthermore, because it is application-independent and not limited by the memory system, the transformation step could conceivably be implemented as an accelerator. If such an accelerator were built to share the processor’s memory system, it would need to run the transformation step only when the pull engine is not running because the pull engine can saturate the memory system, and any memory bandwidth interference would reduce its performance. Otherwise it can run in parallel with Grazelle as long as its operation could be terminated immediately and prematurely upon determination that production of the Wedge Frontier is unwarranted.
Server & Datasets: We evaluate Wedge using a four-socket server equipped with four Intel Xeon E7-8890 v3 (18 physical/36 logical cores and 45 MB LLC)  processors and a total of 1 TB DRAM running Ubuntu 14.04 LTS. Our experiments use the six input datasets listed in Table 1. All six are real-world datasets that span a wide variety of application areas and feature highly variable distributions of vertex degrees [11, 17, 19]. dimacs-usa and twitter-2010 are often featured in the evaluations of other graph processing frameworks [40, 51, 33, 37, 25]. dimacs-usa is unique in that it is a mesh network, having a relatively small and even distribution of edges to vertices. The others are scale-free graphs following a power-law degree distribution 
of varying skew level. The most extreme skew is found inuk-2007, which contains over more vertices having in-degree over 100,000 than twitter-2010, the second-most skewed graph [6, 5, 16]. Our plots sometimes refer to datasets by their shown abbreviations.
Applications: We focus our evaluation on three graph processing applications, all of which differ in their interaction with the frontier optimization: Breadth-First Search (BFS), Connected Components (CC), and Single-Source Shortest Path (SSSP). In BFS, the frontier is initialized to contain just a single vertex (the root vertex of the traversal), and each vertex is inserted into the frontier for at most one iteration throughout the entire application. The frontier begins extremely empty, grows in size, and finally empties fully, at which point the application converges. Furthermore, because each vertex is only inserted into the frontier at most once, the frontier changes completely from one iteration to the next and generally remains very sparse. CC is very much the opposite: the frontier is initialized to contain every vertex in the graph and gradually empties as the algorithm progresses. Vertices are often inserted into the frontier for multiple iterations. SSSP falls somewhere in the middle in that the frontier is initialized to contain a single vertex (the root vertex of the search), but the application behaves like CC in that there is no limit to the number of times a vertex may be inserted into the frontier. SSSP is also the only of these applications that uses edge weights. Edge weights do not affect frontier behavior but can affect the balance of performance between push and pull by biasing the memory access pattern towards sequential. This is because edges are accessed sequentially in the edge list, and edge weights increase the amount of data that must be loaded per edge. Our implementations of CC and SSSP are respectively based on HCC (label propagation ) and Bellman-Ford. With the exception of dimacs-usa, the graphs listed in Table 1 are unweighted, so for SSSP we generate weights using a multiplicative hash algorithm.
|T||twitter-2010 [5, 6]||41.7M||1.47B|
|U||uk-2007 [5, 6]||105.9M||3.74B|
We limit our evaluation scope to just these three applications because including other applications would not contribute any additional insights. For example, PageRank and Collaborative Filtering are commonly evaluated in the literature [40, 37, 51, 13, 49, 44], but they do not exploit the frontier optimization and are therefore irrelevant to our analysis. Furthermore, other important applications are built on top of the fundamental applications we evaluate, meaning that anything learned from the studies we conduct carries over to those applications as well. For example, implementations of Betweenness Centrality are based on BFS [7, 32, 14].
Experiments: §5.1, §5.2, and §5.3 respectively provide in-depth analyses of performance characteristics, sensitivity to various tuning parameters, and scalability. Experiments in §5.1 and §5.2 are executed using only a single socket, whereas multi-socket scaling experiments in §5.3 use multiple. Finally, in §5.4, we present the overall performance of Wedge as compared to that of other state-of-the-art graph processing frameworks Grazelle , Ligra , and GraphMat . Both Ligra and Grazelle are hybrid graph processing frameworks that use the traditional source-oriented vertex-based frontier implementation. Where they differ is that Ligra supports dynamically switching between sparse and dense representations of the frontier data structure, whereas Grazelle only supports a dense frontier representation but features much higher-throughput push-based and pull-based engines. As a result, Ligra performs especially well for applications in which the active subset of the graph is consistently very small, such as BFS, outperforming Grazelle in some cases. GraphMat is a push-only high-throughput graph processing framework built on a sparse matrix-vector multiplication back-end. Prior to the introduction of Grazelle, GraphMat was considered the best-performing framework available.
In all experiments except sensitivity tests (§5.2) we configure Wedge with experimentally-determined tuning parameters. For CC and SSSP we use a frontier precision of 1 bit per 4 edge vectors and a frontier fullness threshold of 20%. For BFS we use a frontier precision of 1 bit per 8 edge vectors and a frontier fullness threshold of 1%. We use a much higher frontier fullness threshold for uk-2007 because it is both extremely skewed (frontier size can grow very large) and has a high diameter (frontier size changes relatively slowly): 48% for CC and SSSP, and 12% for BFS.
5.1 Wedge Performance
Figure 8 provides an in-depth comparison of the execution time of Wedge and Grazelle. Because Wedge is built on top of Grazelle, this comparison highlights the performance impact of switching from push to pull and the overheads of Wedge’s frontier transformation step. Also shown is the division of time between the two key parts of the execution in each case: push and pull for Grazelle and pull and frontier transform for Wedge. Results are normalized to Grazelle’s total runtime, shown as 1.0 in each plot.
The throughput improvements obtained by switching from push to pull are observable by comparing the height of the “Wedge (Pull)” bars to 1.0. This is generally bound by the fraction of time Grazelle spends executing the push-based engine, which in turn is determined by the size of the active subset of the graph throughout the application. Since BFS consistently maintains a relatively small active subset, the biggest difference is observed for BFS, followed by SSSP, and finally CC. Most notably, Grazelle uses the push engine exclusively for BFS on cit-Patents, dimacs-usa, and uk-2007, so the throughput advantage of the pull pattern is maximally able to produce a performance improvement (, , and , respectively).
Frontier transformation overheads are reflected in the “Wedge (Frontier Transform)” results and accounts for a relatively small percentage of the overall execution time. Excessive time spent executing the Wedge transformation, marked by larger blocks on the plot (in particular BFS on uk-2007, which spends more time transforming the frontier than running the pull engine), is mostly attributable to issues of load balance between threads, which we discuss in more detail in §5.3.
An end-to-end performance comparison between Grazelle and Wedge can be made by comparing the overall bar heights between the two. Performance improvement varies from approximately 1% (CC executed on friendster) to (BFS executed on dimacs-usa). In the case of CC executed on twitter-2010 the pull engine’s execution time with Wedge is slower than the overall execution time with Grazelle due to the added work that results from the imprecision of the Wedge Frontier.
It is clear from these results that Wedge matches or performs substantially better than Grazelle. Per Figure 1, a pull-only graph processing framework without Wedge’s pull-friendly frontier optimization would be orders of magnitude slower than a hybrid framework on all three of these applications. We therefore conclude that Wedge enables a pull-based engine to exploit the frontier optimization efficiently.
5.2 Tuning Parameters
Frontier Fullness Threshold: We evaluate the effect of frontier fullness threshold selection on Wedge performance by profiling the individual iterations of all three applications executed on each of the six graph datasets. Our per-iteration profiles capture execution time without a Wedge Frontier (pull only, operating on the entire graph), execution time with a Wedge Frontier (pull plus Wedge transformation), execution time with the conventional push pattern, and the size of the active subset (percentage of edges in the Wedge Frontier). The lattermost quantity is computed by summing the out-degrees of all the vertices the pull engine adds to the source-oriented vertex-based frontier it produces as output. Normally the resulting value would be compared with the frontier fullness threshold to determine whether or not to produce a Wedge Frontier for the next iteration. If less than the threshold a Wedge Frontier is generated, otherwise the pull engine runs without a Wedge Frontier.
Our plots, shown in Figure 9, are intended to highlight the effectiveness of generating the Wedge Frontier selectively rather than unconditionally. However, due to space limitations we cannot show every such combination and instead selected an assortment that highlights behavioral variety. On the left vertical axis we show actual per-iteration execution times in milliseconds of the pull engine with and without a frontier, in the former case including the time taken by the frontier transformation step. The goal in setting the frontier fullness threshold is to pick whichever mode produces a lower execution time for every iteration. On the right vertical axis we show the size of the active subset, expressed as percentage of edges. The horizontal axis shows iterations of the application progressing from left to right. They are unnumbered because the numbers themselves are unimportant in this analysis. Per-iteration results are also shown for Grazelle’s push engine to illustrate the benefit of Wedge over using the push pattern. Wedge generally outperforms the push engine when executed with a frontier, particularly at the tail ends of an application’s execution. Where it does not, differences are small enough and iteration times short enough as not to impact end-to-end performance noticeably.
Execution times of application iterations that use the Wedge Frontier scale with the size of the active subset because said size determines the amount of work in both the frontier transformation step and the accompanying pull engine iteration. This is unlike the execution times of iterations that do not use the Wedge Frontier, which are relatively constant across all iterations. Such iterations iterate over the entire graph and are therefore bound by pull engine throughput. BFS is an exception and sees steadily decreasing no-frontier iteration execution times. This is because vertices that have already been visited are skipped irrespective of the Wedge Frontier, and the number of visited vertices increases as the application progresses.
These results very clearly show a significant performance benefit to generating and consuming the Wedge Frontier when it is sufficiently empty. Iterations at the left of Figure 8(f), for example, are up to faster when run with the Wedge Frontier than without. Also apparent is the performance penalty associated with unconditionally generating the Wedge Frontier every iteration. When the active subset is large enough, using the Wedge Frontier can result in a per-iteration slowdown of up to , per Figure 8(a). This per-iteration performance difference can accumulate to the point of becoming dominant and resulting in reduced overall application performance. End-to-end results (Figure 10) show that performance improves by up to almost by generating the Wedge Frontier only selectively.
Frontier Precision: Figure 11 shows performance sensitivity to frontier precision. To conserve space we show results for a subset of the datasets such that various behaviors are captured. Edge group size (edge vectors per Wedge Frontier bit) varies from 1 to 16. Results are shown separately for the pull engine and the Wedge transformation step and are normalized to the overall execution time of the highest-precision configuration (1 bit per vector). Normalized execution time of Grazelle is overlaid to provide context.
In general we expect that a reduced frontier precision results in an increased pull engine time and a decreased frontier transformation time, trends we observe in many of the shown results. The former is due to the increased number of unnecessary edges with which the pull engine is burdened: a single bit adds more edges to the active subset, which in turn increases the amount of potentially-unnecessary processing the pull engine must do. The latter is due a reduction in the size of the Wedge Frontier data structure, which in turn produces two complementary effects. First, a smaller Wedge Frontier improves caching efficacy. Second, a greater number of edge vectors per bit reduces the number of bits that need to be set to capture the active subset, which reduces the number of memory write operations required.
Notable exceptions to these general trends are dimacs-usa and uk-2007, for which pull engine performance improves as frontier precision is reduced. This behavior can occur when multiple contiguous Wedge Frontier bits are set in a higher-precision configuration such that reducing the precision does not introduce wasted work but rather reduces frontier-checking overheads. For CC and SSSP, there is a second effect at play. dimacs-usa and to a lesser extent uk-2007 are characterized by having a relatively higher diameter than the other graph datasets in our evaluation. This translates to the structures of these graphs having long chains of vertices along which messages must flow. In a perfectly-precise frontier a message would propagate one hop per iteration. With a less-precise frontier edges that would ordinarily not be processed until later iterations are processed earlier, leading to messages propagating multiple hops per iteration and significantly reducing the number of iterations required to reach convergence. These effects have limits, evidenced by the slowdowns when transitioning from 8 to 16 edge vectors per bit for SSSP running on dimacs-usa.
Wedge uses a frontier precision of 4 or 8 edge vectors per bit depending on the application, the former for CC and SSSP and the latter for BFS. This is generally a beneficial decision, resulting in a speedup of up to approximately and, at worst, a slowdown of 5%.
Performance scaling results within a single socket are shown in Figure 12. To conserve space we selected three representative cases per application, with the goal of showcasing differing behaviors. Pull engine performance generally scales well with core count, limited primarily by saturation of the memory system. This bottleneck, particularly visible in Figures 11(a), 11(g), and 11(i), can be overcome by introducing locality optimizations  into the pull engine.
Performance of our implementation of the frontier transformation generally scales with increasing core count, but in many cases very slowly, limited by cores rather than by the memory system. Scalability plateaus are primarily due to load imbalance between threads. To quantify this we measured the time spent waiting at synchronization barriers for each thread and aggregated the results into Figure 13, which shows average per-thread time division between doing useful work (either in the pull engine or in the frontier transformation step) or being idle. While the pull engine is effectively load-balanced, we observe that cases of limited scaling in Figure 12 are associated with up to 89% of time wasted due to load imbalance. This means that fixing our implementation, which can be done using any known software load balancing technique such as work-stealing , could reduce time spent in the frontier transformation step by up to .
Multi-socket performance scaling results are shown in Figure 14. Each data point is shown with respect to the corresponding single-socket result, and all results were obtained by fully utilizing the cores in each socket. As a result, the theoretical maximum speedup is for two-socket results and for four-socket results. Scalability of the pull engine is highly dependent on the proportion of node-local versus node-remote memory accesses, which in turn depends on the quality of the partitioning of the graph across sockets. As Grazelle’s partitioning scheme is very simple , it is not surprising that scaling results vary substantially between input datasets. Nevertheless most results for CC and SSSP are between and with all four sockets active, indicating good scaling for these applications. BFS is much more difficult to scale effectively because the active subset remains relatively small and is completely different each iteration. The active subset is so small for dimacs-usa that many iterations slow down because the only active edges result in node-remote memory accesses.
Frontier transformation scaling characteristics differ from those of the pull engine. Its workload is entirely dependent on the distribution of edges within the active subset, which affects load balance between sockets and is extremely difficult to predict. Thus, even the best possible partitioning of the graph may be suboptimal for the transformation step, and in some cases the opposite is true. Furthermore, node-remote memory accesses are guaranteed because the source-oriented vertex-based frontier produced by the pull engine must be consumed by all sockets running the frontier transformation step. Nevertheless, for CC and SSSP it generally scales at least marginally with increased socket count, peaking at almost with all four sockets active. As with the pull engine, BFS is difficult to scale and in some cases slows down. Overall scaling results are biassed towards the scalability of the pull engine because the frontier transformation step does not dominate overall execution time (Figure 8).
5.4 Overall Comparison
We compare the end-to-end performance of Wedge with that of Grazelle version 1.0.1  configured both in hybrid mode and pull-only mode, Ligra version 1.5 , and GraphMat version 1.0 . All frameworks were compiled with gcc version 5.5.0 and linked with whatever external libraries were recommended by the framework authors (Intel Cilk Plus  for Ligra, OpenMP  for GraphMat). Ligra allows the user to specify whether to compile it to use 32-bit or 64-bit integers internally. Both Wedge and Grazelle use 64-bit integers to avoid artificially limiting graph size, so we compiled Ligra to do the same. GraphMat only supports signed 32-bit integers internally and therefore cannot process uk-2007 using any application because the number of edges exceeds their representational capability.
Results are shown as per-framework heat maps in Figure 15 such that each number displays the slowdown compared to the fastest-observed time for that particular case. Generally results were captured using all four sockets, but in some indicated cases it was faster to run with fewer sockets. All cores in each socket are fully-utilized.
Wedge consistently outperforms Ligra, beating it by up to on BFS, up to on CC, and up to on SSSP. It also generally outperforms Grazelle, though by a smaller margin (up to ) because both Grazelle and Wedge share pull engine implementations. Its ability to outperform both demonstrates the effectiveness of Wedge at enabling a pull engine to execute a frontier-based application efficiently. Part of the benefit of using Wedge over Ligra comes from its higher-throughput pull engine, but the remainder (and its entire benefit over Grazelle) is a result of replacing the time spent in the push engine with a smaller amount of time spent in the pull engine. Because the active subset size with BFS is typically very small, Ligra’s sparse frontier optimization is particularly effective, enabling it to outperform Grazelle in some cases. However, using exclusively a pull engine with the Wedge Frontier is enough to more than overcome this performance gap. Grazelle can outperform Wedge because the frontier transformation step scales less effectively with multiple sockets than Grazelle’s push engine. Single-socket results other than CC with twitter-2010 favor Wedge.
GraphMat is uncompetitive with any of the other frameworks tested. Despite being push-only and supporting the frontier optimization, its implementation of the frontier is suboptimal. Rather than updating the frontier continuously as an iteration is executed, GraphMat compares old vertex values with new upon iteration completion and uses the results to create the frontier in a separate pass.
Arguably the most striking insight, and the key takeaway of this analysis, is obtained by comparing the performance of Grazelle running in pull-only mode (shown as “Grazelle (Pull)”) with that of Wedge, essentially the “before Wedge and after Wedge” view of a pull engine. Given that Grazelle’s pull-only mode can be orders of magnitude slower than all other frameworks except GraphMat but Wedge’s performance is almost always the best among the frameworks tested, we can conclude that Wedge has achieved our goal of rebuilding the frontier optimization such that a pull engine is able to exploit it efficiently.
6 Related Work
Wedge is an entirely new approach that improves both the performance and utility of pull-based graph processing engines. Existing work—whether it targets software, custom hardware, GPUs, or specialized accelerators—has predominantly focused on improving push-based engines [37, 49, 53, 47, 18, 12, 36, 8]. This is likely due to the conventional wisdom that pull-based engines cannot effectively exploit the frontier optimization, an idea that has persisted for years and represents the state-of-the-art . The pull pattern began as an optimization for the Breadth-First Search application specifically [1, 2] but has since been generalized to other applications through the introduction of hybrid graph processing frameworks [16, 40, 51]. Garaph is the closest existing work to Wedge; its “notify-pull” approach proposes using a vertex-centric destination-oriented frontier that does not address the superfluous edge problem .
Other areas of work in the graph processing community have improved graph processing performance through optimizations that target work scheduling across cores [49, 33, 37], graph partitioning across sockets [49, 51], and optimizing synchronization and communication [49, 51, 33, 16], which represent typical concerns for any parallel program [24, 38, 48]. More recent work has attempted to take greater advantage of the underlying processor’s hardware features, such as by improving caching effectiveness [43, 28, 53, 3], reducing memory traffic through data structure compression , and optimizing data structures for SIMD vectorization . All of these strategies are orthogonal to Wedge and can generally be leveraged to improve the performance of both the frontier transformation step and the pull engine itself.
Domain-specific languages such as GraphIt  combine many of these optimizations together into a single compiler. They attempt to address the problem of requiring both push-based and pull-based versions of application code by generating both automatically from a single algorithm description. However, even the state-of-the-art does not eliminate the need for multiple implementations: a second scheduling program is required to assist the compiler in selecting appropriate optimizations.
Conventional wisdom states that a pull-based graph processing engine, despite having significantly higher throughput than a push-based engine, is fundamentally incapable of exploiting the frontier optimization. Our key contribution in this work, Wedge, defies this wisdom by rebuilding the frontier optimization so that it is suited for the pull pattern.
Wedge showcases a fundamentally new approach to addressing the trade-off between push and pull patterns in graph processing. Prior work employed hybrid graph processing frameworks that support both patterns and dynamically switch between them. This approach comes with two key disadvantages. First, iterations that use push suffer from reduced performance. Second, applications must be implemented multiple times (once each for push and pull). Instead of continuing to juggle between both patterns, Wedge directly targets a pull engine’s frontier-related deficiencies.
Wedge introduces the Wedge Frontier, a representation of the frontier data structure that can efficiently be consumed by a pull engine, and a transformation step for generating it. Because the transformation process can be expensive even when parallelized, we proposed two optimizations to make it practical. First, we generate the Wedge Frontier only when it is sufficiently sparse. Second, we coarsen the granularity of the representation to reduce both its size and the number of operations needed to generate it.
Wedge is implemented in software on top of Grazelle, a state-of-the-art hybrid graph processing framework, resulting in a new pull-only version. It can outperform Grazelle, Ligra, and GraphMat respectively by up to , , and . Wedge’s two key optimizations respectively improve its performance by up to and .
-  S. Beamer, K. Asanović, and D. A. Patterson. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for Graph500. Technical report, EECS Department, University of California, Berkeley, November 2011.
-  S. Beamer, K. Asanović, and D. A. Patterson. Direction-optimizing breadth-first search. In SC ’12, pages 1–10. IEEE Computer Society, 2012.
-  S. Beamer, K. Asanović, and D. A. Patterson. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In IISWC ’15, pages 56–65. IEEE, 2015.
-  M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. To push or to pull: On reducing communication and synchronization in graph computations. In HDPC ’17, pages 93–104. ACM, 2017.
-  P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In WWW ’11, pages 587–596. ACM, 2011.
-  P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In WWW ’04, pages 595–601. ACM, 2004.
-  U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163–177, 2001.
-  L. Chen, X. Huo, B. Ren, S. Jain, and G. Agrawal. Efficient and simplified parallel graph processing over CPU and MIC. In IPDPS ’15, pages 819–828. IEEE, 2015.
-  R. Chen, J. Shi, Y. Chen, and H. Chen. PowerLyra: Differentiated graph computation and partitioning on skewed graphs. In EuroSys ’15, pages 1:1–1:15. ACM, 2015.
-  C. Demetrescu. 9th DIMACS Implementation Challenge. http://www.dis.uniroma1.it/challenge9/download.shtml, 2010.
-  B. Elser and A. Montresor. An evaluation study of bigdata frameworks for graph processing. In BigData ’13, pages 60–67. IEEE, 2013.
-  J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI ’12, pages 17–30. USENIX, 2012.
-  J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI ’14, pages 599–613. USENIX, 2014.
-  O. Green, R. McColl, and D. A. Bader. A fast algorithm for streaming betweenness centrality. In PASSAT ’12, pages 11–20. IEEE, 2012.
-  S. Grossman, H. Litz, and C. Kozyrakis. Grazelle. https://github.com/stanford-mast/Grazelle-PPoPP18/archive/v1.0.1.tar.gz, 2018.
-  S. Grossman, H. Litz, and C. Kozyrakis. Making pull-based graph processing performant. In PPoPP ’18, pages 246–260. ACM, 2018.
-  Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How well do graph-processing platforms perform? an empirical performance evaluation and analysis. In IPDPS ’14, pages 395–404. IEEE, 2014.
-  T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In MICRO ’16, pages 1–13. IEEE, 2016.
-  M. Han, K. Daudjee, K. Ammar, M. T. Özsu, X. Wang, and T. Jin. An experimental comparison of Pregel-like graph processing systems. Proc. VLDB Endowment, 7(12):1047–1058, August 2014.
-  Intel. CilkPlus. https://www.cilkplus.org/, 2014.
-  Intel. Intel Xeon Processor E7-8890 v3. https://ark.intel.com/products/84685, 2015.
-  U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A peta-scale graph mining system implementation and observations. In ICDM ’09, pages 229–238. IEEE, 2009.
-  Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: A system for dynamic load balancing in large-scale graph processing. In EuroSys ’13, pages 169–182. ACM, 2013.
-  A. Krishnamurthy and K. Yelick. Optimizing parallel programs with explicit synchronization. In PLDI ’95, pages 196–204. ACM, 1995.
-  A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In OSDI ’12, pages 31–46. USENIX, 2012.
-  Laboratory for Web Algorithmics. Datasets. http://law.di.unimi.it/datasets.php, 2012.
-  J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, 2014.
-  L. Li, R. Geda, A. B. Hayes, Y. Chen, P. Chaudhari, E. Z. Zhang, and M. Szegedy. A simple yet effective balanced edge partition model for parallel computing. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(1):14:1–14:21, June 2017.
-  Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endowment, 5(8):716–727, April 2012.
-  L. Ma, Z. Yang, H. Chen, J. Xue, and Y. Dai. Garaph: Efficient GPU-accelerated graph processing on a single machine with balanced replication. In ATC ’17, pages 195–207. USENIX, 2017.
-  G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD ’10, pages 135–146. ACM, 2010.
-  A. McLaughlin and D. A. Bader. Scalable and high performance betweenness centrality on the GPU. In SC ’14, pages 572–583. IEEE, 2014.
-  D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP ’13, pages 456–471. ACM, 2013.
-  OpenMP ARB. OpenMP. http://www.openmp.org/, 2016.
-  V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, and M. Haridasan. Managing large graphs on multi-cores with graph awareness. In ATC ’12, pages 41–52. USENIX, 2012.
-  A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos: Scale-out graph processing from secondary storage. In SOSP ’15, pages 410–424. ACM, 2015.
-  A. Roy, I. Mihailovic, and W. Zwaenepoel. X-Stream: Edge-centric graph processing using streaming partitions. In SOSP ’13, pages 472–488. ACM, 2013.
-  L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A simple load balancing scheme for task allocation in parallel machines. In SPAA ’91, pages 237–245. ACM, 1991.
-  S. Salihoglu and J. Widom. GPS: A graph processing system. In SSDBM ’13, pages 22:1–22:12. ACM, 2013.
-  J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In PPoPP ’13, pages 135–146. ACM, 2013.
-  J. Shun and G. E. Blelloch. Ligra. https://github.com/jshun/ligra/archive/v.1.5.tar.gz, 2016.
-  J. Shun, L. Dhulipala, and G. E. Blelloch. Smaller and faster: Parallel processing of compressed graphs with Ligra+. In DCC ’15, pages 403–412. IEEE, 2015.
-  J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. Accelerating graph analytics by utilising the memory locality of graph partitioning. In ICPP ’17, pages 181–190. IEEE, 2017.
-  N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. GraphMat: High performance graph analytics made productive. Proc. VLDB Endowment, 8(11):1214–1225, July 2015.
-  N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. GraphMat. https://github.com/narayanan2004/GraphMat/archive/v1.0-single-node.tar.gz, 2016.
-  L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990.
-  Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: A high-performance graph processing library on the GPU. In PPoPP ’15, pages 265–266. ACM, 2015.
-  M. H. Willebeek-LeMair and A. P. Reeves. Strategies for dynamic load balancing on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems, 4(9):979–993, September 1993.
-  M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao, L. Wei, H. Lin, Y. Dai, and L. Zhou. GraM: Scaling graph computation to the trillions. In SoCC ’15, pages 408–421. ACM, 2015.
-  C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In PPoPP ’15, pages 194–204. ACM, 2015.
-  K. Zhang, R. Chen, and H. Chen. NUMA-aware graph-structured analytics. In PPoPP ’15, pages 183–193. ACM, 2015.
-  M. Zhang, Y. Wu, K. Chen, X. Qian, X. Li, and W. Zheng. Exploring the hidden dimension in graph processing. In OSDI ’16, pages 285–300. USENIX, 2016.
-  Y. Zhang, V. Kiriansky, C. Mendis, S. Amarasinghe, and M. Zaharia. Making caches work for graph analytics. In BigData ’17, pages 293–302. IEEE, 2017.
-  Y. Zhang, M. Yang, R. Baghdadi, S. Kamil, J. Shun, and S. Amarasinghe. Graphit - a high-performance dsl for graph analytics. arXiv, 2018.
-  D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In FAST ’15, pages 45–58. USENIX, 2015.
-  X. Zhu, W. Han, and W. Chen. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In ATC ’15, pages 375–386. USENIX, 2015.