I Introduction
Large graphs are becoming ubiquitous due to the rise of large connected real life networks including social, transportation, web and gene expression networks, which can be abstracted and modeled as graphs. There is large interest in the acceleration of different graph processing algorithms and applications including breadth first search (BFS), singlesource shortest path (SSSP), minimum spanning tree (MST) and betweenness centrality algorithms on GPUs [1, 2, 3, 4, 5, 6, 7] due to both the large processing needs of the applications and also due to the benchmark efforts like Graph500 [8].
Traditionally, graph applications are considered to be difficult to analyze and parallelize. This difficulty stems from the unpredictable dataaccess and controlflow patterns, termed as irregularity, which is inherent in graph applications. Thus, it is quite challenging to statically predict the access pattern without any knowledge about the input graph. Therefore, most of the effective techniques towards the analysis and the parallelization of graph algorithms are dynamic in nature.
One of the key challenges while dealing with irregular graph algorithms is maintaining loadbalance across GPU threads. Depending upon the graph, the taskdistribution can get skewed. As an example, many of the GPUbased implementations [1, 9, 2, 3, 4]
employ a nodebased taskdistribution, i.e., they assign a thread to a set of graph nodes. Such a nodebased distribution works well with the compressed sparserow (CSR) storage format often used to store graphs compactly (CSR format represents a graph in adjacency list format, but the adjacencies of each node are concatenated together to form a monolithic adjacency list of size equal to the number of edges. Each node’s adjacency list then starts at various offsets in this monolithic list). However, this nodebased distribution is unsuitable for graphs with wide variance in the node degrees, such as the social networks. This is because in many graph applications, the activity at each node deals with propagating some information to its neighbors or collecting it from them. Therefore, the work done is proportional to the degree of a node. Thus, the nodebased taskdistribution can lead to high loadimbalance for skewed degree graphs.
To address the loadbalancing problem, previous research has explored edgebased taskdistribution [3]. In this method, a thread is assigned a set of edges (instead of nodes as in the nodebased method above). This leads to nearperfect load balancing across threads, since each thread processes (almost) the same number of edges. However, there are a couple of issues with edgebased processing. First, it may not always be feasible to convert a nodebased processing algorithm into an edgebased processing algorithm. Theoretically, such a conversion requires the node activity to be distributive which need not necessarily hold. Second, it poses restrictions on the graph format: to assign an edge to a thread, the graph should either be in a coordinate list (COO) format or it should be converted to such a format (COO format represents a graph as a sequence of edges with each edge as a tuple ). The former consumes more memory while the latter has conversion overheads. The memory requirement is a key factor for GPUs as they continue to have low memories (upto 12 GB). In fact, in our experiments with large graphs, we found that the edgebased methods which rely on COO format cannot be executed due to insufficient memory. These restrictions make edgebased taskdistribution unsuitable as a general solution.
Despite the aforementioned issues, nodebased and edgebased taskdistribution methods are attractive because of their simplicity. The simplicity stems from the onetime assignment of graph elements to threads. Thus, existing methods are static in nature. Unfortunately, the characteristics of irregular graph algorithms often change dynamically. The processing workload at different parts of the graph varies as the algorithm progresses. Therefore, application of static loadbalancing techniques is often inadequate, and we need dynamic load balancing mechanisms while dealing with graph algorithms.
In this paper, we propose three techniques to address the issues with the existing methods. The first method, which combines nodebased and edgebased methods is workload decomposition. It assigns a set of edges to a thread similar to the edgebased taskdistribution. However, the considered edges belong only to the set of active nodes (i.e., nodes in the worklist where there is work to be done) resembling a nodebased taskdistribution. The second method, which we call as node splitting, avoids loadimbalance by changing the underlying graph structure. As the name suggests, it splits a highdegree node into several smalldegree nodes – thereby reducing the loadimbalance. The third method uses a hierarchical processing and employs a hierarchy of worklists. The imbalance in the taskdistribution from the main (super) worklist is handled by creating several subworklists distributed evenly across threads and changing the number of threads proportional to the size of the subworklist.
Following are the primary contributions of our paper.

We identify the limitations of the currently prevalent node and edgebased approaches towards loadbalancing graph algorithms on GPUs.

We propose three methods to address the above limitations: workload decomposition, node splitting and hierarchical processing. We discuss various tradeoffs associated with these methods; a method needs to be chosen depending upon the application requirements and the nature of input graphs. We argue that a single loadbalancing strategy is unlikely to be suitable in all scenarios.

We perform a comprehensive evaluation of the existing and the proposed methods using a range of realworld and synthetic graphs and two graph algorithms: breadth first search processing and single source shortest paths computation. We illustrate the utility of each method and quantitatively show the tradeoffs involved.
The rest of the paper is organized as follows. In Section II we explain existing nodebased and edgebased taskdistributions and discuss their advantages and limitations. In Section III we propose our load balancing strategies and discuss the tradeoffs involved. In Section IV we illustrate the effectiveness of our proposed techniques using two graph algorithms. In Section V we compare our work with and contrast it against other proposed methods dealing with graph applications on GPUs. Finally, in Section VI we conclude and mention our plans for future work.
Ii Motivation
In this section we motivate the need for dynamic loadbalancing strategies. Towards this goal, we first explain the existing static approaches, namely, nodebased taskdistribution and edgebased taskdistribution, along with their advantages and disadvantages.
Iia Nodebased Taskdistribution
In nodebased taskdistribution, the unit of processing is a node. Thus, when a GPU kernel is invoked, the number of threads in the launchconfiguration is proportional to the number of graph nodes. In an extreme case, one may assign each node to a different thread. For a nodebased distribution it is natural to represent the graph in a CSR format. Such a representation is also spaceefficient. Each thread then operates on each of the assigned set of nodes, and may propagate a computed information along the set of edges incident on the node. Therefore, the amount of work done by a thread becomes proportional to the degree of the node on which it operates. Since this degree is unknown statically and varies across inputs, nodebased taskdistribution (and in general, a static loadbalancing technique) incurs a high loadimbalance if the variance in the nodedegrees is large (e.g., in social networks wherein degrees follow a powerlaw distribution).
Figures 1(b) and 1(a) show the outdegree distribution of the USA road network and Stanford web graph respectively. We find that the Stanford graph has a relatively larger variation in the node outdegrees (USA: minimum=1, maximum=9, average=2.4, Stanford: minimum=1, maximum=255, average=8.2). We observe similar skewed degree distribution across other social networks as well (flickr, citations, twitter, etc.). For such graphs, if a thread is assigned to a node, the threads operating on high outdegree nodes would dominate the computation time. This causes the threads operating on low outdegree nodes to incur a long wait resulting in inefficient GPU resource utilization. In summary, nodebased taskdistribution, which is used in several recent works [1, 9, 2, 3, 4], may deliver poor performance on highskew graphs (it should be noted that loadbalancing was not their primary goal).
IiB Edgebased Taskdistribution
In order to improve loadbalance, former research has proposed edgebased taskdistribution for graph algorithms on GPUs [3]. In this approach, threads operate on the graph edges; that is, the number of threads in the kernel launch configuration is proportional to the number of graph edges. Edgebased taskdistribution provides nearperfect loadbalance for propagation algorithms (such as BFS, SSSP, etc.), since the edges can be almost evenly distributed across threads. Such an edgebased graph partitioning is also oblivious to the degree distribution, and hence is likely to work well across different types of graphs. Given edges and threads with , the threads are assigned to the edges in a roundrobin manner. This ensures coalesced access since the neighboring edges are assigned to consecutive threads. For our work, we spawned the GPU kernel with the maximum number of active threads possible for a given CUDA architecture.
Figure 1 shows the pseudocode for edgebased parallelism using SSSP as an example. SSSP maintains two worklists: inputWl and outputWl for reading and writing, respectively. The while loop at Lines 1–1 processes all the edges in the current input worklist by repeatedly calling sssp_kernel on the GPU. In the end, the computed distance values (attribute dist) are copied from GPU to CPU.
Parallel edgebased SSSP requires synchronization across threads. This is required when two edges being operated by two threads point to the same node – resulting in a conflict. We use atomic primitives (atomicMin) to update the distance values. Populating a shared worklist such as outputWl also necessitates synchronization and we rely on atomic primitives for inserting edges. A naïve way to insert each outgoing edge of a node results in a considerable overhead. We employ workchunking [9] to club together multiple edges and use a single atomic instruction for adding all the edges of a node.
Despite its advantages over the nodebased approach, edgebased taskdistribution suffers from the following limitations. First, it may not always be possible to use an edgebased distribution. For a nodebased computation to be converted into a functionally equivalent edgebased computation, the computation needs to have a distributivity property. A function distributes over another function if . For example, in BFS, computing the minimum level for the destination node distributes over the addition (one plus) operation on the source node level.
Second disadvantage of edgebased distribution is its higher space complexity. A spaceefficient CSR representation is unsuitable for edgebased distribution since the source node information is not duplicated across all the outgoing edges of that node. Therefore, edgebased distribution often demands a lessnormalized COO format wherein source node is duplicated across edges emanating from that source. For a graph with edges, the COO representation requires elements for storing the two arrays (in contrast to for CSR representation in case of nodebased distribution). This additional storage cost can be substantial for denser graphs, and is especially relevant for GPUs which continue to have smaller main memories (maximum 12 GB for NVIDIA GPUs). If the vertices in the arrays are represented by 4byte integers, the maximum number of edges that can accommodated in a system with GPU device memory of is 500 million for nonweighted graphs and about 350 million for weighted graphs. Third, edgebased taskdistribution can lead to the size explosion of worklists. The size of the worklist can become greater than the number of edges since the outgoing edges of a node may be added redundantly to the worklist by multiple threads. This would require condensing the worklist and removing redundancy at the end of each GPU kernel invocation, resulting in condensing overhead [2]. Large worklists also stress the memory resources.
Iii Proposed Load Balancing Strategies
We propose three load balancing strategies: workload decomposition, node splitting and hierarchical processing which we discuss in the following subsections. All our strategies implement datadriven GPU executions [9] in which only the active elements are processed using a worklist.
Iiia Workload Decomposition
Workload decomposition combines nodebased and edgebased taskdistribution. It can be viewed as a form of space decomposition. In this approach, the processing elements in the worklist continue to be the nodes, but the workload of the nodes, namely, the edges, are decomposed across threads using a block distribution. number of graph edges are partitioned across threads such that each thread receives a contiguous chunk of edges for processing. Thus, a given thread processes a subset of edges corresponding to a subset of nodes and all the edges outgoing from a node may not be processed by the same thread. Figure 2 illustrates the workload decomposition strategy with two nodes. In this figure, four threads process three edges each from two nodes with the first node having five outgoing edges and the second node having seven edges. We find that Thread 2 processes two edges from node 1 and one edge from node 8.
Figure 2 presents the pseudocode for the strategy. Each GPU thread processes edgesPerThread number of edges starting from a particular outgoing edge of a particular node. Each worklist maintains the nodes to be processed and each node’s outdegree as two associative arrays. Both the arrays are populated while updating the worklist, but only the second array (node outdegrees) is used in the prefix sum computation to determine edgePerThread and to populate the offset array of structure. These offsets are calculated in the GPU in a separate kernel function find_offsets (Lines 2 and 2). A GPU thread in this kernel uses the prefix sums of the outdegrees of the nodes and the edgesPerThread to find the particular node and edge that it has to start with for processing. The prefix sums of the outdegrees are also calculated on the GPU using a separate kernel. We use the NVIDIA’s Thrust API for inclusive scan for this purpose (Line 2). The while loop in the kernel (Line 2) handles the situation when a thread, after processing a subset of edges for a node, moves to processing the next set of edges from the next node in the worklist.
An advantage of workload decomposition is that it works with the CSR format and therefore, has a lower space complexity. Assuming a conservative estimate of the number of nodes equal to half the number of edges, graphs of at least 350 million edges can be accommodated with the CSR format on the GPU. Also, since it distributes edges of a node across threads, it has a better loadbalancing compared to a nodebased distribution. A drawback of the workload decomposition is that it can lead to uncoalesced accesses since a node’s edges may get separated. The method also incurs some overhead for the prefix sum operations, extra kernelcalls to obtain node offsets, and due to the atomic instructions required as a node may be operated upon by multiple threads. Despite saving space compared to a COO format, workload decomposition requires extra space to store the node and edge offsets for each thread.
In our experiments, we observe that the limitations of workload decomposition affect its performance for largediameter graphs (such as the road networks) but the method performs very well for scalefree graphs such as the social networks (Section IV).
IiiB Node Splitting
The second approach we propose is based on changing the graph structure itself to balance the load. Since the root cause of the loadimbalance is skewed outdegree distribution across graph nodes, node splitting preprocesses the graph to split each highdegree node into multiple lowdegree childnodes. This approach is implemented as follows. We define an input parameter called maximumoutdegreethreshold (MDT). If a node’s outdegree is more than MDT, then the node is split into nodes, with the outgoing edges of the node distributed evenly among the original (parent) and the split (child) nodes. For example, Figure 3 depicts our node splitting approach, where a node 8 is split into two nodes, 8 (parent) and 8’ (child) which share the outgoing edges. Multiple child nodes can get formed from a parent node. Note that the graph now does not contain the original highdegree node. By repeating this splitting procedure for each of the highdegree graph nodes, we can obtain another graph with maximum degree bounded by MDT.
As evident, nodesplitting approach has the advantage that it can work with the spaceefficient CSR representation. Since nodesplitting creates duplicates of a node, incoming edges to the original node need special consideration. To address this issue without increasing any existing node degrees, we maintain the incoming edges of a node only for the parent node. The parent node, in turn, keeps track of its children. The algorithm is modified to reflect the attributes (distance in case SSSP) of a parent node onto its children. This strategy ensures that no new edges get added to the graph (parentchild relationship does not interfere with the normal graph edges).
Automatic Determination of Node Splitting Threshold: A salient feature of our node splitting strategy is to automatically determine the threshold MDT for node splitting. Obvious methods based on a threshold or maxdegree etc. do not work in general. For instance, we cannot fix the value of MDT to a constant, since it is unsuitable across graphs and degree distributions. We cannot also fix MDT to the maximum degree in the graph, since there could be a big skew in the degree distribution with a few very high degree nodes and a large number of medium degree nodes. A better alternative is to use the difference between the average degree and the maximum degree in the graph; but such a function would be influenced by the graph size. Another constraint is that the number of nodes to be split should be minimum possible, to reduce the splitting (and processing) cost. To account for these issues, we use a histogram based method in which we use HistogramBinCount number of bins representing the ranges of outdegrees of the nodes in the original graph. The number of bins is given as an input parameter to our algorithm. We then find the distributions of the outdegrees across the bins. We find the bin or range with the maximum height, i.e., the range of outdegrees for which the graph has the maximum number of nodes. Let binIndex be the index of this bin. We then find the maximum degree threshold MDT for the outdegrees as (binIndex/HistogramBinCount) maxDegree, where maxDegree is the maximum outdegree in the graph. Our nodesplitting algorithm then finds the nodes in the graph with outdegrees greater than MDT and splits them into child nodes such that each parent and the child nodes will have a maximum of MDT outdegrees. Our histogram approach of finding MDT attempts to maximize the number of nodes (parent and child) with MDT outdegrees. By choosing the bin with the maximum height in which the nodes already have their outdegrees closer to MDT, our algorithm minimizes the amount of splitting. Since the histogrambased method considers degree distributions of a graph to achieve load balancing, it can be applied to different graphs with different kinds of distributions including highly skewed distributions.
An advantage of the node splitting approach is that it continues to work with the spaceefficient CSR format. Another advantage over workloaddecomposition (Section IIIA) is that all the edges of a node are processed by the same thread, reducing bookkeeping and improving the scope for memory coalescing. The primary disadvantage of node splitting is that it results in extra atomic operations to update the child nodes whenever the parent node gets updated. A secondary disadvantage is the overhead of computing the histogram to find the MDT. One may wonder that the strategy has a space overhead due to explicit splitting, but we found in our experiments that less than 5% of the nodes undergo split, resulting in negligible space overhead.
As we observe in our experiments (Section IV), nodesplitting provides considerably better loadbalancing. In addition, it provides comparable performance for large diameter graphs (such as road networks); but it has a high overhead for powerlaw degree distribution graphs.
IiiC Hierarchical Processing
Hiearchical processing performs a timedecomposition of the workload. It achieves this by partitioning the main (super) worklist into several subworklists. If the subworklist is large, it can be further partitioned into subsubworklists, and so on. This builds a hierarchy of worklists. The depth of this hierarchy is tunable, and we utilize the histogrambased approach in nodesplitting (Section IIIB) for finding the maximum degree threshold (MDT) which determines when to split a worklist into subworklists.
An iteration for processing a node worklist is composed of subiterations. In each subiteration, a sublist consisting of nodes remaining to be processed from the superworklist is formed, and a GPU kernel is invoked to process this sublist. Each GPU thread considers a set of nodes in the sublist, processing only up to MDT unprocessed outgoing edges of these nodes. Thus, all the threads corresponding to the kernel invocation of the subiteration are loadbalanced within this threshold. The nodes in the sublist with the number of unprocessed outgoing edges less than or equal to MDT will be processed. The next sublist with a reduced set of nodes will be processed in the next subiteration. This continues until all the nodes in the super list are processed before processing the next super list in the next iteration.
Figure 4 illustrates the hierarchical processing of sublists within an iteration for an input graph. The super worklist in the figure contains two nodes 1 and 8 with five and seven outgoing edges respectively. In each kernel for a subiteration, two threads are spawned for processing the two nodes. Let MDT be . Then each thread in a subiteration processes maximum three edges.
The sizes of the sublists decrease over the subiterations due to the removal of the processed nodes. Since the GPU kernel is spawned using a nodeparallel approach in which the number of GPU threads is proportional to the size of the sublist, the reduction of the sublist size can result in a situation where the GPU kernel is invoked with a small number of threads to process a small number of nodes in the sublist. This will result in very low GPU occupancy. For example, consider a situation where, if after a few sub iterations, only one node remains to be processed and this node has edges remaining to be relaxed. If the MDT is , twenty more subiterations will invoke twenty more GPU kernels successively, each spawning one GPU thread to process edges of the node. To avoid this situation, our strategy switches to the workloaddecomposition technique when the sublist size becomes smaller than a threshold. A natural threshold is governed by the GPU kernel block size which, in our experiments, is set to . We also switch to workload decomposition for processing the super worklist at the beginning of the toplevel iterations when the size of the super list becomes smaller than the block size.
The fundamental advantage of the hierarchical processing strategy over nodesplitting is that it avoids the space and time complexity needed for creation of new nodes. By following a hybrid method of using workloaddecomposition for small number of nodes and using the technique of subiterations for larger number of nodes, it combines the advantages of multiple approaches. However, the hierarchical processing method incurs extra overhead due to additional kernel invocations corresponding to the subiterations. The method also incurs increased space complexity and atomic operations for the sub worklists.
In our experiments (Section IV) we found that despite its overheads, hierarchical processing is a scalable mechanism. For larger graphs in our experimental suite where other proposed strategies fail to execute due to insufficient memory requirment, hierarchical processing successfully completes execution offering good benefits in terms of loadbalancing.
Table I summarizes the advantages and disadvantages of the different load balancing strategies.
Strategy  Advantages  Disadvantages  

Existing 
Nodebased Distribution (BS) 


Edgebased Distribution (EP) 



Proposed 
Workload Decomposition (WD) 


Node Splitting (NS) 



Hierarchical Processing (HP) 


Iv Experiments and Results
To assess the effectiveness of our proposed techniques, we embed them in the implementation of two graph algorithms: breadthfirst search computation (BFS) and singlesource shortest paths computation (SSSP). Both these algorithms are fundamental to several domains and form the building block for several interesting applications. We compare our implementations against the LonestarGPU benchmark implementations [10], which use a nodebased taskdistribution. We used LonestarGPU1.02 version, an older version available at the time of our work. For our experiments, we use both synthetically generated graphs as well as the realworld graphs. The synthetic ones are the RMAT graphs based on the recursive matrix model [11] and random graphs based on the ErdősRényi model (denoted as ER*). Both are generated using GTgraph [11]. For realworld graphs, we use the USA road networks (for West, Florida and overall USA). To assess scalability, we include three relatively larger graphs obtained using the graph generation tool available in the Graph500 benchmark [8]. The tool accepts three parameters: number of nodes, number of edges and a seed value for random number generation, and generates a corresponding graph. Depending upon the seed value, the graph connectivity differs. The properties of all the graphs are given in Table II
. The last column of the table represents the amount of load imbalance in the form of the skewness in the outdegree distribution of the nodes. This is indicated in terms of the maximum, average and standard deviations (
) of their outdegrees.We observe in Table II that Graph500 and RMAT graphs have a high maximum degree as well as a lot of variance in the number of outdegrees. RMAT graphs are also characterized by smallworld property due to their low diameter. In contrast, the road networks have very small maximum degree and little variation in the outdegree distribution. They have large diameters (not shown) in comparison to the RMAT and ER graphs. ER graphs have a random distribution of edges in the graph and have a higher maxdegree as well as the standard deviation than road networks. However, they do not have large diameters as in the case of road networks, nor do they exhibit the smallworld property. It should be noted that despite this variance, the average degrees of all the graphs remain comparable. Together these graphs test various aspects of our strategies.
Graph  Nodes  Edges  Outdegrees  
(Million)  (Million)  Max  Avg  
rmat20  1.05  8.26  1,181  8  177.40 
roadFLA  1.07  2.71  8  3  2.45 
roadW  6.26  15.12  9  4  2.74 
roadUSA  23.95  57.71  9  3  2.74 
ER20  1.05  4.19  15  4  4.47 
ER23  8.39  33.55  10  3  4.46 
Graph500  16.78  335.00  924,000  20  20,900 
(three graphs) 
We implemented all our proposed strategies using CUDA. Our experiments were performed on a Keplerbased GPU system. The GPU has a Tesla K20c architecture with 13 SMXs each having 192 CUDA cores (2,496 CUDA cores totally) with 4.66 GB of main memory, 1 MB of L2 cache and 64 KB of registers per SM. It has a configurable 64 KB of fast memory per SMX that is split between the L1 data cache and shared memory. The programs have been compiled using nvcc version 5.0 with O3 arch=sm_35 flags. The CPU is a hexcore Intel Xeon E52620 2.0 GHz workstation with CentOS 6.4, 16 GB RAM and 500 GB hard disk.
Iva Performance Comparison of Strategies
In this section, we compare the various strategies in terms of performance or execution times. The strategies are denoted as BS for LoneStar GPU baseline version that implements nodebased taskdistribution, EP for edgebased distribution, WD for workload decomposition, NS for nodesplitting, and HP for hierarchical processing. We split the overall execution time into useful kernel time and the overhead associated with implementing a strategy. The overheads encompass all the corresponding initializations, extra kernel invocations and bookkeeping. Note that BS also has an overhead component.
Figure 5 shows the comparison results for SSSP. We find that all our strategies perform significantly better than the baseline (BS) method in almost all cases, for graphs with small as well as large diameters. This is because in SSSP, especially for the graphs with large diameters, the kernel times dominate the overheads (unlike in BFS, discussed below). This shows that our load balancing strategies in particular, and load balancing in general, are fruitful for applications that perform even a reasonable amount of computations. We believe the techniques would be more useful for computationintensive irregular applications. The edgebased parallelism (EP) method performs the best, giving 60–80% smaller execution times than the baseline. Unfortunately, due to its high storage requirement, EP is unable to run on larger graphs such as Graph500. Among the nodebased strategies, workload decomposition (WD) method performs the best for graphs with highly skewed or random degree distribution. For such graphs (RMAT and ER), the node splitting (NS) performs the worst since its node creation overhead coupled with highly skewed degree distribution dominates the kernel times. However, when the deviation in the size of the neighborhood is less, the NS method performs the best among the nodebased strategies since its node creation overhead is a onetime cost and is amortized by relatively large total kernel execution times.
Hierarchical processing (HP) performs in between the WD and NS methods for smaller graphs. However, the main advantage of HP is seen in dealing with larger graphs such as Graph500. At the time of writing this paper, we were able to execute only the HP strategy of the three load balancing strategies (WD,NS and HP) for these large graphs. As mentioned, the edgebased parallelism (EP) has a high storage complexity related to storing the edges and hence cannot be executed for these large graphs. We find that the HP method gives large improvements resulting in 4875% reduction in execution times for these large graphs. Thus the HP method will have larger importance as we explore realworld BigData graphs.
Figure 6 shows the comparison results for BFS. It is noteworthy that BFS is a memorybound kernel, and it performs only a little computation. Therefore, we observe the associated overheads are large in general, unlike in SSSP where the overheads were lesser than the computations. Only when the graphs get sufficiently bigger, do the overheads amortize. The EP method, similar to SSSP, consistently delivers better performance than BS. However, its high storage requirements could not be accommodated for the largesized Graph500 graphs. For the graphs with small diameters, namely the RMAT and ER graphs, the execution time with EP is 48–68% lesser than that of BS (0.17 MTEPS (BS) vs. 0.54 MTEPS (EP) for RMAT20). For the graphs with large diameters, namely, the road networks, the maximum performance gain with EP over BS is about 10%.
Similar to SSSP results, the WD method performs the best among the nodebased approaches for graphs with small diameters for the BFS application. The NS method involves considerable overhead for these graphs. For graphs with large diameters, the NS method performs the best since it incurs lesser onetime overheads. In case of relatively larger graphs such as Graph500, HP performs considerably (2) better than BS, while the EP method fails to complete execution due to insufficient memory.
IvB Performance, Space Complexity and Implementation Tradeoffs
While the previous section focused on the performance aspects of the strategies, in this section we compare the strategies in terms of three axes of comparison: (i) execution time, (ii) memory requirement, and (iii) implementation complexity. The first two are quantitative, while the last one is a qualitative assessment out of our experience. Figure 7 shows the relative rankings of the strategies in terms of the three aspects. In each axis, a strategy that is closer to the origin is superior in terms of the corresponding factor.
Overall, it is clear that no one technique fares in all aspects. This suggests that we may have to use different load balancing strategies depending upon the performance requirements, amount of GPU memory and personnel expertise available. Despite the lack of a clear winner, Edgebased Processing (EP) ranks better on two axes (Execution Time and Implementation Complexity). This makes it a more desirable option when the amount of memory is not an issue. Nodebased processing (which we call as baseline BS) is also easy to implement, and has a low memory requirement (due to CSR representation), but in our experience, performs the worst. Another useful choice could be Hierarchical Processing (HP). It incurs lesser memory penalty and has a moderate implementation complexity. HP does not fare well in terms of performance for small graphs but performs well for large graphs. Workload Decomposition (WD) and Node Splitting (NS) could be the methods of choice when performance is more important but memory is insufficient to execute EP. NS (implemented as a static phase) is likely to perform better than WD despite graph modification, but incurs larger memory overhead.
IvC Degree Distribution due to Node Splitting
The NS method modifies the graph by creating childnodes to distribute the outdegrees. Therefore, the degree distribution in the modified graph differs from the original. Figure 8 shows the distribution of outdegrees of the nodes before (red curve) and after (green curve) node splitting for two synthetic graphs. The maximum degree thresholds (MDTs) determined using the histogram approach are also shown. It is evident from the figure that NS achieves a better load balancing by confining all the nodes to outdegrees within a small range (represented by green curves). We obtained similar results for the other graphs. It should also be noted that by exploiting histogramming, the MDT does not get biased to a range based on the graph size. For instance, for road networks and random graphs, MDT is 2–4 whereas for RMAT graph, it gets rightly computed as 118.
IvD Work Chunking Optimization for Edgebased Parallelism
While using atomic operations to add edges to the worklist in the GPU kernel, we use work chunking in which we collect all edges of a node and add them together using a single atomic operation. We compare this strategy with the default strategy of using an atomic operation for adding every edge. Figure 9 shows the speedups obtained due to work chunking EP method over the the default EP method. We find that work chunking results in 1.11–3.125, with an average of 1.82, speedups over the default method.
V Related Work
While we have employed fundamental algorithms for BFS and SSSP, various optimized algorithms and implementations for these and other graph applications have been developed on a variety of architectures, including distributed and sharedmemory supecomputers, and multicore machines [12, 13, 14, 6, 5, 15, 16, 17]. BFS has received significant attention over the years [18, 19, 2, 20]. The work by Merrill et al. [2] has developed a workefficient queuebased algorithm for BFS. stepping or the derivations of it [21, 22] are commonly used for SSSP. Harish and Narayanan [23] describe CUDA implementations of graph algorithms such as BFS and singlesource shortest paths computation. Vineet et al. [24] and Nobari et al. [25] propose computing the minimum spanning tree and forest. The primary objective of our work is to propose and evaluate load balancing strategies within a common framework. While we have used the LoneStarGPU framework and algorithms, our strategies are equally applicable to the above mentioned optimized algorithms as well.
The work by Merrill et al. [2] has implemented BFS traversal on GPUs using prefix sum computations. The work introduces techniques for local duplication detection to avoid race condition, gathering neighbors for edges, concurrent discovery of duplicates, and strategies for graph contraction and expansion. Nasre et al. have implemented topology and datadriven versions of several graph applications and have quantitatively compared the two versions [9]. In the topologydriven algorithms, GPU threads are spawned for all nodes in a graph, while in the datadriven algorithms, worklists are built dynamically and threads are spawned corresponding to only active elements/nodes in a time step. In another work, Nasre et al. have developed execution strategies to address challenges related to morph graphs in which the structure of the graph changes during execution [1]. They propose optimizations for concurrent node addition, deletion and refinement including reorganizing memory layout, conflict resolution, adaptive parallelism and reducing warp divergence. The work by Gharaibeh et al. [4] proposed hybrid executions of graph applications utilizing both CPU and GPU cores. The work devises and compares different strategies for partitioning the graph nodes among the CPUs and GPUs.
All these efforts assign the GPU threads to the nodes of the graph, thus performing nodebased parallelism. None of these strategies addresses the resulting load imbalance due to nodelevel parallelism. Sariyüce et al. [3] evaluate both nodebased and edgebased parallelism for the betweenness centrality problem. They identify the load imbalance in the nodebased parallelism and show that the edgebased parallelism results in good load balance. They also proposed the concept of virtual nodes in which duplicate nodes are created for the actual nodes with high outdegrees. One of our strategies, namely, the node splitting
approach, is similar to the virtual nodes concept. However, in our method, the nodesplitting level or the number of virtual nodes is determined automatically using a novel heuristic. Our work is also more comprehensive since it considers multiple load balancing strategies and multiple graph applications. Our work also proposes a novel hierarchical processing method for load balancing. While for the betweenness centrality problem, the authors show that the virtual node strategy performs uniformly better than the edgebased parallelism, we show cases in which the edgebased parallelism gives the best results and analyze the reasons. We also show that different application scenarios demand different strategies and there is no
onesizefitsall solution.Vi Conclusions and Future Work
In this paper, we had evaluated four load balancing strategies for BFS and SSSP applications for different graphs. We found that the edgebased processing method performs the best giving about 10% better performance than the baseline for BFS, and about 6080% better performance than the baseline for SSSP. Among the nodebased strategies, the workload decomposition method performs the best for graphs with small diameters while the node splitting method performs the best for graphs with large diameters. While the nodebased strategies gave worse performance than the baseline in BFS, all our load balancing strategies gave significantly better results (at least 20% better) than the baseline for SSSP. This shows that load balancing becomes very essential for computationallyintensive graph applications especially for large graphs. For very large graphs in which some of our load balancing strategies cannot be executed due to memory constraints, our novel hierarchical processing method proposed in this work gives 4875% reduction in execution time compared to the baseline. In future, we plan to explore our strategies for other graph applications including minimum spanning tree and betweenness centrality applications. We also plan to explore dynamic parallelism offered by modern GPU architectures for load balancing graphs. Finally, we plan to build data reorganization strategies for improved coalescing.
References
 [1] R. Nasre, M. Burtscher, and K. Pingali, “Morph Algorithms on GPUs,” in Principles and Practice of Parallel Programming, ser. PPoPP ’13, 2013.
 [2] D. Merrill, M. Garland, and A. Grimshaw, “Scalable GPU Graph Traversal,” in Principles and Practice of Parallel Programming, ser. PPoPP ’12, 2012.
 [3] A. Sariyüce, K. Kaya, E. Saule, and Ü. Çatalyürek, “Betweenness Centrality on GPUs and Heterogeneous Architectures,” in 6th Workshop on General Purpose Processor Using Graphics Processing Units, ser. GPUGPU ’13, 2013.
 [4] A. Gharaibeh, L. Costa, E. SantosNeto, and M. Ripeanu, “On Graphs, Gpus, and Blind Dating: A Workload to Processor Matchmaking Quest,” in International Parallel and Distributed Processing Symposium, ser. IPDPS ’13, 2013.
 [5] D. Ediger, K. Jiang, E. Riedy, and D. Bader, “GraphCT: Multithreaded Algorithms for Massive Graph Analysis,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 11, 2013.
 [6] A. Buluç and K. Madduri, “Parallel BreadthFirst Search on Distributed Memory Systems,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’11, 2011.
 [7] A. McLaughlin and D. A. Bader, “Scalable and High Performance Betweenness Centrality on the GPU,” in 26th ACM/IEEE International Conference on High Performance Computing, Networking, Storage, and Analysis (SC), 2014.
 [8] “Graph 500,” http://www.graph500.org.
 [9] R. Nasre, M. Burtscher, and K. Pingali, “Datadriven Versus Topologydriven Irregular Computations on GPUs,” in International Parallel and Distributed Processing Symposium, ser. IPDPS ’13, 2013.
 [10] “Lonestar gpu,” http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu.
 [11] K. Madduri and D. A. Bader, “GTgraph: A Suite of Synthetic Random Graph Generators.”
 [12] K. Madduri, D. Ediger, K. Jiang, D. Bader, and D. ChavarríaMiranda, “A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets,” in International Parallel and Distributed Processing Symposium, ser. IPDPS ’09, 2009.
 [13] M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali, “Lonestar: A Suite of Parallel Irregular Programs,” in ISPASS, 2009.
 [14] V. Agarwal, F. Petrini, D. Pasetto, and D. Bader, “Scalable Graph Exploration on Multicore Processors,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’10, 2010.
 [15] A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and U. Catalyurek, “A Scalable Distributed Parallel BreadthFirst Search Algorithm on Bluegene/L,” in Proceedings of the 2005 ACM/IEEE conference on Supercomputing, ser. SC ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 25–. [Online]. Available: http://dx.doi.org/10.1109/SC.2005.4
 [16] D. A. Bader and K. Madduri, “Designing Multithreaded Algorithms for BreadthFirst Search and stconnectivity on the Cray MTA2,” in Proceedings of the 2006 International Conference on Parallel Processing, ser. ICPP ’06. Washington, DC, USA: IEEE Computer Society, 2006, pp. 523–530. [Online]. Available: http://dx.doi.org/10.1109/ICPP.2006.34
 [17] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew, “Optimistic Parallelism Requires Abstractions,” SIGPLAN Not. (Proceedings of PLDI), vol. 42, no. 6, pp. 211–222, 2007. [Online]. Available: http://iss.ices.utexas.edu/Publications/Papers/PLDI2007.pdf
 [18] L. Luo, M. Wong, and W.m. Hwu, “An Effective GPU Implementation of BreadthFirst Search,” in Proceedings of the 47th Design Automation Conference, ser. DAC ’10, New York, NY, USA, 2010, pp. 52–55. [Online]. Available: http://doi.acm.org/10.1145/1837274.1837289
 [19] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient Parallel Graph Exploration on MultiCore CPU and GPU,” in 20th International Conference on Parallel Architectures and Compilation Techniques, 2011.
 [20] A. Gharaibeh, L. B. Costa, E. SantosNeto, and M. Ripeanu, “A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing,” in The 21st International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’12, 2012.
 [21] U. Meyer and P. Sanders, “stepping: A Parallelizable Shortest Path Algorithm,” J. Algorithms, vol. 49, no. 1, pp. 114–152, 2003.
 [22] V. Chakaravarthy, F. Checconi, F. Petrini, and Y. Sabharwal, “Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 1923, 2014, 2014.
 [23] P. Harish and P. J. Narayanan, “Accelerating Large Graph Algorithms on the GPU using CUDA,” in HiPC’07: Proceedings of the 14th international conference on High performance computing. Berlin, Heidelberg: SpringerVerlag, 2007, pp. 197–208.
 [24] V. Vineet, P. Harish, S. Patidar, and P. J. Narayanan, “Fast Minimum Spanning Tree for Large Graphs on the GPU,” in Proceedings of the Conference on High Performance Graphics 2009, ser. HPG ’09. New York, NY, USA: ACM, 2009, pp. 167–171. [Online]. Available: http://doi.acm.org/10.1145/1572769.1572796
 [25] S. Nobari, T.T. Cao, P. Karras, and S. Bressan, “Scalable Parallel Minimum Spanning Forest Computation,” in Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’12. New York, NY, USA: ACM, 2012, pp. 205–214. [Online]. Available: http://doi.acm.org/10.1145/2145816.2145842
Comments
There are no comments yet.