1. Introduction
Bulksynchronous parallel (BSP) programming (Valiant:1990:ABM) is the traditional model for GPU applications. It is a natural fit for statically schedulable, regular problems, such as many dense matrix, image analysis, and structured grid computations. Programming environments like NVIDIA’s CUDA and Khronos’s SYCL support this relatively simple model, which maps efficiently to the massive finegrained parallelism on GPUs and can deliver nearpeak performance.
However, some important problems are instead irregular, with frequent control flow branches, nonunit stride memory accesses, variable amounts of work across loop iterations, and dynamically varying degrees of parallelism. Algorithms that operate on graphs or trees or those with recursive formulations often exhibit such irregularity. These more naturally use a taskbased programming model. Programming systems like Legion
(bauer2012legion), PTask (rossbach2011ptask) and StarPU (augonnet2010data) use tasking on the CPU to feed GPUs with kernels to mask the latency of communication and keep the GPU busy. In contrast, we consider the problem of very finegrained tasking where a set of similar application level tasks are aggregated to form a dataparallel GPU task. This idea is used in stateoftheart GPU graph libraries like Gunrock (gunrock2017), where the application level tasks are vertices or edges. Each frontier in a graph sweep is launched as a separate GPU kernel in the BSP model; this raises issues of insufficient parallelism, uneven finish times, and high kernel launch overhead for small frontiers.To address the above issues, we present the design of Atos, a taskscheduling framework for GPUs that is adaptable to different usage scenarios:

It supports both expensive and inexpensive frontiers by providing persistent and nonpersistent task schedulers. The persistent scheduler is a GPU kernel that runs continuously to minimize launch overhead.

It allows the user to trade off task and data parallelism by selecting the worker size, which is the number of GPU threads within each worker, and the number of items in each task.

It uses a single shared task queue, which balances load more quickly than a distributed queue, yet is fast enough to keep GPU workers occupied.

It supports asynchronous execution across frontiers to maximize available parallelism, while mostly preserving crossfrontier ordering and thus minimizing overwork.
We study three graph algorithms on a variety of graph datasets that stress the importance of each of the above controls. These problems have nested parallelism with outer loop dependencies, and the ability to relax those dependencies comes at the cost of possible overwork. Thus a second major theme in this paper is this tradeoff between increased parallelism and overwork. Each of our three algorithms explores this tradeoff in a somewhat different manner and detailed discussion and analysis is presented in Section 5 and 6.
Our contributions include:

Developing a generalized GPU taskparallel framework that explores a broad design space of possible taskparallel implementations;

A demonstration of the benefits of mixed task and data parallelism for finegrained parallel applications;

Identifying relaxedsynchronization applications as a strong candidate for acceleration with a GPU taskparallel framework; and

A detailed analysis of application performance that highlights the impact of design decisions both within the task scheduler and at the application level.
2. A Dynamic, Irregular Application Pattern
Atos handles a broad set of applications with finegrained task and data parallelism, but we choose to focus here on a particularly challenging class of irregular nested loops with the following form:
The inner loops produce data parallelism that may be flattened as in NESL (Blelloch:1990:CCL) or Gunrock (gunrock2017) to maximize parallelism and to implement a dataparallel loadbalancing technique. For example, if the inner loops are iterating over graph vertices and outgoing edges, some preanalysis may be used to evenly distribute the edges rather than the vertices. Our work also relaxes the outer loop iterations, which will expand opportunities to find parallelism.
Our applications exhibit one or more of the following forms of dynamic, irregular parallelism:

The number of tasks varies across outer loop frontiers: Work is generated dynamically and the number of output tasks produced from each input task is not fixed.

The cost of each inner loop task (lines 4–5 in Listing 1) may vary: The loop bound (
workload(i).size()
) in line 5 is not fixed. 
Total work may vary: The outer loop has a loopcarried dependence, so while dynamic generation of tasks will enforce some dependencies, parallel execution of tasks across frontiers can change program behavior, including the total number of tasks processed.
2.1. Performance Challenges of Dynamic, Irregular Problems
Traditional BSP implementations of such applications usually launch a series of kernels, with each kernel corresponding to a bulksynchronous step (one iteration of the outer loop). Between each step is a global synchronization barrier. The kernels themselves parallelize over the inner loops, with the list of work passed between iterations forming the “frontier”. While optimized BSP implementations of many dynamic irregular applications achieve impressive performance, we identify three performance challenges with this approach.
Small frontier problem The runtime of a kernel is the sum of fixed costs (the cost of the global synchronization barrier plus the kernel launch cost) and the amount of time to process the input frontier. The size of input frontiers across iterations may vary significantly over the lifetime of the program. A small frontier has two performance consequences: (1) The fixed costs dominate the overall processing cost; the GPU is spending a significant amount of time setting up or waiting for computation rather than performing it. (2) A small frontier may not be large enough to fill the GPU with work, leaving processing units idle (Gunrockfrontier).
Load imbalance Because of the irregular nature of work in the inner loop, efficiently assigning work to GPU threads is critical to achieve the highest performance. Statically computing work assignment is infeasible because work is produced dynamically, so lines 4–6 in Listing 1 are usually implemented with dataparallel load balancing techniques (merrillbfs; davidsonsssp; gunrock2017) that compute work assignment at runtime. However, the optimal loadbalancing technique is problemdependent and even for a fixed problem may be inputdependent. And the runtime cost of that technique may be high.
Loss of concurrency opportunities The BSP model enforces an ordering between every operation in iteration with every operation in iteration , even if some of those pairs may be independent. We relax this crossfrontier ordering in two ways. One is to consider problems whose typical formulation is as nested loops but whose computation is amenable to reordering across BSP iterations. Two is to speculate that work in later iterations is not dependent on work in earlier iterations and to repair the misspeculation or retry the computation if it is incorrect.
3. Our TaskParallel Programming Model and Design Space
Our programming model allows implementations of taskparallel formulations of the workloads we discussed in Section 2, with a focus on providing solutions to BSP’s three challenges: smallfrontier, loadimbalance, and loss of concurrency opportunities. We use the following terminology:
 worker::

one or a group of GPU threads.
 task::

one or more pieces of work that are scheduled as a single unit in our system.
 application function ::

the code that processes each task. Each application function declares the worker size it requires to run.
At a high level, our model maintains a queue of tasks. Workers fetch a task from the queue, process the task, and add newly generated tasks (if any) to the queue. The program runs until either a stop condition is met or the queue is empty (Listing 2). This programming model addresses the performance challenges from Section 2.1:

We can implement Listing 2 with a single kernel invocation, avoiding multiple launches of small kernels.

Task parallelism is implicitly loadbalanced because workers can run independently and stay busy even if tasks require different amounts of work.

Listing 2 has no global synchronization; instead the programmer controls the scheduling of work, allowing more flexible dependencies and thus more opportunities for concurrency.
The primary focus of this paper is how to best implement Listing 2 on the GPU. We identify below the four key design decisions in such an implementation.
 Relaxing barriers:

A BSP implementation separates work in each iteration with a global barrier. Can we benefit from relaxing this global barrier constraint?
 Worker size:

We can choose what GPU resources we assign to each worker. What is the worker size that yields the best performance?
 Data vs. Task Parallelism:

We expect to leverage the parallelism between tasks. We can also choose the size of our tasks, and within each task, potentially leverage data parallelism. What is the right balance between task and data parallelism?
 Kernel strategy:

Listing 2 is written in a “persistent” style, implementable with a single kernel call. We could alternatively interchange its outer and inner loops, making one “discrete” kernel call per iteration. Which is best for optimal performance?
3.1. Relaxing Barriers
As discussed in Section 2, many applications have the nested loop structure from Listing 1 and their BSP implementations may lose concurrency opportunities because of the global barrier between the outer loops. In many cases, we can remove the barrier and relax the outer loop dependency while still computing the correct result. How? Consider two tasks and that, in a BSP implementation, are ordered: is in an iteration that precedes and thus must run before .

One possibility is to speculate that we can compute and at the same time, or even in the order then , without changing the correctness of the computation. If our speculation is correct, then we expose more concurrency. If our speculation is incorrect, then we must fix it. This fix might be cheap or costly.

A second possibility is a problem formulation that is robust to computing items out of order. This is also called Dijkstra’s don’t care nondeterminism (Dijkstra:1976:ADO).
In either case, we can relax global barriers and expose additional concurrency; depending on the problem and dataset, we may see significant performance gains. However, relaxing barriers may incur additional costs: the cost of performing incorrectly speculated work, the cost of repairing incorrectly speculated work, and less predictable convergence rates when compared to the BSP counterpart. Overall, if the performance improvements from increased concurrency outweigh these costs, we can deliver performance gains.
Related work: Hassaan et al. (Hassaan:2011:OVU) studied unordered and ordered versions of several algorithms, demonstrating a tradeoff between parallelism and work efficiency. However, the relaxed barrier formulations we study differ from unordered ones. Consider breadthfirst search (BFS). Both Hassaan et al. and we begin with a workefficient Dijkstra BFS, but they compare to a workinefficient BellmanFord BFS, while we compare to a relaxed (speculative) Dijkstra BFS (Section 5.1). The speculative Dijkstra BFS is more workefficient than BellmanFord BFS. Empirically, speculative Dijkstra’s workload is within a small constant factor of that of BSP Dijkstra, which is (see Table 4). This is much smaller than BellmanFord’s workload of . Kulkarni et al. (kulkarni2007optimistic)
studied an abstraction and runtime scheme for workloads with optimistic parallelism, which differ from the relaxing barriers we study in this paper. Their notion of optimistic parallelism assumes many tasks can run in parallel and stops a task the moment it violates a dependency. In contrast, we allow the computation to commit, even if it violates a dependency, and only fix the mistake afterwards.
3.2. Worker Size
Previous taskparallelism work (Cederman:2008:ODL; chen2010dynamic; augonnet2010data; rossbach2011ptask; krieder2014design; ben2017groute) uses the whole GPU as a single worker. They maintain a task dependency graph on the CPU side and orchestrate the execution by launching a CUDA kernel for each task when dependencies are satisfied. We use GPUwideworker to describe such an organization. Tasks in those GPUwideworker taskparallel frameworks are usually very large to better utilize the entire GPU’s resources. This organization is easy to program and has low scheduling overhead.
However, the GPUwideworker scheme is a poor match when task dependencies require finer management. Consider the following extreme example: Task and Task both contain 10,000 data items, and only a single data item in depends on a single item in . A GPUwideworker implementation must wait for to complete before beginning work on , even though most of the data items can be processed independently and concurrently. We use the term false dependency to describe the situation when a data item has all its dependencies satisfied, but cannot be processed because another data item in the same task has unresolved dependencies. One can reduce such false dependencies by decomposing a large task into many smaller tasks to expose more parallelism. As a result, workers should be smaller, matching the size of tasks and allowing full utilization of the GPU’s resources. This approach has motivated a number of recent taskparallelism frameworks (tzeng2010task; tzeng2012gpu; yeh2017pagoda; juggler; Steinberger:2014:WTS), which use workers sized as either warps or Cooperative Thread Arrays (CTAs). The resulting additional complexity in scheduling many smaller workers motivates also moving scheduling decisions from the CPU to the GPU.
Most taskparallelism frameworks only provide one worker size. Our framework provides thread, warp, and CTAsized workers, to support tasks of different size and different synchronization requirements. The only prior work that uses multiple granularities is Whippletree (Steinberger:2014:WTS). Whippletree’s thread and warp worker sizes are primarily a programming model concept and suffer from synchronization penalties at the implementation level. In Whippletree’s implementation, threads are still synchronized within entire CTAs, suffering from false dependencies if tasks require finer synchronization than at CTA granularity.
3.3. Balance Between Data and Task Parallelism
Many parallel applications have work items that require different amounts of processing. Traditional BSP applications address this challenge with explicitly coded dataparallel load balancing techniques. We describe two different approaches in the context of Listing 1:
One widely used technique is load balancing search (davidsonsssp), which dynamically computes the prefixsum of workload(i).size()
for in_frontier
, then flattens the two forloops into one big array and redistributes the work in the array to each CUDA thread (see Baxter (lbs) for details). Another popular dataparallel load balancing technique separates the work in in_frontier
into different buckets based on workload(i).size()
and launch a separate kernel with the best processing strategy for that size for each bucket (merrillbfs).
Task parallelism is a natural fit for these irregular applications. Workers in our framework do not directly synchronize with each other. One of the primary advantages of this lack of coordination is that it allows workers to attend to any available work items as soon as they become available (“implicit taskparallel load balancing”).
Task parallelism and data parallelism are not exclusive– individual tasks of sufficient size may also exploit data parallelism in their execution. Thus we consider a continuous spectrum with pure taskparallel and pure dataparallel load balancing at the extremes, and expect that the optimal tradeoff will be applicationdependent. Our framework supports two worker sizes larger than a thread (warp and CTA) and offers the programmer the ability to exploit data parallelism within each warpsized or CTAsized task. In the framework, workers operate on tasks asynchronously, but an individual worker itself is executed synchronously. Therefore, we can use a worker’s capacity as a parameter to control the tradeoff between data and task parallelism. Given a fixed number of threads available, increasing a worker’s capacity reduces the total number of workers available for a given application. At the same time, an increase in worker capacity results in more opportunities to perform dataparallel load balancing within each worker. In the extreme, setting a worker’s capacity to the entire GPU leaves no room for task parallelism and is equivalent to the BSP model. We found that dataparallel load balancing inside the worker combined with taskparallel load balancing provided by Atos results in better overall load balancing (Section 6). We are not aware of any previous work that combines these two types of load balancing.
3.4. Kernel Strategy
Traditional GPU kernels divide a variable amount of input work into fixedsize CTAs and launch a kernel over a CTA count proportional to the amount of input work. Persistent kernels (Gupta:2012:ASO) decouple the relationship between data size and launched CTAs. A persistent kernel launches only enough CTAs to fill the GPU. These CTAs remain resident for the entire kernel and run a loop that matches naturally to the taskparallelism model in Listing 2.
Advantages of persistent kernels Persistent kernels reduce kernel launch overhead and CPU/GPU communication. This is particularly significant where many small kernels are required. The persistent kernel reduces CPU involvement in favor of programmerwritten GPU logic within the persistent kernel.
Disadvantages of persistent kernels GPU workers in the persistent kernel concurrently pop from a shared queue; this requires atomic operations to ensure exclusive pops. Persistent kernels have higher register usage than discrete kernels (requiring extra registers to maintain the queue loop).
Intuitively, if a discretekernel application suffers from large kernel overhead (exhibits the small frontier problem), a persistent kernel may be preferred. Otherwise, it may be better and simpler to choose a discrete kernel. Previous taskparallelism frameworks either use discrete kernels (Cederman:2008:ODL; augonnet2010data; rossbach2011ptask; krieder2014design)) or persistent kernels (tzeng2010task; tzeng2012gpu; yeh2017pagoda; juggler; Steinberger:2014:WTS), but none of them provide both and/or expose that decision to the programmer. In Section 6.6, we discuss the results of our experiments with respect to this choice.
4. Framework API
launch*
API functions are used to launch workers who repeatedly pop tasks from the work queue; each worker then applies function f1
to the task popped. When the worker fails to pop, it runs function f2
(default noop) instead. Under persistent kernel mode, numThread
numBlock
cannot exceed the maximum number of threads that can concurrently reside on the GPU based on the application’s register and shared memory usage. By default, these are set to the maximum allowed values, but can be overridden by users.
As discussed in Section 3, when we execute launchCTA
, the choice of numThread
will determine the fraction of task and dataparallelism. Argument FETCH_SIZE
defines how many data items constitute a task to be popped from the queue by a worker. Given numBlock
, the FETCH_SIZE
should be set accordingly. Increasing FETCH_SIZE
will reduce available tasks for other workers but increase local data parallelism, and decreasing FETCH_SIZE
will have the opposite effect.
5. Three Case Studies
In this paper we select three classic nested loop problems in the domain of graph computation for indepth study. We choose them because their implementations are wellstudied in the BSP model so we can be confident that our results are meaningful. Also when run on particular datasets, their BSP implementations exhibit one or more of the challenges described in Section 2.1. In Algorithms 1–6, we use CUDA atomic operation APIs (cuda_api).
5.1. BreadthFirst Search
The pseudocodes of BSP BFS and relaxedbarrier BFS (“Speculative BFS”) are shown in Algorithms 1 and 2. Algorithm 1 shows that BSP BFS is exactly Dijkstra’s algorithm: BSP BFS advances the outer iteration only when all vertices in the in_frontier
are traversed; thus strict breadthfirst ordering is preserved, ensuring every vertex will first be reached via its optimal path.
In contrast, in Algorithm 2, Speculative BFS relaxes the barrier constraints of the outer loop of BSP BFS. Speculative BFS has a single dynamic frontier, and many independent CUDA workers asynchronously push and pop vertices to the frontier. Consequently, vertices at different distances can be processed simultaneously. This raises the possibility of extra work, since outoforder iteration may require visiting vertices multiple times in order to find the shortest path.
Despite this difference, we stress that both algorithms are based on a similar vertex traversal. Speculative BFS is more precisely seen as Dijkstra with relaxed barrier constraints, and differs from SSSP algorithms such as BellmanFord.
5.2. PageRank
PageRank computes the importance (rank) of nodes in a graph.
PageRank has a “push” and a “pull” formulation. Both BSP and relaxedbarrier PageRank (asynchronous PageRank) use push and their pseudocodes are shown in Algorithm 3 and 4. The algorithm begins with an initial rank and residue value per node. In Algorithm 3, BSP PageRank is computed by two kernels that are repeatedly called in an outer loop until all vertices converge. The first kernel pushes a fraction of the residue of each vertex on the frontier
to its neighbors. The second kernel aggregates all vertices with residue and adds them to the frontier
. Having two separate kernels allows threads to be remapped for more flexible load balancing.
Asynchronous PageRank (Algorithm 4) removes the global synchronization and fuses the two kernels from the BSP implementation together. In asynchronous PageRank, CUDA workers pop vertices from the queue asynchronously, push their residue to their neighbors, then exclusively reserve Check_Size
number of vertices and push those vertices onto the queue if their residue .
PageRank is an iterative unordered algorithm as it is essentially a random walk (chung2010pagerank; fortunato2007random). In theory, the BSP global barrier is not necessary or beneficial. In our experiments, BSP PageRank does not converge faster than asynchronous PageRank.
5.3. Graph Coloring
Graph coloring is usually implemented using either an independentsetbased approach or a greedy approach. In this paper, both the BSP and relaxed barrier graph coloring use a speculative greedy graph coloring algorithm (gebremedhin2000scalable) shown in Algorithms 5 and 6.
In Algorithm 5, the BSP implementation uses a double buffer and consists of two kernels that are called in the outer loop until all vertices are appropriately colored. The first kernel assigns colors to vertices from in_frontier
while avoiding color conflicts with their neighbors. The second kernel aggregates each vertex that, after the first kernel’s color assignment, still has a color conflict with a neighbor, and adds it to out_frontier
. The speculative part of this algorithm is in the first kernel: it allows a vertex to be colored using the outdated color information from its neighbors and if the vertex color assignment fails (and is detected in the second kernel), that vertex is readded to the frontier for recoloring.
In contrast, in Algorithm 6, the relaxed barrier version fuses the two kernels by an uberkernel (hargreaves2005generating); we use the sign of the vertex ID to distinguish between the color assignment task and the conflict detection task. Specifically, if we pop a vertex with positive ID, we perform color assignment (lines 8–11 of Algorithm 6, identical to lines 11–14 of Algorithm 5), and if we pop a vertex with negative ID, we perform conflict detection (lines 16–18 of Algorithm 6, identical to lines 21–23 of Algorithm 5). Thus, Algorithm 6 is almost identical to Algorithm 5. The only difference is that Algorithm 6 does not enforce a global barrier between the color assignment and conflict detection kernels.
6. Experiments and Analysis
6.1. System and Hardware Information
All experiments in this paper are run on a Linux workstation with 2.20 GHz Intel(R) hyperthreaded E52698 v4 Xeon(R) CPUs, 528 GB of main memory, and an NVIDIA V100 GPU with 32 GB onboard memory. All programs were compiled with NVIDIA’s nvcc compiler (version 11.1.168) with the O3 flag and gcc 9.3.0 with the O3 flag. All results ignore transfer time and are averaged over 20 runs.
6.2. Experimental Overview
We evaluate the design principles discussed in Section 3 using three implementation variants based on a combination of Atos configurations. (1) “persist32” utilizes persistent kernels with warpsized workers. It has no dataparallel load balancing within a worker, instead only using implicit taskparallel load balancing. (2) “persistworker_sizeFETCH_SIZE” utilizes persistent kernels with CTAsized workers. (3) “discreteworker_sizeFETCH_SIZE” utilizes discrete kernels and CTAsized workers. Both CTA variants use load balancing search (merrillbfs) (a dataparallel load balancing technique) inside workers in conjunction with implicit taskparallel load balancing.
For BFS and PageRank, we compare the performance of our implementations to Gunrock (gunrock2017), a stateoftheart singleGPU BSPbased graph framework, which extensively uses dataparallel loadbalancing techniques. For Graph Coloring, Gunrock’s independentset graph coloring algorithm is not comparable, so we faithfully implemented a BSP graph coloring using the same speculative greedy graph coloring algorithm. Our BSP implementation uses the Gunrock’s bucketbased dataparallel load balancing method (gunrock2017), described in Section 3.3.
We run the three case studies on three scalefree and two meshlike datasets (see Table 2) and summarize the runtime (speedup) results for four implementations in Table 1.
Application: BFS (BSP=Gunrock)  

Dataset  BSP 




socLiveJournal1^{s}  15.3  22.3 (x0.68)  12.4 (x1.23)  10.7 (x1.42)  
hollywood_2009^{s}  9.26  12.2 (x0.75)  6.23 (x1.48)  4.56 (x2.02)  
indochina_2004^{s}  13.2  15.6 (x0.84)  8.03 (x1.65)  7.42 (x1.79)  
road_usa^{m}  604  327 (x1.84)  46.9 (x12.8)  174 (x3.46)  
roadNet_ca^{m}  55.9  39.6 (x1.41)  4.35 (x12.8)  15.5 (x3.58)  
Application: PageRank (BSP=Gunrock)  
Dataset  BSP 




socLiveJournal1^{s}  262  156 (x1.68)  113 (x2.31)  116 (x2.25)  
hollywood_2009^{s}  87.1  80.0 (x1.08)  68.5 (x1.27)  72.4 (x1.20)  
indochina_2004^{s}  159  84.7 (x1.88)  52.6 (x3.02)  49.6 (x3.20)  
road_usa^{m}  221  169 (x1.30)  121 (x1.81)  112 (x1.95)  
roadNet_ca^{m}  20.5  16.2 (x1.26)  10.1 (x2.03)  8.28 (x2.47)  
Application: Graph Coloring  
Dataset  BSP 




socLiveJournal1^{s}  96.5  20.4 (x4.71)  36.1 (x2.67)  63.2 (x1.52)  
hollywood_2009^{s}  77.9  31.9 (x2.4)  59.3 (x1.31)  274 (x0.28)  
indochina_2004^{s}  673  74.1 (x9.08)  184 (x3.65)  2073 (x0.32)  
road_usa^{m}  38.2  51.4 (x0.74)  19.3 (x1.97)  81.9 (x0.46)  
roadNet_ca^{m}  9.11  4.18 (x2.18)  3.52 (x2.58)  12.0 (x0.75) 
Max.  Max.  Avg.  
Dataset  Vertices  Edges  Diam.  indeg.  outdeg.  degree 
socLiveJournal1^{s}  4.8M  68M  20  13,905  20,292  14 
hollywood_2009^{s}  1.1M  11M  11  11,467  11,467  105 
indochina_2004^{s}  7.4M  191M  26  256,425  6,984  8 
road_usa^{m}  23.9M  57M  6,809  9  9  2 
roadNet_ca^{m}  1.9M  5M  849  12  12  2 
6.3. Performance Challenges in Three Study Cases
Each of our study cases embodies a subset of the BSP performance challenges discussed in Section 2.1, which informs our choice of design decisions in the Atos framework. We summarize the BSP performance challenges in Table 3.
Load Imbalance Problem:
All three algorithms involve iterating over a vertex’s neighbor list; thus variance in the vertices’ degrees leads to load imbalance. This issue is much more severe on scalefree datasets where vertex degree variance is high. In contrast, meshlike graphs have low maximum degree, and hence low degree variance (see Table
2).Small Frontier Problem: In Figures 1, 2, and 3, we plot throughput against timeline for each BSP implementation of the three algorithms. Low throughput over a long duration of time indicates the presence of the smallfrontier problem.
BFS: In Figure 1, Gunrock has high throughput on scalefree datasets, and thus does not have the smallfrontier problem. This is because scalefree datasets have low diameter (leading to a small number of BSP iterations) and high average degree (leading to a large amount of work per iteration). In contrast, Gunrock on meshlike datasets does exhibit the small frontier problem, because these datasets have high diameters and small average degree; consequently, there is a large number of iterations, with little work per iteration, leading to low throughput over many iterations.
PageRank: In Figure 2, for Gunrock PageRank, both scalefree and meshlike datasets do not exhibit the small frontier problem, as they have high throughput over most of the execution time and converge in fewer than 35 iterations (though Indochina2004 exhibits a long flat tail in the latter half of execution).
Graph Coloring: In Figure 3, for BSP graph coloring, scale free datasets have low throughput for more than 70% of execution time, and thus have the small frontier problem. Meshlike datasets terminate in fewer than 40 iterations, and have short tails, and thus do not have the small frontier problem. This is because on scalefree datasets, the highdegree vertices will need to be recolored many times, leading to a large number of iterations, during which the frontier contains a few highdegree vertices with color conflicts. In contrast, meshlike datasets have low average degree, and are less likely to have color conflicts.
BFS  PageRank  Graph Coloring  

ScaleFree  Load Imbalance  Load Imbalance  Load Imbalance + Small Frontier 
MeshLike  Small Frontier  None  None 
6.4. Relaxing Barriers
Application: BFS  Application: PageRank  

Dataset 







socLiveJournal1^{s}  1.43  1.06  1.01  0.73  0.72  0.72  
hollywood_2009^{s}  2.26  1.19  1.07  1.08  1.18  0.9  
indochina_2004^{s}  1.28  1.00  1.00  0.76  0.73  0.75  
road_usa^{m}  3.56  1.05  1.04  0.79  0.79  0.92  
roadNet_ca^{m}  2.05  1.02  1.04  1.18  1.11  0.97 
Application: Graph Coloring  

Dataset  BSP 




socLiveJournal1^{s}  1.17  1.00  1.74  2.78  
hollywood_2009^{s}  3.31  1.15  5.24  37.34  
indochina_2004^{s}  1.96  1.04  4.45  16.97  
road_usa^{m}  1.22  1.00  1.46  1.41  
roadNet_ca^{m}  2.55  1.00  1.74  2.44 
As discussed in Sections 2–3, relaxing barriers exposes more concurrency, giving higher throughput and shorter execution time. However, relaxing barriers may result in extra work. If the performance improvement from increased concurrency outweighs the cost of extra work, we obtain a net performance gain.
There are two key factors influencing this tradeoff. First, we find that in the presence of a small frontier problem, the increase in concurrency from relaxing barriers is always more significant than the cost of extra work. Second, on naturally unordered algorithms such as PageRank, one can always relax the barrier: although the barrier gives the BSP implementation a more predictable convergence rate, the barrier generally does not make the convergence faster.
Table 4 summarizes the extra work for three study cases. Figures 1, 2 and 3 plot the throughput of four implementations (BSP + three Atos variants) of three study cases against timeline for four datasets. Notably, these plots show the normalized throughput, which is the measured throughput divided by the overwork factors in Table 4. This gives a fair measure of overall performance, as it incorporates both the benefits of improved concurrency (higher absolute throughput) and the cost of extra work. Essentially, normalized throughput measures “useful” throughput rather than raw absolute throughput. We provide detailed analysis below for each application.
BFS: Figure 1 shows that for the two meshlike datasets, all 3 Atos implementations achieve considerably higher normalized throughput than Gunrock. Why? Table 3 shows that Gunrock on meshlike datasets has a severe small frontier problem. Therefore, the increase in concurrency in the 3 Atos implementations offers a significant performance advantage. Table 4 indicates that the persistentwarp implementation generates 3.5x extra work vs. Gunrock, but despite this extra work, Atos’s normalized throughput is still significantly higher than Gunrock. Scalefree graphs, on the other hand, exhibit more parallelism and do not suffer from the smallfrontier problem. Atos’s fastest implementations are still faster than Gunrock’s, but not nearly as much as for the mesh networks.
On all BFS experiments, Atos’s CTA implementations are faster than its warp ones. Atos’s CTA implementations use a combination of taskparallel and dataparallel load balancing techniques (see Section 6.5 for details), and thus have better load balancing than its warp implementations, which only use taskparallel load balancing. This leads to higher GPU utilization and hence higher absolute throughput. Second, CTA implementations produce less extra work than warp (see Table 4). Due to better load balancing in CTA, the workload of each worker has lower variance. If a worker receives too much work, there will be a long delay before the vertices’ updated depths are visible to other workers; this increases the likelihood that downstream vertices are first reached via other suboptimal paths, which leads to extra work.
PageRank: Unlike BFS, PageRank is naturally unordered, as it satisfies Dijkstra’s don’t care nondeterminism (Dijkstra:1976:ADO). Therefore, relaxing the barrier in the outer loop does not generate any misspeculations and hence results in no wasted work. In fact, Table 4 shows that the Atos implementations perform less work than Gunrock in general. This is because the BSP barrier forces each vertex to be processed at most once per iteration. By relaxing this barrier, the Atos implementations can update certain important vertices (e.g., vertices with high centrality) more frequently than other vertices, thus leading to more efficient propagation of rank.
Figure 2 shows that all three Atos implementations compact the workload and process it with higher normalized throughput (persistCTA has a higher profiling cost). Though PageRank does not suffer from the small frontier problem, the three Atos implementations nonetheless have superior performance over Gunrock, because relaxing the barrier increases concurrency. In addition, relaxing the barrier lowered the overall workload in practice, even though in theory it may lead to a more unpredictable convergence rate.
Graph Coloring: Unlike BFS and PageRank, all graph coloring implementations (including BSP) use a speculative approach (greedy graph coloring) and thus all have extra work. Table 4 summarizes the multiplicative factor of extra work, which is defined as a ratio vs. the number of vertices in the graph (the lowest possible workload). Atos’s persist warp has the least extra work; on some datasets, the extra work is less than 1%, which means after the first color assignment, only 1% of vertices have a color conflict and must be recolored. Atos’s discrete warp has the most extra work (on hollywood2009, 37.34x). The extra work is due to the combination of two factors:
1. Conflicts tend to arise when neighboring vertices are colored concurrently: From Section 5.3, given a vertex, the algorithm first checks its neighbors’ colors, then assigns a color to the vertex that does not conflict with its neighbors. The color assignment is speculative because it is done using possibly outdated color information from the vertex’s neighbors. When neighboring vertices are colored simultaneously, they read outdated colors from each other, leading to conflicts and recoloring.
2. Consecutive vertices on the work queue are likely to be neighbors: On many if not most graphs, the vertex ID is semantically meaningful: vertices whose vertex ID are numerically close are more likely to be neighbors. At the beginning of graph coloring, all vertices are initially inserted onto the work queue in order of vertex ID.
Since consecutive vertices on the work queue tend to be assigned colors concurrently, the above implies a high likelihood of color conflicts. We verify that the large amount of extra work is indeed due to semantically meaningful vertex IDs: running the exact same experiment with randomly permuted vertex IDs, the amount of extra work drops to less than 1.5x for all four implementations on all datasets. ID permutation leads to the following runtime improvements (in ms) on scalefree datasets:
Impl.  socLiveJournal1  hollywood  indochina 

discretewarp  63 31  274 26  2073 222 
persistCTA  36 21  59 28  184 50 
BSP  96 89  77 61  673 485 
The BSP implementation has a more modest improvement because BSP’s threadwarpCTA load balancing scheme (merrillbfs) already divides each bucket into three individuallyloadbalanced subbuckets, reducing interbucket conflicts. Persistwarp has little change as there is almost no extra work even before permutation. Notably, after permutation, all three Atos variants are faster than BSP implementation on scalefree datasets.
Comparing persistwarp and persistCTA: persistCTA has better load balancing, allowing for more (potentially adjacent) vertices to be colored simultaneously, resulting in more extra work than persistwarp. We verify this from Table 4. Roughly speaking, the amount of extra work for persistCTA is more significant on scalefree graphs, as a vertex can have a large number of neighbors, leading to more potential conflicts. Therefore, persistCTA outperforms persistwarp on meshlike graphs, where the increased concurrency and better loadbalancing outweigh the cost of wasted work; conversely, on scalefree datasets, persistCTA is slower than persistwarp, because the cost of extra work is too high (see Tables 1 and 4).
Comparing persistwarp and discretewarp: Discretewarp has more extra work than persistwarp, hurting its performance, for two reasons.
1) The scheduling policies of discrete and persistent kernels are different. When kernels are launched from the CPU (discretekernel strategy), the kernel launched earlier always has a higher scheduling (hardware) priority than the kernel launched later. This effectively causes vertices to always be colored in roughly the same order as their initial ordering (by vertex ID, which causes many conflicts). In contrast, the persistent kernel only incurs one kernel launch and warps within it are scheduled by the hardware scheduler, whose decisions are much less ordered by vertex ID. Thus persistwarp has fewer coloring conflicts caused by adjacencies and hence less overwork.
2) discretewarp has lower register usage than persistentwarp (72 vs. 42), so persistwarp only achieves 43% occupancy per SM and discretewarp achieves 62%. Therefore the discretewrap assigns colors to more vertices simultaneously, leading to a greater likelihood of conflicts than persistwarp. Unlike our other applications, in graph coloring, the cost of extra work largely reduces the benefit of increased concurrency. On scalefree datasets (without random permutation), our highest performance is achieved with lower concurrency and less overwork (persistwarp variant), which achieves a lower absolute throughput but a higher normalized throughput (and hence higher performance overall).
6.5. Worker Size and Tradeoff between Task and DataParallelism Load Balancing
As discussed in Section 3, Atos enables the user to trade off between taskparallelism and dataparallelism load balancing by adjusting the worker size and FETCH_SIZE. Atos persistCTA, like persistwarp, uses a persistent kernel to exploit task parallelism, but now the taskparallel work units are fewer and larger (the size of a CTA) and we can leverage more data parallelism within a CTA. In most cases, persistCTA outperforms persistwarp with both higher normalized/absolute throughput, except for the graph coloring on scalefree datasets, where it achieves only higher absolute throughput. Figure 4 illustrates this tradeoff for BFS and PageRank on socLiveJournal (scalefree) and road_usa (meshlike). We exclude graph coloring because it can only be run with one CTA size, due to high register usage (72) and high shared memory usage (46 KB).
6.6. Kernel Strategy
From Section 3, the chief advantage of the persistent kernel is removing the overhead associated with kernel invocation, which is most significant for finegrained tasks that involve many small kernel launches. Based on the performance results in Table 1 and Figure 1, the performance gap between persistent kernel and discrete kernel is particularly large for BFS on meshlike graphs, which require many small kernel launches due to the high diameter and small workload per iteration. Graph coloring on indochina2004 also shows a large kernel launch overhead. Using a random permutation of vertex IDs (see Section 6.4), Atos’s persistent variant is 4.3x faster than its discrete variant.
7. Conclusion
In this paper, we present our taskparallel GPU dynamic scheduling framework, Atos, and analyze its performance across numerous design parameters on three case studies. Our analysis provides the following guidelines on what applications are suitable to run in a capable taskparallel framework, as well as what Atos configurations to use, given an application’s characteristics:

If the dynamic application either exhibits the small frontier problem or has load imbalance, Atos will have a performance advantage.

If the application exhibits the small frontier problem, it should be run with a persistent kernel.

If the application exhibits load imbalance, it should be run with both task and dataparallelism load balancing in tandem to achieve better performance. For different applications, the optimal tradeoff point varies.

By relaxing the outer loop dependency in the application, Atos increases concurrency at the cost of extra work due to misspeculation, or less predictable convergence rates. The optimal tradeoff between the increased concurrency and additional cost is applicationdependent. When an application is naturally unordered (e.g., PageRank) or has the small frontier problem (e.g., BFS on meshlike datasets and graph coloring on scalefree datasets), the increased concurrency usually outweighs the cost. Conversely, on problems such as BFS on scalefree graphs or graph coloring on meshlike graphs, the cost of extra work can hurt performance. The best way to reduce extra work is applicationdependent and may include better load balancing (e.g., BFS) or reducing concurrency (e.g., graph coloring).
Comments
There are no comments yet.