As the emerging trend of the graph-based deep learning, Graph Neural Networks (GNNs) recently attract a significant amount of research attention from various domains. However, existing GNN implementations fail to catch up with the evolving GNN architectures, the ever-increasing graph size, and node-embedding dimensionality, thus, suffering from an unsatisfied performance. To break this hurdle, we propose GNNAdvisor, an efficient runtime system to systematically accelerate GNN applications on GPUs. First, GNNAdvisor spots the graph-structure information (e.g., graph community) as a new driving force to facilitate GNN acceleration. Besides, GNNAdvisor implements a novel yet highly-efficient group-based workload management tailored for GNN computation to improve the thread-level performance on GPUs. GNNAdvisor further capitalizes on the GPU memory hierarchy for acceleration by gracefully coordinating the execution of GNNs according to the characteristics of the GPU memory structure. Moreover, GNNAdvisor incorporates a Modeling & Estimating strategy to offer sufficient flexibility for automatic performance tuning across various GNN architectures and input datasets. Extensive experiments show that GNNAdvisor provides average , , and speedup over the state-of-the-art GNN execution frameworks, Deep Graph Library, NeuGraph, and GunRock, respectively.
Graph Neural Networks (GNNs) emerge to stand on the front-line for handling many graph-based deep learning tasks (e.g., the node classification [kaspar2010graph, gibert2012graph, duran2017learning] and link prediction [chen2005link, kunegis2009learning, tylenda2009towards]). Compared with standard methods for graph analytics, such as random walk [grover2016node2vec, deepWalk] and graph laplacians [luo2011cauchy, luo2009non, cheng2018deep], GNNs highlight themselves with significantly higher accuracy [GCNConv, GINConv, GATConv], and better generality [SageConv]. In addition, the well-learned GNNs [GCNConv, GINConv, SageConv, GATConv] can be easily applied towards different types of graph structures or dynamic graphs without much re-computing overhead. However, the performance of GNNs [wang2019dgl, pyG, ma2019neugraph, wang2016gunrock] fails to catch up with the increasingly complicated GNN architectures (i.e., the more number of layers and higher dimensionality in each layer), the larger size of graph datasets, and the higher dimensionality of node embeddings.
The major reason behind their limited performance comes from the unique computing paradigm of GNNs, in contrast to both standard neural networks [GCNConv, GINConv], and graph algorithms [page1999pagerank]. For instance, to generate embeddings for each node in a graph, GNNs consist of both the graph-based operations (e.g., scatter and gather) for aggregating information from its neighbors and the classic neural network computation (e.g.
, dense matrix-matrix multiplication (DGEMM)) for updating/learning the embedding vector. These two phases are often referred to asAggregation and Update, respectively. The aggregation phase is usually sparse in computation and highly irregular in memory access, while the update phase involves NN-operations that are dense in computation and regular in memory access.
To tackle the challenge of GNN workloads, previous works can be categorized into two broad types. The first type [wang2016gunrock, ma2019neugraph, sysMLROC] is built on the popular graph processing systems and is combined with NN-operations. The second type [pyG, wang2019dgl], reversely, starts with existing deep learning frameworks and extends its work to support graph operations. However, neither of these works can confidently handle the computation specialty (mainly throttled by the aggregation phase) of GNN, especially on GPU platforms. The NN-operations at the update phase, due to its regularity, are well-suited for GPU-based acceleration, and such optimizations have been well explored by many previous research [tan2011fast, li2009note, li2012optimized] and industrial library implementations, such as the cuBLAS [cublas] and cuDNN [cuDNN]. But the aggregation phase still has to rely on those “outdated” graph operations, which are developed for standard graph processing/analytics [wang2016gunrock, khorasani2014cusha], where each node only has a single scalar attribute in contrast to the high-dimensional embedding vector in GNNs. Moreover, these existing solutions are still preliminary and inevitably fall in short in the following two major aspects.
Lack of optimizations based on GNN inputs.
Existing works of accelerating GNNs on GPUs fail to leverage any kind of input-level properties to guide a more effective GNN optimization. While some prior efforts [balaji2018graph, rabbit-order] in the system domain exploit the graph properties to optimize the performance of traditional graph algorithms (e.g., PageRank [page1999pagerank]) on CPUs, the context of applying such optimizations on GPUs for GNNs are largely different.
Lack of configurability for performance tuning.
While today’s GNN datasets show a significant diversity, most existing GNN frameworks offer very limited options to users for seeking better performance. Since their major building blocks are either hidden or tightly packed, exposing high-level callable interfaces (APIs) without any tuning options. This leads to poor configurability and adaptability to efficiently handle ever-changing GNN architectures and inputs.
To this end, we propose, GNNAdvisor, an efficient runtime system for GNN acceleration on GPUs. As shown in Figure 1, GNNAdvisor consists of several key components to facilitate the GNN optimization and execution on GPUs. First, GNNAdvisor incorporates an input extractor to “squeeze” the input-level information that can guide our system-level optimizations. Second, GNNAdvisor utilizes a kernel & runtime crafter to customize the GNN kernel and CUDA runtime settings through effective group-based workload management, and memory optimizations. It considers several performance-critical factors (e.g., GNN computation pattern, and CUDA kernel block settings). Third, GNNAdvisor forms an optimization loop (consisting of kernel & runtime crafter, GPU profiling, and performance evaluator), which offers significant configurability for performance tuning and incorporates a Modeling & Estimating strategy to reduce manual efforts in hyper-parameter exploration.
Overall, we make the following contributions.
We exploit the performance benefits from the GNN input-level properties, such as communities in graphs, node degree, and dimensionality of node embedding. Such information will guide our system-level optimizations on GPUs (e.g., node renumbering, group-based workload partitioning, and dimension-based workload sharing).
We propose a group-based workload generation and mapping technique tailored for GNN computation to balance intra-thread efficiency and inter-thread parallelism.
We carefully craft the GPU data placement to reduce the high-overhead memory operations (e.g., global memory access), which are the major obstacles for many GNN applications.
We also introduce a set of performance-related parameters for user tuning flexibility and incorporate a Modeling & Estimating strategy to automate the optimization process with minor manual efforts.
Comprehensive experiments demonstrate the strength of GNNAdvisor over state-of-the-art GNN execution frameworks.
3 Background and Related Works
In this section, we introduce the basics of graph neural networks (GNNs) and two major types of GNN frameworks: graph-based systems and deep learning frameworks.
3.1 Graph Neural Networks
Graph Neural Networks (GNNs) are now becoming a major way for gaining insights from the graph structures. It generally includes several graph convolutional layers, each of which consists of a neighbor aggregation and a node update step. It computes the embedding for node at layer based on its embedding at layer , where .
As shown in Equation 1, is the embedding vector for node at layer . is the aggregation results through collecting neighbors’ information (e.g.
, node embeddings). The aggregation method could vary across different GNNs. Some methods just purely depend on the properties of neighbors while others also leverage the edge properties, such as weights. The update function is generally composed of a single fully connected layer or multi-layer perceptron (MLP) for NN-operations on both aggregated neighbor results and its current embedding at layer.
3.2 Graph-based Systems
Previous works [wang2016gunrock, sysMLROC, ma2019neugraph], such as GunRock, are built upon graph-based systems. These systems are usually equipped with a set of highly optimized graph-based operators, such as advance and filter operation, and SAGA (Scatter-Apply (edges)-Gather-Apply (nodes)) in NeuGraph. To support NN-based operations for node update, they usually try to adapt the graph-based operators working on nodes with a single scalar attribute to work on nodes with embedding vectors. However, this strategy has limitations in 1) counting on a set of system-level optimizations for boosting performance while overlooking the performance benefits of input-level information; 2) borrowing the optimizations principals for traditional graph processing algorithms while ignoring the unique context (e.g., computation and memory access pattern) of GNNs. Differently, we foresee the potential benefits of GNNs’ input-level information, and gracefully leverage them to guide our system-level optimizations while respecting the unique favor of GNN computation.
3.3 Deep Learning Framework
Another type of the works, such as Pytorch-Geometric (PyG)[pyG] and Deep Graph Library (DGL) [wang2019dgl], extend the existing NN-based frameworks (e.g., Pytorch [pytorch]
, and Tensorflow[tensorflow2015]) to support graph-based operators (e,g.
, Scatter and Gather). These frameworks allow users to easily build their own GNNs based on the provided high-level interfaces without touching the complicated implementation details. To support such programmability from the lower-level implementation, PyG uses torch-scatter[torch-scatter] library as its major building block for aggregation operations, and DGL crafts its own aggregation kernel in CUDA/C++ and further incorporates the SpMM as its kernel fusion optimization for a certain type of GNNs that requires SUM-based aggregation.
While this type of framework provides users with the high-level abstraction that improves their programmability, their shortcomings are also noteworthy: 1) their graph-based operation kernels (e.g., torch-scatter in PyG, and SpMM-based kernel fusion in DGL) are not efficient to handle the diversity of GNN architectures, input graph structures and embedding dimensionalities due to the excessive data movement and thread synchronization overheads; 2) their major computation kernels (e.g., aggregation and update) are tightly-wrapped and only exposed high-level API to users, leaving no space for further performance tuning to meet users’ demands.
4 Input Information
In this section, we detail two different types of GNN input information (graph properties and GNN architecture) collected by GNNAdvisor’s input extractor and identifying their potentials of guiding our system-level optimizations.
4.1 Graph Properties
Graph-based data (e.g., relations among objects) in non-Euclidean space displays fruitful information compared with the Euclidean data (e.g., vectors). GNNAdvisor spots the potential of these graph properties to improve GNN performance on GPUs. Overall, GNNAdvisor leverages three major graph properties: node degree, node embedding, and graph community.
4.1.1 Node Degree
Real-world graphs generally follow the power-law distribution [graph-power-law] of their node degrees. In the parallel graph processing systems [liu2015enterprise, Mizan, han2014experimental], such distribution would cause workload imbalance (Figure 2a). In GNN aggregation, such workload imbalance (Figure 2b) is even exacerbated due to the scaled-up dimensionality of the node embedding vector compared with a single scalar value in graph processing, thus suffering from more severe performance degradation.
However, previous methods (graph-based systems and deep learning frameworks) are missing viable solutions to solve this problem effectively, because 1) they may choose the coarse-grained vertex-centric processing paradigm that in favor of high programmability, or 2) they may overkill such a problem by taking very fine-grained edge-centric processing, which may lead to inferior overall performance (excessive thread launching and synchronization overheads). GNNAdvisor, in contrast, fully exploits such a node-degree property to facilitate the group-based workload management (a middle-ground solution between vertex-centric and edge-centric processing) and also provides it as the “hint” for performance tuning (trade-off between single-thread execution efficiency and multi-thread workload balance). We put more detailed discussions in Section 5.
4.1.2 Node Embedding
As the major difference between traditional graph processing and GNN, node embedding reshapes the computation paradigm, memory layout, and optimization strategies. Node embedding, on one hand, invalidates the optimizations well-suited for graph processing, such as a series of aggressive cache-based optimizations [lakhotia2017recall, xu2014graph]; on the other hand, it leaves space for a new set of optimizations, such as workload sharing along node embeddings. To this end, GNNAdvisor utilizes the size of node embedding to facilitate a set of optimizations (e.g., dimension-based workload sharing), and performance tuning. We will organize these details in Section 5.4.
4.1.3 Graph Community
Graph community [fortunato2010community, lancichinetti2008benchmark, newman2013spectral] is one of the key features of real-world graphs, which describes that a small group of nodes tend to hold “strong” intra-group connections (many edges) while maintaining “weak” connections (fewer edges) with the remaining part of the graph. Works in the graph processing/analytics domain [hendrickson2000graph, newman2013spectral] leverage graph community to optimize performance, such as community-based graph partitioning for parallel/distributed graph processing. In GNNAdvisor, we leverage graph community to improve the data locality during GNN aggregation.
GNN aggregation is featured with intensive yet highly irregular memory access. Thus, improving memory performance is the key for breaking the GNN performance bottleneck in its time-consuming aggregation phase. As exemplified in Figure 3, when GNN applies neighbor aggregation for all nodes within a community, the loaded embedding of node can be fully shared among all of its neighbors (, , , ), thereby reducing the need for issuing additional memory access. If we follow such a community pattern in aggregation, the spatial and temporal locality of node embeddings can be improved. Since nodes within a community are more likely to share neighbors, thus, reducing the unnecessary memory access.
This idea sounds promising, but the effort to capitalize on its benefits is non-trivial. The major challenge is to “capture” the communities of a graph appropriately such that we can improve the computation and memory locality during aggregation. In GNNAdvisor, we leverage an efficient yet lightweight node renumbering (Section 6.1) to solve this problem.
4.2 GNN Architecture
The computation pattern of GNN aggregation also affects how we apply the optimization effectively. Specifically, the mainstream aggregation methods of GNNs can be simply categorized into two types: 1) order-independent aggregation (e.g., sum, min, and max) with only the embeddings of neighbor nodes, such as GCN [GCNConv]; 2) order-independent aggregation with special edge features (e.g., weights, and edge vectors) applied to each neighbor node, such as GIN [GINConv], GAT [GATConv].
For the first type of GNN, the common design practice is to reduce the node embedding dimensionality before the neighbor aggregation at each graph convolution layer [GCNConv, pyG, wang2019dgl], which shows a better balance between the overall accuracy and model complexity. In this case, a locality-aware node renumbering and memory organization would largely lower the memory access and computation overhead and contribute more to the overall performance. On the other hand, the second type of aggregation must work on full-dimensional node embedding to compute the special edge features in node aggregation, leading to a large amount of data movements and computations. In this case, a more fine-grained workload sharing among threads would benefit performance more. Such a difference in GNN architectures is also critical for GNNAdvisor to layout the corresponding workload management (Section 5) and memory organization (Section 6) that can benefit overall runtime performance on GPUs.
5 Group-based Workload Management
GNNAdvisor’s kernel & runtime crafter features its group-based workload management with four techniques: group-based partitioning, leader-node scheme, block-based mapping, and dimension-based sharing.
5.1 Group-based Partitioning
Group-based workload partitioning is a novel workload balance technique tailored for GNNs on GPUs. It breaks down the neighbors of a node into different groups and assigns the aggregation workload of each group to a thread. Specifically, it relies on the input-level information – node degree and node embedding – to determine the size of each node group and the aggregation workload assigned to each thread. As shown in Figure 4a, the neighbors of a node are divided into 3 groups with a pre-determined group size of 4. In aggregation, each thread handles the intra-group aggregation workload (i.e., element-wise accumulation of neighbors’ embedding vectors). After finishing the intra-group aggregation, an inter-group reduction will gather final results for each node.
The benefits of applying the group-based aggregation are three-fold: 1) compared with the more coarse-grained node-centric aggregation (Figure 4b), which distributes workloads by nodes, group-based partitioning can largely mitigate the inter-thread imbalance by dividing workloads into groups that are set to be equal in size; 2) compared with the more fine-grained edge-centric aggregation (Figure 4b), which is seemingly high-performance by using more threads for massive parallelism, the group-based solution can avoid the overheads of managing excessive threads that might hurt the performance in many ways, such as resource contentions, and synchronizations. To further exploit its performance benefits, we introduce an adjustable parameter – group size (i.e., in each group), and detail its value selection in Section 7.1.
5.2 Leader-node Scheme
While the group-based workload partitioning largely mitigates the workload imbalance, the costly thread-level synchronization still throttles the overall performance. As shown in Figure 4a, to complete the node aggregation of node , each thread has to rely on high-overhead thread-level synchronization (i.e., atomic operations) to guarantee the result correctness under the parallel setting.
GNNAdvisor, however, effectively reduces the unnecessary thread-level synchronization by leveraging the leader-node aggregation scheme. Each group of neighbor nodes maintain a leader that holds the result of intra-group aggregation temporarily. Since each thread manages aggregation for neighbor nodes within a group, its execution can be proceeded in an atomic-free manner because of no inter-thread contention. After the intra-group aggregation, a wait-free inter-group reduction will be triggered to gather the final results. Since each thread can start pushing its result to the center node as soon as it finishes its intra-group aggregation.
5.3 Block-based Mapping
To facilitate a better utilization of the GPU and our group-based GNN workloads, GNNAdvisor incorporates an efficient block-based workload mapping strategy. As shown in Figure 5, the workload of node can be handled by thread , , based on its group-based workload partitioning. In addition, threads within each GPU block (usually consists of up to 1024 threads) can support multiple nodes for neighbor aggregation simultaneously. In the meantime, each block on GPUs owns a shared memory that can be used to cache the “hot-spot” data temporarily for fast memory access from all threads of a block. Similar to the leader-node pattern at intra-group aggregation, we use a leader thread (notate as thread for node in Figure 5) of each node to manage the final result updates from the shared memory to the global memory, which can further reduce the overhead of the expensive thread-level synchronization at the global memory. Meanwhile, this strategy also reduces the number of threads that are concurrently working on the memory, alleviating the contention among threads.
Considering the memory and computation resource constraints of a block, as well as the input-level information (e.g., the number of nodes and edges), the number of threads per block should be determined to balance the thread efficiency (e.g., thread management, and resource sharing) at the intra-block level and parallelization at the inter-block level.
5.4 Dimension-based Sharing
GNN distinguishes itself from traditional graph algorithms in its computation on high-dimension node embeddings. To tackle this challenge efficiently, we further offload the group-based workloads from a single thread to a group of threads to improve the aggregation performance through dimension-based workload sharing. As shown in Figure 6a, the original group-based workloads are distributed to three different working threads for workload sharing. And each thread manages a part of (around 1/3) workload during the node aggregation. There are two benefits of applying this strategy. First, it could effectively execute the node aggregation on a diverse range of the node embedding dimension and the GNNs’ hidden dimension, since the group-based workload can be shared by multiple threads instead of single thread. Second, it introduces another performance-related parameter – the number of working threads along the data dimension for configurability.
Besides, we also spot the memory access pattern as a key factor to bring even better performance, since threads with continuous IDs will access to a consecutive part of memory, thus facilitating the memory coalescing on GPUs. To this end, as shown in Figure 6b, we adjust the threads and their workload mapping, where adjacent threads would operate on the dimension close to each other side by side.
6 Memory Optimization
In this section, we detail the memory optimization used in our kernel & runtime crafter, including node renumbering and memory organization.
6.1 Community-aware Node Renumbering
Node renumbering is a way to “rename” each node (ID) of a graph to improve the temporal and spatial locality at the GNN aggregation phase. It is based on the idea that the closeness of node IDs would affect the adjacency of the time (WHEN they get processed) and the proximity of the location (WHERE they get handled), such as blocks and SMs in GPU.
Specifically, there are three steps in applying the node renumbering. First, we identify the communities that can maximize the overall modularity of the graph [rabbit-order], and assign nodes with their corresponding community IDs. Second, we traverse (e.g., Reverse Cuthill–McKee algorithm (RCM) [RCM-Algorithm]) the nodes inside each community to maximize the neighbor sharing among nodes with consecutive IDs. After the above two steps, we can get the one-to-one mapping from the old node ID to the new IDs. Note that the above renumbering process is lightweight in its computation and memory cost, and can be effectively parallelized. After being processed, nodes within each community would be assigned with IDs continuous in its number values, and nodes with consecutive IDs sharing more neighbors are more likely to be processed with the temporal and spatial locality. To quantify its benefits, we put more detailed evaluations in Section 8.6.2.
6.2 Block-aware Memory Organizing
To exploit the shared memory benefits, we tailor the GPU memory organization to support the block-based mapping (Section 5.3) for group-based workload. As shown in Figure 5, each block can handle the aggregation workloads from several nodes. GNNAdvisor utilizes the shared memory, a valuable but limited resource, to hold the block-level aggregation results while reducing unnecessary access to global memory. After each block has finished all of its workload, the leader threads (thread in red color in Figure 5) of each node within a block will flush the shared memory result to the global memory.
The detailed mapping algorithm is described in Algorithm 1. Specifically, each neighbor group mapped into a thread will have three properties, (the shared memory address that holds the intra-group neighbor aggregation result of a node), (the global index of the node that the neighbor group belongs to), and (boolean flag indicating whether this group (thread) is responsible for flushing the results to global memory). Note that the value of should be determined based on the shared memory size of each block and the embedding size of each node. Algorithm 1 () shows the major organizing routine that captures several cases. First, if the current thread is the first thread of a block (), we select the current threads as the group leader and allocate shared memory for its aggregation target node. Second, if the current thread is in the middle of a block, we may either 1) keep its target node address in shared memory the same as its predecessor if both of them aggregate towards the same target node (), or 2) allocate new shared memory for its target node if they aggregate towards different target nodes ().
7 Design Optimization
In this section, we introduce the design optimization through the performance evaluator of GNNAdvisor, which includes the Modeling and Estimating. As illustrated in Figure 7, the performance evaluator takes the input settings (Graph, GNN, and GPU parameters)) that are captured by our input extractor, to determine the optimizations that will be executed by kernel & runtime crafter.
Modeling basically measures the performance impact of hyper-parameters of system-level optimizations. Three major hyper-parameters have been utilized.
Group Size (): balances the workloads and reduces the overhead of the inter-thread synchronization (e.g., atomic operations). The increase of the will reduce the frequency of thread synchronization and memory access. However, when becomes even larger than most of the node degrees, such benefit will diminish due to the irregular size of each unfulfilled group. Increasing also requires more work on a single thread while compromising parallelization. On the other side, if we aggressively decrease , it may underutilize each thread, since it cannot fulfill the computation capability of each thread to offset thread launching cost.
Thread-per-block (): balances the intra-block and inter-block efficiency. First, the increase of will accommodate more groups in each thread block. Meanwhile, the inter-thread contention in each block will become severer. On the contrary, if we decrease the too much, the temporal and spatial locality of neighbor sharing would diminish. Besides, would also affect the total number of blocks that will be running in parallel, which impacts the resource isolation and scheduling flexibility.
Dimension Worker (): partitions the aggregation workloads along the node embedding dimension and distributes them to multiple threads. The increase of would benefit the computation parallelization, which contributes to the overhead performance in terms of reduced latency. Meanwhile, the value of should also make a tradeoff between the single-thread efficiency and multi-thread parallelization.
To estimate the runtime latency, we use the formula,
where is the number of the nodes; is the number of edges; is the maximum number thread a block can hold; is the node embedding dimension;
is to handle different graphs with node degree variations (based on the standard deviation of node degree ()). Our insight is that if the value of can approach either from its left side (lower) or right side (higher), the performance will become better. The larger is, the higher the value of becomes. We set in the range of to .
In addition, we should respect the single-thread efficiency
where of each thread can be measured by computation throughput (GB/s) of SMs on GPUs. Meanwhile, we should consider the shared memory size () of each block,
where is the average degree of each node; is the size of data type (4 bytes for floating point).
While the graph structure would vary from the graph to graph, their underlying community structures are relatively fixed in sizes and shapes. Based on this observation, we propose a community-based GNN profiling strategy, which could largely ease the profiling efforts. Specifically, we follow three major steps. First, we broadly collect the most typical sizes of the graph community, and randomly generate the edge connections with 90%, 70%, and 50% densities. Second, we evaluate them on GNNAdvisor based on the several most popular sizes (e.g., 16 in GCN, 256 in GIN) of the GNN hidden layer embedding size, group size, and thread-per-block size. Note that this step will also help us to explore the best value of hyper-parameters (Equation 2) under different settings. Third, we select a set of appropriate hyper-parameters to approximate the overall performance for a given input setting. To this end, we can easily get a “sense” of how good or bad for a given GNN input.
To optimize hyper-parameters, we 1) start from a set of randomly generated settings based on previous profiling results; 2) approximate their performance based on the above method and keep the settings that can deliver high enough performance; 3) do crossover on these kept settings to generate a new set of settings for the next iteration beginning from step 2. In general, 10 - 15 iterations of the above process would be enough to generate a “premium” setting with a satisfying performance that can meet the users’ requirements.
In this section, we show the strength of GNNAdvisor through intensive experiments over various GNN models and datasets.
8.1 Experiment Setup
We choose two representative GNN models to cover mainstream types of operations in the aggregation phase.
Graph Convolutional Network (GCN) [GCNConv] is one of the most popular GNN architectures. It has been widely adopted in node classification, graph classification, and link prediction tasks. Besides, it is also the key backbone network for many other GNNs, such as GraphSage [SageConv], and differentiable pooling (Diffpool) [diffpool]. Therefore, improving the performance of GCN will also benefit a broad range of GNNs.
Graph Isomorphism Network (GIN) [GINConv], another typical type of GNN, aims to distinguish the graph-structure that cannot be identified by GCN. GIN differs from GCN in its aggregation function, which weights the node embedding values from the node itself. In addition, GIN is also the reference architecture for many other advanced GNNs with more edge properties, such as Graph Attention Network (GAT) [GATConv].
We choose three different types of datasets to cover the vast majority of the GNN inputs. Type I Graphs are the typical datasets used by many previous GNN algorithm papers [GCNConv, GINConv, SageConv]. They are usually small in the number of nodes and edges, but rich in node embedding information with high dimensionality. These graphs can validate pure algorithm performance (e.g., the accuracy of link prediction). Type II Graphs [KKMMN2016] are the popular benchmark datasets for graph kernels and have been selected as the built-in datasets for DGL [wang2019dgl] and PyG [pyG]. Each dataset consists of a set of small graphs, which only have intra-graph edge connections without inter-graph edge connection. These graphs are generally used for batched training or inference. Type III Graphs are large graphs [snapnets, GCNConv] in terms of the number of nodes and edges. These graphs demonstrate high irregularity in its structures, which are challenging for most of the existing GNN frameworks. Details of the above datasets are listed in Table 1.
8.1.3 Baseline Implementations
Deep Graph Library (DGL) [wang2019dgl]
is the state-of-the-art GNN framework on GPUs, which is built upon the famous tensor-oriented platform – Pytorch[pytorch]. For its low-level implementation, DGL is optimized with 1) kernel fusion to fuse send and recv steps (relying on cuSparse [cusparse]), and 2) batch processing of nodes/edges by stacking their features. DGL significantly outperforms the other existing GNN framework [pyG] over various datasets on many mainstream GNN architectures. Therefore, we make an in-depth comparison with DGL in our evaluation (Section 8.2).
Pytorch-Geometric (PyG) [pyG] is another GNN framework in which users can define their edge convolutions when building customized GNN aggregation layers. For the low-level implementation, PyG relies on a high-performance torch-scatter library [torch-scatter], a dedicated CUDA kernel optimized for scatter-and-gather operations, as the major building block for aggregation operations.
NeuGraph [ma2019neugraph] is a dataflow-centered GNN system on GPUs built on Tensorflow [tensorflow2015]. The major part of NeuGraph is its graph propagation engine, which mainly focuses on a set of traditional system-level optimizations, such as operation fusion and scheduling.
GunRock [wang2016gunrock] is the GPU-based graph processing framework with the state-of-the-art performance on traditional graph algorithms (e.g., PageRank). Recently, they release their implementation of GraphSage [SageConv], a 2-layer GCN with some additional features, such as sampling.
8.1.4 Platforms & Metrics
Our major evaluation platform is a server with an 8-core 16-thread Intel Xeon Silver 4110 Processor [xeon] (Clock Frequency: 2.1GHz, Memory: 64GB DDR4) and a Quadro P6000 [quardo] (3840 CUDA cores, Memory: 24GB GDDR5X, Peak Memory Bandwidth: 432GB/s, Peak Single Precision Performance: 12 TFLOPs). Besides, we use Tesla V100 [tesla-v100] on the Nvidia DGX-1 system [dgx] for an additional study to demonstrate the generality of the proposed runtime system.
To measure the performance improvement, we calculate the averaged speedup of 100 measurements under the same setting. For kernel detailed metric analysis, we utilize CUDA kernel profiling metrics [nvprof-metrics] from NVProf [nvprof]. Note that we focus on the evaluation of GNNAdvisor’s for GNN inferences, but GNNAdvisor’s optimizations can also be applied towards GNN training, which uses the same aggregation-update pattern in both of its value propagation in the forward phase and gradient propagation in backward phase.
8.2 Compared with DGL
As shown in Figure 8, GNNAdvisor achieves and speedup on average compared to DGL [wang2019dgl] over three types of datasets for GCN and GIN, respectively. It is because that GNNAdvisor can fully leverage the input-level information, such as node degree, to guide system-level optimizations. Whereas DGL only applies a set of generic optimizations without effectively using the input properties. We next provide detailed analysis for each type of datasets and give insights into the benefits based on low-level GPU kernel metrics.
Type I Graphs: The performance improvements against DGL is significantly higher for GCN (on average ) compared to GIN (on average ). The major reason is their different GNN computation patterns. For GCN, node dimension reduction (DGEMM) is always placed before aggregation. This largely reduces the data movements and thread synchronization overheads during the aggregation phase, which could gain more benefits from GNNAdvisor’s group-based workload management and memory optimization for data locality improvements. GIN, on the other side, has the aggregation phase that must be finished before the node dimension reduction (MLP, essentially DGEMM operation). Thus, it cannot avoid high-volume memory access and data movements during the aggregation phase. Therefore, it gets lower benefits from the data locality and the shared memory on GPUs for fast and low-overhead memory access. However, our effective dimension-based workload sharing among threads could still handle these large dimension cases for better performance. We also observe that when the node embedding of GIN is relatively lower, even if the size of the graph (the number of nodes and edges) is largely increased, such as PPI, GNNAdvisor would provide better performance ().
Type II Graphs: The performance shows less difference between GCN () and GIN () on the same datasets except for TWITTER-Partial, which has the highest node embedding dimension (1323) in Type II graphs. It is worth noticing that the speedup for GIN is consistently better compared with type I, there are two major reasons: 1) node embedding dimension is much lower (average 66.5, excluding TWITTER-Partial) versus Type I (average 1421), which can gain more performance benefits from data spatial and temporal locality of our memory optimizations; 2) Type II graphs intrinsically have good locality in its graph structure. Since Type II datasets consist of small graphs with very dense intra-graph connections but no inter-graph edges, plus nodes within each small graphs are assigned with consecutive IDs. Therefore, the performance gains of using such graph-structure level locality can be scaled up when combining with GNNAdvisor’s efficient workload and memory optimizations.
Type III Graphs: The speedup is also evident (average for GCN and average for GIN) on graphs with a large number of nodes and edges, such as amazon0505. Since the high overhead inter-thread synchronization and global memory access can be well reduced through our group-based workload management and memory organization with significant performance tuning flexibility. Besides, our node renumbering strategy could further facilitate an efficient workload sharing among adjacent threads (working on a group of nodes) through improving the data (e.g., node embeddings) spatial and temporal locality. On the dataset artist, which has the smallest number of nodes and edges within Type III, we notice a lower speedup of its performance for GIN, and we find that the artist dataset has the highest standard deviation of graph community sizes within type III graphs, which makes it challenging to 1) use the group community information to capture the node temporal and spatial locality in the GNN aggregation phase and 2) capitalize on the performance benefits of using such a community structure for guiding system-level optimizations (e.g., workload mapping) on GPUs, which have a fixed number of computation and memory units within each block/SM in general.
Kernel Metrics: To gain more insights from the performance strength of GNNAdvisor, we further measure two performance-critical (computation and memory) GPU kernel metrics via NVProf: Stream Processors (SMs) efficiency and Cache (L1 + L2 + Texture) Hit Rate for comparison. As shown in Figure 9a, GNNAdvisor achieves on average and higher SM efficiency compared with DGL for GCN and GIN, respectively, which indicates that our group-based workload management strategy can strike a good balance between the single-thread efficiency and the multi-thread parallelism that are crucial to the overall performance improvement. From Figure 9b, we can see that GNNAdvisor achieves on average and better cache hit rate compared with DGL for GCN and GIN, correspondingly, which also demonstrates the benefit of memory optimization.
8.3 Compared with PyG
We choose the Type II datasets for the comparison with PyG since PyG has the most optimizations (e.g., Mini-batch Handling) for effectively processing such batched graphs with block-diagonal properties in their adjacent matrices. For Type I and III datasets, we find that PyG cannot provide comparable performance compared with DGL and thus decide not to include the results here.
As shown in Figure 10a, we can see that GNNAdvisor can outperform PyG with and speedup on average for GCN and GIN, respectively. For GCN, GNNAdvisor achieves significant speedup on datasets with high-dimensional node embedding, such as TWITTER-Partial, through 1) node dimension reduction before aggregation and 2) workload sharing among node groups and dimensions. For GIN, GNNAdvisor reaches speedup on datasets with a higher average degree, such as DD, since GNNAdvisor can effectively distribute the workload of each node along their embedding dimension to active threads while balancing the single-thread efficiency and inter-thread parallelism. PyG, however, achieves inferior performance because 1) it has a poor thread management in balancing workload and controlling synchronization overhead; 2) it heavily relies on the scatter-and-gather kernel, which may have performance advantages for some datasets, but could not effectively transfer such benefits towards various inputs due to lack of performance-related configurability.
8.4 Compared with NeuGraph
For a fair comparison with NeuGraph that is not open-sourced, we 1) use the GPU (Quardo P6000[quardo]) that is comparable with the GPU of NeuGraph (Tesla P100 [tesla-p100]) in performance-critical factors, such as GPU architecture (both are Pascal) and number of CUDA cores; 2) use the same set of inputs as NeuGraph on the same GNN architecture [ma2019neugraph].
|Benchmark||NeuGraph (ms)||GNNAdvisor (ms)|
As shown in Table 2, GNNAdvisor outperforms NeuGraph with a significant amount of margin ( to speedup) in terms of computation and memory performance. NeuGraph relies on general GPU kernel optimizations and largely ignores the input information. Moreover, the optimizations in NeuGraph are “built-in” and “fixed” inside the framework without performance tuning flexibility. In contrast, GNNAdvisor leverages GNN-featured GPU optimizations, and demonstrates the key contribution of input-level insights for system-level optimizations while maintaining a significant amount of configurability for performance tuning.
8.5 Compared with GunRock
We make a performance comparison between GNNAdvisor and GunRock [wang2016gunrock] on GraphSage over the Type III graphs. Note that GraphSage is the only GNN implementation officially released by GunRock, and it is essentially a 2-layer GCN except for an additional neighbor sampling, which has been disabled for a fair comparison. As shown in Figure 10b, we can see that GNNAdvisor outperforms GunRock with to speedup. Especially, on the graphs with large size of nodes and high-dimension node embeddings, such as soc-BlogCatalog, such a speedup performance is more prominent, because 1) GunRock is composed of a set of graph-processing oriented operators that are optimized for traditional graph algorithms with quite different computation and memory patterns compared with GNNs; 2) GunRock leverages optimizations that are generally applicable without considering the inputs’ differences, thus lacking input-aware yet effective optimizations to boost performance. In contrast, GNNAdvisor lights with GNN-tailored optimizations and effective usage of input-level properties, which delivers much better performance.
8.6 Optimization Analysis
8.6.1 Group-based Workload
We analyse the advantage of group-based workload of GNNAdvisor by exploring the impact of group-size, thread-per-block, and dimension-worker on GCN performance.
Group-size: From Figure 11a, we can see that with the increase of the group size, the running time of the GNNAdvisor will first decrease. Since such an increase will try to fulfill the computation capability of each thread, and meanwhile enjoy the benefit of data locality and atomic operation reduction (i.e., inter-thread synchronization overhead). However, when the group size becomes larger than a certain threshold (e.g., 32 for artist dataset), each thread would reach its computation capability upper bound, and further increasing the group size would not offer more performance benefit but only impose more stress on each thread, thus suffering from performance degradation. Besides, the even larger group size would also result in fewer threads during the execution, thus limiting the parallelism of GPUs.
|Processor||Architect||SMs||CUDA Cores||Frequency||Throughput||Cache||Max. Mem.||Mem. B/W|
|Quardo P6000||Pascal||30||3,840||1.506 GHz||12 TFLOPs||3 MB L2||24 GB||432 GB/s|
|Tesla V100||Volta||80||5,120||1.530 GHz||14 TFLOPs||6 MB L2||16 GB||900 GB/s|
Thread-per-block: As shown in Figure 11b, the performance impact of thread-per-block follows a similar pattern of the group size factor above. Increasing the thread-per-block would first improve the overall performance, and then compromise the performance when it crosses over a certain threshold that is determined by different input-level properties (e.g., node degrees, and node embedding size). For example, on the com-amazon dataset we can reach its “optimal” performance when thread-per-block is 128 and placing more threads on the same block would negate the speedup because of the exacerbated inter-thread contention within the same block.
Dimension-worker: As shown in Figure 11c, the dimension worker impact is more evident in performance compared with the above two factors. When the number of dimension worker reaches around 16, the performance of Type III datasets could reach its optimal, which can balance the single-worker efficiency and the multi-worker parallelism. The even larger size of dimension worker would do more harm than help since some launched threads will get no dimension to work on while other launched threads also stay "hungry" due to insufficient workload. In this case, the overall performance suffers due to excessive kernel thread launching and thread underutilization.
8.6.2 Node Renumbering Benefits
We demonstrate the benefit of node renumbering optimization by profiling Type III datasets for GCN and GIN. As shown in Figure 12a, effective renumbering nodes within a graph can bring up to and speedup for GCN and GIN, respectively. Since it can increase the data spatial and temporal locality during the GNN aggregation.
To quantify such performance benefits from node renumbering, we extract the detailed GPU kernel metric – memory access in terms of read and write bytes from DRAM for illustration. From the Figure 12b, we can see that node renumbering can effectively reduce the memory access overhead (on average 40.62% for GCN and 42.33% for GIN) during the runtime since more loaded node embeddings are likely to be shared among the nodes with consecutive IDs. We also notice one input case that benefits less from our optimization – artist, since 1) the community size inside artist displays a large variation (high standard deviation), making it challenging to capture the neighboring adjacency and locality; 2) such a variation also hurdles system-level optimizations to capitalize on the renumbering benefits through an effective computation and memory resource mapping towards underlying GPU hardware.
8.6.3 Block-level Optimization Benefits
We show the advantage of our block-level optimizations (block-based workload mapping (Section 5.3) and block-aware memory organizing (Section 6.2)), and we analyze the two kernel metrics (atomic operations reduction and DRAM access reduction) on three large graphs for illustration. As shown in Figure 12c, GNNAdvisor’s block-level optimizations can effectively reduce the atomic operations and DRAM memory access by average 47.85% and 57.93% compared with the baseline without applying block-level optimizations. This result also demonstrates 1) group-based workload and its block-based mapping can effectively reduce a large portion of high-overhead inter-thread synchronization overhead – atomic operations; 2) utilizing shared memory can avoid a significant amount of unnecessary costly DRAM access.
8.7 Case Studies
Hidden Dimensions of GNN
In this experiment, we analyze the impact of the GNN architecture in terms of the size of the hidden dimension for GCN and GIN. As shown in Figure 13a, we observe that with the increase of hidden dimension of GCN, the running time of GNNAdvisor is also increased due to more computation (e.g., additions) and memory operations (e.g., data movements) during the aggregation phase and a larger size of node embedding matrix during the node update phase. Meanwhile, we also notice that GIN shows a sharper increase of latency versus GCN (Figure 13b), mainly because of the more numbers of layers (GCN:2 vs. GIN:5) that makes such changes more significant.
Performance on Tesla V100
To demonstrate the potential of GNNAdvisor in modern data-center environments, we showcase the performance of GNNAdvisor on an enterprise-level GPU – Tesla V100 [tesla-v100]. The detailed comparison of Tesla V100 and Quardo P6000 are listed in Table 3. As shown in Figure 13c, GNNAdvisor can scale well towards such a high-end device, which can achieve and speedup compared with P6000 for GCN and GIN due to more computation resources (e.g., 2.6 SMs, and CUDA cores, and throughput performance) and higher memory bandwidth (e.g., peak memory bandwidth). This comparison shows that GNNAdvisor can well adapt towards more advanced GPU hardware configurations for seeking better performance. Moreover, we also foresee that our current work of GNNAdvisor can be extended to the multi-GPU or distributed data center, benefiting overall performance by improving the single-machine/GPU efficiency.
In this work, we propose GNNAdvisor, an efficient GNN runtime system that overcomes the issues in previous works (e.g., graph-based systems, and deep learning frameworks). Specifically, GNNAdvisor incorporates the input-level information (e.g., graph properties) and the system-level optimizations (e.g., group-based workload management, and memory optimizations), performance tuning options (e.g., design parameters), and a Modeling & Estimating strategy. Overall, GNNAdvisor provides users a handy tool to accelerate GNNs on GPUs systematically and comprehensively.