TC-GNN with Pytorch integration
Recently, graph neural networks (GNNs), as the backbone of graph-based machine learning, demonstrate great success in various domains (e.g., e-commerce). However, the performance of GNNs is usually unsatisfactory due to the highly sparse and irregular graph-based operations. To this end, we propose, TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework. The core idea is to reconcile the "Sparse" GNN computation with "Dense" TCU. Specifically, we conduct an in-depth analysis of the sparse operations in mainstream GNN computing frameworks. We introduce a novel sparse graph translation technique to facilitate TCU processing of sparse GNN workload. We also implement an effective CUDA core and TCU collaboration design to fully utilize GPU resources. We fully integrate TC-GNN with the Pytorch framework for ease of programming. Rigorous experiments show an average of 1.70X speedup over the state-of-the-art Deep Graph Library framework across various GNN models and dataset settings.READ FULL TEXT VIEW PDF
TC-GNN with Pytorch integration
Over the recent years, with the increasing popularity of graph-based learning, graph neural networks (GNNs) (Kipf and Welling, 2017; Xu et al., 2019; Thekumparampil et al., 2018) become dominant in the computing of essential tasks across various domains, including e-commerce, financial services, etc. Compared with standard methods for graph analytics, such as random walk (Grover and Leskovec, 2016; Perozzi et al., 2014; Huang et al., 2021) and graph laplacians (Luo et al., 2011, 2009; Cheng et al., 2018), GNNs highlight themselves with significantly higher accuracy (Kipf and Welling, 2017; Xu et al., 2019; Veličković et al., 2018) and better generality (Hamilton et al., 2017). From the computation perspective, GNNs feature an interleaved execution phase of both graph operations (scatter-and-gather (Gonzalez et al., 2012)) at the Aggregation phase and Neural Network (NN) operations (matrix multiplication) at the Update phase. Our experimental studies further show that the aggregation phase which involves highly sparse computation on irregular input graphs generally takes more than 80% running time for both GNN training and inference. Existing GNN frameworks, e.g., Deep Graph Library (Wang et al., 2019) and Pytorch-Geometric (Fey and Lenssen, 2019a), are mostly built upon the popular NN frameworks that are originally optimized for dense operations, such as general matrix-matrix multiplication (GEMM). To support sparse computations in GNNs, their common strategy is to incorporate sparse primitives (such as cuSPARSE (Nvidia, )) for their backend implementations. However, cuSPARSE leverages the sparse linear algebra (LA) algorithm which involves lots of high-cost indirect memory accesses on non-zero elements of a sparse matrix. Therefore, cuSPARSE cannot enjoy the same level of optimizations (e.g., data reuse) as its dense counterpart, such as cuBLAS (Nvidia, ). Moreover, cuSPARSE is designed to only utilize on CUDA core. Therefore, it cannot benefit from the recent technical advancement on GPU hardware features, such as Tensor Core Unit (TCU), which can significantly boost the GPU performance of dense LA algorithms (e.g.
This work focuses on exploring the potentials of TCU for accelerating such GNN-based graph learning. We remark that making TCU effective for general GNN computing is a non-trivial task. Our initial study shows that naively applying the TCU to sparse GNN computation would even result in inferior performance compared with the existing sparse implementations on CUDA core. There are several challenges. First, directly resolving the sparse GNN computing problem with the pure dense GEMM solution is impractical due to the extremely large memory cost (, where is the number of nodes). Besides, traversing the matrix tiles already known to be filled with all-zero elements would cause excessive unnecessary computations and memory access. Second
, simply employing TCU to process non-zero matrix tiles of the sparse graph adjacency matrix would still waste most of the TCU computation and memory access efforts. This is because TCU input matrix tiles are defined with fixed dimension settings (e.g.,
), whereas the non-zero elements of a sparse graph adjacency matrix are distributed irregularly. Thus, it requires intensive zero-value padding to satisfy such a rigid input constraint.Third, even though the recent CUDA release update enables TCU to exploit the benefit of certain types of sparsity (Nvidia, ), it only supports blocked SpMM, where non-zero elements must be first fit into well-shaped blocks and the number of blocks must be the same across different rows. Such a rigid input restriction makes it hard to handle highly irregular sparse graphs from real-world GNN applications efficiently.
To this end, we introduce, TC-GNN, the first TCU-based GNN acceleration design on GPUs. Our key insight is to let the input sparse graph fit the dense computation of TCU. At the input level, instead of exhaustively traversing all sparse matrix tiles and determine whether to process each tile, we develop a new sparse graph translation (SGT) technique that can effectively identify those non-zero tiles and condense non-zero elements from these tiles into a fewer number of “dense” tiles. Our major observation is that neighbor sharing is very common among nodes in real-world graphs. Therefore, applying SGT can effectively merge the unnecessary data loading of the shared neighbors among different nodes to avoid high-cost memory access. Our SGT is generally applicable towards any kind of sparse pattern of input graphs and can always yield the correct results as the original sparse algorithm. At the kernel level, for efficiently processing GNN sparse workloads, TC-GNN exploits the benefits of CUDA core and TCU collaboration. The major design idea is that the CUDA core which is more excel at fine-grained thread-level execution would be a good candidate for managing memory-intensive data access. While TCU which is more powerful in handling simple arithmetic operations (e.g., multiplication and addition) can be well-suited for compute-intensive GEMM on dense tiles generated from SGT. At the framework level, we integrate TC-GNN with the popular Pytorch (Paszke et al., 2019) framework. Thereby, users only need to interact with their familiar Pytorch programming environment by using TC-GNN APIs. This can significantly reduce extra learning efforts meanwhile improving user productivity and code portability across different platforms.
To sum, we summarize our contributions as follows:
We conduct a detailed analysis () of several existing solutions (e.g., SpMM on CUDA core) and identify the potentials of using TCU for accelerating the sparse GNN workloads.
We introduce a sparse graph translation technique (). It can make the sparse and irregular GNN input graphs easily fit the dense computing of TCU for acceleration.
We build a TCU-tailored GPU kernel with effective CUDA core and TCU collaboration (). It consists of a novel two-level workload mapping strategy for computation optimization and a TCU-optimized dataflow design for memory access optimization.
We deliver an end-to-end GNN framework design with seamless integration with the popular Pytorch framework for high programmability and configurability.
Extensive experiments show the significant speedup (on average 1.70) over the state-of-the-art GNN computing framework, Deep Graph Library, across various mainstream GNN models and dataset settings.
Graph neural networks (GNNs) are an effective tool for graph-based machine learning. The detailed computing flow of GNNs is illustrated in Figure 1
. GNNs basically compute the node feature vector (embedding) for nodeat layer based on the embedding information at layer (), as shown in Equation 1,
where is the embedding vector for node at layer ; is the aggregation results through collecting neighbors’ information (e.g., node embeddings); is the neighbor set of node . The aggregation method and the order of aggregation and update could vary across different GNNs. Some methods (Kipf and Welling, 2017; Hamilton et al., 2017) just rely on the neighboring nodes while others (Veličković et al., 2018)
also leverage the edge properties that are computed by applying vector dot-product between source and destination node embeddings. The update function is generally composed of standard NN operations, such as a single fully connected layer or a multi-layer perceptron (MLP) in the form of, where and are the weight and bias parameter, respectively. The common choices for node embedding dimensions are 16, 64, and 128, and the embedding dimension may change across different layers. After several iterations of aggregation and update (i.e., several GNN layers), we will get the output feature embedding of each node, which can usually be used for various downstream graph-based deep learning tasks, such as node classification (Kaspar and Horst, 2010; Gibert et al., 2012; Duran and Niepert, 2017) and link prediction (Chen et al., 2005; Kunegis and Lommatzsch, 2009; Tylenda et al., 2009).
The sparse computing in the aggregation phase is generally formalized as the sparse-matrix dense-matrix multiplication (SpMM), as illustrated in Figure 2(a), and is handled by many sparse libraries (e.g., cuSPARSE (Nvidia, )) in many state-of-the-art GNN frameworks (Wang et al., 2019, 2021). These designs only count on GPU CUDA cores for computing, which waste the modern GPUs with diverse computing units, such as the Tensor Core Unit (TCU). Specifically, we formalized the neighbor aggregation as SpMM-like operations (Equation 2)
where is the graph adjacency matrix stored in CSR format. is a node feature embedding matrix stored in dense format. is the number of nodes in the graph, and is the size of node feature embedding dimension; is the elementwise multiplication and is the standard matrix-matrix multiplication; is the edge feature matrix in CSR format and can be computed by SDDMM-like operations (Equation 3), as illustrated in Figure 2(b).
Note that the computation of is optional in GNNs, which is generally adopted by Attention-based Graph Neural Network in Pytorch (Thekumparampil et al., 2018) for identifying more complicated graph structural information. Other GNNs, such as Graph Convolutional Network (Kipf and Welling, 2017), Graph Isomorphism Network (Xu et al., 2019), only use the graph adjacency matrix for neighbor aggregation.
In the most recent GPU architectures (since Volta (Nvidia, )), NVIDIA announced a new type of computing unit, Tensor Core Unit (TCU), for accelerating dense deep-learning operations (e.g., Dense GEMM). A GPU Streaming-Multiprocessor (w/ TCU) is illustrated in Figure 3. Note that FP64, FP32, INT, and SFU are for double-precision, single-precision, integer, and special function units, respectively. Different from scalar computation on CUDA Cores, TCU provides tile-based matrix-matrix computation primitives on register fragments, which can deliver more than throughput improvement. In particular, TCU supports the compute primitive of , where and are required to be a certain type of precision (e.g., half, TF-32), while and are stored in FP32. Depending on the data precision and GPU architecture version, the matrix size (MMA shape) of , , and should follow some principles (NVIDIA, ). For example, TF-32 TCU computing requires and . In the recent CUDA release (>=) on Ampere (>=), TF-32 serves as a good alternative of float/double on TCU-based GPU computing for modern deep-learning applications, according to NVIDIA’s in-depth studies (NVIDIA, ).
TCU can be utilized in several ways. The simplest way is to call cuBLAS (Nvidia, ) by using the cublasSgemmEX API. The second way is to call the Warp Matrix Multiply-Accumulate (WMMA) (nvcuda::wmma) API (Nvidia, ) in CUDA C++ to operate TCU directly. There are four major types of operations (Listing 1). wmma::fragment defines the input matrix tile for TCU computation. Each fragment consists of thread-local registers from a warp of threads. wmma::load_matrix_sync loads the input matrix tiles from global/shared memory to register fragments. wmma::mma_sync executes the matrix multiplication on loaded matrix tiles in register fragments. Finally, wmma::store_matrix_sync moves the results from registers to global/shared memory.
Since the appearance of the TCU, research efforts have been devoted to accelerating high-performance computing workloads with TCU. Ahmad et al. (Abdelfattah et al., ) process the batched small-size GEMM on TCU for acceleration. Boyuan et al. (Feng et al., 2021) introduce GEMM-based scientific computing on TCU with extended precision and high performance. These prior efforts mostly use the TCU in the dense applications that TCU is initially designed for, while TC-GNN jumps out of the scope defined by TCU designer by accelerating the sparse GNN operations using TCU.
In this section, we will discuss the major technical thrust for us to leverage TCU for accelerating sparse GNN computation. We use the optimization of SpMM as the major example in this discussion, and the acceleration of SDDMM would also benefit from similar optimization principles. We first characterize the existing GNN computation solutions, including SpMM on CUDA core, Dense GEMM on CUDA core/TCU, and a hybrid Sparse-Dense solution. Then we give insights based on pros/cons analysis and our motivation.
|.1em.05em.05em Dataset||Aggr. (%)||Update (%)||Cache(%)||Occ.(%)|
|.1em.05em.05em Solution||Mem. Consumption||Effective Mem. Access||Computation Intensity||Effective Computation|
As the major components of sparse linear algebra operation, SpMM has been incorporated in many off-the-shelf libraries (26; E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen (1999); W. Bosma, J. Cannon, and C. Playoust (1997); 16; G. Huang, G. Dai, Y. Wang, and H. Yang (2020)). GE-SpMM (Huang et al., 2020) accelerates GNN computations on GPU through a hand-optimized CUDA kernel with coalesced row caching and coarse-grained warp merging to improve the computation parallelism and memory performance. The close-sourced cuSPARSE (Nvidia, ) library developed by NVIDIA is the most popular solution and it can deliver state-of-the-art performance for most GPU-based SpMM computation. cuSPARSE has also been widely adopted by the many GNN computing framework, such as Deep Graph Library (DGL) (Wang et al., 2019), as the backend for the sparse neighbor aggregation operations. To understand its characters, we profile DGL on one layer of a GCN (Kipf and Welling, 2017) model (neighbor aggregation + node update) on NVIDIA RTX3090. We report two key kernel matrices for only neighbor aggregation kernel, including L1/texture cache hit rate () and the achieved Streaming-Multiprocessor (SM) occupancy (). We select three representative GNN datasets: Cora with 3,327 nodes, 9,464 edges, and 3,703 node embedding dimensions; Citeseer with 2,708 nodes, 10,858 edges, and 1,433 dimensions; Pubmed with 19,717 nodes, 88,676 edges, and 500 dimensions. From Table 1, we have several observations: First, the aggregation phase usually dominates the overall execution of the GNN execution. From these three commonly used GNN datasets, we can see that the aggregation phase usually takes more than 80% of the overall execution time, which demonstrates the key performance bottleneck of the GNNs is to improve the performance of the sparse neighbor aggregation. Second, sparse operations in GNNs show very low memory performance. The column Cache of Table 1 shows GNN sparse operations could not well benefit from the GPU cache system, thus, showing a low cache-hit ratio (around 37%) and frequent global memory access. Third, sparse operations of GNNs show very inefficient computation. As described in the column Occupancy of Table 1, sparse operations of GNNs could hardly keep the GPU busy because 1) its low computation intensity (the number of non-zero elements in the sparse matrix is generally small); 2) its highly irregular memory access for fetching rows of the dense matrix during the computation, resulting in memory-bound computation; 3) it currently can only leverage CUDA core for computation, which naturally has limited throughput performance. On the other side, this study also points out several potential directions of improving the SpMM performance on GPUs, such as improving the computation intensity (e.g., assigning more workload to each thread/warp/block), boosting memory access efficiency (e.g., crafting specialized memory layout for coalesced memory access), and breaking the computation performance ceiling (e.g., using TCU).
While the Dense GEMM is mainly utilized for dense NN computations (e.g., linear transformation and convolution), it can also be leveraged for GNN aggregation under some circumstances. For example, when an input graph has a very limited number of nodes, we can directly use the dense adjacency matrix of the graph and accelerate the intrinsically sparse neighbor aggregation computation on CUDA core/TCU by calling cuBLAS (Nvidia, ). However, such an assumption may not hold even for medium-size graphs in real-world GNN applications.
|.1em.05em.05em Dataset||# Nodes||# Edges||Memory||Eff.Comp|
As shown in Table 3, for these selected datasets, the memory consumption of their dense graph adjacent matrix (as a 2D float array) would easily exceed the memory constraint of today’s GPU (<100GB). Even if we assume the dense adjacent matrix can fit into the GPU memory, the extremely low effective computations (last column of Table 3) would also be a major obstacle for us to achieve high performance. We measure the effective computation as , where is the number of the non-zero elements (indicating edges) in the graph adjacent matrix and is the number of nodes in the graph. The number of is tiny in comparison with the . Therefore, computations and memory access on zero elements are wasted.
Another type of work (Kurt et al., 2020; Nvidia, ) takes the path of mixing the sparse control (tile-based iteration) with Dense GEMM computation. They first apply a convolution-like (2D sliding window) operation on the adjacent matrix and traverse all possible dense tiles that contain non-zero elements. Then, for all identified non-zero tiles, they invoke GEMM on CUDA Core/TCU for computation. However, this strategy has two shortcomings. First, sparse control itself would cause high overhead. Based on our empirical study, the non-zero elements are highly scattered on the adjacent matrix of a sparse graph. Therefore, traversing all blocks in a super large adjacent matrix would be time-consuming. Second, the identified sparse tiles would still waste lots of computations. The irregular edge connections of the real-world graphs could hardly fit into these fixed-shape tile frames. Therefore, most of the dense tiles would still have very low occupation (few non-zero elements in each tile).
Inspired by the above studies, we make several key design choices in order to achieve high-performance sparse GNN operations. 1) At the algorithm level, we choose the hybrid sparse-dense solution as the starting point. This can give us more flexibility for optimizations at the sparse control (e.g., traversing fewer tiles) and dense computation (e.g., increasing the effective computation/memory access when processing each tile), 2) At the GPU kernel level, we employ the shared memory as the key space for GPU kernel-level data management. It can help us to re-organize the irregular GNN input data in a more “regularized” way such that both the memory access efficiency and computing performance can be well improved. 3) At the hardware level, we choose TCU as our major computing unit since it can bring significantly higher computing throughput performance in comparison with CUDA Core. This also indicates the great potential of using TCU for harvesting more performance gains. Finally, we crystallize our idea into TC-GNN that effectively coordinates the execution of GNN sparse operations on dense TCU. We show a brief qualitative comparison among TC-GNN and the above three solutions in Table 2 and we justify these benefits through a detailed discussion of TC-GNN in the next two sections. Note that Memory Consumption is the size of memory used by the sparse/dense graph adjacency matrix; The Effective Memory Access is the ratio between the size of the accessed data that is actually involved in the later computation and the total size of data being accessed; The Computation Intensity is the ratio of the number of computing operations versus the data being accessed; The Effective Computation is the ratio between the operations for generating the final result and the total operations.
We will detail TC-GNN, including three algorithmic designs: Sparse Graph Translation, Sparse (SpMM-like) Neighbor Aggregation, and Sparse (SDDMM-like) Edge Feature Computing.
As the major component of TC-GNN, we propose a novel Sparse Graph Translation (SGT) technique to facilitate the TCU acceleration of GNNs. Our core idea is that the pattern of the graph sparsity can be well-tuned for TCU computation through effective graph structural manipulation meanwhile guaranteeing the output correctness.
Specifically, we condense the highly-scattered neighbor ids without losing key information (e.g., edge connections). As exemplified in Figure 4(a) and (b), we take the regular graph in CSR format as the input and condense the columns of each row window (in the red-colored rectangular box) to build TCU blocks () (a.k.a., the input operand shape of a single MMA instruction), in the orange-colored rectangular box. is the row pointer array is the edges of each node stored continuously. In this paper, we demonstrate the use of standard MMA shape for TF-32 of TCU on Ampere GPU architecture, and other MMA shapes (NVIDIA, ) can also be used if different computation precision (e.g., half and int8) and GPU architecture (e.g., Turing) are specified.
Our sparse graph translation scheme takes several steps for processing each row window, as detailed in Algorithm 1 and visualized in Figure 4(c). Note that winPartition is an array for maintaining the number of TC blocks in each row window. edgeToCol is an array for maintaining the mapping between the edges and their corresponding position in the graph after SGT. We choose the size of the row window (=TC_BLK_H) and column width (TC_BLK_W) according to TCU MMA specification (e.g., TC_BLK_H=16, TC_BLK_W=8 in TF-32). After condensing the graph within each row window, the time complexity of sliding the can be reduced from to only , where is the total number of nodes in the graph and is the size of the unique neighbor within the current row window, which equals in Algorithm 1. Besides, the density (computation intensity) of each identified TCU block can be largely improved. Considering the case in Figure 4, after the sparse graph translation, we can achieve higher density on individual TCU blocks (Figure 4(b)) compared with the original one (Figure 4
(a)). Note that SGT is applicable for both the SpMM and SDDMM in GNN sparse operations, and it can be easily parallelized because the processing of individual row windows is independent of each other. Besides, the sparse graph translation only needs to execute once and its result can be reused across many epochs/rounds of GNN training/inference.
Neighbor Aggregation The major part of GNN sparse computing is the neighbor aggregation, which can generally be formalized as SpMM operations by many state-of-the-art frameworks (Wang et al., 2019). And they employ the cuSPARSE (Nvidia, ) on CUDA core as a black-box technique for supporting sparse GNN computation. In contrast, our TC-GNN design targets at TCU for the major neighbor aggregation computation which demands a specialized algorithmic design. TC-GNN focuses on maximizing the net performance gains by gracefully batching the originally highly irregular SpMM as dense GEMM computation and solving it on TCU effectively. As illustrated in Algorithm 2, the node aggregation processes all TC blocks from each row window. nodePointer and edgeList are directly from graph CSR, while edgeToCol and winPartition are generated from SGT discussed in the previous section. Note that InitSparse is to initialize a sparse tile in dense format according to the translated graph structure of the current TC block. Meanwhile, FetchDense returns a dense node embedding matrix tile for TCU computation, and the corresponding column range (embedding dimension range) of matrix . This is to handle the case that the width of one could not cover the full-width (all dimensions) of . Therefore, the will be used to put the current TCU computation output to the correct location in the updated node embedding matrix .
Edge Feature Computing Previous research efforts (Thekumparampil et al., 2018; Veličković et al., 2018) have demonstrated the great importance of incorporating the edge feature for a better GNN model algorithmic performance (e.g., accuracy, and F1-score). The underlying building block to generate edge features is the Sparse Dense-Dense Matrix Multiplication (SDDMM)-like operation. In TC-GNN, we support SDDMM with the collaboration of the above sparse graph translation and TCU-tailored algorithm design, as described in Algorithm 3. The overall algorithm structure and inputs are similar to the above neighbor aggregation. The major difference is the output. In the case of neighbor aggregation, our output is the updated dense node embedding matrix (), where edge feature computing will generate a sparse output with the same shape as the graph edge lists. Note that fetching the only needs to consecutively access the node embedding matrix by rows while fetching the requires first computing the TCU block column-id to node-id () to fetch the corresponding neighbor node embeddings from the same node embedding matrix .
We will detail TC-GNN by mapping the above algorithmic design to low-level primitives (e.g., warp/block) and shared memory layout. We discuss two key techniques: two-level workload mapping and TCU-optimized dataflow design.
Different from previous work (Wang et al., 2019; Fey and Lenssen, 2019a) focusing on CUDA core only, TC-GNN highlights itself with CUDA core and TCU collaboration through effective two-level workload mapping. The idea is based on the fact that CUDA Cores work in SIMT fashion and are operated by individual threads, while TCU designated for GEMM computation requires the collaboration from a warp of threads (32 threads). Our key design principle is to mix these two types of computing units as a single GPU kernel, which can efficiently coordinate the kernel execution at different levels of execution granularity.
In TC-GNN, we operate CUDA cores by thread blocks and manage TCU by thread warps. Specifically, threads running CUDA cores from the same thread block will load data (e.g., edges) from the global memory to shared memory. Note that in our design we assign each row window (discussed in Section 4.1) to one thread block. The number of threads in each block should be divisible by the number of threads in each warp (32) for better performance. Once threads running on CUDA core (CUDA-core threads) finish the data loading, threads from each warp (TCU threads) will operate TCU for GEMM computation (including loading the data from the shared memory to thread-local registers (fragments), applying GEMM computation on data in registers, accumulating results on registers, and storing the final results back to global memory). Note that there would be a large overlap of the CUDA-core threads and TCU threads, both of which are threads from the same blocks but running at a different time frame. In general, we use more CUDA-core threads than TCU threads considering that global memory access demanding more parallelization. There are two major benefits of such a two-level workload mapping strategy. First, threads from the same block can work together to improve the memory access parallelization to better utilize memory bandwidth. Second, warps from the same block can reuse the loaded data, including the information (e.g., column index mapping) of the translated graph and the tiles from the dense node embedding matrix. Therefore, redundant high-cost global memory operations can largely be avoided.
As the major technique to improve the GPU performance, shared memory is customized for our TCU-based sparse kernel design for re-organizing data layout for dense TCU computation and reducing the redundant global memory traffic. Our design takes the TCU specialty into careful consideration from two aspects, 1) the input matrix tile size of the TCU, which is M(16)N(16)K(8) in case of TF-32, and 2) the tile fragment layout for fast computation. The common practice of the loaded tile A and B are stored in row-major and column-major for better performance. Next, we will detail our TCU-optimized dataflow design for both neighbor aggregation and edge feature computation.
Neighbor Aggregation As visualized in Figure 5(a) and detailed in Listing 2, shared memory is mainly used for caching several most frequently used information, including the tile of sparse matrix A (sparse_A), the column-id of the sparse matrix to row-id of node embedding matrix (sparse_AToX_index), and the dense tile of (dense_X). When handling each TCU block, we assign all threads from the same block of threads for loading the sparse tile while allowing several warps to concurrently load the dense row tile from the matrix . The reasons for enforcing such caching are two-folds. First, it can bridge the gap between the sparse graph data and the dense GEMM computing that requires continuous data layout. For example, the adjacent matrix is input as CSR format that cannot be directed feed to TCU GEMM computation, therefore, we use a shared memory sparse_A to initialize its equivalent dense tile. Similarly, we cache rows of according to the columns of to the row of mapping after our sparse graph translation, where originally scattered columns of (rows of ) are condensed. Second, it can enable the data reuse on sparse_AToX_index and sparse_A. This is because in general, the BLK_H (16) cannot cover all dimensions of a node embedding (e.g., 64), multiple warps will be initiated of the same block to operate TCU in parallel to working on non-overlapped dense tiles while using the same sparse tile.
Edge Feature Computation Similar to the shared memory design in neighbor aggregation, for edge feature computing, as visualized in Figure 5(b) and detailed in Listing 3 at the next page, the shared memory is utilized for sparse tile A sparse_A, the column-id of sparse to row-id of the matrix sparse_AToX_index, and the dense tile dense_X from the matrix . We assign all threads from the same block of threads for loading the sparse tile while allowing several warps to concurrently load the dense row tile from the matrix . Compared with dataflow design in neighbor aggregation, edge feature computing demonstrates several differences. First, the sizes of sparse_A are different. In the neighbor aggregation computation, the sparse matrix is used as one operand in the SpMM-like computation, therefore, the minimal processing granularity is , while in edge feature computing by following SDDMM-like operation, the sparse matrix is served as the output matrix, thus, maintaining the minimum processing granularity is . To reuse the same translated sparse graph as SpMM, we need to recalculate the total number of TC blocks (Line 9). Second, iterations along the embedding dimension would be different. Compared with neighbor aggregation, edge feature computing requires the result accumulation along the embedding dimension. Therefore, the result will only be output until all iterations have finished. In neighbor aggregation, the node embedding vector is divided among several warps, each of which will output their aggregation result to non-overlapped embedding dimension range in parallel. Third, the output format has changed. Compared with SpMM-like neighbor aggregation which directly output computing result as an updated dense matrix , SDDMM-like edge feature computing requires a sparse format (the same shape as edgeList) output for compatibility with neighbor aggregation and memory space. Therefore, one more step of dense-to-sparse translation is employed after the result accumulation.
Benchmarks: We choose the two most representative GNN models widely used by previous work (Wang et al., 2019; Fey and Lenssen, 2019a; Ma et al., 2019) on node classification tasks, which can cover different types of aggregation. Specifically, 1) Graph Convolutional Network (GCN) (Kipf and Welling, 2017) is one of the most popular GNN model architectures. It is also the key backbone network for many other GNNs, such as GraphSAGE (Hamilton et al., 2017), and differentiable pooling (Diffpool) (Ying et al., 2018). Therefore, improving the performance of GCN will also benefit a broad range of GNNs. For GCN evaluation, we use the setting: 2 layers with 16 hidden dimensions, which is also the setting from the original paper (Kipf and Welling, 2017). 2) Attention-based Graph Neural Network (AGNN) (Thekumparampil et al., 2018). AGNN differs from GCN and GIN in its aggregation function, which compute edge feature (via embedding vector dot-product between source and destination vertices) before the node aggregation. AGNN is also the reference architecture for many other recent GNNs for better model algorithmic performance. For AGNN evaluation, we use the setting: 4 layers with 32 hidden dimensions.
Baselines: 1) Deep Graph Library (DGL) (Wang et al., 2019) is the state-of-the-art GNN framework on GPUs, which is built with the high-performance CUDA-core-based cuSPARSE (Nvidia, ) library as the backend and uses Pytorch (Paszke et al., 2019) as its front-end programming interface. DGL significantly outperforms other existing GNN frameworks (Fey and Lenssen, 2019a) over various datasets on many mainstream GNN model architectures. Therefore, we make an in-depth comparison with DGL. 2) Pytorch-Geometric (PyG) (Fey and Lenssen, 2019a)
is another GNN framework. PyG leverages torch-scatter(Fey and Lenssen, 2019b) library (highly-engineered CUDA-core kernel) as the backend support, which highlights its performance on batched small graph settings; 3) Blocked-SpMM (Nvidia, ) (bSpMM) accelerates SpMM on TCU. It is included in the recent update (March. 2021) on cuSPARSE library (CUDA 11.2). bSpMM requires the sparse matrix with Blocked-Ellpack format for computation. Its computation on non-zero blocks can be seen as the hybrid sparse-dense solution discussed in Section 3.3. Note that the bSpMM has not been incorporated in any existing GNN frameworks. We also compare TC-GNN with tSparse (Zachariadis et al., 2020) and Triton (Tillet et al., 2019) for non-vendor developed and highly optimized kernels on TCU.
Datasets, Platforms, and Metrics: We cover three types of datasets (Table 4), which have been used in previous GNN-related work (Wang et al., 2019; Fey and Lenssen, 2019a; Ma et al., 2019). Specifically, Type I graphs are the typical datasets used by previous GNN algorithm papers (Kipf and Welling, 2017; Xu et al., 2019; Hamilton et al., 2017). They are usually small in the number of nodes and edges, but rich in node embedding information with high dimensionality. Type II graphs (Kersting et al., 2016) are the popular benchmark datasets for graph kernels and are selected as the built-in datasets for PyG (Fey and Lenssen, 2019a). Each dataset consists of a set of small graphs, which only have intra-graph edge connections without inter-graph edge connections. Type III graphs (Leskovec and Krevl, 2014; Kipf and Welling, 2017) are large in terms of the number of nodes and edges. These graphs demonstrate high irregularity in its structures, which are challenging for most of the existing GNN frameworks. TC-GNN backend is implemented with C++ and CUDA C, and the front-end is implemented in Python. Our major evaluation platform is a server with an 8-core 16-thread Intel Xeon Silver 4110 CPU and an NVIDIA RTX3090 GPU. To measure the performance speedup, we calculate the averaged latency of 200 end-to-end results.
As shown in Figure 6(a), TC-GNN achieves speedup on average compared to DGL over three types of datasets across GCN and AGNN model on end-to-end training.
Type I Graphs: The performance improvements against DGL are significantly higher for GCN (on average ) compared to AGNN (on average ). The major reason is their different GNN computation patterns. For GCN, it only consists of a neighbor aggregation phase (SpMM-like operation) and a node update phase (GEMM operation). Whereas in the AGNN, the aggregation phase would also require an additional edge attention value (feature) computation based on SDDMM-like operations. Compared with SpMM-like operations, edge attention computation in SDDMM is more sensitive to the irregular sparse graph structure because of much more intensive computations and memory access. Thus, the performance improvement is relatively lower.
Type II Graphs: TC-GNN achieves GCN () and AGNN () on the Type II graphs. Speedup on Type II graphs is relatively lower compared with Type I, since Type II datasets consisting of a set of small graphs with very dense intra-graph connections but no inter-graph edges. This leads to a lower benefit from the sparse graph translation that would show more effectiveness on highly irregular and sparse graphs. Such a clustered graph structure would also benefit cuSPARSE due to more efficient memory access, i.e., the fewer irregular data fetching from the sparse matrix. In addition, for AGNN, TC-GNN can still demonstrate evident performance benefits over the DGL (CUDA core only), which can mainly contribute to our TCU-based SDDMM-like operation designs that can fully exploit the power of GPU through an effective TCU and CUDA core collaboration.
Type III Graphs: The speedup is also evident (on average 1.59 for GCN and on average 1.51 for AGNN) on graphs with a large number of nodes and edges and irregular graph structures. The reason is the high overhead global memory access can be well reduced through our spare graph translation. Besides, our dimension-split strategy further facilitates efficient workload sharing among warps through improving the data spatial/temporal locality. On the dataset artist and soc-BlogCatalog, which have a higher average degree within Type III datasets, we notice a better speedup performance for both GCN and AGNN. This is because 1) more neighbors per node can lead to the higher density of non-zero elements within each tile/fragment. Thus, it can fully exploit the computation benefits of each TCU GEMM operation; 2) it can also facilitate more efficient memory access. For example, in AGNN, fetching one dense row (node embedding of one node) from the dense matrix can be reused more times by applying dot-product between and many columns of (node embedding of neighbors) the dense matrix .
Comparison with PyG We further compare TC-GNN with PyG (Fey and Lenssen, 2019a), which is another popular GNN computing framework built on the highly engineered torch-scatter (Fey and Lenssen, 2019b) library running on CUDA core. As shown in Figure 6(b), TC-GNN can outperform PyG with an average of speedup on GCN and an average of speedup on AGNN. For GCN, TC-GNN achieves significant speedup on datasets with high-dimensional node embedding, such as Yeast, through effective TCU acceleration through a TCU-aware sparse graph translation while reducing the synchronization overhead by employing our highly parallelized TCU-tailored algorithm design. PyG, however, achieves inferior performance because its underlying GPU kernel can only leverage CUDA core, thus, intrinsically bounded by the CUDA core computing performance. its kernel implementation heavily relies on the high-overhead atomic operations for thread-level synchronization, thus, suffering from performance degradation.
|.1em.05em.05em Dataset||tSparse (ms)||Triton (ms)||TC-GNN (ms)|
Compared with cuSPARSE bSpMM We compare our TC-GNN SpMM kernel with cuSPARSE bSpMM to demonstrate the performance advantage of TC-GNN compared with the state-of-the-art hybrid sparse-dense solution on TCU. Figure 6(c) shows that TC-GNN can outperform bSpMM with on average speedup on neighbor aggregation. Our SGT technique can maximize the non-zero density of each non-zero tile and significantly reduce the number of non-zero tiles to be processed. However, bSpMM in cuSPARSE has to comply with the strict input sparse pattern (indicated in official API documentation (Nvidia, )). For example, all rows in the arrays must have the same number of non-zero blocks. Thus, more redundant computations (on padding those non-structural zero blocks) in bSpMM lead to inferior performance.
Compared with tSparse and Triton We compare TC-GNN SpMM kernel with tSparse (Zachariadis et al., 2020) and Triton (Tillet et al., 2019) SpMM on Type III datasets. From Table 5 (Column-2,4), TC-GNN can outperform tSparse with on average speedup on SpMM. The major reason behind this is that TC-GNN can well reduce the graph structural-level irregularity through our novel sparse graph translation scheme to benefit the dense TCU-based computation. In contrast, tSparse only considers partitioning the input sparse matrix into dense/sparse tiles based on their non-zeros elements but ignoring the potential of compressing non-zero elements into fewer tiles to reduce the workload. As shown in Table 5 (Column-3,4), TC-GNN can outperform Triton with on average speedup on SpMM. Triton’s block-sparse GEMM for TCU acceleration is designed for block-sparse traditional Neural Networks (focusing on feature maps’ sparsity), which is quite different from GNNs (focusing on the graph adjacency matrix’s sparsity) with significant larger sparse matrix size and more irregular pattern. The real-world graphs are highly irregular in edge connections, which will be reflected as a highly scattered distribution of non-zeros elements on adjacency matrices.
SGT Effectiveness we conduct a quantitive analysis of SGT in terms of the total number of TCU blocks between graphs w/o SGT and the graphs w/ SGT applied. As shown in Figure 7(a), across all types of datasets, our SGT technique can significantly reduce the number of traversed TCU blocks (on average 67.47%). The major reason is that SGT can largely improve the density of non-zero elements within each TCU Block. In contrast, the graphs w/o SGT would demonstrate a large number of highly sparse TCU blocks. What is also worth noticing is that on Type II graphs, such a reduction benefit is lower. The reason is that Type II graphs consist of a set of small subgraphs that only maintain the intra-subgraph connections, which already maintain dense columns.
SGT Overhead We further evaluate the overhead of our TC-aware sparse graph translation technique. Here we use the training for illustration, and the inference in real GNN application setting would also use the same graph structure many times (Hamilton et al., 2017; Kipf and Welling, 2017) while only changing the node embeddings input. As shown in Figure 7(b), its overhead is consistently tiny (on average 4.43%) compared with the overall training time. We thus conclude that such one-time overhead can be amortized during the GNN computation, which demonstrates its applicability in real-world GNNs.
In this paper, we introduce the first GNN acceleration framework on TCU of GPUs. We design a novel sparse graph translation technique to gracefully fit the sparse GNN workload on dense TCU. Our TCU-tailored GPU kernel design maximizes the TCU performance gains for GNN computing through effective CUDA core and TCU collaboration and a set of memory/data flow optimizations. Our seamless integration with the Pytorch framework further facilitates end-to-end GNN computing with high programmability.
Attention-based graph neural network for semi-supervised learning. Cited by: §1, §2.1, §4.2, §6.