TCGNNPytorch
TCGNN with Pytorch integration
view repo
Recently, graph neural networks (GNNs), as the backbone of graphbased machine learning, demonstrate great success in various domains (e.g., ecommerce). However, the performance of GNNs is usually unsatisfactory due to the highly sparse and irregular graphbased operations. To this end, we propose, TCGNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework. The core idea is to reconcile the "Sparse" GNN computation with "Dense" TCU. Specifically, we conduct an indepth analysis of the sparse operations in mainstream GNN computing frameworks. We introduce a novel sparse graph translation technique to facilitate TCU processing of sparse GNN workload. We also implement an effective CUDA core and TCU collaboration design to fully utilize GPU resources. We fully integrate TCGNN with the Pytorch framework for ease of programming. Rigorous experiments show an average of 1.70X speedup over the stateoftheart Deep Graph Library framework across various GNN models and dataset settings.
READ FULL TEXT VIEW PDFTCGNN with Pytorch integration
Over the recent years, with the increasing popularity of graphbased learning, graph neural networks (GNNs) (Kipf and Welling, 2017; Xu et al., 2019; Thekumparampil et al., 2018) become dominant in the computing of essential tasks across various domains, including ecommerce, financial services, etc. Compared with standard methods for graph analytics, such as random walk (Grover and Leskovec, 2016; Perozzi et al., 2014; Huang et al., 2021) and graph laplacians (Luo et al., 2011, 2009; Cheng et al., 2018), GNNs highlight themselves with significantly higher accuracy (Kipf and Welling, 2017; Xu et al., 2019; Veličković et al., 2018) and better generality (Hamilton et al., 2017). From the computation perspective, GNNs feature an interleaved execution phase of both graph operations (scatterandgather (Gonzalez et al., 2012)) at the Aggregation phase and Neural Network (NN) operations (matrix multiplication) at the Update phase. Our experimental studies further show that the aggregation phase which involves highly sparse computation on irregular input graphs generally takes more than 80% running time for both GNN training and inference. Existing GNN frameworks, e.g., Deep Graph Library (Wang et al., 2019) and PytorchGeometric (Fey and Lenssen, 2019a), are mostly built upon the popular NN frameworks that are originally optimized for dense operations, such as general matrixmatrix multiplication (GEMM). To support sparse computations in GNNs, their common strategy is to incorporate sparse primitives (such as cuSPARSE (Nvidia, )) for their backend implementations. However, cuSPARSE leverages the sparse linear algebra (LA) algorithm which involves lots of highcost indirect memory accesses on nonzero elements of a sparse matrix. Therefore, cuSPARSE cannot enjoy the same level of optimizations (e.g., data reuse) as its dense counterpart, such as cuBLAS (Nvidia, ). Moreover, cuSPARSE is designed to only utilize on CUDA core. Therefore, it cannot benefit from the recent technical advancement on GPU hardware features, such as Tensor Core Unit (TCU), which can significantly boost the GPU performance of dense LA algorithms (e.g.
, the linear transformation and convolution) in most conventional deeplearning applications.
This work focuses on exploring the potentials of TCU for accelerating such GNNbased graph learning. We remark that making TCU effective for general GNN computing is a nontrivial task. Our initial study shows that naively applying the TCU to sparse GNN computation would even result in inferior performance compared with the existing sparse implementations on CUDA core. There are several challenges. First, directly resolving the sparse GNN computing problem with the pure dense GEMM solution is impractical due to the extremely large memory cost (, where is the number of nodes). Besides, traversing the matrix tiles already known to be filled with allzero elements would cause excessive unnecessary computations and memory access. Second
, simply employing TCU to process nonzero matrix tiles of the sparse graph adjacency matrix would still waste most of the TCU computation and memory access efforts. This is because TCU input matrix tiles are defined with fixed dimension settings (
e.g.,), whereas the nonzero elements of a sparse graph adjacency matrix are distributed irregularly. Thus, it requires intensive zerovalue padding to satisfy such a rigid input constraint.
Third, even though the recent CUDA release update enables TCU to exploit the benefit of certain types of sparsity (Nvidia, ), it only supports blocked SpMM, where nonzero elements must be first fit into wellshaped blocks and the number of blocks must be the same across different rows. Such a rigid input restriction makes it hard to handle highly irregular sparse graphs from realworld GNN applications efficiently.To this end, we introduce, TCGNN, the first TCUbased GNN acceleration design on GPUs. Our key insight is to let the input sparse graph fit the dense computation of TCU. At the input level, instead of exhaustively traversing all sparse matrix tiles and determine whether to process each tile, we develop a new sparse graph translation (SGT) technique that can effectively identify those nonzero tiles and condense nonzero elements from these tiles into a fewer number of “dense” tiles. Our major observation is that neighbor sharing is very common among nodes in realworld graphs. Therefore, applying SGT can effectively merge the unnecessary data loading of the shared neighbors among different nodes to avoid highcost memory access. Our SGT is generally applicable towards any kind of sparse pattern of input graphs and can always yield the correct results as the original sparse algorithm. At the kernel level, for efficiently processing GNN sparse workloads, TCGNN exploits the benefits of CUDA core and TCU collaboration. The major design idea is that the CUDA core which is more excel at finegrained threadlevel execution would be a good candidate for managing memoryintensive data access. While TCU which is more powerful in handling simple arithmetic operations (e.g., multiplication and addition) can be wellsuited for computeintensive GEMM on dense tiles generated from SGT. At the framework level, we integrate TCGNN with the popular Pytorch (Paszke et al., 2019) framework. Thereby, users only need to interact with their familiar Pytorch programming environment by using TCGNN APIs. This can significantly reduce extra learning efforts meanwhile improving user productivity and code portability across different platforms.
To sum, we summarize our contributions as follows:
We conduct a detailed analysis () of several existing solutions (e.g., SpMM on CUDA core) and identify the potentials of using TCU for accelerating the sparse GNN workloads.
We introduce a sparse graph translation technique (). It can make the sparse and irregular GNN input graphs easily fit the dense computing of TCU for acceleration.
We build a TCUtailored GPU kernel with effective CUDA core and TCU collaboration (). It consists of a novel twolevel workload mapping strategy for computation optimization and a TCUoptimized dataflow design for memory access optimization.
We deliver an endtoend GNN framework design with seamless integration with the popular Pytorch framework for high programmability and configurability.
Extensive experiments show the significant speedup (on average 1.70) over the stateoftheart GNN computing framework, Deep Graph Library, across various mainstream GNN models and dataset settings.
Graph neural networks (GNNs) are an effective tool for graphbased machine learning. The detailed computing flow of GNNs is illustrated in Figure 1
. GNNs basically compute the node feature vector (embedding) for node
at layer based on the embedding information at layer (), as shown in Equation 1,(1) 
where is the embedding vector for node at layer ; is the aggregation results through collecting neighbors’ information (e.g., node embeddings); is the neighbor set of node . The aggregation method and the order of aggregation and update could vary across different GNNs. Some methods (Kipf and Welling, 2017; Hamilton et al., 2017) just rely on the neighboring nodes while others (Veličković et al., 2018)
also leverage the edge properties that are computed by applying vector dotproduct between source and destination node embeddings. The update function is generally composed of standard NN operations, such as a single fully connected layer or a multilayer perceptron (MLP) in the form of
, where and are the weight and bias parameter, respectively. The common choices for node embedding dimensions are 16, 64, and 128, and the embedding dimension may change across different layers. After several iterations of aggregation and update (i.e., several GNN layers), we will get the output feature embedding of each node, which can usually be used for various downstream graphbased deep learning tasks, such as node classification (Kaspar and Horst, 2010; Gibert et al., 2012; Duran and Niepert, 2017) and link prediction (Chen et al., 2005; Kunegis and Lommatzsch, 2009; Tylenda et al., 2009).The sparse computing in the aggregation phase is generally formalized as the sparsematrix densematrix multiplication (SpMM), as illustrated in Figure 2(a), and is handled by many sparse libraries (e.g., cuSPARSE (Nvidia, )) in many stateoftheart GNN frameworks (Wang et al., 2019, 2021). These designs only count on GPU CUDA cores for computing, which waste the modern GPUs with diverse computing units, such as the Tensor Core Unit (TCU). Specifically, we formalized the neighbor aggregation as SpMMlike operations (Equation 2)
(2) 
where is the graph adjacency matrix stored in CSR format. is a node feature embedding matrix stored in dense format. is the number of nodes in the graph, and is the size of node feature embedding dimension; is the elementwise multiplication and is the standard matrixmatrix multiplication; is the edge feature matrix in CSR format and can be computed by SDDMMlike operations (Equation 3), as illustrated in Figure 2(b).
(3) 
Note that the computation of is optional in GNNs, which is generally adopted by Attentionbased Graph Neural Network in Pytorch (Thekumparampil et al., 2018) for identifying more complicated graph structural information. Other GNNs, such as Graph Convolutional Network (Kipf and Welling, 2017), Graph Isomorphism Network (Xu et al., 2019), only use the graph adjacency matrix for neighbor aggregation.
In the most recent GPU architectures (since Volta (Nvidia, )), NVIDIA announced a new type of computing unit, Tensor Core Unit (TCU), for accelerating dense deeplearning operations (e.g., Dense GEMM). A GPU StreamingMultiprocessor (w/ TCU) is illustrated in Figure 3. Note that FP64, FP32, INT, and SFU are for doubleprecision, singleprecision, integer, and special function units, respectively. Different from scalar computation on CUDA Cores, TCU provides tilebased matrixmatrix computation primitives on register fragments, which can deliver more than throughput improvement. In particular, TCU supports the compute primitive of , where and are required to be a certain type of precision (e.g., half, TF32), while and are stored in FP32. Depending on the data precision and GPU architecture version, the matrix size (MMA shape) of , , and should follow some principles (NVIDIA, ). For example, TF32 TCU computing requires and . In the recent CUDA release (>=) on Ampere (>=), TF32 serves as a good alternative of float/double on TCUbased GPU computing for modern deeplearning applications, according to NVIDIA’s indepth studies (NVIDIA, ).
TCU can be utilized in several ways. The simplest way is to call cuBLAS (Nvidia, ) by using the cublasSgemmEX API. The second way is to call the Warp Matrix MultiplyAccumulate (WMMA) (nvcuda::wmma) API (Nvidia, ) in CUDA C++ to operate TCU directly. There are four major types of operations (Listing 1). wmma::fragment defines the input matrix tile for TCU computation. Each fragment consists of threadlocal registers from a warp of threads. wmma::load_matrix_sync loads the input matrix tiles from global/shared memory to register fragments. wmma::mma_sync executes the matrix multiplication on loaded matrix tiles in register fragments. Finally, wmma::store_matrix_sync moves the results from registers to global/shared memory.
Since the appearance of the TCU, research efforts have been devoted to accelerating highperformance computing workloads with TCU. Ahmad et al. (Abdelfattah et al., ) process the batched smallsize GEMM on TCU for acceleration. Boyuan et al. (Feng et al., 2021) introduce GEMMbased scientific computing on TCU with extended precision and high performance. These prior efforts mostly use the TCU in the dense applications that TCU is initially designed for, while TCGNN jumps out of the scope defined by TCU designer by accelerating the sparse GNN operations using TCU.
In this section, we will discuss the major technical thrust for us to leverage TCU for accelerating sparse GNN computation. We use the optimization of SpMM as the major example in this discussion, and the acceleration of SDDMM would also benefit from similar optimization principles. We first characterize the existing GNN computation solutions, including SpMM on CUDA core, Dense GEMM on CUDA core/TCU, and a hybrid SparseDense solution. Then we give insights based on pros/cons analysis and our motivation.
.1em.05em.05em Dataset  Aggr. (%)  Update (%)  Cache(%)  Occ.(%) 

Cora  88.56  11.44  37.22  15.06 
Citeseer  86.52  13.47  38.18  15.19 
Pubmed  94.39  5.55  37.22  16.24 
.1em.05em.05em 
.1em.05em.05em Solution  Mem. Consumption  Effective Mem. Access  Computation Intensity  Effective Computation 

Sparse GEMM  Low  Low  Low  High 
Dense GEMM  High  High  High  Low 
Hyxbrid SparseDense  High  Low  Low  High 
TCGNN  Low  High  High  High 
.1em.05em.05em 
As the major components of sparse linear algebra operation, SpMM has been incorporated in many offtheshelf libraries (26; E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen (1999); W. Bosma, J. Cannon, and C. Playoust (1997); 16; G. Huang, G. Dai, Y. Wang, and H. Yang (2020)). GESpMM (Huang et al., 2020) accelerates GNN computations on GPU through a handoptimized CUDA kernel with coalesced row caching and coarsegrained warp merging to improve the computation parallelism and memory performance. The closesourced cuSPARSE (Nvidia, ) library developed by NVIDIA is the most popular solution and it can deliver stateoftheart performance for most GPUbased SpMM computation. cuSPARSE has also been widely adopted by the many GNN computing framework, such as Deep Graph Library (DGL) (Wang et al., 2019), as the backend for the sparse neighbor aggregation operations. To understand its characters, we profile DGL on one layer of a GCN (Kipf and Welling, 2017) model (neighbor aggregation + node update) on NVIDIA RTX3090. We report two key kernel matrices for only neighbor aggregation kernel, including L1/texture cache hit rate () and the achieved StreamingMultiprocessor (SM) occupancy (). We select three representative GNN datasets: Cora with 3,327 nodes, 9,464 edges, and 3,703 node embedding dimensions; Citeseer with 2,708 nodes, 10,858 edges, and 1,433 dimensions; Pubmed with 19,717 nodes, 88,676 edges, and 500 dimensions. From Table 1, we have several observations: First, the aggregation phase usually dominates the overall execution of the GNN execution. From these three commonly used GNN datasets, we can see that the aggregation phase usually takes more than 80% of the overall execution time, which demonstrates the key performance bottleneck of the GNNs is to improve the performance of the sparse neighbor aggregation. Second, sparse operations in GNNs show very low memory performance. The column Cache of Table 1 shows GNN sparse operations could not well benefit from the GPU cache system, thus, showing a low cachehit ratio (around 37%) and frequent global memory access. Third, sparse operations of GNNs show very inefficient computation. As described in the column Occupancy of Table 1, sparse operations of GNNs could hardly keep the GPU busy because 1) its low computation intensity (the number of nonzero elements in the sparse matrix is generally small); 2) its highly irregular memory access for fetching rows of the dense matrix during the computation, resulting in memorybound computation; 3) it currently can only leverage CUDA core for computation, which naturally has limited throughput performance. On the other side, this study also points out several potential directions of improving the SpMM performance on GPUs, such as improving the computation intensity (e.g., assigning more workload to each thread/warp/block), boosting memory access efficiency (e.g., crafting specialized memory layout for coalesced memory access), and breaking the computation performance ceiling (e.g., using TCU).
While the Dense GEMM is mainly utilized for dense NN computations (e.g., linear transformation and convolution), it can also be leveraged for GNN aggregation under some circumstances. For example, when an input graph has a very limited number of nodes, we can directly use the dense adjacency matrix of the graph and accelerate the intrinsically sparse neighbor aggregation computation on CUDA core/TCU by calling cuBLAS (Nvidia, ). However, such an assumption may not hold even for mediumsize graphs in realworld GNN applications.
.1em.05em.05em Dataset  # Nodes  # Edges  Memory  Eff.Comp 

OVCR8H  1,890,931  3,946,402  14302.48 GB  0.36% 
Yeast  1,714,644  3,636,546  11760.02 GB  0.32% 
DD  334,925  1,686,092  448.70 GB  0.03% 
.1em.05em.05em 
As shown in Table 3, for these selected datasets, the memory consumption of their dense graph adjacent matrix (as a 2D float array) would easily exceed the memory constraint of today’s GPU (<100GB). Even if we assume the dense adjacent matrix can fit into the GPU memory, the extremely low effective computations (last column of Table 3) would also be a major obstacle for us to achieve high performance. We measure the effective computation as , where is the number of the nonzero elements (indicating edges) in the graph adjacent matrix and is the number of nodes in the graph. The number of is tiny in comparison with the . Therefore, computations and memory access on zero elements are wasted.
Another type of work (Kurt et al., 2020; Nvidia, ) takes the path of mixing the sparse control (tilebased iteration) with Dense GEMM computation. They first apply a convolutionlike (2D sliding window) operation on the adjacent matrix and traverse all possible dense tiles that contain nonzero elements. Then, for all identified nonzero tiles, they invoke GEMM on CUDA Core/TCU for computation. However, this strategy has two shortcomings. First, sparse control itself would cause high overhead. Based on our empirical study, the nonzero elements are highly scattered on the adjacent matrix of a sparse graph. Therefore, traversing all blocks in a super large adjacent matrix would be timeconsuming. Second, the identified sparse tiles would still waste lots of computations. The irregular edge connections of the realworld graphs could hardly fit into these fixedshape tile frames. Therefore, most of the dense tiles would still have very low occupation (few nonzero elements in each tile).
Inspired by the above studies, we make several key design choices in order to achieve highperformance sparse GNN operations. 1) At the algorithm level, we choose the hybrid sparsedense solution as the starting point. This can give us more flexibility for optimizations at the sparse control (e.g., traversing fewer tiles) and dense computation (e.g., increasing the effective computation/memory access when processing each tile), 2) At the GPU kernel level, we employ the shared memory as the key space for GPU kernellevel data management. It can help us to reorganize the irregular GNN input data in a more “regularized” way such that both the memory access efficiency and computing performance can be well improved. 3) At the hardware level, we choose TCU as our major computing unit since it can bring significantly higher computing throughput performance in comparison with CUDA Core. This also indicates the great potential of using TCU for harvesting more performance gains. Finally, we crystallize our idea into TCGNN that effectively coordinates the execution of GNN sparse operations on dense TCU. We show a brief qualitative comparison among TCGNN and the above three solutions in Table 2 and we justify these benefits through a detailed discussion of TCGNN in the next two sections. Note that Memory Consumption is the size of memory used by the sparse/dense graph adjacency matrix; The Effective Memory Access is the ratio between the size of the accessed data that is actually involved in the later computation and the total size of data being accessed; The Computation Intensity is the ratio of the number of computing operations versus the data being accessed; The Effective Computation is the ratio between the operations for generating the final result and the total operations.
We will detail TCGNN, including three algorithmic designs: Sparse Graph Translation, Sparse (SpMMlike) Neighbor Aggregation, and Sparse (SDDMMlike) Edge Feature Computing.
As the major component of TCGNN, we propose a novel Sparse Graph Translation (SGT) technique to facilitate the TCU acceleration of GNNs. Our core idea is that the pattern of the graph sparsity can be welltuned for TCU computation through effective graph structural manipulation meanwhile guaranteeing the output correctness.
Specifically, we condense the highlyscattered neighbor ids without losing key information (e.g., edge connections). As exemplified in Figure 4(a) and (b), we take the regular graph in CSR format as the input and condense the columns of each row window (in the redcolored rectangular box) to build TCU blocks () (a.k.a., the input operand shape of a single MMA instruction), in the orangecolored rectangular box. is the row pointer array is the edges of each node stored continuously. In this paper, we demonstrate the use of standard MMA shape for TF32 of TCU on Ampere GPU architecture, and other MMA shapes (NVIDIA, ) can also be used if different computation precision (e.g., half and int8) and GPU architecture (e.g., Turing) are specified.
Our sparse graph translation scheme takes several steps for processing each row window, as detailed in Algorithm 1 and visualized in Figure 4(c). Note that winPartition is an array for maintaining the number of TC blocks in each row window. edgeToCol is an array for maintaining the mapping between the edges and their corresponding position in the graph after SGT. We choose the size of the row window (=TC_BLK_H) and column width (TC_BLK_W) according to TCU MMA specification (e.g., TC_BLK_H=16, TC_BLK_W=8 in TF32). After condensing the graph within each row window, the time complexity of sliding the can be reduced from to only , where is the total number of nodes in the graph and is the size of the unique neighbor within the current row window, which equals in Algorithm 1. Besides, the density (computation intensity) of each identified TCU block can be largely improved. Considering the case in Figure 4, after the sparse graph translation, we can achieve higher density on individual TCU blocks (Figure 4(b)) compared with the original one (Figure 4
(a)). Note that SGT is applicable for both the SpMM and SDDMM in GNN sparse operations, and it can be easily parallelized because the processing of individual row windows is independent of each other. Besides, the sparse graph translation only needs to execute once and its result can be reused across many epochs/rounds of GNN training/inference.
Neighbor Aggregation The major part of GNN sparse computing is the neighbor aggregation, which can generally be formalized as SpMM operations by many stateoftheart frameworks (Wang et al., 2019). And they employ the cuSPARSE (Nvidia, ) on CUDA core as a blackbox technique for supporting sparse GNN computation. In contrast, our TCGNN design targets at TCU for the major neighbor aggregation computation which demands a specialized algorithmic design. TCGNN focuses on maximizing the net performance gains by gracefully batching the originally highly irregular SpMM as dense GEMM computation and solving it on TCU effectively. As illustrated in Algorithm 2, the node aggregation processes all TC blocks from each row window. nodePointer and edgeList are directly from graph CSR, while edgeToCol and winPartition are generated from SGT discussed in the previous section. Note that InitSparse is to initialize a sparse tile in dense format according to the translated graph structure of the current TC block. Meanwhile, FetchDense returns a dense node embedding matrix tile for TCU computation, and the corresponding column range (embedding dimension range) of matrix . This is to handle the case that the width of one could not cover the fullwidth (all dimensions) of . Therefore, the will be used to put the current TCU computation output to the correct location in the updated node embedding matrix .
Edge Feature Computing Previous research efforts (Thekumparampil et al., 2018; Veličković et al., 2018) have demonstrated the great importance of incorporating the edge feature for a better GNN model algorithmic performance (e.g., accuracy, and F1score). The underlying building block to generate edge features is the Sparse DenseDense Matrix Multiplication (SDDMM)like operation. In TCGNN, we support SDDMM with the collaboration of the above sparse graph translation and TCUtailored algorithm design, as described in Algorithm 3. The overall algorithm structure and inputs are similar to the above neighbor aggregation. The major difference is the output. In the case of neighbor aggregation, our output is the updated dense node embedding matrix (), where edge feature computing will generate a sparse output with the same shape as the graph edge lists. Note that fetching the only needs to consecutively access the node embedding matrix by rows while fetching the requires first computing the TCU block columnid to nodeid () to fetch the corresponding neighbor node embeddings from the same node embedding matrix .
We will detail TCGNN by mapping the above algorithmic design to lowlevel primitives (e.g., warp/block) and shared memory layout. We discuss two key techniques: twolevel workload mapping and TCUoptimized dataflow design.
Different from previous work (Wang et al., 2019; Fey and Lenssen, 2019a) focusing on CUDA core only, TCGNN highlights itself with CUDA core and TCU collaboration through effective twolevel workload mapping. The idea is based on the fact that CUDA Cores work in SIMT fashion and are operated by individual threads, while TCU designated for GEMM computation requires the collaboration from a warp of threads (32 threads). Our key design principle is to mix these two types of computing units as a single GPU kernel, which can efficiently coordinate the kernel execution at different levels of execution granularity.
In TCGNN, we operate CUDA cores by thread blocks and manage TCU by thread warps. Specifically, threads running CUDA cores from the same thread block will load data (e.g., edges) from the global memory to shared memory. Note that in our design we assign each row window (discussed in Section 4.1) to one thread block. The number of threads in each block should be divisible by the number of threads in each warp (32) for better performance. Once threads running on CUDA core (CUDAcore threads) finish the data loading, threads from each warp (TCU threads) will operate TCU for GEMM computation (including loading the data from the shared memory to threadlocal registers (fragments), applying GEMM computation on data in registers, accumulating results on registers, and storing the final results back to global memory). Note that there would be a large overlap of the CUDAcore threads and TCU threads, both of which are threads from the same blocks but running at a different time frame. In general, we use more CUDAcore threads than TCU threads considering that global memory access demanding more parallelization. There are two major benefits of such a twolevel workload mapping strategy. First, threads from the same block can work together to improve the memory access parallelization to better utilize memory bandwidth. Second, warps from the same block can reuse the loaded data, including the information (e.g., column index mapping) of the translated graph and the tiles from the dense node embedding matrix. Therefore, redundant highcost global memory operations can largely be avoided.
As the major technique to improve the GPU performance, shared memory is customized for our TCUbased sparse kernel design for reorganizing data layout for dense TCU computation and reducing the redundant global memory traffic. Our design takes the TCU specialty into careful consideration from two aspects, 1) the input matrix tile size of the TCU, which is M(16)N(16)K(8) in case of TF32, and 2) the tile fragment layout for fast computation. The common practice of the loaded tile A and B are stored in rowmajor and columnmajor for better performance. Next, we will detail our TCUoptimized dataflow design for both neighbor aggregation and edge feature computation.
Neighbor Aggregation As visualized in Figure 5(a) and detailed in Listing 2, shared memory is mainly used for caching several most frequently used information, including the tile of sparse matrix A (sparse_A), the columnid of the sparse matrix to rowid of node embedding matrix (sparse_AToX_index), and the dense tile of (dense_X). When handling each TCU block, we assign all threads from the same block of threads for loading the sparse tile while allowing several warps to concurrently load the dense row tile from the matrix . The reasons for enforcing such caching are twofolds. First, it can bridge the gap between the sparse graph data and the dense GEMM computing that requires continuous data layout. For example, the adjacent matrix is input as CSR format that cannot be directed feed to TCU GEMM computation, therefore, we use a shared memory sparse_A to initialize its equivalent dense tile. Similarly, we cache rows of according to the columns of to the row of mapping after our sparse graph translation, where originally scattered columns of (rows of ) are condensed. Second, it can enable the data reuse on sparse_AToX_index and sparse_A. This is because in general, the BLK_H (16) cannot cover all dimensions of a node embedding (e.g., 64), multiple warps will be initiated of the same block to operate TCU in parallel to working on nonoverlapped dense tiles while using the same sparse tile.
Edge Feature Computation Similar to the shared memory design in neighbor aggregation, for edge feature computing, as visualized in Figure 5(b) and detailed in Listing 3 at the next page, the shared memory is utilized for sparse tile A sparse_A, the columnid of sparse to rowid of the matrix sparse_AToX_index, and the dense tile dense_X from the matrix . We assign all threads from the same block of threads for loading the sparse tile while allowing several warps to concurrently load the dense row tile from the matrix . Compared with dataflow design in neighbor aggregation, edge feature computing demonstrates several differences. First, the sizes of sparse_A are different. In the neighbor aggregation computation, the sparse matrix is used as one operand in the SpMMlike computation, therefore, the minimal processing granularity is , while in edge feature computing by following SDDMMlike operation, the sparse matrix is served as the output matrix, thus, maintaining the minimum processing granularity is . To reuse the same translated sparse graph as SpMM, we need to recalculate the total number of TC blocks (Line 9). Second, iterations along the embedding dimension would be different. Compared with neighbor aggregation, edge feature computing requires the result accumulation along the embedding dimension. Therefore, the result will only be output until all iterations have finished. In neighbor aggregation, the node embedding vector is divided among several warps, each of which will output their aggregation result to nonoverlapped embedding dimension range in parallel. Third, the output format has changed. Compared with SpMMlike neighbor aggregation which directly output computing result as an updated dense matrix , SDDMMlike edge feature computing requires a sparse format (the same shape as edgeList) output for compatibility with neighbor aggregation and memory space. Therefore, one more step of densetosparse translation is employed after the result accumulation.
.1em.05em.05em Type  Dataset  #Vertex  #Edge  Dim.  #Class 
I  Citeseer  3,327  9,464  3703  6 
Cora  2,708  10,858  1433  7  
Pubmed  19,717  88,676  500  3  
PPI  56,944  818,716  50  121  
II  PROTEINS_full  43,471  162,088  29  2 
OVCAR8H  1,890,931  3,946,402  66  2  
Yeast  1,714,644  3,636,546  74  2  
DD  334,925  1,686,092  89  2  
YeastH  3,139,988  6,487,230  75  2  
III  amazon0505  410,236  4,878,875  96  22 
artist  50,515  1,638,396  100  12  
comamazon  334,863  1,851,744  96  22  
socBlogCatalog  88,784  2,093,195  128  39  
amazon0601  403,394  3,387,388  96  22  
.1em.05em.05em 



Benchmarks: We choose the two most representative GNN models widely used by previous work (Wang et al., 2019; Fey and Lenssen, 2019a; Ma et al., 2019) on node classification tasks, which can cover different types of aggregation. Specifically, 1) Graph Convolutional Network (GCN) (Kipf and Welling, 2017) is one of the most popular GNN model architectures. It is also the key backbone network for many other GNNs, such as GraphSAGE (Hamilton et al., 2017), and differentiable pooling (Diffpool) (Ying et al., 2018). Therefore, improving the performance of GCN will also benefit a broad range of GNNs. For GCN evaluation, we use the setting: 2 layers with 16 hidden dimensions, which is also the setting from the original paper (Kipf and Welling, 2017). 2) Attentionbased Graph Neural Network (AGNN) (Thekumparampil et al., 2018). AGNN differs from GCN and GIN in its aggregation function, which compute edge feature (via embedding vector dotproduct between source and destination vertices) before the node aggregation. AGNN is also the reference architecture for many other recent GNNs for better model algorithmic performance. For AGNN evaluation, we use the setting: 4 layers with 32 hidden dimensions.
Baselines: 1) Deep Graph Library (DGL) (Wang et al., 2019) is the stateoftheart GNN framework on GPUs, which is built with the highperformance CUDAcorebased cuSPARSE (Nvidia, ) library as the backend and uses Pytorch (Paszke et al., 2019) as its frontend programming interface. DGL significantly outperforms other existing GNN frameworks (Fey and Lenssen, 2019a) over various datasets on many mainstream GNN model architectures. Therefore, we make an indepth comparison with DGL. 2) PytorchGeometric (PyG) (Fey and Lenssen, 2019a)
is another GNN framework. PyG leverages torchscatter
(Fey and Lenssen, 2019b) library (highlyengineered CUDAcore kernel) as the backend support, which highlights its performance on batched small graph settings; 3) BlockedSpMM (Nvidia, ) (bSpMM) accelerates SpMM on TCU. It is included in the recent update (March. 2021) on cuSPARSE library (CUDA 11.2). bSpMM requires the sparse matrix with BlockedEllpack format for computation. Its computation on nonzero blocks can be seen as the hybrid sparsedense solution discussed in Section 3.3. Note that the bSpMM has not been incorporated in any existing GNN frameworks. We also compare TCGNN with tSparse (Zachariadis et al., 2020) and Triton (Tillet et al., 2019) for nonvendor developed and highly optimized kernels on TCU.Datasets, Platforms, and Metrics: We cover three types of datasets (Table 4), which have been used in previous GNNrelated work (Wang et al., 2019; Fey and Lenssen, 2019a; Ma et al., 2019). Specifically, Type I graphs are the typical datasets used by previous GNN algorithm papers (Kipf and Welling, 2017; Xu et al., 2019; Hamilton et al., 2017). They are usually small in the number of nodes and edges, but rich in node embedding information with high dimensionality. Type II graphs (Kersting et al., 2016) are the popular benchmark datasets for graph kernels and are selected as the builtin datasets for PyG (Fey and Lenssen, 2019a). Each dataset consists of a set of small graphs, which only have intragraph edge connections without intergraph edge connections. Type III graphs (Leskovec and Krevl, 2014; Kipf and Welling, 2017) are large in terms of the number of nodes and edges. These graphs demonstrate high irregularity in its structures, which are challenging for most of the existing GNN frameworks. TCGNN backend is implemented with C++ and CUDA C, and the frontend is implemented in Python. Our major evaluation platform is a server with an 8core 16thread Intel Xeon Silver 4110 CPU and an NVIDIA RTX3090 GPU. To measure the performance speedup, we calculate the averaged latency of 200 endtoend results.
As shown in Figure 6(a), TCGNN achieves speedup on average compared to DGL over three types of datasets across GCN and AGNN model on endtoend training.
Type I Graphs: The performance improvements against DGL are significantly higher for GCN (on average ) compared to AGNN (on average ). The major reason is their different GNN computation patterns. For GCN, it only consists of a neighbor aggregation phase (SpMMlike operation) and a node update phase (GEMM operation). Whereas in the AGNN, the aggregation phase would also require an additional edge attention value (feature) computation based on SDDMMlike operations. Compared with SpMMlike operations, edge attention computation in SDDMM is more sensitive to the irregular sparse graph structure because of much more intensive computations and memory access. Thus, the performance improvement is relatively lower.
Type II Graphs: TCGNN achieves GCN () and AGNN () on the Type II graphs. Speedup on Type II graphs is relatively lower compared with Type I, since Type II datasets consisting of a set of small graphs with very dense intragraph connections but no intergraph edges. This leads to a lower benefit from the sparse graph translation that would show more effectiveness on highly irregular and sparse graphs. Such a clustered graph structure would also benefit cuSPARSE due to more efficient memory access, i.e., the fewer irregular data fetching from the sparse matrix. In addition, for AGNN, TCGNN can still demonstrate evident performance benefits over the DGL (CUDA core only), which can mainly contribute to our TCUbased SDDMMlike operation designs that can fully exploit the power of GPU through an effective TCU and CUDA core collaboration.
Type III Graphs: The speedup is also evident (on average 1.59 for GCN and on average 1.51 for AGNN) on graphs with a large number of nodes and edges and irregular graph structures. The reason is the high overhead global memory access can be well reduced through our spare graph translation. Besides, our dimensionsplit strategy further facilitates efficient workload sharing among warps through improving the data spatial/temporal locality. On the dataset artist and socBlogCatalog, which have a higher average degree within Type III datasets, we notice a better speedup performance for both GCN and AGNN. This is because 1) more neighbors per node can lead to the higher density of nonzero elements within each tile/fragment. Thus, it can fully exploit the computation benefits of each TCU GEMM operation; 2) it can also facilitate more efficient memory access. For example, in AGNN, fetching one dense row (node embedding of one node) from the dense matrix can be reused more times by applying dotproduct between and many columns of (node embedding of neighbors) the dense matrix .
Comparison with PyG We further compare TCGNN with PyG (Fey and Lenssen, 2019a), which is another popular GNN computing framework built on the highly engineered torchscatter (Fey and Lenssen, 2019b) library running on CUDA core. As shown in Figure 6(b), TCGNN can outperform PyG with an average of speedup on GCN and an average of speedup on AGNN. For GCN, TCGNN achieves significant speedup on datasets with highdimensional node embedding, such as Yeast, through effective TCU acceleration through a TCUaware sparse graph translation while reducing the synchronization overhead by employing our highly parallelized TCUtailored algorithm design. PyG, however, achieves inferior performance because its underlying GPU kernel can only leverage CUDA core, thus, intrinsically bounded by the CUDA core computing performance. its kernel implementation heavily relies on the highoverhead atomic operations for threadlevel synchronization, thus, suffering from performance degradation.
.1em.05em.05em Dataset  tSparse (ms)  Triton (ms)  TCGNN (ms) 

amazon0505  18.60  31.64  4.09 
artist  9.15  12.86  3.06 
comamazon  13.84  15.50  3.26 
socBlogCatalog  9.74  14.38  3.59 
amazon0601  11.93  21.78  3.41 
.1em.05em.05em 
Compared with cuSPARSE bSpMM We compare our TCGNN SpMM kernel with cuSPARSE bSpMM to demonstrate the performance advantage of TCGNN compared with the stateoftheart hybrid sparsedense solution on TCU. Figure 6(c) shows that TCGNN can outperform bSpMM with on average speedup on neighbor aggregation. Our SGT technique can maximize the nonzero density of each nonzero tile and significantly reduce the number of nonzero tiles to be processed. However, bSpMM in cuSPARSE has to comply with the strict input sparse pattern (indicated in official API documentation (Nvidia, )). For example, all rows in the arrays must have the same number of nonzero blocks. Thus, more redundant computations (on padding those nonstructural zero blocks) in bSpMM lead to inferior performance.
Compared with tSparse and Triton We compare TCGNN SpMM kernel with tSparse (Zachariadis et al., 2020) and Triton (Tillet et al., 2019) SpMM on Type III datasets. From Table 5 (Column2,4), TCGNN can outperform tSparse with on average speedup on SpMM. The major reason behind this is that TCGNN can well reduce the graph structurallevel irregularity through our novel sparse graph translation scheme to benefit the dense TCUbased computation. In contrast, tSparse only considers partitioning the input sparse matrix into dense/sparse tiles based on their nonzeros elements but ignoring the potential of compressing nonzero elements into fewer tiles to reduce the workload. As shown in Table 5 (Column3,4), TCGNN can outperform Triton with on average speedup on SpMM. Triton’s blocksparse GEMM for TCU acceleration is designed for blocksparse traditional Neural Networks (focusing on feature maps’ sparsity), which is quite different from GNNs (focusing on the graph adjacency matrix’s sparsity) with significant larger sparse matrix size and more irregular pattern. The realworld graphs are highly irregular in edge connections, which will be reflected as a highly scattered distribution of nonzeros elements on adjacency matrices.


SGT Effectiveness we conduct a quantitive analysis of SGT in terms of the total number of TCU blocks between graphs w/o SGT and the graphs w/ SGT applied. As shown in Figure 7(a), across all types of datasets, our SGT technique can significantly reduce the number of traversed TCU blocks (on average 67.47%). The major reason is that SGT can largely improve the density of nonzero elements within each TCU Block. In contrast, the graphs w/o SGT would demonstrate a large number of highly sparse TCU blocks. What is also worth noticing is that on Type II graphs, such a reduction benefit is lower. The reason is that Type II graphs consist of a set of small subgraphs that only maintain the intrasubgraph connections, which already maintain dense columns.
SGT Overhead We further evaluate the overhead of our TCaware sparse graph translation technique. Here we use the training for illustration, and the inference in real GNN application setting would also use the same graph structure many times (Hamilton et al., 2017; Kipf and Welling, 2017) while only changing the node embeddings input. As shown in Figure 7(b), its overhead is consistently tiny (on average 4.43%) compared with the overall training time. We thus conclude that such onetime overhead can be amortized during the GNN computation, which demonstrates its applicability in realworld GNNs.
In this paper, we introduce the first GNN acceleration framework on TCU of GPUs. We design a novel sparse graph translation technique to gracefully fit the sparse GNN workload on dense TCU. Our TCUtailored GPU kernel design maximizes the TCU performance gains for GNN computing through effective CUDA core and TCU collaboration and a set of memory/data flow optimizations. Our seamless integration with the Pytorch framework further facilitates endtoend GNN computing with high programmability.
Attentionbased graph neural network for semisupervised learning
. Cited by: §1, §2.1, §4.2, §6.
Comments
There are no comments yet.