1. Introduction
With the popularity surge of the graph neural networks (GNNs) (Kipf and Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017), research around the fullprecision GNNs has been widely studied in terms of its algorithms (Kipf and Welling, 2017; Xu et al., 2019) and execution performance (Ma et al., ; Wang et al., 2019; Fey and Lenssen, 2019a) over traditional graph analytical methods, such as Random Walk (Huang et al., 2021). On the other side, quantized GNN (Feng et al., ; Tailor et al., 2021) (QGNN) recently attract lots of attention thanks to its negligible accuracy loss, resilience towards malicious attacks, and significantly lower computations and memory overhead. We summarize several key features of GNNs that make them intrinsically suitable for quantization. First, the adjacent matrix of GNNs is naturally wellsuited for quantization, since we only need to use 0/1 to indicate the existence of edge connections. Thus, using low bits for such information can save both memory and computation. Second, the quantization of weight and node embedding can also be beneficial. Because the tiny precision loss in quantization can largely be offset by the node information fusion through the iterative neighbor aggregation process of GNNs. The quantization of floatingpoint numbers can absorb input perturbations from adversarial attacks.
Despite such great theoretical success of QGNN, the realization of such benefits on highperformance GPUs is still facing tremendous challenges. Existing GPUbased GNN frameworks (Fey and Lenssen, 2019a; Wang et al., 2019, 2021) are designed and tailored for GPU CUDA cores, which are intrinsically bounded by its peak throughput performance and can only handle the bytebased data types (e.g., int32). Although quantized computation can be achieved via pure algorithmic emulation, the actual bitlevel performance gains could hardly be harvested, since all underlying arithmetic operations still have to rely on those welldefined data types from CUDA/C++ libraries.
To tackle these challenges, we decide to move forward with the recent GPU hardware feature – Tensor Core (TC). The modern NVIDIA GPU with TC design is illustrated in Figure 1. TC provides the native support of bitlevel operations (XOR, AND), which could be the major ingredient for quantized computation. Besides, TC can easily beat CUDA core with a significantly higher throughput performance (more than ) on conventional NN operations (e.g.
, linear transformation and convolution). This demonstrates the potential of using TC in accelerating QGNNs. However, directly using TC for QGNN computation is encountering several challenges.
First, the current TC can only support limited choices of bitwidth (e.g., 1bit and 4bit), which may not be able to meet the demands of users for anybitwidth (e.g., 2bit) computation. Second, TC initially tailored for GEMM computation may not directly fit the context of sparse GNN computation. A huge amount of computation and memory access efforts would be wasted on those nonexisted edges. This is because the hard constraint of TC input matrix tilesize (e.g., 8128 for 1bit GEMM) has to be satisfied, which may require excessive zero paddings.
Third, the lowbit computation would cause the compatibility issue, since the existing deeplearning frameworks
(Paszke et al., 2019; Abadi et al., ) cannot directly operate on the lowbit data type.Therefore, we remark there are several aspects to be considered in order to use TC for QGNNs: 1) Hardwarelevel Support. This inspires us to explore the highperformance GPU hardware features that can efficiently support the QGNN computation. Even though it is hard to find such a GPU hardware feature that can directly support anybitwidth QGNN, some indirect hardware features would potentially be helpful. For example, NVIDIA introduced the 1bit TCbased GEMM on Turing Architecture, which essentially can be used to composite anybitwidth GEMM. 2) Softwarelevel Optimizations. This motivates us to optimize the kernel computation according to the characters of QGNN. GNN computation is featured with a highly sparse and irregular scheme. It is intrinsically not favorable for the dense GPU computation flow tailored for the traditional NN operators. Thus, how to handle such inputlevel irregularity from the computation and memory perspectives is essential to the performance of QGNN. For example, subgraph partitioning (Karypis and Kumar, 2009) based minibatch GNN computation has been used to increase the computation efficiency without compromising model accuracy performance. 3) Frameworklevel Integration. This encourages us to bridge the gap between quantized lowbit implementations and deeplearning frameworks built for fullprecision computation. Therefore, our whole systemlevel design can be seamlessly integrated with the stateoftheart mainstream NN frameworks to benefit the execution performance and the developing productivity.
To this end, we propose the first TC computation framework (Figure 2) to support anybitwidth QGNN on GPUs.
At the input side, we incorporate the METIS (Karypis and Kumar, 2009) graph partitioning to generate a set of dense subgraphs from the highly irregular and sparse input graphs. The insight here is that nodes in realworld graphs are likely to form clusters, and such information can be used to benefit the efficiency of GNN computing and model algorithmic performance.
At the algorithm side, we leverage the insight that anybitwidth QGNN computation can always be decomposed into the 1bit computation. Each bit in the output can be generated by different combinations of bits from the input. Thus, we use quantized lowbit data representation and bitdecomposed computation base on the “atomic” 1bit type.
At the GPU kernel side
, we craft a lowbit computation design tailored for QGNN computation on batched dense subgraphs. We address the key performance bottleneck of the lowbit GNN computing from the memory and computing perspectives. Specifically, we use only 1bit binarized representation for the subgraph adjacent matrix, which is memory efficient for representing the presence/absence of edge connections between nodes. Besides, we use a 3Dstacked bitcompression technique for maintaining quantized lowbit node embedding features and weights. In addition, we fully exploit the intrasubgraph sparsity through zerotile skipping and nonzero tile reuse, which can further avoid unnecessary computations and improve the data locality.
At the framework side, we integrate QGTC with the stateoftheart Tensorbased Pytorch (Paszke et al., 2019) framework. We introduce the new notion of bitTensor data type and bitTensor computation and warp them up as a new set of Pytorch API extensions. Endusers can directly interact with the QGTC Pytorch APIs to access all functionalities. This largely improves the programmability and extensibility.
Overall, our key contributions can be summarized as

We propose a novel 1bit composition technique for anybitwidth arithmetic design (), which can support QGNN with diverse precision demands.

We introduce a highly efficient implementation of QGNN () built on top of the GPU Tensor Core by applying a series of computation optimizations and memory optimizations.

We integrate QGTC with Pytorch () by introducing bitTensor data type and bitTensor computation for better programmability and extensibility.

Extensive experiments demonstrate the advantages of QGTC in terms of better performance compared with the stateoftheart DGL framework on mainstream GNN models across various datasets.
2. Background and Related Work
In this section, we will introduce the background of GNNs, the quantization of GNNs, and basics of GPU Tensor Core.
2.1. Graph Neural Networks
Graph neural network (GNN) is an effective tool for graphbased machine learning. The detailed computing flow of GNNs is illustrated in Figure
3. GNNs basically compute the node feature vector (embedding) for node
at layer based on the embedding information at layer (), as shown in Equation 1,(1) 
where is the embedding vector for node at layer ; is the aggregation results through collecting neighbors’ information (e.g., node embeddings); is the neighbor set of node . The aggregation method and the order of aggregation and update could vary across different GNNs. Some methods (Kipf and Welling, 2017; Hamilton et al., 2017) just rely on the neighboring nodes while others (Veličković et al., 2018)
also leverage the edge properties that are computed by applying vector dotproduct between source and destination node embeddings. The update function is generally composed of standard NN operations, such as a single fully connected layer or a multilayer perceptron (MLP) in the form of
, where and are the weight and bias parameter, respectively. The common choices for node embedding dimensions are 16, 64, and 128, and the embedding dimension may change across different layers.The most recent advancement of GNN is its batched computation (Chiang et al., 2019), which has also been adopted by many stateoftheart GNN computing frameworks (Wang et al., 2019; Fey and Lenssen, 2019a) for large graphs that cannot be easily fit into the GPU/CPU memory for computation directly. Batched GNN computation has been highlighted with good accuracy and runtime performance (Chiang et al., 2019) in comparison with fullgraph computation. Batched GNN computation takes several steps. First, it decomposed the input graphs by employing the stateoftheart graph partitioning toolset, such as METIS (Karypis and Kumar, 2009), which can minimize the graph structural information loss meanwhile maximizing the number of edge connections within each subgraph (i.e., improving the subgraph modularity). Second, it feeds the small subgraphs into the GNN models for computation, which will generate the node feature vector for each subgraph. Third, the generated node embeddings can be used in multiple downstream tasks, such as node/graph classification, link prediction, and community detection (LibenNowell and Kleinberg, 2007; Grover and Leskovec, 2016; Huang et al., 2021, 2022).
2.2. Quantization of GNNs
Besides the research efforts on fullprecision GNNs, recent focus also shifts towards the quantized GNNs. For example, Boyuan et al. (Feng et al., ) propose the first framework for running quantized GNNs, and several types of quantization schemes can be applied on GNNs (e.g., the quantization based on the GNN layer, node degrees, and the edge weights). And their experimental results also demonstrate the effectiveness of the GNN quantization in terms of memory saving and model accuracy. Shyam et al. (Tailor et al., 2021) introduce an architecturally agnostic and stable method, Degree Quant, to improve performance over existing quantizationaware training baselines commonly used on other architectures (e.g., CNNs). They achieve up to speedups on CPU when using int8 compared with float. Compared with the fullprecision GNNs, lowbit GNNs bring the benefit of model robustness towards the adversarial attacks and the low computation and memory overheads. However, work from (Feng et al., ) only showcases the theoretical memory and computation benefits via softwarelevel quantization simulation, where its underlying computation is still carried out in 32bit fullprecision float. Work from (Tailor et al., 2021) only demonstrates such gains on CPUs, which has limited applicability in the realworld GNN computation settings. This encourages us to harvest its real performance benefits on the modern widely used GPU platforms.
2.3. Tensor Core on GPUs
The recent advancement of GPU hardware technology has pushed computing power to a new level. Among those innovations, the most significant one is the Tensor Core (TC) on NVIDIA GPU. Different from scalarscalar computation on CUDA Cores, TC provides a matrixmatrix compute primitive, which can deliver more than higher computation throughput. The initial version of TC is designed for handling the GEMM with halfprecision input and fullprecision output. More variants (e.g., int8, int4, and int1 inputs with 32bit unsigned integer (uint32) output) have been introduced since the recent CUDA release (11.0) and newer GPU microarchitectures (e.g., Turing and Ampere).
In particular, TC supports the compute primitive of , where matrix tile and are required to be a certain type of precision (e.g., 1bit), while matrix tile and use uint32. Depending on the input data precision and the version of GPU microarchitecture, the matrix tile size of , , and may have different choices. For example, 1bit TC computing requires and . Different from the CUDA Cores which requires users to define the execution flow of each thread (i.e., work of individual threads). TC requires the collaboration of a warp of threads (32 threads) (i.e., work of individual warps). This can be reflected in two ways. First, before calling TC for computation, all registers of a warp of threads need to collaboratively store the matrix tile into a new memory hierarchy (called Fragment (NVIDIA, 2017)), which allows data sharing across registers. This intrawarp sharing provides opportunities for fragmentbased memory optimizations. Second, during the computation, these loaded matrix fragments will be taken as the TC input to generate the output fragment, which also consists of the registers from each thread in a warp. Data movements among these registers are also managed by a warp of threads collaboratively.
Prior research efforts have been devoted to accelerating highperformance computing workloads with TC. Ahmad et al. (Abdelfattah et al., 2019) process the batched smallsize GEMM on TC for acceleration. Ang (Li and Su, 2020) leverage 1bit GEMM capability on Turing GPU TC for accelerating binary neural network inference. Dakkak (Dakkak et al., ) accelerates the halfprecision scan on TC by transforming the scan to a GEMM. Boyuan et al. (Feng et al., 2021) introduce GEMMbased scientific computing on TC with extended precision and high performance. QGTC enlarges the application range of TC by accelerating GNNs for anybitwidth quantized GNN computation, which is not directly covered by any existing research, any release of cuBLAS (Nvidia, ), or CUTLASS (NVIDIA, ) library, and the TC hardware design.
TC can be used in several ways. The simplest one is to call cuBLAS (Nvidia, ) cublasSgemmEX API. However, cuBLAS API only supports computation on the most common fixed bitwidth on TC, such as 8bit, halfprecision (16bit), thus, it cannot support any bitwidth precision directly. The second way is to call the Warp Matrix MultiplyAccumulate (WMMA) (nvcuda::wmma) API (Nvidia, ) in CUDA C++ to operate TC directly. There are basically four types of operations (Listing 1). In this project, we follow the second way for more lowlevel implementation customization for batched GNN computation. Because it can offer more design/implementation flexibility for compositing arbitrarybit computation and ease the optimization (e.g., data loading and reuse) for batched GNNspecific workloads at the GPU kernel.
3. QGTC Algorithm Design
In this section, we first introduce the basics of lowbit computation. Then we will discuss our TCtailored algorithm design for quantized GNN.
3.1. 1bit Composition for Quantized Ops.
Over the last few years, quantized deep neural networks (QDNNs) (Feng et al., ; Tailor et al., 2021) have been extensively studied, largely due to their memory saving and high computation performance. In GNN, however, similar work is largely lagging behind. Work from (Feng et al., ) demonstrates that GNN is actually insensitive to quantization, even very lowbit quantization would not lead to evident accuracy loss because of the graphlike aggregation operations that can amortize such quantization influence. Another work from (Bahri et al., 2020) also demonstrates that even the binarized GNN would be beneficial in some application scenarios. In this work, we foresee that the support for anybitwidth precision computation on GNN is vital to satisfy various users’ demands (e.g., execution time).
Given a quantization bit and the 32bit floatingpoint value , we quantize it as a bit value by using
(2) 
where is an empirical lower bound that can be determined by users or application settings; is the ratio between the range (, where is an empirical upper bound) and the bit representation range (); is the floor function.
For anybitwidth computation on quantized values, we propose a new type of arithmetics based on the “atomic” 1bit computation widely used in the binarized NN (Hubara et al., 2016).
Anybitwidth ScalarScalar Multiplication: Assuming we have a 3bit scalar value () and multiply it with a 2bit scalar value (). we can first represent these two values as
(3) 
where and indicate the bit value (0/1) at the certain bit position after bit decomposition. By following the general rule of multiplication, we can get as
(4) 
through simplification we can get that
(5) 
Anybitwidth VectorVector Multiplication: We extend the anybitwidth scalarscalar computation towards vectorvector anybitwidth computation between a 3bit vector and 2bit vector , each of which has elements. Therefore, the above scalarscalar multiplication formula can be extended to kdimension vectorvector multiplication as follows.
(6) 
From the above formula, we can see that in order to compute the result of anybitwidth vectorvector multiplication, we first do bit decomposition on all elements in each vector then do bitbit multiplication between elements from each vector, and finally do bit shifting and reduction to get the final result. For example, after bitdecomposition of and , we can get at bit position 2 as and at bit position 1 as , where . From the multiplication and addition, we can get the multiplication result of at bit position 3. Such a 1bit vectorvector multiplication can be effectively implemented as
(7) 
where counts the total number of of the result in its bit representation (e.g., will return for a binary number ). A similar procedure can be applied to generate the result at bit position 0, 1, and 2. After all these individual bits in temporary results are ready, we can do bit shifting and reduction to get the final result. Based on such anybitwidth vectorvector results, we can easily derive the anybit matrixmatrix multiplication scheme, where each element in the output matrix can be seen as the results of anybitwidth vectorvector multiplication.
3.2. Quantized Computation in GNNs
Applying anybitwidth precision computation in GNNs would find two major specialties. First, the adjacent matrix () of GNNs only need to use binary (1bit) number to represent the presence/absence of edges. Second, the node embedding matrix () and the weight matrix () can be represented with anybitwidth to meet the different precision demands.
As described in Algorithm 1, each layer of anybitwidth GNN consists of a neighbor aggregation and a node embedding update phase. Specifically, neighbor aggregation conducts through a 1bitandsbit matrix multiplication and the node update conducts through a sbitandtbit matrix multiplication. At Line 1 to 3, we do bitDecompose for subgraph adjacency matrix , embedding matrix , and weight matrix . For scalar int32 numbers, our bitDecompose will first quantize it to another int32 number in a nbit data range by using Equation 2. Then, it applies bitshifting to extract each bit (0/1) from the quantized int32 number. Our 3D stacked bit compression (Section 4.2) happens after the above first and second steps are applied to each element of a matrix, and it will pack the extracted bits for the whole matrix together. Here for ease of algorithm description, we maintain different bits of a matrix as the list, e.g., stands for the 0’s bits for all elements inside the . At Line 5 to 7, we apply bitmatrix multiplication between each bit matrix from and the binary 1bit matrix , the results of this step will still be a set of bit matrices and be stored in a list. At Line 8 to 14, we apply the similar bitmatrix multiplication between and , and the results of this step will be stored as bitmatrix as well for the following finalresult generation. To avoid any data overflow during the reduction (Line 15 to 19), should also use a fullbit data type (e.g., int32). For large graphs, their adjacent matrices cannot be easily fit into the GPU device memory directly. In this scenario, we employ METIS (Karypis and Kumar, 2009) for graph partitioning and run GNN as batched subgraph computation, which is used by the most popular clusterGCN (Chiang et al., 2019) design. Considering that the number of subgraphs generated by METIS (Karypis and Kumar, 2009) is usually within the reasonable size (2,000 to 20,000), such a batched GNN computation can be accommodated on a single modern GPU without violating its memory constraints. Note that to reduce the runtime overhead, the bitdecomposition of the matrix and can be precomputed and cached before the GNN computation at each layer. Because across different GNN layers of the same subgraphs, matrix can be reused. On the other side, across different subgraphs at the same GNN layers, the matrix can be reused.
4. Implementation
4.1. Subgraph Partitioning and Batching
Realworld graphs usually come with a large number of nodes and highly irregular graph structure (edge connections). This brings two levels of difficulties for GNN computing. The first one is the memory consumption, since GPU device memory cannot accommodate all nodes, edges, and node embedding features at the same time. The second one is the inefficient execution since the irregular and sparse edge connections lead to low memory access efficiency and poor computation performance. To this end, in QGTC, we combine the stateoftheart graph partitioning technique METIS (Karypis and Kumar, 2009) and subgraph batch processing (Chiang et al., 2019) to handle different sizes of input graphs effectively. Compared with other solutions, such as graph clustering approaches (Boldi et al., 2011; Raghavan et al., 2007; Karantasis et al., 2014) and BFSbased methods (Cuthill and McKee, ), METIS achieves a better quality of its captured subgraph partitions (more edges in each subgraph) and the significantly higher runtime performance owing to its partial parallelization. Note that the number of subgraphs/partitions is determined by users and is passed as a runtime parameter to METIS.
After the subgraph partitioning, we will conduct a batching step for GNN computation on GPUs. This step gathers a set of subgraph partitions based on userdefined batch size. Later, during the GNN computing, subgraphs are loaded to GPU memory by batch. Using the partitioning and batching strategy for GNN computing gives users control of workloads at two levels of granularity. First, the workload granularity is defined by the number of subgraphs/partitions. This would manage the size of each subgraph partition and the edge connection density of each subgraph. In general, the more number of the subgraphs/partitions would lead to denser edges connections within each subgraph, which may bring better computation and memory locality. Second, the processing granularity is controlled by the batch size. This would determine the size of graphs that will be fit into the GPU at each round of execution. The selection of batch size would maximize the utilization of the GPU while respecting the GPU computation and memory resource constraints.
4.2. 3DStacked Bit Compression
Existing NN frameworks are developed for fullprecision computation, which leads to two major challenges: First, the lowbit quantized data type cannot directly borrow the fullprecision data type as the “vehicle” for computation. The major reason is that the full precision data type such as float and int32 cannot bring any benefits to the memory or computation saving. Second, lowbit quantization would not fit any type of bit alignment, since its bitlevel boundary mostly cannot be divisible by the size of a byte (8bit), making it hard to retrieve its actual value.
To this end, we propose a novel 3Dstacked bitcompression technique to handle anybitwidth data type effectively. The major idea is to compress anybitwidth input with 32bit alignment for ease of value retrieval and memory alignment. As exemplified in Figure 4(a), we have an input matrix with the shape of 3bit. For each bit of the element in the matrix, we store it in a bit matrix (1bit) stacked along the vertical axis. During the computation of anybitwidth matrix multiplication , two types of 3Dstacked bitcompression are employed. For matrix , we use the columnwise compression with 32bit alignment, as illustrated in Figure 4(b). The main reason for choosing columnwise compression is that the matrix multiplication would benefit from coalesced acrosscolumn memory access along each row of the matrix . 32bit alignment can benefit the read performance by coalesced loading from the global memory to fragment. After the compression on matrix (1bit), we will get a 32bit compressed 3bit with the shape of 3bit, where and are for padding rows/columns that cannot be divisible by the basic TC computing size ().
For matrix , we use the rowwise compression with 32bit alignment, as shown in Figure 4(c) which can benefit the acrossrow access along each column of matrix . After the compression on matrix (1bitK), we will get a 32bit compressed 2bit with the shape of 2bit for the output layer. Note that if the is the hidden layer of a GNN model, the padding strategy on matrix would be slightly different considering that the result matrix will become a new matrix in the next layer which demands 128bit padding. In this case, to avoid additional padding overhead, we will get the 2bit with the shape of 2bit.
Compared with the previous work (Cowan et al., 2020) that also leverages bitlevel data packing, there are several differences. The first one is the padding strategy. Padding of QGTC on different tensor dimensions could be different, where bitlevel padding is ignored in the work from (Cowan et al., 2020). For example, QGTC may PAD8 or PAD128, depending on the following computation is carried out in lowbit or 32bit format, thereby, avoiding unnecessary conversion. The second one is the packing datatype. Work from (Cowan et al., 2020) uses uint4 for packing continuous 128 bits, while QGTC uses a 32bit format for better interoperability with Pytorch. The third one is the bitlevel layout. Work from (Cowan et al., 2020) doesn’t consider more bitlevel layout optimization. In QGTC, for GEMM operation (), we use a columnwise compression for the matrix and a rowwise compression for the matrix .
4.3. Zerotile Jumping
Even though the subgraph partitioning such as METIS (Karypis and Kumar, 2009) makes the subgraph denser (more number of edge connections within each subgraph), there are still some TC tiles (i.e., the input matrix tile for a single TC computation) that are filled with allzero elements. Therefore, directly iterating through these zero tiles would introduce the cost of unnecessary memory (loading data from the global memory to threadlocal registers) and computation (processing 1bit TCGEMM on input adjacent matrix tile that contains allzero elements). Based on this observation, we introduce a novel zerotile jumping technique to reduce the unnecessary computations by bitwise OR operations and warplevel synchronization primitives.
As illustrated in Figure 5, each 1bit TCGEMM would work on the tile size of register fragment. This can be well partitioned into int32 elements. To check whether the tile contains allzero elements, we first employ only 8 threads from a warp of threads to fetch an uint4_v vector data (each uint4_v element in CUDA consists of 4 int32 elements placed in continuous memory address). The reason for using uint4_v is to maximize the memory access efficiency by issuing fewer global memory requests. Once all uint4_v elements have been loaded. Each thread will apply bitwise OR across all 4 int32 elements, which will check whether each row of a TCtile is allzero. The next step is to tell whether the whole tile is allzeros across different rows, we will use the warplevel primitive to sync the information across these 8 active threads in the warp. This step will generate an int32 number. If this number is zero, it will indicate all elements in this input TCtile are zero. Otherwise, we still need to conduct the 1bit TCGEMM on the current tile. We will give a more quantitative analysis of such zerotile jumping in Section 6.3.
4.4. NonZero Tile Reuse
In addition to jumping over the zero tiles, we further consider reusing the nonzero tiles to improve data locality. In the aggregation step of the GNN computation, we generate the output feature map at different bitlevel separately. For example, when we choose 1bit adjacent subgraph matrix and a 4bit feature embedding matrix, we will execute the iteration 4 times to generate the output. One straightforward solution, called crossbit reduction, is to generate the complete output matrix tile at each bit level first. This requires loading the matrix tile imperatively, as shown in Figure 6(a). However, this would cause one problem that each nonzero tile from the adjacent matrix will be repetitively loaded when computing with each bit matrix from the embedding matrix.
In fact, we can consider reordering the steps in a way that we can maximize the benefit of each nonzero tile of the subgraph adjacency matrix. As shown in Figure 6(b), we introduce a crosstile reduction strategy. Specifically, for each loaded nonzero fragment, we will use it to generate an output tile at all bit levels and do a localized reduction (only on the current tile) to generate a partial aggregation result. Once this part has been done, we will move forward to the next nonzero tile and repeat the same process until all nonzero tiles have been processed. The complexity of loading the nonzero tiles can be reduced from to , where is the number of bits for node embeddings.
4.5. Interlayer Kernel Fusion
Across the GNN layers, we incorporate the lowbit data transferring. Specifically, the output of the one hidden layer will directly be handed over to the next layer as the input. There are several strategies we use. First, we apply data quantization and bitdecomposition at the end of the computation kernel such as the neighbor aggregation and node update. This can help to avoid outputting the result to the slow global memory and apply the data manipulation again. Second
, standalone activation function kernels such as
ReLU and tanh, can be effectively fused into our computation kernel as a devicefunction, which can directly operate the shared memory results to achieve high performance. For the batch normalization (BN) layers that follow the graph convolution layers, we can also do layer fusion based on
(8) 
where , , and are the BN parameters that can be incorporated into the lowbit convolutional kernel to avoid launching a BN kernel. One thing worth noting is that computation at the hidden layer and the output layer is slightly different. For hidden layers, each kernel has the quantization + bitdecomposition on the output activation, since the next layer relies on the lowbit data as the input for computation. While for the last layer, once the fullprecision accumulation is complete, it will directly output the fullprecision result for the softmax
layer (considering the node classification task) to generate logits that demand high precision.
4.6. BandwidthOptimized Subgraph Packing
During the GNN computation of the subgraphs, data communication between the CPU host and GPU device is also nonavoidable. It will swap the subgraph data (such as edge lists and node embedding) in/out of the GPU device. One basic approach is to transfer the dense adjacent matrix considering that a single subgraph is generally within the range of the modern GPU memory. However, this could easily lead to a huge amount of data traffic between the CPU and GPU host. The transferring performance is heavily bounded by PCIe bandwidth (32 GB/s for PCIe 4.0x16) between the CPU host and the GPU device. For the node embedding matrix, the current practice is to transfer the node embedding matrix by initializing another standalone PCIe transferring, which incurs additional overheads and is unable to maximize the bandwidth usage.
To overcome these issues, we employ a new strategy, called bandwidthoptimized subgraph packing. Instead of directly migrating the large dense adjacent matrix transferring we just transfer the sparse edge list then covert it to the dense matrix on the GPU device. This can trade the lowcost ondevice computation for the highcost PCIe data transferring. Besides, to facilitate the node embedding matrix transferring at the meantime, we compress the sparse edge list and the node feature embedding into a compound memory object (occupying consecutive memory space) on the host first and then initiate the transferring of this compound memory object from the host CPU to GPU device.
5. Integration with PyTorch
Besides the highly efficient kernel design and data transferring optimization, for better usability and programmability, we integrate QGTC with the popular Pytorch framework. However, there are two key technical challenges. The first one is how to represent the quantized lowbit number in those Tensorbased frameworks that are built on bytebased data types (e.g., int32). The second one is how to apply valid computation between the quantized lowbit number and those welldefined bytebased numbers. For example, how could we get the correct results when we do arithmetic multiplication between a 2bit number and a 32bit integer number. To this end, we introduce two new techniques.
BitTensor Data Type: We use the 32bit IntTensor in Pytorch as the “vehicle” for holding anybitwidth quantized data. And we leverage our 3Dstacked bit compression technique (Section 4.2) to package the quantized data. We offer a Pytorch API Tensor.to_bit(nbits) for such data type conversion functionality. Note that existing Pytorch APIs, such as print, are only defined for those complete data types, such as Int. Therefore, to access the element value of a bitTensor, we provide Tensor.to_val(nbits) to decode a bitTensor as int32 Tensor (converting each element from a lowbit number to an int32 number). This can make our design compatible with existing Pytorch functionalities.
BitTensor Computation: We handle two different types of computation: 1) the operations that only involve bitTensor and 2) the operations that involve both bitTensor and float or int32 Tensor. For the first type of operations, we built two APIs based on whether we want to get the int32 output or still get the quantized lowbit output as a bit Tensor. For anybitwidth MM with lowbit output, the API is bitMM2Bit(C, A, B, bit_A, bit_B, Bit_C), where A and B are bit Tensors, bit_A/B/C are bitwidth parameters. For anybitwidth MM with int32 output, the API is bitMM2Int(C, A, B, bit_A, bit_B). For the second type of operations, we will first decode a bitTensor as a float/int32 Tensor by using Tensor.to_val(nbits). Then we call the official APIs in Pytorch for the regular fullprecision computation.
Type  Dataset  #Vertex  #Edge  Dim.  #Class 

I  Proteins  43,471  162,088  29  2 
artist  50,515  1,638,396  100  12  
II  BlogCatalog  88,784  2,093,195  128  39 
PPI  56,944  818,716  50  121  
III  ogbnarxiv  169,343  1,166,243  128  40 
ogbnproducts  2,449,029  61,859,140  100  47 
6. Evaluation



Benchmarks: We choose two most representative GNN models widely used by previous work (Wang et al., 2019; Fey and Lenssen, 2019a; Ma et al., ) on the node classification task to cover different types of aggregation. 1) Cluster GCN (Kipf and Welling, 2017) is one of the most popular GNN model architectures. It is also the key backbone network for many other GNNs, such as GraphSAGE (Hamilton et al., 2017), and differentiable pooling (Diffpool) (Ying et al., 2018). For Cluster GCN evaluation, we use the setting: 3 layers with 16 hidden 2) Batched GIN (Xu et al., 2019) differs from cluster GCN in its order of aggregation and node update. Batched GIN aggregates neighbor embedding before the node feature update (via linear transformation). GIN demonstrates its strength by capturing the graph properties that cannot be collected by GCN according to (Xu et al., 2019). Therefore, improving the performance of GIN will benefit more advanced GNNs, such as GAT (Veličković et al., 2018) and AGNN (Thekumparampil et al., 2018). For batched GIN evaluation, we use the setting: 3 layers with 64 hidden dimensions of each layer. For quantization bitwidth, we cover the bitwidth settings from the existing quantized GNN studies (Feng et al., ; Tailor et al., 2021) and also conduct a comprehensive experimental analysis on different bitwidth settings.
Baselines: we choose several baselines for comparison. 1) Deep Graph Library (DGL) (Wang et al., 2019) is the stateoftheart GNN framework on GPUs, which is built with the cuSPARSE (Nvidia, ) library as the backend and uses Pytorch (Paszke et al., 2019) as its frontend. DGL significantly outperforms the other existing GNN frameworks (Fey and Lenssen, 2019a) over various datasets on many mainstream GNN model architectures. 2) PytorchGeometric (PyG) (Fey and Lenssen, 2019a)
is another popular GNN framework, which leverages highlyengineered torchscatter
(Fey and Lenssen, 2019b) as the backend.



Datasets: We cover all three types of datasets, which have been used in many previous GNNrelated work (Wang et al., 2019; Fey and Lenssen, 2019a; Ma et al., ). Details of these datasets are listed in Table 1. Specifically, Type I graphs are the popular GNN datasets evaluated by many GNN algorithmic papers (Kipf and Welling, 2017; Xu et al., 2019; Hamilton et al., 2017). Type II graphs (Kersting et al., 2016) are the popular benchmark datasets for graph kernels and are selected as the builtin datasets for PyG (Fey and Lenssen, 2019a). Type III graphs (Hu et al., 2020) are challenging GNN datasets in terms of the large number of nodes and edges. These graphs demonstrate high irregularity in its structures. Note that we do graph partitioning by using METIS (Karypis and Kumar, 2009) and set the number of total subgraphs as 1500 as prior work (Chiang et al., 2019; Zeng and Prasanna, ).
Platforms & Metrics: QGTC backend is implemented with C++ and CUDA C, while QGTC frontend is implemented in Python. Our major evaluation platform is a Ubuntu server (16.04) with an 8core 16thread Intel Xeon Silver 4110 CPU @ 2.8GHz with 64GB host memory and an NVIDIA Ampere RTX3090 GPU with 24GB device memory. The GPU device kernel is compiled with CUDA (v11.0) and the CPU host code is compiled with GCC 7.5.0 with the compilation option of “std=c++14 O3” for integration with the Pytorch framework. To measure the performance speedup, we calculate the averaged latency of 200 rounds of endtoend results.
6.1. Compared with DGL
In this section, we conduct a detailed comparison with DGL under the different choices of bitwidth. As shown in Figure 7(a) and Figure 7(b), QGTC achieves up to 1.38 and 1.63 endtoend inference speedup compared to DGL over three types of datasets for cluster GCN and batched GIN, respectively. We also notice that the performance benefit is closely related to the bitwidth we choose, as we can see that from 16bit to 32bit the performance shows a large difference compared with the 1bit to 8bit setting. We next provide a detailed analysis and give insights for each type of dataset. With a fewer number of bits for both the weights and the node embedding features, QGTC is more likely to reach higher performance. Because a smaller size of bitwidth would lead to less memory access and fewer computations. QGTC can successfully harvest bitlevel benefits. DGL reaches an inferior performance due to 1) FP32 computation comes with the high computation complexity compared with our QGTC lowbit design; 2) DGL can only rely on CUDA core for computation which is naturally bounded by the peak computation performance compared with our QGTC design on TC with much higher throughput performance. Compared with batched GIN, cluster GCN shows higher benefits since it applies the node update first before the neighbor aggregation, this can effectively reduce data movements and computation overhead. Compared with the above two types of datasets, Type III datasets achieve relatively lower performance improvements because of two reasons. First, under the same number of partitions, the size of each partition (subgraph) will increase due to the large number of nodes/edges in the original graph. This also increases the CPUGPU data transfer time for edge list and node embedding features. This, to some degree, would amortize the performance benefits from lowbit GNN computation. Second, the extra cost of input bitcompression for adjacent matrix and node feature embeddings can also increase. However, the above two types of overhead can be well managed by selecting a reasonable number of partitions and batch size for balancing computation performance and accuracy of GNNs (Chiang et al., 2019; Zeng and Prasanna, ).
Accuracy w.r.t. Quantization Bits To build the QGNN model, we apply quantizationaware training and evaluate the model testing accuracy w.r.t. quantization bits on two large Type III datasets on GCN model for demonstration.
Settings  FP32  16 bits  8 bits  4 bits  2 bits 

ogbproduct  0.791  0.791  0.783  0.739  0.620 
ogbarxiv  0.724  0.708  0.707  0.685  0.498 
As shown in Table 2, the GNN model is resilient to the lowbit quantization and can maintain the model accuracy to a large extent. Combining these results with our above performance evaluation result under different quantization bits, we can conclude that making the right tradeoff between the runtime performance and model accuracy is meaningful and can bring benefits to different application settings.
6.2. Compared with Other Implementations
Compared with PyG. We also compare QGTC with the popular PyG (Fey and Lenssen, 2019a) on cluster GCN, another popular GNN computing framework equipped with the highly engineered torchscatter (Fey and Lenssen, 2019b) CUDA library for GNN computation. As shown in Figure 7(c), QGTC can outperform PyG with up to
speedup for cluster GCN. Especially, QGTC achieves more evident speedups on datasets with higher node degree and embedding dimension, such as PPI, since 1) QGTC applies an effective TC acceleration for lowbit execution of GNN which breaks the throughput limits of CUDA core. 2) PyG kernel design uses floatingpoint arithmetic operations which requires more data movements, therefore, achieving inferior performance compared with QGTC.
Comparison with cuBLAS GEMM on TC. We further compare our lowbit computation (from 2bit to 7bit) with respect to the stateoftheart cuBLASgemmEX for quantized (int8) GEMM solution on Tensor Core in terms of their throughput performance. Note that int8 is the cuBLAS currently supported minimum bits for quantized computation on Tensor Core. In this study, we mainly focus on the computation of (i.e., , where is the number of nodes and is the node embedding dimension) for the neighbor aggregation phase. As shown in Figure 8(a), QGTC achieves significant throughput improvement compared with Tensor Core cuBLAS (int8) in lowbit settings. The major reason is our QGTC design effectively reduces the computation and the data movements at the bit level, thereby, harvesting the real performance gains of the lowbit quantization on GPUs. When the number of bits for quantization is approaching 8bit in the computation, the performance gains would decrease due to the increasing computations.
6.3. Additional Studies
ZeroTile Jumping. We would compute the ratio of the nonzero TC tiles (8128) that are actually involved in our computation versus the total number of TC tiles in the adjacent matrix. As shown in Figure 8(b), our zerotile jumping technology can largely save the efforts for processing allzero tiles. Based on our observation, the source of such allzero TC tiles comes from two levels. The first type of allzero TC tiles is coming from batching subgraphs. Because there is no edge connection among nodes across different subgraphs. This type of allzero TC tiles dominates the overall collected number of allzero tiles. The second type of allzero tiles comes from the missing edge connections between the nodes within each subgraph. While this type of allzero tiles is minor in its quantity compared with the first type. It potentially reduces memory access and computation.
Adjacency Matrix Size. We will demonstrate the subgraph adjacency matrix size impact on the performance of QGTC. Specifically, adjacency matrix size can be controlled by specifying the number of subgraphs (in METIS) and batching size (in data loader). The size of the adjacency matrix will impact the performance of aggregation in terms of computations and data movements, meanwhile, it will also determine whether our GNN computation can fully utilize the available GPU resources.. We use 1bit for both adjacency matrix and node embedding matrix in this study. As shown in Figure 8(c), we can observe that under the same size of , with the increase of the number of nodes (i.e., the value of ), our major 1bit GEMM computation kernel would scale up its performance well. Note that different colored lines represent different embedding sizes, and we mainly focus on the computation of (i.e., , where is the number of nodes and is the node embedding dimension) for neighbor aggregation phase. in the settings of small subgraph size (128 to 512), the increase of the overall computation throughput is not evident, because the computation size is small and most of the available GPU resources such as SMs would achieve low utilization. While in the range of subgraph size (512 to 16386), we can notice a more significant increase in the TFLOPs performance. Because in these settings, more computations from the bitlevel data manipulation would trigger more SMs to participate in the BMM computation, thereby, improving the overall GPU throughput. For those large subgraph sizes (> 16384) the overall throughput would hardly increase, mainly because all available GPU computation units are almost fully in use. One specialty of those batched GNN computations w.r.t.
the traditional NN computation is that batch GNN have more skewedsized matrices in terms of the ratio between
and . This, to some degree, limits the achievable peak performance on TC. What is also worth noticing is that among different lines (different choices of ), the larger usually leads to better utilization of the GPU, since more computation and memory resources of the GPU will become active for higher throughput.Nonzero Tile Reuse. we will demonstrate the effectiveness of our nonzero tile reuse by a controlvariable study. We eliminate the number of nonzero tiles impact on performance by setting all tiles to nonzero tiles (i.e., filling the initial matrix with all ones). Then we choose the neighbor aggregation process () for the study and fix the to 1024. We change from 1024 to 8192. Three bit combinations are used in our evaluation, where is consistently using 1bit while is using 4, 8, and 16 bit. As described in Figure 9
(a), our nonzero tile ruse can improve the throughput performance on those large matrix sizes with the higher number of bits. The major reason behind this is that reuse the nonzero tile can largely reduce the global memory access for fetching the same 1bit adjacency matrix tile repetitively, which is the key performance bottleneck for those large metrics. The setting (w/o nonzerotile) reuse shows more advantage on the smaller size matrix because the overhead of recurrent loading the same adjacency matrix tile is not pronounced compared with GEMM operations on TC. This study inspires us to come up with a more intelligent strategy or heuristics to determine under which condition applying the nonzero tile reuse will bring performance benefits and we would leave this for our future work for exploration.


CPUtoGPU Data Transferring. We compare three different methods of CPUGPU communication. The first implementation (I) is to separately transfer the dense adjacent matrix and the dense node embedding matrix. The second implementation (II) is to separately transfer the sparse adjacent matrix and dense node embedding. It then calls sparse_to_dense on GPU for adjacent matrix initialization. The third implementation (III) used by QGTC is to pack the sparse adjacent matrix and dense node embedding matrix first and then transfer the packed data. Once it finishes transferring, it will call sparse_to_dense on GPUs for adjacent matrix initialization. We select two ogb datasets for study in this experiment. As shown in Figure 9(b), the third implementation used in QGTC consistently outperforms Implementation I with an average of and implementation II with an average of speedup. There are two major reasons. First, separate transferring of the edge list and embedding matrix would lead to the overhead of initiating one additional PCIe data communication, which goes through a slow warmup process. Second, separate transferring is hard to maximize the PCIe bandwidth utilization.
7. Conclusion
In this paper, we propose the first QGNN computing framework QGTC to support anybitwidth computation via GPU Tensor Core. Specifically, we introduce the first GNNtailored anybitwidth arithmetic design that can emulate different bitwidth computations to meet the endusers demands. We craft a TCtailored CUDA kernel design by incorporating 3Dstacked bit compression, zerotile jumping, and nonzero tile reuse technique to maximize the performance gains from GPU Tensor Core. We also incorporate an effective bandwidthoptimized subgraph packing strategy to maximize the data transferring efficiency. Finally, we integrate QGTC with the popular Pytorch framework for better programmability and extensibility. Extensive experiments show significant performance gains of QGTC in practice.
References
 [1] TensorFlow: a system for largescale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16. Cited by: §1.
 Fast batched matrix multiplication for small sizes using halfprecision arithmetic on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Cited by: §2.3.
 Binary graph neural networks. arXiv preprint arXiv:2012.15823. Cited by: §3.1.
 Layered label propagation: a multiresolution coordinatefree ordering for compressing social networks. In Proceedings of the 20th international conference on World wide web (WWW), Cited by: §4.1.
 Clustergcn: an efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM International Conference on Knowledge Discovery & Data Mining, Cited by: §2.1, §3.2, §4.1, §6.1, §6.
 Automatic generation of highperformance quantized machine learning kernels. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp. 305–316. Cited by: §4.2.
 [7] Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference, Cited by: §4.1.
 [8] Accelerating reduction and scan using tensor core units. In Proceedings of the ACM International Conference on Supercomputing, ICS ’19. Cited by: §2.3.
 EGEMMtc: accelerating scientific computing tensor cores with extended precision. ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP). Cited by: §2.3.

[10]
SGQuant: squeezing the last bit on graph neural networks with specialized quantization.
In
2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)
, Cited by: §1, §2.2, §3.1, §6.  Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (ICLR), Cited by: §1, §1, §2.1, §6.2, §6, §6, §6.
 PyTorch extension library of optimized scatter operations. External Links: Link Cited by: §6.2, §6.
 Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.1.
 Inductive representation learning on large graphs. In Advances in neural information processing systems (NeurIPS), Cited by: §1, §2.1, §6, §6.
 Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: §6.
 A broader picture of randomwalk based graph embedding. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 685–695. Cited by: §1, §2.1.
 POLE: polarized embedding for signed networks. Cited by: §2.1.
 Binarized neural networks. In Proceedings of the 30th international conference on neural information processing systems, Cited by: §3.1.
 Parallelization of reordering algorithms for bandwidth and wavefront reduction. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Cited by: §4.1.
 MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0. University of Minnesota, Minneapolis, MN. Note: http://www.cs.umn.edu/~metis Cited by: §1, §1, §2.1, §3.2, §4.1, §4.3, §6.
 Benchmark data sets for graph kernels. External Links: Link Cited by: §6.
 Semisupervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR). Cited by: §1, §2.1, §6, §6.
 Accelerating binarized neural networks via bittensorcores in turing gpus. IEEE Transactions on Parallel and Distributed Systems (TPDS). Cited by: §2.3.
 The linkprediction problem for social networks. Journal of the American society for information science and technology 58 (7), pp. 1019–1031. Cited by: §2.1.
 [25] Neugraph: parallel deep neural network computation on large graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Cited by: §1, §6, §6.
 [26] CUDA sparse matrix library (cusparse). Note: developer.nvidia.com/cusparse External Links: Link Cited by: §6.
 [27] CUDA template library for dense linear algebra at all levels and scales (cutlass). Cited by: §2.3.
 [28] Dense linear algebra on gpus. Note: developer.nvidia.com/cublas External Links: Link Cited by: §2.3, §2.3.
 [29] Warp matrix multiplyaccumulate (wmma). Note: docs.nvidia.com/cuda/cudacprogrammingguide/index.html#wmma External Links: Link Cited by: §2.3.
 PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS) 32, Cited by: §1, §1, §6.
 Near linear time algorithm to detect community structures in largescale networks. Physical review E. Cited by: §4.1.
 Degreequant: quantizationaware training for graph neural networks. International Conference on Learning Representations (ICLR). Cited by: §1, §2.2, §3.1, §6.
 Programming tensor cores in cuda 9. Note: https:// devblogs.nvidia.com/programmingtensorcorescuda9/ Cited by: §2.3.

Attentionbased graph neural network for semisupervised learning
. . Cited by: §6.  Graph attention networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1, §6.
 Deep graph library: towards efficient and scalable deep learning on graphs. ICLR Workshop on Representation Learning on Graphs and Manifolds. Cited by: §1, §1, §2.1, §6, §6, §6.
 GNNAdvisor: an efficient runtime system for gnn acceleration on gpus. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’21), Cited by: §1.
 How powerful are graph neural networks?. In International Conference on Learning Representations (ICLR), Cited by: §1, §6, §6.
 Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS), Cited by: §6.
 [40] Graphact: accelerating gcn training on cpufpga heterogeneous platforms. In Proceedings of the 2020 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA), Cited by: §6.1, §6.
Comments
There are no comments yet.