With the popularity surge of the graph neural networks (GNNs) (Kipf and Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017), research around the full-precision GNNs has been widely studied in terms of its algorithms (Kipf and Welling, 2017; Xu et al., 2019) and execution performance (Ma et al., 2019; Wang et al., 2019; Fey and Lenssen, 2019) over traditional graph analytical methods, such as Random Walk (Huang et al., 2021). On the other side, quantized GNN (Feng et al., 2020; Tailor et al., 2021) (QGNN) recently attract lots of attention thanks to its negligible accuracy loss, resilience towards malicious attacks, and significantly lower computations and memory overhead. We summarize several key features of GNNs that make them intrinsically suitable for quantization. First, the adjacent matrix of GNNs is naturally well-suited for quantization, since we only need to use 0/1 to indicate the existence of edge connections. Thus, using low bits for such information can save both memory and computation. Second, the quantization of weight and node embedding can also be beneficial. Because the tiny precision loss in quantization can largely be offset by the node information fusion through the iterative neighbor aggregation process of GNNs. The quantization of floating-point numbers can absorb input perturbations from adversarial attacks.
Despite such great theoretical success of QGNN, the realization of such benefits on high-performance GPUs is still facing tremendous challenges. Existing GPU-based GNN frameworks (Fey and Lenssen, 2019; Wang et al., 2019, 2021) are designed and tailored for GPU CUDA cores, which are intrinsically bounded by its peak throughput performance and can only handle the byte-based data types (e.g., int32). Although quantized computation can be achieved via pure algorithmic emulation, the actual bit-level performance gains could hardly be harvested, since all underlying arithmetic operations still have to rely on those well-defined data types from CUDA/C++ libraries.
To tackle these challenges, we decide to move forward with the recent GPU hardware feature – Tensor Core (TC). The modern NVIDIA GPU with TC design is illustrated in Figure 1. TC provides the native support of bit-level operations (XOR, AND), which could be the major ingredient for quantized computation. Besides, TC can easily beat CUDA core with a significantly higher throughput performance (more than ) on conventional NN operations (e.g.
, linear transformation and convolution). This demonstrates the potential of using TC in accelerating QGNNs. However, directly using TC for QGNN computation is encountering several challenges.
First, the current TC can only support limited choices of bitwidth (e.g., 1-bit and 4-bit), which may not be able to meet the demands of users for any-bitwidth (e.g., 2-bit) computation. Second, TC initially tailored for GEMM computation may not directly fit the context of sparse GNN computation. A huge amount of computation and memory access efforts would be wasted on those non-existed edges. This is because the hard constraint of TC input matrix tile-size (e.g., 8
128 for 1-bit GEMM) has to be satisfied, which may require excessive zero paddings.Third
, the low-bit computation would cause the compatibility issue, since the existing deep-learning frameworks(Paszke et al., 2019; Abadi et al., 2016) cannot directly operate on the low-bit data type.
Therefore, we remark there are several aspects to be considered in order to use TC for QGNNs: 1) Hardware-level Support. This inspires us to explore the high-performance GPU hardware features that can efficiently support the QGNN computation. Even though it is hard to find such a GPU hardware feature that can directly support any-bitwidth QGNN, some indirect hardware features would potentially be helpful. For example, NVIDIA introduced the 1-bit TC-based GEMM on Turing Architecture, which essentially can be used to composite any-bitwidth GEMM. 2) Software-level Optimizations. This motivates us to optimize the kernel computation according to the characters of QGNN. GNN computation is featured with a highly sparse and irregular scheme. It is intrinsically not favorable for the dense GPU computation flow tailored for the traditional NN operators. Thus, how to handle such input-level irregularity from the computation and memory perspectives is essential to the performance of QGNN. For example, subgraph partitioning (Karypis and Kumar, 2009) based mini-batch GNN computation has been used to increase the computation efficiency without compromising model accuracy performance. 3) Framework-level Integration. This encourages us to bridge the gap between quantized low-bit implementations and deep-learning frameworks built for full-precision computation. Therefore, our whole system-level design can be seamlessly integrated with the state-of-the-art mainstream NN frameworks to benefit the execution performance and the developing productivity.
At the input level, we incorporate the METIS (Karypis and Kumar, 2009) graph partitioning to generate a set of dense subgraphs from the highly irregular and sparse input graphs. The insight here is that nodes in real-world graphs are likely to form clusters, and such information can be used to benefit the efficiency of GNN computing and model algorithmic performance.
At the algorithm level, we leverage the insight that any-bitwidth QGNN computation can always be decomposed into the 1-bit computation. Each bit in the output can be generated by different combinations of bits from the input. Thus, we use quantized low-bit data representation and bit-decomposed computation base on the “atomic” 1-bit type.
At the GPU kernel level
, we craft a low-bit computation design tailored for QGNN computation on batched dense subgraphs. We address the key performance bottleneck of the low-bit GNN computing from the memory and computing perspectives. Specifically, we use only 1-bit binarized representation for the subgraph adjacent matrix, which is memory efficient for representing the presence/absence of edge connections between nodes. Besides, we use a 3D-stacked bit-compression technique for maintaining quantized low-bit node embedding features and weights. In addition, we fully exploit the intra-subgraph sparsity through zero-tile skipping and non-zero tile reuse, which can further avoid unnecessary computations and improve the data locality.
At the framework level, we integrate QGTC with the state-of-the-art Tensor-based PyTorch (Paszke et al., 2019) framework. We introduce the new notion of bit-Tensor data type and bit-Tensor computation and warp them up as a new set of PyTorch API extensions. End-users can directly interact with the QGTC PyTorch APIs to access all functionalities. This largely improves the programmability and extensibility.
Overall, we summarize our key contributions as:
We propose a novel 1-bit composition technique for any-bitwidth arithmetic design (), which can support QGNN with diverse precision demands.
We introduce a highly efficient implementation of QGNN () built on top of the GPU Tensor Core by applying a series of computation optimizations (e.g., subgraph partitioning and batching, and zero-tile jumping) and memory optimizations. (e.g., 3D-stacked bit-compression and non-zero tile reuse).
We integrate QGTC with PyTorch () by introducing bit-Tensor data type and bit-Tensor computation for better programmability and extensibility.
Extensive experiments demonstrate the advantages of QGTC in terms of better performance compared with the state-of-the-art DGL framework on mainstream GNN models across various datasets.
2. Background and Related Work
In this section, we will introduce the background of GNNs, the quantization of GNNs, and basics of GPU Tensor Core.
2.1. Graph Neural Networks
Graph neural network (GNN) is an effective tool for graph-based machine learning. The detailed computing flow of GNNs is illustrated in Figure3
. GNNs basically compute the node feature vector (embedding) for nodeat layer based on the embedding information at layer (), as shown in Equation 1,
where is the embedding vector for node at layer ; is the aggregation results through collecting neighbors’ information (e.g., node embeddings); is the neighbor set of node . The aggregation method and the order of aggregation and update could vary across different GNNs. Some methods (Kipf and Welling, 2017; Hamilton et al., 2017) just rely on the neighboring nodes while others (Veličković et al., 2018)
also leverage the edge properties that are computed by applying vector dot-product between source and destination node embeddings. The update function is generally composed of standard NN operations, such as a single fully connected layer or a multi-layer perceptron (MLP) in the form of, where and are the weight and bias parameter, respectively. The common choices for node embedding dimensions are 16, 64, and 128, and the embedding dimension may change across different layers.
The most recent advancement of GNN is its batched computation (Chiang et al., 2019), which has also been adopted by many state-of-the-art GNN computing frameworks (Wang et al., 2019; Fey and Lenssen, 2019) for large graphs that cannot be easily fit into the GPU/CPU memory for computation directly. Batched GNN computation has been highlighted with good accuracy and runtime performance (Chiang et al., 2019) in comparison with full-graph computation. Batched GNN computation takes several steps. First, it decomposed the input graphs by employing the state-of-the-art graph partitioning toolset, such as METIS (Karypis and Kumar, 2009), which can minimize the graph structural information loss meanwhile maximizing the number of edge connections within each subgraph (i.e., improving the subgraph modularity). Second, it feeds the small subgraphs into the GNN models for computation, which will generate the node feature vector for each subgraph. Third, the generated node embeddings can be used in multiple downstream tasks, such as node/graph classification, link prediction, and community detection (Liben-Nowell and Kleinberg, 2007; Grover and Leskovec, 2016; Huang et al., 2021, 2022).
2.2. Quantization of GNNs
Besides the research efforts on full-precision GNNs, recent focus also shifts towards the quantized GNNs. For example, Boyuan et al. (Feng et al., 2020) propose the first framework for running quantized GNNs, and several types of quantization schemes can be applied on GNNs (e.g., the quantization based on the GNN layer, node degrees, and the edge weights). And their experimental results also demonstrate the effectiveness of the GNN quantization in terms of memory saving and model accuracy. Shyam et al. (Tailor et al., 2021) introduce an architecturally agnostic and stable method, Degree Quant, to improve performance over existing quantization-aware training baselines commonly used on other architectures (e.g., CNNs). They achieve up to speedups on CPU when using int8 compared with float. Compared with the full-precision GNNs, low-bit GNNs bring the benefit of model robustness towards the adversarial attacks and the low computation and memory overheads. However, work from (Feng et al., 2020) only showcases the theoretical memory and computation benefits via software-level quantization simulation, where its underlying computation is still carried out in 32-bit full-precision float. Work from (Tailor et al., 2021) only demonstrates such gains on CPUs, which has limited applicability in the real-world GNN computation settings. This encourages us to harvest its real performance benefits on the modern widely used GPU platforms.
2.3. Tensor Core on GPUs
The recent advancement of GPU hardware technology has pushed computing power to a new level. Among those innovations, the most significant one is the Tensor Core (TC) on NVIDIA GPU. Different from scalar-scalar computation on CUDA Cores, TC provides a matrix-matrix compute primitive, which can deliver more than higher computation throughput. The initial version of TC is designed for handling the GEMM with half-precision input and full-precision output. More variants (e.g., int8, int4, and int1 inputs with 32-bit unsigned integer (uint32) output) have been introduced since the recent CUDA release (11.0) and newer GPU microarchitectures (e.g., Turing and Ampere).
In particular, TC supports the compute primitive of , where matrix tile and are required to be a certain type of precision (e.g., 1-bit), while matrix tile and use uint32. Depending on the input data precision and the version of GPU microarchitecture, the matrix tile size of , , and may have different choices. For example, 1-bit TC computing requires and . Different from the CUDA Cores which requires users to define the execution flow of each thread (i.e., work of individual threads). TC requires the collaboration of a warp of threads (32 threads) (i.e., work of individual warps). This can be reflected in two ways. First, before calling TC for computation, all registers of a warp of threads need to collaboratively store the matrix tile into a new memory hierarchy (called Fragment (NVIDIA, 2017)), which allows data sharing across registers. This intra-warp sharing provides opportunities for fragment-based memory optimizations. Second, during the computation, these loaded matrix fragments will be taken as the TC input to generate the output fragment, which also consists of the registers from each thread in a warp. Data movements among these registers are also managed by a warp of threads collaboratively.
Prior research efforts have been devoted to accelerating high-performance computing workloads with TC. Ahmad et al. (Abdelfattah et al., 2019) process the batched small-size GEMM on TC for acceleration. Ang and Simon (Li and Su, 2020) leverage 1-bit GEMM capability on Turing GPU TC for accelerating binary neural network inference. Dakkak et al. (Dakkak et al., ) accelerates the half-precision scan on TC by transforming the scan to a GEMM. Boyuan et al. (Feng et al., 2021) introduce GEMM-based scientific computing on TC with extended precision. QGTC enlarges the application range of TC by accelerating GNNs for any-bitwidth quantized GNN computation, which is not directly covered by any existing research, any release of cuBLAS (Nvidia, ), or CUTLASS (NVIDIA, ) library, and GPU TC hardware.
TC can be used in several ways. The simplest one is to call cuBLAS (Nvidia, ) cublasSgemmEX API. However, cuBLAS API only supports computation on the most common fixed bit-width on TC, such as 8-bit, half-precision (16-bit), thus, it cannot support any bitwidth precision directly. The second way is to call the Warp Matrix Multiply-Accumulate (WMMA) (nvcuda::wmma) API (Nvidia, ) in CUDA C++ to operate TC directly. There are basically four types of operations (Listing 1). In this project, we follow the second way for more low-level implementation customization for batched GNN computation. Because it can offer more design/implementation flexibility for compositing arbitrary-bit computation and ease the optimization (e.g., data loading and reuse) for batched GNN-specific workloads at the GPU kernel.
3. QGTC Algorithm Design
In this section, we first introduce the basics of low-bit computation. Then we will discuss our TC-tailored algorithm design for quantized GNN.
3.1. 1-bit Composition for Quantized Ops.
Over the last few years, quantized deep neural networks (QDNNs) (Feng et al., 2020; Tailor et al., 2021) have been extensively studied, largely due to their memory saving and high computation performance. In GNN, however, similar work is largely lagging behind. Work from (Feng et al., 2020) demonstrates that GNN is actually insensitive to quantization, even very low-bit quantization would not lead to evident accuracy loss because of the graph-like aggregation operations that can amortize such quantization influence. Another work from (Bahri et al., 2020) also demonstrates that even the binarized GNN would be beneficial in some application scenarios. In this work, we foresee that the support for any-bitwidth precision computation on GNN is vital to satisfy various users’ demands (e.g., execution time).
Given a quantization bit and the 32-bit floating-point value , we quantize it as a -bit value by using
where is an empirical lower bound that can be determined by users or application settings; is the ratio between the range (, where is an empirical upper bound) and the -bit representation range (); is the floor function.
For any-bitwidth computation on quantized values, we propose a new type of arithmetics based on the “atomic” 1-bit computation widely used in the binarized NN (Hubara et al., 2016).
Any-bitwidth Scalar-Scalar Multiplication: Assuming we have a 3-bit scalar value () and multiply it with a 2-bit scalar value (). we can first represent these two values as
where and indicate the bit value (0/1) at the certain bit position after bit decomposition. By following the general rule of multiplication, we can get as
through simplification we can get that
Any-bitwidth Vector-Vector Multiplication: We extend the any-bitwidth scalar-scalar computation towards any-bitwidth vector-vector computation between a 3-bit vector and 2-bit vector , each of which has elements. Therefore, the above scalar-scalar multiplication formula can be extended to -dimension vector-vector multiplication
From the above formula, we can see that in order to compute the result of any-bitwidth vector-vector multiplication, we first do bit decomposition on all elements in each vector then do bit-bit multiplication between elements from each vector, and finally do bit shifting and reduction to get the final result. For example, after bit-decomposition of and , we can get at bit position 2 as and at bit position 1 as , where . From the multiplication and addition, we can get the multiplication result of at bit position 3. Such a 1-bit vector-vector multiplication can be effectively implemented as
where counts the total number of of the result in its bit representation (e.g., will return for a binary number ). A similar procedure can be applied to generate the result at bit position 0, 1, and 2. After all these individual bits in temporary results are ready, we can do bit shifting and reduction to get the final result. Based on such any-bitwidth vector-vector results, we can easily derive the any-bit matrix-matrix multiplication scheme, where each element in the output matrix can be seen as the results of any-bitwidth vector-vector multiplication.
3.2. Quantized Computation in GNNs
Applying any-bitwidth precision computation in GNNs would find two major specialties. First, the adjacent matrix () of GNNs only need to use binary (1-bit) number to represent the presence/absence of edges. Second, the node embedding matrix () and the weight matrix () can be represented with any-bitwidth to meet the different precision demands.
As described in Algorithm 1, each layer of any-bitwidth GNN consists of a neighbor aggregation and a node embedding update phase. Specifically, neighbor aggregation conducts through a 1-bit-and-s-bit matrix multiplication and the node update conducts through a s-bit-and-t-bit matrix multiplication. At Line 1 to 3, we do bitDecompose for subgraph adjacency matrix , embedding matrix , and weight matrix . For scalar int32 numbers, our bitDecompose will first quantize it to another int32 number in a n-bit data range by using Equation 2. Then, it applies bit-shifting to extract each bit (0/1) from the quantized int32 number. Our 3D stacked bit compression (Section 4.2) happens after the above first and second steps are applied to each element of a matrix, and it will pack the extracted bits for the whole matrix together. Here for ease of algorithm description, we maintain different bits of a matrix as the list, e.g., stands for the 0’s bits for all elements inside the . At Line 5 to 7, we apply bit-matrix multiplication between each bit matrix from and the binary 1-bit matrix , the results of this step will still be a set of bit matrices and be stored in a list. At Line 8 to 14, we apply the similar bit-matrix multiplication between and , and the results of this step will be stored as bit-matrix as well for the following final-result generation. To avoid any data overflow during the reduction (Line 15 to 19), should also use a full-bit data type (e.g., int32). For large graphs, their adjacent matrices cannot be easily fit into the GPU device memory directly. In this scenario, we employ METIS (Karypis and Kumar, 2009) for graph partitioning and run GNN as batched subgraph computation, which is used by the most popular cluster-GCN (Chiang et al., 2019) design. Considering that the number of subgraphs generated by METIS (Karypis and Kumar, 2009) is usually within the reasonable size (2,000 to 20,000), such a batched GNN computation can be accommodated on a single modern GPU without violating its memory constraints. Note that to reduce the runtime overhead, the bit-decomposition of the matrix and can be pre-computed and cached before the GNN computation at each layer. The major reason behind this is that across different GNN layers of the same subgraphs, the adjacent matrix can be reused. On the other side, across different subgraphs at the same GNN layers, the weight matrix can be reused for the later-on computation.
4.1. Subgraph Partitioning and Batching
Real-world graphs usually come with a large number of nodes and highly irregular graph structure (edge connections). This brings two levels of difficulties for GNN computing. The first one is the memory consumption, since GPU device memory cannot accommodate all nodes, edges, and node embedding features at the same time. The second one is the inefficient execution since the irregular and sparse edge connections lead to low memory access efficiency and poor computation performance. To this end, in QGTC, we combine the state-of-the-art graph partitioning technique METIS (Karypis and Kumar, 2009) and subgraph batch processing (Chiang et al., 2019) to handle different sizes of input graphs effectively. Compared with other solutions, such as graph clustering approaches (Raghavan et al., 2007; Karantasis et al., 2014) and BFS-based methods (Cuthill and McKee, ), METIS achieves a better quality of its captured subgraph partitions (more edges in each subgraph) and the significantly higher runtime performance owing to its partial parallelization. Note that the number of subgraphs/partitions is determined by users and is passed as a runtime parameter to METIS.
After the subgraph partitioning, we will conduct a batching step for GNN computation on GPUs. This step gathers a set of subgraph partitions based on user-defined batch size. Later, during the GNN computing, subgraphs are loaded to GPU memory by batch. Using the partitioning and batching strategy for GNN computing gives users control of workloads at two levels of granularity. First, the workload granularity is defined by the number of subgraphs/partitions. This would manage the size of each subgraph partition and the edge connection density of each subgraph. In general, the more number of the subgraphs/partitions would lead to denser edges connections within each subgraph, which may bring better computation and memory locality. Second, the processing granularity is controlled by the batch size. This would determine the size of graphs that will be fit into the GPU at each round of execution. The selection of batch size would maximize the utilization of the GPU while respecting the GPU computation and memory resource constraints.
4.2. 3D-Stacked Bit Compression
Existing NN frameworks are developed for full-precision computation, which leads to two major challenges: First, the low-bit quantized data type cannot directly borrow the full-precision data type as the “vehicle” for computation. The major reason is that the full precision data type such as float and int32 cannot bring any benefits to the memory or computation saving. Second, low-bit quantization would not fit any type of bit alignment, since its bit-level boundary mostly cannot be divisible by the size of a byte (8-bit), making it hard to retrieve its actual value.
To this end, we propose a novel 3D-stacked bit-compression technique to handle any-bitwidth data type effectively. The major idea is to compress any-bitwidth input with 32-bit alignment for ease of value retrieval and memory alignment. As exemplified in Figure 4(a), we have an input matrix with the shape of 3-bit. For each bit of the element in the matrix, we store it in a bit matrix (1-bit) stacked along the vertical axis. During the computation of any-bitwidth matrix multiplication , two types of 3D-stacked bit-compression are employed. For matrix , we use the column-wise compression with 32-bit alignment, as illustrated in Figure 4(b). The main reason for choosing column-wise compression is that the matrix multiplication would benefit from coalesced across-column memory access along each row of the matrix . 32-bit alignment can benefit the read performance by coalesced loading from the global memory to fragment. After the compression on matrix (1-bit), we will get a 32-bit compressed 3-bit with the shape of 3-bit, where and are for padding rows/columns that cannot be divisible by the basic TC computing size ().
For matrix , we use the row-wise compression with 32-bit alignment, as shown in Figure 4(c) which can benefit the across-row access along each column of matrix . After the compression on matrix (1-bitK), we will get a 32-bit compressed 2-bit with the shape of 2-bit for the output layer. Note that if the is the hidden layer of a GNN model, the padding strategy on matrix would be slightly different considering that the result matrix will become a new matrix in the next layer which demands 128-bit padding. In this case, to avoid additional padding overhead, we will get the 2-bit with the shape of 2-bit.
Compared with the previous work (Cowan et al., 2020) that also leverages bit-level data packing, there are several differences. The first one is the padding strategy. Padding of QGTC on different tensor dimensions could be different, where bit-level padding is ignored in the work from (Cowan et al., 2020). For example, QGTC may PAD8 or PAD128, depending on the following computation is carried out in low-bit or 32-bit format, thereby, avoiding unnecessary conversion. The second one is the packing datatype. Work from (Cowan et al., 2020) uses uint4 for packing continuous 128 bits, while QGTC uses a 32-bit format for better interoperability with PyTorch. The third one is the bit-level layout. Work from (Cowan et al., 2020) doesn’t consider more bit-level layout optimization. In QGTC, for GEMM operation (), we use a column-wise compression for the matrix and a row-wise compression for the matrix .
4.3. Zero-tile Jumping
Even though the subgraph partitioning such as METIS (Karypis and Kumar, 2009) makes the subgraph denser (more number of edge connections within each subgraph), there are still some TC tiles (i.e., the input matrix tile for a single TC computation) that are filled with all-zero elements. Therefore, directly iterating through these zero tiles would introduce the cost of unnecessary memory (loading data from the global memory to thread-local registers) and computation (processing 1-bit TC-GEMM on input adjacent matrix tile that contains all-zero elements). Based on this observation, we introduce a novel zero-tile jumping technique to reduce the unnecessary computations by bitwise OR operations and warp-level synchronization primitives.
As illustrated in Figure 5, each 1-bit TC-GEMM would work on the tile size of register fragment. This can be well partitioned into int32 elements. To check whether the tile contains all-zero elements, we first employ only 8 threads from a warp of threads to fetch an uint4_v vector data (each uint4_v element in CUDA consists of 4 int32 elements placed in continuous memory address). The reason for using uint4_v is to maximize the memory access efficiency by issuing fewer global memory requests. Once all uint4_v elements have been loaded. Each thread will apply bitwise OR across all 4 int32 elements, which will check whether each row of a TC-tile is all-zero. The next step is to tell whether the whole tile is all-zeros across different rows, we will use the warp-level primitive to sync the information across these 8 active threads in the warp. This step will generate an int32 number. If this number is zero, it will indicate all elements in this input TC-tile are zero. Otherwise, we still need to conduct the 1-bit TC-GEMM on the current tile. We will give a more quantitative analysis of such zero-tile jumping in Section 6.3.
4.4. Non-Zero Tile Reuse
In addition to jumping over the zero tiles, we further consider reusing the non-zero tiles to improve data locality. In the aggregation step of the GNN computation, we generate the output feature map at different bit-level separately. For example, when we choose 1-bit adjacent subgraph matrix and a 4-bit feature embedding matrix, we will execute the iteration 4 times to generate the output. One straightforward solution, called cross-bit reduction, is to generate the complete output matrix tile at each bit level first. This requires loading the matrix tile imperatively, as shown in Figure 6(a). However, this would cause one problem that each non-zero tile from the adjacent matrix will be repetitively loaded when computing with each bit matrix from the embedding matrix.
In fact, we can consider reordering the steps in a way that we can maximize the benefit of each non-zero tile of the subgraph adjacency matrix. As shown in Figure 6(b), we introduce a cross-tile reduction strategy. Specifically, for each loaded non-zero fragment, we will use it to generate an output tile at all bit levels and do a localized reduction (only on the current tile) to generate a partial aggregation result. Once this part has been done, we will move forward to the next non-zero tile and repeat the same process until all non-zero tiles have been processed. The complexity of loading the nonzero tiles can be reduced from to , where is the number of bits for node embeddings.
4.5. Inter-layer Kernel Fusion
Across the GNN layers, we incorporate the low-bit data transferring. Specifically, the output of the one hidden layer will directly be handed over to the next layer as the input. There are several strategies we use. First, we apply data quantization and bit-decomposition at the end of the computation kernel such as the neighbor aggregation and node update. This can help to avoid outputting the result to the slow global memory and apply the data manipulation again. Second
, standalone activation function kernels such asReLU and tanh, can be effectively fused into our computation kernel as a device
function, which can directly operate the shared memory results to achieve high performance. For the batch normalization (BN) layers that follow the graph convolution layers, we can also do layer fusion based on
where , , and are the BN parameters that can be incorporated into the low-bit convolutional kernel to avoid launching a BN kernel. One thing worth noting is that computation at the hidden layer and the output layer is slightly different. For hidden layers, each kernel has the quantization + bit-decomposition on the output activation, since the next layer relies on the low-bit data as the input for computation. While for the last layer, once the full-precision accumulation is complete, it will directly output the full-precision result for the softmax
layer (considering the node classification task) to generate logits that demand high precision.
4.6. Bandwidth-Optimized Subgraph Packing
During the GNN computation of the subgraphs, data communication between the CPU host and GPU device is also unavoidable. It will swap the subgraph data (such as edge lists and node embedding) in/out of the GPU device. One basic approach is to transfer the dense adjacent matrix in floating point numbers considering that the size of a single subgraph is generally within the range of the modern GPU memory. However, this could easily lead to a huge amount of data traffic between the CPU and GPU host. The transferring performance is heavily bounded by PCIe bandwidth (32 GB/s for PCIe 4.0x16) between the CPU host and the GPU device. For the node embedding matrix, the current practice is to transfer the node embedding matrix by initializing another standalone PCIe transferring, which incurs additional overheads and is unable to maximize the bandwidth usage.
To overcome these issues, we employ a new strategy, called bandwidth-optimized subgraph packing. Instead of directly migrating the dense adjacent matrix or sparse adjacent matrix in single-precision floating point, we just transfer the compressed low-bit adjacent matrix and low-bit embedding matrix. This can significantly minimize the data traffic on the high-cost PCIe-based data communication. Besides, we compress the low-bit adjacent matrix and low-bit embedding matrix into a compound memory object (by using torch.nn.Module and register_buffer) on the host first and then initiate the transferring of this memory object from the host CPU to GPU device.
5. Integration with PyTorch
Besides the highly efficient kernel design and data transferring optimization, for better usability and programmability, we integrate QGTC with the popular PyTorch framework. However, there are two key technical challenges. The first one is how to represent the quantized low-bit number in those Tensor-based frameworks that are built on byte-based data types (e.g., int32). The second one is how to apply valid computation between the quantized low-bit number and those well-defined byte-based numbers. For example, how could we get the correct results when we do arithmetic multiplication between a 2-bit number and a 32-bit integer number. To this end, we introduce two new techniques.
Bit-Tensor Data Type: We use the 32-bit IntTensor in PyTorch as the “vehicle” for holding any-bitwidth quantized data. And we leverage our 3D-stacked bit compression technique (Section 4.2) to package the quantized data. We offer a PyTorch API Tensor.to_bit(nbits) for such data type conversion functionality. Note that existing PyTorch APIs, such as print, are only defined for those complete data types, such as Int. Therefore, to access the element value of a bit-Tensor, we provide Tensor.to_val(nbits) to decode a bit-Tensor as int32 Tensor (converting each element from a low-bit number to an int32 number). This can make our design compatible with existing PyTorch functionalities.
Bit-Tensor Computation: We handle two different types of computation: 1) the operations that only involve bit-Tensor and 2) the operations that involve both bit-Tensor and float or int32 Tensor. For the first type of operations, we built two APIs based on whether we want to get the int32 output or still get the quantized low-bit output as a bit Tensor. For any-bitwidth MM with low-bit output, the API is bitMM2Bit(C, A, B, bit_A, bit_B, Bit_C), where A and B are bit Tensors, bit_A/B/C are bitwidth parameters. For any-bitwidth MM with int32 output, the API is bitMM2Int(C, A, B, bit_A, bit_B). For the second type of operations, we will first decode a bit-Tensor as a float/int32 Tensor by using Tensor.to_val(nbits). Then we call the official APIs in PyTorch for the regular full-precision computation.
Benchmarks: We choose two most representative GNN models widely used by previous work (Wang et al., 2019; Fey and Lenssen, 2019; Ma et al., 2019) on the node classification task to cover different types of aggregation. 1) Cluster GCN (Kipf and Welling, 2017) is one of the most popular GNN model architectures. It is also the key backbone network for many other GNNs, such as GraphSAGE (Hamilton et al., 2017), and differentiable pooling (Diffpool) (Ying et al., 2018). For Cluster GCN evaluation, we use the setting: 3 layers with 16 hidden dimensions per layer. 2) Batched GIN (Xu et al., 2019) differs from cluster GCN in its order of aggregation and node update. Batched GIN aggregates neighbor embedding before the node feature update (via linear transformation). GIN demonstrates its strength by capturing the graph properties that cannot be collected by GCN according to (Xu et al., 2019). Therefore, improving the performance of GIN will benefit more advanced GNNs, such as GAT (Veličković et al., 2018). For batched GIN evaluation, we use the setting: 3 layers with 64 hidden dimensions per layer. For quantization bitwidth, we cover the bitwidth settings from the existing quantized GNN studies (Feng et al., 2020; Tailor et al., 2021) and also conduct a comprehensive experimental analysis on different bitwidth settings.
Baselines: In our experiments, we choose several baselines for comparison. For end-to-end runtime performance comparison, we choose Deep Graph Library (DGL) (Wang et al., 2019), which is the state-of-the-art GNN framework on GPUs. DGL is built with highly optimized CUDA-based GNN kernel as the backend and uses PyTorch (Paszke et al., 2019) as its front-end. For GNN aggregation kernel performance comparison, we choose the state-of-the-art quantized GEMM implementation on GPU Tensor Core from cuBLAS (Nvidia, ) with int8 precision and CUTLASS (NVIDIA, ) with int4 precision.
Datasets: We cover all three types of datasets, which have been used in many previous GNN-related work (Wang et al., 2019; Fey and Lenssen, 2019). Details of these datasets are listed in Table 1. Specifically, Type I graphs are the popular GNN datasets evaluated by many GNN algorithmic papers (Kipf and Welling, 2017; Xu et al., 2019). Type II graphs (Kersting et al., 2016) are the popular benchmark datasets for graph kernels in many frameworks for GNN algorithmic research. Type III graphs (Hu et al., 2020) are challenging GNN datasets in terms of the large number of nodes and edges. These graphs demonstrate high irregularity in its structures. Note that we do graph partitioning by using METIS (Karypis and Kumar, 2009) and set the number of total subgraphs as 1,500 as prior work (Chiang et al., 2019; Zeng and Prasanna, 2020).
Platforms & Metrics: QGTC backend is implemented with C++ and CUDA C, while QGTC front-end is implemented in Python. Our major evaluation platform is a Ubuntu server (16.04) with an 8-core 16-thread Intel Xeon Silver 4110 CPU@2.8GHz with 64GB host memory and an NVIDIA Ampere RTX3090 GPU with 24GB device memory. The GPU device kernel is compiled with CUDA (v11.0) and the CPU host code is compiled with GCC 7.5.0 with the compilation option of “-std=c++14 -O3” for integration with the PyTorch framework. To measure the performance speedup, we calculate the averaged latency of 200 rounds of end-to-end results.
6.1. Compared with DGL
In this section, we conduct detailed end-to-end comparison with DGL framework under the different choices of bitwidth. As shown in Figure 7(a) and Figure 7(b), QGTC achieves on average 2.6 and 2.8 end-to-end inference speedup compared to DGL over three types of datasets for cluster GCN and batched GIN, respectively. We also notice that the performance benefit is closely related to the bitwidth we choose, as we can see that from 16-bit to 32-bit the performance shows a large difference compared with the 1-bit to 8-bit setting. We next provide a detailed analysis and give insights for each type of dataset. With a fewer number of bits for both the weights and the node embedding features, QGTC is more likely to reach higher performance. This is because a smaller size of bitwidth would lead to less memory access and fewer computations at the bit level. DGL reaches an inferior performance due to 1) FP32 computation comes with the high computation complexity compared with our QGTC low-bit design; 2) DGL can only rely on CUDA cores for computation which is naturally bounded by the peak computation performance compared with our QGTC on TC with higher throughput performance.
|Settings||FP32||16 bits||8 bits||4 bits||2 bits|
Compared with cluster GCN, experimental results on the batched GIN shows higher benefits of QGTC over DGL. This is because batched GIN applies the node update first before the neighbor aggregation, which leads to higher computation-to-communication ratio. QGTC achieves relatively higher performance improvements on Type III datasets. The major reason is that under the same number of partitions, the size of each partition (subgraph) will increase due to more number of nodes/edges. This also improves the computation intensity that will highlight QGTC’s performance advantages of quantized low-bit computation on GPU Tensor cores.
Accuracy w.r.t. Quantization Bits To build the QGNN model, we apply quantization-aware training and evaluate the model testing accuracy w.r.t. quantization bits on two large Type III datasets on GCN model for demonstration. As shown in Table 2, the GNN model is resilient to the low-bit quantization and can maintain the model accuracy to a large extent. Combining these results with our above performance evaluation result under different quantization bits, we can conclude that making the right tradeoff between the runtime performance and model accuracy is meaningful and can bring benefits to different application settings.
6.2. Compared with other baselines
Compared with cuBLAS-int8 on TC. We further compare our low-bit computation (from 2-bit to 7-bit) with respect to the state-of-the-art cuBLASgemmEX for quantized (int8) GEMM solution on Tensor Core in terms of their throughput performance. Note that int8 is the cuBLAS currently supported minimum bits for quantized computation on Tensor Core. In this study, we mainly focus on the computation of (i.e., , where is the number of nodes and is the node embedding dimension) for the neighbor aggregation phase. As shown in Figure 7(c), QGTC achieves significant throughput improvement compared with Tensor Core cuBLAS (int8) in low-bit settings. The major reason is our QGTC design effectively reduces the computation and the data movements at the bit level, thereby, harvesting the real performance gains of the low-bit quantization on GPUs. When the number of bits for quantization is approaching 8-bit in the computation, the performance gains would decrease due to the increase of bit-level computations.
Compared with CUTLASS-int4 on TC We also compare against the latest CUTLASS (NVIDIA, )(v2.7) with the int4 Tensor Core GEMM in terms of throughput (TFLOPs) for . The results are summarized at Table 3, where we can clearly see the performance advantage in terms of throughput over the CUTLASS implementation. Note that all reported decimal numbers are in TFLOPS; N is the adjacent matrix size and Dim is the node feature embedding dimension. The graph adjacent matrix is stored in 1-bit. QGTC (2-bit) means the 2-bit representation for the embedding matrix.
|N||Dim||CUTLASS (int4)||QGTC (1-bit)||QGTC (2-bit)||QGTC (3-bit)||QGTC (4-bit)|
The major reason behind such performance improvement is that our QGTC design can use the 1-bit binary for representing graph adjacency matrix and n-bit (n=1,2,3,4) for node embedding matrix, while CUTLASS int4 only have the support of 4-bit 4-bit. Thus, we have to use a 4-bit presentation for both adjacent matrix and embedding matrix during computation.
6.3. Additional Studies
In this section, we will conduct detailed studies to demonstrate the effectiveness of our design and optimizations.
Zero-Tile Jumping. We would compute the ratio of the non-zero TC tiles (8128) that are actually involved in our computation versus the total number of TC tiles in the adjacent matrix.
As shown in Figure 8, our zero-tile jumping technology can largely save the efforts for processing all-zero tiles. Based on our observation, the source of such all-zero TC tiles comes from two levels. The first type of all-zero TC tiles is coming from batching subgraphs. Because there is no edge connection among nodes across different subgraphs. This type of all-zero TC tiles dominates the overall collected number of all-zero tiles. The second type of all-zero tiles comes from the missing edge connections between the nodes within each subgraph. While this type of all-zero tiles is minor in its quantity compared with the first type. It potentially reduces memory access and computation.
Adjacency Matrix Size. We will demonstrate the subgraph adjacency matrix size impact on the performance of QGTC. Specifically, adjacency matrix size can be controlled by specifying the number of subgraphs (in METIS) and batching size (in data loader). The size of the adjacency matrix will impact the performance of aggregation in terms of computations and data movements, meanwhile, it will also determine whether our GNN computation can fully utilize the available GPU resources.. We use 1-bit for both adjacency matrix and node embedding matrix in this study. As shown in Figure 9, we can observe that under the same size of , with the increase of the number of nodes (i.e., the value of ), our major 1-bit GEMM computation kernel would scale up its performance well. Note that different colored lines represent different embedding sizes, and we mainly focus on the computation of (i.e., , where is the number of nodes and is the node embedding dimension) for neighbor aggregation phase. in the settings of small subgraph size (128 to 512), the increase of the overall computation throughput is not evident, because the computation size is small and most of the available GPU resources such as SMs would achieve low utilization. While in the range of subgraph size (512 to 16,384), we can notice a more significant increase in the TFLOPs performance. Because in these settings, more computations from the bit-level data manipulation would trigger more SMs to participate in the BMM computation, thereby, improving the overall GPU throughput. For those large subgraph sizes (> 16,384) the overall throughput would hardly increase, mainly because all available GPU computation units are almost fully in use.
One specialty of those batched GNN computations w.r.t.
the traditional NN computation is that batch GNN have more skewed-sized matrices in terms of the ratio betweenand . This, to some degree, limits the achievable peak performance on TC. What is also worth noticing is that among different lines (different choices of ), the larger usually leads to better utilization of the GPU, since more computation and memory resources of the GPU will become active for higher throughput.
Non-zero Tile Reuse. We will demonstrate the effectiveness of our non-zero tile reuse by a control-variable study. We eliminate the number of non-zero tiles impact on performance by setting all tiles to non-zero tiles (i.e., filling the initial matrix with all ones). Then we choose the neighbor aggregation process () for the study and fix the to 1024. We change from 1,024 to 8,192. Three bit combinations are used in our evaluation, where is consistently using 1-bit while is using 4, 8, and 16 bit.
As described in Figure 10
, our non-zero tile reuse can improve the throughput performance on those large matrix sizes with the higher number of bits. The major reason behind this is that reuse the non-zero tile can largely reduce the global memory access for fetching the same 1-bit adjacency matrix tile repetitively, which is the key performance bottleneck for those large metrics. The setting (w/o nonzero-tile) reuse shows more advantage on the smaller size matrix because the overhead of recurrent loading the same adjacency matrix tile is not pronounced compared with GEMM operations on TC. This study inspires us to come up with a more intelligent strategy or heuristics to determine under which condition applying the non-zero tile reuse will bring performance benefits and we would leave this for our future work for exploration.
In this paper, we propose QGTC, the first QGNN computing framework to support any-bitwidth computation via GPU Tensor Core. Specifically, we introduce the first GNN-tailored any-bitwidth arithmetic design that can emulate different bitwidth computations to meet the end-users demands. We craft a TC-tailored CUDA kernel design by incorporating 3D-stacked bit compression, zero-tile jumping, and non-zero tile reuse technique to maximize the performance gains from GPU Tensor Core. We also incorporate an effective bandwidth-optimized subgraph packing strategy to maximize the data transferring efficiency. Finally, we integrate QGTC with the popular PyTorch framework for better programmability and extensibility. Extensive experiments show significant performance gains of QGTC in practice.
We would like to thank anonymous PPoPP paper reviewers for their valuable suggestions on our paper writing and PPoPP artifact reviewers for helping us improve our software artifact functionality and reusability to benefit future research. This work was supported by National Science Foundation under the award number 2124039.
- TensorFlow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, Savannah, GA, USA. Cited by: §1.
- Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Cited by: §2.3.
- Binary graph neural networks. arXiv preprint arXiv:2012.15823. Cited by: §3.1.
- Cluster-gcn: an efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM International Conference on Knowledge Discovery & Data Mining, Cited by: §2.1, §3.2, §4.1, §6.
- Automatic generation of high-performance quantized machine learning kernels. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp. 305–316. Cited by: §4.2.
-  Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference, Cited by: §4.1.
-  Accelerating reduction and scan using tensor core units. In Proceedings of the ACM International Conference on Supercomputing, Cited by: §2.3.
- EGEMM-tc: accelerating scientific computing tensor cores with extended precision. ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP). Cited by: §2.3.
SGQuant: squeezing the last bit on graph neural networks with specialized quantization.
IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Cited by: §1, §2.2, §3.1, §6.
- Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (ICLR), Cited by: §1, §1, §2.1, §6, §6.
- Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.1.
- Inductive representation learning on large graphs. In Advances in neural information processing systems (NeurIPS), Cited by: §1, §2.1, §6.
- Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: §6.
- A broader picture of random-walk based graph embedding. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 685–695. Cited by: §1, §2.1.
- POLE: polarized embedding for signed networks. Cited by: §2.1.
- Binarized neural networks. In Proceedings of the 30th international conference on neural information processing systems, Cited by: §3.1.
- Parallelization of reordering algorithms for bandwidth and wavefront reduction. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Cited by: §4.1.
- MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0. University of Minnesota, Minneapolis, MN. Note: http://www.cs.umn.edu/~metis Cited by: §1, §1, §2.1, §3.2, §4.1, §4.3, §6.
- Benchmark data sets for graph kernels. External Links: Cited by: §6.
- Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR). Cited by: §1, §2.1, §6, §6.
- Accelerating binarized neural networks via bit-tensor-cores in turing gpus. IEEE Transactions on Parallel and Distributed Systems (TPDS). Cited by: §2.3.
- The link-prediction problem for social networks. Journal of the American society for information science and technology 58 (7), pp. 1019–1031. Cited by: §2.1.
- Neugraph: parallel deep neural network computation on large graphs. In USENIX Annual Technical Conference, Cited by: §1, §6.
-  CUBLAS library. External Links: Cited by: §2.3, §2.3, §6.
-  CUDA template library for dense linear algebra at all levels and scales (cutlass). Cited by: §2.3, §6.2, §6.
-  Warp matrix multiply-accumulate (wmma). External Links: Cited by: §2.3.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS) 32, Cited by: §1, §1, §6.
- Near linear time algorithm to detect community structures in large-scale networks. Physical review E. Cited by: §4.1.
- Degree-quant: quantization-aware training for graph neural networks. International Conference on Learning Representations. Cited by: §1, §2.2, §3.1, §6.
- Programming tensor cores in cuda 9. External Links: Cited by: §2.3.
- Graph attention networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1, §6.
- Deep graph library: towards efficient and scalable deep learning on graphs. ICLR Workshop on Representation Learning on Graphs and Manifolds. Cited by: §1, §1, §2.1, §6, §6, §6.
- GNNAdvisor: an efficient runtime system for gnn acceleration on gpus. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’21), Cited by: §1.
- How powerful are graph neural networks?. In International Conference on Learning Representations (ICLR), Cited by: §1, §6, §6.
- Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS), Cited by: §6.
- Graphact: accelerating gcn training on cpu-fpga heterogeneous platforms. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Cited by: §6.
Appendix A Artifact Appendix
QGTC is an efficient design and implementation for Quantized GNN computing on NVIDIA Ampere GPUs (e.g., A100 and RTX3090). QGTC consists of two parts. The first part is the host-side CPU Python program. It is responsible for dataset loading, runtime configuration generation, and invoking the GPU-side program. The second part is the device-side GPU kernel. It is responsible for the major computation of the Quantized GNN model through floating-point number bit-decomposition and GEMM-based computation for quantized GNNs. QGTC improves the performance of Quantized GNN computing with its kernel design and optimization based on 1-bit Tensor Core primitive from NVIDIA Ampere Architecture. Moreover, the runtime configuration generation on the host-side CPU program makes QGTC more adaptive towards various kinds of input settings.
Intel CPU x86_64 with host memory >= 32GB. Tested on Intel Xeon Silver 4110 (8-core with 16-thread) CPU with 64GB host memory.
NVIDIA GPU (arch>=) with devcie memory >= 16GB. Tested on NVIDIA RTX 3070 () and RTX3090 (). Note that upon creating this artifact, we mainly evaluate our design on RTX3090. The execution time may be different across different devices but the overall speedup is similar.
OS & Compiler: NVIDIA-Docker-2.0, Ubuntu 16.04+, GCC 7.5+, CMAKE 3.14+, CUDA 11.3.
Step-1: Setup the basic environment. Two options:
Setup the environment via Docker (Recommended).
Run docker pull happy233/qgtc:updated
Run docker run -it –rm –gpus all -v $PWD/:/qgtc happy233/qgtc:updated /bin/bash
Setup via conda and pip.
Create a new conda environment: conda create -n env_name python=3.6
Activate conda environment: conda activate env_name
Install PyTorch: conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge and pip install torch requests.
Install DGL: conda install -c dglteam dgl-cuda11.0.
Install QGTC: TORCH_CUDA_ARCH_LIST="8.6" python setup.py clean –all install
Details of these options can be found in README.md.
Step-2: Install QGTC PyTorch Binding.
Go to QGTC_module/
Run ./build.sh to install the QGTC modules for running QGTC kernel.
Step-3: Download datasets. We preprocess graph datasets in .npy format that can be downloaded and extracted automatically by running ./download_dataset.sh. Note that node initial embedding is not included, and we generate an all 1s embedding matrix according to users input dimension parameter at the runtime for just performance evaluation.
Running DGL baseline on GNN inference (Figure 7(a,b)).
Go to the root directory of this project.
Run ./1_7a_eval_DGL_cluster_GCN.py for the cluster GCN and ./1_7b_eval_DGL_batched_GIN.py for the batched GIN of the DGL baseline. Each script will automatically generate a .csv result file.
Running cuBLASgemmEX for INT8 GEMM kernel comparison (Figure 7(c)).
Go to cuBLASgemmEX/ directory.
Run ./compile.sh to compile cuBLAS baseline.
Run ./bench_cuBLAS_INT8.py to profile cuBLAS Tensorc Core GEMM in INT8 precision.
Go to the project root directory.
Run ./2_7c_QGTC_GEMM_INT8.py to profile our QGTC low-bit GEMM built on 1-bit Tensor Core primitive for comparison.
Running QGTC on the cluster GCN and the batched GIN (Figure 7(a,b)).
Go to project root directory.
Run ./0_7a_eval_QGTC_cluster_GCN.py for the cluster GCN and ./0_7b_eval_QGTC_batched_GIN.py for the batched GIN and generate .csv result files.
Running some additional studies (Figure 8 and 9). Detailed commands of running all these studies can be found in README.md.
Note that in this artifact, we focus on the evaluation of the quantized GNN inference computation, and the reported time per epoch includes the quantized low-bit GNN model forward pass. We exclude the time of data loading and some other data preprocessing tasks.