I Introduction
Recently, Graph Neural Networks (GNNs) emerge as a new tool to manage various graphbased deep learning tasks (
e.g., node classification [10, 6, 4] and link prediction [2, 12, 20]). In the comparison with standard methods for graph analytics, such as random walk [7, 18] and graph laplacians [17, 16, 3], GNNs highlight themselves with significantly higher accuracy [11, 22, 21] and better generality [8]. In addition, the welllearned GNNs [11, 22, 8, 21] can be easily applied towards different types of graph structures or dynamic graphs without much recomputing overhead.However, the GNNs featured with high memory footprint prevent them from being effectively applied towards the vast majority of resourceconstrained settings, such as embedded systems and IoT devices, which are essential for many domains. There are two major reasons behind such an awkward situation. First, the input of GNNs consists of two types of inputs, graph structures (edge list) and node features (embeddings), which would easily lead to a dramatic increase in their storage sizes when the graph becomes large. This will stress the very limited memory budgets of those small devices. Second, the larger size of graphs demands more data operations (e.g., addition and multiplication) and data movements (e.g., memory transactions), which will consume lots of energy and drain the limited power budget on those tiny devices. To tackle these challenges, data quantization can emerge as an “onestonetwobird” solution for resourceconstrained devices that can 1) effectively reduce the memory size of both the graph structure and node embeddings, leading to less memory usage; 2) effectively minimize the size of manipulated data, leading to less power consumption.
Nevertheless, an efficient approach for GNN quantization is still missing. Existing approaches may 1) choose a simple yet aggressive uniform quantization to all data to minimize memory and power cost, which leads to high accuracy loss; 2) choose a very conservative quantization to maintain accuracy, which leads to suboptimal memory and energysaving performance. While numerous works have been explored for quantization on CNNs [9, 1, 24, 14], directly applying these existing techniques without considering GNNspecific properties, would easily result in unsatisfactory quantization performance. To address these problems, we believe that three critical questions are noteworthy: 1) what types of data (weight or features) should be quantized? 2) what is the efficient quantization scheme suitable for GNNs? 3) How to determine the quantization bits?
To answer these questions, we make the following observations: a) quantization on node embedding features is more effective. As shown in Figure 1, the features take up to of the overall memory size, which demonstrates their significant memory impact; b) GNNs computing paradigms are different across different layers, different graphs nodes, different components. And these differences could be leveraged as the major ”guideline” for enforcing more efficient characterdriven quantization.
Based on these observations, we make the following contributions in this paper to systematically quantize GNNs, as illustrated in Figure 2.

We propose a GNNtailored quantization algorithm design to reduce memory consumption and a GNN quantization finetuning to maintain the accuracy.

We propose a multigranularity quantization featured with componentwise, topologyaware, and layerwise quantization to meet the diverse data precision demands.

We propose endtoend bits selecting in an automatic manner that makes the most appropriate choice for the aforementioned different quantization granularities.

Rigorous experiments show SGQuant can reduce the memory up to 31.9 (from 4.25) compared with the original fullprecision model meanwhile limiting the accuracy to 0.4% on average.
Ii Backgrounds and Related Work
In this section, we will first introduce the basics of Graph Neural Networks (GNNs), and then give some background knowledge of applying data quantization on GNNs.
Iia Graph Neural Network
Graph Neural Networks (GNNs) are now becoming the major way of gaining insights from the graph structures. It generally includes several graph convolutional layers, each of which consists of two components: an Attention Component and a Combination Component, as illustrated in Figure 3. Formally, given two neighboring nodes and (i.e., ), and their node embedding and at layer , GNNs first use the attention component to measure the relationship between these two nodes:
(1) 
One instantiation of the attention component is a singlelayer neural network that concatenates the node embedding and , and multiplies with the attention weight matrix , as the case in GAT [21]. Note that GCN [11] is a special case that has the attention weight matrix with all elements equaling to one. Overall, is an attention matrix measuring the pairwise relationship between nodes, whose memory consumption increases quadratically as the number of nodes increases.
Then, GNN computes the node embedding for node at layer with the combination component:
(2) 
One popular instantiation of the combination component is to 1) average over embeddings from neighboring nodes weighted by the attention matrix ; and 2) multiply the averaged embedding with a combination weight :
(3) 
For each layer , GNNs have
dimension embedding vector for each node and an
embedding matrix when storing all embeddings for nodes. This embedding matrix will increase linearly with the number of nodes and introduce heavy memory overhead for large graphs (e.g., Reddit [13] with nodes).Besides the concepts of Layer and Component, GNNs also consider the third concept – Topology. The topology of GNNs characterizes the graph structure based on the properties of nodes and the edge connections among them.
IiB Quantization
Numerous works of quantization mostly focus on data compression around Convolutional Neural Networks (CNNs). Song et al.
[9] reduces the size of CNNs without accuracy loss by network pruning, weight quantization, postquantization network finetuning. Ron et al. [1]offers a posttraining quantization targeting at weights and activations and minimizes memory consumption at the tensor level. Z. Chen et al.
[24] proposes ternaryvalue based weight quantization to reduce the size of neural networks with minimal accuracy loss. Darryl [14] introduces a layerwise quantizer for fixedpoint implementation of DNNs.Despite such great success in CNNs, efficient GNN quantization is yet to come. And we believe that GNNs have great potential for quantization. There are several reasons, 1) GNN architectures display different levels of computation hierarchy, which allow for more specialized quantization based on their properties (e.g., different quantization strategies for nodes with different degrees, or for layers with different hidden dimensions), which facilitates a finegrained quantization scheme based on the types of operations, whereas CNNs have fixed shape of all feature maps that pass through the same set of NN layers; 2) GNNs are more diverse in their inputs, such as graph topologies (e.g., edge connections) and node features (embeddings), which enable quantization based on the categories of the data. Overall, we are the first to systematically and comprehensively explore the quantization on GNNs by exploiting graph properties and GNN architecture.
Iii GNNtailored Quantization
In this section, we introduce our GNNtailored quantization to convert a fullprecision GNN to a quantized GNN with reduced memory consumption.
Iiia Quantization Algorithm Design
Two key designs differentiate our GNN quantization from existing work on CNN quantizations. First, SGQuant quantizes both the attention matrix and the embedding matrix , while the CNN quantization generally only considers feature quantization due to the intrinsic model difference between GNNs and CNNs. Second, when assigning different quantization bits for the attention matrix and the embedding matrix, SGQuant contains a “rematching” mechanism that matches their quantization bits and enables the computation in Equation 3.
Formally, given a quantization bit and the 32bit attention matrix computed from Equation 1, we quantize it as a qbit attention matrix
(4) 
where is an empirical lower bound of the attention matrix values, is the ratio between the attention matrix range and the qbit representation range, and is the floor function. Specifically, we evaluate the GNN on large graph benchmarks and collect the statistics on its attention matrices, including the minimal value , the maximum value . Then, we can compute a 32bit parameter on the ratio as the feature range over the qbit representation range . While Equation 1 generates a 32bit attention matrix requiring bit memory space, our quantized qbit attention matrix requires only bit memory space. In this way, our quantization on the attention matrix reduces the memory consumption to of its fullprecision version. In particular, once we have computed the 32bit attention value , we can immediately quantize it into a bit value and store it in the memory. Similarly, given a quantization bit and the 32bit embedding matrix from Equation 3, we can generate a bit embedding matrix that reduces the memory consumption to of its fullprecision version.
Suppose we assign different quantization bits and to the attention matrix and the embedding matrix, there would be an “unmatching bit” problem in Equation 3 that the attention value and the embedding have unmatching bits. To solve this problem, we propose a “rematching” mechanism that recovers the quantized value to bit value before these values enter the combination component. Specifically, we compute the recovered 32bit attention as . Similarly, we can compute the recovered 32bit embedding . Feeding these recovered values into the combination component, we can compute the Equation 3 as
(5) 
Note that the “rematching” mechanism introduces negligible memory overhead since, when we compute the node embedding , we only recover a small set of nodes that have edge connections with the node . In addition, the here is a 32bit value since SGQuant only quantizes the GNN features as discussed in Figure 1. Similarly, we can compute the attention component at layer as
(6) 
where the is a 32bit value and can be quantized into bit values with Equation 4. Due to this similarity, we will only discuss quantizing the combination component in the following sections.
IiiB GNN Quantization Finetuning
One challenge in GNN quantization is that directly applying the quantization to GNNs during inference usually leads to high accuracy loss up to . This accuracy loss can be largely recovered to less than when we finetune the quantized GNNs. Note that this finetuning procedure only needs to be conducted once for a quantized GNN model. Overall, SGQuant uses the same loss as the original GNN model (e.g.
, negative loglikelihood (NLL) for semisupervised node classification task). On the backpropagation related to GNN quantization, we derive the gradient as follows
(7) 
Note that the computation of uses a floor function
, whose gradient is zero almosteverywhere and hinders the backpropagation. Our SGQuant uses the straightthrough estimator that assigns the gradient
to be . To this end, we can rewrite the gradient in Equation 7 as(8) 
We implement a tailored GNN quantization layer in PyTorchGeometric
[5] that enables both the quantized inference and the backpropagation, such that SGQuant can easily conduct endtoend finetuning.Iv MultiGranularity Quantization
When designing our specialized graph quantization (SGQuant) method, the quantization granularity is an important aspect to be considered. In this section, we propose four different types of granularity: componentwise, topologyaware, layerwise, and uniform, as illustrated in Figure 4. The simplest granularity is the uniform quantization, which applies the same quantization bits to all layers and components in the GNN. It helps reduce the memory consumption by replacing the bit values with the corresponding bit quantized data representation. However, when applying the same quantization bit to all layers, nodes, and components, we ignore their different sensitivity to quantization bits and the introduced numerical error, leading to degraded accuracy. To this end, we need the quantization at finer granularity to cater the different sensitivity.
Iva Componentwise Quantization
Componentwise Quantization (CWQ) considers the quantization sensitivity at each GNN component and applies different quantization bit to different components, as illustrated in Figure 4(a). In each layer, modern GNNs usually contain the attention component for measuring the relationship for each pair of nodes, and the combination component for computing the embedding for the next layer. While the combination component is critical for providing finegrained features for the next GNN layer, the attention component usually only provides a coarsegrained hint on the importance of one node to another node . Our key insight is that the attention component is more robust to the numerical error in the GNN quantization compared to the combination component. Thus, we can usually apply a lower quantization bit on the attention component than the combination component.
Formally, CWQ maintains a quantization configuration
(9) 
for the quantization bits of each GNN component (i.e., for the aggregation component and for the combination component), where and are the quantization bits for the attention and the combination component, respectively. During the quantization, we will check the quantization bits for each component and conduct quantization correspondingly. Formally, we compute the quantized attention matrix and the quantized embedding , as described in Equation 4. Note that the parameter in Equation 4 varies for components according to the assigned quantization bit and . While CWQ may lead to multiplying two values with ”unmatching” bits during the combination component, we can resolve with the ”rematching” mechanism in Equation 5. In particular, during combination, we first recover the quantized component values and to their corresponding bit representation and , then compute with bit values
(10) 
IvB Topologyaware Quantization
Topologyaware Quantization (TAQ) exploits the graph topology information and applies different quantization bits for different nodes based on their most essential topology property – degree, as illustrated in Figure 4(b). In GNN computation, nodes with higher degrees usually have more abundant information from their neighboring nodes, which makes them more robust to low quantization bits since the random error from quantization can usually be averaged to with a large number of aggregation operation. In particular, given a quantization bit , the quantization error of each node
is a random variable and follows the uniform distribution
where represents the difference between the maximum embedding value and the minimum embedding value. For a node with a large degree, we will aggregate a large number of and from the node and its neighboring nodes and the average results will converge to following the law of large numbers [19]. To this end, nodes with a large degree are more robust to the quantization error and we can use smaller quantization bits to these highdegree nodes.Formally, TAQ maintains a quantization configuration that are selected according to the node degrees
(11) 
where the features of a node are assigned the quantization bit if the node degree falls into . Here, we set and , as illustrated in Figure 5. Suppose there are three nodes: node1, node2 and node3, which have the node degree , , and respectively. Our TAQ determines the quantization bits of each node based on their degrees. To get the appropriate quantization bit for different nodes, we propose and implement a Fbit function, as illustrated in Figure 5(b). We first create the most commonly used quantization bits as a template list (std_qbit), and predefine the degree split_point list . Fbit function maps the nodes with corresponding quantization bits based on their degrees. The strategy behind such a mapping is to maintain higher quantization bits for lowdegree nodes, while penalizing highdegree nodes with low bit quantization.
Once we have assigned different quantization bits to different nodes , there are still “unmatching” bits problem across nodes, similar to the “unmatching” problem across components. We can use the “rematching” technique on node embeddings and compute the combination component as
(12) 
where is a 32bit value. TAQ does not quantize the attention matrix since we consider only the firstorder topology information and skip secondorder topology information that, for an edge , two nodes and have different degrees.
IvC Layerwise Quantization
The Layerwise Quantization (LWQ) exploits the diverse quantization sensitivity in individual GNN layers and provides different quantization bits to each layer. Our key motivation is that leading layers usually take the detailed data and capture the lowlevel features while the succeeding layers usually abstract these lowlevel details into highlevel features. To this end, leading layers require large quantization bits to represent the lowlevel details while the succeeding layers need only small quantization bits for storing the highlevel features. Our evaluation empirically confirms that, under the same memory consumption, assigning higher bits to the leading layers generally leads to higher accuracy, compared to assigning higher bits to the succeeding layers.
Formally, LWQ maintains a quantization configuration
(13) 
where is the quantization bit at the layer and is the number of GNN layers. In particular, GNN quantized with LWQ has the same quantization bits for both the attention matrix and the embedding matrix at the layer and computes the combination component at as
(14) 
IvD Combine Multiple Granularities
Besides applying the above granularities standalone, SGQuant can effectively combine them in collaborative ways. And we detail two major types of combinations as follows.
Lwq+cwq
Note that LWQ and CWQ are complementary and can be easily combined to provide more finegrained quantization configuration
(15) 
Our evaluation shows that LWQ+CWQ can provide lower quantization bits at the same accuracy, compared to only applying LWQ or CWQ alone. The main insight is that LWQ+CWQ provides more finegrained granularities and could potentially generate models with higher accuracy under the same memory budget. Formally, GNN quantized with LWQ+CWQ computes the combination component as
(16) 
We can similarly use LWQ+TAQ and CWQ+TAQ. We omit the details of these two combinations here due to page limits.
Lwq+taq+cwq
We can also combine TAQ, LWQ, and CWQ to generate quantization configuration
(17) 
Note that the quantization bits on the attention matrix does not depend on the topology information, as the case in TAQ. Formally, GNN with LWQ+TAQ+CWQ computes as
(18) 
where is decided by the degree of node .
V Autobit Selection
Given the rich set of quantization granularities, one natural question arises: How can we assign quantization bits for different granularities to achieve the sweet point between accuracy and memory saving
? Essentially, we need to solve a combinatorial optimization problem that minimizes the endtoend loss by selecting a group of discrete quantization bits. Suppose we are considering
LWQ+TAQ+CWQ, we can formalize the combinatorial optimization problem as(19) 
where is typically crossentropy for classification tasks and norm for regression tasks, is the quantization bits for attention matrix at layer , is the quantization bits for the node embedding at layer for node . We also include the weight matrices and
in the loss function, since we conduct endtoend finetuning for the quantized GNN, as discussed in Section
IIIB.There are three challenges in solving this combinatorial optimization problem. First, there is a large design space due to the abundant quantization granularity. When we apply LWQ, CWQ, and TAQ simultaneously, the number of possible quantization configurations increases exponentially, leading to huge manual efforts in exploration. Second, large diversity exists in the GNN model design in terms of the attention generation in the aggregation components and the neural network design in the combination components. This diversity makes it hard to analytically compute the endtoend quantization error and the impact of quantization bits towards the GNN predictions. Third, graph topology usually varies in terms of the number of nodes and the degree distribution, making the measurement of quantization intractable. As we have discussed in Section IVB, this topology information usually has a high impact on the quantization error, requiring the consideration of both the graph topology and the GNN design during the selection of quantization bits.
To address these challenges, we build an autobit selection (ABS) with two main components: a machine learning cost model that predicts the accuracy of the quantized GNN under a given quantization configuration, and an exploration scheme to select the promising configurations.
Va Machine Learning Cost Model
Before we dive into our machine learning (ML) cost model, we will first discuss two baseline approaches. The first one is the random search with trialanderror that randomly samples a large number of quantization configurations and examines all samples to find the best one. However, this approach usually requires a large number of samples to find a good configuration. The second is to build a predefined cost model to analyze the impact of quantization bits over the predictions for a particular GNN model and a graph topology. However, this approach usually fails to generalize well to various GNN models and graph inputs.
To this end, we build a ML cost model that learns on the fly the interaction among quantization bits, GNN models, and the graph topology. Figure 6 illustrates our ML cost model design. Given the quantization granularity and the bits to select, we can randomly generate a set of configurations and extract the quantization bits as the features. Then, we will train and evaluate these configurations, and measure the accuracy as true labels. Finally, we will use the collected features and labels to train our ML cost model and use it to predict the remaining configurations. We treat this task as a regression problem and use a traditional ML model — regression tree [15], as our ML cost model. We prefer the regression tree over neural networks since the former one has faster inference speed and does not require a large amount of training data.
VB Exploration Scheme
Given the ML cost model, a simple exploration scheme would evaluate all remaining quantization configurations and select the one with the highest predicted accuracy and the lowest memory size. However, this approach may fail in two cases. First, it is timeconsuming to evaluate all remaining quantization configurations, especially when we use LWQ, CWQ, and TAQ simultaneously for a large GNN. Second, we may select a small number of quantization configurations for training the ML cost models to reduce the overhead from autobit selection, such that the trained ML model cannot predict precise accuracies for all remaining quantization configurations.
To this end, we propose an exploration scheme that iteratively trains the ML cost model and selects promising configurations. In this way, we can balance the low overhead in training the ML cost model and the precise prediction of configuration accuracies. In particular, there are five steps in our exploration scheme.

Step1: Randomly select a small number of configurations, extract features, and measure their accuracies.

Step2: Train the ML cost model based on the collected features and labels.

Step3: Sample a large number of configurations, use the ML cost model to predict their accuracy, and find the ones with the accuracy.

Step4: Extract features of the selected configurations and measure their accuracies.

Step5: Repeat Step2  Step4 until reaching iterations.
During this procedure, only configurations with negligible accuracy drop () will be kept. Among the remaining configurations, we select the one with the lowest memory consumption. Here, , , and are hyperparameters in ABS that balance the selection overhead and the ML cost model accuracy. Smaller and lead to lower selection overhead by reducing the number of quantization configurations that are trained and evaluated. We have experimented with diverse and on an extensive collection of GNNs and datasets, and find that a small and a small hit the balance between selection overhead and the ML cost model accuracy. The reason is that our cost model is a traditional regression tree model that can be trained with a small amount of data. Using a larger , we can generally select configurations with lower memory consumption and higher accuracy, since more configurations are evaluated by our ML cost model. By default, we set in our evaluation. This leads to negligible latency ( seconds) at each iteration due to the fast inference speed from the regression tree.


Arch  Specification 


GCN  hidden=32, #layers=2 
AGNN  hidden=16, #layers=4 
GAT  hidden=256, #layers=2 



Dataset  #Vertex  #Edge  #Dim  #Class 


Citeseer  3,327  9,464  3,703  6 
Cora  2,708  10,858  1,433  7 
Pubmed  19,717  88,676  500  3 
Amazoncomputer  13,381  245,778  767  10 
232,965  114,615,892  602  41  

Dataset  Network  Accuracy (%)  Average Bits  Memory Size (MB)  Saving 
Cora  GCN (FullPrecision)    
GCN (ReducedPrecision)  
AGNN (FullPrecision)    
AGNN (ReducedPrecision)  
GAT (FullPrecision)    
GAT (ReducedPrecision)  
Citeseer  GCN (FullPrecision)    
GCN (ReducedPrecision)  
AGNN (FullPrecision)    
AGNN (ReducedPrecision)  
GAT (FullPrecision)    
GAT (ReducedPrecision)  
Pubmed  GCN (FullPrecision)  80.36  32  43.71   
GCN (ReducedPrecision)  80.28  2.9  4.01  
AGNN (FullPrecision)  80.44  32  43.46    
AGNN (ReducedPrecision)  80.31  3.07  4.17  
GAT (FullPrecision)  78.00  32  44.48    
GAT (ReducedPrecision)  77.30  3.77  5.26  
GCN (FullPrecision)  81.07  32  328.70    
GCN (ReducedPrecision)  80.36  3.72  38.25  
AGNN (FullPrecision)  74.63  32  643.92    
AGNN (ReducedPrecision)  74.40  4  113.92  5.65x  
GAT (FullPrecision)  92.66  32  311.85    
GAT (ReducedPrecision)  92.23  4.07  39.70  
AmazonComputer  GCN (FullPrecision)  89.57  32  44.58   
GCN (ReducedPrecision)  89.39  3.29  4.59  
AGNN (FullPrecision)  77.69  32  44.16    
AGNN (ReducedPrecision)  77.33  4  5.99  
GAT (FullPrecision)  93.10  32  45.71    
GAT (ReducedPrecision)  92.60  7.53  10.75 
Vi Evaluation
In this section, we show the strength of our proposed quantization method through intensive experiments over various GNN models and datasets.
Via Experiment Setup
ViA1 GNN Architectures
Graph Convolutional Network (GCN) [11] is the most basic and popular GNN architecture. It has been widely adopted in node classification, graph classification, and link prediction tasks. Besides, it is also the key backbone network for many other GNNs, such as GraphSage [8], and Diffpool [23]. Attentionbased Graph Neural Network (AGNN) [22] aims to reduce the parameter size and computation by replacing the fully connected layer with specialized propagation layers. Graph Attention Network (GAT) [21] is a reference architecture for many other advanced GNNs with more edge properties, which can provide stateoftheart accuracy performance on many GNN tasks. Details of their configurations are shown in Table I.
ViA2 Datasets
We select two categories of graph datasets to cover the vast majority of the GNN inputs. The first type includes the most typical datasets (Citeseer, Cora, and Pubmed) used by many GNN papers [11, 22, 8]. They are usually small in the number of nodes and edges, but come with highdimensional feature embedding. The second type (Amazoncomputer, and Reddit) are large graphs [13, 11] in the number of nodes and edges. Details of the above datasets are listed in Table II.
ViB Overall Performance
In this section, we demonstrate the benefits of SGQuant by evaluating its impact of accuracy loss and memory saving. As shown in Table III, our specialized quantization method can effectively reduce the memory consumption up to , , and on GCN, AGNN, and GAT respectively, meanwhile limiting the accuracy loss by , , on average compared with the original fullprecision model for GCN, AGNN, and GAT.
Moreover, there are several noteworthy observations. Across different datasets: On datasets with smaller sizes, such as Cora and Citeseer, our specialized quantization method can reduce the memory size more aggressively while maintaining accuracy by selecting relatively low average bits, such as 1.22 for GCN on Cora. This is because the smaller datasets with limited size of nodes and edge connections make the quantization precision loss less significant. Across different models: we find that to maintain the accuracy, SGQuant would select higher average bits for more complex models. For example, on Amazoncomputer dataset, GAT model locates 7.53 as the average bit, while the AGNN and GCN locate 4 bit and 3.29 bit, respectively. We observe similar pattern on all other datasets that we evaluated. The major reason is that more complex GNN models would involve more intricate computations that would easily enlarge the accuracy loss of quantization and require higher bits to offset such loss. For instance, GAT has to first compute neighborspecific attention values and scale them with the number of attention heads before the combination component. Instead, AGNN and GCN have simpler combination component that requires much less effort in computation, getting its loss of quantization well undercontrolled even with lower bits.
What also worth mentioning is that on datasets with large size, such as Reddit, the absolute memory size saving is significant, which reduces up to MB memory occupation. This can also demonstrate the potential of SGQuant to make the GNNs happen on memoryconstrained device more easily.
ViC Breakdown Analysis of Multigranularity Quantization
In this experiment, we break down the benefits of multigranularity quantization. Specifically, we apply GAT on Cora. We first evaluate the performance of uniform quantization (Uniform) and layerwise quantization (LWQ). Then, we evaluate more finegrained granularity by combining LWQ with componentwise quantization (CWQ) and apply different quantization bits to individual components at each layer. For example, for GAT with 2 layers and 2 components of aggregation and combination at each layer, the quantization configuration with LWQ+CWQ has quantization bits (i.e., ). We additionally impose the topologyaware quantization (TAQ) to study the performance of SGQuant when considering LWQ, CWQ, and TAQ, simultaneously.
Figure 7 shows the error rate of each quantization granularity at each memory size. Specifically, Uniform shows the highest error rate under each memory size. This error rate increases significantly when we compress the model to be smaller a certain size (2.5MB). Compared with Uniform, LWQ achieves lower error rate, due to the flexibility in selecting different bits for different layers. Moreover, we observe that LWQ+CWQ further mitigates such accuracy degradation when reducing the model memory footprint aggressively. The reason is that LWQ+CWQ takes the properties of different layers and different components in to consideration, which can strike a good balance between the memory saving and the accuracy. Finally, this experiment also shows that, by incorporating the node information (degree) with LWQ+CWQ+TAQ, our SGQuant can achieve even lower error rate at each memory size. The major reason is that highdegree nodes would intrinsically gather more information from its neighbors compared with the nodes with limited number of neighbors. In other words, applying more aggressive quantization on highdegree nodes would cause minor information loss.
Quantization Method  Configuation@MemSize=2MB  Error Rate 
Uniform  18.90%  
LWQ  18.60%  
LWQ + CWQ  17.90%  
LWQ + CWQ + TAQ  16.70% 
As a case study, Table IV shows the allocated bitwidth and error rate of GAT on Cora with different granularity, with the memory size around 2MB. We observe similar trend as Figure 7 that finegrained granularities generally lead to lower error rate at a given memory size. One interesting observation is that LWQ achieves a lower error rate than Uniform, while LWQ chooses lower quantization bit than Uniform at layer 2. The insight is that the low quantization bit may introduce the regularization effect and prevent overfitting in the training procedure. Also, LWQ usually assigns higher bits to leading layers, as discussed in Section IVC. For the LWQ+CWQ, we assign smaller quantization bits to the attention component, since attention component is more robust to the numerical error in the GNN quantization, as discussed in Section IVA. The most finegrained granularity is LWQ+CWQ+TAQ, where we can reduce error rate by under the same memory size, compared with the uniform quantization.
ViD Effectiveness of AutoBit Selection
In this experiment, we evaluate autobit selection (ABS) with the machine learning (ML) cost model. As discussed in Section V, we iteratively select and evaluate quantization configurations. Among these evaluated quantization configurations, we only select configurations that shows negligible accuracy drop () compared to the fullprecision GNN. Among remaining models, we exhibit its memory saving compared to the fullprecision GNN. We compare our ABS with the random search approach, which randomly picks 200 quantization configurations and selects the one with lowest memory size while also showing negligible accuracy drop.
Figure 8 exhibits the results on AGNN and Cora dataset, while similar trend can be observed on other GNNs and datasets. Overall, our ML cost model converges within 200 trails of quantization configurations and achieves two advantages over the random search approach. First, ML cost model can locate the appropriate quantization bits more swiftly compared with naive random search solution. Second, for the final results, ML cost model can pinpoint a more ”optimal” value for bits that offers higher memory saving (25) compared with random search (). The major reasons behind such a success are two folds. First, we build our initial model based on several key features (configuration parameters) of SGQuant, which can effectively capture the core relation between their value and the final quantization performance (accuracy). Second, the ML cost model are iteratively updated as it sees more data samples, which helps it refine itself by providing solution more wisely. Besides, we observe similar performance between ML cost model and random search at the first trails. The reason is that, starting with no training data, our ABS randomly samples and profiles (=40) configurations at the beginning, where we can expect similar performance as the random search approach.
Vii Conclusion
In this paper, we propose and implement a specialized GNN quantization scheme, SGQuant, to resolve the memory overhead of GNN computing. Specifically, our multigranularity quantization incorporates the layerwise, componentwise, and topologyaware quantization granularities that can intelligently compress the GNN features while minimizing the accuracy drop. To efficiently select the most appropriate bits for all these quantization granularities, we further offer a MLbased automatic bitselecting (ABS) strategy that can minimize the users’ efforts in design exploration. Rigorous experiments show that SGQuant can effectively reduce the memory size up to under negligible accuracy drop. In sum, SGQuant paves a promising way for GNN quantization that can facilitate their deployment on resourceconstraint devices.
References
 [1] (2019) Post training 4bit quantization of convolutional networks for rapiddeployment. In Advances in Neural Information Processing Systems (NeurIPS), pp. 7948–7956. Cited by: §I, §IIB.
 [2] (2005) Link prediction approach to collaborative filtering. In Proceedings of the 5th ACM/IEEECS Joint Conference on Digital Libraries (JCDL), pp. 141–142. Cited by: §I.
 [3] (2018) Deep feature learning via structured graph laplacian embedding for person reidentification. Pattern Recognition 82, pp. 94–104. Cited by: §I.
 [4] (2017) Learning graph representations with embedding propagation. In Advances in neural information processing systems (NIPS), pp. 5119–5130. Cited by: §I.
 [5] (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (ICLR), Cited by: §IIIB.
 [6] (2012) Graph embedding in vector spaces by node attribute statistics. Pattern Recognition 45 (9), pp. 3072–3083. Cited by: §I.
 [7] (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM international conference on Knowledge discovery and data mining (SIGKDD), pp. 855–864. Cited by: §I.
 [8] (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems (NIPS), pp. 1024–1034. Cited by: §I, §VIA1, §VIA2.
 [9] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §I, §IIB.
 [10] (2010) Graph classification and clustering based on vector space embedding. Vol. 77, World Scientific. Cited by: §I.
 [11] (2017) Semisupervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR). Cited by: §I, §IIA, §VIA1, §VIA2.
 [12] (2009) Learning spectral graph transformations for link prediction. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 561–568. Cited by: §I.
 [13] (201406) SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data Cited by: §IIA, §VIA2.
 [14] (2016) Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning (ICML), pp. 2849–2858. Cited by: §I, §IIB.
 [15] (2011) Classification and regression trees. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1, pp. 14–23. Cited by: §VA.
 [16] (2009) Nonnegative laplacian embedding. In 2009 Ninth IEEE International Conference on Data Mining (ICDM), pp. 337–346. Cited by: §I.
 [17] (2011) Cauchy graph embedding. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 553–560. Cited by: §I.
 [18] (2014) DeepWalk: online learning of social representations. In Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), , New York, NY, USA, pp. 701–710. External Links: ISBN 9781450329569, Link, Document Cited by: §I.
 [19] (2003) Mathematical statistics. Springer Texts in Statistics, Springer. External Links: ISBN 9780387953823, LCCN 98045794, Link Cited by: §IVB.
 [20] (2009) Towards timeaware link prediction in evolving social networks. In Proceedings of the 3rd workshop on social network mining and analysis, pp. 1–10. Cited by: §I.
 [21] (2018) Graph attention networks. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §I, §IIA, §VIA1.
 [22] (2019) How powerful are graph neural networks?. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §I, §VIA1, §VIA2.
 [23] (2018) Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS), , Red Hook, NY, USA, pp. 4805–4815. Cited by: §VIA1.
 [24] (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §I, §IIB.
Comments
There are no comments yet.