1 Introduction
Graph neural networks (GNNs) have recently emerged as the dominant paradigm for learning and analyzing non-Euclidean data, which contain rich node content information as well as topological relational information [7, 12, 55]. As such, a massive number of GNN architectures have been developed [21, 48, 57, 62, 65]. The success of GNNs also triggers a great surge of interest in applying elaborated graph networks to various tasks across many domains, such as object detection [11, 8]
, pose estimation
[61], point cloud processing [22, 53, 38], and visual SLAM [42]. These GNN-based applications, in general, rely on cumbersome graph architectures to deliver gratifying results. For example, SuperGlue, a GNN-based feature matching approach, requires 12M network parameters to achieve the state-of-the-art performance [42].In practice, however, such applications typically require a compact and lightweight architecture for real-time interaction, especially in resource-constrained environments. In the case of autonomous driving [31], for example, it is critical to maintain fast and timely responses for GNN-based SLAM algorithms to handle complex traffic conditions, thereby leading to the urgent need of compressing cumbersome GNN models. The work of [60], as the first attempt, leverages knowledge distillation to learn a compact student GNN with fewer parameters. In spite of the improved efficiency, this approach still relies on the expensive floating-point operations, let alone a well-performed teacher model pre-trained in the first place.


In this paper, we strive to make one step further towards ultra lightweight GNNs. Our goal is to train a customized 1-bit GNN, as shown in Fig. 1, that allows for favorable memory efficiency and meanwhile enjoys competitive performance. We start with developing a naïve GNN binarization framework, achieved through converting 32-bit features and parameters into 1-bit ones, followed by leveraging straight-through estimator to optimize the binarized model. The derived vanilla binarized GNN enjoys favorable memory efficiency; however, its performance is not encouraging as expected. Through parsing its underlying process, we identified that the binarization yields limited expressive power, making the model incapable to distinguish different graph topologies. An illustrating example is shown in Fig. 2(a), where a mean aggregator, which is commonly adopted by full-precision GNNs, produce identical aggregation results for two diversified graph topologies with binarized features, thereby leading to inferior performances.
Inspired by this discovery, we introduce to the proposed GNN binarization framework a learnable and adaptive neighborhood aggregator, so as to alleviate the aforementioned dilemma and enhance the distinguishability of 1-bit graphs. Unlike existing GNNs that rely on a pre-defined and fixed aggregator, our elaborate meta neighborhood aggregators enables dynamically selecting (Fig. 2(b)) or generating (Fig. 2(c)) customized input- and layer-specific aggregation schemes. As such, we explicitly account for the customized characteristics of binarized graph features, and further strengthen the discriminative power for handling topological structures.
Towards this end, we propose two variants of meta aggregators: an exclusive meta aggregator, termed as Greedy Gumbel Neighborhood Aggregator (GNA), that adaptively selects an optimal aggregator in a learnable manner, as well as a diffused meta aggregator, termed as Adaptable Hybrid Neighborhood Aggregator (ANA), that either approximates a single aggregator or dynamically generates a hybrid aggregation behavior. Specifically, GNA incorporates the discrete decisions from the candidate aggregators, conditioned on the individual graph features, into the gradient descent process by leveraging Greedy Gumbel Sampling. Inevitably, the performance of GNA is bottlenecked by the individual aggregators in the candidate pool. Thus, we further devise ANA that enables generating a hybrid aggregator dynamically based on the input 1-bit graphs. ANA simultaneously preserves the strengths of multiple individual aggregators, leading to favorable competence to handle the challenging 1-bit graph features. Moreover, the proposed GNA and ANA can be readily extended as portable modules into the general full-precision GNN models to enhance the expressive capability.
In sum, our contribution is a novel GNN-customized binarization framework that generates a 1-bit lightweight GNN model with competitive performance, making it competent for resource-constrained applications such as edge computing. This is specifically achieved through an adaptive meta aggregation scheme to accommodate the challenging quantized graph features. We evaluate the proposed customized framework on several large-scale benchmarks across different domains and graph tasks. Experimental results demonstrate that the proposed meta aggregators achieve results superior to the state-of-the-art, not only on the devised 1-bit binarized GNN models, but also on the general full-precision models.
2 Related Work
Graph Neural Networks. The concept of graph neural networks was proposed in [43], which generalized existing neural networks to handle graph data represented in the non-Euclidean domain. Over the past few years, graph neural networks have achieved unprecedented advances with various approaches being developed [21, 19, 65, 7, 59, 58, 26, 28, 50, 13, 25, 33, 27]. For example, graph attention network in [48] introduces a novel attention mechanism for efficient graph processing. GraphSAGE [10], on the other hand, addresses the scalability issues on large-scale graphs by sampling and aggregating feature representations from local neighborhoods.
The success of GNNs has also boosted the applications of graph networks in a wide range of problem domains [65], including semantic segmentation [54, 22, 38, 36], object detection [11, 8]
[61], interaction detection [37, 17], and visual SLAM [42], . Specifically, Wang [54] propose a dynamic graph convolutional model for point cloud classification and semantic segmentation, which combines the advantages of the PointNet [35] and graph convolutional network [21]. Despite the encouraging performance, there is a lack of research on compressing cumbersome GNN models, which is critical for deployment in resource-constrained environments like on the mobile-terminal side.We also notice two related works in [24, 34] upon publication that focus on generalized aggregation functions. However, our work is conceptually very different from [24, 34]: in fact, our work is the first dedicated study on functionals of GNNs, dealing with functions of functions; in other words, we focus on learning aggregators of aggregators, where the inputs are themselves aggregators. This has not yet been explored in prior works.
Network Binarization. In the field of model compression [64, 44, 45, 5, 46], network binarization techniques aim to save memory occupancy and accelerate the network inference by binarizing network parameters and then utilizing bitwise operations [14, 15, 4]. In recent years, various CNN binarization methods have been proposed, which can be categorized into direct binarization [6, 14, 15, 20] and optimization-based binarization [40, 4, 30]. Specifically, direct binarization quantizes the weights and activations to 1 bit with a pre-defined binarization function. In contrast, optimization-based binarization introduces scaling factors for the binarized parameters to improve the representation ability, but inevitably leading to inferior efficiency.
Driven by the success of the aforementioned binarization techniques in the CNN domain, in the paper, we propose a GNN-specific binarization method. Specifically, we primarily focus on GNN-based direct binarization, since our goal is to develop super lightweight GNN models. We also notice three concurrent works [49, 51, 1] that also aim to accelerate the forward process for GNN models. However, [49, 51] directly apply CNN-based binarization techniques without considering the characteristics of GNNs, which in fact will serve as the baseline method in our experiments. The other work in [1] only focuses on improving the efficiency of dynamic graph convolutional model [54], by speeding up the dynamic construction of k-nearest-neighbor graphs in the Hamming space. Unlike [49, 51, 1], we aim to devise a more general GNN-specific binarization framework that is applicable to most existing GNN models.
3 Vanilla Binary GNN and Pre-analysis
In this section, we first develop a vanilla binary GNN framework by simply binarizing model parameters and activations. We then show the limitations of this vanilla binary GNN by looking into the internal message aggregation process and accordingly develop two possible solutions to address these limitations. Eventually, built upon the possible solutions, we introduce the idea of the proposed customized GNN binarization framework with the meta aggregators.
Formulation of GNN Models.
GNNs leverage graph topologies and node/edge features to learn a representation vector of a node, an edge or the entire graph. Let
denote a directed/undirected graph with nodes and edges , where is the set of neighboring nodes of . Each node has an associated node feature . For example, in the task of 3D object classification, can be set as the 3D coordinates.Existing GNNs follow an iterative neighborhood aggregation scheme at each GNN layer, where each node iteratively gathers features from its neighboring nodes to capture the structural information [23, 57]. Let denote the feature vector of the node at layer . The corresponding updated feature vector in a GNN can then be formulated as:
(1) |
where represents the feature associated with the neighboring nodes. is a mapping function that takes as well as as inputs. The choice of the mapping corresponds to different architectures of GNNs.
For the sake of simplicity, we take here graph convolutional network (GCN) proposed by Kipf and Welling [21] as an example GNN architecture for the following discussions. We denote Mean as the mean aggregator that computes an average of the incoming messages and as the learnable weight matrix for feature transformation. The general GNN form in Eq. 1 can then be instantiated for GCN as: or which respectively correspond to the case where aggregation comes first or comes after the feature transformation step [52].
Vanilla 1-bit GNN Models.
We develop a naïve binarized GNN framework to compress cumbersome GNN models, by directly binarizing 32-bit input features and learnable weights in the feature transformation step into 1-bit ones.
Specifically, for the case of vanilla binary GCN, the forward propagation process can be modeled as:
(2) |
where represents the element in the learnable weight matrix . We also binarize the graph features in the same manner, by replacing in Eq. 2 with the feature element .
During the backward propagation, it is not feasible to simply exploit Backward Propagation (BP) algorithm [41], as most full-precision models do, to optimize binarized graph networks, due to the undifferentiable binarization function, , sign in Eq. 2. The derivative part of the sign
function will lead to 0 gradients almost everywhere, thereby resulting in the vanishing gradient problem. To alleviate this dilemma, we leverage the
Straight-through Estimator (STE) [2] for the backward propagation process in the binarized graph nets, formulated as:(3) |
where
represents the loss function. Essentially, Eq.
3 can be considered as propogagting the gradient through hard tanh function, defined as: .We illustrate in Fig. 3
the computational workflow at an example binarized GCN layer for the case where the aggregation comes after the feature transformation. A similar scheme can be observed for the GCN model where the aggregation happens first. With compact node features and net weights, binarized GCN only relies on 1-bit XNOR and bit-count operations for graph-based processing, leading to an efficient and lightweight graph model that is competent for edge computing.

Challenges and Possible Solutions.
Despite the compact binarized parameters and features, we empirically observed that the results of the developed vanilla GNN were not promising as expected. Specifically, we conduct a preliminary experiment on the ZINC dataset [16] with the GCN architecture in [7]. Averaged over 25 independent runs, the full-precision GCN model achieves the performance of 0.4070.018 in terms of the mean absolute error (MAE), whereas the vanilla binarized GCN yields the result of 0.6690.070 in MAE, which is far behind that of the full-precision one.
We explore the reason behind this challenge of implausible performance, by looking into the internal computational process in binarized GNNs. Specifically, we look back on Fig. 3, which shows the example workflow at a binarized GCN layer where the feature transformation is performed before the aggregation step. It is noticeable that the result of 1-bit operations lies in the discrete integer domain. The resulted feature space is thereby much smaller than that of the 32-bit floating-point operations. In other words, the outputs of 1-bit operations are less distinguishable from each other. This property, when appearing in the graph domain, leads to difficulties to extract and discriminate graph topologies in the neighborhood aggregation process, which in fact is the key to the success of graph networks.
To further illustrate this dilemma, we demonstrate a couple of examples in Fig. 4, including both max and mean aggregation schemes that are commonly leveraged in GNNs. Fig. 4(a) shows the aggregation results of the 32-bit GNN layer, where both of max and mean aggregators successfully distinguish the two different topological structures, respectively. However, for the aggregation results of discrete integer features in binarized GNNs (Fig. 4(b)), neither max nor mean aggregators can discriminate the corresponding two graph structures. Moreover, the situation will be more challenging for the case where the aggregation happens before the transformation, since the features fed into the aggregator are limited to only or .
Nevertheless, from Fig. 4(b), we also found that, by combining different aggregation schemes, various graph topologies could in fact become distinguishable. This observation motivates us to develop possible solutions to alleviate the aforementioned dilemma in vanilla binarized GNNs. Specifically, we propose a couple of straightforward mixed multi-aggregators that combine the benefits of various aggregation schemes in two different ways. The first one conducts multiple times of message aggregation with several different aggregators and then computes the sum over the aggregation results, leading to the performance of 0.5990.017 in MAE with five standard aggregators. The second one, on the other hand, concatenates the results from several independent aggregators, achieving the average result of 0.6140.045 over 25 runs.
In spite of the improved performance, the devised possible solutions need to perform multiple times of feature aggregations at each GNN layer, resulting in heavy computational burdens. Motivated by this limitation, we introduce the proposed meta neighborhood aggregators, which aim to enhance the discriminative capability of topological structures and meanwhile enjoy efficient computations.


4 Meta Neighborhood Aggregation
4.1 Overview
Towards addressing the aforementioned limitations of the devised mixed multi-aggregators, we introduce in this section the proposed concept of the Meta Aggregator, which aims to adaptively and efficiently adjust the way to aggregate information in a learnable manner. Towards this end, we propose a couple of specific forms of meta aggregators, , the exclusive meta aggregation method and the diffused meta aggregation method, as illustrated in Fig. 5.
The exclusive form, termed as Greedy Gumbel Neighborhood Aggregator (GNA), learns to determine a single optimal aggregation scheme from a pool of candidate aggregators, according to the individual characteristics of the quantized graph features, as shown in the upper part of Fig. 5. The diffused meta form, on the other hand, adaptively learns a customized aggregation formulation that can potentially incorporate the properties of several independent aggregators, thereby termed as Adaptable Hybrid Neighborhood Aggregator (ANA) shown in the lower part of Fig. 5.
In what follows, we detail the devised two forms of meta neighborhood aggregation methods, , GNA and ANA, and also the associated training strategy.
|
||||||||
Methods | Full (GAT) [48] | Vanilla (GAT) [14] | GNA (GAT) | ANA (GAT) | Full (GCN) [21] | Vanilla (GCN) [14] | GNA (GCN) | ANA (GCN) |
Bit-width | 32/32 | 1/1 | 1/1 | 1/1 | 32/32 | 1/1 | 1/1 | 1/1 |
Param Size | 399.941KB | 81.7070KB | 82.0610KB | 81.8799KB | 402.645KB | 82.2002KB | 82.5566KB | 82.3740KB |
Test MAESD | 0.4760.006 | 0.6700.064 | 0.5920.013 | 0.5660.012 | 0.4070.018 | 0.6690.070 | 0.6080.024 | 0.6070.020 |
Train MAESD | 0.3000.024 | 0.6100.066 | 0.5310.013 | 0.4530.019 | 0.3030.026 | 0.6240.069 | 0.5580.027 | 0.5640.021 |
-value | GNA vs. Vanilla: 3.01010 ANA vs. Vanilla: 2.35910 | GNA vs. Vanilla: 1.59710 ANA vs. Vanilla: 9.78710 | ||||||
|
4.2 Greedy Gumbel Aggregator
Motivated by the observation from Fig. 4, where different single aggregators work for a corresponding set of cases as explained in Sect. 3, we propose the idea of adaptively determining the optimal aggregator depending on the specific input graphs, as depicted in the upper part of Fig. 5.
To this end, there are a few challenges to be addressed. First, the aggregation selector should understand the underlying characteristics of various input graphs without introducing much additional computational cost. To address this issue, we propose to leverage a 1-bit graph auto-encoder to extract meaningful information from input graphs, which is then exploited to guide the decision of different aggregation methods.
The second challenge is how to incorporate the discrete selections into the gradient descent process in training GNNs. One straightforward solution would be to model the discrete determination process as a state classification problem and to consider the various aggregators in the candidate pool as different labels. However, this naïve attempt does not account for the uncertainty of the selector, which is likely to cause the model collapse problem where the output choice is independent of the input graphs, such as always or never picking up a specific aggregator.
To alleviate this dilemma, we propose to impose stochasticity in the aggregator decision process with greedy Gumbel sampling [29, 47]
and propagate gradients through stochastic neurons through the continuous form of Gumbel-Max trick
[18]. Specifically, we introduce such stochasticity by greedily sampling noise from the Gumbel distribution, due to its property of Gumbel-Max trick [9]. In terms of Gumbel random variables, the Gumbel-Max trick can be utilized to parameterize discrete distributions. However, there is a argmax operation in the Gumbel-Max trick, which is not differentiable. We thereby resort to its continuous relaxation form, termed as Gumbel-softmax estimator, to address this issue, which uses a softmax function to replace the undifferentiable argmax function.
With the aforementioned graph auto-encoder and also the Gumbel-softmax estimator to address the two challenges, respectively, the proposed greedy Gumbel aggregator (GNA) for node can then be formulated as:
(4) |
where represents the binarized graph auto-encoder at layer that extracts principal and meaningful information, and denotes the sampled Gumbel random noise. is the input subgraph with one centered node and a set of its neighboring nodes where the connection . is a constant that denotes the temperature of the softmax. is the output one-hot vector that indicates the aggregator decision at node and layer from a pool of candidate aggregators like .
In this way, the proposed greedy Gumbel aggregator adaptively decides the optimal aggregator conditioned on each specific node and layer in a learnable manner, which can significantly improve the topological discriminative capability of the vanilla binary GNN model.
4.3 Adaptable Hybrid Aggregator
Despite the improved representational ability, the performance of the greedy Gumbel aggregator is bottlenecked by that of the existing standard aggregators, which leaves room for further improvement. Motivated by this observation, we further devise an adaptable hybrid neighborhood aggregator (ANA) that can generate a hybrid form of the several standard aggregators in a learnable manner, thereby simultaneously retaining the advantages of different aggregators. The overall computational pipeline of ANA is demonstrated in the lower part of Fig. 5.
We start by giving the developed graph-based mathematical formulation for diffused message aggregation, defined as follows:
(5) |
where is the in-degree of the node and is the graph sample with edges . We use to denote the 1-bit graph auto-encoder at layer , as is also used in Eq. 4. represents the feature vector of the neighboring node at layer , whereas is the obtained diffused aggregator.
Eq. 5 can essentially approximate the max and mean functions, depending on the output of graph auto-encoder . Specifically, higher will lead to a behavior similar to that of the max aggregator, while smaller values of generate an effect of the mean neighborhood aggregation. Detailed mathematical proof is provided in the supplementary material.
By slightly changing the form of Eq. 5, we can also approximate other aggregators. For example, by simply adding a minus to the input graph features, Eq. 5 can approach the behavior of the min aggregation. Also, by utilizing the fact
, the variance aggregator can be approximated by adding the square operations to Eq.
5. More detailed derivations and mathematical proofs can be found in the supplement.Furthermore, it is also possible to simultaneously combine the benefits of all these approximated aggregators, by summing multiple terms in Eq. 5 with graph-based learnable weighting factors that adaptively control the diffused degree of various aggregator approximations. We illustrate the corresponding sophisticated formulation and also more detailed explanations in the supplementary material.
4.4 Training Strategy
We also propose a training strategy, tailored for the proposed method. As a whole, the principal operations of training a 1-bit GNN model with the proposed meta neighborhood aggregation approaches is concluded in Alg. 1. For the sake of clarity, we omit the bias terms in our illustration, which have similar behavior to that of the GNN weight . Also, we take the case where the feature transformation happens before the aggregation step as an example to illustrate the overall workflow.
As can be observed from Alg. 1, at each layer, the input graph is fed into the lightweight 1-bit graph auto-encoder to extract useful information that is beneficial to the following meta aggregators. Followed by this graph encoding process, the meta neighborhood aggregation module receives the encoded features and exclusively determines an optimal aggregator, or produces a diffused aggregator that amalgamates the behaviors of several independent aggregators. The desired 1-bit GNN model can eventually be obtained by optimizing the model for epochs with the straight-through estimator, as explained in Sect. 3.
5 Experiments
In this section, we perform extensive experiments on three publicly available benchmarks across diversified problem domains, including graph regression, node classification, and 3D object recognition. Followed by the evaluations, we further provide detailed discussions regarding the strengths and weaknesses of the devised meta aggregators.
|
|||
Methods | Param Size | Test MAESD | Train MAESD |
|
|||
GatedGCN [3] | 413.027KB | 0.4260.012 | 0.2720.023 |
GraphSage [10] | 371.004KB | 0.4750.007 | 0.2960.030 |
GIN [57] | 402.652KB | 0.3870.019 | 0.3190.020 |
MoNet [32] | 414.070KB | 0.3860.009 | 0.2990.016 |
GCN [21] | 402.645KB | 0.4070.018 | 0.3030.026 |
GAT [48] | 399.941KB | 0.4760.006 | 0.3000.024 |
GNA (Ours) | 411.270KB | 0.3370.021 | 0.1600.026 |
ANA (Ours) | 404.504KB | 0.3250.015 | 0.1090.014 |
|
5.1 Experimental Settings
Datasets. We validate the effectiveness of the proposed meta aggregation methods on three different datasets, each of which specializes in a distinct task. Specifically, for the task of graph regression, we use the ZINC dataset [16], which is one of the most popular real-world molecular datasets [7]. The goal of ZINC is to regress a specific molecular property, the constrained solubility, which is a critical property for developing GNNs for molecules [63].
Also, for the node classification task, we adopt the protein-protein interaction (PPI) dataset [66], which is a multi-label dataset with 24 graphs corresponding to different human tissues. Each node in the PPI dataset is labeled with various protein functions. The objective of PPI is thereby to predict the 121 protein functions from the interactions of human tissue proteins. Furthermore, we utilize ModelNet40 [56] for the evaluation on the task of 3D object classification. ModelNet40 is a popular dataset for 3D object analysis [35, 36], containing 12,311 meshed CAD models from 40 shape categories in total. Each object comprises a set of 3D points, with the 3D coordinates as the features. The goal is to predict the category of each 3D shape.
Implementation Details.
We primarily use three heterogeneous architectures, including graph convolutional network (GCN) [21], graph attention network (GAT) [48], as well as dynamic graph convolutional model (DGCNN) [54] to evaluate the proposed meta aggregation approach. For other settings such as learning rates and batch size, we follow those in the works of [7], [48], and [54] for the tasks of graph regression, node classification, and point cloud classification, respectively.
In particular, for more convincing evaluations, we report the results on the ZINC dataset over 25 independent runs with 25 different random seeds. Also, as done in the field of CNN binarization [39]
, we keep the first and the last GNN layer full-precision and binarize the other GNN layers for all the comparison methods. More detailed task-by-task architecture designs as well as the hyperparameter settings can be found in the supplementary material.
|
|||
Methods | Bit-width | Param Size | Score |
|
|||
Full Prec. [48] | 32/32 | 43.7712MB | 98.70 |
Vanilla [14] | 1/1 | 28.2560MB | 92.68 |
GNA (Ours) | 1/1 | 28.2572MB | 97.52 |
ANA (Ours) | 1/1 | 28.2565MB | 97.71 |
|

5.2 Results
Graph Regression. Tab. 1 shows the ablation results of the vanilla 1-bit GNN models and those of GNNs with the proposed meta neighborhood aggregators GAN and ANA. Specifically, we report the results on two GNN architectures, , GCN [21] and GAT [48], by averaging over 25 independent runs with 25 seeds.
The proposed GNA and ANA, as shown in Tab. 1, achieves gratifying performance in terms of both test and train MAE, and at the same time maintains a compact model size. Moreover, we provide in the last row of Tab. 1 the -value of the paired -test between the 1-bit GNNs with a fixed aggregator (Vanilla) and those with the proposed learnable meta aggregators. The corresponding results statistically validate the effectiveness of the proposed method.
Furthermore, we show in Tab. 2 the results of extending the proposed meta aggregators to full-precision GNNs and compare them with those of the state-of-the-art approaches [3, 10, 57, 32, 21, 48]. Specifically, the results in the last two rows of Tab. 2 are obtained by simply replacing the pre-defined aggregator in GAT with the proposed GNA and ANA. As can be observed from Tab. 2, the proposed method outperforms other approaches by a large margin, and meanwhile introduces few additional parameters.
Node Classification. In Tab. 3, we demonstrate the results of different methods with the GAT architecture. The proposed GNA and ANA, as shown in Tab. 3, yield results on par with those of the 32-bit full-precision models, but comes with a more lightweight architecture. The proposed method also outperforms the vanilla 1-bit GNN model that relies on a fixed aggregation scheme.
|
||||
Methods | Bit-width | Param Size | Acc (%) | mAcc (%) |
|
||||
Full Prec. [54] | 32/32 | 1681.66KB | 92.42 | 89.51 |
Vanilla [14] | 1/1 | 1091.20KB | 74.19 | 65.95 |
GNA (Ours) | 1/1 | 1091.30KB | 78.36 | 71.67 |
ANA (Ours) | 1/1 | 1091.30KB | 84.64 | 78.89 |
|
3D Object Recognition.
The results of the proposed approach and other methods on the ModelNet40 dataset are shown in Tab. 4. We build our network here based on the architecture designed in [60]. We also demonstrate in Fig. 6 the corresponding visualization results of different approaches, where the column termed as “Fixed Aggr.” in Fig. 6 corresponds to the “Vanilla” model in Tab. 4. With the proposed meta aggregation schemes, the 1-bit GNN model gains a boost by more than 10% in both the overall accuracy and the mean class accuracy. This improvement is also illustrated in Fig. 6, where the proposed meta aggregators help the 1-bit GNN learn a closer structure to that of the full-precision GNN model.
5.3 Discussions
We provide here a detailed account of the strengths and weaknesses of the proposed two meta aggregators GNA and ANA. For the exclusive meta form GNA, the performance can potentially be further enhanced with the advance of novel aggregation schemes. In other words, the results of GNA depend on those of every single aggregator in the candidate aggregation pool, which at the same time is a weakness of GNA since its performance is bottlenecked by that of the single aggregator. The diffused form ANA, on the other hand, may simultaneously retain the benefits of several popular aggregators. However, the mathematical form in Eq. 5 limits the type of aggregators that ANA can potentially approximate, meaning that ANA may not have much room for further improvement even with the emergence of novel and prevailing aggregators in the future.
6 Conclusions
In this paper, we propose a couple of learnable aggregation schemes for 1-bit compact GNNs. The goal of the proposed method is to enhance the topological discriminative ability of the 1-bit GNNs. This is achieved by adaptively selecting a single aggregator, or generating a hybrid aggregation form that can simultaneously maintain the strengths of several aggregators. Moreover, the proposed meta aggregation schemes can be readily extended to the full-precision GNN models. Experiments across various domains demonstrate that, with the proposed meta aggregators, the 1-bit GNN yields results on par with those of the cumbersome full-precision ones. In our future work, we will strive to generalize the proposed aggregator to compact and lightweight visual transformers.
Acknowledgements.
Mr Yongcheng Jing is supported by ARC FL-170100117. Xinchao Wang is supported by the Start-up Fund of National University of Singapore.
References
- [1] Mehdi Bahri, Gaétan Bahl, and Stefanos Zafeiriou. Binary graph neural networks. arXiv preprint arXiv:2012.15823, 2020.
- [2] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- [3] Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017.
- [4] Adrian Bulat and Georgios Tzimiropoulos. Xnor-net++: Improved binary neural networks. In BMVC, 2019.
-
[5]
Hanting Chen, Yunhe Wang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang
Xu.
Addernet: Do we really need multiplications in deep learning?
In CVPR, 2020. - [6] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: training deep neural networks with binary weights during propagations. In NeurIPS, 2015.
- [7] Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
- [8] Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei, and Jifeng Dai. Learning region features for object detection. In ECCV, 2018.
- [9] Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures. US Government Printing Office, 1954.
- [10] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
- [11] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, 2018.
- [12] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
- [13] Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. Adaptive sampling towards fast graph representation learning. In NeurIPS, 2018.
- [14] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In NeurIPS, 2016.
- [15] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
- [16] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 2012.
- [17] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn: Deep learning on spatio-temporal graphs. In CVPR, 2016.
- [18] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- [19] Yongcheng Jing, Yiding Yang, Xinchao Wang, Mingli Song, and Dacheng Tao. Amalgamating knowledge from heterogeneous graph neural networks. In CVPR, 2021.
- [20] Minje Kim and Paris Smaragdis. Bitwise neural networks. arXiv preprint arXiv:1601.06071, 2016.
- [21] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
- [22] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, 2018.
- [23] Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In ICCV, 2019.
- [24] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns. arXiv preprint arXiv:2006.07739, 2020.
-
[25]
Qimai Li, Zhichao Han, and Xiao-Ming Wu.
Deeper insights into graph convolutional networks for semi-supervised learning.
In AAAI, 2018. -
[26]
Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang.
Adaptive graph convolutional neural networks.
In AAAI, 2018. - [27] Huihui Liu, Yiding Yang, and Xinchao Wang. Overcoming catastrophic forgetting in graph neural networks. In AAAI, 2021.
- [28] Jianxin Ma, Peng Cui, Kun Kuang, Xin Wang, and Wenwu Zhu. Disentangled graph convolutional networks. In ICML, 2019.
- [29] Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In NeurIPS, 2014.
- [30] Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. In ICLR, 2020.
- [31] Stefan Milz, Georg Arbeiter, Christian Witt, Bassam Abdallah, and Senthil Yogamani. Visual slam for automated driving: Exploring the applications of deep learning. In CVPR Workshop, 2018.
- [32] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, 2017.
- [33] Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550, 2019.
- [34] Giovanni Pellegrini, Alessandro Tibo, Paolo Frasconi, Andrea Passerini, and Manfred Jaeger. Learning aggregation functions. In IJCAI, 2021.
- [35] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
- [36] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
- [37] Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In ECCV, 2018.
- [38] Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 3d graph neural networks for rgbd semantic segmentation. In ICCV, 2017.
- [39] Haotong Qin, Ruihao Gong, Xianglong Liu, Xiao Bai, Jingkuan Song, and Nicu Sebe. Binary neural networks: A survey. Pattern Recognition, 2020.
-
[40]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi.
Xnor-net: Imagenet classification using binary convolutional neural networks.
In ECCV, 2016. - [41] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 1986.
- [42] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In CVPR, 2020.
- [43] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. TNN, 2008.
- [44] Chengchao Shen, Xinchao Wang, Jie Song, Li Sun, and Mingli Song. Amalgamating knowledge towards comprehensive classification. In AAAI, 2019.
- [45] Chengchao Shen, Xinchao Wang, Youtan Yin, Jie Song, Sihui Luo, and Mingli Song. Progressive network grafting for few-shot knowledge distillation. In AAAI, 2021.
- [46] Chengchao Shen, Mengqi Xue, Xinchao Wang, Jie Song, Li Sun, and Mingli Song. Customizing student networks from heterogeneous teachers via adaptive knowledge amalgamation. In ICCV, 2019.
- [47] Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. IJCV, 2020.
- [48] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
- [49] Hanchen Wang, Defu Lian, Ying Zhang, Lu Qin, Xiangjian He, Yiguang Lin, and Xuemin Lin. Binarized graph neural network. arXiv preprint arXiv:2004.11147, 2020.
- [50] Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. Graphgan: Graph representation learning with generative adversarial nets. In AAAI, 2018.
- [51] Junfu Wang, Yunhong Wang, Zhen Yang, Liang Yang, and Yuanfang Guo. Bi-gcn: Binary graph convolutional network. arXiv preprint arXiv:2010.07565, 2020.
- [52] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. Deep graph library: Towards efficient and scalable deep learning on graphs. In ICLR Workshop, 2019.
- [53] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. TOG, 2019.
- [54] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph cnn for learning on point clouds. TOG, 2019.
- [55] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. TNNLS, 2020.
- [56] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
- [57] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
- [58] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.
- [59] Yiding Yang, Zunlei Feng, Mingli Song, and Xinchao Wang. Factorizable graph convolutional networks. In NeurIPS, 2020.
- [60] Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, and Xinchao Wang. Distilling knowledge from graph convolutional networks. In CVPR, 2020.
- [61] Yiding Yang, Zhou Ren, Haoxiang Li, Chunluan Zhou, Xinchao Wang, and Gang Hua. Learning dynamics via graph neural networks for human pose estimation and tracking. In CVPR, 2021.
- [62] Yiding Yang, Xinchao Wang, Mingli Song, Junsong Yuan, and Dacheng Tao. Spagan: Shortest path graph attention network. In IJCAI, 2019.
- [63] Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation. In NeurIPS, 2018.
- [64] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In CVPR, 2017.
- [65] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
- [66] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 2017.