Meta-Aggregator: Learning to Aggregate for 1-bit Graph Neural Networks

In this paper, we study a novel meta aggregation scheme towards binarizing graph neural networks (GNNs). We begin by developing a vanilla 1-bit GNN framework that binarizes both the GNN parameters and the graph features. Despite the lightweight architecture, we observed that this vanilla framework suffered from insufficient discriminative power in distinguishing graph topologies, leading to a dramatic drop in performance. This discovery motivates us to devise meta aggregators to improve the expressive power of vanilla binarized GNNs, of which the aggregation schemes can be adaptively changed in a learnable manner based on the binarized features. Towards this end, we propose two dedicated forms of meta neighborhood aggregators, an exclusive meta aggregator termed as Greedy Gumbel Neighborhood Aggregator (GNA), and a diffused meta aggregator termed as Adaptable Hybrid Neighborhood Aggregator (ANA). GNA learns to exclusively pick one single optimal aggregator from a pool of candidates, while ANA learns a hybrid aggregation behavior to simultaneously retain the benefits of several individual aggregators. Furthermore, the proposed meta aggregators may readily serve as a generic plugin module into existing full-precision GNNs. Experiments across various domains demonstrate that the proposed method yields results superior to the state of the art.



page 8


Graph Neural Networks with Parallel Neighborhood Aggregations for Graph Classification

We focus on graph classification using a graph neural network (GNN) mode...

Breaking the Expressive Bottlenecks of Graph Neural Networks

Recently, the Weisfeiler-Lehman (WL) graph isomorphism test was used to ...

A Novel Higher-order Weisfeiler-Lehman Graph Convolution

Current GNN architectures use a vertex neighborhood aggregation scheme, ...

Stochastic Aggregation in Graph Neural Networks

Graph neural networks (GNNs) manifest pathologies including over-smoothi...

Hybrid Graph Neural Networks for Few-Shot Learning

Graph neural networks (GNNs) have been used to tackle the few-shot learn...

Hyper Meta-Path Contrastive Learning for Multi-Behavior Recommendation

User purchasing prediction with multi-behavior information remains a cha...

Learning General Optimal Policies with Graph Neural Networks: Expressive Power, Transparency, and Limits

It has been recently shown that general policies for many classical plan...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs) have recently emerged as the dominant paradigm for learning and analyzing non-Euclidean data, which contain rich node content information as well as topological relational information [7, 12, 55]. As such, a massive number of GNN architectures have been developed [21, 48, 57, 62, 65]. The success of GNNs also triggers a great surge of interest in applying elaborated graph networks to various tasks across many domains, such as object detection [11, 8]

, pose estimation

[61], point cloud processing [22, 53, 38], and visual SLAM [42]. These GNN-based applications, in general, rely on cumbersome graph architectures to deliver gratifying results. For example, SuperGlue, a GNN-based feature matching approach, requires 12M network parameters to achieve the state-of-the-art performance [42].

In practice, however, such applications typically require a compact and lightweight architecture for real-time interaction, especially in resource-constrained environments. In the case of autonomous driving [31], for example, it is critical to maintain fast and timely responses for GNN-based SLAM algorithms to handle complex traffic conditions, thereby leading to the urgent need of compressing cumbersome GNN models. The work of [60], as the first attempt, leverages knowledge distillation to learn a compact student GNN with fewer parameters. In spite of the improved efficiency, this approach still relies on the expensive floating-point operations, let alone a well-performed teacher model pre-trained in the first place.

Figure 1: Illustrations of the computational workflow in (a) conventional full-precision GNNs and (b) the proposed 1-bit GNNs. In particular, we devise two meta aggregators for the proposed model, termed as Greedy Gumbel Aggregator (GNA) and Adaptable Hybrid Aggregator (ANA), that learn to perform adaptive aggregation in a graph-aware and layer-aware manner.
Figure 2: Example aggregation results of the two graphs with different topological structures for (a) the conventional pre-defined and fixed aggregator, (b) the proposed exclusive form of meta aggregators GNA, and (c) the proposed diffused form of meta aggregators ANA.

In this paper, we strive to make one step further towards ultra lightweight GNNs. Our goal is to train a customized 1-bit GNN, as shown in Fig. 1, that allows for favorable memory efficiency and meanwhile enjoys competitive performance. We start with developing a naïve GNN binarization framework, achieved through converting 32-bit features and parameters into 1-bit ones, followed by leveraging straight-through estimator to optimize the binarized model. The derived vanilla binarized GNN enjoys favorable memory efficiency; however, its performance is not encouraging as expected. Through parsing its underlying process, we identified that the binarization yields limited expressive power, making the model incapable to distinguish different graph topologies. An illustrating example is shown in Fig. 2(a), where a mean aggregator, which is commonly adopted by full-precision GNNs, produce identical aggregation results for two diversified graph topologies with binarized features, thereby leading to inferior performances.

Inspired by this discovery, we introduce to the proposed GNN binarization framework a learnable and adaptive neighborhood aggregator, so as to alleviate the aforementioned dilemma and enhance the distinguishability of 1-bit graphs. Unlike existing GNNs that rely on a pre-defined and fixed aggregator, our elaborate meta neighborhood aggregators enables dynamically selecting (Fig. 2(b)) or generating (Fig. 2(c)) customized input- and layer-specific aggregation schemes. As such, we explicitly account for the customized characteristics of binarized graph features, and further strengthen the discriminative power for handling topological structures.

Towards this end, we propose two variants of meta aggregators: an exclusive meta aggregator, termed as Greedy Gumbel Neighborhood Aggregator (GNA), that adaptively selects an optimal aggregator in a learnable manner, as well as a diffused meta aggregator, termed as Adaptable Hybrid Neighborhood Aggregator (ANA), that either approximates a single aggregator or dynamically generates a hybrid aggregation behavior. Specifically, GNA incorporates the discrete decisions from the candidate aggregators, conditioned on the individual graph features, into the gradient descent process by leveraging Greedy Gumbel Sampling. Inevitably, the performance of GNA is bottlenecked by the individual aggregators in the candidate pool. Thus, we further devise ANA that enables generating a hybrid aggregator dynamically based on the input 1-bit graphs. ANA simultaneously preserves the strengths of multiple individual aggregators, leading to favorable competence to handle the challenging 1-bit graph features. Moreover, the proposed GNA and ANA can be readily extended as portable modules into the general full-precision GNN models to enhance the expressive capability.

In sum, our contribution is a novel GNN-customized binarization framework that generates a 1-bit lightweight GNN model with competitive performance, making it competent for resource-constrained applications such as edge computing. This is specifically achieved through an adaptive meta aggregation scheme to accommodate the challenging quantized graph features. We evaluate the proposed customized framework on several large-scale benchmarks across different domains and graph tasks. Experimental results demonstrate that the proposed meta aggregators achieve results superior to the state-of-the-art, not only on the devised 1-bit binarized GNN models, but also on the general full-precision models.

2 Related Work

Graph Neural Networks. The concept of graph neural networks was proposed in [43], which generalized existing neural networks to handle graph data represented in the non-Euclidean domain. Over the past few years, graph neural networks have achieved unprecedented advances with various approaches being developed [21, 19, 65, 7, 59, 58, 26, 28, 50, 13, 25, 33, 27]. For example, graph attention network in [48] introduces a novel attention mechanism for efficient graph processing. GraphSAGE [10], on the other hand, addresses the scalability issues on large-scale graphs by sampling and aggregating feature representations from local neighborhoods.

The success of GNNs has also boosted the applications of graph networks in a wide range of problem domains [65], including semantic segmentation [54, 22, 38, 36], object detection [11, 8]

, pose estimation

[61], interaction detection [37, 17], and visual SLAM [42], . Specifically, Wang [54] propose a dynamic graph convolutional model for point cloud classification and semantic segmentation, which combines the advantages of the PointNet [35] and graph convolutional network [21]. Despite the encouraging performance, there is a lack of research on compressing cumbersome GNN models, which is critical for deployment in resource-constrained environments like on the mobile-terminal side.

We also notice two related works in [24, 34] upon publication that focus on generalized aggregation functions. However, our work is conceptually very different from [24, 34]: in fact, our work is the first dedicated study on functionals of GNNs, dealing with functions of functions; in other words, we focus on learning aggregators of aggregators, where the inputs are themselves aggregators. This has not yet been explored in prior works.

Network Binarization. In the field of model compression [64, 44, 45, 5, 46], network binarization techniques aim to save memory occupancy and accelerate the network inference by binarizing network parameters and then utilizing bitwise operations [14, 15, 4]. In recent years, various CNN binarization methods have been proposed, which can be categorized into direct binarization [6, 14, 15, 20] and optimization-based binarization [40, 4, 30]. Specifically, direct binarization quantizes the weights and activations to 1 bit with a pre-defined binarization function. In contrast, optimization-based binarization introduces scaling factors for the binarized parameters to improve the representation ability, but inevitably leading to inferior efficiency.

Driven by the success of the aforementioned binarization techniques in the CNN domain, in the paper, we propose a GNN-specific binarization method. Specifically, we primarily focus on GNN-based direct binarization, since our goal is to develop super lightweight GNN models. We also notice three concurrent works [49, 51, 1] that also aim to accelerate the forward process for GNN models. However, [49, 51] directly apply CNN-based binarization techniques without considering the characteristics of GNNs, which in fact will serve as the baseline method in our experiments. The other work in [1] only focuses on improving the efficiency of dynamic graph convolutional model [54], by speeding up the dynamic construction of k-nearest-neighbor graphs in the Hamming space. Unlike [49, 51, 1], we aim to devise a more general GNN-specific binarization framework that is applicable to most existing GNN models.

3 Vanilla Binary GNN and Pre-analysis

In this section, we first develop a vanilla binary GNN framework by simply binarizing model parameters and activations. We then show the limitations of this vanilla binary GNN by looking into the internal message aggregation process and accordingly develop two possible solutions to address these limitations. Eventually, built upon the possible solutions, we introduce the idea of the proposed customized GNN binarization framework with the meta aggregators.

Formulation of GNN Models.

GNNs leverage graph topologies and node/edge features to learn a representation vector of a node, an edge or the entire graph. Let

denote a directed/undirected graph with nodes and edges , where is the set of neighboring nodes of . Each node has an associated node feature . For example, in the task of 3D object classification, can be set as the 3D coordinates.

Existing GNNs follow an iterative neighborhood aggregation scheme at each GNN layer, where each node iteratively gathers features from its neighboring nodes to capture the structural information [23, 57]. Let denote the feature vector of the node at layer . The corresponding updated feature vector in a GNN can then be formulated as:


where represents the feature associated with the neighboring nodes. is a mapping function that takes as well as as inputs. The choice of the mapping corresponds to different architectures of GNNs.

For the sake of simplicity, we take here graph convolutional network (GCN) proposed by Kipf and Welling [21] as an example GNN architecture for the following discussions. We denote Mean as the mean aggregator that computes an average of the incoming messages and as the learnable weight matrix for feature transformation. The general GNN form in Eq. 1 can then be instantiated for GCN as: or which respectively correspond to the case where aggregation comes first or comes after the feature transformation step [52].

Vanilla 1-bit GNN Models.

We develop a naïve binarized GNN framework to compress cumbersome GNN models, by directly binarizing 32-bit input features and learnable weights in the feature transformation step into 1-bit ones.

Specifically, for the case of vanilla binary GCN, the forward propagation process can be modeled as:


where represents the element in the learnable weight matrix . We also binarize the graph features in the same manner, by replacing in Eq. 2 with the feature element .

During the backward propagation, it is not feasible to simply exploit Backward Propagation (BP) algorithm [41], as most full-precision models do, to optimize binarized graph networks, due to the undifferentiable binarization function, , sign in Eq. 2. The derivative part of the sign

function will lead to 0 gradients almost everywhere, thereby resulting in the vanishing gradient problem. To alleviate this dilemma, we leverage the

Straight-through Estimator (STE) [2] for the backward propagation process in the binarized graph nets, formulated as:



represents the loss function. Essentially, Eq. 

3 can be considered as propogagting the gradient through hard tanh function, defined as: .

We illustrate in Fig. 3

the computational workflow at an example binarized GCN layer for the case where the aggregation comes after the feature transformation. A similar scheme can be observed for the GCN model where the aggregation happens first. With compact node features and net weights, binarized GCN only relies on 1-bit XNOR and bit-count operations for graph-based processing, leading to an efficient and lightweight graph model that is competent for edge computing.

Figure 3: Illustrations of the computational workflow at an example binarized GNN layer. Despite the efficient 1-bit operations, the output features are less distinguishable between each other, leading to the challenge in the aggregation step shown in Fig. 4.

Challenges and Possible Solutions.

Despite the compact binarized parameters and features, we empirically observed that the results of the developed vanilla GNN were not promising as expected. Specifically, we conduct a preliminary experiment on the ZINC dataset [16] with the GCN architecture in [7]. Averaged over 25 independent runs, the full-precision GCN model achieves the performance of 0.4070.018 in terms of the mean absolute error (MAE), whereas the vanilla binarized GCN yields the result of 0.6690.070 in MAE, which is far behind that of the full-precision one.

We explore the reason behind this challenge of implausible performance, by looking into the internal computational process in binarized GNNs. Specifically, we look back on Fig. 3, which shows the example workflow at a binarized GCN layer where the feature transformation is performed before the aggregation step. It is noticeable that the result of 1-bit operations lies in the discrete integer domain. The resulted feature space is thereby much smaller than that of the 32-bit floating-point operations. In other words, the outputs of 1-bit operations are less distinguishable from each other. This property, when appearing in the graph domain, leads to difficulties to extract and discriminate graph topologies in the neighborhood aggregation process, which in fact is the key to the success of graph networks.

To further illustrate this dilemma, we demonstrate a couple of examples in Fig. 4, including both max and mean aggregation schemes that are commonly leveraged in GNNs. Fig. 4(a) shows the aggregation results of the 32-bit GNN layer, where both of max and mean aggregators successfully distinguish the two different topological structures, respectively. However, for the aggregation results of discrete integer features in binarized GNNs (Fig. 4(b)), neither max nor mean aggregators can discriminate the corresponding two graph structures. Moreover, the situation will be more challenging for the case where the aggregation happens before the transformation, since the features fed into the aggregator are limited to only or .

Nevertheless, from Fig. 4(b), we also found that, by combining different aggregation schemes, various graph topologies could in fact become distinguishable. This observation motivates us to develop possible solutions to alleviate the aforementioned dilemma in vanilla binarized GNNs. Specifically, we propose a couple of straightforward mixed multi-aggregators that combine the benefits of various aggregation schemes in two different ways. The first one conducts multiple times of message aggregation with several different aggregators and then computes the sum over the aggregation results, leading to the performance of 0.5990.017 in MAE with five standard aggregators. The second one, on the other hand, concatenates the results from several independent aggregators, achieving the average result of 0.6140.045 over 25 runs.

In spite of the improved performance, the devised possible solutions need to perform multiple times of feature aggregations at each GNN layer, resulting in heavy computational burdens. Motivated by this limitation, we introduce the proposed meta neighborhood aggregators, which aim to enhance the discriminative capability of topological structures and meanwhile enjoy efficient computations.

Figure 4: Example aggregation results of (a) conventional 32-bit GNN layer and (b) binarized GNN layer, corresponding to Fig. 3. For (a), both mean and max aggregators can distinguish the two graph structures; however, for binarized GNN (b), max and mean aggregators fail to differentiate between two topologies.
Figure 5: The overall framework of the proposed meta neighborhood aggregation methods. The upper row illustrates the workflow of the exclusive meta aggregator GNA, which receives the encoded graph features from the binarized graph auto-encoder (, the pink trapezoid) and exclusively determines a single optimal layer-wise and node-wise aggregator from a candidate aggregator pool. The lower row, on the other hand, demonstrates the diffused meta aggregator ANA, which amalgamates various aggregation behaviors.

4 Meta Neighborhood Aggregation

4.1 Overview

Towards addressing the aforementioned limitations of the devised mixed multi-aggregators, we introduce in this section the proposed concept of the Meta Aggregator, which aims to adaptively and efficiently adjust the way to aggregate information in a learnable manner. Towards this end, we propose a couple of specific forms of meta aggregators, , the exclusive meta aggregation method and the diffused meta aggregation method, as illustrated in Fig. 5.

The exclusive form, termed as Greedy Gumbel Neighborhood Aggregator (GNA), learns to determine a single optimal aggregation scheme from a pool of candidate aggregators, according to the individual characteristics of the quantized graph features, as shown in the upper part of Fig. 5. The diffused meta form, on the other hand, adaptively learns a customized aggregation formulation that can potentially incorporate the properties of several independent aggregators, thereby termed as Adaptable Hybrid Neighborhood Aggregator (ANA) shown in the lower part of Fig. 5.

In what follows, we detail the devised two forms of meta neighborhood aggregation methods, , GNA and ANA, and also the associated training strategy.

1:: the number of layers; : the GNN model weight; : input graph data with nodes and edges ; : the input binarized node feature vector; : the graph auto-encoder; Meta-Aggre.{GNA, ANA}: the choice of meta neighborhood aggregators.
2:: Target 1-bit binarized GNN model.
3:for  to  do
4:     Feed the graph sample into the GNN layer ;
5:     Binarize the GNN weight into by Eq. 2;
6:     Perform 1-bit transformation with and ;
7:     Binarize the weight of into by Eq. 2;
8:     Obtain the encoded features with ;
9:     // Identify the choice from the two meta aggregators
10:     if Meta-Aggre. is GNA then
11:         // Exclusively decide an optimal aggregator
12:         Feed into the GNA module.
13:         Obtain the decision for node by Eq. 4;
14:         Perform aggregations with the obtained ;
15:     else if Meta-Aggre. is ANA then
16:         // Generate a diffused aggregator
17:         Feed into the ANA module;
18:         Obtain the diffused aggregator by Eq. 5;
19:         Perform aggregations with the obtained ;
20:     end if
21:end for
22:Optimize the binarized GNN

for epochs by Eq. 

Algorithm 1 Training a lightweight 1-bit GNN model with the proposed meta neighborhood aggregators.


Methods Full (GAT) [48] Vanilla (GAT) [14] GNA (GAT) ANA (GAT) Full (GCN) [21] Vanilla (GCN) [14] GNA (GCN) ANA (GCN)
Bit-width 32/32 1/1 1/1 1/1 32/32 1/1 1/1 1/1
Param Size 399.941KB 81.7070KB 82.0610KB 81.8799KB 402.645KB 82.2002KB 82.5566KB 82.3740KB
Test MAESD 0.4760.006 0.6700.064 0.5920.013 0.5660.012 0.4070.018 0.6690.070 0.6080.024 0.6070.020
Train MAESD 0.3000.024 0.6100.066 0.5310.013 0.4530.019 0.3030.026 0.6240.069 0.5580.027 0.5640.021
-value GNA vs. Vanilla: 3.01010 ANA vs. Vanilla: 2.35910 GNA vs. Vanilla: 1.59710 ANA vs. Vanilla: 9.78710


Table 1: Results on the ZINC dataset with different architectures, in terms of the mean absolute error (MAE). From left to right: the results of the full-precision GNNs (Full), those of the 1-bit GNNs without the proposed meta aggregators (Vanilla), and the results of the 1-bit GNNs with GNA and ANA. We also provide the -value of the paired -test to demonstrate the statistically meaningful improvements by the proposed GNA and ANA.

4.2 Greedy Gumbel Aggregator

Motivated by the observation from Fig. 4, where different single aggregators work for a corresponding set of cases as explained in Sect. 3, we propose the idea of adaptively determining the optimal aggregator depending on the specific input graphs, as depicted in the upper part of Fig. 5.

To this end, there are a few challenges to be addressed. First, the aggregation selector should understand the underlying characteristics of various input graphs without introducing much additional computational cost. To address this issue, we propose to leverage a 1-bit graph auto-encoder to extract meaningful information from input graphs, which is then exploited to guide the decision of different aggregation methods.

The second challenge is how to incorporate the discrete selections into the gradient descent process in training GNNs. One straightforward solution would be to model the discrete determination process as a state classification problem and to consider the various aggregators in the candidate pool as different labels. However, this naïve attempt does not account for the uncertainty of the selector, which is likely to cause the model collapse problem where the output choice is independent of the input graphs, such as always or never picking up a specific aggregator.

To alleviate this dilemma, we propose to impose stochasticity in the aggregator decision process with greedy Gumbel sampling [29, 47]

and propagate gradients through stochastic neurons through the continuous form of Gumbel-Max trick

[18]. Specifically, we introduce such stochasticity by greedily sampling noise from the Gumbel distribution, due to its property of Gumbel-Max trick [9]

. In terms of Gumbel random variables, the Gumbel-Max trick can be utilized to parameterize discrete distributions. However, there is a argmax operation in the Gumbel-Max trick, which is not differentiable. We thereby resort to its continuous relaxation form, termed as Gumbel-softmax estimator, to address this issue, which uses a softmax function to replace the undifferentiable argmax function.

With the aforementioned graph auto-encoder and also the Gumbel-softmax estimator to address the two challenges, respectively, the proposed greedy Gumbel aggregator (GNA) for node can then be formulated as:


where represents the binarized graph auto-encoder at layer that extracts principal and meaningful information, and denotes the sampled Gumbel random noise. is the input subgraph with one centered node and a set of its neighboring nodes where the connection . is a constant that denotes the temperature of the softmax. is the output one-hot vector that indicates the aggregator decision at node and layer from a pool of candidate aggregators like .

In this way, the proposed greedy Gumbel aggregator adaptively decides the optimal aggregator conditioned on each specific node and layer in a learnable manner, which can significantly improve the topological discriminative capability of the vanilla binary GNN model.

4.3 Adaptable Hybrid Aggregator

Despite the improved representational ability, the performance of the greedy Gumbel aggregator is bottlenecked by that of the existing standard aggregators, which leaves room for further improvement. Motivated by this observation, we further devise an adaptable hybrid neighborhood aggregator (ANA) that can generate a hybrid form of the several standard aggregators in a learnable manner, thereby simultaneously retaining the advantages of different aggregators. The overall computational pipeline of ANA is demonstrated in the lower part of Fig. 5.

We start by giving the developed graph-based mathematical formulation for diffused message aggregation, defined as follows:


where is the in-degree of the node and is the graph sample with edges . We use to denote the 1-bit graph auto-encoder at layer , as is also used in Eq. 4. represents the feature vector of the neighboring node at layer , whereas is the obtained diffused aggregator.

Eq. 5 can essentially approximate the max and mean functions, depending on the output of graph auto-encoder . Specifically, higher will lead to a behavior similar to that of the max aggregator, while smaller values of generate an effect of the mean neighborhood aggregation. Detailed mathematical proof is provided in the supplementary material.

By slightly changing the form of Eq. 5, we can also approximate other aggregators. For example, by simply adding a minus to the input graph features, Eq. 5 can approach the behavior of the min aggregation. Also, by utilizing the fact

, the variance aggregator can be approximated by adding the square operations to Eq. 

5. More detailed derivations and mathematical proofs can be found in the supplement.

Furthermore, it is also possible to simultaneously combine the benefits of all these approximated aggregators, by summing multiple terms in Eq. 5 with graph-based learnable weighting factors that adaptively control the diffused degree of various aggregator approximations. We illustrate the corresponding sophisticated formulation and also more detailed explanations in the supplementary material.

4.4 Training Strategy

We also propose a training strategy, tailored for the proposed method. As a whole, the principal operations of training a 1-bit GNN model with the proposed meta neighborhood aggregation approaches is concluded in Alg. 1. For the sake of clarity, we omit the bias terms in our illustration, which have similar behavior to that of the GNN weight . Also, we take the case where the feature transformation happens before the aggregation step as an example to illustrate the overall workflow.

As can be observed from Alg. 1, at each layer, the input graph is fed into the lightweight 1-bit graph auto-encoder to extract useful information that is beneficial to the following meta aggregators. Followed by this graph encoding process, the meta neighborhood aggregation module receives the encoded features and exclusively determines an optimal aggregator, or produces a diffused aggregator that amalgamates the behaviors of several independent aggregators. The desired 1-bit GNN model can eventually be obtained by optimizing the model for epochs with the straight-through estimator, as explained in Sect. 3.

5 Experiments

In this section, we perform extensive experiments on three publicly available benchmarks across diversified problem domains, including graph regression, node classification, and 3D object recognition. Followed by the evaluations, we further provide detailed discussions regarding the strengths and weaknesses of the devised meta aggregators.


Methods Param Size Test MAESD Train MAESD


GatedGCN [3] 413.027KB 0.4260.012 0.2720.023
GraphSage [10] 371.004KB 0.4750.007 0.2960.030
GIN [57] 402.652KB 0.3870.019 0.3190.020
MoNet [32] 414.070KB 0.3860.009 0.2990.016
GCN [21] 402.645KB 0.4070.018 0.3030.026
GAT [48] 399.941KB 0.4760.006 0.3000.024
GNA (Ours) 411.270KB 0.3370.021 0.1600.026
ANA (Ours) 404.504KB 0.3250.015 0.1090.014


Table 2: Results of the proposed meta aggregation methods and other approaches for 32-bit full-precision models on the ZINC dataset, in terms of MAE. The results are averaged over 25 independent runs with 25 different random seeds.

5.1 Experimental Settings

Datasets. We validate the effectiveness of the proposed meta aggregation methods on three different datasets, each of which specializes in a distinct task. Specifically, for the task of graph regression, we use the ZINC dataset [16], which is one of the most popular real-world molecular datasets [7]. The goal of ZINC is to regress a specific molecular property, the constrained solubility, which is a critical property for developing GNNs for molecules [63].

Also, for the node classification task, we adopt the protein-protein interaction (PPI) dataset [66], which is a multi-label dataset with 24 graphs corresponding to different human tissues. Each node in the PPI dataset is labeled with various protein functions. The objective of PPI is thereby to predict the 121 protein functions from the interactions of human tissue proteins. Furthermore, we utilize ModelNet40 [56] for the evaluation on the task of 3D object classification. ModelNet40 is a popular dataset for 3D object analysis [35, 36], containing 12,311 meshed CAD models from 40 shape categories in total. Each object comprises a set of 3D points, with the 3D coordinates as the features. The goal is to predict the category of each 3D shape.

Implementation Details.

We primarily use three heterogeneous architectures, including graph convolutional network (GCN) [21], graph attention network (GAT) [48], as well as dynamic graph convolutional model (DGCNN) [54] to evaluate the proposed meta aggregation approach. For other settings such as learning rates and batch size, we follow those in the works of [7], [48], and [54] for the tasks of graph regression, node classification, and point cloud classification, respectively.

In particular, for more convincing evaluations, we report the results on the ZINC dataset over 25 independent runs with 25 different random seeds. Also, as done in the field of CNN binarization [39]

, we keep the first and the last GNN layer full-precision and binarize the other GNN layers for all the comparison methods. More detailed task-by-task architecture designs as well as the hyperparameter settings can be found in the supplementary material.


Methods Bit-width Param Size Score


Full Prec. [48] 32/32 43.7712MB 98.70
Vanilla [14] 1/1 28.2560MB 92.68
GNA (Ours) 1/1 28.2572MB 97.52
ANA (Ours) 1/1 28.2565MB 97.71


Table 3: Results on the PPI dataset for the task of node classification, in terms of micro-averaged F score. Detailed network architectures can be found in the supplementary material.

Figure 6: Visualization results of the learned feature space, depicted as the distance between the red point and the rest of the others. The visualized features are extracted from the intermediate layer of the models. More results can be found in the supplementary material.

5.2 Results

Graph Regression. Tab. 1 shows the ablation results of the vanilla 1-bit GNN models and those of GNNs with the proposed meta neighborhood aggregators GAN and ANA. Specifically, we report the results on two GNN architectures, , GCN [21] and GAT [48], by averaging over 25 independent runs with 25 seeds.

The proposed GNA and ANA, as shown in Tab. 1, achieves gratifying performance in terms of both test and train MAE, and at the same time maintains a compact model size. Moreover, we provide in the last row of Tab. 1 the -value of the paired -test between the 1-bit GNNs with a fixed aggregator (Vanilla) and those with the proposed learnable meta aggregators. The corresponding results statistically validate the effectiveness of the proposed method.

Furthermore, we show in Tab. 2 the results of extending the proposed meta aggregators to full-precision GNNs and compare them with those of the state-of-the-art approaches [3, 10, 57, 32, 21, 48]. Specifically, the results in the last two rows of Tab. 2 are obtained by simply replacing the pre-defined aggregator in GAT with the proposed GNA and ANA. As can be observed from Tab. 2, the proposed method outperforms other approaches by a large margin, and meanwhile introduces few additional parameters.

Node Classification. In Tab. 3, we demonstrate the results of different methods with the GAT architecture. The proposed GNA and ANA, as shown in Tab. 3, yield results on par with those of the 32-bit full-precision models, but comes with a more lightweight architecture. The proposed method also outperforms the vanilla 1-bit GNN model that relies on a fixed aggregation scheme.


Methods Bit-width Param Size Acc (%) mAcc (%)


Full Prec. [54] 32/32 1681.66KB 92.42 89.51
Vanilla [14] 1/1 1091.20KB 74.19 65.95
GNA (Ours) 1/1 1091.30KB 78.36 71.67
ANA (Ours) 1/1 1091.30KB 84.64 78.89


Table 4: Results on the ModelNet40 dataset for 3D object recognition, in terms of the overall accuracy (Acc) and the mean class accuracy (mAcc).

3D Object Recognition.

The results of the proposed approach and other methods on the ModelNet40 dataset are shown in Tab. 4. We build our network here based on the architecture designed in [60]. We also demonstrate in Fig. 6 the corresponding visualization results of different approaches, where the column termed as “Fixed Aggr.” in Fig. 6 corresponds to the “Vanilla” model in Tab. 4. With the proposed meta aggregation schemes, the 1-bit GNN model gains a boost by more than 10% in both the overall accuracy and the mean class accuracy. This improvement is also illustrated in Fig. 6, where the proposed meta aggregators help the 1-bit GNN learn a closer structure to that of the full-precision GNN model.

5.3 Discussions

We provide here a detailed account of the strengths and weaknesses of the proposed two meta aggregators GNA and ANA. For the exclusive meta form GNA, the performance can potentially be further enhanced with the advance of novel aggregation schemes. In other words, the results of GNA depend on those of every single aggregator in the candidate aggregation pool, which at the same time is a weakness of GNA since its performance is bottlenecked by that of the single aggregator. The diffused form ANA, on the other hand, may simultaneously retain the benefits of several popular aggregators. However, the mathematical form in Eq. 5 limits the type of aggregators that ANA can potentially approximate, meaning that ANA may not have much room for further improvement even with the emergence of novel and prevailing aggregators in the future.

6 Conclusions

In this paper, we propose a couple of learnable aggregation schemes for 1-bit compact GNNs. The goal of the proposed method is to enhance the topological discriminative ability of the 1-bit GNNs. This is achieved by adaptively selecting a single aggregator, or generating a hybrid aggregation form that can simultaneously maintain the strengths of several aggregators. Moreover, the proposed meta aggregation schemes can be readily extended to the full-precision GNN models. Experiments across various domains demonstrate that, with the proposed meta aggregators, the 1-bit GNN yields results on par with those of the cumbersome full-precision ones. In our future work, we will strive to generalize the proposed aggregator to compact and lightweight visual transformers.


Mr Yongcheng Jing is supported by ARC FL-170100117. Xinchao Wang is supported by the Start-up Fund of National University of Singapore.