1. Introduction
^{1}^{1}footnotetext: *Corresponding author (Email: jiangbo@ahu.edu.cn)Graph Neural Networks (GNNs) have achieved great success in many graph data representation and learning tasks, such as semisupervised learning, clustering, graph classification etc. As we know that the main aspect powering existing GNNs is the multilayer network architecture to learn the rich nonlinear graph representations for the specific learning tasks.
It is known that the core message propagation mechanism in multilayer GNNs is neighborhood information aggregation in which each node is updated by aggregating the information from its neighbors. Most of existing GNNs generally adopt either linear neighborhood aggregation (e.g., mean, sum) (Kipf and Welling, 2017; Xu et al., 2019) or max aggregation (Hamilton et al., 2017) or combination of them (Dehmamy et al., 2019; Corso et al., 2020) in their layerwise message propagation. For example, Kipf et al. (Kipf and Welling, 2017) present Graph Convolutional Networks (GCN) which adopts a weighted summation operation as the aggregation function. Veličković et al. (Veličković et al., 2018) present Graph Attention Networks (GAT) which uses a learned weighted mean aggregator via a selfattention mechanism. Hamilton et al. (Hamilton et al., 2017) propose the general GraphSAGE that employs mean, sum and max operation respectively for neighborhood aggregation. Dehmamy et al. (Dehmamy et al., 2019)
propose a modular GCN design by combining different neighborhood aggregation rules (mean, sum, normalized mean) together with residual connections. Xu et al.
(Xu et al., 2019) propose Graph Isomorphism Networks (GIN) in which the summation operation is utilized for neighborhood information aggregation for graph classification task. Geisler et al. (Geisler et al., 2020) propose reliable GNN by developing a robust aggregation function based on robust statistics in which the robust aggregator is implemented via an adaptively weighted mean aggregation function. Gabriele et al. (Corso et al., 2020) propose Principal Neighbourhood Aggregation (PNA) by integrating multiple aggregators (e.g., mean, max, min) together via degreescalers. Cai et al. (Cai et al., 2021) propose Graph Neural Architecture Search (GNAS) to learn the optimal depth of message passing with max and sum neighbor aggregations.After reviewing the previous GNNs on various graph learning tasks, we can find the following aspects. First, most of existing GNNs generally adopt linear aggregation functions as neighborhood information aggregators in their layerwise message propagation, i.e., they learn the contextaware representation of each node by linearly aggregating the information from its neighbors. Thus, the whole nonlinearity of these linear aggregator based GNNs is determined based on the number of hidden layers (depth of networks). However, it is known that, deeper GNNs usually lead to oversmoothing issue (Kipf and Welling, 2017; Li et al., 2018; Zhao and Akoglu, 2020). Therefore, the whole nonlinearity and network’s capacity of these GNNs are generally limited. Second, the nonlinear max aggregator has been utilized in some works (Hamilton et al., 2017; Lee et al., 2017; Xu et al., 2019). As we know, max aggregator obviously fails to preserve the detailed information of node representations within each node’s neighborhood. Third, some recent works attempt to combine mean/sum and max aggregators together to provide a combined aggregator (Corso et al., 2020). However, the combined aggregator is explicitly depended on the individual mean/sum and max aggregators.
To address these issues, in this work, we rethink the message aggregation mechanism in layerwise propagation of GNNs and aim to fully exploit the flexible nonlinear aggregation functions for information aggregation in GNNs. Specifically, we develop three kinds of nonlinear neighborhood aggregators for GNNs’ message propagation by exploiting norm, polynomial and softmax functions respectively. Overall, there are three main aspects of the proposed methods. First, one important property of the proposed nonlinear neighborhood aggregators is that they can be regarded as intermediates between linear mean/summation and nonlinear max aggregators and provide the optimal flexible balanced aggregation strategies for GNNs. Thus, our aggregation mechanisms can inherit both (i) nonlinearity that increases network’s functional complexity/capacity and (ii) detailsensitivity that preserves the detailed information of representations in GNNs’ message propagation. Second, the proposed aggregators are all differentiable that allow endtoend training. Third, our aggregation mechanisms are general scheme which can integrate with many GNN architectures to enhance existing GNNs’ capacities and learning performance. Overall, we summarize the main contributions as follows:

We propose to develop three kinds of nonlinear neighborhood aggregation schemes for the general GNNs’ layerwise propagation.

We analyze the main properties of the proposed nonlinear neighborhood aggregators and show the balanced behavior of the proposed models.

We incorporate the proposed nonlinear aggregation mechanisms into GNNs and propose several novel networks for graph data representation and learning tasks.
We integrate our nonlinear aggregators into several GNN architectures and experiments on several datasets show the effectiveness of the proposed aggregators.
2. Revisiting Neighborhood Aggregation in GNNs
GNN provides a multilayer network architecture for graph data representation and learning. It is known that the core operation of GNNs is the neighborhood aggregation for network’s layerwise message propagation. Let be the input graph where denotes the adjacency matrix and denotes the collection of node features. The neighborhood aggregation for layerwise message propagation in GNNs (Hamilton et al., 2017; Geisler et al., 2020) can generally be formulated as follows,
(1) 
where denotes the neighborhood of node (including node ) and matrix denotes the layerwise weight parameter. Function
denotes the activation function, such as
, , and denotes the aggregation function, such as mean, summation, max, etc (Hamilton et al., 2017; Xu et al., 2019; Geisler et al., 2020).For example, in Graph Convolutional Networks (GCN) (Kipf and Welling, 2017), it adopts the weighted summation aggregation over normalized graph in its layerwise propagation as,
(2) 
where denotes the normalized adjacency matrix in which and is the diagonal matrix with . In Graph Attention Network (GAT) (Veličković et al., 2018), it first computes the attention for each graph edge as,
(3) 
where denotes the layerspecific weight matrix and denotes the softmax function defined on topological graph. The function denotes the learnable metric function parameterized by . The parameter is shared across different edges to achieve information communication. Based on the learned edge attention , GAT then conducts the layerwise information aggregation as,
(4) 
In GraphSAGE (Hamilton et al., 2017)
, it employs the maxpooling operation over the neighborhood and implements layerwise message aggregation as follows
(5) 
where the denotes the fullyconnected network with an activation function and learnable parameter . The function denotes the elementwise max operation which aggregates the dimensionwise maximum value of node’s neighbors.
3. The Proposed Method
Most of existing GNNs mainly obtain nonlinear representations via activation function , such as , etc, in their layerwise message propagation. Therefore, the nonlinearity of the whole network is determined based on the number of hidden layers. However, it is known that, deeper GNNs usually lead to oversmoothing issue (Kipf and Welling, 2017; Li et al., 2018; Zhao and Akoglu, 2020). Therefore, the nonlinearity and the whole learning capacity of existing GNNs are still limited. Besides, some GNNs use the nonlinear max aggregation as aggregator which fails to preserve the detailed information of node representations within each node’s neighborhood. To overcome these issues, we rethink the message aggregation mechanism in layerwise propagation of GNNs and develop several nonlinear message aggregation functions to enhance the learning capacity of GNNs. To be specific, we propose three kinds of nonlinear aggregations, i.e., norm, polynomial and softmax aggregation. All these nonlinear aggregations provide balanced and selfadjust between mean and max functions, as discussed in detail below.
3.1. norm Aggregation
As a general nonlinear function,
norm has been commonly used in computer vision and signal processing fields
(Gulcehre et al., 2014; Estrach et al., 2014). Here, we propose to exploit norm function for the neighborhood aggregation in GNN’s layerwise message propagation.Given be the input graph where denotes the adjacency matrix and denotes the collection of node features. One straightforward way to adopt norm for the neighborhood aggregation can be formulated as
(6) 
Here, is defined as
where is a parameter which can be learned adaptively. However, the negative information has been ignored in Eq.(6). To address this issue, we adapt the above norm function and propose our aggregation as follows,
(7) 
where denotes the minimum element of matrix H and
denotes the feature vector of node
.Remark.
The proposed is a general scheme which can be integrated with many GNN’ architectures, such as GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), GraphSAGE (Hamilton
et al., 2017), etc.
For example, one can replace with in Eq.(2) to produce a GCN variant (Kipf and Welling, 2017).
Also, one can use the learned graph attention in Eq.(4) to replace to generate a new GAT model.
In particular, one main property of the proposed is that it provides a balanced aggregator between max aggregator and summation aggregator. Specifically, we have the following Proposition 1.
Proposition 1:
When , becomes to the weighted summation aggregation; When , it is equivalent to the max aggregation.
Proof. When , one can easily observe that becomes to the weighted summation aggregation. For the case , we first prove the following conclusion.
To be specific, given any vector and , without loss of generality, we assume be the unique maximum value of . Below, we can show that
(8) 
For any , since , thus . Thus, we have
(9) 
Based on the above analysis, one can see that
(10) 
This completes the proof.
3.2. Polynomial Aggregation
In addition to norm, we also explore polynomial function (Wei et al., 2019) for neighborhood information aggregation.
Given be the input graph where denotes the adjacency matrix and denotes the node features. Formally, we propose to define polynomial aggregation as
(11) 
where the division operation in Eq.(11) is the elementwise division and is defined as
where is a parameter which controls its polynomial order. In this paper, the parameter value of is learned adaptively. Similar to the above norm function, considering the inevitable negative elements in H and the power operation (Wei et al., 2019), we adapt the Eq.(11) and finally propose our as follows,
(12) 
where is the minimum element of H and the division operation is the elementwise division.
Similar to , also gives a general scheme and can be integrated with many GNNs’ architectures, such as GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), GraphSAGE (Hamilton
et al., 2017), etc.
In addition, also provides a balanced aggregator between max and summation aggregator (Wei et al., 2019).
Formally, we have the following Proposition 2.
Proposition 2:
When , becomes to the weighted mean aggregation; When , it is equivalent to the max aggregation.
The mathematical proof of this Proposition 2 can be similarly obtained by referring to work (Wei et al., 2019).
3.3. Softmax Aggregation
The nonlinear function softmax has been commonly used as activity function in GNNs. Given a vector , the standard softmax function is defined as follows,
(13) 
where is a scale factor. It is known that softmax is a smooth approximation of argmax function, i.e., let be the maximum value of , then
(14) 
In our work, we exploit softmax function for neighborhood information aggregation in GNNs. To be specific, given be the input graph where denotes the adjacency matrix and denotes the node features. Formally, our softmax based aggregator is defined as
(15) 
where denotes the scalar parameter and indicates the elementwise product. denotes the feature vector of node . The division operation in Eq.(15) denotes the elementwise division. The operation is defined as
Similar to the above two nonlinear aggregators, can also combine with some GNNs’ architectures, such as GCN (Kipf and Welling, 2017) and GAT (Veličković et al., 2018), etc.
In addition, we can show that provides an adaptive aggregator between max and mean aggregations.
Formally, we have the following proposition.
Proposition 3:
When , becomes to the weighted mean aggregation; When , it is equivalent to the max aggregation.
When , obviously becomes to the weighted mean aggregator. For , the proof can be easily obtained by using Eq.(14).
Dataset  Cora  Citeseer  Pubmed  Amazon Photo  Amazon Computers  PPI 
Nodes  2708  3327  19717  7487  13381  56944 
Feature  1433  3703  500  745  767  50 
Edges  5429  4732  44338  119043  245778  266144 
Classes  7  6  3  8  10  121 
Method  Cora  Citeseer  Pubmed  Photo  Computers 
GCN (baseline)  81.241.68  69.921.75  79.821.60  89.051.80  80.022.83 
GCN  82.571.88  71.030.74  80.721.00  90.861.06  81.982.27 
GCN  82.201.60  71.121.10  81.011.29  91.800.50  82.861.41 
GCN  83.101.28  71.501.12  81.500.82  91.250.80  82.022.11 
Masked GCN (baseline)  81.441.19  69.101.98  80.191.40  88.603.08  76.067.09 
Masked GCN  82.561.45  69.961.33  80.921.18  90.561.19  80.634.92 
Masked GCN  82.001.02  69.941.23  81.030.81  90.981.95  78.156.44 
Masked GCN  82.881.00  70.101.04  81.501.06  90.571.56  81.502.75 
GAT (baseline)  82.021.55  70.881.36  80.050.97  89.722.02  79.722.20 
GAT  83.220.78  72.240.99  80.491.14  90.372.10  81.141.80 
GAT  83.301.15  71.561.46  80.291.05  90.701.49  82.601.62 
GAT  83.141.30  72.151.01  81.051.22  90.652.10  81.262.10 
CAT (baseline)  82.451.40  72.461.33  80.121.10  88.042.14  80.531.85 
CAT  82.931.30  72.500.87  81.521.28  90.541.08  81.101.92 
CAT  82.601.45  71.781.15  81.351.91  90.650.81  82.661.43 
CAT  83.341.41  72.610.80  82.041.40  91.210.66  82.901.98 
4. Experiments
To validate the effectiveness of our proposed nonlinear aggregators, we take two GCNbased models (GCN (Kipf and Welling, 2017) and Masked GCN (Yang et al., 2019)) and two GATbased models (GAT (Veličković et al., 2018) and CAT (He et al., 2021)) as baseline architectures and perform evaluations on several widely used graph learning datasets.
4.1. Experimental Settings
Dataset Setting. We test the proposed models on six datasets including Cora, Citeseer, Pubmed (Sen et al., 2008), PPI (Hamilton et al., 2017), Amazon Photo and Computers (Shchur et al., 2018). Similar to work (Kipf and Welling, 2017), we randomly select nodes per class as training set, nodes and nodes as validation and testing set respectively. For Amazon Photo and Computers (Shchur et al., 2018), following work (Shchur et al., 2018), we randomly select nodes per class as training set, nodes per class and the remaining nodes as validation and testing set respectively. For PPI dataset, similar to the setting in work (Veličković et al., 2018), we take graphs as training set, graphs as validation set and graphs as testing set. The introduction and usage of these datasets are summarized in Table 1.
Parameter Setting. We integrate the proposed nonlinear aggregators with GCNs and GATs respectively. For GCNbased methods, the number of hidden layer units is selected from for all datasets. For GATbased methods, the number of hidden layer units is set to on all datasets. The weight decay are set to and for citation and amazon datasets, respectively. Our network parameters are trained and optimized by gradient descent algorithms (Kingma and Ba, 2015; Qian, 1999).
Method  Cora  Citeseer  Pubmed  Photo  Computers 
GCN (baseline)  80.240.97  70.150.60  79.101.22  89.001.45  80.362.40 
GCN  82.400.82  70.580.83  79.981.21  91.041.10  82.431.95 
GCN  82.140.95  71.000.89  79.960.98  92.060.33  83.141.38 
GCN  82.870.91  71.280.82  80.640.89  91.220.82  82.531.82 
Masked GCN (baseline)  80.740.78  69.031.78  79.141.05  88.833.03  76.197.25 
Masked GCN  82.400.66  69.761.07  80.260.85  90.761.16  80.925.00 
Masked GCN  81.900.42  69.831.02  80.001.06  91.151.90  78.746.28 
Masked GCN  82.580.85  69.961.08  80.761.26  90.791.52  81.872.69 
GAT (baseline)  82.550.40  71.020.30  80.010.71  88.550.70  78.981.20 
GAT  83.620.56  72.300.45  80.401.16  90.611.33  81.931.91 
GAT  83.680.65  71.761.10  80.220.80  90.781.18  81.061.20 
GAT  83.500.65  71.730.76  80.891.10  90.841.15  80.541.94 
CAT (baseline)  81.981.01  70.350.72  77.853.48  89.671.16  80.551.69 
CAT  82.590.82  71.680.66  79.751.27  90.721.05  81.881.78 
CAT  82.101.10  70.490.85  79.921.16  90.830.77  83.001.39 
CAT  82.601.03  71.780.75  80.401.24  91.420.61  83.251.94 
4.2. Comparison Results
Node Classification. We first test our proposed models on transductive learning task and take some popular GNNs as baselines including Graph Convolutional Network (GCN) (Kipf and Welling, 2017), Graph Attention Networks (GAT) (Veličković et al., 2018), Masked GCN (Yang et al., 2019) and CAT (He et al., 2021). Table 2 reports the comparison results. One can observe that the proposed nonlinear aggregators can consistently improve the baseline models which demonstrates the effectiveness of our proposed nonlinear aggregation schemes on extending the network’s capacity and thus enhancing GNNs’ learning performance. We then test our proposed models on inductive learning task. Table 4 reports the comparison results. We can note that the nonlinear GNN models achieve better result than vanilla GNNs which further indicates the advantages of our proposed nonlinear aggregators.
Method  PPI 
GCN (baseline)  97.250.34 
GCN  98.640.01 
GCN  98.600.02 
GCN  98.700.44 
Masked GCN (baseline)  97.300.50 
Masked GCN  98.410.18 
Masked GCN  98.520.11 
Masked GCN  97.950.29 
GAT (baseline)  97.300.30 
GAT  98.220.60 
GAT  98.600.40 
GAT  98.540.70 
CAT (baseline)  97.800.25 
CAT  98.320.13 
CAT  98.440.08 
CAT  98.120.17 
Node Clustering. We also evaluate the proposed methods on semisupervised node clustering task as suggested in work (He et al., 2021). Table 3 reports the comparison results. Note that the nonlinear aggregation based models obtain higher performance which shows the benefits of our proposed nonlinear message aggregation mechanisms for GNNs’ learning.
4.3. Intuitive Demonstration
Here, we show some visual demonstrations to intuitively demonstrate the effect of the proposed nonlinear aggregations. We first use 2D tSNE visualization (Maaten and Hinton, 2008) to show some demonstrations. Figure 1 and 2 respectively show the comparison results of GCN (Kipf and Welling, 2017) and GCNbased variants on Amazon Photo and Computers datasets. As shown in Figure 1 and 2, the node embeddings obtained by our proposed nonlinear GCNs are distributed more compactly and clearly than the vanilla GCN (Kipf and Welling, 2017). It is consistent with the experiment results shown in Table and which demonstrates the more expressive capacity of the proposed nonlinear aggregators. We then show the convergence property of the nonlinear aggregations based on GNNs. Figure 3
shows the comparison training loss across different epochs of GCN
(Kipf and Welling, 2017), GCN, GCN and GCN on citation datasets . One can note that our proposed three nonlinear GCNs have lower training loss which indicates the higher capacity of the proposed aggregators.5. Conclusion
In this paper, we rethink the neighborhood aggregation mechanism of GNNs and propose nonlinear message aggregation schemes to extend GNNs’ learning capacity. The proposed nonlinear aggregation operators are general and flexible strategies GNNs which provide the intermediates between the commonly used max and mean/sum aggregations. We integrate our proposed nonlinear aggregators into several GNNs. Experiments on several benchmark datasets show the effectiveness of our proposed nonlinear aggregations on enhancing the learning capacity and performances of GNNs.
References
 (1)

Cai
et al. (2021)
Shaofei Cai, Liang Li,
Jincan Deng, Beichen Zhang,
ZhengJun Zha, Li Su, and
Qingming Huang. 2021.
Rethinking graph neural architecture search from
messagepassing. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
. 6657–6666.  Corso et al. (2020) Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Lio, and Petar Velickovic. 2020. Principal Neighbourhood Aggregation for Graph Nets. In Neural Information Processing Systems (NeurIPS) 33.
 Dehmamy et al. (2019) Nima Dehmamy, AlbertLaszlo Barabasi, and Rose Yu. 2019. Understanding the Representation Power of Graph Neural Networks in Learning Graph Topology. In Advances in Neural Information Processing Systems.

Estrach
et al. (2014)
Joan Bruna Estrach, Arthur
Szlam, and Yann LeCun. 2014.
Signal recovery from Pooling Representations. In
Proceedings of the 31st International Conference on Machine Learning
. 307–315.  Geisler et al. (2020) Simon Geisler, Daniel Zügner, and Stephan Günnemann. 2020. Reliable Graph Neural Networks via Robust Aggregation. In Advances in Neural Information Processing Systems. 13272–13284.

Gulcehre
et al. (2014)
Caglar Gulcehre, Kyunghyun
Cho, Razvan Pascanu, and Yoshua
Bengio. 2014.
LearnedNorm Pooling for Deep Feedforward and Recurrent Neural Networks. In
Proceedings of the 2014th European Conference on Machine Learning and Knowledge Discovery. 530–546.  Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Neural Information Processing Systems. 1024–1034.
 He et al. (2021) Tiantian He, L Bai, and Yew Soon Ong. 2021. Learning Conjoint Attentions for Graph Neural Nets. In Advances in Neural Information Processing Systems (NeurIPS) 34. Curran Associates, Inc.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. SemiSupervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations.
 Lee et al. (2017) ChenYu Lee, Patrick Gallagher, and Zhuowen Tu. 2017. Generalizing Pooling Functions in CNNs: Mixed, Gated, and Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2017), 863–875.

Li
et al. (2018)
Q. Li, Z. Han, and
X.M. Wu. 2018.
Deeper Insights into Graph Convolutional Networks for SemiSupervised Learning. In
The ThirtySecond AAAI Conference on Artificial Intelligence
.  Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using tSNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
 Qian (1999) Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks 12, 1 (1999), 145–151.
 Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. 2008. Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
 Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
 Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations (2018).
 Wei et al. (2019) Zhen Wei, Jingyi Zhang, Li Liu, Fan Zhu, Fumin Shen, Yi Zhou, Si Liu, Yao Sun, and Ling Shao. 2019. Building detailsensitive semantic segmentation networks with polynomial pooling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7115–7123.
 Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations.
 Yang et al. (2019) Liang Yang, Fan Wu, Yingkui Wang, Junhua Gu, and Yuanfang Guo. 2019. Masked Graph Convolutional Network. In International Joint Conference on Artificial Intelligence. 4070–4077.
 Zhao and Akoglu (2020) Lingxiao Zhao and Leman Akoglu. 2020. PairNorm: Tackling Oversmoothing in GNNs. In International Conference on Learning Representations.