1. Introduction11footnotetext: *Corresponding author (Email: email@example.com)
Graph Neural Networks (GNNs) have achieved great success in many graph data representation and learning tasks, such as semi-supervised learning, clustering, graph classification etc. As we know that the main aspect powering existing GNNs is the multi-layer network architecture to learn the rich nonlinear graph representations for the specific learning tasks.
It is known that the core message propagation mechanism in multi-layer GNNs is neighborhood information aggregation in which each node is updated by aggregating the information from its neighbors. Most of existing GNNs generally adopt either linear neighborhood aggregation (e.g., mean, sum) (Kipf and Welling, 2017; Xu et al., 2019) or max aggregation (Hamilton et al., 2017) or combination of them (Dehmamy et al., 2019; Corso et al., 2020) in their layer-wise message propagation. For example, Kipf et al. (Kipf and Welling, 2017) present Graph Convolutional Networks (GCN) which adopts a weighted summation operation as the aggregation function. Veličković et al. (Veličković et al., 2018) present Graph Attention Networks (GAT) which uses a learned weighted mean aggregator via a self-attention mechanism. Hamilton et al. (Hamilton et al., 2017) propose the general GraphSAGE that employs mean, sum and max operation respectively for neighborhood aggregation. Dehmamy et al. (Dehmamy et al., 2019)
propose a modular GCN design by combining different neighborhood aggregation rules (mean, sum, normalized mean) together with residual connections. Xu et al.(Xu et al., 2019) propose Graph Isomorphism Networks (GIN) in which the summation operation is utilized for neighborhood information aggregation for graph classification task. Geisler et al. (Geisler et al., 2020) propose reliable GNN by developing a robust aggregation function based on robust statistics in which the robust aggregator is implemented via an adaptively weighted mean aggregation function. Gabriele et al. (Corso et al., 2020) propose Principal Neighbourhood Aggregation (PNA) by integrating multiple aggregators (e.g., mean, max, min) together via degree-scalers. Cai et al. (Cai et al., 2021) propose Graph Neural Architecture Search (GNAS) to learn the optimal depth of message passing with max and sum neighbor aggregations.
After reviewing the previous GNNs on various graph learning tasks, we can find the following aspects. First, most of existing GNNs generally adopt linear aggregation functions as neighborhood information aggregators in their layer-wise message propagation, i.e., they learn the context-aware representation of each node by linearly aggregating the information from its neighbors. Thus, the whole nonlinearity of these linear aggregator based GNNs is determined based on the number of hidden layers (depth of networks). However, it is known that, deeper GNNs usually lead to over-smoothing issue (Kipf and Welling, 2017; Li et al., 2018; Zhao and Akoglu, 2020). Therefore, the whole nonlinearity and network’s capacity of these GNNs are generally limited. Second, the nonlinear max aggregator has been utilized in some works (Hamilton et al., 2017; Lee et al., 2017; Xu et al., 2019). As we know, max aggregator obviously fails to preserve the detailed information of node representations within each node’s neighborhood. Third, some recent works attempt to combine mean/sum and max aggregators together to provide a combined aggregator (Corso et al., 2020). However, the combined aggregator is explicitly depended on the individual mean/sum and max aggregators.
To address these issues, in this work, we re-think the message aggregation mechanism in layer-wise propagation of GNNs and aim to fully exploit the flexible nonlinear aggregation functions for information aggregation in GNNs. Specifically, we develop three kinds of non-linear neighborhood aggregators for GNNs’ message propagation by exploiting -norm, polynomial and softmax functions respectively. Overall, there are three main aspects of the proposed methods. First, one important property of the proposed nonlinear neighborhood aggregators is that they can be regarded as intermediates between linear mean/summation and nonlinear max aggregators and provide the optimal flexible balanced aggregation strategies for GNNs. Thus, our aggregation mechanisms can inherit both (i) nonlinearity that increases network’s functional complexity/capacity and (ii) detail-sensitivity that preserves the detailed information of representations in GNNs’ message propagation. Second, the proposed aggregators are all differentiable that allow end-to-end training. Third, our aggregation mechanisms are general scheme which can integrate with many GNN architectures to enhance existing GNNs’ capacities and learning performance. Overall, we summarize the main contributions as follows:
We propose to develop three kinds of nonlinear neighborhood aggregation schemes for the general GNNs’ layer-wise propagation.
We analyze the main properties of the proposed nonlinear neighborhood aggregators and show the balanced behavior of the proposed models.
We incorporate the proposed nonlinear aggregation mechanisms into GNNs and propose several novel networks for graph data representation and learning tasks.
We integrate our nonlinear aggregators into several GNN architectures and experiments on several datasets show the effectiveness of the proposed aggregators.
2. Revisiting Neighborhood Aggregation in GNNs
GNN provides a multi-layer network architecture for graph data representation and learning. It is known that the core operation of GNNs is the neighborhood aggregation for network’s layer-wise message propagation. Let be the input graph where denotes the adjacency matrix and denotes the collection of node features. The neighborhood aggregation for layer-wise message propagation in GNNs (Hamilton et al., 2017; Geisler et al., 2020) can generally be formulated as follows,
where denotes the neighborhood of node (including node ) and matrix denotes the layer-wise weight parameter. Function
denotes the activation function, such as, , and denotes the aggregation function, such as mean, summation, max, etc (Hamilton et al., 2017; Xu et al., 2019; Geisler et al., 2020).
For example, in Graph Convolutional Networks (GCN) (Kipf and Welling, 2017), it adopts the weighted summation aggregation over normalized graph in its layer-wise propagation as,
where denotes the normalized adjacency matrix in which and is the diagonal matrix with . In Graph Attention Network (GAT) (Veličković et al., 2018), it first computes the attention for each graph edge as,
where denotes the layer-specific weight matrix and denotes the softmax function defined on topological graph. The function denotes the learnable metric function parameterized by . The parameter is shared across different edges to achieve information communication. Based on the learned edge attention , GAT then conducts the layer-wise information aggregation as,
In GraphSAGE (Hamilton et al., 2017)
, it employs the max-pooling operation over the neighborhood and implements layer-wise message aggregation as follows
where the denotes the fully-connected network with an activation function and learnable parameter . The function denotes the element-wise max operation which aggregates the dimension-wise maximum value of node’s neighbors.
3. The Proposed Method
Most of existing GNNs mainly obtain nonlinear representations via activation function , such as , etc, in their layer-wise message propagation. Therefore, the nonlinearity of the whole network is determined based on the number of hidden layers. However, it is known that, deeper GNNs usually lead to over-smoothing issue (Kipf and Welling, 2017; Li et al., 2018; Zhao and Akoglu, 2020). Therefore, the nonlinearity and the whole learning capacity of existing GNNs are still limited. Besides, some GNNs use the nonlinear max aggregation as aggregator which fails to preserve the detailed information of node representations within each node’s neighborhood. To overcome these issues, we re-think the message aggregation mechanism in layer-wise propagation of GNNs and develop several nonlinear message aggregation functions to enhance the learning capacity of GNNs. To be specific, we propose three kinds of nonlinear aggregations, i.e., -norm, polynomial and softmax aggregation. All these nonlinear aggregations provide balanced and self-adjust between mean and max functions, as discussed in detail below.
3.1. -norm Aggregation
As a general nonlinear function,
-norm has been commonly used in computer vision and signal processing fields(Gulcehre et al., 2014; Estrach et al., 2014). Here, we propose to exploit -norm function for the neighborhood aggregation in GNN’s layer-wise message propagation.
Given be the input graph where denotes the adjacency matrix and denotes the collection of node features. One straightforward way to adopt -norm for the neighborhood aggregation can be formulated as
Here, is defined as
where is a parameter which can be learned adaptively. However, the negative information has been ignored in Eq.(6). To address this issue, we adapt the above -norm function and propose our aggregation as follows,
where denotes the minimum element of matrix H and
denotes the feature vector of node.
The proposed is a general scheme which can be integrated with many GNN’ architectures, such as GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), GraphSAGE (Hamilton
et al., 2017), etc.
For example, one can replace with in Eq.(2) to produce a GCN variant (Kipf and Welling, 2017).
Also, one can use the learned graph attention in Eq.(4) to replace to generate a new GAT model.
In particular, one main property of the proposed is that it provides a balanced aggregator between max aggregator and summation aggregator. Specifically, we have the following Proposition 1.
When , becomes to the weighted summation aggregation; When , it is equivalent to the max aggregation.
Proof. When , one can easily observe that becomes to the weighted summation aggregation. For the case , we first prove the following conclusion.
To be specific, given any vector and , without loss of generality, we assume be the unique maximum value of . Below, we can show that
For any , since , thus . Thus, we have
Based on the above analysis, one can see that
This completes the proof.
3.2. Polynomial Aggregation
In addition to -norm, we also explore polynomial function (Wei et al., 2019) for neighborhood information aggregation.
Given be the input graph where denotes the adjacency matrix and denotes the node features. Formally, we propose to define polynomial aggregation as
where the division operation in Eq.(11) is the element-wise division and is defined as
where is a parameter which controls its polynomial order. In this paper, the parameter value of is learned adaptively. Similar to the above -norm function, considering the inevitable negative elements in H and the power operation (Wei et al., 2019), we adapt the Eq.(11) and finally propose our as follows,
where is the minimum element of H and the division operation is the element-wise division.
Similar to , also gives a general scheme and can be integrated with many GNNs’ architectures, such as GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), GraphSAGE (Hamilton
et al., 2017), etc.
In addition, also provides a balanced aggregator between max and summation aggregator (Wei et al., 2019).
Formally, we have the following Proposition 2.
When , becomes to the weighted mean aggregation; When , it is equivalent to the max aggregation.
The mathematical proof of this Proposition 2 can be similarly obtained by referring to work (Wei et al., 2019).
3.3. Softmax Aggregation
The nonlinear function softmax has been commonly used as activity function in GNNs. Given a vector , the standard softmax function is defined as follows,
where is a scale factor. It is known that softmax is a smooth approximation of argmax function, i.e., let be the maximum value of , then
In our work, we exploit softmax function for neighborhood information aggregation in GNNs. To be specific, given be the input graph where denotes the adjacency matrix and denotes the node features. Formally, our softmax based aggregator is defined as
where denotes the scalar parameter and indicates the element-wise product. denotes the feature vector of node . The division operation in Eq.(15) denotes the element-wise division. The operation is defined as
Similar to the above two nonlinear aggregators, can also combine with some GNNs’ architectures, such as GCN (Kipf and Welling, 2017) and GAT (Veličković et al., 2018), etc.
In addition, we can show that provides an adaptive aggregator between max and mean aggregations.
Formally, we have the following proposition.
When , becomes to the weighted mean aggregation; When , it is equivalent to the max aggregation.
When , obviously becomes to the weighted mean aggregator. For , the proof can be easily obtained by using Eq.(14).
|Dataset||Cora||Citeseer||Pubmed||Amazon Photo||Amazon Computers||PPI|
|Masked GCN (baseline)||81.441.19||69.101.98||80.191.40||88.603.08||76.067.09|
To validate the effectiveness of our proposed nonlinear aggregators, we take two GCN-based models (GCN (Kipf and Welling, 2017) and Masked GCN (Yang et al., 2019)) and two GAT-based models (GAT (Veličković et al., 2018) and CAT (He et al., 2021)) as baseline architectures and perform evaluations on several widely used graph learning datasets.
4.1. Experimental Settings
Dataset Setting. We test the proposed models on six datasets including Cora, Citeseer, Pubmed (Sen et al., 2008), PPI (Hamilton et al., 2017), Amazon Photo and Computers (Shchur et al., 2018). Similar to work (Kipf and Welling, 2017), we randomly select nodes per class as training set, nodes and nodes as validation and testing set respectively. For Amazon Photo and Computers (Shchur et al., 2018), following work (Shchur et al., 2018), we randomly select nodes per class as training set, nodes per class and the remaining nodes as validation and testing set respectively. For PPI dataset, similar to the setting in work (Veličković et al., 2018), we take graphs as training set, graphs as validation set and graphs as testing set. The introduction and usage of these datasets are summarized in Table 1.
Parameter Setting. We integrate the proposed nonlinear aggregators with GCNs and GATs respectively. For GCN-based methods, the number of hidden layer units is selected from for all datasets. For GAT-based methods, the number of hidden layer units is set to on all datasets. The weight decay are set to and for citation and amazon datasets, respectively. Our network parameters are trained and optimized by gradient descent algorithms (Kingma and Ba, 2015; Qian, 1999).
|Masked GCN (baseline)||80.740.78||69.031.78||79.141.05||88.833.03||76.197.25|
4.2. Comparison Results
Node Classification. We first test our proposed models on transductive learning task and take some popular GNNs as baselines including Graph Convolutional Network (GCN) (Kipf and Welling, 2017), Graph Attention Networks (GAT) (Veličković et al., 2018), Masked GCN (Yang et al., 2019) and CAT (He et al., 2021). Table 2 reports the comparison results. One can observe that the proposed nonlinear aggregators can consistently improve the baseline models which demonstrates the effectiveness of our proposed nonlinear aggregation schemes on extending the network’s capacity and thus enhancing GNNs’ learning performance. We then test our proposed models on inductive learning task. Table 4 reports the comparison results. We can note that the nonlinear GNN models achieve better result than vanilla GNNs which further indicates the advantages of our proposed nonlinear aggregators.
|Masked GCN (baseline)||97.300.50|
Node Clustering. We also evaluate the proposed methods on semi-supervised node clustering task as suggested in work (He et al., 2021). Table 3 reports the comparison results. Note that the nonlinear aggregation based models obtain higher performance which shows the benefits of our proposed nonlinear message aggregation mechanisms for GNNs’ learning.
4.3. Intuitive Demonstration
Here, we show some visual demonstrations to intuitively demonstrate the effect of the proposed nonlinear aggregations. We first use 2D t-SNE visualization (Maaten and Hinton, 2008) to show some demonstrations. Figure 1 and 2 respectively show the comparison results of GCN (Kipf and Welling, 2017) and GCN-based variants on Amazon Photo and Computers datasets. As shown in Figure 1 and 2, the node embeddings obtained by our proposed nonlinear GCNs are distributed more compactly and clearly than the vanilla GCN (Kipf and Welling, 2017). It is consistent with the experiment results shown in Table and which demonstrates the more expressive capacity of the proposed nonlinear aggregators. We then show the convergence property of the nonlinear aggregations based on GNNs. Figure 3
shows the comparison training loss across different epochs of GCN(Kipf and Welling, 2017), GCN-, GCN- and GCN- on citation datasets . One can note that our proposed three nonlinear GCNs have lower training loss which indicates the higher capacity of the proposed aggregators.
In this paper, we re-think the neighborhood aggregation mechanism of GNNs and propose nonlinear message aggregation schemes to extend GNNs’ learning capacity. The proposed nonlinear aggregation operators are general and flexible strategies GNNs which provide the intermediates between the commonly used max and mean/sum aggregations. We integrate our proposed nonlinear aggregators into several GNNs. Experiments on several benchmark datasets show the effectiveness of our proposed nonlinear aggregations on enhancing the learning capacity and performances of GNNs.
et al. (2021)
Shaofei Cai, Liang Li,
Jincan Deng, Beichen Zhang,
Zheng-Jun Zha, Li Su, and
Qingming Huang. 2021.
Rethinking graph neural architecture search from
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6657–6666.
- Corso et al. (2020) Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Lio, and Petar Velickovic. 2020. Principal Neighbourhood Aggregation for Graph Nets. In Neural Information Processing Systems (NeurIPS) 33.
- Dehmamy et al. (2019) Nima Dehmamy, Albert-Laszlo Barabasi, and Rose Yu. 2019. Understanding the Representation Power of Graph Neural Networks in Learning Graph Topology. In Advances in Neural Information Processing Systems.
et al. (2014)
Joan Bruna Estrach, Arthur
Szlam, and Yann LeCun. 2014.
Signal recovery from Pooling Representations. In
Proceedings of the 31st International Conference on Machine Learning. 307–315.
- Geisler et al. (2020) Simon Geisler, Daniel Zügner, and Stephan Günnemann. 2020. Reliable Graph Neural Networks via Robust Aggregation. In Advances in Neural Information Processing Systems. 13272–13284.
et al. (2014)
Caglar Gulcehre, Kyunghyun
Cho, Razvan Pascanu, and Yoshua
Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks. InProceedings of the 2014th European Conference on Machine Learning and Knowledge Discovery. 530–546.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Neural Information Processing Systems. 1024–1034.
- He et al. (2021) Tiantian He, L Bai, and Yew Soon Ong. 2021. Learning Conjoint Attentions for Graph Neural Nets. In Advances in Neural Information Processing Systems (NeurIPS) 34. Curran Associates, Inc.
- Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
- Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations.
- Lee et al. (2017) Chen-Yu Lee, Patrick Gallagher, and Zhuowen Tu. 2017. Generalizing Pooling Functions in CNNs: Mixed, Gated, and Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2017), 863–875.
et al. (2018)
Q. Li, Z. Han, and
X.-M. Wu. 2018.
Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning. In
The Thirty-Second AAAI Conference on Artificial Intelligence.
- Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
- Qian (1999) Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks 12, 1 (1999), 145–151.
- Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
- Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
- Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations (2018).
- Wei et al. (2019) Zhen Wei, Jingyi Zhang, Li Liu, Fan Zhu, Fumin Shen, Yi Zhou, Si Liu, Yao Sun, and Ling Shao. 2019. Building detail-sensitive semantic segmentation networks with polynomial pooling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7115–7123.
- Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations.
- Yang et al. (2019) Liang Yang, Fan Wu, Yingkui Wang, Junhua Gu, and Yuanfang Guo. 2019. Masked Graph Convolutional Network. In International Joint Conference on Artificial Intelligence. 4070–4077.
- Zhao and Akoglu (2020) Lingxiao Zhao and Leman Akoglu. 2020. PairNorm: Tackling Oversmoothing in GNNs. In International Conference on Learning Representations.