Introduction
Many real world applications involve graph data, like social networks [Zhang and Chen2018], chemical molecules [Gilmer et al.2017], and recommender systems [Berg, Kipf, and Welling2017]
. The complicated structures of these graphs have inspired new machine learning methods
[Cai, Zheng, and Chang2018, Wu et al.2019b]. Recently much attention and progress has been made on graph neural networks, which have been successfully applied to social network analysis
[Battaglia et al.2016], recommendation systems [Ying et al.2018], and machine reading comprehension [Tu et al.2019, De Cao, Aziz, and Titov2018].Recently, a novel architecture leveraging attention mechanism in Graph Neural Networks (GNNs) called Graph Attention Networks (GATs) was introduced [Veličković et al.2017]
. GAT was motivated by attention mechanism in natural language processing
[Vaswani et al.2017, Devlin et al.2018]. It computes representation of each node by attending to its neighbors via a masked selfattention. For each node, different weights are learned by attention functions so that the nodes in the same neighborhood have different weights in the feature aggregation step. Inspired by such attentionbased architecture, several new attentionbased GNNs have been proposed, and have achieved stateoftheart performance on node classification benchmarks [Liu et al.2018, Zhang et al.2018, Ryu, Lim, and Kim2018, Lee, Rossi, and Kong2018, Thekumparampil et al.2018, AbuElHaija et al.2018].However, attentionbased GNNs suffer from the problems of overfitting and oversmoothing: (1) learned attention functions that use node features to assign an importance weight to every neighboring node tend to overfit the training data because the masked selfattention forces attention weights to be only computed for direct neighbors. (2) The oversmoothing problem arises for nodes that are connected but lie on different sides of the class decision boundary. Due to information exchanging over these edges, stacking multiple attention layers causes excessive smoothing of node features, and makes nodes from different classes become indistinguishable.
Here we develop a framework called Constrained Graph Attention Networks (CGATs) that address the above shortcomings of GATs via the use of marginbased constraints. Marginbased constraints act as regularizers that prevent the overfitting of GATs. For example, by adding the constraint that the learned attention weights of the nodes in the neighborhood should be greater than those of the nodes in nonneighborhood by a large margin, we guide the attention weights to separate one node’s neighboring nodes and nonneighboring nodes by a margin, while attention weights in GATs is not capable of doing the same. This helps the model to learn the attention function which generalizes well to unseen graph structures.
To overcome the problem of oversmoothing, we propose constraints on attentions based on class labels, and a new feature aggregation function which only selects the neighbors with top attention weights for feature aggregation. The purpose of the proposed aggregation function is to reduce information propagation between nodes belonging to different classes.
In order to train the proposed CGAT model efficiently and effectively, we develop a layerwise adaptive negative sampling strategy. In contrast to the uniform negative sampling that suffers from the problem of inefficiency due to the fact that many negative samples do not provide any meaningful information, our negative sampling method obtains highly informative negative nodes in a layerwise adaptive way.
We evaluate the proposed approach on four nodeclassification benchmarks: Cora, Citeseer, and Pubmed citation networks as well as an inductive proteinprotein interaction (PPI) dataset. Extensive experiments demonstrate the effectiveness of our approach regarding to the classification accuracy and generalization on unseen graph structure: our CGAT models improve consistently over the stateoftheart GATs on all four datasets, especially with a new stateoftheart accuracy number on the inductive learning PPI data.
In summary, we make the following contributions in this paper:

[leftmargin=5pt,noitemsep]

We provide new insights and mathematical analysis of the attention based GNN models for node classification and associated challenges;

We propose a new constrained graph attention network (CGAT) that utilizes constraints over the node attentions to overcome the problem of overfitting and oversmoothing. And we propose an adaptive layerwise negative sampling strategy to train CGAT efficiently and effectively;

We propose an aggregation strategy to further remedy the oversmoothing problem at the class boundary by selecting top neighbors for feature aggregation;

Our extensive experimental results and analysis demonstrate the benefit of the CGAT model and show consistent gains over stateoftheart graph attention models on standard benchmark datasets for graph node classification.
Related Works
GNNs can be generally divided into two groups: spectral and nonspectral models [Cai, Zheng, and Chang2018, Hamilton, Ying, and Leskovec2017, Veličković et al.2017]
, according to the type of convolution operations on graphs. The former generates convolution operations based on Laplacian eigenvectors
[Bruna et al.2013, Henaff, Bruna, and LeCun2015, Defferrard, Bresson, and Vandergheynst2016], and these models are usually difficult to generalize to graph with unseen structures [Monti et al.2017]. The nonspectral methods generate convolution operations directly based on spatially close neighbors and usually exhibit better performance on unseen graphs [Duvenaud et al.2015, Atwood and Towsley2016, Hamilton, Ying, and Leskovec2017, Niepert, Ahmed, and Kutzkov2016, Monti et al.2017]. Our algorithm conceptually belongs to the nonspectral approaches.Graph Convolutional Neural Networks (GCNs)
generalize convolution operations from traditional image data to graphs. The key point is to find a function generating node’s representation by aggregating its own features as well as neighbors’ features [Wu et al.2019b]. Example models include SSE [Dai et al.2018] MPNN [Gilmer et al.2017], GraphSAGE [Hamilton, Ying, and Leskovec2017], DCNN [Atwood and Towsley2016], StoGCN [Chen, Zhu, and Song2017], LGCN [Gao, Wang, and Ji2018], and more. These GCNs usually treat all nodes of the same neighborhood equally for the purpose of feature aggregation.Graph Attention Networks (GATs) generalize attention operation to graph data. GATs allow for assigning different importance to nodes of a same neighborhood at the feature aggregation step and increase the model capacity of GNNs [Veličković et al.2017]. Based on such framework, different attentionbased GNN architectures have been proposed. Examples include GaAN [Zhang et al.2018], AGNN [Thekumparampil et al.2018], GeniePath [Liu et al.2018], and others. Different models usually use different attention functions to compute the importance of the nodes in neighborhood. However, such attention functions suffer from overfitting problem in learning the attention weights. If there are edges between different clusters, these GNNs easily lead to oversmoothing of node’s representation, which hurts the performance on downstream node classification task.
Analysis of GATs
In this section we briefly review the GAT model and identify its weaknesses.
Notation. Let be a graph where is the set of nodes (or vertices), is the set of edges connecting pairs of nodes in , and represents the node input features, where each row = is a
dimensional vector of attribute values of node
(). In this paper, we consider undirected graphs. Suppose is the adjacency/weighted adjacency matrix of with , and . The graph Laplacian of is defined as . And the random walk normalized Laplacian .Node classification. Suppose that consists of a set of labeled nodes, the goal of node classification is to predict the labels of the remaining unlabeled nodes. Many graphbased node labeling methods make the cluster assumption which assumes the connected nodes in the graph are likely to share the same label [Weston et al.2012, Li, Han, and Wu2018].
Attention based GNN. utilizes the following layerwise attention based aggregate function for node embedding on each node :
(1) 
Where is a trainable weight matrix shared by th layer.
is the activation function.
is the node embedding achieved in th layer; . is the set of ’s onehop neighboring nodes and also includes (i.e. there is a selfloop on each node). is the th attention weight between the target node and the neighboring node , which is generated by applying softmax to the values computed by attention function , and is the trainable parameters of the attention function. For GAT [Veličković et al.2017] model, , is concatenation as in [Veličković et al.2017]. In this paper, we use GAT to refer to all attentionbased GNN models.The overfitting problem of GATs. The attention functions in GAT compute the attention values based on the features of pairs of connected nodes (see Eq. 1). To train such attention functions in GAT, there is only one source to guide their parameters: the classification error. In other words, supervised information to learn these attention functions only comes from the labels of the nodes.
There are two common sources of overfitting in machine learning: 1) lack of enough supervision information for parameter learning, and 2) the model is overparameterized. we believe that the overfitting of GAT comes from the former source: lack of enough supervision data. The supervision of GAT to learn attention parameters is limited and indirect, since the GAT supervision signal can only come from the nodes labels for node classification. In general, smaller number of supervisions leads to more overfitting [Trevor, Robert, and JH2009]. The learned attention function performs well on the training data but fails to generalize and is not robust to perturbation. We demonstrate this in experimental section with robustness test.
The oversmoothing problem of GATs. To facilitate the analysis, we focus on the attention aggregation and simplify Eq. 1 in terms of matrix operation as ^{1}^{1}1 Similar to [Li, Han, and Wu2018], we omit the nonlinearity activation function . In fact [Wu et al.2019a] shows evidence that similar performance is observed in the case when there is no nonlinearity after the aggregation step., where is the attention matrix, if otherwise , and . Then we have the following proposition (See the proof in Appendix) that a single attention layer acts as a kind of random walking Laplacian smoothing.
Proposition 1.
Let matrix () be a random walk normalized Laplacian of the graph . And a single attention layer is equivalent to the Laplacian smoothing operation.
Let
be a transition probability matrix of a connected undirected graph
with nodes, be the probability of being at node after steps walking in if we start at , and be the degree of node . Then, we have the following theorem (See the proof in Appendix).Theorem 1.
If the graph has no bipartite components, there exists a random walk on with transition probability matrix , that converges to a unique stationary distribution . That is, for any pair of nodes , .
We can view attention weight matrix as a random walk transition probability matrix since and . Therefore, suppose there are connected components in the graph , according to Theorem 1, by repeatedly applying random walking Laplacian smoothing multiple times (this is similar to increasing the depth of GAT), the features of the nodes in each connected component will converge to same values. Based on the cluster assumption in node classification that the nodes in same connected component tend to share same labels, the smoothing results in a easier classification problem. This is the reason why GAT works for node classification.
Different Attention Weights at Every Layer. In practice, attention weight matrices vary in different layers. This is different from Theorem 1
which multiplies an identical matrix repeatedly. In fact, stacking multiple GAT layers together is equivalent to matrixchain multiplication over multiple different attention weight matrices. We have Theorem
2 (See the proof in Appendix) to demonstrate that GATs suffer from oversmoothing when they go deep, since that the attention matrix at each layer can be viewed as a transition probability matrix on the graph.Theorem 2.
Let () be a transition probability matrix of the connected undirected graph , corresponding to attention scores of th GAT layer, then , where is the unique stationary distribution in Theorem 1.
In practice most graphs contain bridge nodes that connect different components with different labels. Theorem 2 states that if we increase the depth of GAT, due to the boundary nodes, the aggregated node features of different components would become indistinguishable, leading to worse performance of deep GATs (See the observation of oversmoothing in Appendix). We call this phenomenon as oversmoothing.
Multihead Attention is employed in GAT [Veličković et al.2017]. Specially, independent attention heads are computed for feature aggregation at each layer, and the output of that layer is the concatenated outputs from all heads. To facilitate the analysis, we only focus on the attention aggregation and simplify Eq. 1 on each head as , where be the th (1 ) head attention matrix in th layer of GAT, is the head number of th layer. The output of th layer = , where denotes concatenation along the column (hidden) dimension. By expanding this equation for the previous layer, we can get that each independent component = ) = . We can perform this expansion recursively for all layers. Therefore the output of th layer consists of multiple components, where each component can be viewed as a matrixchain multiplication on attention matrices from different heads and layers. According to Theorem 2, these matrixchain multiplications will converge to the unique distribution if . This means that multihead attention GATs still suffer from oversmoothing problem if they go deep.
Residual connection. is an effective way to ensure good performance when increasing the depth of Convolutional Neural Networks [He et al.2016]. It has also been employed in GATs [Veličković et al.2017, Liu et al.2018]
. To formally prove the effect of residual connections, we introduce the concept of
Lazy Random Walk as follows:Let be a random walk transition probability matrix, we first define the transition probability matrix of a lazy random walk as . At every step, the lazy random walk has probability of staying at the current node, and probability of moving away from it. Hence the residual connection is a lazy random walk based smoothing up to a constant factor^{2}^{2}2The constant factor depends only on number of layers and is the same for all nodes. Therefore if the lazy random walk is viewed as a transition probability matrix, by Theorem 2, the features of all nodes in a connected component converge to the same values if more GAT layers are stacked. Moreover, the following Theorem in [Chung2005] answers how fast the lazy random walk based smoothing process converges to a stationary distribution.
Theorem 3.
Suppose that a strongly connected directed graph on
nodes has Laplacian eigenvalues
. Then has a lazy random walk with the rate of convergence of order . Namely, after at most steps, we have:Theorem 3 implies that it is difficult to prevent the oversmoothing of deep GAT by simply adding residual connections. This phenomenon has also been confirmed by experiments in [Liu et al.2018].
Constrained Graph Attention Networks
To address the problems of overfitting and oversmoothing of GATs, we propose a framework called constrained graph attention networks (CGATs) via adding constrains on both attention function and feature aggregate function. With these constraints, we can improve the generalization ability and alleviate the problem of oversmoothing of GAT. In the following, we first introduce two constraints on attention computation, which involves two margin based losses to guide the training of GNN. Then, based on the constrained attentions, we propose a new aggregation function, which chooses a subset of neighboring nodes based on attention weights rather than all neighbors for feature aggregation, to further reduce the oversmoothing of GAT.
Margin based Constraint on Attention
To address the problem of overfitting, we either make use of more data or use regularization techniques for attention function training. [Hamilton, Ying, and Leskovec2017] utilized graph structure data to guide the graph representation learning, which achieves impressive performance improvement for node classification. They required that the similarity between nearby nodes should be larger than those of disparate nodes. Nearby nodes are identified by a fixedlength random walk. This means that the graph structure is very important for graph representation learning. Inspired by this idea, the first constraint on attention function would be to induce the computed attention weights to reflect the graph structure. More precisely, we require the attention weights between onehop neighboring nodes be greater than those of disparate nodes (including multihop neighboring nodes). This can be viewed as a simplified version of [Hamilton, Ying, and Leskovec2017].
We apply a second constraint on attention computation to address the oversmoothing in GAT. Oversmoothing occurs if a pair of nodes with different class labels are connected, as the information of different classes gets mixed via such pairs of nodes. To prevent the information communication, we require that the attention weights between nodes that shared the same class labels are greater than those weights between nodes that are with different class labels. This constraint is called the class boundary based constraint.
For a given node , suppose is the set of its onehop neighboring nodes, and and are the neighbors with different and same class labels to ^{3}^{3}3 is due to the selfloop connection.. Fig. 1 gives an illustration of two marginbased constraints.

[leftmargin=10pt]

Loss from Graph Structure based Constraint:
(2) 
Loss from Class Boundary Constraint:
(3)
where 0 and 0 are slack variables which control the margin between attention values. is then attention function. Let and be the features of nodes and , we use to compute the attention between two nodes and , is a trainable matrix.
Adaptive Negative Sampling for GNN Training
. Negative sampling has been proved to be an effective way to optimize the loss function
in Eq. 2. A uniform sampling of negative examples would suffer the problem of inefficiency since many negative samples are easy to classify as the model training goes on. And these negative example would not provide any meaningful information to the model training. Here we propose a new approach to choose negative examples adaptively for each layer.
For a given node, we assume that the important negative sample nodes are the nonneighboring nodes which have large contribution in feature aggregation to the other nodes. This means that the more contribution of a node for feature aggregation, the more possible it is a good negative candidate node. Therefore, we apply importance sampling to choose negative sample nodes. The importance of a node can be estimated by Proposition
2 (See the proof in Appendix).Proposition 2.
The importance of a node to feature aggregation of in th layer is proportional to .
According to Proposition 2, we construct a negative sampling as follows: we use a weighted random sampler. Weights can be efficiently computed based on the attention matrix by summing the attention weights of one column in .
With these two constraints, we can optimize the following loss functions for node classification:
(4) 
where represents the loss derived from the node classification error (e.g., cross entropy loss for multiclass node classification) and and are two weight factors to make tradeoffs among these losses, which are data dependent.
Constrained Feature Aggregation
According to the analysis of GATs, the oversmoothing of GAT occurs from the information mixing along the bridging nodes connecting two different clusters. In this section, we propose a constrained feature aggregate function to prevent such information mixing. For each node, the aggregate function only makes use of the features from the neighbors with top attention weights rather than all neighbors. From the constraint on attention computation in Eq. 3, the attention weights of the nodes from different classes should be small. Therefore, picking up nodes with top attention weights would not only keep the smoothing effect of features of the nodes within same class but also drop edges that connect different classes due to small attention weights.
Note that the parameter makes a tradeoff between smoothing and oversmoothing. In principle, for a connected graph, if we guarantee that the all the selected top nodes could still form a connected graph, then we can keep a smoothing effect of GNN models. The top based feature aggregator can be viewed as a subgraph selector, which selects different subgraphs for Laplacian smoothing in different layers. Therefore, it alleviates the oversmoothing in existing GAT models which always use the same graph structures for feature aggregation. The top selection based feature aggregate function helps the model go deeper with more layers. However, a small
would allows GAT go deeper, but might cause high variance in aggregation. Therefore, in practice, we should select the parameter
to make a tradeoff between these two aspects. The results of sensitive analysis of in experimental section demonstrate such tradeoff.Comparison to Attention Dropout in GAT. Attention dropout randomly selects a proportion of attentions for feature aggregation [Veličković et al.2017]. Experiments have confirmed that it is helpful for GAT training for small graphs. It can also be viewed as a process of selecting different subgraphs for laplacian smoothing. This motivation is similar to our constrained feature aggregation. However, with the random dropout mechanism, it is still difficult to prevent the information mixing along the bridging nodes, since the bridges will only be removed with probably equal to the dropout probability, leading to poor performance. As an example, see Figure 2 (a), which shows that the performance of GAT with dropout is significantly lower when we add noisy edges in the graphs. The noisy edges here act as bridges between nodes with different label classes. We observe that GAT with dropout cannot effectively cope with the more pronounced oversmoothing due to noisy edges. In contrast, our method still performs well on these noisy datasets.
Experiments
Data Set. We evaluate the performance of the proposed algorithm CGAT (Constrained Graph Attention neTworks) on four node classification benchmarks: (1) categorizing academic papers in the citation network datasets: Cora, Citeseer and Pubmed [Sen et al.2008]; (2) classifying protein functions across various biological proteinprotein interaction (PPI) graph [Zitnik and Leskovec2017]. Table 1 summarizes the statistical information of these datasets. Our experiments are conducted over standard data splits [Huang et al.2018]
. Following the supervised learning scenario, we use all the labels in the training examples for model training.
Name  Nodes  Edges  Classes  Node features  Train/Dev/Test 
Cora^{a}  2708  5429  7  1433  1,208/500/1,000 
Citeseer^{a}  3327  4732  6  3703  1,827/500/1000 
Pubmed^{a}  19717  88651  3  500  18,217/500/1,000 
PPI^{b}  56944^{∗}  818716  121^{⋆}  50  20/2/2^{⋄} 

a: transductive problem; b: inductive problem; : multilabel; : total nodes in 24 graphs; : 20 graphs for train, 2 graphs for validation and 2 graphs for test.
Hyperparameter Settings. For three transductive learning problems, we use two hidden layers with hidden dimension as 32 for Cora, 64 for Citeseer, and three hidden layers with hidden dimension 32 for Pubmed; we set the number of neighbors used in feature aggregate function as 4 for Cora, Citerseer, and 8 for Pubmed. For the inductive learning problem PPI, we use three hidden layers with hidden dimension 128, and set as 8. We make use of Adam as the optimizer and perform hyperparameter search for all baselines and our method over the same validation set. The set of margin values () used in (, ) is {0.1, 0.2, 0.3, 0.5} and the tradeoff factor () of two losses is set as {1, 2, 5, 10}, learning rate is set as {0.001, 0.003, 0.005, 0.01} and regularization factor is set as {0.0001, 0.0005, 0.001}. We train all models using early stopping with a window size of 100.
Baselines. We compare our CGAT with the following representative GNN models: GCN [Kipf and Welling2017], GraphSAGE [Hamilton, Ying, and Leskovec2017], and Graph Attention Network (GAT) [Veličković et al.2017]. For transductive learning problems, since the results in [Kipf and Welling2017, Hamilton, Ying, and Leskovec2017, Veličković et al.2017] were from semisupervised data setting, we present results of node classification based on our experiments following the same hypeparameter settings reported in these papers. We take the best GraphSAGE results from different pooling strategies [Huang et al.2018, Veličković et al.2017]. Meanwhile, for inductive learning problem PPI, we also compare CGAT with other two representative attention based GNN models GaAN [Zhang et al.2018] and GeinePath [Liu et al.2018].
Evaluation Settings. We use the same metrics in GAT [Veličković et al.2017] for classification performance evaluation. Specially, classification accuracy is collected over Cora, Citeseer and Pubmed, and Micro
is collected over the multilabel classification problem PPI. We report the mean and standard deviation of these metrics collected for 10 runs of the model under different random seeds.
Experimental Results
We investigate the proposed algorithm CGAT in the following four aspects: (1) classification performance comparison; (2) robustness which indicates whether the CGAT is able to overcome the overfitting problem, and improve generalization on unseen graph structure; (3) depth of GNN models to demonstrate whether CGAT can prevent the oversmoothing problem suffered by GAT and (4) sensitive analysis of the number of neighbors used in feature aggregate functions.
Methods  Cora  Citeseer  Pubmed  PPI^{∗}  

GNN 
GCN  71.0^{†}  
GraphSAGE  76.8^{‡}  
GAT  
CGAT  
Ablation 
w/o  
w/o  
w/o top  
w/o NINS^{⋆} 

The accuracy of the attention based GNN models on PPI: GaAN [Zhang et al.2018] 98.7 0.02 and GeinePath [Liu et al.2018] 97.9, respectively;

The best accuracy of GCN on PPI reported in [Liu et al.2018];

The best accuracy of GraphSAGE on PPI reported in [Veličković et al.2017];

NINS: Node Importance based Negative Sampling.
Classification. We report results of performance comparison and ablation study in Table 2. From this table, we observe that:
(a) Our model CGAT performs consistently better than all baseline models GCN, GraphSAGE and GAT across all benchmarks. Specifically, we improve upon GAT with absolute accuracy gain of 1.2%, 2.6%, 0.6% and 1.5% on Cora, Citeseer, PubMed and PPI, respectively. Especially for the inductive learning problem PPI, we get the new stateoftheart classification performance [Zhang et al.2018, Liu et al.2018].
(b) Using ablation studies in Table 2, we observe that the proposed constraints and selected neighborhood based aggregation function achieves especially large gain on PPI. In inductive learning setting such as PPI, the testing graph is completely unseen, where overfitting of attention is especially significant. This suggests that the proposed constraints make the attention functions can generalize well to unseen graph structure. The last row in Table 2 shows the classification accuracy of CGAT with uniform negative sampling instead of the adaptive nodeimportance based negative sampling method proposed in this paper. The results imply that involving the nodes’ importance into negative sampling brings benefit for the training of CGAT.
Robustness Analysis. To demonstrate the robustness of CGAT, i.e. whether the induced attention function is robust to the graph structure, we conduct experiments by perturbing edges in “Cora” test data. Fig. 2 (a) presents the experimental results. We observe that:
By random adding edges in testing stage, GAT shows a significant descending trend when increasing the ratio of adding edge. Randomly adding edges might connect different classes together. This aggravates the oversmoothing problem of GAT. However, our algorithm CGAT still get good predictions even when the ratio of adding edges is up to 50%. Because of the class boundary constraint, CGAT would assign small attention values on these boundary edges. Moreover, the proposed selected neighbor based feature aggregation function would further eliminate such negative impacts. These results demonstrate better generalization of our CGAT than GAT on unseen graphs (see more results on robustness in Appendix).
Deeper GAT. Fig. 2 (b) compares CGAT and GAT with different depths on “Cora”. In contrast to the degradation of GAT with deeper layers due to more significant oversmoothing, Our proposed CGAT maintains good classification performances with increasing attention layers. Again these results show that CGAT is able to effectively overcome the problem of oversmoothing. This allows the applications of CGAT in graphlevel tasks where depth is critical [Bünz and Lamm2017].
Sensitive Analysis of Neighbor Number . We also analyze the sensitivity of the hyperparameter , which controls the aggregation step based on high attention weights. We conduct experiments of CGAT by varying in range {1, 2, 4, 6, 8, 10, 20} on “Cora”. We randomly add 10% edges to “Cora” to increase the chance of information propagation among different classes and then investigate how to set on these noisy graphs. The right subfig in Fig. 2 (c) gives the impact of on classification accuracy of CGAT on “Cora” and “Cora with 10% noisy edges”.
We observe that the classification accuracy first increases to a peak value and stabilizes or slightly decreases. This means that plays a role of making tradeoff between undersmoothing (not enough smoothing to tackle noise) and oversmoothing. For example, in Fig. 2 a smaller would be best for noisy graphs, whereas a larger achieves the best performance on original graphs.
Conclusion
In this paper we provide analysis of the weakness of GAT models: overfitting of attention function and oversmoothing of node representation on deeper model. We propose a novel approach called constrained graph attention Network (CGAT), to address the overfitting and oversmoothing issues of GAT by guiding the attention during GAT training using marginbased constraints. In addition, a layerwise adaptive nodeandedge sampling approach is proposed for augmenting attention training with effective negative examples. Furthermore, to alleviate the oversmoothing problem we propose a new feature aggregate function which only selects the neighbors with top K attention weights rather than all the neighbors. Extensive experiments on common benchmark datasets have verified the effectiveness of our approach, and show significant gains in accuracy on standard node classification benchmarks, especially on deeper models and noisy tests, compared to the stateoftheart GAT models. A particularly interesting direction for future work is to explore more effective constraints in attention computation of GAT for other downstream tasks (e.g., link prediction).
References
 [AbuElHaija et al.2018] AbuElHaija, S.; Perozzi, B.; AlRfou, R.; and Alemi, A. A. 2018. Watch your step: Learning node embeddings via graph attention. In NIPS, 9180–9190.
 [Atwood and Towsley2016] Atwood, J., and Towsley, D. 2016. Diffusionconvolutional neural networks. In NIPS, 1993–2001.
 [Battaglia et al.2016] Battaglia, P.; Pascanu, R.; Lai, M.; Rezende, D. J.; et al. 2016. Interaction networks for learning about objects, relations and physics. In NIPS, 4502–4510.
 [Berg, Kipf, and Welling2017] Berg, R. v. d.; Kipf, T. N.; and Welling, M. 2017. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263.
 [Bruna et al.2013] Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203.
 [Bünz and Lamm2017] Bünz, B., and Lamm, M. 2017. Graph neural networks and boolean satisfiability. arXiv preprint arXiv:1702.03592.
 [Cai, Zheng, and Chang2018] Cai, H.; Zheng, V. W.; and Chang, K. C.C. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. TKDE 30(9):1616–1637.
 [Chen, Zhu, and Song2017] Chen, J.; Zhu, J.; and Song, L. 2017. Stochastic training of graph convolutional networks with variance reduction. arXiv preprint arXiv:1710.10568.
 [Chung2005] Chung, F. 2005. Laplacians and the cheeger inequality for directed graphs. Annals of Combinatorics 9(1):1–19.
 [Dai et al.2018] Dai, H.; Kozareva, Z.; Dai, B.; Smola, A.; and Song, L. 2018. Learning steadystates of iterative algorithms over graphs. In ICML, 1114–1122.
 [De Cao, Aziz, and Titov2018] De Cao, N.; Aziz, W.; and Titov, I. 2018. Question answering by reasoning across documents with graph convolutional networks. arXiv preprint arXiv:1808.09920.
 [Defferrard, Bresson, and Vandergheynst2016] Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 3844–3852.
 [Devlin et al.2018] Devlin, J.; Chang, M.W.; Lee, K.; and Toutanova, K. 2018. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 [Duvenaud et al.2015] Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; AspuruGuzik, A.; and Adams, R. P. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2224–2232.
 [Gao, Wang, and Ji2018] Gao, H.; Wang, Z.; and Ji, S. 2018. Largescale learnable graph convolutional networks. In KDD, 1416–1424. ACM.
 [Gilmer et al.2017] Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In ICML, 1263–1272. JMLR. org.
 [Hamilton, Ying, and Leskovec2017] Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In NIPS, 1024–1034.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
 [Henaff, Bruna, and LeCun2015] Henaff, M.; Bruna, J.; and LeCun, Y. 2015. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163.
 [Horn and Johnson2012] Horn, R. A., and Johnson, C. R. 2012. Matrix analysis. Cambridge university press.
 [Huang et al.2018] Huang, W.; Zhang, T.; Rong, Y.; and Huang, J. 2018. Adaptive sampling towards fast graph representation learning. In NIPS, 4558–4567.
 [Kipf and Welling2017] Kipf, T. N., and Welling, M. 2017. Semisupervised classification with graph convolutional networks. In ICLR.
 [Lee, Rossi, and Kong2018] Lee, J. B.; Rossi, R.; and Kong, X. 2018. Graph classification using structural attention. In KDD, 1666–1674. ACM.

[Li, Han, and Wu2018]
Li, Q.; Han, Z.; and Wu, X.M.
2018.
Deeper insights into graph convolutional networks for semisupervised learning.
In AAAI.  [Liu et al.2018] Liu, Z.; Chen, C.; Li, L.; Zhou, J.; Li, X.; Song, L.; and Qi, Y. 2018. Geniepath: Graph neural networks with adaptive receptive paths. arXiv preprint arXiv:1802.00910.
 [Lovász and others1993] Lovász, L., et al. 1993. Random walks on graphs: A survey. Combinatorics, Paul erdos is eighty 2(1):1–46.

[Monti et al.2017]
Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; and Bronstein,
M. M.
2017.
Geometric deep learning on graphs and manifolds using mixture model cnns.
In CVPR, 5115–5124.  [Niepert, Ahmed, and Kutzkov2016] Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning convolutional neural networks for graphs. In ICML, 2014–2023.

[Randall2006]
Randall, D.
2006.
Rapidly mixing markov chains with applications in computer science and physics.
Computing in Science & Engineering 8(2):30–41.  [Ryu, Lim, and Kim2018] Ryu, S.; Lim, J.; and Kim, W. Y. 2018. Deeply learning molecular structureproperty relationships using graph attention neural network. arXiv preprint arXiv:1805.10988.
 [Sen et al.2008] Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and EliassiRad, T. 2008. Collective classification in network data. AI magazine 29(3):93–106.
 [Thekumparampil et al.2018] Thekumparampil, K. K.; Wang, C.; Oh, S.; and Li, L.J. 2018. Attentionbased graph neural network for semisupervised learning. arXiv preprint arXiv:1803.03735.
 [Trevor, Robert, and JH2009] Trevor, H.; Robert, T.; and JH, F. 2009. The elements of statistical learning: data mining, inference, and prediction.
 [Tu et al.2019] Tu, M.; Wang, G.; Huang, J.; Tang, Y.; He, X.; and Zhou, B. 2019. Multihop reading comprehension across multiple documents by reasoning over heterogeneous graphs. arXiv:1905.07374.
 [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS, 5998–6008.
 [Veličković et al.2017] Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.
 [Weston et al.2012] Weston, J.; Ratle, F.; Mobahi, H.; and Collobert, R. 2012. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade. Springer. 639–655.
 [Wu et al.2019a] Wu, F.; Zhang, T.; Souza Jr, A. H. d.; Fifty, C.; Yu, T.; and Weinberger, K. Q. 2019a. Simplifying graph convolutional networks. In ICML.
 [Wu et al.2019b] Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; and Yu, P. S. 2019b. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596.
 [Ying et al.2018] Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W. L.; and Leskovec, J. 2018. Graph convolutional neural networks for webscale recommender systems. In KDD, 974–983. ACM.
 [Zhang and Chen2018] Zhang, M., and Chen, Y. 2018. Link prediction based on graph neural networks. In NIPS, 5165–5175.
 [Zhang et al.2018] Zhang, J.; Shi, X.; Xie, J.; Ma, H.; King, I.; and Yeung, D.Y. 2018. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294.
 [Zitnik and Leskovec2017] Zitnik, M., and Leskovec, J. 2017. Predicting multicellular function through multilayer tissue networks. Bioinformatics 33(14):i190–i198.
Appendix A Proof of Proposition 1
Before we give the proof, we first introduce the concepts of random walking normalized Laplacian and Laplacian smoothing as follows.
Random walking Normalized Laplacian Let be the attention weight matrix, and , then the graph Laplacian of is defined as . And is the random walking normalized Laplacian of .
Laplacian Smoothing [Li, Han, and Wu2018] on each row of the input feature matrix is defined as:
(5) 
where is a parameter to controls the smoothness, i.e. the importance weight of the node’s features with respect to the features of its neighbors. We can rewrite the Laplacian smoothing in Eq. 5 in matrix form:
(6) 
Proof.
As is the attention weight matrix, , then we can get that . The random walk normalization of is = = .
We can rewrite the graph attention operation as . According to the formulation of Laplacian smoothing in Eq. 5, we can conclude that graph attention is a special form of Laplacian smoothing with . ∎
Appendix B Proof of Theorem 1
Proof.
(1) We can view the random walk on graph as a Markov chain with . As is undirected, connected and nonbipartite graph, the Markov chain is ergodic [Randall2006, Lovász and others1993]. And any finite ergodic Markov chain converges to a unique stationary distribution [Randall2006]. (2) According to PerronFrobenius Theorem [Horn and Johnson2012, Chung2005], such stationary distribution is just the Perron vector of . And for the undirected graph, its Perron vector w.r.t. is . ∎
Appendix C Proof of Theorem 2
Proof.
(1) Let be the transition matrix over the graph , corresponding to attention weight matrix of th layer of GAT. According to Theorem 1, the random walk on the graph with converges to a unique stationary distribution which depends on the degrees of the graph regardless of . i.e., where denotes the stationary distribution w.r.t. and is the unique stationary distribution. (2) Let be the th row of , according to the converge analysis of random walk in [Randall2006], we have as , where is the mixing rate of random walk with . By exploring the equation recursively, . Moreover, for strongly connected graph, the mixing rate according to [Randall2006]. Then, . i.e., . ∎
Appendix D Observation of OverSmoothing on Data “Citeseer”
Fig. 3 shows the training loss, training error and the validation error of GAT models with different layers on benchmark dataset “Citeseer” (See detailed information of the data in Table 1). From this figure, we can observe that the deeper networks can still converge, but a performance degradation problem occurs: with the depth increasing, the accuracy degrades. In this paper, we demonstrate that such performance degradation is mainly due to oversmoothing effect of deeper GAT models.
Appendix E Proof of Proposition 2
Proof.
Let’s first review the feature aggregate function in GAT:.
(7) 
where if , otherwise . We can view as the importance of of given the graph with features . We can rewrite it as a form of conditional probability . If we define (denoted as for simplification) as the probability of sampling given all the nodes of the current layer, then we get . Then, according to Bayes’s formula, we can get . ∎
Appendix F Experimental Results of Robust Analysis and Deeper GAT
To evaluate the robustness of CGAT, in particular, whether the induced attention function is robust to the graph structure, we conduct experiments by perturbing edges in “Cora” data. Fig. 4 presents the experimental results. From this figure, we can observe that:
(a) By randomly dropping some edges in training stage (see Fig. 4 (a)), CGAT get a relative stable performance when increasing the ratio of dropped edge. In contrast, the performance of GAT shows a descending trend. This is because of that, for a missing edge in testing stage, the attention value w.r.t. this edge in CGAT is still convincible as the two constraints. That is, if the missing edge connected two nodes share same labels, according to the constraints, the attention weight will be higher and results in a better smoothing operator. In contrast, if the missing edge connected two nodes with different labels, because of proposed constraints and proposed feature aggregation function, the impact of such edge can be eliminated as well. In contrast, for GAT without these constraints, there is still information propagation no matter the missing edge lies in classification boundary or not, and even assign large attention values for the classification boundary edges, and lead to oversmoothing.
(b) By random adding some edges in training stages (see Fig. 4 (b)),the performance of CGAT still keeps relative stable but GAT’s performance decreases when increasing the ratio of adding edges. This is because of that, the randomly adding edges might connect different classes together. This will result in more information propagation among different classes and easily lead to the oversmoothing. This hurts the quality of the training data. The constraints in CGAT can be viewed as a data cleaner which can improve the quality of the training data. In contrast, GAT has no such ability and leads to the induced model perform worse in testing stage.
(c) Compares CGAT and GAT with different depths on “Citeseer”. Our proposed CGATs maintain good classification performances with increasing attention layers. Again these results prove oversmoothing is not an issue for CGAT.
Comments
There are no comments yet.