Improving Graph Attention Networks with Large Margin-based Constraints

10/25/2019 ∙ by Guangtao Wang, et al. ∙ 18

Graph Attention Networks (GATs) are the state-of-the-art neural architecture for representation learning with graphs. GATs learn attention functions that assign weights to nodes so that different nodes have different influences in the feature aggregation steps. In practice, however, induced attention functions are prone to over-fitting due to the increasing number of parameters and the lack of direct supervision on attention weights. GATs also suffer from over-smoothing at the decision boundary of nodes. Here we propose a framework to address their weaknesses via margin-based constraints on attention during training. We first theoretically demonstrate the over-smoothing behavior of GATs and then develop an approach using constraint on the attention weights according to the class boundary and feature aggregation pattern. Furthermore, to alleviate the over-fitting problem, we propose additional constraints on the graph structure. Extensive experiments and ablation studies on common benchmark datasets demonstrate the effectiveness of our method, which leads to significant improvements over the previous state-of-the-art graph attention methods on all datasets.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Many real world applications involve graph data, like social networks [Zhang and Chen2018], chemical molecules [Gilmer et al.2017], and recommender systems [Berg, Kipf, and Welling2017]

. The complicated structures of these graphs have inspired new machine learning methods 

[Cai, Zheng, and Chang2018, Wu et al.2019b]

. Recently much attention and progress has been made on graph neural networks, which have been successfully applied to social network analysis 

[Battaglia et al.2016], recommendation systems [Ying et al.2018], and machine reading comprehension [Tu et al.2019, De Cao, Aziz, and Titov2018].

Recently, a novel architecture leveraging attention mechanism in Graph Neural Networks (GNNs) called Graph Attention Networks (GATs) was introduced [Veličković et al.2017]

. GAT was motivated by attention mechanism in natural language processing 

[Vaswani et al.2017, Devlin et al.2018]. It computes representation of each node by attending to its neighbors via a masked self-attention. For each node, different weights are learned by attention functions so that the nodes in the same neighborhood have different weights in the feature aggregation step. Inspired by such attention-based architecture, several new attention-based GNNs have been proposed, and have achieved state-of-the-art performance on node classification benchmarks [Liu et al.2018, Zhang et al.2018, Ryu, Lim, and Kim2018, Lee, Rossi, and Kong2018, Thekumparampil et al.2018, Abu-El-Haija et al.2018].

However, attention-based GNNs suffer from the problems of overfitting and over-smoothing: (1) learned attention functions that use node features to assign an importance weight to every neighboring node tend to overfit the training data because the masked self-attention forces attention weights to be only computed for direct neighbors. (2) The over-smoothing problem arises for nodes that are connected but lie on different sides of the class decision boundary. Due to information exchanging over these edges, stacking multiple attention layers causes excessive smoothing of node features, and makes nodes from different classes become indistinguishable.

Here we develop a framework called Constrained Graph Attention Networks (C-GATs) that address the above shortcomings of GATs via the use of margin-based constraints. Margin-based constraints act as regularizers that prevent the over-fitting of GATs. For example, by adding the constraint that the learned attention weights of the nodes in the neighborhood should be greater than those of the nodes in non-neighborhood by a large margin, we guide the attention weights to separate one node’s neighboring nodes and non-neighboring nodes by a margin, while attention weights in GATs is not capable of doing the same. This helps the model to learn the attention function which generalizes well to unseen graph structures.

To overcome the problem of over-smoothing, we propose constraints on attentions based on class labels, and a new feature aggregation function which only selects the neighbors with top attention weights for feature aggregation. The purpose of the proposed aggregation function is to reduce information propagation between nodes belonging to different classes.

In order to train the proposed C-GAT model efficiently and effectively, we develop a layer-wise adaptive negative sampling strategy. In contrast to the uniform negative sampling that suffers from the problem of inefficiency due to the fact that many negative samples do not provide any meaningful information, our negative sampling method obtains highly informative negative nodes in a layer-wise adaptive way.

We evaluate the proposed approach on four node-classification benchmarks: Cora, Citeseer, and Pubmed citation networks as well as an inductive protein-protein interaction (PPI) dataset. Extensive experiments demonstrate the effectiveness of our approach regarding to the classification accuracy and generalization on unseen graph structure: our C-GAT models improve consistently over the state-of-the-art GATs on all four datasets, especially with a new state-of-the-art accuracy number on the inductive learning PPI data.

In summary, we make the following contributions in this paper:

  • [leftmargin=5pt,noitemsep]

  • We provide new insights and mathematical analysis of the attention based GNN models for node classification and associated challenges;

  • We propose a new constrained graph attention network (C-GAT) that utilizes constraints over the node attentions to overcome the problem of over-fitting and over-smoothing. And we propose an adaptive layer-wise negative sampling strategy to train C-GAT efficiently and effectively;

  • We propose an aggregation strategy to further remedy the over-smoothing problem at the class boundary by selecting top neighbors for feature aggregation;

  • Our extensive experimental results and analysis demonstrate the benefit of the C-GAT model and show consistent gains over state-of-the-art graph attention models on standard benchmark datasets for graph node classification.

Related Works

GNNs can be generally divided into two groups: spectral and non-spectral models [Cai, Zheng, and Chang2018, Hamilton, Ying, and Leskovec2017, Veličković et al.2017]

, according to the type of convolution operations on graphs. The former generates convolution operations based on Laplacian eigenvectors 

[Bruna et al.2013, Henaff, Bruna, and LeCun2015, Defferrard, Bresson, and Vandergheynst2016], and these models are usually difficult to generalize to graph with unseen structures [Monti et al.2017]. The non-spectral methods generate convolution operations directly based on spatially close neighbors and usually exhibit better performance on unseen graphs [Duvenaud et al.2015, Atwood and Towsley2016, Hamilton, Ying, and Leskovec2017, Niepert, Ahmed, and Kutzkov2016, Monti et al.2017]. Our algorithm conceptually belongs to the non-spectral approaches.

Graph Convolutional Neural Networks (GCNs)

generalize convolution operations from traditional image data to graphs. The key point is to find a function generating node’s representation by aggregating its own features as well as neighbors’ features [Wu et al.2019b]. Example models include SSE [Dai et al.2018] MPNN [Gilmer et al.2017], GraphSAGE [Hamilton, Ying, and Leskovec2017], DCNN [Atwood and Towsley2016], StoGCN [Chen, Zhu, and Song2017], LGCN [Gao, Wang, and Ji2018], and more. These GCNs usually treat all nodes of the same neighborhood equally for the purpose of feature aggregation.

Graph Attention Networks (GATs) generalize attention operation to graph data. GATs allow for assigning different importance to nodes of a same neighborhood at the feature aggregation step and increase the model capacity of GNNs [Veličković et al.2017]. Based on such framework, different attention-based GNN architectures have been proposed. Examples include GaAN [Zhang et al.2018], AGNN [Thekumparampil et al.2018], GeniePath [Liu et al.2018], and others. Different models usually use different attention functions to compute the importance of the nodes in neighborhood. However, such attention functions suffer from over-fitting problem in learning the attention weights. If there are edges between different clusters, these GNNs easily lead to over-smoothing of node’s representation, which hurts the performance on downstream node classification task.

Analysis of GATs

In this section we briefly review the GAT model and identify its weaknesses.

Notation. Let be a graph where is the set of nodes (or vertices), is the set of edges connecting pairs of nodes in , and represents the node input features, where each row = is a

-dimensional vector of attribute values of node

(). In this paper, we consider undirected graphs. Suppose is the adjacency/weighted adjacency matrix of with , and . The graph Laplacian of is defined as . And the random walk normalized Laplacian .

Node classification. Suppose that consists of a set of labeled nodes, the goal of node classification is to predict the labels of the remaining unlabeled nodes. Many graph-based node labeling methods make the cluster assumption which assumes the connected nodes in the graph are likely to share the same label [Weston et al.2012, Li, Han, and Wu2018].

Attention based GNN. utilizes the following layer-wise attention based aggregate function for node embedding on each node :


Where is a trainable weight matrix shared by -th layer.

is the activation function.

is the node embedding achieved in -th layer; . is the set of ’s one-hop neighboring nodes and also includes (i.e. there is a self-loop on each node). is the -th attention weight between the target node and the neighboring node , which is generated by applying softmax to the values computed by attention function , and is the trainable parameters of the attention function. For GAT [Veličković et al.2017] model, , is concatenation as in [Veličković et al.2017]. In this paper, we use GAT to refer to all attention-based GNN models.

The overfitting problem of GATs. The attention functions in GAT compute the attention values based on the features of pairs of connected nodes (see Eq. 1). To train such attention functions in GAT, there is only one source to guide their parameters: the classification error. In other words, supervised information to learn these attention functions only comes from the labels of the nodes.

There are two common sources of over-fitting in machine learning: 1) lack of enough supervision information for parameter learning, and 2) the model is over-parameterized. we believe that the over-fitting of GAT comes from the former source: lack of enough supervision data. The supervision of GAT to learn attention parameters is limited and indirect, since the GAT supervision signal can only come from the nodes labels for node classification. In general, smaller number of supervisions leads to more overfitting [Trevor, Robert, and JH2009]. The learned attention function performs well on the training data but fails to generalize and is not robust to perturbation. We demonstrate this in experimental section with robustness test.

The over-smoothing problem of GATs. To facilitate the analysis, we focus on the attention aggregation and simplify Eq. 1 in terms of matrix operation as 111 Similar to [Li, Han, and Wu2018], we omit the non-linearity activation function . In fact [Wu et al.2019a] shows evidence that similar performance is observed in the case when there is no nonlinearity after the aggregation step., where is the attention matrix, if otherwise , and . Then we have the following proposition (See the proof in Appendix) that a single attention layer acts as a kind of random walking Laplacian smoothing.

Proposition 1.

Let matrix () be a random walk normalized Laplacian of the graph . And a single attention layer is equivalent to the Laplacian smoothing operation.


be a transition probability matrix of a connected undirected graph

with nodes, be the probability of being at node after steps walking in if we start at , and be the degree of node . Then, we have the following theorem (See the proof in Appendix).

Theorem 1.

If the graph has no bipartite components, there exists a random walk on with transition probability matrix , that converges to a unique stationary distribution . That is, for any pair of nodes , .

We can view attention weight matrix as a random walk transition probability matrix since and . Therefore, suppose there are connected components in the graph , according to Theorem 1, by repeatedly applying random walking Laplacian smoothing multiple times (this is similar to increasing the depth of GAT), the features of the nodes in each connected component will converge to same values. Based on the cluster assumption in node classification that the nodes in same connected component tend to share same labels, the smoothing results in a easier classification problem. This is the reason why GAT works for node classification.

Different Attention Weights at Every Layer. In practice, attention weight matrices vary in different layers. This is different from Theorem 1

which multiplies an identical matrix repeatedly. In fact, stacking multiple GAT layers together is equivalent to matrix-chain multiplication over multiple different attention weight matrices. We have Theorem

2 (See the proof in Appendix) to demonstrate that GATs suffer from over-smoothing when they go deep, since that the attention matrix at each layer can be viewed as a transition probability matrix on the graph.

Theorem 2.

Let () be a transition probability matrix of the connected undirected graph , corresponding to attention scores of -th GAT layer, then , where is the unique stationary distribution in Theorem 1.

In practice most graphs contain bridge nodes that connect different components with different labels. Theorem 2 states that if we increase the depth of GAT, due to the boundary nodes, the aggregated node features of different components would become indistinguishable, leading to worse performance of deep GATs (See the observation of over-smoothing in Appendix). We call this phenomenon as over-smoothing.

Multi-head Attention is employed in GAT [Veličković et al.2017]. Specially, independent attention heads are computed for feature aggregation at each layer, and the output of that layer is the concatenated outputs from all heads. To facilitate the analysis, we only focus on the attention aggregation and simplify Eq. 1 on each head as , where be the -th (1 ) head attention matrix in -th layer of GAT, is the head number of -th layer. The output of -th layer = , where denotes concatenation along the column (hidden) dimension. By expanding this equation for the previous layer, we can get that each independent component = ) = . We can perform this expansion recursively for all layers. Therefore the output of -th layer consists of multiple components, where each component can be viewed as a matrix-chain multiplication on attention matrices from different heads and layers. According to Theorem 2, these matrix-chain multiplications will converge to the unique distribution if . This means that multi-head attention GATs still suffer from over-smoothing problem if they go deep.

Figure 1: Target node: dark black solid bound circle; nodes with same colors share same class labels; white circle means any un-reachable node from the target node. Left Orange Circle: graph structure based constraint requiring there is a margin between attention from one-hop neighbors (black solid line) and that from multi-hop neighbors (blue dashed line) or unreachable nodes (black dashed line); Right Blue Circle: Class boundary based constraint requiring there is a margin between attention from neighbors with same class labels and the neighbors with different class labels.

Residual connection. is an effective way to ensure good performance when increasing the depth of Convolutional Neural Networks [He et al.2016]. It has also been employed in GATs [Veličković et al.2017, Liu et al.2018]

. To formally prove the effect of residual connections, we introduce the concept of

Lazy Random Walk as follows:

Let be a random walk transition probability matrix, we first define the transition probability matrix of a lazy random walk as . At every step, the lazy random walk has probability of staying at the current node, and probability of moving away from it. Hence the residual connection is a lazy random walk based smoothing up to a constant factor222The constant factor depends only on number of layers and is the same for all nodes. Therefore if the lazy random walk is viewed as a transition probability matrix, by Theorem 2, the features of all nodes in a connected component converge to the same values if more GAT layers are stacked. Moreover, the following Theorem in [Chung2005] answers how fast the lazy random walk based smoothing process converges to a stationary distribution.

Theorem 3.

Suppose that a strongly connected directed graph on

nodes has Laplacian eigenvalues

. Then has a lazy random walk with the rate of convergence of order . Namely, after at most steps, we have:

Theorem 3 implies that it is difficult to prevent the over-smoothing of deep GAT by simply adding residual connections. This phenomenon has also been confirmed by experiments in [Liu et al.2018].

Constrained Graph Attention Networks

To address the problems of overfitting and over-smoothing of GATs, we propose a framework called constrained graph attention networks (C-GATs) via adding constrains on both attention function and feature aggregate function. With these constraints, we can improve the generalization ability and alleviate the problem of over-smoothing of GAT. In the following, we first introduce two constraints on attention computation, which involves two margin based losses to guide the training of GNN. Then, based on the constrained attentions, we propose a new aggregation function, which chooses a subset of neighboring nodes based on attention weights rather than all neighbors for feature aggregation, to further reduce the over-smoothing of GAT.

Margin based Constraint on Attention

To address the problem of over-fitting, we either make use of more data or use regularization techniques for attention function training. [Hamilton, Ying, and Leskovec2017] utilized graph structure data to guide the graph representation learning, which achieves impressive performance improvement for node classification. They required that the similarity between nearby nodes should be larger than those of disparate nodes. Nearby nodes are identified by a fixed-length random walk. This means that the graph structure is very important for graph representation learning. Inspired by this idea, the first constraint on attention function would be to induce the computed attention weights to reflect the graph structure. More precisely, we require the attention weights between one-hop neighboring nodes be greater than those of disparate nodes (including multi-hop neighboring nodes). This can be viewed as a simplified version of [Hamilton, Ying, and Leskovec2017].

We apply a second constraint on attention computation to address the over-smoothing in GAT. Over-smoothing occurs if a pair of nodes with different class labels are connected, as the information of different classes gets mixed via such pairs of nodes. To prevent the information communication, we require that the attention weights between nodes that shared the same class labels are greater than those weights between nodes that are with different class labels. This constraint is called the class boundary based constraint.

For a given node , suppose is the set of its one-hop neighboring nodes, and and are the neighbors with different and same class labels to 333 is due to the self-loop connection.. Fig. 1 gives an illustration of two margin-based constraints.

  1. [leftmargin=10pt]

  2. Loss from Graph Structure based Constraint:

  3. Loss from Class Boundary Constraint:


where 0 and 0 are slack variables which control the margin between attention values. is then attention function. Let and be the features of nodes and , we use to compute the attention between two nodes and , is a trainable matrix.

Adaptive Negative Sampling for GNN Training

. Negative sampling has been proved to be an effective way to optimize the loss function

in Eq. 2

. A uniform sampling of negative examples would suffer the problem of inefficiency since many negative samples are easy to classify as the model training goes on. And these negative example would not provide any meaningful information to the model training. Here we propose a new approach to choose negative examples adaptively for each layer.

For a given node, we assume that the important negative sample nodes are the non-neighboring nodes which have large contribution in feature aggregation to the other nodes. This means that the more contribution of a node for feature aggregation, the more possible it is a good negative candidate node. Therefore, we apply importance sampling to choose negative sample nodes. The importance of a node can be estimated by Proposition

2 (See the proof in Appendix).

Proposition 2.

The importance of a node to feature aggregation of in -th layer is proportional to .

According to Proposition 2, we construct a negative sampling as follows: we use a weighted random sampler. Weights can be efficiently computed based on the attention matrix by summing the attention weights of one column in .

With these two constraints, we can optimize the following loss functions for node classification:


where represents the loss derived from the node classification error (e.g., cross entropy loss for multi-class node classification) and and are two weight factors to make trade-offs among these losses, which are data dependent.

Constrained Feature Aggregation

According to the analysis of GATs, the over-smoothing of GAT occurs from the information mixing along the bridging nodes connecting two different clusters. In this section, we propose a constrained feature aggregate function to prevent such information mixing. For each node, the aggregate function only makes use of the features from the neighbors with top attention weights rather than all neighbors. From the constraint on attention computation in Eq. 3, the attention weights of the nodes from different classes should be small. Therefore, picking up nodes with top attention weights would not only keep the smoothing effect of features of the nodes within same class but also drop edges that connect different classes due to small attention weights.

Note that the parameter makes a trade-off between smoothing and over-smoothing. In principle, for a connected graph, if we guarantee that the all the selected top nodes could still form a connected graph, then we can keep a smoothing effect of GNN models. The top based feature aggregator can be viewed as a sub-graph selector, which selects different sub-graphs for Laplacian smoothing in different layers. Therefore, it alleviates the over-smoothing in existing GAT models which always use the same graph structures for feature aggregation. The top- selection based feature aggregate function helps the model go deeper with more layers. However, a small

would allows GAT go deeper, but might cause high variance in aggregation. Therefore, in practice, we should select the parameter

to make a trade-off between these two aspects. The results of sensitive analysis of in experimental section demonstrate such trade-off.

Comparison to Attention Dropout in GAT. Attention dropout randomly selects a proportion of attentions for feature aggregation [Veličković et al.2017]. Experiments have confirmed that it is helpful for GAT training for small graphs. It can also be viewed as a process of selecting different sub-graphs for laplacian smoothing. This motivation is similar to our constrained feature aggregation. However, with the random dropout mechanism, it is still difficult to prevent the information mixing along the bridging nodes, since the bridges will only be removed with probably equal to the dropout probability, leading to poor performance. As an example, see Figure 2 (a), which shows that the performance of GAT with dropout is significantly lower when we add noisy edges in the graphs. The noisy edges here act as bridges between nodes with different label classes. We observe that GAT with dropout cannot effectively cope with the more pronounced oversmoothing due to noisy edges. In contrast, our method still performs well on these noisy datasets.


Data Set. We evaluate the performance of the proposed algorithm C-GAT (Constrained Graph Attention neTworks) on four node classification benchmarks: (1) categorizing academic papers in the citation network datasets: Cora, Citeseer and Pubmed [Sen et al.2008]; (2) classifying protein functions across various biological protein-protein interaction (PPI) graph [Zitnik and Leskovec2017]. Table 1 summarizes the statistical information of these datasets. Our experiments are conducted over standard data splits [Huang et al.2018]

. Following the supervised learning scenario, we use all the labels in the training examples for model training.

Name Nodes Edges Classes Node features Train/Dev/Test
Coraa 2708 5429 7 1433 1,208/500/1,000
Citeseera 3327 4732 6 3703 1,827/500/1000
Pubmeda 19717 88651 3 500 18,217/500/1,000
PPIb 56944 818716 121 50 20/2/2
  • a: transductive problem; b: inductive problem; : multi-label; : total nodes in 24 graphs; : 20 graphs for train, 2 graphs for validation and 2 graphs for test.

Table 1: Statistical Information on Benchmarks

Hyper-parameter Settings. For three transductive learning problems, we use two hidden layers with hidden dimension as 32 for Cora, 64 for Citeseer, and three hidden layers with hidden dimension 32 for Pubmed; we set the number of neighbors used in feature aggregate function as 4 for Cora, Citerseer, and 8 for Pubmed. For the inductive learning problem PPI, we use three hidden layers with hidden dimension 128, and set as 8. We make use of Adam as the optimizer and perform hyper-parameter search for all baselines and our method over the same validation set. The set of margin values () used in (, ) is {0.1, 0.2, 0.3, 0.5}  and the trade-off factor () of two losses is set as {1, 2, 5, 10}, learning rate is set as {0.001, 0.003, 0.005, 0.01} and regularization factor is set as {0.0001, 0.0005, 0.001}. We train all models using early stopping with a window size of 100.

Baselines. We compare our C-GAT with the following representative GNN models: GCN [Kipf and Welling2017], GraphSAGE [Hamilton, Ying, and Leskovec2017], and Graph Attention Network (GAT) [Veličković et al.2017]. For transductive learning problems, since the results in [Kipf and Welling2017, Hamilton, Ying, and Leskovec2017, Veličković et al.2017] were from semi-supervised data setting, we present results of node classification based on our experiments following the same hype-parameter settings reported in these papers. We take the best GraphSAGE results from different pooling strategies [Huang et al.2018, Veličković et al.2017]. Meanwhile, for inductive learning problem PPI, we also compare C-GAT with other two representative attention based GNN models GaAN [Zhang et al.2018] and GeinePath [Liu et al.2018].

Evaluation Settings. We use the same metrics in GAT [Veličković et al.2017] for classification performance evaluation. Specially, classification accuracy is collected over Cora, Citeseer and Pubmed, and Micro

is collected over the multi-label classification problem PPI. We report the mean and standard deviation of these metrics collected for 10 runs of the model under different random seeds.

Experimental Results

We investigate the proposed algorithm C-GAT in the following four aspects: (1) classification performance comparison; (2) robustness which indicates whether the C-GAT is able to overcome the overfitting problem, and improve generalization on unseen graph structure; (3) depth of GNN models to demonstrate whether C-GAT can prevent the over-smoothing problem suffered by GAT and (4) sensitive analysis of the number of neighbors used in feature aggregate functions.

Methods Cora Citeseer Pubmed PPI


GCN 71.0
GraphSAGE 76.8


w/o top
w/o NINS
Table 2: Classification Accuracy Ablation and Comparison

Classification. We report results of performance comparison and ablation study in Table 2. From this table, we observe that:

(a) Our model C-GAT performs consistently better than all baseline models GCN, GraphSAGE and GAT across all benchmarks. Specifically, we improve upon GAT with absolute accuracy gain of 1.2%, 2.6%, 0.6% and 1.5% on Cora, Citeseer, PubMed and PPI, respectively. Especially for the inductive learning problem PPI, we get the new state-of-the-art classification performance [Zhang et al.2018, Liu et al.2018].

(b) Using ablation studies in Table 2, we observe that the proposed constraints and selected neighborhood based aggregation function achieves especially large gain on PPI. In inductive learning setting such as PPI, the testing graph is completely un-seen, where overfitting of attention is especially significant. This suggests that the proposed constraints make the attention functions can generalize well to un-seen graph structure. The last row in Table 2 shows the classification accuracy of C-GAT with uniform negative sampling instead of the adaptive node-importance based negative sampling method proposed in this paper. The results imply that involving the nodes’ importance into negative sampling brings benefit for the training of C-GAT.

Figure 2: Experiments on Cora. Left: Robustness Analysis: Train on original graphs and perform testing on graphs by adding edges randomly. For adding edges, we first randomly select a set of nodes according to a given sampling ratio, and then random add one edge on these nodes. Middle: Deeper GAT: Classification performance comparison between GAT and C-GAT with different depth; Right: Sensitive Analysis Impact of number of neighbors on classification performance of C-GAT.

Robustness Analysis. To demonstrate the robustness of C-GAT, i.e. whether the induced attention function is robust to the graph structure, we conduct experiments by perturbing edges in “Cora” test data. Fig. 2 (a) presents the experimental results. We observe that:

By random adding edges in testing stage, GAT shows a significant descending trend when increasing the ratio of adding edge. Randomly adding edges might connect different classes together. This aggravates the over-smoothing problem of GAT. However, our algorithm C-GAT still get good predictions even when the ratio of adding edges is up to 50%. Because of the class boundary constraint, C-GAT would assign small attention values on these boundary edges. Moreover, the proposed selected neighbor based feature aggregation function would further eliminate such negative impacts. These results demonstrate better generalization of our C-GAT than GAT on unseen graphs (see more results on robustness in Appendix).

Deeper GAT. Fig. 2 (b) compares C-GAT and GAT with different depths on “Cora”. In contrast to the degradation of GAT with deeper layers due to more significant oversmoothing, Our proposed C-GAT maintains good classification performances with increasing attention layers. Again these results show that C-GAT is able to effectively overcome the problem of oversmoothing. This allows the applications of C-GAT in graph-level tasks where depth is critical [Bünz and Lamm2017].

Sensitive Analysis of Neighbor Number . We also analyze the sensitivity of the hyper-parameter , which controls the aggregation step based on high attention weights. We conduct experiments of C-GAT by varying in range {1, 2, 4, 6, 8, 10, 20} on “Cora”. We randomly add 10% edges to “Cora” to increase the chance of information propagation among different classes and then investigate how to set on these noisy graphs. The right sub-fig in Fig. 2 (c) gives the impact of on classification accuracy of C-GAT on “Cora” and “Cora with 10% noisy edges”.

We observe that the classification accuracy first increases to a peak value and stabilizes or slightly decreases. This means that plays a role of making trade-off between under-smoothing (not enough smoothing to tackle noise) and over-smoothing. For example, in Fig. 2 a smaller would be best for noisy graphs, whereas a larger achieves the best performance on original graphs.


In this paper we provide analysis of the weakness of GAT models: over-fitting of attention function and over-smoothing of node representation on deeper model. We propose a novel approach called constrained graph attention Network (C-GAT), to address the overfitting and over-smoothing issues of GAT by guiding the attention during GAT training using margin-based constraints. In addition, a layer-wise adaptive node-and-edge sampling approach is proposed for augmenting attention training with effective negative examples. Furthermore, to alleviate the over-smoothing problem we propose a new feature aggregate function which only selects the neighbors with top K attention weights rather than all the neighbors. Extensive experiments on common benchmark datasets have verified the effectiveness of our approach, and show significant gains in accuracy on standard node classification benchmarks, especially on deeper models and noisy tests, compared to the state-of-the-art GAT models. A particularly interesting direction for future work is to explore more effective constraints in attention computation of GAT for other downstream tasks (e.g., link prediction).


Appendix A Proof of Proposition 1

Before we give the proof, we first introduce the concepts of random walking normalized Laplacian and Laplacian smoothing as follows.

Random walking Normalized Laplacian Let be the attention weight matrix, and , then the graph Laplacian of is defined as . And is the random walking normalized Laplacian of .

Laplacian Smoothing [Li, Han, and Wu2018] on each row of the input feature matrix is defined as:


where is a parameter to controls the smoothness, i.e. the importance weight of the node’s features with respect to the features of its neighbors. We can rewrite the Laplacian smoothing in Eq. 5 in matrix form:


As is the attention weight matrix, , then we can get that . The random walk normalization of is = = .

We can rewrite the graph attention operation as . According to the formulation of Laplacian smoothing in Eq. 5, we can conclude that graph attention is a special form of Laplacian smoothing with . ∎

Appendix B Proof of Theorem 1


(1) We can view the random walk on graph as a Markov chain with . As is undirected, connected and non-bipartite graph, the Markov chain is ergodic [Randall2006, Lovász and others1993]. And any finite ergodic Markov chain converges to a unique stationary distribution [Randall2006]. (2) According to Perron-Frobenius Theorem [Horn and Johnson2012, Chung2005], such stationary distribution is just the Perron vector of . And for the undirected graph, its Perron vector w.r.t. is . ∎

Appendix C Proof of Theorem 2


(1) Let be the transition matrix over the graph , corresponding to attention weight matrix of -th layer of GAT. According to Theorem 1, the random walk on the graph with converges to a unique stationary distribution which depends on the degrees of the graph regardless of . i.e., where denotes the stationary distribution w.r.t. and is the unique stationary distribution. (2) Let be the -th row of , according to the converge analysis of random walk in [Randall2006], we have as , where is the mixing rate of random walk with . By exploring the equation recursively, . Moreover, for strongly connected graph, the mixing rate according to [Randall2006]. Then, . i.e., . ∎

Appendix D Observation of Over-Smoothing on Data “Citeseer”

Fig. 3 shows the training loss, training error and the validation error of GAT models with different layers on benchmark dataset “Citeseer” (See detailed information of the data in Table 1). From this figure, we can observe that the deeper networks can still converge, but a performance degradation problem occurs: with the depth increasing, the accuracy degrades. In this paper, we demonstrate that such performance degradation is mainly due to over-smoothing effect of deeper GAT models.

Figure 3: Training loss (left), training error (middle) and validation error (right) on Citeseer with 2-layer, 4-layer, 6-layer and 8-layer GAT models. The deeper network has higher training error, and thus validation error.
Figure 4: Left: randomly dropping edges in training stage and performing test on the original graph over “Cora”; Middle: Randomly adding edges in training stage and performing test on the original graph “Cora”. For adding edges, we first randomly select a set of nodes according to a given sampling ratio, and then random add one edge on these nodes. Right: Classification performance comparison between GAT and C-GAT with different depth on “Citeseer”

Appendix E Proof of Proposition 2


Let’s first review the feature aggregate function in GAT:.


where if , otherwise . We can view as the importance of of given the graph with features . We can rewrite it as a form of conditional probability . If we define (denoted as for simplification) as the probability of sampling given all the nodes of the current layer, then we get . Then, according to Bayes’s formula, we can get . ∎

Appendix F Experimental Results of Robust Analysis and Deeper GAT

To evaluate the robustness of C-GAT, in particular, whether the induced attention function is robust to the graph structure, we conduct experiments by perturbing edges in “Cora” data. Fig. 4 presents the experimental results. From this figure, we can observe that:

(a) By randomly dropping some edges in training stage (see Fig. 4 (a)), C-GAT get a relative stable performance when increasing the ratio of dropped edge. In contrast, the performance of GAT shows a descending trend. This is because of that, for a missing edge in testing stage, the attention value w.r.t. this edge in C-GAT is still convincible as the two constraints. That is, if the missing edge connected two nodes share same labels, according to the constraints, the attention weight will be higher and results in a better smoothing operator. In contrast, if the missing edge connected two nodes with different labels, because of proposed constraints and proposed feature aggregation function, the impact of such edge can be eliminated as well. In contrast, for GAT without these constraints, there is still information propagation no matter the missing edge lies in classification boundary or not, and even assign large attention values for the classification boundary edges, and lead to over-smoothing.

(b) By random adding some edges in training stages (see Fig. 4 (b)),the performance of C-GAT still keeps relative stable but GAT’s performance decreases when increasing the ratio of adding edges. This is because of that, the randomly adding edges might connect different classes together. This will result in more information propagation among different classes and easily lead to the over-smoothing. This hurts the quality of the training data. The constraints in C-GAT can be viewed as a data cleaner which can improve the quality of the training data. In contrast, GAT has no such ability and leads to the induced model perform worse in testing stage.

(c) Compares C-GAT and GAT with different depths on “Citeseer”. Our proposed C-GATs maintain good classification performances with increasing attention layers. Again these results prove over-smoothing is not an issue for C-GAT.