1 Attention meets pooling in graph neural networks
The practical importance of attention in deep learning is wellestablished and there are many arguments in its favor
[1], including interpretability [2, 3]. In graph neural networks (GNNs), attention can be defined over edges [4, 5] or over nodes [6]. In this work, we focus on the latter, because, despite being equally important in certain tasks, it is not as thoroughly studied [7]. To begin our description, we first establish a connection between attention and pooling methods. In convolutional neural networks (CNNs), pooling methods are generally based on uniformly dividing the regular grid (such as onedimensional temporal grid in audio) into local regions and taking a single value from that region (average, weighted average, max, stochastic, etc.), while attention in CNNs is typically a separate mechanism that weights
dimensional input :(1) 
where  output for unit (node in a graph) , ,  elementwise multiplication,  the number of units in the input (i.e. number of nodes in a graph).
In GNNs, pooling methods generally follow the same pattern as in CNNs, but the pooling regions (sets of nodes) are often found based on clustering [8, 9, 10], since there is no grid that can be uniformly divided into regions in the same way across all examples (graphs) in the dataset. Recently, topk pooling [11] was proposed, diverging from other methods: instead of clustering “similar” nodes, it propagates only part of the input and this part is not uniformly sampled from the input. Topk pooling can thus select some local part of the input graph, completely ignoring the rest. For this reason at first glance it does not appear to be logical.
However, we can notice that pooled feature maps in [11, Eq. 2] are computed in the same way as attention outputs in Eq. 1 above, if we rewrite their Eq. 2 in the following way:
(2) 
where is a set of indices of pooled nodes, , and denotes the unit is absent in the output.
The only difference between Eq. 2 and Eq. 1 is that , i.e. the number of units in the output is smaller or, formally, there exists a ratio of preserved nodes. We leverage this finding to integrate attention and pooling into a unified computational block of a GNN. In contrast, in CNNs, it is challenging to achieve this, because the input is defined on a regular grid, so we need to maintain resolution for all examples in the dataset after each pooling layer. In GNNs, we can remove any number of nodes, so that the next layer will receive a smaller graph. When applied to the input layer, this form of attentionbased pooling also brings us interpretability of predictions, since the network makes a decision only based on pooled nodes.
(a) Colors  (b) Triangles  (c) MNIST 
, which we find heuristically (see Section
3.1).Despite the appealing nature of attention, it is often unstable to train and the conditions under which it fails or succeeds are unclear. Motivated by insights of [12] recently proposed Graph Isomorphism Networks (GIN), we design two simple graph reasoning tasks that allow us to study attention in a controlled environment where we know ground truth attention. The first task is counting colors in a graph (Colors), where a color is a unique discrete feature. The second task is counting the number of triangles in a graph (Triangles). We confirm our observations on a standard benchmark, MNIST [13] (Figure 1), and identify factors influencing the effectiveness of attention.
Our synthetic experiments also allow us to study the ability of attention GNNs to generalize to larger, more complex or noisy graphs. Aiming to provide a recipe to train more effective, stable and robust attention GNNs, we propose a weaklysupervised scheme to train attention, that does not require ground truth attention scores, and as such is agnostic to a dataset and the choice of a model. We validate the effectiveness of this scheme on our synthetic datasets, as well as on MNIST and on real graph classification benchmarks in which ground truth attention is unavailable and hard to define, namely Collab [14, 15], Proteins [16], and D&D [17].
2 Model
We study two variants of GNNs: Graph Convolutional Networks (GCN) [18] and Graph Isomorphism Networks (GIN) [12]. One of the main ideas of GIN is to replace the Mean aggregator over nodes, such as the one in GCN, with a Sum aggregator, and add more fullyconnected layers after aggregating neigboring node features. The resulting model can distinguish a wider range of graph structures than previous models [12, Figure 3].
2.1 Thresholding by attention coefficients
To pool the nodes in a graph using the method from[11] a predefined ratio (Eq. 2) must be chosen for the entire dataset. For instance, for only 80% of nodes are left after each pooling layer. Intuitively, it is clear that this ratio should be different for small and large graphs. Therefore, we propose to choose threshold , such that only nodes with attention values are propagated:
(3) 
Note, that dropping nodes from a graph is different from keeping nodes with very small, or even zero, feature values, because a bias is added to node features after the following graph convolution layer affecting features of neighbors. An important potential issue of dropping nodes is the change of graph structure and emergence of isolated nodes. However, in our experiments we typically observe that the model predicts similar for nearby nodes, so that an entire local neighborhood is pooled or dropped, as opposed to clusteringbased methods which collapse each neighborhood to a single node. We provide a quantitative and qualitative comparison in Section 3.
2.2 Attention subnetwork
To train an attention model that predicts the coefficients for nodes, we consider two approaches: (1) Linear Projection
[11], where a single layer projection is trained: ; and (2) DiffPool [10], where a separate GNN is trained:(4) 
where is the adjacency matrix of a graph. In all cases, we use a softmax activation [1, 2] instead of tanh in [11], because it provides more interpretable results and ecourages sparse outputs:
. To train attention in a supervised or weaklysupervised way, we use the KullbackLeibler divergence loss (see Section
3.3).2.3 ChebyGIN
In some of our experiments, the performance of both GCNs and GINs is quite poor and, consequently, it is also hard for the attention subnetwork to learn. By combining GIN with ChebyNet [8], we propose a stronger model, ChebyGIN. ChebyNet is a multiscale extension of GCN [18], so that for the first scale, , node features are node features themselves, for features are averaged over onehop neighbors, for  over twohop neighbors and so forth. To implement the Sum aggregator in ChebyGIN, we multiply features by node degrees starting from . We also add more fullyconnected layers after feature aggregation as in GIN.
3 Experiments
We introduce the color counting task (Colors) and the triangle counting task (Triangles) in which we generate synthetic training and test graphs. We also experiment with MNIST images [13] and three molecule and social datasets. In Colors, Triangles and MNIST tasks (Figure 1), we assume to know ground truth attention, i.e. for each node we heuristically define its importance in solving the task correctly, , which is necessary to train (in the supervised case) and evaluate our attention models.
3.1 Datasets
Colors.
We introduce the color counting task. We generate random graphs where features for each node are assigned to one of the three onehot values (colors): [1,0,0] (red), [0,1,0] (green), [0,0,1] (blue). The task is to count the number of green nodes, . This is a trivial task, but it lets us study the influence of initialization of the attention model on the training dynamics. In this task, graph structure is unimportant and edges of graphs act like a medium to exchange node features. Ground truth attention is , when corresponds to green nodes and otherwise. We also extend this dataset to higher dimensional cases to study how model performance changes with
. In these cases, node features are still onehot vectors and we classify the number of nodes where the second feature is one.
Triangles.
Counting the number of triangles in a graph is a wellknown task which can be solved analytically by computing , where is an adjacency matrix. This task turned out to be hard for GNNs, so we add node degree features as onehot vectors to all graphs, so that the model can exploit both graph structure and features. Compared to the Colors task, here it is more challenging to study the effect of initializing , but we can still calculate ground truth attention as , where is the number of triangles that include node , so that for nodes that are not part of triangles.
Mnist75sp.
Mnist [13] contains 70k grayscale images of size 2828 pixels. While each of 784 pixels can be represented as a node, we follow [19, 20] and consider an alternative approach to highlight the ability of GNNs to work on irregular grids. In particular, each image can be represented as a small set of superpixels without losing essential classspecific information (see Figure 2). We compute SLIC [21] superpixels for each image and build a graph, in which each node corresponds to a superpixel with node features being pixel intensity values and coordinates of their centers of masses. We extract superpixels, hence the dataset is denoted as MNIST75sp. Edges are formed based on spatial distance between superpixel centers as in [8, Eq. 8]. Each image depicts a handwritten digit from 0 to 9 and the task is to classify the image. Ground truth attention is considered to be for superpixels with nonzero intensity, and is the total number of such superpixels. The idea is that only nonzero superpixels determine the digit class.
Molecule and social datasets.
We extend our study to more practical cases, where ground truth attention is not available, and experiment with protein datasets: Proteins [16] and D&D [17], and a scientific collaboration dataset, Collab [14, 15]. These are standard graph classification benchmarks. A standard way to evaluate models on these datasets is to perform 10fold crossvalidation and report average accuracy [22, 10]. In this work, we are concerned about a model’s ability to generalize to larger and more complex or noisy graphs, therefore we generate splits based on the number of nodes. For instance, for Proteins we train on graphs with nodes and test on graphs with nodes (see Table 2 for details about splits of other datasets and results).
A detailed description of tasks and model hyperparameters is provided in
Appendix.3.2 Generalization to larger and noisy graphs
One of the core strengths of attention is that it makes it easier to generalize to unseen, potentially more complex and/or noisy, inputs by reducing their complexity to similar inputs in the training set. To examine this phenomenon, for Colors and Triangles tasks we add test graphs that can be several times larger (TestLarge) than the training ones. For Colors we further extend it by adding unseen colors to the test set (TestLargeC) in the format , where for if and for if , i.e. there is no new colors that have nonzero values in a green channel. This can be interpreted as adding mixtures of red, blue and transparency channels, with nine possible colors in total as opposed to three in the training set (Figure 2).
Colors 


Train ()  TestOrig ()  TestLarge ()  TestLargeC () 
Triangles 


Train ()  TestOrig ()  TestLarge () 
MNIST75sp 


Train()  TestOrig()  TestNoisy()  TestNoisyC() 
Neural networks (NNs) have been observed to be brittle if they are fed with test samples corrupted in a subtle way, i.e. by adding a noise [23] or changing a sample in an adversarial way [24], such that a human can still recognize them fairly well. To study this problem, test sets of standard image benchmarks have been enlarged by adding corrupted images [25].
Graph neural networks, as a particular case of NNs, inherit this weakness. The attention mechanism, if designed and trained properly, can improve a net’s robustness by attending to only important and ignoring misleading parts (nodes) of data. In this work, we explore the ability of GNNs with and without attention to generalize to noisy graphs and unseen node features. This should help us to understand the limits of GNNs, and potentially NNs in general, with attention and conditions when it succeedes and when it does not. To this end, we generate two additional test sets for MNIST75sp. In the first set, TestNoisy, we add Gaussian noise, drawn from , to superpixel intensity features, i.e. the shape and coordinates of superpixels are the same as in the original clean test set. In the second set, TestNoisyC, we colorize images by adding two more channels and add independent Gaussian noise, drawn from , to each channel (Figure 2).
3.3 Network architectures and training
We build 2 layer GNNs for Colors and 3 layer GNNs for other tasks with 64 filters in each layer, except for MNIST75sp
where we have more filters. Our baselines are GNNs with global sum or max pooling (gpool), DiffPool
[10] and topk pooling [11]. We add two layers of our pooling for Triangles, each of which is a GNN with 3 layers and 32 filters (Eq. 4); whereas a single pooling layer in the form of vector is used in other cases. We train all models with Adam [26], learning rate 1e3, batch size 32, weight decay 1e4 (see Appendix for details).For Colors and Triangles we minimize the regression loss (MSE) and cross entropy (CE) for other tasks, denoted as . For experiments with supervised and weaklysupervised (described below in Section 3.4) attention, we additionally minimize the KullbackLeibler (KL) divergence loss between ground truth attention and predicted coefficients , so that the total loss for some training graph with nodes becomes:
(5) 
where controls the scale and importance of the KL term.
We repeat experiments 10 times and report an average accuracy and standard deviation in Tables
1 and 2. For Colorswe run experiments 100 times, since we observe larger variance. In Table
1 we report results on all test subsets independently. In all other experiments on Colors, Triangles and MNIST75sp, we report an average accuracy on the combined test set. For Collab, Proteins and D&D, we run experiments 10 times using splits described in Section 3.1.The only hyperparameters that we tune in our experiments are threshold in our method (Eq. 3), ratio in topk (Eq. 2) and in Eq. 5. For synthetic datasets, we tune them on a validation set generated in the same way as TestOrig. For MNIST75sp, we use part of the training set. For Collab, Proteins and D&D, we tune them using 10fold crossvalidation on the training set.
Colors  Triangles  MNIST75sp  
Orig  Large  LargeC  Attn  Orig  Large  Attn  Orig  Noisy  NoisyC  Attn  
Global pool  GCN  97  7215  203  99.6  461  231  79  78.32  384  364  722 
GIN  9610  7122  2611  99.2  501  221  77  87.63  5511  5112  715  
ChebyGIN  100  9312  157  99.8  661  301  79  97.4  8012  7911  723  
Unsuperv.  GIN, topk  99.6  174  93  756  472  181  635  866  5926  5523  6534 
GIN, ours  9418  137  116  7215  473  202  683  82.68  5128  4724  5831  
ChebyGIN, topk  100  117  66  7920  645  252  766  92.94  6826  6725  5237  
ChebyGIN, ours  8030  1610  116  6731  673  262  774  94.63  8023  7722  7831  
Supervised  GIN, topk  871  3918  288  99.9  491  201  88  90.51  85.52  795  99.3 
GIN, ours  100  969  8918  99.8  491  221  761  90.90.4  85.01  803  99.3  
ChebyGIN, topk  100  8615  3115  99.8  831  391  97  95.10.3  90.60.8  8316  100  
ChebyGIN, ours  100  948  7517  99.8  881  481  96  95.40.2  92.30.4  8616  100  
Weak sup.  ChebyGIN, ours  100  906  7314  99.9  681  301  88  95.80.4  88.84  869  96.51 
Upper bound  GIN  100  100  100  100  941  852  100  93.60.4  90.81  90.81  100 
ChebyGIN  100  100  100  100  99.8  99.41  100  96.90.1  94.80.3  95.10.3  100 
Attention correctness.
We evaluate attention correctness using an area under the ROC curve (AUC) as an alternative to other methods, such as [27], which can be overoptimistic in some extreme cases, such as when all attention is concentrated in a single node or attention is uniformly spread over all nodes. AUC allows to evaluate ranking of instead of their absolute values. To evaluate attention correctness of models with global pooling, we follow the idea from convolutional neural networks [28]. After training a model, we remove node and compute an absolute difference from prediction for the original graph:
(6) 
where is a model’s prediction for the graph without node . While this method shows surprisingly high AUC in some tasks, it is not builtin in training and thus does not help to train a better model and only implicitly interprets a model’s prediction (Figures 5 and 6). However, these results inspired us to design a weaklysupervised method described below.
3.4 Weaklysupervised attention supervision
Although for Colors, Triangles and MNIST75sp we can define ground truth attention, so that it does not require manual labeling, in practice it is usually not the case and such annotations are hard to define and expensive, or even unclear how to produce. Based on results in Table 1, supervision of attention is necessary to reveal its power. Therefore, we propose a weaklysupervised approach, agnostic to the choice of a dataset and model, that does not require ground truth attention labels, but can improve model performance and generalization ability. Our approach is based on generating attention coefficients (Eq. 6) and using them as labels to train our attention model with the loss defined in Eq 5. We apply this approach to Colors, Triangles and MNIST75sp and observe peformance and robustness close to supervised models. We also apply it to Collab, Proteins and D&D, and in all cases we are able to improve results compared to unsupervised attention.


(a)  (a)zoomed  (b)  (b)zoomed 



(c)  (d)  (e)  (f) 
95%. (d) Probability of a good initialization is estimated as the proportion of cases when cosine similarity
0.5; error bars indicate standard deviation. (ce) show results using a higher dimensional attention model, .4 Analysis of results
In this work, we aim to better understand attention and generalization in graph neural networks, and, based on our empirical findings, below we provide our analysis for the following questions.
How powerful is attention over nodes in GNNs?
Our results on the Colors, Triangles and MNIST75sp datasets suggest that the main strength of attention over nodes in GNNs is the ability to generalize to more complex or noisy graphs at test time. This ability essentially transforms a model that fails to generalize into a fairly robust one. Indeed, a classification accuracy gap for ColorsLargeC between the best model without supervised attention (GIN with global pooling) and a similar model with supervised attention (GIN, sup) is more than 60%. For TrianglesLarge this gap is 18% and for MNIST75spNoisy it is more than 12%. This gap is even larger if compared to upper bound cases indicating that our supervised models can be further tuned and improved. Models with supervised or weaklysupervised attention also have a more narrow spread of results (Figure 3).
bad initialization (cos. sim.=0.75)  good initialization (cos. sim.=0.75)  optimal initialization (cos. sim.=1.00)  
Unsupervised 

(a)  (b)  (c)  
Supervised 

(d)  (e)  (f) 
What are the factors influencing performance of GNNs with attention?
We identify three key factors influencing performance of GNNs with attention: initialization of the attention model (i.e. vector or GNN in Eq. 4), strength of the main GNN model (i.e. the model that actually performs classification), and finally other hyperparameters of the attention and GNN models.
We highlight initialization as the critical factor. We ran 100 experiments on Colors with random initializations (Figure 3, (ae)) of the vector and measured how performance of both attention and classification is affected depending on how close (in terms of cosine similarity) the initialized was to the optimal one, . We disentangle the dependency between the classification accuracy and cos. sim. into two functions to make the relationship clearer (Figure 3, (a, c)). Interestingly, we found that classification accuracy depends exponentially on attention correctness and becomes close to 100% only when attention is also close to being perfect. In the case of slightly worse attention, even starting from 99%, classification accuracy drops significantly. This is an important finding that can also be valid for other more realistic applications. In the Triangles task we only partially confirm this finding, because our attention models could not achieve AUC high enough to boost classification. However, by observing the upper bound results obtained by training with ground truth attention, we assume that this boost potentially should happen once attention becomes accurate enough.
Why is initialization of attention important?
One of the reasons that initialization is so important is because training GNNs with attention is a chicken or the egg sort of problem. In order to attend to important nodes, the model needs to have a clear understanding of the graph. Yet, in order to gain that level of understanding, the model needs strong attention to avoid focusing on noisy nodes. During training, the attention model predicts attention coefficients and they might be wrong, especially at the beginning of training, but the rest of the GNN model assumes those predictions to be correct and updates its parameters according to those . This problem is revealed by taking the gradient of an attention function (Eq. 1): , where are node features, and is some differentiable function with parameters used to propagate node features: . Gradients , that are used to update parameters in gradient descent, reinforce potentially wrong predictions , since they depend on , and the model solution can diverge from the optimal one, which we observe in Figure 4(a,b). Hence, the performance of such a model largely depends on the initial state, i.e. how accurate were after the first forward pass.
Why is the variance of some results so high?
In Table 1 we report high variance of results, which is mainly due to initialization of the attention model as explained above. This variance is also caused by initialization of other trainable parameters of a GNN, but we show that once the attention model is perfect, other parameters can recover from a bad initialization leading to improved performance. The opposite, however, is not true: we never observed recovery of a model with poorly initialized attention.
How results change with increase of attention model input dimensionality or capacity?
We performed experiments using ChebyGINh  a model with higher dimensionality of an input to the attention model (see Table 5 in Appendix for details). In such cases, it becomes very unlikely to initialize it in a way close to optimal (Figure 3, (ce)), and attention accuracy is concentrated in the 6080% region. Effect of the attention model of such low accuracy is neglible or even harmful, especially on the large and noisy graphs. We also experimented with a deeper attention model (ChebyGINh), i.e. a 2 layer fullyconnected layer with 32 hidden units for Colors and MNIST75sp, and a deeper GNN (Eq. 4) for Triangles. This has a positive effect overall, except for Triangles, where our attention models were already deep GNNs.
Collab  Proteins  D&D  D&D  

# train / test graphs  500 / 4500  500 / 613  462 / 716  500 / 678 
# nodes () train  3235  425  30200  30300 
# nodes () test  32492  6620  2015748  305748 
Global max  65.913.37  72.661.41  29.744.89  72.653.64 
Unsup, ours  65.683.52  76.330.62  51.905.33  77.172.87 
Weaksup, ours  66.971.65  77.090.66  54.254.98  78.361.09 
How results differ depending on to which layer we apply the attention model?
When an attention model is attached to deeper layers (as we do for Triangles and MNIST75sp), the signal that it receives is much stronger compared to the first layers, which positively influences overall performance. But in terms of computational cost, it is desirable to attach an attention model closer to the input layer to reduce graph size in the beginning of a forward pass. Using this strategy is also more reasonable when we know that attention weights can be determined solely by input features (as we do in our Colors task), or when the goal is to interpret model’s predictions. In contrast, deeper features contain information about a large neighborhood of nodes, so importance of a particular node represents the importance of an entire neighborhood making attention less interpretable.
How topk compares to our thresholdbased pooling method?
Our method to attend and pool nodes (Eq. 3) is based on topk pooling [11] and we show that the proposed thresholdbased pooling is superior in a principle way. When we use supervised attention our results are better by more than 40% on ColorsLargeC, by 9% on TrianglesLarge and by 3% on MNIST75sp. In Figure 3 ((a,b)zoomed) we show that GIN and ChebyGIN models with supervised topk pooling never reach an average accuracy of more than 80% as opposed to our method which reaches 100% in many cases.
What is the recipe for more powerful attention GNNs?
We showed that GNNs with supervised training of attention are significantly more accurate and robust, although in case of a bad initialization it can take a long time to reach the performance of a better initialization. However, supervised attention is often infeasible. We suggested an alternative approach based on weaklysupervised training and validated it on our synthetic (Table 1) and real (Table 2) datasets. In case of Colors, Triangles and MNIST75sp we can compare to both unsupervised and supervised models and conclude that our approach shows performance, robustness and relatively low variation (i.e. sensitivity to initialization) similar to supervised models and much better than unsupervised models. In case of Collab, Proteins and D&D we can only compare to unsupervised and global pooling models and confirm that our method can be effectively employed for a wide diversity of graph classification tasks and attends to more relevant nodes (Figures 5 and 6).
Global pool  Unsup  Unsup pooled  Weaksup  Weaksup pooled  

Collab 

Proteins 

D&D 
5 Conclusion
We have shown that learned attention can be extremely powerful in graph neural networks, but only if it is close to optimal. This is difficult to achieve due to the sensitivity of initialization, especially in the unsupervised setting where we do not have access to ground truth attention. Thus, we have identified initialization of attention models for high dimensional inputs as an important open issue. We also show that attention can make GNNs more robust to larger and noisy graphs, and that the weaklysupervised approach proposed in our work brings advantages similar to the ones of supervised models, yet at the same time can be effectively applied to datasets without annotated attention.
Acknowledgments
This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors also acknowledge support from the Canadian Institute for Advanced Research and the Canada Foundation for Innovation. We are also thankful to Angus Galloway for feedback.
References
 [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 [2] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:1612.04757, 2016.
 [3] Andreea Deac, Petar VeliČković, and Pietro Sormanni. Attentive crossmodal paratope prediction. Journal of Computational Biology, 2018.
 [4] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
 [5] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and DitYan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294, 2018.
 [6] John Boaz Lee, Ryan Rossi, and Xiangnan Kong. Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1666–1674. ACM, 2018.
 [7] John Boaz Lee, Ryan A Rossi, Sungchul Kim, Nesreen K Ahmed, and Eunyee Koh. Attention models in graphs: A survey. arXiv preprint arXiv:1807.07984, 2018.
 [8] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 [9] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587, 2018.
 [10] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pages 4805–4815, 2018.
 [11] Hongyang Gao and Shuiwang Ji. Graph UNet. In Submitted to the Seventh International Conference on Learning Representations (ICLR), 2018.
 [12] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019.
 [13] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [14] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):2, 2007.
 [15] Anshumali Shrivastava and Ping Li. A new space for comparing graphs. In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 62–71. IEEE Press, 2014.
 [16] Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and HansPeter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
 [17] Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003.
 [18] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [19] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, page 3, 2017.

[20]
Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller.
Splinecnn: Fast geometric deep learning with continuous bspline
kernels.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 869–877, 2018.  [21] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, Sabine Süsstrunk, et al. Slic superpixels compared to stateoftheart superpixel methods. 2012.
 [22] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.
 [23] Samuel Dodge and Lina Karam. A study and comparison of human and deep learning recognition performance under visual distortions. In 2017 26th international conference on computer communication and networks (ICCCN), pages 1–7. IEEE, 2017.
 [24] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [25] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
 [26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [27] Chenxi Liu, Junhua Mao, Fei Sha, and Alan L Yuille. Attention correctness in neural image captioning. arXiv preprint arXiv:1605.09553, 2017.
 [28] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
Appendix
Colors  Triangles  MNIST75sp  
# train graphs  500  30,000  60,000 
# val graphs  2,500  5,000  5,000 (from the training set) 
# test graphs Orig  2,500  5,000  10,000 
# test graphs Large/Noisy  2,500  5,000  10,000 
# test graphs LargeC/NoisyC  2,500  10,000  
# classes  11  10  10 
# nodes () train/val  425  425  <=75 
# nodes () test  4200  4100  <=75 
# layers and filters  2 layers, 64 filters in each  3 layers, 64 filters in each  3 layers: 4, 64, 512 filters 
Dropout  0  0  0.5 
Nonlinearity  ReLU  ReLU  ReLU 
# pooling layers  1  2  1 
READOUT layer  global sum  global max  global max 
GIN aggregator 
Sum 2 layer MLP with 256 hid. units 
Sum 2 layer MLP with 64 hid. units 
Sum 2 layer MLP with 64 hid. units 
ChebyGIN aggregator 
Mean 1 layer MLP 
Sum 2 layer MLP with 64 hid. units 
Mean 1 layer MLP 
ChebyGIN max scale,  2  7  4 
Attention model of ChebyGINd 
2 layer MLP with 32 hid. units 
4 layer GNN with 32 filters 
2 layer MLP with 32 hid. units 
Attention model of ChebyGINh 
32 features in the input instead of 4 
128 filters in the first layer instead of 64 
32 filters in the first layer instead of 4 
Attention model 
applied to input layer 
Same arch. as the class. GNN, but for ChebyGIN, applied to hidden layer (Eq. 4) 
applied to hidden layer 
Optimal weights of attention model 
collinear to  Unknown  Unknown 
Ground truth attention for node 
, is the number of triangles that include node 
, where  indices of superpixels (nodes) with nonzero intensity,  total number of such superpixels; for other nodes 

Optimal threshold,  chosen in the range from 0.0001 to 0.1 (usually values around are the best)  
Optimal ratio,  chosen in the range from 0.05 to 1.0 with step 0.05 (usually values close to 1.0 are the best)  
in loss (Eq. 5 in the paper)  100  
Number of clusters in DiffPool  4  4 
25 
Training params 
100 epochs (lr decay after 85 and 95 epochs) 
30 epochs (lr decay after 20 and 25 epochs) 
In DiffPool, the number of clusters returned after pooling must be fixed before we start training. While this number can be smaller or larger than the number of nodes in the graph, we still did not find it beneficial to use DiffPool with a number of clusters larger than 4 (the minimal number of nodes in training graphs). Part of the issue is that we train on small graphs and test on large ones and it is hard to choose the number of clusters suitable for graphs of all sizes.
Fewer than for attention models, since they converged faster.
We found that using the Sum aggregator and 2 layer MLPs is not necessary for Colors and MNIST75sp, since the tasks are relatively easy and the standard ChebyNet models performed comparably. For MNIST75sp, the Sum aggregator and 2 layer MLPs were also unstable during training.
Since perfect attention weights can be predicted solely based on input features.
Attention applied to a hidden layer receives a stronger signal compared when applied to the input layer, which improves results and makes it unnecessary to the use a GNN to predict attention weights as we do for Triangles.
For supervised and weaklysupervised models, we found it useful to set for nodes with superpixel intensity smaller than 0.5.
Collab  Proteins  D&D  D&D  
# input dimensionality  492  3  89  89 
# train graphs  500  500  462  500 
# test graphs  4500  613  716  678 
# classes  3 (physics research areas)  2 (enzyme vs nonenzyme)  
# nodes () train  3235  425  30200  30300 
# nodes () test  32492  6620  2015748  305748 
# layers and filters  3 layers, 64 filters in each, followed by a classification layer  
Dropout  0.1  
Nonlinearity  ReLU  
# pooling layers  1  
READOUT layer  global max  
ChebyGIN aggregator  Mean, 1 layer MLP (i.e. equivalent to ChebyNet)  
ChebyGIN max scale,  3  
Optimal threshold,  chosen in the range from 0.0001 to 0.1  
in loss (Eq. 5 in the paper)  chosen in the range from 0.1 to 100  
Attention model 
2 layer MLP with 32 hidden units applied to hidden layer 
applied to hidden layer 
2 layer MLP with 32 hidden units applied to hidden layer 

Training params  50 epochs (lr decay after 25, 35 and 45 epochs) 
Colors  Triangles  MNIST75sp  

Orig  Large  LargeC  Attn  Orig  Large  Attn  Orig  Noisy  NoisyC  Attn  
GIN, global pool  9610  7122  2611  99.2  501  221  77  87.63  5511  5112  715 
GIN, DiffPool [10]  584  162  283  97  391  181  82  831  546  433  502 
ChebyGINd, unsup, ours  9713  248  155  9121  6214  253  782  96.41  88.410  88.310  9215 
ChebyGINh, unsup, ours  6738  158  11  6925  5913  254  764  95.53  7620  6518  7433 
Comments
There are no comments yet.