1 Introduction
The advent of deep learning has led to extensive improvements in technology used to recognize and utilize patterns in data (LeCun et al., 2015). In particular, convolutional neural networks (CNNs) successfully leverage the properties of data such as images, speech, and video on Euclidean domains (grid structure) (Hinton et al., 2012; Krizhevsky et al., 2012; He et al., 2016; Karpathy et al., 2014). CNNs consist of convolutional layers and downsampling (pooling) layers. The convolutional and pooling layers exploit the shiftinvariance (also known as stationary) property and compositionality of gridstructured data (Simoncelli & Olshausen, 2001; Bronstein et al., 2017). As a result, CNNs perform well with a small number of parameters.
In various fields, however, a large amount of data, such as graphs, exists on the nonEuclidean domain. For example, social networks, biological networks, and molecular structures can be represented by nodes and edges of graphs (Lazer et al., 2009; Davidson et al., 2002; Duvenaud et al., 2015). Therefore, attempts have been made to successfully use CNNs in the nonEuclidean domain. Most previous studies have redefined the convolution and pooling layers to process graph data.
To define graph convolution, studies have used the spectral (Bruna et al., 2014; Henaff et al., 2015; Defferrard et al., 2016; Kipf & Welling, 2016) and nonspectral (Monti et al., 2017; Hamilton et al., 2017; Xu et al., 2018a; Veličković et al., 2018; Morris et al., 2018) methods. The application of graph convolution has resulted in outstanding performance in a variety of fields which include recommender systems (van den Berg et al., 2017; Yao & Li, 2018; Monti et al., 2017), chemical researches (You et al., 2018; Zitnik et al., 2018)
(Bastings et al., 2017; Peng et al., 2018; Yao et al., 2018), and in many tasks as reported in Zhou et al..There are fewer methods for graph pooling than for graph convolution. Previous researches have adopted the pooling method that considers only graph topology (Defferrard et al., 2016; Rhee et al., 2018). With growing interest in graph pooling, several improved methods have been proposed (Dai et al., 2016; Duvenaud et al., 2015; Gilmer et al., 2017b; Zhang et al., 2018b). They utilize node features to obtain a smaller graph representation. Recently, Ying et al.; Gao & Ji; Cangea et al. have proposed innovative pooling methods that can learn hierarchical representations of graphs. These methods allow Graph Neural Networks (GNNs) to attain scaleddown graphs after pooling in an endtoend fashion.
However, the above pooling methods have room for improvement. For example, the differentiable hierarchical pooling method of Ying et al. has a quadratic storage complexity and the number of its parameters is dependent on the number of nodes. Gao & Ji; Cangea et al. have addressed the complexity issue, but their method does not take graph topology into account.
Here, we propose SAGPool which is a SelfAttention Graph Pooling method for GNNs in the context of hierarchical graph pooling. Our method can learn hierarchical representations in an endtoend fashion using relatively few parameters. The selfattention mechanism is exploited to distinguish between the nodes that should be dropped and the nodes that should be retained. Due to the selfattention mechanism which uses graph convolution to calculate attention scores, node features and graph topology are considered. In short, SAGPool, which has the advantages of the previous methods, is the first method to use selfattention for graph pooling and achieve high performance.
2 Related Work
GNNs have drawn considerable attention due to their stateoftheart performance on tasks in the graph domain. Studies on GNNs focus on extending the convolution and pooling operation, which are the main components of CNN, to graphs.
2.1 Graph Convolution
Convolution operation on graphs can be defined in either the spectral or nonspectral domain. Spectral approaches focus on redefining the convolution operation in the Fourier domain, utilizing spectral filters that use the graph Laplacian. Kipf & Welling proposed a layerwise propagation rule that simplifies the approximation of the graph Laplacian using the Chebyshev expansion method (Defferrard et al., 2016). The goal of nonspectral approaches is to define the convolution operation so that it works directly on graphs. In general nonspectral approaches, the central node aggregates features from adjacent nodes when its features are passed to the next layer rather than defining the convolution operation in the Fourier domain. Hamilton et al. proposed GraphSAGE which learns node embeddings through sampling and aggregation. While GraphSAGE operates in a fixedsize neighborhood, Graph Attention Network (GAT) (Veličković et al., 2018), based on attention mechanisms (Bahdanau et al., 2014), computes node representations in entire neighborhoods. Both approaches have improved performance on graphrelated tasks.
2.2 Graph Pooling
Pooling layers enable CNN models to reduce the number of parameters by scaling down the size of representations, and thus avoid overfitting. To generalize CNNs, the pooling method for GNNs is necessary. Graph pooling methods can be grouped into the following three categories: topology based, global, and hierarchical pooling.
Topology based pooling
Earlier works used graph coarsening algorithms rather than neural networks. Spectral clustering algorithms use eigendecomposition to obtain coarsened graphs. However, alternatives were needed due to the time complexity of eigendecomposition. Graclus
(Dhillon et al., 2007)computes clustered versions of given graphs without eigenvectors because of the mathematical equivalence between a general spectral clustering objective and a weighted kernel kmeans objective. Even in recent GNN models
(Defferrard et al., 2016; Rhee et al., 2018), Graclus is employed as a pooling module.Global pooling Unlike the previous methods, global pooling methods consider graph features. Global pooling methods use summation or neural networks to pool all the representations of nodes in each layer. Graphs with different structures can be processed because global pooling methods collect all the representations. Gilmer et al. viewed GNNs as message passing schemes, and proposed a general framework for graph classification where entire graph representations could be obtained by utilizing the Set2Set(Vinyals et al., 2015) method. SortPool(Zhang et al., 2018b) sorts embeddings for nodes according to the structural roles of a graph and feeds the sorted embeddings to the next layers.
Hierarchical pooling Global pooling methods do not learn hierarchical representations which are crucial for capturing structural information of graphs. The main motivation of hierarchical pooling methods is to build a model that can learn feature or topologybased node assignments in each layer. Ying et al. proposed DiffPool which is a differentiable graph pooling method that can learn assignment matrices in an endtoend fashion. A learned assignment matrix in layer ,
contains the probability values of nodes in layer
being assigned to clusters in the next layer . Here, denotes the number of nodes in layer . Specifically, nodes are assigned by the following equation:(1) 
where denotes the node feature matrix and is the adjacency matrix.
Cangea et al. utilized gPool(Gao & Ji, 2019) and achieved performance comparable to that of DiffPool. gPool requires a storage complexity of whereas DiffPool requires where , , and
denote vertices, edges, and pooling ratio, respectively. gPool uses a learnable vector
to calculate projection scores, and then uses the scores to select the top ranked nodes. Projection scores are obtained by the dot product between and the features of all the nodes. The scores indicate the amount of information of nodes that can be retained. The following equation roughly describes the pooling procedure in gPool.(2) 
As in Equation (2), the graph topology does not affect the projection scores.
To further improve graph pooling, we propose SAGPool which can use features and topology to yield hierarchical representations with a reasonable complexity of time and space.
3 Proposed Method
The key point of SAGPool is that it uses a GNN to provide selfattention scores. In Section 3.1, we describe the mechanism of SAGPool and its variants. Model architectures for the evaluations are described in Section 3.2. The SAGPool layer and the model architectures are illustrated in Figure 1 and Figure 2, respectively.
3.1 SelfAttention Graph Pooling
Selfattention mask Attention mechanisms have been widely used in the recent deep learning studies (Parikh et al., 2016; Cheng et al., 2016; Zhang et al., 2018a; Veličković et al., 2018). Such mechanisms make it possible to focus more on important features and less on unimportant features. In particular, selfattention, commonly referred to as intraattention, allows input features to be the criteria for the attention itself (Vaswani et al., 2017). We obtain selfattention scores using graph convolution. For instance, if the graph convolution formula of Kipf & Welling is used, the selfattention score is calculated as follows.
(3) 
where
is the activation function (e.g.
), is the adjacency matrix with selfconnections (i.e. ), is the degree matrix of , is the input features of the graph with nodes and dimensional features, and is the only parameter of the SAGPool layer. By utilizing graph convolution to obtain selfattention scores, the result of the pooling is based on both graph features and topology. We adopt the node selection method of Gao & Ji; Cangea et al., which retains a portion of nodes of the input graph even when graphs of varying sizes and structures are inputted. The pooling ratiois a hyperparameter that determines the number of nodes to keep. The top
nodes are selected based on the value of .(4) 
where toprank is the function that returns the indices of the top values, is an indexing operation and is the feature attention mask.
Graph pooling An input graph is processed by the operation notated as masking in Figure 1.
(5) 
where is the rowwise (i.e. nodewise) indexed feature matrix, is the broadcasted elementwise product, and is the rowwise and colwise indexed adjacency matrix. and are the new feature matrix and the corresponding adjacency matrix, respectively.
Variation of SAGPool The main reason for using graph convolution in SAGPool is to reflect the topology as well as node features. The various formulas of GNNs can be substituted for Equation (3), if GNNs take the node feature and the adjacency matrix as inputs. The generalized equation for calculating the attention score is as follows.
(6) 
where denotes the node feature matrix and is the adjacency matrix.
There are several ways to calculate attention scores using not only adjacent nodes but also multihop connected nodes. In Equation (7) and (8), we illustrate examples of using the twohop connections which involve the augmentation of edges and the stack of GNN layers. Adding the square of an adjacency matrix creates edges between twohop neighbors.
(7) 
The stack of GNN layers allows for the indirect aggregation of twohop nodes. In this case, the nonlinearity and the number of parameters of the SAGPool layer increase.
(8) 
Equations (7) and (8) can be applied to the multihop connections.
Another variant is to average multiple attention scores. The average attention score is obtained by GNNs as follows:
(9) 
3.2 Model Architecture
According to Lipton & Steinhardt, if numerous modifications are made to a model, it may be difficult to identify which modification contributes to improving performance. For a fair comparison, we adopted the model architectures from Zhang et al. and Cangea et al., and compared the baselines and our method using the same architectures.
Convolution layer As mentioned in Section 2.1, there are many definitions for graph convolution. Other types of graph convolution may improve performance, but we utilize the widely used graph convolution proposed by Kipf & Welling for all the models. Equation (10) is the same as Equation (3), except for the dimension of .
(10) 
where is the node representation of th layer and is the convolution weight with input feature dimension and output feature dimension
. The Rectified Linear Unit (ReLU)
(Nair & Hinton, 2010) function is used as an activation function.Readout layer Inspired by the JKnet architecture (Xu et al., 2018b), Cangea et al. proposed a readout layer that aggregates node features to make a fixed size representation. The summarized output feature of the readout layer is as follows:
(11) 
where is the number of nodes, is the feature vector of th node, and denotes concatenation.
Global pooling architecture We implemented the global pooling architecture proposed by Zhang et al.. As shown in Figure 2, the global pooling architecture consists of three graph convolutional layers and the outputs of each layer are concatenated. Node features are aggregated in the readout layer which follows the pooling layer. Then graph feature representations are passed to the linear layer for classification.
Data set  Number of Graphs  Number of Classes  Avg. # of Nodes per Graph  Avg. # of Edges per Graph 

D&D  1178  2  284.32  715.66 
PROTEINS  1113  2  39.06  72.82 
NCI1  4110  2  29.87  32.30 
NCI109  4127  2  29.68  32.13 
FRANKENSTEIN  4337  2  16.90  17.88 
Hierarchical pooling architecture In this setting, we implemented the hierarchical pooling architecture from the recent hierarchical pooling study of Cangea et al.. As shown in Figure 2, the architecture is comprised of three blocks each of which consists of a graph convolutional layer and a graph pooling layer. The outputs of each block are summarized in the readout layer. The summation of the outputs of each readout layer is fed to the linear layer for classification.
4 Experiments
We evaluate the global pooling and hierarchical pooling methods on the graph classification task. In Section 4.1, we discuss the datasets used for evaluation. Section 4.3 describes how we train the models. The methods compared in the experiments are introduced in Sections 4.4 and 4.5.
Hyperparameter  Range 

Learning rate 
1e2, 5e2, 1e3, 5e3, 1e4, 5e4 
Hidden size  16, 32, 64, 128 
Weight decay  1e2, 1e3, 1e4, 1e5 
(L2 regularization)  
Pooling ratio  1/2, 1/4 
4.1 Datasets
Five datasets with a large number of graphs (k) were selected among the benchmark datasets (Kersting et al., 2016). The statistics of the datasets are summarized in Table 1.
D&D (Dobson & Doig, 2003; Shervashidze et al., 2011) contains graphs of protein structures. A node represents an amino acid and edges are constructed if the distance of two nodes is less than 6 Å. A label denotes whether a protein is an enzyme or nonenzyme. PROTEINS (Dobson & Doig, 2003; Borgwardt et al., 2005) is also a set of proteins, where nodes are secondary structure elements. If nodes have edges, the nodes are in an amino acid sequence or in a close 3D space. NCI (Wale et al., 2008) is a biological dataset used for anticancer activity classification. In the dataset, each graph represents a chemical compound, with nodes and edges representing atoms and chemical bonds, respectively. NCI1 and NCI109 are commonly used as benchmark datasets for graph classification. FRANKENSTEIN (Orsini et al., 2015) is a set of molecular graphs (Costa & Grave, 2010) with node features containing continuous values. A label denotes whether a molecule is a mutagen or nonmutagen.
Models  D&D  PROTEINS  NCI1  NCI109  FRANKENSTEIN 

Set2Set 

SortPool  
SAGPool (Ours)  
DiffPool  
gPool  
SAGPool (Ours) 
Average accuracy and standard deviation of the 20 random seeds. The subscript
(e.g. ) denotes the global pooling architecture and the subscript (e.g. ) denotes the hierarchical pooling architecture.Graph Convolution  D&D  PROTEINS 

SAGPool  
SAGPool  
SAGPool  
SAGPool  
SAGPool  
SAGPool  
SAGPool  
SAGPool 
4.2 Evaluation of GNNs
In addition, the same early stopping criterion and hyperparameter selection strategy are used for all the models to ensure a fair comparison.
4.3 Training Procedures
Shchur et al. demonstrate that different splits of data can affect the performance of GNN models. In our experiments, we evaluated the pooling methods over 20 random seeds using 10fold cross validation. A total of 200 testing results were used to obtain the final accuracy of each method on each dataset. 10 percent of the training data was used for validation in the training session. We used the Adam optimizer (Kingma & Ba, 2014)
, early stopping criterion, patience, and hyperparameter selection strategy for the global pooling architecture and hierarchical pooling architecture. We stopped the training if the validation loss did not improve for 50 epochs in an epoch termination condition with a maximum of 100k epochs, as done in
(Shchur et al., 2018). The optimal hyperparameters are obtained by grid search. The ranges of grid search are summarized in Table 2.4.4 Baselines
We consider the following four pooling methods as baselines: Set2Set, SortPool, DiffPool, and gPool. DiffPool, gPool, and SAGPool were compared using the hierarchical pooling architecture while Set2Set, SortPool, and SAGPool were compared using the global pooling architecture. We used the same hyperparameter search strategy for all the baselines and SAGPool. The hyperparameters are summarized in Table 2.
Set2Set (Vinyals et al., 2015) requires an additional hyperparameter which is the number of processing steps for the LSTM(Hochreiter & Schmidhuber, 1997) module. We use 10 processing steps for all the experiments. We assume that the readout layer is unnecessary because the LSTM module produces embeddings for graphs invariant to the order of nodes.
SortPool (Zhang et al., 2018b) is a recent global pooling method which uses sorting for pooling. The number of nodes is set such that 60% of graphs have more than nodes. In the global pooling setting, SAGPool has the same number of output nodes as SortPool.
DiffPool (Ying et al., 2018)
is the first endtoend trainable graph pooling method that can produce hierarchical representations of graphs. We did not use batch normalization for DiffPool, which is not related to the pooling method. For the hyperparameter search, the pooling ratio ranges from 0.25 to 0.5 for the following reasons. In the reference implementation, the cluster size is set to 25% of the maximum number of nodes. DiffPool
causes the out of memory error when the pooling ratio is larger than 0.5.gPool (Gao & Ji, 2019) selects topranked nodes for pooling, which makes it similar to our method. The comparison between our method and gPool demonstrates that considering topology can help improve performance on the graph classification task.
4.5 Variations of SAGPool
As mentioned in section 3.1, three variations of SAGPool are used to obtain attention scores . In our experiments, we compared each variant on the two datasets. First, any kind of GNNs can be applied to Equation (6). We compared the performance of the three most widely used GNNs (SAGPool, SAGPool, SAGPool). Second, we made the following modifications to SAGPool so that it can consider the twohop connection: an edge augmentation (SAGPool) in Equation (7) and a stack of GNN layers (SAGPool) in Equation (8). Last, multiple GNNs calculate attention scores and the scores are averaged to obtain the final attention score (SAGPool). We evaluated the performance of and using Equation (9). The results are summarized in Table 4.
4.6 Summary of Results
The results are summarized in Table 3 and 4. The accuracies and standard deviations are given in percentages. From the comparison of the global pooling methods and SAGPool, the results demonstrate that SAGPool generally performs well, but it performs especially well on D&D and PROTEINS. In the experiments, SAGPool outperformed the hierarchical pooling methods on all the datasets. We also compared variants of SAGPool with the hierarchical pooling architecture on the two benchmark datasets. The performance of the variants of SAGPool varied. The experimental results of the SAGPool variants show that SAGPool has the potential to improve performance. A detailed analysis of the experimental results is provided in the next section.
5 Analysis
In this section, we provide an analysis of the experimental results. In Section 5.1, we compare global pooling and hierarchical pooling. Section 5.2 provides an explanation on how the SAGPool method addresses the shortcomings of the gPool method. In the 5.3 and 5.4 sections, we compare the efficiency of SAGPool with that of DiffPool. We provide an analysis of SAGPool variants in Section 5.5.
5.1 Global and Hierarchical Pooling
It is difficult to determine whether the global pooling architecture or hierarchical pooling architecture is completely beneficial to graph classification. Since the global pooling architecture (SAGPool, SortPool, Set2Set) minimizes the loss of information, it performs better than the hierarchical pooling architecture (SAGPool, gPool, DiffPool) on datasets with fewer nodes (NCI1, NCI109, FRANKENSTEIN). However, is more effective on datasets with a large number of nodes (D&D, PROTEINS) because it efficiently extracts useful information from large scale graphs. Therefore, it is important to use the pooling architecture that is the most suitable for the given data. Nonetheless, SAGPool tends to perform well with each architecture.
5.2 Effect of Considering Graph Topology
To calculate the attention scores of nodes, SAGPool utilizes the graph convolution in Equation (3). Unlike gPool, SAGPool uses the term, which is the first order approximation of the graph Laplacian. This term allows SAGPool to consider graph topology. As shown in Table 3, considering graph topology improves performance. In addition, the graph Laplacian does not have to be recalculated because it is the term used in a previous graph convolutional layer in the same block. Although SAGPool has the same parameters as gPool (Figure 3), it achieves superior performance in the graph classification task.
5.3 Sparse Implementation
Manipulating graph data with a sparse matrix is important for GNNs because the adjacency matrix is usually sparse. When graph convolution is calculated using a dense matrix, the computational complexity of multiplication is where is the adjacency matrix, is the feature matrix of nodes, and denotes vertices. Pooling with a dense matrix causes the memory efficiency problem, as mentioned by (Cangea et al., 2018). However, if a sparse matrix is used in the same operation, the computational complexity is reduced to where represents the edges. Since SAGPool is a sparse pooling method, it can reduce its computational complexity, unlike DiffPool which is a dense pooling method. Sparseness also affects space complexity. Since SAGPool uses GNN for obtaining attention scores, SAGPool requires of storage for sparse pooling whereas dense pooling methods need .
5.4 Relation with the Number of Nodes
In DiffPool, the cluster size has to be defined when constructing a model because a GNN produces an assignment matrix as stated in Equation (1). The cluster size has to be proportional to the maximum number of nodes according to the reference implementation. These requirements of DiffPool can lead to two problems. First, the number of parameters is dependent on the maximum number of nodes as shown in Figure 3. Second, it is difficult to determine the right cluster size when the number of nodes varies greatly. For example, only 10 out of 1178 graphs have over 1000 nodes, where the maximum number of nodes is 5748 and the minimum is 30. The cluster size is 574 if the pooling ratio is 10%, which expands the size of graphs after pooling for most of the data. On the other hand, in SAGPool, the number of parameters is independent of the cluster size. In addition, the cluster size can be changed based on the number of input nodes.
5.5 Comparison of the SAGPool Variants
To investigate the potential of our method, we evaluated SAGPool variants on two datasets. SAGPool can be modified to perform the following: changing the type of GNN, considering the twohop connections, and averaging the attention scores of multiple GNNs. As shown in Table 4, the performance on graph classification varies depending on which dataset and type of GNN in SAGPool are used. We used two techniques to consider twohop connections. The attention scores obtained by the two sequential GNN layers (SAGPool) reflect the information of twohop neighbors. Another technique is to add the square of an adjacency matrix to itself, resulting in a new adjacency matrix that has twohop connectivity. Without any modifications to the SAGPool layer, the new adjacency matrix can be processed in SAGPool. The information of twohop neighbors may help improve performance. The last variants of SAGPool is to average the attention scores from multiple GNNs. We found that choosing the right for the dataset can help achieve stable performance.
5.6 Limitations
We retain a certain percentage (pooling ratio ) of nodes to handle different input graphs of various sizes, which has also been done in previous studies (Gao & Ji, 2019; Cangea et al., 2018). In SAGPool, we cannot parameterize the pooling ratios to find optimal values for each graph. To address this limitation, we used binary classification to decide which nodes to preserve, but this did not completely solve the issue.
6 Conclusion
In this paper, we proposed SAGPool which is a novel graph pooling method based on selfattention. Our method has the following features: hierarchical pooling, consideration of both node features and graph topology, reasonable complexity, and endtoend representation learning. SAGPool uses a consistent number of parameters regardless of the input graph size. Extensions of our work may include using learnable pooling ratios to obtain optimal cluster sizes for each graph and studying the effects of multiple attention masks in each pooling layer, where final representations can be derived by aggregating different hierarchical representations. Our experiments were run on a NVIDIA TitanXp GPU. We implemented all the baselines and SAGPool using PyTorch
(Paszke et al., 2017) and the geometric deep learning extension library provided by Fey et al..References
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.

Bastings et al. (2017)
Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., and Simaan, K.
Graph convolutional encoders for syntaxaware neural machine translation.
In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1957–1967, 2017.  Borgwardt et al. (2005) Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
 Bronstein et al. (2017) Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 Bruna et al. (2014) Bruna, J., Zaremba, W., Szlam, A., and Lecun, Y. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014.
 Cangea et al. (2018) Cangea, C., Veličković, P., Jovanović, N., Kipf, T., and Liò, P. Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287, 2018.
 Cheng et al. (2016) Cheng, J., Dong, L., and Lapata, M. Long shortterm memorynetworks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 551–561, 2016.

Costa & Grave (2010)
Costa, F. and Grave, K. D.
Fast neighborhood subgraph pairwise distance kernel.
In
Proceedings of the 27th International Conference on International Conference on Machine Learning
, pp. 255–262. Omnipress, 2010.  Dai et al. (2016) Dai, H., Dai, B., and Song, L. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pp. 2702–2711, 2016.
 Davidson et al. (2002) Davidson, E. H., Rast, J. P., Oliveri, P., Ransick, A., Calestani, C., Yuh, C.H., Minokawa, T., Amore, G., Hinman, V., ArenasMena, C., et al. A genomic regulatory network for development. science, 295(5560):1669–1678, 2002.
 Defferrard et al. (2016) Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852, 2016.
 Dhillon et al. (2007) Dhillon, I. S., Guan, Y., and Kulis, B. Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence, 29(11), 2007.
 Dobson & Doig (2003) Dobson, P. D. and Doig, A. J. Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003.
 Duvenaud et al. (2015) Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., AspuruGuzik, A., and Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232, 2015.

Fey et al. (2018)
Fey, M., Lenssen, J. E., Weichert, F., and Müller, H.
SplineCNN: Fast geometric deep learning with continuous Bspline
kernels.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  Gao & Ji (2019) Gao, H. and Ji, S. Graph unet, 2019. URL https://openreview.net/forum?id=HJePRoAct7.
 Gilmer et al. (2017a) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. CoRR, abs/1704.01212, 2017a. URL http://arxiv.org/abs/1704.01212.
 Gilmer et al. (2017b) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263–1272, 2017b.
 Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Henaff et al. (2015) Henaff, M., Bruna, J., and LeCun, Y. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 Hinton et al. (2012) Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Karpathy et al. (2014) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and FeiFei, L. Largescale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.
 Kersting et al. (2016) Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tudortmund.de.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Lazer et al. (2009) Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., et al. Life in the network: the coming age of computational social science. Science (New York, NY), 323(5915):721, 2009.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
 Lipton & Steinhardt (2018) Lipton, Z. C. and Steinhardt, J. Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341, 2018.
 Monti et al. (2017) Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., and Bronstein, M. M. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, pp. 3, 2017.
 Morris et al. (2018) Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and leman go neural: Higherorder graph neural networks. CoRR, abs/1810.02244, 2018. URL http://arxiv.org/abs/1810.02244.

Nair & Hinton (2010)
Nair, V. and Hinton, G. E.
Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010. 
Orsini et al. (2015)
Orsini, F., Frasconi, P., and De Raedt, L.
Graph invariant kernels.
In
Proceedings of the Twentyfourth International Joint Conference on Artificial Intelligence
, pp. 3756–3762, 2015. 
Parikh et al. (2016)
Parikh, A., Täckström, O., Das, D., and Uszkoreit, J.
A decomposable attention model for natural language inference.
In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2249–2255, 2016.  Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPSW, 2017.
 Peng et al. (2018) Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., Song, Y., and Yang, Q. Largescale hierarchical text classification with recursively regularized deep graphcnn. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 1063–1072. International World Wide Web Conferences Steering Committee, 2018.
 Rhee et al. (2018) Rhee, S., Seo, S., and Kim, S. Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18, pp. 3527–3534. International Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/490. URL https://doi.org/10.24963/ijcai.2018/490.
 Shchur et al. (2018) Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. Pitfalls of graph neural network evaluation. CoRR, abs/1811.05868, 2018. URL http://arxiv.org/abs/1811.05868.
 Shervashidze et al. (2011) Shervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Mehlhorn, K., and Borgwardt, K. M. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 Simoncelli & Olshausen (2001) Simoncelli, E. P. and Olshausen, B. A. Natural image statistics and neural representation. Annual review of neuroscience, 24(1):1193–1216, 2001.
 van den Berg et al. (2017) van den Berg, R., Kipf, T. N., and Welling, M. Graph convolutional matrix completion. stat, 1050:7, 2017.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
 Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, 2018.
 Vinyals et al. (2015) Vinyals, O., Bengio, S., and Kudlur, M. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015.
 Wale et al. (2008) Wale, N., Watson, I. A., and Karypis, G. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375, 2008.
 Xu et al. (2018a) Xu, K., Li, C., Tian, Y., Sonobe, T., ichi Kawarabayashi, K., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. In ICML, 2018a.
 Xu et al. (2018b) Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.i., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018b.
 Yao & Li (2018) Yao, K.L. and Li, W.J. Convolutional geometric matrix completion. arXiv preprint arXiv:1803.00754, 2018.
 Yao et al. (2018) Yao, L., Mao, C., and Luo, Y. Graph convolutional networks for text classification. arXiv preprint arXiv:1809.05679, 2018.
 Ying et al. (2018) Ying, R., You, J., Morris, C., Ren, X., Hamilton, W. L., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. CoRR, abs/1806.08804, 2018. URL http://arxiv.org/abs/1806.08804.
 You et al. (2018) You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. Graph convolutional policy network for goaldirected molecular graph generation. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 6412–6422. Curran Associates, Inc., 2018.
 Zhang et al. (2018a) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018a.
 Zhang et al. (2018b) Zhang, M., Cui, Z., Neumann, M., and Chen, Y. An endtoend deep learning architecture for graph classification. In Proceedings of AAAI Conference on Artificial Inteligence, 2018b.
 Zhou et al. (2018) Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., and Sun, M. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
 Zitnik et al. (2018) Zitnik, M., Agrawal, M., and Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, 34(13):457–466, 2018.