1 Introduction
Convolutional Neural Networks (CNNs) have seen wide success in domains where data is restricted to a Euclidean space. These methods exploit properties such as stationarity of the data distributions, locality and a welldefined notation of translation, but cannot model data that is nonEuclidean in nature. Such structure is naturally present in many domains, such as chemistry, physics, social networks, transportation systems, and 3D geometry, and can be expressed by graphs bronstein2017geometric ; hamilton2017representation . By defining an operation on graphs analogous to convolution, Graph Convolutional Networks (GCNs) have extended CNNs to graphbased data. The earliest methods performed convolution in the spectral domain bruna2013spectral , but subsequent work has proposed generalizations of convolution in the spatial domain. There have been multiple successful applications of GCNs to node classification velickovic2017graph and link prediction schlichtkrull2018modeling , whereas we target graph classification similarly to simonovsky2017dynamic .
Our focus is on multigraphs, a graph that is permitted to have multiple edges. Multigraphs are important in many domains, such as chemistry and physics. The challenge of generalizing convolution to graphs and multigraphs is to have anisotropic convolution kernels (such as edge detectors). Anisotropic models, such as MoNet monti2017geometric and SplineCNN fey2018splinecnn , rely on coordinate structure, work well for vision tasks, but are suboptimal for nonvisual graph problems. Other general models exist gilmer2017neural ; battaglia2018relational , but making them efficient for a variety of tasks conflicts with the “no free lunch theorem”.
Compared to nonspectral GCNs, spectral models have filters with more global support, which is important for capturing complex relationships. We rely on Chebyshev GCNs (ChebNet) defferrard2016convolutional that enjoy an explicit control of receptive field size. Even though it was originally derived from spectral methods bruna2013spectral , it does not suffer from their main shortcoming — sensitivity of learned filters to graph size and structure.
Contributions: We propose a scalable spectral GCN that learns from multigraphs by capturing multirelational graph paths as well as multiplicative and additive interactions to reduce model complexity and learn richer representations. We also learn new abstract relationships between graph nodes, beyond the ones annotated in the datasets. To our knowledge, we are the first to demonstrate that spectral methods can efficiently solve problems with variable graph size and structure, where this kind of method is generally believed not to perform well.
2 Multigraph Convolution
While we provide the background to understand our model, a review of spectral graph methods is beyond the scope of this paper. Section 6.1 of the Appendix reviews spectral graph convolution.
2.1 Approximate spectral graph convolution
We consider an undirected, possibly disconnected, graph with nodes, , and edges, , having values in range . Nodes usually represent specific semantic concepts such as atoms in a chemical compound or users in a social network. Nodes can also denote abstract blocks of information with common properties, such as superpixels in images. Edges define the relationships between nodes and the scope over which node effects may propagate.
In spectral graph convolution bruna2013spectral , the filter is defined on an entire input space. Although it makes filters global, which helps to capture complex relationships, it is also desirable to have local support since the data often have local structure and since we want to learn filters independent on the input size to make the model scalable.
To address this issue, we can model this filter as a function of eigenvalues
(which is assumed to be constant) of the normalized symmetric graph Laplacian : . We can then approximate it as a sum of terms using the Chebyshev expansion, where each term contains powers . Finally, we apply the property of eigendecomposition:(1) 
By combining this property with the Chebyshev expansion of
, we exclude eigenvectors
, that are often infeasible to compute, from spectral graph convolution, and instead express the convolution as a function of graph Laplacian . In general, for the input with nodes and dimensional features in each node, the approximate convolution is defined as:(2) 
where are features projected onto the Chebyshev basis and concatenated for all orders and are trainable weights, where .
This approximation scheme was proposed in defferrard2016convolutional , and Eq. 2 defines the convolutional layer in the Chebyshev GCN (ChebNet), which is the basis for our method. Convolution is an essential computational block in graph networks, since it permits the gradual aggregation of information from neighboring nodes. By stacking the operator in Eq. 2, we capture increasingly larger neighborhoods and learn complex relationships in graphstructured data.
2.2 Graphs with variable structure and size
The approximate spectral graph convolution (Eq. 2) enforces spatial locality of the filters by controlling the order of the Chebyshev polynomial . Importantly, it reduces the computational complexity of spectral convolution from to , making it much faster in practice assuming the graph is sparsely connected and sparse matrix multiplication is implemented. In this work, we observe an important byproduct of this scheme: that learned filters become less sensitive to changes in graph structure and size due to excluding the eigenvectors from spectral convolution, so that learned filters are not tied to .
(a) ENZYMES  (b) MUTAG  (c) NCI1 
The only assumption that still makes a trainable filter sensitive to graph structure is that we model it as a function of eigenvalues . However, the distribution of eigenvalues of the normalized Laplacian is concentrated in a limited range, making it a weaker dependency on graphs than the spectral convolution via eigenvectors, so that learned filters generalize well to new graphs. Moreover, since we use powers of in performing convolution (Eq. 2), the distribution of eigenvalues further contracts due to exponentiation of the middle term on the RHS of Eq. 1. We believe that this effect accounts for the robustness of learned filters to changes in graph size or structure (Figure 1).
2.3 Graphs with multiple relation types
In the approximate spectral graph convolution (Eq. 2), the graph Laplacian encodes a single relation type between nodes. Yet, a graph may describe many types of distinct relations. In this section, we address this limitation by extending Eq. 2 to a multigraph, i.e. a graph with multiple () edges (relations) between the same nodes encoded as a set of graph Laplacians , where is an upper bound on the number of edges per dyad. Extensions to a multigraph can also be applied to early spectral models bruna2013spectral but, since ChebNet was shown to be superior in downstream tasks, we choose to focus on the latter model.
Two dimensional Chebyshev polynomial.
The Chebyshev polynomial used in Eq. 2 (see Section 6.1 in Appendix for detail) can be extended for two variables (relations in our case) similarly to bilinear models, e.g. as in omar2010two :
(3) 
and, analogously, for more variables. For , the convolution is then defined as:
(4) 
where . In this case, we allow the model to leverage graph paths consisting of multiple relation types (Figure 2). This flexibility, however, comes at a great computational cost, which is prohibitive for a large number of relations or large order due to exponential growth of the number of parameters: . Moreover, as we demonstrate in our experiments, such multirelational paths do not necessary lead to better performance.
(a)  (b) 
Multiplicative and additive fusion.
Motivated by multimodal fusion considered in the Visual Question Answering literature (e.g. kim2016hadamard ), we propose the multiplicative operator:
(5) 
where is a learnable differentiable transformation for relation type and are features projected onto the Chebyshev basis . In this case, node features interact in a multiplicative way. The advantage of this method is that it can learn separate for each relation and has fewer trainable parameters preventing overfitting, which is especially important for large and . The elementwise multiplication in Eq. 5 can be replaced with summation to perform additive fusion.
Shared projections.
Another potential strength of the approach in Eq. 5 is that we can further decrease model complexity by sharing parameters of between the relation types, so that the total number of trainable parameters does not depend on the number of relations . Despite useful practical properties, as we demonstrate in the experiments, it is usually hard for a single shared to generalize between different relation types.
Concatenating edge features.
A more straightforward approach is to concatenate features for all relation types and learn a single matrix of weights :
(6) 
This method, however, does not scale well for large , since the dimensionality of grows linearly with . Note that even though multirelational paths are not explicit in Eq. 5 and 6, for a multilayer network, relation types will still communicate through node features. In Figure 2, node will contain features of node after the first convolutional layer, so that in the second layer the filter centered at node will have access to features of node by accessing features of node . Compared to the 2d polynomial convolution defined by Eq. 4, the concatenationbased, multiplicative and additive approaches require more layers to have a larger multirelational receptive field.
3 Multigraph Convolutional Networks
A frequent assumption of current GCNs is that there is at most one edge between any pair of nodes in a graph. This restriction is usually implied by datasets with such structure, so that in many datasets, graphs are annotated with the single most important relation type, for example, whether two atoms in a molecule are bonded wale2008comparison ; duvenaud2015convolutional . Meanwhile, data is often complex and nodes tend to have multiple relationships of different semantic, physical, or abstract meanings. Therefore, we argue that there could be other relationships captured by relaxing this restriction and allowing for multiple kinds of edges, beyond those annotated in the data.
3.1 Learning edges
Prior work (e.g. schlichtkrull2018modeling ; bordes2013translating ), proposed methods to learn from multiple edges, but similarly to the methods using a single edge type kipf2016semi , they leveraged only predefined (annotated) edges in the data. We devise a more flexible model, which, in addition to learning from an arbitrary number of predefined relations between nodes (see Section 2.3), learns abstract edges jointly with a GCN. We propose to learn a new edge between any pair of nodes and with features and using a trainable similarity function:
(7) 
where the softmax is used to enforce sparse connections and
can be any differentiable function such as a multilayer perceptron in our work. This idea is similar to
henaff2015deep , built on the early spectral convolution model bruna2013spectral , which learned an adjacency matrix, but targeted classification tasks for non graphstructured data (e.g. document classification, with each document is represented as a feature vector). Moreover, we learn this matrix jointly with a more recent graph classification model defferrard2016convolutional and, additionally, efficiently fuse predefined and learned relations. Eq. 7 is also similar to that of velickovic2017graph , which used this functional form to predict an attention coefficient for some existing edge. The attention model can only strengthen or weaken some existing relations, but cannot form new relations. We present a more general model that makes it possible to connect previously disconnected nodes and form
new abstract relations. To enforce a symmetry of predicted edges we compute an average: .3.2 Layer pooling versus global pooling
Inspired by convolutional networks, previous works bruna2013spectral ; defferrard2016convolutional ; monti2017geometric ; simonovsky2017dynamic ; fey2018splinecnn built an analogy of pooling layers in graphs, for example, using the Graclus clustering algorithm dhillon2007weighted . In CNNs, pooling is an effective way to reduce memory and computation, particularly for large inputs. It also provides additional robustness to local deformations and leads to faster growth of receptive fields. However, we can build a convolutional network without any pooling layers with similar performance in a downstream task springenberg2014striving — it just will be relatively slow, since pooling is extremely cheap on regular grids, such as images. In graph classification tasks, the input dimensionality, which corresponds to the number of nodes , is often very small () and the benefits of pooling are less clear. Graph pooling, such as in dhillon2007weighted , is also computationally intensive since we need to run the clustering algorithm for each training example independently, which limits the scale of problems we can address. Aiming to simplify the model while maintaining classification accuracy, we exclude pooling layers between conv. layers and perform global maximum pooling (GMP) over nodes following the last conv. layer. This fixes the size of the penultimate feature vector regardless of the number of nodes (Figure 3).
4 Experiments
4.1 Dataset details
We evaluate our model on five chemical graph classification datasets frequently used in previous work: NCI1 and NCI109 wale2008comparison , MUTAG debnath1991structure , ENZYMES schomburg2004brenda , and PROTEINS borgwardt2005protein . For each dataset, there is a set of graphs with an arbitrary number of nodes and undirected binary edges of a single type () and each graph has a single, categorical label that is to be predicted. Dataset statistics are presented in Table 2 of the Appendix.
Every graph represents some chemical compound labeled according to its functional properties. In NCI1, NCI109 and MUTAG, edges correspond to atom bonds (types of bonds) and vertices  to atom properties; in ENZYMES, edges are formed based on spatial distance (edges connect nodes if those are neighbors along the amino acids sequence or if they are neighbors in space within the protein structure borgwardt2005protein ); in PROTEINS, edges are similarly formed based on spatial distance between amino acids in proteins. Since edges in these datasets are not directly related to the features of nodes they connect, we expect that learned edges will enrich graph structure and improve graph classification.
Node features are discrete in these datasets and represesented as onehot vectors of length . We do not use any additional node or edge attributes available for some of these datasets.
These datasets vary in the number of graphs (188  4127), class labels (2  6) and the number of nodes in a graph (2  620) and, thereby, represent a comprehensive benchmark for our method. We follow the standard approach to evaluation shervashidze2011weisfeiler ; yanardag2015deep
and perform 10fold crossvalidation on these datasets. To minimize any random effects, we repeat experiments 10 times and report average classification accuracies together with standard deviations.
4.2 Architectural details and experimental setup
In all experiments, we train a ChebNet with three graph convolutional layers followed by global max pooling (GMP) and 2 fullyconnected layers (Figure
3). Batch normalization (BN) and the ReLU activation are added after each layer, whereas dropout is added only before the fullyconnected layers. Projections
in Eq. 5 are modeled by a single layer neural network with hidden units and the activation. The edge prediction function (see Eq. 7, Section 3.1) is a two layer neural network with 128 hidden units (32 for PROTEINS), which acts on concatenated node features. Detailed network architectures are presented in Table 2 of the Appendix.We train all models using the Adam optimizer kingma2014adam
with learning rate of 0.001, weight decay of 0.0001, and batch size of 32, the learning rate is decayed after 25, 35, and 45 epochs and the models are trained for 50 epochs as in
simonovsky2017dynamic . We run experiments for different fusion methods (Section 2.3) and Chebyshev orders in range from 2 to 6 (Section 2.1) and report the best results in Table 1.Model  NCI1  NCI109  MUTAG  ENZYMES  PROTEINS 

WL shervashidze2011weisfeiler  84.60.4  84.50.2  83.81.5  59.11.1  
WLOA kriege2016valid  86.10.2  86.30.2  84.51.7  59.91.1  76.40.4 
structure2vec dai2016discriminative  83.7  82.2  88.3  61.1  
DGK yanardag2015deep  80.30.5  80.30.3  87.42.7  53.40.9  75.70.5 
PSCN niepert2016learning  78.61.9  92.64.2  75.92.8  
ECC simonovsky2017dynamic  83.8  82.1  88.3  53.5  
DGCNN zhang2018end  74.40.5  85.81.7  76.30.2  
Graph UNet cangea2018towards  64.2  75.5  
DiffPool ying2018hierarchical  62.5  76.3  
MoNet monti2017geometric  ours*  69.80.2  70.00.3  84.21.2  36.41.2  71.91.2 
GCN kipf2016semi  ours*  75.80.7  73.40.4  76.51.4  40.71.8  74.30.5 
ChebNet defferrard2016convolutional  ours*  83.10.4  82.10.2  84.41.6  58.01.4  75.50.4 
Multigraph ChebNet  83.40.4  82.00.3  89.11.4  61.71.3  76.50.4 
4.3 Results
Previous works typically show strong performance on one or two datasets out of five that we use (Table 1). In contrast, the Multigraph ChebNet, leveraging two relation types (annotated and learned, see Section 3.1), shows high accuracy across all datasets. On PROTEINS we outperform all previous methods, while on ENZYMES two recent works based on differentiable pooling ying2018hierarchical ; cangea2018towards are better, however it is difficult to compare to their results without the standard deviation of accuracies. We also obtain competitive accuracy on NCI1 outperforming DGK yanardag2015deep , PSCN niepert2016learning , and DGCNN zhang2018end . Importantly, the Multigraph ChebNet with two edge types, i.e. predefined dataset annotations and the learned edges (Section 3.1) consistently outperforms the baseline ChebNet with a single edge, which shows efficacy of our approach and demonstrates the complementary nature of predefined and learned edges. Lower results on NCI1 and NCI109 can be explained by the fact that the node features in the graphs of these datasets are imbalanced with some features appearing only a few times in the dataset. This is undesirable for our method, which learns new edges based on features and the model can predict random values for unseen features. On MUTAG we surpass all but one method niepert2016learning . But in this case the dataset is tiny, consisting of 188 graphs and the margin from the top method is not statistically significant.
Evaluation of edge fusion methods.
We train a model using each of the edge fusion methods proposed in Section 2.3 and report the summary of results in Figure 4. We count the number of times each method outperforms the others treating all 10 folds independently. As expected, graph convolution based on the twodimensional Chebyshev polynomial is better for lower orders of , since it exploits multirelational graph paths, effectively increasing the receptive field of filters. However, for larger , the model complexity becomes too high due to quadratic growth of the number of parameters and performance degrades. Sharing weights for multiplicative or additive fusion generally drops performance with a few exceptions in the multiplicative case. This implies that predefined and learned edges are of a different nature. It would be interesting to validate these fusion methods on a larger number of relation types.
Speed comparison.
We compare forward pass speed of the proposed method to the baseline ChebNet, MoNet monti2017geometric and GCN kipf2016semi (Figure 5). Analgously to kipf2016semi , we generate random graphs with nodes and edges. ChebNet with 2 edge types (Multigraph ChebNet) is on average two times slower than the baseline ChebNet with a single annotated edge. MoNet is in turn two times slower than Multigraph ChebNet. Multigraph ChebNet with 2d edge fusion is the slowest due to exponential growth of parameters, while GCN is the fastest, although the gap with the baseline ChebNet is small. Therefore, we believe that Multigraph ChebNet with certain edge fusion methods provides a relatively fast, scalable and accurate model.
5 Related work and Discussion
Our method relies on a fast approximate spectral graph convolution known as ChebNet defferrard2016convolutional ), which was designed for graph classification. A simplified and faster version of this model, Graph Convolutional Networks (GCN) kipf2016semi , which is practically equivalent to the ChebNet with order
, has shown impressive node classification performance on citation and knowledge graph datasets in the transductive learning setting. In all our experiments, we noticed that using more global filters (with larger
) is important (Tables 1). Other recent works hamilton2017inductive ; velickovic2017graph also focus on node classification and, therefore, are not empirically compared to in this work.Recent work of ying2018hierarchical proposed a differentiable alternative to clusteringbased graph pooling, showing strong results in graph classification tasks, but at the high computational cost. To alleviate this, a more scalable approach based on dropping nodes graphunet2018 ; cangea2018towards was introduced and can be integrated with our method to further improve results.
Closely related to our work, monti2017geometric formulated the generalized graph convolution model (MoNet) based on a trainable transformation to pseudocoordinates, which led to learning anisotropic kernels and excellent results in visual tasks. However, in nonvisual tasks, when coordinates are not naturally defined, the performance is worse (Table 1). Notably, the computational cost (both memory and speed) of MoNet is higher than for ChebNet due to the patch operator in (monti2017geometric, , Eq. (9)(11)) (Figure 5). The argument in favor of MoNet against ChebNet was the sensitivity of spectral convolution methods, including ChebNet, to changes in graph size and structure. We contradict this argument and show superior performance on chemical graph classification datasets. SplineCNN fey2018splinecnn
is similar to MoNet and is good at classifying both graphs and nodes, but it is also based on pseudo coordinates and, therefore, potentially has the same shortcoming of MoNet. So, its performance on general graph classification problems where coordinates are not well defined is expected to be inferior.
Another family of methods based on kernels shervashidze2011weisfeiler ; kriege2016valid shows strong performance on chemical datasets, but their application is limitted to small scale graph problems with discrete node features. Scalable extensions of kernel methods to graphs with continuous features were proposed niepert2016learning ; yanardag2015deep , but they showed weaker results.
6 Conclusion
In this work, we address several limitations of current graph convolutional networks and show competitive graph classification results on a number of chemical datasets. First, we revisit the spectral graph convolution model based on the Chebyshev polynomial, commonly believed to inherit shortcomings of earlier spectral methods, and demonstrate its ability to learn from graphs of arbitrary size and structure. Second, we design and study edge fusion methods for multirelational graphs, and show the importance of validating these methods for each task to achieve optimal performance. Third, we propose a way to learn new edges in a graph jointly with a graph classification model. Our results show that the learned edges are complimentary to edges already annotated, providing a significant gain in accuracy.
Acknowledgments
This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors also acknowledge support from the Canadian Institute for Advanced Research and the Canada Foundation for Innovation.
References

(1)
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre
Vandergheynst.
Geometric deep learning: going beyond euclidean data.
IEEE Signal Processing Magazine, 34(4):18–42, 2017.  (2) William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
 (3) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 (4) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. 2018.
 (5) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607. Springer, 2018.
 (6) Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017.
 (7) Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, page 3, 2017.

(8)
Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller.
Splinecnn: Fast geometric deep learning with continuous bspline
kernels.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 869–877, 2018. 
(9)
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and
George E Dahl.
Neural message passing for quantum chemistry.
In
International Conference on Machine Learning
, pages 1263–1272, 2017.  (10) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro SanchezGonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
 (11) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 (12) Zaid Omar, Nikolaos Mitianoudis, and Tania Stathaki. Twodimensional chebyshev polynomials for image fusion. 2010.
 (13) JinHwa Kim, KyoungWoon On, Woosang Lim, Jeonghee Kim, JungWoo Ha, and ByoungTak Zhang. Hadamard product for lowrank bilinear pooling. arXiv preprint arXiv:1610.04325, 2016.
 (14) Nikil Wale, Ian A Watson, and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375, 2008.
 (15) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 (16) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
 (17) Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 (18) Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 (19) Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence, 29(11), 2007.
 (20) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
 (21) Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and Corwin Hansch. Structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.
 (22) Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Christian Heldt, Gregor Huhn, and Dietmar Schomburg. Brenda, the enzyme database: updates and major new developments. Nucleic acids research, 32(suppl_1):D431–D433, 2004.
 (23) Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and HansPeter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
 (24) Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 (25) Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.
 (26) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 (27) Nils M Kriege, PierreLouis Giscard, and Richard Wilson. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pages 1623–1631, 2016.
 (28) Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pages 2702–2711, 2016.
 (29) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
 (30) Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An endtoend deep learning architecture for graph classification. In Proceedings of AAAI Conference on Artificial Inteligence, 2018.
 (31) Cătălina Cangea, Petar Veličković, Nikola Jovanović, Thomas Kipf, and Pietro Liò. Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287, 2018.
 (32) Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. arXiv preprint arXiv:1806.08804, 2018.
 (33) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 (34) Anonymous. Graph unet. In Submitted to the Seventh International Conference on Learning Representations (ICLR), 2018.
 (35) David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
 (36) Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data sets for graph kernels, 2016.
Appendix
6.1 Overview of spectral graph convolution and its approximation
Following the notation of [11], spectral convolution on a graph having nodes is defined analogously to convolution in the Fourier domain (the convolution theorem) for some onedimensional features over nodes and filter as [3, 1]:
(8) 
where, are the eigenvectors of the normalized symmetric graph Laplacian, , where is an adjacency matrix of the graph , are node degrees. follows from the definition of eigenvectors, where is a diagonal matrix of eigenvalues. The operator denotes the Hadamard product (elementwise multiplication), and is a diagonal matrix with elements of in the diagonal.
The spectral convolution in (8) can be approximated using the Chebyshev expansion, where with and (i.e. terms contain powers ) and the property of eigendecomposition:
(9) 
Assuming eigenvalues are fixed constants, filter can be represented as a function of eigenvalues , such that (8) becomes:
(10) 
Filter can be then approximated as a Chebyshev polynomial of degree (a weighted sum of terms). Substituting the approximated into Eq. 10 and exploiting Eq. 9, the approximate spectral convolution takes the form of (see in [11, 17] for further analysis and [35] for derivations):
(11) 
where is a rescaled graph Laplacian with as the largest eigenvalue of , are projections of input features onto the Chebyshev basis and are learnable weights shared across nodes. In this work, we further simplify the computation and fix ( varies from graph to graph), so that and assume no loops in a graph. has the same eigenvectors as , but its eigenvalues are .
6.2 Dataset statistics and network architectures
Dataset  # graphs  Architecture  

NCI1  4110  3  111  29.87  37  GC32GC64GC128D0.1FC256D0.1FC2 
NCI109  4127  4  111  29.68  38  GC32GC64GC128D0.1FC256D0.1FC2 
MUTAG  188  10  28  17.93  7  GC32GC32GC32D0.1FC96D0.1FC2 
ENZYMES  600  2  126  32.63  3  GC32GC64GC512D0.1FC256D0.1FC6 
PROTEINS  1113  4  620  39.06  3  GC32GC32GC32D0.1FC96D0.1FC2 
Comments
There are no comments yet.