1 Introduction and Related Work
Here we††The first two authors contributed equally. study the problem of graph classification; the task of learning to categorise graphs into classes. This is a direct generalisation of image classification krizhevsky2012imagenet , as images may be easily cast as a special case of a “grid graph” (with each pixel of an image connected to its eight immediate neighbours). Therefore, it is natural to investigate and generalise CNN elements to graphs bronstein2017geometric ; hamilton2017representation ; battaglia2018relational .
Generalising the convolutional layer to graphs has been a very active area of research, with several graph convolutional layers bruna2013spectral ; defferrard2016convolutional ; kipf2016semi ; gilmer2017neural ; velickovic2018graph proposed in recent times, significantly advancing the state-of-the-art on many challenging node classification benchmarks (analogues of image segmentation in the graph domain), as well as link prediction. Conversely, generalising pooling layers has received substantially smaller levels of attention by the community.
The proposed strategies broadly fall into two categories: 1) aggregating node representations in a global pooling step after each duvenaud2015convolutional or after the final message passing step li2015gated ; dai2016discriminative ; gilmer2017neural , and 2) aggregating node representations into clusters which coarsen the graph in a hierarchical manner bruna2013spectral ; niepert2016learning ; defferrard2016convolutional ; monti2017geometric ; simonovsky2017dynamic ; fey2018splinecnn ; mrowca2018flexible ; ying2018hierarchical ; anonymous2019graph . Apart from ying2018hierarchical ; anonymous2019graph , all earlier works in this area assume a fixed, pre-defined cluster assignment, that is obtained by running a clustering algorithm on the graph nodes, e.g. using the GraClus algorithm dhillon2007weighted
to obtain structure-dependent cluster assignments or finding clusters via k-means on node featuresmrowca2018flexible . The main insight by recent works ying2018hierarchical ; anonymous2019graph is that intermediate node representations (e.g. after applying a graph convolution layer) can be leveraged to obtain both feature- and structure-based cluster assignments that are adaptive to the underlying data and that can be learned in a differentiable manner.
The first end-to-end trainable graph CNN with a learnable pooling operator was recently pioneered, leveraging the DiffPool layer ying2018hierarchical . DiffPool computes soft clustering assignments of nodes from the original graph to nodes in the pooled graph. Through a combination of restricting the clustering scores to respect the input graph’s adjacency information, and a sparsity-inducing entropy regulariser, the clustering learnt by DiffPool eventually converges to an almost-hard clustering with interpretable structure, and leads to state-of-the-art results on several graph classification benchmarks.
The main limitation of DiffPool is computation of the soft clustering assignments—while the assignments eventually converge, during the early phases of training, an entire assignment matrix must be stored; relating nodes from the original graph to nodes from the pooled graph in an all-pairs fashion. This incurs a quadratic storage complexity for any pooling scheme with a fixed pooling ratio , and is therefore prohibitive for large graphs.
In this work, we leverage recent advances in graph neural network design hamilton2017inductive ; anonymous2019graph ; xu2018representation to demonstrate that sparsity need not be sacrificed to obtain good performance on end-to-end graph convolutional architectures with pooling. We demonstrate performance that is comparable to variants of DiffPool on four standard graph classification benchmarks, all while using a graph CNN that only requires storage (comparable to the storage complexity of the input graph).
We assume a standard graph-based machine learning setup; the input graph is represented as a matrix ofnode features, , and an adjacency matrix, . Here, is the number of nodes in the graph, and the number of features. In the cases where the graph is featureless, one may use the node degree information
(e.g. one-hot encoding the node degree for all degrees up to a given upper bound) to serve as artificial node features. While the adjacency matrix may consist ofreal numbers (and may even contain arbitrary edge features), here we restrict our attention to undirected and unweighted graphs; i.e. is assumed to be binary and symmetric.
To specify a CNN-inspired neural network for graph classification, we first require a convolutional and a pooling layer. In addition, we require a readout layer (analogous to a flattening layer in an image CNN), that converts the learnt representations into a fixed-sizevector representation, to be used for final prediction (e.g. a simple MLP). These layers are specified in the following paragraphs.
Given that our model will be required to classify unseen graph structures at test time, the main requirement of the convolutional layer in our architecture is that it is inductive, i.e. that it does not depend on a fixed and known graph structure. The simplest such layer is the mean-pooling propagation rule, as similarly used in GCN kipf2016semi or Const-GAT velickovic2018graph :
where is the adjacency matrix with inserted self-loops and is its corresponding degree matrix; i.e.
. We have used the rectified linear (ReLU) activation for.
are learnable linear transformations applied to every node. The transformation throughrepresents a simple skip-connection he2016deep , further encouraging preservation of information about the central node.
To make sure that a graph downsampling layer behaves idiomatically with respect to a wide class of graph sizes and structures, we adopt the approach of reducing the graph with a pooling ratio, . This implies that a graph with nodes will have nodes after application of such a pooling layer.
Unlike DiffPool, which attempts to do this via computing a clustering of the nodes into clusters (and therefore incurs a quadratic penalty in storing cluster assignment scores), we leverage the recently proposed Graph U-Net architecture anonymous2019graph , which simply drops nodes from the original graph.
The choice of which nodes to drop is done based on a projection score against a learnable vector, . In order to enable gradients to flow into , the projection scores are also used as gating values, such that retained nodes receiving lower scores will experience less significant feature retention. Fully written out, the operation of this pooling layer (computing a pooled graph, , from an input graph, ), may be expressed as follows:
Here, is the norm, top- selects the top- indices from a given input vector, is (broadcasted) elementwise multiplication, and is an indexing operation which takes slices at indices specified by . This operation requires only a pointwise projection operation and slicing into the original feature and adjacency matrices, and therefore trivially retains sparsity.
Lastly, we seek a “flattening” operation that will preserve information about the input graph in a fixed-size representation. A natural way to do this in CNNs is global average pooling, i.e. the average of all learnt node embeddings in the final layer. We further augment this by performing global max pooling
global max poolingas well, which we found strengthened our representations. Lastly, inspired by the JK-net architecture xu2018representation ; xu2018powerful , we perform this summarisation after each conv-pool block of the network, and aggregate all of the summaries together by taking their sum.
Concretely, to summarise the output graph of the -th conv-pool block, :
where is the number of nodes of the graph, are the -th node’s feature vector, and denotes concatenation. Then, the final summary vector (for a graph CNN with layers) is obtained as the sum of all those summaries (i.e. ) and submitted to an MLP for obtaining final predictions.
We find that the aggregation across layers is important, not only to preserve information at different scales of processing, but also to handle efficiently retaining information on smaller input graphs that may quickly be pooled down to a too small number of nodes.
The entire pipeline of our model may be visualised in Figure 1.
Datasets and evaluation procedure
To assess how well our sparse model can hierarchically compress the representation of a graph while still producing features relevant for classification, we evaluate the graph neural network architecture on several well-known benchmark tasks: biological (Enzymes, Proteins, D&D) and scientific collaboration (Collab) datasets KKMMN2016 . We report the performance achieved from carrying out 10-fold cross-validation on each of these, in relation to the results presented by Ying et al. ying2018hierarchical .
Our graph neural network architecture comprises three blocks, each of them consisting of a graph convolutional layer with 128 (Enzymes and Collab) or 64 features (D&D and Proteins), followed by a pooling step (refer to Section 2 for details). We ensure that there is enough information after each coarsening stage by preserving 80% of the existing nodes. A learning rate of 0.005 was used for Proteins and 0.0005 for all other datasets. The model was trained using the Adam optimizer kingma2014adam
for 100 epochs onEnzymes, 40 on Proteins, 20 on D&D and 30 on Collab.
Table LABEL:table:results illustrates our comparison to the performances reported by Ying et al. ying2018hierarchical . In all cases, our algorithm significantly outperforms the GraphSAGE sparse aggregation method hamilton2017inductive , while successfully competing at most within 1 percentage point of accuracy with the three variants of DiffPool ying2018hierarchical , the recent singular development in hierarchical graph representation learning. Unlike the latter, our method does not require quadratic memory, paving the way to deploying scalable hierarchical graph classification algorithms on larger real-world datasets.
We also verify this claim empirically—through experiments on random inputs—in Figure 2, where we demonstrate that our method compares favourably to DiffPool on larger-scale graphs, even if the pooling layer doesn’t drop any nodes (compared to a 0.25 retain rate for the DiffPool).
We would like to thank the developers of PyTorchpaszke2017automatic . CC acknowledges funding by DREAM CDT. PV and PL have received funding from the European Union’s Horizon 2020 research and innovation programme PROPAG-AGEING under grant agreement No 634821. TK acknowledges funding by SAP SE. We specially thank Jian Tang and Max Welling for the extremely useful discussions.
-  Anonymous. Graph u-net. In Submitted to the Seventh International Conference on Learning Representations (ICLR), 2018. under review.
-  Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre
Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, 2017.
-  Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
-  Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pages 2702–2711, 2016.
-  Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis.
Weighted graph cuts without eigenvectors a multilevel approach.IEEE transactions on pattern analysis and machine intelligence, 29(11), 2007.
-  David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
-  Paul Erdos and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
-  Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In , pages 869–877, 2018.
-  Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
-  Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
-  William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data sets for graph kernels, 2016.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
-  Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, page 3, 2017.
-  Damian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber, Li Fei-Fei, Joshua B Tenenbaum, and Daniel LK Yamins. Flexible neural representation for physics prediction. arXiv preprint arXiv:1806.08047, 2018.
-  Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
-  Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017.
-  Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018.
-  Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
-  Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018.
-  Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, Jure Leskovec, Joseph M Antognini, Jascha Sohl-Dickstein, Nima Roohi, Ramneet Kaur, et al. Hierarchical graph representation learning with differentiable pooling. CoRR, 2018.
Appendix A Qualitative analysis
We qualitatively investigate the distribution of graph summaries, using a pre-trained model on a fold of the Collab dataset to produce 499 outputs across all 3 classes. Figure 3 shows that an evident clustering can be achieved, once the graph has been processed by the sequence of convolution and pooling layers leveraged by our architecture.