Towards Sparse Hierarchical Graph Classifiers

by   Cătălina Cangea, et al.

Recent advances in representation learning on graphs, mainly leveraging graph convolutional networks, have brought a substantial improvement on many graph-based benchmark tasks. While novel approaches to learning node embeddings are highly suitable for node classification and link prediction, their application to graph classification (predicting a single label for the entire graph) remains mostly rudimentary, typically using a single global pooling step to aggregate node features or a hand-designed, fixed heuristic for hierarchical coarsening of the graph structure. An important step towards ameliorating this is differentiable graph coarsening---the ability to reduce the size of the graph in an adaptive, data-dependent manner within a graph neural network pipeline, analogous to image downsampling within CNNs. However, the previous prominent approach to pooling has quadratic memory requirements during training and is therefore not scalable to large graphs. Here we combine several recent advances in graph neural network design to demonstrate that competitive hierarchical graph classification results are possible without sacrificing sparsity. Our results are verified on several established graph classification benchmarks, and highlight an important direction for future research in graph-based neural networks.



There are no comments yet.


page 1

page 2

page 3

page 4


Graph Convolutional Networks with EigenPooling

Graph neural networks, which generalize deep neural network models to gr...

Hierarchical Graph Representation Learning with Differentiable Pooling

Recently, graph neural networks (GNNs) have revolutionized the field of ...

Accurate Learning of Graph Representations with Graph Multiset Pooling

Graph neural networks have been widely used on modeling graph data, achi...

ASAP: Adaptive Structure Aware Pooling for Learning Hierarchical Graph Representations

Graph Neural Networks (GNN) have been shown to work effectively for mode...

Dual Convolutional Neural Network for Graph of Graphs Link Prediction

Graphs are general and powerful data representations which can model com...

Improving Graph Neural Network Representations of Logical Formulae with Subgraph Pooling

Recent advances in the integration of deep learning with automated theor...

Sparse hierarchical representation learning on molecular graphs

Architectures for sparse hierarchical representation learning have recen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Here weThe first two authors contributed equally. study the problem of graph classification; the task of learning to categorise graphs into classes. This is a direct generalisation of image classification krizhevsky2012imagenet , as images may be easily cast as a special case of a “grid graph” (with each pixel of an image connected to its eight immediate neighbours). Therefore, it is natural to investigate and generalise CNN elements to graphs bronstein2017geometric ; hamilton2017representation ; battaglia2018relational .

Generalising the convolutional layer to graphs has been a very active area of research, with several graph convolutional layers bruna2013spectral ; defferrard2016convolutional ; kipf2016semi ; gilmer2017neural ; velickovic2018graph proposed in recent times, significantly advancing the state-of-the-art on many challenging node classification benchmarks (analogues of image segmentation in the graph domain), as well as link prediction. Conversely, generalising pooling layers has received substantially smaller levels of attention by the community.

The proposed strategies broadly fall into two categories: 1) aggregating node representations in a global pooling step after each duvenaud2015convolutional or after the final message passing step li2015gated ; dai2016discriminative ; gilmer2017neural , and 2) aggregating node representations into clusters which coarsen the graph in a hierarchical manner bruna2013spectral ; niepert2016learning ; defferrard2016convolutional ; monti2017geometric ; simonovsky2017dynamic ; fey2018splinecnn ; mrowca2018flexible ; ying2018hierarchical ; anonymous2019graph . Apart from ying2018hierarchical ; anonymous2019graph , all earlier works in this area assume a fixed, pre-defined cluster assignment, that is obtained by running a clustering algorithm on the graph nodes, e.g. using the GraClus algorithm dhillon2007weighted

to obtain structure-dependent cluster assignments or finding clusters via k-means on node features

mrowca2018flexible . The main insight by recent works ying2018hierarchical ; anonymous2019graph is that intermediate node representations (e.g. after applying a graph convolution layer) can be leveraged to obtain both feature- and structure-based cluster assignments that are adaptive to the underlying data and that can be learned in a differentiable manner.

The first end-to-end trainable graph CNN with a learnable pooling operator was recently pioneered, leveraging the DiffPool layer ying2018hierarchical . DiffPool computes soft clustering assignments of nodes from the original graph to nodes in the pooled graph. Through a combination of restricting the clustering scores to respect the input graph’s adjacency information, and a sparsity-inducing entropy regulariser, the clustering learnt by DiffPool eventually converges to an almost-hard clustering with interpretable structure, and leads to state-of-the-art results on several graph classification benchmarks.

The main limitation of DiffPool is computation of the soft clustering assignments—while the assignments eventually converge, during the early phases of training, an entire assignment matrix must be stored; relating nodes from the original graph to nodes from the pooled graph in an all-pairs fashion. This incurs a quadratic storage complexity for any pooling scheme with a fixed pooling ratio , and is therefore prohibitive for large graphs.

In this work, we leverage recent advances in graph neural network design hamilton2017inductive ; anonymous2019graph ; xu2018representation to demonstrate that sparsity need not be sacrificed to obtain good performance on end-to-end graph convolutional architectures with pooling. We demonstrate performance that is comparable to variants of DiffPool on four standard graph classification benchmarks, all while using a graph CNN that only requires storage (comparable to the storage complexity of the input graph).

2 Model

We assume a standard graph-based machine learning setup; the input graph is represented as a matrix of

node features, , and an adjacency matrix, . Here, is the number of nodes in the graph, and the number of features. In the cases where the graph is featureless, one may use the node degree information

(e.g. one-hot encoding the node degree for all degrees up to a given upper bound) to serve as artificial node features. While the adjacency matrix may consist of

real numbers (and may even contain arbitrary edge features), here we restrict our attention to undirected and unweighted graphs; i.e. is assumed to be binary and symmetric.

To specify a CNN-inspired neural network for graph classification, we first require a convolutional and a pooling layer. In addition, we require a readout layer (analogous to a flattening layer in an image CNN), that converts the learnt representations into a fixed-sizevector representation, to be used for final prediction (e.g. a simple MLP). These layers are specified in the following paragraphs.

Convolutional layer

Given that our model will be required to classify unseen graph structures at test time, the main requirement of the convolutional layer in our architecture is that it is inductive, i.e. that it does not depend on a fixed and known graph structure. The simplest such layer is the mean-pooling propagation rule, as similarly used in GCN kipf2016semi or Const-GAT velickovic2018graph :


where is the adjacency matrix with inserted self-loops and is its corresponding degree matrix; i.e.

. We have used the rectified linear (ReLU) activation for


are learnable linear transformations applied to every node. The transformation through

represents a simple skip-connection he2016deep , further encouraging preservation of information about the central node.

Pooling layer

To make sure that a graph downsampling layer behaves idiomatically with respect to a wide class of graph sizes and structures, we adopt the approach of reducing the graph with a pooling ratio, . This implies that a graph with nodes will have nodes after application of such a pooling layer.

Unlike DiffPool, which attempts to do this via computing a clustering of the nodes into clusters (and therefore incurs a quadratic penalty in storing cluster assignment scores), we leverage the recently proposed Graph U-Net architecture anonymous2019graph , which simply drops nodes from the original graph.

The choice of which nodes to drop is done based on a projection score against a learnable vector, . In order to enable gradients to flow into , the projection scores are also used as gating values, such that retained nodes receiving lower scores will experience less significant feature retention. Fully written out, the operation of this pooling layer (computing a pooled graph, , from an input graph, ), may be expressed as follows:


Here, is the norm, top- selects the top- indices from a given input vector, is (broadcasted) elementwise multiplication, and is an indexing operation which takes slices at indices specified by . This operation requires only a pointwise projection operation and slicing into the original feature and adjacency matrices, and therefore trivially retains sparsity.

Readout layer

Lastly, we seek a “flattening” operation that will preserve information about the input graph in a fixed-size representation. A natural way to do this in CNNs is global average pooling, i.e. the average of all learnt node embeddings in the final layer. We further augment this by performing

global max pooling

as well, which we found strengthened our representations. Lastly, inspired by the JK-net architecture xu2018representation ; xu2018powerful , we perform this summarisation after each conv-pool block of the network, and aggregate all of the summaries together by taking their sum.

Concretely, to summarise the output graph of the -th conv-pool block, :


where is the number of nodes of the graph, are the -th node’s feature vector, and denotes concatenation. Then, the final summary vector (for a graph CNN with layers) is obtained as the sum of all those summaries (i.e. ) and submitted to an MLP for obtaining final predictions.

We find that the aggregation across layers is important, not only to preserve information at different scales of processing, but also to handle efficiently retaining information on smaller input graphs that may quickly be pooled down to a too small number of nodes.

Figure 1: The full pipeline of our model (for ), leveraging several stacks of interleaved convolutional/pooling layers (that, unlike DiffPool, drop rather than aggregate nodes), as well as a JK-net-style summary, combining information at different scales.

The entire pipeline of our model may be visualised in Figure 1.

3 Experiments

Datasets and evaluation procedure

To assess how well our sparse model can hierarchically compress the representation of a graph while still producing features relevant for classification, we evaluate the graph neural network architecture on several well-known benchmark tasks: biological (Enzymes, Proteins, D&D) and scientific collaboration (Collab) datasets KKMMN2016 . We report the performance achieved from carrying out 10-fold cross-validation on each of these, in relation to the results presented by Ying et al. ying2018hierarchical .

Model parameters

Our graph neural network architecture comprises three blocks, each of them consisting of a graph convolutional layer with 128 (Enzymes and Collab) or 64 features (D&D and Proteins), followed by a pooling step (refer to Section 2 for details). We ensure that there is enough information after each coarsening stage by preserving 80% of the existing nodes. A learning rate of 0.005 was used for Proteins and 0.0005 for all other datasets. The model was trained using the Adam optimizer kingma2014adam

for 100 epochs on

Enzymes, 40 on Proteins, 20 on D&D and 30 on Collab.


Model Enzymes D&D Collab Proteins
Graphlet 41.03 74.85 64.66 72.91
Shortest-path 42.32 78.86 59.10 76.43
1-WL 53.43 74.02 78.61 73.76
WL-QA 60.13 79.04 80.74 75.26
PatchySAN 76.27 72.60 75.00
GraphSAGE 54.25 75.42 68.25 70.48
ECC 53.50 74.10 67.79 72.65
Set2Set 60.15 78.12 71.75 74.29
SortPool 57.12 79.37 73.76 75.54
DiffPool-Det 58.33 75.47 82.13 75.62
DiffPool-NoLP 62.67 79.98 75.63 77.42
DiffPool 64.23 81.15 75.50 78.10
Ours 64.17 78.59 74.54 75.46
Table 1: Classification accuracy percentages. Our model successfully outperforms the sparse aggregation-based GraphSAGE baseline, while being a close competitor to DiffPool variants, across all datasets. This confirms the effectiveness of leveraging learnable pooling while preserving sparsity.
Figure 2: GPU memory usage of our method (with no pooling; ) and DiffPool () during training on Erdős-Rényi graphs erdos1960evolution of varying node sizes (and ). Both methods ran with 128 input and hidden features, and three Conv-Pool layers. “OOM” denotes out-of-memory.

Table LABEL:table:results illustrates our comparison to the performances reported by Ying et al. ying2018hierarchical . In all cases, our algorithm significantly outperforms the GraphSAGE sparse aggregation method hamilton2017inductive , while successfully competing at most within 1 percentage point of accuracy with the three variants of DiffPool ying2018hierarchical , the recent singular development in hierarchical graph representation learning. Unlike the latter, our method does not require quadratic memory, paving the way to deploying scalable hierarchical graph classification algorithms on larger real-world datasets.

We also verify this claim empirically—through experiments on random inputs—in Figure 2, where we demonstrate that our method compares favourably to DiffPool on larger-scale graphs, even if the pooling layer doesn’t drop any nodes (compared to a 0.25 retain rate for the DiffPool).


We would like to thank the developers of PyTorch

paszke2017automatic . CC acknowledges funding by DREAM CDT. PV and PL have received funding from the European Union’s Horizon 2020 research and innovation programme PROPAG-AGEING under grant agreement No 634821. TK acknowledges funding by SAP SE. We specially thank Jian Tang and Max Welling for the extremely useful discussions.


Appendix A Qualitative analysis

Figure 3: t-SNE plot illustrating the classification capabilities of our model. The points represent summaries of 499 Collab test graphs; each of the three classes corresponds to a different color.

We qualitatively investigate the distribution of graph summaries, using a pre-trained model on a fold of the Collab dataset to produce 499 outputs across all 3 classes. Figure 3 shows that an evident clustering can be achieved, once the graph has been processed by the sequence of convolution and pooling layers leveraged by our architecture.