Pooling in Graph Convolutional Neural Networks

04/07/2020 ∙ by Mark Cheung, et al. ∙ Carnegie Mellon University 0

Graph convolutional neural networks (GCNNs) are a powerful extension of deep learning techniques to graph-structured data problems. We empirically evaluate several pooling methods for GCNNs, and combinations of those graph pooling methods with three different architectures: GCN, TAGCN, and GraphSAGE. We confirm that graph pooling, especially DiffPool, improves classification accuracy on popular graph classification datasets and find that, on average, TAGCN achieves comparable or better accuracy than GCN and GraphSAGE, particularly for datasets with larger and sparser graph structures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past decade, deep learning techniques such as convolutional neural networks (CNNs) have transformed fields like computer vision and other Euclidean data domains (i.e., domains in which data have a uniform, grid-like structure). Many important domains, however, are comprised of non-Euclidean data (i.e., data have irregular relationships that require mathematical concepts like graphs or manifolds to explicitly model). Such domains include social networks, sensor feeds, web traffic, supply chains, and biological systems. As these data grow in size and complexity, deep learning seems to recommend itself as a tool for classification and pattern recognition, but conventional deep learning approaches are often sharply limited when data lack a Euclidean structure to exploit. There are ongoing efforts to extend deep learning to these non-Euclidean domains, and such techniques have been dubbed

geometric deep learning [3].

In parallel with advances in geometric deep learning are advances in graph signal processing (GSP) [19, 18]. Research in GSP attempts to generalize classical signal processing theory for irregular data defined on graphs. One attraction of GSP is that it provides a unified mathematical framework through which to view the spectral and vertex domains of a graph. Concepts like frequency or smoothness, which can be understood intuitively in classical signal processing, can be explicitly defined for data on graphs.

Graph convolutional neural networks (GCNNs), an extension of CNNs to graph-structured data, were first implemented with concepts from spectral graph theory [4], and methods based on the spectral approach have since been refined and expanded [7, 15]. Reference [9] proposes the topology adaptive graph convolutional network (TAGCN) that defines graph convolution directly in the vertex domain as multiplication by polynomials of the graph adjacency matrix. This is consistent with the concept of convolution in graph signal processing [19]. TAGCN designs a set of fixed-size learnable filters whose topologies are adaptive to the topology of the graph as the filters scan the graph to perform convolution, see also [8, 21]. Other implementations, such as GraphSAGE [12] and graph attention networks (GATs) [22], are also defined directly in the vertex domain of the graph and apply a learned, convolution-like aggregation function.

An important operation in conventional CNNs is pooling, a nonlinear downsampling operation. Pooling layers in a CNN shrink the number of dimensions of the feature representation, thereby reducing the computation cost, memory footprint, and number of learned parameters. As a result, pooling allows for deeper networks in practice and can help control overfitting. Additionally, pooling has translation invariance properties that are desirable in many applications. Recently, the use of pooling in CNNs has come into question, but it remains popular.

Just as convolution and convolution-like methods have been proposed to create graph convolutional layers in GCNNs, several methods have been proposed in order to perform pooling with GCNNs [24], [11], [25]. Unlike convolution, which has been derived in GSP [19], pooling has not been rigorously defined. Therefore, the current generation of pooling methods are based on ad hoc rather than systematic approaches. They nonetheless have shown improved accuracy on popular graph classification datasets.

In this paper, we perform experiments on graph classification datasets, conditionally on graph convolution and graph pooling in GCNNs. This is a supervised learning task in which previously unseen graphs are classified based on labeled graphs. This task is analogous to image classification. Like with CNNs and image classification, tools like pooling layers are important for constructing high-level representations from node-level information.

The paper is divided as follows: we first present the background and related work in section II. Section III provides our proposed approach. In section IV, we discuss the datasets used and present the results and analysis. Finally, we conclude the paper in section V.

I-a Graph Signal Processing Perspective

The convolutional and pooling operator in graph neural network have a theoretical foundation in GSP. GSP [19] extends traditional discrete signal processing to graph signals, signals that are indexed by the nodes in a graph.

Let be a graph with adjacency matrix , where is the set of nodes and a nonzero entry denotes a directed edge from node to node .

on is a graph signal where is the signal space over the nodes of and . , and represents a measurement at node .

The heart of GCNNs is applying convolutional filters to graph signals. In GSP, convolution is a matrix-vector multiplication of a polynomial of the adjacency matrix

and the graph signal . This definition is used to create the graph convolutional layer in GCNNs.

The GSP literature includes [5] and [1]. In [5] and [1], several sampling set selection and sampling methods are proposed. The pooling methods explored herein are not based specifically on these sampling methods, but we observe that there is a relationship between sampling in GSP and pooling in GCNNs. Both reduce the number of values in the signal and can reduce the number of nodes in the graph. The key difference is that, in sampling, we focus on how to recover the original signal given the sampled signal. However, recoverability is not required in pooling algorithms in GCNNs.

Ii Related Work

In this section, we describe the infrastructure for graph convolutional and pooling layers and the related literature.

Ii-a Graph Convolutional Layer

We concentrate on three implementations of GCNNs, derived from different definitions of graph convolution: graph convolutional networks (GCNs) [15], GraphSAGE [12], and topology-adaptive graph convolutional networks (TAGCNs) [9, 21].

In GCN [15], given a graph signal (where denotes the input layer, is the number of nodes, and is the number of features/input channels) and a graph structure , a graph convolutional layer is defined as follows:

(1)

where , , is the trainable weight matrix,

is the nonlinear activation function, and

is the number of output channels. for the first layer, and we can propagate the graph signal through additional layers in the network. This approach is based on a first-order approximation of localized spectral filters on graphs [13].

In GraphSAGE [12], graph convolution is defined as follows, for each node of :

(2)

where is an aggregator function (e.g., sum, mean, or max), and is a random sample of the node ’s neighbors.

In TAGCN [9], it is defined as follows:

(3)

where and .

Ii-B Graph Pooling Layer

Similar to graph convolution, graph pooling is inspired by pooling in CNNs. In addition to static pooling methods [17, 2], various differentiable methods have been proposed.

Using the same notation as (1), a graph pooling operator should yield a new signal and adjacency matrix , usually with . See Fig. 1 for an example.

Fig. 1: Graph pooling, yielding a new signal and Adjacency matrix

An important benefit of graph pooling is the hierarchical representation of data and structure. Otherwise, global patterns in the data are usually not considered until the final aggregation layer of a network. Below we describe four recent graph pooling algorithms.

Ii-B1 Sort Pooling

Sort Pooling (SortPool) [25] operates after the last graph convolution layer. Instead of summing or averaging features, SortPool arranges the vertices in a consistent order and outputs a representation with a fixed set, so that further training using CNN can be done.

The vertices are sorted based on their structural roles within the graph. Using the connection between graph convolution and the Weisfeiler-Lehman subtree kernel [20], SortPool sorts the node features of the last layer individually, then sorts in descending order based on the layer before, and finally selects the top nodes.

Ii-B2 Differentiable Pooling

Differentiable Pooling (DiffPool) [24] is a differentiable graph pooling module that learns hierarchical representations of the graphs by aggregating nodes through several pooling layers. It uses a learned assignment matrix and updates the graph signal and topology as follows:

(4)
(5)

where is the GraphSAGE [12] operator with the mean aggregator. DiffPool achieves significantly better prediction accuracy than GraphSAGE, SortPool, and certain kernel methods, especially when global features are important for classification [24].

Ii-B3 Top-k Pooling

Top-k Pool [11] pools using a trainable projection vector and select the top-k indices of the projection and the corresponding edges in .

Top-k pool is inspired by encoder-decoder architectures like U-Nets. In addition to the Top-k pool operation, there is also an Unpool operation that reverses the process. These two combined create the encoder-decoder model on graph, known as the graph U-Nets [11]. Reference [11] shows that Top-k pool with the U-net structure performs better than DiffPool, but we will show if it works well standalone vs. other pooling algorithms.

Ii-B4 Self-Attention Graph Pooling

Self-Attention Graph Pooling (SagPool) [11] uses an attention mechanism to select the important nodes:

(6)
(7)
(8)
(9)

The attention score is calculated from GCN and the top nodes are selected from it. Since graph convolution is used to obtain the self-attention score, SagPool uses both the graph features and structure [11]. Reference [11] shows that SAGPool performs better than DiffPool and Top-k Pool across some biochemical datasets.

Iii Proposed Method

We first compare GCN, GraphSAGE, and TAGCN for graph classification across four benchmark datasets. We then investigate how pooling affects these results, by combining the different convolutional architectures with the four pooling techniques described above, i.e., SortPool [25], DiffPool [24], Top-k Pool [11], and SagPool [16]. In each instance, the pooling method is paired with GCN or GraphSAGE (determined by that used in each pooling paper), and compared with the pooling method paired with TAGCN.

Iv Experiments

Iv-a Datasets

To evaluate the efficacy of the different methods, we apply our methods on real-world graph kernel benchmarks. See Table I for the properties of these datasets. We evaluate our methods on bioinfomatics datasets and social network datassets. Both MUTAG and Proteins datasets are bioinformatics data. MUTAG [6] is a dataset consisting of chemical compounds represented by graphs. The task is to predict whether the chemical compound is mutagenic. Proteins [14] is a dataset consisting of proteins represented by graphs. The objective is to predict whether a protein functions as an enzyme. In both of the datasets, the nodes are structure elements, and two nodes are connected if there is a chemical bond between the structure elements represented by the nodes.

For social network datasets, we chose IMDB-Binary and Reddit-Binary. IMDB-Binary [23] is a set of graphs corresponding to ego-networks of actors and actresses. An edge is drawn between two actors if they were cast in the same movie. The task is to predict whether a movie is romance or action. In Reddit-Binary [23], each graph corresponds to an online discussion thread. An edge is drawn between two users if one has replied to the other. The task is to predict whether a thread belongs to a discussion forum or a question answering forum.

Dataset Graphs Classes Avg Nodes Avg Edges
MUTAG 188 2 17.7 38.9
Proteins 1113 2 39.06 72.82
IMDB-Binary 1000 2 19.77 96.53
Reddit-Binary 2000 2 429.63 497.75
TABLE I: Properties of Graph Classification Datasets

Iv-B Network Training

We perform 5-fold cross-validation to select the hyperparameters from the validation accuracy and estimate the test accuracy. For the baselines, the hyperparameters are the number of graph convolutional layers, number of channels in each layer, dropout rates, pooling rate (number or percentage of nodes to keep), and (for TAGCN) order of polynomial filter. For a fairer comparison, we considered 1-5 layers for TAGCN vs. 1-15 layers for GCN and GraphSAGE when using graph polynomial filters of degree 3 (to show that 1 layer of TAGCN with degree

is not

layers of GCN/GraphSAGE). We use cross-entropy loss and ADAM optimization with a starting learning rate of 0.01, a decay factor of 0.5, and a decay step size of 50. Experiments were performed in PyTorch using code from the PyTorch Geometric Library

[10].

Iv-C Results

Fig. 2: Comparison of graph classification accuracies of GCNN Variant (TAGCN, GCN, GraphSAGE) for no pooling, SortPool, DiffPool, Top-k Pool, and SagPool across 4 datasets (MUTAG, Proteins, IMDB-Binary, Reddit-Binary)

Fig. 2

shows the results of GCNN variants with no pooling, DiffPool, SagPool, SortPool, and Top-K Pool. The green, orange, and blue bars are the means of the cross-validated accuracy and the smaller black error bars are their standard deviations.

Iv-C1 Graph Convolution Comparison

In general, TAGCN performs better than GCN and GraphSAGE on the four graph classification benchmarks. However, due to the increase in complexity, TAGCN has high variance, especially denser graph structures. TAGCN performs better as graphs become less sparse, i.e., as average degree increases.

We also showed empirically that simply increasing number of layers in GCN and GraphSAGE is not analogous to increasing the order of the polynomial filter in TAGCN. We attribute the the advantage of TAGCN mainly to: 1) Passing a residual connection of the graph signal, and 2) Having weights associated with each polynomial of the adjacency matrix. In comparison, GCN and GraphSAGE do not improve much after five layers, perhaps also suffering from oversmoothing.

Iv-C2 Graph Convolution and Graph Pooling Comparison

Among the pooling algorithms, DiffPool generally performs the best. SagPool and SortPool perform better for MUTAG and Proteins, but similar or worse for IMDB-Binary and Reddit-Binary. Top-k pool performs poorly, suggesting that it requires the auto-encoder structure to perform better. In general, only Diffpool is consistently better than no pooling.

The results for graph convolution apply to graph pooling with graph convolution. TAGCN with pooling generally performs better than GCN and GraphSAGE with pooling and more prone to overfitting, likely due to the same reasons.

V Conclusion

On average, TAGCN generally performs well against GCN and GraphSAGE on graph classification datasets with and without pooling for sparser and larger graphs. We also find that DiffPool generally outperforms the other pooling methods evaluated. For future work, we would like to develop a better theoretical understanding of GCNNs, by studying different problems like oversmoothing and the design of different parameters.

References

  • [1] A. Anis, A. Gadde, and A. Ortega (2016-07) Efficient Sampling Set Selection for Bandlimited Graph Signals Using Graph Spectral Proxies. IEEE Transactions on Signal Processing 64 (14), pp. 3775–3789. External Links: Document, ISSN 1941-0476 Cited by: §I-A.
  • [2] Y. Boykov and O. Veksler (2006) Graph Cuts in Vision and Graphics:Theories and Application. In Handbook of Mathematical Models in Computer Vision, N. Paragios, Y. Chen, and O. D. Faugeras (Eds.), pp. 79–96. Cited by: §II-B.
  • [3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017-07) Geometric Deep Learning: Going Beyond Euclidean Data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. External Links: Document, ISSN Cited by: §I.
  • [4] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2014) Spectral Networks and Locally Connected Networks on Graphs. In International Conference on Learning Representations (ICLR), Cited by: §I.
  • [5] S. Chen, R. Varma, A. Sandryhaila, and J. Kovačević (2015-12) Discrete Signal Processing on Graphs: Sampling Theory. IEEE Transactions on Signal Processing 63 (24), pp. 6510–6523. External Links: Document, ISSN 1941-0476 Cited by: §I-A.
  • [6] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch (1991) Structure-activity Relationship of Mutagenic Aromatic and Heteroaromatic Nitro Compounds. Correlation with Molecular Orbital Energies and Hydrophobicity. Journal of Medicinal Chemistry 34 (2), pp. 786–797. External Links: Document Cited by: §IV-A.
  • [7] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems (NIPS), Cited by: §I.
  • [8] J. Du, J. Shi, S. Kar, and J. M. F. Moura (2018-06) On Graph Convolution For Graph CNNs. In

    2018 IEEE Data Science Workshop (DSW)

    ,
    Vol. , pp. 1–5. External Links: Document, ISSN null Cited by: §I.
  • [9] J. Du, S. Zhang, G. Wu, J. M. F. Moura, and S. Kar (2017) Topology Adaptive Graph Convolutional Networks. Computing Research Repository abs/1710.10370. External Links: Link, 1710.10370 Cited by: §I, §II-A, §II-A.
  • [10] M. Fey and J. E. Lenssen (2019) Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §IV-B.
  • [11] H. Gao and S. Ji (2019-09–15 Jun) Graph U-Nets. In

    36th International Conference on Machine Learning

    , K. Chaudhuri and R. Salakhutdinov (Eds.),
    Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 2083–2092. External Links: Link Cited by: §I, §II-B3, §II-B3, §II-B4, §III.
  • [12] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1024–1034. External Links: Link Cited by: §I, §II-A, §II-A, §II-B2.
  • [13] D. K. Hammond, P. Vandergheynst, and R. Gribonval (2011) Wavelets on Graphs via Spectral Graph Theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129 – 150. External Links: ISSN 1063-5203, Document, Link Cited by: §II-A.
  • [14] B. Karsten, C. S. Ong, S. Schönauer, S.V.N. Vishwanathan, and H.-P. Kriegel (2005) Protein Function Prediction via Graph Kernels. Intelligent Systems for Molecular Biology. Cited by: §IV-A.
  • [15] T. N. Kipf and M. Welling (2017) Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, (ICLR) 2017, Toulon, France, dApril 24-26, 2017, Conference Track Proceedings, Cited by: §I, §II-A, §II-A.
  • [16] J. Lee, I. Lee, and J. Kang (2019-09–15 Jun) Self-Attention Graph Pooling. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 3734–3743. External Links: Link Cited by: §III.
  • [17] E. Luzhnica, B. Day, and P. Lio’ (2019) Clique Pooling for Graph Classification. Computing Research Repository abs/1904.00374. External Links: 1904.00374 Cited by: §II-B.
  • [18] A. Ortega, P. Frossard, J. Kovačević, J. M. F. Moura, and P. Vandergheynst (2018-05) Graph Signal Processing: Overview, Challenges, and Applications. Proceedings of the IEEE 106 (5), pp. 808–828. External Links: Document, ISSN Cited by: §I.
  • [19] A. Sandryhaila and J. M. F. Moura (2013-04) Discrete Signal Processing on Graphs. IEEE Trans. Signal Proc. 61 (7), pp. 1644–1656. Cited by: §I-A, §I, §I, §I.
  • [20] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011-11) Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 12, pp. 2539–2561. External Links: ISSN 1532-4435, Link Cited by: §II-B1.
  • [21] J. Shi, M. Cheung, J. Du, and J. M. F. Moura (2018-10) Classification with Vertex-Based Graph Convolutional Neural Networks. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers, Vol. , pp. 752–756. External Links: Document, ISSN 1058-6393 Cited by: §I, §II-A.
  • [22] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph Attention Networks. In 6th International Conference on Learning Representations (ICLR), 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §I.
  • [23] P. Yanardag and S.V.N. Vishwanathan (2015) Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 1365–1374. External Links: ISBN 978-1-4503-3664-2, Document Cited by: §IV-A.
  • [24] R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec (2018) Hierarchical Graph Representation Learning with Differentiable Pooling. In 32nd International Conference on Neural Information Processing Systems (NIPS), pp. 4805–4815. External Links: Link Cited by: §I, §II-B2, §II-B2, §III.
  • [25] M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An End-to-End Deep Learning Architecture for Graph Classification. In

    32nd AAAI Conference on Artificial IntelligenceAAAI Conference on Artificial Intelligence

    ,
    External Links: Link Cited by: §I, §II-B1, §III.