In recent years, a fast-growing field of applying deep learning to graphs has emerged. Many of these works are inspired by generalizing CNN to the non-euclidian and sparsely connected data that graphs represent. But while a multitude of different GCN have been proposed, the number of proposed pooling layers remains small.
Yet intelligent pooling on graphs holds significant promise: It might both identify clusters (feature- or structure-based) and reduce computational requirements by reducing the number of nodes. Together, these promise to abstract from flat nodes to hierarchical sets of nodes. They are also a stepping stone towards enabling GNN to modify graph structures instead of only node features.
We propose a new pooling layer based on edge contractions (EdgePool, see Fig. 1), which aims to correct weaknesses in previously proposed learned pooling layers. We do this by viewing the task not as choosing nodes but as choosing edges and pooling the connected nodes. This immediately and naturally takes the graph structure into account and ensures that we never drop nodes completely.
The main advantages of our proposed EdgePool layer are:
EdgePool performs better than other pooling methods.
EdgePool can be integrated in existing graph classification architectures.
EdgePool can be used for node classification and improves performance.
2 Related work
Graph pooling strategies can be divided into two types: We can either use fixed pooling methods, usually based on graph topology, or use learned pooling methods. We concentrate on comparisons with learned pooling methods, since these appear to outperform fixed pooling methods.
Ying et al. (2018) were the first to propose a learned pooling layer. DiffPool learns to soft-assign each node to a fixed number of clusters based on their features. DiffPool works well, but suffers from three disadvantages: (a) The number of clusters has to be chosen in advance, which might cause performance issues when used on datasets with different graph sizes. (b) Since cluster assignment is based only on node features, nodes are assigned to the same cluster based on their features, ignoring distances. (c) The cluster assignment matrix is dense, and in , where is the number of clusters. Since is usually chosen according to the total number of nodes, the cluster assignment matrix scales quadratically with the number of nodes . They also need several auxiliary objectives (link prediction, node feature regularization, cluster assignment entropy regularization) to train well. In addition to that, the density makes integration into usually sparse GNN difficult.
Graph U-Net, introduced by Gao & Ji (2018), uses a simple top-k choice of nodes for their gPool layer, learning a node score and dropping all but the top nodes. Cangea et al. (2018) later applied this to graph classification. While this approach is both sparse and variable in graph size, its node choice is dependent on global state. This introduces two new issues: (a) Adding nodes to a graph can change the pooling result of the whole graph. (b) Whole areas of a graph might see no node chosen, which causes loss of information.
Lee et al. (2019) introduced SAGPool. A variant of TopKPool, SAGPool no longer uses only node features to compute node scores but uses graph convolutions to take neighbouring node features into account. While their method improves TopKPool qualitatively, the disadvantages remain.
For our work, we consider a graph , where each of the nodes has features . Edges are represented as directed pairs of nodes without weights or features.
3.1 Edge contraction
We base our pooling operation on edge contractions. Contracting the edge introduces the new node and new edges such that is adjacent to all nodes or has been adjacent to. , , and all their edges are deleted from the graph. Since edge contractions are commutative, we can also define an edge set contraction. By constructing the set such that no two edges are incident to the same node, we can simply apply the naive notion of single-edge contraction multiple times.
Intuitively, we choose a single edge to contract by merging its nodes. This new node is then connected to all nodes the merged nodes had been connected to. We repeat this procedure multiple times, taking care not to include a newly-merged node into it.
3.2 Choosing edges
Given the preconditions mentioned above, we naively choose edges by computing a score for each edge, then iteratively contracting the highest-scoring edge which does not have a newly-merged node incident.
In our procedure, we compute raw scores for each node as a simple linear combination of the concatenated node features. For an edge from node to node , we compute the raw score as
where and are the node features and and are learned parameters.
To compute the final node score for an edge, we employ a local softmax normalization over all edges of a node111We experimented with a simple gating function, but found softmax normalization to perform better.. We modify the final score such that the mean of the score range lies at . Later on, this enables us to include the score in the unpooling procedure without issues due to numerical stability. We also found this to lead to better performance in the graph classification task, which we believe is because of better gradient flow. The final score then becomes:
Given the edge scores, we now iteratively contract edges according to the scores, ignoring those which have a newly-merged node incident. An illustration of the process is depicted in Fig. 2.
Note that this will always pool roughly 50% of the total nodes. Contrary to DiffPool and TopKPool, this ratio cannot be changed.
3.3 Computing new node features
There are many strategies for combining the features of pairs of nodes. In particular, we are not restricted to symmetric functions since the edges chosen have a specific direction. Nonetheless, we found that taking the sum of the node features works well.
To enable the gradient to flow into the scores, we use gating and multiply the combined node features by the edge score:
3.4 Computational performance
Given our procedure above, we immediately see that EdgePool can operate on sparse representations. When doing so, both runtime and memory scales linearly in the number of edges. This particularly avoids the scaling issues of DiffPool’s cluster assignment matrix.
Additionally, EdgePool is locally independent: As long as the node scores of two nodes and and of their neighbours do not change (by changing nodes within the receptive fields), the choice of edge will not change. Accordingly, EdgePool does not have to be computed for the whole graph at once. If the graph changes, only the pooling local to the changed areas needs to be updated.
3.5 Integrating edge features
EdgePool can be updated to take edge features of edge into account. To do so, we have to include them in the raw score computation (Eq. (1)). The simplest approach is to concatenate them:
Additionally, we will likely have to change the procedure to compute new node features; we propose using a weighted linear combination of both nodes’ features, the features of the chosen edge, and the features of the reverse edge if it exists.
Lastly, we need a procedure to combine the edge features of edges that ended at both merged nodes and will therefore be merged. We believe a simple sum should work well here, too. However, we have not conducted experiments on this.
3.6 Unpooling EdgePool
To use pooling in the context of node classification, an unpooling operation is necessary. To do so, each EdgePool layer also emits the mapping of each of the previous graph’s nodes to the newly-pooled graph’s nodes. When unpooling, we then create an inverse mapping of pooled nodes to unpooled nodes. Since we assign each node to exactly one merged node, this mapping can be chained through many pooling layers. Additionally, we divide the unpooled node features by the corresponding edge score:
We design our experiments to answer three questions:
Does EdgePool outperform alternative pooling methods?
Can EdgePool be used as a plug-and-play addition for any GNN?
Can EdgePool be used for node classification?
4.1 General Setup
We evaluate our models on multiple graph and node classification datasets, and share most of the training procedures between all models.
While there are many graph classification datasets available, most of these are small (in both nodes per graph and total graphs). As an example, the popular enzymes dataset contains only 600 graphs, making 10-fold crossvalidation (at a test set size of 60) very difficult.
We conduct 10-fold cross-validation for all datasets and report mean and standard deviation. We choose all folds at random, eschewing the default planetoid split.
Graph classification datasets
For graph classification, we evaluate on four datasets from the collection by Kersting et al. (2016). At 1113 graphs, proteins (Borgwardt et al., 2005) is the smallest, but has been used extensively as a benchmark dataset. The task is to predict whether a given protein is an enzyme. The two reddit-based datasets (Yanardag & Vishwanathan, 2015) depict user responses in an online discussion. The task is to predict the subreddit, out of two (reddit-binary) or eleven (reddit-multi-12k) choices. Lastly, each collab
(ibid) graph models scientific collaborations of one researcher. The task is to classify which of three fields the researcher belongs to. Neithercollab nor the two reddit-based datasets have any features.
Node classification datasets
We also evaluate EdgePool on five semi-supervised node classification datasets. cora (Namata et al., 2012), citeseer, and pubmed (Sen et al., 2008) model citation networks. In these, nodes are documents and edges model citations. The goal is to classify the subfield of each of the documents. The photo and computer datasets (Shchur et al., 2018) are part of the Amazon co-purchasing graph. Nodes are products and edges model co-purchases between products. The goal is to predict the product category.
Each of these datasets is a semi-supervised node classification task from bag-of-word features. We use 20 nodes per class as training data and 30 nodes per class as test data. Every other node is unlabelled.
While we use different models, several setup parameters have been chosen identically between all models. Each is trained for a total of 200 epochs using the Adam optimizer(Kingma & Ba, 2014) with a learning rate of , which is halved every 50 epochs. 128 graphs are batched together at each step by treating them as a single unconnected graph. We use 128 channels except for proteins and the node classification datasets, where we used 64. This setup follows Ying et al. (2018).
All models use both dropout and batch normalization(Ioffe & Szegedy, 2015). We found batch normalization to suffer greatly when evaluated using population statistics and instead use mini-batch statistics even during testing.
We also found using edge score dropout significantly increased EdgePool’s performance, and set every edge score to with a chance of .
4.3 Experimental design
To answer the questions we have posed, we design three different experiments.
4.3.1 Q1: Does EdgePool outperform alternative pooling approaches?
To evaluate this, we use the same architecture as used by Ying et al. (2018) for DiffPool: The model has three SAGEConv blocks (Hamilton et al., 2017) whose outputs are globally mean-pooled and concatenated. Final classification occurs after two fully-connected layers. The base model does not pool nodes, every other model pools after every block. Note that DiffPool uses a siamese architecture, using separate SAGEConv blocks to compute cluster assignments. We restrict DiffPool to a maximum of 750 nodes per graph and set TopKPool’s pool ratio to 0.5 to remain comparable to EdgePool.
Additionally, we only use the cross-entropy loss to train the model. To ensure a fair comparison, we also do this for DiffPool, which originally used three additional auxiliary losses and tasks to stabilize training and precomputed additional features.
4.3.2 Q2: Can EdgePool be integrated in existing architectures?
To evaluate whether EdgePool can be integrated into pre-existing architectures, we follow the model configuration from pytorch-geometric’s benchmarks (Fey & Lenssen, 2019). Speficially, we use a total of seven convolutional layers, followed by a global pooling layer and two fully-connected layers. If pooling is used, it is added after every second convolutional layer (i.e. there are three pooling layers).
The convolutional layers we evaluate this on are GCN (Kipf & Welling, ), GIN and GIN0 (Xu et al., 2019), and GraphSAGE (Hamilton et al., 2017) both with and without accumulating intermediate results (SAGE nacc). Additionally, we construct a model using node-independent MLP, in which only pooling might lead to communication between nodes.
4.3.3 Q3: Can EdgePool be used for node classification?
On node classification tasks, we evaluate a simple architecture, varying the convolutional layers. We evaluate GCN, GIN and GIN0, and GAT (Veličković et al., 2017). Again, we also evaluate a MLP layer. As with Q2, we use seven convolutional layers. We pool after the second and fourth and unpool after the fifth and seventh, with shortcuts between the poolings. The concatenated features are then used by a two-layer MLP to predict each node’s class.
5 Results and discussion
We implemented the models using PyTorch (Paszke et al., 2017) and in particular the pytorch-geometric library (Fey & Lenssen, 2019). Experiments were conducted on several Geforce 1080Ti GPUs in parallel, leveraging Singularity containers (Kurtzer et al., 2017) for reproducibility.
5.1 EdgePool vs. alternative pooling approaches
Table 1 shows mean accuracy and standard deviation for graph classification tasks. As can be seen, EdgePool consistently improves performance over the non-pooling models and TopKPool. Discounting proteins due to close performance, it outperforms all other pooling approaches on two tasks, and is only outperformed by DiffPool on one task.
This answer Q1: EdgePool consistently outperforms all pooling methods but DiffPool. While DiffPool might perform better on some graphs, EdgePool scales far better and can be used on large graph sizes.
5.2 EdgePool in existing architectures
Table 2 shows comparative results for different benchmark models with and without EdgePool. On a large majority of GNN/dataset combinations, EdgePool increases performance, by an average of almost pp. GIN and GIN0 profit the least (mean improvement of ), while GraphSAGE profits the most ( pp).
Interestingly, we can see that EdgePool allows even the MLP model to perform fairly well. This model can only rely on pooling to gain information on the neighbourhood. Nonetheless, it performs competitively on proteins and collab.
Unfortunately, the performance increases of EdgePool are not consistent over datasets and models. This makes it impossible to make a specific recommendation on situations in which one should or should not include EdgePool in the model.
However, we can still answer Q2
: It is easily possible to integrate EdgePool in existing architectures. Doing so will lead to an estimated improvement of aboutpp, but might for some combinations of model and dataset decrease performance.
5.3 EdgePool for node classification
As Table 3 shows, GNN using EdgePool can be integrated in node classification architectures and improves performance for 21 of 25 dataset/model combinations.
In particular, note the increase in performance for the MLP. In several of these tasks, an MLP augmented with EdgePool shows competitive performance to GNN algorithms. For GNN algorithms, EdgePool improves performance by an average of pp, performing worst on pubmed (no improvement on average) and for GCN (decrease by pp). It performs best for GIN and GIN0, at pp and pp improvements respectively.
This answers Q3: EdgePool will, in most cases, improve performance for node classification. The expected improvement is an average of pp.
5.4 Visual inspection
Both Fig. 1 and Fig. 3 show examples of the pooling resulting from using EdgePool. In particular, they show that EdgePool keeps the linearity of the original protein even after pooling. Fig. 2(a) shows how unconnected paths (orange arrows) are not pooled, keeping the original graph structure visible even after pooling. However, as Fig. 2(b) shows, there are situations in which EdgePool causes node poolings which are counter-intuitive to humans.
We have proposed EdgePool, a hard pooling method for GNN, based on edge contraction.
This pooling is both localized (and therefore independent of non-local graph changes) and sparse (and therefore computationally efficient even on large graphs).
Except for a single pooling procedure on a single dataset, EdgePool outperforms all previously proposed pooling approaches. We also show that EdgePool can be integrated into a large number of GNN architectures and usually improves performance on both node and graph classification tasks without any adaptions to training or architecture.
Besides the obvious use of EdgePool in improving existing GNN architectures, we hope it will serve as a stepping stone towards methods that learn how to modify graph structures. We believe this will lead towards methods that no longer operate on nodes but on abstracted groups of nodes.
- Borgwardt et al. (2005) Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S. V. N., Smola, A. J., and Kriegel, H.-P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, June 2005. ISSN 1367-4803. doi: 10.1093/bioinformatics/bti1007.
- Cangea et al. (2018) Cangea, C., Veličković, P., Jovanović, N., Kipf, T., and Liò, P. Towards Sparse Hierarchical Graph Classifiers. In NeurIPS 2018 Workshop on Relational Representation Learning, November 2018.
- Fey & Lenssen (2019) Fey, M. and Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
- Gao & Ji (2018) Gao, H. and Ji, S. Graph U-Net. September 2018. URL https://openreview.net/forum?id=HJePRoAct7.
- Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167, pp. 1–11, 2015. ISSN 0717-6163. doi: 10.1007/s13398-014-0173-7.2.
- Kersting et al. (2016) Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.de.
- Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs, stat], December 2014.
- (9) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR 2017.
- Kurtzer et al. (2017) Kurtzer, G. M., Sochat, V., and Bauer, M. W. Singularity: Scientific containers for mobility of compute. Plos One, 12(5):e0177459, 2017. ISSN 1932-6203. doi: 10.1371/journal.pone.0177459.
- Lee et al. (2019) Lee, J., Lee, I., and Kang, J. Self-Attention Graph Pooling. arXiv:1904.08082 [cs, stat], April 2019.
- Namata et al. (2012) Namata, G., London, B., Getoor, L., and Huang, B. Query-driven Active Surveying for Collective Classification. 10th International Workshop on Mining and Learning with Graphs, pp. 8, 2012.
- Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPS-W, 2017.
- Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective Classification in Network Data. AI Magazine, 29(3):93–93, September 2008. ISSN 2371-9621. doi: 10.1609/aimag.v29i3.2157.
- Shchur et al. (2018) Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. Pitfalls of Graph Neural Network Evaluation. arXiv:1811.05868 [cs, stat], November 2018.
- Veličković et al. (2017) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph Attention Networks. October 2017.
- Xu et al. (2019) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
- Yanardag & Vishwanathan (2015) Yanardag, P. and Vishwanathan, S. V. N. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. ACM, October 2015. ISBN 978-1-4503-3664-2. doi: 10.1145/2783258.2783417.
- Ying et al. (2018) Ying, R., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems 31, pp. 4800–4810. 2018.