1 Introduction
In recent years, a fastgrowing field of applying deep learning to graphs has emerged. Many of these works are inspired by generalizing CNN to the noneuclidian and sparsely connected data that graphs represent. But while a multitude of different GCN have been proposed, the number of proposed pooling layers remains small.
Yet intelligent pooling on graphs holds significant promise: It might both identify clusters (feature or structurebased) and reduce computational requirements by reducing the number of nodes. Together, these promise to abstract from flat nodes to hierarchical sets of nodes. They are also a stepping stone towards enabling GNN to modify graph structures instead of only node features.
We propose a new pooling layer based on edge contractions (EdgePool, see Fig. 1), which aims to correct weaknesses in previously proposed learned pooling layers. We do this by viewing the task not as choosing nodes but as choosing edges and pooling the connected nodes. This immediately and naturally takes the graph structure into account and ensures that we never drop nodes completely.
The main advantages of our proposed EdgePool layer are:

EdgePool performs better than other pooling methods.

EdgePool can be integrated in existing graph classification architectures.

EdgePool can be used for node classification and improves performance.
2 Related work
Graph pooling strategies can be divided into two types: We can either use fixed pooling methods, usually based on graph topology, or use learned pooling methods. We concentrate on comparisons with learned pooling methods, since these appear to outperform fixed pooling methods.
DiffPool
Ying et al. (2018) were the first to propose a learned pooling layer. DiffPool learns to softassign each node to a fixed number of clusters based on their features. DiffPool works well, but suffers from three disadvantages: (a) The number of clusters has to be chosen in advance, which might cause performance issues when used on datasets with different graph sizes. (b) Since cluster assignment is based only on node features, nodes are assigned to the same cluster based on their features, ignoring distances. (c) The cluster assignment matrix is dense, and in , where is the number of clusters. Since is usually chosen according to the total number of nodes, the cluster assignment matrix scales quadratically with the number of nodes . They also need several auxiliary objectives (link prediction, node feature regularization, cluster assignment entropy regularization) to train well. In addition to that, the density makes integration into usually sparse GNN difficult.
TopKPool
Graph UNet, introduced by Gao & Ji (2018), uses a simple topk choice of nodes for their gPool layer, learning a node score and dropping all but the top nodes. Cangea et al. (2018) later applied this to graph classification. While this approach is both sparse and variable in graph size, its node choice is dependent on global state. This introduces two new issues: (a) Adding nodes to a graph can change the pooling result of the whole graph. (b) Whole areas of a graph might see no node chosen, which causes loss of information.
SAGPool
Lee et al. (2019) introduced SAGPool. A variant of TopKPool, SAGPool no longer uses only node features to compute node scores but uses graph convolutions to take neighbouring node features into account. While their method improves TopKPool qualitatively, the disadvantages remain.
3 EdgePool
For our work, we consider a graph , where each of the nodes has features . Edges are represented as directed pairs of nodes without weights or features.
3.1 Edge contraction
We base our pooling operation on edge contractions. Contracting the edge introduces the new node and new edges such that is adjacent to all nodes or has been adjacent to. , , and all their edges are deleted from the graph. Since edge contractions are commutative, we can also define an edge set contraction. By constructing the set such that no two edges are incident to the same node, we can simply apply the naive notion of singleedge contraction multiple times.
Intuitively, we choose a single edge to contract by merging its nodes. This new node is then connected to all nodes the merged nodes had been connected to. We repeat this procedure multiple times, taking care not to include a newlymerged node into it.
3.2 Choosing edges
Given the preconditions mentioned above, we naively choose edges by computing a score for each edge, then iteratively contracting the highestscoring edge which does not have a newlymerged node incident.
In our procedure, we compute raw scores for each node as a simple linear combination of the concatenated node features. For an edge from node to node , we compute the raw score as
(1) 
where and are the node features and and are learned parameters.
To compute the final node score for an edge, we employ a local softmax normalization over all edges of a node^{1}^{1}1We experimented with a simple gating function, but found softmax normalization to perform better.. We modify the final score such that the mean of the score range lies at . Later on, this enables us to include the score in the unpooling procedure without issues due to numerical stability. We also found this to lead to better performance in the graph classification task, which we believe is because of better gradient flow. The final score then becomes:
(2) 
Given the edge scores, we now iteratively contract edges according to the scores, ignoring those which have a newlymerged node incident. An illustration of the process is depicted in Fig. 2.
Note that this will always pool roughly 50% of the total nodes. Contrary to DiffPool and TopKPool, this ratio cannot be changed.
3.3 Computing new node features
There are many strategies for combining the features of pairs of nodes. In particular, we are not restricted to symmetric functions since the edges chosen have a specific direction. Nonetheless, we found that taking the sum of the node features works well.
To enable the gradient to flow into the scores, we use gating and multiply the combined node features by the edge score:
(3) 
3.4 Computational performance
Given our procedure above, we immediately see that EdgePool can operate on sparse representations. When doing so, both runtime and memory scales linearly in the number of edges. This particularly avoids the scaling issues of DiffPool’s cluster assignment matrix.
Additionally, EdgePool is locally independent: As long as the node scores of two nodes and and of their neighbours do not change (by changing nodes within the receptive fields), the choice of edge will not change. Accordingly, EdgePool does not have to be computed for the whole graph at once. If the graph changes, only the pooling local to the changed areas needs to be updated.
3.5 Integrating edge features
EdgePool can be updated to take edge features of edge into account. To do so, we have to include them in the raw score computation (Eq. (1)). The simplest approach is to concatenate them:
(4) 
Additionally, we will likely have to change the procedure to compute new node features; we propose using a weighted linear combination of both nodes’ features, the features of the chosen edge, and the features of the reverse edge if it exists.
Lastly, we need a procedure to combine the edge features of edges that ended at both merged nodes and will therefore be merged. We believe a simple sum should work well here, too. However, we have not conducted experiments on this.
3.6 Unpooling EdgePool
To use pooling in the context of node classification, an unpooling operation is necessary. To do so, each EdgePool layer also emits the mapping of each of the previous graph’s nodes to the newlypooled graph’s nodes. When unpooling, we then create an inverse mapping of pooled nodes to unpooled nodes. Since we assign each node to exactly one merged node, this mapping can be chained through many pooling layers. Additionally, we divide the unpooled node features by the corresponding edge score:
(5) 
4 Experiments
We design our experiments to answer three questions:
 Q1:

Does EdgePool outperform alternative pooling methods?
 Q2:

Can EdgePool be used as a plugandplay addition for any GNN?
 Q3:

Can EdgePool be used for node classification?
4.1 General Setup
We evaluate our models on multiple graph and node classification datasets, and share most of the training procedures between all models.
4.1.1 Datasets
While there are many graph classification datasets available, most of these are small (in both nodes per graph and total graphs). As an example, the popular enzymes dataset contains only 600 graphs, making 10fold crossvalidation (at a test set size of 60) very difficult.
We conduct 10fold crossvalidation for all datasets and report mean and standard deviation. We choose all folds at random, eschewing the default planetoid split.
Graph classification datasets
For graph classification, we evaluate on four datasets from the collection by Kersting et al. (2016). At 1113 graphs, proteins (Borgwardt et al., 2005) is the smallest, but has been used extensively as a benchmark dataset. The task is to predict whether a given protein is an enzyme. The two redditbased datasets (Yanardag & Vishwanathan, 2015) depict user responses in an online discussion. The task is to predict the subreddit, out of two (redditbinary) or eleven (redditmulti12k) choices. Lastly, each collab
(ibid) graph models scientific collaborations of one researcher. The task is to classify which of three fields the researcher belongs to. Neither
collab nor the two redditbased datasets have any features.Node classification datasets
We also evaluate EdgePool on five semisupervised node classification datasets. cora (Namata et al., 2012), citeseer, and pubmed (Sen et al., 2008) model citation networks. In these, nodes are documents and edges model citations. The goal is to classify the subfield of each of the documents. The photo and computer datasets (Shchur et al., 2018) are part of the Amazon copurchasing graph. Nodes are products and edges model copurchases between products. The goal is to predict the product category.
Each of these datasets is a semisupervised node classification task from bagofword features. We use 20 nodes per class as training data and 30 nodes per class as test data. Every other node is unlabelled.
4.2 Training
While we use different models, several setup parameters have been chosen identically between all models. Each is trained for a total of 200 epochs using the Adam optimizer
(Kingma & Ba, 2014) with a learning rate of , which is halved every 50 epochs. 128 graphs are batched together at each step by treating them as a single unconnected graph. We use 128 channels except for proteins and the node classification datasets, where we used 64. This setup follows Ying et al. (2018).All models use both dropout and batch normalization
(Ioffe & Szegedy, 2015). We found batch normalization to suffer greatly when evaluated using population statistics and instead use minibatch statistics even during testing.We also found using edge score dropout significantly increased EdgePool’s performance, and set every edge score to with a chance of .
4.3 Experimental design
To answer the questions we have posed, we design three different experiments.
4.3.1 Q1: Does EdgePool outperform alternative pooling approaches?
To evaluate this, we use the same architecture as used by Ying et al. (2018) for DiffPool: The model has three SAGEConv blocks (Hamilton et al., 2017) whose outputs are globally meanpooled and concatenated. Final classification occurs after two fullyconnected layers. The base model does not pool nodes, every other model pools after every block. Note that DiffPool uses a siamese architecture, using separate SAGEConv blocks to compute cluster assignments. We restrict DiffPool to a maximum of 750 nodes per graph and set TopKPool’s pool ratio to 0.5 to remain comparable to EdgePool.
Additionally, we only use the crossentropy loss to train the model. To ensure a fair comparison, we also do this for DiffPool, which originally used three additional auxiliary losses and tasks to stabilize training and precomputed additional features.
4.3.2 Q2: Can EdgePool be integrated in existing architectures?
To evaluate whether EdgePool can be integrated into preexisting architectures, we follow the model configuration from pytorchgeometric’s benchmarks (Fey & Lenssen, 2019). Speficially, we use a total of seven convolutional layers, followed by a global pooling layer and two fullyconnected layers. If pooling is used, it is added after every second convolutional layer (i.e. there are three pooling layers).
The convolutional layers we evaluate this on are GCN (Kipf & Welling, ), GIN and GIN0 (Xu et al., 2019), and GraphSAGE (Hamilton et al., 2017) both with and without accumulating intermediate results (SAGE nacc). Additionally, we construct a model using nodeindependent MLP, in which only pooling might lead to communication between nodes.
4.3.3 Q3: Can EdgePool be used for node classification?
On node classification tasks, we evaluate a simple architecture, varying the convolutional layers. We evaluate GCN, GIN and GIN0, and GAT (Veličković et al., 2017). Again, we also evaluate a MLP layer. As with Q2, we use seven convolutional layers. We pool after the second and fourth and unpool after the fifth and seventh, with shortcuts between the poolings. The concatenated features are then used by a twolayer MLP to predict each node’s class.
5 Results and discussion
We implemented the models using PyTorch (Paszke et al., 2017) and in particular the pytorchgeometric library (Fey & Lenssen, 2019). Experiments were conducted on several Geforce 1080Ti GPUs in parallel, leveraging Singularity containers (Kurtzer et al., 2017) for reproducibility.
5.1 EdgePool vs. alternative pooling approaches
Table 1 shows mean accuracy and standard deviation for graph classification tasks. As can be seen, EdgePool consistently improves performance over the nonpooling models and TopKPool. Discounting proteins due to close performance, it outperforms all other pooling approaches on two tasks, and is only outperformed by DiffPool on one task.
This answer Q1: EdgePool consistently outperforms all pooling methods but DiffPool. While DiffPool might perform better on some graphs, EdgePool scales far better and can be used on large graph sizes.
proteins  rdtb  rdt12k  collab  

Base Model  
DiffPool [*]  
TopKPool  
SAGPool  
EdgePool 
5.2 EdgePool in existing architectures
Table 2 shows comparative results for different benchmark models with and without EdgePool. On a large majority of GNN/dataset combinations, EdgePool increases performance, by an average of almost pp. GIN and GIN0 profit the least (mean improvement of ), while GraphSAGE profits the most ( pp).
Interestingly, we can see that EdgePool allows even the MLP model to perform fairly well. This model can only rely on pooling to gain information on the neighbourhood. Nonetheless, it performs competitively on proteins and collab.
Unfortunately, the performance increases of EdgePool are not consistent over datasets and models. This makes it impossible to make a specific recommendation on situations in which one should or should not include EdgePool in the model.
However, we can still answer Q2
: It is easily possible to integrate EdgePool in existing architectures. Doing so will lead to an estimated improvement of about
pp, but might for some combinations of model and dataset decrease performance.proteins  GCN  GIN  GIN0  SAGE  SAGE nacc  MLP 

No Pooling  
EdgePool  
RDTB  
No Pooling  
EdgePool  
RDT12K  
No Pooling  
EdgePool  
COLLAB  
No Pooling  
EdgePool 
5.3 EdgePool for node classification
As Table 3 shows, GNN using EdgePool can be integrated in node classification architectures and improves performance for 21 of 25 dataset/model combinations.
In particular, note the increase in performance for the MLP. In several of these tasks, an MLP augmented with EdgePool shows competitive performance to GNN algorithms. For GNN algorithms, EdgePool improves performance by an average of pp, performing worst on pubmed (no improvement on average) and for GCN (decrease by pp). It performs best for GIN and GIN0, at pp and pp improvements respectively.
This answers Q3: EdgePool will, in most cases, improve performance for node classification. The expected improvement is an average of pp.
cora  GCN  GIN  GIN0  GAT  MLP 

No Pooling  
EdgePool  
citeseer  
No Pooling  
EdgePool  
pubmed  
No Pooling  
EdgePool  
photo  
No Pooling  
EdgePool  
computer  
No Pooling  
EdgePool 
5.4 Visual inspection
Both Fig. 1 and Fig. 3 show examples of the pooling resulting from using EdgePool. In particular, they show that EdgePool keeps the linearity of the original protein even after pooling. Fig. 2(a) shows how unconnected paths (orange arrows) are not pooled, keeping the original graph structure visible even after pooling. However, as Fig. 2(b) shows, there are situations in which EdgePool causes node poolings which are counterintuitive to humans.
6 Conclusion
We have proposed EdgePool, a hard pooling method for GNN, based on edge contraction.
This pooling is both localized (and therefore independent of nonlocal graph changes) and sparse (and therefore computationally efficient even on large graphs).
Except for a single pooling procedure on a single dataset, EdgePool outperforms all previously proposed pooling approaches. We also show that EdgePool can be integrated into a large number of GNN architectures and usually improves performance on both node and graph classification tasks without any adaptions to training or architecture.
Besides the obvious use of EdgePool in improving existing GNN architectures, we hope it will serve as a stepping stone towards methods that learn how to modify graph structures. We believe this will lead towards methods that no longer operate on nodes but on abstracted groups of nodes.
References
 Borgwardt et al. (2005) Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S. V. N., Smola, A. J., and Kriegel, H.P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, June 2005. ISSN 13674803. doi: 10.1093/bioinformatics/bti1007.
 Cangea et al. (2018) Cangea, C., Veličković, P., Jovanović, N., Kipf, T., and Liò, P. Towards Sparse Hierarchical Graph Classifiers. In NeurIPS 2018 Workshop on Relational Representation Learning, November 2018.
 Fey & Lenssen (2019) Fey, M. and Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
 Gao & Ji (2018) Gao, H. and Ji, S. Graph UNet. September 2018. URL https://openreview.net/forum?id=HJePRoAct7.
 Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167, pp. 1–11, 2015. ISSN 07176163. doi: 10.1007/s1339801401737.2.
 Kersting et al. (2016) Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tudortmund.de.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs, stat], December 2014.
 (9) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. In ICLR 2017.
 Kurtzer et al. (2017) Kurtzer, G. M., Sochat, V., and Bauer, M. W. Singularity: Scientific containers for mobility of compute. Plos One, 12(5):e0177459, 2017. ISSN 19326203. doi: 10.1371/journal.pone.0177459.
 Lee et al. (2019) Lee, J., Lee, I., and Kang, J. SelfAttention Graph Pooling. arXiv:1904.08082 [cs, stat], April 2019.
 Namata et al. (2012) Namata, G., London, B., Getoor, L., and Huang, B. Querydriven Active Surveying for Collective Classification. 10th International Workshop on Mining and Learning with Graphs, pp. 8, 2012.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPSW, 2017.
 Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and EliassiRad, T. Collective Classification in Network Data. AI Magazine, 29(3):93–93, September 2008. ISSN 23719621. doi: 10.1609/aimag.v29i3.2157.
 Shchur et al. (2018) Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. Pitfalls of Graph Neural Network Evaluation. arXiv:1811.05868 [cs, stat], November 2018.
 Veličković et al. (2017) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph Attention Networks. October 2017.
 Xu et al. (2019) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
 Yanardag & Vishwanathan (2015) Yanardag, P. and Vishwanathan, S. V. N. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. ACM, October 2015. ISBN 9781450336642. doi: 10.1145/2783258.2783417.
 Ying et al. (2018) Ying, R., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems 31, pp. 4800–4810. 2018.
Comments
There are no comments yet.