1 Introduction
A fundamental component in deep convolutional neural networks is the
pooling operation, which replaces the output of convolutions with local summaries of nearby points and is usually implemented by maximum or average operations. Stateoftheart architectures alternate convolutions, which extrapolate local patterns irrespective of the specific location on the input signal, and pooling, which lets the ensuing convolutions capture aggregated patterns. Pooling allows to learn abstract representations in deeper layers of the network by discarding information that is superfluous for the task, and keeps model complexity under control by limiting the growth of intermediate features.Graph Neural Networks (GNNs) extend the convolution operation from regular domains, such as images or time series, to data with arbitrary topologies and unordered structures described by graphs (Battaglia et al., 2018). The development of pooling strategies for GNNs, however, has lagged behind the design of newer and more effective messagepassing (MP) operations (Gilmer et al., 2017), such as graph convolutions, mainly due to the difficulty of defining an aggregated version of the original graph that supports the pooled signal.
A naïve pooling strategy in GNNs is to average all nodes features (Li et al., 2016), but it has limited flexibility since it does not extract local summaries of the graph structure, and no further MP operations can be applied afterwards. An alternative approach consists in precomputing coarsened versions of the original graph and then fit the data to these deterministic structures (Bruna et al., 2013). While this aggregation accounts for the connectivity of the graph, it ignores taskspecific objectives as well as the node features.
In this paper, we propose a differentiable pooling operation implemented as a neural network layer, which can be seamlessly combined with other MP layers (see Fig. 1). The parameters in the pooling layer are learned by combining the taskspecific loss with an unsupervised regularization term, which optimizes a continuous relaxation of the normalized minCUT objective. The minCUT identifies dense graph components, where the nodes features become locally homogeneous after the messagepassing. By gradually aggregating these components, the GNN learns to capture coarser properties of the graph. The proposed minCUT pooling operator (minCUTpool) yields partitions that 1) cluster together nodes which are similar and strongly connected on the graph, and 2) take into account the taskspecific objective of the GNN.
2 Background
2.1 Mincut and spectral clustering
Given a graph , , and the associated adjacency matrix , the way normalized minCUT (simply referred to as minCUT) aims at partitioning in disjoint subsets by removing the minimum volume of edges. The problem is equivalent to maximizing
(1) 
where the numerator counts the edge volume within each cluster, and the denominator counts the edges between the nodes in a cluster and the rest of the graph (Shi & Malik, 2000). Let be a cluster assignment matrix, so that if node belongs to cluster , and 0 otherwise. The minCUT problem can be expressed as
(2) 
where is the degree matrix (Dhillon et al., 2004). Since problem (2) is NPhard, it is usually recast in a relaxed formulation that can be solved in polynomial time and guarantees a nearoptimal solution (Yu & Shi, 2003):
(3) 
While the optimization problem (3) is still nonconvex, there exists an optimal solution , where
contains the eigenvectors of
corresponding to thelargest eigenvalues, and
is an arbitrary orthogonal transformation (Ikebe et al., 1987).Since the elements of are real values rather than binary cluster indicators, the spectral clustering (SC) approach can be used to find discrete cluster assignments. In SC, the rows of
are treated as node representations embedded in the eigenspace of the Laplacian, and are clustered together with standard algorithms such as
means (Von Luxburg, 2007). One of the main limitations of SC lies in the computation of the spectrum of , which has a memory complexity of and a computational complexity of . This prevents its applicability to large datasets.To deal with such scalability issues, the constrained optimization in (3) can be solved by gradient descent algorithms that refine the solution by iterating operations whose individual complexity is , or even (Han & Filippone, 2017). Those algorithms search the solution on the manifold induced by the orthogonality constraint on the columns of , by performing gradient updates along the geodesics (Wen & Yin, 2013; Collins et al., 2014). Alternative approaches rely on the QR factorization to constrain the space of feasible solutions (Damle et al., 2016), and alleviate the cost of the factorization by ensuring that orthogonality holds only on one minibatch at a time (Shaham et al., 2018).
Other works based on neural networks include an autoencoder trained to map the
th row of the Laplacian to the th components of the first eigenvectors, to avoid the spectral decomposition (Tian et al., 2014). Yi et al. (2017) use a soft orthogonality constraint to learn spectral embeddings as a volumetric reparametrization of a precomputed Laplacian eigenbase. Shaham et al. (2018); Kampffmeyer et al. (2019) propose differentiable loss functions to partition generic data and process outofsample data at inference time. Nazi et al. (2019) generate balanced node partitions with a GNN, but adopt an optimization that does not encourage cluster assignments to be orthogonal.2.2 Graph Neural Networks
Many approaches have been proposed to process graphs with neural networks, including recurrent architectures (Scarselli et al., 2009; Li et al., 2016) or convolutional operations inspired by filters used in graph signal processing (Defferrard et al., 2016; Levie et al., 2018; Bianchi et al., 2019)
. Since our focus is on graph pooling, we base our GNN implementation on a simple MP operation, which combines the features of each node with its 1storder neighbors. To account for the absence of selfloops, a typical approach is to add a (scaled) identity matrix to the diagonal of
(Kipf & Welling, 2017). Since our pooling will also modify the structure of the adjacency matrix, we prefer a MP implementation that operates on the original and accounts for the initial node features by means of skip connections.Let be the symmetrically normalized adjacency matrix and the matrix containing the node features. The output of the MP layer is
(4) 
where are the trainable weights relative to the mixing and skip component of the layer, respectively.
3 Proposed method
The minCUT pooling strategy computes a cluster assignment matrix
by means of a multilayer perceptron, which maps each node feature
into the th row of :(5) 
where are trainable parameters. The softmax function guarantees that and enforces the constraints inherited from the optimization problem in (2). The parameters and are jointly optimized by minimizing the usual taskspecific loss, as well as an unsupervised loss , which is composed of two terms
(6) 
where indicates the Frobenius norm.
The cut loss term, , evaluates the minCUT given by the cluster assignment , and is bounded by . Minimizing encourages strongly connected nodes to be clustered together, since the inner product increases when is large. has a single maximum, reached when the numerator . This occurs if, for each pair of connected nodes (i.e., ), the cluster assignments are orthogonal (i.e., ). reaches its minimum, , when . This occurs when in a graph with disconnected components the cluster assignments are equal for all the nodes in the same component and orthogonal to the cluster assignments of nodes in different components. However, is a nonconvex function and its minimization can lead to local minima or degenerate solutions. For example, given a connected graph, a trivial optimal solution is the one that assigns all nodes to the same cluster. As a consequence of the continuous relaxation, another degenerate minimum occurs when the cluster assignments are all uniform, that is, all nodes are equally assigned to all clusters. This problem is exacerbated by prior messagepassing operations, which make the node features more uniform.
The orthogonality loss term, , penalizes the degenerate minima of by encouraging the cluster assignments to be orthogonal and the clusters to be of similar size. Since the two matrices in have unitary norm it is easy to see that and, therefore, does not dominate over (see Fig. 4 for an example). can be interpreted as a (rescaled) clustering matrix , where assigns exactly points to each cluster. The value of the Frobenius norm between clustering matrices is not dominated by the performance on the largest clusters (Law et al., 2017)
and thus can be used to optimize intracluster variance.
Contrarily to SC methods that search for feasible solutions only within the space of orthogonal matrices, only introduces a soft constraint that can be violated during the learning procedure. Since is nonconvex, the violation compromises the theoretical guarantee of convergence to the optimum of (3). However, we note that:

in the GNN architecture, the minCUT objective is a regularization term and, therefore, a solution which is suboptimal for (3) could instead be adequate for the taskspecific objective;

optimizing the taskspecific loss helps the GNN to avoid the degenerate minima of .
3.1 Coarsening
The coarsened version of the adjacency matrix and the graph signal are computed as
(7) 
where the entry in is the weighted average value of feature among the elements in cluster . is a symmetric matrix, whose entries are the total number of edges between the nodes in the cluster , while is the number of edges between cluster and . Since corresponds to the numerator of in (7), the trace maximization yields clusters with many internal connections and weakly connected to each other. Hence, will be a diagonaldominant matrix, which describes a graph with selfloops much stronger than any other connection. Because selfloops hamper the propagation across adjacent nodes in the MP operations following the pooling layer, we compute the new adjacency matrix by zeroing the diagonal and by applying the degree normalization
(8) 
where returns the matrix diagonal.
3.2 Discussion and relationship with spectral clustering
There are several differences between minCUTpool and classic SC methods. SC partitions the graph based on the Laplacian, but does not account for node features. Instead, the cluster assignments found by minCUTpool depend on , which works well if connected nodes have similar features. This is a reasonable assumption in GNNs since, even in disassortative graphs (i.e., networks where dissimilar nodes are likely to be connected (Newman, 2003)), the features tend to become similar due to the MP operations.
Another difference is that SC handles a single graph and is not conceived for tasks with multiple graphs to be partitioned independently. Instead, thanks to the independence of the model parameters from the number of nodes and the graph spectrum, minCUTpool can generalize to outofsample data. This feature is fundamental in problems such as graph classification, where each sample is a graph with a different structure, and allows to train the model on small graphs and process larger ones at inference time. Finally, minCUTpool directly uses the soft cluster assignments rather than performing means afterwards.
4 Related work on pooling in GNNs
Trainable pooling methods.
Similarly to our method, these approaches learn how to generate coarsened version of the graph through differentiable functions, which take as input the nodes features and are parametrized by weights optimized for the task at hand.
The work that is most related to our approach is Diffpool (Ying et al., 2018), which uses two MP layers in parallel: one to compute the new node features (as in Eq. (4)), and another to generate the cluster assignments . In minCUTpool, instead, we compute by means of a MLP applied on . However, the main difference is in the regularization loss , which in Diffpool consists of two terms. The first is the link prediction term , which minimizes the Frobenius norm of the difference between the adjacency and the Gram matrix of the cluster assignments, and encourages nearby nodes to be clustered together. The second term minimizes the entropy of the cluster assignments to make them similar to onehot vectors.
The approach dubbed Top pooling (Cangea et al., 2018; Hongyang Gao, 2019), learns a projection vector that is applied to each node feature to obtain a score. The nodes with the highest scores are retained, while the remaining ones are dropped. Top is more memory efficient as it avoids generating the cluster assignments. To prevent from becoming disconnected when the nodes are removed, Top drops the rows and the columns from and uses it as the new adjacency matrix. However, computing costs and it is inefficient to implement even with sparse operations.
Topological pooling methods.
These methods precompute a pyramid of coarsened graphs, only taking into account the topology of . During training, the node features are pooled with standard procedures and are fit into these deterministic graph structures. These methods are less flexible, but provide a stronger bias that can prevent degenerate solutions (e.g., coarsened graphs collapsing in a single node).
The approach proposed by Bruna et al. (2013), which has been adopted also in other GNN architectures (Defferrard et al., 2016; Fey et al., 2018), exploits GRACLUS (Dhillon et al., 2004), a hierarchical algorithm based on SC. At each level , two vertices and are clustered together in a new vertex
. At inference phase, maxpooling is used to determine which node of the pair is kept. Fake vertices are added so that the number of nodes can be halved each time, but this injects noisy information in the graph.
Node decimation is a method originally proposed in graph signal processing literature (Shuman et al., 2016), which as been adapted also for GNNs (Simonovsky & Komodakis, 2017; Bianchi et al., 2019). The nodes are partitioned in two sets, according to the signs of the eigenvector of the Laplacian associated to the largest eigenvalue. One of the two sets is dropped, reducing the number of nodes each time approximately by half. Kron reduction is used to compute a pyramid of coarsened Laplacians from the remaining nodes.
A procedure proposed in Gama et al. (2018) diffuses a signal from designated nodes on the graph and stores the observed sequence of diffused components. The resulting stream of information is interpreted as a time signal, where standard CNN pooling is applied. We also mention a pooling operation to coarsen binary unweighted graphs by aggregating maximal cliques (Luzhnica et al., 2019). Nodes assigned to the same clique are summarized by max or average pooling and become a new node in the coarsened graph.
5 Experiments
We consider both supervised and unsupervised tasks, and compare minCUTpool with several of the other popular pooling strategies described above. The Appendix reports further details on the experiments and a schematic depiction of the architectures used in each task.
5.1 Clustering the graph nodes
To evaluate the effectiveness of the proposed loss, we perform different node clustering tasks with a simple GNN composed of a single MP layer followed by a pooling layer. The GNN is trained by minimizing only.
Clustering on synthetic networks
We consider two simple graphs: the first is a network with 6 communities and the second is a regular grid. The adjacency matrix is binary and the features are the 2D node coordinates. Fig. 2 depicts the node partitions generated by SC (a,d), Diffpool (b,e), and minCUTpool (c,f). Cluster indexes for Diffpool and minCUTpool are obtained by taking the argmax of rowwise. Compared to SC, Diffpool and minCUTpool leverage the information contained in . minCUTpool generates very accurate and balanced partitions, demonstrating that the cluster assignment matrix is well formed. On the other hand, Diffpool assigns some nodes to the wrong community in the first example, and produces an imbalanced partition of the grid.
Image segmentation
Given an image, we build a Region Adjacency Graph (Trémeau & Colantoni, 2000) using as nodes the regions generated by an oversegmentation procedure (Felzenszwalb & Huttenlocher, 2004). The SC technique used in this example is the recursive normalized cut (Shi & Malik, 2000), which recursively clusters the nodes until convergence. For Diffpool and minCUTpool, the node features consist of the average and total color in each oversegmented region. We set the number of desired clusters to . The results in Fig. 3 show that minCUTpool yields a more precise segmentation. On the other hand, Diffpool aggregates wrong regions and, in addition, SC finds too many segments.
Clustering on citation networks
We cluster the nodes of three popular citation networks: Cora, Citeseer, and Pubmed. The nodes are documents represented by sparse bagofwords feature vectors stored in and the binary undirected edges in indicate citation links between documents. Each node is labeled with the document class . To test the quality of the partitions generated by each method we check the agreement between the cluster assignments and the original classes. Tab. 1 reports the Completeness Score and Normalized Mutual Information , where is the entropy.
The GNN architecture configured with minCUTpool results in a higher NMI score than SC, which does not account for the node features when generating the partitions. Our pooling operation outperforms also Diffpool, indicating that the unsupervised loss in Diffpool is unable to converge to an optimal solution, possibly due to its highly nonconvex nature. This can be seen from Fig. 4, which depicts the evolution of the unsupervised losses and NMI scores of Diffpool and minCUTpool during training.
Dataset  Spectral clustering  Diffpool  minCUTpool  

5pt.  NMI  CS  NMI  CS  NMI  CS  
Cora  7  0.025 0.014  0.126 0.042  0.315 0.005  0.309 0.005  0.404 0.018  0.392 0.018 
Citeseer  6  0.014 0.003  0.033 0.000  0.139 0.016  0.153 0.020  0.287 0.047  0.283 0.046 
Pubmed  3  0.182 0.000  0.261 0.000  0.079 0.001  0.085 0.001  0.200 0.020  0.197 0.019 
5.2 Supervised graph classification
In this task, the th datum is a graph with nodes represented by a pair and must be associated to the correct label . We test the models on different graph classification datasets. For featureless graphs, we used the node degree information and the clustering coefficient as surrogate node features. We evaluate model performance with a 10fold train/test split, using of the training set in each fold as validation for early stopping. We adopt a fixed network architecture, MP(32)poolMP(32)poolMP(32)GlobalAvgPoolsoftmax, where MP is the messagepassing operation in (4) with 32 hidden units. The pooling module is implemented either by Graclus, Decimation pooling, Top pooling, Diffpool, or the proposed minCUTpool. Each pooling method is configured to drop half of the nodes in a graph ( in Top, Diffpool, and minCUTpool). As baselines, we consider the popular WeisfeilerLehman (WL) graph kernel (Shervashidze et al., 2011), a network with only MP layers (Flat), and a fully connected network (Dense).
Dataset  WL  Dense  Flat  Graclus  Decim.  Diffpool  Top  minCUT 

5pt. Bencheasy  92.6  29.30.3  98.50.3  97.50.5  97.90.5  98.60.4  82.48.9  99.00.0 
Benchhard  60.0  29.40.3  67.62.8  69.01.5  72.60.9  69.91.9  42.715.2  73.81.9 
Mutagenicity  84.4  68.40.3  78.01.3  74.41.8  77.82.3  77.62.7  71.93.7  79.92.1 
Proteins  71.2  68.73.3  72.64.8  68.64.6  70.43.4  72.73.8  69.63.5  76.52.6 
DD  78.6  70.65.2  76.81.5  70.54.8  70.13.0  79.32.4  69.47.8  80.32.0 
COLLAB  74.8  79.31.6  82.11.8  77.12.1  75.82.2  81.81.4  79.31.8  83.41.7 
RedditBinary  68.2  48.52.6  80.32.6  79.20.4  84.32.4  86.82.1  74.74.5  91.41.5 
Tab. 2 reports the classification results, highlighting those that are significantly better (value w.r.t. the method with the highest mean accuracy). The comparison with Flat helps to understand if a pooling operation is useful or not. The results of Dense, instead, help to quantify how much additional information is brought by the graph structure, with respect to the node features alone. It can be seen that minCUTpool obtains always equal or better results with respect to every other GNN architecture. On the other hand, some pooling procedures do not always improve the performance compared to the Flat
baseline, making them not advisable to use in some cases. The WL kernel generally performs worse than the GNNs, except for the Mutagenicity dataset. This is probably because Mutagenicity has smaller graphs than the other datasets, and the adopted GNN architecture is overparametrized for this task. Interestingly, in some dataset such as Proteins and COLLAB it is possible to obtain fairly good classification accuracy with the
Dense architecture, meaning that the graph structure only adds limited information.Fig. 5 reports a comparison of the execution time per training epoch for each pooling algorithm. Graclus and Decimation are understandably the fastest methods, since the coarsened graphs are precomputed. Among the differentiable pooling methods, minCUTpool is faster than Diffpool, which uses a slower MP layer rather than a MLP to compute cluster assignments, and than Top, which computes the square of at every forward pass.
5.3 GNN Autoencoder
To compare the amount of information retained by the pooling layers in the coarsened graphs, we train an autoencoder (AE) to reconstruct a input graph signal from its pooled version. The AE architecture is MP(32)MP(32)poolunpoolMP(32)MP(32)MP, and is trained by minimizing the mean squared error between the original and the reconstructed graph signal, . All the pooling operations are configured to retain of the original nodes.
In Diffpool and minCUTpool, the unpool step is simply implemented by transposing the original pooling operations
(9) 
Top does not generate a cluster assignment matrix, but returns a binary mask that indicates the nodes to drop (0) or to retain (1). Therefore, an upsamplig matrix is built by dropping the columns of the identity matrix that correspond to a 0 in , . The unpooling operation is performed by replacing with in (9), and the resulting upscaled graph is a version of the original graph with zeroes in correspondence of the dropped nodes.
Fig. 6 and 7 report the original graph signal (the node features are the 2D coordinates of the nodes) and the reconstruction obtained by using the different pooling methods, for a ring graph and a regular grid graph. The reconstruction produced by Diffpool is worse for the ring graph, but is almost perfect for the grid graph, while minCUTpool yields good results in both cases. On the other hand, Top clearly fails in generating a coarsened representation that maintains enough information from the original graph.
This experiment highlights a major issue in Top pooling, which retains the nodes associated to the highest values of a score vector , computed by projecting the node features onto a trainable vector : . Nodes that are connected on the graph usually share similar features, and their similarity further increases after the MP operations, which combine the features of neighboring nodes. Retaining the nodes associated to the top scores in corresponds to keeping those nodes that are alike and highly connected, as it can be seen in Fig. 67. Therefore, Top discards entire portions of the graphs, which might contain important information. This explains why Top fails to recover the original graph signal when used as bottleneck for the AE, and yields the worse performance among all GNN methods in the graph classification task.
6 Conclusions
We proposed a pooling layer for GNNs that coarsens a graph by taking into account both the the connectivity structure and the node features. The layer optimizes a regularization term based on the minCUT objective, which is minimized in conjunction with the taskspecific loss to produce node partitions that are optimal for the task at hand.
We tested the effectiveness of our pooling strategy on unsupervised node clustering tasks, by optimizing only the unsupervised clustering loss, as well as supervised graph classification tasks on several popular benchmark datasets. Results show that minCUTpool performs significantly better than existing pooling strategies for GNNs.
References
 Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro SanchezGonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
 Bianchi et al. (2019) Filippo Maria Bianchi, Daniele Grattarola, L Livi, and C Alippi. Graph neural networks with convolutional arma filters. arXiv preprint arXiv:1901.01343, 2019.
 Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.

Cangea et al. (2018)
Cătălina Cangea, Petar Veličković, Nikola Jovanović,
Thomas Kipf, and Pietro Liò.
Towards sparse hierarchical graph classifiers.
In Advances in Neural Information Processing Systems – Relational Representation Learning Workshop, 2018. 
Collins et al. (2014)
Maxwell D Collins, Ji Liu, Jia Xu, Lopamudra Mukherjee, and Vikas Singh.
Spectral clustering with a convex regularizer on millions of images.
In
European Conference on Computer Vision
, pp. 282–298. Springer, 2014.  Damle et al. (2016) Anil Damle, Victor Minden, and Lexing Ying. Robust and efficient multiway spectral clustering. arXiv preprint arXiv:1609.08251, 2016.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852, 2016.

Dhillon et al. (2004)
Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis.
Kernel kmeans: spectral clustering and normalized cuts.
In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 551–556. ACM, 2004.  Felzenszwalb & Huttenlocher (2004) Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graphbased image segmentation. International journal of computer vision, 59(2):167–181, 2004.

Fey et al. (2018)
Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller.
Splinecnn: Fast geometric deep learning with continuous bspline kernels.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 869–877, 2018.  Gama et al. (2018) Fernando Gama, Antonio G Marques, Geert Leus, and Alejandro Ribeiro. Convolutional neural network architectures for signals supported on graphs. IEEE Transactions on Signal Processing, 67(4):1034–1049, 2018.

Gilmer et al. (2017)
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and
George E Dahl.
Neural message passing for quantum chemistry.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 1263–1272. JMLR. org, 2017.  Han & Filippone (2017) Yufei Han and Maurizio Filippone. Minibatch spectral clustering. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3888–3895. IEEE, 2017.
 Hongyang Gao (2019) Shuiwang Ji Hongyang Gao. Graph unets. In Proceedings of the 36th International conference on Machine learning (ICML), 2019.
 Ikebe et al. (1987) Yasuhiko Ikebe, Toshiyuki Inagaki, and Sadaaki Miyamoto. The monotonicity theorem, cauchy’s interlace theorem, and the courantfischer theorem. The American Mathematical Monthly, 94(4):352–354, 1987.
 Kampffmeyer et al. (2019) Michael Kampffmeyer, Sigurd Løkse, Filippo M. Bianchi, Lorenzo Livi, ArntBørre Salberg, and Robert Jenssen. Deep divergencebased approach to clustering. Neural Networks, 113:91 – 101, 2019.
 Kipf & Welling (2017) Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. International Conference of Learning Representations (ICLR), 2017.
 Law et al. (2017) Marc T Law, Raquel Urtasun, and Richard S Zemel. Deep spectral clustering learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1985–1994. JMLR. org, 2017.
 Levie et al. (2018) Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1):97–109, 2018.
 Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. International Conference of Learning Representations (ICLR), 2016.
 Luzhnica et al. (2019) Enxhell Luzhnica, Ben Day, and Pietro Lio. Clique pooling for graph classification. International Conference of Learning Representations (ICLR) – Representation Learning on Graphs and Manifolds workshop, 2019.
 Nazi et al. (2019) Azade Nazi, Will Hang, Anna Goldie, Sujith Ravi, and Azalia Mirhoseini. Gap: Generalizable approximate graph partitioning framework. arXiv preprint arXiv:1903.00614, 2019.
 Newman (2003) Mark EJ Newman. Mixing patterns in networks. Physical Review E, 67(2):026126, 2003.
 Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
 Shaham et al. (2018) Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587, 2018.
 Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 Shi & Malik (2000) Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Departmental Papers (CIS), pp. 107, 2000.
 Shuman et al. (2016) David I Shuman, Mohammad Javad Faraji, and Pierre Vandergheynst. A multiscale pyramid transform for graph signals. IEEE Transactions on Signal Processing, 64(8):2119–2134, 2016.
 Simonovsky & Komodakis (2017) Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 Tian et al. (2014) Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and TieYan Liu. Learning deep representations for graph clustering. In AAAI, pp. 1293–1299, 2014.
 Trémeau & Colantoni (2000) Alain Trémeau and Philippe Colantoni. Regions adjacency graph applied to color image segmentation. IEEE Transactions on image processing, 9(4):735–744, 2000.
 Von Luxburg (2007) Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
 Wen & Yin (2013) Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(12):397–434, 2013.
 Yi et al. (2017) Li Yi, Hao Su, Xingwen Guo, and Leonidas J Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290, 2017.
 Ying et al. (2018) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810, 2018.
 Yu & Shi (2003) Yu and Shi. Multiclass spectral clustering. In Proceedings Ninth IEEE International Conference on Computer Vision, pp. 313–319 vol.1, Oct 2003.
Appendix
Appendix A Additional experiments
a.1 Graph regression of molecular properties on QM9
The QM9 chemical database is a collection of 135k small organic molecules, associated to continuous labels describing several geometric, energetic, electronic, and thermodynamic properties^{1}^{1}1http://quantummachine.org/datasets/. Each molecule in the dataset is represented as a graph
, where atoms are associated to nodes, and edges represent chemical bonds. The atomic number of each atom (onehot encoded; C, N, F, O) is taken as node feature and the type of bond (onehot encoded; single, double, triple, aromatic) can be used as edge attribute. In this experiment, we ignore the edge attributes in order to use all pooling algorithms without modifications.
The purpose of this experiment is to compare the trainable pooling methods also on a graph regression task, but it must be intended as a proof of concept. In fact, the graphs in this dataset are extremely small (the average number of nodes is 8) and, therefore, a pooling operation is arguably not necessary. We consider a GNN with architecture MP(32)poolMP(32)GlobalAvgPoolDense, where pool is implemented by Top, Diffpool, or minCUTpool. The network is trained to predict a given chemical property from the input molecular graphs. Performance is evaluated with a fold crossvalidation, using of the training set for validation in each split. The GNNs are trained for 50 epochs, using Adam with learning rate 5e4, batch size 32, and ReLU activations. We use the mean squared error (MSE) as supervised loss.
The MSE obtained on the prediction of each property for different pooling methods is reported in Tab. 3. As expected, the flat baseline with no pooling operation (MP(32)MP(32)GlobalAvgPoolDense) yields a lower error in most cases. Contrarily to the graph classification and the AE task, Top achieves better results than Diffpool in average. Once again, minCUTpool significantly outperforms the other methods on each regression task and, in one case, also the flat baseline.
Property  Top  Diffpool  minCUTpool  Flat baseline 

mu  0.600  0.651  0.538  0.559 
alpha  0.197  0.114  0.078  0.065 
homo  0.698  0.712  0.526  0.435 
lumo  0.601  0.646  0.540  0.515 
gap  0.630  0.698  0.584  0.552 
r2  0.452  0.440  0.261  0.204 
zpve  0.402  0.410  0.328  0.284 
u0_atom  0.308  0.245  0.193  0.163 
cv  0.291  0.337  0.148  0.127 
Appendix B Experimental details
The GNN architectures analyzed in this work have been implemented with the Spektral library^{2}^{2}2https://danielegrattarola.github.io/spektral/. For the WL kernel, we used the implementation provided in the GraKeL library^{3}^{3}3https://ysig.github.io/GraKeL/dev/. The pooling strategy based on Graclus, is taken from the ChebyNets repository^{4}^{4}4https://github.com/mdeff/cnn_graph.
b.1 Clustering on citation networks
Diffpool and minCUTpool are configured with 16 hidden neurons with linear activations in the MLP and MP layer, respectively used to compute the cluster assignment matrix
. The MP layer used to compute the propagated node features uses an ELU activation in both architectures. The learning rate for Adam is 5e4, and the models are trained for 10000 iterations. The details of the citation networks dataset are reported in Tab. 4.Dataset  Nodes  Edges  Node features  Node classes 

5pt. Cora  2708  5429  1433  7 
Citeseer  3327  9228  3703  6 
Pubmed  19717  88651  500  3 
b.2 Graph classification
We train the GNN architectures with Adam, an L_{2} penalty loss with weight 1e4, and 16 hidden units () both in the MLP of minCUTpool and in the internal MP of Diffpool. Mutagenicity, Proteins, DD, COLLAB, and Reddit2k are datasets representing realworld graphs and are taken from the repository of benchmark datasets for graph kernels^{5}^{5}5https://ls11www.cs.tudortmund.de/staff/morris/graphkerneldatasets. Bencheasy and Benchhard^{6}^{6}6https://github.com/FilippoMB/Benchmark_dataset_for_graph_classification are datasets where the node features and the adjacency matrix are completely uninformative if considered alone. Hence, algorithms that account only for the node features or the graph structure will fail to classify the graphs. The statistics of all the datasets are reported in Tab. 5.
Dataset  samples  classes  avg. nodes  avg. edges  node attr.  node labels 

5pt. Bencheasy  1800  3  147.82  922.66  –  yes 
Benchhard  1800  3  148.32  572.32  –  yes 
Mutagenicity  4337  2  30.32  30.77  –  yes 
Proteins  1113  2  39.06  72.82  1  no 
DD  1178  2  284.32  715.66  –  yes 
COLLAB  5000  3  74.49  2457.78  –  no 
Reddit2K  2000  2  429.63  497.75  –  no 
Appendix C Architectures schemata
Fig. 12 reports the schematic representation of the minCUTpool layer; Fig. 12 the GNN architecture used in the clustering and segmentation tasks; Fig. 12 the GNN architecture used in the graph classification task; Fig. 12 the GNN architecture used in the graph regression task; Fig. 12 the graph autoencoder used in the graph signal reconstruction task.
Comments
There are no comments yet.