1 Introduction
Convolution neural networks (CNNs) are efficient to extract hierarchical representations of signals residing on regular grids, such as audios and images. With the convolution and pooling operations, CNNs have achieved stateoftheart performance in a variety of applications. With the increasing availability of various forms of network data, recent pioneer works [2, 5, 9, 10, 26] have been generalizing convolution neural networks to irregular structures, including graph and point cloud data. Most of the current attempts for designing neural network representations of graph data however focus on the convolution operator. The pooling operator is mostly overlooked, yet it carries an important part of the ability of graph neural networks to distill effective hierarchical representations.
Hierarchical network data representations necessitate a careful design of all elements of the learning architecture. In tasks like graph classification for instance, a global representation is required in addition to local features in order to predict the label for an entire graph. The pooling operator is an important component in the construction of such hierarchical architectures. In the case of image representation, the pooling operator downsamples data simply by flipping the predefined local receptive field and aggregating information in each receptive field, from left to right and top to bottom taking advantage of the inherent spatial order in lattice structures. However, in the general case of networks with diverse irregular connections between vertices and no spatial order of the nodes, the pooling operation becomes more challenging. In particular, it is not appropriate to directly generalize the pooling operator from images to graphs. Even for the simple pooling operation that downsamples image signals with stride 2, its counterpart on graphs that can be formulated as a maxcut problem, becomes a NPhard problem
[3]. In addition, it is still a challenge to construct a coarsened graph with the pooled features. In other words, one has to solve the following two problems to generalize the pooling operation to graphs: (1) How to choose and aggregate information to characterize the signals residing on graph? (2) How to coarsen the structure of graphs in the higher levels of representation? Graph theory may provide several tools to generalize the pooling operator to networks with help of graph clustering or graph coarsening algorithms [25, 21, 8]. However, although these methods perform well on fixed graph structures, they fail to easily generalize to networks with different topologies. Furthermore, their high computational complexity might become a limitation in graph convolution neural networks.In this paper, we propose a pooling operator, called iPool, for data that comes with arbitrary graphs. This new operator is designed to generate a faithful representation of signals supported on the graph in terms of their information. Therefore, it first uses a criterion to evaluate the amount of information carried by each node given observations of its neighbors. Then, as informative nodes play an important role in characterizing graphs, we propose a strategy to construct coarsened graphs by aggregating information in accordance with the proposed information criterion, such as the intrinsic structure of the original graph is preserved. In particular, the node signals are predicted from the neighbour signals values to determine the most informative nodes, as shown in Fig. 1. A coarsened graph is constructed on these most informative nodes with a topology that maintains consistency with the original graph. The proposed iPool scheme offers several advantages: 1) it is a parameterfree building block, and it can be paired with diverse graph convolution operators in order to deal with arbitrary graphs; 2) it is invariant to graph isomorphism, which means that coarsened (isomorphic) graphs are identical through the pooling operation; 3) it exploits signals residing on graph in addition to the structure of graph; 4) it could provide two kinds of pooling schemes, namely global and local pooling which permit to trade off preservation of graph structure and information aggregation. Finally, we resort to graph classification tasks in order to evaluate the proposed method and we show that it outperforms stateoftheart graph neural networks based methods on a collection of public benchmark datasets. The proposed pooling operator also has a promising perspective to be applied to other discriminative tasks and generalized easily to other nonEuclidean data, such as point clouds.
2 Related Work
We give a brief review below on a collection of works on graph representation learning. We focus on traditional graph coarsening schemes as well as on recent attempts in graph pooling, which are relevant to the main problem addressed in this work.
Graph neural networks (GNNs) are first proposed in [6, 17, 18] and some recent works have attempted to extend convolution neural networks to graphs, termed graph convolution networks (GCNs). One category of methods define convolutions for graphs on the basis of spectral filtering of graph signals, which leads to spectral GCNs [2, 5, 9, 10]. Another line of works focus on defining the convolution operation in the spatial domain [2, 7, 26, 23, 29, 27] to address the basisdependent problem where a spectral graph neural network trained on one graph structure fails to transfer properly to other graph structures. Most of them generalize the convolution operation by defining a “message aggregation” scheme, where information in the neighborhood is aggregated. These convolution layers in these architectures might be combined with different pooling operators to generate hierarchical representations of graph data.
When it comes to building multiscale representations of graph data, previous works rely on variants of graph coarsening algorithms, or differnet forms of graph pooling. For general graphs, unfortunately, this problem becomes a complex problem and a collection of approximate methods have been proposed in the literature, such as multiscale methods [21, 8]
. Yet, computational complextiy stays high, so that these methods are mostly suitable for processing fixed structure graphs offline, but have clear limitations in providing an online building block able to cope with graphs of various structures. An alternative to graph coarsening consists in implementing spectral clustering which split graphs into parts that can yield a hierarchical representation
[25]. Spectral clustering is however not ideal in building multiscale representations of graph data.Pooling probably provides the most constructive alternative to develop multiscale representation in graph neural networks, and different methods have been proposed for generalizing poling ideas to graphs. For example
Zhang et al. [29] propose SortPooling that sorts the nodes based on the value of the last feature map in a descending manner, and preserves the first nodes of this list. On the other hand, DIFFPOOL [28] follows clustering methods and assign nodes softly with generating the cluster assignment matrix using neural networks. SET2SET [24] then implements an equivalent global pooling by aggregating information through LSTMs. However, these two recent methods have limitations: DIFFPOOL requires another graph neural network with the same capacity as the network for main task to learn projection matrices and the SET2SET depends on extra LSTM units which are timeconsuming to train, in order to gather information.In contrary to previous methods, we propose a graph pooling scheme in accordance with the information of graph signals to enhance the capability of GNNs to construct hierarchical representations of graphs with arbitrary topologies. Similarly to the pooling operation for images, the proposed graph pooling operator is parameterfree and “plug and play”, and thereby leads to fast implementations in both training and testing phases.
3 The iPool algorithm
We present in this section the main elements of our new iPool algorithm. We first introduce the general framework of graph neural network architectures as well as the notation used in the paper. We later define the role of the pooling algorithm and introduce an information gain criterion that drives our iPool algorithm. We finally show how our pooling algorithm can be used to construct hierarchical representations with both global and local graph pooling strategies.
3.1 Framework
In this paper, we consider the learning of hierarchical representations for network data with help of a neural network architecture. Such an architecture is typically built on the concatenation of several layers, composed of graph convolution blocks and pooling operators. We focus here on the pooling operator, which is a key element for effective learning of hierarchical representations.
We define the notation used in this paper, where we employ capital letters and bold lowercase letters to indicate matrices and vectors, respectively. The network data at the input of the learning architecture, is represented by an undirected graph
, with and respectively the set of vertices, and the set of edges. The adjacency matrix represents the topology of the network, and presents a nonzero value at position (i.e., ) only if there is an edge in that connects vertices and . We consider both unweighted and weighted graphs, which leads to consisting of respectively unitary values or actual edge weights. We further define and as the degree matrix and respectively the transmission matrix of . The degree matrix is a diagonal matrix with and the transmission matrix defined as denotes the transmission probability of each pair of nodes. Furthermore, each vertex of the graph might further be attributed a signal value or feature, which is denoted by , and . Finally, as we construct a hierarchical representation of graph data, the subscript is used to denote features or parameters belonging to the th layer of the neural networks. For example, for the graph in the th neural network representation layer, the th vertex of the graph is written as , and represents the signal or feature residing on the node .Most of current graph convolution networks utilize a stack of graph convolution layers to learn a representation or embedding of the graph data. The graph convolution layers are usually designed to follow a neighborhood message aggregation scheme, which can be written as:
(1) 
where can be any graph shift operator that has nonzero values only in the positions corresponding to edges in the graph and in its diagonal, including but not limited to and . The functions and
respectively denote a data aggregation function and a nonlinear activation function, and the parameters
are learned during the training of the neural network. The aggregation function varies in different graph convolution network architectures and is usually defined as (weighted) summation [28], mean [7, 29], or even a Multilayer Perceptron (MLP)
[27].A key element in these architectures is the graph pooling or coarsening operator, which aims at selecting a subset of data and obtain a version with a reduced dimension that still represents well the original graph data. These operators equip the neural networks with the ability to construct a sort of multiscale representations of network data. However, due to the diverse structures of graphs and the absence of a regular gridlike topology in general, it is not possible to predefine the sampling structure or the local receptive fields on graphs, as it can be done in image representation learning [3]. The graph pooling operator has thus to be defined adaptively, yet in a generic way so that it can accommodate different network structures.
Generally, for an undirected graph , the graph pooling process could be formulated as:
(2) 
where and are the adjacency matrices of the coarse and fine graphs and is the coarsening matrix. Correspondingly, the signal residing on the graph is downsampled by the coarsening matrix:
(3) 
Notably, Eq.(2) and Eq.(3) are general formations that are suitable for both traditional graph coarsening as well as clustering methods [13, 4, 12] and stateoftheart graph pooling methods [28].
The design of effective hierarchical representations of graph data largely relies on the proper choice of the graph pooling operator, hence on a proper choice of the operator . We present below a novel pooling strategy using a neighborhood information gain criterion.
3.2 The neighborhood information gain criterion
In order to design a proper pooling operator, one first needs to define a criterion that governs the selection of the most important nodes in the graph. The objective is to coarsen the graph representation while keeping a faithful representation of the original one, with the coarsening matrix or equivalently the pooling operator generated on the basis of the structure of graph and the signals that it supports. In general, if a signal residing on one particular node of the graph could be well predicted from signals supported by other nodes, this node can probably be removed in the coarsened graph, with negligible information loss. If we further consider the typical localization and smoothness properties of most signals, it is reasonable to limit the node signal prediction process within the node neighborhood. Therefore, we can relate the amount of information carried by a graph node, to the difficulty of predicting the signal value from nodes in its neighborhood. We therefore introduce below a measure, namely the neighborhood information gain criterion, in order to quantitatively evaluate the uncertainty or information of node signals given observations of the neighbors. We later use this measure to design a new pooling operator that eventually preserves the most important nodes in the graph.
The neighborhood information gain criterion is defined as the “Manhattan” distance between the observed signal and the one predicted from observations at the neighbor nodes ^{1}^{1}1For graph without signals, graph structure is taken as signals by using features after convolution layers with constant features as input (i.e., for all nodes) implicitly.. We choose the “Manhattan” distance as it represents a common similarity measure that is especially convenient for high dimensional vectors that might be present in some graph datasets. Specifically, with a prediction function using information from the neighborhood, the neighborhood information gain of each node could be formulated as:
(4) 
With this definition, the neighborhood information gain criterion gives the same importance to differences along each dimension; using an norm prevents specific dimensions with large deviations from dominating the other ones as it typically happens with norms with . We now have different options for choosing the prediction function . Among them, neighborhood aggregating functions are promising, in particular when considering the typical localization and smoothness properties of graph signals. We therefore choose to predict the node signal as the weighted average of signals supported on nodes within its hop neighborhood. Given that the th power of transmission matrix is a effective measure of the level of connection or dependence between any pair of nodes reachable with hops, we adopt the elements of as the weights in our prediction function in order to give more confidence to nodes that have stronger connections. We however modify into an offdiagonal transition matrix corresponding to graph without hop circles. Finally, the prediction function can be formulated as
(5) 
with
(6) 
where is the hop neighborhood, is the adjacency matrix where diagonal values corresponding to the hop circles have been removed, and is the corresponding degree matrix. In this way, Eq. (4) could be further formulated as:
(7)  
(8) 
where we use to indicate the norm of each row of a matrix. The value of will be high when a node signal is very different from the ones in its neighborhood, which means that the node has high information and should be preserved in the pooling operator. Note that the definition of the information criterion is local, which is very important towards low complexity and possible distributed implementation.
3.3 Graph Coarsening
With the neighborhood information gain criterion defined above, we can now identify nodes that should be preserved by the pooling operator. The convolution layers based on message aggregation (see Eq. (1)) performs a series of equivalent lowpass filtering of the graph information. The pooling then identifies the nodes that have the higher information gain and permits to adaptively downsample the graph while preserving the local characteristics of graph signals. We describe below a local and global version of the pooling algorithm, and then show how we construct coarsened graphs after pooling in order to build a multiscale representation of network data.
Similarly to pooling strategies developed for image representations, a global pooling scheme can aggregate information of the whole graph globally at the possibly expense of losing the graph structure information, while local pooling strategy can preserve the general structure of graph with however a limited receptive field to collect information. With different applications having diverse requirements, it is worthwhile to explore both of global pooling and local pooling for graphs. Luckily, with the neighborhood information criterion defined above, we can derive both global pooling as well as local pooling strategies with only minor adjustments.
Global iPool strategy. On the basis of the neighborhood information gain, nodes are assigned different priorities to construct coarse graph globally. In order to approximate the information of the graph, the pooling should preserve the nodes that can not be well represented by their neighbors. In other words, the nodes with relative high neighborhood information gain have to be preserved in the construction of a coarsened graph. Specifically, the graph nodes are reordered based on the value of their neighborhood information gain. The global pooling strategy then simply select the top nodes that have the highest information gain, as:
(9) 
where is the pooling ratio and represents the global ranking operator.
Local iPool strategy. Pooling can also be implemented locally, and nodes can be selected within each receptive field, similarly to what is done for images. Receptive fields may however overlap, and we propose to normalize the information gain in each neighborhood before local pooling. This reduces the probability that selected nodes mainly come from dominant neighborhoods and permits to have a better distribution of the pooled nodes over the original graph. Mathematically, the neighborhood information gain of each node is normalized by the average neighborhood information gain of its neighborhood:
(10) 
Then, the nodes are ordered and selected globally in terms of the normalized neighborhood information gain:
(11) 
Coarsened graph construction. The above iPool versions select the nodes to be preserved and we can now construct a coarsened graph with the selected nodes. First, the coarsening matrix of Eq. (2) and Eq. (3) can be obtained directly from the indices of the nodes selected by iPool. Concretely, with the indices of the pooled nodes, the coarsening matrix is constructed according to the following formulation, for :
(12) 
The coarsening matrix is sparse; the cardinality of each row is , and the cardinality of each column is no more than .
The coarsened graph is constructed as follow. To facilitate propagation of signals on the graph, we first alter the connection between nodes in the original graph. Specifically, there is an edge connecting two nodes if and only if there is a path consisting of edges between these two nodes in the original graph. The weights of the edges are further set either as the corresponding elements of the th power of the adjacency matrix for weighted graphs or as for unweighted graphs. Namely, we set
(13) 
for unweighted graphs. For weighted graphs, we set
(14) 
where the value of is set to be either 1 or 2 depending on the properties of the datasets and applications. Notably, the hyperparameter plays a similar role as the hyperparameter stride in image pooling. The coarsening matrix in Eq. (12) is applied to the adjacency matrix of the expanded graph to obtain the coarsened graph. The graph and features at the next layer of the representation thus become:
(15)  
(16) 
Note that, during the training phase of the neural network architecture, the iPool operator as any general endtoend differentiable layer, takes as inputs the adjacency matrix and node features of graphs and produces the adjacency matrix and node features of coarsened graphs. Furthermore, the operator is generic enough to be integrated in diverse architectures.
4 Properties of iPool
In this section, we will discuss the relationship between the neighborhood information gain and neighborhood conditional entropy and derive some properties of iPool.
The neighborhood information gain is defined as the deviation between the observed signal and the predicted signal based on signals in the neighborhood. Empirically, this deviation reflects the uncertainty of one signal value given other values in its neighborhood. In information theory, the entropy is designed to quantify the uncertainty, and the conditional entropy is specifically utilized to measure the amount of information of one variable given values of the other variable(s). Generalized to graph signals, we arrive at the neighborhood conditional entropy by considering variables as signals supported on nodes in neighborhood:
(17) 
where is the entropy and represents the mutual information. Actually, under a certain assumption on the conditional distribution, the proposed neighborhood information gain has a close relationship with the neighborhood conditional entropy.
Proposition 1.
Let assume that the components of neighborhood conditional distribution of each node are independent and that each component satisfies a Laplace distribution = Laplace() with , the neighborhood information gain
of each node is an approximate empirical estimation of its neighborhood conditional entropy.
With Prop. 1, the proposed iPool algorithm is a constructive way to coarsen graphs in accordance with the maximum neighborhood conditional entropy strategy. Specifically, the global iPool preserves nodes with the maximum neighborhood conditional entropy, while the local iPool selects nodes with relatively large neighborhood conditional entropy. Furthermore, the global iPool strategy implies that nodes share the same or similar variations of neighborhood conditional distribution, while the local iPool strategy relaxes this restriction to nodes in the same neighborhood having similar variations in terms of neighborhood conditional distribution. The proof is presented in Appendix.
We now further show that iPool is invariant to isomorphic graphs so that the graph neural networks consisting of iPool combined with other components that are also invariant to graph isomorphism, will produce invariant representations for isomorphic graphs. This property is important for a wide collection of discriminative tasks, such as graph classification. We prove this proposition in Appendix.
Proposition 2.
For any isomorphic graphs and , the iPool will produce the same coarsened graphs.
In summary, the information gain criterion used in iPool is actually a constructive approximation of the maximum neighborhood conditional entropy, which further validates the choice of this measure to preserve the graph information. Furthermore, the iPool algorithm is invariant under graph isomorphism, which is very important in numerous tasks.
5 Experiments
In this section, we evaluate the effectiveness of the proposed pooling scheme in graph classification tasks, and compare with a variety of kernel based methods and neural network based methods.
5.1 Experimental settings
We evaluate the proposed pooling algorithm in the context of deep graph convolution networks. The hierarchical graph convolution networks used in the experiments following the general framework of GraphSAGE [7] consist of a stack of convolution modules, iPool layers, a readout module as well as a prediction module, as presented in Fig. 2.
Method  ENZY  D&D  REDM12K  COLL  PROT 

GRAPHLET  41.03  74.85  21.73  64.66  72.91 
SP  42.32  78.86  36.93  59.10  76.43 
WL  53.43  74.02  39.03  78.61  73.76 
WLOA  60.13  79.04  44.38  80.74  75.26 
PSCN    76.272.64  41.320.42  72.602.15  75.002.51 
GRAPHSAGE  54.004.36  78.083.52  41.197.19  76.281.67  76.554.20 
ECC  53.50  74.10  41.73  67.79  72.65 
SET2SET  55.506.99  78.513.59  > 7days  75.461.40  76.733.97 
SORTPOOL  57.12  79.37  41.82  73.76  75.54 
DIFFPOOL  58.675.37  79.194.17  47.371.02  76.381.93  77.002.93 
Proposed global  56.007.72  78.763.45  47.021.01  76.861.67  76.463.22 
Local ()  59.005.73  79.452.78  47.241.10  77.282.17  77.363.27 
Local ()  57.506.20  78.934.07  47.641.56^{2}^{2}2Better performance will be achieved with other configurations, and detailed information is presented in Ablation studies.  77.201.76  76.723.06 
Method  ENZY  D&D  REDM12K  COLL  PROT  NCI1  NCI109  MUTAG  IMDBB  IMDBM 

Avg  32.63  284.32  391.41  74.49  39.06  29.87  29.68  17.93  19.77  13.00 
Avg  62.14  715.66  456.89  2457.78  72.82  32.30  32.13  19.79  96.53  65.94 
Node labels  V  V  X  X  V  V  V  V  X  X 
#Classes  6  2  11  3  2  2  2  2  2  3 
#Graphs  600  1178  11929  5000  1113  4110  4127  188  1000  1500 
The convolution module consists of three graph convolution layers. For the graph convolution layer, we follow the design of common architectures and choose the general message propagation and aggregation formation. Specifically, we adopt the following convolution module:
(18) 
where ReLU denotes the rectified linear unit activation function and
indicates an normalization function to stablise and accelerate the training process. In more details, this convolution operation consists of three steps: (1) Neighborhood information aggregation: through , each node propagates information in neighborhood and make a weighted summation of information from neighbors utilizing the weights of shift matrix . It is analogous to weighted average filtering, a common lowpass filtering, and thereby enables the graph convolution networks to be robust to noisy signals to some degree. (2) Affine transformation: if we unfold the formation , we will see that imposes an affine transformation on each node after aggregation, because of. (3) Nonlinear mapping: the ReLU nonlinear function is chosen to further perform pointwise nonlinear transformation. Feature maps of convolution layers within a convolution module are further concatenated to form the outputs of the convolution module. The pooling module then follows the convolution module to coarsen graphs in accordance with the iPool operator, as introduced in Section 3. In addition to convolution and pooling modules, a readout module is utilized to attain graph embeddings of different coarsened versions and these graph embeddings are concatenated to produce the final graph representation:
(19) 
where indicates an elementwise operator to aggregate information of all nodes along each dimension of features. Specifically, an elementwise mean operator is used on biological datasets and an elementwise sum operator is adopted on social network datasets. A prediction module is finally added to the architecture for graph classification. It consists of two fully connected layers and a softmax layer to make the prediction of the graph category based on the graph representation .
Method  NCI1  NCI109  MUTAG  IMDBB  IMDBM 

PSCN  76.341.68    88.954.37  71.002.29  45.232.84 
GRAPHSAGE  78.761.73  77.922.47  86.788.88  72.102.66  49.203.94 
ECC  76.82  75.03  76.11     
SORTPOLL  74.44    85.83  70.03  47.83 
kGNNs  76.2    86.1  74.2  49.5 
SET2SET  80.801.72  79.091.96  86.787.33  71.007.54  49.734.19 
DIFFPOOL  80.461.22  79.041.25  88.876.75  73.003.22  49.603.87 
Proposed global  80.461.66  78.802.62  89.425.68  72.903.08  50.733.68 
Local ()  81.411.53  80.012.32  87.846.12  73.102.98  50.532.71 
Local ()  81.581.46  80.032.05  90.424.68  73.302.72  51.273.44 
We implement the proposed model in Pytorch
[16]. We conduct experiments to classify graphs on ten public benchmark graph datasets, including biological datasets (MUTAG, ENZYMES, DD, PROTEINS, NCI1, NCI109) and social network datasets (IMDBBINARY, IMDBMULTI, COLLAB, REDDITMULTI12K)
^{3}^{3}3Datasets could be downloaded from https://ls11www.cs.tudortmund.de/staff/morris/graphkerneldatasets. Statistics and properties of the datasets are presented in Table 2. Node categorical features are adopted in biological datasets, while constant features are utilized in social datasets. Following prior methods [28, 27], we perform 10fold crossvalidation on all of the datasets and report the best average accuracy. In the experiments, we employ two kinds of networks, a small one with the number of hidden neurons as 30 and a large one with 64 hidden neurons at each convolution layer. Also, one iPool layer is utilized for most datasets except that two iPool layers are employed for D&D and REDDITM12K because of the large number of nodes per graph. The pooling ratio (
) is set as for middle datasets, including ENZYMES, PROTEINS, and COLLAB, and for the others. These networks are optimized by minibatch gradient descent algorithm (batch size =20) with the Adam optimizer. The following hyperparameters are tuned for each dataset: (1) learning rate ; (2) the dropout ratio and weight decay . Detailed information about hyperparameters is presented in Appendix (Table A1). Finally, we compare the proposed method with a collection of stateoftheart kernelbased solutions as well as GNNbased graph representation learning methods. For graph kernel methods, we compare with the WeisfeilerLehman subtree kernel (WL) [20], the WeisfeilerLehman optimal assignment kernel (WLOA) [11], graphlet kernel [19], and shortestpath kernel (SP) [1]. On the other hand, a series of graph neural network variants and different graph pooling schemes designed for deep graph neural networks are taken into consideration, including PSCN [15], GraphSAGE [7], ECC [22], kGNNs [14], global pooling scheme with LSTMs (SET2SET) [24], sort pooling (SORTPOOL) scheme [29], and DIFFPOOL generating coarsening matrix with an extra GNN [28]. Results of the baseline models are cited from the original works except the results of GraphSAGE, SET2SET, and DIFFPOOL methods which are obtained using their public code with the same architecture used for iPool.Method  NCI1  NCI109  MUTAG  IMDBB  IMDBM  Multiple 

SET2SET  24.86 (7.1)  26.08 (7.1)  0.45 (2.0)  7.71 (7.1)  8.25 (7.1)  7.1 
DIFFPOOL  4.03 (1.1)  4.05 (1.1)  0.25 (1.1)  1.25 (1.1)  1.30 (1.1)  1.1 
Proposed  3.51  3.66  0.22  1.09  1.16  1 
Running time (s/epoch).
5.2 Experimental results and analysis
Dataset  Base0  Base1  Base2  iPool1  iPool2 

D&D  78.523.89  76.373.13  75.944.47  79.282.29  79.452.78 
REDM12K  46.851.13  48.441.23  48.581.65  48.710.95  47.641.56 
Table 1 and Table 3 respectively illustrate performance of two kinds of networks, a small one (30 hidden neurons) and a large one (64 hidden neurons). The proposed iPool strategies outperform other GNNs based methods on 9 of 10 datasets. Compared with the stateoftheart kernel based methods, the iPool methods also achieve competitive performance. The limited number of training samples per category would cause the neural network overfitting, which probably explains the inferior performance of GNNs based methods on the ENZYMES dataset. With regards to the COLLAB dataset, the WL kernel and WLOA kernel outperforms all GNN variants, which probably results from the large number of node degrees making it difficult for graph neural networks to effectively aggregate neighborhood information with convolution layers.
We also note that the local pooling strategy performs better than the global one on all of the datasets. The local strategy prefers nodes from different neighborhoods and thereby the coarsened graphs could better preserve the structure of the original graphs, as demonstrated in Fig. 3. In addition, the number of hops (i.e., ) utilized in the prediction function (Eq. (5)) has an important impact on the classification performance, and has a close relationship with the average node degree. Specifically, the sparser the connections between nodes of a graph, the larger the necessary number of hops () to achieve the best performance. This is consistent with intuition, since the prediction function with large in graphs with dense connections, will aggregate global information of the whole graph rather than local information of each neighborhood.
Without the graph pooling layer, the GraphSAGE and SET2SET methods extract representations of graphs at the scale of the original signal. In contrast, DIFFPOOL and iPool obtain multiscale graph representations and achieve better results on most datasets. In addition, the iPool scheme further outperforms DIFFPOOL on all of the datasets. It probably better preserves the structure of the original graphs and the localization property of graph signals, as demonstrated in Fig. 3. In addition, the running speed of iPool is about 7 times faster than SET2SET and 1.1 times faster than DIFFPOOL, as illustrated in Table 4.
Ablation studies. We further explore the effectiveness of the iPool operator to deal with multiscale representations of graph data with a series of ablation experiments on the D&D and REDDITMULTI12K datasets, the largest datasets of bioinformatics and social networks respectively in terms of average number of nodes per graph. Specifically, we conduct experiments on 3 network architectures, as presented in Fig. 2, to deal with singlescale (Base0, 1 convolution module), doublescale (iPool1, 2 convolution modules and 1 pooling module), and triplescale (iPool2, 3 convolution modules and 2 pooling modules) graph representations. To eliminate the impact of the depth of neural networks, we further extract singlescale graph representations with the same architecture as their multiscale counterparts except for the pooling layers, i.e., Base1 corresponding to iPool1 and Base2 matching iPool2. We utilize the same procedure as the one for Table. 1 and report 10fold crossvalidation results in terms of the mean and standard variation of classification accuracy in Table 5.
According to Fig. 4 and Table 5, we have the following findings. (1) The iPool operation could accelerate the convergence of representation learning in the training phase. Concretely, the iPool2 and iPool1 models converge faster than their singlescale counterparts Base2 and Base1, and are far faster than the singlescale shallow model Base0 on both datasets. (2) The iPool operation could improve the classification performance of models due to its hierarchical representation. Specifically, models with iPool achieve the best performance on both datasets. Compared to iPool1, the inferior performance of iPool2 on the REDDITMULTI12K dataset probably results from the overfitting of the model, given the simple constant node signals and the scaleddown graphs through two pooling operators.
6 Conclusion
We have proposed in this paper a low complexity and adaptive graph pooling operator for GNNs, which improves their capability of distilling hierarchical representations of graphs and network data. The new operator has interesting properties in practice, in that it is mostly based on local computations, and it leads to invariance properties under graph isomorphism. The proposed iPool solution further permits to achieve stateoftheart performance on several graph classification datasets. An interesting future direction is to explore other neighborhood prediction functions to make the neighbor information gain in line with more general conditional distribution of nodes. It is also worthwhile to utilize the proposed pooling operation with other convolution schemes, such as GINs.
References
 Borgwardt and Kriegel [2005] K. M. Borgwardt and H.P. Kriegel. Shortestpath kernels on graphs. In Fifth IEEE international conference on data mining (ICDM’05), pages 8–pp. IEEE, 2005.
 Bruna et al. [2014] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. In International Conference for Learning Representations, 2014.
 Bui and Jones [1992] T. N. Bui and C. Jones. Finding good approximate vertex and edge partitions is nphard. Information Processing Letters, 42(3):153–159, 1992.
 Chevalier and Safro [2009] C. Chevalier and I. Safro. Comparison of coarsening schemes for multilevel graph partitioning. In International Conference on Learning and Intelligent Optimization, pages 191–205. Springer, 2009.
 Defferrard et al. [2016] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 Gori et al. [2005] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 729–734. IEEE, 2005.
 Hamilton et al. [2017] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 Karypis and Kumar [1998] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998.
 Khasanova and Frossard [2017] R. Khasanova and P. Frossard. Graphbased isometry invariant representation learning. arXiv preprint arXiv:1703.00356, 2017.
 Kipf and Welling [2017] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In International Conference for Learning Representations, 2017.
 Kriege et al. [2016] N. M. Kriege, P.L. Giscard, and R. Wilson. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pages 1623–1631, 2016.
 Kushnir et al. [2006] D. Kushnir, M. Galun, and A. Brandt. Fast multiscale clustering and manifold identification. Pattern Recognition, 39(10):1876–1891, 2006.
 Kushnir et al. [2009] D. Kushnir, M. Galun, and A. Brandt. Efficient multilevel eigensolvers with applications to data analysis tasks. IEEE transactions on pattern analysis and machine intelligence, 32(8):1377–1391, 2009.
 Morris et al. [2019] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higherorder graph neural networks. In Proceedings of AAAI Conference on Artificial Inteligence, 2019.

Niepert et al. [2016]
M. Niepert, M. Ahmed, and K. Kutzkov.
Learning convolutional neural networks for graphs.
In
International conference on machine learning
, pages 2014–2023, 2016.  Paszke et al. [2017] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 Scarselli et al. [2009a] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks, 20(1):81–102, 2009a.
 Scarselli et al. [2009b] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009b.
 Shervashidze et al. [2009] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pages 488–495, 2009.
 Shervashidze et al. [2011] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 Shuman et al. [2016] D. I. Shuman, M. J. Faraji, and P. Vandergheynst. A multiscale pyramid transform for graph signals. IEEE Transactions on Signal Processing, 64(8):2119–2134, 2016.

Simonovsky and Komodakis [2017]
M. Simonovsky and N. Komodakis.
Dynamic edgeconditioned filters in convolutional neural networks on
graphs.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 3693–3702, 2017.  Veličković et al. [2018] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJXMpikCZ.
 Vinyals et al. [2015] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. International Conference on Learning Representations, 2015.
 Von Luxburg [2007] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
 Wang et al. [2018] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
 Xu et al. [2019] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
 Ying et al. [2018] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pages 4805–4815, 2018.

Zhang et al. [2018]
M. Zhang, Z. Cui, M. Neumann, and Y. Chen.
An endtoend deep learning architecture for graph classification.
In Proceedings of AAAI Conference on Artificial Inteligence, 2018.
Appendix
A1 Proof of Proposition 1
A2 Proof of Proposition 2
Proof.
Since , there exists an edgepreserving bijection:
(A2) 
For , there is a and their neighborhood information gains are respectively:
(A3)  
(A4) 
Since is edgepreserving, and any edge shares the same weights with its counterpart :
(A5) 
Therefore,
(A6) 
and it holds true also for the normalized neighborhood information gain. Note that the ranking function takes only the value of (normalized) neighborhood information gain of nodes under consideration, and that has the same ranking as . Then
(A7) 
Thus, the iPool operation is invariant to isomorphism. ∎
A3 More information about experiments
Dataset  lr  s  kglobal  dropout  weight decay 

ENZY  1e3  2  1  0  1e4 
D&D  1e4  2  1  0.5  0 
REDM12K  1e3  2  2  0  0 
COLL  1e3  1  1  0  0 
PROT  1e2  2  1  0  1e4 
NCI1  1e3  2  2  0  1e4 
NCI109  1e3  2  2  0.5  0 
MUTAG  1e2  2  2  0.5  3e5 
IMDBB  1e3  1  2  0  1e4 
IMDBM  1e3  1  2  0  0 
Comments
There are no comments yet.