1 Introduction
Although data that can be represented as grid structure on Euclidean domains, such as images [1], video [2], speech [3], and texts [4], has closely connections with daily life, there is another major category of NonEuclidean data, namely graph, which is constructed by irregularlyarranged nodes and the connectionindicated edges. Examples include social networks [5], citation networks [6], road networks [7] and bioinformatics [8]
. Different from Euclidean data, convolution and pooling operation in Convolutional Neural Networks (CNNs) cannot be directly applied to NonEuclidean graphstructured data due to the irregularity and nondeterminacy of the neighborhood for the central node. Consequently, it is important to learn a sufficiently expressive representation for graphstructured data in a reasonable way.
1.1 Motivation
A great deal of research on Graph Neural Networks (GNNs) has emerged to generalize the great success of convolution in CNNs to graphstructured data. In GNNs, convolution operation is evolved into neighborhood message aggregation of the central node along edges, thus capturing both node features and graph structural information. Following this principle, various GNNs have been proposed, such as GCN [9], graphSAGE [10] and GAT [11]. All of them have achieved significant prosperity for graph representation learning tasks, especially for nodelevel representation based tasks, including node classification [12, 11] and link prediction [13, 14]. However, as for graphlevel representation learning tasks, such as graph classification [15, 16, 17], graph matching [18] and graph similarity learning [19], convolution operation alone is deficient. It is nontrivial to potentially empower GNNs to produce discriminative graphlevel representations with the help of pooling operation.
To remedy this problem, a few researchers have tried to further generalize the pooling mechanism from CNNs to GNNs for graphlevel representation learning. Accordingly, it is natural to raise the question: what are the basic criteria for a highquality pooling method in GNNs? Actually, different from the pooling operation in CNNs for reducing the number of computational parameters and preserving the invariance, the basic idea of graph pooling techniques is a node feature aggregator throughout the entire graph, analogous to the neighborhood aggregator in graph convolution. Therefore, a good graph pooling method should encourage graphs with approximate topology and similar node features to have resemblant representations to some extent. As a result, the major challenge for a highquality pooling mechanism is to define a method which both effectively maintains pivotal node features and explicitly captures important structural information. To address this challenge, previous works have proposed graph pooling architectures in three ways.
First, universal maximum/average pooling methods [20, 19] are intuitively extended to graph models by simple elementwise max or meandownsampling through all node features. But such methods have been proved to ignore feature multiplicities, as well as completely miss the structural information [21]. Graphs with different corresponding structures may get the same representation.
Second, TopK methods [22, 17, 23, 24, 25] sort graph nodes in a consistent order with scores on behalf of the importance, only K nodes with the highest scores are selected to form the pooled graph. As a result, sets of nodes except for the top K ones are definitely discarded, which may involve important features. Meanwhile, very few TopK methods utilize the local substructure information for scoring stage, thus resulting a strong possibility of the selected nodes to be isolated, which may influence the information propagation in subsequent GNN layers.
Third, advanced graph pooling methods, such as DiffPool [16] and ASAP [26], learn graph representations in a hierarchical grouping manner for capturing more comprehensive local substructures widespread in graphs. They usually group nodes into smaller clusters with multiple levels and utilize the final level as the graph representation. However, grouping operation is usually executed on a fixed 1hop neighborhood, thus forcing the information to flow from a certain neighborhood to the specific coarsened cluster and neglecting the higherorder dependency among nodes that may hold significant information, as shown in Fig. 1(a).
1.2 Contributions
To the best of our knowledge, no existing graph pooling methods adaptively handle the graph local substructures and highorder dependency, while capturing node features. And there is a general lack of systematical consideration to the generalization of graphlevel representations affected by pooling methods. To bridge the above gap, we propose a novel hierarchical graph pooling framework called Hierarchical Adaptive Pooing (HAP). The main contributations can be summarized as follows.

We intruduce HAP, a supervised hierarchical pooling framework. HAP is capable of preserving node feature information with adaptive graph structure sensitivity to both local substructures and highorder dependency. We provide a more comprehensible framework by offering exhaustive theoretical analysis for computational complexity, permutation invariance and the design validity of HAP.

We propose masterorthogonal attention (MOA), a novel crosslevel attention mechanism specifically designed for hierarchical graph pooling. MOA can be leveraged to capture crosslevel interactions under the guidance of graph pattern properties in a more efficient and effective way. MOA also acts as a soft substructure extractor. Attention weights for nodes in the receptive field of a possible local cluster are much higher than those beyond the receptive field. This ensures the localsubstructure sensitivity and introduces highorder node features to it.

We design GCont
, an autolearned global graph content playing a significant role in MOA. The key innovation is that it incorporates highlevel global pattern properties into pooling method, making MOA sensitive to the latent graph characteristics and produce a more adequate graphlevel representation without interference of artificial factors. It is relatively stable during hierarchical pooling and flexible enough to be learned heuristically. GCont also guarantees the generalization ability across graphs with the same form of features.
Extensive experiments demonstrate that (1) HAP significantly outperforms twelve graph pooling methods on six realworld datasets for graph classification task with a maximum accuracy improvement of 22.79%; (2) HAP sharply outperforms the stateoftheart GMN [18], which is specifically designed for graph matching task, by boosting the accuracy up to 3.5%; (3) HAP also achieves a maximum accuracy gain of 16.7% comparing with conventional approximate GED algorithms and GNNbased graph similarity learning models; (4) The graph coarsening module in HAP dramatically enhances the expression ability of existing graph pooling architecture for graphlevel tasks; (5) HAP achieves good generalization ability across graphs with the same form of features; and (6) HAP provides meaningful visualization of graphlevel representations.
2 Related Work
In this section, we summarize some related works on graph pooling in GNNs covering the two main types of methods: supervised and unsupervised.
2.1 Supervised Pooling
Supervised pooling methods can be divided into flat pooling and hierarchical pooling according to whether the graphlevel representation is aggregated in a flat or hierarchical way with a view to local substructures. Further, flat pooling methods also cover two families: universal pooling and TopK pooling, depending on the number of nodes participated in the final aggregation.
2.1.1 Flat Universal Pooling
Flat universal pooling methods take all the nodes into consideration. Earlier works directly learn from CNNs to use mean or maxpooling method to extract features. Subsequently, Xu et al.
[21] find that sumpooling is much more powerful because no matter mean or max aggregator ignores the multiformity of features, thus struggling in distinguishing graphs with nodes that have repeating features. Some other works rely on contentbased attention operation. In Gated Graph Neural Networks (GGNNs) [20], the graphlevel output is defined by a soft attention mechanism for each node to decide which is more relevant to the graphlevel task. Message Passing Neural Networks (MPNNs) [27] further utilizes Set2Set [22] method to take the order of nodes into consideration and find the importance of each node to the graphlevel representation through timeconsuming iterative softattention. In SimGNN [19] and UGRAPHEMB [28], a graph content is defined as the average of node features and the attention is executed between nodes and it. Obviously, such manmade design makes the final graphlevel representation infinitely close to the output of meanpooling method, which is an inefficient method mentioned above.2.1.2 Flat TopK pooling
Flat TopK pooling methods score the nodes according to the importance. Nodes with klargest scores are preserved to form the new graph. SortPooling [23] method refers to the graph label method WL [29], it regards the output node features of each GCN layer as the continuous WL colors and sorts the nodes according to the last GCN layer’s colors. AttPool [24] calculates the scores using a global softattention mechanism. Further more, a local attention method accesses node degree information, which contributes to keep a balance between the importance and the dispersion. gPool [17]
develops new ideas that use the projection of node features to a trainable projection vector as node score. SAGPool
[25] considers both node features and graph topology during pooling by taking GCN to calculate attention scores. However, important information existed in the abandoned nodes may be ignored, which should be explicitly captured in graph pooling. Moreover, no structural relationships among nodes are acquired during pooling, thus may lead to unconnectedness of the selected nodes.2.1.3 Hierarchical Group Pooling
Due to the fact that local substructures are present in realworld graphs, hierarchical group pooling methods come into being. DiffPool [16] is the first differentiable group pooling approach that learns a dense assignment matrix to group 1hop neighborhood of nodes into clusters in each hierarchical layer. Subsequently, to address the sparsity concerns in DiffPool, ASAP is proposed, which combines both TopK and group methods. Clusters are generated by aggregating hhop neighbors of each central node to leverage the graph structure, then only top scoring clusters are maintained. Actually, ASAP [26] still cannot guarantee the connectivity between the selected clusters. Actually, Mesquita et al. in [30] indicate that a successful graph pooling method should not be restricted to nearby nodes. Thus we propose that highorder structural dependency also contributes to the construction of a good pooling region.
2.2 Unsupervised Pooling
Loss functions of the aforementioned graph pooling methods are usually taskbased supervised except for DiffPool that exploits a link prediction loss and enforces nearby nodes to be pooled together. Recently, there is a growing interest in unsupervised graph pooling by minimizing the objective related to graph structure characteristic or borrowed from graph theory. StructPool [31] employs conditional random fields (CRFs) to capture highorder structural relationships by minimizing the Gibbs energy. MinCutPool [32] continuously relaxes the normalized minCUT problem in graph theory and optimizes the cluster assignments by minimizing this objective. UGRAPHEMB [28] utilizes wellaccepted and domainagnostic graph proximity metrics to provide extra graphgraph proximity guidance during learning. These novel ideas offer the possibility of breaking a logjam of current graph pooling research.
Something also worth mentioning is that there is a common challenge for no matter universal pooling, TopK pooling or group pooling, i.e., the elementwise aggregation, score ranking and cluster assignment learning processes are merely executed on a single fixed graph, lacking the inductive capability for entirely new graphs.
3 Preliminaries
3.1 Problem Statement
A graph is represented as , where denotes the set of nodes, is the edge link between node and , and consists the node labels (no node labels are provided in some cases). For a graph with nodes and edges, represents the weighted adjacency matrix and is a diagonal matrix that diagonal elements stand for the degree of nodes. denotes the node feature matrix and is the graphlevel embedding. A label may also be attached to the graph . Detailed notations are summarized in TABLE I. Given a graph dataset, the graph pooling task aims to learn a mapping from a node feature matrix to a single graph representation.
Notations  Definitions or Descriptions 
,  the input/coarsened graph 
, ,  the node/edge/node label set of 
,  the number of nodes of / 
,  the adjacent matrix of / 
the degree matrix of  
,  the node feature matrix of / 
the graphlevel embedding of  
the number of graph coarsening modules  
the autolearned global graph content  
row in C refers to a node of the source graph  
column in C refer to a cluster of the target graph ’  
the MOA matrix  
the dimension of input node feature  
the dimension of the output node feature  
the dimension of the graph level embedding  
the label of graph  
3.2 Downstream Tasks
Most previous works only focus on the application of graph pooling for graph classification, which is an important graphlevel representation learning task but partial. In this paper, we formally summarize and define the downstream graph pooling tasks and conduct exhaustive experiments over them:

Graph Classification: Given an input graph , the graph classification task tries to learn a mapping from the graph to the corresponding graph label.

Graph Matching: Given an input graph pair , the graph matching task aims to determine whether and are isomorphic^{1}^{1}1For a pair of graph, graph isomorphism decides whether there exists a bijective function between them so that nodes are connected in the same way. to each other.

Graph Similarity Learning: Given an input graph triple , the graph similarity learning task manages to explore whether is much closer to or .
3.3 Graph Neural Networks
Given node features and graph structure , modern GNNs usually learn useful node representations in an neighborhood aggregation fashion following general “messagepassing” architecture. The forward process comprises two phases, each of which iteratively runs for time steps. The message passing phase aggregates information along edges of the central node from its neighbors. Then the combination phase updates the representation of the central node based on the message:
(1) 
(2) 
where is the embedding of node at the th iteration that is initialized as , and is the node feature vector of node ’s neighbor depending on the adjacency matrix.
There are multiple selectable implementations of and adapted successfully to different GNN models. Actually, our HAP pooling framework can be consolidated into any GNN models following the implementation of Equation 1 and Equation 2. After times iteration, the representation of the central node captures the features and structural information within its hop neighborhood.
3.4 Graph Attention
Graph attention mechanism executed between a query and a key allows for allocating diverse alignment scores to different parts of the input, making the model focus on the most relevant portion. Existing graph attention mechanism can be divided into nodelevel attention and masterlevel attention according to the attention scope. Specifically, nodelevel attention covers both selfattention and crossattention.
HardSelfAttention (HSA) [11] chooses both and from the node features of the single input graph to find the node dependency on itself:
(3) 
where and are trainable parameters, and is a concatenation operation.
SoftSelfAttention (SSA) [20] decides which nodes are relevant to the current graphlevel task, so that is defined as node feature but no specific key is provided:
(4) 
CrossAttention (CA) [18] captures the differences between graphs by doing comparisons across the pair of graphs through choosing and from the node features of pairwise input, thus fusing information from both graphs:
(5) 
4 The Proposed Method: HAP
In this section, we present HAP, a hierarchical graph pooling framework for graphlevel tasks. Its key idea is the graph coarsening module supported by novel graph pattern property extracting technique GCont and crosslevel attention mechanism MOA complementing and reinforcing each other, which not only prompts the GNN model to be sensitive to both local substructures and highorder dependency, but also empowers it with stronger generalization ability. Below, we discuss the components of HAP in details.
4.1 Hierarchical Framework
Figure 2 illustrates the overall architecture of the HAP. Given single, pairwise or triplet input graphs for differentiated graphlevel tasks, HAP extracts the node features and graph structure information for an endtoend training. The process can be decomposed into six main steps:

Input Construction Single input graph classification and pairwise graph matching task require no special operation on the given dataset, which consists of single or pairs of graphs with true labels to indicate which class the graph belongs to or whether the pair is matching or not. However, a triplet generator is necessary for graph similarity learning task.

Node & Cluster Embedding Subsequently, single or pairs of graphs, or the generated triplets are transferred into a node & cluster embedding module to learn a lowdimensional node vector representation for each node or coarsened cluster. The cluster representations will be maintained for hierarchical similarity measuring.

Graph CoarseningI Then, a learnable GCont defines a coarsening preparation step for each graph by extracting global pattern properties. Rows and columns of it correspond to source nodes before coarsening and target clusters after coarsening, respectively.

Graph CoarseningII Furthermore, the MOA mechanism is utilized to obtain an attention assignment. The attention coefficient matrix, each element of which indicates the contribution of the node from the source graph to the cluster from the target graph , is derived from the GCont.

Graph CoarseningIII Afterwards, a cluster formation function is learned to compute the cluster representation after one coarsening.

Learning: Executes the loop between Step2 and Step5 until reaching a satisfied graph scale. HAP then calculates corresponding task loss with hierarchical graph representation to constantly optimize all the weight parameters.
4.2 Input Construction
For graph similarity learning task, training and testing data in the form of triplet is essential to learn the relative similarity among graphs. Based on the best of our knowledge, there is no readytouse graph dataset in triplet form. To bridge the above gap, we propose a triplet generator in this subsection.
Given a dataset with single graphs, we denote it as , the similarity between every two graphs and can be measured under a graphgraph proximity metric , such as Graph Edit Distance (GED). The smaller the GED, the more similar the pair. Then the pairwise groundtruth proximity is denoted as follows:
(8) 
Afterwards, we conduct triplets by fixing the first position with one graph and randomly choose two disparate graphs to fill the rest two positions:
(9) 
Synchronously, the ground truth triplet proximity is generated as follows, in which a positive number for the element means that graph is much similar to graph and a negative number means that graph is much similar to graph :
(10) 
4.3 Node & Cluster Embedding
There is a demand for node or cluster embedding to extract node or cluster features before going to the next graph coarsening module. In this paper, we choose to employ a twolayer GAT [11] or GCN [9] as basic components since they are all well capable of capturing the local structure information of a node. Actually, any mainstream GNNs can also be integrated into the HAP framework. And please note that the number of GAT or GCN layers depends on the real application graph data.
For the kth layer in GAT, it takes graph ’s adjacent matrix
and the hidden representation matrix
as input, then formulates the phase in a weightedattentionbased operator:(11) 
where
is the nonlinear activation function such as
ReLU or Sigmoid, is a trainable global attention assignment among all nodes, and picks onehop neighborhood attention. is a trainable weight matrix.Similarly, the implementation of Equation 1 and Equation 2 for forwardpropagation operation of GCN is defined as:
(12) 
where is the adjacent matrix plus selfconnections (i.e., ), is the degree matrix of (i.e., ), and is the symmetric normalized Laplacian for graph . With one convolutional layer, GCN is able to preserve the firstorder neighborhood information between nodes. By stacking multiple GCN layers, it is capable to encode higherorder (e.g., khop) neighborhood relationships.
Specifically, different from classical GAT or GCN where graph scale is stable throughout the whole training, HAP scales down nodes into clusters in the graph coarsening module before transferring them to the next node & cluster embedding layer. As a result, , and change with the action of graph coarsening (cf. Eq. 18).
4.4 Graph Coarsening
We achieve graph coarsening through graph global pattern property extracting technique GCont and crosslevel attention mechanism MOA. We show the graph coarsening module architecture in Fig. 3 and elaborate the details in this subsection. Further, algorithm 1 gives the pseudocode for the graph coarsening module.
4.4.1 Attention Preparation using GCont
Given node features for the source graph, the task of coarsening process is to learn the cluster assignment matrix through attention mechanism. However, one important thing ignored by all the group pooling methods is that the pre and postcoarsening graph content should remain stable without loss of important information. We observe that both DiffPool [16] and ASAP [26] receive no global guidance. Hence, we propose GCont, an autolearned global graph content sustaining the coarsening process.
As an initial step, we propose using one learnable linear transformation, parametrized by the weight matrix
to generate GCont. The simple linear transformer also combines scalability with the ability to deal with relatively larger graphs. The global graph content is converted from the node feature matrix as:(13) 
where and indicate element in the position of the th row and th column of matrix and separately. is the automatically learned global graph content matrix in which each row is equivalent to a node of the source graph and each column is corresponding to a cluster node of the target coarsened graph .
The GCont bridges the gaps between the source graph and the target graph and maintains the consistency. On one hand, the elements in reflect the interaction between nodes from source graph and clusters from coarsened graph. On the other hand, they contain the graph pattern properties cohered before and after coarsening, thus facilitating generalization across graphs with the same form of features.
4.4.2 Attention Assignment using MOA
HAP intends to achieve graph downsampling through a crosslevel attentionbased aggregator for information interaction between the source graph and the coarsened target graph utilizing global graph property guidance. However, we observe that both HSA and SSA described in Sec. 3.4 only focus on one single graph while CA does not utilize any global information. Although MA introduces master information into the attention process, it is highly affected by the manmade master function. To that end, we propose a new variant of attention mechanism called MasterOrthogonalAttention (MOA).
Computation of Attention Assignment: The input of MOA mechanism is a welllearned representation matrix , , where is the number of nodes of the source graph , and is the feature dimension for each node. Then the graph coarsening module produces a new coarsened graph representation matrix , as its output, where is the number of clusters of the coarsened graph. Each cluster will then be regarded as an individual node. Meanwhile, adjacent matrix will also be updated to . Please note that the number of graph coarsening modules and the coarsened graph size are determined by the real application graph data. In our experiment, we employ two coarsening modules and we evaluate it in the experiment.
After having obtained the global graph content matrix, we can employ an orthogonal^{2}^{2}2The terminology “orthogonal” here means rows and columns of a 2D matrix, which is different from the meaning of orthogonal vectors in a mathematical sense. crosslevel attention mechanism between nodes of the source graph and clusters of the target coarsened graph. The attention matrix is formed with elements as follows:
(14) 
where is the LeakyReLU nonlinearity, is a concatenation operation with relaxed dimension of from to , and is the trainable shared attentional parameter with relaxed dimension . The reason for the relaxation will be given below.
, which is equivalent to a crosslevel aggregator, offers a fullyconnected information channel between the sourcegraph nodes and the target coarsenedgraph clusters, with each element indicating the importance of node ’s feature to cluster . The local substructure is preserved by attention mechanism while the highorder dependency is also captured through the fullyconnected information channel, thus strengthening feature reservations. We normalize it for better evaluation:
(15) 
MOA mechanism synthesizes both selfattention and crossattention with masterattention. On one hand, the proposed MOA mechanism calculates the attention coefficients based on the GCont alone, so we can sort it into selfattention mechanism. On the other hand, the attention is predicted between the source graph and the target coarsened graph, so we may also classify it as crossattention mechanism.
Relaxation of Attentional Parameter: In traditional graph attention scheme [11], attention coefficients is calculated as follows:
(16) 
where is the LeakyReLU nonlinearity, is the trainable shared attentional mechanism, is a weight matrix to produce new node features from cardinality to , and are the input node features, and is a concatenation operation.
Apparently, the trainable shared attentional parameter in the conventional graph attention mechanism is irrelevant to the node number of the input graph. However, in our MOA mechanism, the dimension of is related to the node number of the inputted source graph, making the concatenation . As a result, the trainable shared attentional mechanism would be initialized as , which is sensitive to the node number of the inputted source graph.
Manifestly, varies with the input and it is unknown for parameter initializing stage. Thus, proper relaxation offers intriguingly good performance when standard techniques appear to suffer. We loose to , so that . We theoretically analyse the validity of the relaxation on the prediction outcome in Sec. 5.3.
4.4.3 Cluster Formation
The learning of the global graph content and the crosslevel aggregator constitutes a concordant unity, and complement and restrict mutually. Subsequently, we generate the coarsened graph representation matrix and update the adjacent matrix :
(17) 
(18) 
4.4.4 Soft Sampling
According to Lee et al. [33], handling adjacent data with a sparse matrix in GNN contributes to decreasing the computational complexity from to and also reduces space complexity. However, the adjacent matrix turns to be a dense one from a sparse assignment. That said, the structure of the source graph is refined to a fullyconnected downsampled one. Proper edge sampling will lead to saving both time and storage without a dramatic loss of accuracy. As a workaround, we adopt the GumbelSoftMax [34] to achieve soft sampling for neighborhood relationship, thus decreasing edge density for the sampled adjacent matrix :
(19) 
where and . Here, we set the softmax temperature parameter to make the adjacent matrix distribution close to onehot. This operation reduces edge density as much as possible but preserves the connectivity of graphs.
4.5 Learning
The proposed HAP supports three types of input: single graph for graph classification, pairwise graphs for graph matching, and triplet graphs for graph similarity learning. All of the input graphs will be coarsened to a 1D vector at the final graph embedding layer, which can be used to compute graph similarity directly. Meanwhile, as is demonstrated in model structure, HAP alternates between node embedding and graph coarsening, thus generating different graph representation matrix at graph coarsening layer . As a result, we also propose a hierarchical similarity measure by jointly utilizing hierarchical graph representations.
4.5.1 Prediction
For graph classification tasks with a single input graph , the final graph representation is directed fed into two fullyconnected layers with a activation on the output to get the predicted label . Then we optimize the model with a standard crossentropy on the graph that has ground truth labels . The fullyconnected layers and the objective function can be represented separately as follows:
(20) 
(21) 
where and represent weights and biases in the th fullyconnected layer respectively for . is the adopted ReLU and Softmax activation function for and separately. is the training set of single graphs and denotes the number of classes.
For graph matching tasks with pairwise input graphs , pairs are labeled with or representing similar or dissimilar respectively. We optimize the normalization function to push the model to convert graph distances to similarity scores with distribution :
(22) 
where denotes a softmax parameter sensitive to different range of distances and is determined by the real application graph data. Basically, we set it to 0.5. represents graph distances of graph pair at coarsen level , and here we use Euclidean distance. Then the model is optimized by hierarchical crossentropy function as follows:
(23) 
where is the training set of pairwise graphs. is the label for this pair.
For graph similarity learning tasks with triplet input graphs , hierarchical Mean Squared Error (MSE) loss function is employed as follows:
(24) 
where is the training set of triplet graphs, denotes ground truth triplet proximity defined by relative Graph Edit Distance (GED) at Sec. 4.2.
4.5.2 Hierarchical Prediction
As shown in Fig. 2, we adopt a hierarchical prediction strategy to further facilitate the training process and fully utilize the hierarchical intermediate features of coarsened graphs. The outputs of every coarsening process are summarized as the intermediate graph feature, which will be fed into the learning module for graph matching or graph similarity learning.
5 Theoretical Analysis
5.1 Computational Complexity Analysis
In the following, we theoretically analysis the computational complexity of the proposed HAP and show the superiority of the proposed graph coarsening module.
Claim 1 (Time Complexity)
The time complexity of the proposed HAP with graph coarsening modules in dowmsampling ratio is approximately , where is the number of nodes of the original input graph.
Proof 1
The time complexity of HAP involves three parts corresponding to the three stages of GNNbased graphlevel representation learning models: (1) node embedding; (2) graph coarsening; and (3) learning. The time complexity of node embedding stage is [11], where is the dimension of output node features. After that, to downsample node number in the th graph coarsening module, where , it requires . Let’s suppose remains constant among all the coarsening modules. Then the time complexity for all the graph coarsening modules is . Due to the fact that is less than 1, is a couple of orders of magnitude smaller than . So the time complexity of graph coarsening stage is roughly equivalent to . Eventually, for the learning stage, the time complexity is , where is the dimension of the graph level embedding for the input graph . Therefore, the overall computational complexity of the proposed HAP framework is .
Specifically, when a proper coarsening ratio is chosen where (e.g., and ), the actual execution time of the proposed HAP will become almost linear to .
5.2 Permutation Invariance
Graph pooling methods need to be permutation invariant since they should guarantee that the graphlevel representation does not vary with the input order of nodelevel representations. As for the proposed graph coarsening module, we proof that it is graph permutation invariant.
Definition 5.1 (Permutation matrix)
is a permutation matrix of size iff and .
Claim 2 (Permutation invariance)
Let be any permutation matrix, be any undirected graph, a function be a pooling operation depending on graph , graph permutation is defined as . The proposed graph coarsening module is graph permutation invariant.
Proof 2
is computed by an attention mechanism between source nodes and target coarsened clusters. Since the attention function are operated between node set and cluster set, the order of nodes or clusters has no effect to the result, we have:
(25) 
Since and any permutation matrix is orthogonal, applying to it, we get:
(26) 
Since , applying to it, we get:
(27) 
As a result, , HAP is graph invariant.
5.3 Validity of Relaxation for Attentional Parameter
In Sec. 4.4.2, we conduct a relaxation operation for the attentional parameter. Substantially, the relaxation is applied to the column dimension of GCont during concatenation. A natural question is that whether the relaxation affects the accuracy of attention coefficients, which may directly lead to neglecting important information during crosslevel aggregation. We now theoretically analyze this question.
Definition 5.2 (LeakyReLU)
LeakyReLU is a monotone increasing activation function:
(28) 
Claim 3
Let >, , and be vectors before relaxation, let and be vectors after relaxation, let be concatenation operation, and LeakyReLU be a nonlinearity. Then .
Proof 3
The essence of is a similarity comparison between vector and vector
. Due to the reason that vectors with different dimensions are noncomparable, the lacking dimension needs to be padded with zero. So that:
(29) 
While do comparison between and , we can also pad them as follows:
(30) 
(31) 
Hence, . Based on known conditions that LeakyReLU is monotonically increasing, so that .
As a result, the relaxation for attentional parameter has no negative effects for the attention computation and feature extraction.
6 Experiments and Evaluation
We evaluate HAP against a number of stateoftheart methods to answer the following questions:
Q1: How does HAP compare with other baselines when evaluated with downstream tasks including graph classification, graph matching and graph similarity learning? (Sec. 6.2, Sec. 6.3, Sec. 6.4)
Q2: How dose the original HAP compare with ablated ones with graph coarsening module replaced by other stateoftheart pooling algorithms? (Sec. 6.5.1)
Q3: How does the number of the graph coarsening modules influence the quality of graphlevel representations generated by HAP? (Sec. 6.5.2)
Q4: Do key designs of HAP contribute to better generalization performance? (Sec. 6.5.3)
Dataset  #Graphs  #Triples  #Pairs  Max.  Avg.  #Classes 
IMDBB  1000      136  19.8  2 
IMDBM  1500      89  13.0  3 
COLLAB  5000      492  74  3 
MUTAG  188      28  17.9  2 
PROTEINS  1113      620  39.1  2 
PTC  344      109  25.5  2 
AIDS    171900    10  8.9   
LINUX    409600    10  7.7   
Synthetic data      8750  300  105.7  2 
6.1 Experimental Setup
6.1.1 Datasets
We perform experiments on eight realworld datasets and one synthetic dataset varying with tasks. The graph statistics are summarized in TABLE II.
Evaluating graph matching task requires benchmark datasets with groundtruth labels (true for matching and false for unmatching). To the best of our knowledge, no publicavailable realworld dataset holds such groundtruth labels. To fill this gap, we conduct a synthetic dataset, a collection of labeled graph pair
with edge probability
generated by the VF2 graph matching library [35]. Given a graph , a positive sample is the maximum connected subgraph randomly extracted with 1 to 3 nodes less than . And a negative sample is created by randomly adding 3 to 7 nodes with the same edge probability.6.1.2 Baselines
We compare HAP with three kinds of baselines:
Graph pooling baseline: For comparison of total pooling, we choose GCNconcat (concatenation of GCNbased nodelevel representations), SumPool [36], MeanPool, MeanAttPool [19], and Set2Set [22]. For TopK pooling, we use SortPooling [23], AttPool [24], gPool [17] and SAGPool [25]. For group pooling, we compare with DiffPool [16] and ASAP [26]. We also conduct evaluation on an unsupervised method StructPool [31].
Graph matching baseline: We focus on the Graph Matching Network (GMN) [18] specifically designed for pairwise graph similarity learning.
Graph similarity learning baseline: There are two categories of graph similarity learning baselines. Due to the reason that the groundtruth triplet proximity for graph similarity learning task is calculated by conventional rigorous GED algorithm, the first type is referred to as conventional approximate GED algorithms for comparison, including Beam search [37], VJ [38] and Hungarian [39] algorithm. The other type includes SimGNN [19] and GMN, which are GNNbased models.
6.1.3 Parameter Settings
For the basic model structure of HAP, we set two node & cluster embedding layers before every following graph coarsening module, and a total of two coarsening modules are needed. Adma optimizer is used with initial learning rate 0.01 for graph classification datasets, 0.0015 for AIDS, 0.0001 for LINUX and synthetic data. For social network datasets IMDB and COLLAB with no informative node features, we use onehot encoding of node degrees as initial node input. Similarly, we adopt onehot encoding of node labels for AIDS dataset, while others are initialized identically. For graph classification and other tasks, the initial dimension is 64 and 128, respectively. All of the datasets are randomly partitioned into 8:1:1 for training/validation/testing. For all the baseline methods, we conduct experiments under the default settings reported in the original work.
Method  Model  IMDBB  IMDBM  COLLAB  MUTAG  PROTEINS  PTC 
GCNconcat  74.01  48.03  63.22  72.22  70.27  58.82  
SumPool  76.02  52.01  72.83  89.47  73.21  68.57  
MeanPool  74.02  51.33  71.26  85.01  72.32  63.89  
MeanAttPool  75.01  52.03  72.65  85.06  73.21  63.89  
Universal pooling  Set2Set  68.02  50.66  64.23  88.89  71.17  55.88 
SortPooling  66.83  47.02  72.94  83.33  74.05  56.47  
AttPoolglobal  70.13  47.53  77.36  86.67  74.68  66.18  
AttPoollocal  70.83  48.73  79.12  82.22  74.77  66.47  
gPool  78.02  54.67  82.25  87.72  73.87  68.57  
TopK pooling  SAGPool  75.03  48.64  78.22  75.02  72.97  47.06 
DiffPool  77.04  52.03  61.87  77.78  73.87  58.82  
Group pooling  ASAP  73.04  50.13  80.52  78.35  73.92  58.01 
Unsupervised  StructPool  74.06  53.33  70.85  83.33  72.07  67.64 
HAP (ours)  79.04  55.33  73.95  95.01  76.46  69.44  
6.2 Task1: Graph Classification
We evaluate HAP on six benchmark graph classification datasets and compare it with several stateoftheart approaches belonging to different pooling categories respectively. For AttPool, we try different attention functions (global attention and local attention) to obtain the graphlevel representations. For HAP, we try GAT and GCN for node & cluster embedding operation and report the better accuracy. Table III shows the accuracy with the best results marked in bold. We can observe that HAP obtains the best performance on five out of six datasets with an average improvement of 5.9%.
Of all the graph pooling methods, universal pooling approaches are the most straightforward ones but achieve considerable effect, especially the SumPool which is consistent in underlying concept with our HAP. Intuitively, higher the quality of graphlevel representations, better the graph classification result. The elementwise sum aggregator in SumPool tends to capture all node features in consideration of higherorder node dependency, but the generated graphlevel representations fail to obtain sufficient quality, i.e., the quality of graphlevel representations is not positively associated with how much node features are acquired. Irrelevant features that may interfere the results are obtained without reducing the weights, thus the final graphlevel representations mixed with excessive irrelevant information is detrimental to the graph classification accuracy.
TopK pooling approaches produce scorebased representations that drop nodes from the original graph with lower scores. As a result, potentially valuable information attached with these nodes and the related substructures may be ignored. From TABLE III, the performance of TopK
pooling approaches are universally inferior than other methods that capture more features or structural information. More damaging, SortPool and AttPoolglobal fail to return a result within 72 hours in practical execution. But there is an exception to the rule: gPool, with consistently better performance than other methods, even excels HAP on COLLAB. gPool computes scores by the multiplication of node feature matrix and a trainable projection vector, so that feature of each node is covered in the estimated scalar projection values by assigning with different weights. This crucial ingredient leads to the outstanding performance. As for the incredible performance on COLLAB, it might be due to the nature of COLLAB dataset. COLLAB covers scientific collaboration between authors. Nodes represent authors and edges indicate coauthor relationship between authors. The classification task of each graph is to estimate the field the corresponding researcher belongs to. In this situation, it can be distinguished easily by the authors with TopK quantity of papers that may be domain experts, while other unknown authors are actually noisy information. Nevertheless, our HAP still has advantages except for such exceptional circumstances.
For the dataset MUTAG, we observe that our HAP incredibly outperforms all the baselines for an average 12.3% improvement. The significant test result shows that HAP fits the property of MUTAG adequately. MUTAG is a twoclass nitro compound dataset with nodes and edges on behalf of atoms and chemical bonds, respectively. Note that molecules of both classes have the common substructure nitro, so that higherorder information beyond the substructure is the crucial for differentiation, which is correctly handled by HAP.
Visualization: To further conceptualize the effectiveness of the learned graphlevel representations, we provide a visualization of the tSNE on PROTEINS and COLLAB dataset with features extracted by the methods HAP, SAGPool, MeanAttPool and DiffPool (Fig. 4). In each figure, points of different colors exhibits discernible graph clusters with different labels in the projected twodimensional space. Note that the separability of the cluster border verifies the discriminative power. We can find that HAP performing consistently with MeanAttPool on PROTEINS shows better discriminability of the two classes than SAGPool and DiffPool. As for COLLAB, HAP is far superior to its competitors where three classes are clearly separated, all of which are in accordance with the results suggested in TABLE III.
6.3 Task2: Graph Matching
Four synthetic datasets are generated with different data size for graph matching task. TABLEIV shows the graph matching results w.r.t. graph size.
GMN, specifically designed for graph matching task, makes the node embedding phase dependent on the pair through a crossgraph attention mechanism. However, as shown in TABLEIV, HAP drastically boosts the matching accuracy up to 3.5% compared to GMN on graph size . When increasing graph size, HAP achieves a steady raising while GMN decreases gracefully from graph size to . This shows the key point: basic node embedding models have been perfectly capable of getting highquality nodelevel representations. On the contrary, the core to enhancing graph matching accuracy is to improving the quality of graphlevel representations. After replacing the basic pooling module in GMN, the performance of GMNHAP grows tremendously to be comparable with HAP, further confirming the strong ability of the proposed graph coarsening module.
Model  =20  =30  =40  =50 
GMN  95.01  97.82  96.84  99.41 
GMNHAP  98.22  98.83  98.41  99.82 
HAP (ours)  98.42  98.83  99.43  99.83 
6.4 Task3: Graph Similarity Learning
We show the results of HAP for graph similarity learning compared with both conventional approximate GED algorithms and GNNbased models on dataset AIDS and LINUX in Fig. 5. Note that evaluating graph similarity learning task requires benchmark datasets with groundtruth GEDs processed by the exact algorithm A*. A recent research [40] shows that “no currently available algorithm manages to reliably compute GED within reasonable time between graphs with more than 16 nodes”. And the experiments on A* show that 10 nodes seem to be reaching the limit of its ability to deal with the problem. To address the gap, we only accept benchmark datasets with the max number of nodes no more than 10 in each graph. Our results demonstrate that HAP is capable of boosting the accuracy of stateoftheart methods.
More specifically, for conventional approximate GED algorithms with high computational complexity, HAP improves accuracy by a relative gain of 16.7% and 15.1% on AIDS and LINUX, respectively. In regard to comparing with GNNbased models, HAP is overwhelming to SimGNN, which focuses more on optimizing the exact similarity score between graphs while neglecting the relativity. The result, in one aspect, reflects that a singleminded pursuit of the optimization of pairwise absolute similarity is not necessarily favorable to the relative similarity tasks which are more common in realworld applications to some extent. Similarly, HAP outperforms GMN by a margin of 13.1% and 3.6% on AIDS and LINUX, respectively. When replacing pooling methods in SimGNN and GMN with the proposed graph coarsening module, both of them achieve slightly promotion and GMNHAP obtains comparable accuracy with HAP. These results indicates that our HAP and coarsening module are conducive to a highquality graphlevel representation.
Ablated Model  Graph Classification  Graph Matching  Graph Similarity Learning  
IMDBB  IMDBM  COLLAB  MUTAG  PROTEINS  PTC  =20  =30  =40  =50  AIDS  LINUX  
HAPMeanPool  74.02  51.14  71.13  85.02  72.31  63.94  55.81  56.82  56.33  58.12  68.62  66.34 
HAPMeanAttPool  75.03  52.25  72.64  85.13  73.22  63.91  85.23  84.01  80.63  86.62  78.31  89.02 
HAPSAGPool  70.34  44.21  72.15  75.22  59.86  61.13  61.55  63.62  60.13  60.05  74.42  66.53 
HAPDiffPool  77.35  47.12  61.92  80.03  66.12  55.67  65.93  63.04  63.82  61.95  80.73  91.30 
HAP (ours)  79.04  55.33  73.95  95.01  76.46  69.44  98.42  98.81  99.43  99.84  85.62  93.11 
6.5 Ablation Studies
6.5.1 Comparison of Graph Pooling Mechanisms
To study the effectiveness of our proposed graph coarsening module, we fix other components of HAP framework, and replace our coarsening module with other four differentiable graph pooling methods, i.e., MeanPool, MeanAttPool, SAGPool and DiffPool, referring these variants as HAPMeanPool, HAPMeanAttPool, HAPSAGPool and HAPDiffPool, respectively. The performance of HAP and its four ablated variants on graph classification, graph matching and graph similarity learning task is shown in TABLE V.
We observe that compared with other four ablated variants whose performance fluctuates wildly among tasks, our HAP achieves superior performances on all of the twelve datasets for the three tasks. We also find that HAPMeanPool ranks the bottom across tasks, especially inferior on graph matching and graph similarity learning by a margin of 17% to 42.61%. This validates that the multiformity features which may be redundant information in singleinput graph classification task is crucial for multipleinput graphlevel tasks to do horizontal comparison. On the contrary, HAPMeanAttPool brings about performance benefits against other ablated variants. This indicates that globalwise information aggregation can be helpful for graphlevel representation learning. Further, with the help of the proposed graph coarsening module, our HAP achieves adaptive graph structure sensibility based on a globalwise information aggregation, which utilizes both local structure and global pattern properties, thus contributing to a highquality graphlevel representation. Similarly, when comparing with HAPDiffPool, a onehop neighborhood aggregator, HAP can also improve graphlevel representation quality by joining highorder dependency among nodes that may hold significant information. Moreover, the performance comparison between HAPSAGPool and HAP reveals that our HAP can indeed retain key graph information that may be attached to the abandoned nodes.
Model  Graph Matching  Graph Similarity Learning  
=20  =30  =40  =50  AIDS  LINUX  
baseline  85.21  84.02  80.65  86.63  75.52  67.63 
Coarsen=1  99.04  97.16  97.61  96.08  83.22  84.83 
Coarsen=2  99.72  98.76  99.78  98.4  84.11  89.42 
Coarsen=3  97.62  99.45  99.35  99.39  85.04  90.65 
6.5.2 Comparison of Different Number of Graph Coarsening Module
TABLE VI shows the performance of graph matching and graph similarity learning by adopting different number of graph coarsening modules in HAP. All experiments are conducted using HAPMeanAttPool as the baseline with fixed coarsen ratio for the same dataset. We observe that replacing the MeanAttPool with one our graph coarsening module, denoted as Coarsen = 1, improves the performance by at least 10.2%, which can effectively demonstrate the significance of the our proposed coarsening module. Furthermore, increasing coarsening modules from one to two can improve the performance by at most 5.4%. Finally, increasing coarsening modules from two to three slightly improves the performance by an average of 0.7%. These results demonstrate that the proposed graph coarsening module can dramatically improve the performance by coarsening graphs in a hierarchical manner.
Visualization: Fig. 6 visualizes how graphlevel representations react with different number of graph coarsening modules in graph classification task. It can be seen that the challenging classification is progressively corrected with the number of graph coarsening modules increasing from one to two, but is easily to be misclassified when there are three coarsening modules.
Synthesizing the above results, the greater the number of graph coarsening modules, the more attached parameters and additional memory usage. To balance the performance and resource usage, we choose Coarsen = 2 as default settings.
6.5.3 Comparison of Generalization Performance
While most GNNs are designed to consider the generalization ability to unseen nodes, there are few researches in graph pooling area to address the generalization to unseen graphs. However, in practical applications such as protein molecular structure recognition, researchers are often interested in generalizing the knowledge learned from smallsized molecules to largesized molecules with the same form of structures.
In this subsection, we justify the generalization capability of the models by training on smallsize graphs and testing on largesized graphs with the same edge probability for graph matching task. The results shown in TABLE VII indicate that only HAP can achieve a natural generalization of the smallsized results to the scenarios of largesized graphs. This is credited to the key strength of HAP: it can effectively learn the global graph content that involves highlevel pattern information for the training graph by GCont, thus preserving the pattern properties that are inherited between the training and testing graphs. When applying our graph coarsening module to GMN, GMNHAP achieves significant improvement of the prediction performance by 8.22% and 10.31%, respectively.
Model  =100  =200 
GMN  85.22  74.31 
GMNHAP  93.44  84.62 
HAPMeanPool  57.22  58.53 
HAPMeanAttPool  83.52  87.84 
HAPSAGPool  59.01  59.13 
HAPDiffPool  64.04  59.22 
HAP (ours)  98.51  98.53 
7 Conclusion and Future Work
In this paper, we introduce a novel graph pooling framework HAP for hierarchical graphlevel representation learning by adaptively leveraging the graph structures. The key innovation of HAP is the graph coarsening module, assisted by novel graph pattern property extracting technique GCont and crosslevel attention mechanism MOA. HAP clusters local substructures through a newly proposed crosslevel attention mechanism MOA. MOA mechanism helps it to naturally focus more on close neighborhood while effectively capture higherorder dependency that may contain important information. We also propose GCont, an autolearned global graph content that sustains the crossattention process. HAP leverages GCont to provide global guidance in graph coarsening. It extracts graph pattern properties to make the pre and postcoarsening graph content maintain stable without loss of significant information. The learning of GCont also facilitates generalization across graphs with the same form of features. Theoretically analysis and extensive experiments demonstrate that HAP and the key component graph coarsening module achieve stateoftheart performance on four downstream tasks.
HAP shows its potential to improve other graph learning methods by getting a more informative graph embedding. Furthermore, there are incredible opportunities for HAP to be further extended to more complex networks such as attributed networks and heterogeneous networks which may be more common in realworld applications.
References
 [1] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
 [2] Y. Fan, X. Lu, D. Li, and Y. Liu, “Videobased emotion recognition using cnnrnn and c3d hybrid networks,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, pp. 445–450.
 [3] D. Palaz, R. Collobert et al., “Analysis of cnnbased speech recognition system using raw speech as input,” Idiap, Tech. Rep., 2015.
 [4] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of cnn and rnn for natural language processing,” arXiv preprint arXiv:1702.01923, 2017.
 [5] S. A. Myers, A. Sharma, P. Gupta, and J. Lin, “Information network or social network? the structure of the twitter follow graph,” in Proceedings of the 23rd International Conference on World Wide Web, 2014, pp. 493–498.
 [6] N. Shibata, Y. Kajikawa, and I. Sakata, “Link prediction in citation networks,” Journal of the American society for information science and technology, vol. 63, no. 1, pp. 78–85, 2012.
 [7] J. Hu, C. Guo, B. Yang, and C. S. Jensen, “Stochastic weight completion for road networks using graph convolutional networks,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019, pp. 1274–1285.

[8]
Y. Li, C. Huang, L. Ding, Z. Li, Y. Pan, and X. Gao, “Deep learning in bioinformatics: Introduction, application, and perspective in the big data era,”
Methods, vol. 166, pp. 4–21, 2019.  [9] T. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv: Learning, 2016.
 [10] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in neural information processing systems, 2017, pp. 1024–1034.
 [11] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
 [12] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [13] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in European Semantic Web Conference. Springer, 2018, pp. 593–607.
 [14] S. Vashishth, S. Sanyal, V. Nitin, and P. Talukdar, “Compositionbased multirelational graph convolutional networks,” arXiv preprint arXiv:1911.03082, 2019.
 [15] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” arXiv: Learning, 2013.
 [16] R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec, “Hierarchical graph representation learning with differentiable pooling,” arXiv preprint arXiv:1806.08804, 2018.
 [17] H. Gao and S. Ji, “Graph unets,” arXiv preprint arXiv:1905.05178, 2019.
 [18] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli, “Graph matching networks for learning the similarity of graph structured objects,” arXiv preprint arXiv:1904.12787, 2019.
 [19] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang, “Simgnn: A neural network approach to fast graph similarity computation,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019, pp. 384–392.
 [20] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph sequence neural networks,” arXiv: Learning, 2016.
 [21] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
 [22] O. Vinyals, S. Bengio, and M. Kudlur, “Order matters: Sequence to sequence for sets,” arXiv preprint arXiv:1511.06391, 2015.
 [23] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An endtoend deep learning architecture for graph classification.” in AAAI, vol. 18, 2018, pp. 4438–4445.

[24]
J. Huang, Z. Li, N. Li, S. Liu, and G. Li, “Attpool: Towards hierarchical
feature representation in graph convolutional networks via attention
mechanism,” in
Proceedings of the IEEE International Conference on Computer Vision
, 2019, pp. 6480–6489.  [25] J. Lee, I. Lee, and J. Kang, “Selfattention graph pooling,” arXiv preprint arXiv:1904.08082, 2019.
 [26] E. Ranjan, S. Sanyal, and P. P. Talukdar, “Asap: Adaptive structure aware pooling for learning hierarchical graph representations.” in AAAI, 2020, pp. 5470–5477.

[27]
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural
message passing for quantum chemistry,” in
Proceedings of the 34th International Conference on Machine LearningVolume 70
. JMLR. org, 2017, pp. 1263–1272.  [28] Y. Bai, H. Ding, Y. Qiao, A. Marinovic, K. Gu, T. Chen, Y. Sun, and W. Wang, “Unsupervised inductive graphlevel representation learning via graphgraph proximity,” arXiv preprint arXiv:1904.01098, 2019.
 [29] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeilerlehman graph kernels,” Journal of Machine Learning Research, vol. 12, pp. 2539–2561, 2011.
 [30] D. Mesquita, A. Souza, and S. Kaski, “Rethinking pooling in graph neural networks,” Advances in Neural Information Processing Systems, vol. 33, 2020.
 [31] H. Yuan and S. Ji, “Structpool: Structured graph pooling via conditional random fields,” in International Conference on Learning Representations, 2019.

[32]
F. M. Bianchi, D. Grattarola, and C. Alippi, “Spectral clustering with graph neural networks for graph pooling,” in
International Conference on Machine Learning. PMLR, 2020, pp. 874–883.  [33] J. Lee, I. Lee, and J. Kang, “Selfattention graph pooling,” in International Conference on Machine Learning. PMLR, 2019, pp. 3734–3743.
 [34] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbelsoftmax,” arXiv preprint arXiv:1611.01144, 2016.
 [35] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub) graph isomorphism algorithm for matching large graphs,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 10, pp. 1367–1372, 2004.
 [36] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.

[37]
M. Neuhaus, K. Riesen, and H. Bunke, “Fast suboptimal algorithms for the
computation of graph edit distance,” in
Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)
. Springer, 2006, pp. 163–172.  [38] S. Fankhauser, K. Riesen, and H. Bunke, “Speeding up graph edit distance computation through fast bipartite matching,” in International Workshop on GraphBased Representations in Pattern Recognition. Springer, 2011, pp. 102–111.
 [39] K. Riesen and H. Bunke, “Approximate graph edit distance computation by means of bipartite graph matching,” Image and Vision computing, vol. 27, no. 7, pp. 950–959, 2009.
 [40] D. B. Blumenthal and J. Gamper, “On the exact computation of the graph edit distance,” Pattern Recognition Letters, vol. 134, pp. 46–57, 2020.
Comments
There are no comments yet.