Although data that can be represented as grid structure on Euclidean domains, such as images , video , speech , and texts , has closely connections with daily life, there is another major category of Non-Euclidean data, namely graph, which is constructed by irregularly-arranged nodes and the connection-indicated edges. Examples include social networks , citation networks , road networks  and bioinformatics 
. Different from Euclidean data, convolution and pooling operation in Convolutional Neural Networks (CNNs) cannot be directly applied to Non-Euclidean graph-structured data due to the irregularity and nondeterminacy of the neighborhood for the central node. Consequently, it is important to learn a sufficiently expressive representation for graph-structured data in a reasonable way.
A great deal of research on Graph Neural Networks (GNNs) has emerged to generalize the great success of convolution in CNNs to graph-structured data. In GNNs, convolution operation is evolved into neighborhood message aggregation of the central node along edges, thus capturing both node features and graph structural information. Following this principle, various GNNs have been proposed, such as GCN , graphSAGE  and GAT . All of them have achieved significant prosperity for graph representation learning tasks, especially for node-level representation based tasks, including node classification [12, 11] and link prediction [13, 14]. However, as for graph-level representation learning tasks, such as graph classification [15, 16, 17], graph matching  and graph similarity learning , convolution operation alone is deficient. It is nontrivial to potentially empower GNNs to produce discriminative graph-level representations with the help of pooling operation.
To remedy this problem, a few researchers have tried to further generalize the pooling mechanism from CNNs to GNNs for graph-level representation learning. Accordingly, it is natural to raise the question: what are the basic criteria for a high-quality pooling method in GNNs? Actually, different from the pooling operation in CNNs for reducing the number of computational parameters and preserving the invariance, the basic idea of graph pooling techniques is a node feature aggregator throughout the entire graph, analogous to the neighborhood aggregator in graph convolution. Therefore, a good graph pooling method should encourage graphs with approximate topology and similar node features to have resemblant representations to some extent. As a result, the major challenge for a high-quality pooling mechanism is to define a method which both effectively maintains pivotal node features and explicitly captures important structural information. To address this challenge, previous works have proposed graph pooling architectures in three ways.
First, universal maximum/average pooling methods [20, 19] are intuitively extended to graph models by simple element-wise max- or mean-downsampling through all node features. But such methods have been proved to ignore feature multiplicities, as well as completely miss the structural information . Graphs with different corresponding structures may get the same representation.
Second, Top-K methods [22, 17, 23, 24, 25] sort graph nodes in a consistent order with scores on behalf of the importance, only K nodes with the highest scores are selected to form the pooled graph. As a result, sets of nodes except for the top K ones are definitely discarded, which may involve important features. Meanwhile, very few Top-K methods utilize the local substructure information for scoring stage, thus resulting a strong possibility of the selected nodes to be isolated, which may influence the information propagation in subsequent GNN layers.
Third, advanced graph pooling methods, such as DiffPool  and ASAP , learn graph representations in a hierarchical grouping manner for capturing more comprehensive local substructures widespread in graphs. They usually group nodes into smaller clusters with multiple levels and utilize the final level as the graph representation. However, grouping operation is usually executed on a fixed 1-hop neighborhood, thus forcing the information to flow from a certain neighborhood to the specific coarsened cluster and neglecting the higher-order dependency among nodes that may hold significant information, as shown in Fig. 1(a).
To the best of our knowledge, no existing graph pooling methods adaptively handle the graph local substructures and high-order dependency, while capturing node features. And there is a general lack of systematical consideration to the generalization of graph-level representations affected by pooling methods. To bridge the above gap, we propose a novel hierarchical graph pooling framework called Hierarchical Adaptive Pooing (HAP). The main contributations can be summarized as follows.
We intruduce HAP, a supervised hierarchical pooling framework. HAP is capable of preserving node feature information with adaptive graph structure sensitivity to both local substructures and high-order dependency. We provide a more comprehensible framework by offering exhaustive theoretical analysis for computational complexity, permutation invariance and the design validity of HAP.
We propose master-orthogonal attention (MOA), a novel cross-level attention mechanism specifically designed for hierarchical graph pooling. MOA can be leveraged to capture cross-level interactions under the guidance of graph pattern properties in a more efficient and effective way. MOA also acts as a soft substructure extractor. Attention weights for nodes in the receptive field of a possible local cluster are much higher than those beyond the receptive field. This ensures the local-substructure sensitivity and introduces high-order node features to it.
We design GCont
, an auto-learned global graph content playing a significant role in MOA. The key innovation is that it incorporates high-level global pattern properties into pooling method, making MOA sensitive to the latent graph characteristics and produce a more adequate graph-level representation without interference of artificial factors. It is relatively stable during hierarchical pooling and flexible enough to be learned heuristically. GCont also guarantees the generalization ability across graphs with the same form of features.
Extensive experiments demonstrate that (1) HAP significantly outperforms twelve graph pooling methods on six real-world datasets for graph classification task with a maximum accuracy improvement of 22.79%; (2) HAP sharply outperforms the state-of-the-art GMN , which is specifically designed for graph matching task, by boosting the accuracy up to 3.5%; (3) HAP also achieves a maximum accuracy gain of 16.7% comparing with conventional approximate GED algorithms and GNN-based graph similarity learning models; (4) The graph coarsening module in HAP dramatically enhances the expression ability of existing graph pooling architecture for graph-level tasks; (5) HAP achieves good generalization ability across graphs with the same form of features; and (6) HAP provides meaningful visualization of graph-level representations.
2 Related Work
In this section, we summarize some related works on graph pooling in GNNs covering the two main types of methods: supervised and unsupervised.
2.1 Supervised Pooling
Supervised pooling methods can be divided into flat pooling and hierarchical pooling according to whether the graph-level representation is aggregated in a flat or hierarchical way with a view to local substructures. Further, flat pooling methods also cover two families: universal pooling and TopK pooling, depending on the number of nodes participated in the final aggregation.
2.1.1 Flat Universal Pooling
Flat universal pooling methods take all the nodes into consideration. Earlier works directly learn from CNNs to use mean- or max-pooling method to extract features. Subsequently, Xu et al. find that sum-pooling is much more powerful because no matter mean or max aggregator ignores the multiformity of features, thus struggling in distinguishing graphs with nodes that have repeating features. Some other works rely on content-based attention operation. In Gated Graph Neural Networks (GG-NNs) , the graph-level output is defined by a soft attention mechanism for each node to decide which is more relevant to the graph-level task. Message Passing Neural Networks (MPNNs)  further utilizes Set2Set  method to take the order of nodes into consideration and find the importance of each node to the graph-level representation through time-consuming iterative soft-attention. In SimGNN  and UGRAPHEMB , a graph content is defined as the average of node features and the attention is executed between nodes and it. Obviously, such man-made design makes the final graph-level representation infinitely close to the output of mean-pooling method, which is an inefficient method mentioned above.
2.1.2 Flat Top-K pooling
Flat Top-K pooling methods score the nodes according to the importance. Nodes with k-largest scores are preserved to form the new graph. SortPooling  method refers to the graph label method WL , it regards the output node features of each GCN layer as the continuous WL colors and sorts the nodes according to the last GCN layer’s colors. AttPool  calculates the scores using a global soft-attention mechanism. Further more, a local attention method accesses node degree information, which contributes to keep a balance between the importance and the dispersion. gPool 
develops new ideas that use the projection of node features to a trainable projection vector as node score. SAGPool considers both node features and graph topology during pooling by taking GCN to calculate attention scores. However, important information existed in the abandoned nodes may be ignored, which should be explicitly captured in graph pooling. Moreover, no structural relationships among nodes are acquired during pooling, thus may lead to unconnectedness of the selected nodes.
2.1.3 Hierarchical Group Pooling
Due to the fact that local substructures are present in real-world graphs, hierarchical group pooling methods come into being. DiffPool  is the first differentiable group pooling approach that learns a dense assignment matrix to group 1-hop neighborhood of nodes into clusters in each hierarchical layer. Subsequently, to address the sparsity concerns in DiffPool, ASAP is proposed, which combines both TopK and group methods. Clusters are generated by aggregating h-hop neighbors of each central node to leverage the graph structure, then only top scoring clusters are maintained. Actually, ASAP  still cannot guarantee the connectivity between the selected clusters. Actually, Mesquita et al. in  indicate that a successful graph pooling method should not be restricted to nearby nodes. Thus we propose that high-order structural dependency also contributes to the construction of a good pooling region.
2.2 Unsupervised Pooling
Loss functions of the aforementioned graph pooling methods are usually task-based supervised except for DiffPool that exploits a link prediction loss and enforces nearby nodes to be pooled together. Recently, there is a growing interest in unsupervised graph pooling by minimizing the objective related to graph structure characteristic or borrowed from graph theory. StructPool  employs conditional random fields (CRFs) to capture high-order structural relationships by minimizing the Gibbs energy. MinCutPool  continuously relaxes the normalized minCUT problem in graph theory and optimizes the cluster assignments by minimizing this objective. UGRAPHEMB  utilizes well-accepted and domain-agnostic graph proximity metrics to provide extra graph-graph proximity guidance during learning. These novel ideas offer the possibility of breaking a logjam of current graph pooling research.
Something also worth mentioning is that there is a common challenge for no matter universal pooling, Top-K pooling or group pooling, i.e., the element-wise aggregation, score ranking and cluster assignment learning processes are merely executed on a single fixed graph, lacking the inductive capability for entirely new graphs.
3.1 Problem Statement
A graph is represented as , where denotes the set of nodes, is the edge link between node and , and consists the node labels (no node labels are provided in some cases). For a graph with nodes and edges, represents the weighted adjacency matrix and is a diagonal matrix that diagonal elements stand for the degree of nodes. denotes the node feature matrix and is the graph-level embedding. A label may also be attached to the graph . Detailed notations are summarized in TABLE I. Given a graph dataset, the graph pooling task aims to learn a mapping from a node feature matrix to a single graph representation.
|Notations||Definitions or Descriptions|
|,||the input/coarsened graph|
|, ,||the node/edge/node label set of|
|,||the number of nodes of /|
|,||the adjacent matrix of /|
|the degree matrix of|
|,||the node feature matrix of /|
|the graph-level embedding of|
|the number of graph coarsening modules|
|the auto-learned global graph content|
|row in C refers to a node of the source graph|
|column in C refer to a cluster of the target graph ’|
|the MOA matrix|
|the dimension of input node feature|
|the dimension of the output node feature|
|the dimension of the graph level embedding|
|the label of graph|
3.2 Downstream Tasks
Most previous works only focus on the application of graph pooling for graph classification, which is an important graph-level representation learning task but partial. In this paper, we formally summarize and define the downstream graph pooling tasks and conduct exhaustive experiments over them:
Graph Classification: Given an input graph , the graph classification task tries to learn a mapping from the graph to the corresponding graph label.
Graph Matching: Given an input graph pair , the graph matching task aims to determine whether and are isomorphic111For a pair of graph, graph isomorphism decides whether there exists a bijective function between them so that nodes are connected in the same way. to each other.
Graph Similarity Learning: Given an input graph triple , the graph similarity learning task manages to explore whether is much closer to or .
3.3 Graph Neural Networks
Given node features and graph structure , modern GNNs usually learn useful node representations in an neighborhood aggregation fashion following general “message-passing” architecture. The forward process comprises two phases, each of which iteratively runs for time steps. The message passing phase aggregates information along edges of the central node from its neighbors. Then the combination phase updates the representation of the central node based on the message:
where is the embedding of node at the -th iteration that is initialized as , and is the node feature vector of node ’s neighbor depending on the adjacency matrix.
There are multiple selectable implementations of and adapted successfully to different GNN models. Actually, our HAP pooling framework can be consolidated into any GNN models following the implementation of Equation 1 and Equation 2. After times iteration, the representation of the central node captures the features and structural information within its -hop neighborhood.
3.4 Graph Attention
Graph attention mechanism executed between a query and a key allows for allocating diverse alignment scores to different parts of the input, making the model focus on the most relevant portion. Existing graph attention mechanism can be divided into node-level attention and master-level attention according to the attention scope. Specifically, node-level attention covers both self-attention and cross-attention.
Hard-Self-Attention (HSA)  chooses both and from the node features of the single input graph to find the node dependency on itself:
where and are trainable parameters, and is a concatenation operation.
Soft-Self-Attention (SSA)  decides which nodes are relevant to the current graph-level task, so that is defined as node feature but no specific key is provided:
Cross-Attention (CA)  captures the differences between graphs by doing comparisons across the pair of graphs through choosing and from the node features of pairwise input, thus fusing information from both graphs:
4 The Proposed Method: HAP
In this section, we present HAP, a hierarchical graph pooling framework for graph-level tasks. Its key idea is the graph coarsening module supported by novel graph pattern property extracting technique GCont and cross-level attention mechanism MOA complementing and reinforcing each other, which not only prompts the GNN model to be sensitive to both local substructures and high-order dependency, but also empowers it with stronger generalization ability. Below, we discuss the components of HAP in details.
4.1 Hierarchical Framework
Figure 2 illustrates the overall architecture of the HAP. Given single, pairwise or triplet input graphs for differentiated graph-level tasks, HAP extracts the node features and graph structure information for an end-to-end training. The process can be decomposed into six main steps:
Input Construction Single input graph classification and pairwise graph matching task require no special operation on the given dataset, which consists of single or pairs of graphs with true labels to indicate which class the graph belongs to or whether the pair is matching or not. However, a triplet generator is necessary for graph similarity learning task.
Node & Cluster Embedding Subsequently, single or pairs of graphs, or the generated triplets are transferred into a node & cluster embedding module to learn a low-dimensional node vector representation for each node or coarsened cluster. The cluster representations will be maintained for hierarchical similarity measuring.
Graph Coarsening-I Then, a learnable GCont defines a coarsening preparation step for each graph by extracting global pattern properties. Rows and columns of it correspond to source nodes before coarsening and target clusters after coarsening, respectively.
Graph Coarsening-II Furthermore, the MOA mechanism is utilized to obtain an attention assignment. The attention coefficient matrix, each element of which indicates the contribution of the node from the source graph to the cluster from the target graph , is derived from the GCont.
Graph Coarsening-III Afterwards, a cluster formation function is learned to compute the cluster representation after one coarsening.
Learning: Executes the loop between Step-2 and Step-5 until reaching a satisfied graph scale. HAP then calculates corresponding task loss with hierarchical graph representation to constantly optimize all the weight parameters.
4.2 Input Construction
For graph similarity learning task, training and testing data in the form of triplet is essential to learn the relative similarity among graphs. Based on the best of our knowledge, there is no ready-to-use graph dataset in triplet form. To bridge the above gap, we propose a triplet generator in this subsection.
Given a dataset with single graphs, we denote it as , the similarity between every two graphs and can be measured under a graph-graph proximity metric , such as Graph Edit Distance (GED). The smaller the GED, the more similar the pair. Then the pairwise ground-truth proximity is denoted as follows:
Afterwards, we conduct triplets by fixing the first position with one graph and randomly choose two disparate graphs to fill the rest two positions:
Synchronously, the ground truth triplet proximity is generated as follows, in which a positive number for the element means that graph is much similar to graph and a negative number means that graph is much similar to graph :
4.3 Node & Cluster Embedding
There is a demand for node or cluster embedding to extract node or cluster features before going to the next graph coarsening module. In this paper, we choose to employ a two-layer GAT  or GCN  as basic components since they are all well capable of capturing the local structure information of a node. Actually, any mainstream GNNs can also be integrated into the HAP framework. And please note that the number of GAT or GCN layers depends on the real application graph data.
For the k-th layer in GAT, it takes graph ’s adjacent matrix
and the hidden representation matrixas input, then formulates the phase in a weighted-attention-based operator:
is the non-linear activation function such asReLU or Sigmoid, is a trainable global attention assignment among all nodes, and picks one-hop neighborhood attention. is a trainable weight matrix.
where is the adjacent matrix plus self-connections (i.e., ), is the degree matrix of (i.e., ), and is the symmetric normalized Laplacian for graph . With one convolutional layer, GCN is able to preserve the first-order neighborhood information between nodes. By stacking multiple GCN layers, it is capable to encode higher-order (e.g., k-hop) neighborhood relationships.
Specifically, different from classical GAT or GCN where graph scale is stable throughout the whole training, HAP scales down nodes into clusters in the graph coarsening module before transferring them to the next node & cluster embedding layer. As a result, , and change with the action of graph coarsening (cf. Eq. 18).
4.4 Graph Coarsening
We achieve graph coarsening through graph global pattern property extracting technique GCont and cross-level attention mechanism MOA. We show the graph coarsening module architecture in Fig. 3 and elaborate the details in this subsection. Further, algorithm 1 gives the pseudocode for the graph coarsening module.
4.4.1 Attention Preparation using GCont
Given node features for the source graph, the task of coarsening process is to learn the cluster assignment matrix through attention mechanism. However, one important thing ignored by all the group pooling methods is that the pre- and post-coarsening graph content should remain stable without loss of important information. We observe that both DiffPool  and ASAP  receive no global guidance. Hence, we propose GCont, an auto-learned global graph content sustaining the coarsening process.
As an initial step, we propose using one learnable linear transformation, parametrized by the weight matrixto generate GCont. The simple linear transformer also combines scalability with the ability to deal with relatively larger graphs. The global graph content is converted from the node feature matrix as:
where and indicate element in the position of the -th row and -th column of matrix and separately. is the automatically learned global graph content matrix in which each row is equivalent to a node of the source graph and each column is corresponding to a cluster node of the target coarsened graph .
The GCont bridges the gaps between the source graph and the target graph and maintains the consistency. On one hand, the elements in reflect the interaction between nodes from source graph and clusters from coarsened graph. On the other hand, they contain the graph pattern properties cohered before and after coarsening, thus facilitating generalization across graphs with the same form of features.
4.4.2 Attention Assignment using MOA
HAP intends to achieve graph downsampling through a cross-level attention-based aggregator for information interaction between the source graph and the coarsened target graph utilizing global graph property guidance. However, we observe that both HSA and SSA described in Sec. 3.4 only focus on one single graph while CA does not utilize any global information. Although MA introduces master information into the attention process, it is highly affected by the manmade master function. To that end, we propose a new variant of attention mechanism called Master-Orthogonal-Attention (MOA).
Computation of Attention Assignment: The input of MOA mechanism is a well-learned representation matrix , , where is the number of nodes of the source graph , and is the feature dimension for each node. Then the graph coarsening module produces a new coarsened graph representation matrix , as its output, where is the number of clusters of the coarsened graph. Each cluster will then be regarded as an individual node. Meanwhile, adjacent matrix will also be updated to . Please note that the number of graph coarsening modules and the coarsened graph size are determined by the real application graph data. In our experiment, we employ two coarsening modules and we evaluate it in the experiment.
After having obtained the global graph content matrix, we can employ an orthogonal222The terminology “orthogonal” here means rows and columns of a 2D matrix, which is different from the meaning of orthogonal vectors in a mathematical sense. cross-level attention mechanism between nodes of the source graph and clusters of the target coarsened graph. The attention matrix is formed with elements as follows:
where is the LeakyReLU nonlinearity, is a concatenation operation with relaxed dimension of from to , and is the trainable shared attentional parameter with relaxed dimension . The reason for the relaxation will be given below.
, which is equivalent to a cross-level aggregator, offers a fully-connected information channel between the source-graph nodes and the target coarsened-graph clusters, with each element indicating the importance of node ’s feature to cluster . The local substructure is preserved by attention mechanism while the high-order dependency is also captured through the fully-connected information channel, thus strengthening feature reservations. We normalize it for better evaluation:
MOA mechanism synthesizes both self-attention and cross-attention with master-attention. On one hand, the proposed MOA mechanism calculates the attention coefficients based on the GCont alone, so we can sort it into self-attention mechanism. On the other hand, the attention is predicted between the source graph and the target coarsened graph, so we may also classify it as cross-attention mechanism.
Relaxation of Attentional Parameter: In traditional graph attention scheme , attention coefficients is calculated as follows:
where is the LeakyReLU nonlinearity, is the trainable shared attentional mechanism, is a weight matrix to produce new node features from cardinality to , and are the input node features, and is a concatenation operation.
Apparently, the trainable shared attentional parameter in the conventional graph attention mechanism is irrelevant to the node number of the input graph. However, in our MOA mechanism, the dimension of is related to the node number of the inputted source graph, making the concatenation . As a result, the trainable shared attentional mechanism would be initialized as , which is sensitive to the node number of the inputted source graph.
Manifestly, varies with the input and it is unknown for parameter initializing stage. Thus, proper relaxation offers intriguingly good performance when standard techniques appear to suffer. We loose to , so that . We theoretically analyse the validity of the relaxation on the prediction outcome in Sec. 5.3.
4.4.3 Cluster Formation
The learning of the global graph content and the cross-level aggregator constitutes a concordant unity, and complement and restrict mutually. Subsequently, we generate the coarsened graph representation matrix and update the adjacent matrix :
4.4.4 Soft Sampling
According to Lee et al. , handling adjacent data with a sparse matrix in GNN contributes to decreasing the computational complexity from to and also reduces space complexity. However, the adjacent matrix turns to be a dense one from a sparse assignment. That said, the structure of the source graph is refined to a fully-connected downsampled one. Proper edge sampling will lead to saving both time and storage without a dramatic loss of accuracy. As a workaround, we adopt the Gumbel-SoftMax  to achieve soft sampling for neighborhood relationship, thus decreasing edge density for the sampled adjacent matrix :
where and . Here, we set the softmax temperature parameter to make the adjacent matrix distribution close to one-hot. This operation reduces edge density as much as possible but preserves the connectivity of graphs.
The proposed HAP supports three types of input: single graph for graph classification, pairwise graphs for graph matching, and triplet graphs for graph similarity learning. All of the input graphs will be coarsened to a 1D vector at the final graph embedding layer, which can be used to compute graph similarity directly. Meanwhile, as is demonstrated in model structure, HAP alternates between node embedding and graph coarsening, thus generating different graph representation matrix at graph coarsening layer . As a result, we also propose a hierarchical similarity measure by jointly utilizing hierarchical graph representations.
For graph classification tasks with a single input graph , the final graph representation is directed fed into two fully-connected layers with a activation on the output to get the predicted label . Then we optimize the model with a standard cross-entropy on the graph that has ground truth labels . The fully-connected layers and the objective function can be represented separately as follows:
where and represent weights and biases in the -th fully-connected layer respectively for . is the adopted ReLU and Softmax activation function for and separately. is the training set of single graphs and denotes the number of classes.
For graph matching tasks with pairwise input graphs , pairs are labeled with or representing similar or dissimilar respectively. We optimize the normalization function to push the model to convert graph distances to similarity scores with distribution :
where denotes a softmax parameter sensitive to different range of distances and is determined by the real application graph data. Basically, we set it to 0.5. represents graph distances of graph pair at coarsen level , and here we use Euclidean distance. Then the model is optimized by hierarchical cross-entropy function as follows:
where is the training set of pairwise graphs. is the label for this pair.
For graph similarity learning tasks with triplet input graphs , hierarchical Mean Squared Error (MSE) loss function is employed as follows:
where is the training set of triplet graphs, denotes ground truth triplet proximity defined by relative Graph Edit Distance (GED) at Sec. 4.2.
4.5.2 Hierarchical Prediction
As shown in Fig. 2, we adopt a hierarchical prediction strategy to further facilitate the training process and fully utilize the hierarchical intermediate features of coarsened graphs. The outputs of every coarsening process are summarized as the intermediate graph feature, which will be fed into the learning module for graph matching or graph similarity learning.
5 Theoretical Analysis
5.1 Computational Complexity Analysis
In the following, we theoretically analysis the computational complexity of the proposed HAP and show the superiority of the proposed graph coarsening module.
Claim 1 (Time Complexity)
The time complexity of the proposed HAP with graph coarsening modules in dowmsampling ratio is approximately , where is the number of nodes of the original input graph.
The time complexity of HAP involves three parts corresponding to the three stages of GNN-based graph-level representation learning models: (1) node embedding; (2) graph coarsening; and (3) learning. The time complexity of node embedding stage is , where is the dimension of output node features. After that, to downsample node number in the -th graph coarsening module, where , it requires . Let’s suppose remains constant among all the coarsening modules. Then the time complexity for all the graph coarsening modules is . Due to the fact that is less than 1, is a couple of orders of magnitude smaller than . So the time complexity of graph coarsening stage is roughly equivalent to . Eventually, for the learning stage, the time complexity is , where is the dimension of the graph level embedding for the input graph . Therefore, the overall computational complexity of the proposed HAP framework is .
Specifically, when a proper coarsening ratio is chosen where (e.g., and ), the actual execution time of the proposed HAP will become almost linear to .
5.2 Permutation Invariance
Graph pooling methods need to be permutation invariant since they should guarantee that the graph-level representation does not vary with the input order of node-level representations. As for the proposed graph coarsening module, we proof that it is graph permutation invariant.
Definition 5.1 (Permutation matrix)
is a permutation matrix of size iff and .
Claim 2 (Permutation invariance)
Let be any permutation matrix, be any undirected graph, a function be a pooling operation depending on graph , graph permutation is defined as . The proposed graph coarsening module is graph permutation invariant.
is computed by an attention mechanism between source nodes and target coarsened clusters. Since the attention function are operated between node set and cluster set, the order of nodes or clusters has no effect to the result, we have:
Since and any permutation matrix is orthogonal, applying to it, we get:
Since , applying to it, we get:
As a result, , HAP is graph invariant.
5.3 Validity of Relaxation for Attentional Parameter
In Sec. 4.4.2, we conduct a relaxation operation for the attentional parameter. Substantially, the relaxation is applied to the column dimension of GCont during concatenation. A natural question is that whether the relaxation affects the accuracy of attention coefficients, which may directly lead to neglecting important information during cross-level aggregation. We now theoretically analyze this question.
Definition 5.2 (LeakyReLU)
LeakyReLU is a monotone increasing activation function:
Let >, , and be vectors before relaxation, let and be vectors after relaxation, let be concatenation operation, and LeakyReLU be a nonlinearity. Then .
The essence of is a similarity comparison between vector and vector . Due to the reason that vectors with different dimensions are non-comparable, the lacking dimension needs to be padded with zero. So that:
. Due to the reason that vectors with different dimensions are non-comparable, the lacking dimension needs to be padded with zero. So that:
While do comparison between and , we can also pad them as follows:
Hence, . Based on known conditions that LeakyReLU is monotonically increasing, so that .
As a result, the relaxation for attentional parameter has no negative effects for the attention computation and feature extraction.
6 Experiments and Evaluation
We evaluate HAP against a number of state-of-the-art methods to answer the following questions:
Q2: How dose the original HAP compare with ablated ones with graph coarsening module replaced by other state-of-the-art pooling algorithms? (Sec. 6.5.1)
Q3: How does the number of the graph coarsening modules influence the quality of graph-level representations generated by HAP? (Sec. 6.5.2)
Q4: Do key designs of HAP contribute to better generalization performance? (Sec. 6.5.3)
6.1 Experimental Setup
We perform experiments on eight real-world datasets and one synthetic dataset varying with tasks. The graph statistics are summarized in TABLE II.
Evaluating graph matching task requires benchmark datasets with ground-truth labels (true for matching and false for unmatching). To the best of our knowledge, no public-available real-world dataset holds such ground-truth labels. To fill this gap, we conduct a synthetic dataset, a collection of labeled graph pair
with edge probabilitygenerated by the VF2 graph matching library . Given a graph , a positive sample is the maximum connected subgraph randomly extracted with 1 to 3 nodes less than . And a negative sample is created by randomly adding 3 to 7 nodes with the same edge probability.
We compare HAP with three kinds of baselines:
Graph pooling baseline: For comparison of total pooling, we choose GCN-concat (concatenation of GCN-based node-level representations), SumPool , MeanPool, MeanAttPool , and Set2Set . For TopK pooling, we use SortPooling , AttPool , gPool  and SAGPool . For group pooling, we compare with DiffPool  and ASAP . We also conduct evaluation on an unsupervised method StructPool .
Graph matching baseline: We focus on the Graph Matching Network (GMN)  specifically designed for pairwise graph similarity learning.
Graph similarity learning baseline: There are two categories of graph similarity learning baselines. Due to the reason that the ground-truth triplet proximity for graph similarity learning task is calculated by conventional rigorous GED algorithm, the first type is referred to as conventional approximate GED algorithms for comparison, including Beam search , VJ  and Hungarian  algorithm. The other type includes SimGNN  and GMN, which are GNN-based models.
6.1.3 Parameter Settings
For the basic model structure of HAP, we set two node & cluster embedding layers before every following graph coarsening module, and a total of two coarsening modules are needed. Adma optimizer is used with initial learning rate 0.01 for graph classification datasets, 0.0015 for AIDS, 0.0001 for LINUX and synthetic data. For social network datasets IMDB and COLLAB with no informative node features, we use one-hot encoding of node degrees as initial node input. Similarly, we adopt one-hot encoding of node labels for AIDS dataset, while others are initialized identically. For graph classification and other tasks, the initial dimension is 64 and 128, respectively. All of the datasets are randomly partitioned into 8:1:1 for training/validation/testing. For all the baseline methods, we conduct experiments under the default settings reported in the original work.
6.2 Task1: Graph Classification
We evaluate HAP on six benchmark graph classification datasets and compare it with several state-of-the-art approaches belonging to different pooling categories respectively. For AttPool, we try different attention functions (global attention and local attention) to obtain the graph-level representations. For HAP, we try GAT and GCN for node & cluster embedding operation and report the better accuracy. Table III shows the accuracy with the best results marked in bold. We can observe that HAP obtains the best performance on five out of six datasets with an average improvement of 5.9%.
Of all the graph pooling methods, universal pooling approaches are the most straightforward ones but achieve considerable effect, especially the SumPool which is consistent in underlying concept with our HAP. Intuitively, higher the quality of graph-level representations, better the graph classification result. The element-wise sum aggregator in SumPool tends to capture all node features in consideration of higher-order node dependency, but the generated graph-level representations fail to obtain sufficient quality, i.e., the quality of graph-level representations is not positively associated with how much node features are acquired. Irrelevant features that may interfere the results are obtained without reducing the weights, thus the final graph-level representations mixed with excessive irrelevant information is detrimental to the graph classification accuracy.
TopK pooling approaches produce score-based representations that drop nodes from the original graph with lower scores. As a result, potentially valuable information attached with these nodes and the related substructures may be ignored. From TABLE III, the performance of TopK
pooling approaches are universally inferior than other methods that capture more features or structural information. More damaging, SortPool and AttPool-global fail to return a result within 72 hours in practical execution. But there is an exception to the rule: gPool, with consistently better performance than other methods, even excels HAP on COLLAB. gPool computes scores by the multiplication of node feature matrix and a trainable projection vector, so that feature of each node is covered in the estimated scalar projection values by assigning with different weights. This crucial ingredient leads to the outstanding performance. As for the incredible performance on COLLAB, it might be due to the nature of COLLAB dataset. COLLAB covers scientific collaboration between authors. Nodes represent authors and edges indicate co-author relationship between authors. The classification task of each graph is to estimate the field the corresponding researcher belongs to. In this situation, it can be distinguished easily by the authors with Top-K quantity of papers that may be domain experts, while other unknown authors are actually noisy information. Nevertheless, our HAP still has advantages except for such exceptional circumstances.
For the dataset MUTAG, we observe that our HAP incredibly outperforms all the baselines for an average 12.3% improvement. The significant test result shows that HAP fits the property of MUTAG adequately. MUTAG is a two-class nitro compound dataset with nodes and edges on behalf of atoms and chemical bonds, respectively. Note that molecules of both classes have the common substructure nitro, so that higher-order information beyond the substructure is the crucial for differentiation, which is correctly handled by HAP.
Visualization: To further conceptualize the effectiveness of the learned graph-level representations, we provide a visualization of the t-SNE on PROTEINS and COLLAB dataset with features extracted by the methods HAP, SAGPool, MeanAttPool and DiffPool (Fig. 4). In each figure, points of different colors exhibits discernible graph clusters with different labels in the projected two-dimensional space. Note that the separability of the cluster border verifies the discriminative power. We can find that HAP performing consistently with MeanAttPool on PROTEINS shows better discriminability of the two classes than SAGPool and DiffPool. As for COLLAB, HAP is far superior to its competitors where three classes are clearly separated, all of which are in accordance with the results suggested in TABLE III.
6.3 Task2: Graph Matching
Four synthetic datasets are generated with different data size for graph matching task. TABLEIV shows the graph matching results w.r.t. graph size.
GMN, specifically designed for graph matching task, makes the node embedding phase dependent on the pair through a cross-graph attention mechanism. However, as shown in TABLEIV, HAP drastically boosts the matching accuracy up to 3.5% compared to GMN on graph size . When increasing graph size, HAP achieves a steady raising while GMN decreases gracefully from graph size to . This shows the key point: basic node embedding models have been perfectly capable of getting high-quality node-level representations. On the contrary, the core to enhancing graph matching accuracy is to improving the quality of graph-level representations. After replacing the basic pooling module in GMN, the performance of GMN-HAP grows tremendously to be comparable with HAP, further confirming the strong ability of the proposed graph coarsening module.
6.4 Task3: Graph Similarity Learning
We show the results of HAP for graph similarity learning compared with both conventional approximate GED algorithms and GNN-based models on dataset AIDS and LINUX in Fig. 5. Note that evaluating graph similarity learning task requires benchmark datasets with ground-truth GEDs processed by the exact algorithm A*. A recent research  shows that “no currently available algorithm manages to reliably compute GED within reasonable time between graphs with more than 16 nodes”. And the experiments on A* show that 10 nodes seem to be reaching the limit of its ability to deal with the problem. To address the gap, we only accept benchmark datasets with the max number of nodes no more than 10 in each graph. Our results demonstrate that HAP is capable of boosting the accuracy of state-of-the-art methods.
More specifically, for conventional approximate GED algorithms with high computational complexity, HAP improves accuracy by a relative gain of 16.7% and 15.1% on AIDS and LINUX, respectively. In regard to comparing with GNN-based models, HAP is overwhelming to SimGNN, which focuses more on optimizing the exact similarity score between graphs while neglecting the relativity. The result, in one aspect, reflects that a single-minded pursuit of the optimization of pairwise absolute similarity is not necessarily favorable to the relative similarity tasks which are more common in real-world applications to some extent. Similarly, HAP outperforms GMN by a margin of 13.1% and 3.6% on AIDS and LINUX, respectively. When replacing pooling methods in SimGNN and GMN with the proposed graph coarsening module, both of them achieve slightly promotion and GMN-HAP obtains comparable accuracy with HAP. These results indicates that our HAP and coarsening module are conducive to a high-quality graph-level representation.
|Ablated Model||Graph Classification||Graph Matching||Graph Similarity Learning|
6.5 Ablation Studies
6.5.1 Comparison of Graph Pooling Mechanisms
To study the effectiveness of our proposed graph coarsening module, we fix other components of HAP framework, and replace our coarsening module with other four differentiable graph pooling methods, i.e., MeanPool, MeanAttPool, SAGPool and DiffPool, referring these variants as HAP-MeanPool, HAP-MeanAttPool, HAP-SAGPool and HAP-DiffPool, respectively. The performance of HAP and its four ablated variants on graph classification, graph matching and graph similarity learning task is shown in TABLE V.
We observe that compared with other four ablated variants whose performance fluctuates wildly among tasks, our HAP achieves superior performances on all of the twelve datasets for the three tasks. We also find that HAP-MeanPool ranks the bottom across tasks, especially inferior on graph matching and graph similarity learning by a margin of 17% to 42.61%. This validates that the multiformity features which may be redundant information in single-input graph classification task is crucial for multiple-input graph-level tasks to do horizontal comparison. On the contrary, HAP-MeanAttPool brings about performance benefits against other ablated variants. This indicates that global-wise information aggregation can be helpful for graph-level representation learning. Further, with the help of the proposed graph coarsening module, our HAP achieves adaptive graph structure sensibility based on a global-wise information aggregation, which utilizes both local structure and global pattern properties, thus contributing to a high-quality graph-level representation. Similarly, when comparing with HAP-DiffPool, a one-hop neighborhood aggregator, HAP can also improve graph-level representation quality by joining high-order dependency among nodes that may hold significant information. Moreover, the performance comparison between HAP-SAGPool and HAP reveals that our HAP can indeed retain key graph information that may be attached to the abandoned nodes.
|Model||Graph Matching||Graph Similarity Learning|
6.5.2 Comparison of Different Number of Graph Coarsening Module
TABLE VI shows the performance of graph matching and graph similarity learning by adopting different number of graph coarsening modules in HAP. All experiments are conducted using HAP-MeanAttPool as the baseline with fixed coarsen ratio for the same dataset. We observe that replacing the MeanAttPool with one our graph coarsening module, denoted as Coarsen = 1, improves the performance by at least 10.2%, which can effectively demonstrate the significance of the our proposed coarsening module. Furthermore, increasing coarsening modules from one to two can improve the performance by at most 5.4%. Finally, increasing coarsening modules from two to three slightly improves the performance by an average of 0.7%. These results demonstrate that the proposed graph coarsening module can dramatically improve the performance by coarsening graphs in a hierarchical manner.
Visualization: Fig. 6 visualizes how graph-level representations react with different number of graph coarsening modules in graph classification task. It can be seen that the challenging classification is progressively corrected with the number of graph coarsening modules increasing from one to two, but is easily to be misclassified when there are three coarsening modules.
Synthesizing the above results, the greater the number of graph coarsening modules, the more attached parameters and additional memory usage. To balance the performance and resource usage, we choose Coarsen = 2 as default settings.
6.5.3 Comparison of Generalization Performance
While most GNNs are designed to consider the generalization ability to unseen nodes, there are few researches in graph pooling area to address the generalization to unseen graphs. However, in practical applications such as protein molecular structure recognition, researchers are often interested in generalizing the knowledge learned from small-sized molecules to large-sized molecules with the same form of structures.
In this subsection, we justify the generalization capability of the models by training on small-size graphs and testing on large-sized graphs with the same edge probability for graph matching task. The results shown in TABLE VII indicate that only HAP can achieve a natural generalization of the small-sized results to the scenarios of large-sized graphs. This is credited to the key strength of HAP: it can effectively learn the global graph content that involves high-level pattern information for the training graph by GCont, thus preserving the pattern properties that are inherited between the training and testing graphs. When applying our graph coarsening module to GMN, GMN-HAP achieves significant improvement of the prediction performance by 8.22% and 10.31%, respectively.
7 Conclusion and Future Work
In this paper, we introduce a novel graph pooling framework HAP for hierarchical graph-level representation learning by adaptively leveraging the graph structures. The key innovation of HAP is the graph coarsening module, assisted by novel graph pattern property extracting technique GCont and cross-level attention mechanism MOA. HAP clusters local substructures through a newly proposed cross-level attention mechanism MOA. MOA mechanism helps it to naturally focus more on close neighborhood while effectively capture higher-order dependency that may contain important information. We also propose GCont, an auto-learned global graph content that sustains the cross-attention process. HAP leverages GCont to provide global guidance in graph coarsening. It extracts graph pattern properties to make the pre- and post-coarsening graph content maintain stable without loss of significant information. The learning of GCont also facilitates generalization across graphs with the same form of features. Theoretically analysis and extensive experiments demonstrate that HAP and the key component graph coarsening module achieve state-of-the-art performance on four downstream tasks.
HAP shows its potential to improve other graph learning methods by getting a more informative graph embedding. Furthermore, there are incredible opportunities for HAP to be further extended to more complex networks such as attributed networks and heterogeneous networks which may be more common in real-world applications.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
-  Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, pp. 445–450.
-  D. Palaz, R. Collobert et al., “Analysis of cnn-based speech recognition system using raw speech as input,” Idiap, Tech. Rep., 2015.
-  W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of cnn and rnn for natural language processing,” arXiv preprint arXiv:1702.01923, 2017.
-  S. A. Myers, A. Sharma, P. Gupta, and J. Lin, “Information network or social network? the structure of the twitter follow graph,” in Proceedings of the 23rd International Conference on World Wide Web, 2014, pp. 493–498.
-  N. Shibata, Y. Kajikawa, and I. Sakata, “Link prediction in citation networks,” Journal of the American society for information science and technology, vol. 63, no. 1, pp. 78–85, 2012.
-  J. Hu, C. Guo, B. Yang, and C. S. Jensen, “Stochastic weight completion for road networks using graph convolutional networks,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019, pp. 1274–1285.
Y. Li, C. Huang, L. Ding, Z. Li, Y. Pan, and X. Gao, “Deep learning in bioinformatics: Introduction, application, and perspective in the big data era,”Methods, vol. 166, pp. 4–21, 2019.
-  T. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv: Learning, 2016.
-  W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in neural information processing systems, 2017, pp. 1024–1034.
-  P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
-  T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
-  M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in European Semantic Web Conference. Springer, 2018, pp. 593–607.
-  S. Vashishth, S. Sanyal, V. Nitin, and P. Talukdar, “Composition-based multi-relational graph convolutional networks,” arXiv preprint arXiv:1911.03082, 2019.
-  J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” arXiv: Learning, 2013.
-  R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec, “Hierarchical graph representation learning with differentiable pooling,” arXiv preprint arXiv:1806.08804, 2018.
-  H. Gao and S. Ji, “Graph u-nets,” arXiv preprint arXiv:1905.05178, 2019.
-  Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli, “Graph matching networks for learning the similarity of graph structured objects,” arXiv preprint arXiv:1904.12787, 2019.
-  Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang, “Simgnn: A neural network approach to fast graph similarity computation,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019, pp. 384–392.
-  Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph sequence neural networks,” arXiv: Learning, 2016.
-  K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
-  O. Vinyals, S. Bengio, and M. Kudlur, “Order matters: Sequence to sequence for sets,” arXiv preprint arXiv:1511.06391, 2015.
-  M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification.” in AAAI, vol. 18, 2018, pp. 4438–4445.
J. Huang, Z. Li, N. Li, S. Liu, and G. Li, “Attpool: Towards hierarchical
feature representation in graph convolutional networks via attention
Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6480–6489.
-  J. Lee, I. Lee, and J. Kang, “Self-attention graph pooling,” arXiv preprint arXiv:1904.08082, 2019.
-  E. Ranjan, S. Sanyal, and P. P. Talukdar, “Asap: Adaptive structure aware pooling for learning hierarchical graph representations.” in AAAI, 2020, pp. 5470–5477.
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural
message passing for quantum chemistry,” in
Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1263–1272.
-  Y. Bai, H. Ding, Y. Qiao, A. Marinovic, K. Gu, T. Chen, Y. Sun, and W. Wang, “Unsupervised inductive graph-level representation learning via graph-graph proximity,” arXiv preprint arXiv:1904.01098, 2019.
-  N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” Journal of Machine Learning Research, vol. 12, pp. 2539–2561, 2011.
-  D. Mesquita, A. Souza, and S. Kaski, “Rethinking pooling in graph neural networks,” Advances in Neural Information Processing Systems, vol. 33, 2020.
-  H. Yuan and S. Ji, “Structpool: Structured graph pooling via conditional random fields,” in International Conference on Learning Representations, 2019.
F. M. Bianchi, D. Grattarola, and C. Alippi, “Spectral clustering with graph neural networks for graph pooling,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 874–883.
-  J. Lee, I. Lee, and J. Kang, “Self-attention graph pooling,” in International Conference on Machine Learning. PMLR, 2019, pp. 3734–3743.
-  E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
-  L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub) graph isomorphism algorithm for matching large graphs,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 10, pp. 1367–1372, 2004.
-  K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
M. Neuhaus, K. Riesen, and H. Bunke, “Fast suboptimal algorithms for the
computation of graph edit distance,” in
Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 2006, pp. 163–172.
-  S. Fankhauser, K. Riesen, and H. Bunke, “Speeding up graph edit distance computation through fast bipartite matching,” in International Workshop on Graph-Based Representations in Pattern Recognition. Springer, 2011, pp. 102–111.
-  K. Riesen and H. Bunke, “Approximate graph edit distance computation by means of bipartite graph matching,” Image and Vision computing, vol. 27, no. 7, pp. 950–959, 2009.
-  D. B. Blumenthal and J. Gamper, “On the exact computation of the graph edit distance,” Pattern Recognition Letters, vol. 134, pp. 46–57, 2020.