I Introduction
Graph is a powerful format of data representation and is widely used in areas such as social networks [31, 29, 16], biomedical analysis [4, 9], recommender systems [8], and computer security [28, 14]. Graph distance (or similarity) ^{1}^{1}1For conciseness, we refer to both graph distance and graph similarity as graph similarity as it is easy to transform a distance measure into a similarity measure. is important for many graphbased tasks such as graph similarity search [36, 35], binary function analysis [34]
[22]. For example, in binary function analysis, there is a database of controlflow graphs that are known to have problems, and the goal is to find if a software is prone to these problems. A natural solution is to search in the graph database to decide whether there are controlflow graphs similar to the controlgraph of the software, for which graph similarity computation is needed. More applications of graph distance can be found in [2].Graph edit distance (GED) and maximum common subgraph (MCS) are two general measures for the similarity between two graphs [23]. GED is the minimum number of edit operations (e.g., node/edge deletion/insertion) to transform one graph into another. MCS is the size of the largest common subgraph (with respect to the number of nodes) shared by two graphs. Computing the exact GED and MCS between two graphs is known to be NPhard and still challenging in practice [6, 36]. Moreover, it is reported that the stateoftheart algorithms fail to compute the exact GED between 2 graphs with more than 16 nodes in a reasonable time [3].
Many methods have been proposed to compute graph similarity, and they usually provide approximate results for computation speedup. These methods can be roughly classified into two categories, i.e., graph theory based methods and learning based methods. In the graph theory based methods, BEAM
[21] uses beam search to avoid the high complexity for searching the full space. Hungarian [25] and VJ [10]use linear programming to approximate GED. HED
[11] matches the nodes in two graphs using their local structures. MCSPLIT [20] uses a branch and bound algorithm to compute MCS. In the learning based methods, GraphSIM [2] utilizes the graph convolutional network (GCN) [15] to compute the node embeddings and the embedding correlation matrix used to predict graph similarity. Graph matching network (GMN) [18]adopts an attention layer to match the nodes in two graphs in embedding learning and computes the GED using the embedding of the two graphs. Currently, the learning based methods are shown to outperform the graph theory based methods in both accuracy and efficiency, and thus benefit tasks that require graph similarity estimation. We will give a more detailed introduction to the related work and discuss the differences of HGMN from them in Section
II.Existing learning based methods either use the embedding of each individual node or the embedding of an entire graph [1, 18]
, which fail to capture local topological structures of different scales. In this paper, we observe that graph similarity can benefit from a multiscale view. That is, if two graphs are similar to each other, they are also similar when compressed into more compact graphs and conversely if two graphs are different their compact graphs are also likely to be different. We propose the hierarchical graph matching network (HGMN), which uses multiple stages of spectral clustering to cluster the graphs into successively more compact graphs. In each stage of the clustering, earth mover distance (EMD)
[26] is used to explicitly align the nodes in the two graphs such that the network does not have to learn complex node permutations. We derive correlation matrices from the node embedding in each stage and these matrices are fed into a convolutional neural network (CNN) to predict graph similarity. The entire pipeline is trained endtoend in a datadriven fashion.We experimented on 4 datasets (i.e., AIDS, LINUX, IMDBMULTI, and PTC) and used 4 performance matrices (i.e., mean squared error, spearman’s rank correlation coefficient, kendall’s rank correlation coefficient, precision at ) to evaluate the accuracy of graph similarity approximation. The results show that HGMN consistently outperforms the stateoftheart baselines on different datasets and performance metrics. Compared with the best performing baseline, the improvement in accuracy is 12.0% on average and can be up to 62.6%. Moreover, we also experimented the key designs in HGMN, i.e., hierarchical graph clustering and explicit node matching, and the results show that both of them lead to performance improvement.
Ii Background and Related Work
Iia Problem Formulation
A graph is represented as , in which is the set of nodes and is the set of edges. We denote and . Each node
can come with a feature vector
. We set as a onehot vector or all 1 vector in when the graph does not come with node feature vectors. We assume that edges do not have weight and focus on undirected graphs. The adjacency matrix of the graph is denoted using .For two graphs and , their GED is the minimum number of edit operations in the optimal alignment that transform one graph into another [36]. The edit operations include edge deletion/insertion, node deletion/insertion and relabeling a node. We transform GED into a similarity score in the range of using to ensure that its range is welldefined. A maximum common subgraph of two graphs and is a subgraph common to both and , and there is no other common subgraph of and that contains more nodes [23]. We restrict the the maximum common subgraph to be a connected graph and MCS is defined as the number of nodes in the maximum common subgraph. An illustration of GED and MCS is provide in Figure 1.
Our goal is to learn a model that predicts the (transformed) GED or MCS between two graphs. Under reasonable computational complexity constraint, we want the prediction of the model to be as accurate as possible.
IiB Existing Methods
Graphtheory based methods. The A* algorithm [12]
is widely used for GED computation, which returns the exact result. A* is a bestfirst algorithm that tries to find an optimal path in a search tree. The idea is to formulate the problem using a tree structure, in which the root node is the starting point, inner nodes represent partial solutions, and leaf nodes are complete solutions. As A* has exponential time complexity, it is only suitable for small graphs and cannot finish in a reasonable time for large graphs. Several heuristics have been proposed to improve the execution time, which sacrifice accuracy for efficiency. For example, BEAM utilizes beam search to avoid searching the full search space and introduces a heuristic rule to favor long partial edit paths over shorter ones
[21].Some methods measure GED via linear programing [5]. Based on bipartite graph matching, Hungarian [25] and VJ [10] replace the cost of editing a node by the cost of editing the 1star graph centered at this node. The cost of substituting a star graph by another one is further expressed as the solution of a square linear sum assignment problem. HED [11] matches the nodes in two graphs using their local structures, and GED is approximated by the Hausdorff distance [13] between the nodes in the two graphs. To compute MCS, a branch and bound algorithm is used in MCSPLIT [20], which induces the result by finding a maximumcardinality mapping between the graphs.
Learningbased methods. The learning based methods usually use graph neural networks (such as GCN) to learn embedding and predict graph similarity using the embedding. SMPNN predicts graph similarity using a summation of the similarities between the nodes in the two graphs [24]. GCNMEAN and GCNMAX [17] use GCN [15] to learn graph embedding and train a fully connected neural network to compute graph similarity from the embedding of two graphs. SIMGNN uses both graph embedding and node similarities to predict graph similarity [1]. GMN [18] introduces a crossgraph attention layer to allow the nodes in the two graphs to interact with each other but still predicts graph similarity using graph embedding. GraphSIM [2] utilizes GCN with a different number of layers to build multiple correlation matrices among the nodes in the two graphs and use the correlation matrices to predict graph similarity.
Our contributions. HGMN adopts successful techniques from existing learning based methods, e.g., applying GCN to learn node embedding and using the node correlation matrix as input for neural network. However, HGMN has two fundamental differences from existing learningbased methods. First, we use multiple stages of spectral clustering to create a multiscale view of the similarity between graphs. The hierarchical clustering provides more information for the downstream neural network as the differences between two graphs can be captured in the correlation matrices of different scales. Second, we explicitly align the nodes in the two graphs using the earth mover distance and computes correlation matrix in the aligned order. Node alignment ensures that the correlation matrix is the same under arbitrary node permutation and thus the neural network does not need to learn to be robust to node permutation. We will show in the experiments that both designs are crucial for performance, especially on large graphs.
Iii Hierarchical Graph Matching Network
The data processing pipeline of HGMN is illustrated in Figure 2. HGMN uses multiple stages of spectral clustering to organize the graphs into successively more compact graphs. In each stage, an embedding pooling operator is applied to derive the initial node embedding in this stage from the node embedding of the previous stage. Then, the initial node embeddings are processed by a GCN to generate refined node embedding. Based on the refined embedding, we use the earth mover distance to build a onetoone mapping for nodes in the two graphs (i.e., node alignment) to ensure permutation invariance. The correlation matrices from all stages are fed into a CNN model to predict the similarity score of two graphs. The GCNs for all hierarchical clustering stages and the CNN model is trained from data. In the following, we introduce the modules of the HGMN pipeline in more details.
Iiia Hierarchical Graph Clustering
The procedure of hierarchical graph clustering is described in Algorithm 1, which is conducted in stages. In each stage, a new graph and its adjacency matrix are constructed from the graph in the previous stage (i.e., ). The size sequence of the graphs satisfies , in which is the size of the original graph, is the size of the compact graph in stage and we always use . Therefore, the graph becomes smaller and smaller as the stage goes on. In the 4^{th} line of Algorithm 1, the normalized Laplacian for a graph with adjacency matrix is defined as , in which is the diagonal degree matrix with and is the Laplacian matrix. Each row (i.e., ) of the matrix corresponds to a node in
. The kmeans in the 7
^{th} line of Algorithm 1 groups the nodes in into clusters and we treat each cluster of nodes as a single node in the compact graph . The forloop in the 9^{th} line of Algorithm 1 constructs the adjacency matrix of and we assume two nodes are connected in if they contain connected nodes .We use spectral clustering for graph compaction for two reasons. Firstly, it preserves the local structure of the graph. As we focus on unweighted graphs, spectral clustering approximately minimizes the number of cross edges (normalized by the size of the graph clusters) between the graph clusters . This means that nodes in each graph cluster tend to be strongly connected. Secondly, spectral clustering also allows flexible control of the number of graph clusters by setting the number of kmeans centers. We provide an illustration of hierarchical graph clustering in the leftmost part of Figure 2, which shows that wellconnected nodes are grouped into the same graph cluster. Moreover, it can be observed that for two different graphs, their compact graphs in the same clustering stage are also different. Thus, hierarchical graph clustering provides the down stream neural network a multiscale view of the differences between two graphs, which makes the task of graph similarity prediction easier. Hierarchical clustering also makes HGMN more expressive and general than existing learningbased methods for graph similarity approximation. As we use the original graph in the stage and set for the final stage, methods that use either node embedding or graph embedding can be regraded as special cases of HGMN.
Embedding pooling. In each stage of hierarchical graph clustering, we derive the initial node embedding for from the node embedding of . We call this procedure embedding pooling, which is motivated by EigenPooling [19]. We show how embedding pooling works for a graph cluster , which contains nodes from and is treated as a single node in . Assume that the node embedding of has dimension, the embedding matrix for can be organized as , in which each row corresponds to the embedding of a node from in . We can also define an adjacency matrix for the nodes in by connecting edges in for which both end points are contained in . With the adjacency matrix , we can define the Laplacian matrix for and solve the eigenvectors of the Laplacian matrix as , in which is the eigenvector corresponding to the largest eigenvalue of . The initial embedding vector of in is obtained as
(1) 
in which and is used as the initial embedding for node in . The intuition is that corresponds to high frequency signal on in spectral graph theory. By projecting onto , we keep the signal component in that changes the fastest on . Using more eigenvectors (e.g., , ), we can create multiple initial embedding for and these embedding can work in parallel in a similar manner to multiple image channels in a CNN. Figure 2 (third column) shows that multiple initial embedding can be used to generated multiple correlation matrices for a stage after they go through GCN update and node alignment.
IiiB Node Embedding and Alignment
We use a graph convolutional neural network (GCN) [15] to refine the initial embedding for each stage and the GCNs for different stages have the same number of layers but do not share the model parameters. Assume that the graph contains nodes, the adjacency matrix of is and the initial embedding matrix is . A layer of GCN updates the embedding as follows
(2) 
in which
is the activation function,
is the augmented adjacency matrix with selfloop, is the degree matrix defined on the augmented adjacency matrix , is a learnable mapping matrix, and is the dimension of the embedding for . The layer in equation 2 can be stacked to form a multilayer GCN. GCN is shown to achieve good performance on many graphbased tasks such as node classification [15], link prediction [27] and graph classification [27]. Recently, it is also shown that GCN can approximate the WeisfeilerLehman (WL) graph isomorphism test [32], which decides whether two graphs are topologically identical. We choose GCN as the default embedding model for HGMN due to its expressiveness and simplicity but more sophisticated graph neural network models such as GAT [30] and JKNet [33] can also be easily incorporated into HGMN.Similar to GRAPHSIM [2], we use the embedding correlation matrices as the input for that neural network that predicts graph similarity. Assume that for a stage, we have two graphs and containing and nodes, respectively^{2}^{2}2Actually, only in the stage (i.e., on the original graphs), the two graphs can have different sizes, i.e., . In each stage of clustering, the same output graph size is prespecified for all graphs (i.e., ) such that the input matrices to the downstream CNN have fixed size. We use and for the size for and to consider the most general case.. Their GCN embedding are denoted as and and the correlation matrix is . However, as there is no canonical ordering of the nodes in a graph, the rows of and will be permuted under different node numbering, which results in different . We provide an illustration of this phenomenon in Figure 3, in which and is a permutation of . We want and to be identical as and essentially represent the same pair of graphs. However, as shown in Figure 3, and are quite different. This means that the downstream CNN needs to be robust to node permutation, which makes the learning task difficult. Therefore, we use the earth mover distance [26] to explicitly align the nodes in and .
We define a distance matrix on and as , in which denotes the row of matrix . Then the earth mover distance between and is defined as follows
(3) 
in which . Intuitively, models the cost of transporting unit mass from to while models the amount of mass transported from to . As can be marginalized into and , each row of and will send/receive and unit of the mass, respectively. By minimizing over , the earth mover distance encourages to be large if the distance between and is small.
Algorithm 2 shows how to obtain a node matching between two graphs using the weight matrix optimized by the earth mover distance. The idea is trying to match node pairs with large in a greedy fashion. In the 5^{th} line of Algorithm 2, ties are broken by taking the solution with the minimum value if there are multiple optimal solutions. As is likely to be large when and have a small distance, Algorithm 2 essentially matches a node in to a similar node in . If we define an ordering for the nodes in , e.g., in descending order of the first dimension of their embedding, and arrange the nodes in using the matched order, the correlation matrix will be the same as long as can be transformed into via permutation. Therefore, with explicit node alignment, the downstream CNN does not need to be robust to node permutation, which simplifies learning.
One subtlety is that the correlation matrix does not have a fixed size for the original graphs (the
stage of clustering). To ensure that the CNN has fixed size input, we use interpolation to upsample
to a size of for the stage, in which is the size of the largest graph in the dataset.IiiC Network Structure and Loss Function
Our network uses the correlation matrices from all stages as input to predict graph similarity. As illustrated in the rightmost part of Figure 2, the network consists of multiple convolutional layers and fully connected layers. Since the correlation matrices are similar to images, the convolutional layers are utilized to extract spatial features from them. The fully connected layers allow the features from different stages to interact with each other. The output of the network is a single value that indicates the distance / similarity between two graphs.
We formulate graph similarity prediction as a regression problem and use the mean squared error as the loss function
(4) 
in which is the model parameter, is the graph distance predicted by the model for a graph pair and
is the groundtruth distance between the graph pair. We use minbatch stochastic gradient descent (SGD) for training and in each minbatch,
graph pairs are randomly sampled from the training set to compute loss. The trainable parameters in the model include the CNN used for distance prediction and the GCNs used for embedding in all stages.HGMN can compute graph similarity efficiently, especially for graph similarity search, in which the dataset is known before hand. Hierarchical graph clustering and node embedding can be conducted for the graphs in the database before the query comes. Spectral clustering for the query graph has a complexity of and earth mover distance based node alignment has a complexity of if the largest query graph has a size of . The complexity will not be a big problem if the graphs are not too large. For other computations that involve neural networks, the complexity of HGMN is similar to existing learning based methods such as GRAPHSIM [2] and GMN [18].
Dataset  Domains  # Graphs  MIN/MAX nodes per graph  AVG nodes per graph 

AIDS  Chemical compounds  700  2/10  8.9 
LINUX  Program dependence graph  1000  4/10  7.6 
IMDBMULTI  Egonetworks  332  16/89  25.0 
PTC  Biochemistry  256  16/103  30.2 
Iv Experimental Evaluation
In this part, we first introduce the experiment settings, including datasets, performance metrics and baselines. Then we compare HGMN with the baselines for accuracy of graph similarity prediction. Finally, we examine the key designs in HGMN, i.e., hierarchical graph clustering and earth mover distance based node alignment, and test the influence of the parameters on the performance. All codes to reproduces the results of HGMN will be released after the review process.
Iva Experimental Settings
We largely follow the experimental settings in [2] and introduce the details as follows.
Datasets. We used 4 real datasets for the experiments, i.e., AIDS, PTC, LINUX and IMDBMULTI. The AIDS dataset contains 42,687 chemical compound graphs from the Developmental Therapeutics Program at NCI/NIH 7 and each node in a graph is associated with one out of 29 labels. AIDS has been widely used for the evaluation of graph similarity computation[36, 37, 1] and we randomly sampled 700 graphs from the dataset. PTC consists of 344 chemical compound graphs that report the carcinogenicity for male and female rats. Each node in the PTC dataset has one out of 19 possible labels. LINUX has 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. In each PDG, a node is one statement and an edge models the dependency between the two statements. We randomly sampled 1,000 graphs from the original LINUX dataset. IMDBMULTI is a moviecollaboration dataset containing 1,500 egonetworks of movie actors/actresses. In the egonetworks, each node represents a person and an edge models the collaboration between two persons. On IMDBMULTI and PTC, we removed graphs containing less than 16 nodes to test the scalability of our methods. For both LINUX and IMDBMULTI, the nodes do not come with a label and we use the all 1 vector as the initial embedding of the nodes for the two datasets. The statistics of the datasets after preprocessing are reported in Table I.
Evaluation Methodology. For each dataset, we generated the training set, validation set and test set with a split ratio of 7:2:1. The model was trained on the training set and the hyperparameters (e.g., the number of stages in hierarchical graph clustering and the number of layers in GCN ) were tuned using the validation set. Graphs in the test set were treated as queries and we evaluated how accurately the model approximates the similarity between the query graphs and the graphs in the entire dataset. For AIDS and LINUX, the A* algorithm was used to compute the groundtruth GED between the graphs. As A* has an exponential time complexity with respect to the number of nodes in the graphs, it took too much time for PTC and IMDBMULTI. Therefore, we computed the groundtruth GED for PTC and IMDBMULTI by taking the minimum of three approximate algorithms, i.e., Beam [21], Hungarian [25] and VJ [10]. The distances returned by the algorithms are larger than or equal to the true GED and the same groundtruth approximation methodology was also adopted in [2]. To compute the groundtruth MCS, we used MCSPLIT [20] as it can finish in a long but tolerable time for our datasets. Note that the algorithms used to provide the groundtruth distance/similarity are typically orders of magnitude slower than learning based methods [2]. We excluded the test set from model training to show that the trained models can generalize to unseen data and thus improve the efficiency of graph similarity search.
We used four performance metrics to evaluate the accuracy of graph similarity approximation, i.e., average mean squared error (MSE), Spearman’s rank correlation coefficient (), Kendall’s rank correlation coefficient () and precision at 10 (). MSE is the mean squared error of the predicted GED/MCS compared with the groundtruth GED/MCS. For a query graph , the graphs in the dataset were ranked according to their predicted graph similarities. Both and evaluate how well the similarity prediction based ranking matches ground truth similarity based ranking, and higher value means better performance. is the percentage of true top10 nearest neighbor in the top10 nearest neighbors obtained from estimated graph similarity. MSE measures the accuracy of graph similarity approximation, while , and evaluate how well the estimated graph similarity ranks the graphs, which is also important for graph similarity search.
Baselines. As the learning based methods were shown to outperform the graph theory based methods in both accuracy and efficiency [2], we mainly compared with the learning based methods. The baselines include SMPNN [24], GCNMEAN, GCNMAX [17], SIMGNN [1], GMN [18] and GraphSIM [2]. EMBAVG is a simple baseline introduced in [2] that computes graph similarity using the dot product of two graph embeddings. As the results of our run for SIMGNN and GraphSIM on the AIDS and LINUX dataset are slightly worse than those reported in their papers, we reused the results from their papers. By default, HGMN uses 4 hierarchical cluster stages with size 6, 4, 2, 1 for AIDS and LINUX (small graphs), and 6 hierarchical cluster stages with size 64, 16, 8, 4, 2, 1 for PTC and IMDBMULT (large graphs).
Dataset and Metric  SMPNN  EMBAVG  GCNMEAN  GCNMAX  SIMGNN  GMN  GRAPHSIM  HGMN  Gain (%)  

AIDS  4.725  3.185  2.124  3.423  1.189  1.741  0.787  0.752  4.4  
0.306  0.642  0.653  0.628  0.843  0.751  0.874  0.883  1.0  
0.480  0.592  0.629  0.505  0.690  0.642  0.776  0.778  0.3  
0.092  0.179  0.194  0.290  0.421  0.401  0.534  0.537  0.5  
LINUX  11.523  11.244  7.541  6.341  1.509  1.027  0.058  0.056  3.4  
0.046  0.245  0.579  0.724  0.939  0.941  0.981  0.984  0.3  
0.016  0.301  0.525  0.740  0.879  0.896  0.907  0.920  1.4  
0.014  0.071  0.141  0.541  0.942  0.933  0.992  0.996  0.4  
IMDBMULTI  32.596  71.789  68.823  58.425  2.964  3.210  1.924  0.719  62.6  
0.107  0.229  0.402  0.449  0.781  0.725  0.825  0.930  12.7  
0.644  0.187  0.378  0.354  0.770  0.782  0.821  0.914  11.3  
0.021  0.210  0.219  0.437  0.724  0.751  0.813  0.853  4.9  
PTC  134.124  44.184  7.428  8.329  1.473  1.854  0.889  0.820  7.8  
0.127  0.324  0.546  0.506  0.726  0.670  0.714  0.958  34.2  
0.167  0.315  0.490  0.468  0.678  0.592  0.719  0.941  30.9  
0.087  0.144  0.210  0.241  0.475  0.374  0.541  0.623  15.2  
Dataset and Metric  SMPNN  EMBAVG  GCNMEAN  GCNMAX  SIMGNN  GMN  GRAPHSIM  HGMN  Gain (%)  

AIDS  4.268  6.148  6.234  4.156  3.433  2.234  2.402  2.213  0.9  
0.772  0.723  0.756  0.801  0.822  0.901  0.858  0.902  0.1  
0.529  0.510  0.498  0.574  0.680  0.803  0.798  0.871  9.1  
0.379  0.243  0.347  0.315  0.374  0.513  0.505  0.525  2.3  
LINUX  3.397  2.784  2.689  2.170  0.729  0.794  0.164  0.153  6.7  
0.134  0.475  0.521  0.714  0.859  0.939  0.962  0.960  0.2  
0.675  0.715  0.747  0.784  0.889  0.934  0.946  0.962  1.7  
0.235  0.378  0.421  0.459  0.850  0.949  0.951  0.960  0.9  
IMDBMULTI  15.145  19.354  10.457  10.124  2.451  0.590  1.287  0.529  10.3  
0.310  0.478  0.746  0.841  0.930  0.941  0.976  0.981  0.5  
0.530  0.386  0.611  0.619  0.879  0.920  0.946  0.981  3.7  
0.01  0.211  0.387  0.451  0.812  0.875  0.882  0.896  1.6  
PTC  14.875  26.412  12.441  13.845  5.419  3.142  3.975  2.551  18.8  
0.578  0.647  0.578  0.6617  0.712  0.782  0.779  0.811  3.7  
0.522  0.419  0.650  0.688  0.746  0.792  0.8  0.812  1.5  
0.187  0.352  0.384  0.402  0.356  0.584  0.498  0.609  4.3  
IvB Main Performance Results
We report our main results in Table II and Table III, which compare HGMN with the baselines for the accuracy of GED and MCS perdition, respectively. We can make two observations from the results. First, HGMN consistently outperforms the baselines for both GED and MCS, across 4 different datasets and 4 different performance metrics. For GED, the performance improvement over the best performing baseline is 11.96% on average (averaged over all datasets and performance metrics) and can be up to 62.6%. Compared with the best performing baseline, the improvement for MCS is 4.15% on average and can be up to 18.8%. Second, the performance improvement of HGMN is significantly better for the larger datasets (IMDBMULTI and PTC) than the smaller datasets (AIDS and LINUX). We conjecture that this is because larger graphs have richer structures when they are clustered into more compact graphs. These structures are better captured on the compact graphs than on the original graphs. In contrast, the graphs in AIDS and LINUX are small (with no more than 10 nodes) and thus GCN with a moderate number of layers is already able to capture the structures of different scales. For large graphs, GCN with a large number of layers are required but graph neural networks with too many layers are known to be prone to oversmoothing [7], which often leads to poor performance. This explanation suggests that the hierarchical clustering may enable HGMN to perform well on even larger graphs and we provide more evidences to support this explanation in Section IVC.
HGMN Variants  Performance Metric  

AIDS  PTC  
HGMN  0.752  0.883  0.778  0.537  0.820  0.958  0.941  0.623 
Without node alignment  0.896  0.776  0.685  0.432  0.837  0.914  0.847  0.602 
Without hierarchical clustering  2.768  0.612  0.605  0.313  7.285  0.573  0.513  0.238 
IvC Ablation Study and Parameter Analysis
In Table IV, we study how the two key designs of HGMN, i.e., hierarchical clustering and node alignment, may influence the performance. For without node alignment, we used a random ordering of the nodes in the graphs. For without hierarchical clustering, we used only the embedding correlation matrix for the original graph, which is similar to the case of GRAPHSIM [2]. We tested the influence of a dataset with small graphs (AIDS) and dataset with large graphs (PTC). The results show that disabling either node alignment or hierarchical clustering degrades the performance. Comparatively, the performance degradation is more severe for the large PTC dataset than the small AIDS dataset. This is another evidence that hierarchical clustering helps achieve good performance for large datasets. Compared with node alignment, hierarchical clustering seems to be more important for the performance, and without it the increases 2.7x and 7.9x for AIDS and PTC, respectively.
In Figure 4, we check the influence of the hyperparameters on the performance of HGMN. Figure 3(a) shows that when the number of eigenvectors used for embedding pooling increases, the performance of HGMN first increases and then stabilizes. Recall that the number of eigenvectors decides the number of correlation matrices provided by each stage for the downstream CNN. As more eigenvectors are used for pooling, more information in the node embedding of the previous graph clustering stage is kept and thus more information is provided for the CNN. However, with a sufficient number of eigenvectors, adding new eigenvectors does not help as the first eigenvectors (correspond to the largest eigenvalues) already encode the most significant signals in the embedding matrix .
Figure 3(b) shows that when the number of graph clustering stages increases, the performance of HGMN also first increases but then saturates, similar to the case of pooling eigenvectors. This is because when using too many stages, each stage will only make a small change in the graph structure (e.g., groping two nodes into one) and thus does not provide too much information. For the PTC dataset, the performance of HGMN stabilizes with 5 stages; while for the AIDS dataset, the performance of HGMN stabilizes with 3 stages. This is because graphs in PTC are larger and can have more meaningful stages. This phenomenon also suggests that more stages are required for even larger graphs, on which HGMN can achieve even greater performance improvement than existing baselines as they do not use hierarchical clustering.
IvD Efficiency Comparison
We report the average query processing time for GED similarity search for different methods in Figure 5. Query processing time is the time taken to compute the approximate GED between a query and all dataset items, and the reported results are measured on a machine with Intel(R) Xeon(R) E52697 v3 @ 2.6GHz CPU (56 physical cores) and 512GB RAM in single thread mode. We did not include the graph theory based methods (e.g., Beam [21], Hungarian [25] and VJ [10]) as they are shown to be orders of magnitude slower than learning based methods [2]. We used a dataset with small graphs (AIDS) and a dataset with large graphs (PTC) to check the influence of graph size.
The results show that HGMN takes more time than the other methods because it uses hierarchical graph clustering and explicit node alignment. The higher computation complexity of HGMN is more obvious for larger dataset (PTC vs. AIDS) as larger graphs need more hierarchical clustering stages and make node alignment more complex. However, HGMN is not significantly slower than the other methods (e.g., 29.1% and 12.5% slower compared with GraphSim on PTC and AIDS, respectively) because graph neural network computation on the original graph dominates the overall complexity (required by all the methods). EmbAvg is the most efficient among all methods as it uses a simple dot product between the averaged embeddings of two graphs but its accuracy is poor according to Table II and Table III. We think HGMN offers a reasonable tradeoff between accuracy and efficiency by using a small increase in complexity to trade for better accuracy.
V Conclusions
In this paper, we proposed the hierarchical graph matching network (HGMN) for efficient graph similarity computation. Motivated by the observation that two similar graphs should also be similar when they are clustered into more compact graphs, HGMN uses hierarchical clustering to provide the learning algorithm a multiscale view of the differences between graphs. In addition, HGMN also adopts techniques including eigenvector based embedding pooling and earth mover based node alignment to build a complete machine learning pipeline. Experimental results on 4 datasets and 4 performance metrics show that HGMN consistently outperforms the baselines. Moreover, there are evidences that HGMN can scale to large graphs.
References
 [1] (2019) Simgnn: a neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 384–392. Cited by: §I, §IIB, §IVA, §IVA.

[2]
(2020)
Learningbased efficient graph similarity computation via multiscale convolutional set matching.
In
AAAI Conference on Artificial Intelligence
, Cited by: §I, §I, §IIB, §IIIB, §IIIC, §IVA, §IVA, §IVA, §IVC, §IVD.  [3] (2018) On the exact computation of the graph edit distance. Pattern Recognition Letters. Cited by: §I.
 [4] (2005) Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: §I.
 [5] (2017) Graph edit distance as a quadratic assignment problem. Pattern Recognition Letters 87, pp. 38–46. Cited by: §IIB.
 [6] (1998) A graph distance metric based on the maximal common subgraph. Pattern recognition letters 19 (34), pp. 255–259. Cited by: §I.
 [7] (2019) Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. arXiv preprint arXiv:1909.03211. Cited by: §IVB.
 [8] (2008) Feature weighting in content based recommendation system using social network analysis. In Proceedings of the 17th international conference on World Wide Web, pp. 1041–1042. Cited by: §I.
 [9] (2007) Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug discovery today 12 (56), pp. 225–233. Cited by: §I.
 [10] (2011) Speeding up graph edit distance computation through fast bipartite matching. In Proceedings of the 8th international conference on Graphbased representations in pattern recognition, pp. 102–111. Cited by: §I, §IIB, §IVA, §IVD.
 [11] (2015) Approximation of graph edit distance based on hausdorff matching. Pattern Recognition 48 (2), pp. 331–343. Cited by: §I, §IIB.
 [12] (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §IIB.
 [13] (1993) Comparing images using the hausdorff distance. IEEE Transactions on pattern analysis and machine intelligence 15 (9), pp. 850–863. Cited by: §IIB.
 [14] (2011) Malware classification based on call graph clustering. Journal in computer virology 7 (4), pp. 233–245. Cited by: §I.
 [15] (2017) Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §I, §IIB, §IIIB.
 [16] (2009) On social networks and collaborative recommendation. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 195–202. Cited by: §I.
 [17] (2017) Distance metric learning using graph convolutional networks: application to functional brain networks. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 469–477. Cited by: §IIB, §IVA.
 [18] (2019) Graph matching networks for learning the similarity of graph structured objects. In International Conference on Machine Learning, pp. 3835–3845. Cited by: §I, §I, §IIB, §IIIC, §IVA.
 [19] (2019) Graph convolutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 723–731. Cited by: §IIIA.
 [20] (2017) A partitioning algorithm for maximum common subgraph problems. Cited by: §I, §IIB, §IVA.
 [21] (2006) Fast suboptimal algorithms for the computation of graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 163–172. Cited by: §I, §IIB, §IVA, §IVD.
 [22] (2010) Web graph similarity for anomaly detection. Journal of Internet Services and Applications 1 (1), pp. 19–30. Cited by: §I.
 [23] (2002) Rascal: calculation of graph similarity using maximum common edge subgraphs. The Computer Journal 45 (6), pp. 631–644. Cited by: §I, §IIA.
 [24] (2018) Learning graph distances with message passing neural networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2239–2244. Cited by: §IIB, §IVA.
 [25] (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision computing 27 (7), pp. 950–959. Cited by: §I, §IIB, §IVA, §IVD.

[26]
(1998)
A metric for distributions with applications to image databases.
In
Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271)
, pp. 59–66. Cited by: §I, §IIIB.  [27] (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §IIIB.
 [28] (2010) Detecting malware variants via functioncall graph similarity. In 2010 5th International Conference on Malicious and Unwanted Software, pp. 113–120. Cited by: §I.
 [29] (1979) Errorcorrecting isomorphisms of attributed relational graphs for pattern analysis. IEEE Transactions on systems, man, and cybernetics 9 (12), pp. 757–768. Cited by: §I.
 [30] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §IIIB.
 [31] (2010) Modeling relationship strength in online social networks. In Proceedings of the 19th international conference on World wide web, pp. 981–990. Cited by: §I.
 [32] (2019) How powerful are graph neural networks?. In International Conference on Learning Representations, Cited by: §IIIB.
 [33] (2018) Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5453–5462. Cited by: §IIIB.
 [34] (2017) Neural networkbased graph embedding for crossplatform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 363–376. Cited by: §I.
 [35] (2008) Graph similarity scoring and matching. Applied mathematics letters 21 (1), pp. 86–94. Cited by: §I.
 [36] (2009) Comparing stars: on approximating graph edit distance. Proceedings of the VLDB Endowment 2 (1), pp. 25–36. Cited by: §I, §I, §IIA, §IVA.
 [37] (2013) Efficient processing of graph similarity queries with edit distance constraints. The VLDB Journal 22 (6), pp. 727–752. Cited by: §IVA.
Comments
There are no comments yet.