Hierarchical Graph Matching Network for Graph Similarity Computation

06/30/2020 ∙ by Haibo Xiu, et al. ∙ MIT Zhejiang University The Chinese University of Hong Kong 0

Graph edit distance / similarity is widely used in many tasks, such as graph similarity search, binary function analysis, and graph clustering. However, computing the exact graph edit distance (GED) or maximum common subgraph (MCS) between two graphs is known to be NP-hard. In this paper, we propose the hierarchical graph matching network (HGMN), which learns to compute graph similarity from data. HGMN is motivated by the observation that two similar graphs should also be similar when they are compressed into more compact graphs. HGMN utilizes multiple stages of hierarchical clustering to organize a graph into successively more compact graphs. At each stage, the earth mover distance (EMD) is adopted to obtain a one-to-one mapping between the nodes in two graphs (on which graph similarity is to be computed), and a correlation matrix is also derived from the embeddings of the nodes in the two graphs. The correlation matrices from all stages are used as input for a convolutional neural network (CNN), which is trained to predict graph similarity by minimizing the mean squared error (MSE). Experimental evaluation on 4 datasets in different domains and 4 performance metrics shows that HGMN consistently outperforms existing baselines in the accuracy of graph similarity approximation.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Graph is a powerful format of data representation and is widely used in areas such as social networks [31, 29, 16], biomedical analysis [4, 9], recommender systems [8], and computer security [28, 14]. Graph distance (or similarity) 111For conciseness, we refer to both graph distance and graph similarity as graph similarity as it is easy to transform a distance measure into a similarity measure. is important for many graph-based tasks such as graph similarity search [36, 35], binary function analysis [34]

and anomaly detection 

[22]. For example, in binary function analysis, there is a database of control-flow graphs that are known to have problems, and the goal is to find if a software is prone to these problems. A natural solution is to search in the graph database to decide whether there are control-flow graphs similar to the control-graph of the software, for which graph similarity computation is needed. More applications of graph distance can be found in [2].

Graph edit distance (GED) and maximum common subgraph (MCS) are two general measures for the similarity between two graphs [23]. GED is the minimum number of edit operations (e.g., node/edge deletion/insertion) to transform one graph into another. MCS is the size of the largest common subgraph (with respect to the number of nodes) shared by two graphs. Computing the exact GED and MCS between two graphs is known to be NP-hard and still challenging in practice [6, 36]. Moreover, it is reported that the state-of-the-art algorithms fail to compute the exact GED between 2 graphs with more than 16 nodes in a reasonable time [3].

Many methods have been proposed to compute graph similarity, and they usually provide approximate results for computation speedup. These methods can be roughly classified into two categories, i.e., graph theory based methods and learning based methods. In the graph theory based methods, BEAM 

[21] uses beam search to avoid the high complexity for searching the full space. Hungarian [25] and VJ [10]

use linear programming to approximate GED. HED 

[11] matches the nodes in two graphs using their local structures. MC-SPLIT [20] uses a branch and bound algorithm to compute MCS. In the learning based methods, GraphSIM [2] utilizes the graph convolutional network (GCN) [15] to compute the node embeddings and the embedding correlation matrix used to predict graph similarity. Graph matching network (GMN) [18]

adopts an attention layer to match the nodes in two graphs in embedding learning and computes the GED using the embedding of the two graphs. Currently, the learning based methods are shown to outperform the graph theory based methods in both accuracy and efficiency, and thus benefit tasks that require graph similarity estimation. We will give a more detailed introduction to the related work and discuss the differences of HGMN from them in Section 


Existing learning based methods either use the embedding of each individual node or the embedding of an entire graph [1, 18]

, which fail to capture local topological structures of different scales. In this paper, we observe that graph similarity can benefit from a multi-scale view. That is, if two graphs are similar to each other, they are also similar when compressed into more compact graphs and conversely if two graphs are different their compact graphs are also likely to be different. We propose the hierarchical graph matching network (HGMN), which uses multiple stages of spectral clustering to cluster the graphs into successively more compact graphs. In each stage of the clustering, earth mover distance (EMD) 

[26] is used to explicitly align the nodes in the two graphs such that the network does not have to learn complex node permutations. We derive correlation matrices from the node embedding in each stage and these matrices are fed into a convolutional neural network (CNN) to predict graph similarity. The entire pipeline is trained end-to-end in a data-driven fashion.

We experimented on 4 datasets (i.e., AIDS, LINUX, IMDB-MULTI, and PTC) and used 4 performance matrices (i.e., mean squared error, spearman’s rank correlation coefficient, kendall’s rank correlation coefficient, precision at ) to evaluate the accuracy of graph similarity approximation. The results show that HGMN consistently outperforms the state-of-the-art baselines on different datasets and performance metrics. Compared with the best performing baseline, the improvement in accuracy is 12.0% on average and can be up to 62.6%. Moreover, we also experimented the key designs in HGMN, i.e., hierarchical graph clustering and explicit node matching, and the results show that both of them lead to performance improvement.

Ii Background and Related Work

Fig. 1: An illustration of GED and MCS, best viewed in color
Fig. 2: The pipeline of hierarchical graph matching network (HGMN), best viewed in color

Ii-a Problem Formulation

A graph is represented as , in which is the set of nodes and is the set of edges. We denote and . Each node

can come with a feature vector

. We set as a one-hot vector or all 1 vector in when the graph does not come with node feature vectors. We assume that edges do not have weight and focus on undirected graphs. The adjacency matrix of the graph is denoted using .

For two graphs and , their GED is the minimum number of edit operations in the optimal alignment that transform one graph into another [36]. The edit operations include edge deletion/insertion, node deletion/insertion and relabeling a node. We transform GED into a similarity score in the range of using to ensure that its range is well-defined. A maximum common subgraph of two graphs and is a subgraph common to both and , and there is no other common subgraph of and that contains more nodes [23]. We restrict the the maximum common subgraph to be a connected graph and MCS is defined as the number of nodes in the maximum common subgraph. An illustration of GED and MCS is provide in Figure 1.

Our goal is to learn a model that predicts the (transformed) GED or MCS between two graphs. Under reasonable computational complexity constraint, we want the prediction of the model to be as accurate as possible.

Ii-B Existing Methods

Graph-theory based methods. The A* algorithm [12]

is widely used for GED computation, which returns the exact result. A* is a best-first algorithm that tries to find an optimal path in a search tree. The idea is to formulate the problem using a tree structure, in which the root node is the starting point, inner nodes represent partial solutions, and leaf nodes are complete solutions. As A* has exponential time complexity, it is only suitable for small graphs and cannot finish in a reasonable time for large graphs. Several heuristics have been proposed to improve the execution time, which sacrifice accuracy for efficiency. For example, BEAM utilizes beam search to avoid searching the full search space and introduces a heuristic rule to favor long partial edit paths over shorter ones 


Some methods measure GED via linear programing [5]. Based on bipartite graph matching, Hungarian [25] and VJ [10] replace the cost of editing a node by the cost of editing the 1-star graph centered at this node. The cost of substituting a star graph by another one is further expressed as the solution of a square linear sum assignment problem. HED [11] matches the nodes in two graphs using their local structures, and GED is approximated by the Hausdorff distance [13] between the nodes in the two graphs. To compute MCS, a branch and bound algorithm is used in MC-SPLIT [20], which induces the result by finding a maximum-cardinality mapping between the graphs.

Learning-based methods. The learning based methods usually use graph neural networks (such as GCN) to learn embedding and predict graph similarity using the embedding. SMPNN predicts graph similarity using a summation of the similarities between the nodes in the two graphs [24]. GCNMEAN and GCNMAX [17] use GCN [15] to learn graph embedding and train a fully connected neural network to compute graph similarity from the embedding of two graphs. SIMGNN uses both graph embedding and node similarities to predict graph similarity [1]. GMN [18] introduces a cross-graph attention layer to allow the nodes in the two graphs to interact with each other but still predicts graph similarity using graph embedding. GraphSIM [2] utilizes GCN with a different number of layers to build multiple correlation matrices among the nodes in the two graphs and use the correlation matrices to predict graph similarity.

Our contributions. HGMN adopts successful techniques from existing learning based methods, e.g., applying GCN to learn node embedding and using the node correlation matrix as input for neural network. However, HGMN has two fundamental differences from existing learning-based methods. First, we use multiple stages of spectral clustering to create a multi-scale view of the similarity between graphs. The hierarchical clustering provides more information for the downstream neural network as the differences between two graphs can be captured in the correlation matrices of different scales. Second, we explicitly align the nodes in the two graphs using the earth mover distance and computes correlation matrix in the aligned order. Node alignment ensures that the correlation matrix is the same under arbitrary node permutation and thus the neural network does not need to learn to be robust to node permutation. We will show in the experiments that both designs are crucial for performance, especially on large graphs.

Iii Hierarchical Graph Matching Network

The data processing pipeline of HGMN is illustrated in Figure 2. HGMN uses multiple stages of spectral clustering to organize the graphs into successively more compact graphs. In each stage, an embedding pooling operator is applied to derive the initial node embedding in this stage from the node embedding of the previous stage. Then, the initial node embeddings are processed by a GCN to generate refined node embedding. Based on the refined embedding, we use the earth mover distance to build a one-to-one mapping for nodes in the two graphs (i.e., node alignment) to ensure permutation invariance. The correlation matrices from all stages are fed into a CNN model to predict the similarity score of two graphs. The GCNs for all hierarchical clustering stages and the CNN model is trained from data. In the following, we introduce the modules of the HGMN pipeline in more details.

Iii-a Hierarchical Graph Clustering

The procedure of hierarchical graph clustering is described in Algorithm 1, which is conducted in stages. In each stage, a new graph and its adjacency matrix are constructed from the graph in the previous stage (i.e., ). The size sequence of the graphs satisfies , in which is the size of the original graph, is the size of the compact graph in stage and we always use . Therefore, the graph becomes smaller and smaller as the stage goes on. In the 4th line of Algorithm 1, the normalized Laplacian for a graph with adjacency matrix is defined as , in which is the diagonal degree matrix with and is the Laplacian matrix. Each row (i.e., ) of the matrix corresponds to a node in

. The k-means in the 7

th line of Algorithm 1 groups the nodes in into clusters and we treat each cluster of nodes as a single node in the compact graph . The for-loop in the 9th line of Algorithm 1 constructs the adjacency matrix of and we assume two nodes are connected in if they contain connected nodes .

1:  Input: A graph , its adjacency matrix , the number of clustering stages and the size of the compact graphs for each stage
2:  Output: successively more compact graphs and their adjacency matrices
3:  for  do
4:     Compute the normalized Laplacian of as

     Compute the eigenvectors corresponding to the

smallest eigenvalues of

and use them as the columns of
6:     Normalize each row of matrix to unit norm
7:     Conduct k-means clustering to cluster the rows of into clusters
8:     Initialize
9:     for each edge  do
10:        if , and  then
12:        end if
13:     end for
14:  end for
Algorithm 1 Hierarchical graph compaction with spectral clustering

We use spectral clustering for graph compaction for two reasons. Firstly, it preserves the local structure of the graph. As we focus on unweighted graphs, spectral clustering approximately minimizes the number of cross edges (normalized by the size of the graph clusters) between the graph clusters . This means that nodes in each graph cluster tend to be strongly connected. Secondly, spectral clustering also allows flexible control of the number of graph clusters by setting the number of k-means centers. We provide an illustration of hierarchical graph clustering in the leftmost part of Figure 2, which shows that well-connected nodes are grouped into the same graph cluster. Moreover, it can be observed that for two different graphs, their compact graphs in the same clustering stage are also different. Thus, hierarchical graph clustering provides the down stream neural network a multi-scale view of the differences between two graphs, which makes the task of graph similarity prediction easier. Hierarchical clustering also makes HGMN more expressive and general than existing learning-based methods for graph similarity approximation. As we use the original graph in the stage and set for the final stage, methods that use either node embedding or graph embedding can be regraded as special cases of HGMN.

Embedding pooling. In each stage of hierarchical graph clustering, we derive the initial node embedding for from the node embedding of . We call this procedure embedding pooling, which is motivated by EigenPooling [19]. We show how embedding pooling works for a graph cluster , which contains nodes from and is treated as a single node in . Assume that the node embedding of has -dimension, the embedding matrix for can be organized as , in which each row corresponds to the embedding of a node from in . We can also define an adjacency matrix for the nodes in by connecting edges in for which both end points are contained in . With the adjacency matrix , we can define the Laplacian matrix for and solve the eigenvectors of the Laplacian matrix as , in which is the eigenvector corresponding to the largest eigenvalue of . The initial embedding vector of in is obtained as


in which and is used as the initial embedding for node in . The intuition is that corresponds to high frequency signal on in spectral graph theory. By projecting onto , we keep the signal component in that changes the fastest on . Using more eigenvectors (e.g., , ), we can create multiple initial embedding for and these embedding can work in parallel in a similar manner to multiple image channels in a CNN. Figure 2 (third column) shows that multiple initial embedding can be used to generated multiple correlation matrices for a stage after they go through GCN update and node alignment.

Iii-B Node Embedding and Alignment

We use a graph convolutional neural network (GCN) [15] to refine the initial embedding for each stage and the GCNs for different stages have the same number of layers but do not share the model parameters. Assume that the graph contains nodes, the adjacency matrix of is and the initial embedding matrix is . A layer of GCN updates the embedding as follows


in which

is the activation function,

is the augmented adjacency matrix with self-loop, is the degree matrix defined on the augmented adjacency matrix , is a learnable mapping matrix, and is the dimension of the embedding for . The layer in equation 2 can be stacked to form a multi-layer GCN. GCN is shown to achieve good performance on many graph-based tasks such as node classification [15], link prediction [27] and graph classification [27]. Recently, it is also shown that GCN can approximate the Weisfeiler-Lehman (WL) graph isomorphism test [32], which decides whether two graphs are topologically identical. We choose GCN as the default embedding model for HGMN due to its expressiveness and simplicity but more sophisticated graph neural network models such as GAT [30] and JK-Net [33] can also be easily incorporated into HGMN.

Similar to GRAPHSIM [2], we use the embedding correlation matrices as the input for that neural network that predicts graph similarity. Assume that for a stage, we have two graphs and containing and nodes, respectively222Actually, only in the stage (i.e., on the original graphs), the two graphs can have different sizes, i.e., . In each stage of clustering, the same output graph size is pre-specified for all graphs (i.e., ) such that the input matrices to the downstream CNN have fixed size. We use and for the size for and to consider the most general case.. Their GCN embedding are denoted as and and the correlation matrix is . However, as there is no canonical ordering of the nodes in a graph, the rows of and will be permuted under different node numbering, which results in different . We provide an illustration of this phenomenon in Figure 3, in which and is a permutation of . We want and to be identical as and essentially represent the same pair of graphs. However, as shown in Figure 3, and are quite different. This means that the downstream CNN needs to be robust to node permutation, which makes the learning task difficult. Therefore, we use the earth mover distance [26] to explicitly align the nodes in and .

Fig. 3: An illustration of the need for node alignment, different shades in indicate different correlation values, best viewed in color

We define a distance matrix on and as , in which denotes the row of matrix . Then the earth mover distance between and is defined as follows


in which . Intuitively, models the cost of transporting unit mass from to while models the amount of mass transported from to . As can be marginalized into and , each row of and will send/receive and unit of the mass, respectively. By minimizing over , the earth mover distance encourages to be large if the distance between and is small.

Algorithm 2 shows how to obtain a node matching between two graphs using the weight matrix optimized by the earth mover distance. The idea is trying to match node pairs with large in a greedy fashion. In the 5th line of Algorithm 2, ties are broken by taking the solution with the minimum value if there are multiple optimal solutions. As is likely to be large when and have a small distance, Algorithm 2 essentially matches a node in to a similar node in . If we define an ordering for the nodes in , e.g., in descending order of the first dimension of their embedding, and arrange the nodes in using the matched order, the correlation matrix will be the same as long as can be transformed into via permutation. Therefore, with explicit node alignment, the downstream CNN does not need to be robust to node permutation, which simplifies learning.

1:  Input: The size and of the two graphs and , and the weight matrix
2:  Output: A matching vector , in which is the match of node in graph
3:  Initialize
4:  for  do
7:     Delete from
8:  end for
Algorithm 2 Node matching with earth mover distance

One subtlety is that the correlation matrix does not have a fixed size for the original graphs (the

stage of clustering). To ensure that the CNN has fixed size input, we use interpolation to up-sample

to a size of for the stage, in which is the size of the largest graph in the dataset.

Iii-C Network Structure and Loss Function

Our network uses the correlation matrices from all stages as input to predict graph similarity. As illustrated in the rightmost part of Figure 2, the network consists of multiple convolutional layers and fully connected layers. Since the correlation matrices are similar to images, the convolutional layers are utilized to extract spatial features from them. The fully connected layers allow the features from different stages to interact with each other. The output of the network is a single value that indicates the distance / similarity between two graphs.

We formulate graph similarity prediction as a regression problem and use the mean squared error as the loss function


in which is the model parameter, is the graph distance predicted by the model for a graph pair and

is the ground-truth distance between the graph pair. We use min-batch stochastic gradient descent (SGD) for training and in each min-batch,

graph pairs are randomly sampled from the training set to compute loss. The trainable parameters in the model include the CNN used for distance prediction and the GCNs used for embedding in all stages.

HGMN can compute graph similarity efficiently, especially for graph similarity search, in which the dataset is known before hand. Hierarchical graph clustering and node embedding can be conducted for the graphs in the database before the query comes. Spectral clustering for the query graph has a complexity of and earth mover distance based node alignment has a complexity of if the largest query graph has a size of . The complexity will not be a big problem if the graphs are not too large. For other computations that involve neural networks, the complexity of HGMN is similar to existing learning based methods such as GRAPHSIM [2] and GMN [18].

Dataset Domains # Graphs MIN/MAX nodes per graph AVG nodes per graph
AIDS Chemical compounds 700 2/10 8.9
LINUX Program dependence graph 1000 4/10 7.6
IMDB-MULTI Ego-networks 332 16/89 25.0
PTC Biochemistry 256 16/103 30.2
TABLE I: Dataset statistics

Iv Experimental Evaluation

In this part, we first introduce the experiment settings, including datasets, performance metrics and baselines. Then we compare HGMN with the baselines for accuracy of graph similarity prediction. Finally, we examine the key designs in HGMN, i.e., hierarchical graph clustering and earth mover distance based node alignment, and test the influence of the parameters on the performance. All codes to reproduces the results of HGMN will be released after the review process.

Iv-a Experimental Settings

We largely follow the experimental settings in [2] and introduce the details as follows.

Datasets. We used 4 real datasets for the experiments, i.e., AIDS, PTC, LINUX and IMDB-MULTI. The AIDS dataset contains 42,687 chemical compound graphs from the Developmental Therapeutics Program at NCI/NIH 7 and each node in a graph is associated with one out of 29 labels. AIDS has been widely used for the evaluation of graph similarity computation[36, 37, 1] and we randomly sampled 700 graphs from the dataset. PTC consists of 344 chemical compound graphs that report the carcinogenicity for male and female rats. Each node in the PTC dataset has one out of 19 possible labels. LINUX has 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. In each PDG, a node is one statement and an edge models the dependency between the two statements. We randomly sampled 1,000 graphs from the original LINUX dataset. IMDB-MULTI is a movie-collaboration dataset containing 1,500 ego-networks of movie actors/actresses. In the ego-networks, each node represents a person and an edge models the collaboration between two persons. On IMDB-MULTI and PTC, we removed graphs containing less than 16 nodes to test the scalability of our methods. For both LINUX and IMDB-MULTI, the nodes do not come with a label and we use the all 1 vector as the initial embedding of the nodes for the two datasets. The statistics of the datasets after preprocessing are reported in Table I.

Evaluation Methodology. For each dataset, we generated the training set, validation set and test set with a split ratio of 7:2:1. The model was trained on the training set and the hyper-parameters (e.g., the number of stages in hierarchical graph clustering and the number of layers in GCN ) were tuned using the validation set. Graphs in the test set were treated as queries and we evaluated how accurately the model approximates the similarity between the query graphs and the graphs in the entire dataset. For AIDS and LINUX, the A* algorithm was used to compute the ground-truth GED between the graphs. As A* has an exponential time complexity with respect to the number of nodes in the graphs, it took too much time for PTC and IMDB-MULTI. Therefore, we computed the ground-truth GED for PTC and IMDB-MULTI by taking the minimum of three approximate algorithms, i.e., Beam [21], Hungarian [25] and VJ [10]. The distances returned by the algorithms are larger than or equal to the true GED and the same ground-truth approximation methodology was also adopted in [2]. To compute the ground-truth MCS, we used MC-SPLIT [20] as it can finish in a long but tolerable time for our datasets. Note that the algorithms used to provide the ground-truth distance/similarity are typically orders of magnitude slower than learning based methods [2]. We excluded the test set from model training to show that the trained models can generalize to unseen data and thus improve the efficiency of graph similarity search.

We used four performance metrics to evaluate the accuracy of graph similarity approximation, i.e., average mean squared error (MSE), Spearman’s rank correlation coefficient (), Kendall’s rank correlation coefficient () and precision at 10 (). MSE is the mean squared error of the predicted GED/MCS compared with the ground-truth GED/MCS. For a query graph , the graphs in the dataset were ranked according to their predicted graph similarities. Both and evaluate how well the similarity prediction based ranking matches ground truth similarity based ranking, and higher value means better performance. is the percentage of true top-10 nearest neighbor in the top-10 nearest neighbors obtained from estimated graph similarity. MSE measures the accuracy of graph similarity approximation, while , and evaluate how well the estimated graph similarity ranks the graphs, which is also important for graph similarity search.

Baselines. As the learning based methods were shown to outperform the graph theory based methods in both accuracy and efficiency [2], we mainly compared with the learning based methods. The baselines include SMPNN [24], GCNMEAN, GCNMAX [17], SIMGNN [1], GMN [18] and GraphSIM [2]. EMBAVG is a simple baseline introduced in [2] that computes graph similarity using the dot product of two graph embeddings. As the results of our run for SIMGNN and GraphSIM on the AIDS and LINUX dataset are slightly worse than those reported in their papers, we reused the results from their papers. By default, HGMN uses 4 hierarchical cluster stages with size 6, 4, 2, 1 for AIDS and LINUX (small graphs), and 6 hierarchical cluster stages with size 64, 16, 8, 4, 2, 1 for PTC and IMDB-MULT (large graphs).

AIDS 4.725 3.185 2.124 3.423 1.189 1.741 0.787 0.752 4.4
0.306 0.642 0.653 0.628 0.843 0.751 0.874 0.883 1.0
0.480 0.592 0.629 0.505 0.690 0.642 0.776 0.778 0.3
0.092 0.179 0.194 0.290 0.421 0.401 0.534 0.537 0.5
LINUX 11.523 11.244 7.541 6.341 1.509 1.027 0.058 0.056 3.4
0.046 0.245 0.579 0.724 0.939 0.941 0.981 0.984 0.3
0.016 0.301 0.525 0.740 0.879 0.896 0.907 0.920 1.4
0.014 0.071 0.141 0.541 0.942 0.933 0.992 0.996 0.4
IMDB-MULTI 32.596 71.789 68.823 58.425 2.964 3.210 1.924 0.719 62.6
0.107 0.229 0.402 0.449 0.781 0.725 0.825 0.930 12.7
0.644 0.187 0.378 0.354 0.770 0.782 0.821 0.914 11.3
0.021 0.210 0.219 0.437 0.724 0.751 0.813 0.853 4.9
PTC 134.124 44.184 7.428 8.329 1.473 1.854 0.889 0.820 7.8
0.127 0.324 0.546 0.506 0.726 0.670 0.714 0.958 34.2
0.167 0.315 0.490 0.468 0.678 0.592 0.719 0.941 30.9
0.087 0.144 0.210 0.241 0.475 0.374 0.541 0.623 15.2
TABLE II: Accuracy comparison for GED approximation, for , smaller value means better performance, for , and , larger value means better performance, the last column if the improvement of HGMN over the best performing baseline
AIDS 4.268 6.148 6.234 4.156 3.433 2.234 2.402 2.213 0.9
0.772 0.723 0.756 0.801 0.822 0.901 0.858 0.902 0.1
0.529 0.510 0.498 0.574 0.680 0.803 0.798 0.871 9.1
0.379 0.243 0.347 0.315 0.374 0.513 0.505 0.525 2.3
LINUX 3.397 2.784 2.689 2.170 0.729 0.794 0.164 0.153 6.7
0.134 0.475 0.521 0.714 0.859 0.939 0.962 0.960 0.2
0.675 0.715 0.747 0.784 0.889 0.934 0.946 0.962 1.7
0.235 0.378 0.421 0.459 0.850 0.949 0.951 0.960 0.9
IMDB-MULTI 15.145 19.354 10.457 10.124 2.451 0.590 1.287 0.529 10.3
0.310 0.478 0.746 0.841 0.930 0.941 0.976 0.981 0.5
0.530 0.386 0.611 0.619 0.879 0.920 0.946 0.981 3.7
0.01 0.211 0.387 0.451 0.812 0.875 0.882 0.896 1.6
PTC 14.875 26.412 12.441 13.845 5.419 3.142 3.975 2.551 18.8
0.578 0.647 0.578 0.6617 0.712 0.782 0.779 0.811 3.7
0.522 0.419 0.650 0.688 0.746 0.792 0.8 0.812 1.5
0.187 0.352 0.384 0.402 0.356 0.584 0.498 0.609 4.3
TABLE III: Accuracy comparison for MCS approximation, for , smaller value means better performance, for , and , larger value means better performance, the last column if the improvement of HGMN over the best performing baseline

Iv-B Main Performance Results

We report our main results in Table II and Table III, which compare HGMN with the baselines for the accuracy of GED and MCS perdition, respectively. We can make two observations from the results. First, HGMN consistently outperforms the baselines for both GED and MCS, across 4 different datasets and 4 different performance metrics. For GED, the performance improvement over the best performing baseline is 11.96% on average (averaged over all datasets and performance metrics) and can be up to 62.6%. Compared with the best performing baseline, the improvement for MCS is 4.15% on average and can be up to 18.8%. Second, the performance improvement of HGMN is significantly better for the larger datasets (IMDB-MULTI and PTC) than the smaller datasets (AIDS and LINUX). We conjecture that this is because larger graphs have richer structures when they are clustered into more compact graphs. These structures are better captured on the compact graphs than on the original graphs. In contrast, the graphs in AIDS and LINUX are small (with no more than 10 nodes) and thus GCN with a moderate number of layers is already able to capture the structures of different scales. For large graphs, GCN with a large number of layers are required but graph neural networks with too many layers are known to be prone to over-smoothing [7], which often leads to poor performance. This explanation suggests that the hierarchical clustering may enable HGMN to perform well on even larger graphs and we provide more evidences to support this explanation in Section IV-C.

HGMN Variants Performance Metric
HGMN 0.752 0.883 0.778 0.537 0.820 0.958 0.941 0.623
Without node alignment 0.896 0.776 0.685 0.432 0.837 0.914 0.847 0.602
Without hierarchical clustering 2.768 0.612 0.605 0.313 7.285 0.573 0.513 0.238
TABLE IV: Ablation study of the designs of HGMN for GED approximation

Iv-C Ablation Study and Parameter Analysis

In Table IV, we study how the two key designs of HGMN, i.e., hierarchical clustering and node alignment, may influence the performance. For without node alignment, we used a random ordering of the nodes in the graphs. For without hierarchical clustering, we used only the embedding correlation matrix for the original graph, which is similar to the case of GRAPHSIM [2]. We tested the influence of a dataset with small graphs (AIDS) and dataset with large graphs (PTC). The results show that disabling either node alignment or hierarchical clustering degrades the performance. Comparatively, the performance degradation is more severe for the large PTC dataset than the small AIDS dataset. This is another evidence that hierarchical clustering helps achieve good performance for large datasets. Compared with node alignment, hierarchical clustering seems to be more important for the performance, and without it the increases 2.7x and 7.9x for AIDS and PTC, respectively.

In Figure 4, we check the influence of the hyper-parameters on the performance of HGMN. Figure 3(a) shows that when the number of eigenvectors used for embedding pooling increases, the performance of HGMN first increases and then stabilizes. Recall that the number of eigenvectors decides the number of correlation matrices provided by each stage for the downstream CNN. As more eigenvectors are used for pooling, more information in the node embedding of the previous graph clustering stage is kept and thus more information is provided for the CNN. However, with a sufficient number of eigenvectors, adding new eigenvectors does not help as the first eigenvectors (correspond to the largest eigenvalues) already encode the most significant signals in the embedding matrix .

Figure 3(b) shows that when the number of graph clustering stages increases, the performance of HGMN also first increases but then saturates, similar to the case of pooling eigenvectors. This is because when using too many stages, each stage will only make a small change in the graph structure (e.g., groping two nodes into one) and thus does not provide too much information. For the PTC dataset, the performance of HGMN stabilizes with 5 stages; while for the AIDS dataset, the performance of HGMN stabilizes with 3 stages. This is because graphs in PTC are larger and can have more meaningful stages. This phenomenon also suggests that more stages are required for even larger graphs, on which HGMN can achieve even greater performance improvement than existing baselines as they do not use hierarchical clustering.

(a) # eigenvector in pooling
(b) # stages in graph clustering
Fig. 4: The influence of the parameters on the accuracy of GED approximation, best viewed in color

Iv-D Efficiency Comparison

Fig. 5: The average GED computation time for some methods

We report the average query processing time for GED similarity search for different methods in Figure 5. Query processing time is the time taken to compute the approximate GED between a query and all dataset items, and the reported results are measured on a machine with Intel(R) Xeon(R) E5-2697 v3 @ 2.6GHz CPU (56 physical cores) and 512GB RAM in single thread mode. We did not include the graph theory based methods (e.g., Beam [21], Hungarian [25] and VJ [10]) as they are shown to be orders of magnitude slower than learning based methods [2]. We used a dataset with small graphs (AIDS) and a dataset with large graphs (PTC) to check the influence of graph size.

The results show that HGMN takes more time than the other methods because it uses hierarchical graph clustering and explicit node alignment. The higher computation complexity of HGMN is more obvious for larger dataset (PTC vs. AIDS) as larger graphs need more hierarchical clustering stages and make node alignment more complex. However, HGMN is not significantly slower than the other methods (e.g., 29.1% and 12.5% slower compared with GraphSim on PTC and AIDS, respectively) because graph neural network computation on the original graph dominates the overall complexity (required by all the methods). EmbAvg is the most efficient among all methods as it uses a simple dot product between the averaged embeddings of two graphs but its accuracy is poor according to Table II and Table III. We think HGMN offers a reasonable trade-off between accuracy and efficiency by using a small increase in complexity to trade for better accuracy.

V Conclusions

In this paper, we proposed the hierarchical graph matching network (HGMN) for efficient graph similarity computation. Motivated by the observation that two similar graphs should also be similar when they are clustered into more compact graphs, HGMN uses hierarchical clustering to provide the learning algorithm a multi-scale view of the differences between graphs. In addition, HGMN also adopts techniques including eigenvector based embedding pooling and earth mover based node alignment to build a complete machine learning pipeline. Experimental results on 4 datasets and 4 performance metrics show that HGMN consistently outperforms the baselines. Moreover, there are evidences that HGMN can scale to large graphs.


  • [1] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang (2019) Simgnn: a neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 384–392. Cited by: §I, §II-B, §IV-A, §IV-A.
  • [2] Y. Bai, H. Ding, K. Gu, Y. Sun, and W. Wang (2020) Learning-based efficient graph similarity computation via multi-scale convolutional set matching. In

    AAAI Conference on Artificial Intelligence

    Cited by: §I, §I, §II-B, §III-B, §III-C, §IV-A, §IV-A, §IV-A, §IV-C, §IV-D.
  • [3] D. B. Blumenthal and J. Gamper (2018) On the exact computation of the graph edit distance. Pattern Recognition Letters. Cited by: §I.
  • [4] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H. Kriegel (2005) Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: §I.
  • [5] S. Bougleux, L. Brun, V. Carletti, P. Foggia, B. Gaüzère, and M. Vento (2017) Graph edit distance as a quadratic assignment problem. Pattern Recognition Letters 87, pp. 38–46. Cited by: §II-B.
  • [6] H. Bunke and K. Shearer (1998) A graph distance metric based on the maximal common subgraph. Pattern recognition letters 19 (3-4), pp. 255–259. Cited by: §I.
  • [7] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun (2019) Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. arXiv preprint arXiv:1909.03211. Cited by: §IV-B.
  • [8] S. Debnath, N. Ganguly, and P. Mitra (2008) Feature weighting in content based recommendation system using social network analysis. In Proceedings of the 17th international conference on World Wide Web, pp. 1041–1042. Cited by: §I.
  • [9] H. Eckert and J. Bajorath (2007) Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug discovery today 12 (5-6), pp. 225–233. Cited by: §I.
  • [10] S. Fankhauser, K. Riesen, and H. Bunke (2011) Speeding up graph edit distance computation through fast bipartite matching. In Proceedings of the 8th international conference on Graph-based representations in pattern recognition, pp. 102–111. Cited by: §I, §II-B, §IV-A, §IV-D.
  • [11] A. Fischer, C. Y. Suen, V. Frinken, K. Riesen, and H. Bunke (2015) Approximation of graph edit distance based on hausdorff matching. Pattern Recognition 48 (2), pp. 331–343. Cited by: §I, §II-B.
  • [12] P. E. Hart, N. J. Nilsson, and B. Raphael (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §II-B.
  • [13] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge (1993) Comparing images using the hausdorff distance. IEEE Transactions on pattern analysis and machine intelligence 15 (9), pp. 850–863. Cited by: §II-B.
  • [14] J. Kinable and O. Kostakis (2011) Malware classification based on call graph clustering. Journal in computer virology 7 (4), pp. 233–245. Cited by: §I.
  • [15] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §I, §II-B, §III-B.
  • [16] I. Konstas, V. Stathopoulos, and J. M. Jose (2009) On social networks and collaborative recommendation. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 195–202. Cited by: §I.
  • [17] S. I. Ktena, S. Parisot, E. Ferrante, M. Rajchl, M. Lee, B. Glocker, and D. Rueckert (2017) Distance metric learning using graph convolutional networks: application to functional brain networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 469–477. Cited by: §II-B, §IV-A.
  • [18] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli (2019) Graph matching networks for learning the similarity of graph structured objects. In International Conference on Machine Learning, pp. 3835–3845. Cited by: §I, §I, §II-B, §III-C, §IV-A.
  • [19] Y. Ma, S. Wang, C. C. Aggarwal, and J. Tang (2019) Graph convolutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 723–731. Cited by: §III-A.
  • [20] C. McCreesh, P. Prosser, and J. Trimble (2017) A partitioning algorithm for maximum common subgraph problems. Cited by: §I, §II-B, §IV-A.
  • [21] M. Neuhaus, K. Riesen, and H. Bunke (2006) Fast suboptimal algorithms for the computation of graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 163–172. Cited by: §I, §II-B, §IV-A, §IV-D.
  • [22] P. Papadimitriou, A. Dasdan, and H. Garcia-Molina (2010) Web graph similarity for anomaly detection. Journal of Internet Services and Applications 1 (1), pp. 19–30. Cited by: §I.
  • [23] J. W. Raymond, E. J. Gardiner, and P. Willett (2002) Rascal: calculation of graph similarity using maximum common edge subgraphs. The Computer Journal 45 (6), pp. 631–644. Cited by: §I, §II-A.
  • [24] P. Riba, A. Fischer, J. Lladós, and A. Fornés (2018) Learning graph distances with message passing neural networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2239–2244. Cited by: §II-B, §IV-A.
  • [25] K. Riesen and H. Bunke (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision computing 27 (7), pp. 950–959. Cited by: §I, §II-B, §IV-A, §IV-D.
  • [26] Y. Rubner, C. Tomasi, and L. J. Guibas (1998) A metric for distributions with applications to image databases. In

    Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271)

    pp. 59–66. Cited by: §I, §III-B.
  • [27] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §III-B.
  • [28] S. Shang, N. Zheng, J. Xu, M. Xu, and H. Zhang (2010) Detecting malware variants via function-call graph similarity. In 2010 5th International Conference on Malicious and Unwanted Software, pp. 113–120. Cited by: §I.
  • [29] W. Tsai and K. Fu (1979) Error-correcting isomorphisms of attributed relational graphs for pattern analysis. IEEE Transactions on systems, man, and cybernetics 9 (12), pp. 757–768. Cited by: §I.
  • [30] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §III-B.
  • [31] R. Xiang, J. Neville, and M. Rogati (2010) Modeling relationship strength in online social networks. In Proceedings of the 19th international conference on World wide web, pp. 981–990. Cited by: §I.
  • [32] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In International Conference on Learning Representations, Cited by: §III-B.
  • [33] K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018) Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5453–5462. Cited by: §III-B.
  • [34] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song (2017) Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 363–376. Cited by: §I.
  • [35] L. A. Zager and G. C. Verghese (2008) Graph similarity scoring and matching. Applied mathematics letters 21 (1), pp. 86–94. Cited by: §I.
  • [36] Z. Zeng, A. K. Tung, J. Wang, J. Feng, and L. Zhou (2009) Comparing stars: on approximating graph edit distance. Proceedings of the VLDB Endowment 2 (1), pp. 25–36. Cited by: §I, §I, §II-A, §IV-A.
  • [37] X. Zhao, C. Xiao, X. Lin, W. Wang, and Y. Ishikawa (2013) Efficient processing of graph similarity queries with edit distance constraints. The VLDB Journal 22 (6), pp. 727–752. Cited by: §IV-A.